2021-01-22 17:38:50

by Joel Fernandes

[permalink] [raw]
Subject: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
that there are a lot of calls to update_blocked_averages(). This causes
the schedule loop to slow down to taking upto 500 micro seconds at
times (due to newidle load balance). I have also seen this manifest in
the periodic balancer.

Closer look shows that the problem is caused by the following
ingredients:
1. If the system has a lot of inactive CGroups (thanks Dietmar for
suggesting to inspect /proc/sched_debug for this), this can make
__update_blocked_fair() take a long time.

2. The device has a lot of CPUs in a cluster which causes schedutil in a
shared frequency domain configuration to be slower than usual. (the load
average updates also try to update the frequency in schedutil).

3. The CPU is running at a low frequency causing the scheduler/schedutil
code paths to take longer than when running at a high CPU frequency.

The fix is simply rate limit the calls to update_blocked_averages to 20
times per second. It appears that updating the blocked average less
often is sufficient. Currently I see about 200 calls per second
sometimes, which seems overkill.

schbench shows a clear improvement with the change:

Without patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (212 total samples)
50.0th: 210 (106 samples)
75.0th: 619 (53 samples)
90.0th: 665 (32 samples)
95.0th: 703 (11 samples)
*99.0th: 12656 (8 samples)
99.5th: 12784 (1 samples)
99.9th: 13424 (1 samples)
min=15, max=13424

With patch:
~/schbench -t 2 -m 2 -r 5
Latency percentiles (usec) runtime 5 (s) (214 total samples)
50.0th: 188 (108 samples)
75.0th: 238 (53 samples)
90.0th: 623 (32 samples)
95.0th: 657 (12 samples)
*99.0th: 717 (7 samples)
99.5th: 725 (2 samples)
99.9th: 725 (0 samples)

Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Suggested-by: Dietmar Eggeman <[email protected]>
Co-developed-by: Qais Yousef <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>

---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..fe2dc0024db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
return false;

- if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
+ if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
return true;

update_blocked_averages(cpu);
--
2.30.0.280.ga3ce27912f-goog


2021-01-22 19:39:24

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, Jan 22, 2021 at 06:39:27PM +0000, Qais Yousef wrote:
> On 01/22/21 17:56, Vincent Guittot wrote:
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > return false;
> > >
> > > - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
> > This hard coded (HZ/20) looks really like an ugly hack
>
> This was meant as an RFC patch to discuss the problem really.

Agreed, sorry.

> Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> processing __update_blocked_fair() and the other half in sugov_update_shared().
> So roughly 50us each. Note that each function is calling an iterator in
> return. Correct me if my numbers are wrong Joel.

Correct, and I see update_nohz_stats() itself called around 8 times during a
load balance which multiplies the overhead.

Dietmar found out also that the reason for update_nohz_stacks() being called
8 times is because in our setup, there is only 1 MC sched domain with all 8
CPUs, versus say 2 MC domains with 4 CPUs each.

> Running on a little core on low frequency these numbers don't look too odd.
> So I'm not seeing how we can speed these functions up.

Agreed.

> But since update_sg_lb_stats() will end up with multiple calls to
> update_blocked_averages() in one go, this latency adds up quickly.

True!

> One noticeable factor in Joel's system is the presence of a lot of cgroups.
> Which is essentially what makes __update_blocked_fair() expensive, and it seems
> to always return something has decayed so we end up with a call to
> sugov_update_shared() in every call.

Correct.

thanks,

- Joel

[..]

2021-01-25 17:35:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann <[email protected]> wrote:
>
> On 22/01/2021 20:10, Joel Fernandes wrote:
> > Hi Vincent,
> >
> > Thanks for reply. Please see the replies below:
> >
> > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> >> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> >> <[email protected]> wrote:
> >>>
> >>> On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> >>> that there are a lot of calls to update_blocked_averages(). This causes
> >>> the schedule loop to slow down to taking upto 500 micro seconds at
> >>> times (due to newidle load balance). I have also seen this manifest in
> >>> the periodic balancer.
> >>>
> >>> Closer look shows that the problem is caused by the following
> >>> ingredients:
> >>> 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> >>> suggesting to inspect /proc/sched_debug for this), this can make
> >>> __update_blocked_fair() take a long time.
> >>
> >> Inactive cgroups are removed from the list so they should not impact
> >> the duration
> >
> > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > decayed and not have any tasks running on it but its load needs to be
> > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > 'blocked'.
> >
> > * There can be a lot of idle CPU cgroups. Don't let fully
> > * decayed cfs_rqs linger on the list.
> > */
> > if (cfs_rq_is_decayed(cfs_rq))
> > list_del_leaf_cfs_rq(cfs_rq);
> >
> >>> 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> >>> shared frequency domain configuration to be slower than usual. (the load
> >>
> >> What do you mean exactly by it causes schedutil to be slower than usual ?
> >
> > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > frequency update needs to go through utilization of other CPUs in the
> > cluster. I believe this could be adding to the problem but is not really
> > needed to optimize if we can rate limit the calls to update_blocked_averages
> > to begin with.
> >
> >>> average updates also try to update the frequency in schedutil).
> >>>
> >>> 3. The CPU is running at a low frequency causing the scheduler/schedutil
> >>> code paths to take longer than when running at a high CPU frequency.
> >>
> >> Low frequency usually means low utilization so it should happen that much.
> >
> > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> >
> > schedule() can result in new idle balance with the CFS pick call happening
> > often. Here is a function graph trace. The tracer shows
> > update_blocked_averages taking a lot of time.
> >
> > sugov:0-2454 [002] 2657.992570: funcgraph_entry: | load_balance() {
> > sugov:0-2454 [002] 2657.992577: funcgraph_entry: | update_group_capacity() {
> > sugov:0-2454 [002] 2657.992580: funcgraph_entry: 2.656 us | __msecs_to_jiffies();
> > sugov:0-2454 [002] 2657.992585: funcgraph_entry: 2.447 us | _raw_spin_lock_irqsave();
> > sugov:0-2454 [002] 2657.992591: funcgraph_entry: 2.552 us | _raw_spin_unlock_irqrestore();
> > sugov:0-2454 [002] 2657.992595: funcgraph_exit: + 17.448 us | }
> > sugov:0-2454 [002] 2657.992597: funcgraph_entry: 1.875 us | update_nohz_stats();
> > sugov:0-2454 [002] 2657.992601: funcgraph_entry: 1.667 us | idle_cpu();
> > sugov:0-2454 [002] 2657.992605: funcgraph_entry: | update_nohz_stats() {
> > sugov:0-2454 [002] 2657.992608: funcgraph_entry: + 33.333 us | update_blocked_averages();
> > sugov:0-2454 [002] 2657.992643: funcgraph_exit: + 38.073 us | }
> > sugov:0-2454 [002] 2657.992645: funcgraph_entry: 1.770 us | idle_cpu();
> > sugov:0-2454 [002] 2657.992649: funcgraph_entry: | update_nohz_stats() {
> > sugov:0-2454 [002] 2657.992651: funcgraph_entry: + 41.823 us | update_blocked_averages();
> > sugov:0-2454 [002] 2657.992694: funcgraph_exit: + 45.729 us | }
> > sugov:0-2454 [002] 2657.992696: funcgraph_entry: 1.823 us | idle_cpu();
> > sugov:0-2454 [002] 2657.992700: funcgraph_entry: | update_nohz_stats() {
> > sugov:0-2454 [002] 2657.992702: funcgraph_entry: + 35.312 us | update_blocked_averages();
> > sugov:0-2454 [002] 2657.992740: funcgraph_exit: + 39.792 us | }
> > sugov:0-2454 [002] 2657.992742: funcgraph_entry: 1.771 us | idle_cpu();
> > sugov:0-2454 [002] 2657.992746: funcgraph_entry: | update_nohz_stats() {
> > sugov:0-2454 [002] 2657.992748: funcgraph_entry: + 33.438 us | update_blocked_averages();
> > sugov:0-2454 [002] 2657.992783: funcgraph_exit: + 37.500 us | }
> > sugov:0-2454 [002] 2657.992785: funcgraph_entry: 1.771 us | idle_cpu();
> > sugov:0-2454 [002] 2657.992790: funcgraph_entry: | update_nohz_stats() {
> > sugov:0-2454 [002] 2657.992792: funcgraph_entry: + 45.521 us | update_blocked_averages();
> > sugov:0-2454 [002] 2657.992839: funcgraph_exit: + 49.323 us | }
> > sugov:0-2454 [002] 2657.992842: funcgraph_entry: 1.823 us | idle_cpu();
> > sugov:0-2454 [002] 2657.992847: funcgraph_entry: | update_nohz_stats() {
> > sugov:0-2454 [002] 2657.992850: funcgraph_entry: + 67.187 us | update_blocked_averages();
> > sugov:0-2454 [002] 2657.992919: funcgraph_exit: + 72.031 us | }
> > sugov:0-2454 [002] 2657.992921: funcgraph_entry: 2.760 us | idle_cpu();
> > sugov:0-2454 [002] 2657.992926: funcgraph_entry: | update_nohz_stats() {
> > sugov:0-2454 [002] 2657.992928: funcgraph_entry: + 61.146 us | update_blocked_averages();
> > sugov:0-2454 [002] 2657.992992: funcgraph_exit: + 65.886 us | }
> > sugov:0-2454 [002] 2657.992994: funcgraph_entry: 1.771 us | idle_cpu();
> > sugov:0-2454 [002] 2657.992998: funcgraph_exit: ! 430.209 us | }
> > sugov:0-2454 [002] 2657.993006: bprint: trace_long: wtf: lb: 432916
> > sugov:0-2454 [002] 2657.993017: bprint: trace_long: wtf: newidle_balance: 501458
> >
> >
> >>> The fix is simply rate limit the calls to update_blocked_averages to 20
> >>> times per second. It appears that updating the blocked average less
> >>> often is sufficient. Currently I see about 200 calls per second
> >>
> >> Would be good to explain why updating less often is sufficient ?
> >
> > I don't know this code that well, intuitively it seems to me updating blocked
> > averages at such a high rate seems pointless. But I defer to your expertise
> > on that. Why do you feel an update is needed at least HZ times per second?
> > What about system with HZ=1000 or 300, that seems to be an insane rate of
> > updating (not to mention all the complexity of going through the leaf cgroup
> > list and doing the frequency updates).
>
> I assume this is what you're seeing on your device. This is on tip sched/core
> but should be close to your kernel. I glanced over the diffs in fair.c between
> chromeos-5.4 and tip sched/core and didn't spot any changes in this area.
>
> I ran on a hikey620 w/o CONFIG_SCHED_MC to mimic the 8 CPUs (8 sched groups
> (sg)) in the MC domain (the only sched domain).

you can also modify the DT to emulate 1 MC level with 8 cores

>
> Since nohz.has_blocked=1 in your newidle_balance() calls,
> load_balance() -> update_sd_lb_stats() sets LBF_NOHZ_STATS and calls
> update_sg_lb_stats() for each of the 8 sg's.
>
> Since LBF_NOHZ_STATS is set, update_sg_lb_stats() calls
> update_nohz_stats(..., false) per cpu in sg.
>
> And for a lot of these 8 sg's, i.e. 8 CPUs, update_blocked_averages()
> is called since none of the 3 bail-out conditions:
>
> (1) !rq->has_blocked_load
> (2) !cpumask_test_cpu(cpu, nohz.idle_cpus_mask)
> (3) force && !time_after(jiffies, rq->last_blocked_load_update_tick))
>
> trigger.
>
> We advance nohz.next_blocked by msecs_to_jiffies(LOAD_AVG_PERIOD) (32ms,
> 8 jiffies w/ HZ=250) but we advance 'rq->last_blocked_load_update_tick
> only to jiffies' in update_blocked_load_status().
>
>
> [005] 7370.188469: bprint: nohz_balance_enter_idle: CPU5 nohz.has_blocked=1
> ...
> [005] 7370.210068: bprint: pick_next_task_fair: CPU5
> [005] 7370.210079: bprint: update_sd_lb_stats: CPU5 nohz.has_blocked=1 -> set LBF_NOHZ_STATS
> [005] 7370.210082: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=5 weight=1)
> [005] 7370.210085: bprint: update_nohz_stats: CPU5 cpu=5 not in nohz.idle_cpus_mask -> bail
> [005] 7370.210088: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=6 weight=1)
> [005] 7370.210091: bprint: update_nohz_stats: CPU5 cpu=6 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210112: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=7 weight=1)
> [005] 7370.210116: bprint: update_nohz_stats: CPU5 cpu=7 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210134: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=0 weight=1)
> [005] 7370.210137: bprint: update_nohz_stats: CPU5 cpu=0 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210156: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=1 weight=1)
> [005] 7370.210159: bprint: update_nohz_stats: CPU5 cpu=1 not in nohz.idle_cpus_mask -> bail
> [005] 7370.210162: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=2 weight=1)
> [005] 7370.210165: bprint: update_nohz_stats: CPU5 cpu=2 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210183: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=3 weight=1)
> [005] 7370.210186: bprint: update_nohz_stats: CPU5 cpu=3 force=0 jiffies-last_blocked_load_update_tick=1 (jiffies-nohz.next_blocked=-7) -> update_blocked_averages()
> [005] 7370.210205: bprint: update_sd_lb_stats: CPU5 dst_cpu=5 sg=(first_cpu=4 weight=1)
> [005] 7370.210207: bprint: update_nohz_stats: CPU5 cpu=4 not in nohz.idle_cpus_mask -> bail
> ...
> [005] 7370.444704: bprint: _nohz_idle_balance: CPU5 nohz.has_blocked=0
>
>
> If I understood you correctly, you want to avoid these frequent calls
> to update_blocked_averages() here to further avoid invoking sched_util
> via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
> is set) very often in your setup.

So It's not clear if the problem that joel wants to raise, is about:
- the running time of update_blocked_averages
- the running time of the cpufreq_update_util which is called because
utilization has decayed during the update of blocked load
- the wake up latency because of newly_idle lb

> Since you have up to 6 CPUs in a frequency domain, this could be more
> costly than usual.

2021-01-25 18:20:01

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 25/01/2021 18:30, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 11:45, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 22/01/2021 20:10, Joel Fernandes wrote:
>>> Hi Vincent,
>>>
>>> Thanks for reply. Please see the replies below:
>>>
>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>> <[email protected]> wrote:

[...]

>> If I understood you correctly, you want to avoid these frequent calls
>> to update_blocked_averages() here to further avoid invoking sched_util
>> via update_blocked_averages() -> cpufreq_update_util() (since 'decayed'
>> is set) very often in your setup.
>
> So It's not clear if the problem that joel wants to raise, is about:
> - the running time of update_blocked_averages
> - the running time of the cpufreq_update_util which is called because
> utilization has decayed during the update of blocked load
> - the wake up latency because of newly_idle lb

Pretty much so.

IIRC his interest is driven by the fact that he saw much less activity
in newly_idle lb and therefore cpufreq_update_util on a system with the
same kernel and userspace but with less CPUs (i.e. also smaller
frequency domains) and less cgroups (with blocked load) and started
wondering why.

I assume that since he understands this environment now much better, he
should be able to come up with better test numbers to show if there is a
performance issue on his 2+6 DynamIQ system and if yes, where exactly in
this code path.

2021-01-29 17:33:10

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

Le vendredi 29 janv. 2021 ? 11:33:00 (+0100), Vincent Guittot a ?crit :
> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <[email protected]> wrote:
> >
> > Hi Vincent,
> >
> > On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> > <[email protected]> wrote:
> > > > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <[email protected]> wrote:
> > > > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > > > > the periodic balancer.
> > > > > > > >
> > > > > > > > Closer look shows that the problem is caused by the following
> > > > > > > > ingredients:
> > > > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > > > __update_blocked_fair() take a long time.
> > > > > > >
> > > > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > > > the duration
> > > > > >
> > > > > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > > > > decayed and not have any tasks running on it but its load needs to be
> > > > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > > > 'blocked'.
> > > > >
> > > > > How many blocked cgroups have you got ?
> > > >
> > > > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > > > this loop runs per new idle balance. When the problem happens I see this loop
> > > > run 35-40 times (for one single instance of newidle balance). So in total
> > > > there are at least these many cfs_rq load updates.
> > >
> > > Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?
> >
> > All CPUs.
> >
> > > > I also see that new idle balance can be called 200-500 times per second.
> > >
> > > This is not surprising because newidle_balance() is called every time
> > > the CPU is about to become idle
> >
> > Sure.
> >
> > > > > >
> > > > > > * There can be a lot of idle CPU cgroups. Don't let fully
> > > > > > * decayed cfs_rqs linger on the list.
> > > > > > */
> > > > > > if (cfs_rq_is_decayed(cfs_rq))
> > > > > > list_del_leaf_cfs_rq(cfs_rq);
> > > > > >
> > > > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > > > > shared frequency domain configuration to be slower than usual. (the load
> > > > > > >
> > > > > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > > > > >
> > > > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > > > frequency update needs to go through utilization of other CPUs in the
> > > > > > cluster. I believe this could be adding to the problem but is not really
> > > > > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > > > > to begin with.
> > > > >
> > > > > Qais mentioned half of the time being used by
> > > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > > this call ?
> > > >
> > > > I do not see a frequency update happening at the time of the problem. However
> > > > note that sugov_iowait_boost() does run even if frequency is not being
> > > > updated. IIRC, this function is also not that light weight and I am not sure
> > > > if it is a good idea to call this that often.
> > >
> > > Scheduler can't make any assumption about how often schedutil/cpufreq
> > > wants to be called. Some are fast and straightforward and can be
> > > called very often to adjust frequency; Others can't handle much
> > > updates. The rate limit mechanism in schedutil and io-boost should be
> > > there for such purpose.
> >
> > Sure, I know that's the intention.
> >
> > > > > > > > average updates also try to update the frequency in schedutil).
> > > > > > > >
> > > > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > > > >
> > > > > > > Low frequency usually means low utilization so it should happen that much.
> > > > > >
> > > > > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> > > > >
> > > > > Happening a lot in itself is not a problem if there is nothing else to
> > > > > do so it's not a argument in itself
> > > >
> > > > It is a problem - it shows up in the preempt off critical section latency
> > >
> > > But this point is not related to the point above which is about how
> > > often it happens.
> >
> > It is related. A bad thing happening quite often is worse than a bad
> > thing happening infrequently. I agree it is not a root cause fix, but
> > it makes things "better".
> >
> > > > tracer. Are you saying its Ok for preemption to be disabled on system for 500
> > > > micro seconds? That hurts real-time applications (audio etc).
> > >
> > > So. Is your problem related to real-time applications (audio etc) ?
> >
> > Yes it is. I don't have a reproducer and it could be the audio
> > buffering will hide the problem. But that doesn't mean that there is
> > no problem or that we cannot improve things. There will also likely be
> > power consumption improvement.
> >
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see it
> > > > happen with other load balance as well, but it not that often as those LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
>
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
>
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved. As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.
> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay.... If
> you look at your results for schbench -t 2 -m 2 -r 5: The *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
>
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473
>
> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400
>
> So a quick and wrong conclusion could be to say that you should disable EAS ;-)
>
> All that to say that your schbench's results are not relevant and I
> disagree when you said that you have provided all details of your
> problem.
>
> Then, your function graph shows a half millisecond duration but there
> are no details about the context. It seems that your trace happens
> after that sugov just changed the freq. But it would be not surprising
> to have such a large duration when running at the lowest
> capacity/frequency of the system as an example.
>
> All that to say that the main reason of my questions is just to make
> sure that the problem is fully understood and the results are
> relevants, which doesn't seem to be the case for schbench at least.
>
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and I'm going
> to provide a patchset for this

The patch below moves the update of the blocked load of CPUs outside newidle_balance().

Instead, the update is done with the usual idle load balance update. I'm working on an
additonnal patch that will select this cpu that is about to become idle, instead of a
random idle cpu but this 1st step fixe the problem of lot of update in newly idle.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 32 +++-----------------------------
1 file changed, 3 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 197a51473e0c..8200b1d4df3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7421,8 +7421,6 @@ enum migration_type {
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
#define LBF_SOME_PINNED 0x08
-#define LBF_NOHZ_STATS 0x10
-#define LBF_NOHZ_AGAIN 0x20

struct lb_env {
struct sched_domain *sd;
@@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);

- if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
- env->flags |= LBF_NOHZ_AGAIN;
-
sgs->group_load += cpu_load(rq);
sgs->group_util += cpu_util(i);
sgs->group_runnable += cpu_runnable(rq);
@@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats tmp_sgs;
int sg_status = 0;

-#ifdef CONFIG_NO_HZ_COMMON
- if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
- env->flags |= LBF_NOHZ_STATS;
-#endif
-
do {
struct sg_lb_stats *sgs = &tmp_sgs;
int local_group;
@@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* Tag domain that child domain prefers tasks go to siblings first */
sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;

-#ifdef CONFIG_NO_HZ_COMMON
- if ((env->flags & LBF_NOHZ_AGAIN) &&
- cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
-
- WRITE_ONCE(nohz.next_blocked,
- jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
- }
-#endif
-
if (env->sd->flags & SD_NUMA)
env->fbq_type = fbq_classify_group(&sds->busiest_stat);

@@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
return;

raw_spin_unlock(&this_rq->lock);
- /*
- * This CPU is going to be idle and blocked load of idle CPUs
- * need to be updated. Run the ilb locally as it is a good
- * candidate for ilb instead of waking up another idle CPU.
- * Kick an normal ilb if we failed to do the update.
- */
- if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
- kick_ilb(NOHZ_STATS_KICK);
+ kick_ilb(NOHZ_STATS_KICK);
raw_spin_lock(&this_rq->lock);
}

@@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

- nohz_newidle_balance(this_rq);
-
goto out;
}

@@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

if (pulled_task)
this_rq->idle_stamp = 0;
+ else
+ nohz_newidle_balance(this_rq);

rq_repin_lock(this_rq, rf);

--
2.17.1


>
> > for this.
> >
> > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > when the coming idle duration is expected to be long enough
> >
> > No, we do not want the schedule loop to take half a millisecond.
>
> keep in mind that you are scaling frequency so everything takes time
> at lowest frequency/capacity ...
>
> >
> > > > > IIUC, your real problem is that newidle_balance is running whereas a
> > > > > task is about to wake up on the cpu and we don't abort quickly during
> > > > > this load_balance
> > > >
> > > > Yes.
> > > >
> > > > > so we could also try to abort earlier in case of newly idle load balance
> > > >
> > > > I think interrupts are disabled when the load balance runs, so there's no way
> > > > for say an audio interrupt to even run in order to wake up a task. How would
> > > > you know to abort the new idle load balance?
> > > >
> > > > Could you elaborate more also on the drawback of the rate limiting patch we
> > > > posted? Do you see a side effect?
> > >
> > > Your patch just tries to hide your problem and not to solve the root cause.
> >
> > Agreed, the solution presented is a band aid and not a proper fix. It
> > was just intended to illustrate the problem and start a discussion. I
> > should have marked it RFC for sure.
> >
> > thanks!
> >
> > - Joel

2021-02-01 15:16:34

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, Jan 29, 2021 at 5:33 AM Vincent Guittot
<[email protected]> wrote:
[...]
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see it
> > > > happen with other load balance as well, but it not that often as those LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
>
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
>
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved.
> As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.

I disagree. Here you are assuming that schbench is the only task
running on the system. There are other processes, daemons as well. I
see a strong correlation between commenting out
update_blocked_averages() and not seeing the latency hit at the higher
percentiles.

> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay.... If
> you look at your results for schbench -t 2 -m 2 -r 5: The *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473

Sure, there could be a different problem causing these higher
percentile latencies on your device. I still think 12ms is awful.

> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400

Yes, definitely changing the governor also solves the problem (like
for example if performance governor is used). The problem happens at
low frequencies.

> So a quick and wrong conclusion could be to say that you should disable EAS ;-)

Sure, except the power management guys may come after me ;-)

[..]
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and I'm going
> to provide a patchset for this

That's fine, as long as we agree on this problem ;-) Thanks for
providing the patches and I will try them once they are ready.

> > for this.
> >
> > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > when the coming idle duration is expected to be long enough
> >
> > No, we do not want the schedule loop to take half a millisecond.
>
> keep in mind that you are scaling frequency so everything takes time
> at lowest frequency/capacity ...

Agreed, I was also thinking the same. But that doesn't change the fact
that there is room for improvement and I'm grateful to you for trying
to improve it!

thanks,

- Joel

2021-02-03 11:56:27

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 29/01/2021 18:27, Vincent Guittot wrote:
> Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <[email protected]> wrote:
>>>
>>> Hi Vincent,
>>>
>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>> <[email protected]> wrote:
>>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <[email protected]> wrote:
>>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>>>>>> <[email protected]> wrote:

[...]

>> The only point that I agree with, is that running
>> update_blocked_averages with preempt and irq off is not a good thing
>> because we don't manage the number of csf_rq to update and I'm going
>> to provide a patchset for this
>
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
>
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.

I'm trying to understand the need for this extra patch.

The patch below moves away from doing update_blocked_averages() (1) for
all CPUs in sched groups of the sched domain:

newidle_balance()->load_balance()->
find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()

to:

calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
update_nohz_stats() and for the ilb CPU.

newidle_balance() calls (1) for newidle CPU already.

What would be the benefit to choose newidle CPU as ilb CPU?

> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 32 +++-----------------------------
> 1 file changed, 3 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
> #define LBF_NEED_BREAK 0x02
> #define LBF_DST_PINNED 0x04
> #define LBF_SOME_PINNED 0x08
> -#define LBF_NOHZ_STATS 0x10
> -#define LBF_NOHZ_AGAIN 0x20
>
> struct lb_env {
> struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
>
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
> sgs->group_load += cpu_load(rq);
> sgs->group_util += cpu_util(i);
> sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> struct sg_lb_stats tmp_sgs;
> int sg_status = 0;
>
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
> do {
> struct sg_lb_stats *sgs = &tmp_sgs;
> int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> /* Tag domain that child domain prefers tasks go to siblings first */
> sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> - jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
> if (env->sd->flags & SD_NUMA)
> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> return;
>
> raw_spin_unlock(&this_rq->lock);
> - /*
> - * This CPU is going to be idle and blocked load of idle CPUs
> - * need to be updated. Run the ilb locally as it is a good
> - * candidate for ilb instead of waking up another idle CPU.
> - * Kick an normal ilb if we failed to do the update.
> - */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);
> raw_spin_lock(&this_rq->lock);
> }
>
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> - nohz_newidle_balance(this_rq);
> -
> goto out;
> }
>
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> if (pulled_task)
> this_rq->idle_stamp = 0;
> + else
> + nohz_newidle_balance(this_rq);
>
> rq_repin_lock(this_rq, rf);

2021-02-03 13:14:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann <[email protected]> wrote:
>
> On 29/01/2021 18:27, Vincent Guittot wrote:
> > Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
> >> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <[email protected]> wrote:
> >>>
> >>> Hi Vincent,
> >>>
> >>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> >>> <[email protected]> wrote:
> >>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> >>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <[email protected]> wrote:
> >>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> >>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> >>>>>>>> <[email protected]> wrote:
>
> [...]
>
> >> The only point that I agree with, is that running
> >> update_blocked_averages with preempt and irq off is not a good thing
> >> because we don't manage the number of csf_rq to update and I'm going
> >> to provide a patchset for this
> >
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> >
> > Instead, the update is done with the usual idle load balance update. I'm working on an
> > additonnal patch that will select this cpu that is about to become idle, instead of a
> > random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
>
> I'm trying to understand the need for this extra patch.
>
> The patch below moves away from doing update_blocked_averages() (1) for
> all CPUs in sched groups of the sched domain:
>
> newidle_balance()->load_balance()->
> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>
> to:
>
> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
> update_nohz_stats() and for the ilb CPU.
>
> newidle_balance() calls (1) for newidle CPU already.
>
> What would be the benefit to choose newidle CPU as ilb CPU?

To prevent waking up another idle cpu to run the update whereas
newidle cpu is already woken up and about to be idle so the best
candidate.
All the aim of the removed code was to prevent waking up an idle cpu
for doing something that could be done by the newidle cpu before it
enters idle state

>
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 32 +++-----------------------------
> > 1 file changed, 3 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 197a51473e0c..8200b1d4df3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7421,8 +7421,6 @@ enum migration_type {
> > #define LBF_NEED_BREAK 0x02
> > #define LBF_DST_PINNED 0x04
> > #define LBF_SOME_PINNED 0x08
> > -#define LBF_NOHZ_STATS 0x10
> > -#define LBF_NOHZ_AGAIN 0x20
> >
> > struct lb_env {
> > struct sched_domain *sd;
> > @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> > struct rq *rq = cpu_rq(i);
> >
> > - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> > - env->flags |= LBF_NOHZ_AGAIN;
> > -
> > sgs->group_load += cpu_load(rq);
> > sgs->group_util += cpu_util(i);
> > sgs->group_runnable += cpu_runnable(rq);
> > @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > struct sg_lb_stats tmp_sgs;
> > int sg_status = 0;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> > - env->flags |= LBF_NOHZ_STATS;
> > -#endif
> > -
> > do {
> > struct sg_lb_stats *sgs = &tmp_sgs;
> > int local_group;
> > @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > /* Tag domain that child domain prefers tasks go to siblings first */
> > sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > - if ((env->flags & LBF_NOHZ_AGAIN) &&
> > - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > -
> > - WRITE_ONCE(nohz.next_blocked,
> > - jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> > - }
> > -#endif
> > -
> > if (env->sd->flags & SD_NUMA)
> > env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> >
> > @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> > return;
> >
> > raw_spin_unlock(&this_rq->lock);
> > - /*
> > - * This CPU is going to be idle and blocked load of idle CPUs
> > - * need to be updated. Run the ilb locally as it is a good
> > - * candidate for ilb instead of waking up another idle CPU.
> > - * Kick an normal ilb if we failed to do the update.
> > - */
> > - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> > - kick_ilb(NOHZ_STATS_KICK);
> > + kick_ilb(NOHZ_STATS_KICK);
> > raw_spin_lock(&this_rq->lock);
> > }
> >
> > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > update_next_balance(sd, &next_balance);
> > rcu_read_unlock();
> >
> > - nohz_newidle_balance(this_rq);
> > -
> > goto out;
> > }
> >
> > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> > if (pulled_task)
> > this_rq->idle_stamp = 0;
> > + else
> > + nohz_newidle_balance(this_rq);
> >
> > rq_repin_lock(this_rq, rf);

2021-02-03 17:14:51

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 01/29/21 18:27, Vincent Guittot wrote:
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
>
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 32 +++-----------------------------
> 1 file changed, 3 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
> #define LBF_NEED_BREAK 0x02
> #define LBF_DST_PINNED 0x04
> #define LBF_SOME_PINNED 0x08
> -#define LBF_NOHZ_STATS 0x10
> -#define LBF_NOHZ_AGAIN 0x20
>
> struct lb_env {
> struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
>
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
> sgs->group_load += cpu_load(rq);
> sgs->group_util += cpu_util(i);
> sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> struct sg_lb_stats tmp_sgs;
> int sg_status = 0;
>
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
> do {
> struct sg_lb_stats *sgs = &tmp_sgs;
> int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> /* Tag domain that child domain prefers tasks go to siblings first */
> sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> - jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
> if (env->sd->flags & SD_NUMA)
> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> return;
>
> raw_spin_unlock(&this_rq->lock);
> - /*
> - * This CPU is going to be idle and blocked load of idle CPUs
> - * need to be updated. Run the ilb locally as it is a good
> - * candidate for ilb instead of waking up another idle CPU.
> - * Kick an normal ilb if we failed to do the update.
> - */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))

Since we removed the call to this function (which uses this_rq)

> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);

And unconditionally call kick_ilb() which will find a suitable cpu to run the
lb at regardless what this_rq is.

Doesn't the below become unnecessary now?

10494 /*
10495 * This CPU doesn't want to be disturbed by scheduler
10496 * housekeeping
10497 */
10498 if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
10499 return;
10500
10501 /* Will wake up very soon. No time for doing anything else*/
10502 if (this_rq->avg_idle < sysctl_sched_migration_cost)
10503 return;

And we can drop this_rq arg altogether?

> raw_spin_lock(&this_rq->lock);
> }
>
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> - nohz_newidle_balance(this_rq);
> -
> goto out;
> }
>
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> if (pulled_task)
> this_rq->idle_stamp = 0;
> + else
> + nohz_newidle_balance(this_rq);

Since nohz_newidle_balance() will not do any real work now, I couldn't figure
out what moving this here achieves. Fault from my end to parse the change most
likely :-)

Joel can still test this patch as is of course. This is just an early review
since I already spent the time trying to understand it.

Thanks

--
Qais Yousef

>
> rq_repin_lock(this_rq, rf);
>
> --
> 2.17.1

2021-02-03 17:40:38

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Wed, 3 Feb 2021 at 18:09, Qais Yousef <[email protected]> wrote:
>
> On 01/29/21 18:27, Vincent Guittot wrote:
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
> >
> > Instead, the update is done with the usual idle load balance update. I'm working on an
> > additonnal patch that will select this cpu that is about to become idle, instead of a
> > random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 32 +++-----------------------------
> > 1 file changed, 3 insertions(+), 29 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 197a51473e0c..8200b1d4df3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7421,8 +7421,6 @@ enum migration_type {
> > #define LBF_NEED_BREAK 0x02
> > #define LBF_DST_PINNED 0x04
> > #define LBF_SOME_PINNED 0x08
> > -#define LBF_NOHZ_STATS 0x10
> > -#define LBF_NOHZ_AGAIN 0x20
> >
> > struct lb_env {
> > struct sched_domain *sd;
> > @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> > struct rq *rq = cpu_rq(i);
> >
> > - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> > - env->flags |= LBF_NOHZ_AGAIN;
> > -
> > sgs->group_load += cpu_load(rq);
> > sgs->group_util += cpu_util(i);
> > sgs->group_runnable += cpu_runnable(rq);
> > @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > struct sg_lb_stats tmp_sgs;
> > int sg_status = 0;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> > - env->flags |= LBF_NOHZ_STATS;
> > -#endif
> > -
> > do {
> > struct sg_lb_stats *sgs = &tmp_sgs;
> > int local_group;
> > @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > /* Tag domain that child domain prefers tasks go to siblings first */
> > sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
> >
> > -#ifdef CONFIG_NO_HZ_COMMON
> > - if ((env->flags & LBF_NOHZ_AGAIN) &&
> > - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> > -
> > - WRITE_ONCE(nohz.next_blocked,
> > - jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> > - }
> > -#endif
> > -
> > if (env->sd->flags & SD_NUMA)
> > env->fbq_type = fbq_classify_group(&sds->busiest_stat);
> >
> > @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> > return;
> >
> > raw_spin_unlock(&this_rq->lock);
> > - /*
> > - * This CPU is going to be idle and blocked load of idle CPUs
> > - * need to be updated. Run the ilb locally as it is a good
> > - * candidate for ilb instead of waking up another idle CPU.
> > - * Kick an normal ilb if we failed to do the update.
> > - */
> > - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>
> Since we removed the call to this function (which uses this_rq)
>
> > - kick_ilb(NOHZ_STATS_KICK);
> > + kick_ilb(NOHZ_STATS_KICK);
>
> And unconditionally call kick_ilb() which will find a suitable cpu to run the
> lb at regardless what this_rq is.
>
> Doesn't the below become unnecessary now?

The end goal is to keep running on this cpu that is about to become idle.

This patch is mainly there to check that Joel's problem disappears if
the update of the blocked load of the cpus is not done in the
newidle_balance context. I'm preparing few other patches on top to
clean up the full path

>
> 10494 /*
> 10495 * This CPU doesn't want to be disturbed by scheduler
> 10496 * housekeeping
> 10497 */
> 10498 if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
> 10499 return;
> 10500
> 10501 /* Will wake up very soon. No time for doing anything else*/
> 10502 if (this_rq->avg_idle < sysctl_sched_migration_cost)
> 10503 return;
>
> And we can drop this_rq arg altogether?
>
> > raw_spin_lock(&this_rq->lock);
> > }
> >
> > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > update_next_balance(sd, &next_balance);
> > rcu_read_unlock();
> >
> > - nohz_newidle_balance(this_rq);
> > -
> > goto out;
> > }
> >
> > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> > if (pulled_task)
> > this_rq->idle_stamp = 0;
> > + else
> > + nohz_newidle_balance(this_rq);
>
> Since nohz_newidle_balance() will not do any real work now, I couldn't figure
> out what moving this here achieves. Fault from my end to parse the change most
> likely :-)

The goal is to schedule the update only if we are about to be idle and
nothing else has been queued in the meantime

>
> Joel can still test this patch as is of course. This is just an early review
> since I already spent the time trying to understand it.
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > rq_repin_lock(this_rq, rf);
> >
> > --
> > 2.17.1

2021-02-03 20:02:11

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, Jan 29, 2021 at 06:27:27PM +0100, Vincent Guittot wrote:
[...]
> > update_blocked_averages with preempt and irq off is not a good thing
> > because we don't manage the number of csf_rq to update and I'm going
> > to provide a patchset for this
>
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
>
> Instead, the update is done with the usual idle load balance update. I'm working on an
> additonnal patch that will select this cpu that is about to become idle, instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
>
> Signed-off-by: Vincent Guittot <[email protected]>

I confirmed that with this patch, I don't see the preemptoff issues related
to update_blocked_averages() anymore (tested using preemptoff tracer).

I went through the patch and it looks correct to me, I will further review it
and await further reviews from others as well, and then backport the patch to
our kernels. Thanks Vince and everyone!

Tested-by: Joel Fernandes (Google) <[email protected]>

thanks,

- Joel



> ---
> kernel/sched/fair.c | 32 +++-----------------------------
> 1 file changed, 3 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
> #define LBF_NEED_BREAK 0x02
> #define LBF_DST_PINNED 0x04
> #define LBF_SOME_PINNED 0x08
> -#define LBF_NOHZ_STATS 0x10
> -#define LBF_NOHZ_AGAIN 0x20
>
> struct lb_env {
> struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
>
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
> sgs->group_load += cpu_load(rq);
> sgs->group_util += cpu_util(i);
> sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> struct sg_lb_stats tmp_sgs;
> int sg_status = 0;
>
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
> do {
> struct sg_lb_stats *sgs = &tmp_sgs;
> int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> /* Tag domain that child domain prefers tasks go to siblings first */
> sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> - jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
> if (env->sd->flags & SD_NUMA)
> env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
> return;
>
> raw_spin_unlock(&this_rq->lock);
> - /*
> - * This CPU is going to be idle and blocked load of idle CPUs
> - * need to be updated. Run the ilb locally as it is a good
> - * candidate for ilb instead of waking up another idle CPU.
> - * Kick an normal ilb if we failed to do the update.
> - */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);
> raw_spin_lock(&this_rq->lock);
> }
>
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> - nohz_newidle_balance(this_rq);
> -
> goto out;
> }
>
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> if (pulled_task)
> this_rq->idle_stamp = 0;
> + else
> + nohz_newidle_balance(this_rq);
>
> rq_repin_lock(this_rq, rf);
>
> --
> 2.17.1
>
>
> >
> > > for this.
> > >
> > > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > > when the coming idle duration is expected to be long enough
> > >
> > > No, we do not want the schedule loop to take half a millisecond.
> >
> > keep in mind that you are scaling frequency so everything takes time
> > at lowest frequency/capacity ...
> >
> > >
> > > > > > IIUC, your real problem is that newidle_balance is running whereas a
> > > > > > task is about to wake up on the cpu and we don't abort quickly during
> > > > > > this load_balance
> > > > >
> > > > > Yes.
> > > > >
> > > > > > so we could also try to abort earlier in case of newly idle load balance
> > > > >
> > > > > I think interrupts are disabled when the load balance runs, so there's no way
> > > > > for say an audio interrupt to even run in order to wake up a task. How would
> > > > > you know to abort the new idle load balance?
> > > > >
> > > > > Could you elaborate more also on the drawback of the rate limiting patch we
> > > > > posted? Do you see a side effect?
> > > >
> > > > Your patch just tries to hide your problem and not to solve the root cause.
> > >
> > > Agreed, the solution presented is a band aid and not a proper fix. It
> > > was just intended to illustrate the problem and start a discussion. I
> > > should have marked it RFC for sure.
> > >
> > > thanks!
> > >
> > > - Joel

2021-02-04 10:50:06

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 02/03/21 18:35, Vincent Guittot wrote:
> > > raw_spin_unlock(&this_rq->lock);
> > > - /*
> > > - * This CPU is going to be idle and blocked load of idle CPUs
> > > - * need to be updated. Run the ilb locally as it is a good
> > > - * candidate for ilb instead of waking up another idle CPU.
> > > - * Kick an normal ilb if we failed to do the update.
> > > - */
> > > - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> >
> > Since we removed the call to this function (which uses this_rq)
> >
> > > - kick_ilb(NOHZ_STATS_KICK);
> > > + kick_ilb(NOHZ_STATS_KICK);
> >
> > And unconditionally call kick_ilb() which will find a suitable cpu to run the
> > lb at regardless what this_rq is.
> >
> > Doesn't the below become unnecessary now?
>
> The end goal is to keep running on this cpu that is about to become idle.
>
> This patch is mainly there to check that Joel's problem disappears if
> the update of the blocked load of the cpus is not done in the
> newidle_balance context. I'm preparing few other patches on top to
> clean up the full path

+1

> >
> > 10494 /*
> > 10495 * This CPU doesn't want to be disturbed by scheduler
> > 10496 * housekeeping
> > 10497 */
> > 10498 if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
> > 10499 return;
> > 10500
> > 10501 /* Will wake up very soon. No time for doing anything else*/
> > 10502 if (this_rq->avg_idle < sysctl_sched_migration_cost)
> > 10503 return;
> >
> > And we can drop this_rq arg altogether?
> >
> > > raw_spin_lock(&this_rq->lock);
> > > }
> > >
> > > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > > update_next_balance(sd, &next_balance);
> > > rcu_read_unlock();
> > >
> > > - nohz_newidle_balance(this_rq);
> > > -
> > > goto out;
> > > }
> > >
> > > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >
> > > if (pulled_task)
> > > this_rq->idle_stamp = 0;
> > > + else
> > > + nohz_newidle_balance(this_rq);
> >
> > Since nohz_newidle_balance() will not do any real work now, I couldn't figure
> > out what moving this here achieves. Fault from my end to parse the change most
> > likely :-)
>
> The goal is to schedule the update only if we are about to be idle and
> nothing else has been queued in the meantime

I see. This short coming already existed and not *strictly* related to moving
update of blocked load out of newidle balance.

Thanks

--
Qais Yousef

2021-02-04 23:42:07

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 03/02/2021 14:12, Vincent Guittot wrote:
> On Wed, 3 Feb 2021 at 12:54, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 29/01/2021 18:27, Vincent Guittot wrote:
>>> Le vendredi 29 janv. 2021 � 11:33:00 (+0100), Vincent Guittot a �crit :
>>>> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <[email protected]> wrote:
>>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
>>>>> <[email protected]> wrote:
>>>>>>> On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
>>>>>>>> On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <[email protected]> wrote:
>>>>>>>>> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
>>>>>>>>>> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
>>>>>>>>>> <[email protected]> wrote:
>>
>> [...]
>>
>>>> The only point that I agree with, is that running
>>>> update_blocked_averages with preempt and irq off is not a good thing
>>>> because we don't manage the number of csf_rq to update and I'm going
>>>> to provide a patchset for this
>>>
>>> The patch below moves the update of the blocked load of CPUs outside newidle_balance().
>>>
>>> Instead, the update is done with the usual idle load balance update. I'm working on an
>>> additonnal patch that will select this cpu that is about to become idle, instead of a
>>> random idle cpu but this 1st step fixe the problem of lot of update in newly idle.
>>
>> I'm trying to understand the need for this extra patch.
>>
>> The patch below moves away from doing update_blocked_averages() (1) for
>> all CPUs in sched groups of the sched domain:
>>
>> newidle_balance()->load_balance()->
>> find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()
>>
>> to:
>>
>> calling (1) for CPUs in nohz.idle_cpus_mask in _nohz_idle_balance() via
>> update_nohz_stats() and for the ilb CPU.
>>
>> newidle_balance() calls (1) for newidle CPU already.
>>
>> What would be the benefit to choose newidle CPU as ilb CPU?
>
> To prevent waking up another idle cpu to run the update whereas
> newidle cpu is already woken up and about to be idle so the best
> candidate.
> All the aim of the removed code was to prevent waking up an idle cpu
> for doing something that could be done by the newidle cpu before it
> enters idle state

Ah OK, makes sense. So the only diff would be that newidle CPU will call
(1) on nohz.idle_cpus_mask rather on on sd->span.

[...]

2021-03-23 21:41:09

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 1/29/21 9:27 AM, Vincent Guittot wrote:
>
> The patch below moves the update of the blocked load of CPUs outside newidle_balance().

On a well known database workload, we also saw a lot of overhead to do update_blocked_averages
in newidle_balance(). So changes to reduce this overhead is much welcomed.

Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per socket Icelake system.

A big part of the overhead in our database workload comes from updating
blocked averages in newidle_balance, caused by I/O threads making
some CPUs go in and out of idle frequently in the following code path:

----__blkdev_direct_IO_simple
|
|----io_schedule_timeout
| |
| ----schedule_timeout
| |
| ----schedule
| |
| ----__schedule
| |
| ----pick_next_task_fair
| |
| ----newidle_balance
| |
----update_blocked_averages

We found update_blocked_averages() now consumed most CPU time, eating up 2% of the CPU cycles once cgroup
gets turned on.

I hacked up Joe's original patch to rate limit the update of blocked
averages called from newidle_balance(). The 9% throughput degradation reduced to
5.4%. We'll be testing Vincent's change to see if it can give
similar performance improvement.

Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (500000), to encourage migrations to idle cpu
and reduce latency. We got quite a lot of calls to update_blocked_averages directly
and then try to load_balance in newidle_balance instead of relegating
the responsibility to idle load balancer. (See code snippet in newidle_balance below)


if (this_rq->avg_idle < sysctl_sched_migration_cost || <-----sched_migration_cost check
!READ_ONCE(this_rq->rd->overload)) {

rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

goto out; <--- invoke idle load balancer
}

raw_spin_unlock(&this_rq->lock);

update_blocked_averages(this_cpu);

.... followed by load balance code ---


So the update_blocked_averages offload to idle_load_balancer in Vincent's patch is less
effective in this case with small sched_migration_cost.

Looking at the code a bit more, we don't actually load balance every time in this code path
unless our avg_idle time exceeds some threshold. Doing update_blocked_averages immediately
is only needed if we do call load_balance(). If we don't do any load balance in the code path,
we can let the idle load balancer update the blocked averages lazily.

Something like the following perhaps on top of Vincent's patch? We haven't really tested
this change yet but want to see if this change makes sense to you.

Tim

Signed-off-by: Tim Chen <[email protected]>
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63950d80fd0b..b93f5f52658a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
struct sched_domain *sd;
int pulled_task = 0;
u64 curr_cost = 0;
+ bool updated_blocked_avg = false;

update_misfit_status(NULL, this_rq);
/*
@@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

raw_spin_unlock(&this_rq->lock);

- update_blocked_averages(this_cpu);
rcu_read_lock();
for_each_domain(this_cpu, sd) {
int continue_balancing = 1;
@@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
}

if (sd->flags & SD_BALANCE_NEWIDLE) {
+ if (!updated_blocked_avg) {
+ update_blocked_averages(this_cpu);
+ updated_blocked_avg = true;
+ }
+
t0 = sched_clock_cpu(this_cpu);

pulled_task = load_balance(this_cpu, this_rq,


2021-03-24 13:49:57

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

Hi Tim,

Le mardi 23 mars 2021 ? 14:37:59 (-0700), Tim Chen a ?crit :
>
>
> On 1/29/21 9:27 AM, Vincent Guittot wrote:
> >
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
>
> On a well known database workload, we also saw a lot of overhead to do update_blocked_averages
> in newidle_balance(). So changes to reduce this overhead is much welcomed.
>
> Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per socket Icelake system.
>
> A big part of the overhead in our database workload comes from updating
> blocked averages in newidle_balance, caused by I/O threads making
> some CPUs go in and out of idle frequently in the following code path:
>
> ----__blkdev_direct_IO_simple
> |
> |----io_schedule_timeout
> | |
> | ----schedule_timeout
> | |
> | ----schedule
> | |
> | ----__schedule
> | |
> | ----pick_next_task_fair
> | |
> | ----newidle_balance
> | |
> ----update_blocked_averages
>
> We found update_blocked_averages() now consumed most CPU time, eating up 2% of the CPU cycles once cgroup
> gets turned on.
>
> I hacked up Joe's original patch to rate limit the update of blocked
> averages called from newidle_balance(). The 9% throughput degradation reduced to
> 5.4%. We'll be testing Vincent's change to see if it can give
> similar performance improvement.
>
> Though in our test environment, sysctl_sched_migration_cost was kept
> much lower (25000) compared to the default (500000), to encourage migrations to idle cpu
> and reduce latency. We got quite a lot of calls to update_blocked_averages directly
> and then try to load_balance in newidle_balance instead of relegating
> the responsibility to idle load balancer. (See code snippet in newidle_balance below)
>
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost || <-----sched_migration_cost check
> !READ_ONCE(this_rq->rd->overload)) {
>
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> goto out; <--- invoke idle load balancer
> }
>
> raw_spin_unlock(&this_rq->lock);
>
> update_blocked_averages(this_cpu);
>
> .... followed by load balance code ---
>

> So the update_blocked_averages offload to idle_load_balancer in Vincent's patch is less
> effective in this case with small sched_migration_cost.
>
> Looking at the code a bit more, we don't actually load balance every time in this code path
> unless our avg_idle time exceeds some threshold. Doing update_blocked_averages immediately

IIUC your problem, we call update_blocked_averages() but because of:

if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
update_next_balance(sd, &next_balance);
break;
}

the for_each_domain loop stops even before running load_balance on the 1st
sched domain level which means that update_blocked_averages() was called
unnecessarily.

And this is even more true with a small sysctl_sched_migration_cost which allows newly
idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
smallest domain but that's probably because of task_hot().

if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
skip spin_unlock/lock and for_each_domain() loop entirely

Maybe something like below:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76e33a70d575..08933e0d87ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10723,17 +10723,21 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);

+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

goto out;
}
+ rcu_read_unlock();
+

raw_spin_unlock(&this_rq->lock);


> is only needed if we do call load_balance(). If we don't do any load balance in the code path,
> we can let the idle load balancer update the blocked averages lazily.
>
> Something like the following perhaps on top of Vincent's patch? We haven't really tested
> this change yet but want to see if this change makes sense to you.
>
> Tim
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 63950d80fd0b..b93f5f52658a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> struct sched_domain *sd;
> int pulled_task = 0;
> u64 curr_cost = 0;
> + bool updated_blocked_avg = false;
>
> update_misfit_status(NULL, this_rq);
> /*
> @@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> raw_spin_unlock(&this_rq->lock);
>
> - update_blocked_averages(this_cpu);
> rcu_read_lock();
> for_each_domain(this_cpu, sd) {
> int continue_balancing = 1;
> @@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> }
>
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> + if (!updated_blocked_avg) {
> + update_blocked_averages(this_cpu);
> + updated_blocked_avg = true;
> + }
> +
> t0 = sched_clock_cpu(this_cpu);
>
> pulled_task = load_balance(this_cpu, this_rq,
>
>

2021-03-25 03:14:59

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 3/24/21 6:44 AM, Vincent Guittot wrote:
> Hi Tim,

>
> IIUC your problem, we call update_blocked_averages() but because of:
>
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> update_next_balance(sd, &next_balance);
> break;
> }
>
> the for_each_domain loop stops even before running load_balance on the 1st
> sched domain level which means that update_blocked_averages() was called
> unnecessarily.
>

That's right

> And this is even more true with a small sysctl_sched_migration_cost which allows newly
> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> smallest domain but that's probably because of task_hot().
>
> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> skip spin_unlock/lock and for_each_domain() loop entirely
>
> Maybe something like below:
>

The patch makes sense. I'll ask our benchmark team to queue this patch for testing.

Tim


2021-04-07 21:03:07

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

Hi Tim,

On Wed, 24 Mar 2021 at 17:05, Tim Chen <[email protected]> wrote:
>
>
>
> On 3/24/21 6:44 AM, Vincent Guittot wrote:
> > Hi Tim,
>
> >
> > IIUC your problem, we call update_blocked_averages() but because of:
> >
> > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> > update_next_balance(sd, &next_balance);
> > break;
> > }
> >
> > the for_each_domain loop stops even before running load_balance on the 1st
> > sched domain level which means that update_blocked_averages() was called
> > unnecessarily.
> >
>
> That's right
>
> > And this is even more true with a small sysctl_sched_migration_cost which allows newly
> > idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
> > for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> > smallest domain but that's probably because of task_hot().
> >
> > if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> > skip spin_unlock/lock and for_each_domain() loop entirely
> >
> > Maybe something like below:
> >
>
> The patch makes sense. I'll ask our benchmark team to queue this patch for testing.

Do you have feedback from your benchmark team ?

Regards,
Vincent
>
> Tim
>
>

2021-04-07 21:43:46

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 4/7/21 7:02 AM, Vincent Guittot wrote:
> Hi Tim,
>
> On Wed, 24 Mar 2021 at 17:05, Tim Chen <[email protected]> wrote:
>>
>>
>>
>> On 3/24/21 6:44 AM, Vincent Guittot wrote:
>>> Hi Tim,
>>
>>>
>>> IIUC your problem, we call update_blocked_averages() but because of:
>>>
>>> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>> update_next_balance(sd, &next_balance);
>>> break;
>>> }
>>>
>>> the for_each_domain loop stops even before running load_balance on the 1st
>>> sched domain level which means that update_blocked_averages() was called
>>> unnecessarily.
>>>
>>
>> That's right
>>
>>> And this is even more true with a small sysctl_sched_migration_cost which allows newly
>>> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
>>> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
>>> smallest domain but that's probably because of task_hot().
>>>
>>> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
>>> skip spin_unlock/lock and for_each_domain() loop entirely
>>>
>>> Maybe something like below:
>>>
>>
>> The patch makes sense. I'll ask our benchmark team to queue this patch for testing.
>
> Do you have feedback from your benchmark team ?
>

Vincent,

Thanks for following up. I just got some data back from the benchmark team.
The performance didn't change with your patch. And the overall cpu% of update_blocked_averages
also remain at about the same level. My first thought was perhaps this update
still didn't catch all the calls to update_blocked_averages

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

To experiment, I added one more check on the next_balance to further limit
the path to actually do idle load balance with the next_balance time.

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ time_before(jiffies, this_rq->next_balance) ||
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

I was suprised to find the overall cpu% consumption of update_blocked_averages
and throughput of the benchmark still didn't change much. So I took a
peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
The call to update_locked_averages was reduced in newidle_balance so the patch did
what we intended. But the overall rate of calls to
update_blocked_averages remain roughly the same, shifting from
newidle_balance to run_rebalance_domains.

100.00% (ffffffff810cf070)
|
---update_blocked_averages
|
|--95.47%--run_rebalance_domains
| __do_softirq
| |
| |--94.27%--asm_call_irq_on_stack
| | do_softirq_own_stack
| | |
| | |--93.74%--irq_exit_rcu
| | | |
| | | |--88.20%--sysvec_apic_timer_interrupt
| | | | asm_sysvec_apic_timer_interrupt
| | | | |
...
|
|
--4.53%--newidle_balance
pick_next_task_fair

I was expecting idle load balancer to be rate limited to 60 Hz, which
should be 15 jiffies apart on the test system with CONFIG_HZ_250.
When I did a trace on a single CPU, I see that update_blocked_averages
are often called between 1 to 4 jiffies apart, which is at a much higher
rate than I expected. I haven't taken a closer look yet. But you may
have a better idea. I won't have access to the test system and workload
till probably next week.

Thanks.

Tim

2021-04-08 14:52:41

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Wed, 7 Apr 2021 at 19:19, Tim Chen <[email protected]> wrote:
>
>
>
> On 4/7/21 7:02 AM, Vincent Guittot wrote:
> > Hi Tim,
> >
> > On Wed, 24 Mar 2021 at 17:05, Tim Chen <[email protected]> wrote:
> >>
> >>
> >>
> >> On 3/24/21 6:44 AM, Vincent Guittot wrote:
> >>> Hi Tim,
> >>
> >>>
> >>> IIUC your problem, we call update_blocked_averages() but because of:
> >>>
> >>> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> >>> update_next_balance(sd, &next_balance);
> >>> break;
> >>> }
> >>>
> >>> the for_each_domain loop stops even before running load_balance on the 1st
> >>> sched domain level which means that update_blocked_averages() was called
> >>> unnecessarily.
> >>>
> >>
> >> That's right
> >>
> >>> And this is even more true with a small sysctl_sched_migration_cost which allows newly
> >>> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
> >>> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> >>> smallest domain but that's probably because of task_hot().
> >>>
> >>> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> >>> skip spin_unlock/lock and for_each_domain() loop entirely
> >>>
> >>> Maybe something like below:
> >>>
> >>
> >> The patch makes sense. I'll ask our benchmark team to queue this patch for testing.
> >
> > Do you have feedback from your benchmark team ?
> >
>
> Vincent,
>
> Thanks for following up. I just got some data back from the benchmark team.
> The performance didn't change with your patch. And the overall cpu% of update_blocked_averages
> also remain at about the same level. My first thought was perhaps this update
> still didn't catch all the calls to update_blocked_averages
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> To experiment, I added one more check on the next_balance to further limit
> the path to actually do idle load balance with the next_balance time.
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + time_before(jiffies, this_rq->next_balance) ||
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> I was suprised to find the overall cpu% consumption of update_blocked_averages
> and throughput of the benchmark still didn't change much. So I took a
> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
> The call to update_locked_averages was reduced in newidle_balance so the patch did
> what we intended. But the overall rate of calls to

At least , we have removed the useless call to update_blocked_averages
in newidle_balance when we will not perform any newly idle load
balance

> update_blocked_averages remain roughly the same, shifting from
> newidle_balance to run_rebalance_domains.
>
> 100.00% (ffffffff810cf070)
> |
> ---update_blocked_averages
> |
> |--95.47%--run_rebalance_domains
> | __do_softirq
> | |
> | |--94.27%--asm_call_irq_on_stack
> | | do_softirq_own_stack

The call of update_blocked_averages mainly comes from SCHED_SOFTIRQ.
And as a result, not from the new path
do_idle()->nohz_run_idle_balance() which has been added by this patch
to defer the call to update_nohz_stats() after newlyidle_balance and
before entering idle.

> | | |
> | | |--93.74%--irq_exit_rcu
> | | | |
> | | | |--88.20%--sysvec_apic_timer_interrupt
> | | | | asm_sysvec_apic_timer_interrupt
> | | | | |
> ...
> |
> |
> --4.53%--newidle_balance
> pick_next_task_fair
>
> I was expecting idle load balancer to be rate limited to 60 Hz, which

Why 60Hz ?

> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
> When I did a trace on a single CPU, I see that update_blocked_averages
> are often called between 1 to 4 jiffies apart, which is at a much higher
> rate than I expected. I haven't taken a closer look yet. But you may

2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
- the need for an update of blocked load which should not happen more
than once every 32ms which means a rate of around 30Hz
- the need for a load balance of a sched_domain. The min interval for
a sched_domain is its weight when the CPU is idle which is usually few
jiffies

The only idea that I have for now is that we spend less time in
newidle_balance which changes the dynamic of your system.

In your trace, could you check if update_blocked_averages is called
during the tick ? and Is the current task idle task ?

Vincent

> have a better idea. I won't have access to the test system and workload
> till probably next week.
>
> Thanks.
>
> Tim

2021-04-08 23:06:48

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ




On 4/8/21 7:51 AM, Vincent Guittot wrote:

>> I was suprised to find the overall cpu% consumption of update_blocked_averages
>> and throughput of the benchmark still didn't change much. So I took a
>> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
>> The call to update_locked_averages was reduced in newidle_balance so the patch did
>> what we intended. But the overall rate of calls to
>
> At least , we have removed the useless call to update_blocked_averages
> in newidle_balance when we will not perform any newly idle load
> balance
>
>> update_blocked_averages remain roughly the same, shifting from
>> newidle_balance to run_rebalance_domains.
>>
>> 100.00% (ffffffff810cf070)
>> |
>> ---update_blocked_averages
>> |
>> |--95.47%--run_rebalance_domains
>> | __do_softirq
>> | |
>> | |--94.27%--asm_call_irq_on_stack
>> | | do_softirq_own_stack
>
> The call of update_blocked_averages mainly comes from SCHED_SOFTIRQ.
> And as a result, not from the new path
> do_idle()->nohz_run_idle_balance() which has been added by this patch
> to defer the call to update_nohz_stats() after newlyidle_balance and
> before entering idle.
>
>> | | |
>> | | |--93.74%--irq_exit_rcu
>> | | | |
>> | | | |--88.20%--sysvec_apic_timer_interrupt
>> | | | | asm_sysvec_apic_timer_interrupt
>> | | | | |
>> ...
>> |
>> |
>> --4.53%--newidle_balance
>> pick_next_task_fair
>>
>> I was expecting idle load balancer to be rate limited to 60 Hz, which
>
> Why 60Hz ?
>

My thinking is we will trigger load balance only after rq->next_balance.

void trigger_load_balance(struct rq *rq)
{
/* Don't need to rebalance while attached to NULL domain */
if (unlikely(on_null_domain(rq)))
return;

if (time_after_eq(jiffies, rq->next_balance))
raise_softirq(SCHED_SOFTIRQ);

nohz_balancer_kick(rq);
}

And it seems like next_balance is set to be 60 Hz

static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
int continue_balancing = 1;
int cpu = rq->cpu;
int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
unsigned long interval;
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;


>> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
>> When I did a trace on a single CPU, I see that update_blocked_averages
>> are often called between 1 to 4 jiffies apart, which is at a much higher
>> rate than I expected. I haven't taken a closer look yet. But you may
>
> 2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
> - the need for an update of blocked load which should not happen more
> than once every 32ms which means a rate of around 30Hz
> - the need for a load balance of a sched_domain. The min interval for
> a sched_domain is its weight when the CPU is idle which is usually few
> jiffies
>
> The only idea that I have for now is that we spend less time in
> newidle_balance which changes the dynamic of your system.
>
> In your trace, could you check if update_blocked_averages is called
> during the tick ? and Is the current task idle task ?

Here's a snapshot of the trace. However I didn't have the current task in my trace.
You can tell the frequency that update_blocked_averages is called on
cpu 2 by the jiffies value. They are quite close together (1 to 3 jiffies apart).
When I have a chance to get on the machine, I'll take another look
at the current task and whether we got to trigger_load_balance() from scheduler_tick().


3.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
4.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
6.484 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
6.506 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
9.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
11.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.059 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.083 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.100 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.103 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.150 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.227 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.248 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.311 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
16.140 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
16.185 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
16.224 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
16.340 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
16.384 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
16.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
16.993 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
17.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
17.630 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
17.830 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.031 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.036 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.040 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.502 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740

Thanks for taking a look.

Tim

2021-04-09 15:27:34

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, 9 Apr 2021 at 01:05, Tim Chen <[email protected]> wrote:
>
>
>
>
> On 4/8/21 7:51 AM, Vincent Guittot wrote:
>
> >> I was suprised to find the overall cpu% consumption of update_blocked_averages
> >> and throughput of the benchmark still didn't change much. So I took a
> >> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
> >> The call to update_locked_averages was reduced in newidle_balance so the patch did
> >> what we intended. But the overall rate of calls to
> >
> > At least , we have removed the useless call to update_blocked_averages
> > in newidle_balance when we will not perform any newly idle load
> > balance
> >
> >> update_blocked_averages remain roughly the same, shifting from
> >> newidle_balance to run_rebalance_domains.
> >>
> >> 100.00% (ffffffff810cf070)
> >> |
> >> ---update_blocked_averages
> >> |
> >> |--95.47%--run_rebalance_domains
> >> | __do_softirq
> >> | |
> >> | |--94.27%--asm_call_irq_on_stack
> >> | | do_softirq_own_stack
> >
> > The call of update_blocked_averages mainly comes from SCHED_SOFTIRQ.
> > And as a result, not from the new path
> > do_idle()->nohz_run_idle_balance() which has been added by this patch
> > to defer the call to update_nohz_stats() after newlyidle_balance and
> > before entering idle.
> >
> >> | | |
> >> | | |--93.74%--irq_exit_rcu
> >> | | | |
> >> | | | |--88.20%--sysvec_apic_timer_interrupt
> >> | | | | asm_sysvec_apic_timer_interrupt
> >> | | | | |
> >> ...
> >> |
> >> |
> >> --4.53%--newidle_balance
> >> pick_next_task_fair
> >>
> >> I was expecting idle load balancer to be rate limited to 60 Hz, which
> >
> > Why 60Hz ?
> >
>
> My thinking is we will trigger load balance only after rq->next_balance.
>
> void trigger_load_balance(struct rq *rq)
> {
> /* Don't need to rebalance while attached to NULL domain */
> if (unlikely(on_null_domain(rq)))
> return;
>
> if (time_after_eq(jiffies, rq->next_balance))
> raise_softirq(SCHED_SOFTIRQ);
>
> nohz_balancer_kick(rq);
> }
>
> And it seems like next_balance is set to be 60 Hz
>
> static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> {
> int continue_balancing = 1;
> int cpu = rq->cpu;
> int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> unsigned long interval;
> struct sched_domain *sd;
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;

This doesn't mean 60 Hz period but 60*HZ with HZ being the number of
jiffies per second. We init next_balance with now + 60 sec to make
sure it's far later than the next balance of the sched_domains

Then, update_next_balance() keeps track of 1st balance to happen next time

>
>
> >> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
> >> When I did a trace on a single CPU, I see that update_blocked_averages
> >> are often called between 1 to 4 jiffies apart, which is at a much higher
> >> rate than I expected. I haven't taken a closer look yet. But you may
> >
> > 2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
> > - the need for an update of blocked load which should not happen more
> > than once every 32ms which means a rate of around 30Hz
> > - the need for a load balance of a sched_domain. The min interval for
> > a sched_domain is its weight when the CPU is idle which is usually few
> > jiffies
> >
> > The only idea that I have for now is that we spend less time in
> > newidle_balance which changes the dynamic of your system.
> >
> > In your trace, could you check if update_blocked_averages is called
> > during the tick ? and Is the current task idle task ?
>
> Here's a snapshot of the trace. However I didn't have the current task in my trace.
> You can tell the frequency that update_blocked_averages is called on
> cpu 2 by the jiffies value. They are quite close together (1 to 3 jiffies apart).
> When I have a chance to get on the machine, I'll take another look
> at the current task and whether we got to trigger_load_balance() from scheduler_tick().
>
>
> 3.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
> 4.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
> 6.484 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
> 6.506 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
> 9.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
> 11.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.059 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.083 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.100 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.103 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.150 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.227 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.248 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.311 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
> 16.140 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
> 16.185 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
> 16.224 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
> 16.340 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
> 16.384 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
> 16.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
> 16.993 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
> 17.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
> 17.630 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
> 17.830 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.031 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.036 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.040 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.502 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740
>

I don't know exactly what you track with "next_balance=" in
probe:newidle_balance but it always starts with the same value
0x1004fb76c in the future to finish with a value 0x1004fb731 in the
past. This would mean that a load balance is needed during the next
tick which explains why we can see then the
probe:update_blocked_averages for each tick.

Also could you check if the tick is stopped when idle. When the
predicted idle time is short and the next wake is expected to happen
before the next tick, the tick is not stopped.

Vincent

> Thanks for taking a look.
>
> Tim

2021-04-09 18:00:15

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 4/9/21 8:26 AM, Vincent Guittot wrote:

>>>>
>>>> I was expecting idle load balancer to be rate limited to 60 Hz, which
>>>
>>> Why 60Hz ?
>>>
>>
>> My thinking is we will trigger load balance only after rq->next_balance.
>>
>> void trigger_load_balance(struct rq *rq)
>> {
>> /* Don't need to rebalance while attached to NULL domain */
>> if (unlikely(on_null_domain(rq)))
>> return;
>>
>> if (time_after_eq(jiffies, rq->next_balance))
>> raise_softirq(SCHED_SOFTIRQ);
>>
>> nohz_balancer_kick(rq);
>> }
>>
>> And it seems like next_balance is set to be 60 Hz
>>
>> static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>> {
>> int continue_balancing = 1;
>> int cpu = rq->cpu;
>> int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>> unsigned long interval;
>> struct sched_domain *sd;
>> /* Earliest time when we have to do rebalance again */
>> unsigned long next_balance = jiffies + 60*HZ;
>
> This doesn't mean 60 Hz period but 60*HZ with HZ being the number of
> jiffies per second. We init next_balance with now + 60 sec to make
> sure it's far later than the next balance of the sched_domains
>
> Then, update_next_balance() keeps track of 1st balance to happen next time
>

Thanks for pointing out my misread of the code. In this case the
balance frequency should be lower than I thought as balance should be 60 sec
apart in theory.

>> Here's a snapshot of the trace. However I didn't have the current task in my trace.
>> You can tell the frequency that update_blocked_averages is called on
>> cpu 2 by the jiffies value. They are quite close together (1 to 3 jiffies apart).
>> When I have a chance to get on the machine, I'll take another look
>> at the current task and whether we got to trigger_load_balance() from scheduler_tick().
>>
>>
>> 3.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
>> 4.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
>> 6.484 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
>> 6.506 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
>> 9.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
>> 11.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
>> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
>> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
>> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
>> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
>> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
>> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.059 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.083 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.100 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.103 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.150 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.227 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.248 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.311 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
>> 16.140 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
>> 16.185 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>> 16.224 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>> 16.340 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>> 16.384 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>> 16.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
>> 16.993 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
>> 17.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
>> 17.630 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
>> 17.830 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.031 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.036 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.040 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.502 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740
>>
>
> I don't know exactly what you track with "next_balance=" in

It is the rq->next_balance value as we enter the newidle_balance function.

> probe:newidle_balance but it always starts with the same value
> 0x1004fb76c in the future to finish with a value 0x1004fb731 in the
> past.

This indeed is odd as the next_balance should move forward and not backward.

> This would mean that a load balance is needed during the next
> tick which explains why we can see then the
> probe:update_blocked_averages for each tick.

Will try to debug and find out why the next_balance has gone backwards
next time I get access to the test system.

>
> Also could you check if the tick is stopped when idle. When the
> predicted idle time is short and the next wake is expected to happen
> before the next tick, the tick is not stopped.
>

Will do.

Tim

2021-05-10 22:01:04

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 4/9/21 10:59 AM, Tim Chen wrote:

>>> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
>>> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
>>> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
>>> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c

>> I don't know exactly what you track with "next_balance=" in
>
> It is the rq->next_balance value as we enter the newidle_balance function.
>
>> probe:newidle_balance but it always starts with the same value
>> 0x1004fb76c in the future to finish with a value 0x1004fb731 in the
>> past.
>
> This indeed is odd as the next_balance should move forward and not backward.


Vincent,

I found a bug in newidle_balance() that moved the next balance time
backward. I fixed it in patch 1 below. This corrects the
next_balance time update and we now have proper load balance rate limiting.

After putting in the other two changes previously discussed with you (patch 2 and 3 below),
I see very good improvement (about +5%) in the database workload I was investigating.
The costly update_blocked_averages() function is called much less frequently and consumed
only 0.2% of cpu cycles instead of 2.6% before the changes.

Including all three patches here together for easier review. The patches
apply to the tip tree's sched/core branch.

Thanks.

Tim

--->8---

From 848eb8f45b53b45cacf70022c98f632daabefe77 Mon Sep 17 00:00:00 2001
Message-Id: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
From: Tim Chen <[email protected]>
Date: Fri, 7 May 2021 14:19:47 -0700
Subject: [PATCH 1/3] sched: Fix rq->next_balance time going backward

In traces on newidle_balance(), this_rq->next_balance
time goes backward from time to time, e.g.

11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c

This was due to newidle_balance() updated this_rq->next_balance
to an earlier time than its current value. The real intention
was to make sure next_balance move this_rq->next_balance forward
in its update:

out:
/* Move the next balance forward */
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;

The actual outcome was moving this_rq->next_balance backward,
in the wrong direction.

Fix the incorrect check on next_balance causing the problem.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..b0b5698b2184 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10681,7 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

out:
/* Move the next balance forward */
- if (time_after(this_rq->next_balance, next_balance))
+ if (time_after(next_balance, this_rq->next_balance))
this_rq->next_balance = next_balance;

if (pulled_task)
--
2.20.1


From f2c9af4af6438ad79076e1a664003dc01ad4fdf0 Mon Sep 17 00:00:00 2001
Message-Id: <f2c9af4af6438ad79076e1a664003dc01ad4fdf0.1620677280.git.tim.c.chen@linux.intel.com>
In-Reply-To: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
References: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
From: Vincent Guittot <[email protected]>
Date: Fri, 7 May 2021 14:38:10 -0700
Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
load balance

In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when

this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will
not be useful and simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip update_blocked_averages()
when possible.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0b5698b2184..f828b75488a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10612,17 +10612,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);

+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

goto out;
}
+ rcu_read_unlock();

raw_spin_unlock(&this_rq->lock);

--
2.20.1


From c45d13c6156c3cdc340ef3ba523b8750642a9c50 Mon Sep 17 00:00:00 2001
Message-Id: <c45d13c6156c3cdc340ef3ba523b8750642a9c50.1620677280.git.tim.c.chen@linux.intel.com>
In-Reply-To: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
References: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
From: Tim Chen <[email protected]>
Date: Fri, 7 May 2021 14:54:54 -0700
Subject: [PATCH 3/3] sched: Rate limit load balance in newidle_balance()

Currently newidle_balance() could do load balancng even if the cpu is not
due for a load balance. Make newidle_balance() check the next_balance
time on the cpu's runqueue so it defers load balancing if it is not
due for its load balance.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f828b75488a0..8e00e1fdd6e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10617,6 +10617,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!READ_ONCE(this_rq->rd->overload) ||
+ time_before(jiffies, this_rq->next_balance) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

if (sd)
--
2.20.1

2021-05-11 15:30:43

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

Hi Tim,

On Mon, 10 May 2021 at 23:59, Tim Chen <[email protected]> wrote:
>
>
>
> On 4/9/21 10:59 AM, Tim Chen wrote:
>
> >>> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> >>> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
> >>> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> >>> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
>
> >> I don't know exactly what you track with "next_balance=" in
> >
> > It is the rq->next_balance value as we enter the newidle_balance function.
> >
> >> probe:newidle_balance but it always starts with the same value
> >> 0x1004fb76c in the future to finish with a value 0x1004fb731 in the
> >> past.
> >
> > This indeed is odd as the next_balance should move forward and not backward.
>
>
> Vincent,
>
> I found a bug in newidle_balance() that moved the next balance time
> backward. I fixed it in patch 1 below. This corrects the
> next_balance time update and we now have proper load balance rate limiting.
>
> After putting in the other two changes previously discussed with you (patch 2 and 3 below),
> I see very good improvement (about +5%) in the database workload I was investigating.
> The costly update_blocked_averages() function is called much less frequently and consumed
> only 0.2% of cpu cycles instead of 2.6% before the changes.

That's a good news

>
> Including all three patches here together for easier review. The patches
> apply to the tip tree's sched/core branch.
>
> Thanks.
>
> Tim
>
> --->8---
>
> From 848eb8f45b53b45cacf70022c98f632daabefe77 Mon Sep 17 00:00:00 2001
> Message-Id: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> From: Tim Chen <[email protected]>
> Date: Fri, 7 May 2021 14:19:47 -0700
> Subject: [PATCH 1/3] sched: Fix rq->next_balance time going backward
>
> In traces on newidle_balance(), this_rq->next_balance
> time goes backward from time to time, e.g.
>
> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>
> This was due to newidle_balance() updated this_rq->next_balance
> to an earlier time than its current value. The real intention
> was to make sure next_balance move this_rq->next_balance forward
> in its update:

Sometimes, we want to set this_rq->next_balance backward compared to
its current value. When a CPU is busy, the balance interval is
multiplied by busy_factor which is set to 16 by default. On SMT2 sched
domain level, it means that the interval will be 32ms when busy
instead of 2ms. But if a busy load balance happens just before
becoming idle, the this_rq->next_balance will be set 32ms later
whereas it should go down to 2ms as the CPU is now idle. And this
becomes even worse when you have 128 CPUs at die sched_domain level
because the idle CPU will have to wait 2048ms instead of the correct
128ms interval.

>
> out:
> /* Move the next balance forward */
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> The actual outcome was moving this_rq->next_balance backward,
> in the wrong direction.

As explained above, this is intentional.
You are facing a situation where the load balance of sched_domain
level has not run for a while and last_balance is quite old and
generates a next_balance still in the past.

typically:

CPU is busy
CPU runs busy LB at time T so last_balance = T
And next LB will not happen before T+16*Interval
At time T+15*Interval, CPU enters idle
newidle_balance updates next_balance to last_balance+Interval which is
in the past

>
> Fix the incorrect check on next_balance causing the problem.
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d75af1ecfb4..b0b5698b2184 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10681,7 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> out:
> /* Move the next balance forward */
> - if (time_after(this_rq->next_balance, next_balance))
> + if (time_after(next_balance, this_rq->next_balance))

The current comparison is correct but next_balance should not be in the past.

update_next_balance() is only used in newidle_balance() so we could
modify it to have:

next = max(jiffies+1, next = sd->last_balance + interval)


> this_rq->next_balance = next_balance;
>
> if (pulled_task)
> --
> 2.20.1
>
>
> From f2c9af4af6438ad79076e1a664003dc01ad4fdf0 Mon Sep 17 00:00:00 2001
> Message-Id: <f2c9af4af6438ad79076e1a664003dc01ad4fdf0.1620677280.git.tim.c.chen@linux.intel.com>
> In-Reply-To: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> References: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> From: Vincent Guittot <[email protected]>
> Date: Fri, 7 May 2021 14:38:10 -0700
> Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
> load balance
>
> In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when
>
> this_rq->avg_idle < sd->max_newidle_lb_cost
>
> Doing a costly call to update_blocked_averages() will
> not be useful and simply adds overhead when this condition is true.
>
> Check the condition early in newidle_balance() to skip update_blocked_averages()
> when possible.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b0b5698b2184..f828b75488a0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10612,17 +10612,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> */
> rq_unpin_lock(this_rq, rf);
>
> + rcu_read_lock();
> + sd = rcu_dereference_check_sched_domain(this_rq->sd);
> +
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> - rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> goto out;
> }
> + rcu_read_unlock();
>
> raw_spin_unlock(&this_rq->lock);
>
> --
> 2.20.1
>
>
> From c45d13c6156c3cdc340ef3ba523b8750642a9c50 Mon Sep 17 00:00:00 2001
> Message-Id: <c45d13c6156c3cdc340ef3ba523b8750642a9c50.1620677280.git.tim.c.chen@linux.intel.com>
> In-Reply-To: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> References: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> From: Tim Chen <[email protected]>
> Date: Fri, 7 May 2021 14:54:54 -0700
> Subject: [PATCH 3/3] sched: Rate limit load balance in newidle_balance()
>
> Currently newidle_balance() could do load balancng even if the cpu is not
> due for a load balance. Make newidle_balance() check the next_balance
> time on the cpu's runqueue so it defers load balancing if it is not
> due for its load balance.
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f828b75488a0..8e00e1fdd6e0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10617,6 +10617,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> !READ_ONCE(this_rq->rd->overload) ||
> + time_before(jiffies, this_rq->next_balance) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> if (sd)
> --
> 2.20.1
>

2021-05-11 17:27:06

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 5/11/21 8:25 AM, Vincent Guittot wrote:
> Hi Tim,
>
> Sometimes, we want to set this_rq->next_balance backward compared to
> its current value. When a CPU is busy, the balance interval is
> multiplied by busy_factor which is set to 16 by default. On SMT2 sched
> domain level, it means that the interval will be 32ms when busy
> instead of 2ms. But if a busy load balance happens just before
> becoming idle, the this_rq->next_balance will be set 32ms later
> whereas it should go down to 2ms as the CPU is now idle. And this
> becomes even worse when you have 128 CPUs at die sched_domain level
> because the idle CPU will have to wait 2048ms instead of the correct
> 128ms interval.
>
>>
>> out:
>> /* Move the next balance forward */
>> - if (time_after(this_rq->next_balance, next_balance))
>> + if (time_after(next_balance, this_rq->next_balance))
>
> The current comparison is correct but next_balance should not be in the past.

I understand then the intention is after the update,
this_rq->next_balance should have a minimum value of jiffies+1. So
we will need

out:
/* Move the next balance forward */
+ this_rq->next_balance = max(jiffies+1, this_rq->next_balance);
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;

as next_balance computed will be at least jiffies+1 after your fix to
update_next_balance(). We still need to take care of the case when
this_rq->next_balance <= jiffies.

So combining with your suggestion on the fix to update_next_balance(),
the fix will be

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..2dc471c1511c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9901,7 +9901,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)

/* used by idle balance, so cpu_busy = 0 */
interval = get_sd_balance_interval(sd, 0);
- next = sd->last_balance + interval;
+ next = max(jiffies+1, sd->last_balance + interval);

if (time_after(*next_balance, next))
*next_balance = next;
@@ -10681,6 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

out:
/* Move the next balance forward */
+ this_rq->next_balance = max(jiffies+1, this_rq->next_balance);
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;


>
> update_next_balance() is only used in newidle_balance() so we could
> modify it to have:
>
> next = max(jiffies+1, next = sd->last_balance + interval)

Is the extra assignment "next = sd->last_balance + interval" needed?
This seems more straight forward:

next = max(jiffies+1, sd->last_balance + interval)

I will try to get the benchmark folks to do another run with this change.
Hopefully I'll get some bandwidth from them soon.

Thanks.

Tim

2021-05-11 17:57:55

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Tue, 11 May 2021 at 19:25, Tim Chen <[email protected]> wrote:
>
>
>
> On 5/11/21 8:25 AM, Vincent Guittot wrote:
> > Hi Tim,
> >
> > Sometimes, we want to set this_rq->next_balance backward compared to
> > its current value. When a CPU is busy, the balance interval is
> > multiplied by busy_factor which is set to 16 by default. On SMT2 sched
> > domain level, it means that the interval will be 32ms when busy
> > instead of 2ms. But if a busy load balance happens just before
> > becoming idle, the this_rq->next_balance will be set 32ms later
> > whereas it should go down to 2ms as the CPU is now idle. And this
> > becomes even worse when you have 128 CPUs at die sched_domain level
> > because the idle CPU will have to wait 2048ms instead of the correct
> > 128ms interval.
> >
> >>
> >> out:
> >> /* Move the next balance forward */
> >> - if (time_after(this_rq->next_balance, next_balance))
> >> + if (time_after(next_balance, this_rq->next_balance))
> >
> > The current comparison is correct but next_balance should not be in the past.
>
> I understand then the intention is after the update,
> this_rq->next_balance should have a minimum value of jiffies+1. So
> we will need
>
> out:
> /* Move the next balance forward */
> + this_rq->next_balance = max(jiffies+1, this_rq->next_balance);
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> as next_balance computed will be at least jiffies+1 after your fix to
> update_next_balance(). We still need to take care of the case when
> this_rq->next_balance <= jiffies.
>
> So combining with your suggestion on the fix to update_next_balance(),
> the fix will be
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d75af1ecfb4..2dc471c1511c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9901,7 +9901,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>
> /* used by idle balance, so cpu_busy = 0 */
> interval = get_sd_balance_interval(sd, 0);
> - next = sd->last_balance + interval;
> + next = max(jiffies+1, sd->last_balance + interval);
>
> if (time_after(*next_balance, next))
> *next_balance = next;
> @@ -10681,6 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> out:
> /* Move the next balance forward */
> + this_rq->next_balance = max(jiffies+1, this_rq->next_balance);
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
>
> >
> > update_next_balance() is only used in newidle_balance() so we could
> > modify it to have:
> >
> > next = max(jiffies+1, next = sd->last_balance + interval)
>
> Is the extra assignment "next = sd->last_balance + interval" needed?

No it's a typo mistake while copy pasting the line

> This seems more straight forward:
>
> next = max(jiffies+1, sd->last_balance + interval)
>
> I will try to get the benchmark folks to do another run with this change.
> Hopefully I'll get some bandwidth from them soon.
>
> Thanks.
>
> Tim
>

2021-05-12 14:00:57

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 05/11/21 10:25, Tim Chen wrote:
> > update_next_balance() is only used in newidle_balance() so we could
> > modify it to have:
> >
> > next = max(jiffies+1, next = sd->last_balance + interval)
>
> Is the extra assignment "next = sd->last_balance + interval" needed?
> This seems more straight forward:
>
> next = max(jiffies+1, sd->last_balance + interval)

I haven't been following the whole conversation closely, but it's always
interesting when manipulating time in non time_*() functions.

Is this max() safe against wrapping?

Thanks

--
Qais Yousef

2021-05-13 23:08:04

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 5/12/21 6:59 AM, Qais Yousef wrote:
> On 05/11/21 10:25, Tim Chen wrote:
>>> update_next_balance() is only used in newidle_balance() so we could
>>> modify it to have:
>>>
>>> next = max(jiffies+1, next = sd->last_balance + interval)
>>
>> Is the extra assignment "next = sd->last_balance + interval" needed?
>> This seems more straight forward:
>>
>> next = max(jiffies+1, sd->last_balance + interval)
>
> I haven't been following the whole conversation closely, but it's always
> interesting when manipulating time in non time_*() functions.
>
> Is this max() safe against wrapping?

Looking at the definition, seems like max doesn't take care of wrapping.
#define max(a, b) \
({ \
typeof(a) __a = (a); \
typeof(b) __b = (b); \
MINMAX_ASSERT_COMPATIBLE(typeof(__a), typeof(__b)); \
__a > __b ? __a : __b; \
})


Probably need to do
next = time_after(jiffies+1, sd->last_balance + interval) ? jiffies+1 : sd->last_balance + interval;

Tim


2021-05-18 19:16:46

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On 05/13/21 11:45, Tim Chen wrote:
>
>
> On 5/12/21 6:59 AM, Qais Yousef wrote:
> > On 05/11/21 10:25, Tim Chen wrote:
> >>> update_next_balance() is only used in newidle_balance() so we could
> >>> modify it to have:
> >>>
> >>> next = max(jiffies+1, next = sd->last_balance + interval)
> >>
> >> Is the extra assignment "next = sd->last_balance + interval" needed?
> >> This seems more straight forward:
> >>
> >> next = max(jiffies+1, sd->last_balance + interval)
> >
> > I haven't been following the whole conversation closely, but it's always
> > interesting when manipulating time in non time_*() functions.
> >
> > Is this max() safe against wrapping?
>
> Looking at the definition, seems like max doesn't take care of wrapping.
> #define max(a, b) \
> ({ \
> typeof(a) __a = (a); \
> typeof(b) __b = (b); \
> MINMAX_ASSERT_COMPATIBLE(typeof(__a), typeof(__b)); \
> __a > __b ? __a : __b; \
> })
>
>
> Probably need to do
> next = time_after(jiffies+1, sd->last_balance + interval) ? jiffies+1 : sd->last_balance + interval;

Yep, that's what I thought it should look like. There's a small chance jiffies
would have changed between the 2 reads though. I can't see how this would cause
a problem, so we should be fine.

Would it be more useful (and readable) to have time_min()/time_max() wrappers?
This type of usage is rare but it'll help to have a common way to handle this
scenario.

Naming might get controversial though :-); time_earliest()/time_latest() could
be another option.

The wrapper is nice to have for me, so feel free to ignore the suggestion.

Thanks

--
Qais Yousef

2021-06-11 20:04:51

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 5/12/21 6:59 AM, Qais Yousef wrote:
> On 05/11/21 10:25, Tim Chen wrote:
>>> update_next_balance() is only used in newidle_balance() so we could
>>> modify it to have:
>>>
>>> next = max(jiffies+1, next = sd->last_balance + interval)
>>
>> Is the extra assignment "next = sd->last_balance + interval" needed?
>> This seems more straight forward:
>>
>> next = max(jiffies+1, sd->last_balance + interval)
>
> I haven't been following the whole conversation closely, but it's always
> interesting when manipulating time in non time_*() functions.
>
> Is this max() safe against wrapping?
>

Vincent,

Sorry I haven't got back sooner. I finally was able to get some test
time on the test system. The fix works to correct the next balance time
going backwards but the frequency of balancing still remains the same,
so we don't see performance improvement.

I incorporated Qais' suggestion to take care of the wrap around time
(see patch #1) in patches below. This patch by itself prevented
the next_balance from going backward. However, most of the time the
next_balance occurs immediately in the next jiffie after newidle_balance
occured and we still have the same amount of load balancing as the vanilla
kernel on the OLTP workload I was looking at. I didn't see performance
improvement with just patch#1 and patch#2.

The current logic is when a CPU becomes idle, next_balance occur very
shortly (usually in the next jiffie) as get_sd_balance_interval returns
the next_balance in the next jiffie if the CPU is idle. However, in
reality, I saw most CPUs are 95% busy on average for my workload and
a task will wake up on an idle CPU shortly. So having frequent idle
balancing towards shortly idle CPUs is counter productive and simply
increase overhead and does not improve performance.

I tried a patch (patch 3) in addition to the other patches. It improved
performance by 5%, which is quite significant for my OLTP workload.
The patch considers a CPU busy when average its utilization is more
than 80% when determining the next_balance interval. This tweak may
not be ideal for the case when CPU becomes idle after a CPU intensive
task dominates a CPU for a long time and will block for a while.

Hopefully we can find a way to make good judgement on whether we have
a mostly busy CPU that becomes idle, and a task likely to wake up on
it soon. For such case, we should push out the next balance time. Such
logic is absent today in the idle load balance path. And such frequent
load balancing hurt performance when cgroup is turned on. Computing
update_blocked_averages before load balance becomes expensive. For my
OLTP workload, we lose 9% of performance when cgroup is turned on.

Tim


----

From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
Message-Id: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
From: Tim Chen <[email protected]>
Date: Tue, 11 May 2021 09:55:41 -0700
Subject: [PATCH 1/3] sched: sched: Fix rq->next_balance time updated to
earlier than current time

In traces on newidle_balance(), this_rq->next_balance
time goes backward and earlier than current time jiffies, e.g.

11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c

It doesn't make sense to have a next_balance in the past.
Fix newidle_balance() and update_next_balance() so the next
balance time is at least jiffies+1.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..740a0572cbf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)

/* used by idle balance, so cpu_busy = 0 */
interval = get_sd_balance_interval(sd, 0);
- next = sd->last_balance + interval;
+ if (time_after(jiffies+1, sd->last_balance + interval))
+ next = jiffies+1;
+ else
+ next = sd->last_balance + interval;

if (time_after(*next_balance, next))
*next_balance = next;
@@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

out:
/* Move the next balance forward */
+ if (time_after(jiffies+1, this_rq->next_balance))
+ this_rq->next_balance = jiffies+1;
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;

--
2.20.1


From 59de98515bda38b8d6faec5f8c68e1c9ec18962e Mon Sep 17 00:00:00 2001
Message-Id: <59de98515bda38b8d6faec5f8c68e1c9ec18962e.1623433293.git.tim.c.chen@linux.intel.com>
In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
From: Vincent Guittot <[email protected]>
Date: Fri, 7 May 2021 14:38:10 -0700
Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
load balance

In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when

this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will
not be useful and simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip update_blocked_averages()
when possible.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 740a0572cbf1..a69bfc651e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10615,17 +10615,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);

+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

goto out;
}
+ rcu_read_unlock();

raw_spin_unlock(&this_rq->lock);

--
2.20.1


From 4622055d989a5feb446a7893a48fcd31305ec7a7 Mon Sep 17 00:00:00 2001
Message-Id: <4622055d989a5feb446a7893a48fcd31305ec7a7.1623433293.git.tim.c.chen@linux.intel.com>
In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
From: Tim Chen <[email protected]>
Date: Mon, 24 May 2021 13:21:03 -0700
Subject: [PATCH 3/3] sched: Don't shorten the load balance interval of a 80%
or more busy CPU

For a CPU that's busy 80% or more on average, it is quite likely that a task
will wake up on it very soon. It is better to not shorten the load
balance interval as if it is completely idle to save on the load
balancing overhead.

Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a69bfc651e55..7353395d8a3a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9895,12 +9895,11 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
}

static inline void
-update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
+update_next_balance(struct sched_domain *sd, unsigned long *next_balance, int cpu_busy)
{
unsigned long interval, next;

- /* used by idle balance, so cpu_busy = 0 */
- interval = get_sd_balance_interval(sd, 0);
+ interval = get_sd_balance_interval(sd, cpu_busy);
if (time_after(jiffies+1, sd->last_balance + interval))
next = jiffies+1;
else
@@ -10593,6 +10592,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
struct sched_domain *sd;
int pulled_task = 0;
u64 curr_cost = 0;
+ int cpu_busy = 0;

update_misfit_status(NULL, this_rq);
/*
@@ -10618,12 +10618,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);

+ /*
+ * Consider the cpu busy if it has more than 80% average utilization.
+ * Idle balance such cpu not as frequently as a task may wake up soon.
+ */
+ if ((cpu_util(this_cpu) * 10 > capacity_orig_of(this_cpu) * 8))
+ cpu_busy = 1;
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

if (sd)
- update_next_balance(sd, &next_balance);
+ update_next_balance(sd, &next_balance, cpu_busy);
+
rcu_read_unlock();

goto out;
@@ -10639,7 +10647,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
u64 t0, domain_cost;

if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
- update_next_balance(sd, &next_balance);
+ update_next_balance(sd, &next_balance, cpu_busy);
break;
}

@@ -10657,7 +10665,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
curr_cost += domain_cost;
}

- update_next_balance(sd, &next_balance);
+ update_next_balance(sd, &next_balance, cpu_busy);

/*
* Stop searching for tasks to pull if there are
--
2.20.1

2021-06-18 12:34:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, 11 Jun 2021 at 22:00, Tim Chen <[email protected]> wrote:
>
>
>
> On 5/12/21 6:59 AM, Qais Yousef wrote:
> > On 05/11/21 10:25, Tim Chen wrote:
> >>> update_next_balance() is only used in newidle_balance() so we could
> >>> modify it to have:
> >>>
> >>> next = max(jiffies+1, next = sd->last_balance + interval)
> >>
> >> Is the extra assignment "next = sd->last_balance + interval" needed?
> >> This seems more straight forward:
> >>
> >> next = max(jiffies+1, sd->last_balance + interval)
> >
> > I haven't been following the whole conversation closely, but it's always
> > interesting when manipulating time in non time_*() functions.
> >
> > Is this max() safe against wrapping?
> >
>
> Vincent,
>
> Sorry I haven't got back sooner. I finally was able to get some test
> time on the test system. The fix works to correct the next balance time
> going backwards but the frequency of balancing still remains the same,
> so we don't see performance improvement.
>
> I incorporated Qais' suggestion to take care of the wrap around time
> (see patch #1) in patches below. This patch by itself prevented
> the next_balance from going backward. However, most of the time the
> next_balance occurs immediately in the next jiffie after newidle_balance

Which is not really surprising as we don't want to keep a CPU idle if
another one is overloaded.

> occured and we still have the same amount of load balancing as the vanilla
> kernel on the OLTP workload I was looking at. I didn't see performance
> improvement with just patch#1 and patch#2.
>
> The current logic is when a CPU becomes idle, next_balance occur very
> shortly (usually in the next jiffie) as get_sd_balance_interval returns
> the next_balance in the next jiffie if the CPU is idle. However, in
> reality, I saw most CPUs are 95% busy on average for my workload and
> a task will wake up on an idle CPU shortly. So having frequent idle
> balancing towards shortly idle CPUs is counter productive and simply
> increase overhead and does not improve performance.

Just to make sure that I understand your problem correctly: Your problem is:
- that we have an ilb happening on the idle CPU and consume cycle
- or that the ilb will pull a task on an idle CPU on which a task will
shortly wakeup which ends to 2 tasks competing for the same CPU.

>
> I tried a patch (patch 3) in addition to the other patches. It improved
> performance by 5%, which is quite significant for my OLTP workload.
> The patch considers a CPU busy when average its utilization is more
> than 80% when determining the next_balance interval. This tweak may
> not be ideal for the case when CPU becomes idle after a CPU intensive
> task dominates a CPU for a long time and will block for a while.
>
> Hopefully we can find a way to make good judgement on whether we have
> a mostly busy CPU that becomes idle, and a task likely to wake up on
> it soon. For such case, we should push out the next balance time. Such
> logic is absent today in the idle load balance path. And such frequent
> load balancing hurt performance when cgroup is turned on. Computing
> update_blocked_averages before load balance becomes expensive. For my
> OLTP workload, we lose 9% of performance when cgroup is turned on.
>
> Tim
>
>
> ----
>
> From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
> Message-Id: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> From: Tim Chen <[email protected]>
> Date: Tue, 11 May 2021 09:55:41 -0700
> Subject: [PATCH 1/3] sched: sched: Fix rq->next_balance time updated to
> earlier than current time
>
> In traces on newidle_balance(), this_rq->next_balance
> time goes backward and earlier than current time jiffies, e.g.
>
> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>
> It doesn't make sense to have a next_balance in the past.
> Fix newidle_balance() and update_next_balance() so the next
> balance time is at least jiffies+1.
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d75af1ecfb4..740a0572cbf1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>
> /* used by idle balance, so cpu_busy = 0 */
> interval = get_sd_balance_interval(sd, 0);
> - next = sd->last_balance + interval;
> + if (time_after(jiffies+1, sd->last_balance + interval))
> + next = jiffies+1;
> + else
> + next = sd->last_balance + interval;
>
> if (time_after(*next_balance, next))
> *next_balance = next;
> @@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> out:
> /* Move the next balance forward */
> + if (time_after(jiffies+1, this_rq->next_balance))
> + this_rq->next_balance = jiffies+1;
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> --
> 2.20.1
>
>
> From 59de98515bda38b8d6faec5f8c68e1c9ec18962e Mon Sep 17 00:00:00 2001
> Message-Id: <59de98515bda38b8d6faec5f8c68e1c9ec18962e.1623433293.git.tim.c.chen@linux.intel.com>
> In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> From: Vincent Guittot <[email protected]>
> Date: Fri, 7 May 2021 14:38:10 -0700
> Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
> load balance
>
> In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when
>
> this_rq->avg_idle < sd->max_newidle_lb_cost
>
> Doing a costly call to update_blocked_averages() will
> not be useful and simply adds overhead when this condition is true.
>
> Check the condition early in newidle_balance() to skip update_blocked_averages()
> when possible.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 740a0572cbf1..a69bfc651e55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10615,17 +10615,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> */
> rq_unpin_lock(this_rq, rf);
>
> + rcu_read_lock();
> + sd = rcu_dereference_check_sched_domain(this_rq->sd);
> +
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> - rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> goto out;
> }
> + rcu_read_unlock();
>
> raw_spin_unlock(&this_rq->lock);
>
> --
> 2.20.1
>
>
> From 4622055d989a5feb446a7893a48fcd31305ec7a7 Mon Sep 17 00:00:00 2001
> Message-Id: <4622055d989a5feb446a7893a48fcd31305ec7a7.1623433293.git.tim.c.chen@linux.intel.com>
> In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> From: Tim Chen <[email protected]>
> Date: Mon, 24 May 2021 13:21:03 -0700
> Subject: [PATCH 3/3] sched: Don't shorten the load balance interval of a 80%
> or more busy CPU
>
> For a CPU that's busy 80% or more on average, it is quite likely that a task
> will wake up on it very soon. It is better to not shorten the load
> balance interval as if it is completely idle to save on the load
> balancing overhead.
>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a69bfc651e55..7353395d8a3a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9895,12 +9895,11 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
> }
>
> static inline void
> -update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
> +update_next_balance(struct sched_domain *sd, unsigned long *next_balance, int cpu_busy)
> {
> unsigned long interval, next;
>
> - /* used by idle balance, so cpu_busy = 0 */
> - interval = get_sd_balance_interval(sd, 0);
> + interval = get_sd_balance_interval(sd, cpu_busy);
> if (time_after(jiffies+1, sd->last_balance + interval))
> next = jiffies+1;
> else
> @@ -10593,6 +10592,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> struct sched_domain *sd;
> int pulled_task = 0;
> u64 curr_cost = 0;
> + int cpu_busy = 0;
>
> update_misfit_status(NULL, this_rq);
> /*
> @@ -10618,12 +10618,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>
> + /*
> + * Consider the cpu busy if it has more than 80% average utilization.
> + * Idle balance such cpu not as frequently as a task may wake up soon.
> + */
> + if ((cpu_util(this_cpu) * 10 > capacity_orig_of(this_cpu) * 8))
> + cpu_busy = 1;
> +
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> !READ_ONCE(this_rq->rd->overload) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> if (sd)
> - update_next_balance(sd, &next_balance);
> + update_next_balance(sd, &next_balance, cpu_busy);
> +
> rcu_read_unlock();
>
> goto out;
> @@ -10639,7 +10647,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> u64 t0, domain_cost;
>
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> - update_next_balance(sd, &next_balance);
> + update_next_balance(sd, &next_balance, cpu_busy);
> break;
> }
>
> @@ -10657,7 +10665,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> curr_cost += domain_cost;
> }
>
> - update_next_balance(sd, &next_balance);
> + update_next_balance(sd, &next_balance, cpu_busy);
>
> /*
> * Stop searching for tasks to pull if there are
> --
> 2.20.1
>

2021-06-18 19:00:24

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ



On 6/18/21 3:28 AM, Vincent Guittot wrote:

>>
>> The current logic is when a CPU becomes idle, next_balance occur very
>> shortly (usually in the next jiffie) as get_sd_balance_interval returns
>> the next_balance in the next jiffie if the CPU is idle. However, in
>> reality, I saw most CPUs are 95% busy on average for my workload and
>> a task will wake up on an idle CPU shortly. So having frequent idle
>> balancing towards shortly idle CPUs is counter productive and simply
>> increase overhead and does not improve performance.
>
> Just to make sure that I understand your problem correctly: Your problem is:
> - that we have an ilb happening on the idle CPU and consume cycle

That's right. The cycles are consumed heavily in update_blocked_averages()
when cgroup is enabled.

> - or that the ilb will pull a task on an idle CPU on which a task will
> shortly wakeup which ends to 2 tasks competing for the same CPU.
>

Because for the OLTP workload I'm looking at, we have tasks that sleep
for a short while and wake again very shortly (i.e. the CPU actually
is ~95% busy on average), pulling tasks to such a CPU is really not
helpful to improve overall CPU utilization in the system. So my
intuition is for such almost fully busy CPU, we should defer load
balancing to it (see prototype patch 3).

Tim




2021-06-25 08:51:47

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

On Fri, 18 Jun 2021 at 18:14, Tim Chen <[email protected]> wrote:
>
>
>
> On 6/18/21 3:28 AM, Vincent Guittot wrote:
>
> >>
> >> The current logic is when a CPU becomes idle, next_balance occur very
> >> shortly (usually in the next jiffie) as get_sd_balance_interval returns
> >> the next_balance in the next jiffie if the CPU is idle. However, in
> >> reality, I saw most CPUs are 95% busy on average for my workload and
> >> a task will wake up on an idle CPU shortly. So having frequent idle
> >> balancing towards shortly idle CPUs is counter productive and simply
> >> increase overhead and does not improve performance.
> >
> > Just to make sure that I understand your problem correctly: Your problem is:
> > - that we have an ilb happening on the idle CPU and consume cycle
>
> That's right. The cycles are consumed heavily in update_blocked_averages()
> when cgroup is enabled.

But they are normally consumed on an idle CPU and the ILB checks
need_resched() before running load balance for the next idle CPU.

Does it mean that your problem is coming from update_blocked_average()
spending a long time with rq_lock_irqsave and increasing the wakeup
latency of your short running task ?

>
> > - or that the ilb will pull a task on an idle CPU on which a task will
> > shortly wakeup which ends to 2 tasks competing for the same CPU.
> >
>
> Because for the OLTP workload I'm looking at, we have tasks that sleep
> for a short while and wake again very shortly (i.e. the CPU actually
> is ~95% busy on average), pulling tasks to such a CPU is really not
> helpful to improve overall CPU utilization in the system. So my
> intuition is for such almost fully busy CPU, we should defer load
> balancing to it (see prototype patch 3).

Note that this is at the opposite of what you said earlier:
"
Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (500000), to encourage
migrations to idle cpu
and reduce latency.
"

But, it will be quite hard to find a value that fits to requirements
for everybody and some will have UCs for which they want to pull tasks
even if the CPU is 95% busy; You can have 2ms of idle time but having
a utilization above 95% and an ILB inside a Core or at LLC is somewhat
cheap and would take advantage of those 2ms

>
> Tim
>
>
>
>