Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <20210122154600.1722680-1-joel@joelfernandes.org>
 <CAKfTPtAnzhDKXayicDdymWpK1UswfkTaO8vL-WHxVaoj7DaCFw@mail.gmail.com>
 <YAsjOqmo7TEeXjoj@google.com> <CAKfTPtBWoRuwwkaqQKNgHTnQBE4fevyYqEoeGc5RpCsBbOS1sQ@mail.gmail.com>
 <YBG0W5PFGtGRCEuB@google.com> <CAKfTPtBqj5A_7QmxhhmkNTc3+VT6+AqWgw1GDYrgy1V5+PJMmQ@mail.gmail.com>
 <CAEXW_YRrhEfGcLN5yrLJZm6HrB15M_R5xfpMReG2wE2rSmVWdA@mail.gmail.com> <CAKfTPtBvwm9vZb5C=2oTF6N-Ht6Rvip4Lv18yi7O3G8e-_ZWdg@mail.gmail.com>
In-Reply-To: <CAKfTPtBvwm9vZb5C=2oTF6N-Ht6Rvip4Lv18yi7O3G8e-_ZWdg@mail.gmail.com>
From:   Joel Fernandes <joel@joelfernandes.org>
Date:   Mon, 1 Feb 2021 10:13:54 -0500
Message-ID: <CAEXW_YQ6GCuNsffGi9+=hkNyKZ3t-wHvFOEQZ8iOvKWkF4k8Xg@mail.gmail.com>
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages()
 for NOHZ
To:     Vincent Guittot <vincent.guittot@linaro.org>
Cc:     linux-kernel <linux-kernel@vger.kernel.org>,
        Paul McKenney <paulmck@kernel.org>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Dietmar Eggeman <dietmar.eggemann@arm.com>,
        Qais Yousef <qais.yousef@arm.com>,
        Ben Segall <bsegall@google.com>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Mel Gorman <mgorman@suse.de>,
        Peter Zijlstra <peterz@infradead.org>,
        Steven Rostedt <rostedt@goodmis.org>,
        "Uladzislau Rezki (Sony)" <urezki@gmail.com>,
        Neeraj upadhyay <neeraj.iitr10@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Fri, Jan 29, 2021 at 5:33 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
[...]
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see it
> > > > happen with other load balance as well, but it not that often as those LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
>
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
>
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved.
> As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.

I disagree. Here you are assuming that schbench is the only task
running on the system. There are other processes, daemons as well. I
see a strong correlation between commenting out
update_blocked_averages() and not seeing the latency hit at the higher
percentiles.

> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay.... If
> you look at your results for schbench -t 2 -m 2 -r 5: The  *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473

Sure, there could be a different problem causing these higher
percentile latencies on your device. I still think 12ms is awful.

> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400

Yes, definitely changing the governor also solves the problem (like
for example if performance governor is used). The problem happens at
low frequencies.

> So a quick and wrong conclusion could be to say that you should disable EAS ;-)

Sure, except the power management guys may come after me ;-)

[..]
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and I'm going
> to provide a patchset for this

That's fine, as long as we agree on this problem ;-) Thanks for
providing the patches and I will try them once they are ready.

> > for this.
> >
> > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > when the coming idle duration is expected to be long enough
> >
> > No, we do not want the schedule loop to take half a millisecond.
>
> keep in mind that you are scaling frequency so everything takes time
> at lowest frequency/capacity ...

Agreed, I was also thinking the same. But that doesn't change the fact
that there is room for improvement and I'm grateful to you for trying
to improve it!

thanks,

 - Joel