Received-SPF: pass (google.com: domain of linux-kernel+bounces-176922-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1;
Date: Sun, 12 May 2024 16:29:39 +0100
From: Qais Yousef <qyousef@layalina.io>
To: Christian Loehle <christian.loehle@arm.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>, linux-kernel@vger.kernel.org,
	peterz@infradead.org, juri.lelli@redhat.com, mingo@redhat.com,
	dietmar.eggemann@arm.com, vschneid@redhat.com,
	vincent.guittot@linaro.org, Johannes.Thumshirn@wdc.com,
	adrian.hunter@intel.com, ulf.hansson@linaro.org, andres@anarazel.de,
	asml.silence@gmail.com, linux-pm@vger.kernel.org,
	linux-block@vger.kernel.org, io-uring@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost
Message-ID: <20240512152939.uua2sritrwg4zobj@airbuntu>
References: <20240304201625.100619-1-christian.loehle@arm.com>
 <20240304201625.100619-3-christian.loehle@arm.com>
 <CAJZ5v0gMni0QJTBJXoVOav=kOtQ9W--NyXAgq+dXA+m-bciG8w@mail.gmail.com>
 <5060c335-e90a-430f-bca5-c0ee46a49249@arm.com>
 <CAJZ5v0janPrWRkjcLkFeP9gmTC-nVRF-NQCh6CTET6ENy-_knQ@mail.gmail.com>
 <20240325023726.itkhlg66uo5kbljx@airbuntu>
 <d99fd27a-dac5-4c71-b644-1213f51f2ba0@arm.com>
 <20240429111816.mqok5biihvy46eba@airbuntu>
 <80da988f-899e-4b93-a648-ffd0680d4000@arm.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <80da988f-899e-4b93-a648-ffd0680d4000@arm.com>

On 05/07/24 16:19, Christian Loehle wrote:
> On 29/04/2024 12:18, Qais Yousef wrote:
> > On 04/19/24 14:42, Christian Loehle wrote:
> > 
> >>> I think the major thing we need to be careful about is the behavior when the
> >>> task is sleeping. I think the boosting will be removed when the task is
> >>> dequeued and I can bet there will be systems out there where the BLOCK softirq
> >>> being boosted when the task is sleeping will matter.
> >>
> >> Currently I see this mainly protected by the sugov rate_limit_us.
> >> With the enqueue's being the dominating cpufreq updates it's not really an
> >> issue, the boost is expected to survive the sleep duration, during which it
> >> wouldn't be active.
> >> I did experiment with some sort of 'stickiness' of the boost to the rq, but
> >> it is somewhat of a pain to deal with if we want to remove it once enqueued
> >> on a different rq. A sugov 1ms timer is much simpler of course.
> >> Currently it's not necessary IMO, but for the sake of being future-proof in
> >> terms of more frequent freq updates I might include it in v2.
> > 
> > Making sure things work with purpose would be really great. This implicit
> > dependency is not great IMHO and make both testing and reasoning about why
> > things are good or bad harder when analysing real workloads. Especially by non
> > kernel developers.
> 
> Agreed.
> Even without your proposed changes [1] relying on sugov rate_limit_us is
> unfortunate.
> There is a problem with an arbitrarily low rate_limit_us more generally, not
> just because we kind of rely on the CPU being boosted right before the task is
> actually enqueued (for the interrupt/softirq part of it), but also because of
> the latency from requested frequency improvement to actually running on that
> frequency. If the task is 90% done by the time it sees the improvement and
> the frequency will be updated (back to a lower one) before the next enqueue,
> then that's hardly worth the effort.

I think that's why iowait boost is done the way it is today. You need to
sustain the boost as the thing that need it run for a very short amount of
time..

Have you looked at how long the iowait boosted tasks run for in your tests?

> Currently this is covered by rate_limit_us probabillistically and that seems

Side note. While looking more at the history of rate_limit_us and why it is set
soo much higher than reported transition_latency by the driver, I am slowly
reaching the conclusion that what is happening here is similar to introduction
of down_rate_limit_us in Android. Our ability to predict system requirement is
not great and we end up with prematurely lowering frequencies, and this rate
limit just happens to more likely keep it higher so on average you see better
perf in various workloads. I hope we can improve on this so we don't rely on
this magic and enable better usage of the hardware ability to transition
between frequencies fast and be more exact. I am trying to work on this as part
of my magic margins series. But I think the story is bigger than that..

> to be good enough in practice, but it's not very pleasing (and also EAS can't
> take it into consideration).
> That's not just exclusive for iowait wakeup tasks of course, but in theory any
> that is off the rq frequently (and still requests a higher frequency than it can
> realistically build up through util_avg like through uclamp_min).

For uclamp_min, if the task RUNNING duration is shorter than the hardware
ability to apply the boost, I think there's little we can do. The user can opt
to boost the system in general. Note that this is likely a problem on systems
with multi-ms transition_delay_us. If the task is running for few 100s us and
it really wants to be boosted for this short time then a static system wide
boost is all we can do to guarantee what it wants. The hardware is poor fit for
the use case in this scenario. And personally I'd push back against introducing
complexity to deal with such poor fit scenarios. We can already set min freq
for the policy via sysfs and uclamp_min can still help with task placement for
HMP systems.

Now the problem we could have which is similar to iowait boost scenario is when
there's a chaining effect that requires the boost for the duration of the
chain.

I think we can do something about it if we:

	1. Guarantee the chain will run on the same CPU.
	2. Introduce a 'sticky' flag for the boost to stay while the chain is
	   running.
	3. Introduce a start/finish indication for the chain.

I think we can do something like that with sched-qos [1] to tag the chain via
a cookie and request the boost to apply to them collectively.

Generally, userspace would be better to collapse this chain into a single task
that runs in one go.

I don't know how often this scenario exists in practice and what limitations
exist that make the simple collapse solution infeasible.. So I'd leave this
out until more info is available.

> 
> >>>
> >>> FWIW I do have an implementation for per-task iowait boost where I went a step
> >>> further and converted intel_pstate too and like Christian didn't notice
> >>> a regression. But I am not sure (rather don't think) I triggered this use case.
> >>> I can't tell when the systems truly have per-cpu cpufreq control or just appear
> >>> so and they are actually shared but not visible at linux level.
> >>
> >> Please do share your intel_pstate proposal!
> > 
> > This is what I had. I haven't been working on this for the past few months, but
> > I remember tried several tests on different machines then without a problem.
> > I tried to re-order patches at some point though and I hope I didn't break
> > something accidentally and forgot the state.
> > 
> > https://github.com/torvalds/linux/compare/master...qais-yousef:linux:uclamp-max-aggregation
> > 
> 
> Thanks for sharing, that looks reasonable with consolidating it into uclamp_min.
> Couple of thoughts on yours, I'm sure you're aware, but consider it me thinking out
> loud:
> - iowait boost is taken into consideration for task placement, but with just the
> 4 steps that made it more aggressive on HMP. (Potentially 2-3 consecutive iowait
> wakeups to land on the big instead of running at max OPP of a LITTLE).

Yeah I opted to keep the logic the same. I think there are gains to be had even
without being smarter about the algorithm. But we do need to improve it, yes.
The current logic is too aggressive and the perf/power trade-off will be tricky
in practice.

> - If the current iowait boost decay is sensible is questionable, but there should
> probably be some decay. Taken to the extreme this would mean something
> like blk_wait_io() demands 1024 utilization, if it waits for a very long time.
> Repeating myself here, but iowait wakeups itself is tricky to work with (and I
> try to work around that).

I didn't get you here. But generally the story can go few levels deep yes. My
approach was to make incremental progress without breaking existing stuff, but
help move things in the right direction over time. Fixing everything in one go
will be hard and not productive.

> - The intel_pstate solution will increase boost even if
> previous_wakeup->iowait_boost > current->iowait_boost
> right? But using current->iowait_boost is a clever idea.

I forgot the details now. But I seem to remember intel_pstate had its own
accounting when it sees iowait boost in intel_pstate_hwp_boost_up/down().

Note that the major worry I had is about the softirq not being boosted.
Although in my testing this didn't seem to show up as things seemed fine. But
I haven't dug down to see how accidental this was. I could easily see my
patches making some use case out there not happy as the softirq might not get
a chance to see the boost. I got distracted with other stuff and didn't get
back to the topic since then. I am more than happy to support your efforts
though :)

[1] https://lore.kernel.org/lkml/20230916213316.p36nhgnibsidoggt@airbuntu/