Received-SPF: pass (google.com: domain of linux-kernel+bounces-116445-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161;
Date: Mon, 25 Mar 2024 02:53:08 +0000
From: Qais Yousef <qyousef@layalina.io>
To: Christian Loehle <christian.loehle@arm.com>
Cc: Bart Van Assche <bvanassche@acm.org>, linux-kernel@vger.kernel.org,
	peterz@infradead.org, juri.lelli@redhat.com, mingo@redhat.com,
	rafael@kernel.org, dietmar.eggemann@arm.com, vschneid@redhat.com,
	vincent.guittot@linaro.org, Johannes.Thumshirn@wdc.com,
	adrian.hunter@intel.com, ulf.hansson@linaro.org, andres@anarazel.de,
	asml.silence@gmail.com, linux-pm@vger.kernel.org,
	linux-block@vger.kernel.org, io-uring@vger.kernel.org,
	linux-mmc@vger.kernel.org
Subject: Re: [RFC PATCH 0/2] Introduce per-task io utilization boost
Message-ID: <20240325025308.6uqkhpyba6moxntl@airbuntu>
References: <20240304201625.100619-1-christian.loehle@arm.com>
 <86f0af00-8765-4481-9245-1819fb2c6379@acm.org>
 <0dc6a839-2922-40ac-8854-2884196da9b9@arm.com>
 <c5b7fc1f-f233-4d25-952b-539607c2a0cc@acm.org>
 <2784c093-eea1-4b73-87da-1a45f14013c8@arm.com>
 <20240321123935.zqscwi2aom7lfhts@airbuntu>
 <1ff973fc-66a4-446e-8590-ec655c686c90@arm.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <1ff973fc-66a4-446e-8590-ec655c686c90@arm.com>

On 03/21/24 17:57, Christian Loehle wrote:

> > So you want the hardirq to move to the big core? Unlike softirq, there will be
> > a single hardirq for the controller (to my limited knowledge), so if there are
> > multiple requests I'm not sure we can easily match which one relates to which
> > before it triggers. So we can end up waking up the wrong core.
> 
> It would be beneficial to move the hardirq to a big core if the IO task
> is using it anyway.
> I'm not sure I actually want to. There are quite a few pitfalls (like you

I'm actually against it. I think it's too much complexity for not necessasrily
a big gain. FWIW, one of the design request to get per task iowait boost so
that we can *disable* it. It wastes power when only a handful of tasks actually
care about perf.

Caring where the hardirq run for perf is unlikely a problem in practice.
Softirq should follow the requester already when it matters.

> mentioned) that the scheduler really shouldn't be concerned about.
> Moving the hardirq, if implemented in the kernel, would have to be done by the
> host controller driver anyway, which would explode this series.
> (host controller drivers are quite fragmented e.g. on mmc)
> 
> The fact that having a higher capacity CPU available ("running faster") for an
> IO task doesn't (always) imply higher throughput because of the hardirq staying
> on some LITTLE CPU is bothering (for this series), though.
> 
> > 
> > Generally this should be a userspace policy. If there's a scenario where the
> > throughput is that important they can easily move the hardirq to the big core
> > unconditionally and move it back again once this high throughput scenario is no
> > longer important.
> 
> It also feels wrong to let this be a userspace policy, as the hardirq must be
> migrated to the perf domain of the task, which userspace isn't aware of.
> Unless you expect userspace to do

irq balancer is a userspace policy. For kernel to make an automatic decision
there are a lot of ifs must be present. Again, I don't see on such system
maximizing throughput is a concern. And userspace can fix the problem simply
- they know after all when the throughput really matters to the point where the
hardirq runs is a bottleneck. In practice, I don't think it is a bottleneck.
But this is my handwavy judgement. The experts know better. And note, I mean
use cases that are not benchmarks ;-)

> CPU_affinity_task=big_perf_domain_0 && hardirq_affinity=big_perf_domain_0
> but then you could just as well ask them to set performance governor for
> big_perf_domain_0 (or uclamp_min=1024) and need neither this series nor
> any iowait boosting.
> 
> Furthermore you can't generally expect userspace to know if their IO will lead
> to any interrupt at all, much less which one. They ideally don't even know if
> the file IO they are doing is backed by any physical storage in the first place.
> (Or even further, that they are doing file IO at all, they might just be
> e.g. page-faulting.)

The way I see it, it's like gigabit networking. The hardirq will matter once
you reach such high throughput scenarios. Which are corner cases and not the
norm?

> 
> > 
> > Or where you describing a different problem?
> 
> That is the problem I mentioned in the series and Bart and I were discussing.
> It's a problem of the series as in "the numbers aren't that impressive".
> Current iowait boosting on embedded/mobile systems will perform quite well by
> chance, as the (low util) task will often be on the same perf domain the hardirq
> will be run on. As can be seen in the cover letter the benefit of running the
> task on a (2xLITTLE capacity) big CPU therefore are practically non-existent,
> for tri-gear systems where big CPU is more like 10xLITTLE capacity the benefit
> will be much greater.
> I just wanted to point this out. We might just acknowledge the problem and say
> "don't care" about the potential performance benefits of those scenarios that
> would require hardirq moving.

I thought the softirq does the bulk of the work. hardirq being such
a bottleneck is (naively maybe) a red flag for me that it's doing too much than
a simple interrupt servicing.

You don't boost when the task is sleeping, right? I think this is likely
a cause of the problem where softirq is not running as fast - where before the
series the CPU will be iowait boosted regardless the task is blocked or not.

> In the long-term it looks like for UFS the problem will disappear as we are
> expected to get one queue/hardirq per CPU (as Bart mentioned), on NVMe that
> is already the case.
> 
> I CC'd Uffe and Adrian for mmc, to my knowledge the only subsystem where
> 'fast' (let's say >10K IOPS) devices are common, but only one queue/hardirq
> is available (and it doesn't look like this is changing anytime soon).
> I would also love to hear what Bart or other UFS folks think about it.
> Furthermore if I forgot any storage subsystem with the same behavior in that
> regards do tell me.
> 
> Lastly, you could consider the IO workload:
> IO task being in iowait very frequently [1] with just a single IO inflight [2]
> and only very little time being spent on the CPU in-between iowaits[3],
> therefore the interrupt handler being on the critical path for IO throughput
> to a non-negligible degree, to be niche, but it's precisely the use-case where
> iowait boosting shows it's biggest benefit.
> 
> Sorry for the abomination of a sentence, see footnotes for the reasons.
> 
> [1] If sugov doesn't see significantly more than 1 iowait per TICK_NSEC it
> won't apply any significant boost currently.

I CCed you to a patch where I fix this. I've been sleeping on it for too long.
Maybe I should have split this fix out of the consolidation patch.

> [2] If the storage devices has enough in-flight requests to serve, iowait
> boosting is unnecessary/wasteful, see cover letter.
> [3] If the task actually uses the CPU in-between iowaits, it will build up
> utilization, iowait boosting benefit diminishes.

The current mechanism is very aggressive. It needs to evolve for sure.

> 
> > 
> > Glad to see your series by the way :-) I'll get a chance to review it over the
> > weekend hopefully.
> 
> Thank you!
> Apologies for not CCing you in the first place, I am curious about your opinion
> on the concept!

I actually had a patch that implements iowait boost per-task (on top of my
remove uclamp max aggregation series) where I did actually take the extra step
to remove iowait from intel_pstate. Can share the patches if you think you'll
find them useful.

Just want to note that this mechanism can end up waste power and this is an
important direction to consider. It's not about perf only (which matters too).

> 
> FWIW I did mess up a last-minute, what was supposed to be, cosmetic change that
> only received a quick smoke test, so 1/2 needs the following:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4aaf64023b03..2b6f521be658 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6824,7 +6824,7 @@ static void dequeue_io_boost(struct cfs_rq *cfs_rq, struct task_struct *p)
>         } else if (p->io_boost_curr_ios < p->io_boost_threshold_down) {
>                 /* Reduce boost */
>                 if (p->io_boost_level > 1)
> -                       io_boost_scale_interval(p, true);
> +                       io_boost_scale_interval(p, false);
>                 else
>                         p->io_boost_level = 0;
>         } else if (p->io_boost_level == IO_BOOST_LEVELS) {
> 
> 
> I'll probably send a v2 rebased on 6.9 when it's out anyway, but so far the
> changes are mostly cosmetic and addressing Bart's comments about the benchmark
> numbers in the cover letter.

I didn't spend a lot of time on the series, but I can see a number of problems.
Let us discuss them first and plan a future direction. No need to v2 if it's
just for this fix IMO.