Hi all,
Looks like iowait boost is completely broken upstream. Just
documenting my findings of iowait boost issues:
1. If a CPU requests iowait boost in a cluster, another CPU can go
ahead and reset very quickly it since it thinks there's no new request
for the iowait boosting CPU
2. If the iowait is longer than a tick, then successive iowait boost
doubling does not happen. So heavy I/O waiting code never gets a
boost.
3. update_load_avg() is triggered right after the the iowait boost
request which makes another cpufreq update request, this request is a
non-iowait boost one so it ends up resetting the iowait boost request
(in the same path!).
4. Same as #3 but due the update_blocked_averages from new idle balance path.
Here is a patch that tries to address these problems and I see better
cpufreq boosting happening, however it is just a test broken patch to
highlight the issues:
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/5.4/iowait-boost-debug-1&id=3627d896d499d168fef9a388e5d6b3359acc3423
I think we ought to rewrite the whole mess instead of fixing it since
a lot has changed in scheduler code over time it feels. Beata is
working on rewriting the whole iowait boost infra, I am glad she has
started the work on this and looking forward to helping with the
patches.
thanks,
- Joel
Hi Joel,
Thanks for sending this out.
On Mon, Jun 07, 2021 at 12:19:01PM -0400, Joel Fernandes wrote:
> Hi all,
> Looks like iowait boost is completely broken upstream. Just
> documenting my findings of iowait boost issues:
>
I wouldn't go as far to state that it is completely broken. Rather that
the current sugov implementation for iowait boosting is not meeting
the expectations and I believe this should be clarified first. More on those
expectations below.
> 1. If a CPU requests iowait boost in a cluster, another CPU can go
> ahead and reset very quickly it since it thinks there's no new request
> for the iowait boosting CPU
So the 'boosting' value is being tracked per CPU, so each core in a cluster
will have it's own variant of that. When calculating the shared freq for
the cluster, sugov will use max utilization reported on each core, including
I/O boost. Now, if there is no pending request for boosting on a given core
at the time of calling sugov_iowait_apply, the current 'boost' will be
reduced, but only this one and that will not affect boost values on remaining
CPUs. It means that there was no task waking up on that particular CPU after
waiting on I/O request. So I would say it's fine. Unless I am misunderstanding
your case ?
> 2. If the iowait is longer than a tick, then successive iowait boost
> doubling does not happen. So heavy I/O waiting code never gets a
> boost.
This might indeed be an issue. What's more: the fact that boosting is applied
per core, any migration will disregard the boosting applied so far, so
it might start increasing the freq on 'new' CPU from scratch.
Might, as sugov (and the I/O boosting) has no notion of the source of boosting
request so it might end up on a core that has already lifted I/O boost.
This also means, that having different small tasks, waking up from
I/O within the mentioned TICK_NSEC time window might drive the frequency to max
even though those would be sporadic wakeups. Things get slightly
cumbersome as increasing the boost does not necessarily result in the freq
change, so the above mentioned case would need proper timing but it is possible.
Also the boost value will not get doubled unless previous one has been applied.
This might result in misalignment between task wake-ups/placement and sugov's
freq changes.
> 3. update_load_avg() is triggered right after the the iowait boost
> request which makes another cpufreq update request, this request is a
> non-iowait boost one so it ends up resetting the iowait boost request
> (in the same path!).
Not necessarily - this is guarded by the TICK_NSEC you have mentioned:
in
sugov_iowait_reset {
...
if (delta_ns <= TICK_NSEC)
return;
...
}
So for the particular call sequence the boost should not get reset.
Another problem is when the non-I/O bound tasks triggers the update
after the TICK_NSEC has elapsed and before the frequency change is done.
(see sugov_should_update_freq )
> 4. Same as #3 but due the update_blocked_averages from new idle balance path.
>
> Here is a patch that tries to address these problems and I see better
> cpufreq boosting happening, however it is just a test broken patch to
> highlight the issues:
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/5.4/iowait-boost-debug-1&id=3627d896d499d168fef9a388e5d6b3359acc3423
>
> I think we ought to rewrite the whole mess instead of fixing it since
> a lot has changed in scheduler code over time it feels. Beata is
> working on rewriting the whole iowait boost infra, I am glad she has
> started the work on this and looking forward to helping with the
> patches.
>
So back to the expectations.
The main problem, as I see it, is what do we actually want to achieve with
the I/O boosting? Is it supposed to compensate the time lost while waiting
for the I/O request to be completed or is is supposed to optimize the rate
at which I/O requests are being made. Do we want to boost I/O bound tasks by
default, no limits applied or should we care about balancing performance
vs power ? And unless those expectations are clearly stated, we might not
get too far with any changes, really.
Smth that I do agree with is what has been suggested few times by Quentin,
to make the I/O boosting being a per-task feature, not per-core, and that is
smth I have been experimenting with. This would help with maintaining the boost
level upon migration and could potentially improve task placement. It would also
eliminate sugov's misfires: currently the boost might be applied when the task
on behalf of which the boost has been triggered, has been migrated or is not
runnable at that point. Also sugov's boosting is completely unaware of
uclamp restrictions,and I guess it should be. This could also be fixed with
per-task boosting.
There are few tricky bits though:
- current implementation (sugov's) makes sure the boost will not get doubled
unless the current level of boosting has been applied -> there is no notion
for that when moving the boosting as a per-task feature (considered as an
optimization for too frequent freq changes)
- guarding ramping up the frequency with TICK_NSEC & freq_update_delay_ns
is also out of the equation (same as above)
The above two means we might be boosting faster that this is currently expected.
On the other hand we might lose the boosting as currently sugov does not care
whether that is a single or multiple tasks waking up from I/O.
I've been trying to come up with some heuristics that would help to decide
whether boosting makes sense or not (like don't boost too much if most of the
time the task is being blocked or when despite boosting the sleep time keeps
getting longer). But that seems slightly counter intuitive, to place that sort
of decision making in schedulers generic code (without having all the info at
hand) and I would expect this to be handled by other, more suited mechanisms.
On the other hand, boosting upon each wake-up (from I/O) seems slightly
excessive. But again: it all comes down to the expectations on the actually
desired behaviour. We could potentially add another PELT-like signal to track
the time spent waiting on I/O request alongside the task utilization, but that
means more messing with signals on the rq level and losing some performance
points due to needed calculations. Alternative of blindly increasing boost
for each relevant wake-up means that small tasks could drive the freq to its
max values - which I am not sure is so desired for power-aware scenarios.
So all in all - still in progress.
Any comments are more than welcomed.
---
BR
B.
> thanks,
> - Joel
+CC Dietmar, Rafael
On 06/07/21 20:10, Beata Michalska wrote:
> Hi Joel,
>
> Thanks for sending this out.
>
> On Mon, Jun 07, 2021 at 12:19:01PM -0400, Joel Fernandes wrote:
> > Hi all,
> > Looks like iowait boost is completely broken upstream. Just
> > documenting my findings of iowait boost issues:
> >
> I wouldn't go as far to state that it is completely broken. Rather that
> the current sugov implementation for iowait boosting is not meeting
> the expectations and I believe this should be clarified first. More on those
> expectations below.
> > 1. If a CPU requests iowait boost in a cluster, another CPU can go
> > ahead and reset very quickly it since it thinks there's no new request
> > for the iowait boosting CPU
> So the 'boosting' value is being tracked per CPU, so each core in a cluster
> will have it's own variant of that. When calculating the shared freq for
> the cluster, sugov will use max utilization reported on each core, including
> I/O boost. Now, if there is no pending request for boosting on a given core
> at the time of calling sugov_iowait_apply, the current 'boost' will be
> reduced, but only this one and that will not affect boost values on remaining
> CPUs. It means that there was no task waking up on that particular CPU after
> waiting on I/O request. So I would say it's fine. Unless I am misunderstanding
> your case ?
> > 2. If the iowait is longer than a tick, then successive iowait boost
> > doubling does not happen. So heavy I/O waiting code never gets a
> > boost.
> This might indeed be an issue. What's more: the fact that boosting is applied
> per core, any migration will disregard the boosting applied so far, so
> it might start increasing the freq on 'new' CPU from scratch.
> Might, as sugov (and the I/O boosting) has no notion of the source of boosting
> request so it might end up on a core that has already lifted I/O boost.
> This also means, that having different small tasks, waking up from
> I/O within the mentioned TICK_NSEC time window might drive the frequency to max
> even though those would be sporadic wakeups. Things get slightly
> cumbersome as increasing the boost does not necessarily result in the freq
> change, so the above mentioned case would need proper timing but it is possible.
> Also the boost value will not get doubled unless previous one has been applied.
> This might result in misalignment between task wake-ups/placement and sugov's
> freq changes.
> > 3. update_load_avg() is triggered right after the the iowait boost
> > request which makes another cpufreq update request, this request is a
> > non-iowait boost one so it ends up resetting the iowait boost request
> > (in the same path!).
> Not necessarily - this is guarded by the TICK_NSEC you have mentioned:
> in
> sugov_iowait_reset {
> ...
> if (delta_ns <= TICK_NSEC)
> return;
> ...
> }
> So for the particular call sequence the boost should not get reset.
> Another problem is when the non-I/O bound tasks triggers the update
> after the TICK_NSEC has elapsed and before the frequency change is done.
> (see sugov_should_update_freq )
> > 4. Same as #3 but due the update_blocked_averages from new idle balance path.
> >
> > Here is a patch that tries to address these problems and I see better
> > cpufreq boosting happening, however it is just a test broken patch to
> > highlight the issues:
> > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/5.4/iowait-boost-debug-1&id=3627d896d499d168fef9a388e5d6b3359acc3423
> >
> > I think we ought to rewrite the whole mess instead of fixing it since
> > a lot has changed in scheduler code over time it feels. Beata is
> > working on rewriting the whole iowait boost infra, I am glad she has
> > started the work on this and looking forward to helping with the
> > patches.
> >
> So back to the expectations.
> The main problem, as I see it, is what do we actually want to achieve with
> the I/O boosting? Is it supposed to compensate the time lost while waiting
> for the I/O request to be completed or is is supposed to optimize the rate
> at which I/O requests are being made. Do we want to boost I/O bound tasks by
> default, no limits applied or should we care about balancing performance
> vs power ? And unless those expectations are clearly stated, we might not
> get too far with any changes, really.
>
> Smth that I do agree with is what has been suggested few times by Quentin,
> to make the I/O boosting being a per-task feature, not per-core, and that is
> smth I have been experimenting with. This would help with maintaining the boost
> level upon migration and could potentially improve task placement. It would also
> eliminate sugov's misfires: currently the boost might be applied when the task
> on behalf of which the boost has been triggered, has been migrated or is not
> runnable at that point. Also sugov's boosting is completely unaware of
> uclamp restrictions,and I guess it should be. This could also be fixed with
> per-task boosting.
>
> There are few tricky bits though:
> - current implementation (sugov's) makes sure the boost will not get doubled
> unless the current level of boosting has been applied -> there is no notion
> for that when moving the boosting as a per-task feature (considered as an
> optimization for too frequent freq changes)
> - guarding ramping up the frequency with TICK_NSEC & freq_update_delay_ns
> is also out of the equation (same as above)
>
> The above two means we might be boosting faster that this is currently expected.
> On the other hand we might lose the boosting as currently sugov does not care
> whether that is a single or multiple tasks waking up from I/O.
>
> I've been trying to come up with some heuristics that would help to decide
> whether boosting makes sense or not (like don't boost too much if most of the
> time the task is being blocked or when despite boosting the sleep time keeps
> getting longer). But that seems slightly counter intuitive, to place that sort
> of decision making in schedulers generic code (without having all the info at
> hand) and I would expect this to be handled by other, more suited mechanisms.
> On the other hand, boosting upon each wake-up (from I/O) seems slightly
> excessive. But again: it all comes down to the expectations on the actually
> desired behaviour. We could potentially add another PELT-like signal to track
> the time spent waiting on I/O request alongside the task utilization, but that
> means more messing with signals on the rq level and losing some performance
> points due to needed calculations. Alternative of blindly increasing boost
> for each relevant wake-up means that small tasks could drive the freq to its
> max values - which I am not sure is so desired for power-aware scenarios.
> So all in all - still in progress.
>
> Any comments are more than welcomed.
>
> ---
> BR
> B.
>
>
> > thanks,
> > - Joel
On Mon, Jun 07, 2021 at 08:10:32PM +0100, Beata Michalska wrote:
> So back to the expectations.
> The main problem, as I see it, is what do we actually want to achieve with
> the I/O boosting? Is it supposed to compensate the time lost while waiting
> for the I/O request to be completed or is is supposed to optimize the rate
> at which I/O requests are being made.
The latter, you want to increase the race of submission.
> Do we want to boost I/O bound tasks by
> default, no limits applied or should we care about balancing performance
> vs power ? And unless those expectations are clearly stated, we might not
> get too far with any changes, really.
You want to not increase power beyond what is needed to match the rate
of processing I suppose.
Hi Beata,
On Mon, Jun 07, 2021 at 08:10:32PM +0100, Beata Michalska wrote:
> Hi Joel,
>
> Thanks for sending this out.
Np, thanks for replying.
> On Mon, Jun 07, 2021 at 12:19:01PM -0400, Joel Fernandes wrote:
> > Hi all,
> > Looks like iowait boost is completely broken upstream. Just
> > documenting my findings of iowait boost issues:
> >
> I wouldn't go as far to state that it is completely broken. Rather that
> the current sugov implementation for iowait boosting is not meeting
> the expectations and I believe this should be clarified first. More on those
> expectations below.
> > 1. If a CPU requests iowait boost in a cluster, another CPU can go
> > ahead and reset very quickly it since it thinks there's no new request
> > for the iowait boosting CPU
> So the 'boosting' value is being tracked per CPU, so each core in a cluster
> will have it's own variant of that. When calculating the shared freq for
> the cluster, sugov will use max utilization reported on each core, including
> I/O boost. Now, if there is no pending request for boosting on a given core
> at the time of calling sugov_iowait_apply, the current 'boost' will be
> reduced, but only this one and that will not affect boost values on remaining
> CPUs. It means that there was no task waking up on that particular CPU after
> waiting on I/O request. So I would say it's fine. Unless I am misunderstanding
> your case ?
Yes, but consider the case where the I/O is slow on one CPU (call it X) so
say IO wait takes 2 milliseconds. Now another CPU (call it Y) is
continuiously making cpufreq requests much faster than that. Also consider
that the slow CPU X is doing back to back I/O request and has consecutive
I/O sleep time (no other sleep, just I/O sleep). What you'll see is the
CPU X's boost always stays at _MIN boost when it wakes up because Y reset it
to 0 in the meanwhile. So the boost never accumulates. Does that make sense?
I would say that the I/O CPU should have a 'doubling' of boost. Probably the
issue can be solved by making rate_limit_us longer than the iowait time. But
that seems like a hack and would likely cause other issues.
> > 2. If the iowait is longer than a tick, then successive iowait boost
> > doubling does not happen. So heavy I/O waiting code never gets a
> > boost.
> This might indeed be an issue. What's more: the fact that boosting is applied
> per core, any migration will disregard the boosting applied so far, so
> it might start increasing the freq on 'new' CPU from scratch.
> Might, as sugov (and the I/O boosting) has no notion of the source of boosting
> request so it might end up on a core that has already lifted I/O boost.
> This also means, that having different small tasks, waking up from
> I/O within the mentioned TICK_NSEC time window might drive the frequency to max
> even though those would be sporadic wakeups. Things get slightly
> cumbersome as increasing the boost does not necessarily result in the freq
> change, so the above mentioned case would need proper timing but it is possible.
> Also the boost value will not get doubled unless previous one has been applied.
> This might result in misalignment between task wake-ups/placement and sugov's
> freq changes.
Agreed, there are many issues, thanks for highlighting.
> > 3. update_load_avg() is triggered right after the the iowait boost
> > request which makes another cpufreq update request, this request is a
> > non-iowait boost one so it ends up resetting the iowait boost request
> > (in the same path!).
> Not necessarily - this is guarded by the TICK_NSEC you have mentioned:
> in
> sugov_iowait_reset {
> ...
> if (delta_ns <= TICK_NSEC)
> return;
> ...
> }
> So for the particular call sequence the boost should not get reset.
> Another problem is when the non-I/O bound tasks triggers the update
> after the TICK_NSEC has elapsed and before the frequency change is done.
> (see sugov_should_update_freq )
Oh yeah. You are quite right on this particular issue. I agree it shouldn't
reset. So I wonder what I was seeing for #3 then... Hopefully delta_ns can be
trusted..
> > 4. Same as #3 but due the update_blocked_averages from new idle balance path.
> >
> > Here is a patch that tries to address these problems and I see better
> > cpufreq boosting happening, however it is just a test broken patch to
> > highlight the issues:
> > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/5.4/iowait-boost-debug-1&id=3627d896d499d168fef9a388e5d6b3359acc3423
> >
> > I think we ought to rewrite the whole mess instead of fixing it since
> > a lot has changed in scheduler code over time it feels. Beata is
> > working on rewriting the whole iowait boost infra, I am glad she has
> > started the work on this and looking forward to helping with the
> > patches.
> >
> So back to the expectations.
> The main problem, as I see it, is what do we actually want to achieve with
> the I/O boosting? Is it supposed to compensate the time lost while waiting
> for the I/O request to be completed or is is supposed to optimize the rate
> at which I/O requests are being made. Do we want to boost I/O bound tasks by
> default, no limits applied or should we care about balancing performance
> vs power ? And unless those expectations are clearly stated, we might not
> get too far with any changes, really.
Yeah part of the problem its not clear how much boost is 'ideal'. However, not having any boost at all or a
boost too low for heavy I/O seems a clear issues. I guess some boost will be
better than no boost. But I agree its hard to design an algorithm that fits
everything.
> Smth that I do agree with is what has been suggested few times by Quentin,
> to make the I/O boosting being a per-task feature, not per-core, and that is
> smth I have been experimenting with.
Yeah I agree with that too.
Thanks,
Joel
On Tuesday 08 Jun 2021 at 19:46:54 (+0200), Peter Zijlstra wrote:
> On Mon, Jun 07, 2021 at 08:10:32PM +0100, Beata Michalska wrote:
> > So back to the expectations.
> > The main problem, as I see it, is what do we actually want to achieve with
> > the I/O boosting? Is it supposed to compensate the time lost while waiting
> > for the I/O request to be completed or is is supposed to optimize the rate
> > at which I/O requests are being made.
>
> The latter, you want to increase the race of submission.
>
> > Do we want to boost I/O bound tasks by
> > default, no limits applied or should we care about balancing performance
> > vs power ? And unless those expectations are clearly stated, we might not
> > get too far with any changes, really.
>
> You want to not increase power beyond what is needed to match the rate
> of processing I suppose.
Note that in some cases we also don't care about throughput, and would
prefer to keep the frequency for some unimportant IO bound tasks (e.g.
background logging deamons and such). Uclamp.max indicates this to some
extent.
Android for instance has a pretty good idea of the tasks it cares about,
and it'd be good to have a way to enable IO wait boosting only for
those.
Thanks,
Quentin
On Tue, Jun 08, 2021 at 03:09:37PM -0400, Joel Fernandes wrote:
> Hi Beata,
>
> On Mon, Jun 07, 2021 at 08:10:32PM +0100, Beata Michalska wrote:
> > Hi Joel,
> >
> > Thanks for sending this out.
>
> Np, thanks for replying.
>
> > On Mon, Jun 07, 2021 at 12:19:01PM -0400, Joel Fernandes wrote:
> > > Hi all,
> > > Looks like iowait boost is completely broken upstream. Just
> > > documenting my findings of iowait boost issues:
> > >
> > I wouldn't go as far to state that it is completely broken. Rather that
> > the current sugov implementation for iowait boosting is not meeting
> > the expectations and I believe this should be clarified first. More on those
> > expectations below.
> > > 1. If a CPU requests iowait boost in a cluster, another CPU can go
> > > ahead and reset very quickly it since it thinks there's no new request
> > > for the iowait boosting CPU
> > So the 'boosting' value is being tracked per CPU, so each core in a cluster
> > will have it's own variant of that. When calculating the shared freq for
> > the cluster, sugov will use max utilization reported on each core, including
> > I/O boost. Now, if there is no pending request for boosting on a given core
> > at the time of calling sugov_iowait_apply, the current 'boost' will be
> > reduced, but only this one and that will not affect boost values on remaining
> > CPUs. It means that there was no task waking up on that particular CPU after
> > waiting on I/O request. So I would say it's fine. Unless I am misunderstanding
> > your case ?
>
> Yes, but consider the case where the I/O is slow on one CPU (call it X) so
> say IO wait takes 2 milliseconds. Now another CPU (call it Y) is
> continuiously making cpufreq requests much faster than that. Also consider
> that the slow CPU X is doing back to back I/O request and has consecutive
> I/O sleep time (no other sleep, just I/O sleep). What you'll see is the
> CPU X's boost always stays at _MIN boost when it wakes up because Y reset it
> to 0 in the meanwhile. So the boost never accumulates. Does that make sense?
> I would say that the I/O CPU should have a 'doubling' of boost. Probably the
> issue can be solved by making rate_limit_us longer than the iowait time. But
> that seems like a hack and would likely cause other issues.
>
OK, I think I see your point now.
So another issue to be added to the list.
Not sure though twiddling with rate_limit_us would do any good. This can be
already tweaked from sysfs but it touches on the freq transition delays so
I wouldn't mess around with that just to tune I/O boosting.
I'd still rather move the boosting outside of sugov - as much as possible at
least.
---
BR
B.
> > > 2. If the iowait is longer than a tick, then successive iowait boost
> > > doubling does not happen. So heavy I/O waiting code never gets a
> > > boost.
> > This might indeed be an issue. What's more: the fact that boosting is applied
> > per core, any migration will disregard the boosting applied so far, so
> > it might start increasing the freq on 'new' CPU from scratch.
> > Might, as sugov (and the I/O boosting) has no notion of the source of boosting
> > request so it might end up on a core that has already lifted I/O boost.
> > This also means, that having different small tasks, waking up from
> > I/O within the mentioned TICK_NSEC time window might drive the frequency to max
> > even though those would be sporadic wakeups. Things get slightly
> > cumbersome as increasing the boost does not necessarily result in the freq
> > change, so the above mentioned case would need proper timing but it is possible.
> > Also the boost value will not get doubled unless previous one has been applied.
> > This might result in misalignment between task wake-ups/placement and sugov's
> > freq changes.
>
> Agreed, there are many issues, thanks for highlighting.
>
> > > 3. update_load_avg() is triggered right after the the iowait boost
> > > request which makes another cpufreq update request, this request is a
> > > non-iowait boost one so it ends up resetting the iowait boost request
> > > (in the same path!).
> > Not necessarily - this is guarded by the TICK_NSEC you have mentioned:
> > in
> > sugov_iowait_reset {
> > ...
> > if (delta_ns <= TICK_NSEC)
> > return;
> > ...
> > }
> > So for the particular call sequence the boost should not get reset.
> > Another problem is when the non-I/O bound tasks triggers the update
> > after the TICK_NSEC has elapsed and before the frequency change is done.
> > (see sugov_should_update_freq )
>
> Oh yeah. You are quite right on this particular issue. I agree it shouldn't
> reset. So I wonder what I was seeing for #3 then... Hopefully delta_ns can be
> trusted..
>
> > > 4. Same as #3 but due the update_blocked_averages from new idle balance path.
> > >
> > > Here is a patch that tries to address these problems and I see better
> > > cpufreq boosting happening, however it is just a test broken patch to
> > > highlight the issues:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/5.4/iowait-boost-debug-1&id=3627d896d499d168fef9a388e5d6b3359acc3423
> > >
> > > I think we ought to rewrite the whole mess instead of fixing it since
> > > a lot has changed in scheduler code over time it feels. Beata is
> > > working on rewriting the whole iowait boost infra, I am glad she has
> > > started the work on this and looking forward to helping with the
> > > patches.
> > >
> > So back to the expectations.
> > The main problem, as I see it, is what do we actually want to achieve with
> > the I/O boosting? Is it supposed to compensate the time lost while waiting
> > for the I/O request to be completed or is is supposed to optimize the rate
> > at which I/O requests are being made. Do we want to boost I/O bound tasks by
> > default, no limits applied or should we care about balancing performance
> > vs power ? And unless those expectations are clearly stated, we might not
> > get too far with any changes, really.
>
> Yeah part of the problem its not clear how much boost is 'ideal'. However, not having any boost at all or a
> boost too low for heavy I/O seems a clear issues. I guess some boost will be
> better than no boost. But I agree its hard to design an algorithm that fits
> everything.
>
> > Smth that I do agree with is what has been suggested few times by Quentin,
> > to make the I/O boosting being a per-task feature, not per-core, and that is
> > smth I have been experimenting with.
>
> Yeah I agree with that too.
>
> Thanks,
> Joel
>
On Tue, Jun 08, 2021 at 07:46:54PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 07, 2021 at 08:10:32PM +0100, Beata Michalska wrote:
> > So back to the expectations.
> > The main problem, as I see it, is what do we actually want to achieve with
> > the I/O boosting? Is it supposed to compensate the time lost while waiting
> > for the I/O request to be completed or is is supposed to optimize the rate
> > at which I/O requests are being made.
>
> The latter, you want to increase the race of submission.
>
> > Do we want to boost I/O bound tasks by
> > default, no limits applied or should we care about balancing performance
> > vs power ? And unless those expectations are clearly stated, we might not
> > get too far with any changes, really.
>
> You want to not increase power beyond what is needed to match the rate
> of processing I suppose.
This is what I took as a baseline for my playground.
This tough means we will be are operating on some assumptions (unless we go for
some major rework) and that boosting may not reach the highest level in some cases.
For those, I guess we will have to use another way to deal with performance.
Thanks for your comments.
---
BR
B.
On 06/09/21 09:50, Quentin Perret wrote:
> On Tuesday 08 Jun 2021 at 19:46:54 (+0200), Peter Zijlstra wrote:
> > On Mon, Jun 07, 2021 at 08:10:32PM +0100, Beata Michalska wrote:
> > > So back to the expectations.
> > > The main problem, as I see it, is what do we actually want to achieve with
> > > the I/O boosting? Is it supposed to compensate the time lost while waiting
> > > for the I/O request to be completed or is is supposed to optimize the rate
> > > at which I/O requests are being made.
> >
> > The latter, you want to increase the race of submission.
> >
> > > Do we want to boost I/O bound tasks by
> > > default, no limits applied or should we care about balancing performance
> > > vs power ? And unless those expectations are clearly stated, we might not
> > > get too far with any changes, really.
> >
> > You want to not increase power beyond what is needed to match the rate
> > of processing I suppose.
>
> Note that in some cases we also don't care about throughput, and would
> prefer to keep the frequency for some unimportant IO bound tasks (e.g.
> background logging deamons and such). Uclamp.max indicates this to some
> extent.
In theory, one can have a user space daemon that monitors IO (via BPF?) and
auto boost via uclamp. You can have allow/disallow list per-app too to setup
the limits.
So I say rm -rf iowait_boost and let's make it a user space problem :)
/me runs
--
Qais Yousef
On Thu, Jun 10, 2021 at 9:30 AM Beata Michalska <[email protected]> wrote:
>
> On Tue, Jun 08, 2021 at 03:09:37PM -0400, Joel Fernandes wrote:
> > Hi Beata,
> >
> > On Mon, Jun 07, 2021 at 08:10:32PM +0100, Beata Michalska wrote:
> > > Hi Joel,
> > >
> > > Thanks for sending this out.
> >
> > Np, thanks for replying.
> >
> > > On Mon, Jun 07, 2021 at 12:19:01PM -0400, Joel Fernandes wrote:
> > > > Hi all,
> > > > Looks like iowait boost is completely broken upstream. Just
> > > > documenting my findings of iowait boost issues:
> > > >
> > > I wouldn't go as far to state that it is completely broken. Rather that
> > > the current sugov implementation for iowait boosting is not meeting
> > > the expectations and I believe this should be clarified first. More on those
> > > expectations below.
> > > > 1. If a CPU requests iowait boost in a cluster, another CPU can go
> > > > ahead and reset very quickly it since it thinks there's no new request
> > > > for the iowait boosting CPU
> > > So the 'boosting' value is being tracked per CPU, so each core in a cluster
> > > will have it's own variant of that. When calculating the shared freq for
> > > the cluster, sugov will use max utilization reported on each core, including
> > > I/O boost. Now, if there is no pending request for boosting on a given core
> > > at the time of calling sugov_iowait_apply, the current 'boost' will be
> > > reduced, but only this one and that will not affect boost values on remaining
> > > CPUs. It means that there was no task waking up on that particular CPU after
> > > waiting on I/O request. So I would say it's fine. Unless I am misunderstanding
> > > your case ?
> >
> > Yes, but consider the case where the I/O is slow on one CPU (call it X) so
> > say IO wait takes 2 milliseconds. Now another CPU (call it Y) is
> > continuiously making cpufreq requests much faster than that. Also consider
> > that the slow CPU X is doing back to back I/O request and has consecutive
> > I/O sleep time (no other sleep, just I/O sleep). What you'll see is the
> > CPU X's boost always stays at _MIN boost when it wakes up because Y reset it
> > to 0 in the meanwhile. So the boost never accumulates. Does that make sense?
> > I would say that the I/O CPU should have a 'doubling' of boost. Probably the
> > issue can be solved by making rate_limit_us longer than the iowait time. But
> > that seems like a hack and would likely cause other issues.
> >
> OK, I think I see your point now.
> So another issue to be added to the list.
> Not sure though twiddling with rate_limit_us would do any good. This can be
> already tweaked from sysfs but it touches on the freq transition delays so
> I wouldn't mess around with that just to tune I/O boosting.
> I'd still rather move the boosting outside of sugov - as much as possible at
> least.
How about something like so? At least a partial respite to that issue.
A concurrent cpufreq request has to wait till at least TICK_NSEC
before decaying a neighbor's boost, and the boost reset takes place
only after at least 2 ticks. Since we already start at a low boost of
min, I think being less aggressive there should be Ok. Completely
untested and purely a vacation-patch:
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4f09afd..72aaff4 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -27,6 +27,7 @@ struct sugov_policy {
struct list_head tunables_hook;
raw_spinlock_t update_lock;
+ u64 last_update;
u64 last_freq_update_time;
s64 freq_update_delay_ns;
unsigned int next_freq;
@@ -188,7 +189,7 @@ static bool sugov_iowait_reset(struct sugov_cpu
*sg_cpu, u64 time,
s64 delta_ns = time - sg_cpu->last_update;
/* Reset boost only if a tick has elapsed since last request */
- if (delta_ns <= TICK_NSEC)
+ if (delta_ns <= 2 * TICK_NSEC)
return false;
sg_cpu->iowait_boost = set_iowait_boost ? IOWAIT_BOOST_MIN : 0;
@@ -215,6 +216,7 @@ static void sugov_iowait_boost(struct sugov_cpu
*sg_cpu, u64 time,
unsigned int flags)
{
bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
/* Reset boost if the CPU appears to have been idle enough */
if (sg_cpu->iowait_boost &&
@@ -260,6 +262,7 @@ static void sugov_iowait_boost(struct sugov_cpu
*sg_cpu, u64 time,
*/
static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time)
{
+ struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned long boost;
/* No boost currently required */
@@ -270,7 +273,8 @@ static void sugov_iowait_apply(struct sugov_cpu
*sg_cpu, u64 time)
if (sugov_iowait_reset(sg_cpu, time, false))
return;
- if (!sg_cpu->iowait_boost_pending) {
+ if (!sg_cpu->iowait_boost_pending &&
+ time - sg_policy->last_update > TICK_NSEC) {
/*
* No boost pending; reduce the boost value.
*/
@@ -440,6 +444,7 @@ sugov_update_shared(struct update_util_data *hook,
u64 time, unsigned int flags)
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
+ sg_policy->last_update = time;
ignore_dl_rate_limit(sg_cpu);
--
2.27.0