LinuxLists.cc - Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs

2023-09-14 21:43:14

Subject: Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

On Sun, 10 Sept 2023 at 19:46, Qais Yousef <[email protected]> wrote:
>
> On 09/08/23 16:30, Vincent Guittot wrote:
>
> > > Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt
> > > will be restricted by their uclamp settings _after_ the headroom is applied.
> >
> > But then you mixed some utilization level (irq pressure) which is
> > time related with some performance/bandwidth limitation which is freq
> > related. And I think that part of the reason why we can't agree on
> > where to put headroom and how to take into account uclamp
>
> I did not change how this part works. We can drop patch 4 completely if this is
> what is causing the confusion.
>
> What I changed is the order of applying uclamp_rq_util_with() and
> apply_dvfs_headroom(). The rest is still the same as-is in the code today.
>
> >
> > >
> > > > above 512 whatever the current (720) formula or your proposal (608).
> > > > In the case of uclamp, it should be applied after having been scaled
> > > > by irq time.
> > >
> > > I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers.
> >
> > My bad, I finally decided to use an irq pressure of 128 in my
> > calculation but forgot to change the value in my email
> >
> > >
> > > So the way I'm proposing it here
> > >
> > > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> > > util = uclamp_rq_util_with(rq, util, NULL) = 512
> > > util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384
> > > util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704
> > > util += dvfs_headroom(dl_bw) = 704
> > >
> > > >
> > > > So we should have reported utilization of 720 with a bandwidth
> > > > requirement of 512 and then cpufreq can applies its headroom if needed
> > >
> > > The key part I'm changing is this
> > >
> > > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> > > util = uclamp_rq_util_with(rq, util, NULL) = 512
> > >
> > > Before we had (assume irq, rt and dl are 0 for simplicity and a single task is
> > > running)
> > >
> > > util = cfs = 800
> > > util = uclamp_rq_util_with(rq, util, NULL) = 512
> > > util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640
> > >
> > > So we are running higher than we are allowed to. So applying the headroom
> > > after taking uclamp constraints into account is the problem.
> >
> > But I think it's not correct because you mixed some utilization level
> > with some performance requirement. IIRC, you said that the
> > uclamp/min/max must not be seen as an utilization level but a
> > performance hint. In such case, you can't add it to some other
> > utilization because it defeats its original purpose
>
> But this is what we do today already. I didn't change this part. See below.

And it seems that what is done today doesn't work correctly for you.
Your proposal to include cpufreq headroom into the scheduler is not
correct IMO and it only applies for some cases. Then, the cpufreq
driver can have some really good reason to apply some headroom even
with an uclamp value but it can't make any decision.

I think that we should use another way to fix your problem with how
uclamp than reordering how headroom is applied by cpufreq. Mixing
utilization and performance in one signal hide some differences that
cpufreq can make a good use of.

As an example:

cfs util = 650
cfs uclamp = 800
irq = 128

cfs with headroom 650*1.25=812 is clamped to 800

Final utilization will be : 800(1024-128)/1024+128*1.25=860 which is
above the target of 800.

When we look at the detail, we have:

cfs util once scaled to the full range is only 650(1024-128)/1024= 568

After applying irq (even with some headroom) 568+128*1.25 = 728 which
is below the uclamp of 800 so shouldn't we stay at 800 in this case ?

>
> >
> > >
> > > IIUC and your concerns are about how scheduler and schedutil are interacting
> > > loses some abstraction, then yeah this makes sense and can refactor code to
> > > better abstract the two and keep them decoupled.
> > >
> > > But if you think the order the headroom is applied should be the way it is,
> > > then this doesn't fix the problem.
> > >
> > > A good use case to consider is a single task running at 1024. If I want to
> > > limit it to 80% using uclamp_max, how can this be enforced? With current code,
> > > the task will continue to run at 1024; which is the problem.
> >
> > Single case is the easy part
> >
> > >
> > > Running at 640 instead of 512 is still a problem too as this could be 1 or
> > > 2 OPP higher than what we want to allow; and there could be significant power
> > > difference between those OPPs.
> >
> > That's my point.
> >
> > Your proposal tries to fix the simple case of 1 task with a uclamp_max
> > but If we want to move in this direction, we should fix all cases.
>
> I think I am addressing all cases. I don't think I understand the problem
> you're trying to highlight. Is the RFC patch 4 is what is causing the
> confusion? That one is not related to fixing uclamp_max; it's just me
> questioning the status quo of which pressure signal requires the headroom.

No patch 4 doesn't cause me confusion.

>
> The main change being done here actually is to apply_dvfs_headroom() *before*
> doing uclamp_rq_util_with(). I am not sure how you see this mixing.

Because dvfs_headroom is a cpufreq hints and you want to apply it
somewhere else.

>
> Current code performs apply_dvfs_headroom() *after*; which what causes the CPU
> to run at a performance level higher than rq->uclamp[UCLAMP_MAX].
>
> It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to
> 800, then the CPU should not vote to max (assuminig all other pressures are 0).

You can't remove the irq pressure from the picture. If
rq->uclamp[UCLAMP_MAX] is set to 800 means that cpu must not go above
800, it should apply also after taking into account other inputs. At
least up to some level as described in my example above

>
> Handling of irq pressure etc is not changed.
>
> > I probably need to put my idea in real code to see what it means but
> > we should provide 2 value to cpufreq: an utilization level and a
> > performance hint which would max aggregate the various performance
> > hints that we have
>
> Hmm. This reads to me code structure change. From my perspective; the correct
> behavior is to apply the headroom before restricting the performance level.
>
> It seems you're seeing a better way to aggregate all the pressures when taking
> uclamp into account. Which I am starting to see your point if this is what
> you're saying. Though I haven't changed the existing relationship. I did try
> to question some things in patch 4, my thoughts were specific to headroom, but
> it seems you have more generalized form.
>
> I do agree in general it seems scheduler and schedutil are getting tightly
> coupled code wise and it would be good to provide generic cpufreq interfaces to
> potentially allow other governors to hook in and benefit.
>
> Whether this is something desired or not, I don't know as I remember Peter and
> Rafael wanted schedutil to grow to become the de-facto governor. It seems the
> right thing anyway.
>
>
> Thanks!
>
> --
> Qais Yousef

2023-09-16 22:35:17

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

On 09/12/23 18:03, Vincent Guittot wrote:

> And it seems that what is done today doesn't work correctly for you.
> Your proposal to include cpufreq headroom into the scheduler is not
> correct IMO and it only applies for some cases. Then, the cpufreq
> driver can have some really good reason to apply some headroom even
> with an uclamp value but it can't make any decision.
>
> I think that we should use another way to fix your problem with how
> uclamp than reordering how headroom is applied by cpufreq. Mixing
> utilization and performance in one signal hide some differences that
> cpufreq can make a good use of.
>
> As an example:
>
> cfs util = 650
> cfs uclamp = 800
> irq = 128
>
> cfs with headroom 650*1.25=812 is clamped to 800
>
> Final utilization will be : 800(1024-128)/1024+128*1.25=860 which is
> above the target of 800.
>
> When we look at the detail, we have:
>
> cfs util once scaled to the full range is only 650(1024-128)/1024= 568
>
> After applying irq (even with some headroom) 568+128*1.25 = 728 which
> is below the uclamp of 800 so shouldn't we stay at 800 in this case ?

Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped
the 812 to 800, with rounding errors that almost accounts for the 10 points
difference between 870 and 860..

I might have gotten the math wrong. But what I saw is that we have

util = (X + Y + Z) * A

and what I did

util = AX + AY + AZ

so maybe I missed something up, but I just did the multiplication with the
headroom to each element individually rather than after the sum.

So yeah, if I messed that part up, then that wasn't intentional and should be
done differently. But I still can't see it.

> >
> > The main change being done here actually is to apply_dvfs_headroom() *before*
> > doing uclamp_rq_util_with(). I am not sure how you see this mixing.
>
> Because dvfs_headroom is a cpufreq hints and you want to apply it
> somewhere else.

I am still not sure if you mean we are mixing up the code and we need better
abstraction or something else.

Beside the abstraction problem, which I agree with, I can't see what I am
mixing up yet :( Sorry I think I need more helping hand to see it.

> > Current code performs apply_dvfs_headroom() *after*; which what causes the CPU
> > to run at a performance level higher than rq->uclamp[UCLAMP_MAX].
> >
> > It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to
> > 800, then the CPU should not vote to max (assuminig all other pressures are 0).
>
> You can't remove the irq pressure from the picture. If
> rq->uclamp[UCLAMP_MAX] is set to 800 means that cpu must not go above
> 800, it should apply also after taking into account other inputs. At
> least up to some level as described in my example above

I was trying to simplify to understand what you mean as I don't think I see the
problem you're highlighting still.

Thanks!

--
Qais Yousef

2023-09-24 07:58:27

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

On Sat, 16 Sept 2023 at 21:25, Qais Yousef <[email protected]> wrote:
>
> On 09/12/23 18:03, Vincent Guittot wrote:
>
> > And it seems that what is done today doesn't work correctly for you.
> > Your proposal to include cpufreq headroom into the scheduler is not
> > correct IMO and it only applies for some cases. Then, the cpufreq
> > driver can have some really good reason to apply some headroom even
> > with an uclamp value but it can't make any decision.
> >
> > I think that we should use another way to fix your problem with how
> > uclamp than reordering how headroom is applied by cpufreq. Mixing
> > utilization and performance in one signal hide some differences that
> > cpufreq can make a good use of.
> >
> > As an example:
> >
> > cfs util = 650
> > cfs uclamp = 800
> > irq = 128
> >
> > cfs with headroom 650*1.25=812 is clamped to 800
> >
> > Final utilization will be : 800(1024-128)/1024+128*1.25=860 which is
> > above the target of 800.
> >
> > When we look at the detail, we have:
> >
> > cfs util once scaled to the full range is only 650(1024-128)/1024= 568
> >
> > After applying irq (even with some headroom) 568+128*1.25 = 728 which
> > is below the uclamp of 800 so shouldn't we stay at 800 in this case ?
>
> Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped
> the 812 to 800, with rounding errors that almost accounts for the 10 points
> difference between 870 and 860..

no I voluntarily use 568 + 128*1.25. I added dvfs headroom for irq
just to ensure that you will not raise that I removed the headroom for
irq and focus on the use case but it might have created more
confusion.

My example above demonstrate that only taking care of cases with null
irq pressure is not enough and you can still ends up above 800

IIUC you point with uclamp_max. It is a performance limit that you
don't want to cross because of CFS.This means that we should not go
above 800 in my example because of cfs utilization: Irq needs between
128 and CFS asks 568 so the system needs 696 which is below the 800
uclamp. Even if you add the dvfs headroom on irq, the system is still
below 800. Only when you add dfvs headroom to cfs then you go above
800 but it's not needed because uclamp say that you should not go
above 800 because of CFS so we should stay at 800 whereas both current
formula and your new formula return a value above 800

>
> I might have gotten the math wrong. But what I saw is that we have
>
> util = (X + Y + Z) * A
>
> and what I did
>
> util = AX + AY + AZ
>
> so maybe I missed something up, but I just did the multiplication with the
> headroom to each element individually rather than after the sum.
>
> So yeah, if I messed that part up, then that wasn't intentional and should be
> done differently. But I still can't see it.
>
> > >
> > > The main change being done here actually is to apply_dvfs_headroom() *before*
> > > doing uclamp_rq_util_with(). I am not sure how you see this mixing.
> >
> > Because dvfs_headroom is a cpufreq hints and you want to apply it
> > somewhere else.
>
> I am still not sure if you mean we are mixing up the code and we need better
> abstraction or something else.
>
> Beside the abstraction problem, which I agree with, I can't see what I am
> mixing up yet :( Sorry I think I need more helping hand to see it.

There is a mix between actual utilization and performance limit and
when we add both we then lose important information as highlighted by
my example. If the current formula is not correct because we can go
above uclamp_max value, your proposal is not better. And the root
cause is mainly coming from adding utilization with performance limit
(i.e. uclamp)

That's why I said that we need a new interface to enable cpufreq to
not blindly apply its headroom but to make smarter decision at cpufreq
level

>
> > > Current code performs apply_dvfs_headroom() *after*; which what causes the CPU
> > > to run at a performance level higher than rq->uclamp[UCLAMP_MAX].
> > >
> > > It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to
> > > 800, then the CPU should not vote to max (assuminig all other pressures are 0).
> >
> > You can't remove the irq pressure from the picture. If
> > rq->uclamp[UCLAMP_MAX] is set to 800 means that cpu must not go above
> > 800, it should apply also after taking into account other inputs. At
> > least up to some level as described in my example above
>
> I was trying to simplify to understand what you mean as I don't think I see the
> problem you're highlighting still.
>
>
> Thanks!
>
> --
> Qais Yousef

2023-09-24 17:30:31

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

On 09/24/23 09:58, Vincent Guittot wrote:

> > Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped
> > the 812 to 800, with rounding errors that almost accounts for the 10 points
> > difference between 870 and 860..
>
> no I voluntarily use 568 + 128*1.25. I added dvfs headroom for irq
> just to ensure that you will not raise that I removed the headroom for
> irq and focus on the use case but it might have created more
> confusion.
>
> My example above demonstrate that only taking care of cases with null
> irq pressure is not enough and you can still ends up above 800
>
> IIUC you point with uclamp_max. It is a performance limit that you
> don't want to cross because of CFS.This means that we should not go
> above 800 in my example because of cfs utilization: Irq needs between
> 128 and CFS asks 568 so the system needs 696 which is below the 800
> uclamp. Even if you add the dvfs headroom on irq, the system is still
> below 800. Only when you add dfvs headroom to cfs then you go above
> 800 but it's not needed because uclamp say that you should not go

Yep, absolutely. It seems we agree that CFS shouldn't go above 800 if it is
capped even if there's headroom, but the question you have on the way it is
being applied. As long as we agree on this part which is a fundamental behavior
question that I thought is the pain point, the implementation details are
certainly something that I can improve on.

> above 800 because of CFS so we should stay at 800 whereas both current
> formula and your new formula return a value above 800

I'm not sure how to handle irq, rt and dl here to be honest. They seem to have
been taken as an 'additional' demand on top of CFS. So yes, we'll go above but
irq, and dl don't have knowledge about ucalmp_max. RT does and will be equally
capped like CFS. I kept current behavior the same, but I did wonder about them
too in patch 4.

So in a system where there are active CFS, RT, DL and IRQ and both CFS and RT
had a cap of 800, then they won't ask for me. But once we add IRQ and DL on
top, then we'll go above.

You think we shouldn't? See below for a suggestion.

> > I am still not sure if you mean we are mixing up the code and we need better
> > abstraction or something else.
> >
> > Beside the abstraction problem, which I agree with, I can't see what I am
> > mixing up yet :( Sorry I think I need more helping hand to see it.
>
> There is a mix between actual utilization and performance limit and
> when we add both we then lose important information as highlighted by
> my example. If the current formula is not correct because we can go
> above uclamp_max value, your proposal is not better. And the root
> cause is mainly coming from adding utilization with performance limit
> (i.e. uclamp)
>
> That's why I said that we need a new interface to enable cpufreq to
> not blindly apply its headroom but to make smarter decision at cpufreq
> level

Okay I see. I tend to agree here too. The question is should cpufreq take each
util (cfs, rt, dl, irq) as input and do the decision on its own. Or should the
scheduler add them and pass the aggregated value? If the latter, how can
cpufreq know how to apply the limit? From what I see all these decisions has to
happen in the same function but not split.

It seems the sticking point is how we interpret irq pressure with uclamp. It
seems you think we should apply any uclamp capping to this, which I think would
make sense.

And DL bandwidth we need to max() with the aggregated value.

So I think the formula could be

util = cfs + rt pressure + irq pressure

unsigned long cpufreq_convert_util_to_freq(rq, util, dl_bw)
{
eff_util = apply_dvfs_headroom(util);
eff_util = uclamp_rq_util_with(rq, util, NULL);

eff_util = max(eff_util, dl_bw);
}

so we add the utilization of cfs, rt and irq (as per current formula). And then
let cpufreq do the headroom and limit management.

I changed the way we handle dl_bw as it is actually requesting to run at
a specific level and not really a pressure. So we max() it with eff_util.

If there's a DL task on the rq then it'd be running and the frequency it
needs is defined by its bandwidth.

We could also keep it as it is with

unsigned long cpufreq_convert_util_to_freq(rq, util, dl_bw)
{
eff_util = apply_dvfs_headroom(util);
eff_util = uclamp_rq_util_with(rq, util, NULL);

eff_util += dl_bw;
}

RT has uclamp knowledge so it'll either run at max or whatever value it might
have requested via uclamp_min. But DL doesn't set any uclamp_min and must be
either added or max()ed. I'm not sure which is more correct yet, but maybe
adding actually is better to ensure the CPU runs higher to handle all the tasks
on the rq.

What do you think?

Thanks!

--
Qais Yousef

2023-09-28 20:45:47

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

Le dimanche 24 sept. 2023 ? 18:23:01 (+0100), Qais Yousef a ?crit :
> On 09/24/23 09:58, Vincent Guittot wrote:
>
> > > Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped
> > > the 812 to 800, with rounding errors that almost accounts for the 10 points
> > > difference between 870 and 860..
> >
> > no I voluntarily use 568 + 128*1.25. I added dvfs headroom for irq
> > just to ensure that you will not raise that I removed the headroom for
> > irq and focus on the use case but it might have created more
> > confusion.
> >
> > My example above demonstrate that only taking care of cases with null
> > irq pressure is not enough and you can still ends up above 800
> >
> > IIUC you point with uclamp_max. It is a performance limit that you
> > don't want to cross because of CFS.This means that we should not go
> > above 800 in my example because of cfs utilization: Irq needs between
> > 128 and CFS asks 568 so the system needs 696 which is below the 800
> > uclamp. Even if you add the dvfs headroom on irq, the system is still
> > below 800. Only when you add dfvs headroom to cfs then you go above
> > 800 but it's not needed because uclamp say that you should not go
>
> Yep, absolutely. It seems we agree that CFS shouldn't go above 800 if it is
> capped even if there's headroom, but the question you have on the way it is

At least I want to ensure that cpufreq has the right information to make a
smart decision. In the example above, it's not needed to go above 800 for
neither cfs nor irq.

> being applied. As long as we agree on this part which is a fundamental behavior
> question that I thought is the pain point, the implementation details are
> certainly something that I can improve on.
>
> > above 800 because of CFS so we should stay at 800 whereas both current
> > formula and your new formula return a value above 800
>
> I'm not sure how to handle irq, rt and dl here to be honest. They seem to have
> been taken as an 'additional' demand on top of CFS. So yes, we'll go above but
> irq, and dl don't have knowledge about ucalmp_max. RT does and will be equally
> capped like CFS. I kept current behavior the same, but I did wonder about them
> too in patch 4.
>
> So in a system where there are active CFS, RT, DL and IRQ and both CFS and RT
> had a cap of 800, then they won't ask for me. But once we add IRQ and DL on
> top, then we'll go above.
>
> You think we shouldn't? See below for a suggestion.

I'm afraid that the answer is not that straight forward

In the example above should we still stay at uclamp value if we now set uclamp
to 680 ?

And what about a uclamp of 160 ?

>
> > > I am still not sure if you mean we are mixing up the code and we need better
> > > abstraction or something else.
> > >
> > > Beside the abstraction problem, which I agree with, I can't see what I am
> > > mixing up yet :( Sorry I think I need more helping hand to see it.
> >
> > There is a mix between actual utilization and performance limit and
> > when we add both we then lose important information as highlighted by
> > my example. If the current formula is not correct because we can go
> > above uclamp_max value, your proposal is not better. And the root
> > cause is mainly coming from adding utilization with performance limit
> > (i.e. uclamp)
> >
> > That's why I said that we need a new interface to enable cpufreq to
> > not blindly apply its headroom but to make smarter decision at cpufreq
> > level
>
> Okay I see. I tend to agree here too. The question is should cpufreq take each
> util (cfs, rt, dl, irq) as input and do the decision on its own. Or should the
> scheduler add them and pass the aggregated value? If the latter, how can
> cpufreq know how to apply the limit? From what I see all these decisions has to
> happen in the same function but not split.

I'm not in favor of showing all details to cpufreq because it will have to
follow the internal changes. In instead, I was thinking of something like:

/* Function name to be changed */
unsigned_long effective_cpu_util(int cpu, unsigned int *min, unsigned int *max)

The function returns the actual utilization of the CPU and some minimum and
maximum limits with the possibility to have the min and/or Actual values > Max
because the min would be a hard minimum value whereas max only a soft maximum
value.

Min would be the minimum perf to provide to the cpu : typically DL_bw + irq
Actual would be the actual utilization of the cpu: cfs+rt+dl+irq (after scaling
everything in the normal range)
Max would be the maximum needed performance for normal work: typically the
minimum between uclamp and capacity

Then cpufreq can use these 3 values to compute a performance level and it
will know up to which perf level it should go and if it is worth it.
Something likr:

/* get scheduler statistic */
target = effective_cpu_util(cpu, util, &min, &max)

/* ensure min perf for dl and irq + some margin for others */
target = min + headroom

/* check if we really need to go to max */
if ((actual + headroom) < max)
max = actual + headroom

/* use max of the 2 values */
target = max(target, max)

I put all this in the below quick patch which only compiled but not tested

---
include/linux/energy_model.h | 1 -
kernel/sched/core.c | 98 +++++++++++++++++---------------
kernel/sched/cpufreq_schedutil.c | 6 +-
kernel/sched/fair.c | 4 +-
kernel/sched/sched.h | 7 ++-
5 files changed, 61 insertions(+), 55 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e4cf9baf5f9e..c424a1bcec38 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -261,7 +261,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
scale_cpu = arch_scale_cpu_capacity(cpu);
ref_freq = em_get_capacity_ref_freq(cpu, pd);

- max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ref_freq, scale_cpu);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6560392f2f83..030564f5be24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7404,18 +7404,13 @@ int sched_core_idle_cpu(int cpu)
* required to meet deadlines.
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- enum cpu_util_type type,
- struct task_struct *p)
+ unsigned long *min,
+ unsigned long *max)
{
- unsigned long dl_util, util, irq, max;
+ unsigned long dl_util, util, irq, scale;
struct rq *rq = cpu_rq(cpu);

- max = arch_scale_cpu_capacity(cpu);
-
- if (!uclamp_is_used() &&
- type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
- return max;
- }
+ scale = arch_scale_cpu_capacity(cpu);

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7423,9 +7418,16 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
- if (unlikely(irq >= max))
- return max;
+ if (unlikely(irq >= scale)) {
+ if (min)
+ *min = scale;
+ if (max)
+ *max = scale;
+ return scale;
+ }

+ if (min)
+ min = irq + cpu_bw_dl(rq);
/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
@@ -7439,29 +7441,13 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
- if (type == FREQUENCY_UTIL)
- util = uclamp_rq_util_with(rq, util, p);
+ util += cpu_util_dl(rq);

- dl_util = cpu_util_dl(rq);
-
- /*
- * For frequency selection we do not make cpu_util_dl() a permanent part
- * of this sum because we want to use cpu_bw_dl() later on, but we need
- * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
- * that we select f_max when there is no idle time.
- *
- * NOTE: numerical errors or stop class might cause us to not quite hit
- * saturation when we should -- something for later.
- */
- if (util + dl_util >= max)
- return max;
-
- /*
- * OTOH, for energy computation we need the estimated running time, so
- * include util_dl and ignore dl_bw.
- */
- if (type == ENERGY_UTIL)
- util += dl_util;
+ if (util >= scale) {
+ if (max)
+ *max = scale;
+ return scale;
+ }

/*
* There is still idle time; further improve the number by using the
@@ -7472,28 +7458,48 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* U' = irq + --------- * U
* max
*/
- util = scale_irq_capacity(util, irq, max);
+ util = scale_irq_capacity(util, irq, scale);
util += irq;

+ if (max)
+ *max = uclamp_rq_util_with(rq, util, NULL);
+
+ return min(scale, util);
+}
+
+
+/*
+ * TODO: move this in cpufreq
+ */
+unsigned long effective_cpu_perf(int cpu, unsigned long util_cfs,
+ struct task_struct *p)
+{
+ unsigned long actual, target;
+
+ /* Get utilization stats */
+ actual = effective_cpu_util(cpu, util_cfs, &min, &max);
+
/*
- * Bandwidth required by DEADLINE must always be granted while, for
- * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
- * to gracefully reduce the frequency when no tasks show up for longer
- * periods of time.
- *
- * Ideally we would like to set bw_dl as min/guaranteed freq and util +
- * bw_dl as requested freq. However, cpufreq is not yet ready for such
- * an interface. So, we only do the latter for now.
+ * Provide at least enough capacity for DL + irq plus some headroom
+ * for other activities
*/
- if (type == FREQUENCY_UTIL)
- util += cpu_bw_dl(rq);
+ target = map_util_perf(min);

- return min(max, util);
+ actual = map_util_perf(actual);
+ /* Actually we don't need to target the max performance */
+ if (actual < max)
+ max = actual;
+
+ /*
+ * Ensure at least minimum perf target while providing more computa capacity when
+ * possible
+ */
+ target = max(target,max);
}

unsigned long sched_cpu_util(int cpu)
{
- return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e2b9c8c3d69a..ef6b4b09ac12 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -162,7 +162,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned int freq;
struct cpufreq_policy *policy = sg_policy->policy;

- util = map_util_perf(util);
freq = get_capacity_ref_freq(policy);
freq = map_util_freq(util, freq, max);

@@ -179,8 +178,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
struct rq *rq = cpu_rq(sg_cpu->cpu);

sg_cpu->bw_dl = cpu_bw_dl(rq);
- sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
- FREQUENCY_UTIL, NULL);
+ sg_cpu->util = effective_cpu_perf(sg_cpu->cpu, util, NULL);
}

/**
@@ -427,7 +425,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sg_cpu->util = prev_util;

cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
- map_util_perf(sg_cpu->util), max_cap);
+ sg_cpu->util, max_cap);

sg_cpu->sg_policy->last_freq_update_time = time;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06d6d0dde48a..50568e2fa1ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7570,7 +7570,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);

- busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+ busy_time += effective_cpu_util(cpu, util, NULL, NULL);
}

eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7602,7 +7602,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+ eff_util = effective_cpu_perf(cpu, util, tsk);
max_util = max(max_util, eff_util);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17ae151e90c0..e79cb1110e8d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2988,10 +2988,13 @@ enum cpu_util_type {
ENERGY_UTIL,
};

-unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- enum cpu_util_type type,
+unsigned long effective_cpu_perf(int cpu, unsigned long util_cfs,
struct task_struct *p);

+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long *min,
+ unsigned long *max)
+
/*
* Verify the fitness of task @p to run on @cpu taking into account the
* CPU original capacity and the runtime/deadline ratio of the task.
--
2.34.1

>
> It seems the sticking point is how we interpret irq pressure with uclamp. It
> seems you think we should apply any uclamp capping to this, which I think would
> make sense.
>
> And DL bandwidth we need to max() with the aggregated value.
>
> So I think the formula could be
>
> util = cfs + rt pressure + irq pressure
>
> unsigned long cpufreq_convert_util_to_freq(rq, util, dl_bw)
> {
> eff_util = apply_dvfs_headroom(util);
> eff_util = uclamp_rq_util_with(rq, util, NULL);
>
> eff_util = max(eff_util, dl_bw);
> }
>
> so we add the utilization of cfs, rt and irq (as per current formula). And then
> let cpufreq do the headroom and limit management.
>
> I changed the way we handle dl_bw as it is actually requesting to run at
> a specific level and not really a pressure. So we max() it with eff_util.
>
> If there's a DL task on the rq then it'd be running and the frequency it
> needs is defined by its bandwidth.
>
> We could also keep it as it is with
>
> unsigned long cpufreq_convert_util_to_freq(rq, util, dl_bw)
> {
> eff_util = apply_dvfs_headroom(util);
> eff_util = uclamp_rq_util_with(rq, util, NULL);
>
> eff_util += dl_bw;
> }
>
> RT has uclamp knowledge so it'll either run at max or whatever value it might
> have requested via uclamp_min. But DL doesn't set any uclamp_min and must be
> either added or max()ed. I'm not sure which is more correct yet, but maybe
> adding actually is better to ensure the CPU runs higher to handle all the tasks
> on the rq.
>
> What do you think?
>
>
> Thanks!
>
> --
> Qais Yousef

2023-09-28 23:18:53

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

On 09/28/23 19:50, Vincent Guittot wrote:

> >
> > Yep, absolutely. It seems we agree that CFS shouldn't go above 800 if it is
> > capped even if there's headroom, but the question you have on the way it is
>
> At least I want to ensure that cpufreq has the right information to make a
> smart decision. In the example above, it's not needed to go above 800 for
> neither cfs nor irq.

Okay you want to do even bigger rework :-) I thought I might have pushed some
boundary with the rework I had in mind hehe.

> I'm not in favor of showing all details to cpufreq because it will have to
> follow the internal changes. In instead, I was thinking of something like:
>
> /* Function name to be changed */
> unsigned_long effective_cpu_util(int cpu, unsigned int *min, unsigned int *max)
>
> The function returns the actual utilization of the CPU and some minimum and
> maximum limits with the possibility to have the min and/or Actual values > Max
> because the min would be a hard minimum value whereas max only a soft maximum
> value.
>
> Min would be the minimum perf to provide to the cpu : typically DL_bw + irq
> Actual would be the actual utilization of the cpu: cfs+rt+dl+irq (after scaling
> everything in the normal range)
> Max would be the maximum needed performance for normal work: typically the
> minimum between uclamp and capacity
>
> Then cpufreq can use these 3 values to compute a performance level and it
> will know up to which perf level it should go and if it is worth it.
> Something likr:

Okay thanks! I think I have better clarity now. Let me try to rework the
patches.

Cheers

--
Qais Yousef

2023-09-29 13:27:52

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

Le jeudi 28 sept. 2023 ? 23:05:04 (+0100), Qais Yousef a ?crit :
> On 09/28/23 19:50, Vincent Guittot wrote:
>
> > >
> > > Yep, absolutely. It seems we agree that CFS shouldn't go above 800 if it is
> > > capped even if there's headroom, but the question you have on the way it is
> >
> > At least I want to ensure that cpufreq has the right information to make a
> > smart decision. In the example above, it's not needed to go above 800 for
> > neither cfs nor irq.
>
> Okay you want to do even bigger rework :-) I thought I might have pushed some
> boundary with the rework I had in mind hehe.
>
> > I'm not in favor of showing all details to cpufreq because it will have to
> > follow the internal changes. In instead, I was thinking of something like:
> >
> > /* Function name to be changed */
> > unsigned_long effective_cpu_util(int cpu, unsigned int *min, unsigned int *max)
> >
> > The function returns the actual utilization of the CPU and some minimum and
> > maximum limits with the possibility to have the min and/or Actual values > Max
> > because the min would be a hard minimum value whereas max only a soft maximum
> > value.
> >
> > Min would be the minimum perf to provide to the cpu : typically DL_bw + irq
> > Actual would be the actual utilization of the cpu: cfs+rt+dl+irq (after scaling
> > everything in the normal range)
> > Max would be the maximum needed performance for normal work: typically the
> > minimum between uclamp and capacity
> >
> > Then cpufreq can use these 3 values to compute a performance level and it
> > will know up to which perf level it should go and if it is worth it.
> > Something likr:
>
> Okay thanks! I think I have better clarity now. Let me try to rework the
> patches.

This is the patch with everything amended, some pieces were missing in the previous version.

---
include/linux/energy_model.h | 1 -
kernel/sched/core.c | 103 +++++++++++++++++--------------
kernel/sched/cpufreq_schedutil.c | 6 +-
kernel/sched/fair.c | 4 +-
kernel/sched/sched.h | 7 ++-
5 files changed, 66 insertions(+), 55 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e4cf9baf5f9e..c424a1bcec38 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -261,7 +261,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
scale_cpu = arch_scale_cpu_capacity(cpu);
ref_freq = em_get_capacity_ref_freq(cpu, pd);

- max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ref_freq, scale_cpu);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6560392f2f83..e5476703ba49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7404,18 +7404,13 @@ int sched_core_idle_cpu(int cpu)
* required to meet deadlines.
*/
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- enum cpu_util_type type,
- struct task_struct *p)
+ unsigned long *min,
+ unsigned long *max)
{
- unsigned long dl_util, util, irq, max;
+ unsigned long util, irq, scale;
struct rq *rq = cpu_rq(cpu);

- max = arch_scale_cpu_capacity(cpu);
-
- if (!uclamp_is_used() &&
- type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
- return max;
- }
+ scale = arch_scale_cpu_capacity(cpu);

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7423,9 +7418,16 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* update_irq_load_avg().
*/
irq = cpu_util_irq(rq);
- if (unlikely(irq >= max))
- return max;
+ if (unlikely(irq >= scale)) {
+ if (min)
+ *min = scale;
+ if (max)
+ *max = scale;
+ return scale;
+ }

+ if (min)
+ *min = irq + cpu_bw_dl(rq);
/*
* Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective
@@ -7439,29 +7441,13 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* frequency will be gracefully reduced with the utilization decay.
*/
util = util_cfs + cpu_util_rt(rq);
- if (type == FREQUENCY_UTIL)
- util = uclamp_rq_util_with(rq, util, p);
-
- dl_util = cpu_util_dl(rq);
-
- /*
- * For frequency selection we do not make cpu_util_dl() a permanent part
- * of this sum because we want to use cpu_bw_dl() later on, but we need
- * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
- * that we select f_max when there is no idle time.
- *
- * NOTE: numerical errors or stop class might cause us to not quite hit
- * saturation when we should -- something for later.
- */
- if (util + dl_util >= max)
- return max;
+ util += cpu_util_dl(rq);

- /*
- * OTOH, for energy computation we need the estimated running time, so
- * include util_dl and ignore dl_bw.
- */
- if (type == ENERGY_UTIL)
- util += dl_util;
+ if (util >= scale) {
+ if (max)
+ *max = scale;
+ return scale;
+ }

/*
* There is still idle time; further improve the number by using the
@@ -7472,28 +7458,53 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* U' = irq + --------- * U
* max
*/
- util = scale_irq_capacity(util, irq, max);
+ util = scale_irq_capacity(util, irq, scale);
util += irq;

+ if (max)
+ *max = uclamp_rq_util_with(rq, util, NULL);
+
+ return min(scale, util);
+}
+
+
+/*
+ * TODO: move this in cpufreq
+ */
+unsigned long effective_cpu_perf(int cpu, unsigned long util_cfs,
+ struct task_struct *p)
+{
+ unsigned long actual, target, min, max;
+ struct rq *rq = cpu_rq(cpu);
+
+ /* Get utilization stats */
+ actual = effective_cpu_util(cpu, util_cfs, &min, &max);
+
+ /* Check how max would be changed with p */
+ if (p)
+ max = min(max, uclamp_rq_util_with(rq, util_cfs, p));
+
/*
- * Bandwidth required by DEADLINE must always be granted while, for
- * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
- * to gracefully reduce the frequency when no tasks show up for longer
- * periods of time.
- *
- * Ideally we would like to set bw_dl as min/guaranteed freq and util +
- * bw_dl as requested freq. However, cpufreq is not yet ready for such
- * an interface. So, we only do the latter for now.
+ * Provide at least enough capacity for DL + irq plus some headroom
+ * for other activities
*/
- if (type == FREQUENCY_UTIL)
- util += cpu_bw_dl(rq);
+ target = map_util_perf(min);

- return min(max, util);
+ actual = map_util_perf(actual);
+ /* Actually we don't need to target the max performance */
+ if (actual < max)
+ max = actual;
+
+ /*
+ * Ensure at least minimum perf target while providing more computa capacity when
+ * possible
+ */
+ return max(target,max);
}

unsigned long sched_cpu_util(int cpu)
{
- return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
+ return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
}
#endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e2b9c8c3d69a..ef6b4b09ac12 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -162,7 +162,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned int freq;
struct cpufreq_policy *policy = sg_policy->policy;

- util = map_util_perf(util);
freq = get_capacity_ref_freq(policy);
freq = map_util_freq(util, freq, max);

@@ -179,8 +178,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
struct rq *rq = cpu_rq(sg_cpu->cpu);

sg_cpu->bw_dl = cpu_bw_dl(rq);
- sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
- FREQUENCY_UTIL, NULL);
+ sg_cpu->util = effective_cpu_perf(sg_cpu->cpu, util, NULL);
}

/**
@@ -427,7 +425,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sg_cpu->util = prev_util;

cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
- map_util_perf(sg_cpu->util), max_cap);
+ sg_cpu->util, max_cap);

sg_cpu->sg_policy->last_freq_update_time = time;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06d6d0dde48a..50568e2fa1ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7570,7 +7570,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
for_each_cpu(cpu, pd_cpus) {
unsigned long util = cpu_util(cpu, p, -1, 0);

- busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+ busy_time += effective_cpu_util(cpu, util, NULL, NULL);
}

eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7602,7 +7602,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+ eff_util = effective_cpu_perf(cpu, util, tsk);
max_util = max(max_util, eff_util);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17ae151e90c0..4cae9d7c4d8f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2988,10 +2988,13 @@ enum cpu_util_type {
ENERGY_UTIL,
};

-unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
- enum cpu_util_type type,
+unsigned long effective_cpu_perf(int cpu, unsigned long util_cfs,
struct task_struct *p);

+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long *min,
+ unsigned long *max);
+
/*
* Verify the fitness of task @p to run on @cpu taking into account the
* CPU original capacity and the runtime/deadline ratio of the task.
--
2.34.1

>
>
> Cheers
>
> --
> Qais Yousef