2022-04-07 21:06:12

by Xuewen Yan

[permalink] [raw]
Subject: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

There are cases when the cpu max capacity might be reduced due to thermal.
Take into the thermal pressure into account when judge whether the rt task
fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
also should be considered.

Signed-off-by: Xuewen Yan <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 1 +
kernel/sched/rt.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3dbf351d12d5..285ad51caf0f 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
struct rq *rq = cpu_rq(sg_cpu->cpu);
unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);

+ max -= arch_scale_thermal_pressure(sg_cpu->cpu);
sg_cpu->max = max;
sg_cpu->bw_dl = cpu_bw_dl(rq);
sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a32c46889af8..d9982ebd4821 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
max_cap = uclamp_eff_value(p, UCLAMP_MAX);

cpu_cap = capacity_orig_of(cpu);
+ cpu_cap -= arch_scale_thermal_pressure(cpu);

return cpu_cap >= min(min_cap, max_cap);
}
--
2.25.1


2022-04-11 10:27:17

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

HI Dietmar

On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
<[email protected]> wrote:
>
> On 07/04/2022 07:19, Xuewen Yan wrote:
> > There are cases when the cpu max capacity might be reduced due to thermal.
> > Take into the thermal pressure into account when judge whether the rt task
> > fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> > also should be considered.
> >
> > Signed-off-by: Xuewen Yan <[email protected]>
> > ---
> > kernel/sched/cpufreq_schedutil.c | 1 +
> > kernel/sched/rt.c | 1 +
> > 2 files changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 3dbf351d12d5..285ad51caf0f 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> >
> > + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
>
> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
>
> For the energy part (A) we use max' in compute_energy() to cap sum_util
> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
> max'). This was done to match (B)'s `policy->max` capping.
>
> For the frequency part (B) we have freq_qos_update_request() in:
>
> power_actor_set_power()
> ...
> cdev->ops->set_cur_state()
>
> cpufreq_set_cur_state()
> freq_qos_update_request() <-- !
> arch_update_thermal_pressure()
>
> restricting `policy->max` which then clamps `target_freq` in:
>
> cpufreq_update_util()
> ...
> get_next_freq()
> cpufreq_driver_resolve_freq()
> __resolve_freq()
>

Do you mean that the "max" here will not affect the frequency
conversion, so there is no need to change it?
But is it better to reflect the influence of thermal here?

> [...]
>
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index a32c46889af8..d9982ebd4821 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > cpu_cap = capacity_orig_of(cpu);
> > + cpu_cap -= arch_scale_thermal_pressure(cpu);
> >
> > return cpu_cap >= min(min_cap, max_cap);
> > }
>
> IMHO, this should follow what we do with rq->cpu_capacity
> (capacity_of(), the remaining capacity for CFS). E.g. we use
> capacity_of() in find_energy_efficient_cpu() and select_idle_capacity()
> to compare capacities. So we would need a function like
> scale_rt_capacity() for RT (minus the rq->avg_rt.util_avg) but then also
> one for DL (minus rq->avg_dl.util_avg and rq->avg_rt.util_avg).

It's a really good idea. And do you already have the corresponding patch?
If there is, can you tell me the corresponding link?

Thanks a lot!

BR
xuewen

2022-04-11 17:25:32

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 07/04/2022 07:19, Xuewen Yan wrote:
> There are cases when the cpu max capacity might be reduced due to thermal.
> Take into the thermal pressure into account when judge whether the rt task
> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> also should be considered.
>
> Signed-off-by: Xuewen Yan <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 1 +
> kernel/sched/rt.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 3dbf351d12d5..285ad51caf0f 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
>
> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);

max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()

For the energy part (A) we use max' in compute_energy() to cap sum_util
and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
max'). This was done to match (B)'s `policy->max` capping.

For the frequency part (B) we have freq_qos_update_request() in:

power_actor_set_power()
...
cdev->ops->set_cur_state()

cpufreq_set_cur_state()
freq_qos_update_request() <-- !
arch_update_thermal_pressure()

restricting `policy->max` which then clamps `target_freq` in:

cpufreq_update_util()
...
get_next_freq()
cpufreq_driver_resolve_freq()
__resolve_freq()

[...]

> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a32c46889af8..d9982ebd4821 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> max_cap = uclamp_eff_value(p, UCLAMP_MAX);
>
> cpu_cap = capacity_orig_of(cpu);
> + cpu_cap -= arch_scale_thermal_pressure(cpu);
>
> return cpu_cap >= min(min_cap, max_cap);
> }

IMHO, this should follow what we do with rq->cpu_capacity
(capacity_of(), the remaining capacity for CFS). E.g. we use
capacity_of() in find_energy_efficient_cpu() and select_idle_capacity()
to compare capacities. So we would need a function like
scale_rt_capacity() for RT (minus the rq->avg_rt.util_avg) but then also
one for DL (minus rq->avg_dl.util_avg and rq->avg_rt.util_avg).

2022-04-12 00:54:28

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 11/04/2022 10:52, Xuewen Yan wrote:
> HI Dietmar
>
> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
> <[email protected]> wrote:
>>
>> On 07/04/2022 07:19, Xuewen Yan wrote:
>>> There are cases when the cpu max capacity might be reduced due to thermal.
>>> Take into the thermal pressure into account when judge whether the rt task
>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
>>> also should be considered.
>>>
>>> Signed-off-by: Xuewen Yan <[email protected]>
>>> ---
>>> kernel/sched/cpufreq_schedutil.c | 1 +
>>> kernel/sched/rt.c | 1 +
>>> 2 files changed, 2 insertions(+)
>>>
>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>>> index 3dbf351d12d5..285ad51caf0f 100644
>>> --- a/kernel/sched/cpufreq_schedutil.c
>>> +++ b/kernel/sched/cpufreq_schedutil.c
>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
>>>
>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
>>
>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
>>
>> For the energy part (A) we use max' in compute_energy() to cap sum_util
>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
>> max'). This was done to match (B)'s `policy->max` capping.
>>
>> For the frequency part (B) we have freq_qos_update_request() in:
>>
>> power_actor_set_power()
>> ...
>> cdev->ops->set_cur_state()
>>
>> cpufreq_set_cur_state()
>> freq_qos_update_request() <-- !
>> arch_update_thermal_pressure()
>>
>> restricting `policy->max` which then clamps `target_freq` in:
>>
>> cpufreq_update_util()
>> ...
>> get_next_freq()
>> cpufreq_driver_resolve_freq()
>> __resolve_freq()
>>
>
> Do you mean that the "max" here will not affect the frequency
> conversion, so there is no need to change it?
> But is it better to reflect the influence of thermal here?

I guess your point is that even though max' has no effect on frequency
since QOS caps policy->max anyway, it is still easier to understand the
dependency between schedutil and EAS/EM when it comes to the use of
thermal pressure.

I agree. The way how the "hidden" policy->max capping guarantees that
schedutil and EAS/EM are doing the same even when only the latter uses
max' is not obvious.

I just wanted to mention the historical reason why the code looks like
it does today.

>> [...]
>>
>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>> index a32c46889af8..d9982ebd4821 100644
>>> --- a/kernel/sched/rt.c
>>> +++ b/kernel/sched/rt.c
>>> @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
>>> max_cap = uclamp_eff_value(p, UCLAMP_MAX);
>>>
>>> cpu_cap = capacity_orig_of(cpu);
>>> + cpu_cap -= arch_scale_thermal_pressure(cpu);
>>>
>>> return cpu_cap >= min(min_cap, max_cap);
>>> }
>>
>> IMHO, this should follow what we do with rq->cpu_capacity
>> (capacity_of(), the remaining capacity for CFS). E.g. we use
>> capacity_of() in find_energy_efficient_cpu() and select_idle_capacity()
>> to compare capacities. So we would need a function like
>> scale_rt_capacity() for RT (minus the rq->avg_rt.util_avg) but then also
>> one for DL (minus rq->avg_dl.util_avg and rq->avg_rt.util_avg).
>
> It's a really good idea. And do you already have the corresponding patch?
> If there is, can you tell me the corresponding link?

No, I don't have any code for this. Should be trivial though. But the
issue is that the update would probably have to be decoupled from CFS
load_balance (update_group_capacity()) and the use cases in RT/DL are
only valid for asymmetric CPU capacity systems.

2022-04-13 17:28:23

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity



On 4/11/22 15:07, Dietmar Eggemann wrote:
> On 11/04/2022 10:52, Xuewen Yan wrote:
>> HI Dietmar
>>
>> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
>> <[email protected]> wrote:
>>>
>>> On 07/04/2022 07:19, Xuewen Yan wrote:
>>>> There are cases when the cpu max capacity might be reduced due to thermal.
>>>> Take into the thermal pressure into account when judge whether the rt task
>>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
>>>> also should be considered.
>>>>
>>>> Signed-off-by: Xuewen Yan <[email protected]>
>>>> ---
>>>> kernel/sched/cpufreq_schedutil.c | 1 +
>>>> kernel/sched/rt.c | 1 +
>>>> 2 files changed, 2 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>>>> index 3dbf351d12d5..285ad51caf0f 100644
>>>> --- a/kernel/sched/cpufreq_schedutil.c
>>>> +++ b/kernel/sched/cpufreq_schedutil.c
>>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
>>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
>>>>
>>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
>>>
>>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
>>>
>>> For the energy part (A) we use max' in compute_energy() to cap sum_util
>>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
>>> max'). This was done to match (B)'s `policy->max` capping.
>>>
>>> For the frequency part (B) we have freq_qos_update_request() in:
>>>
>>> power_actor_set_power()
>>> ...
>>> cdev->ops->set_cur_state()
>>>
>>> cpufreq_set_cur_state()
>>> freq_qos_update_request() <-- !
>>> arch_update_thermal_pressure()
>>>
>>> restricting `policy->max` which then clamps `target_freq` in:
>>>
>>> cpufreq_update_util()
>>> ...
>>> get_next_freq()
>>> cpufreq_driver_resolve_freq()
>>> __resolve_freq()
>>>
>>
>> Do you mean that the "max" here will not affect the frequency
>> conversion, so there is no need to change it?
>> But is it better to reflect the influence of thermal here?
>
> I guess your point is that even though max' has no effect on frequency
> since QOS caps policy->max anyway, it is still easier to understand the
> dependency between schedutil and EAS/EM when it comes to the use of
> thermal pressure.
>
> I agree. The way how the "hidden" policy->max capping guarantees that
> schedutil and EAS/EM are doing the same even when only the latter uses
> max' is not obvious.

+1 here, IMO we shouldn't rely on hidden stuff. There are two which set
the thermal pressure, but one is not setting the freq_qos which causes
the update of the 'policy->max'. So the schedutil will send that high
frequency but that driver would just ignore and clamp internally. In the
end we might argue it still works, but is it clean and visible from the
code? Funny thing might start to happen then the driver, which might be
the last safety net cannot capture this.

We also should be OK with energy estimation and the CPU capacity vs.
task utilization comparisons, since the thermal pressure is accounted
there* (until the thermal is controlled in kernel not in FW, which is
something where we are heading with scmi-cpufreq mentioned in this
cover letter [1]).

>
> I just wanted to mention the historical reason why the code looks like
> it does today.
>
>>> [...]
>>>
>>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>>> index a32c46889af8..d9982ebd4821 100644
>>>> --- a/kernel/sched/rt.c
>>>> +++ b/kernel/sched/rt.c
>>>> @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
>>>> max_cap = uclamp_eff_value(p, UCLAMP_MAX);
>>>>
>>>> cpu_cap = capacity_orig_of(cpu);
>>>> + cpu_cap -= arch_scale_thermal_pressure(cpu);
>>>>
>>>> return cpu_cap >= min(min_cap, max_cap);
>>>> }
>>>
>>> IMHO, this should follow what we do with rq->cpu_capacity
>>> (capacity_of(), the remaining capacity for CFS). E.g. we use
>>> capacity_of() in find_energy_efficient_cpu() and select_idle_capacity()
>>> to compare capacities. So we would need a function like
>>> scale_rt_capacity() for RT (minus the rq->avg_rt.util_avg) but then also
>>> one for DL (minus rq->avg_dl.util_avg and rq->avg_rt.util_avg).
>>
>> It's a really good idea. And do you already have the corresponding patch?
>> If there is, can you tell me the corresponding link?
>
> No, I don't have any code for this. Should be trivial though. But the
> issue is that the update would probably have to be decoupled from CFS
> load_balance (update_group_capacity()) and the use cases in RT/DL are
> only valid for asymmetric CPU capacity systems.

Having in mind those I would vote for fixing it incrementally.
Thus, IMHO this patch is good for taking it. Later we might think how
to better estimate the real CPU capacity visible from RT and DL classes.
In this shape it is good for many systems which only use RT,
but not DL class. Those systems w/ RT and w/o DL might suffer on some
asymmetric CPU platforms where medium cores have capacity e.g. 850 and
thermal pressure reduced the big cores capacity by 250 making them 774.

Regards,
Lukasz

[1]
https://lore.kernel.org/linux-pm/[email protected]/

2022-04-16 03:10:58

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Luba / Dietmar

On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
>
>
>
> On 4/11/22 15:07, Dietmar Eggemann wrote:
> > On 11/04/2022 10:52, Xuewen Yan wrote:
> >> HI Dietmar
> >>
> >> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
> >> <[email protected]> wrote:
> >>>
> >>> On 07/04/2022 07:19, Xuewen Yan wrote:
> >>>> There are cases when the cpu max capacity might be reduced due to thermal.
> >>>> Take into the thermal pressure into account when judge whether the rt task
> >>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> >>>> also should be considered.
> >>>>
> >>>> Signed-off-by: Xuewen Yan <[email protected]>
> >>>> ---
> >>>> kernel/sched/cpufreq_schedutil.c | 1 +
> >>>> kernel/sched/rt.c | 1 +
> >>>> 2 files changed, 2 insertions(+)
> >>>>
> >>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >>>> index 3dbf351d12d5..285ad51caf0f 100644
> >>>> --- a/kernel/sched/cpufreq_schedutil.c
> >>>> +++ b/kernel/sched/cpufreq_schedutil.c
> >>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> >>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
> >>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> >>>>
> >>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
> >>>
> >>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
> >>>
> >>> For the energy part (A) we use max' in compute_energy() to cap sum_util
> >>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
> >>> max'). This was done to match (B)'s `policy->max` capping.
> >>>
> >>> For the frequency part (B) we have freq_qos_update_request() in:
> >>>
> >>> power_actor_set_power()
> >>> ...
> >>> cdev->ops->set_cur_state()
> >>>
> >>> cpufreq_set_cur_state()
> >>> freq_qos_update_request() <-- !
> >>> arch_update_thermal_pressure()
> >>>
> >>> restricting `policy->max` which then clamps `target_freq` in:
> >>>
> >>> cpufreq_update_util()
> >>> ...
> >>> get_next_freq()
> >>> cpufreq_driver_resolve_freq()
> >>> __resolve_freq()
> >>>
> >>
> >> Do you mean that the "max" here will not affect the frequency
> >> conversion, so there is no need to change it?
> >> But is it better to reflect the influence of thermal here?
> >
> > I guess your point is that even though max' has no effect on frequency
> > since QOS caps policy->max anyway, it is still easier to understand the
> > dependency between schedutil and EAS/EM when it comes to the use of
> > thermal pressure.
> >
> > I agree. The way how the "hidden" policy->max capping guarantees that
> > schedutil and EAS/EM are doing the same even when only the latter uses
> > max' is not obvious.
>
> +1 here, IMO we shouldn't rely on hidden stuff. There are two which set
> the thermal pressure, but one is not setting the freq_qos which causes
> the update of the 'policy->max'. So the schedutil will send that high
> frequency but that driver would just ignore and clamp internally. In the
> end we might argue it still works, but is it clean and visible from the
> code? Funny thing might start to happen then the driver, which might be
> the last safety net cannot capture this.
>
> We also should be OK with energy estimation and the CPU capacity vs.
> task utilization comparisons, since the thermal pressure is accounted
> there* (until the thermal is controlled in kernel not in FW, which is
> something where we are heading with scmi-cpufreq mentioned in this
> cover letter [1]).

IMO, If so, we don't want to modify the original code, but also need to
consider the impact of thermal, maybe it is possible to add a new
macro definition
like this:

#define arch_scale_cpu_capacity_except_thermal()
(arch_scale_cpu_capacity() - arch_scale_thermal_pressure())

>
> >
> > I just wanted to mention the historical reason why the code looks like
> > it does today.
> >
> >>> [...]
> >>>
> >>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >>>> index a32c46889af8..d9982ebd4821 100644
> >>>> --- a/kernel/sched/rt.c
> >>>> +++ b/kernel/sched/rt.c
> >>>> @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> >>>> max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> >>>>
> >>>> cpu_cap = capacity_orig_of(cpu);
> >>>> + cpu_cap -= arch_scale_thermal_pressure(cpu);
> >>>>
> >>>> return cpu_cap >= min(min_cap, max_cap);
> >>>> }
> >>>
> >>> IMHO, this should follow what we do with rq->cpu_capacity
> >>> (capacity_of(), the remaining capacity for CFS). E.g. we use
> >>> capacity_of() in find_energy_efficient_cpu() and select_idle_capacity()
> >>> to compare capacities. So we would need a function like
> >>> scale_rt_capacity() for RT (minus the rq->avg_rt.util_avg) but then also
> >>> one for DL (minus rq->avg_dl.util_avg and rq->avg_rt.util_avg).
> >>
> >> It's a really good idea. And do you already have the corresponding patch?
> >> If there is, can you tell me the corresponding link?
> >
> > No, I don't have any code for this. Should be trivial though. But the
> > issue is that the update would probably have to be decoupled from CFS
> > load_balance (update_group_capacity()) and the use cases in RT/DL are
> > only valid for asymmetric CPU capacity systems.
>
> Having in mind those I would vote for fixing it incrementally.
> Thus, IMHO this patch is good for taking it. Later we might think how
> to better estimate the real CPU capacity visible from RT and DL classes.
> In this shape it is good for many systems which only use RT,
> but not DL class. Those systems w/ RT and w/o DL might suffer on some
> asymmetric CPU platforms where medium cores have capacity e.g. 850 and
> thermal pressure reduced the big cores capacity by 250 making them 774.
>

Your mean is that before there is better way to handle RT capacity, we
can take this patch temporarily?
If so, I can update the patch which will just fix the rt.c.

In fact, in the mobile phone usage scenario where the cpu contains 3
clusters (small/middle/big),
the capacity of the big core's capacity will be smaller than that of
the middle core due to the thermal effect.
At this time, we do not want the big core CPU to be used as an RT
task's alternative cpu.

> Regards,
> Lukasz
>
> [1]
> https://lore.kernel.org/linux-pm/[email protected]/


BR
xuewen.yan

2022-04-19 13:38:53

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity



On 4/19/22 08:14, Vincent Guittot wrote:
> On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <[email protected]> wrote:
>>
>> Hi Luba / Dietmar
>>
>> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
>>>
>>>
>>>
>>> On 4/11/22 15:07, Dietmar Eggemann wrote:
>>>> On 11/04/2022 10:52, Xuewen Yan wrote:
>>>>> HI Dietmar
>>>>>
>>>>> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> On 07/04/2022 07:19, Xuewen Yan wrote:
>>>>>>> There are cases when the cpu max capacity might be reduced due to thermal.
>>>>>>> Take into the thermal pressure into account when judge whether the rt task
>>>>>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
>>>>>>> also should be considered.
>>>>>>>
>>>>>>> Signed-off-by: Xuewen Yan <[email protected]>
>>>>>>> ---
>>>>>>> kernel/sched/cpufreq_schedutil.c | 1 +
>>>>>>> kernel/sched/rt.c | 1 +
>>>>>>> 2 files changed, 2 insertions(+)
>>>>>>>
>>>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>>>>>>> index 3dbf351d12d5..285ad51caf0f 100644
>>>>>>> --- a/kernel/sched/cpufreq_schedutil.c
>>>>>>> +++ b/kernel/sched/cpufreq_schedutil.c
>>>>>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>>>>>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
>>>>>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
>>>>>>>
>>>>>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
>>>>>>
>>>>>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
>>>>>>
>>>>>> For the energy part (A) we use max' in compute_energy() to cap sum_util
>>>>>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
>>>>>> max'). This was done to match (B)'s `policy->max` capping.
>>>>>>
>>>>>> For the frequency part (B) we have freq_qos_update_request() in:
>>>>>>
>>>>>> power_actor_set_power()
>>>>>> ...
>>>>>> cdev->ops->set_cur_state()
>>>>>>
>>>>>> cpufreq_set_cur_state()
>>>>>> freq_qos_update_request() <-- !
>>>>>> arch_update_thermal_pressure()
>>>>>>
>>>>>> restricting `policy->max` which then clamps `target_freq` in:
>>>>>>
>>>>>> cpufreq_update_util()
>>>>>> ...
>>>>>> get_next_freq()
>>>>>> cpufreq_driver_resolve_freq()
>>>>>> __resolve_freq()
>>>>>>
>>>>>
>>>>> Do you mean that the "max" here will not affect the frequency
>>>>> conversion, so there is no need to change it?
>>>>> But is it better to reflect the influence of thermal here?
>>>>
>>>> I guess your point is that even though max' has no effect on frequency
>>>> since QOS caps policy->max anyway, it is still easier to understand the
>>>> dependency between schedutil and EAS/EM when it comes to the use of
>>>> thermal pressure.
>>>>
>>>> I agree. The way how the "hidden" policy->max capping guarantees that
>>>> schedutil and EAS/EM are doing the same even when only the latter uses
>>>> max' is not obvious.
>>>
>>> +1 here, IMO we shouldn't rely on hidden stuff. There are two which set
>>> the thermal pressure, but one is not setting the freq_qos which causes
>>> the update of the 'policy->max'. So the schedutil will send that high
>>> frequency but that driver would just ignore and clamp internally. In the
>>> end we might argue it still works, but is it clean and visible from the
>>> code? Funny thing might start to happen then the driver, which might be
>>> the last safety net cannot capture this.
>
> schedutil reflects what scheduler wants not what HW can do. If you
> start to cap the freq with arch_scale_thermal_pressure() in schedutil,

s/freq/util ?

To be precised and maybe fix some potential design issues. We are
talking here about utilization and set max capacity in function:
sugov_get_util()
so fields:

sugov_cpu::util
sugov_cpu::max /* max capacity */

> you will loose some opportunity to run at higher frequency because
> arch_scale_thermal_pressure() is transient and might change just after
> using it. This means that you will stay at lower freq after mitigation
> stops until a new cpufreq_update_util() happens. ANd I don't vene
> mentioned when thermal mitigation is managed by HW at a much higher
> frequency than what Linux can handle
>
> arch_scale_thermal_pressure() must not be used but thermal_load_avg()
> like scale_rt_capacity() what Dietmar suggested
>

First, I would like to see your view to this topic and why you are
making such strong statements. I have slightly different view and
made dozen of experiments with this thermal pressure in last ~2-3y.

The code flow is like this and operates on those fields from above:

util, max <--- sugov_get_util()
util <--- sugov_iowait_apply() <--- util, max /* ignore this now */

get_next_freq():
util <--- map_util_perf() <--- util (1)
freq <--- map_util_freq() <--- util, max, max_freq (2)


At (1) we add +25% util, at (2) we do the conversion to frequency:
freq = max_freq * util / max

As you can see with the patch we would still end up with bigger
frequency than max_freq (since it can happen: max < util).
It's also true currently in mainline, when
max=1024 and util=1024+256=1280
I would be similar if we cap max capacity:
max=800 and util=800+200=1000
but then in both cases are multiplied by 'max_freq' in (2)

As you can see this is not the situation that you have described, is it?
And the transient or non-transient is minor here IMO.

Secondly, you have mentioned the mitigation in HW and issue between
instantaneous vs. PELT-one thermal pressure information. This is
something that I'm stretching my head for long. I'm trying to
develop this for new Arm FW thermal. You have mentioned:
'thermal mitigation is managed by HW at a much higher
frequency than what Linux can handle' - I would be also more
precised here: HW or FW? How often the HW can change max freq or
how often FW can change that? If we don't have those numbers
than statement: 'a much higher' doesn't help in solving this
problem that Xuewen (and others) faces. IMO it's not technical
argument for blocking the patch and incremental development.

It's about timing, when we talk about thermal pressure signals and
those two information. For the PELT-one there are also two use cases:
raising time and decay time (where we're actually increasing the
visible capacity of the CPU). The decay period is quite long,
e.g.
Thermal pressure of 140 is removed, signal should converge to 0 from 140
in 130ms (90% decayed),
in 230ms (fully decayed).
The default kernel code allows to slow down the decay period, which is
a derivative from current global PELT default setting.
We can slow it down, but we cannot make it to react faster (BTW I made
such change to compare experiments). It's not always good to have
such long delays.

For asymmetric CPUs that I was describing and also Xuewen, where mid
core might be faster than big, we need this information in RT class.
Android is such system, so the situation is real (DL is not used there).
You have questioned this that:
'arch_scale_thermal_pressure() must not be used'
I wouldn't be so sure for the RT change.
Are you sure about that? Do you have experiments for it? I would
like to see them. I have run dozen of experiments and measurements
for this thermal pressure information on a few platforms. How
they behave on those task placements and what are the thermal
signal decay delay impacts. I'm still not sure which one is
best and thus not proposed any changes. But I'll refactor my
test code and send a patch with trace event for the new
topology_update_thermal_pressure(), which then allows to compare
those two designs and nasty timings issues. We would than see
how often (if 'much higher' is true) platforms set this value.
Currently, in mainline there are two clients which set this
value.

I've been investigating this PELT signal ~1-2 year ago and found
an issue when it's actually updated with delays for the long idle CPU.
When one CPU was running fast and thermal throttling kicked in, while
the other was idle, the idle one didn't have recent thermal information,
but could be picked as a candidate because visible capacity was ~max
possible - which is wrong because they both share the clock.
Check the function others_have_blocked() and the design around it.

That's why I'm a bit more careful with statements that one type of
information is better that other.

Furthermore, check the code in rt_task_fits_capacity(), there is no
PELT signal from the RT task. There is only uclamp_eff_value() from
task 'p', which is not PELT information. So all involved variables
are not PELT, why you recommend the PELT thermal pressure here?

As I said, this patch for the RT class is an incremental step into the
right direction.


2022-04-20 11:54:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Tue, 19 Apr 2022 at 14:01, Lukasz Luba <[email protected]> wrote:
>
>
>
> On 4/19/22 08:14, Vincent Guittot wrote:
> > On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <[email protected]> wrote:
> >>
> >> Hi Luba / Dietmar
> >>
> >> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>> On 4/11/22 15:07, Dietmar Eggemann wrote:
> >>>> On 11/04/2022 10:52, Xuewen Yan wrote:
> >>>>> HI Dietmar
> >>>>>
> >>>>> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> On 07/04/2022 07:19, Xuewen Yan wrote:
> >>>>>>> There are cases when the cpu max capacity might be reduced due to thermal.
> >>>>>>> Take into the thermal pressure into account when judge whether the rt task
> >>>>>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> >>>>>>> also should be considered.
> >>>>>>>
> >>>>>>> Signed-off-by: Xuewen Yan <[email protected]>
> >>>>>>> ---
> >>>>>>> kernel/sched/cpufreq_schedutil.c | 1 +
> >>>>>>> kernel/sched/rt.c | 1 +
> >>>>>>> 2 files changed, 2 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >>>>>>> index 3dbf351d12d5..285ad51caf0f 100644
> >>>>>>> --- a/kernel/sched/cpufreq_schedutil.c
> >>>>>>> +++ b/kernel/sched/cpufreq_schedutil.c
> >>>>>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> >>>>>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
> >>>>>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> >>>>>>>
> >>>>>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
> >>>>>>
> >>>>>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
> >>>>>>
> >>>>>> For the energy part (A) we use max' in compute_energy() to cap sum_util
> >>>>>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
> >>>>>> max'). This was done to match (B)'s `policy->max` capping.
> >>>>>>
> >>>>>> For the frequency part (B) we have freq_qos_update_request() in:
> >>>>>>
> >>>>>> power_actor_set_power()
> >>>>>> ...
> >>>>>> cdev->ops->set_cur_state()
> >>>>>>
> >>>>>> cpufreq_set_cur_state()
> >>>>>> freq_qos_update_request() <-- !
> >>>>>> arch_update_thermal_pressure()
> >>>>>>
> >>>>>> restricting `policy->max` which then clamps `target_freq` in:
> >>>>>>
> >>>>>> cpufreq_update_util()
> >>>>>> ...
> >>>>>> get_next_freq()
> >>>>>> cpufreq_driver_resolve_freq()
> >>>>>> __resolve_freq()
> >>>>>>
> >>>>>
> >>>>> Do you mean that the "max" here will not affect the frequency
> >>>>> conversion, so there is no need to change it?
> >>>>> But is it better to reflect the influence of thermal here?
> >>>>
> >>>> I guess your point is that even though max' has no effect on frequency
> >>>> since QOS caps policy->max anyway, it is still easier to understand the
> >>>> dependency between schedutil and EAS/EM when it comes to the use of
> >>>> thermal pressure.
> >>>>
> >>>> I agree. The way how the "hidden" policy->max capping guarantees that
> >>>> schedutil and EAS/EM are doing the same even when only the latter uses
> >>>> max' is not obvious.
> >>>
> >>> +1 here, IMO we shouldn't rely on hidden stuff. There are two which set
> >>> the thermal pressure, but one is not setting the freq_qos which causes
> >>> the update of the 'policy->max'. So the schedutil will send that high
> >>> frequency but that driver would just ignore and clamp internally. In the
> >>> end we might argue it still works, but is it clean and visible from the
> >>> code? Funny thing might start to happen then the driver, which might be
> >>> the last safety net cannot capture this.
> >
> > schedutil reflects what scheduler wants not what HW can do. If you
> > start to cap the freq with arch_scale_thermal_pressure() in schedutil,
>
> s/freq/util ?

Isn't it the same at the end if you cap util to orig capacity -
arch_scale_thermal_pressure ?

>
> To be precised and maybe fix some potential design issues. We are
> talking here about utilization and set max capacity in function:
> sugov_get_util()
> so fields:
>
> sugov_cpu::util
> sugov_cpu::max /* max capacity */

Yes. With this patch ,util will be lower than current thermal
mitigation whereas util normally reflects what we need not what can
be provided

>
> > you will loose some opportunity to run at higher frequency because
> > arch_scale_thermal_pressure() is transient and might change just after
> > using it. This means that you will stay at lower freq after mitigation
> > stops until a new cpufreq_update_util() happens. ANd I don't vene
> > mentioned when thermal mitigation is managed by HW at a much higher
> > frequency than what Linux can handle
> >
> > arch_scale_thermal_pressure() must not be used but thermal_load_avg()
> > like scale_rt_capacity() what Dietmar suggested
> >
>
> First, I would like to see your view to this topic and why you are
> making such strong statements. I have slightly different view and
> made dozen of experiments with this thermal pressure in last ~2-3y.
>
> The code flow is like this and operates on those fields from above:
>
> util, max <--- sugov_get_util()
> util <--- sugov_iowait_apply() <--- util, max /* ignore this now */
>
> get_next_freq():
> util <--- map_util_perf() <--- util (1)
> freq <--- map_util_freq() <--- util, max, max_freq (2)
>
>
> At (1) we add +25% util, at (2) we do the conversion to frequency:
> freq = max_freq * util / max
>
> As you can see with the patch we would still end up with bigger
> frequency than max_freq (since it can happen: max < util).
> It's also true currently in mainline, when
> max=1024 and util=1024+256=1280
> I would be similar if we cap max capacity:
> max=800 and util=800+200=1000

It's not because you end up with a similar value that it's ok. You
can't use one side to compensate for the other one. 1.25 is there to
provide spare cycles to a change in the cpu load (load meaning what is
running on the cpu not load_avg)

> but then in both cases are multiplied by 'max_freq' in (2)
>
> As you can see this is not the situation that you have described, is it?
> And the transient or non-transient is minor here IMO.

If max is 512 then util = 640 which is much lower than 1024.

>
> Secondly, you have mentioned the mitigation in HW and issue between
> instantaneous vs. PELT-one thermal pressure information. This is
> something that I'm stretching my head for long. I'm trying to
> develop this for new Arm FW thermal. You have mentioned:
> 'thermal mitigation is managed by HW at a much higher
> frequency than what Linux can handle' - I would be also more
> precised here: HW or FW? How often the HW can change max freq or
> how often FW can change that? If we don't have those numbers
> than statement: 'a much higher' doesn't help in solving this

By much higher means that Linux can't react fast enough and should not
try to sync because it's a lost game

> problem that Xuewen (and others) faces. IMO it's not technical
> argument for blocking the patch and incremental development.
>
> It's about timing, when we talk about thermal pressure signals and
> those two information. For the PELT-one there are also two use cases:
> raising time and decay time (where we're actually increasing the
> visible capacity of the CPU). The decay period is quite long,
> e.g.
> Thermal pressure of 140 is removed, signal should converge to 0 from 140
> in 130ms (90% decayed),
> in 230ms (fully decayed).
> The default kernel code allows to slow down the decay period, which is
> a derivative from current global PELT default setting.
> We can slow it down, but we cannot make it to react faster (BTW I made
> such change to compare experiments). It's not always good to have
> such long delays.
>
> For asymmetric CPUs that I was describing and also Xuewen, where mid
> core might be faster than big, we need this information in RT class.
> Android is such system, so the situation is real (DL is not used there).
> You have questioned this that:
> 'arch_scale_thermal_pressure() must not be used'
> I wouldn't be so sure for the RT change.
> Are you sure about that? Do you have experiments for it? I would
> like to see them. I have run dozen of experiments and measurements
> for this thermal pressure information on a few platforms. How
> they behave on those task placements and what are the thermal
> signal decay delay impacts. I'm still not sure which one is
> best and thus not proposed any changes. But I'll refactor my
> test code and send a patch with trace event for the new
> topology_update_thermal_pressure(), which then allows to compare
> those two designs and nasty timings issues. We would than see
> how often (if 'much higher' is true) platforms set this value.
> Currently, in mainline there are two clients which set this
> value.
>
> I've been investigating this PELT signal ~1-2 year ago and found
> an issue when it's actually updated with delays for the long idle CPU.
> When one CPU was running fast and thermal throttling kicked in, while
> the other was idle, the idle one didn't have recent thermal information,
> but could be picked as a candidate because visible capacity was ~max
> possible - which is wrong because they both share the clock.
> Check the function others_have_blocked() and the design around it.
>
> That's why I'm a bit more careful with statements that one type of
> information is better that other.
>
> Furthermore, check the code in rt_task_fits_capacity(), there is no
> PELT signal from the RT task. There is only uclamp_eff_value() from
> task 'p', which is not PELT information. So all involved variables
> are not PELT, why you recommend the PELT thermal pressure here?
>
> As I said, this patch for the RT class is an incremental step into the
> right direction.
>
>

2022-04-21 13:58:02

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Qais

On Wed, Apr 20, 2022 at 9:51 PM Qais Yousef <[email protected]> wrote:
>
> Hi Xuewen
>
> Thanks for sending the patch. RT relationship with thermal pressure is an
> interesting topic :)
>
> On 04/07/22 13:19, Xuewen Yan wrote:
> > There are cases when the cpu max capacity might be reduced due to thermal.
> > Take into the thermal pressure into account when judge whether the rt task
> > fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> > also should be considered.
>
> It would help to explain the mode of failure you're seeing here. What are you
> seeing?

I used in Android scenario, there are many RT processes in the
Android. I do not want to set the sched_uclamp_util_min_rt_default to
1024, it would make the power consumption very high.
I used a compromise method, setting the value of
sysctl_sched_uclamp_util_min_rt_default to be bigger than the small
core capacity but not so that the frequency of the big core becomes
very high.
But when there are there clusters on the soc, the big core's capacity
often become smaller than the middle core, when this happens, I want
the RT can run on the middle cores instead of the on the big core
always.
When consider the thermal pressure, it could relieve this phenomenon.
>
> >
> > Signed-off-by: Xuewen Yan <[email protected]>
> > ---
> > kernel/sched/cpufreq_schedutil.c | 1 +
> > kernel/sched/rt.c | 1 +
> > 2 files changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 3dbf351d12d5..285ad51caf0f 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> >
> > + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
>
> Wouldn't this break the call to irq_scale_capacity() in effective_cpu_util()?
>
> > sg_cpu->max = max;
> > sg_cpu->bw_dl = cpu_bw_dl(rq);
> > sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,

It would destory the irq_scale_capacity, but indeed the cpu max
capacity has changed, is it better to use the real cpu caopacity?

max - irq
U' = irq + --------- * U
max
I can't imagine how much of an impact this will have at the moment.

> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index a32c46889af8..d9982ebd4821 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > cpu_cap = capacity_orig_of(cpu);
> > + cpu_cap -= arch_scale_thermal_pressure(cpu);
>
> Hmm I'm not a fan of this. By default all RT tasks have uclamp_min = 1024 to
> keep the default behavior of the system running at max performance point.
>
> With this change, any tiny thermal pressure means all RT tasks will fail to fit
> on the biggest CPU. While this hint is not meant to be bullet proof, but it
> shouldn't break that easily either. The highest performance point will still be
> on this CPU. The only exception is capacity inversion where the bigs
> performance goes below the mediums' under severe thermal circumstances. But
> then there are 2 issues.
>
> 1. This patch doesn't help with this case. It simply reverts to putting
> tasks 'randomly' and might still end up on this CPU. I can't see
> how this is better.
> 2. If we are under such severe thermal pressure, then things must be
> falling over badly anyway and I'm not sure we can still satisfy the
> perf requirements these tasks want anyway. Unless you're trying to
> keep these CPUs less busy to alleviate thermal pressure? This patch
> will not help achieving that either. Or I'm unable to see it if it
> does.

Yes,It is the problem that would lead to, maybe it need more
consideration just like select the cpus which have min overutil.

>
> It'd be good to explain the problem you're seeing and how this patch helped
> you.
>
> The only thing I can think of is that you have uclamp_min set to the medium
> CPUs capacity but due to thermal pressure they might fail to run at highest
> frequency hence by forcing them NOT to fit on mediums you essentially make them
> run on the bigs where they get a better chance of getting the perf they want.

Yes, I have change the uclamp_min of RT. and used in android phone
which soc has three clusters(small/middle/big). The scenario is as I
described earlier.

>
>
> Thanks
>
> --
> Qais Yousef
>
>
> >
> > return cpu_cap >= min(min_cap, max_cap);
> > }
> > --
> > 2.25.1
> >

Thnaks!

BR
xuewen

2022-04-21 16:59:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Tue, 19 Apr 2022 at 16:13, Lukasz Luba <[email protected]> wrote:
>
>
>
> On 4/19/22 13:51, Vincent Guittot wrote:
> > On Tue, 19 Apr 2022 at 14:01, Lukasz Luba <[email protected]> wrote:
> >>
> >>
> >>
> >> On 4/19/22 08:14, Vincent Guittot wrote:
> >>> On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <[email protected]> wrote:
> >>>>
> >>>> Hi Luba / Dietmar
> >>>>
> >>>> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 4/11/22 15:07, Dietmar Eggemann wrote:
> >>>>>> On 11/04/2022 10:52, Xuewen Yan wrote:
> >>>>>>> HI Dietmar
> >>>>>>>
> >>>>>>> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 07/04/2022 07:19, Xuewen Yan wrote:
> >>>>>>>>> There are cases when the cpu max capacity might be reduced due to thermal.
> >>>>>>>>> Take into the thermal pressure into account when judge whether the rt task
> >>>>>>>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> >>>>>>>>> also should be considered.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Xuewen Yan <[email protected]>
> >>>>>>>>> ---
> >>>>>>>>> kernel/sched/cpufreq_schedutil.c | 1 +
> >>>>>>>>> kernel/sched/rt.c | 1 +
> >>>>>>>>> 2 files changed, 2 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >>>>>>>>> index 3dbf351d12d5..285ad51caf0f 100644
> >>>>>>>>> --- a/kernel/sched/cpufreq_schedutil.c
> >>>>>>>>> +++ b/kernel/sched/cpufreq_schedutil.c
> >>>>>>>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> >>>>>>>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
> >>>>>>>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> >>>>>>>>>
> >>>>>>>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
> >>>>>>>>
> >>>>>>>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
> >>>>>>>>
> >>>>>>>> For the energy part (A) we use max' in compute_energy() to cap sum_util
> >>>>>>>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
> >>>>>>>> max'). This was done to match (B)'s `policy->max` capping.
> >>>>>>>>
> >>>>>>>> For the frequency part (B) we have freq_qos_update_request() in:
> >>>>>>>>
> >>>>>>>> power_actor_set_power()
> >>>>>>>> ...
> >>>>>>>> cdev->ops->set_cur_state()
> >>>>>>>>
> >>>>>>>> cpufreq_set_cur_state()
> >>>>>>>> freq_qos_update_request() <-- !
> >>>>>>>> arch_update_thermal_pressure()
> >>>>>>>>
> >>>>>>>> restricting `policy->max` which then clamps `target_freq` in:
> >>>>>>>>
> >>>>>>>> cpufreq_update_util()
> >>>>>>>> ...
> >>>>>>>> get_next_freq()
> >>>>>>>> cpufreq_driver_resolve_freq()
> >>>>>>>> __resolve_freq()
> >>>>>>>>
> >>>>>>>
> >>>>>>> Do you mean that the "max" here will not affect the frequency
> >>>>>>> conversion, so there is no need to change it?
> >>>>>>> But is it better to reflect the influence of thermal here?
> >>>>>>
> >>>>>> I guess your point is that even though max' has no effect on frequency
> >>>>>> since QOS caps policy->max anyway, it is still easier to understand the
> >>>>>> dependency between schedutil and EAS/EM when it comes to the use of
> >>>>>> thermal pressure.
> >>>>>>
> >>>>>> I agree. The way how the "hidden" policy->max capping guarantees that
> >>>>>> schedutil and EAS/EM are doing the same even when only the latter uses
> >>>>>> max' is not obvious.
> >>>>>
> >>>>> +1 here, IMO we shouldn't rely on hidden stuff. There are two which set
> >>>>> the thermal pressure, but one is not setting the freq_qos which causes
> >>>>> the update of the 'policy->max'. So the schedutil will send that high
> >>>>> frequency but that driver would just ignore and clamp internally. In the
> >>>>> end we might argue it still works, but is it clean and visible from the
> >>>>> code? Funny thing might start to happen then the driver, which might be
> >>>>> the last safety net cannot capture this.
> >>>
> >>> schedutil reflects what scheduler wants not what HW can do. If you
> >>> start to cap the freq with arch_scale_thermal_pressure() in schedutil,
> >>
> >> s/freq/util ?
> >
> > Isn't it the same at the end if you cap util to orig capacity -
> > arch_scale_thermal_pressure ?
>
> No, as I have showed you in the example calculation the 'max_freq'
> is always taken as a start then scaled by current 'util/max'.
> If the 'max' was 1024 in both cases, then you might claim that
> we made a mistake and end up with obviously too low frequency.
>
> That's why I asked you to be precised with your statements in the
> description while mentioning variables and signals.
>
> >
> >>
> >> To be precised and maybe fix some potential design issues. We are
> >> talking here about utilization and set max capacity in function:
> >> sugov_get_util()
> >> so fields:
> >>
> >> sugov_cpu::util
> >> sugov_cpu::max /* max capacity */
> >
> > Yes. With this patch ,util will be lower than current thermal
> > mitigation whereas util normally reflects what we need not what can
> > be provided
>
> This is a different requirements: util has to be max capacity and
> max capacity has to be original max CPU capacity - for the SchedUtil.
> OK, why? What this requirement adds in the design and final values?

Because the calculation you are proposing is wrong and doesn't make
sense. Util is the average utilization of the cpu that has to be
compared with its original capacity max in order to get the freq that
matches with this utilization.

We have freq = util / max * max_freq and cpufreq will then capp freq
if mitigation is applied. Once the mitigation disappear, the request
will be back to targeted freq.

If you replace max by max' = max - arch_scale_thermal_pressure then :

- by the time you do the calculation, arch_scale_thermal_pressure can
have changed and the result is meaningless. This is true whatever the
pace of updating arch_scale_thermal_pressure

- you change the range of capacity to max'= max -
arch_scale_thermal_pressure and you scale it to max_freq. if util >
max', then you will ask for max_freq whatever the util being really
close to max or not. Also you will ask for max freq even if util is
close but below max' whereas the mitigation doesn't impact utilization

>
> >
> >>
> >>> you will loose some opportunity to run at higher frequency because
> >>> arch_scale_thermal_pressure() is transient and might change just after
> >>> using it. This means that you will stay at lower freq after mitigation
> >>> stops until a new cpufreq_update_util() happens. ANd I don't vene
> >>> mentioned when thermal mitigation is managed by HW at a much higher
> >>> frequency than what Linux can handle
> >>>
> >>> arch_scale_thermal_pressure() must not be used but thermal_load_avg()
> >>> like scale_rt_capacity() what Dietmar suggested
> >>>
> >>
> >> First, I would like to see your view to this topic and why you are
> >> making such strong statements. I have slightly different view and
> >> made dozen of experiments with this thermal pressure in last ~2-3y.
> >>
> >> The code flow is like this and operates on those fields from above:
> >>
> >> util, max <--- sugov_get_util()
> >> util <--- sugov_iowait_apply() <--- util, max /* ignore this now */
> >>
> >> get_next_freq():
> >> util <--- map_util_perf() <--- util (1)
> >> freq <--- map_util_freq() <--- util, max, max_freq (2)
> >>
> >>
> >> At (1) we add +25% util, at (2) we do the conversion to frequency:
> >> freq = max_freq * util / max
> >>
> >> As you can see with the patch we would still end up with bigger
> >> frequency than max_freq (since it can happen: max < util).
> >> It's also true currently in mainline, when
> >> max=1024 and util=1024+256=1280
> >> I would be similar if we cap max capacity:
> >> max=800 and util=800+200=1000
> >
> > It's not because you end up with a similar value that it's ok. You
> > can't use one side to compensate for the other one. 1.25 is there to
> > provide spare cycles to a change in the cpu load (load meaning what is
> > running on the cpu not load_avg)
>
> It's different. You've made a hard statement that we are going to break
> this frequency selected value, while IMO we aren't. It would behave
> the same. I don't compensate anything, the divider ('max') has changed
> as well. The patch sets 'sg_cpu->max' not just uses 'max' locally, so
> it produces lower 'util' but then might have 'max=1024'. It's not the
> case.
>
> >
> >> but then in both cases are multiplied by 'max_freq' in (2)
> >>
> >> As you can see this is not the situation that you have described, is it?
> >> And the transient or non-transient is minor here IMO.
> >
> > If max is 512 then util = 640 which is much lower than 1024.
>
> What scenario is this?
> Is 1024 the utilization that we might have from the CPU rq?
> What is the original CPU capacity, 1024?
>
> >
> >>
> >> Secondly, you have mentioned the mitigation in HW and issue between
> >> instantaneous vs. PELT-one thermal pressure information. This is
> >> something that I'm stretching my head for long. I'm trying to
> >> develop this for new Arm FW thermal. You have mentioned:
> >> 'thermal mitigation is managed by HW at a much higher
> >> frequency than what Linux can handle' - I would be also more
> >> precised here: HW or FW? How often the HW can change max freq or
> >> how often FW can change that? If we don't have those numbers
> >> than statement: 'a much higher' doesn't help in solving this
> >
> > By much higher means that Linux can't react fast enough and should not
> > try to sync because it's a lost game
>
> As I said, 'much higher' is not a number to base a design on it.

But that gives you the constraint that you can't expect to be always
synced with up to date value which is the most important here. This
means that cpu_cap -= arch_scale_thermal_pressure(cpu) can be wrong
just after you computed it and your decision is wrong.


> We need real numbers from real platforms. Currently we have two
> places where the thermal pressure is set:
> 1) cpufreq_cooling.c [1]
> 2) Qcom driver [2]
> (we might have 3rd soon for Arm SCMI+FW)

I don't have details but i have khz in mind

>
> For the 2nd I would like to see numbers. For the 1st one when
> kernel thermal is used (which supports higher number of platforms
> comparing to Qcom driver) as it's by design kernel tries to control
> thermal, so changes are not that frequent.
>
> As for now, I can see in experiments the 1st is suffering long decay
> delays and also corner cases with long idle CPUs.
>
> >
> >> problem that Xuewen (and others) faces. IMO it's not technical
> >> argument for blocking the patch and incremental development.
> >>
> >> It's about timing, when we talk about thermal pressure signals and
> >> those two information. For the PELT-one there are also two use cases:
> >> raising time and decay time (where we're actually increasing the
> >> visible capacity of the CPU). The decay period is quite long,
> >> e.g.
> >> Thermal pressure of 140 is removed, signal should converge to 0 from 140
> >> in 130ms (90% decayed),
> >> in 230ms (fully decayed).
> >> The default kernel code allows to slow down the decay period, which is
> >> a derivative from current global PELT default setting.
> >> We can slow it down, but we cannot make it to react faster (BTW I made
> >> such change to compare experiments). It's not always good to have
> >> such long delays.
> >>
> >> For asymmetric CPUs that I was describing and also Xuewen, where mid
> >> core might be faster than big, we need this information in RT class.
> >> Android is such system, so the situation is real (DL is not used there).
> >> You have questioned this that:
> >> 'arch_scale_thermal_pressure() must not be used'
> >> I wouldn't be so sure for the RT change.
> >> Are you sure about that? Do you have experiments for it? I would
> >> like to see them. I have run dozen of experiments and measurements
> >> for this thermal pressure information on a few platforms. How
> >> they behave on those task placements and what are the thermal
> >> signal decay delay impacts. I'm still not sure which one is
> >> best and thus not proposed any changes. But I'll refactor my
> >> test code and send a patch with trace event for the new
> >> topology_update_thermal_pressure(), which then allows to compare
> >> those two designs and nasty timings issues. We would than see
> >> how often (if 'much higher' is true) platforms set this value.
> >> Currently, in mainline there are two clients which set this
> >> value.
> >>
> >> I've been investigating this PELT signal ~1-2 year ago and found
> >> an issue when it's actually updated with delays for the long idle CPU.
> >> When one CPU was running fast and thermal throttling kicked in, while
> >> the other was idle, the idle one didn't have recent thermal information,
> >> but could be picked as a candidate because visible capacity was ~max
> >> possible - which is wrong because they both share the clock.
> >> Check the function others_have_blocked() and the design around it.
> >>
> >> That's why I'm a bit more careful with statements that one type of
> >> information is better that other.
> >>
> >> Furthermore, check the code in rt_task_fits_capacity(), there is no
> >> PELT signal from the RT task. There is only uclamp_eff_value() from
> >> task 'p', which is not PELT information. So all involved variables
> >> are not PELT, why you recommend the PELT thermal pressure here?
> >>
> >> As I said, this patch for the RT class is an incremental step into the
> >> right direction.
> >>
> >>
>
> You haven't answered my questions, which are about technical details of
> your recommendations and statements.
>
> I'm trying to help Xuewen to solve his/her issues with the RT class
> incrementally. I don't want to push him/her into a deep dark water
> of PELT signals, to what variable compare them, corner cases when they
> are (or not) updated or completely not implemented. I'm not even sure
> if those extra complexities make sense for the RT/DL (since they
> make some difference on big.mid.little specific platforms but not for
> the rest).

2022-04-22 16:12:55

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity



On 4/19/22 13:51, Vincent Guittot wrote:
> On Tue, 19 Apr 2022 at 14:01, Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 4/19/22 08:14, Vincent Guittot wrote:
>>> On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <[email protected]> wrote:
>>>>
>>>> Hi Luba / Dietmar
>>>>
>>>> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 4/11/22 15:07, Dietmar Eggemann wrote:
>>>>>> On 11/04/2022 10:52, Xuewen Yan wrote:
>>>>>>> HI Dietmar
>>>>>>>
>>>>>>> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 07/04/2022 07:19, Xuewen Yan wrote:
>>>>>>>>> There are cases when the cpu max capacity might be reduced due to thermal.
>>>>>>>>> Take into the thermal pressure into account when judge whether the rt task
>>>>>>>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
>>>>>>>>> also should be considered.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Xuewen Yan <[email protected]>
>>>>>>>>> ---
>>>>>>>>> kernel/sched/cpufreq_schedutil.c | 1 +
>>>>>>>>> kernel/sched/rt.c | 1 +
>>>>>>>>> 2 files changed, 2 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>>>>>>>>> index 3dbf351d12d5..285ad51caf0f 100644
>>>>>>>>> --- a/kernel/sched/cpufreq_schedutil.c
>>>>>>>>> +++ b/kernel/sched/cpufreq_schedutil.c
>>>>>>>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>>>>>>>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
>>>>>>>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
>>>>>>>>>
>>>>>>>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
>>>>>>>>
>>>>>>>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
>>>>>>>>
>>>>>>>> For the energy part (A) we use max' in compute_energy() to cap sum_util
>>>>>>>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
>>>>>>>> max'). This was done to match (B)'s `policy->max` capping.
>>>>>>>>
>>>>>>>> For the frequency part (B) we have freq_qos_update_request() in:
>>>>>>>>
>>>>>>>> power_actor_set_power()
>>>>>>>> ...
>>>>>>>> cdev->ops->set_cur_state()
>>>>>>>>
>>>>>>>> cpufreq_set_cur_state()
>>>>>>>> freq_qos_update_request() <-- !
>>>>>>>> arch_update_thermal_pressure()
>>>>>>>>
>>>>>>>> restricting `policy->max` which then clamps `target_freq` in:
>>>>>>>>
>>>>>>>> cpufreq_update_util()
>>>>>>>> ...
>>>>>>>> get_next_freq()
>>>>>>>> cpufreq_driver_resolve_freq()
>>>>>>>> __resolve_freq()
>>>>>>>>
>>>>>>>
>>>>>>> Do you mean that the "max" here will not affect the frequency
>>>>>>> conversion, so there is no need to change it?
>>>>>>> But is it better to reflect the influence of thermal here?
>>>>>>
>>>>>> I guess your point is that even though max' has no effect on frequency
>>>>>> since QOS caps policy->max anyway, it is still easier to understand the
>>>>>> dependency between schedutil and EAS/EM when it comes to the use of
>>>>>> thermal pressure.
>>>>>>
>>>>>> I agree. The way how the "hidden" policy->max capping guarantees that
>>>>>> schedutil and EAS/EM are doing the same even when only the latter uses
>>>>>> max' is not obvious.
>>>>>
>>>>> +1 here, IMO we shouldn't rely on hidden stuff. There are two which set
>>>>> the thermal pressure, but one is not setting the freq_qos which causes
>>>>> the update of the 'policy->max'. So the schedutil will send that high
>>>>> frequency but that driver would just ignore and clamp internally. In the
>>>>> end we might argue it still works, but is it clean and visible from the
>>>>> code? Funny thing might start to happen then the driver, which might be
>>>>> the last safety net cannot capture this.
>>>
>>> schedutil reflects what scheduler wants not what HW can do. If you
>>> start to cap the freq with arch_scale_thermal_pressure() in schedutil,
>>
>> s/freq/util ?
>
> Isn't it the same at the end if you cap util to orig capacity -
> arch_scale_thermal_pressure ?

No, as I have showed you in the example calculation the 'max_freq'
is always taken as a start then scaled by current 'util/max'.
If the 'max' was 1024 in both cases, then you might claim that
we made a mistake and end up with obviously too low frequency.

That's why I asked you to be precised with your statements in the
description while mentioning variables and signals.

>
>>
>> To be precised and maybe fix some potential design issues. We are
>> talking here about utilization and set max capacity in function:
>> sugov_get_util()
>> so fields:
>>
>> sugov_cpu::util
>> sugov_cpu::max /* max capacity */
>
> Yes. With this patch ,util will be lower than current thermal
> mitigation whereas util normally reflects what we need not what can
> be provided

This is a different requirements: util has to be max capacity and
max capacity has to be original max CPU capacity - for the SchedUtil.
OK, why? What this requirement adds in the design and final values?

>
>>
>>> you will loose some opportunity to run at higher frequency because
>>> arch_scale_thermal_pressure() is transient and might change just after
>>> using it. This means that you will stay at lower freq after mitigation
>>> stops until a new cpufreq_update_util() happens. ANd I don't vene
>>> mentioned when thermal mitigation is managed by HW at a much higher
>>> frequency than what Linux can handle
>>>
>>> arch_scale_thermal_pressure() must not be used but thermal_load_avg()
>>> like scale_rt_capacity() what Dietmar suggested
>>>
>>
>> First, I would like to see your view to this topic and why you are
>> making such strong statements. I have slightly different view and
>> made dozen of experiments with this thermal pressure in last ~2-3y.
>>
>> The code flow is like this and operates on those fields from above:
>>
>> util, max <--- sugov_get_util()
>> util <--- sugov_iowait_apply() <--- util, max /* ignore this now */
>>
>> get_next_freq():
>> util <--- map_util_perf() <--- util (1)
>> freq <--- map_util_freq() <--- util, max, max_freq (2)
>>
>>
>> At (1) we add +25% util, at (2) we do the conversion to frequency:
>> freq = max_freq * util / max
>>
>> As you can see with the patch we would still end up with bigger
>> frequency than max_freq (since it can happen: max < util).
>> It's also true currently in mainline, when
>> max=1024 and util=1024+256=1280
>> I would be similar if we cap max capacity:
>> max=800 and util=800+200=1000
>
> It's not because you end up with a similar value that it's ok. You
> can't use one side to compensate for the other one. 1.25 is there to
> provide spare cycles to a change in the cpu load (load meaning what is
> running on the cpu not load_avg)

It's different. You've made a hard statement that we are going to break
this frequency selected value, while IMO we aren't. It would behave
the same. I don't compensate anything, the divider ('max') has changed
as well. The patch sets 'sg_cpu->max' not just uses 'max' locally, so
it produces lower 'util' but then might have 'max=1024'. It's not the
case.

>
>> but then in both cases are multiplied by 'max_freq' in (2)
>>
>> As you can see this is not the situation that you have described, is it?
>> And the transient or non-transient is minor here IMO.
>
> If max is 512 then util = 640 which is much lower than 1024.

What scenario is this?
Is 1024 the utilization that we might have from the CPU rq?
What is the original CPU capacity, 1024?

>
>>
>> Secondly, you have mentioned the mitigation in HW and issue between
>> instantaneous vs. PELT-one thermal pressure information. This is
>> something that I'm stretching my head for long. I'm trying to
>> develop this for new Arm FW thermal. You have mentioned:
>> 'thermal mitigation is managed by HW at a much higher
>> frequency than what Linux can handle' - I would be also more
>> precised here: HW or FW? How often the HW can change max freq or
>> how often FW can change that? If we don't have those numbers
>> than statement: 'a much higher' doesn't help in solving this
>
> By much higher means that Linux can't react fast enough and should not
> try to sync because it's a lost game

As I said, 'much higher' is not a number to base a design on it.
We need real numbers from real platforms. Currently we have two
places where the thermal pressure is set:
1) cpufreq_cooling.c [1]
2) Qcom driver [2]
(we might have 3rd soon for Arm SCMI+FW)

For the 2nd I would like to see numbers. For the 1st one when
kernel thermal is used (which supports higher number of platforms
comparing to Qcom driver) as it's by design kernel tries to control
thermal, so changes are not that frequent.

As for now, I can see in experiments the 1st is suffering long decay
delays and also corner cases with long idle CPUs.

>
>> problem that Xuewen (and others) faces. IMO it's not technical
>> argument for blocking the patch and incremental development.
>>
>> It's about timing, when we talk about thermal pressure signals and
>> those two information. For the PELT-one there are also two use cases:
>> raising time and decay time (where we're actually increasing the
>> visible capacity of the CPU). The decay period is quite long,
>> e.g.
>> Thermal pressure of 140 is removed, signal should converge to 0 from 140
>> in 130ms (90% decayed),
>> in 230ms (fully decayed).
>> The default kernel code allows to slow down the decay period, which is
>> a derivative from current global PELT default setting.
>> We can slow it down, but we cannot make it to react faster (BTW I made
>> such change to compare experiments). It's not always good to have
>> such long delays.
>>
>> For asymmetric CPUs that I was describing and also Xuewen, where mid
>> core might be faster than big, we need this information in RT class.
>> Android is such system, so the situation is real (DL is not used there).
>> You have questioned this that:
>> 'arch_scale_thermal_pressure() must not be used'
>> I wouldn't be so sure for the RT change.
>> Are you sure about that? Do you have experiments for it? I would
>> like to see them. I have run dozen of experiments and measurements
>> for this thermal pressure information on a few platforms. How
>> they behave on those task placements and what are the thermal
>> signal decay delay impacts. I'm still not sure which one is
>> best and thus not proposed any changes. But I'll refactor my
>> test code and send a patch with trace event for the new
>> topology_update_thermal_pressure(), which then allows to compare
>> those two designs and nasty timings issues. We would than see
>> how often (if 'much higher' is true) platforms set this value.
>> Currently, in mainline there are two clients which set this
>> value.
>>
>> I've been investigating this PELT signal ~1-2 year ago and found
>> an issue when it's actually updated with delays for the long idle CPU.
>> When one CPU was running fast and thermal throttling kicked in, while
>> the other was idle, the idle one didn't have recent thermal information,
>> but could be picked as a candidate because visible capacity was ~max
>> possible - which is wrong because they both share the clock.
>> Check the function others_have_blocked() and the design around it.
>>
>> That's why I'm a bit more careful with statements that one type of
>> information is better that other.
>>
>> Furthermore, check the code in rt_task_fits_capacity(), there is no
>> PELT signal from the RT task. There is only uclamp_eff_value() from
>> task 'p', which is not PELT information. So all involved variables
>> are not PELT, why you recommend the PELT thermal pressure here?
>>
>> As I said, this patch for the RT class is an incremental step into the
>> right direction.
>>
>>

You haven't answered my questions, which are about technical details of
your recommendations and statements.

I'm trying to help Xuewen to solve his/her issues with the RT class
incrementally. I don't want to push him/her into a deep dark water
of PELT signals, to what variable compare them, corner cases when they
are (or not) updated or completely not implemented. I'm not even sure
if those extra complexities make sense for the RT/DL (since they
make some difference on big.mid.little specific platforms but not for
the rest).

2022-04-22 17:22:11

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 04/21/22 16:07, Xuewen Yan wrote:
> Hi Qais
>
> On Wed, Apr 20, 2022 at 9:51 PM Qais Yousef <[email protected]> wrote:
> >
> > Hi Xuewen
> >
> > Thanks for sending the patch. RT relationship with thermal pressure is an
> > interesting topic :)
> >
> > On 04/07/22 13:19, Xuewen Yan wrote:
> > > There are cases when the cpu max capacity might be reduced due to thermal.
> > > Take into the thermal pressure into account when judge whether the rt task
> > > fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> > > also should be considered.
> >
> > It would help to explain the mode of failure you're seeing here. What are you
> > seeing?
>
> I used in Android scenario, there are many RT processes in the
> Android. I do not want to set the sched_uclamp_util_min_rt_default to
> 1024, it would make the power consumption very high.
> I used a compromise method, setting the value of
> sysctl_sched_uclamp_util_min_rt_default to be bigger than the small
> core capacity but not so that the frequency of the big core becomes
> very high.
> But when there are there clusters on the soc, the big core's capacity
> often become smaller than the middle core, when this happens, I want
> the RT can run on the middle cores instead of the on the big core
> always.
> When consider the thermal pressure, it could relieve this phenomenon.

Thanks for the explanation. It's worth putting some of this in the changelog in
the next versions.

So the problem is as I suspected, but capacity inversion is the major cause of
grief.

Is it okay to share what the capacities of the littles, mediums and bigs on
your system? And how they change under worst case scenario thermal pressure?
Only IF you have these numbers handy :-)

Is it actually an indication of a potential other problem if you swing into
capacity inversion in the bigs that often? I've seen a lot of systems where the
difference between the meds and bigs is small. But frequent inversion could be
suspicious still.

Do the littles and the mediums experience any significant thermal pressure too?

> >
> > >
> > > Signed-off-by: Xuewen Yan <[email protected]>
> > > ---
> > > kernel/sched/cpufreq_schedutil.c | 1 +
> > > kernel/sched/rt.c | 1 +
> > > 2 files changed, 2 insertions(+)
> > >
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 3dbf351d12d5..285ad51caf0f 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> > >
> > > + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
> >
> > Wouldn't this break the call to irq_scale_capacity() in effective_cpu_util()?
> >
> > > sg_cpu->max = max;
> > > sg_cpu->bw_dl = cpu_bw_dl(rq);
> > > sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,
>
> It would destory the irq_scale_capacity, but indeed the cpu max
> capacity has changed, is it better to use the real cpu caopacity?
>
> max - irq
> U' = irq + --------- * U
> max
> I can't imagine how much of an impact this will have at the moment.

It doesn't seem it'll cause a significant error, but still it seems to me this
function wants the original capacity passed to it.

There are similar questions to be asked since you modify sg_cpu->max. Every
user needs to be audited if they're fine with this change or not.

I'm not sure still what we are achieving here. You want to force schedutil not
to request higher frequencies if thermal pressure is high? Should schedutil
actually care? Shouldn't the cpufreq driver reject this request and pick the
next best thing if it can't satisfy it? I could be missing something, I haven't
looked that hard tbh :-)

>
> > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > index a32c46889af8..d9982ebd4821 100644
> > > --- a/kernel/sched/rt.c
> > > +++ b/kernel/sched/rt.c
> > > @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> > >
> > > cpu_cap = capacity_orig_of(cpu);
> > > + cpu_cap -= arch_scale_thermal_pressure(cpu);
> >
> > Hmm I'm not a fan of this. By default all RT tasks have uclamp_min = 1024 to
> > keep the default behavior of the system running at max performance point.
> >
> > With this change, any tiny thermal pressure means all RT tasks will fail to fit
> > on the biggest CPU. While this hint is not meant to be bullet proof, but it
> > shouldn't break that easily either. The highest performance point will still be
> > on this CPU. The only exception is capacity inversion where the bigs
> > performance goes below the mediums' under severe thermal circumstances. But
> > then there are 2 issues.
> >
> > 1. This patch doesn't help with this case. It simply reverts to putting
> > tasks 'randomly' and might still end up on this CPU. I can't see
> > how this is better.
> > 2. If we are under such severe thermal pressure, then things must be
> > falling over badly anyway and I'm not sure we can still satisfy the
> > perf requirements these tasks want anyway. Unless you're trying to
> > keep these CPUs less busy to alleviate thermal pressure? This patch
> > will not help achieving that either. Or I'm unable to see it if it
> > does.
>
> Yes,It is the problem that would lead to, maybe it need more
> consideration just like select the cpus which have min overutil.

It depends on the severity of the problem. The simplest thing I can suggest is
to check if the cpu is in capacity inversion state, and if it is, then make
rt_task_fits_capacity() return false always.

If we need a generic solution to handle thermal pressure omitting OPPs, then
the search needs to become more complex. The proposal in this patch is not
adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly
omit some cpus because of any tiny thermal pressure. For example if the
capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then
any small thermal pressure on mediums will cause these tasks to run on big cpus
only, which is not what we want. Especially if these big cpus can end up in
capacity inversion later ;-)

So if we want to handle this case, then we need to ensure the search returns
false only if

1. Thermal pressure results in real OPP to be omitted.
2. Another CPU that can provide this performance level is available.

Otherwise we should still fit it on this CPU because it'll give us the closest
thing to what was requested.

I can think of 2 ways to implement this, but none of them seem particularly
pretty :-/


Thanks

--
Qais Yousef

2022-04-22 18:09:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <[email protected]> wrote:
>
> Hi Luba / Dietmar
>
> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
> >
> >
> >
> > On 4/11/22 15:07, Dietmar Eggemann wrote:
> > > On 11/04/2022 10:52, Xuewen Yan wrote:
> > >> HI Dietmar
> > >>
> > >> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
> > >> <[email protected]> wrote:
> > >>>
> > >>> On 07/04/2022 07:19, Xuewen Yan wrote:
> > >>>> There are cases when the cpu max capacity might be reduced due to thermal.
> > >>>> Take into the thermal pressure into account when judge whether the rt task
> > >>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> > >>>> also should be considered.
> > >>>>
> > >>>> Signed-off-by: Xuewen Yan <[email protected]>
> > >>>> ---
> > >>>> kernel/sched/cpufreq_schedutil.c | 1 +
> > >>>> kernel/sched/rt.c | 1 +
> > >>>> 2 files changed, 2 insertions(+)
> > >>>>
> > >>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > >>>> index 3dbf351d12d5..285ad51caf0f 100644
> > >>>> --- a/kernel/sched/cpufreq_schedutil.c
> > >>>> +++ b/kernel/sched/cpufreq_schedutil.c
> > >>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > >>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
> > >>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> > >>>>
> > >>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
> > >>>
> > >>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
> > >>>
> > >>> For the energy part (A) we use max' in compute_energy() to cap sum_util
> > >>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
> > >>> max'). This was done to match (B)'s `policy->max` capping.
> > >>>
> > >>> For the frequency part (B) we have freq_qos_update_request() in:
> > >>>
> > >>> power_actor_set_power()
> > >>> ...
> > >>> cdev->ops->set_cur_state()
> > >>>
> > >>> cpufreq_set_cur_state()
> > >>> freq_qos_update_request() <-- !
> > >>> arch_update_thermal_pressure()
> > >>>
> > >>> restricting `policy->max` which then clamps `target_freq` in:
> > >>>
> > >>> cpufreq_update_util()
> > >>> ...
> > >>> get_next_freq()
> > >>> cpufreq_driver_resolve_freq()
> > >>> __resolve_freq()
> > >>>
> > >>
> > >> Do you mean that the "max" here will not affect the frequency
> > >> conversion, so there is no need to change it?
> > >> But is it better to reflect the influence of thermal here?
> > >
> > > I guess your point is that even though max' has no effect on frequency
> > > since QOS caps policy->max anyway, it is still easier to understand the
> > > dependency between schedutil and EAS/EM when it comes to the use of
> > > thermal pressure.
> > >
> > > I agree. The way how the "hidden" policy->max capping guarantees that
> > > schedutil and EAS/EM are doing the same even when only the latter uses
> > > max' is not obvious.
> >
> > +1 here, IMO we shouldn't rely on hidden stuff. There are two which set
> > the thermal pressure, but one is not setting the freq_qos which causes
> > the update of the 'policy->max'. So the schedutil will send that high
> > frequency but that driver would just ignore and clamp internally. In the
> > end we might argue it still works, but is it clean and visible from the
> > code? Funny thing might start to happen then the driver, which might be
> > the last safety net cannot capture this.

schedutil reflects what scheduler wants not what HW can do. If you
start to cap the freq with arch_scale_thermal_pressure() in schedutil,
you will loose some opportunity to run at higher frequency because
arch_scale_thermal_pressure() is transient and might change just after
using it. This means that you will stay at lower freq after mitigation
stops until a new cpufreq_update_util() happens. ANd I don't vene
mentioned when thermal mitigation is managed by HW at a much higher
frequency than what Linux can handle

arch_scale_thermal_pressure() must not be used but thermal_load_avg()
like scale_rt_capacity() what Dietmar suggested

> >
> > We also should be OK with energy estimation and the CPU capacity vs.
> > task utilization comparisons, since the thermal pressure is accounted
> > there* (until the thermal is controlled in kernel not in FW, which is
> > something where we are heading with scmi-cpufreq mentioned in this
> > cover letter [1]).
>
> IMO, If so, we don't want to modify the original code, but also need to
> consider the impact of thermal, maybe it is possible to add a new
> macro definition
> like this:
>
> #define arch_scale_cpu_capacity_except_thermal()
> (arch_scale_cpu_capacity() - arch_scale_thermal_pressure())
>
> >
> > >
> > > I just wanted to mention the historical reason why the code looks like
> > > it does today.
> > >
> > >>> [...]
> > >>>
> > >>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > >>>> index a32c46889af8..d9982ebd4821 100644
> > >>>> --- a/kernel/sched/rt.c
> > >>>> +++ b/kernel/sched/rt.c
> > >>>> @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > >>>> max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> > >>>>
> > >>>> cpu_cap = capacity_orig_of(cpu);
> > >>>> + cpu_cap -= arch_scale_thermal_pressure(cpu);
> > >>>>
> > >>>> return cpu_cap >= min(min_cap, max_cap);
> > >>>> }
> > >>>
> > >>> IMHO, this should follow what we do with rq->cpu_capacity
> > >>> (capacity_of(), the remaining capacity for CFS). E.g. we use
> > >>> capacity_of() in find_energy_efficient_cpu() and select_idle_capacity()
> > >>> to compare capacities. So we would need a function like
> > >>> scale_rt_capacity() for RT (minus the rq->avg_rt.util_avg) but then also
> > >>> one for DL (minus rq->avg_dl.util_avg and rq->avg_rt.util_avg).
> > >>
> > >> It's a really good idea. And do you already have the corresponding patch?
> > >> If there is, can you tell me the corresponding link?
> > >
> > > No, I don't have any code for this. Should be trivial though. But the
> > > issue is that the update would probably have to be decoupled from CFS
> > > load_balance (update_group_capacity()) and the use cases in RT/DL are
> > > only valid for asymmetric CPU capacity systems.
> >
> > Having in mind those I would vote for fixing it incrementally.
> > Thus, IMHO this patch is good for taking it. Later we might think how
> > to better estimate the real CPU capacity visible from RT and DL classes.
> > In this shape it is good for many systems which only use RT,
> > but not DL class. Those systems w/ RT and w/o DL might suffer on some
> > asymmetric CPU platforms where medium cores have capacity e.g. 850 and
> > thermal pressure reduced the big cores capacity by 250 making them 774.
> >
>
> Your mean is that before there is better way to handle RT capacity, we
> can take this patch temporarily?
> If so, I can update the patch which will just fix the rt.c.
>
> In fact, in the mobile phone usage scenario where the cpu contains 3
> clusters (small/middle/big),
> the capacity of the big core's capacity will be smaller than that of
> the middle core due to the thermal effect.
> At this time, we do not want the big core CPU to be used as an RT
> task's alternative cpu.
>
> > Regards,
> > Lukasz
> >
> > [1]
> > https://lore.kernel.org/linux-pm/[email protected]/
>
>
> BR
> xuewen.yan

2022-04-22 20:39:12

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Xuewen

Thanks for sending the patch. RT relationship with thermal pressure is an
interesting topic :)

On 04/07/22 13:19, Xuewen Yan wrote:
> There are cases when the cpu max capacity might be reduced due to thermal.
> Take into the thermal pressure into account when judge whether the rt task
> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> also should be considered.

It would help to explain the mode of failure you're seeing here. What are you
seeing?

>
> Signed-off-by: Xuewen Yan <[email protected]>
> ---
> kernel/sched/cpufreq_schedutil.c | 1 +
> kernel/sched/rt.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 3dbf351d12d5..285ad51caf0f 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
>
> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);

Wouldn't this break the call to irq_scale_capacity() in effective_cpu_util()?

> sg_cpu->max = max;
> sg_cpu->bw_dl = cpu_bw_dl(rq);
> sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a32c46889af8..d9982ebd4821 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> max_cap = uclamp_eff_value(p, UCLAMP_MAX);
>
> cpu_cap = capacity_orig_of(cpu);
> + cpu_cap -= arch_scale_thermal_pressure(cpu);

Hmm I'm not a fan of this. By default all RT tasks have uclamp_min = 1024 to
keep the default behavior of the system running at max performance point.

With this change, any tiny thermal pressure means all RT tasks will fail to fit
on the biggest CPU. While this hint is not meant to be bullet proof, but it
shouldn't break that easily either. The highest performance point will still be
on this CPU. The only exception is capacity inversion where the bigs
performance goes below the mediums' under severe thermal circumstances. But
then there are 2 issues.

1. This patch doesn't help with this case. It simply reverts to putting
tasks 'randomly' and might still end up on this CPU. I can't see
how this is better.
2. If we are under such severe thermal pressure, then things must be
falling over badly anyway and I'm not sure we can still satisfy the
perf requirements these tasks want anyway. Unless you're trying to
keep these CPUs less busy to alleviate thermal pressure? This patch
will not help achieving that either. Or I'm unable to see it if it
does.

It'd be good to explain the problem you're seeing and how this patch helped
you.

The only thing I can think of is that you have uclamp_min set to the medium
CPUs capacity but due to thermal pressure they might fail to run at highest
frequency hence by forcing them NOT to fit on mediums you essentially make them
run on the bigs where they get a better chance of getting the perf they want.


Thanks

--
Qais Yousef


>
> return cpu_cap >= min(min_cap, max_cap);
> }
> --
> 2.25.1
>

2022-04-22 22:37:29

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity



On 4/21/22 09:29, Vincent Guittot wrote:
> On Tue, 19 Apr 2022 at 16:13, Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 4/19/22 13:51, Vincent Guittot wrote:
>>> On Tue, 19 Apr 2022 at 14:01, Lukasz Luba <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 4/19/22 08:14, Vincent Guittot wrote:
>>>>> On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <[email protected]> wrote:
>>>>>>
>>>>>> Hi Luba / Dietmar
>>>>>>
>>>>>> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 4/11/22 15:07, Dietmar Eggemann wrote:
>>>>>>>> On 11/04/2022 10:52, Xuewen Yan wrote:
>>>>>>>>> HI Dietmar
>>>>>>>>>
>>>>>>>>> On Mon, Apr 11, 2022 at 4:21 PM Dietmar Eggemann
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 07/04/2022 07:19, Xuewen Yan wrote:
>>>>>>>>>>> There are cases when the cpu max capacity might be reduced due to thermal.
>>>>>>>>>>> Take into the thermal pressure into account when judge whether the rt task
>>>>>>>>>>> fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
>>>>>>>>>>> also should be considered.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Xuewen Yan <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>> kernel/sched/cpufreq_schedutil.c | 1 +
>>>>>>>>>>> kernel/sched/rt.c | 1 +
>>>>>>>>>>> 2 files changed, 2 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>>>>>>>>>>> index 3dbf351d12d5..285ad51caf0f 100644
>>>>>>>>>>> --- a/kernel/sched/cpufreq_schedutil.c
>>>>>>>>>>> +++ b/kernel/sched/cpufreq_schedutil.c
>>>>>>>>>>> @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>>>>>>>>>>> struct rq *rq = cpu_rq(sg_cpu->cpu);
>>>>>>>>>>> unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
>>>>>>>>>>>
>>>>>>>>>>> + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
>>>>>>>>>>
>>>>>>>>>> max' = arch_scale_cpu_capacity() - arch_scale_thermal_pressure()
>>>>>>>>>>
>>>>>>>>>> For the energy part (A) we use max' in compute_energy() to cap sum_util
>>>>>>>>>> and max_util at max' and to call em_cpu_energy(..., max_util, sum_util,
>>>>>>>>>> max'). This was done to match (B)'s `policy->max` capping.
>>>>>>>>>>
>>>>>>>>>> For the frequency part (B) we have freq_qos_update_request() in:
>>>>>>>>>>
>>>>>>>>>> power_actor_set_power()
>>>>>>>>>> ...
>>>>>>>>>> cdev->ops->set_cur_state()
>>>>>>>>>>
>>>>>>>>>> cpufreq_set_cur_state()
>>>>>>>>>> freq_qos_update_request() <-- !
>>>>>>>>>> arch_update_thermal_pressure()
>>>>>>>>>>
>>>>>>>>>> restricting `policy->max` which then clamps `target_freq` in:
>>>>>>>>>>
>>>>>>>>>> cpufreq_update_util()
>>>>>>>>>> ...
>>>>>>>>>> get_next_freq()
>>>>>>>>>> cpufreq_driver_resolve_freq()
>>>>>>>>>> __resolve_freq()
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you mean that the "max" here will not affect the frequency
>>>>>>>>> conversion, so there is no need to change it?
>>>>>>>>> But is it better to reflect the influence of thermal here?
>>>>>>>>
>>>>>>>> I guess your point is that even though max' has no effect on frequency
>>>>>>>> since QOS caps policy->max anyway, it is still easier to understand the
>>>>>>>> dependency between schedutil and EAS/EM when it comes to the use of
>>>>>>>> thermal pressure.
>>>>>>>>
>>>>>>>> I agree. The way how the "hidden" policy->max capping guarantees that
>>>>>>>> schedutil and EAS/EM are doing the same even when only the latter uses
>>>>>>>> max' is not obvious.
>>>>>>>
>>>>>>> +1 here, IMO we shouldn't rely on hidden stuff. There are two which set
>>>>>>> the thermal pressure, but one is not setting the freq_qos which causes
>>>>>>> the update of the 'policy->max'. So the schedutil will send that high
>>>>>>> frequency but that driver would just ignore and clamp internally. In the
>>>>>>> end we might argue it still works, but is it clean and visible from the
>>>>>>> code? Funny thing might start to happen then the driver, which might be
>>>>>>> the last safety net cannot capture this.
>>>>>
>>>>> schedutil reflects what scheduler wants not what HW can do. If you
>>>>> start to cap the freq with arch_scale_thermal_pressure() in schedutil,
>>>>
>>>> s/freq/util ?
>>>
>>> Isn't it the same at the end if you cap util to orig capacity -
>>> arch_scale_thermal_pressure ?
>>
>> No, as I have showed you in the example calculation the 'max_freq'
>> is always taken as a start then scaled by current 'util/max'.
>> If the 'max' was 1024 in both cases, then you might claim that
>> we made a mistake and end up with obviously too low frequency.
>>
>> That's why I asked you to be precised with your statements in the
>> description while mentioning variables and signals.
>>
>>>
>>>>
>>>> To be precised and maybe fix some potential design issues. We are
>>>> talking here about utilization and set max capacity in function:
>>>> sugov_get_util()
>>>> so fields:
>>>>
>>>> sugov_cpu::util
>>>> sugov_cpu::max /* max capacity */
>>>
>>> Yes. With this patch ,util will be lower than current thermal
>>> mitigation whereas util normally reflects what we need not what can
>>> be provided
>>
>> This is a different requirements: util has to be max capacity and
>> max capacity has to be original max CPU capacity - for the SchedUtil.
>> OK, why? What this requirement adds in the design and final values?
>
> Because the calculation you are proposing is wrong and doesn't make
> sense. Util is the average utilization of the cpu that has to be
> compared with its original capacity max in order to get the freq that
> matches with this utilization.
>
> We have freq = util / max * max_freq and cpufreq will then capp freq
> if mitigation is applied. Once the mitigation disappear, the request
> will be back to targeted freq.
>
> If you replace max by max' = max - arch_scale_thermal_pressure then :
>
> - by the time you do the calculation, arch_scale_thermal_pressure can
> have changed and the result is meaningless. This is true whatever the
> pace of updating arch_scale_thermal_pressure

The sudden change of the value taken from arch_scale_thermal_pressure
I can understand, but there are similar and we live with it. Look at
the whole EAS estimations done in a one CPU waku-up event or the uclamp
stuff. As far this is not too frequently occurring - we live wit it.

I can see your concern here, since you mentioned below that you expect
some platforms to hit it in 'khz' rate. This is probably not good, to
trigger the kernel so often from HW/FW.

That's why I have been struggling to find a 'good' design on this
glue layer for Arm FW+kernel. Our FW would probably won't cause such
huge notification traffic. A rate e.g. 50-100ms would be enough,
especially if we have the per-CPU cpufreq policy. So we might have
this 'PELT-like filter or signal' in FW, and just update kernel
less often. But then there is an issue with the rising/decaying
penalty of the kernel thermal pressure signal.

We cannot assume that some SoCs don't do this already.

Let's meet in the middle:
1) use the thermal PELT signal in RT:
capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
2) introduce a more configurable thermal_pressure shifter instead
'sched_thermal_decay_shift', which would allow not only to make the
decaying longer, but also shorter when the platform already might do
that, to not cause too much traffic.

>
> - you change the range of capacity to max'= max -
> arch_scale_thermal_pressure and you scale it to max_freq. if util >
> max', then you will ask for max_freq whatever the util being really
> close to max or not. Also you will ask for max freq even if util is
> close but below max' whereas the mitigation doesn't impact utilization

It's already there, even w/o patch. That's why I gave you the examples.

BTW, isn't true that the utilization of the Little CPU rq can reach
1024 today after your change to the PELT when there is no idle time,
even when cpu max capacity is e.g. 300?
Before that change the utilization of a throttled CPU rq would converge
to the current capacity of the CPU, am I right?

Is it this commit:
23127296889fe84b0762b191

>
>>
>>>
>>>>
>>>>> you will loose some opportunity to run at higher frequency because
>>>>> arch_scale_thermal_pressure() is transient and might change just after
>>>>> using it. This means that you will stay at lower freq after mitigation
>>>>> stops until a new cpufreq_update_util() happens. ANd I don't vene
>>>>> mentioned when thermal mitigation is managed by HW at a much higher
>>>>> frequency than what Linux can handle
>>>>>
>>>>> arch_scale_thermal_pressure() must not be used but thermal_load_avg()
>>>>> like scale_rt_capacity() what Dietmar suggested
>>>>>
>>>>
>>>> First, I would like to see your view to this topic and why you are
>>>> making such strong statements. I have slightly different view and
>>>> made dozen of experiments with this thermal pressure in last ~2-3y.
>>>>
>>>> The code flow is like this and operates on those fields from above:
>>>>
>>>> util, max <--- sugov_get_util()
>>>> util <--- sugov_iowait_apply() <--- util, max /* ignore this now */
>>>>
>>>> get_next_freq():
>>>> util <--- map_util_perf() <--- util (1)
>>>> freq <--- map_util_freq() <--- util, max, max_freq (2)
>>>>
>>>>
>>>> At (1) we add +25% util, at (2) we do the conversion to frequency:
>>>> freq = max_freq * util / max
>>>>
>>>> As you can see with the patch we would still end up with bigger
>>>> frequency than max_freq (since it can happen: max < util).
>>>> It's also true currently in mainline, when
>>>> max=1024 and util=1024+256=1280
>>>> I would be similar if we cap max capacity:
>>>> max=800 and util=800+200=1000
>>>
>>> It's not because you end up with a similar value that it's ok. You
>>> can't use one side to compensate for the other one. 1.25 is there to
>>> provide spare cycles to a change in the cpu load (load meaning what is
>>> running on the cpu not load_avg)
>>
>> It's different. You've made a hard statement that we are going to break
>> this frequency selected value, while IMO we aren't. It would behave
>> the same. I don't compensate anything, the divider ('max') has changed
>> as well. The patch sets 'sg_cpu->max' not just uses 'max' locally, so
>> it produces lower 'util' but then might have 'max=1024'. It's not the
>> case.
>>
>>>
>>>> but then in both cases are multiplied by 'max_freq' in (2)
>>>>
>>>> As you can see this is not the situation that you have described, is it?
>>>> And the transient or non-transient is minor here IMO.
>>>
>>> If max is 512 then util = 640 which is much lower than 1024.
>>
>> What scenario is this?
>> Is 1024 the utilization that we might have from the CPU rq?
>> What is the original CPU capacity, 1024?

Is this 1024 the utilization of the CPU runqueue because since
the new PELT we can have it bigger than CPU capacity?

>>
>>>
>>>>
>>>> Secondly, you have mentioned the mitigation in HW and issue between
>>>> instantaneous vs. PELT-one thermal pressure information. This is
>>>> something that I'm stretching my head for long. I'm trying to
>>>> develop this for new Arm FW thermal. You have mentioned:
>>>> 'thermal mitigation is managed by HW at a much higher
>>>> frequency than what Linux can handle' - I would be also more
>>>> precised here: HW or FW? How often the HW can change max freq or
>>>> how often FW can change that? If we don't have those numbers
>>>> than statement: 'a much higher' doesn't help in solving this
>>>
>>> By much higher means that Linux can't react fast enough and should not
>>> try to sync because it's a lost game
>>
>> As I said, 'much higher' is not a number to base a design on it.
>
> But that gives you the constraint that you can't expect to be always
> synced with up to date value which is the most important here. This
> means that cpu_cap -= arch_scale_thermal_pressure(cpu) can be wrong
> just after you computed it and your decision is wrong.

This is hypothetical situation when the value can change in such
noisy way on some platform. But I understand your concern.

>
>
>> We need real numbers from real platforms. Currently we have two
>> places where the thermal pressure is set:
>> 1) cpufreq_cooling.c [1]
>> 2) Qcom driver [2]
>> (we might have 3rd soon for Arm SCMI+FW)
>
> I don't have details but i have khz in mind

If such traffic of interrupts in khz is true for driver in 2)
then it's a bit concerning.

Although, smarter platforms shouldn't suffer due to design forced to one
corner case platform.

>
>>
>> For the 2nd I would like to see numbers. For the 1st one when
>> kernel thermal is used (which supports higher number of platforms
>> comparing to Qcom driver) as it's by design kernel tries to control
>> thermal, so changes are not that frequent.
>>
>> As for now, I can see in experiments the 1st is suffering long decay
>> delays and also corner cases with long idle CPUs.
>>
>>>
>>>> problem that Xuewen (and others) faces. IMO it's not technical
>>>> argument for blocking the patch and incremental development.
>>>>
>>>> It's about timing, when we talk about thermal pressure signals and
>>>> those two information. For the PELT-one there are also two use cases:
>>>> raising time and decay time (where we're actually increasing the
>>>> visible capacity of the CPU). The decay period is quite long,
>>>> e.g.
>>>> Thermal pressure of 140 is removed, signal should converge to 0 from 140
>>>> in 130ms (90% decayed),
>>>> in 230ms (fully decayed).
>>>> The default kernel code allows to slow down the decay period, which is
>>>> a derivative from current global PELT default setting.
>>>> We can slow it down, but we cannot make it to react faster (BTW I made
>>>> such change to compare experiments). It's not always good to have
>>>> such long delays.
>>>>
>>>> For asymmetric CPUs that I was describing and also Xuewen, where mid
>>>> core might be faster than big, we need this information in RT class.
>>>> Android is such system, so the situation is real (DL is not used there).
>>>> You have questioned this that:
>>>> 'arch_scale_thermal_pressure() must not be used'
>>>> I wouldn't be so sure for the RT change.
>>>> Are you sure about that? Do you have experiments for it? I would
>>>> like to see them. I have run dozen of experiments and measurements
>>>> for this thermal pressure information on a few platforms. How
>>>> they behave on those task placements and what are the thermal
>>>> signal decay delay impacts. I'm still not sure which one is
>>>> best and thus not proposed any changes. But I'll refactor my
>>>> test code and send a patch with trace event for the new
>>>> topology_update_thermal_pressure(), which then allows to compare
>>>> those two designs and nasty timings issues. We would than see
>>>> how often (if 'much higher' is true) platforms set this value.
>>>> Currently, in mainline there are two clients which set this
>>>> value.
>>>>
>>>> I've been investigating this PELT signal ~1-2 year ago and found
>>>> an issue when it's actually updated with delays for the long idle CPU.
>>>> When one CPU was running fast and thermal throttling kicked in, while
>>>> the other was idle, the idle one didn't have recent thermal information,
>>>> but could be picked as a candidate because visible capacity was ~max
>>>> possible - which is wrong because they both share the clock.
>>>> Check the function others_have_blocked() and the design around it.
>>>>
>>>> That's why I'm a bit more careful with statements that one type of
>>>> information is better that other.
>>>>
>>>> Furthermore, check the code in rt_task_fits_capacity(), there is no
>>>> PELT signal from the RT task. There is only uclamp_eff_value() from
>>>> task 'p', which is not PELT information. So all involved variables
>>>> are not PELT, why you recommend the PELT thermal pressure here?
>>>>
>>>> As I said, this patch for the RT class is an incremental step into the
>>>> right direction.
>>>>
>>>>
>>
>> You haven't answered my questions, which are about technical details of
>> your recommendations and statements.
>>
>> I'm trying to help Xuewen to solve his/her issues with the RT class
>> incrementally. I don't want to push him/her into a deep dark water
>> of PELT signals, to what variable compare them, corner cases when they
>> are (or not) updated or completely not implemented. I'm not even sure
>> if those extra complexities make sense for the RT/DL (since they
>> make some difference on big.mid.little specific platforms but not for
>> the rest).

As I said we need a way forward, this issue of capacity inversion
on big.mid.little is there. It was for ~2-3years and is going to be
even bigger in future. So please don't block it and prepare/share the
numbers for the corner case platforms.

I have proposed the where we can meet in the middle, consider it.
I will prepare a patch for that shifter.

2022-04-25 07:52:43

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Fri, Apr 22, 2022 at 12:15 AM Qais Yousef <[email protected]> wrote:
>
> On 04/21/22 16:07, Xuewen Yan wrote:
> > Hi Qais
> >
> > On Wed, Apr 20, 2022 at 9:51 PM Qais Yousef <[email protected]> wrote:
> > >
> > > Hi Xuewen
> > >
> > > Thanks for sending the patch. RT relationship with thermal pressure is an
> > > interesting topic :)
> > >
> > > On 04/07/22 13:19, Xuewen Yan wrote:
> > > > There are cases when the cpu max capacity might be reduced due to thermal.
> > > > Take into the thermal pressure into account when judge whether the rt task
> > > > fits the cpu. And when schedutil govnor get cpu util, the thermal pressure
> > > > also should be considered.
> > >
> > > It would help to explain the mode of failure you're seeing here. What are you
> > > seeing?
> >
> > I used in Android scenario, there are many RT processes in the
> > Android. I do not want to set the sched_uclamp_util_min_rt_default to
> > 1024, it would make the power consumption very high.
> > I used a compromise method, setting the value of
> > sysctl_sched_uclamp_util_min_rt_default to be bigger than the small
> > core capacity but not so that the frequency of the big core becomes
> > very high.
> > But when there are there clusters on the soc, the big core's capacity
> > often become smaller than the middle core, when this happens, I want
> > the RT can run on the middle cores instead of the on the big core
> > always.
> > When consider the thermal pressure, it could relieve this phenomenon.
>
> Thanks for the explanation. It's worth putting some of this in the changelog in
> the next versions.
>
> So the problem is as I suspected, but capacity inversion is the major cause of
> grief.
>
> Is it okay to share what the capacities of the littles, mediums and bigs on
> your system? And how they change under worst case scenario thermal pressure?
> Only IF you have these numbers handy :-)

Okay, the little/mid/big cpu scale capacity is 350/930/1024, but the
cpu frequency point is discrete, the big core's high freq point may is
just a few more than the mid core's highest.
In this case, once the thermal decrease the scaling_max_freq, the
maximum frequency of the large core is easily lower than that of the
medium core.
Of course, the corner case is due to the frequency design of the soc
and our thermal algorithm.

>
> Is it actually an indication of a potential other problem if you swing into
> capacity inversion in the bigs that often? I've seen a lot of systems where the
> difference between the meds and bigs is small. But frequent inversion could be
> suspicious still.
>
> Do the littles and the mediums experience any significant thermal pressure too?

In our platform, it's not.

>
> > >
> > > >
> > > > Signed-off-by: Xuewen Yan <[email protected]>
> > > > ---
> > > > kernel/sched/cpufreq_schedutil.c | 1 +
> > > > kernel/sched/rt.c | 1 +
> > > > 2 files changed, 2 insertions(+)
> > > >
> > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > > index 3dbf351d12d5..285ad51caf0f 100644
> > > > --- a/kernel/sched/cpufreq_schedutil.c
> > > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > > @@ -159,6 +159,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > > > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > > unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> > > >
> > > > + max -= arch_scale_thermal_pressure(sg_cpu->cpu);
> > >
> > > Wouldn't this break the call to irq_scale_capacity() in effective_cpu_util()?
> > >
> > > > sg_cpu->max = max;
> > > > sg_cpu->bw_dl = cpu_bw_dl(rq);
> > > > sg_cpu->util = effective_cpu_util(sg_cpu->cpu, cpu_util_cfs(sg_cpu->cpu), max,
> >
> > It would destory the irq_scale_capacity, but indeed the cpu max
> > capacity has changed, is it better to use the real cpu caopacity?
> >
> > max - irq
> > U' = irq + --------- * U
> > max
> > I can't imagine how much of an impact this will have at the moment.
>
> It doesn't seem it'll cause a significant error, but still it seems to me this
> function wants the original capacity passed to it.
>
> There are similar questions to be asked since you modify sg_cpu->max. Every
> user needs to be audited if they're fine with this change or not.
>
> I'm not sure still what we are achieving here. You want to force schedutil not
> to request higher frequencies if thermal pressure is high? Should schedutil
> actually care? Shouldn't the cpufreq driver reject this request and pick the
> next best thing if it can't satisfy it? I could be missing something, I haven't
> looked that hard tbh :-)

I changed this just want to make it more responsive to the real
capacity of the cpu, if it will cause other problems, maybe it would
be better not to change it.:)

>
> >
> > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > > index a32c46889af8..d9982ebd4821 100644
> > > > --- a/kernel/sched/rt.c
> > > > +++ b/kernel/sched/rt.c
> > > > @@ -466,6 +466,7 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> > > >
> > > > cpu_cap = capacity_orig_of(cpu);
> > > > + cpu_cap -= arch_scale_thermal_pressure(cpu);
> > >
> > > Hmm I'm not a fan of this. By default all RT tasks have uclamp_min = 1024 to
> > > keep the default behavior of the system running at max performance point.
> > >
> > > With this change, any tiny thermal pressure means all RT tasks will fail to fit
> > > on the biggest CPU. While this hint is not meant to be bullet proof, but it
> > > shouldn't break that easily either. The highest performance point will still be
> > > on this CPU. The only exception is capacity inversion where the bigs
> > > performance goes below the mediums' under severe thermal circumstances. But
> > > then there are 2 issues.
> > >
> > > 1. This patch doesn't help with this case. It simply reverts to putting
> > > tasks 'randomly' and might still end up on this CPU. I can't see
> > > how this is better.
> > > 2. If we are under such severe thermal pressure, then things must be
> > > falling over badly anyway and I'm not sure we can still satisfy the
> > > perf requirements these tasks want anyway. Unless you're trying to
> > > keep these CPUs less busy to alleviate thermal pressure? This patch
> > > will not help achieving that either. Or I'm unable to see it if it
> > > does.
> >
> > Yes,It is the problem that would lead to, maybe it need more
> > consideration just like select the cpus which have min overutil.
>
> It depends on the severity of the problem. The simplest thing I can suggest is
> to check if the cpu is in capacity inversion state, and if it is, then make
> rt_task_fits_capacity() return false always.
>
> If we need a generic solution to handle thermal pressure omitting OPPs, then
> the search needs to become more complex. The proposal in this patch is not
> adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly
> omit some cpus because of any tiny thermal pressure. For example if the
> capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then
> any small thermal pressure on mediums will cause these tasks to run on big cpus
> only, which is not what we want. Especially if these big cpus can end up in
> capacity inversion later ;-)
>
> So if we want to handle this case, then we need to ensure the search returns
> false only if
>
> 1. Thermal pressure results in real OPP to be omitted.
> 2. Another CPU that can provide this performance level is available.
>
> Otherwise we should still fit it on this CPU because it'll give us the closest
> thing to what was requested.
>
> I can think of 2 ways to implement this, but none of them seem particularly
> pretty :-/

Maybe as Lukasz Luba said:

https://lore.kernel.org/all/[email protected]/

> Let's meet in the middle:
> 1) use the thermal PELT signal in RT:
> capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
> 2) introduce a more configurable thermal_pressure shifter instead
> 'sched_thermal_decay_shift', which would allow not only to make the
> decaying longer, but also shorter when the platform already might do
> that, to not cause too much traffic.

But even if this is changed, there will still be the same problem, I
look forward to Lukasz's patch:)

Thanks!
---
BR
xuewen

>
>
> Thanks
>
> --
> Qais Yousef

2022-04-26 08:50:04

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Thu, 21 Apr 2022 at 12:57, Lukasz Luba <[email protected]> wrote:
>
>
>
> On 4/21/22 09:29, Vincent Guittot wrote:
> > On Tue, 19 Apr 2022 at 16:13, Lukasz Luba <[email protected]> wrote:
> >>
> >>
> >>
> >> On 4/19/22 13:51, Vincent Guittot wrote:
> >>> On Tue, 19 Apr 2022 at 14:01, Lukasz Luba <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 4/19/22 08:14, Vincent Guittot wrote:
> >>>>> On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <[email protected]> wrote:
> >>>>>>
> >>>>>> Hi Luba / Dietmar
> >>>>>>
> >>>>>> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>

[...]

> >>>> To be precised and maybe fix some potential design issues. We are
> >>>> talking here about utilization and set max capacity in function:
> >>>> sugov_get_util()
> >>>> so fields:
> >>>>
> >>>> sugov_cpu::util
> >>>> sugov_cpu::max /* max capacity */
> >>>
> >>> Yes. With this patch ,util will be lower than current thermal
> >>> mitigation whereas util normally reflects what we need not what can
> >>> be provided
> >>
> >> This is a different requirements: util has to be max capacity and
> >> max capacity has to be original max CPU capacity - for the SchedUtil.
> >> OK, why? What this requirement adds in the design and final values?
> >
> > Because the calculation you are proposing is wrong and doesn't make
> > sense. Util is the average utilization of the cpu that has to be
> > compared with its original capacity max in order to get the freq that
> > matches with this utilization.
> >
> > We have freq = util / max * max_freq and cpufreq will then capp freq
> > if mitigation is applied. Once the mitigation disappear, the request
> > will be back to targeted freq.
> >
> > If you replace max by max' = max - arch_scale_thermal_pressure then :
> >
> > - by the time you do the calculation, arch_scale_thermal_pressure can
> > have changed and the result is meaningless. This is true whatever the
> > pace of updating arch_scale_thermal_pressure
>
> The sudden change of the value taken from arch_scale_thermal_pressure
> I can understand, but there are similar and we live with it. Look at
> the whole EAS estimations done in a one CPU waku-up event or the uclamp
> stuff. As far this is not too frequently occurring - we live wit it.
>
> I can see your concern here, since you mentioned below that you expect
> some platforms to hit it in 'khz' rate. This is probably not good, to
> trigger the kernel so often from HW/FW.
>
> That's why I have been struggling to find a 'good' design on this
> glue layer for Arm FW+kernel. Our FW would probably won't cause such
> huge notification traffic. A rate e.g. 50-100ms would be enough,
> especially if we have the per-CPU cpufreq policy. So we might have
> this 'PELT-like filter or signal' in FW, and just update kernel

In this case arch_scale_thermal_pressure() doesn't reflect the actual
thermal pressure but an average which is what thermal_load_avg() is
also doing.

> less often. But then there is an issue with the rising/decaying
> penalty of the kernel thermal pressure signal.
>
> We cannot assume that some SoCs don't do this already.
>
> Let's meet in the middle:
> 1) use the thermal PELT signal in RT:
> capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))

This is what Dietmar and I have been suggested

> 2) introduce a more configurable thermal_pressure shifter instead
> 'sched_thermal_decay_shift', which would allow not only to make the
> decaying longer, but also shorter when the platform already might do
> that, to not cause too much traffic.
>
> >
> > - you change the range of capacity to max'= max -
> > arch_scale_thermal_pressure and you scale it to max_freq. if util >
> > max', then you will ask for max_freq whatever the util being really
> > close to max or not. Also you will ask for max freq even if util is
> > close but below max' whereas the mitigation doesn't impact utilization
>
> It's already there, even w/o patch. That's why I gave you the examples.

Not sure how to understand this above.

utilization can already goes above but this reflects a reality that
the task could need more capacity than the current cpu capacity

>
> BTW, isn't true that the utilization of the Little CPU rq can reach
> 1024 today after your change to the PELT when there is no idle time,
> even when cpu max capacity is e.g. 300?

yes

> Before that change the utilization of a throttled CPU rq would converge
> to the current capacity of the CPU, am I right?

yes

>
> Is it this commit:
> 23127296889fe84b0762b191
>
> >
> >>
> >>>
> >>>>

[...]

> >>>
> >>>> but then in both cases are multiplied by 'max_freq' in (2)
> >>>>
> >>>> As you can see this is not the situation that you have described, is it?
> >>>> And the transient or non-transient is minor here IMO.
> >>>
> >>> If max is 512 then util = 640 which is much lower than 1024.
> >>
> >> What scenario is this?
> >> Is 1024 the utilization that we might have from the CPU rq?
> >> What is the original CPU capacity, 1024?
>
> Is this 1024 the utilization of the CPU runqueue because since
> the new PELT we can have it bigger than CPU capacity?
>
> >>
> >>>
> >>>>
> >>>> Secondly, you have mentioned the mitigation in HW and issue between
> >>>> instantaneous vs. PELT-one thermal pressure information. This is
> >>>> something that I'm stretching my head for long. I'm trying to
> >>>> develop this for new Arm FW thermal. You have mentioned:
> >>>> 'thermal mitigation is managed by HW at a much higher
> >>>> frequency than what Linux can handle' - I would be also more
> >>>> precised here: HW or FW? How often the HW can change max freq or
> >>>> how often FW can change that? If we don't have those numbers
> >>>> than statement: 'a much higher' doesn't help in solving this
> >>>
> >>> By much higher means that Linux can't react fast enough and should not
> >>> try to sync because it's a lost game
> >>
> >> As I said, 'much higher' is not a number to base a design on it.
> >
> > But that gives you the constraint that you can't expect to be always
> > synced with up to date value which is the most important here. This
> > means that cpu_cap -= arch_scale_thermal_pressure(cpu) can be wrong
> > just after you computed it and your decision is wrong.
>
> This is hypothetical situation when the value can change in such
> noisy way on some platform. But I understand your concern.
>
> >
> >
> >> We need real numbers from real platforms. Currently we have two
> >> places where the thermal pressure is set:
> >> 1) cpufreq_cooling.c [1]
> >> 2) Qcom driver [2]
> >> (we might have 3rd soon for Arm SCMI+FW)
> >
> > I don't have details but i have khz in mind
>
> If such traffic of interrupts in khz is true for driver in 2)
> then it's a bit concerning.
>
> Although, smarter platforms shouldn't suffer due to design forced to one
> corner case platform.
>
> >
> >>
> >> For the 2nd I would like to see numbers. For the 1st one when
> >> kernel thermal is used (which supports higher number of platforms
> >> comparing to Qcom driver) as it's by design kernel tries to control
> >> thermal, so changes are not that frequent.
> >>
> >> As for now, I can see in experiments the 1st is suffering long decay
> >> delays and also corner cases with long idle CPUs.
> >>
> >>>

[...]

> >> I'm trying to help Xuewen to solve his/her issues with the RT class
> >> incrementally. I don't want to push him/her into a deep dark water
> >> of PELT signals, to what variable compare them, corner cases when they
> >> are (or not) updated or completely not implemented. I'm not even sure
> >> if those extra complexities make sense for the RT/DL (since they
> >> make some difference on big.mid.little specific platforms but not for
> >> the rest).
>
> As I said we need a way forward, this issue of capacity inversion
> on big.mid.little is there. It was for ~2-3years and is going to be
> even bigger in future. So please don't block it and prepare/share the
> numbers for the corner case platforms.

I don't want to block anything but just want a solution that is
coherent with the whole design and not just a fix for one UC

>
> I have proposed the where we can meet in the middle, consider it.
> I will prepare a patch for that shifter.

2022-04-26 09:29:00

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 04/25/22 09:31, Xuewen Yan wrote:
> On Fri, Apr 22, 2022 at 12:15 AM Qais Yousef <[email protected]> wrote:
> > Is it okay to share what the capacities of the littles, mediums and bigs on
> > your system? And how they change under worst case scenario thermal pressure?
> > Only IF you have these numbers handy :-)
>
> Okay, the little/mid/big cpu scale capacity is 350/930/1024, but the
> cpu frequency point is discrete, the big core's high freq point may is
> just a few more than the mid core's highest.
> In this case, once the thermal decrease the scaling_max_freq, the
> maximum frequency of the large core is easily lower than that of the
> medium core.
> Of course, the corner case is due to the frequency design of the soc
> and our thermal algorithm.

Okay, thanks for the info!

>
> >
> > Is it actually an indication of a potential other problem if you swing into
> > capacity inversion in the bigs that often? I've seen a lot of systems where the
> > difference between the meds and bigs is small. But frequent inversion could be
> > suspicious still.
> >
> > Do the littles and the mediums experience any significant thermal pressure too?
>
> In our platform, it's not.

Good.

> > It doesn't seem it'll cause a significant error, but still it seems to me this
> > function wants the original capacity passed to it.
> >
> > There are similar questions to be asked since you modify sg_cpu->max. Every
> > user needs to be audited if they're fine with this change or not.
> >
> > I'm not sure still what we are achieving here. You want to force schedutil not
> > to request higher frequencies if thermal pressure is high? Should schedutil
> > actually care? Shouldn't the cpufreq driver reject this request and pick the
> > next best thing if it can't satisfy it? I could be missing something, I haven't
> > looked that hard tbh :-)
>
> I changed this just want to make it more responsive to the real
> capacity of the cpu, if it will cause other problems, maybe it would
> be better not to change it.:)

There are others who can give you a better opinion. But AFAICS we're not fixing
anything but risking breaking other things. So I vote for not to change it :)

> > It depends on the severity of the problem. The simplest thing I can suggest is
> > to check if the cpu is in capacity inversion state, and if it is, then make
> > rt_task_fits_capacity() return false always.
> >
> > If we need a generic solution to handle thermal pressure omitting OPPs, then
> > the search needs to become more complex. The proposal in this patch is not
> > adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly
> > omit some cpus because of any tiny thermal pressure. For example if the
> > capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then
> > any small thermal pressure on mediums will cause these tasks to run on big cpus
> > only, which is not what we want. Especially if these big cpus can end up in
> > capacity inversion later ;-)
> >
> > So if we want to handle this case, then we need to ensure the search returns
> > false only if
> >
> > 1. Thermal pressure results in real OPP to be omitted.
> > 2. Another CPU that can provide this performance level is available.
> >
> > Otherwise we should still fit it on this CPU because it'll give us the closest
> > thing to what was requested.
> >
> > I can think of 2 ways to implement this, but none of them seem particularly
> > pretty :-/
>
> Maybe as Lukasz Luba said:
>
> https://lore.kernel.org/all/[email protected]/
>
> > Let's meet in the middle:
> > 1) use the thermal PELT signal in RT:
> > capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
> > 2) introduce a more configurable thermal_pressure shifter instead
> > 'sched_thermal_decay_shift', which would allow not only to make the
> > decaying longer, but also shorter when the platform already might do
> > that, to not cause too much traffic.
>
> But even if this is changed, there will still be the same problem, I
> look forward to Lukasz's patch:)

This will not address my concern unless I missed something.

The best (simplest) way forward IMHO is to introduce a new function

bool cpu_in_capacity_inversion(int cpu);

(feel free to pick another name) which will detect the scenario you're in. You
can use this function then in rt_task_fits_capacity()

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a32c46889af8..d48811a7e956 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
if (!static_branch_unlikely(&sched_asym_cpucapacity))
return true;

+ if (cpu_in_capacity_inversion(cpu))
+ return false;
+
min_cap = uclamp_eff_value(p, UCLAMP_MIN);
max_cap = uclamp_eff_value(p, UCLAMP_MAX);

You'll probably need to do something similar in dl_task_fits_capacity().

This might be a bit aggressive though as we'll steer away all RT tasks from
this CPU (as long as there's another CPU that can fit it). I need to think more
about it. But we could do something like this too

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a32c46889af8..f2a34946a7ab 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
if (!static_branch_unlikely(&sched_asym_cpucapacity))
return true;

+ cpu_cap = capacity_orig_of(cpu);
+
+ if (cpu_in_capacity_inversion(cpu))
+ cpu_cap -= thermal_load_avg(cpu_rq(cpu));
+
min_cap = uclamp_eff_value(p, UCLAMP_MIN);
max_cap = uclamp_eff_value(p, UCLAMP_MAX);

- cpu_cap = capacity_orig_of(cpu);
-
return cpu_cap >= min(min_cap, max_cap);
}
#else

Thoughts?


Thanks!

--
Qais Yousef

2022-04-26 12:15:54

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Tue, Apr 26, 2022 at 12:12 AM Qais Yousef <[email protected]> wrote:
>
> On 04/25/22 09:31, Xuewen Yan wrote:
> > On Fri, Apr 22, 2022 at 12:15 AM Qais Yousef <[email protected]> wrote:
> > > Is it okay to share what the capacities of the littles, mediums and bigs on
> > > your system? And how they change under worst case scenario thermal pressure?
> > > Only IF you have these numbers handy :-)
> >
> > Okay, the little/mid/big cpu scale capacity is 350/930/1024, but the
> > cpu frequency point is discrete, the big core's high freq point may is
> > just a few more than the mid core's highest.
> > In this case, once the thermal decrease the scaling_max_freq, the
> > maximum frequency of the large core is easily lower than that of the
> > medium core.
> > Of course, the corner case is due to the frequency design of the soc
> > and our thermal algorithm.
>
> Okay, thanks for the info!
>
> >
> > >
> > > Is it actually an indication of a potential other problem if you swing into
> > > capacity inversion in the bigs that often? I've seen a lot of systems where the
> > > difference between the meds and bigs is small. But frequent inversion could be
> > > suspicious still.
> > >
> > > Do the littles and the mediums experience any significant thermal pressure too?
> >
> > In our platform, it's not.
>
> Good.
>
> > > It doesn't seem it'll cause a significant error, but still it seems to me this
> > > function wants the original capacity passed to it.
> > >
> > > There are similar questions to be asked since you modify sg_cpu->max. Every
> > > user needs to be audited if they're fine with this change or not.
> > >
> > > I'm not sure still what we are achieving here. You want to force schedutil not
> > > to request higher frequencies if thermal pressure is high? Should schedutil
> > > actually care? Shouldn't the cpufreq driver reject this request and pick the
> > > next best thing if it can't satisfy it? I could be missing something, I haven't
> > > looked that hard tbh :-)
> >
> > I changed this just want to make it more responsive to the real
> > capacity of the cpu, if it will cause other problems, maybe it would
> > be better not to change it.:)
>
> There are others who can give you a better opinion. But AFAICS we're not fixing
> anything but risking breaking other things. So I vote for not to change it :)
>
> > > It depends on the severity of the problem. The simplest thing I can suggest is
> > > to check if the cpu is in capacity inversion state, and if it is, then make
> > > rt_task_fits_capacity() return false always.
> > >
> > > If we need a generic solution to handle thermal pressure omitting OPPs, then
> > > the search needs to become more complex. The proposal in this patch is not
> > > adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly
> > > omit some cpus because of any tiny thermal pressure. For example if the
> > > capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then
> > > any small thermal pressure on mediums will cause these tasks to run on big cpus
> > > only, which is not what we want. Especially if these big cpus can end up in
> > > capacity inversion later ;-)
> > >
> > > So if we want to handle this case, then we need to ensure the search returns
> > > false only if
> > >
> > > 1. Thermal pressure results in real OPP to be omitted.
> > > 2. Another CPU that can provide this performance level is available.
> > >
> > > Otherwise we should still fit it on this CPU because it'll give us the closest
> > > thing to what was requested.
> > >
> > > I can think of 2 ways to implement this, but none of them seem particularly
> > > pretty :-/
> >
> > Maybe as Lukasz Luba said:
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > > Let's meet in the middle:
> > > 1) use the thermal PELT signal in RT:
> > > capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
> > > 2) introduce a more configurable thermal_pressure shifter instead
> > > 'sched_thermal_decay_shift', which would allow not only to make the
> > > decaying longer, but also shorter when the platform already might do
> > > that, to not cause too much traffic.
> >
> > But even if this is changed, there will still be the same problem, I
> > look forward to Lukasz's patch:)
>
> This will not address my concern unless I missed something.
>
> The best (simplest) way forward IMHO is to introduce a new function
>
> bool cpu_in_capacity_inversion(int cpu);
>
> (feel free to pick another name) which will detect the scenario you're in. You
> can use this function then in rt_task_fits_capacity()
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a32c46889af8..d48811a7e956 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> if (!static_branch_unlikely(&sched_asym_cpucapacity))
> return true;
>
> + if (cpu_in_capacity_inversion(cpu))
> + return false;
> +
> min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> max_cap = uclamp_eff_value(p, UCLAMP_MAX);
>
> You'll probably need to do something similar in dl_task_fits_capacity().
>
> This might be a bit aggressive though as we'll steer away all RT tasks from
> this CPU (as long as there's another CPU that can fit it). I need to think more
> about it. But we could do something like this too
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a32c46889af8..f2a34946a7ab 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> if (!static_branch_unlikely(&sched_asym_cpucapacity))
> return true;
>
> + cpu_cap = capacity_orig_of(cpu);
> +
> + if (cpu_in_capacity_inversion(cpu))

It's a good idea, but as you said, in mainline, the
sysctl_sched_uclamp_util_min_rt_default is always 1024,
Maybe it's better to add it to the judgment?

+ if (sysctl_sched_uclamp_util_min_rt_default !=
SCHED_CAPACITY_SCALE && cpu_in_capacity_inversion(cpu))

> + cpu_cap -= thermal_load_avg(cpu_rq(cpu));

Why use thermal_load_avg? If thermal is always in effect,the
thermal_load_avg would get bigger and bigger, as a result, the cpu_cap
maybe smaller than (capacity_orig - thermal_pressure).

Thanks!

> +
> min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> max_cap = uclamp_eff_value(p, UCLAMP_MAX);
>
> - cpu_cap = capacity_orig_of(cpu);
> -
> return cpu_cap >= min(min_cap, max_cap);
> }
> #else
>
> Thoughts?

>
>
> Thanks!
>
> --
> Qais Yousef

2022-04-26 13:43:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Tue, 26 Apr 2022 at 11:31, Qais Yousef <[email protected]> wrote:
>
> On 04/26/22 10:09, Vincent Guittot wrote:
> > On Tue, 26 Apr 2022 at 04:07, Xuewen Yan <[email protected]> wrote:
> > >
> > > On Tue, Apr 26, 2022 at 12:12 AM Qais Yousef <[email protected]> wrote:
> > > >
> > > > On 04/25/22 09:31, Xuewen Yan wrote:
> > > > > On Fri, Apr 22, 2022 at 12:15 AM Qais Yousef <[email protected]> wrote:
> > > > > > Is it okay to share what the capacities of the littles, mediums and bigs on
> > > > > > your system? And how they change under worst case scenario thermal pressure?
> > > > > > Only IF you have these numbers handy :-)
> > > > >
> > > > > Okay, the little/mid/big cpu scale capacity is 350/930/1024, but the
> > > > > cpu frequency point is discrete, the big core's high freq point may is
> > > > > just a few more than the mid core's highest.
> > > > > In this case, once the thermal decrease the scaling_max_freq, the
> > > > > maximum frequency of the large core is easily lower than that of the
> > > > > medium core.
> > > > > Of course, the corner case is due to the frequency design of the soc
> > > > > and our thermal algorithm.
> > > >
> > > > Okay, thanks for the info!
> > > >
> > > > >
> > > > > >
> > > > > > Is it actually an indication of a potential other problem if you swing into
> > > > > > capacity inversion in the bigs that often? I've seen a lot of systems where the
> > > > > > difference between the meds and bigs is small. But frequent inversion could be
> > > > > > suspicious still.
> > > > > >
> > > > > > Do the littles and the mediums experience any significant thermal pressure too?
> > > > >
> > > > > In our platform, it's not.
> > > >
> > > > Good.
> > > >
> > > > > > It doesn't seem it'll cause a significant error, but still it seems to me this
> > > > > > function wants the original capacity passed to it.
> > > > > >
> > > > > > There are similar questions to be asked since you modify sg_cpu->max. Every
> > > > > > user needs to be audited if they're fine with this change or not.
> > > > > >
> > > > > > I'm not sure still what we are achieving here. You want to force schedutil not
> > > > > > to request higher frequencies if thermal pressure is high? Should schedutil
> > > > > > actually care? Shouldn't the cpufreq driver reject this request and pick the
> > > > > > next best thing if it can't satisfy it? I could be missing something, I haven't
> > > > > > looked that hard tbh :-)
> > > > >
> > > > > I changed this just want to make it more responsive to the real
> > > > > capacity of the cpu, if it will cause other problems, maybe it would
> > > > > be better not to change it.:)
> > > >
> > > > There are others who can give you a better opinion. But AFAICS we're not fixing
> > > > anything but risking breaking other things. So I vote for not to change it :)
> > > >
> > > > > > It depends on the severity of the problem. The simplest thing I can suggest is
> > > > > > to check if the cpu is in capacity inversion state, and if it is, then make
> > > > > > rt_task_fits_capacity() return false always.
> > > > > >
> > > > > > If we need a generic solution to handle thermal pressure omitting OPPs, then
> > > > > > the search needs to become more complex. The proposal in this patch is not
> > > > > > adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly
> > > > > > omit some cpus because of any tiny thermal pressure. For example if the
> > > > > > capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then
> > > > > > any small thermal pressure on mediums will cause these tasks to run on big cpus
> > > > > > only, which is not what we want. Especially if these big cpus can end up in
> > > > > > capacity inversion later ;-)
> > > > > >
> > > > > > So if we want to handle this case, then we need to ensure the search returns
> > > > > > false only if
> > > > > >
> > > > > > 1. Thermal pressure results in real OPP to be omitted.
> > > > > > 2. Another CPU that can provide this performance level is available.
> > > > > >
> > > > > > Otherwise we should still fit it on this CPU because it'll give us the closest
> > > > > > thing to what was requested.
> > > > > >
> > > > > > I can think of 2 ways to implement this, but none of them seem particularly
> > > > > > pretty :-/
> > > > >
> > > > > Maybe as Lukasz Luba said:
> > > > >
> > > > > https://lore.kernel.org/all/[email protected]/
> > > > >
> > > > > > Let's meet in the middle:
> > > > > > 1) use the thermal PELT signal in RT:
> > > > > > capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
> > > > > > 2) introduce a more configurable thermal_pressure shifter instead
> > > > > > 'sched_thermal_decay_shift', which would allow not only to make the
> > > > > > decaying longer, but also shorter when the platform already might do
> > > > > > that, to not cause too much traffic.
> > > > >
> > > > > But even if this is changed, there will still be the same problem, I
> > > > > look forward to Lukasz's patch:)
> > > >
> > > > This will not address my concern unless I missed something.
> > > >
> > > > The best (simplest) way forward IMHO is to introduce a new function
> > > >
> > > > bool cpu_in_capacity_inversion(int cpu);
> > > >
> > > > (feel free to pick another name) which will detect the scenario you're in. You
> > > > can use this function then in rt_task_fits_capacity()
> > > >
> > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > > index a32c46889af8..d48811a7e956 100644
> > > > --- a/kernel/sched/rt.c
> > > > +++ b/kernel/sched/rt.c
> > > > @@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > > > return true;
> > > >
> > > > + if (cpu_in_capacity_inversion(cpu))
> > > > + return false;
> > > > +
> > > > min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> > > > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> > > >
> > > > You'll probably need to do something similar in dl_task_fits_capacity().
> > > >
> > > > This might be a bit aggressive though as we'll steer away all RT tasks from
> > > > this CPU (as long as there's another CPU that can fit it). I need to think more
> > > > about it. But we could do something like this too
> > > >
> > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > > index a32c46889af8..f2a34946a7ab 100644
> > > > --- a/kernel/sched/rt.c
> > > > +++ b/kernel/sched/rt.c
> > > > @@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > > > return true;
> > > >
> > > > + cpu_cap = capacity_orig_of(cpu);
> > > > +
> > > > + if (cpu_in_capacity_inversion(cpu))
> > >
> > > It's a good idea, but as you said, in mainline, the
> > > sysctl_sched_uclamp_util_min_rt_default is always 1024,
> > > Maybe it's better to add it to the judgment?
> > >
> > > + if (sysctl_sched_uclamp_util_min_rt_default !=
> > > SCHED_CAPACITY_SCALE && cpu_in_capacity_inversion(cpu))
> > >
> > > > + cpu_cap -= thermal_load_avg(cpu_rq(cpu));
> > >
> > > Why use thermal_load_avg? If thermal is always in effect,the
> > > thermal_load_avg would get bigger and bigger, as a result, the cpu_cap
> > > maybe smaller than (capacity_orig - thermal_pressure).
> >
> > For a fixed thermal_pressure(), thermal_load_avg() will not be higher
> > than thermal_pressure() but will increase to reach thermal_pressure()
> >
> > In the current implementation for sched_asym_cpucapacity topology, we
> > do a 1st iteration trying to find a cpu that fits a task's capacity
> > but if it fails, we run a normal cpupri_find that doesn't care about
> > capacity.
> >
> > Do I understand correctly that in your case you would like to run
> > a 1st iteration that takes into account capacity_orig_of(cpu) -
> > thermal_load_avg(cpu_rq(cpu))
> > If it fails run another iteration only with capacity_orig_of(cpu)
> > and finally tries without capacity constraint
>
> Wouldn't this be expensive to have 3 loops? That was my other suggestion but
> wasn't sure the complexity was worth it. So I suggested handling the capacity
> inversion case only.

3 loops might be too expensive. I mainly want to make sure to
understand what should be done to fix Xuewen case without breaking
others. Then we can see how to optimize this in a reasonable number of
loop

>
>
> Thanks
>
> --
> Qais Yousef

2022-04-26 19:08:13

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Tue, 26 Apr 2022 at 04:07, Xuewen Yan <[email protected]> wrote:
>
> On Tue, Apr 26, 2022 at 12:12 AM Qais Yousef <[email protected]> wrote:
> >
> > On 04/25/22 09:31, Xuewen Yan wrote:
> > > On Fri, Apr 22, 2022 at 12:15 AM Qais Yousef <[email protected]> wrote:
> > > > Is it okay to share what the capacities of the littles, mediums and bigs on
> > > > your system? And how they change under worst case scenario thermal pressure?
> > > > Only IF you have these numbers handy :-)
> > >
> > > Okay, the little/mid/big cpu scale capacity is 350/930/1024, but the
> > > cpu frequency point is discrete, the big core's high freq point may is
> > > just a few more than the mid core's highest.
> > > In this case, once the thermal decrease the scaling_max_freq, the
> > > maximum frequency of the large core is easily lower than that of the
> > > medium core.
> > > Of course, the corner case is due to the frequency design of the soc
> > > and our thermal algorithm.
> >
> > Okay, thanks for the info!
> >
> > >
> > > >
> > > > Is it actually an indication of a potential other problem if you swing into
> > > > capacity inversion in the bigs that often? I've seen a lot of systems where the
> > > > difference between the meds and bigs is small. But frequent inversion could be
> > > > suspicious still.
> > > >
> > > > Do the littles and the mediums experience any significant thermal pressure too?
> > >
> > > In our platform, it's not.
> >
> > Good.
> >
> > > > It doesn't seem it'll cause a significant error, but still it seems to me this
> > > > function wants the original capacity passed to it.
> > > >
> > > > There are similar questions to be asked since you modify sg_cpu->max. Every
> > > > user needs to be audited if they're fine with this change or not.
> > > >
> > > > I'm not sure still what we are achieving here. You want to force schedutil not
> > > > to request higher frequencies if thermal pressure is high? Should schedutil
> > > > actually care? Shouldn't the cpufreq driver reject this request and pick the
> > > > next best thing if it can't satisfy it? I could be missing something, I haven't
> > > > looked that hard tbh :-)
> > >
> > > I changed this just want to make it more responsive to the real
> > > capacity of the cpu, if it will cause other problems, maybe it would
> > > be better not to change it.:)
> >
> > There are others who can give you a better opinion. But AFAICS we're not fixing
> > anything but risking breaking other things. So I vote for not to change it :)
> >
> > > > It depends on the severity of the problem. The simplest thing I can suggest is
> > > > to check if the cpu is in capacity inversion state, and if it is, then make
> > > > rt_task_fits_capacity() return false always.
> > > >
> > > > If we need a generic solution to handle thermal pressure omitting OPPs, then
> > > > the search needs to become more complex. The proposal in this patch is not
> > > > adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly
> > > > omit some cpus because of any tiny thermal pressure. For example if the
> > > > capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then
> > > > any small thermal pressure on mediums will cause these tasks to run on big cpus
> > > > only, which is not what we want. Especially if these big cpus can end up in
> > > > capacity inversion later ;-)
> > > >
> > > > So if we want to handle this case, then we need to ensure the search returns
> > > > false only if
> > > >
> > > > 1. Thermal pressure results in real OPP to be omitted.
> > > > 2. Another CPU that can provide this performance level is available.
> > > >
> > > > Otherwise we should still fit it on this CPU because it'll give us the closest
> > > > thing to what was requested.
> > > >
> > > > I can think of 2 ways to implement this, but none of them seem particularly
> > > > pretty :-/
> > >
> > > Maybe as Lukasz Luba said:
> > >
> > > https://lore.kernel.org/all/[email protected]/
> > >
> > > > Let's meet in the middle:
> > > > 1) use the thermal PELT signal in RT:
> > > > capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
> > > > 2) introduce a more configurable thermal_pressure shifter instead
> > > > 'sched_thermal_decay_shift', which would allow not only to make the
> > > > decaying longer, but also shorter when the platform already might do
> > > > that, to not cause too much traffic.
> > >
> > > But even if this is changed, there will still be the same problem, I
> > > look forward to Lukasz's patch:)
> >
> > This will not address my concern unless I missed something.
> >
> > The best (simplest) way forward IMHO is to introduce a new function
> >
> > bool cpu_in_capacity_inversion(int cpu);
> >
> > (feel free to pick another name) which will detect the scenario you're in. You
> > can use this function then in rt_task_fits_capacity()
> >
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index a32c46889af8..d48811a7e956 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > return true;
> >
> > + if (cpu_in_capacity_inversion(cpu))
> > + return false;
> > +
> > min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > You'll probably need to do something similar in dl_task_fits_capacity().
> >
> > This might be a bit aggressive though as we'll steer away all RT tasks from
> > this CPU (as long as there's another CPU that can fit it). I need to think more
> > about it. But we could do something like this too
> >
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index a32c46889af8..f2a34946a7ab 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > return true;
> >
> > + cpu_cap = capacity_orig_of(cpu);
> > +
> > + if (cpu_in_capacity_inversion(cpu))
>
> It's a good idea, but as you said, in mainline, the
> sysctl_sched_uclamp_util_min_rt_default is always 1024,
> Maybe it's better to add it to the judgment?
>
> + if (sysctl_sched_uclamp_util_min_rt_default !=
> SCHED_CAPACITY_SCALE && cpu_in_capacity_inversion(cpu))
>
> > + cpu_cap -= thermal_load_avg(cpu_rq(cpu));
>
> Why use thermal_load_avg? If thermal is always in effect,the
> thermal_load_avg would get bigger and bigger, as a result, the cpu_cap
> maybe smaller than (capacity_orig - thermal_pressure).

For a fixed thermal_pressure(), thermal_load_avg() will not be higher
than thermal_pressure() but will increase to reach thermal_pressure()

In the current implementation for sched_asym_cpucapacity topology, we
do a 1st iteration trying to find a cpu that fits a task's capacity
but if it fails, we run a normal cpupri_find that doesn't care about
capacity.

Do I understand correctly that in your case you would like to run
a 1st iteration that takes into account capacity_orig_of(cpu) -
thermal_load_avg(cpu_rq(cpu))
If it fails run another iteration only with capacity_orig_of(cpu)
and finally tries without capacity constraint

>
> Thanks!
>
> > +
> > min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > - cpu_cap = capacity_orig_of(cpu);
> > -
> > return cpu_cap >= min(min_cap, max_cap);
> > }
> > #else
> >
> > Thoughts?
>
> >
> >
> > Thanks!
> >
> > --
> > Qais Yousef

2022-04-27 06:47:39

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 04/26/22 10:09, Vincent Guittot wrote:
> On Tue, 26 Apr 2022 at 04:07, Xuewen Yan <[email protected]> wrote:
> >
> > On Tue, Apr 26, 2022 at 12:12 AM Qais Yousef <[email protected]> wrote:
> > >
> > > On 04/25/22 09:31, Xuewen Yan wrote:
> > > > On Fri, Apr 22, 2022 at 12:15 AM Qais Yousef <[email protected]> wrote:
> > > > > Is it okay to share what the capacities of the littles, mediums and bigs on
> > > > > your system? And how they change under worst case scenario thermal pressure?
> > > > > Only IF you have these numbers handy :-)
> > > >
> > > > Okay, the little/mid/big cpu scale capacity is 350/930/1024, but the
> > > > cpu frequency point is discrete, the big core's high freq point may is
> > > > just a few more than the mid core's highest.
> > > > In this case, once the thermal decrease the scaling_max_freq, the
> > > > maximum frequency of the large core is easily lower than that of the
> > > > medium core.
> > > > Of course, the corner case is due to the frequency design of the soc
> > > > and our thermal algorithm.
> > >
> > > Okay, thanks for the info!
> > >
> > > >
> > > > >
> > > > > Is it actually an indication of a potential other problem if you swing into
> > > > > capacity inversion in the bigs that often? I've seen a lot of systems where the
> > > > > difference between the meds and bigs is small. But frequent inversion could be
> > > > > suspicious still.
> > > > >
> > > > > Do the littles and the mediums experience any significant thermal pressure too?
> > > >
> > > > In our platform, it's not.
> > >
> > > Good.
> > >
> > > > > It doesn't seem it'll cause a significant error, but still it seems to me this
> > > > > function wants the original capacity passed to it.
> > > > >
> > > > > There are similar questions to be asked since you modify sg_cpu->max. Every
> > > > > user needs to be audited if they're fine with this change or not.
> > > > >
> > > > > I'm not sure still what we are achieving here. You want to force schedutil not
> > > > > to request higher frequencies if thermal pressure is high? Should schedutil
> > > > > actually care? Shouldn't the cpufreq driver reject this request and pick the
> > > > > next best thing if it can't satisfy it? I could be missing something, I haven't
> > > > > looked that hard tbh :-)
> > > >
> > > > I changed this just want to make it more responsive to the real
> > > > capacity of the cpu, if it will cause other problems, maybe it would
> > > > be better not to change it.:)
> > >
> > > There are others who can give you a better opinion. But AFAICS we're not fixing
> > > anything but risking breaking other things. So I vote for not to change it :)
> > >
> > > > > It depends on the severity of the problem. The simplest thing I can suggest is
> > > > > to check if the cpu is in capacity inversion state, and if it is, then make
> > > > > rt_task_fits_capacity() return false always.
> > > > >
> > > > > If we need a generic solution to handle thermal pressure omitting OPPs, then
> > > > > the search needs to become more complex. The proposal in this patch is not
> > > > > adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly
> > > > > omit some cpus because of any tiny thermal pressure. For example if the
> > > > > capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then
> > > > > any small thermal pressure on mediums will cause these tasks to run on big cpus
> > > > > only, which is not what we want. Especially if these big cpus can end up in
> > > > > capacity inversion later ;-)
> > > > >
> > > > > So if we want to handle this case, then we need to ensure the search returns
> > > > > false only if
> > > > >
> > > > > 1. Thermal pressure results in real OPP to be omitted.
> > > > > 2. Another CPU that can provide this performance level is available.
> > > > >
> > > > > Otherwise we should still fit it on this CPU because it'll give us the closest
> > > > > thing to what was requested.
> > > > >
> > > > > I can think of 2 ways to implement this, but none of them seem particularly
> > > > > pretty :-/
> > > >
> > > > Maybe as Lukasz Luba said:
> > > >
> > > > https://lore.kernel.org/all/[email protected]/
> > > >
> > > > > Let's meet in the middle:
> > > > > 1) use the thermal PELT signal in RT:
> > > > > capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
> > > > > 2) introduce a more configurable thermal_pressure shifter instead
> > > > > 'sched_thermal_decay_shift', which would allow not only to make the
> > > > > decaying longer, but also shorter when the platform already might do
> > > > > that, to not cause too much traffic.
> > > >
> > > > But even if this is changed, there will still be the same problem, I
> > > > look forward to Lukasz's patch:)
> > >
> > > This will not address my concern unless I missed something.
> > >
> > > The best (simplest) way forward IMHO is to introduce a new function
> > >
> > > bool cpu_in_capacity_inversion(int cpu);
> > >
> > > (feel free to pick another name) which will detect the scenario you're in. You
> > > can use this function then in rt_task_fits_capacity()
> > >
> > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > index a32c46889af8..d48811a7e956 100644
> > > --- a/kernel/sched/rt.c
> > > +++ b/kernel/sched/rt.c
> > > @@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > > return true;
> > >
> > > + if (cpu_in_capacity_inversion(cpu))
> > > + return false;
> > > +
> > > min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> > > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> > >
> > > You'll probably need to do something similar in dl_task_fits_capacity().
> > >
> > > This might be a bit aggressive though as we'll steer away all RT tasks from
> > > this CPU (as long as there's another CPU that can fit it). I need to think more
> > > about it. But we could do something like this too
> > >
> > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > index a32c46889af8..f2a34946a7ab 100644
> > > --- a/kernel/sched/rt.c
> > > +++ b/kernel/sched/rt.c
> > > @@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > > return true;
> > >
> > > + cpu_cap = capacity_orig_of(cpu);
> > > +
> > > + if (cpu_in_capacity_inversion(cpu))
> >
> > It's a good idea, but as you said, in mainline, the
> > sysctl_sched_uclamp_util_min_rt_default is always 1024,
> > Maybe it's better to add it to the judgment?
> >
> > + if (sysctl_sched_uclamp_util_min_rt_default !=
> > SCHED_CAPACITY_SCALE && cpu_in_capacity_inversion(cpu))
> >
> > > + cpu_cap -= thermal_load_avg(cpu_rq(cpu));
> >
> > Why use thermal_load_avg? If thermal is always in effect,the
> > thermal_load_avg would get bigger and bigger, as a result, the cpu_cap
> > maybe smaller than (capacity_orig - thermal_pressure).
>
> For a fixed thermal_pressure(), thermal_load_avg() will not be higher
> than thermal_pressure() but will increase to reach thermal_pressure()
>
> In the current implementation for sched_asym_cpucapacity topology, we
> do a 1st iteration trying to find a cpu that fits a task's capacity
> but if it fails, we run a normal cpupri_find that doesn't care about
> capacity.
>
> Do I understand correctly that in your case you would like to run
> a 1st iteration that takes into account capacity_orig_of(cpu) -
> thermal_load_avg(cpu_rq(cpu))
> If it fails run another iteration only with capacity_orig_of(cpu)
> and finally tries without capacity constraint

Wouldn't this be expensive to have 3 loops? That was my other suggestion but
wasn't sure the complexity was worth it. So I suggested handling the capacity
inversion case only.


Thanks

--
Qais Yousef

2022-04-27 10:36:23

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

> > > The best (simplest) way forward IMHO is to introduce a new function
> > >
> > > bool cpu_in_capacity_inversion(int cpu);
> > >
> > > (feel free to pick another name) which will detect the scenario you're in. You
> > > can use this function then in rt_task_fits_capacity()
> > >
> > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > index a32c46889af8..d48811a7e956 100644
> > > --- a/kernel/sched/rt.c
> > > +++ b/kernel/sched/rt.c
> > > @@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > > return true;
> > >
> > > + if (cpu_in_capacity_inversion(cpu))
> > > + return false;
> > > +
> > > min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> > > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> > >
> > > You'll probably need to do something similar in dl_task_fits_capacity().
> > >
> > > This might be a bit aggressive though as we'll steer away all RT tasks from
> > > this CPU (as long as there's another CPU that can fit it). I need to think more
> > > about it. But we could do something like this too
> > >
> > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > index a32c46889af8..f2a34946a7ab 100644
> > > --- a/kernel/sched/rt.c
> > > +++ b/kernel/sched/rt.c
> > > @@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > > return true;
> > >
> > > + cpu_cap = capacity_orig_of(cpu);
> > > +
> > > + if (cpu_in_capacity_inversion(cpu))
> >
> > It's a good idea, but as you said, in mainline, the
> > sysctl_sched_uclamp_util_min_rt_default is always 1024,
> > Maybe it's better to add it to the judgment?
>
> I don't think so. If we want to handle finding the next best thing, we need to
> make the search more complex than that. This is no worse than having 2 RT tasks
> waking up at the same time while there's only a single big CPU. One of them
> will end up on a medium or a little and we don't provide better guarantees
> here.

I may have misunderstood your patch before, do you mean this:
1. the cpu has to be inversion, if not, the cpu's capacity is still
the biggest, although the sysctl_sched_uclamp_util_min_rt_default
=1024, it still can put on the cpu.
2. If the cpu is inversion, the thermal pressure should be considered,
at this time, if the sysctl_sched_uclamp_util_min_rt_default is not
1024, make the rt still have chance to select the cpu.
If the sysctl_sched_uclamp_util_min_rt_default is 1024, all of the
cpu actually can not fit the rt, at this time, select cpu without
considering the cap_orig_of(cpu). The worst thing may be that rt
would put on the small core.

I understand right? If so, Perhaps this approach has the least impact
on the current code complexity.

Thanks!
BR
---
xuewen

2022-04-27 10:50:30

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 04/26/22 12:06, Vincent Guittot wrote:
> > Wouldn't this be expensive to have 3 loops? That was my other suggestion but
> > wasn't sure the complexity was worth it. So I suggested handling the capacity
> > inversion case only.
>
> 3 loops might be too expensive. I mainly want to make sure to
> understand what should be done to fix Xuewen case without breaking
> others. Then we can see how to optimize this in a reasonable number of
> loop

The generic solution is what I tried to outline before:

> So if we want to handle this case, then we need to ensure the search returns
> false only if
>
> 1. Thermal pressure results in real OPP to be omitted.
> 2. Another CPU that can provide this performance level is available.
>
> Otherwise we should still fit it on this CPU because it'll give us the closest
> thing to what was requested.

And we can do this in 2 ways, 3 loops as you said, or by creating a fallback
cpumask as we search so that by the end we can resolve to it if we didn't find
the best fit.

My only worry here is that Xuewen doesn't see thermal issues on mids, so
testability is a problem. This generic solution will only help with the case of
mids losing some OPPs at the top, then we can do better by selecting a big core
instead of a medium (if not in capacity inversion itself).

I *think* (and I don't feel strongly about it at all), checking for capacity
inversion and bypassing that cpu, or only then consider its thermal pressure,
is the right approach to take here until someone reports more woes due to
thermal pressure.

That said, the generic solution might not be that bad actually and I'm just
being a bit conservative. So would be good to hear what others think.


Thanks

--
Qais Yousef

2022-04-27 10:59:39

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 04/26/22 10:07, Xuewen Yan wrote:
> On Tue, Apr 26, 2022 at 12:12 AM Qais Yousef <[email protected]> wrote:
> >
> > On 04/25/22 09:31, Xuewen Yan wrote:
> > > On Fri, Apr 22, 2022 at 12:15 AM Qais Yousef <[email protected]> wrote:
> > > > Is it okay to share what the capacities of the littles, mediums and bigs on
> > > > your system? And how they change under worst case scenario thermal pressure?
> > > > Only IF you have these numbers handy :-)
> > >
> > > Okay, the little/mid/big cpu scale capacity is 350/930/1024, but the
> > > cpu frequency point is discrete, the big core's high freq point may is
> > > just a few more than the mid core's highest.
> > > In this case, once the thermal decrease the scaling_max_freq, the
> > > maximum frequency of the large core is easily lower than that of the
> > > medium core.
> > > Of course, the corner case is due to the frequency design of the soc
> > > and our thermal algorithm.
> >
> > Okay, thanks for the info!
> >
> > >
> > > >
> > > > Is it actually an indication of a potential other problem if you swing into
> > > > capacity inversion in the bigs that often? I've seen a lot of systems where the
> > > > difference between the meds and bigs is small. But frequent inversion could be
> > > > suspicious still.
> > > >
> > > > Do the littles and the mediums experience any significant thermal pressure too?
> > >
> > > In our platform, it's not.
> >
> > Good.
> >
> > > > It doesn't seem it'll cause a significant error, but still it seems to me this
> > > > function wants the original capacity passed to it.
> > > >
> > > > There are similar questions to be asked since you modify sg_cpu->max. Every
> > > > user needs to be audited if they're fine with this change or not.
> > > >
> > > > I'm not sure still what we are achieving here. You want to force schedutil not
> > > > to request higher frequencies if thermal pressure is high? Should schedutil
> > > > actually care? Shouldn't the cpufreq driver reject this request and pick the
> > > > next best thing if it can't satisfy it? I could be missing something, I haven't
> > > > looked that hard tbh :-)
> > >
> > > I changed this just want to make it more responsive to the real
> > > capacity of the cpu, if it will cause other problems, maybe it would
> > > be better not to change it.:)
> >
> > There are others who can give you a better opinion. But AFAICS we're not fixing
> > anything but risking breaking other things. So I vote for not to change it :)
> >
> > > > It depends on the severity of the problem. The simplest thing I can suggest is
> > > > to check if the cpu is in capacity inversion state, and if it is, then make
> > > > rt_task_fits_capacity() return false always.
> > > >
> > > > If we need a generic solution to handle thermal pressure omitting OPPs, then
> > > > the search needs to become more complex. The proposal in this patch is not
> > > > adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly
> > > > omit some cpus because of any tiny thermal pressure. For example if the
> > > > capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then
> > > > any small thermal pressure on mediums will cause these tasks to run on big cpus
> > > > only, which is not what we want. Especially if these big cpus can end up in
> > > > capacity inversion later ;-)
> > > >
> > > > So if we want to handle this case, then we need to ensure the search returns
> > > > false only if
> > > >
> > > > 1. Thermal pressure results in real OPP to be omitted.
> > > > 2. Another CPU that can provide this performance level is available.
> > > >
> > > > Otherwise we should still fit it on this CPU because it'll give us the closest
> > > > thing to what was requested.
> > > >
> > > > I can think of 2 ways to implement this, but none of them seem particularly
> > > > pretty :-/
> > >
> > > Maybe as Lukasz Luba said:
> > >
> > > https://lore.kernel.org/all/[email protected]/
> > >
> > > > Let's meet in the middle:
> > > > 1) use the thermal PELT signal in RT:
> > > > capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
> > > > 2) introduce a more configurable thermal_pressure shifter instead
> > > > 'sched_thermal_decay_shift', which would allow not only to make the
> > > > decaying longer, but also shorter when the platform already might do
> > > > that, to not cause too much traffic.
> > >
> > > But even if this is changed, there will still be the same problem, I
> > > look forward to Lukasz's patch:)
> >
> > This will not address my concern unless I missed something.
> >
> > The best (simplest) way forward IMHO is to introduce a new function
> >
> > bool cpu_in_capacity_inversion(int cpu);
> >
> > (feel free to pick another name) which will detect the scenario you're in. You
> > can use this function then in rt_task_fits_capacity()
> >
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index a32c46889af8..d48811a7e956 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > return true;
> >
> > + if (cpu_in_capacity_inversion(cpu))
> > + return false;
> > +
> > min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > You'll probably need to do something similar in dl_task_fits_capacity().
> >
> > This might be a bit aggressive though as we'll steer away all RT tasks from
> > this CPU (as long as there's another CPU that can fit it). I need to think more
> > about it. But we could do something like this too
> >
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index a32c46889af8..f2a34946a7ab 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > return true;
> >
> > + cpu_cap = capacity_orig_of(cpu);
> > +
> > + if (cpu_in_capacity_inversion(cpu))
>
> It's a good idea, but as you said, in mainline, the
> sysctl_sched_uclamp_util_min_rt_default is always 1024,
> Maybe it's better to add it to the judgment?

I don't think so. If we want to handle finding the next best thing, we need to
make the search more complex than that. This is no worse than having 2 RT tasks
waking up at the same time while there's only a single big CPU. One of them
will end up on a medium or a little and we don't provide better guarantees
here.

Basically we need to improve our fallback mechanism to try to pick the next
biggest cpu. Which is something we can do. We just need to be wary not to
increase the wake up latency by making our search more expensive.

>
> + if (sysctl_sched_uclamp_util_min_rt_default !=
> SCHED_CAPACITY_SCALE && cpu_in_capacity_inversion(cpu))
>
> > + cpu_cap -= thermal_load_avg(cpu_rq(cpu));
>
> Why use thermal_load_avg? If thermal is always in effect,the
> thermal_load_avg would get bigger and bigger, as a result, the cpu_cap
> maybe smaller than (capacity_orig - thermal_pressure).

I just copied the suggestion from Lukasz to be honest without thinking much
about it.


Thanks

--
Qais Yousef

2022-04-27 11:45:31

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 04/27/22 09:38, Xuewen Yan wrote:
> > > > The best (simplest) way forward IMHO is to introduce a new function
> > > >
> > > > bool cpu_in_capacity_inversion(int cpu);
> > > >
> > > > (feel free to pick another name) which will detect the scenario you're in. You
> > > > can use this function then in rt_task_fits_capacity()
> > > >
> > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > > index a32c46889af8..d48811a7e956 100644
> > > > --- a/kernel/sched/rt.c
> > > > +++ b/kernel/sched/rt.c
> > > > @@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > > > return true;
> > > >
> > > > + if (cpu_in_capacity_inversion(cpu))
> > > > + return false;
> > > > +
> > > > min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> > > > max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> > > >
> > > > You'll probably need to do something similar in dl_task_fits_capacity().
> > > >
> > > > This might be a bit aggressive though as we'll steer away all RT tasks from
> > > > this CPU (as long as there's another CPU that can fit it). I need to think more
> > > > about it. But we could do something like this too
> > > >
> > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > > > index a32c46889af8..f2a34946a7ab 100644
> > > > --- a/kernel/sched/rt.c
> > > > +++ b/kernel/sched/rt.c
> > > > @@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > > > if (!static_branch_unlikely(&sched_asym_cpucapacity))
> > > > return true;
> > > >
> > > > + cpu_cap = capacity_orig_of(cpu);
> > > > +
> > > > + if (cpu_in_capacity_inversion(cpu))
> > >
> > > It's a good idea, but as you said, in mainline, the
> > > sysctl_sched_uclamp_util_min_rt_default is always 1024,
> > > Maybe it's better to add it to the judgment?
> >
> > I don't think so. If we want to handle finding the next best thing, we need to
> > make the search more complex than that. This is no worse than having 2 RT tasks
> > waking up at the same time while there's only a single big CPU. One of them
> > will end up on a medium or a little and we don't provide better guarantees
> > here.
>
> I may have misunderstood your patch before, do you mean this:
> 1. the cpu has to be inversion, if not, the cpu's capacity is still
> the biggest, although the sysctl_sched_uclamp_util_min_rt_default
> =1024, it still can put on the cpu.
> 2. If the cpu is inversion, the thermal pressure should be considered,
> at this time, if the sysctl_sched_uclamp_util_min_rt_default is not
> 1024, make the rt still have chance to select the cpu.
> If the sysctl_sched_uclamp_util_min_rt_default is 1024, all of the
> cpu actually can not fit the rt, at this time, select cpu without
> considering the cap_orig_of(cpu). The worst thing may be that rt
> would put on the small core.
>
> I understand right? If so, Perhaps this approach has the least impact
> on the current code complexity.

I believe you understood correctly. Tasks that need to run at 1024 when the
biggest cpu is in capacity inversion will get screwed - the system can't
satisfy their requirements. If they're happy to run on a medium (the next best
thing), then their uclamp_min should change to reflect that. If they are not
happy to run at the medium, then I'm not sure if it'll make much of
a difference if they end up on little. Their deadline will be missed anyway..

Again this is no worse than having two RT tasks with uclamp_min = 1024 waking
up at the same time on a system with 1 big cpu. Only one of them will be able
to run there.

I think tasks wanting 1024 is rare and no one seemed to bother with doing
better here so far. But we can certainly do better if need to :-)


Thanks

--
Qais Yousef

2022-05-01 00:06:55

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity



On 4/26/22 08:39, Vincent Guittot wrote:
> On Thu, 21 Apr 2022 at 12:57, Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 4/21/22 09:29, Vincent Guittot wrote:
>>> On Tue, 19 Apr 2022 at 16:13, Lukasz Luba <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 4/19/22 13:51, Vincent Guittot wrote:
>>>>> On Tue, 19 Apr 2022 at 14:01, Lukasz Luba <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/19/22 08:14, Vincent Guittot wrote:
>>>>>>> On Sat, 16 Apr 2022 at 04:47, Xuewen Yan <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi Luba / Dietmar
>>>>>>>>
>>>>>>>> On Wed, Apr 13, 2022 at 9:26 PM Lukasz Luba <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>
> [...]
>
>>>>>> To be precised and maybe fix some potential design issues. We are
>>>>>> talking here about utilization and set max capacity in function:
>>>>>> sugov_get_util()
>>>>>> so fields:
>>>>>>
>>>>>> sugov_cpu::util
>>>>>> sugov_cpu::max /* max capacity */
>>>>>
>>>>> Yes. With this patch ,util will be lower than current thermal
>>>>> mitigation whereas util normally reflects what we need not what can
>>>>> be provided
>>>>
>>>> This is a different requirements: util has to be max capacity and
>>>> max capacity has to be original max CPU capacity - for the SchedUtil.
>>>> OK, why? What this requirement adds in the design and final values?
>>>
>>> Because the calculation you are proposing is wrong and doesn't make
>>> sense. Util is the average utilization of the cpu that has to be
>>> compared with its original capacity max in order to get the freq that
>>> matches with this utilization.
>>>
>>> We have freq = util / max * max_freq and cpufreq will then capp freq
>>> if mitigation is applied. Once the mitigation disappear, the request
>>> will be back to targeted freq.
>>>
>>> If you replace max by max' = max - arch_scale_thermal_pressure then :
>>>
>>> - by the time you do the calculation, arch_scale_thermal_pressure can
>>> have changed and the result is meaningless. This is true whatever the
>>> pace of updating arch_scale_thermal_pressure
>>
>> The sudden change of the value taken from arch_scale_thermal_pressure
>> I can understand, but there are similar and we live with it. Look at
>> the whole EAS estimations done in a one CPU waku-up event or the uclamp
>> stuff. As far this is not too frequently occurring - we live wit it.
>>
>> I can see your concern here, since you mentioned below that you expect
>> some platforms to hit it in 'khz' rate. This is probably not good, to
>> trigger the kernel so often from HW/FW.
>>
>> That's why I have been struggling to find a 'good' design on this
>> glue layer for Arm FW+kernel. Our FW would probably won't cause such
>> huge notification traffic. A rate e.g. 50-100ms would be enough,
>> especially if we have the per-CPU cpufreq policy. So we might have
>> this 'PELT-like filter or signal' in FW, and just update kernel
>
> In this case arch_scale_thermal_pressure() doesn't reflect the actual
> thermal pressure but an average which is what thermal_load_avg() is
> also doing.

Exactly. I hope you understand that there is no point of bombarding the
kernel with 'khz' ratio interrupts with information which can be passed
gently.

But there is an issue with this. We would have to go again via
thermal PELT-like characteristics, as you said for the raw
arch_scale_thermal_pressure() 'must not' be used.

>
>> less often. But then there is an issue with the rising/decaying
>> penalty of the kernel thermal pressure signal.
>>
>> We cannot assume that some SoCs don't do this already.
>>
>> Let's meet in the middle:
>> 1) use the thermal PELT signal in RT:
>> capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu))
>
> This is what Dietmar and I have been suggested

There is a difference: Dietmar was asking which signal should
be used, while you made a statement 'must not' be used.

That's why I proposed to meet in the middle - which is not perfect
solution, because:
1) by default this 'sched_thermal_decay_shift' boot parameter would be
set to 0, so engineers might not be aware of this and it's impact;
they would have to be pointed to it when the play with RT tasks
2) it would not only affect RT 'view' to the current situation of real
CPU capacity but also CFS 'view' on it. But they might be interested
in different 'view':
a) RT - more instantaneous, since it's more latency focused, while
b) CFS - more smooth, since it tries to be more balanced

I've prepared two patches [1][2] and notebook [3] with experiments for
the new thermal pressure signal raising/decaying characteristics. That
might help them to understand.

>
>> 2) introduce a more configurable thermal_pressure shifter instead
>> 'sched_thermal_decay_shift', which would allow not only to make the
>> decaying longer, but also shorter when the platform already might do
>> that, to not cause too much traffic.
>>
>>>
>>> - you change the range of capacity to max'= max -
>>> arch_scale_thermal_pressure and you scale it to max_freq. if util >
>>> max', then you will ask for max_freq whatever the util being really
>>> close to max or not. Also you will ask for max freq even if util is
>>> close but below max' whereas the mitigation doesn't impact utilization
>>
>> It's already there, even w/o patch. That's why I gave you the examples.
>
> Not sure how to understand this above.
>
> utilization can already goes above but this reflects a reality that
> the task could need more capacity than the current cpu capacity

We have issues with this for the CPU rq utilization. IMO it deserves
be re-visited and somehow fixed, especially when we have the uclamp
tasks. It's a topic for different discussion. The signal which
we are using in the SchedUtil is 'not ideal'. I don't have time
to make those plots from experiments. We can return to this topic
later.

>
>>
>> BTW, isn't true that the utilization of the Little CPU rq can reach
>> 1024 today after your change to the PELT when there is no idle time,
>> even when cpu max capacity is e.g. 300?
>
> yes
>
>> Before that change the utilization of a throttled CPU rq would converge
>> to the current capacity of the CPU, am I right?
>
> yes
>
>>
>> Is it this commit:
>> 23127296889fe84b0762b191
>>
>>>
>>>>
>>>>>
>>>>>>
>
> [...]
>
>>>>>
>>>>>> but then in both cases are multiplied by 'max_freq' in (2)
>>>>>>
>>>>>> As you can see this is not the situation that you have described, is it?
>>>>>> And the transient or non-transient is minor here IMO.
>>>>>
>>>>> If max is 512 then util = 640 which is much lower than 1024.
>>>>
>>>> What scenario is this?
>>>> Is 1024 the utilization that we might have from the CPU rq?
>>>> What is the original CPU capacity, 1024?
>>
>> Is this 1024 the utilization of the CPU runqueue because since
>> the new PELT we can have it bigger than CPU capacity?
>>
>>>>
>>>>>
>>>>>>
>>>>>> Secondly, you have mentioned the mitigation in HW and issue between
>>>>>> instantaneous vs. PELT-one thermal pressure information. This is
>>>>>> something that I'm stretching my head for long. I'm trying to
>>>>>> develop this for new Arm FW thermal. You have mentioned:
>>>>>> 'thermal mitigation is managed by HW at a much higher
>>>>>> frequency than what Linux can handle' - I would be also more
>>>>>> precised here: HW or FW? How often the HW can change max freq or
>>>>>> how often FW can change that? If we don't have those numbers
>>>>>> than statement: 'a much higher' doesn't help in solving this
>>>>>
>>>>> By much higher means that Linux can't react fast enough and should not
>>>>> try to sync because it's a lost game
>>>>
>>>> As I said, 'much higher' is not a number to base a design on it.
>>>
>>> But that gives you the constraint that you can't expect to be always
>>> synced with up to date value which is the most important here. This
>>> means that cpu_cap -= arch_scale_thermal_pressure(cpu) can be wrong
>>> just after you computed it and your decision is wrong.
>>
>> This is hypothetical situation when the value can change in such
>> noisy way on some platform. But I understand your concern.
>>
>>>
>>>
>>>> We need real numbers from real platforms. Currently we have two
>>>> places where the thermal pressure is set:
>>>> 1) cpufreq_cooling.c [1]
>>>> 2) Qcom driver [2]
>>>> (we might have 3rd soon for Arm SCMI+FW)
>>>
>>> I don't have details but i have khz in mind
>>
>> If such traffic of interrupts in khz is true for driver in 2)
>> then it's a bit concerning.
>>
>> Although, smarter platforms shouldn't suffer due to design forced to one
>> corner case platform.
>>
>>>
>>>>
>>>> For the 2nd I would like to see numbers. For the 1st one when
>>>> kernel thermal is used (which supports higher number of platforms
>>>> comparing to Qcom driver) as it's by design kernel tries to control
>>>> thermal, so changes are not that frequent.
>>>>
>>>> As for now, I can see in experiments the 1st is suffering long decay
>>>> delays and also corner cases with long idle CPUs.
>>>>
>>>>>
>
> [...]
>
>>>> I'm trying to help Xuewen to solve his/her issues with the RT class
>>>> incrementally. I don't want to push him/her into a deep dark water
>>>> of PELT signals, to what variable compare them, corner cases when they
>>>> are (or not) updated or completely not implemented. I'm not even sure
>>>> if those extra complexities make sense for the RT/DL (since they
>>>> make some difference on big.mid.little specific platforms but not for
>>>> the rest).
>>
>> As I said we need a way forward, this issue of capacity inversion
>> on big.mid.little is there. It was for ~2-3years and is going to be
>> even bigger in future. So please don't block it and prepare/share the
>> numbers for the corner case platforms.
>
> I don't want to block anything but just want a solution that is
> coherent with the whole design and not just a fix for one UC

Good, but forcing to use PELT-like signal for all platforms for
RT tasks might not be a fix either.

It wasn't in this RT code at mainline to take the thermal pressure into
account, but you made a strong statement that one of two signals
'must' not be used (without any data from RT experiments for a few
platforms). You also added argument that there might be 'khz' high
rate of frequency changes on some platform (which might be actually
a corner case or even not possible because fastest DVFS that is
possible today IIRC is ~500-1000us).

In my team we spend hundreds/year (or even thousands) of hours on
research (with experiments and code modifications) in those fields:
scheduler, thermal, firmware. Even when we have the data internally,
we don't say we are 100% sure in all cases for all platforms.

We are trying to engage our partners, with this whole GKI movement, to
go with their patches upstream and have a discussion there. No wonder
they are afraid. I hope you understand this.

Regards,
Lukasz

[1] https://lore.kernel.org/lkml/[email protected]/
[2]
https://lore.kernel.org/lkml/[email protected]/T/#u
[3]
https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/thermal_pressure_delays-all-ipa.ipynb

2022-05-02 11:12:04

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Qais
Thanks for the patient explanation.:)
And I have some other concerns.

On Wed, Apr 27, 2022 at 6:58 PM Qais Yousef <[email protected]> wrote:
>
> On 04/27/22 09:38, Xuewen Yan wrote:
> > > > > The best (simplest) way forward IMHO is to introduce a new function
> > > > >
> > > > > bool cpu_in_capacity_inversion(int cpu);

Maybe the implementation of this function, I have not thought of a
good solution.
(1)how to define the inversion, if the cpu has two
cluster(little/big),it is easy, but still need mark which is the big
cpu...
(2)because the mainline kernel should be common, if the cpu has three
or more clusters, maybe the mid cpus also would be inversion;
(3)what time update the cpu inversion state, if we judge the cpu
inversion whenever the thermal pressure changed, could we receive the
complexity? because may we should traverse all possible cpu.

Thanks!

---
xuewen.yan

2022-05-03 16:24:24

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Xuewen

On 05/01/22 11:20, Xuewen Yan wrote:
> Hi Qais
> Thanks for the patient explanation.:)
> And I have some other concerns.
>
> On Wed, Apr 27, 2022 at 6:58 PM Qais Yousef <[email protected]> wrote:
> >
> > On 04/27/22 09:38, Xuewen Yan wrote:
> > > > > > The best (simplest) way forward IMHO is to introduce a new function
> > > > > >
> > > > > > bool cpu_in_capacity_inversion(int cpu);
>
> Maybe the implementation of this function, I have not thought of a
> good solution.
> (1)how to define the inversion, if the cpu has two
> cluster(little/big),it is easy, but still need mark which is the big
> cpu...

I'd define it as:

capacity_orig_of(cpu) - thermal_pressure(cpu) < capacity_orig_of(next_level_cpu)

> (2)because the mainline kernel should be common, if the cpu has three
> or more clusters, maybe the mid cpus also would be inversion;

Yes. I pray this is highly unlikely though! We should cater for it still.

> (3)what time update the cpu inversion state, if we judge the cpu
> inversion whenever the thermal pressure changed, could we receive the
> complexity? because may we should traverse all possible cpu.

In my head, it would make sense to detect the inversion in
update_cpu_capacity() OR in topology_update_thermal_pressure(). So at whatever
rate this happens at.

Does this answer your question?

Basically I believe something like this should be enough (completely untested)

--->8---


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a68482d66535..44c7c2598d87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8399,16 +8399,37 @@ static unsigned long scale_rt_capacity(int cpu)

static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
+ unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;
+ struct rq *rq = cpu_rq(cpu);

- cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
+ rq->cpu_capacity_orig = capacity_orig;

if (!capacity)
capacity = 1;

- cpu_rq(cpu)->cpu_capacity = capacity;
- trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+ rq->cpu_capacity = capacity;
+ trace_sched_cpu_capacity_tp(rq);
+
+ if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+ unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
+
+ rq->cpu_capacity_inverted = 0;
+
+ for_each_possible_cpu(cpu) {
+ unsigned long cap = arch_scale_cpu_capacity(cpu);
+
+ if (capacity_orig <= cap)
+ continue;
+
+ if (cap > inv_cap) {
+ rq->cpu_capacity_inverted = inv_cap;
+ break;
+ }
+ }
+
+ }

sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8dccb34eb190..bfe84c870bf9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -992,6 +992,7 @@ struct rq {

unsigned long cpu_capacity;
unsigned long cpu_capacity_orig;
+ unsigned long cpu_capacity_inverted;

struct callback_head *balance_callback;

@@ -2807,6 +2808,11 @@ static inline unsigned long capacity_orig_of(int cpu)
return cpu_rq(cpu)->cpu_capacity_orig;
}

+static inline unsigned long cpu_in_capacity_inversion(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_inverted;
+}
+
/**
* enum cpu_util_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency


--->8---

Thanks

--
Qais Yousef

2022-05-09 10:05:09

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Qais

Sorry for the late reply.

On Tue, May 3, 2022 at 10:43 PM Qais Yousef <[email protected]> wrote:
>
> Hi Xuewen
>
> On 05/01/22 11:20, Xuewen Yan wrote:
> > Hi Qais
> > Thanks for the patient explanation.:)
> > And I have some other concerns.
> >
> > On Wed, Apr 27, 2022 at 6:58 PM Qais Yousef <[email protected]> wrote:
> > >
> > > On 04/27/22 09:38, Xuewen Yan wrote:
> > > > > > > The best (simplest) way forward IMHO is to introduce a new function
> > > > > > >
> > > > > > > bool cpu_in_capacity_inversion(int cpu);
> >
> > Maybe the implementation of this function, I have not thought of a
> > good solution.
> > (1)how to define the inversion, if the cpu has two
> > cluster(little/big),it is easy, but still need mark which is the big
> > cpu...
>
> I'd define it as:
>
> capacity_orig_of(cpu) - thermal_pressure(cpu) < capacity_orig_of(next_level_cpu)
ok.
>
> > (2)because the mainline kernel should be common, if the cpu has three
> > or more clusters, maybe the mid cpus also would be inversion;
>
> Yes. I pray this is highly unlikely though! We should cater for it still.
>
> > (3)what time update the cpu inversion state, if we judge the cpu
> > inversion whenever the thermal pressure changed, could we receive the
> > complexity? because may we should traverse all possible cpu.
>
> In my head, it would make sense to detect the inversion in
> update_cpu_capacity() OR in topology_update_thermal_pressure(). So at whatever
> rate this happens at.
>
> Does this answer your question?
Yes, get.
>
> Basically I believe something like this should be enough (completely untested)
>
> --->8---
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a68482d66535..44c7c2598d87 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8399,16 +8399,37 @@ static unsigned long scale_rt_capacity(int cpu)
>
> static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> {
> + unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
> unsigned long capacity = scale_rt_capacity(cpu);
> struct sched_group *sdg = sd->groups;
> + struct rq *rq = cpu_rq(cpu);
>
> - cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
> + rq->cpu_capacity_orig = capacity_orig;
>
> if (!capacity)
> capacity = 1;
>
> - cpu_rq(cpu)->cpu_capacity = capacity;
> - trace_sched_cpu_capacity_tp(cpu_rq(cpu));
> + rq->cpu_capacity = capacity;
> + trace_sched_cpu_capacity_tp(rq);
> +
> + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> + unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);

Indeed, I prefer arch_thermal_pressure here, because the
thermal_load_avg would change over time,
but the inv_cap's update period may could not keep up with his changes.

> +
> + rq->cpu_capacity_inverted = 0;
> +
> + for_each_possible_cpu(cpu) {
> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> +
> + if (capacity_orig <= cap)
> + continue;
> +
> + if (cap > inv_cap) {
> + rq->cpu_capacity_inverted = inv_cap;
> + break;
> + }
> + }
> +
> + }
>
> sdg->sgc->capacity = capacity;
> sdg->sgc->min_capacity = capacity;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 8dccb34eb190..bfe84c870bf9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -992,6 +992,7 @@ struct rq {
>
> unsigned long cpu_capacity;
> unsigned long cpu_capacity_orig;
> + unsigned long cpu_capacity_inverted;
>
> struct callback_head *balance_callback;
>
> @@ -2807,6 +2808,11 @@ static inline unsigned long capacity_orig_of(int cpu)
> return cpu_rq(cpu)->cpu_capacity_orig;
> }
>
> +static inline unsigned long cpu_in_capacity_inversion(int cpu)
> +{
> + return cpu_rq(cpu)->cpu_capacity_inverted;
> +}
> +
> /**
> * enum cpu_util_type - CPU utilization type
> * @FREQUENCY_UTIL: Utilization used to select frequency
>
>
> --->8---

The patch is amazing for me, and the complexity is not too high. Would
you please push the patch?
I think the idea is yours, I don't want to use it as my patch v2.

Thanks!

---
xuewen

2022-05-10 19:48:20

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity



On 5/10/22 15:56, Qais Yousef wrote:
> Hi Xuewen
>
> On 05/09/22 10:29, Xuewen Yan wrote:
>
> [...]
>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index a68482d66535..44c7c2598d87 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8399,16 +8399,37 @@ static unsigned long scale_rt_capacity(int cpu)
>>>
>>> static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>>> {
>>> + unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
>>> unsigned long capacity = scale_rt_capacity(cpu);
>>> struct sched_group *sdg = sd->groups;
>>> + struct rq *rq = cpu_rq(cpu);
>>>
>>> - cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
>>> + rq->cpu_capacity_orig = capacity_orig;
>>>
>>> if (!capacity)
>>> capacity = 1;
>>>
>>> - cpu_rq(cpu)->cpu_capacity = capacity;
>>> - trace_sched_cpu_capacity_tp(cpu_rq(cpu));
>>> + rq->cpu_capacity = capacity;
>>> + trace_sched_cpu_capacity_tp(rq);
>>> +
>>> + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
>>> + unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
>>
>> Indeed, I prefer arch_thermal_pressure here, because the
>> thermal_load_avg would change over time,
>> but the inv_cap's update period may could not keep up with his changes.
>
> If that's what works for you, I think that's fine. Vincent, Lukasz you okay
> with that?

To properly answer this question we probably have to analyze the timings
and this update path - how often it is actually called. Keep in mind
we are going to solve CPU capacity inversion for RT class, which
contains latency sensitive tasks. In this approach the information
about HW status is coming from this CFS load balance path.
What if that load balance is not called that often as RT might require?
What if there is a light load on CPUs, but GPU caused them to throttle,
reducing capacity by a decent chunk e.g. 50%?
That would translate to some RT periodic task which takes 2ms every
8ms to take 4ms, while maybe on other less power hungry CPU it could
take 3ms.

The usage of thermal_load_avg() in the scale_rt_capacity() looks OK
for the CFS, but might not be from the RT class point of view.
The RT class might want to realize faster that CPUs have changed the
capacity.
Maybe it's OK with that patch [1] and boot config shifter=-5, but in
default boot config for shifter=0 we can suffer for hundreds of ms
running on lower capacity cpu (which is quite high number of frames
nowadays).

Without a research and experiments data I'm afraid this is too
big step to make, with this CFS load balance path.

>
>>
>>> +
>>> + rq->cpu_capacity_inverted = 0;
>>> +
>>> + for_each_possible_cpu(cpu) {
>>> + unsigned long cap = arch_scale_cpu_capacity(cpu);
>>> +
>>> + if (capacity_orig <= cap)
>>> + continue;

The search loop here assumes that other CPUs (fortunately not in the
same freq domain) don't suffer due to reduced capacity. This might be
not true - when we have ~1 Watt budget for all CPUs in the system and
single big core can use 3-4W at max or single mid core ~1.2W.

>>> +
>>> + if (cap > inv_cap) {
>>> + rq->cpu_capacity_inverted = inv_cap;
>>> + break;
>>> + }
>>> + }
>>> +
>>> + }
>>>
>>> sdg->sgc->capacity = capacity;
>>> sdg->sgc->min_capacity = capacity;
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 8dccb34eb190..bfe84c870bf9 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -992,6 +992,7 @@ struct rq {
>>>
>>> unsigned long cpu_capacity;
>>> unsigned long cpu_capacity_orig;
>>> + unsigned long cpu_capacity_inverted;
>>>
>>> struct callback_head *balance_callback;
>>>
>>> @@ -2807,6 +2808,11 @@ static inline unsigned long capacity_orig_of(int cpu)
>>> return cpu_rq(cpu)->cpu_capacity_orig;
>>> }
>>>
>>> +static inline unsigned long cpu_in_capacity_inversion(int cpu)
>>> +{
>>> + return cpu_rq(cpu)->cpu_capacity_inverted;
>>> +}
>>> +
>>> /**
>>> * enum cpu_util_type - CPU utilization type
>>> * @FREQUENCY_UTIL: Utilization used to select frequency
>>>
>>>
>>> --->8---
>>
>> The patch is amazing for me, and the complexity is not too high. Would
>> you please push the patch?
>> I think the idea is yours, I don't want to use it as my patch v2.
>
> I'd be happy to add a commit message so that you can include it in your v2.
>
> First, I'd like to hear from Vincent and Lukasz they're happy with this
> approach.
>
> I've been trying to think how we can do this generically but can't find an
> alternative to the extra loop or additional fallback_cpu_mask. Maybe the mask
> is okay if we protect it with sched_asymmetric_cpucapacity static key..
>

I'm sorry Qais, I see that you are trying to bring this
real-CPU-capacity information into RT, but the source and quality of
this information IMO might matter. I cannot help you w/o experiment
results of your proposed approach.

Regards,
Lukasz

[1] https://lore.kernel.org/lkml/[email protected]/

2022-05-10 20:09:48

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Xuewen

On 05/09/22 10:29, Xuewen Yan wrote:

[...]

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a68482d66535..44c7c2598d87 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8399,16 +8399,37 @@ static unsigned long scale_rt_capacity(int cpu)
> >
> > static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> > {
> > + unsigned long capacity_orig = arch_scale_cpu_capacity(cpu);
> > unsigned long capacity = scale_rt_capacity(cpu);
> > struct sched_group *sdg = sd->groups;
> > + struct rq *rq = cpu_rq(cpu);
> >
> > - cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu);
> > + rq->cpu_capacity_orig = capacity_orig;
> >
> > if (!capacity)
> > capacity = 1;
> >
> > - cpu_rq(cpu)->cpu_capacity = capacity;
> > - trace_sched_cpu_capacity_tp(cpu_rq(cpu));
> > + rq->cpu_capacity = capacity;
> > + trace_sched_cpu_capacity_tp(rq);
> > +
> > + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> > + unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
>
> Indeed, I prefer arch_thermal_pressure here, because the
> thermal_load_avg would change over time,
> but the inv_cap's update period may could not keep up with his changes.

If that's what works for you, I think that's fine. Vincent, Lukasz you okay
with that?

>
> > +
> > + rq->cpu_capacity_inverted = 0;
> > +
> > + for_each_possible_cpu(cpu) {
> > + unsigned long cap = arch_scale_cpu_capacity(cpu);
> > +
> > + if (capacity_orig <= cap)
> > + continue;
> > +
> > + if (cap > inv_cap) {
> > + rq->cpu_capacity_inverted = inv_cap;
> > + break;
> > + }
> > + }
> > +
> > + }
> >
> > sdg->sgc->capacity = capacity;
> > sdg->sgc->min_capacity = capacity;
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 8dccb34eb190..bfe84c870bf9 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -992,6 +992,7 @@ struct rq {
> >
> > unsigned long cpu_capacity;
> > unsigned long cpu_capacity_orig;
> > + unsigned long cpu_capacity_inverted;
> >
> > struct callback_head *balance_callback;
> >
> > @@ -2807,6 +2808,11 @@ static inline unsigned long capacity_orig_of(int cpu)
> > return cpu_rq(cpu)->cpu_capacity_orig;
> > }
> >
> > +static inline unsigned long cpu_in_capacity_inversion(int cpu)
> > +{
> > + return cpu_rq(cpu)->cpu_capacity_inverted;
> > +}
> > +
> > /**
> > * enum cpu_util_type - CPU utilization type
> > * @FREQUENCY_UTIL: Utilization used to select frequency
> >
> >
> > --->8---
>
> The patch is amazing for me, and the complexity is not too high. Would
> you please push the patch?
> I think the idea is yours, I don't want to use it as my patch v2.

I'd be happy to add a commit message so that you can include it in your v2.

First, I'd like to hear from Vincent and Lukasz they're happy with this
approach.

I've been trying to think how we can do this generically but can't find an
alternative to the extra loop or additional fallback_cpu_mask. Maybe the mask
is okay if we protect it with sched_asymmetric_cpucapacity static key..

Thanks

--
Qais Yousef

2022-05-10 22:14:27

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 05/10/22 18:44, Lukasz Luba wrote:

[...]

> To properly answer this question we probably have to analyze the timings
> and this update path - how often it is actually called. Keep in mind
> we are going to solve CPU capacity inversion for RT class, which
> contains latency sensitive tasks. In this approach the information

This was an attempt for a generic inversion detection. We update
rq->cpu_capacity which is used by capacity_of() in the same path.

I didn't feel brave to write a quick patch in the topology code, but we can
certainly do the detection there in topology_update_thermal_pressure().

> about HW status is coming from this CFS load balance path.
> What if that load balance is not called that often as RT might require?
> What if there is a light load on CPUs, but GPU caused them to throttle,
> reducing capacity by a decent chunk e.g. 50%?
> That would translate to some RT periodic task which takes 2ms every
> 8ms to take 4ms, while maybe on other less power hungry CPU it could
> take 3ms.
>
> The usage of thermal_load_avg() in the scale_rt_capacity() looks OK
> for the CFS, but might not be from the RT class point of view.
> The RT class might want to realize faster that CPUs have changed the
> capacity.
> Maybe it's OK with that patch [1] and boot config shifter=-5, but in
> default boot config for shifter=0 we can suffer for hundreds of ms
> running on lower capacity cpu (which is quite high number of frames
> nowadays).
>
> Without a research and experiments data I'm afraid this is too
> big step to make, with this CFS load balance path.

I think Xuewen didn't want to use thermal_load_avg(), and that's the question
I deferred.

>
> >
> > >
> > > > +
> > > > + rq->cpu_capacity_inverted = 0;
> > > > +
> > > > + for_each_possible_cpu(cpu) {
> > > > + unsigned long cap = arch_scale_cpu_capacity(cpu);
> > > > +
> > > > + if (capacity_orig <= cap)
> > > > + continue;
>
> The search loop here assumes that other CPUs (fortunately not in the
> same freq domain) don't suffer due to reduced capacity. This might be
> not true - when we have ~1 Watt budget for all CPUs in the system and
> single big core can use 3-4W at max or single mid core ~1.2W.

I defined capacity inversion against capacity_orig. IMHO that's the sensible
definition to make.

Would be good to hear more/other suggestions.

>
> > > > +
> > > > + if (cap > inv_cap) {
> > > > + rq->cpu_capacity_inverted = inv_cap;
> > > > + break;
> > > > + }
> > > > + }
> > > > +
> > > > + }
> > > >
> > > > sdg->sgc->capacity = capacity;
> > > > sdg->sgc->min_capacity = capacity;
> > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > index 8dccb34eb190..bfe84c870bf9 100644
> > > > --- a/kernel/sched/sched.h
> > > > +++ b/kernel/sched/sched.h
> > > > @@ -992,6 +992,7 @@ struct rq {
> > > >
> > > > unsigned long cpu_capacity;
> > > > unsigned long cpu_capacity_orig;
> > > > + unsigned long cpu_capacity_inverted;
> > > >
> > > > struct callback_head *balance_callback;
> > > >
> > > > @@ -2807,6 +2808,11 @@ static inline unsigned long capacity_orig_of(int cpu)
> > > > return cpu_rq(cpu)->cpu_capacity_orig;
> > > > }
> > > >
> > > > +static inline unsigned long cpu_in_capacity_inversion(int cpu)
> > > > +{
> > > > + return cpu_rq(cpu)->cpu_capacity_inverted;
> > > > +}
> > > > +
> > > > /**
> > > > * enum cpu_util_type - CPU utilization type
> > > > * @FREQUENCY_UTIL: Utilization used to select frequency
> > > >
> > > >
> > > > --->8---
> > >
> > > The patch is amazing for me, and the complexity is not too high. Would
> > > you please push the patch?
> > > I think the idea is yours, I don't want to use it as my patch v2.
> >
> > I'd be happy to add a commit message so that you can include it in your v2.
> >
> > First, I'd like to hear from Vincent and Lukasz they're happy with this
> > approach.
> >
> > I've been trying to think how we can do this generically but can't find an
> > alternative to the extra loop or additional fallback_cpu_mask. Maybe the mask
> > is okay if we protect it with sched_asymmetric_cpucapacity static key..
> >
>
> I'm sorry Qais, I see that you are trying to bring this
> real-CPU-capacity information into RT, but the source and quality of
> this information IMO might matter. I cannot help you w/o experiment
> results of your proposed approach.

The question I was posing here is whether to handle thermal only in inversion
case as I was suggesting or do better. We are still trickling through the
details, but first, I wanted to make sure there's no objection to this
direction (detect inversion and bail out in rt_task_fits_capacity() for cpus in
capacity inversion).

Cheers

--
Qais Yousef

2022-05-11 01:13:23

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity



On 5/10/22 19:44, Qais Yousef wrote:
> On 05/10/22 18:44, Lukasz Luba wrote:
>
> [...]
>
>> To properly answer this question we probably have to analyze the timings
>> and this update path - how often it is actually called. Keep in mind
>> we are going to solve CPU capacity inversion for RT class, which
>> contains latency sensitive tasks. In this approach the information
>
> This was an attempt for a generic inversion detection. We update
> rq->cpu_capacity which is used by capacity_of() in the same path.

True, but this is a CFS 'world' and the update path is part of load
balance. Your proposed code which sets the new
'rq->cpu_capacity_inverted' is run there, which might have some
delays.

>
> I didn't feel brave to write a quick patch in the topology code, but we can
> certainly do the detection there in topology_update_thermal_pressure().

Looks better, since that code path is called when we get instantaneous
information about CPU freq reduction. I'm afraid that again this
approach might be blocked due to 'khz' calling ratio of that code and we
'must not' use this.

>
>> about HW status is coming from this CFS load balance path.
>> What if that load balance is not called that often as RT might require?
>> What if there is a light load on CPUs, but GPU caused them to throttle,
>> reducing capacity by a decent chunk e.g. 50%?
>> That would translate to some RT periodic task which takes 2ms every
>> 8ms to take 4ms, while maybe on other less power hungry CPU it could
>> take 3ms.
>>
>> The usage of thermal_load_avg() in the scale_rt_capacity() looks OK
>> for the CFS, but might not be from the RT class point of view.
>> The RT class might want to realize faster that CPUs have changed the
>> capacity.
>> Maybe it's OK with that patch [1] and boot config shifter=-5, but in
>> default boot config for shifter=0 we can suffer for hundreds of ms
>> running on lower capacity cpu (which is quite high number of frames
>> nowadays).
>>
>> Without a research and experiments data I'm afraid this is too
>> big step to make, with this CFS load balance path.
>
> I think Xuewen didn't want to use thermal_load_avg(), and that's the question
> I deferred.

Your code snipped might have similar penalty, since you populate
information about that CPU inversion at 'some point in time'.
My point is: that 'point in time' is not well defined, since it's
CFS load balance. I'm afraid that RT class deserves something better
defined (predictable, repeatable, reliable, short, etc.)

>
>>
>>>
>>>>
>>>>> +
>>>>> + rq->cpu_capacity_inverted = 0;
>>>>> +
>>>>> + for_each_possible_cpu(cpu) {
>>>>> + unsigned long cap = arch_scale_cpu_capacity(cpu);
>>>>> +
>>>>> + if (capacity_orig <= cap)
>>>>> + continue;
>>
>> The search loop here assumes that other CPUs (fortunately not in the
>> same freq domain) don't suffer due to reduced capacity. This might be
>> not true - when we have ~1 Watt budget for all CPUs in the system and
>> single big core can use 3-4W at max or single mid core ~1.2W.

s/1.2W/1-2W

>
> I defined capacity inversion against capacity_orig. IMHO that's the sensible
> definition to make.
>
> Would be good to hear more/other suggestions.

Capacity of other CPU might also be reduced and capacity_orig is not
reflecting that. My gut feeling tells me that this capacity_orig
assumption might be too optimistic for some platforms.

It's the same old question: how good the model should be.
We want to 'model' the reality (CPUs slows down), how good the
model should be in this RT world use case - I don't know w/o
experiments.

I don't even know how often this new variable
'rq->cpu_capacity_inverted' gets updated and what is the time diff to
the last update of the raw thermal pressure variable. You said that code
is 'completely untested'. So it's unknown delay for now - but belongs to
similar class as thermal_load_avg(), but the 2nd is known. I have
shared plots with raw signal vs. PELT-like delays. We at least know
the delays, e.g. ~200ms to reach raw value, but how that impacts RT
world - I have no experiment results from real apps (i.e. w/ audio or
display threads).

>
>>
>>>>> +
>>>>> + if (cap > inv_cap) {
>>>>> + rq->cpu_capacity_inverted = inv_cap;
>>>>> + break;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + }
>>>>>
>>>>> sdg->sgc->capacity = capacity;
>>>>> sdg->sgc->min_capacity = capacity;
>>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>>> index 8dccb34eb190..bfe84c870bf9 100644
>>>>> --- a/kernel/sched/sched.h
>>>>> +++ b/kernel/sched/sched.h
>>>>> @@ -992,6 +992,7 @@ struct rq {
>>>>>
>>>>> unsigned long cpu_capacity;
>>>>> unsigned long cpu_capacity_orig;
>>>>> + unsigned long cpu_capacity_inverted;
>>>>>
>>>>> struct callback_head *balance_callback;
>>>>>
>>>>> @@ -2807,6 +2808,11 @@ static inline unsigned long capacity_orig_of(int cpu)
>>>>> return cpu_rq(cpu)->cpu_capacity_orig;
>>>>> }
>>>>>
>>>>> +static inline unsigned long cpu_in_capacity_inversion(int cpu)
>>>>> +{
>>>>> + return cpu_rq(cpu)->cpu_capacity_inverted;
>>>>> +}
>>>>> +
>>>>> /**
>>>>> * enum cpu_util_type - CPU utilization type
>>>>> * @FREQUENCY_UTIL: Utilization used to select frequency
>>>>>
>>>>>
>>>>> --->8---
>>>>
>>>> The patch is amazing for me, and the complexity is not too high. Would
>>>> you please push the patch?
>>>> I think the idea is yours, I don't want to use it as my patch v2.
>>>
>>> I'd be happy to add a commit message so that you can include it in your v2.
>>>
>>> First, I'd like to hear from Vincent and Lukasz they're happy with this
>>> approach.
>>>
>>> I've been trying to think how we can do this generically but can't find an
>>> alternative to the extra loop or additional fallback_cpu_mask. Maybe the mask
>>> is okay if we protect it with sched_asymmetric_cpucapacity static key..
>>>
>>
>> I'm sorry Qais, I see that you are trying to bring this
>> real-CPU-capacity information into RT, but the source and quality of
>> this information IMO might matter. I cannot help you w/o experiment
>> results of your proposed approach.
>
> The question I was posing here is whether to handle thermal only in inversion
> case as I was suggesting or do better. We are still trickling through the
> details, but first, I wanted to make sure there's no objection to this
> direction (detect inversion and bail out in rt_task_fits_capacity() for cpus in
> capacity inversion).

IMO how you detect that inversion and at which point in time is part of
the scope.

I would vote for using that thermal update code path + compare other
CPUs real capacity not capacity_orig to detect inversion.

2022-05-14 16:49:32

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On Wed, May 11, 2022 at 6:03 AM Lukasz Luba <[email protected]> wrote:
>
>
>
> On 5/10/22 19:44, Qais Yousef wrote:
> > On 05/10/22 18:44, Lukasz Luba wrote:
> >
> > [...]
> >
> >> To properly answer this question we probably have to analyze the timings
> >> and this update path - how often it is actually called. Keep in mind
> >> we are going to solve CPU capacity inversion for RT class, which
> >> contains latency sensitive tasks. In this approach the information
> >
> > This was an attempt for a generic inversion detection. We update
> > rq->cpu_capacity which is used by capacity_of() in the same path.
>
> True, but this is a CFS 'world' and the update path is part of load
> balance. Your proposed code which sets the new
> 'rq->cpu_capacity_inverted' is run there, which might have some
> delays.

Yes, that's exactly what I'm worried about.

>
> >
> > I didn't feel brave to write a quick patch in the topology code, but we can
> > certainly do the detection there in topology_update_thermal_pressure().
>
> Looks better, since that code path is called when we get instantaneous
> information about CPU freq reduction. I'm afraid that again this
> approach might be blocked due to 'khz' calling ratio of that code and we
> 'must not' use this.
>
> >
> >> about HW status is coming from this CFS load balance path.
> >> What if that load balance is not called that often as RT might require?
> >> What if there is a light load on CPUs, but GPU caused them to throttle,
> >> reducing capacity by a decent chunk e.g. 50%?
> >> That would translate to some RT periodic task which takes 2ms every
> >> 8ms to take 4ms, while maybe on other less power hungry CPU it could
> >> take 3ms.
> >>
> >> The usage of thermal_load_avg() in the scale_rt_capacity() looks OK
> >> for the CFS, but might not be from the RT class point of view.
> >> The RT class might want to realize faster that CPUs have changed the
> >> capacity.
> >> Maybe it's OK with that patch [1] and boot config shifter=-5, but in
> >> default boot config for shifter=0 we can suffer for hundreds of ms
> >> running on lower capacity cpu (which is quite high number of frames
> >> nowadays).
> >>
> >> Without a research and experiments data I'm afraid this is too
> >> big step to make, with this CFS load balance path.
> >
> > I think Xuewen didn't want to use thermal_load_avg(), and that's the question
> > I deferred.
>
> Your code snipped might have similar penalty, since you populate
> information about that CPU inversion at 'some point in time'.
> My point is: that 'point in time' is not well defined, since it's
> CFS load balance. I'm afraid that RT class deserves something better
> defined (predictable, repeatable, reliable, short, etc.)
>
> >
> >>
> >>>
> >>>>
> >>>>> +
> >>>>> + rq->cpu_capacity_inverted = 0;
> >>>>> +
> >>>>> + for_each_possible_cpu(cpu) {
> >>>>> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> >>>>> +
> >>>>> + if (capacity_orig <= cap)
> >>>>> + continue;
> >>
> >> The search loop here assumes that other CPUs (fortunately not in the
> >> same freq domain) don't suffer due to reduced capacity. This might be
> >> not true - when we have ~1 Watt budget for all CPUs in the system and
> >> single big core can use 3-4W at max or single mid core ~1.2W.
>
> s/1.2W/1-2W
>
> >
> > I defined capacity inversion against capacity_orig. IMHO that's the sensible
> > definition to make.
> >
> > Would be good to hear more/other suggestions.
>
> Capacity of other CPU might also be reduced and capacity_orig is not
> reflecting that. My gut feeling tells me that this capacity_orig
> assumption might be too optimistic for some platforms.

In unisoc platform with 3 clusters(little/mid/big), there are cases
that middle core and big core are throttled at the same time.

>
> It's the same old question: how good the model should be.
> We want to 'model' the reality (CPUs slows down), how good the
> model should be in this RT world use case - I don't know w/o
> experiments.
>
> I don't even know how often this new variable
> 'rq->cpu_capacity_inverted' gets updated and what is the time diff to
> the last update of the raw thermal pressure variable. You said that code
> is 'completely untested'. So it's unknown delay for now - but belongs to
> similar class as thermal_load_avg(), but the 2nd is known. I have
> shared plots with raw signal vs. PELT-like delays. We at least know
> the delays, e.g. ~200ms to reach raw value, but how that impacts RT
> world - I have no experiment results from real apps (i.e. w/ audio or
> display threads).
>
> >
> >>
> >>>>> +
> >>>>> + if (cap > inv_cap) {
> >>>>> + rq->cpu_capacity_inverted = inv_cap;
> >>>>> + break;
> >>>>> + }
> >>>>> + }
> >>>>> +
> >>>>> + }
> >>>>>
> >>>>> sdg->sgc->capacity = capacity;
> >>>>> sdg->sgc->min_capacity = capacity;
> >>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>>>> index 8dccb34eb190..bfe84c870bf9 100644
> >>>>> --- a/kernel/sched/sched.h
> >>>>> +++ b/kernel/sched/sched.h
> >>>>> @@ -992,6 +992,7 @@ struct rq {
> >>>>>
> >>>>> unsigned long cpu_capacity;
> >>>>> unsigned long cpu_capacity_orig;
> >>>>> + unsigned long cpu_capacity_inverted;
> >>>>>
> >>>>> struct callback_head *balance_callback;
> >>>>>
> >>>>> @@ -2807,6 +2808,11 @@ static inline unsigned long capacity_orig_of(int cpu)
> >>>>> return cpu_rq(cpu)->cpu_capacity_orig;
> >>>>> }
> >>>>>
> >>>>> +static inline unsigned long cpu_in_capacity_inversion(int cpu)
> >>>>> +{
> >>>>> + return cpu_rq(cpu)->cpu_capacity_inverted;
> >>>>> +}
> >>>>> +
> >>>>> /**
> >>>>> * enum cpu_util_type - CPU utilization type
> >>>>> * @FREQUENCY_UTIL: Utilization used to select frequency
> >>>>>
> >>>>>
> >>>>> --->8---
> >>>>
> >>>> The patch is amazing for me, and the complexity is not too high. Would
> >>>> you please push the patch?
> >>>> I think the idea is yours, I don't want to use it as my patch v2.
> >>>
> >>> I'd be happy to add a commit message so that you can include it in your v2.
> >>>
> >>> First, I'd like to hear from Vincent and Lukasz they're happy with this
> >>> approach.
> >>>
> >>> I've been trying to think how we can do this generically but can't find an
> >>> alternative to the extra loop or additional fallback_cpu_mask. Maybe the mask
> >>> is okay if we protect it with sched_asymmetric_cpucapacity static key..
> >>>
> >>
> >> I'm sorry Qais, I see that you are trying to bring this
> >> real-CPU-capacity information into RT, but the source and quality of
> >> this information IMO might matter. I cannot help you w/o experiment
> >> results of your proposed approach.
> >
> > The question I was posing here is whether to handle thermal only in inversion
> > case as I was suggesting or do better. We are still trickling through the
> > details, but first, I wanted to make sure there's no objection to this
> > direction (detect inversion and bail out in rt_task_fits_capacity() for cpus in
> > capacity inversion).
>
> IMO how you detect that inversion and at which point in time is part of
> the scope.
>
> I would vote for using that thermal update code path + compare other
> CPUs real capacity not capacity_orig to detect inversion.

Okay, I could push patch v2 later. Maybe we can continue to discuss
this topic based on v2.

Thanks!
---
xuewen.yan

2022-05-16 13:51:22

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 05/14/22 23:01, Xuewen Yan wrote:
> On Wed, May 11, 2022 at 6:03 AM Lukasz Luba <[email protected]> wrote:

[...]

> > True, but this is a CFS 'world' and the update path is part of load
> > balance. Your proposed code which sets the new
> > 'rq->cpu_capacity_inverted' is run there, which might have some
> > delays.
>
> Yes, that's exactly what I'm worried about.

Hmm. In Android world, where we are concerned here, CFS is a first class
citizen. If we have issues in updating capacities there, this might be hurting
other non-RT related use cases too. So something to ponder in general.

Anyways. It's a very valid concern and I agree with it too. We can do better.
See below.

[...]

> > Capacity of other CPU might also be reduced and capacity_orig is not
> > reflecting that. My gut feeling tells me that this capacity_orig
> > assumption might be too optimistic for some platforms.
>
> In unisoc platform with 3 clusters(little/mid/big), there are cases
> that middle core and big core are throttled at the same time.

Okay. I might have misunderstood you before, but I thought medium cores don't
suffer any meaningful thermal pressure.

[...]

> Okay, I could push patch v2 later. Maybe we can continue to discuss
> this topic based on v2.

Please do.

I have scratched my head and played around implementing the generic solution
using additional cpumasks. If you fancy testing it, that'd be great! I have
tested it very lightly only. If you like it and no one shouts it's terrible, we
can shape it further.

--->8---

From 625209b09bd0eb0eff07fba109e80102c5983c48 Mon Sep 17 00:00:00 2001
From: Qais Yousef <[email protected]>
Date: Fri, 13 May 2022 12:01:15 +0100
Subject: [PATCH] sched/rt: Support multi-criterion fitness search for
lowest_rq

We have multiple criterion that need to be taken into account when
searching for best fitting lowest_irq.

On big.LITTLE systems, when uclamp is used, we use
rt_task_fits_capacity() to enforce going to a larger CPU if the task's
uclamp_min requested that. But we still would have fallen back to
priority based search if no fitting CPU was found.

There are reports now that severe thermal pressure could make the big
CPUs throttled to the point where they are less performant than the
mediums (capacity inversion).

To cater for that, we need to take into account thermal pressure
reducing the capacity of the CPU in the fitness search.

Note that this could introduce another set of problems if not careful.

For instance if an RT task has uclamp_min = 1024, a small amount of
thermal pressure could mean no CPU will fit this task; which means the
hint will become less effective. The big CPU still provides the max
performance level (which is what uclamp_min=1024 is asking for) so we
still better place it there even if thermal conditions mean we lost some
performance.

This corner case can happen at any boundary conditions for littles,
mediums or bigs. For example if an RT task has uclamp_min
= capacity_orig_of(medium_cpu), then any small thermal pressure will
prevent placing it there and force it to big CPUs instead. Which is not
the desired outcome if no big CPU is available. We should still fallback
to the medium CPU in this case.

This dictates a more complex search method to enable multi-level
fallback.

That is:

1. If rt_task_fits_capacity_thermal() returns true, we should
pick this lowest_mask.
2. If (1) failed for all priorities, we should fallback to
rt_task_fits_capacity() lowest_mask if it found any.
3. If (1) and (2) failed, we should fallback to the lowest_mask
based on lowest priority rq as usual.

We teach cpupri_find_fitness() to do a multi-level search in a single
loop. This is at the cost of allocating more cpumasks for each fitness
criteria/level.

At the moment, the only users are heterogeneous systems which have low
CPU count and this should not lead to a big waste of memory.

The priority of fitness_fn is highest starting from 0.

Signed-off-by: Qais Yousef <[email protected]>
---
kernel/sched/cpupri.c | 118 ++++++++++++++++++++++++++++++++----------
kernel/sched/cpupri.h | 14 +++--
kernel/sched/rt.c | 66 ++++++++++++++++++++---
3 files changed, 162 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index fa9ce9d83683..cfe56bd4e555 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -120,7 +120,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
int cpupri_find(struct cpupri *cp, struct task_struct *p,
struct cpumask *lowest_mask)
{
- return cpupri_find_fitness(cp, p, lowest_mask, NULL);
+ return cpupri_find_fitness(cp, p, lowest_mask, NULL, NULL);
}

/**
@@ -142,13 +142,24 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
*/
int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
struct cpumask *lowest_mask,
- bool (*fitness_fn)(struct task_struct *p, int cpu))
+ cpumask_var_t fitness_mask[], fitness_fn_t fitness_fn[])
{
int task_pri = convert_prio(p->prio);
- int idx, cpu;
+ bool fallback_found[NUM_FITNESS_FN];
+ int idx, cpu, fn_idx;

BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);

+ if (NUM_FITNESS_FN && fitness_fn) {
+ /*
+ * Clear the masks so that we can save a fallback hit in them
+ */
+ for (fn_idx = 0; fn_idx < NUM_FITNESS_FN; fn_idx++) {
+ cpumask_clear(fitness_mask[fn_idx]);
+ fallback_found[fn_idx] = false;
+ }
+ }
+
for (idx = 0; idx < task_pri; idx++) {

if (!__cpupri_find(cp, p, lowest_mask, idx))
@@ -157,41 +168,94 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
if (!lowest_mask || !fitness_fn)
return 1;

- /* Ensure the capacity of the CPUs fit the task */
+ /*
+ * We got a hit, save in our fallback masks that are empty.
+ *
+ * Note that we use fitness_mask[0] to save the fallback for
+ * when all fitness_fns fail to find a suitable CPU.
+ *
+ * We use lowest_mask to store the results of fitness_fn[0]
+ * directly.
+ */
+ if (!fallback_found[0]) {
+ cpumask_copy(fitness_mask[0], lowest_mask);
+ fallback_found[0] = true;
+ }
+ for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
+
+ /*
+ * We just need one valid fallback at highest level
+ * (smallest fn_idx). We don't care about checking for
+ * fallback beyond this once we found one.
+ */
+ if (fallback_found[fn_idx])
+ break;
+
+ cpumask_copy(fitness_mask[fn_idx], lowest_mask);
+ }
+
+ /*
+ * fintness_fn[0] hit always terminates the search immediately,
+ * so do that first.
+ */
for_each_cpu(cpu, lowest_mask) {
- if (!fitness_fn(p, cpu))
+ if (!fitness_fn[0](p, cpu))
cpumask_clear_cpu(cpu, lowest_mask);
}

/*
- * If no CPU at the current priority can fit the task
- * continue looking
+ * Stop searching as soon as fitness_fn[0] is happy with the
+ * results.
*/
- if (cpumask_empty(lowest_mask))
- continue;
+ if (!cpumask_empty(lowest_mask))
+ return 1;

- return 1;
+ /*
+ * Find a fallback CPU for the other fitness_fns.
+ *
+ * Only do this once. As soon as we get a valid fallback mask,
+ * we'll remember it so that when fitness_fn[0] fails for all
+ * priorities, we'll return this fallback mask.
+ *
+ * Remember that we use fitnss_mask[0] to store our fallback
+ * results for when all fitness_fns fail.
+ */
+ for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
+
+ /*
+ * We just need one valid fallback at highest level
+ * (smallest fn_idx). We don't care about checking for
+ * fallback beyond this once we found one.
+ */
+ if (fallback_found[fn_idx])
+ break;
+
+ for_each_cpu(cpu, fitness_mask[fn_idx]) {
+ if (!fitness_fn[fn_idx](p, cpu))
+ cpumask_clear_cpu(cpu, fitness_mask[fn_idx]);
+ }
+
+ if (!cpumask_empty(fitness_mask[fn_idx]))
+ fallback_found[fn_idx] = true;
+ }
+ }
+
+ for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
+ if (fallback_found[fn_idx]) {
+ cpumask_copy(lowest_mask, fitness_mask[fn_idx]);
+ return 1;
+ }
}

/*
- * If we failed to find a fitting lowest_mask, kick off a new search
- * but without taking into account any fitness criteria this time.
- *
- * This rule favours honouring priority over fitting the task in the
- * correct CPU (Capacity Awareness being the only user now).
- * The idea is that if a higher priority task can run, then it should
- * run even if this ends up being on unfitting CPU.
- *
- * The cost of this trade-off is not entirely clear and will probably
- * be good for some workloads and bad for others.
- *
- * The main idea here is that if some CPUs were over-committed, we try
- * to spread which is what the scheduler traditionally did. Sys admins
- * must do proper RT planning to avoid overloading the system if they
- * really care.
+ * No fallback from any of the fitness_fns, fallback to priority based
+ * lowest_mask which we store at fitness_mask[0].
*/
- if (fitness_fn)
- return cpupri_find(cp, p, lowest_mask);
+
+ if (fallback_found[0]) {
+ cpumask_copy(lowest_mask, fitness_mask[0]);
+ return 1;
+ }

return 0;
}
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..1feb6324cf24 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -17,12 +17,20 @@ struct cpupri {
int *cpu_to_pri;
};

+#ifdef CONFIG_UCLAMP_TASK
+#define NUM_FITNESS_FN 2
+#else
+#define NUM_FITNESS_FN 0
+#endif
+
+typedef bool (*fitness_fn_t)(struct task_struct *p, int cpu);
+
#ifdef CONFIG_SMP
int cpupri_find(struct cpupri *cp, struct task_struct *p,
struct cpumask *lowest_mask);
-int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
- struct cpumask *lowest_mask,
- bool (*fitness_fn)(struct task_struct *p, int cpu));
+int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
+ struct cpumask *lowest_mask,
+ cpumask_var_t fitness_mask[], fitness_fn_t fitness_fn[]);
void cpupri_set(struct cpupri *cp, int cpu, int pri);
int cpupri_init(struct cpupri *cp);
void cpupri_cleanup(struct cpupri *cp);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a32c46889af8..125b9d360aab 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -452,7 +452,8 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
* Note that uclamp_min will be clamped to uclamp_max if uclamp_min
* > uclamp_max.
*/
-static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+static inline bool __rt_task_fits_capacity(struct task_struct *p, int cpu,
+ bool thermal)
{
unsigned int min_cap;
unsigned int max_cap;
@@ -467,10 +468,39 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)

cpu_cap = capacity_orig_of(cpu);

+ if (thermal)
+ cpu_cap -= arch_scale_thermal_pressure(cpu);
+
return cpu_cap >= min(min_cap, max_cap);
}
-#else
+
+static inline bool rt_task_fits_capacity_thermal(struct task_struct *p, int cpu)
+{
+ return __rt_task_fits_capacity(p, cpu, true);
+}
+
static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ return __rt_task_fits_capacity(p, cpu, false);
+}
+
+fitness_fn_t fitness_fn[NUM_FITNESS_FN] = {
+ rt_task_fits_capacity_thermal,
+ rt_task_fits_capacity
+};
+
+static inline bool rt_task_fits_cpu(struct task_struct *p, int cpu)
+{
+ /*
+ * fitness_fn[0] is the ultimate best choice. If it fails, we should
+ * assume we need to try again/harder to meet this criteria.
+ */
+ return fitness_fn[0](p, cpu);
+}
+
+#else
+fitness_fn_t *fitness_fn = NULL;
+static inline bool rt_task_fits_cpu(struct task_struct *p, int cpu)
{
return true;
}
@@ -1591,14 +1621,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
unlikely(rt_task(curr)) &&
(curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);

- if (test || !rt_task_fits_capacity(p, cpu)) {
+ if (test || !rt_task_fits_cpu(p, cpu)) {
int target = find_lowest_rq(p);

/*
* Bail out if we were forcing a migration to find a better
* fitting CPU but our search failed.
*/
- if (!test && target != -1 && !rt_task_fits_capacity(p, target))
+ if (!test && target != -1 && !rt_task_fits_cpu(p, target))
goto out_unlock;

/*
@@ -1823,6 +1853,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
}

static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
+static DEFINE_PER_CPU(cpumask_var_t *, local_cpu_fitness_mask);

static int find_lowest_rq(struct task_struct *task)
{
@@ -1843,11 +1874,12 @@ static int find_lowest_rq(struct task_struct *task)
* If we're on asym system ensure we consider the different capacities
* of the CPUs when searching for the lowest_mask.
*/
- if (static_branch_unlikely(&sched_asym_cpucapacity)) {
+ if (static_branch_unlikely(&sched_asym_cpucapacity) && NUM_FITNESS_FN) {

+ cpumask_var_t *fitness_mask = __this_cpu_read(local_cpu_fitness_mask);
ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
task, lowest_mask,
- rt_task_fits_capacity);
+ fitness_mask, fitness_fn);
} else {

ret = cpupri_find(&task_rq(task)->rd->cpupri,
@@ -2460,6 +2492,25 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
rt_queue_pull_task(rq);
}

+void __init init_sched_rt_fitness_mask(int cpu)
+{
+ cpumask_var_t *fitness_mask_array;
+ unsigned int i;
+
+ if (!NUM_FITNESS_FN)
+ return;
+
+ fitness_mask_array = kcalloc_node(NUM_FITNESS_FN, sizeof(cpumask_var_t),
+ GFP_KERNEL, cpu_to_node(cpu));
+
+ per_cpu(local_cpu_fitness_mask, cpu) = fitness_mask_array;
+
+ for (i = 0; i < NUM_FITNESS_FN; i++) {
+ zalloc_cpumask_var_node(&fitness_mask_array[i], GFP_KERNEL,
+ cpu_to_node(cpu));
+ }
+}
+
void __init init_sched_rt_class(void)
{
unsigned int i;
@@ -2467,6 +2518,9 @@ void __init init_sched_rt_class(void)
for_each_possible_cpu(i) {
zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
GFP_KERNEL, cpu_to_node(i));
+
+ if (static_branch_unlikely(&sched_asym_cpucapacity))
+ init_sched_rt_fitness_mask(i);
}
}
#endif /* CONFIG_SMP */
--
2.25.1

--->8---

Thanks both!

--
Qais Yousef

2022-05-22 19:33:33

by Oliver Sang

[permalink] [raw]
Subject: [sched] 0eee64011b: canonical_address#:#[##]



Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 0eee64011b1d43795b5c8d1aa62927ba3f07a225 ("[PATCH] sched/rt: Support multi-criterion fitness search for")
url: https://github.com/intel-lab-lkp/linux/commits/Qais-Yousef/sched-rt-Support-multi-criterion-fitness-search-for/20220515-075732
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 734387ec2f9d77b00276042b1fa7c95f48ee879d
patch link: https://lore.kernel.org/lkml/20220514235513.jm7ul2y6uddj6eh2@airbuntu

in testcase: pm-qa
version: pm-qa-x86_64-5ead848-1_20220411
with following parameters:

test: thermal
ucode: 0x28



on test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz with 8G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 581.218544][ T1333] intel_powerclamp: Start idle injection to reduce power
[ 586.315335][ T376] thermal_04.41: checking cooling_device13:state=42 effective cool=0 ... Ok
[ 586.315347][ T376]
[ 586.337525][ T1333] intel_powerclamp: Stop forced idle injection
[ 591.382556][ T1333] intel_powerclamp: Start idle injection to reduce power
[ 595.244228][ C0] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
[ 595.255455][ C0] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[ 595.263733][ C0] CPU: 0 PID: 3730 Comm: kidle_inj/0 Tainted: G I 5.18.0-rc5-00021-g0eee64011b1d #3
[ 595.274246][ C0] Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD5H/Z87X-UD5H-CF, BIOS F9 03/18/2014
[ 595.283979][ C0] RIP: 0010:cpupri_find_fitness (kbuild/src/x86_64/kernel/sched/cpupri.c:256)
[ 595.289736][ C0] Code: 0f b6 04 02 84 c0 7f 06 0f 85 ed 01 00 00 31 c0 80 7c 24 58 00 74 75 48 b8 00 00 00 00 00 fc ff df 48 8b 54 24 28 48 c1 ea 03 <80> 3c 02 00 0f 85 d1 01 00 00 48 ba 00 00 00 00 00 fc ff df 48 8b
All code
========
0: 0f b6 04 02 movzbl (%rdx,%rax,1),%eax
4: 84 c0 test %al,%al
6: 7f 06 jg 0xe
8: 0f 85 ed 01 00 00 jne 0x1fb
e: 31 c0 xor %eax,%eax
10: 80 7c 24 58 00 cmpb $0x0,0x58(%rsp)
15: 74 75 je 0x8c
17: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
1e: fc ff df
21: 48 8b 54 24 28 mov 0x28(%rsp),%rdx
26: 48 c1 ea 03 shr $0x3,%rdx
2a:* 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2e: 0f 85 d1 01 00 00 jne 0x205
34: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
3b: fc ff df
3e: 48 rex.W
3f: 8b .byte 0x8b

Code starting with the faulting instruction
===========================================
0: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
4: 0f 85 d1 01 00 00 jne 0x1db
a: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
11: fc ff df
14: 48 rex.W
15: 8b .byte 0x8b
[ 595.309181][ C0] RSP: 0000:ffffc90000007bf8 EFLAGS: 00010046
[ 595.315113][ C0] RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000
[ 595.322943][ C0] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffc90000007c50
[ 595.330771][ C0] RBP: ffff888100c89c58 R08: 0000000000000000 R09: ffff8881002bb0bb
[ 595.338601][ C0] R10: ffffed1020057617 R11: 0000000000000001 R12: fffffbfff0a31e96
[ 595.346432][ C0] R13: 0000000000000000 R14: ffff88811a375280 R15: ffff8881002bb0b8
[ 595.354267][ C0] FS: 0000000000000000(0000) GS:ffff8881cda00000(0000) knlGS:0000000000000000
[ 595.363059][ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 595.369503][ C0] CR2: 00007f2e7df2e9a0 CR3: 000000023d60e003 CR4: 00000000001706f0
[ 595.377342][ C0] Call Trace:
[ 595.380491][ C0] <IRQ>
[ 595.383204][ C0] ? __wake_up_pollfree (kbuild/src/x86_64/kernel/sched/cpupri.c:146)
[ 595.388090][ C0] ? detach_tasks (kbuild/src/x86_64/include/linux/list.h:69 kbuild/src/x86_64/include/linux/list.h:88 kbuild/src/x86_64/include/linux/list.h:218 kbuild/src/x86_64/kernel/sched/fair.c:7986)
[ 595.392622][ C0] ? _raw_spin_lock (kbuild/src/x86_64/arch/x86/include/asm/atomic.h:202 kbuild/src/x86_64/include/linux/atomic/atomic-instrumented.h:543 kbuild/src/x86_64/include/asm-generic/qspinlock.h:82 kbuild/src/x86_64/include/linux/spinlock.h:185 kbuild/src/x86_64/include/linux/spinlock_api_smp.h:134 kbuild/src/x86_64/kernel/locking/spinlock.c:154)
[ 595.397245][ C0] find_lowest_rq (kbuild/src/x86_64/kernel/sched/rt.c:1890)


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (5.18 kB)
config-5.18.0-rc5-00021-g0eee64011b1d (168.55 kB)
job-script (5.50 kB)
dmesg.xz (9.71 kB)
pm-qa (38.68 kB)
job.yaml (4.25 kB)
Download all attachments

2022-06-15 10:29:58

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Xuewen

On 05/15/22 00:55, Qais Yousef wrote:
> On 05/14/22 23:01, Xuewen Yan wrote:
> > On Wed, May 11, 2022 at 6:03 AM Lukasz Luba <[email protected]> wrote:
>
> [...]
>
> > > True, but this is a CFS 'world' and the update path is part of load
> > > balance. Your proposed code which sets the new
> > > 'rq->cpu_capacity_inverted' is run there, which might have some
> > > delays.
> >
> > Yes, that's exactly what I'm worried about.
>
> Hmm. In Android world, where we are concerned here, CFS is a first class
> citizen. If we have issues in updating capacities there, this might be hurting
> other non-RT related use cases too. So something to ponder in general.
>
> Anyways. It's a very valid concern and I agree with it too. We can do better.
> See below.
>
> [...]
>
> > > Capacity of other CPU might also be reduced and capacity_orig is not
> > > reflecting that. My gut feeling tells me that this capacity_orig
> > > assumption might be too optimistic for some platforms.
> >
> > In unisoc platform with 3 clusters(little/mid/big), there are cases
> > that middle core and big core are throttled at the same time.
>
> Okay. I might have misunderstood you before, but I thought medium cores don't
> suffer any meaningful thermal pressure.
>
> [...]
>
> > Okay, I could push patch v2 later. Maybe we can continue to discuss
> > this topic based on v2.
>
> Please do.
>
> I have scratched my head and played around implementing the generic solution
> using additional cpumasks. If you fancy testing it, that'd be great! I have
> tested it very lightly only. If you like it and no one shouts it's terrible, we
> can shape it further.

Any thoughts on the approach below? It implements the generic solution to
consider thermal pressure, and if nothing was found we fallback to original
choice without thermal pressure being considered.

Do you mind giving it a spin to verify it addresses your problem?

Thanks!

--
Qais Yousef

>
> --->8---
>
> From 625209b09bd0eb0eff07fba109e80102c5983c48 Mon Sep 17 00:00:00 2001
> From: Qais Yousef <[email protected]>
> Date: Fri, 13 May 2022 12:01:15 +0100
> Subject: [PATCH] sched/rt: Support multi-criterion fitness search for
> lowest_rq
>
> We have multiple criterion that need to be taken into account when
> searching for best fitting lowest_irq.
>
> On big.LITTLE systems, when uclamp is used, we use
> rt_task_fits_capacity() to enforce going to a larger CPU if the task's
> uclamp_min requested that. But we still would have fallen back to
> priority based search if no fitting CPU was found.
>
> There are reports now that severe thermal pressure could make the big
> CPUs throttled to the point where they are less performant than the
> mediums (capacity inversion).
>
> To cater for that, we need to take into account thermal pressure
> reducing the capacity of the CPU in the fitness search.
>
> Note that this could introduce another set of problems if not careful.
>
> For instance if an RT task has uclamp_min = 1024, a small amount of
> thermal pressure could mean no CPU will fit this task; which means the
> hint will become less effective. The big CPU still provides the max
> performance level (which is what uclamp_min=1024 is asking for) so we
> still better place it there even if thermal conditions mean we lost some
> performance.
>
> This corner case can happen at any boundary conditions for littles,
> mediums or bigs. For example if an RT task has uclamp_min
> = capacity_orig_of(medium_cpu), then any small thermal pressure will
> prevent placing it there and force it to big CPUs instead. Which is not
> the desired outcome if no big CPU is available. We should still fallback
> to the medium CPU in this case.
>
> This dictates a more complex search method to enable multi-level
> fallback.
>
> That is:
>
> 1. If rt_task_fits_capacity_thermal() returns true, we should
> pick this lowest_mask.
> 2. If (1) failed for all priorities, we should fallback to
> rt_task_fits_capacity() lowest_mask if it found any.
> 3. If (1) and (2) failed, we should fallback to the lowest_mask
> based on lowest priority rq as usual.
>
> We teach cpupri_find_fitness() to do a multi-level search in a single
> loop. This is at the cost of allocating more cpumasks for each fitness
> criteria/level.
>
> At the moment, the only users are heterogeneous systems which have low
> CPU count and this should not lead to a big waste of memory.
>
> The priority of fitness_fn is highest starting from 0.
>
> Signed-off-by: Qais Yousef <[email protected]>
> ---
> kernel/sched/cpupri.c | 118 ++++++++++++++++++++++++++++++++----------
> kernel/sched/cpupri.h | 14 +++--
> kernel/sched/rt.c | 66 ++++++++++++++++++++---
> 3 files changed, 162 insertions(+), 36 deletions(-)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index fa9ce9d83683..cfe56bd4e555 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -120,7 +120,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> struct cpumask *lowest_mask)
> {
> - return cpupri_find_fitness(cp, p, lowest_mask, NULL);
> + return cpupri_find_fitness(cp, p, lowest_mask, NULL, NULL);
> }
>
> /**
> @@ -142,13 +142,24 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> */
> int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> struct cpumask *lowest_mask,
> - bool (*fitness_fn)(struct task_struct *p, int cpu))
> + cpumask_var_t fitness_mask[], fitness_fn_t fitness_fn[])
> {
> int task_pri = convert_prio(p->prio);
> - int idx, cpu;
> + bool fallback_found[NUM_FITNESS_FN];
> + int idx, cpu, fn_idx;
>
> BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);
>
> + if (NUM_FITNESS_FN && fitness_fn) {
> + /*
> + * Clear the masks so that we can save a fallback hit in them
> + */
> + for (fn_idx = 0; fn_idx < NUM_FITNESS_FN; fn_idx++) {
> + cpumask_clear(fitness_mask[fn_idx]);
> + fallback_found[fn_idx] = false;
> + }
> + }
> +
> for (idx = 0; idx < task_pri; idx++) {
>
> if (!__cpupri_find(cp, p, lowest_mask, idx))
> @@ -157,41 +168,94 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> if (!lowest_mask || !fitness_fn)
> return 1;
>
> - /* Ensure the capacity of the CPUs fit the task */
> + /*
> + * We got a hit, save in our fallback masks that are empty.
> + *
> + * Note that we use fitness_mask[0] to save the fallback for
> + * when all fitness_fns fail to find a suitable CPU.
> + *
> + * We use lowest_mask to store the results of fitness_fn[0]
> + * directly.
> + */
> + if (!fallback_found[0]) {
> + cpumask_copy(fitness_mask[0], lowest_mask);
> + fallback_found[0] = true;
> + }
> + for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
> +
> + /*
> + * We just need one valid fallback at highest level
> + * (smallest fn_idx). We don't care about checking for
> + * fallback beyond this once we found one.
> + */
> + if (fallback_found[fn_idx])
> + break;
> +
> + cpumask_copy(fitness_mask[fn_idx], lowest_mask);
> + }
> +
> + /*
> + * fintness_fn[0] hit always terminates the search immediately,
> + * so do that first.
> + */
> for_each_cpu(cpu, lowest_mask) {
> - if (!fitness_fn(p, cpu))
> + if (!fitness_fn[0](p, cpu))
> cpumask_clear_cpu(cpu, lowest_mask);
> }
>
> /*
> - * If no CPU at the current priority can fit the task
> - * continue looking
> + * Stop searching as soon as fitness_fn[0] is happy with the
> + * results.
> */
> - if (cpumask_empty(lowest_mask))
> - continue;
> + if (!cpumask_empty(lowest_mask))
> + return 1;
>
> - return 1;
> + /*
> + * Find a fallback CPU for the other fitness_fns.
> + *
> + * Only do this once. As soon as we get a valid fallback mask,
> + * we'll remember it so that when fitness_fn[0] fails for all
> + * priorities, we'll return this fallback mask.
> + *
> + * Remember that we use fitnss_mask[0] to store our fallback
> + * results for when all fitness_fns fail.
> + */
> + for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
> +
> + /*
> + * We just need one valid fallback at highest level
> + * (smallest fn_idx). We don't care about checking for
> + * fallback beyond this once we found one.
> + */
> + if (fallback_found[fn_idx])
> + break;
> +
> + for_each_cpu(cpu, fitness_mask[fn_idx]) {
> + if (!fitness_fn[fn_idx](p, cpu))
> + cpumask_clear_cpu(cpu, fitness_mask[fn_idx]);
> + }
> +
> + if (!cpumask_empty(fitness_mask[fn_idx]))
> + fallback_found[fn_idx] = true;
> + }
> + }
> +
> + for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
> + if (fallback_found[fn_idx]) {
> + cpumask_copy(lowest_mask, fitness_mask[fn_idx]);
> + return 1;
> + }
> }
>
> /*
> - * If we failed to find a fitting lowest_mask, kick off a new search
> - * but without taking into account any fitness criteria this time.
> - *
> - * This rule favours honouring priority over fitting the task in the
> - * correct CPU (Capacity Awareness being the only user now).
> - * The idea is that if a higher priority task can run, then it should
> - * run even if this ends up being on unfitting CPU.
> - *
> - * The cost of this trade-off is not entirely clear and will probably
> - * be good for some workloads and bad for others.
> - *
> - * The main idea here is that if some CPUs were over-committed, we try
> - * to spread which is what the scheduler traditionally did. Sys admins
> - * must do proper RT planning to avoid overloading the system if they
> - * really care.
> + * No fallback from any of the fitness_fns, fallback to priority based
> + * lowest_mask which we store at fitness_mask[0].
> */
> - if (fitness_fn)
> - return cpupri_find(cp, p, lowest_mask);
> +
> + if (fallback_found[0]) {
> + cpumask_copy(lowest_mask, fitness_mask[0]);
> + return 1;
> + }
>
> return 0;
> }
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index d6cba0020064..1feb6324cf24 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -17,12 +17,20 @@ struct cpupri {
> int *cpu_to_pri;
> };
>
> +#ifdef CONFIG_UCLAMP_TASK
> +#define NUM_FITNESS_FN 2
> +#else
> +#define NUM_FITNESS_FN 0
> +#endif
> +
> +typedef bool (*fitness_fn_t)(struct task_struct *p, int cpu);
> +
> #ifdef CONFIG_SMP
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> struct cpumask *lowest_mask);
> -int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask,
> - bool (*fitness_fn)(struct task_struct *p, int cpu));
> +int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> + struct cpumask *lowest_mask,
> + cpumask_var_t fitness_mask[], fitness_fn_t fitness_fn[]);
> void cpupri_set(struct cpupri *cp, int cpu, int pri);
> int cpupri_init(struct cpupri *cp);
> void cpupri_cleanup(struct cpupri *cp);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a32c46889af8..125b9d360aab 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -452,7 +452,8 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
> * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> * > uclamp_max.
> */
> -static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +static inline bool __rt_task_fits_capacity(struct task_struct *p, int cpu,
> + bool thermal)
> {
> unsigned int min_cap;
> unsigned int max_cap;
> @@ -467,10 +468,39 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
>
> cpu_cap = capacity_orig_of(cpu);
>
> + if (thermal)
> + cpu_cap -= arch_scale_thermal_pressure(cpu);
> +
> return cpu_cap >= min(min_cap, max_cap);
> }
> -#else
> +
> +static inline bool rt_task_fits_capacity_thermal(struct task_struct *p, int cpu)
> +{
> + return __rt_task_fits_capacity(p, cpu, true);
> +}
> +
> static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + return __rt_task_fits_capacity(p, cpu, false);
> +}
> +
> +fitness_fn_t fitness_fn[NUM_FITNESS_FN] = {
> + rt_task_fits_capacity_thermal,
> + rt_task_fits_capacity
> +};
> +
> +static inline bool rt_task_fits_cpu(struct task_struct *p, int cpu)
> +{
> + /*
> + * fitness_fn[0] is the ultimate best choice. If it fails, we should
> + * assume we need to try again/harder to meet this criteria.
> + */
> + return fitness_fn[0](p, cpu);
> +}
> +
> +#else
> +fitness_fn_t *fitness_fn = NULL;
> +static inline bool rt_task_fits_cpu(struct task_struct *p, int cpu)
> {
> return true;
> }
> @@ -1591,14 +1621,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> unlikely(rt_task(curr)) &&
> (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
>
> - if (test || !rt_task_fits_capacity(p, cpu)) {
> + if (test || !rt_task_fits_cpu(p, cpu)) {
> int target = find_lowest_rq(p);
>
> /*
> * Bail out if we were forcing a migration to find a better
> * fitting CPU but our search failed.
> */
> - if (!test && target != -1 && !rt_task_fits_capacity(p, target))
> + if (!test && target != -1 && !rt_task_fits_cpu(p, target))
> goto out_unlock;
>
> /*
> @@ -1823,6 +1853,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
> }
>
> static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
> +static DEFINE_PER_CPU(cpumask_var_t *, local_cpu_fitness_mask);
>
> static int find_lowest_rq(struct task_struct *task)
> {
> @@ -1843,11 +1874,12 @@ static int find_lowest_rq(struct task_struct *task)
> * If we're on asym system ensure we consider the different capacities
> * of the CPUs when searching for the lowest_mask.
> */
> - if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> + if (static_branch_unlikely(&sched_asym_cpucapacity) && NUM_FITNESS_FN) {
>
> + cpumask_var_t *fitness_mask = __this_cpu_read(local_cpu_fitness_mask);
> ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> task, lowest_mask,
> - rt_task_fits_capacity);
> + fitness_mask, fitness_fn);
> } else {
>
> ret = cpupri_find(&task_rq(task)->rd->cpupri,
> @@ -2460,6 +2492,25 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> rt_queue_pull_task(rq);
> }
>
> +void __init init_sched_rt_fitness_mask(int cpu)
> +{
> + cpumask_var_t *fitness_mask_array;
> + unsigned int i;
> +
> + if (!NUM_FITNESS_FN)
> + return;
> +
> + fitness_mask_array = kcalloc_node(NUM_FITNESS_FN, sizeof(cpumask_var_t),
> + GFP_KERNEL, cpu_to_node(cpu));
> +
> + per_cpu(local_cpu_fitness_mask, cpu) = fitness_mask_array;
> +
> + for (i = 0; i < NUM_FITNESS_FN; i++) {
> + zalloc_cpumask_var_node(&fitness_mask_array[i], GFP_KERNEL,
> + cpu_to_node(cpu));
> + }
> +}
> +
> void __init init_sched_rt_class(void)
> {
> unsigned int i;
> @@ -2467,6 +2518,9 @@ void __init init_sched_rt_class(void)
> for_each_possible_cpu(i) {
> zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
> GFP_KERNEL, cpu_to_node(i));
> +
> + if (static_branch_unlikely(&sched_asym_cpucapacity))
> + init_sched_rt_fitness_mask(i);
> }
> }
> #endif /* CONFIG_SMP */
> --
> 2.25.1
>
> --->8---
>
> Thanks both!
>
> --
> Qais Yousef

2022-06-15 11:52:10

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

Hi Qais

Sorry for the late action.
I've had higher priority work lately and haven't had more energy to
work on the patch for a while.
Could you test the patch? I will also work on this patch asap.

Thanks!
BR
xuewen

On Wed, Jun 15, 2022 at 6:13 PM Qais Yousef <[email protected]> wrote:
>
> Hi Xuewen
>
> On 05/15/22 00:55, Qais Yousef wrote:
> > On 05/14/22 23:01, Xuewen Yan wrote:
> > > On Wed, May 11, 2022 at 6:03 AM Lukasz Luba <[email protected]> wrote:
> >
> > [...]
> >
> > > > True, but this is a CFS 'world' and the update path is part of load
> > > > balance. Your proposed code which sets the new
> > > > 'rq->cpu_capacity_inverted' is run there, which might have some
> > > > delays.
> > >
> > > Yes, that's exactly what I'm worried about.
> >
> > Hmm. In Android world, where we are concerned here, CFS is a first class
> > citizen. If we have issues in updating capacities there, this might be hurting
> > other non-RT related use cases too. So something to ponder in general.
> >
> > Anyways. It's a very valid concern and I agree with it too. We can do better.
> > See below.
> >
> > [...]
> >
> > > > Capacity of other CPU might also be reduced and capacity_orig is not
> > > > reflecting that. My gut feeling tells me that this capacity_orig
> > > > assumption might be too optimistic for some platforms.
> > >
> > > In unisoc platform with 3 clusters(little/mid/big), there are cases
> > > that middle core and big core are throttled at the same time.
> >
> > Okay. I might have misunderstood you before, but I thought medium cores don't
> > suffer any meaningful thermal pressure.
> >
> > [...]
> >
> > > Okay, I could push patch v2 later. Maybe we can continue to discuss
> > > this topic based on v2.
> >
> > Please do.
> >
> > I have scratched my head and played around implementing the generic solution
> > using additional cpumasks. If you fancy testing it, that'd be great! I have
> > tested it very lightly only. If you like it and no one shouts it's terrible, we
> > can shape it further.
>
> Any thoughts on the approach below? It implements the generic solution to
> consider thermal pressure, and if nothing was found we fallback to original
> choice without thermal pressure being considered.

>
> Do you mind giving it a spin to verify it addresses your problem?
>
> Thanks!
>
> --
> Qais Yousef
>
> >
> > --->8---
> >
> > From 625209b09bd0eb0eff07fba109e80102c5983c48 Mon Sep 17 00:00:00 2001
> > From: Qais Yousef <[email protected]>
> > Date: Fri, 13 May 2022 12:01:15 +0100
> > Subject: [PATCH] sched/rt: Support multi-criterion fitness search for
> > lowest_rq
> >
> > We have multiple criterion that need to be taken into account when
> > searching for best fitting lowest_irq.
> >
> > On big.LITTLE systems, when uclamp is used, we use
> > rt_task_fits_capacity() to enforce going to a larger CPU if the task's
> > uclamp_min requested that. But we still would have fallen back to
> > priority based search if no fitting CPU was found.
> >
> > There are reports now that severe thermal pressure could make the big
> > CPUs throttled to the point where they are less performant than the
> > mediums (capacity inversion).
> >
> > To cater for that, we need to take into account thermal pressure
> > reducing the capacity of the CPU in the fitness search.
> >
> > Note that this could introduce another set of problems if not careful.
> >
> > For instance if an RT task has uclamp_min = 1024, a small amount of
> > thermal pressure could mean no CPU will fit this task; which means the
> > hint will become less effective. The big CPU still provides the max
> > performance level (which is what uclamp_min=1024 is asking for) so we
> > still better place it there even if thermal conditions mean we lost some
> > performance.
> >
> > This corner case can happen at any boundary conditions for littles,
> > mediums or bigs. For example if an RT task has uclamp_min
> > = capacity_orig_of(medium_cpu), then any small thermal pressure will
> > prevent placing it there and force it to big CPUs instead. Which is not
> > the desired outcome if no big CPU is available. We should still fallback
> > to the medium CPU in this case.
> >
> > This dictates a more complex search method to enable multi-level
> > fallback.
> >
> > That is:
> >
> > 1. If rt_task_fits_capacity_thermal() returns true, we should
> > pick this lowest_mask.
> > 2. If (1) failed for all priorities, we should fallback to
> > rt_task_fits_capacity() lowest_mask if it found any.
> > 3. If (1) and (2) failed, we should fallback to the lowest_mask
> > based on lowest priority rq as usual.
> >
> > We teach cpupri_find_fitness() to do a multi-level search in a single
> > loop. This is at the cost of allocating more cpumasks for each fitness
> > criteria/level.
> >
> > At the moment, the only users are heterogeneous systems which have low
> > CPU count and this should not lead to a big waste of memory.
> >
> > The priority of fitness_fn is highest starting from 0.
> >
> > Signed-off-by: Qais Yousef <[email protected]>
> > ---
> > kernel/sched/cpupri.c | 118 ++++++++++++++++++++++++++++++++----------
> > kernel/sched/cpupri.h | 14 +++--
> > kernel/sched/rt.c | 66 ++++++++++++++++++++---
> > 3 files changed, 162 insertions(+), 36 deletions(-)
> >
> > diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> > index fa9ce9d83683..cfe56bd4e555 100644
> > --- a/kernel/sched/cpupri.c
> > +++ b/kernel/sched/cpupri.c
> > @@ -120,7 +120,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
> > int cpupri_find(struct cpupri *cp, struct task_struct *p,
> > struct cpumask *lowest_mask)
> > {
> > - return cpupri_find_fitness(cp, p, lowest_mask, NULL);
> > + return cpupri_find_fitness(cp, p, lowest_mask, NULL, NULL);
> > }
> >
> > /**
> > @@ -142,13 +142,24 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> > */
> > int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> > struct cpumask *lowest_mask,
> > - bool (*fitness_fn)(struct task_struct *p, int cpu))
> > + cpumask_var_t fitness_mask[], fitness_fn_t fitness_fn[])
> > {
> > int task_pri = convert_prio(p->prio);
> > - int idx, cpu;
> > + bool fallback_found[NUM_FITNESS_FN];
> > + int idx, cpu, fn_idx;
> >
> > BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES);
> >
> > + if (NUM_FITNESS_FN && fitness_fn) {
> > + /*
> > + * Clear the masks so that we can save a fallback hit in them
> > + */
> > + for (fn_idx = 0; fn_idx < NUM_FITNESS_FN; fn_idx++) {
> > + cpumask_clear(fitness_mask[fn_idx]);
> > + fallback_found[fn_idx] = false;
> > + }
> > + }
> > +
> > for (idx = 0; idx < task_pri; idx++) {
> >
> > if (!__cpupri_find(cp, p, lowest_mask, idx))
> > @@ -157,41 +168,94 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> > if (!lowest_mask || !fitness_fn)
> > return 1;
> >
> > - /* Ensure the capacity of the CPUs fit the task */
> > + /*
> > + * We got a hit, save in our fallback masks that are empty.
> > + *
> > + * Note that we use fitness_mask[0] to save the fallback for
> > + * when all fitness_fns fail to find a suitable CPU.
> > + *
> > + * We use lowest_mask to store the results of fitness_fn[0]
> > + * directly.
> > + */
> > + if (!fallback_found[0]) {
> > + cpumask_copy(fitness_mask[0], lowest_mask);
> > + fallback_found[0] = true;
> > + }
> > + for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
> > +
> > + /*
> > + * We just need one valid fallback at highest level
> > + * (smallest fn_idx). We don't care about checking for
> > + * fallback beyond this once we found one.
> > + */
> > + if (fallback_found[fn_idx])
> > + break;
> > +
> > + cpumask_copy(fitness_mask[fn_idx], lowest_mask);
> > + }
> > +
> > + /*
> > + * fintness_fn[0] hit always terminates the search immediately,
> > + * so do that first.
> > + */
> > for_each_cpu(cpu, lowest_mask) {
> > - if (!fitness_fn(p, cpu))
> > + if (!fitness_fn[0](p, cpu))
> > cpumask_clear_cpu(cpu, lowest_mask);
> > }
> >
> > /*
> > - * If no CPU at the current priority can fit the task
> > - * continue looking
> > + * Stop searching as soon as fitness_fn[0] is happy with the
> > + * results.
> > */
> > - if (cpumask_empty(lowest_mask))
> > - continue;
> > + if (!cpumask_empty(lowest_mask))
> > + return 1;
> >
> > - return 1;
> > + /*
> > + * Find a fallback CPU for the other fitness_fns.
> > + *
> > + * Only do this once. As soon as we get a valid fallback mask,
> > + * we'll remember it so that when fitness_fn[0] fails for all
> > + * priorities, we'll return this fallback mask.
> > + *
> > + * Remember that we use fitnss_mask[0] to store our fallback
> > + * results for when all fitness_fns fail.
> > + */
> > + for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
> > +
> > + /*
> > + * We just need one valid fallback at highest level
> > + * (smallest fn_idx). We don't care about checking for
> > + * fallback beyond this once we found one.
> > + */
> > + if (fallback_found[fn_idx])
> > + break;
> > +
> > + for_each_cpu(cpu, fitness_mask[fn_idx]) {
> > + if (!fitness_fn[fn_idx](p, cpu))
> > + cpumask_clear_cpu(cpu, fitness_mask[fn_idx]);
> > + }
> > +
> > + if (!cpumask_empty(fitness_mask[fn_idx]))
> > + fallback_found[fn_idx] = true;
> > + }
> > + }
> > +
> > + for (fn_idx = 1; fn_idx < NUM_FITNESS_FN; fn_idx++) {
> > + if (fallback_found[fn_idx]) {
> > + cpumask_copy(lowest_mask, fitness_mask[fn_idx]);
> > + return 1;
> > + }
> > }
> >
> > /*
> > - * If we failed to find a fitting lowest_mask, kick off a new search
> > - * but without taking into account any fitness criteria this time.
> > - *
> > - * This rule favours honouring priority over fitting the task in the
> > - * correct CPU (Capacity Awareness being the only user now).
> > - * The idea is that if a higher priority task can run, then it should
> > - * run even if this ends up being on unfitting CPU.
> > - *
> > - * The cost of this trade-off is not entirely clear and will probably
> > - * be good for some workloads and bad for others.
> > - *
> > - * The main idea here is that if some CPUs were over-committed, we try
> > - * to spread which is what the scheduler traditionally did. Sys admins
> > - * must do proper RT planning to avoid overloading the system if they
> > - * really care.
> > + * No fallback from any of the fitness_fns, fallback to priority based
> > + * lowest_mask which we store at fitness_mask[0].
> > */
> > - if (fitness_fn)
> > - return cpupri_find(cp, p, lowest_mask);
> > +
> > + if (fallback_found[0]) {
> > + cpumask_copy(lowest_mask, fitness_mask[0]);
> > + return 1;
> > + }
> >
> > return 0;
> > }
> > diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> > index d6cba0020064..1feb6324cf24 100644
> > --- a/kernel/sched/cpupri.h
> > +++ b/kernel/sched/cpupri.h
> > @@ -17,12 +17,20 @@ struct cpupri {
> > int *cpu_to_pri;
> > };
> >
> > +#ifdef CONFIG_UCLAMP_TASK
> > +#define NUM_FITNESS_FN 2
> > +#else
> > +#define NUM_FITNESS_FN 0
> > +#endif
> > +
> > +typedef bool (*fitness_fn_t)(struct task_struct *p, int cpu);
> > +
> > #ifdef CONFIG_SMP
> > int cpupri_find(struct cpupri *cp, struct task_struct *p,
> > struct cpumask *lowest_mask);
> > -int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> > - struct cpumask *lowest_mask,
> > - bool (*fitness_fn)(struct task_struct *p, int cpu));
> > +int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
> > + struct cpumask *lowest_mask,
> > + cpumask_var_t fitness_mask[], fitness_fn_t fitness_fn[]);
> > void cpupri_set(struct cpupri *cp, int cpu, int pri);
> > int cpupri_init(struct cpupri *cp);
> > void cpupri_cleanup(struct cpupri *cp);
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index a32c46889af8..125b9d360aab 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -452,7 +452,8 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
> > * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> > * > uclamp_max.
> > */
> > -static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > +static inline bool __rt_task_fits_capacity(struct task_struct *p, int cpu,
> > + bool thermal)
> > {
> > unsigned int min_cap;
> > unsigned int max_cap;
> > @@ -467,10 +468,39 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> >
> > cpu_cap = capacity_orig_of(cpu);
> >
> > + if (thermal)
> > + cpu_cap -= arch_scale_thermal_pressure(cpu);
> > +
> > return cpu_cap >= min(min_cap, max_cap);
> > }
> > -#else
> > +
> > +static inline bool rt_task_fits_capacity_thermal(struct task_struct *p, int cpu)
> > +{
> > + return __rt_task_fits_capacity(p, cpu, true);
> > +}
> > +
> > static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> > +{
> > + return __rt_task_fits_capacity(p, cpu, false);
> > +}
> > +
> > +fitness_fn_t fitness_fn[NUM_FITNESS_FN] = {
> > + rt_task_fits_capacity_thermal,
> > + rt_task_fits_capacity
> > +};
> > +
> > +static inline bool rt_task_fits_cpu(struct task_struct *p, int cpu)
> > +{
> > + /*
> > + * fitness_fn[0] is the ultimate best choice. If it fails, we should
> > + * assume we need to try again/harder to meet this criteria.
> > + */
> > + return fitness_fn[0](p, cpu);
> > +}
> > +
> > +#else
> > +fitness_fn_t *fitness_fn = NULL;
> > +static inline bool rt_task_fits_cpu(struct task_struct *p, int cpu)
> > {
> > return true;
> > }
> > @@ -1591,14 +1621,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > unlikely(rt_task(curr)) &&
> > (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> >
> > - if (test || !rt_task_fits_capacity(p, cpu)) {
> > + if (test || !rt_task_fits_cpu(p, cpu)) {
> > int target = find_lowest_rq(p);
> >
> > /*
> > * Bail out if we were forcing a migration to find a better
> > * fitting CPU but our search failed.
> > */
> > - if (!test && target != -1 && !rt_task_fits_capacity(p, target))
> > + if (!test && target != -1 && !rt_task_fits_cpu(p, target))
> > goto out_unlock;
> >
> > /*
> > @@ -1823,6 +1853,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
> > }
> >
> > static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
> > +static DEFINE_PER_CPU(cpumask_var_t *, local_cpu_fitness_mask);
> >
> > static int find_lowest_rq(struct task_struct *task)
> > {
> > @@ -1843,11 +1874,12 @@ static int find_lowest_rq(struct task_struct *task)
> > * If we're on asym system ensure we consider the different capacities
> > * of the CPUs when searching for the lowest_mask.
> > */
> > - if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> > + if (static_branch_unlikely(&sched_asym_cpucapacity) && NUM_FITNESS_FN) {
> >
> > + cpumask_var_t *fitness_mask = __this_cpu_read(local_cpu_fitness_mask);
> > ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> > task, lowest_mask,
> > - rt_task_fits_capacity);
> > + fitness_mask, fitness_fn);
> > } else {
> >
> > ret = cpupri_find(&task_rq(task)->rd->cpupri,
> > @@ -2460,6 +2492,25 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
> > rt_queue_pull_task(rq);
> > }
> >
> > +void __init init_sched_rt_fitness_mask(int cpu)
> > +{
> > + cpumask_var_t *fitness_mask_array;
> > + unsigned int i;
> > +
> > + if (!NUM_FITNESS_FN)
> > + return;
> > +
> > + fitness_mask_array = kcalloc_node(NUM_FITNESS_FN, sizeof(cpumask_var_t),
> > + GFP_KERNEL, cpu_to_node(cpu));
> > +
> > + per_cpu(local_cpu_fitness_mask, cpu) = fitness_mask_array;
> > +
> > + for (i = 0; i < NUM_FITNESS_FN; i++) {
> > + zalloc_cpumask_var_node(&fitness_mask_array[i], GFP_KERNEL,
> > + cpu_to_node(cpu));
> > + }
> > +}
> > +
> > void __init init_sched_rt_class(void)
> > {
> > unsigned int i;
> > @@ -2467,6 +2518,9 @@ void __init init_sched_rt_class(void)
> > for_each_possible_cpu(i) {
> > zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
> > GFP_KERNEL, cpu_to_node(i));
> > +
> > + if (static_branch_unlikely(&sched_asym_cpucapacity))
> > + init_sched_rt_fitness_mask(i);
> > }
> > }
> > #endif /* CONFIG_SMP */
> > --
> > 2.25.1
> >
> > --->8---
> >
> > Thanks both!
> >
> > --
> > Qais Yousef

2022-06-15 14:20:46

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity

On 06/15/22 19:17, Xuewen Yan wrote:
> Hi Qais
>
> Sorry for the late action.
> I've had higher priority work lately and haven't had more energy to
> work on the patch for a while.

No problem.

> Could you test the patch? I will also work on this patch asap.

I did run some tests on the patch. I wanted a confirmation it fixes your
problem though before taking this any further.


Thanks

--
Qais Yousef