When the new max frequency value is stored, the task scheduler must
know about it. The scheduler uses the CPUs capacity information in the
task placement. Use the existing mechanism which provides information
about reduced CPU capacity to the scheduler due to thermal capping.
Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 1f8b93f42c76..205d9ea9c023 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -27,6 +27,7 @@
#include <linux/slab.h>
#include <linux/suspend.h>
#include <linux/syscore_ops.h>
+#include <linux/thermal.h>
#include <linux/tick.h>
#include <linux/units.h>
#include <trace/events/power.h>
@@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
static ssize_t store_scaling_max_freq
(struct cpufreq_policy *policy, const char *buf, size_t count)
{
+ unsigned int frequency;
+ struct cpumask *cpus;
unsigned long val;
int ret;
@@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
return -EINVAL;
ret = freq_qos_update_request(policy->max_freq_req, val);
- return ret >= 0 ? count : ret;
+ if (ret >= 0) {
+ /*
+ * Make sure that the task scheduler sees these CPUs
+ * capacity reduction. Use the thermal pressure mechanism
+ * to propagate this information to the scheduler.
+ */
+ cpus = policy->related_cpus;
+ frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
+ arch_update_thermal_pressure(cpus, frequency);
+
+ ret = count;
+ }
+
+ return ret;
}
static ssize_t store_scaling_min_freq
--
2.17.1
Would be good to always CC Scheduler maintainers for such a patch.
On 30-09-22, 10:48, Lukasz Luba wrote:
> When the new max frequency value is stored, the task scheduler must
> know about it. The scheduler uses the CPUs capacity information in the
> task placement. Use the existing mechanism which provides information
> about reduced CPU capacity to the scheduler due to thermal capping.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 1f8b93f42c76..205d9ea9c023 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -27,6 +27,7 @@
> #include <linux/slab.h>
> #include <linux/suspend.h>
> #include <linux/syscore_ops.h>
> +#include <linux/thermal.h>
> #include <linux/tick.h>
> #include <linux/units.h>
> #include <trace/events/power.h>
> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> static ssize_t store_scaling_max_freq
> (struct cpufreq_policy *policy, const char *buf, size_t count)
> {
> + unsigned int frequency;
> + struct cpumask *cpus;
> unsigned long val;
> int ret;
>
> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
> return -EINVAL;
>
> ret = freq_qos_update_request(policy->max_freq_req, val);
> - return ret >= 0 ? count : ret;
> + if (ret >= 0) {
> + /*
> + * Make sure that the task scheduler sees these CPUs
> + * capacity reduction. Use the thermal pressure mechanism
> + * to propagate this information to the scheduler.
> + */
> + cpus = policy->related_cpus;
No need of this, just use related_cpus directly.
> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
> + arch_update_thermal_pressure(cpus, frequency);
I wonder if using the thermal-pressure API here is the right thing to
do. It is a change coming from User, which may or may not be
thermal-related.
> +
> + ret = count;
> + }
> +
> + return ret;
> }
>
> static ssize_t store_scaling_min_freq
> --
> 2.17.1
--
viresh
On 10/10/22 06:39, Viresh Kumar wrote:
> Would be good to always CC Scheduler maintainers for such a patch.
Agree, I'll do that.
>
> On 30-09-22, 10:48, Lukasz Luba wrote:
>> When the new max frequency value is stored, the task scheduler must
>> know about it. The scheduler uses the CPUs capacity information in the
>> task placement. Use the existing mechanism which provides information
>> about reduced CPU capacity to the scheduler due to thermal capping.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
>> 1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>> index 1f8b93f42c76..205d9ea9c023 100644
>> --- a/drivers/cpufreq/cpufreq.c
>> +++ b/drivers/cpufreq/cpufreq.c
>> @@ -27,6 +27,7 @@
>> #include <linux/slab.h>
>> #include <linux/suspend.h>
>> #include <linux/syscore_ops.h>
>> +#include <linux/thermal.h>
>> #include <linux/tick.h>
>> #include <linux/units.h>
>> #include <trace/events/power.h>
>> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>> static ssize_t store_scaling_max_freq
>> (struct cpufreq_policy *policy, const char *buf, size_t count)
>> {
>> + unsigned int frequency;
>> + struct cpumask *cpus;
>> unsigned long val;
>> int ret;
>>
>> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
>> return -EINVAL;
>>
>> ret = freq_qos_update_request(policy->max_freq_req, val);
>> - return ret >= 0 ? count : ret;
>> + if (ret >= 0) {
>> + /*
>> + * Make sure that the task scheduler sees these CPUs
>> + * capacity reduction. Use the thermal pressure mechanism
>> + * to propagate this information to the scheduler.
>> + */
>> + cpus = policy->related_cpus;
>
> No need of this, just use related_cpus directly.
>
>> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
>> + arch_update_thermal_pressure(cpus, frequency);
>
> I wonder if using the thermal-pressure API here is the right thing to
> do. It is a change coming from User, which may or may not be
> thermal-related.
Yes, I thought the same. Thermal-pressure name might be not the
best for covering this use case. I have been thinking about this
thermal pressure mechanism for a while, since there are other
use cases like PowerCap DTPM which also reduces CPU capacity
because of power policy from user-space. We don't notify
the scheduler about it. There might be also an issue with virtual
guest OS and how that kernel 'sees' the capacity of CPUs.
We might try to use this 'thermal-pressure' in the guest kernel
to notify about available CPU capacity (just a proposal, not
even an RFC, since we are missing requirements, but issues where
discussed on LPC 2022 on ChromeOS+Android_guest)
Android middleware has 'powerhits' (IIRC since ~4-5 versions now)
but our capacity in task scheduler is not aware of those reductions.
IMO thermal-pressure mechanism is good, but the naming convention
just might be a bit more 'generic' to cover those two users.
Some proposals of better naming:
1. Performance capping
2. Capacity capping
3. Performance reduction
What do you think about changing the name of this and cover
those two users: PowerCap DTPM and this user-space cpufreq?
On Mon, 10 Oct 2022 at 11:02, Lukasz Luba <[email protected]> wrote:
>
>
>
> On 10/10/22 06:39, Viresh Kumar wrote:
> > Would be good to always CC Scheduler maintainers for such a patch.
>
> Agree, I'll do that.
>
> >
> > On 30-09-22, 10:48, Lukasz Luba wrote:
> >> When the new max frequency value is stored, the task scheduler must
> >> know about it. The scheduler uses the CPUs capacity information in the
> >> task placement. Use the existing mechanism which provides information
> >> about reduced CPU capacity to the scheduler due to thermal capping.
> >>
> >> Signed-off-by: Lukasz Luba <[email protected]>
> >> ---
> >> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
> >> 1 file changed, 17 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> >> index 1f8b93f42c76..205d9ea9c023 100644
> >> --- a/drivers/cpufreq/cpufreq.c
> >> +++ b/drivers/cpufreq/cpufreq.c
> >> @@ -27,6 +27,7 @@
> >> #include <linux/slab.h>
> >> #include <linux/suspend.h>
> >> #include <linux/syscore_ops.h>
> >> +#include <linux/thermal.h>
> >> #include <linux/tick.h>
> >> #include <linux/units.h>
> >> #include <trace/events/power.h>
> >> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> >> static ssize_t store_scaling_max_freq
> >> (struct cpufreq_policy *policy, const char *buf, size_t count)
> >> {
> >> + unsigned int frequency;
> >> + struct cpumask *cpus;
> >> unsigned long val;
> >> int ret;
> >>
> >> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
> >> return -EINVAL;
> >>
> >> ret = freq_qos_update_request(policy->max_freq_req, val);
> >> - return ret >= 0 ? count : ret;
> >> + if (ret >= 0) {
> >> + /*
> >> + * Make sure that the task scheduler sees these CPUs
> >> + * capacity reduction. Use the thermal pressure mechanism
> >> + * to propagate this information to the scheduler.
> >> + */
> >> + cpus = policy->related_cpus;
> >
> > No need of this, just use related_cpus directly.
> >
> >> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
> >> + arch_update_thermal_pressure(cpus, frequency);
> >
> > I wonder if using the thermal-pressure API here is the right thing to
> > do. It is a change coming from User, which may or may not be
> > thermal-related.
>
> Yes, I thought the same. Thermal-pressure name might be not the
> best for covering this use case. I have been thinking about this
> thermal pressure mechanism for a while, since there are other
> use cases like PowerCap DTPM which also reduces CPU capacity
> because of power policy from user-space. We don't notify
> the scheduler about it. There might be also an issue with virtual
> guest OS and how that kernel 'sees' the capacity of CPUs.
> We might try to use this 'thermal-pressure' in the guest kernel
> to notify about available CPU capacity (just a proposal, not
> even an RFC, since we are missing requirements, but issues where
> discussed on LPC 2022 on ChromeOS+Android_guest)
The User space setting scaling_max_freq is a long scale event and it
should be considered as a new running environnement instead of a
transient event. I would suggest updating the EM is and capacity orig
of the system in this case. Similarly, we rebuild sched_domain with a
cpu hotplug. scaling_max_freq interface should not be used to do any
kind of dynamic scaling.
>
> Android middleware has 'powerhits' (IIRC since ~4-5 versions now)
> but our capacity in task scheduler is not aware of those reductions.
>
> IMO thermal-pressure mechanism is good, but the naming convention
> just might be a bit more 'generic' to cover those two users.
>
> Some proposals of better naming:
> 1. Performance capping
> 2. Capacity capping
> 3. Performance reduction
>
> What do you think about changing the name of this and cover
> those two users: PowerCap DTPM and this user-space cpufreq?
On Mon, 10 Oct 2022 at 11:30, Lukasz Luba <[email protected]> wrote:
>
>
>
> On 10/10/22 10:15, Vincent Guittot wrote:
> > On Mon, 10 Oct 2022 at 11:02, Lukasz Luba <[email protected]> wrote:
> >>
> >>
> >>
> >> On 10/10/22 06:39, Viresh Kumar wrote:
> >>> Would be good to always CC Scheduler maintainers for such a patch.
> >>
> >> Agree, I'll do that.
> >>
> >>>
> >>> On 30-09-22, 10:48, Lukasz Luba wrote:
> >>>> When the new max frequency value is stored, the task scheduler must
> >>>> know about it. The scheduler uses the CPUs capacity information in the
> >>>> task placement. Use the existing mechanism which provides information
> >>>> about reduced CPU capacity to the scheduler due to thermal capping.
> >>>>
> >>>> Signed-off-by: Lukasz Luba <[email protected]>
> >>>> ---
> >>>> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
> >>>> 1 file changed, 17 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> >>>> index 1f8b93f42c76..205d9ea9c023 100644
> >>>> --- a/drivers/cpufreq/cpufreq.c
> >>>> +++ b/drivers/cpufreq/cpufreq.c
> >>>> @@ -27,6 +27,7 @@
> >>>> #include <linux/slab.h>
> >>>> #include <linux/suspend.h>
> >>>> #include <linux/syscore_ops.h>
> >>>> +#include <linux/thermal.h>
> >>>> #include <linux/tick.h>
> >>>> #include <linux/units.h>
> >>>> #include <trace/events/power.h>
> >>>> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> >>>> static ssize_t store_scaling_max_freq
> >>>> (struct cpufreq_policy *policy, const char *buf, size_t count)
> >>>> {
> >>>> + unsigned int frequency;
> >>>> + struct cpumask *cpus;
> >>>> unsigned long val;
> >>>> int ret;
> >>>>
> >>>> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
> >>>> return -EINVAL;
> >>>>
> >>>> ret = freq_qos_update_request(policy->max_freq_req, val);
> >>>> - return ret >= 0 ? count : ret;
> >>>> + if (ret >= 0) {
> >>>> + /*
> >>>> + * Make sure that the task scheduler sees these CPUs
> >>>> + * capacity reduction. Use the thermal pressure mechanism
> >>>> + * to propagate this information to the scheduler.
> >>>> + */
> >>>> + cpus = policy->related_cpus;
> >>>
> >>> No need of this, just use related_cpus directly.
> >>>
> >>>> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
> >>>> + arch_update_thermal_pressure(cpus, frequency);
> >>>
> >>> I wonder if using the thermal-pressure API here is the right thing to
> >>> do. It is a change coming from User, which may or may not be
> >>> thermal-related.
> >>
> >> Yes, I thought the same. Thermal-pressure name might be not the
> >> best for covering this use case. I have been thinking about this
> >> thermal pressure mechanism for a while, since there are other
> >> use cases like PowerCap DTPM which also reduces CPU capacity
> >> because of power policy from user-space. We don't notify
> >> the scheduler about it. There might be also an issue with virtual
> >> guest OS and how that kernel 'sees' the capacity of CPUs.
> >> We might try to use this 'thermal-pressure' in the guest kernel
> >> to notify about available CPU capacity (just a proposal, not
> >> even an RFC, since we are missing requirements, but issues where
> >> discussed on LPC 2022 on ChromeOS+Android_guest)
> >
> > The User space setting scaling_max_freq is a long scale event and it
> > should be considered as a new running environnement instead of a
> > transient event. I would suggest updating the EM is and capacity orig
> > of the system in this case. Similarly, we rebuild sched_domain with a
> > cpu hotplug. scaling_max_freq interface should not be used to do any
> > kind of dynamic scaling.
>
> I tend to agree, but the EM capacity would be only used in part of EAS
> code. The whole fair.c view to the capacity_of() (RT + DL + irq +
> thermal_pressure) would be still wrong in other parts, e.g.
> select_idle_sibling() and load balance.
>
> When we get this powerhint we might be already in overutilied state
> so EAS is disabled. IMO other mechanisms in the task scheduler
> should be also aware of that capacity reduction.
That's why I also mentioned the capacity_orig
On 10/10/22 10:15, Vincent Guittot wrote:
> On Mon, 10 Oct 2022 at 11:02, Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 10/10/22 06:39, Viresh Kumar wrote:
>>> Would be good to always CC Scheduler maintainers for such a patch.
>>
>> Agree, I'll do that.
>>
>>>
>>> On 30-09-22, 10:48, Lukasz Luba wrote:
>>>> When the new max frequency value is stored, the task scheduler must
>>>> know about it. The scheduler uses the CPUs capacity information in the
>>>> task placement. Use the existing mechanism which provides information
>>>> about reduced CPU capacity to the scheduler due to thermal capping.
>>>>
>>>> Signed-off-by: Lukasz Luba <[email protected]>
>>>> ---
>>>> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
>>>> 1 file changed, 17 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>>> index 1f8b93f42c76..205d9ea9c023 100644
>>>> --- a/drivers/cpufreq/cpufreq.c
>>>> +++ b/drivers/cpufreq/cpufreq.c
>>>> @@ -27,6 +27,7 @@
>>>> #include <linux/slab.h>
>>>> #include <linux/suspend.h>
>>>> #include <linux/syscore_ops.h>
>>>> +#include <linux/thermal.h>
>>>> #include <linux/tick.h>
>>>> #include <linux/units.h>
>>>> #include <trace/events/power.h>
>>>> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>>>> static ssize_t store_scaling_max_freq
>>>> (struct cpufreq_policy *policy, const char *buf, size_t count)
>>>> {
>>>> + unsigned int frequency;
>>>> + struct cpumask *cpus;
>>>> unsigned long val;
>>>> int ret;
>>>>
>>>> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
>>>> return -EINVAL;
>>>>
>>>> ret = freq_qos_update_request(policy->max_freq_req, val);
>>>> - return ret >= 0 ? count : ret;
>>>> + if (ret >= 0) {
>>>> + /*
>>>> + * Make sure that the task scheduler sees these CPUs
>>>> + * capacity reduction. Use the thermal pressure mechanism
>>>> + * to propagate this information to the scheduler.
>>>> + */
>>>> + cpus = policy->related_cpus;
>>>
>>> No need of this, just use related_cpus directly.
>>>
>>>> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
>>>> + arch_update_thermal_pressure(cpus, frequency);
>>>
>>> I wonder if using the thermal-pressure API here is the right thing to
>>> do. It is a change coming from User, which may or may not be
>>> thermal-related.
>>
>> Yes, I thought the same. Thermal-pressure name might be not the
>> best for covering this use case. I have been thinking about this
>> thermal pressure mechanism for a while, since there are other
>> use cases like PowerCap DTPM which also reduces CPU capacity
>> because of power policy from user-space. We don't notify
>> the scheduler about it. There might be also an issue with virtual
>> guest OS and how that kernel 'sees' the capacity of CPUs.
>> We might try to use this 'thermal-pressure' in the guest kernel
>> to notify about available CPU capacity (just a proposal, not
>> even an RFC, since we are missing requirements, but issues where
>> discussed on LPC 2022 on ChromeOS+Android_guest)
>
> The User space setting scaling_max_freq is a long scale event and it
> should be considered as a new running environnement instead of a
> transient event. I would suggest updating the EM is and capacity orig
> of the system in this case. Similarly, we rebuild sched_domain with a
> cpu hotplug. scaling_max_freq interface should not be used to do any
> kind of dynamic scaling.
I tend to agree, but the EM capacity would be only used in part of EAS
code. The whole fair.c view to the capacity_of() (RT + DL + irq +
thermal_pressure) would be still wrong in other parts, e.g.
select_idle_sibling() and load balance.
When we get this powerhint we might be already in overutilied state
so EAS is disabled. IMO other mechanisms in the task scheduler
should be also aware of that capacity reduction.
On 10/10/22 10:32, Vincent Guittot wrote:
> On Mon, 10 Oct 2022 at 11:30, Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 10/10/22 10:15, Vincent Guittot wrote:
>>> On Mon, 10 Oct 2022 at 11:02, Lukasz Luba <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 10/10/22 06:39, Viresh Kumar wrote:
>>>>> Would be good to always CC Scheduler maintainers for such a patch.
>>>>
>>>> Agree, I'll do that.
>>>>
>>>>>
>>>>> On 30-09-22, 10:48, Lukasz Luba wrote:
>>>>>> When the new max frequency value is stored, the task scheduler must
>>>>>> know about it. The scheduler uses the CPUs capacity information in the
>>>>>> task placement. Use the existing mechanism which provides information
>>>>>> about reduced CPU capacity to the scheduler due to thermal capping.
>>>>>>
>>>>>> Signed-off-by: Lukasz Luba <[email protected]>
>>>>>> ---
>>>>>> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
>>>>>> 1 file changed, 17 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>>>>> index 1f8b93f42c76..205d9ea9c023 100644
>>>>>> --- a/drivers/cpufreq/cpufreq.c
>>>>>> +++ b/drivers/cpufreq/cpufreq.c
>>>>>> @@ -27,6 +27,7 @@
>>>>>> #include <linux/slab.h>
>>>>>> #include <linux/suspend.h>
>>>>>> #include <linux/syscore_ops.h>
>>>>>> +#include <linux/thermal.h>
>>>>>> #include <linux/tick.h>
>>>>>> #include <linux/units.h>
>>>>>> #include <trace/events/power.h>
>>>>>> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>>>>>> static ssize_t store_scaling_max_freq
>>>>>> (struct cpufreq_policy *policy, const char *buf, size_t count)
>>>>>> {
>>>>>> + unsigned int frequency;
>>>>>> + struct cpumask *cpus;
>>>>>> unsigned long val;
>>>>>> int ret;
>>>>>>
>>>>>> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
>>>>>> return -EINVAL;
>>>>>>
>>>>>> ret = freq_qos_update_request(policy->max_freq_req, val);
>>>>>> - return ret >= 0 ? count : ret;
>>>>>> + if (ret >= 0) {
>>>>>> + /*
>>>>>> + * Make sure that the task scheduler sees these CPUs
>>>>>> + * capacity reduction. Use the thermal pressure mechanism
>>>>>> + * to propagate this information to the scheduler.
>>>>>> + */
>>>>>> + cpus = policy->related_cpus;
>>>>>
>>>>> No need of this, just use related_cpus directly.
>>>>>
>>>>>> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
>>>>>> + arch_update_thermal_pressure(cpus, frequency);
>>>>>
>>>>> I wonder if using the thermal-pressure API here is the right thing to
>>>>> do. It is a change coming from User, which may or may not be
>>>>> thermal-related.
>>>>
>>>> Yes, I thought the same. Thermal-pressure name might be not the
>>>> best for covering this use case. I have been thinking about this
>>>> thermal pressure mechanism for a while, since there are other
>>>> use cases like PowerCap DTPM which also reduces CPU capacity
>>>> because of power policy from user-space. We don't notify
>>>> the scheduler about it. There might be also an issue with virtual
>>>> guest OS and how that kernel 'sees' the capacity of CPUs.
>>>> We might try to use this 'thermal-pressure' in the guest kernel
>>>> to notify about available CPU capacity (just a proposal, not
>>>> even an RFC, since we are missing requirements, but issues where
>>>> discussed on LPC 2022 on ChromeOS+Android_guest)
>>>
>>> The User space setting scaling_max_freq is a long scale event and it
>>> should be considered as a new running environnement instead of a
>>> transient event. I would suggest updating the EM is and capacity orig
>>> of the system in this case. Similarly, we rebuild sched_domain with a
>>> cpu hotplug. scaling_max_freq interface should not be used to do any
>>> kind of dynamic scaling.
>>
>> I tend to agree, but the EM capacity would be only used in part of EAS
>> code. The whole fair.c view to the capacity_of() (RT + DL + irq +
>> thermal_pressure) would be still wrong in other parts, e.g.
>> select_idle_sibling() and load balance.
>>
>> When we get this powerhint we might be already in overutilied state
>> so EAS is disabled. IMO other mechanisms in the task scheduler
>> should be also aware of that capacity reduction.
>
> That's why I also mentioned the capacity_orig
Well, I think this is a bit more complex. Thermal framework governor
reduces the perf IDs from top in the freq asc table and keeps that
in the statistics in sysfs. It also updates the thermal pressure signal.
When we rebuild the capacity of CPUs and make the capacity_orig smaller,
the capacity_of would still have the thermal framework reduced capacity
in there. We would end up with too small CPU capacity due to this
subtraction in capacity_of.
Ideally, I would see a mechanism which is aware of this performance
reduction reason:
1. thermal capping
2. power capping (from DTPM)
3. max freq reduction by user space
That common place would figure and maintain the context for the
requested capacity reduction.
BTW, those Android user space max freq requests are not that long,
mostly due to camera capturing (you can see a few in this file,
e.g. [1]).
[1]
https://android.googlesource.com/device/google/gs101/+/refs/heads/android12-qpr1-d-release/powerhint.json#441
On Mon, 10 Oct 2022 at 12:12, Lukasz Luba <[email protected]> wrote:
>
>
>
> On 10/10/22 10:32, Vincent Guittot wrote:
> > On Mon, 10 Oct 2022 at 11:30, Lukasz Luba <[email protected]> wrote:
> >>
> >>
> >>
> >> On 10/10/22 10:15, Vincent Guittot wrote:
> >>> On Mon, 10 Oct 2022 at 11:02, Lukasz Luba <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/10/22 06:39, Viresh Kumar wrote:
> >>>>> Would be good to always CC Scheduler maintainers for such a patch.
> >>>>
> >>>> Agree, I'll do that.
> >>>>
> >>>>>
> >>>>> On 30-09-22, 10:48, Lukasz Luba wrote:
> >>>>>> When the new max frequency value is stored, the task scheduler must
> >>>>>> know about it. The scheduler uses the CPUs capacity information in the
> >>>>>> task placement. Use the existing mechanism which provides information
> >>>>>> about reduced CPU capacity to the scheduler due to thermal capping.
> >>>>>>
> >>>>>> Signed-off-by: Lukasz Luba <[email protected]>
> >>>>>> ---
> >>>>>> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
> >>>>>> 1 file changed, 17 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> >>>>>> index 1f8b93f42c76..205d9ea9c023 100644
> >>>>>> --- a/drivers/cpufreq/cpufreq.c
> >>>>>> +++ b/drivers/cpufreq/cpufreq.c
> >>>>>> @@ -27,6 +27,7 @@
> >>>>>> #include <linux/slab.h>
> >>>>>> #include <linux/suspend.h>
> >>>>>> #include <linux/syscore_ops.h>
> >>>>>> +#include <linux/thermal.h>
> >>>>>> #include <linux/tick.h>
> >>>>>> #include <linux/units.h>
> >>>>>> #include <trace/events/power.h>
> >>>>>> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> >>>>>> static ssize_t store_scaling_max_freq
> >>>>>> (struct cpufreq_policy *policy, const char *buf, size_t count)
> >>>>>> {
> >>>>>> + unsigned int frequency;
> >>>>>> + struct cpumask *cpus;
> >>>>>> unsigned long val;
> >>>>>> int ret;
> >>>>>>
> >>>>>> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
> >>>>>> return -EINVAL;
> >>>>>>
> >>>>>> ret = freq_qos_update_request(policy->max_freq_req, val);
> >>>>>> - return ret >= 0 ? count : ret;
> >>>>>> + if (ret >= 0) {
> >>>>>> + /*
> >>>>>> + * Make sure that the task scheduler sees these CPUs
> >>>>>> + * capacity reduction. Use the thermal pressure mechanism
> >>>>>> + * to propagate this information to the scheduler.
> >>>>>> + */
> >>>>>> + cpus = policy->related_cpus;
> >>>>>
> >>>>> No need of this, just use related_cpus directly.
> >>>>>
> >>>>>> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
> >>>>>> + arch_update_thermal_pressure(cpus, frequency);
> >>>>>
> >>>>> I wonder if using the thermal-pressure API here is the right thing to
> >>>>> do. It is a change coming from User, which may or may not be
> >>>>> thermal-related.
> >>>>
> >>>> Yes, I thought the same. Thermal-pressure name might be not the
> >>>> best for covering this use case. I have been thinking about this
> >>>> thermal pressure mechanism for a while, since there are other
> >>>> use cases like PowerCap DTPM which also reduces CPU capacity
> >>>> because of power policy from user-space. We don't notify
> >>>> the scheduler about it. There might be also an issue with virtual
> >>>> guest OS and how that kernel 'sees' the capacity of CPUs.
> >>>> We might try to use this 'thermal-pressure' in the guest kernel
> >>>> to notify about available CPU capacity (just a proposal, not
> >>>> even an RFC, since we are missing requirements, but issues where
> >>>> discussed on LPC 2022 on ChromeOS+Android_guest)
> >>>
> >>> The User space setting scaling_max_freq is a long scale event and it
> >>> should be considered as a new running environnement instead of a
> >>> transient event. I would suggest updating the EM is and capacity orig
> >>> of the system in this case. Similarly, we rebuild sched_domain with a
> >>> cpu hotplug. scaling_max_freq interface should not be used to do any
> >>> kind of dynamic scaling.
> >>
> >> I tend to agree, but the EM capacity would be only used in part of EAS
> >> code. The whole fair.c view to the capacity_of() (RT + DL + irq +
> >> thermal_pressure) would be still wrong in other parts, e.g.
> >> select_idle_sibling() and load balance.
> >>
> >> When we get this powerhint we might be already in overutilied state
> >> so EAS is disabled. IMO other mechanisms in the task scheduler
> >> should be also aware of that capacity reduction.
> >
> > That's why I also mentioned the capacity_orig
>
> Well, I think this is a bit more complex. Thermal framework governor
> reduces the perf IDs from top in the freq asc table and keeps that
> in the statistics in sysfs. It also updates the thermal pressure signal.
> When we rebuild the capacity of CPUs and make the capacity_orig smaller,
> the capacity_of would still have the thermal framework reduced capacity
> in there. We would end up with too small CPU capacity due to this
> subtraction in capacity_of.
That's why using user space interface should not be used to do dynamic scaling.
I still think that user space interface is not the right interface
>
> Ideally, I would see a mechanism which is aware of this performance
> reduction reason:
> 1. thermal capping
> 2. power capping (from DTPM)
> 3. max freq reduction by user space
Yes for thermal and power capping but no for user space
>
> That common place would figure and maintain the context for the
> requested capacity reduction.
>
> BTW, those Android user space max freq requests are not that long,
> mostly due to camera capturing (you can see a few in this file,
> e.g. [1]).
Why are they doing this ?
This doesn't seem to be the correct interface to use. It seems to do
some power budget and they should use the right interface for this
>
>
> [1]
> https://android.googlesource.com/device/google/gs101/+/refs/heads/android12-qpr1-d-release/powerhint.json#441
On Mon, Oct 10, 2022 at 11:12:06AM +0100, Lukasz Luba wrote:
> BTW, those Android user space max freq requests are not that long,
> mostly due to camera capturing (you can see a few in this file,
> e.g. [1]).
It does what now ?!? Why is Android using this *at*all* ?
+CC Daniel
On 10/10/22 11:22, Vincent Guittot wrote:
> On Mon, 10 Oct 2022 at 12:12, Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 10/10/22 10:32, Vincent Guittot wrote:
>>> On Mon, 10 Oct 2022 at 11:30, Lukasz Luba <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 10/10/22 10:15, Vincent Guittot wrote:
>>>>> On Mon, 10 Oct 2022 at 11:02, Lukasz Luba <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/10/22 06:39, Viresh Kumar wrote:
>>>>>>> Would be good to always CC Scheduler maintainers for such a patch.
>>>>>>
>>>>>> Agree, I'll do that.
>>>>>>
>>>>>>>
>>>>>>> On 30-09-22, 10:48, Lukasz Luba wrote:
>>>>>>>> When the new max frequency value is stored, the task scheduler must
>>>>>>>> know about it. The scheduler uses the CPUs capacity information in the
>>>>>>>> task placement. Use the existing mechanism which provides information
>>>>>>>> about reduced CPU capacity to the scheduler due to thermal capping.
>>>>>>>>
>>>>>>>> Signed-off-by: Lukasz Luba <[email protected]>
>>>>>>>> ---
>>>>>>>> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
>>>>>>>> 1 file changed, 17 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>>>>>>> index 1f8b93f42c76..205d9ea9c023 100644
>>>>>>>> --- a/drivers/cpufreq/cpufreq.c
>>>>>>>> +++ b/drivers/cpufreq/cpufreq.c
>>>>>>>> @@ -27,6 +27,7 @@
>>>>>>>> #include <linux/slab.h>
>>>>>>>> #include <linux/suspend.h>
>>>>>>>> #include <linux/syscore_ops.h>
>>>>>>>> +#include <linux/thermal.h>
>>>>>>>> #include <linux/tick.h>
>>>>>>>> #include <linux/units.h>
>>>>>>>> #include <trace/events/power.h>
>>>>>>>> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>>>>>>>> static ssize_t store_scaling_max_freq
>>>>>>>> (struct cpufreq_policy *policy, const char *buf, size_t count)
>>>>>>>> {
>>>>>>>> + unsigned int frequency;
>>>>>>>> + struct cpumask *cpus;
>>>>>>>> unsigned long val;
>>>>>>>> int ret;
>>>>>>>>
>>>>>>>> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
>>>>>>>> return -EINVAL;
>>>>>>>>
>>>>>>>> ret = freq_qos_update_request(policy->max_freq_req, val);
>>>>>>>> - return ret >= 0 ? count : ret;
>>>>>>>> + if (ret >= 0) {
>>>>>>>> + /*
>>>>>>>> + * Make sure that the task scheduler sees these CPUs
>>>>>>>> + * capacity reduction. Use the thermal pressure mechanism
>>>>>>>> + * to propagate this information to the scheduler.
>>>>>>>> + */
>>>>>>>> + cpus = policy->related_cpus;
>>>>>>>
>>>>>>> No need of this, just use related_cpus directly.
>>>>>>>
>>>>>>>> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
>>>>>>>> + arch_update_thermal_pressure(cpus, frequency);
>>>>>>>
>>>>>>> I wonder if using the thermal-pressure API here is the right thing to
>>>>>>> do. It is a change coming from User, which may or may not be
>>>>>>> thermal-related.
>>>>>>
>>>>>> Yes, I thought the same. Thermal-pressure name might be not the
>>>>>> best for covering this use case. I have been thinking about this
>>>>>> thermal pressure mechanism for a while, since there are other
>>>>>> use cases like PowerCap DTPM which also reduces CPU capacity
>>>>>> because of power policy from user-space. We don't notify
>>>>>> the scheduler about it. There might be also an issue with virtual
>>>>>> guest OS and how that kernel 'sees' the capacity of CPUs.
>>>>>> We might try to use this 'thermal-pressure' in the guest kernel
>>>>>> to notify about available CPU capacity (just a proposal, not
>>>>>> even an RFC, since we are missing requirements, but issues where
>>>>>> discussed on LPC 2022 on ChromeOS+Android_guest)
>>>>>
>>>>> The User space setting scaling_max_freq is a long scale event and it
>>>>> should be considered as a new running environnement instead of a
>>>>> transient event. I would suggest updating the EM is and capacity orig
>>>>> of the system in this case. Similarly, we rebuild sched_domain with a
>>>>> cpu hotplug. scaling_max_freq interface should not be used to do any
>>>>> kind of dynamic scaling.
>>>>
>>>> I tend to agree, but the EM capacity would be only used in part of EAS
>>>> code. The whole fair.c view to the capacity_of() (RT + DL + irq +
>>>> thermal_pressure) would be still wrong in other parts, e.g.
>>>> select_idle_sibling() and load balance.
>>>>
>>>> When we get this powerhint we might be already in overutilied state
>>>> so EAS is disabled. IMO other mechanisms in the task scheduler
>>>> should be also aware of that capacity reduction.
>>>
>>> That's why I also mentioned the capacity_orig
>>
>> Well, I think this is a bit more complex. Thermal framework governor
>> reduces the perf IDs from top in the freq asc table and keeps that
>> in the statistics in sysfs. It also updates the thermal pressure signal.
>> When we rebuild the capacity of CPUs and make the capacity_orig smaller,
>> the capacity_of would still have the thermal framework reduced capacity
>> in there. We would end up with too small CPU capacity due to this
>> subtraction in capacity_of.
>
> That's why using user space interface should not be used to do dynamic scaling.
> I still think that user space interface is not the right interface
>
>>
>> Ideally, I would see a mechanism which is aware of this performance
>> reduction reason:
>> 1. thermal capping
>> 2. power capping (from DTPM)
>> 3. max freq reduction by user space
>
> Yes for thermal and power capping but no for user space
>
>>
>> That common place would figure and maintain the context for the
>> requested capacity reduction.
>>
>> BTW, those Android user space max freq requests are not that long,
>> mostly due to camera capturing (you can see a few in this file,
>> e.g. [1]).
>
> Why are they doing this ?
> This doesn't seem to be the correct interface to use. It seems to do
> some power budget and they should use the right interface for this
Yes, I agree. I have sent explanation with this to Peter's emails.
Daniel tries to give them a better interface: DTPM, but also would
suffer the same issue of capacity reduction for this short time.
We have a few discussions about it, also Daniel presented on a few
LPC those issues.
+CC Daniel, since I have mentioned a few times DTPM
On 10/10/22 11:25, Peter Zijlstra wrote:
> On Mon, Oct 10, 2022 at 11:12:06AM +0100, Lukasz Luba wrote:
>> BTW, those Android user space max freq requests are not that long,
>> mostly due to camera capturing (you can see a few in this file,
>> e.g. [1]).
>
> It does what now ?!? Why is Android using this *at*all* ?
It tries to balance the power budget, before bad things happen
randomly (throttling different devices w/o a good context what's
going on). Please keep in mind that we have ~3 Watts total power
budget in a phone, while several devices might be suddenly used:
1. big CPU with max power ~3-3.5 Watts (and we have 2 cores on pixel6)
2. GPU with max power ~6Watts (normally ~1-2Watts when lightly used)
3. ISP (Image Signal Processor) up to ~2Watts
4. DSP also up to 1-2Watts
We don't have currently a good mechanism which could be aware
of the total power/thermal budget and relations between those
devices. Vendors and OEMs run experiments on devices and profile
them to work more predictable in those 'important to users' scenarios.
AFAIK Daniel Lescano is trying to help with this new interface
for PowerCap: DTMP. It might be use as a new interface for those known
scenarios like the camera snapshot. But that interface is on the list
that I have also mentioned - it's missing the notification mechanism
for the scheduler reduced capacity due to user-space new scenario.
On Mon, 10 Oct 2022 at 12:49, Lukasz Luba <[email protected]> wrote:
>
>
> +CC Daniel
>
> On 10/10/22 11:22, Vincent Guittot wrote:
> > On Mon, 10 Oct 2022 at 12:12, Lukasz Luba <[email protected]> wrote:
> >>
> >>
> >>
> >> On 10/10/22 10:32, Vincent Guittot wrote:
> >>> On Mon, 10 Oct 2022 at 11:30, Lukasz Luba <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/10/22 10:15, Vincent Guittot wrote:
> >>>>> On Mon, 10 Oct 2022 at 11:02, Lukasz Luba <[email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 10/10/22 06:39, Viresh Kumar wrote:
> >>>>>>> Would be good to always CC Scheduler maintainers for such a patch.
> >>>>>>
> >>>>>> Agree, I'll do that.
> >>>>>>
> >>>>>>>
> >>>>>>> On 30-09-22, 10:48, Lukasz Luba wrote:
> >>>>>>>> When the new max frequency value is stored, the task scheduler must
> >>>>>>>> know about it. The scheduler uses the CPUs capacity information in the
> >>>>>>>> task placement. Use the existing mechanism which provides information
> >>>>>>>> about reduced CPU capacity to the scheduler due to thermal capping.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Lukasz Luba <[email protected]>
> >>>>>>>> ---
> >>>>>>>> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
> >>>>>>>> 1 file changed, 17 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> >>>>>>>> index 1f8b93f42c76..205d9ea9c023 100644
> >>>>>>>> --- a/drivers/cpufreq/cpufreq.c
> >>>>>>>> +++ b/drivers/cpufreq/cpufreq.c
> >>>>>>>> @@ -27,6 +27,7 @@
> >>>>>>>> #include <linux/slab.h>
> >>>>>>>> #include <linux/suspend.h>
> >>>>>>>> #include <linux/syscore_ops.h>
> >>>>>>>> +#include <linux/thermal.h>
> >>>>>>>> #include <linux/tick.h>
> >>>>>>>> #include <linux/units.h>
> >>>>>>>> #include <trace/events/power.h>
> >>>>>>>> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> >>>>>>>> static ssize_t store_scaling_max_freq
> >>>>>>>> (struct cpufreq_policy *policy, const char *buf, size_t count)
> >>>>>>>> {
> >>>>>>>> + unsigned int frequency;
> >>>>>>>> + struct cpumask *cpus;
> >>>>>>>> unsigned long val;
> >>>>>>>> int ret;
> >>>>>>>>
> >>>>>>>> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
> >>>>>>>> return -EINVAL;
> >>>>>>>>
> >>>>>>>> ret = freq_qos_update_request(policy->max_freq_req, val);
> >>>>>>>> - return ret >= 0 ? count : ret;
> >>>>>>>> + if (ret >= 0) {
> >>>>>>>> + /*
> >>>>>>>> + * Make sure that the task scheduler sees these CPUs
> >>>>>>>> + * capacity reduction. Use the thermal pressure mechanism
> >>>>>>>> + * to propagate this information to the scheduler.
> >>>>>>>> + */
> >>>>>>>> + cpus = policy->related_cpus;
> >>>>>>>
> >>>>>>> No need of this, just use related_cpus directly.
> >>>>>>>
> >>>>>>>> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
> >>>>>>>> + arch_update_thermal_pressure(cpus, frequency);
> >>>>>>>
> >>>>>>> I wonder if using the thermal-pressure API here is the right thing to
> >>>>>>> do. It is a change coming from User, which may or may not be
> >>>>>>> thermal-related.
> >>>>>>
> >>>>>> Yes, I thought the same. Thermal-pressure name might be not the
> >>>>>> best for covering this use case. I have been thinking about this
> >>>>>> thermal pressure mechanism for a while, since there are other
> >>>>>> use cases like PowerCap DTPM which also reduces CPU capacity
> >>>>>> because of power policy from user-space. We don't notify
> >>>>>> the scheduler about it. There might be also an issue with virtual
> >>>>>> guest OS and how that kernel 'sees' the capacity of CPUs.
> >>>>>> We might try to use this 'thermal-pressure' in the guest kernel
> >>>>>> to notify about available CPU capacity (just a proposal, not
> >>>>>> even an RFC, since we are missing requirements, but issues where
> >>>>>> discussed on LPC 2022 on ChromeOS+Android_guest)
> >>>>>
> >>>>> The User space setting scaling_max_freq is a long scale event and it
> >>>>> should be considered as a new running environnement instead of a
> >>>>> transient event. I would suggest updating the EM is and capacity orig
> >>>>> of the system in this case. Similarly, we rebuild sched_domain with a
> >>>>> cpu hotplug. scaling_max_freq interface should not be used to do any
> >>>>> kind of dynamic scaling.
> >>>>
> >>>> I tend to agree, but the EM capacity would be only used in part of EAS
> >>>> code. The whole fair.c view to the capacity_of() (RT + DL + irq +
> >>>> thermal_pressure) would be still wrong in other parts, e.g.
> >>>> select_idle_sibling() and load balance.
> >>>>
> >>>> When we get this powerhint we might be already in overutilied state
> >>>> so EAS is disabled. IMO other mechanisms in the task scheduler
> >>>> should be also aware of that capacity reduction.
> >>>
> >>> That's why I also mentioned the capacity_orig
> >>
> >> Well, I think this is a bit more complex. Thermal framework governor
> >> reduces the perf IDs from top in the freq asc table and keeps that
> >> in the statistics in sysfs. It also updates the thermal pressure signal.
> >> When we rebuild the capacity of CPUs and make the capacity_orig smaller,
> >> the capacity_of would still have the thermal framework reduced capacity
> >> in there. We would end up with too small CPU capacity due to this
> >> subtraction in capacity_of.
> >
> > That's why using user space interface should not be used to do dynamic scaling.
> > I still think that user space interface is not the right interface
> >
> >>
> >> Ideally, I would see a mechanism which is aware of this performance
> >> reduction reason:
> >> 1. thermal capping
> >> 2. power capping (from DTPM)
> >> 3. max freq reduction by user space
> >
> > Yes for thermal and power capping but no for user space
> >
> >>
> >> That common place would figure and maintain the context for the
> >> requested capacity reduction.
> >>
> >> BTW, those Android user space max freq requests are not that long,
> >> mostly due to camera capturing (you can see a few in this file,
> >> e.g. [1]).
> >
> > Why are they doing this ?
> > This doesn't seem to be the correct interface to use. It seems to do
> > some power budget and they should use the right interface for this
>
> Yes, I agree. I have sent explanation with this to Peter's emails.
> Daniel tries to give them a better interface: DTPM, but also would
> suffer the same issue of capacity reduction for this short time.
The comments in this thread are only about using the userspace
interface scale_max_freq to dynamically scale max freq and then to try
to report these changes in the thermal_pressure, which is the purpose
of this patch.
As said at LPC, I'm fine to rename thermal_pressure for something more
generic but this is not the purpose of this patch. This patch is about
connecting userspace scale_max_freq to thermal_pressure and it's not
the right things to do
>
> We have a few discussions about it, also Daniel presented on a few
> LPC those issues.
On 10/10/22 13:21, Vincent Guittot wrote:
> On Mon, 10 Oct 2022 at 12:49, Lukasz Luba <[email protected]> wrote:
>>
>>
>> +CC Daniel
>>
>> On 10/10/22 11:22, Vincent Guittot wrote:
>>> On Mon, 10 Oct 2022 at 12:12, Lukasz Luba <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 10/10/22 10:32, Vincent Guittot wrote:
>>>>> On Mon, 10 Oct 2022 at 11:30, Lukasz Luba <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/10/22 10:15, Vincent Guittot wrote:
>>>>>>> On Mon, 10 Oct 2022 at 11:02, Lukasz Luba <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/10/22 06:39, Viresh Kumar wrote:
>>>>>>>>> Would be good to always CC Scheduler maintainers for such a patch.
>>>>>>>>
>>>>>>>> Agree, I'll do that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 30-09-22, 10:48, Lukasz Luba wrote:
>>>>>>>>>> When the new max frequency value is stored, the task scheduler must
>>>>>>>>>> know about it. The scheduler uses the CPUs capacity information in the
>>>>>>>>>> task placement. Use the existing mechanism which provides information
>>>>>>>>>> about reduced CPU capacity to the scheduler due to thermal capping.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Lukasz Luba <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> drivers/cpufreq/cpufreq.c | 18 +++++++++++++++++-
>>>>>>>>>> 1 file changed, 17 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
>>>>>>>>>> index 1f8b93f42c76..205d9ea9c023 100644
>>>>>>>>>> --- a/drivers/cpufreq/cpufreq.c
>>>>>>>>>> +++ b/drivers/cpufreq/cpufreq.c
>>>>>>>>>> @@ -27,6 +27,7 @@
>>>>>>>>>> #include <linux/slab.h>
>>>>>>>>>> #include <linux/suspend.h>
>>>>>>>>>> #include <linux/syscore_ops.h>
>>>>>>>>>> +#include <linux/thermal.h>
>>>>>>>>>> #include <linux/tick.h>
>>>>>>>>>> #include <linux/units.h>
>>>>>>>>>> #include <trace/events/power.h>
>>>>>>>>>> @@ -718,6 +719,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>>>>>>>>>> static ssize_t store_scaling_max_freq
>>>>>>>>>> (struct cpufreq_policy *policy, const char *buf, size_t count)
>>>>>>>>>> {
>>>>>>>>>> + unsigned int frequency;
>>>>>>>>>> + struct cpumask *cpus;
>>>>>>>>>> unsigned long val;
>>>>>>>>>> int ret;
>>>>>>>>>>
>>>>>>>>>> @@ -726,7 +729,20 @@ static ssize_t store_scaling_max_freq
>>>>>>>>>> return -EINVAL;
>>>>>>>>>>
>>>>>>>>>> ret = freq_qos_update_request(policy->max_freq_req, val);
>>>>>>>>>> - return ret >= 0 ? count : ret;
>>>>>>>>>> + if (ret >= 0) {
>>>>>>>>>> + /*
>>>>>>>>>> + * Make sure that the task scheduler sees these CPUs
>>>>>>>>>> + * capacity reduction. Use the thermal pressure mechanism
>>>>>>>>>> + * to propagate this information to the scheduler.
>>>>>>>>>> + */
>>>>>>>>>> + cpus = policy->related_cpus;
>>>>>>>>>
>>>>>>>>> No need of this, just use related_cpus directly.
>>>>>>>>>
>>>>>>>>>> + frequency = __resolve_freq(policy, val, CPUFREQ_RELATION_HE);
>>>>>>>>>> + arch_update_thermal_pressure(cpus, frequency);
>>>>>>>>>
>>>>>>>>> I wonder if using the thermal-pressure API here is the right thing to
>>>>>>>>> do. It is a change coming from User, which may or may not be
>>>>>>>>> thermal-related.
>>>>>>>>
>>>>>>>> Yes, I thought the same. Thermal-pressure name might be not the
>>>>>>>> best for covering this use case. I have been thinking about this
>>>>>>>> thermal pressure mechanism for a while, since there are other
>>>>>>>> use cases like PowerCap DTPM which also reduces CPU capacity
>>>>>>>> because of power policy from user-space. We don't notify
>>>>>>>> the scheduler about it. There might be also an issue with virtual
>>>>>>>> guest OS and how that kernel 'sees' the capacity of CPUs.
>>>>>>>> We might try to use this 'thermal-pressure' in the guest kernel
>>>>>>>> to notify about available CPU capacity (just a proposal, not
>>>>>>>> even an RFC, since we are missing requirements, but issues where
>>>>>>>> discussed on LPC 2022 on ChromeOS+Android_guest)
>>>>>>>
>>>>>>> The User space setting scaling_max_freq is a long scale event and it
>>>>>>> should be considered as a new running environnement instead of a
>>>>>>> transient event. I would suggest updating the EM is and capacity orig
>>>>>>> of the system in this case. Similarly, we rebuild sched_domain with a
>>>>>>> cpu hotplug. scaling_max_freq interface should not be used to do any
>>>>>>> kind of dynamic scaling.
>>>>>>
>>>>>> I tend to agree, but the EM capacity would be only used in part of EAS
>>>>>> code. The whole fair.c view to the capacity_of() (RT + DL + irq +
>>>>>> thermal_pressure) would be still wrong in other parts, e.g.
>>>>>> select_idle_sibling() and load balance.
>>>>>>
>>>>>> When we get this powerhint we might be already in overutilied state
>>>>>> so EAS is disabled. IMO other mechanisms in the task scheduler
>>>>>> should be also aware of that capacity reduction.
>>>>>
>>>>> That's why I also mentioned the capacity_orig
>>>>
>>>> Well, I think this is a bit more complex. Thermal framework governor
>>>> reduces the perf IDs from top in the freq asc table and keeps that
>>>> in the statistics in sysfs. It also updates the thermal pressure signal.
>>>> When we rebuild the capacity of CPUs and make the capacity_orig smaller,
>>>> the capacity_of would still have the thermal framework reduced capacity
>>>> in there. We would end up with too small CPU capacity due to this
>>>> subtraction in capacity_of.
>>>
>>> That's why using user space interface should not be used to do dynamic scaling.
>>> I still think that user space interface is not the right interface
>>>
>>>>
>>>> Ideally, I would see a mechanism which is aware of this performance
>>>> reduction reason:
>>>> 1. thermal capping
>>>> 2. power capping (from DTPM)
>>>> 3. max freq reduction by user space
>>>
>>> Yes for thermal and power capping but no for user space
>>>
>>>>
>>>> That common place would figure and maintain the context for the
>>>> requested capacity reduction.
>>>>
>>>> BTW, those Android user space max freq requests are not that long,
>>>> mostly due to camera capturing (you can see a few in this file,
>>>> e.g. [1]).
>>>
>>> Why are they doing this ?
>>> This doesn't seem to be the correct interface to use. It seems to do
>>> some power budget and they should use the right interface for this
>>
>> Yes, I agree. I have sent explanation with this to Peter's emails.
>> Daniel tries to give them a better interface: DTPM, but also would
>> suffer the same issue of capacity reduction for this short time.
>
> The comments in this thread are only about using the userspace
> interface scale_max_freq to dynamically scale max freq and then to try
> to report these changes in the thermal_pressure, which is the purpose
> of this patch.
Fair enough. I think we all see those potential issues but from
different angle, which is good for progress. We also bring different
experience. This patch appeared to be useful for sharing those views and
the issue.
>
> As said at LPC, I'm fine to rename thermal_pressure for something more
> generic but this is not the purpose of this patch. This patch is about
> connecting userspace scale_max_freq to thermal_pressure and it's not
> the right things to do
Sounds good, something more generic would be great to have.
I will start a discussion in a new thread that topic then.
Thanks for sharing your opinion!
On Mon, Oct 10, 2022 at 11:46:29AM +0100, Lukasz Luba wrote:
>
> +CC Daniel, since I have mentioned a few times DTPM
>
> On 10/10/22 11:25, Peter Zijlstra wrote:
> > On Mon, Oct 10, 2022 at 11:12:06AM +0100, Lukasz Luba wrote:
> > > BTW, those Android user space max freq requests are not that long,
> > > mostly due to camera capturing (you can see a few in this file,
> > > e.g. [1]).
> >
> > It does what now ?!? Why is Android using this *at*all* ?
>
> It tries to balance the power budget, before bad things happen
> randomly (throttling different devices w/o a good context what's
> going on). Please keep in mind that we have ~3 Watts total power
> budget in a phone, while several devices might be suddenly used:
> 1. big CPU with max power ~3-3.5 Watts (and we have 2 cores on pixel6)
> 2. GPU with max power ~6Watts (normally ~1-2Watts when lightly used)
> 3. ISP (Image Signal Processor) up to ~2Watts
> 4. DSP also up to 1-2Watts
>
> We don't have currently a good mechanism which could be aware
> of the total power/thermal budget and relations between those
> devices. Vendors and OEMs run experiments on devices and profile
> them to work more predictable in those 'important to users' scenarios.
>
> AFAIK Daniel Lescano is trying to help with this new interface
> for PowerCap: DTMP. It might be use as a new interface for those known
> scenarios like the camera snapshot. But that interface is on the list
> that I have also mentioned - it's missing the notification mechanism
> for the scheduler reduced capacity due to user-space new scenario.
DTMP is like IPA but including random devices? Because I thought IPA
already did lots of this.
On 10/11/22 09:38, Peter Zijlstra wrote:
> On Mon, Oct 10, 2022 at 11:46:29AM +0100, Lukasz Luba wrote:
>>
>> +CC Daniel, since I have mentioned a few times DTPM
>>
>> On 10/10/22 11:25, Peter Zijlstra wrote:
>>> On Mon, Oct 10, 2022 at 11:12:06AM +0100, Lukasz Luba wrote:
>>>> BTW, those Android user space max freq requests are not that long,
>>>> mostly due to camera capturing (you can see a few in this file,
>>>> e.g. [1]).
>>>
>>> It does what now ?!? Why is Android using this *at*all* ?
>>
>> It tries to balance the power budget, before bad things happen
>> randomly (throttling different devices w/o a good context what's
>> going on). Please keep in mind that we have ~3 Watts total power
>> budget in a phone, while several devices might be suddenly used:
>> 1. big CPU with max power ~3-3.5 Watts (and we have 2 cores on pixel6)
>> 2. GPU with max power ~6Watts (normally ~1-2Watts when lightly used)
>> 3. ISP (Image Signal Processor) up to ~2Watts
>> 4. DSP also up to 1-2Watts
>>
>> We don't have currently a good mechanism which could be aware
>> of the total power/thermal budget and relations between those
>> devices. Vendors and OEMs run experiments on devices and profile
>> them to work more predictable in those 'important to users' scenarios.
>>
>> AFAIK Daniel Lescano is trying to help with this new interface
>> for PowerCap: DTMP. It might be use as a new interface for those known
>> scenarios like the camera snapshot. But that interface is on the list
>> that I have also mentioned - it's missing the notification mechanism
>> for the scheduler reduced capacity due to user-space new scenario.
>
> DTMP is like IPA but including random devices? Because I thought IPA
> already did lots of this.
The DTMP is a kernel interface for power split which happen in the user
space policy. It exposes the sysfs to set those scenarios, even before
(like those Android 'powerhints') the power/thermal issue occur. I have
been reviewing it (and advocating internally). There is more work to
do there still and AFAIK is not yet used by Android.
IPA contains the policy to power budget split, but misses this 'context'
of what's going on and would happen. It has some PID mechanism to fix
itself, but it's not a silver bullet.
Furthermore, there are other IPA fundamental issues:
1. You might recall we added last year to IPA the utilization signal
of the CPU runqueues. That model still has issues with input
power estimation and I have described that here [1].
2. Cpu frequency sampling issue (we assume const. freq at whole period)
(also in [1])
3. Power consumption of a CPU at the same frequency varies and depends
on workload instruction mix, e.g. heavy SIMD floating-point code
for some image filter in camera app drains more power vs. a code
which is a garbage-collector background thread traversing a graph
in memory and has big backend stall due to randomness of pointers
(or a game thread for collision detection on octrees).
Our Energy Model doesn't cover such thing (yet).
The issue become more severe for us with last year available big
cores: a new generation of uArch Cortex-X1. They are able to
drain 3.5W instantly, while in Energy Model we have 2.2W for max
freq. In previous big cores we haven't such power hungry CPUs.
A fair assumption was 1.0W for EM value and 1.7W for a pick power
in some SIMD code. That 3.5W-2.2W can heat up the SoC really
quickly and use the free thermal budget easily. So hints from
user space are welcome IMO.
4. User space restriction to cpufreq and devfreq, which are those
'powerhints' about possible coming soon scenarios, are not taken into
account, due to missing interface. I have mentioned it ~2 years ago
and sent a RFC example patch for devfreq (didn't dare to address
cpufreq at once) [2]
5. Thermal-pressure PELT signal converges slowly to the original
instant signal set by thermal governor, so the capacity_of()
has delays to 'observe' the reality of the capped CPUs. In those
user space scenario short hints is important. I have tried to
add a mechanism to react faster, since we might already have
delays in our FW or IPA to the original signal. Patch didn't
make any progress on LKML [3].
6. The leakage. Rising temperature above normal values, causing higher
power drain by the CPU core. Presented on LPC 2022 [4]. This is an
issue when our GPU or ISP heats up the SoC, thus CPUs.
If you like, I can give you more details how those different CPUs
(and other devices) behave under power/thermal stress in various
scenarios. I have spent a lot of time in last ~5years on researching
it.
Regards,
Lukasz
[1]
https://lore.kernel.org/linux-pm/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://lpc.events/event/16/contributions/1341/