2021-01-26 11:43:45

by Lukasz Luba

[permalink] [raw]
Subject: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

Hi all,

This patch set tries to add the missing feature in the Intelligent Power
Allocation (IPA) governor which is: frequency limit set by user space.
User can set max allowed frequency for a given device which has impact on
max allowed power. In current design there is no mechanism to figure this
out. IPA must know the maximum allowed power for every device. It is then
used for proper power split and divvy-up. When the user limit for max
frequency is not know, IPA assumes it is the highest possible frequency.
It causes wrong power split across the devices.

This new mechanism provides the max allowed frequency to the thermal
framework and then max allowed power to the IPA.
The implementation is done in this way because currently there is no way
to retrieve the limits from the PM QoS, without uncapping the local
thermal limit and reading the next value. It would be a heavy way of
doing these things, since it should be done every polling time (e.g. 50ms).
Also, the value stored in PM QoS can be different than the real OPP 'rate'
so still would need conversion into proper OPP for comparison with EM.
Furthermore, uncapping the device in thermal just to check the user freq
limit is not the safest way.
Thus, this simple implementation moves the calculation of the proper
frequency to the sysfs write code, since it's called less often. The value
is then used as-is in the thermal framework without any hassle.

As it's a RFC, it still misses the cpufreq sysfs implementation, but would
be addressed if all agree.

Regards,
Lukasz Luba

Lukasz Luba (3):
PM /devfreq: add user frequency limits into devfreq struct
thermal: devfreq_cooling: add new callback to get user limit for min
state
thermal: power_allocator: get proper max power limited by user

drivers/devfreq/devfreq.c | 41 ++++++++++++++++++++++++---
drivers/thermal/devfreq_cooling.c | 33 +++++++++++++++++++++
drivers/thermal/gov_power_allocator.c | 17 +++++++++--
include/linux/devfreq.h | 4 +++
include/linux/thermal.h | 1 +
5 files changed, 90 insertions(+), 6 deletions(-)

--
2.17.1


2021-01-26 11:57:03

by Lukasz Luba

[permalink] [raw]
Subject: [RFC][PATCH 3/3] thermal: power_allocator: get proper max power limited by user

Use new API interface to get the maximum power of the cooling device. This
is needed to properly allocate and split the total power budget. The
allowed limit is taken from supported cooling device and then checked with
limits set in DT. The final state value is used for asking for the related
power value the cooling device.

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/thermal/gov_power_allocator.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/gov_power_allocator.c b/drivers/thermal/gov_power_allocator.c
index 92acae53df49..ec33fba5a358 100644
--- a/drivers/thermal/gov_power_allocator.c
+++ b/drivers/thermal/gov_power_allocator.c
@@ -308,6 +308,20 @@ power_actor_set_power(struct thermal_cooling_device *cdev,
return 0;
}

+static int
+power_actor_get_max_power(struct thermal_cooling_device *cdev,
+ struct thermal_instance *instance, u32 *max_power)
+{
+ unsigned long min_state = 0;
+
+ if (cdev->ops->get_user_min_state)
+ cdev->ops->get_user_min_state(cdev, &min_state);
+
+ min_state = max(instance->lower, min_state);
+
+ return cdev->ops->state2power(cdev, min_state, max_power);
+}
+
/**
* divvy_up_power() - divvy the allocated power between the actors
* @req_power: each actor's requested power
@@ -455,8 +469,7 @@ static int allocate_power(struct thermal_zone_device *tz,

weighted_req_power[i] = frac_to_int(weight * req_power[i]);

- if (cdev->ops->state2power(cdev, instance->lower,
- &max_power[i]))
+ if (power_actor_get_max_power(cdev, instance, &max_power[i]))
continue;

total_req_power += req_power[i];
--
2.17.1

2021-01-26 11:57:24

by Lukasz Luba

[permalink] [raw]
Subject: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct

The new fields inside devfreq struct allow to check the frequency limits
set by the user via sysfs. These limits are important for thermal governor
Intelligent Power Allocation (IPA) which needs to know the maximum allowed
power consumption of the device.

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/devfreq/devfreq.c | 41 +++++++++++++++++++++++++++++++++++----
include/linux/devfreq.h | 4 ++++
2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 94cc25fd68da..e985a76e5ff3 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -843,6 +843,9 @@ struct devfreq *devfreq_add_device(struct device *dev,
goto err_dev;
}

+ devfreq->user_min_freq = devfreq->scaling_min_freq;
+ devfreq->user_max_freq = devfreq->scaling_max_freq;
+
devfreq->suspend_freq = dev_pm_opp_get_suspend_opp_freq(dev);
atomic_set(&devfreq->suspend_count, 0);

@@ -1513,6 +1516,8 @@ static ssize_t min_freq_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
struct devfreq *df = to_devfreq(dev);
+ struct device *pdev = df->dev.parent;
+ struct dev_pm_opp *opp;
unsigned long value;
int ret;

@@ -1533,6 +1538,19 @@ static ssize_t min_freq_store(struct device *dev, struct device_attribute *attr,
if (ret < 0)
return ret;

+ if (!value)
+ value = df->scaling_min_freq;
+
+ opp = dev_pm_opp_find_freq_ceil(pdev, &value);
+ if (IS_ERR(opp))
+ return count;
+
+ dev_pm_opp_put(opp);
+
+ mutex_lock(&df->lock);
+ df->user_min_freq = value;
+ mutex_unlock(&df->lock);
+
return count;
}

@@ -1554,7 +1572,9 @@ static ssize_t max_freq_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
struct devfreq *df = to_devfreq(dev);
- unsigned long value;
+ struct device *pdev = df->dev.parent;
+ unsigned long value, value_khz;
+ struct dev_pm_opp *opp;
int ret;

/*
@@ -1579,14 +1599,27 @@ static ssize_t max_freq_store(struct device *dev, struct device_attribute *attr,
* A value of zero means "no limit".
*/
if (value)
- value = DIV_ROUND_UP(value, HZ_PER_KHZ);
+ value_khz = DIV_ROUND_UP(value, HZ_PER_KHZ);
else
- value = PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE;
+ value_khz = PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE;

- ret = dev_pm_qos_update_request(&df->user_max_freq_req, value);
+ ret = dev_pm_qos_update_request(&df->user_max_freq_req, value_khz);
if (ret < 0)
return ret;

+ if (!value)
+ value = df->scaling_max_freq;
+
+ opp = dev_pm_opp_find_freq_floor(pdev, &value);
+ if (IS_ERR(opp))
+ return count;
+
+ dev_pm_opp_put(opp);
+
+ mutex_lock(&df->lock);
+ df->user_max_freq = value;
+ mutex_unlock(&df->lock);
+
return count;
}

diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
index b6d3bae1c74d..147a229056d2 100644
--- a/include/linux/devfreq.h
+++ b/include/linux/devfreq.h
@@ -147,6 +147,8 @@ struct devfreq_stats {
* touch this.
* @user_min_freq_req: PM QoS minimum frequency request from user (via sysfs)
* @user_max_freq_req: PM QoS maximum frequency request from user (via sysfs)
+ * @user_min_freq: User's minimum frequency
+ * @user_max_freq: User's maximum frequency
* @scaling_min_freq: Limit minimum frequency requested by OPP interface
* @scaling_max_freq: Limit maximum frequency requested by OPP interface
* @stop_polling: devfreq polling status of a device.
@@ -185,6 +187,8 @@ struct devfreq {
struct dev_pm_qos_request user_max_freq_req;
unsigned long scaling_min_freq;
unsigned long scaling_max_freq;
+ unsigned long user_min_freq;
+ unsigned long user_max_freq;
bool stop_polling;

unsigned long suspend_freq;
--
2.17.1

2021-01-27 21:47:03

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

On 26-01-21, 10:39, Lukasz Luba wrote:
> As it's a RFC, it still misses the cpufreq sysfs implementation, but would
> be addressed if all agree.

Not commenting on the whole stuff but if you ever need something for cpufreq, it
is already there. Look for these.

store_one(scaling_min_freq, min);
store_one(scaling_max_freq, max);

Hopefully they will work just fine.

--
viresh

2021-01-27 22:00:41

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power



On 1/27/21 9:15 AM, Viresh Kumar wrote:
> On 26-01-21, 10:39, Lukasz Luba wrote:
>> As it's a RFC, it still misses the cpufreq sysfs implementation, but would
>> be addressed if all agree.
>
> Not commenting on the whole stuff but if you ever need something for cpufreq, it
> is already there. Look for these.
>
> store_one(scaling_min_freq, min);
> store_one(scaling_max_freq, max);
>
> Hopefully they will work just fine.
>

So, can I assume you don't mind to plumb it into these two?

Yes, I know them and the tricky macro. I just wanted to avoid
one patch for this macro and one patch for cpufreq_cooling.c,
which would use it.

If you agree and Chanwoo agrees for the devfreq, and Daniel
for the new callback in cooling device, then I would continue
by adding missing patches for cpufreq-cooling part.

Regards,
Lukasz

2021-01-27 22:02:42

by Viresh Kumar

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

On 27-01-21, 10:11, Lukasz Luba wrote:
>
>
> On 1/27/21 9:15 AM, Viresh Kumar wrote:
> > On 26-01-21, 10:39, Lukasz Luba wrote:
> > > As it's a RFC, it still misses the cpufreq sysfs implementation, but would
> > > be addressed if all agree.
> >
> > Not commenting on the whole stuff but if you ever need something for cpufreq, it
> > is already there. Look for these.
> >
> > store_one(scaling_min_freq, min);
> > store_one(scaling_max_freq, max);
> >
> > Hopefully they will work just fine.
> >
>
> So, can I assume you don't mind to plumb it into these two?

No :)

As I said at the top, I am not commenting on the whole thing yet, may
need to think over a bit and Rafael will comment as well.

--
viresh

2021-02-01 11:28:02

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

Daniel, Chanwoo

Gentle ping. Have you have a chance to check these patches?

On 1/26/21 10:39 AM, Lukasz Luba wrote:
> Hi all,
>
> This patch set tries to add the missing feature in the Intelligent Power
> Allocation (IPA) governor which is: frequency limit set by user space.
> User can set max allowed frequency for a given device which has impact on
> max allowed power. In current design there is no mechanism to figure this
> out. IPA must know the maximum allowed power for every device. It is then
> used for proper power split and divvy-up. When the user limit for max
> frequency is not know, IPA assumes it is the highest possible frequency.
> It causes wrong power split across the devices.
>
> This new mechanism provides the max allowed frequency to the thermal
> framework and then max allowed power to the IPA.
> The implementation is done in this way because currently there is no way
> to retrieve the limits from the PM QoS, without uncapping the local
> thermal limit and reading the next value. It would be a heavy way of
> doing these things, since it should be done every polling time (e.g. 50ms).
> Also, the value stored in PM QoS can be different than the real OPP 'rate'
> so still would need conversion into proper OPP for comparison with EM.
> Furthermore, uncapping the device in thermal just to check the user freq
> limit is not the safest way.
> Thus, this simple implementation moves the calculation of the proper
> frequency to the sysfs write code, since it's called less often. The value
> is then used as-is in the thermal framework without any hassle.
>
> As it's a RFC, it still misses the cpufreq sysfs implementation, but would
> be addressed if all agree.
>
> Regards,
> Lukasz Luba
>
> Lukasz Luba (3):
> PM /devfreq: add user frequency limits into devfreq struct
> thermal: devfreq_cooling: add new callback to get user limit for min
> state
> thermal: power_allocator: get proper max power limited by user
>
> drivers/devfreq/devfreq.c | 41 ++++++++++++++++++++++++---
> drivers/thermal/devfreq_cooling.c | 33 +++++++++++++++++++++
> drivers/thermal/gov_power_allocator.c | 17 +++++++++--
> include/linux/devfreq.h | 4 +++
> include/linux/thermal.h | 1 +
> 5 files changed, 90 insertions(+), 6 deletions(-)
>

2021-02-01 14:24:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

On Tue, Jan 26, 2021 at 11:40 AM Lukasz Luba <[email protected]> wrote:
>
> Hi all,
>
> This patch set tries to add the missing feature in the Intelligent Power
> Allocation (IPA) governor which is: frequency limit set by user space.
> User can set max allowed frequency for a given device which has impact on
> max allowed power.

If there is more than one frequency that can be limited for the given
device, are you going to add a limit knob for each of them?

> In current design there is no mechanism to figure this
> out. IPA must know the maximum allowed power for every device. It is then
> used for proper power split and divvy-up. When the user limit for max
> frequency is not know, IPA assumes it is the highest possible frequency.
> It causes wrong power split across the devices.

Do I think correctly that this depends on the Energy Model?

> This new mechanism provides the max allowed frequency to the thermal
> framework and then max allowed power to the IPA.
> The implementation is done in this way because currently there is no way
> to retrieve the limits from the PM QoS, without uncapping the local
> thermal limit and reading the next value.

The above is unclear. What PM QoS limit are you referring to in the
first place?

> It would be a heavy way of
> doing these things, since it should be done every polling time (e.g. 50ms).
> Also, the value stored in PM QoS can be different than the real OPP 'rate'
> so still would need conversion into proper OPP for comparison with EM.
> Furthermore, uncapping the device in thermal just to check the user freq
> limit is not the safest way.
> Thus, this simple implementation moves the calculation of the proper
> frequency to the sysfs write code, since it's called less often. The value
> is then used as-is in the thermal framework without any hassle.
>
> As it's a RFC, it still misses the cpufreq sysfs implementation,

What exactly do you mean by this?

> but would be addressed if all agree.

Depending on the answers above.

But my general comment would be that it might turn out to be
unrealistic to expect user space to know what frequency limit to use
to get the desired result in terms of constraining power.

2021-02-01 14:26:33

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power


Hi Lukasz,

On 01/02/2021 12:23, Lukasz Luba wrote:
> Daniel, Chanwoo
>
> Gentle ping. Have you have a chance to check these patches?

I will review the patches in a couple of days

-- Daniel


> On 1/26/21 10:39 AM, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set tries to add the missing feature in the Intelligent Power
>> Allocation (IPA) governor which is: frequency limit set by user space.
>> User can set max allowed frequency for a given device which has impact on
>> max allowed power. In current design there is no mechanism to figure this
>> out. IPA must know the maximum allowed power for every device. It is then
>> used for proper power split and divvy-up. When the user limit for max
>> frequency is not know, IPA assumes it is the highest possible frequency.
>> It causes wrong power split across the devices.
>>
>> This new mechanism provides the max allowed frequency to the thermal
>> framework and then max allowed power to the IPA.
>> The implementation is done in this way because currently there is no way
>> to retrieve the limits from the PM QoS, without uncapping the local
>> thermal limit and reading the next value. It would be a heavy way of
>> doing these things, since it should be done every polling time (e.g.
>> 50ms).
>> Also, the value stored in PM QoS can be different than the real OPP
>> 'rate'
>> so still would need conversion into proper OPP for comparison with EM.
>> Furthermore, uncapping the device in thermal just to check the user freq
>> limit is not the safest way.
>> Thus, this simple implementation moves the calculation of the proper
>> frequency to the sysfs write code, since it's called less often. The
>> value
>> is then used as-is in the thermal framework without any hassle.
>>
>> As it's a RFC, it still misses the cpufreq sysfs implementation, but
>> would
>> be addressed if all agree.
>>
>> Regards,
>> Lukasz Luba
>>
>> Lukasz Luba (3):
>>    PM /devfreq: add user frequency limits into devfreq struct
>>    thermal: devfreq_cooling: add new callback to get user limit for min
>>      state
>>    thermal: power_allocator: get proper max power limited by user
>>
>>   drivers/devfreq/devfreq.c             | 41 ++++++++++++++++++++++++---
>>   drivers/thermal/devfreq_cooling.c     | 33 +++++++++++++++++++++
>>   drivers/thermal/gov_power_allocator.c | 17 +++++++++--
>>   include/linux/devfreq.h               |  4 +++
>>   include/linux/thermal.h               |  1 +
>>   5 files changed, 90 insertions(+), 6 deletions(-)
>>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2021-02-01 16:39:03

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

Hi Rafael,

On 2/1/21 2:19 PM, Rafael J. Wysocki wrote:
> On Tue, Jan 26, 2021 at 11:40 AM Lukasz Luba <[email protected]> wrote:
>>
>> Hi all,
>>
>> This patch set tries to add the missing feature in the Intelligent Power
>> Allocation (IPA) governor which is: frequency limit set by user space.
>> User can set max allowed frequency for a given device which has impact on
>> max allowed power.
>
> If there is more than one frequency that can be limited for the given
> device, are you going to add a limit knob for each of them?

I might be unclear. I was referring to normal sysfs scaling_max_freq,
which sets the max frequency for CPU:

echo XYZ > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

similar for devfreq device, like GPU.


>
>> In current design there is no mechanism to figure this
>> out. IPA must know the maximum allowed power for every device. It is then
>> used for proper power split and divvy-up. When the user limit for max
>> frequency is not know, IPA assumes it is the highest possible frequency.
>> It causes wrong power split across the devices.
>
> Do I think correctly that this depends on the Energy Model?

Not directly, but IPA uses the max freq to ask EM for max power. The
issue is that I don't know this 'max freq' for a given device, because
user might set a limit for that device. In that case IPA still blindly
picks up the power for highest frequency.

>
>> This new mechanism provides the max allowed frequency to the thermal
>> framework and then max allowed power to the IPA.
>> The implementation is done in this way because currently there is no way
>> to retrieve the limits from the PM QoS, without uncapping the local
>> thermal limit and reading the next value.
>
> The above is unclear. What PM QoS limit are you referring to in the
> first place?

The PM QoS which we use in thermal for setting the frequency limits,
for cpufreq_cooling [1] and for devfreq_cooling [2]. I am able to read
that PM QoS value, but it's the lowest, but not set by user.
Example:
2000MHz
1800MHz <----- user set this to 'max freq'
1400MHz <----- thermal set that to 'max freq'

then PM QoS would give me the 1400MHz, because it is the limit for
the max freq.

That's why I said that PM QoS is not able to give me the user limit,
unless I revert in IPA the capping for that device.


>
>> It would be a heavy way of
>> doing these things, since it should be done every polling time (e.g. 50ms).
>> Also, the value stored in PM QoS can be different than the real OPP 'rate'
>> so still would need conversion into proper OPP for comparison with EM.
>> Furthermore, uncapping the device in thermal just to check the user freq
>> limit is not the safest way.
>> Thus, this simple implementation moves the calculation of the proper
>> frequency to the sysfs write code, since it's called less often. The value
>> is then used as-is in the thermal framework without any hassle.
>>
>> As it's a RFC, it still misses the cpufreq sysfs implementation,
>
> What exactly do you mean by this?

I haven't modified cpufreq.c and cpufreq_cooling.c because
maybe for CPUs there is a way to solve it differently or you might
don't want at all to modify CPUs code.

>
>> but would be addressed if all agree.
>
> Depending on the answers above.
>
> But my general comment would be that it might turn out to be
> unrealistic to expect user space to know what frequency limit to use
> to get the desired result in terms of constraining power.
>

There are scenarios, where middleware (which is aware what is on
the foreground in mobile) might limit the GPU max freq, to not
burn out some power spent on highest OPPs.

Regards,
Lukasz

[1]
https://elixir.bootlin.com/linux/latest/source/drivers/thermal/cpufreq_cooling.c#L443
[2]
https://elixir.bootlin.com/linux/latest/source/drivers/thermal/devfreq_cooling.c#L106


2021-02-01 16:40:37

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

Hi Daniel,

On 2/1/21 2:21 PM, Daniel Lezcano wrote:
>
> Hi Lukasz,
>
> On 01/02/2021 12:23, Lukasz Luba wrote:
>> Daniel, Chanwoo
>>
>> Gentle ping. Have you have a chance to check these patches?
>
> I will review the patches in a couple of days

Thank you, I will wait then.

Regards,
Lukasz

>
> -- Daniel
>
>

2021-02-02 09:18:07

by Chanwoo Choi

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

Hi Lukasz,

I'll review this patchset until tomorrow.

Thanks.
Chanwoo Choi

On 2/1/21 8:23 PM, Lukasz Luba wrote:
> Daniel, Chanwoo
>
> Gentle ping. Have you have a chance to check these patches?
>
> On 1/26/21 10:39 AM, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set tries to add the missing feature in the Intelligent Power
>> Allocation (IPA) governor which is: frequency limit set by user space.
>> User can set max allowed frequency for a given device which has impact on
>> max allowed power. In current design there is no mechanism to figure this
>> out. IPA must know the maximum allowed power for every device. It is then
>> used for proper power split and divvy-up. When the user limit for max
>> frequency is not know, IPA assumes it is the highest possible frequency.
>> It causes wrong power split across the devices.
>>
>> This new mechanism provides the max allowed frequency to the thermal
>> framework and then max allowed power to the IPA.
>> The implementation is done in this way because currently there is no way
>> to retrieve the limits from the PM QoS, without uncapping the local
>> thermal limit and reading the next value. It would be a heavy way of
>> doing these things, since it should be done every polling time (e.g. 50ms).
>> Also, the value stored in PM QoS can be different than the real OPP 'rate'
>> so still would need conversion into proper OPP for comparison with EM.
>> Furthermore, uncapping the device in thermal just to check the user freq
>> limit is not the safest way.
>> Thus, this simple implementation moves the calculation of the proper
>> frequency to the sysfs write code, since it's called less often. The value
>> is then used as-is in the thermal framework without any hassle.
>>
>> As it's a RFC, it still misses the cpufreq sysfs implementation, but would
>> be addressed if all agree.
>>
>> Regards,
>> Lukasz Luba
>>
>> Lukasz Luba (3):
>>    PM /devfreq: add user frequency limits into devfreq struct
>>    thermal: devfreq_cooling: add new callback to get user limit for min
>>      state
>>    thermal: power_allocator: get proper max power limited by user
>>
>>   drivers/devfreq/devfreq.c             | 41 ++++++++++++++++++++++++---
>>   drivers/thermal/devfreq_cooling.c     | 33 +++++++++++++++++++++
>>   drivers/thermal/gov_power_allocator.c | 17 +++++++++--
>>   include/linux/devfreq.h               |  4 +++
>>   include/linux/thermal.h               |  1 +
>>   5 files changed, 90 insertions(+), 6 deletions(-)
>>
>
>


--
Best Regards,
Chanwoo Choi
Samsung Electronics

2021-02-02 09:59:48

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power



On 2/2/21 9:31 AM, Chanwoo Choi wrote:
> Hi Lukasz,
>
> I'll review this patchset until tomorrow.

Thank you Chanwoo, I will wait then.

Lukasz

>
> Thanks.
> Chanwoo Choi
>

2021-02-03 09:59:03

by Chanwoo Choi

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct

Hi Lukasz,

When accessing the max_freq and min_freq at devfreq-cooling.c,
even if can access 'user_max_freq' and 'lock' by using the 'devfreq' instance,
I think that the direct access of variables (lock/user_max_freq/user_min_freq)
of struct devfreq are not good.

Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
notification with following changes of 'struct devfreq_freq'?
Also, need to add codes into devfreq_set_target() for initializing
'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
notification.

diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
index 147a229056d2..d5726592d362 100644
--- a/include/linux/devfreq.h
+++ b/include/linux/devfreq.h
@@ -207,6 +207,8 @@ struct devfreq {
struct devfreq_freqs {
unsigned long old;
unsigned long new;
+ unsigned long new_max_freq;
+ unsigned long new_min_freq;
};


And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
You can get the current max_freq/min_freq by using the following steps:

get_freq_range(devfreq, &min_freq, &max_freq);
dev_pm_opp_find_freq_floor(pdev, &min_freq);
dev_pm_opp_find_freq_floor(pdev, &max_freq);

So that you can get the 'max_freq/min_freq' and then
initialize the 'freqs.new_max_freq and freqs.new_min_freq'
with them as following:

in devfreq_set_target()
get_freq_range(devfreq, &min_freq, &max_freq);
dev_pm_opp_find_freq_floor(pdev, &min_freq);
dev_pm_opp_find_freq_floor(pdev, &max_freq);
freqs.new_max_freq = min_freq;
freqs.new_max_freq = max_freq;
devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);


On 1/26/21 7:39 PM, Lukasz Luba wrote:
> The new fields inside devfreq struct allow to check the frequency limits
> set by the user via sysfs. These limits are important for thermal governor
> Intelligent Power Allocation (IPA) which needs to know the maximum allowed
> power consumption of the device.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> drivers/devfreq/devfreq.c | 41 +++++++++++++++++++++++++++++++++++----
> include/linux/devfreq.h | 4 ++++
> 2 files changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> index 94cc25fd68da..e985a76e5ff3 100644
> --- a/drivers/devfreq/devfreq.c
> +++ b/drivers/devfreq/devfreq.c
> @@ -843,6 +843,9 @@ struct devfreq *devfreq_add_device(struct device *dev,
> goto err_dev;
> }
>
> + devfreq->user_min_freq = devfreq->scaling_min_freq;
> + devfreq->user_max_freq = devfreq->scaling_max_freq;
> +
> devfreq->suspend_freq = dev_pm_opp_get_suspend_opp_freq(dev);
> atomic_set(&devfreq->suspend_count, 0);
>
> @@ -1513,6 +1516,8 @@ static ssize_t min_freq_store(struct device *dev, struct device_attribute *attr,
> const char *buf, size_t count)
> {
> struct devfreq *df = to_devfreq(dev);
> + struct device *pdev = df->dev.parent;
> + struct dev_pm_opp *opp;
> unsigned long value;
> int ret;
>
> @@ -1533,6 +1538,19 @@ static ssize_t min_freq_store(struct device *dev, struct device_attribute *attr,
> if (ret < 0)
> return ret;
>
> + if (!value)
> + value = df->scaling_min_freq;
> +
> + opp = dev_pm_opp_find_freq_ceil(pdev, &value);
> + if (IS_ERR(opp))
> + return count;
> +
> + dev_pm_opp_put(opp);
> +
> + mutex_lock(&df->lock);
> + df->user_min_freq = value;
> + mutex_unlock(&df->lock);
> +
> return count;
> }
>
> @@ -1554,7 +1572,9 @@ static ssize_t max_freq_store(struct device *dev, struct device_attribute *attr,
> const char *buf, size_t count)
> {
> struct devfreq *df = to_devfreq(dev);
> - unsigned long value;
> + struct device *pdev = df->dev.parent;
> + unsigned long value, value_khz;
> + struct dev_pm_opp *opp;
> int ret;
>
> /*
> @@ -1579,14 +1599,27 @@ static ssize_t max_freq_store(struct device *dev, struct device_attribute *attr,
> * A value of zero means "no limit".
> */
> if (value)
> - value = DIV_ROUND_UP(value, HZ_PER_KHZ);
> + value_khz = DIV_ROUND_UP(value, HZ_PER_KHZ);
> else
> - value = PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE;
> + value_khz = PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE;
>
> - ret = dev_pm_qos_update_request(&df->user_max_freq_req, value);
> + ret = dev_pm_qos_update_request(&df->user_max_freq_req, value_khz);
> if (ret < 0)
> return ret;
>
> + if (!value)
> + value = df->scaling_max_freq;
> +
> + opp = dev_pm_opp_find_freq_floor(pdev, &value);
> + if (IS_ERR(opp))
> + return count;
> +
> + dev_pm_opp_put(opp);
> +
> + mutex_lock(&df->lock);
> + df->user_max_freq = value;
> + mutex_unlock(&df->lock);
> +
> return count;
> }
>
> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
> index b6d3bae1c74d..147a229056d2 100644
> --- a/include/linux/devfreq.h
> +++ b/include/linux/devfreq.h
> @@ -147,6 +147,8 @@ struct devfreq_stats {
> * touch this.
> * @user_min_freq_req: PM QoS minimum frequency request from user (via sysfs)
> * @user_max_freq_req: PM QoS maximum frequency request from user (via sysfs)
> + * @user_min_freq: User's minimum frequency
> + * @user_max_freq: User's maximum frequency
> * @scaling_min_freq: Limit minimum frequency requested by OPP interface
> * @scaling_max_freq: Limit maximum frequency requested by OPP interface
> * @stop_polling: devfreq polling status of a device.
> @@ -185,6 +187,8 @@ struct devfreq {
> struct dev_pm_qos_request user_max_freq_req;
> unsigned long scaling_min_freq;
> unsigned long scaling_max_freq;
> + unsigned long user_min_freq;
> + unsigned long user_max_freq;
> bool stop_polling;
>
> unsigned long suspend_freq;
>


--
Best Regards,
Chanwoo Choi
Samsung Electronics

2021-02-03 10:24:31

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct

Hi Chanwoo,

Thank you for looking at this.

On 2/3/21 10:11 AM, Chanwoo Choi wrote:
> Hi Lukasz,
>
> When accessing the max_freq and min_freq at devfreq-cooling.c,
> even if can access 'user_max_freq' and 'lock' by using the 'devfreq' instance,
> I think that the direct access of variables (lock/user_max_freq/user_min_freq)
> of struct devfreq are not good.
>
> Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
> notification with following changes of 'struct devfreq_freq'?

I like the idea with devfreq notification. I will have to go through the
code to check that possibility.

> Also, need to add codes into devfreq_set_target() for initializing
> 'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
> notification.
>
> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
> index 147a229056d2..d5726592d362 100644
> --- a/include/linux/devfreq.h
> +++ b/include/linux/devfreq.h
> @@ -207,6 +207,8 @@ struct devfreq {
> struct devfreq_freqs {
> unsigned long old;
> unsigned long new;
> + unsigned long new_max_freq;
> + unsigned long new_min_freq;
> };
>
>
> And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
> You can get the current max_freq/min_freq by using the following steps:
>
> get_freq_range(devfreq, &min_freq, &max_freq);
> dev_pm_opp_find_freq_floor(pdev, &min_freq);
> dev_pm_opp_find_freq_floor(pdev, &max_freq);
>
> So that you can get the 'max_freq/min_freq' and then
> initialize the 'freqs.new_max_freq and freqs.new_min_freq'
> with them as following:
>
> in devfreq_set_target()
> get_freq_range(devfreq, &min_freq, &max_freq);
> dev_pm_opp_find_freq_floor(pdev, &min_freq);
> dev_pm_opp_find_freq_floor(pdev, &max_freq);
> freqs.new_max_freq = min_freq;
> freqs.new_max_freq = max_freq;
> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);

I will plumb it in and check that option. My concern is that function
get_freq_range() would give me the max_freq value from PM QoS, which
might be my thermal limit - lower that user_max_freq. Then I still
need

I've been playing with PM QoS notifications because I thought it would
be possible to be notified in thermal for all new set values - even from
devfreq sysfs user max_freq write, which has value higher that the
current limit set by thermal governor. Unfortunately PM QoS doesn't
send that information by design. PM QoS also by desing won't allow
me to check first two limits in the plist - which would be thermal
and user sysfs max_freq.

I will experiment with this notifications and share the results.
That you for your comments.

Regards,
Lukasz

2021-02-11 11:41:00

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct

Hi Chanwoo,

On 2/3/21 10:21 AM, Lukasz Luba wrote:
> Hi Chanwoo,
>
> Thank you for looking at this.
>
> On 2/3/21 10:11 AM, Chanwoo Choi wrote:
>> Hi Lukasz,
>>
>> When accessing the max_freq and min_freq at devfreq-cooling.c,
>> even if can access 'user_max_freq' and 'lock' by using the 'devfreq'
>> instance,
>> I think that the direct access of variables
>> (lock/user_max_freq/user_min_freq)
>> of struct devfreq are not good.
>>
>> Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
>> notification with following changes of 'struct devfreq_freq'?
>
> I like the idea with devfreq notification. I will have to go through the
> code to check that possibility.
>
>> Also, need to add codes into devfreq_set_target() for initializing
>> 'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
>> notification.
>>
>> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
>> index 147a229056d2..d5726592d362 100644
>> --- a/include/linux/devfreq.h
>> +++ b/include/linux/devfreq.h
>> @@ -207,6 +207,8 @@ struct devfreq {
>>   struct devfreq_freqs {
>>          unsigned long old;
>>          unsigned long new;
>> +       unsigned long new_max_freq;
>> +       unsigned long new_min_freq;
>>   };
>>
>>
>> And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
>> You can get the current max_freq/min_freq by using the following steps:
>>
>>     get_freq_range(devfreq, &min_freq, &max_freq);
>>     dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>     dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>
>> So that you can get the 'max_freq/min_freq' and then
>> initialize the 'freqs.new_max_freq and freqs.new_min_freq'
>> with them as following:
>>
>> in devfreq_set_target()
>>     get_freq_range(devfreq, &min_freq, &max_freq);
>>     dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>     dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>     freqs.new_max_freq = min_freq;
>>     freqs.new_max_freq = max_freq;
>>     devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);
>
> I will plumb it in and check that option. My concern is that function
> get_freq_range() would give me the max_freq value from PM QoS, which
> might be my thermal limit - lower that user_max_freq. Then I still
> need
>
> I've been playing with PM QoS notifications because I thought it would
> be possible to be notified in thermal for all new set values - even from
> devfreq sysfs user max_freq write, which has value higher that the
> current limit set by thermal governor. Unfortunately PM QoS doesn't
> send that information by design. PM QoS also by desing won't allow
> me to check first two limits in the plist - which would be thermal
> and user sysfs max_freq.
>
> I will experiment with this notifications and share the results.
> That you for your comments.

I have experimented with your proposal. Unfortunately, the value stored
in the pm_qos which is read by get_freq_range() is not the user max
freq. It's the value from thermal devfreq cooling when that one is
lower. Which is OK in the overall design, but not for my IPA use case.

What comes to my mind is two options:
1) this patch proposal, with simple solution of two new variables
protected by mutex, which would maintain user stored values
2) add a new notification chain in devfreq to notify about new
user written value, to which devfreq cooling would register; that
would allow devfreq cooling to get that value instantly and store
locally

What do you think Chanwoo?

Regards,
Lukasz

2021-02-11 22:30:29

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct



On 2/11/21 11:07 AM, Lukasz Luba wrote:
> Hi Chanwoo,
>
> On 2/3/21 10:21 AM, Lukasz Luba wrote:
>> Hi Chanwoo,
>>
>> Thank you for looking at this.
>>
>> On 2/3/21 10:11 AM, Chanwoo Choi wrote:
>>> Hi Lukasz,
>>>
>>> When accessing the max_freq and min_freq at devfreq-cooling.c,
>>> even if can access 'user_max_freq' and 'lock' by using the 'devfreq'
>>> instance,
>>> I think that the direct access of variables
>>> (lock/user_max_freq/user_min_freq)
>>> of struct devfreq are not good.
>>>
>>> Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
>>> notification with following changes of 'struct devfreq_freq'?
>>
>> I like the idea with devfreq notification. I will have to go through the
>> code to check that possibility.
>>
>>> Also, need to add codes into devfreq_set_target() for initializing
>>> 'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
>>> notification.
>>>
>>> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
>>> index 147a229056d2..d5726592d362 100644
>>> --- a/include/linux/devfreq.h
>>> +++ b/include/linux/devfreq.h
>>> @@ -207,6 +207,8 @@ struct devfreq {
>>>   struct devfreq_freqs {
>>>          unsigned long old;
>>>          unsigned long new;
>>> +       unsigned long new_max_freq;
>>> +       unsigned long new_min_freq;
>>>   };
>>>
>>>
>>> And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
>>> You can get the current max_freq/min_freq by using the following steps:
>>>
>>>     get_freq_range(devfreq, &min_freq, &max_freq);
>>>     dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>>     dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>>
>>> So that you can get the 'max_freq/min_freq' and then
>>> initialize the 'freqs.new_max_freq and freqs.new_min_freq'
>>> with them as following:
>>>
>>> in devfreq_set_target()
>>>     get_freq_range(devfreq, &min_freq, &max_freq);
>>>     dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>>     dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>>     freqs.new_max_freq = min_freq;
>>>     freqs.new_max_freq = max_freq;
>>>     devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);
>>
>> I will plumb it in and check that option. My concern is that function
>> get_freq_range() would give me the max_freq value from PM QoS, which
>> might be my thermal limit - lower that user_max_freq. Then I still
>> need
>>
>> I've been playing with PM QoS notifications because I thought it would
>> be possible to be notified in thermal for all new set values - even from
>> devfreq sysfs user max_freq write, which has value higher that the
>> current limit set by thermal governor. Unfortunately PM QoS doesn't
>> send that information by design. PM QoS also by desing won't allow
>> me to check first two limits in the plist - which would be thermal
>> and user sysfs max_freq.
>>
>> I will experiment with this notifications and share the results.
>> That you for your comments.
>
> I have experimented with your proposal. Unfortunately, the value stored
> in the pm_qos which is read by get_freq_range() is not the user max
> freq. It's the value from thermal devfreq cooling when that one is
> lower. Which is OK in the overall design, but not for my IPA use case.
>
> What comes to my mind is two options:
> 1) this patch proposal, with simple solution of two new variables
> protected by mutex, which would maintain user stored values
> 2) add a new notification chain in devfreq to notify about new
> user written value, to which devfreq cooling would register; that
> would allow devfreq cooling to get that value instantly and store
> locally

3) How about new define for existing notification chain:
#define DEVFREQ_USER_CHANGE (2)

Then a modified devfreq_notify_transition() would get:
@@ -339,6 +339,10 @@ static int devfreq_notify_transition(struct devfreq
*devfreq,

srcu_notifier_call_chain(&devfreq->transition_notifier_list,
DEVFREQ_POSTCHANGE, freqs);
break;
+ case DEVFREQ_USER_CHANGE:
+ srcu_notifier_call_chain(&devfreq->transition_notifier_list,
+ DEVFREQ_USER_CHANGE, freqs);
+ break;
default:
return -EINVAL;
}

If that is present, I can plumb your suggestion with:
struct devfreq_freq {
+ unsigned long new_max_freq;
+ unsigned long new_min_freq;

and populate them with values in the max_freq_store() by adding at the
end:

freqs.new_max_freq = max_freq;
mutex_lock();
devfreq_notify_transition(devfreq, &freqs, DEVFREQ_USER_CHANGE);
mutex_unlock();

I would handle this notification in devfreq cooling and keep the
value there, for future IPA checks.

If you agree, I can send next version of the patch set.

>
> What do you think Chanwoo?
>
> Regards,
> Lukasz

2021-02-15 15:17:06

by Chanwoo Choi

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct

Hi Lukasz,

On Fri, Feb 12, 2021 at 7:28 AM Lukasz Luba <[email protected]> wrote:
>
>
>
> On 2/11/21 11:07 AM, Lukasz Luba wrote:
> > Hi Chanwoo,
> >
> > On 2/3/21 10:21 AM, Lukasz Luba wrote:
> >> Hi Chanwoo,
> >>
> >> Thank you for looking at this.
> >>
> >> On 2/3/21 10:11 AM, Chanwoo Choi wrote:
> >>> Hi Lukasz,
> >>>
> >>> When accessing the max_freq and min_freq at devfreq-cooling.c,
> >>> even if can access 'user_max_freq' and 'lock' by using the 'devfreq'
> >>> instance,
> >>> I think that the direct access of variables
> >>> (lock/user_max_freq/user_min_freq)
> >>> of struct devfreq are not good.
> >>>
> >>> Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
> >>> notification with following changes of 'struct devfreq_freq'?
> >>
> >> I like the idea with devfreq notification. I will have to go through the
> >> code to check that possibility.
> >>
> >>> Also, need to add codes into devfreq_set_target() for initializing
> >>> 'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
> >>> notification.
> >>>
> >>> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
> >>> index 147a229056d2..d5726592d362 100644
> >>> --- a/include/linux/devfreq.h
> >>> +++ b/include/linux/devfreq.h
> >>> @@ -207,6 +207,8 @@ struct devfreq {
> >>> struct devfreq_freqs {
> >>> unsigned long old;
> >>> unsigned long new;
> >>> + unsigned long new_max_freq;
> >>> + unsigned long new_min_freq;
> >>> };
> >>>
> >>>
> >>> And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
> >>> You can get the current max_freq/min_freq by using the following steps:
> >>>
> >>> get_freq_range(devfreq, &min_freq, &max_freq);
> >>> dev_pm_opp_find_freq_floor(pdev, &min_freq);
> >>> dev_pm_opp_find_freq_floor(pdev, &max_freq);
> >>>
> >>> So that you can get the 'max_freq/min_freq' and then
> >>> initialize the 'freqs.new_max_freq and freqs.new_min_freq'
> >>> with them as following:
> >>>
> >>> in devfreq_set_target()
> >>> get_freq_range(devfreq, &min_freq, &max_freq);
> >>> dev_pm_opp_find_freq_floor(pdev, &min_freq);
> >>> dev_pm_opp_find_freq_floor(pdev, &max_freq);
> >>> freqs.new_max_freq = min_freq;
> >>> freqs.new_max_freq = max_freq;
> >>> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);
> >>
> >> I will plumb it in and check that option. My concern is that function
> >> get_freq_range() would give me the max_freq value from PM QoS, which
> >> might be my thermal limit - lower that user_max_freq. Then I still
> >> need
> >>
> >> I've been playing with PM QoS notifications because I thought it would
> >> be possible to be notified in thermal for all new set values - even from
> >> devfreq sysfs user max_freq write, which has value higher that the
> >> current limit set by thermal governor. Unfortunately PM QoS doesn't
> >> send that information by design. PM QoS also by desing won't allow
> >> me to check first two limits in the plist - which would be thermal
> >> and user sysfs max_freq.
> >>
> >> I will experiment with this notifications and share the results.
> >> That you for your comments.
> >
> > I have experimented with your proposal. Unfortunately, the value stored
> > in the pm_qos which is read by get_freq_range() is not the user max
> > freq. It's the value from thermal devfreq cooling when that one is
> > lower. Which is OK in the overall design, but not for my IPA use case.
> >
> > What comes to my mind is two options:
> > 1) this patch proposal, with simple solution of two new variables
> > protected by mutex, which would maintain user stored values
> > 2) add a new notification chain in devfreq to notify about new
> > user written value, to which devfreq cooling would register; that
> > would allow devfreq cooling to get that value instantly and store
> > locally
>
> 3) How about new define for existing notification chain:
> #define DEVFREQ_USER_CHANGE (2)

I think that if we add the notification with specific actor like user change
or OPP change or others, it is not proper. But, we can add the notification
for min or max frequency change timing. Because the devfreq already has
the notification for current frequency like DEVFREQ_PRECHANGE,
DEVFREQ_POSTCHANGE.

Maybe, we can add the following notification for min/max_freq.
The following min_freq and max_freq values will be calculated by
get_freq_range().
DEVFREQ_MIN_FREQ_PRECHANGE
DEVFREQ_MIN_FREQ_POSTCHANGE
DEVFREQ_MAX_FREQ_PRECHANGE
DEVFREQ_MAX_FREQ_POSTCHANGE


>
> Then a modified devfreq_notify_transition() would get:
> @@ -339,6 +339,10 @@ static int devfreq_notify_transition(struct devfreq
> *devfreq,
>
> srcu_notifier_call_chain(&devfreq->transition_notifier_list,
> DEVFREQ_POSTCHANGE, freqs);
> break;
> + case DEVFREQ_USER_CHANGE:
> + srcu_notifier_call_chain(&devfreq->transition_notifier_list,
> + DEVFREQ_USER_CHANGE, freqs);
> + break;
> default:
> return -EINVAL;
> }
>
> If that is present, I can plumb your suggestion with:
> struct devfreq_freq {
> + unsigned long new_max_freq;
> + unsigned long new_min_freq;
>
> and populate them with values in the max_freq_store() by adding at the
> end:
>
> freqs.new_max_freq = max_freq;
> mutex_lock();
> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_USER_CHANGE);
> mutex_unlock();
>
> I would handle this notification in devfreq cooling and keep the
> value there, for future IPA checks.
>
> If you agree, I can send next version of the patch set.
>
> >
> > What do you think Chanwoo?

I thought that your suggestion to expose the user input for min/max_freq.
But, these values are not valid for the public user. Actually, the devfreq core
handles these values only internally without any explicit access from outside.

I'm not sure that it is right or not to expose the internal value of
devfreq struct.
Until now, I think that it is not proper to show the interval value outside.

Because the devfreq subsystem only provides the min_freq and max_freq
which reflect the all requirement of user input/cooling policy/OPP
instead of user_min_freq, user_max_freq.

If we provide the user_min_freq, user_max_freq via DEVFREQ notification,
we have to make the new sysfs attributes for user_min_freq and user_max_freq
to show the value to the user. But, it seems that it is not nice.

Actually, I have no other idea how to support your feature.
We try to find the more proper method.

--
Best Regards,
Chanwoo Choi

2021-02-16 10:43:46

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct

Hi Chanwoo,

On 2/15/21 3:00 PM, Chanwoo Choi wrote:
> Hi Lukasz,
>
> On Fri, Feb 12, 2021 at 7:28 AM Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 2/11/21 11:07 AM, Lukasz Luba wrote:
>>> Hi Chanwoo,
>>>
>>> On 2/3/21 10:21 AM, Lukasz Luba wrote:
>>>> Hi Chanwoo,
>>>>
>>>> Thank you for looking at this.
>>>>
>>>> On 2/3/21 10:11 AM, Chanwoo Choi wrote:
>>>>> Hi Lukasz,
>>>>>
>>>>> When accessing the max_freq and min_freq at devfreq-cooling.c,
>>>>> even if can access 'user_max_freq' and 'lock' by using the 'devfreq'
>>>>> instance,
>>>>> I think that the direct access of variables
>>>>> (lock/user_max_freq/user_min_freq)
>>>>> of struct devfreq are not good.
>>>>>
>>>>> Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
>>>>> notification with following changes of 'struct devfreq_freq'?
>>>>
>>>> I like the idea with devfreq notification. I will have to go through the
>>>> code to check that possibility.
>>>>
>>>>> Also, need to add codes into devfreq_set_target() for initializing
>>>>> 'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
>>>>> notification.
>>>>>
>>>>> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
>>>>> index 147a229056d2..d5726592d362 100644
>>>>> --- a/include/linux/devfreq.h
>>>>> +++ b/include/linux/devfreq.h
>>>>> @@ -207,6 +207,8 @@ struct devfreq {
>>>>> struct devfreq_freqs {
>>>>> unsigned long old;
>>>>> unsigned long new;
>>>>> + unsigned long new_max_freq;
>>>>> + unsigned long new_min_freq;
>>>>> };
>>>>>
>>>>>
>>>>> And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
>>>>> You can get the current max_freq/min_freq by using the following steps:
>>>>>
>>>>> get_freq_range(devfreq, &min_freq, &max_freq);
>>>>> dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>>>> dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>>>>
>>>>> So that you can get the 'max_freq/min_freq' and then
>>>>> initialize the 'freqs.new_max_freq and freqs.new_min_freq'
>>>>> with them as following:
>>>>>
>>>>> in devfreq_set_target()
>>>>> get_freq_range(devfreq, &min_freq, &max_freq);
>>>>> dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>>>> dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>>>> freqs.new_max_freq = min_freq;
>>>>> freqs.new_max_freq = max_freq;
>>>>> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);
>>>>
>>>> I will plumb it in and check that option. My concern is that function
>>>> get_freq_range() would give me the max_freq value from PM QoS, which
>>>> might be my thermal limit - lower that user_max_freq. Then I still
>>>> need
>>>>
>>>> I've been playing with PM QoS notifications because I thought it would
>>>> be possible to be notified in thermal for all new set values - even from
>>>> devfreq sysfs user max_freq write, which has value higher that the
>>>> current limit set by thermal governor. Unfortunately PM QoS doesn't
>>>> send that information by design. PM QoS also by desing won't allow
>>>> me to check first two limits in the plist - which would be thermal
>>>> and user sysfs max_freq.
>>>>
>>>> I will experiment with this notifications and share the results.
>>>> That you for your comments.
>>>
>>> I have experimented with your proposal. Unfortunately, the value stored
>>> in the pm_qos which is read by get_freq_range() is not the user max
>>> freq. It's the value from thermal devfreq cooling when that one is
>>> lower. Which is OK in the overall design, but not for my IPA use case.
>>>
>>> What comes to my mind is two options:
>>> 1) this patch proposal, with simple solution of two new variables
>>> protected by mutex, which would maintain user stored values
>>> 2) add a new notification chain in devfreq to notify about new
>>> user written value, to which devfreq cooling would register; that
>>> would allow devfreq cooling to get that value instantly and store
>>> locally
>>
>> 3) How about new define for existing notification chain:
>> #define DEVFREQ_USER_CHANGE (2)
>
> I think that if we add the notification with specific actor like user change
> or OPP change or others, it is not proper. But, we can add the notification
> for min or max frequency change timing. Because the devfreq already has
> the notification for current frequency like DEVFREQ_PRECHANGE,
> DEVFREQ_POSTCHANGE.
>
> Maybe, we can add the following notification for min/max_freq.
> The following min_freq and max_freq values will be calculated by
> get_freq_range().
> DEVFREQ_MIN_FREQ_PRECHANGE
> DEVFREQ_MIN_FREQ_POSTCHANGE
> DEVFREQ_MAX_FREQ_PRECHANGE
> DEVFREQ_MAX_FREQ_POSTCHANGE

Would it be then possible to pass the user max freq value written via
sysfs? Something like in the example below, when writing into max sysfs:

1) starting in max_freq_store()
freqs.new_max_freq = max_freq;
devfreq_notify_transition(devfreq, &freqs,
DEVFREQ_MAX_FREQ_PRECHANGE);
dev_pm_qos_update_request()

2)then after a while in devfreq_set_target()
get_freq_range(devfreq, &min_freq, &max_freq);
dev_pm_opp_find_freq_floor(pdev, &min_freq);
dev_pm_opp_find_freq_floor(pdev, &max_freq);
freqs.new_min_freq = min_freq;
freqs.new_max_freq = max_freq;
devfreq_notify_transition(devfreq, &freqs, DEVFREQ_MAX_FREQ_POSTCHANGE);

This 2nd part is called after the PM QoS has changed that limit,
so might be missing (in case value was higher that current),
but thermal would know about that, so no worries.

>
>
>>
>> Then a modified devfreq_notify_transition() would get:
>> @@ -339,6 +339,10 @@ static int devfreq_notify_transition(struct devfreq
>> *devfreq,
>>
>> srcu_notifier_call_chain(&devfreq->transition_notifier_list,
>> DEVFREQ_POSTCHANGE, freqs);
>> break;
>> + case DEVFREQ_USER_CHANGE:
>> + srcu_notifier_call_chain(&devfreq->transition_notifier_list,
>> + DEVFREQ_USER_CHANGE, freqs);
>> + break;
>> default:
>> return -EINVAL;
>> }
>>
>> If that is present, I can plumb your suggestion with:
>> struct devfreq_freq {
>> + unsigned long new_max_freq;
>> + unsigned long new_min_freq;
>>
>> and populate them with values in the max_freq_store() by adding at the
>> end:
>>
>> freqs.new_max_freq = max_freq;
>> mutex_lock();
>> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_USER_CHANGE);
>> mutex_unlock();
>>
>> I would handle this notification in devfreq cooling and keep the
>> value there, for future IPA checks.
>>
>> If you agree, I can send next version of the patch set.
>>
>>>
>>> What do you think Chanwoo?
>
> I thought that your suggestion to expose the user input for min/max_freq.
> But, these values are not valid for the public user. Actually, the devfreq core
> handles these values only internally without any explicit access from outside.
>
> I'm not sure that it is right or not to expose the internal value of
> devfreq struct.
> Until now, I think that it is not proper to show the interval value outside.
>
> Because the devfreq subsystem only provides the min_freq and max_freq
> which reflect the all requirement of user input/cooling policy/OPP
> instead of user_min_freq, user_max_freq.
>
> If we provide the user_min_freq, user_max_freq via DEVFREQ notification,
> we have to make the new sysfs attributes for user_min_freq and user_max_freq
> to show the value to the user. But, it seems that it is not nice.

I would say we don't have to expose it. Let's take a closer look into
an example. The main problem is with GPUs. The middleware is aware of
the OPPs in the GPU. If the middleware wants to switch into different
power-performance mode e.g. power-saving, it writes into this sysfs
the max allowed freq. IPA does not know about it and makes wrong
decisions. As you said, the sysfs read operation combines all:
user input/cooling policy/OPP, but that's not a problem for this aware
middleware. So it can stay as is.
The only addition would be this 'notification about user attempt of
reducing max device speed' internally inside the kernel, for those
subsystems which are interested in it.

>
> Actually, I have no other idea how to support your feature.
> We try to find the more proper method.
>

Thank you for coming back with your comments. I know it's not
an easy feature but I hope we can find a solution.

Regards,
Lukasz

2021-02-22 10:26:47

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power


Hi Lukasz,

sorry for the delay, it took more time to finish my current work before
commenting these patches.

On 26/01/2021 11:39, Lukasz Luba wrote:
> Hi all,
>
> This patch set tries to add the missing feature in the Intelligent Power
> Allocation (IPA) governor which is: frequency limit set by user space.

It is unclear if we are talking about frequency limit of the dvfs device
by setting the hardware limit (min-max freq). If it is the case, then
that is an energy model change, and all user of the energy model must be
notified about this change. But I don't see why userspace wants to
change that.

If we just want to set a frequency limit, then that is what we are doing
with the DTPM framework via power numbers.

> User can set max allowed frequency for a given device which has impact on
> max allowed power. In current design there is no mechanism to figure this
> out. IPA must know the maximum allowed power for every device. It is then
> used for proper power split and divvy-up. When the user limit for max
> frequency is not know, IPA assumes it is the highest possible frequency.
> It causes wrong power split across the devices.

That is because the IPA introduced the power rebalancing between devices
belonging the same thermal zone, so the feature was very use case specific.

The DTPM is supposed to solve this by providing an unified uW unit to
act on the different power capable devices on a generic way.

Today DTPM is mapped to userspace using the powercap framework, but it
is considered to add kernel API to let other subsystem to act on it
directly. May be, you can add those and call them from IPA directly, so
the governor does power decision and ask the DTPM to do the changes.

Conceptually, that would be much more consistent and will remove
duplicated code IMO.

May be create a power_qos framework to unify the units ...

> This new mechanism provides the max allowed frequency to the thermal
> framework and then max allowed power to the IPA.
> The implementation is done in this way because currently there is no way
> to retrieve the limits from the PM QoS, without uncapping the local
> thermal limit and reading the next value. It would be a heavy way of
> doing these things, since it should be done every polling time (e.g. 50ms).
>
> Also, the value stored in PM QoS can be different than the real OPP 'rate'
> so still would need conversion into proper OPP for comparison with EM.
> Furthermore, uncapping the device in thermal just to check the user freq
> limit is not the safest way.
> Thus, this simple implementation moves the calculation of the proper
> frequency to the sysfs write code, since it's called less often. The value
> is then used as-is in the thermal framework without any hassle.

Sounds like the DTPM is doing exactly that, no ?

> As it's a RFC, it still misses the cpufreq sysfs implementation, but would
> be addressed if all agree.
We are talking about power, frequency, in-kernel governor, userspace
having knowledge of max frequency limit to set power.

I'm a bit lost. What is the problem we want to solve here ?

May be I'm missing something. Is it possible to share a scenario where
the userspace acts on the devfreq and the IPA taking decision to
illustrate your proposal ?


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2021-02-22 12:13:48

by Lukasz Luba

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power

Hi Daniel,

On 2/22/21 10:22 AM, Daniel Lezcano wrote:
>
> Hi Lukasz,
>
> sorry for the delay, it took more time to finish my current work before
> commenting these patches.

No worries, thank you looking at this.

>
> On 26/01/2021 11:39, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set tries to add the missing feature in the Intelligent Power
>> Allocation (IPA) governor which is: frequency limit set by user space.
>
> It is unclear if we are talking about frequency limit of the dvfs device
> by setting the hardware limit (min-max freq). If it is the case, then
> that is an energy model change, and all user of the energy model must be
> notified about this change. But I don't see why userspace wants to
> change that.
>
> If we just want to set a frequency limit, then that is what we are doing
> with the DTPM framework via power numbers.
>
>> User can set max allowed frequency for a given device which has impact on
>> max allowed power. In current design there is no mechanism to figure this
>> out. IPA must know the maximum allowed power for every device. It is then
>> used for proper power split and divvy-up. When the user limit for max
>> frequency is not know, IPA assumes it is the highest possible frequency.
>> It causes wrong power split across the devices.
>
> That is because the IPA introduced the power rebalancing between devices
> belonging the same thermal zone, so the feature was very use case specific.
>
> The DTPM is supposed to solve this by providing an unified uW unit to
> act on the different power capable devices on a generic way.
>
> Today DTPM is mapped to userspace using the powercap framework, but it
> is considered to add kernel API to let other subsystem to act on it
> directly. May be, you can add those and call them from IPA directly, so
> the governor does power decision and ask the DTPM to do the changes.
>
> Conceptually, that would be much more consistent and will remove
> duplicated code IMO.
>
> May be create a power_qos framework to unify the units ...
>
>> This new mechanism provides the max allowed frequency to the thermal
>> framework and then max allowed power to the IPA.
>> The implementation is done in this way because currently there is no way
>> to retrieve the limits from the PM QoS, without uncapping the local
>> thermal limit and reading the next value. It would be a heavy way of
>> doing these things, since it should be done every polling time (e.g. 50ms).
>>
>> Also, the value stored in PM QoS can be different than the real OPP 'rate'
>> so still would need conversion into proper OPP for comparison with EM.
>> Furthermore, uncapping the device in thermal just to check the user freq
>> limit is not the safest way.
>> Thus, this simple implementation moves the calculation of the proper
>> frequency to the sysfs write code, since it's called less often. The value
>> is then used as-is in the thermal framework without any hassle.
>
> Sounds like the DTPM is doing exactly that, no ?
>
>> As it's a RFC, it still misses the cpufreq sysfs implementation, but would
>> be addressed if all agree.
> We are talking about power, frequency, in-kernel governor, userspace
> having knowledge of max frequency limit to set power.
>
> I'm a bit lost. What is the problem we want to solve here ?
>
> May be I'm missing something. Is it possible to share a scenario where
> the userspace acts on the devfreq and the IPA taking decision to
> illustrate your proposal ?
>
>

Sure, here is the description of the configuration and scenario in which
the issue is present.
SoC with 2 CPU clusters (consuming 1W (little cluster) and 3W (big
cluster) at max freq, plenty of OPPs available),
1 GPU (at max consuming ~6W, a few of OPPs),

Scenario:
IPA is working because temperature crossed 1st threshold and tries to
control the system to 'converge' to 2nd temp threshold. It checks
the actors max possible power [1], gets the current power, calculates
current budget, split that budget and grants power across actors so
max allowed frequency is set via QoS.

The state2power() callback called at [1] with argument '0' would return
the power from EM for the highest OPP. This is fine in most cases. That
power information is used in line 359 and 364 during the split.

If the user-space (the aware middleware) wants to switch into different
power-performance mode e.g. power-saving, it writes into device sysfs
to limit max allowed freq. Then IPA does not know about it and makes
wrong decisions. It's an issue for GPUs (but CPUs also) which can
consume big power at higher freq. For example to limit this 6W into
3W, simple freq write via sysfs is enough, but IPA completely is not
aware of that (as you can see in the code).

The sysfs interface for GPU:
$ cat /sys/class/devfreq/<device>/available_frequencies
400000000 600000000 800000000 1000000000

$ echo 600000000 > /sys/class/devfreq/<device>/max_freq
$ cat /sys/class/devfreq/<device>/max_freq
600000000

IMHO is not an issue of IPA, because DTPM might suffer for this
missing 'user write' information as well. It's just missing
design feature, to provide that user information further to the
other frameworks or governors.

Regards,
Lukasz

[1]
https://elixir.bootlin.com/linux/latest/source/drivers/thermal/gov_power_allocator.c#L458

2021-02-24 08:22:29

by Chanwoo Choi

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] PM /devfreq: add user frequency limits into devfreq struct

On 2/16/21 7:41 PM, Lukasz Luba wrote:
> Hi Chanwoo,
>
> On 2/15/21 3:00 PM, Chanwoo Choi wrote:
>> Hi Lukasz,
>>
>> On Fri, Feb 12, 2021 at 7:28 AM Lukasz Luba <[email protected]> wrote:
>>>
>>>
>>>
>>> On 2/11/21 11:07 AM, Lukasz Luba wrote:
>>>> Hi Chanwoo,
>>>>
>>>> On 2/3/21 10:21 AM, Lukasz Luba wrote:
>>>>> Hi Chanwoo,
>>>>>
>>>>> Thank you for looking at this.
>>>>>
>>>>> On 2/3/21 10:11 AM, Chanwoo Choi wrote:
>>>>>> Hi Lukasz,
>>>>>>
>>>>>> When accessing the max_freq and min_freq at devfreq-cooling.c,
>>>>>> even if can access 'user_max_freq' and 'lock' by using the 'devfreq'
>>>>>> instance,
>>>>>> I think that the direct access of variables
>>>>>> (lock/user_max_freq/user_min_freq)
>>>>>> of struct devfreq are not good.
>>>>>>
>>>>>> Instead, how about using the 'DEVFREQ_TRANSITION_NOTIFIER'
>>>>>> notification with following changes of 'struct devfreq_freq'?
>>>>>
>>>>> I like the idea with devfreq notification. I will have to go through the
>>>>> code to check that possibility.
>>>>>
>>>>>> Also, need to add codes into devfreq_set_target() for initializing
>>>>>> 'new_max_freq' and 'new_min_freq' before sending the DEVFREQ_POSTCHANGE
>>>>>> notification.
>>>>>>
>>>>>> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
>>>>>> index 147a229056d2..d5726592d362 100644
>>>>>> --- a/include/linux/devfreq.h
>>>>>> +++ b/include/linux/devfreq.h
>>>>>> @@ -207,6 +207,8 @@ struct devfreq {
>>>>>>    struct devfreq_freqs {
>>>>>>           unsigned long old;
>>>>>>           unsigned long new;
>>>>>> +       unsigned long new_max_freq;
>>>>>> +       unsigned long new_min_freq;
>>>>>>    };
>>>>>>
>>>>>>
>>>>>> And I think that new 'user_min_freq'/'user_max_freq' are not necessary.
>>>>>> You can get the current max_freq/min_freq by using the following steps:
>>>>>>
>>>>>>      get_freq_range(devfreq, &min_freq, &max_freq);
>>>>>>      dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>>>>>      dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>>>>>
>>>>>> So that you can get the 'max_freq/min_freq' and then
>>>>>> initialize the 'freqs.new_max_freq and freqs.new_min_freq'
>>>>>> with them as following:
>>>>>>
>>>>>> in devfreq_set_target()
>>>>>>      get_freq_range(devfreq, &min_freq, &max_freq);
>>>>>>      dev_pm_opp_find_freq_floor(pdev, &min_freq);
>>>>>>      dev_pm_opp_find_freq_floor(pdev, &max_freq);
>>>>>>      freqs.new_max_freq = min_freq;
>>>>>>      freqs.new_max_freq = max_freq;
>>>>>>      devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);
>>>>>
>>>>> I will plumb it in and check that option. My concern is that function
>>>>> get_freq_range() would give me the max_freq value from PM QoS, which
>>>>> might be my thermal limit - lower that user_max_freq. Then I still
>>>>> need
>>>>>
>>>>> I've been playing with PM QoS notifications because I thought it would
>>>>> be possible to be notified in thermal for all new set values - even from
>>>>> devfreq sysfs user max_freq write, which has value higher that the
>>>>> current limit set by thermal governor. Unfortunately PM QoS doesn't
>>>>> send that information by design. PM QoS also by desing won't allow
>>>>> me to check first two limits in the plist - which would be thermal
>>>>> and user sysfs max_freq.
>>>>>
>>>>> I will experiment with this notifications and share the results.
>>>>> That you for your comments.
>>>>
>>>> I have experimented with your proposal. Unfortunately, the value stored
>>>> in the pm_qos which is read by get_freq_range() is not the user max
>>>> freq. It's the value from thermal devfreq cooling when that one is
>>>> lower. Which is OK in the overall design, but not for my IPA use case.
>>>>
>>>> What comes to my mind is two options:
>>>> 1) this patch proposal, with simple solution of two new variables
>>>> protected by mutex, which would maintain user stored values
>>>> 2) add a new notification chain in devfreq to notify about new
>>>> user written value, to which devfreq cooling would register; that
>>>> would allow devfreq cooling to get that value instantly and store
>>>> locally
>>>
>>> 3) How about new define for existing notification chain:
>>> #define DEVFREQ_USER_CHANGE            (2)
>>
>> I think that if we add the notification with specific actor like user change
>> or OPP change or others, it is not proper. But, we can add the notification
>> for min or max frequency change timing. Because the devfreq already has
>> the notification for current frequency like DEVFREQ_PRECHANGE,
>> DEVFREQ_POSTCHANGE.
>>
>> Maybe, we can add the following notification for min/max_freq.
>> The following min_freq and max_freq values will be calculated by
>> get_freq_range().
>> DEVFREQ_MIN_FREQ_PRECHANGE
>> DEVFREQ_MIN_FREQ_POSTCHANGE
>> DEVFREQ_MAX_FREQ_PRECHANGE
>> DEVFREQ_MAX_FREQ_POSTCHANGE
>
> Would it be then possible to pass the user max freq value written via
> sysfs? Something like in the example below, when writing into max sysfs:
>
> 1) starting in max_freq_store()
>         freqs.new_max_freq = max_freq;
>         devfreq_notify_transition(devfreq, &freqs, DEVFREQ_MAX_FREQ_PRECHANGE);
>     dev_pm_qos_update_request()


When we use the PRECHANGE and POSTCHANGE notification,
we should keep the consistent value.

When PRECHANGE, notify the previous min/max frequency
containing the user input/cooling policy/OPP.
When POSTCHANGE, notify the new min/max frequency
containing the user input/cooling policy/OPP.

But, in case of your suggestion, DEVFREQ_MAX_FREQ_PRECHANGE considers
only user input without cooling policy/opp.

>
> 2)then after a while in devfreq_set_target()
>     get_freq_range(devfreq, &min_freq, &max_freq);
>     dev_pm_opp_find_freq_floor(pdev, &min_freq);
>     dev_pm_opp_find_freq_floor(pdev, &max_freq);
>     freqs.new_min_freq = min_freq;
>     freqs.new_max_freq = max_freq;
>     devfreq_notify_transition(devfreq, &freqs, DEVFREQ_MAX_FREQ_POSTCHANGE);
>
> This 2nd part is called after the PM QoS has changed that limit,
> so might be missing (in case value was higher that current),
> but thermal would know about that, so no worries.

It doesn't focus on only thermal. We need to consider
all potential user of max_freq notification.

In the devfreq subsystem like devfreq governor,
we might use the user min/max_freq without any restrictions.
But, in this case, devfreq provides the min/max_freq
to outside subsystem/drivers like devfreq-cooling.c of thermal.
IMHO, it is difficult to agree this approach.

If devfreq provides the various min/max_freq value to outside
of devfreq, it makes the confusion to understand the meaning
of min/max_freq. Actually, the other user doesn't need to
know the user input for min/max_freq.

>
>>
>>
>>>
>>> Then a modified devfreq_notify_transition() would get:
>>> @@ -339,6 +339,10 @@ static int devfreq_notify_transition(struct devfreq
>>> *devfreq,
>>>
>>> srcu_notifier_call_chain(&devfreq->transition_notifier_list,
>>>                                   DEVFREQ_POSTCHANGE, freqs);
>>>                   break;
>>> +       case DEVFREQ_USER_CHANGE:
>>> +               srcu_notifier_call_chain(&devfreq->transition_notifier_list,
>>> +                               DEVFREQ_USER_CHANGE, freqs);
>>> +               break;
>>>           default:
>>>                   return -EINVAL;
>>>           }
>>>
>>> If that is present, I can plumb your suggestion with:
>>> struct devfreq_freq {
>>> +       unsigned long new_max_freq;
>>> +       unsigned long new_min_freq;
>>>
>>> and populate them with values in the max_freq_store() by adding at the
>>> end:
>>>
>>> freqs.new_max_freq = max_freq;
>>> mutex_lock();
>>> devfreq_notify_transition(devfreq, &freqs, DEVFREQ_USER_CHANGE);
>>> mutex_unlock();
>>>
>>> I would handle this notification in devfreq cooling and keep the
>>> value there, for future IPA checks.
>>>
>>> If you agree, I can send next version of the patch set.
>>>
>>>>
>>>> What do you think Chanwoo?
>>
>> I thought that your suggestion to expose the user input for min/max_freq.
>> But, these values are not valid for the public user. Actually, the devfreq core
>> handles these values only internally without any explicit access from outside.
>>
>> I'm not sure that it is right or not to expose the internal value of
>> devfreq struct.
>> Until now, I think that it is not proper to show the interval value outside.
>>
>> Because the devfreq subsystem only provides the min_freq and max_freq
>> which reflect the all requirement of user input/cooling policy/OPP
>> instead of user_min_freq, user_max_freq.
>>
>> If we provide the user_min_freq, user_max_freq via DEVFREQ notification,
>> we have to make the new sysfs attributes for user_min_freq and user_max_freq
>> to show the value to the user. But, it seems that it is not nice.
>
> I would say we don't have to expose it. Let's take a closer look into
> an example. The main problem is with GPUs. The middleware is aware of
> the OPPs in the GPU. If the middleware wants to switch into different
> power-performance mode e.g. power-saving, it writes into this sysfs
> the max allowed freq. IPA does not know about it and makes wrong
> decisions. As you said, the sysfs read operation combines all:
> user input/cooling policy/OPP, but that's not a problem for this aware
> middleware. So it can stay as is.
> The only addition would be this 'notification about user attempt of
> reducing max device speed' internally inside the kernel, for those
> subsystems which are interested in it.
As I commented on above, I'm not sure to provide the multiple
min/max_freq to outside of devfreq subsytem. Instead, it is ok
to use user min/max_freq inside of devfreq subsystem.

Unfortunately, I didn't suggests the good solution.
It is very important changes. So that I want to consider
the all users of devfreq.

>
>>
>> Actually, I have no other idea how to support your feature.
>> We try to find the more proper method.
>>
>
> Thank you for coming back with your comments. I know it's not
> an easy feature but I hope we can find a solution.
>
> Regards,
> Lukasz
>
>


--
Best Regards,
Chanwoo Choi
Samsung Electronics