The new runtime table would be populated with a new power data to better
reflect the actual power. The power can vary over time e.g. due to the
SoC temperature change. Higher temperature can increase power values.
For longer running scenarios, such as game or camera, when also other
devices are used (e.g. GPU, ISP) the CPU power can change. The new
EM framework is able to addresses this issue and change the data
at runtime safely.
The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
for the task placement. All the other users (thermal, etc.) are still
using the default (basic) EM. This fact drove the design of this feature.
Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 4 +++-
kernel/power/energy_model.c | 12 +++++++++++-
2 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 546dee90f716..740e7c25cfff 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -39,7 +39,7 @@ struct em_perf_state {
/**
* struct em_perf_table - Performance states table
* @state: List of performance states, in ascending order
- * @rcu: RCU used for safe access and destruction
+ * @rcu: RCU used only for runtime modifiable table
*/
struct em_perf_table {
struct em_perf_state *state;
@@ -49,6 +49,7 @@ struct em_perf_table {
/**
* struct em_perf_domain - Performance domain
* @default_table: Pointer to the default em_perf_table
+ * @runtime_table: Pointer to the runtime modifiable em_perf_table
* @nr_perf_states: Number of performance states
* @flags: See "em_perf_domain flags"
* @cpus: Cpumask covering the CPUs of the domain. It's here
@@ -64,6 +65,7 @@ struct em_perf_table {
*/
struct em_perf_domain {
struct em_perf_table *default_table;
+ struct em_perf_table __rcu *runtime_table;
int nr_perf_states;
unsigned long flags;
unsigned long cpus[];
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 797141638b29..5b40db38b745 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
return ret;
}
+ /* Initialize runtime table as default table. */
+ rcu_assign_pointer(pd->runtime_table, default_table);
+
if (_is_cpu_device(dev))
for_each_cpu(cpu, cpus) {
cpu_dev = get_cpu_device(cpu);
@@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
*/
void em_dev_unregister_perf_domain(struct device *dev)
{
+ struct em_perf_table __rcu *runtime_table;
struct em_perf_domain *pd;
if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
@@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
return;
pd = dev->em_pd;
-
/*
* The mutex separates all register/unregister requests and protects
* from potential clean-up/setup issues in the debugfs directories.
* The debugfs directory name is the same as device's name.
*/
mutex_lock(&em_pd_mutex);
+
em_debug_remove_pd(dev);
+ runtime_table = pd->runtime_table;
+
+ rcu_assign_pointer(pd->runtime_table, NULL);
+ synchronize_rcu();
+
kfree(pd->default_table->state);
kfree(pd->default_table);
kfree(dev->em_pd);
+
dev->em_pd = NULL;
mutex_unlock(&em_pd_mutex);
}
--
2.25.1
On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <[email protected]> wrote:
>
> The new runtime table would be populated with a new power data to better
> reflect the actual power. The power can vary over time e.g. due to the
> SoC temperature change. Higher temperature can increase power values.
> For longer running scenarios, such as game or camera, when also other
> devices are used (e.g. GPU, ISP) the CPU power can change. The new
> EM framework is able to addresses this issue and change the data
> at runtime safely.
>
> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
> for the task placement. All the other users (thermal, etc.) are still
> using the default (basic) EM. This fact drove the design of this feature.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> include/linux/energy_model.h | 4 +++-
> kernel/power/energy_model.c | 12 +++++++++++-
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 546dee90f716..740e7c25cfff 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -39,7 +39,7 @@ struct em_perf_state {
> /**
> * struct em_perf_table - Performance states table
> * @state: List of performance states, in ascending order
> - * @rcu: RCU used for safe access and destruction
> + * @rcu: RCU used only for runtime modifiable table
This still doesn't appear to be used anywhere, so why change it here?
> */
> struct em_perf_table {
> struct em_perf_state *state;
> @@ -49,6 +49,7 @@ struct em_perf_table {
> /**
> * struct em_perf_domain - Performance domain
> * @default_table: Pointer to the default em_perf_table
> + * @runtime_table: Pointer to the runtime modifiable em_perf_table
"Pointer to em_perf_table that can be dynamically updated"
> * @nr_perf_states: Number of performance states
> * @flags: See "em_perf_domain flags"
> * @cpus: Cpumask covering the CPUs of the domain. It's here
> @@ -64,6 +65,7 @@ struct em_perf_table {
> */
> struct em_perf_domain {
> struct em_perf_table *default_table;
> + struct em_perf_table __rcu *runtime_table;
> int nr_perf_states;
> unsigned long flags;
> unsigned long cpus[];
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 797141638b29..5b40db38b745 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
> return ret;
> }
>
> + /* Initialize runtime table as default table. */
Redundant comment.
> + rcu_assign_pointer(pd->runtime_table, default_table);
> +
> if (_is_cpu_device(dev))
> for_each_cpu(cpu, cpus) {
> cpu_dev = get_cpu_device(cpu);
> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
> */
> void em_dev_unregister_perf_domain(struct device *dev)
> {
> + struct em_perf_table __rcu *runtime_table;
> struct em_perf_domain *pd;
>
> if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
> return;
>
> pd = dev->em_pd;
> -
Unrelated change.
> /*
> * The mutex separates all register/unregister requests and protects
> * from potential clean-up/setup issues in the debugfs directories.
> * The debugfs directory name is the same as device's name.
> */
> mutex_lock(&em_pd_mutex);
> +
Same here.
> em_debug_remove_pd(dev);
>
> + runtime_table = pd->runtime_table;
> +
> + rcu_assign_pointer(pd->runtime_table, NULL);
> + synchronize_rcu();
Is it really a good idea to call this under a mutex?
> +
> kfree(pd->default_table->state);
> kfree(pd->default_table);
> kfree(dev->em_pd);
> +
Unrelated change.
> dev->em_pd = NULL;
> mutex_unlock(&em_pd_mutex);
> }
> --
So this really adds a pointer to a table that can be dynamically
updated to struct em_perf_domain without any users so far. It is not
used anywhere as of this patch AFAICS, which is not what the changelog
is saying.
On 9/26/23 20:12, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <[email protected]> wrote:
>>
>> The new runtime table would be populated with a new power data to better
>> reflect the actual power. The power can vary over time e.g. due to the
>> SoC temperature change. Higher temperature can increase power values.
>> For longer running scenarios, such as game or camera, when also other
>> devices are used (e.g. GPU, ISP) the CPU power can change. The new
>> EM framework is able to addresses this issue and change the data
>> at runtime safely.
>>
>> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
>> for the task placement. All the other users (thermal, etc.) are still
>> using the default (basic) EM. This fact drove the design of this feature.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> include/linux/energy_model.h | 4 +++-
>> kernel/power/energy_model.c | 12 +++++++++++-
>> 2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index 546dee90f716..740e7c25cfff 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -39,7 +39,7 @@ struct em_perf_state {
>> /**
>> * struct em_perf_table - Performance states table
>> * @state: List of performance states, in ascending order
>> - * @rcu: RCU used for safe access and destruction
>> + * @rcu: RCU used only for runtime modifiable table
>
> This still doesn't appear to be used anywhere, so why change it here?
I will try to move this later in the series.
>
>> */
>> struct em_perf_table {
>> struct em_perf_state *state;
>> @@ -49,6 +49,7 @@ struct em_perf_table {
>> /**
>> * struct em_perf_domain - Performance domain
>> * @default_table: Pointer to the default em_perf_table
>> + * @runtime_table: Pointer to the runtime modifiable em_perf_table
>
> "Pointer to em_perf_table that can be dynamically updated"
OK
>
>> * @nr_perf_states: Number of performance states
>> * @flags: See "em_perf_domain flags"
>> * @cpus: Cpumask covering the CPUs of the domain. It's here
>> @@ -64,6 +65,7 @@ struct em_perf_table {
>> */
>> struct em_perf_domain {
>> struct em_perf_table *default_table;
>> + struct em_perf_table __rcu *runtime_table;
>> int nr_perf_states;
>> unsigned long flags;
>> unsigned long cpus[];
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 797141638b29..5b40db38b745 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
>> return ret;
>> }
>>
>> + /* Initialize runtime table as default table. */
>
> Redundant comment.
I'll drop it.
>
>> + rcu_assign_pointer(pd->runtime_table, default_table);
>> +
>> if (_is_cpu_device(dev))
>> for_each_cpu(cpu, cpus) {
>> cpu_dev = get_cpu_device(cpu);
>> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
>> */
>> void em_dev_unregister_perf_domain(struct device *dev)
>> {
>> + struct em_perf_table __rcu *runtime_table;
>> struct em_perf_domain *pd;
>>
>> if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
>> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
>> return;
>>
>> pd = dev->em_pd;
>> -
>
> Unrelated change.
ACK
>
>> /*
>> * The mutex separates all register/unregister requests and protects
>> * from potential clean-up/setup issues in the debugfs directories.
>> * The debugfs directory name is the same as device's name.
>> */
>> mutex_lock(&em_pd_mutex);
>> +
>
> Same here.
ACK
>
>> em_debug_remove_pd(dev);
>>
>> + runtime_table = pd->runtime_table;
>> +
>> + rcu_assign_pointer(pd->runtime_table, NULL);
>> + synchronize_rcu();
>
> Is it really a good idea to call this under a mutex?
This is the unregistration of the EM code path, so a thermal
driver which gets some IRQs might not be aware that the EM
is going to vanish. That's why those two code paths: update
& unregister are protected with the same lock.
This synchronize_rcu() won't be long, but makes sure
that when we free(dev->em_pd) we don't leak runtime_table.
>
>> +
>> kfree(pd->default_table->state);
>> kfree(pd->default_table);
>> kfree(dev->em_pd);
>> +
>
> Unrelated change.
ACK
>
>> dev->em_pd = NULL;
>> mutex_unlock(&em_pd_mutex);
>> }
>> --
>
> So this really adds a pointer to a table that can be dynamically
> updated to struct em_perf_domain without any users so far. It is not
> used anywhere as of this patch AFAICS, which is not what the changelog
> is saying.
Good catch. I will adjust the changlog in header and say:
'Add infrastructure and mechanisms for the new runtime table.
The runtime modifiable EM data is used by the Energy Aware Scheduler
(EAS)for the task placement. All the other users (thermal, etc.) are
still using the default (basic) EM. This fact drove the design of this
feature.'
On Fri, Sep 29, 2023 at 11:15 AM Lukasz Luba <[email protected]> wrote:
>
>
>
> On 9/26/23 20:12, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <[email protected]> wrote:
> >>
> >> The new runtime table would be populated with a new power data to better
> >> reflect the actual power. The power can vary over time e.g. due to the
> >> SoC temperature change. Higher temperature can increase power values.
> >> For longer running scenarios, such as game or camera, when also other
> >> devices are used (e.g. GPU, ISP) the CPU power can change. The new
> >> EM framework is able to addresses this issue and change the data
> >> at runtime safely.
> >>
> >> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
> >> for the task placement. All the other users (thermal, etc.) are still
> >> using the default (basic) EM. This fact drove the design of this feature.
> >>
> >> Signed-off-by: Lukasz Luba <[email protected]>
> >> ---
> >> include/linux/energy_model.h | 4 +++-
> >> kernel/power/energy_model.c | 12 +++++++++++-
> >> 2 files changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> >> index 546dee90f716..740e7c25cfff 100644
> >> --- a/include/linux/energy_model.h
> >> +++ b/include/linux/energy_model.h
> >> @@ -39,7 +39,7 @@ struct em_perf_state {
> >> /**
> >> * struct em_perf_table - Performance states table
> >> * @state: List of performance states, in ascending order
> >> - * @rcu: RCU used for safe access and destruction
> >> + * @rcu: RCU used only for runtime modifiable table
> >
> > This still doesn't appear to be used anywhere, so why change it here?
>
> I will try to move this later in the series.
>
> >
> >> */
> >> struct em_perf_table {
> >> struct em_perf_state *state;
> >> @@ -49,6 +49,7 @@ struct em_perf_table {
> >> /**
> >> * struct em_perf_domain - Performance domain
> >> * @default_table: Pointer to the default em_perf_table
> >> + * @runtime_table: Pointer to the runtime modifiable em_perf_table
> >
> > "Pointer to em_perf_table that can be dynamically updated"
>
> OK
>
> >
> >> * @nr_perf_states: Number of performance states
> >> * @flags: See "em_perf_domain flags"
> >> * @cpus: Cpumask covering the CPUs of the domain. It's here
> >> @@ -64,6 +65,7 @@ struct em_perf_table {
> >> */
> >> struct em_perf_domain {
> >> struct em_perf_table *default_table;
> >> + struct em_perf_table __rcu *runtime_table;
> >> int nr_perf_states;
> >> unsigned long flags;
> >> unsigned long cpus[];
> >> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> >> index 797141638b29..5b40db38b745 100644
> >> --- a/kernel/power/energy_model.c
> >> +++ b/kernel/power/energy_model.c
> >> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
> >> return ret;
> >> }
> >>
> >> + /* Initialize runtime table as default table. */
> >
> > Redundant comment.
>
> I'll drop it.
>
> >
> >> + rcu_assign_pointer(pd->runtime_table, default_table);
> >> +
> >> if (_is_cpu_device(dev))
> >> for_each_cpu(cpu, cpus) {
> >> cpu_dev = get_cpu_device(cpu);
> >> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
> >> */
> >> void em_dev_unregister_perf_domain(struct device *dev)
> >> {
> >> + struct em_perf_table __rcu *runtime_table;
> >> struct em_perf_domain *pd;
> >>
> >> if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
> >> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
> >> return;
> >>
> >> pd = dev->em_pd;
> >> -
> >
> > Unrelated change.
>
> ACK
>
> >
> >> /*
> >> * The mutex separates all register/unregister requests and protects
> >> * from potential clean-up/setup issues in the debugfs directories.
> >> * The debugfs directory name is the same as device's name.
> >> */
> >> mutex_lock(&em_pd_mutex);
> >> +
> >
> > Same here.
>
> ACK
>
> >
> >> em_debug_remove_pd(dev);
> >>
> >> + runtime_table = pd->runtime_table;
> >> +
> >> + rcu_assign_pointer(pd->runtime_table, NULL);
> >> + synchronize_rcu();
> >
> > Is it really a good idea to call this under a mutex?
>
> This is the unregistration of the EM code path, so a thermal
> driver which gets some IRQs might not be aware that the EM
> is going to vanish. That's why those two code paths: update
> & unregister are protected with the same lock.
>
> This synchronize_rcu() won't be long,
Are you sure? This potentially waits for all of the CPUs in the
system to go through a quiescent state which may take a while in
principle.
In any case, though, this effectively makes everyone waiting for the
mutex also wait for the grace period to elapse and they may not care
about it.
> but makes sure that when we free(dev->em_pd) we don't leak runtime_table.
>
> >
> >> +
> >> kfree(pd->default_table->state);
> >> kfree(pd->default_table);
> >> kfree(dev->em_pd);
> >> +
> >
> > Unrelated change.
>
> ACK
>
> >
> >> dev->em_pd = NULL;
> >> mutex_unlock(&em_pd_mutex);
> >> }
> >> --
> >
> > So this really adds a pointer to a table that can be dynamically
> > updated to struct em_perf_domain without any users so far. It is not
> > used anywhere as of this patch AFAICS, which is not what the changelog
> > is saying.
>
> Good catch. I will adjust the changlog in header and say:
>
> 'Add infrastructure and mechanisms for the new runtime table.
> The runtime modifiable EM data is used by the Energy Aware Scheduler
> (EAS)for the task placement.
I would make it more clear that this is going to happen after some
other subsequent changes.
> All the other users (thermal, etc.) are
> still using the default (basic) EM. This fact drove the design of this
> feature.'
On 9/29/23 13:27, Rafael J. Wysocki wrote:
> On Fri, Sep 29, 2023 at 11:15 AM Lukasz Luba <[email protected]> wrote:
>>
>>
[snip]
>>>> em_debug_remove_pd(dev);
>>>>
>>>> + runtime_table = pd->runtime_table;
>>>> +
>>>> + rcu_assign_pointer(pd->runtime_table, NULL);
>>>> + synchronize_rcu();
>>>
>>> Is it really a good idea to call this under a mutex?
>>
>> This is the unregistration of the EM code path, so a thermal
>> driver which gets some IRQs might not be aware that the EM
>> is going to vanish. That's why those two code paths: update
>> & unregister are protected with the same lock.
>>
>> This synchronize_rcu() won't be long,
>
> Are you sure? This potentially waits for all of the CPUs in the
> system to go through a quiescent state which may take a while in
> principle.
>
> In any case, though, this effectively makes everyone waiting for the
> mutex also wait for the grace period to elapse and they may not care
> about it.
My apologies for the delay, I had to check this. Yes, should be possible
and safe to not wait here as you described on this synchronize_rcu().
What I have drawn in other response to patch 11/18 [1] should still be
true.
Thanks, I will remove this sync call from here.
>
>> but makes sure that when we free(dev->em_pd) we don't leak runtime_table.
>>
>>>
>>>> +
>>>> kfree(pd->default_table->state);
>>>> kfree(pd->default_table);
>>>> kfree(dev->em_pd);
>>>> +
>>>
>>> Unrelated change.
>>
>> ACK
>>
>>>
>>>> dev->em_pd = NULL;
>>>> mutex_unlock(&em_pd_mutex);
>>>> }
>>>> --
>>>
>>> So this really adds a pointer to a table that can be dynamically
>>> updated to struct em_perf_domain without any users so far. It is not
>>> used anywhere as of this patch AFAICS, which is not what the changelog
>>> is saying.
>>
>> Good catch. I will adjust the changlog in header and say:
>>
>> 'Add infrastructure and mechanisms for the new runtime table.
>> The runtime modifiable EM data is used by the Energy Aware Scheduler
>> (EAS)for the task placement.
>
> I would make it more clear that this is going to happen after some
> other subsequent changes.
>
OK, I will add that information too.
[1]
https://lore.kernel.org/lkml/[email protected]/