2023-11-29 11:08:24

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

Hi all,

This patch set adds a new feature which allows to modify Energy Model (EM)
power values at runtime. It will allow to better reflect power model of
a recent SoCs and silicon. Different characteristics of the power usage
can be leveraged and thus better decisions made during task placement in EAS.

It's part of feature set know as Dynamic Energy Model. It has been presented
and discussed recently at OSPM2023 [3]. This patch set implements the 1st
improvement for the EM.

The concepts:
1. The CPU power usage can vary due to the workload that it's running or due
to the temperature of the SoC. The same workload can use more power when the
temperature of the silicon has increased (e.g. due to hot GPU or ISP).
In such situation the EM can be adjusted and reflect the fact of increased
power usage. That power increase is due to static power
(sometimes called simply: leakage). The CPUs in recent SoCs are different.
We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
They are also built differently with High Performance (HP) cells or
Low Power (LP) cells. They are affected by the temperature increase
differently: HP cells have bigger leakage. The SW model can leverage that
knowledge.

2. It is also possible to change the EM to better reflect the currently
running workload. Usually the EM is derived from some average power values
taken from experiments with benchmark (e.g. Dhrystone). The model derived
from such scenario might not represent properly the workloads usually running
on the device. Therefore, runtime modification of the EM allows to switch to
a different model, when there is a need.

3. The EM can be adjusted after boot, when all the modules are loaded and
more information about the SoC is available e.g. chip binning. This would help
to better reflect the silicon characteristics. Thus, this EM modification
API allows it now. It wasn't possible in the past and the EM had to be
'set in stone'.

More detailed explanation and background can be found in presentations
during LPC2022 [1][2] or in the documentation patches.

Some test results.
The EM can be updated to fit better the workload type. In the case below the EM
has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
for the scheduler bits). The Jankbench was run 10 times for those two configurations,
to get more reliable data.

1. Janky frames percentage
+--------+-----------------+---------------------+-------+-----------+
| metric | variable | kernel | value | perc_diff |
+--------+-----------------+---------------------+-------+-----------+
| gmean | jank_percentage | EM_default | 2.0 | 0.0% |
| gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% |
+--------+-----------------+---------------------+-------+-----------+

2. Avg frame render time duration
+--------+---------------------+---------------------+-------+-----------+
| metric | variable | kernel | value | perc_diff |
+--------+---------------------+---------------------+-------+-----------+
| gmean | mean_frame_duration | EM_default | 10.5 | 0.0% |
| gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% |
+--------+---------------------+---------------------+-------+-----------+

3. Max frame render time duration
+--------+--------------------+---------------------+-------+-----------+
| metric | variable | kernel | value | perc_diff |
+--------+--------------------+---------------------+-------+-----------+
| gmean | max_frame_duration | EM_default | 251.6 | 0.0% |
| gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% |
+--------+--------------------+---------------------+-------+-----------+

4. OS overutilized state percentage (when EAS is not working)
+--------------+---------------------+------+------------+------------+
| metric | wa_path | time | total_time | percentage |
+--------------+---------------------+------+------------+------------+
| overutilized | EM_default | 1.65 | 253.38 | 0.65 |
| overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 |
+--------------+---------------------+------+------------+------------+

5. All CPUs (Little+Mid+Big) power values in mW
+------------+--------+---------------------+-------+-----------+
| channel | metric | kernel | value | perc_diff |
+------------+--------+---------------------+-------+-----------+
| CPU | gmean | EM_default | 142.1 | 0.0% |
| CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
+------------+--------+---------------------+-------+-----------+

The time cost to update the EM decreased in this v5 vs v4:
big: 5us vs 2us -> 2.6x faster
mid: 9us vs 3us -> 3x faster
little: 16us vs 16us -> no change

We still have to update the inefficiency in the cpufreq framework, thus
a bit of overhead will be there.

Changelog:
v5:
- removed 2 tables design
- have only one table (runtime_table) used also in thermal (Wei, Rafael)
- refactored update function and removed callback call for each opp
- added faster EM table swap, using only the RCU pointer update
- added memory allocation API and tracking with kref
- avoid overhead for computing 'cost' for each OPP in update, it can be
pre-computed in device drivers EM earlier
- add support for device drivers providing EM table
- added API for computing 'cost' values in EM for EAS
- added API for thermal/powercap to use EM (using RCU wrappers)
- switched to single allocation and 'state[]' array (Rafael)
- changed documentation to align with current design
- added helper API for computing cost values
- simplified EM free in unregister path (thanks to kref)
- split patch updating EM clients and changed them separetly
- added seperate patch removing old static EM table
- added EM debugfs change patch to dump the runtime_table
- addressed comments in v4 for spelling/comments/headers
- added review tags
v4 changes are here [4]

Regards,
Lukasz Luba

[1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
[2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
[3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
[4] https://lore.kernel.org/lkml/[email protected]/


Lukasz Luba (23):
PM: EM: Add missing newline for the message log
PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
PM: EM: Find first CPU active while updating OPP efficiency
PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
PM: EM: Refactor a new function em_compute_costs()
PM: EM: Check if the get_cost() callback is present in
em_compute_costs()
PM: EM: Refactor how the EM table is allocated and populated
PM: EM: Introduce runtime modifiable table
PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
PM: EM: Add API for memory allocations for new tables
PM: EM: Add API for updating the runtime modifiable EM
PM: EM: Add helpers to read under RCU lock the EM table
PM: EM: Add performance field to struct em_perf_state
PM: EM: Support late CPUs booting and capacity adjustment
PM: EM: Optimize em_cpu_energy() and remove division
powercap/dtpm_cpu: Use new Energy Model interface to get table
powercap/dtpm_devfreq: Use new Energy Model interface to get table
drivers/thermal/cpufreq_cooling: Use new Energy Model interface
drivers/thermal/devfreq_cooling: Use new Energy Model interface
PM: EM: Change debugfs configuration to use runtime EM table data
PM: EM: Remove old table
PM: EM: Add em_dev_compute_costs() as API for device drivers
Documentation: EM: Update with runtime modification design

Documentation/power/energy-model.rst | 206 +++++++++++-
drivers/powercap/dtpm_cpu.c | 35 +-
drivers/powercap/dtpm_devfreq.c | 31 +-
drivers/thermal/cpufreq_cooling.c | 40 ++-
drivers/thermal/devfreq_cooling.c | 43 ++-
include/linux/energy_model.h | 163 +++++----
kernel/power/energy_model.c | 479 +++++++++++++++++++++++----
7 files changed, 813 insertions(+), 184 deletions(-)

--
2.25.1


2023-11-29 11:08:31

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 01/23] PM: EM: Add missing newline for the message log

Fix missing newline for the string long in the error code path.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/power/energy_model.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 7b44f5b89fa1..8b9dd4a39f63 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -250,7 +250,7 @@ static void em_cpufreq_update_efficiencies(struct device *dev)

policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
if (!policy) {
- dev_warn(dev, "EM: Access to CPUFreq policy failed");
+ dev_warn(dev, "EM: Access to CPUFreq policy failed\n");
return;
}

--
2.25.1

2023-11-29 11:08:47

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 02/23] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments

In order to prepare the code for the modifiable EM perf_state table,
refactor existing function em_cpufreq_update_efficiencies().

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/power/energy_model.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 8b9dd4a39f63..42486674b834 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -237,10 +237,10 @@ static int em_create_pd(struct device *dev, int nr_states,
return 0;
}

-static void em_cpufreq_update_efficiencies(struct device *dev)
+static void
+em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
{
struct em_perf_domain *pd = dev->em_pd;
- struct em_perf_state *table;
struct cpufreq_policy *policy;
int found = 0;
int i;
@@ -254,8 +254,6 @@ static void em_cpufreq_update_efficiencies(struct device *dev)
return;
}

- table = pd->table;
-
for (i = 0; i < pd->nr_perf_states; i++) {
if (!(table[i].flags & EM_PERF_STATE_INEFFICIENT))
continue;
@@ -397,7 +395,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,

dev->em_pd->flags |= flags;

- em_cpufreq_update_efficiencies(dev);
+ em_cpufreq_update_efficiencies(dev, dev->em_pd->table);

em_debug_create_pd(dev);
dev_info(dev, "EM: created perf domain\n");
--
2.25.1

2023-11-29 11:08:48

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 04/23] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible

The Energy Model (EM) is going to support runtime modification. There
are going to be 2 EM tables which store information. This patch aims
to prepare the code to be generic and use one of the tables. The function
will no longer get a pointer to 'struct em_perf_domain' (the EM) but
instead a pointer to 'struct em_perf_state' (which is one of the EM's
tables).

Prepare em_pd_get_efficient_state() for the upcoming changes and
make it possible to re-use. Return an index for the best performance
state for a given EM table. The function arguments that are introduced
should allow to work on different performance state arrays. The caller of
em_pd_get_efficient_state() should be able to use the index either
on the default or the modifiable EM table.

Signed-off-by: Lukasz Luba <[email protected]>
Reviewed-by: Daniel Lezcano <[email protected]>
---
include/linux/energy_model.h | 30 +++++++++++++++++-------------
1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index b9caa01dfac4..8069f526c9d8 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -175,33 +175,35 @@ void em_dev_unregister_perf_domain(struct device *dev);

/**
* em_pd_get_efficient_state() - Get an efficient performance state from the EM
- * @pd : Performance domain for which we want an efficient frequency
- * @freq : Frequency to map with the EM
+ * @state: List of performance states, in ascending order
+ * @nr_perf_states: Number of performance states
+ * @freq: Frequency to map with the EM
+ * @pd_flags: Performance Domain flags
*
* It is called from the scheduler code quite frequently and as a consequence
* doesn't implement any check.
*
- * Return: An efficient performance state, high enough to meet @freq
+ * Return: An efficient performance state id, high enough to meet @freq
* requirement.
*/
-static inline
-struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
- unsigned long freq)
+static inline int
+em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
+ unsigned long freq, unsigned long pd_flags)
{
struct em_perf_state *ps;
int i;

- for (i = 0; i < pd->nr_perf_states; i++) {
- ps = &pd->table[i];
+ for (i = 0; i < nr_perf_states; i++) {
+ ps = &table[i];
if (ps->frequency >= freq) {
- if (pd->flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
+ if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
ps->flags & EM_PERF_STATE_INEFFICIENT)
continue;
- break;
+ return i;
}
}

- return ps;
+ return nr_perf_states - 1;
}

/**
@@ -226,7 +228,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
{
unsigned long freq, scale_cpu;
struct em_perf_state *ps;
- int cpu;
+ int cpu, i;

if (!sum_util)
return 0;
@@ -251,7 +253,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
* Find the lowest performance state of the Energy Model above the
* requested frequency.
*/
- ps = em_pd_get_efficient_state(pd, freq);
+ i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
+ pd->flags);
+ ps = &pd->table[i];

/*
* The capacity of a CPU in the domain at the performance state (ps)
--
2.25.1

2023-11-29 11:09:01

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 06/23] PM: EM: Check if the get_cost() callback is present in em_compute_costs()

Subsequent changes will introduce a case in which 'cb->get_cost' may
not be set in em_compute_costs(), so add a check to ensure that it is
not NULL before attempting to dereference it.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/power/energy_model.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 3bea930410c6..3c8542443dd4 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -116,7 +116,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
for (i = nr_states - 1; i >= 0; i--) {
unsigned long power_res, cost;

- if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
+ if ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost) {
ret = cb->get_cost(dev, table[i].frequency, &cost);
if (ret || !cost || cost > EM_MAX_POWER) {
dev_err(dev, "EM: invalid cost %lu %d\n",
--
2.25.1

2023-11-29 11:09:05

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 07/23] PM: EM: Refactor how the EM table is allocated and populated

Split the process of allocation and data initialization for the EM table.
The upcoming changes for modifiable EM will use it.

This change is not expected to alter the general functionality.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/power/energy_model.c | 52 ++++++++++++++++++++++---------------
1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 3c8542443dd4..99426b5eedb6 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -142,18 +142,25 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
return 0;
}

+static int em_allocate_perf_table(struct em_perf_domain *pd,
+ int nr_states)
+{
+ pd->table = kcalloc(nr_states, sizeof(struct em_perf_state),
+ GFP_KERNEL);
+ if (!pd->table)
+ return -ENOMEM;
+
+ return 0;
+}
+
static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
+ struct em_perf_state *table,
int nr_states, struct em_data_callback *cb,
unsigned long flags)
{
unsigned long power, freq, prev_freq = 0;
- struct em_perf_state *table;
int i, ret;

- table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
- if (!table)
- return -ENOMEM;
-
/* Build the list of performance states for this performance domain */
for (i = 0, freq = 0; i < nr_states; i++, freq++) {
/*
@@ -165,7 +172,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
if (ret) {
dev_err(dev, "EM: invalid perf. state: %d\n",
ret);
- goto free_ps_table;
+ return -EINVAL;
}

/*
@@ -175,7 +182,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
if (freq <= prev_freq) {
dev_err(dev, "EM: non-increasing freq: %lu\n",
freq);
- goto free_ps_table;
+ return -EINVAL;
}

/*
@@ -185,7 +192,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
if (!power || power > EM_MAX_POWER) {
dev_err(dev, "EM: invalid power: %lu\n",
power);
- goto free_ps_table;
+ return -EINVAL;
}

table[i].power = power;
@@ -194,16 +201,9 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,

ret = em_compute_costs(dev, table, cb, nr_states, flags);
if (ret)
- goto free_ps_table;
-
- pd->table = table;
- pd->nr_perf_states = nr_states;
+ return -EINVAL;

return 0;
-
-free_ps_table:
- kfree(table);
- return -EINVAL;
}

static int em_create_pd(struct device *dev, int nr_states,
@@ -234,11 +234,15 @@ static int em_create_pd(struct device *dev, int nr_states,
return -ENOMEM;
}

- ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
- if (ret) {
- kfree(pd);
- return ret;
- }
+ pd->nr_perf_states = nr_states;
+
+ ret = em_allocate_perf_table(pd, nr_states);
+ if (ret)
+ goto free_pd;
+
+ ret = em_create_perf_table(dev, pd, pd->table, nr_states, cb, flags);
+ if (ret)
+ goto free_pd_table;

if (_is_cpu_device(dev))
for_each_cpu(cpu, cpus) {
@@ -249,6 +253,12 @@ static int em_create_pd(struct device *dev, int nr_states,
dev->em_pd = pd;

return 0;
+
+free_pd_table:
+ kfree(pd->table);
+free_pd:
+ kfree(pd);
+ return -EINVAL;
}

static void
--
2.25.1

2023-11-29 11:09:06

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 03/23] PM: EM: Find first CPU active while updating OPP efficiency

The Energy Model might be updated at runtime and the energy efficiency
for each OPP may change. Thus, there is a need to update also the
cpufreq framework and make it aligned to the new values. In order to
do that, use a first active CPU from the Performance Domain. This is
needed since the first CPU in the cpumask might be offline when we
run this code path.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/power/energy_model.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 42486674b834..aa7c89f9e115 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -243,12 +243,19 @@ em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
struct em_perf_domain *pd = dev->em_pd;
struct cpufreq_policy *policy;
int found = 0;
- int i;
+ int i, cpu;

if (!_is_cpu_device(dev) || !pd)
return;

- policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
+ /* Try to get a CPU which is active and in this PD */
+ cpu = cpumask_first_and(em_span_cpus(pd), cpu_active_mask);
+ if (cpu >= nr_cpu_ids) {
+ dev_warn(dev, "EM: No online CPU for CPUFreq policy\n");
+ return;
+ }
+
+ policy = cpufreq_cpu_get(cpu);
if (!policy) {
dev_warn(dev, "EM: Access to CPUFreq policy failed\n");
return;
--
2.25.1

2023-11-29 11:09:16

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 08/23] PM: EM: Introduce runtime modifiable table

The new runtime table can be populated with a new power data to better
reflect the actual efficiency of the device e.g. CPU. The power can vary
over time e.g. due to the SoC temperature change. Higher temperature can
increase power values. For longer running scenarios, such as game or
camera, when also other devices are used (e.g. GPU, ISP) the CPU power can
change. The new EM framework is able to addresses this issue and change
the EM data at runtime safely.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 12 ++++++++
kernel/power/energy_model.c | 53 ++++++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 8069f526c9d8..1e618e431cac 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -36,9 +36,20 @@ struct em_perf_state {
*/
#define EM_PERF_STATE_INEFFICIENT BIT(0)

+/**
+ * struct em_perf_table - Performance states table
+ * @rcu: RCU used for safe access and destruction
+ * @state: List of performance states, in ascending order
+ */
+struct em_perf_table {
+ struct rcu_head rcu;
+ struct em_perf_state state[];
+};
+
/**
* struct em_perf_domain - Performance domain
* @table: List of performance states, in ascending order
+ * @runtime_table: Pointer to the runtime modifiable em_perf_table
* @nr_perf_states: Number of performance states
* @flags: See "em_perf_domain flags"
* @cpus: Cpumask covering the CPUs of the domain. It's here
@@ -54,6 +65,7 @@ struct em_perf_state {
*/
struct em_perf_domain {
struct em_perf_state *table;
+ struct em_perf_table __rcu *runtime_table;
int nr_perf_states;
unsigned long flags;
unsigned long cpus[];
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 99426b5eedb6..489287666705 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -23,6 +23,9 @@
*/
static DEFINE_MUTEX(em_pd_mutex);

+static void em_cpufreq_update_efficiencies(struct device *dev,
+ struct em_perf_state *table);
+
static bool _is_cpu_device(struct device *dev)
{
return (dev->bus == &cpu_subsys);
@@ -103,6 +106,31 @@ static void em_debug_create_pd(struct device *dev) {}
static void em_debug_remove_pd(struct device *dev) {}
#endif

+static void em_destroy_table_rcu(struct rcu_head *rp)
+{
+ struct em_perf_table __rcu *runtime_table;
+
+ runtime_table = container_of(rp, struct em_perf_table, rcu);
+ kfree(runtime_table);
+}
+
+static void em_free_table(struct em_perf_table __rcu *table)
+{
+ call_rcu(&table->rcu, em_destroy_table_rcu);
+}
+
+static struct em_perf_table __rcu *
+em_allocate_table(struct em_perf_domain *pd)
+{
+ struct em_perf_table __rcu *table;
+ int table_size;
+
+ table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;
+
+ table = kzalloc(sizeof(*table) + table_size, GFP_KERNEL);
+ return table;
+}
+
static int em_compute_costs(struct device *dev, struct em_perf_state *table,
struct em_data_callback *cb, int nr_states,
unsigned long flags)
@@ -153,6 +181,24 @@ static int em_allocate_perf_table(struct em_perf_domain *pd,
return 0;
}

+static int em_create_runtime_table(struct em_perf_domain *pd)
+{
+ struct em_perf_table __rcu *runtime_table;
+ int table_size;
+
+ runtime_table = em_allocate_table(pd);
+ if (!runtime_table)
+ return -ENOMEM;
+
+ /* Initialize runtime table with existing data */
+ table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;
+ memcpy(runtime_table->state, pd->table, table_size);
+
+ rcu_assign_pointer(pd->runtime_table, runtime_table);
+
+ return 0;
+}
+
static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
struct em_perf_state *table,
int nr_states, struct em_data_callback *cb,
@@ -244,6 +290,10 @@ static int em_create_pd(struct device *dev, int nr_states,
if (ret)
goto free_pd_table;

+ ret = em_create_runtime_table(pd);
+ if (ret)
+ goto free_pd_table;
+
if (_is_cpu_device(dev))
for_each_cpu(cpu, cpus) {
cpu_dev = get_cpu_device(cpu);
@@ -460,6 +510,9 @@ void em_dev_unregister_perf_domain(struct device *dev)
em_debug_remove_pd(dev);

kfree(dev->em_pd->table);
+
+ em_free_table(dev->em_pd->runtime_table);
+
kfree(dev->em_pd);
dev->em_pd = NULL;
mutex_unlock(&em_pd_mutex);
--
2.25.1

2023-11-29 11:09:18

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 09/23] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS

The new Energy Model (EM) supports runtime modification of the performance
state table to better model the power used by the SoC. Use this new
feature to improve energy estimation and therefore task placement in
Energy Aware Scheduler (EAS).

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 1e618e431cac..94a77a813724 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -238,6 +238,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
unsigned long max_util, unsigned long sum_util,
unsigned long allowed_cpu_cap)
{
+ struct em_perf_table *runtime_table;
unsigned long freq, scale_cpu;
struct em_perf_state *ps;
int cpu, i;
@@ -255,7 +256,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
*/
cpu = cpumask_first(to_cpumask(pd->cpus));
scale_cpu = arch_scale_cpu_capacity(cpu);
- ps = &pd->table[pd->nr_perf_states - 1];
+
+ /*
+ * No rcu_read_lock() since it's already called by task scheduler.
+ * The runtime_table is always there for CPUs, so we don't check.
+ */
+ runtime_table = rcu_dereference(pd->runtime_table);
+
+ ps = &runtime_table->state[pd->nr_perf_states - 1];

max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
@@ -265,9 +273,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
* Find the lowest performance state of the Energy Model above the
* requested frequency.
*/
- i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
- pd->flags);
- ps = &pd->table[i];
+ i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
+ freq, pd->flags);
+ ps = &runtime_table->state[i];

/*
* The capacity of a CPU in the domain at the performance state (ps)
--
2.25.1

2023-11-29 11:09:26

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 05/23] PM: EM: Refactor a new function em_compute_costs()

Refactor a dedicated function which will be easier to maintain and re-use
in future. The upcoming changes for the modifiable EM perf_state table
will use it (instead of duplicating the code).

This change is not expected to alter the general functionality.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/power/energy_model.c | 72 ++++++++++++++++++++++---------------
1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index aa7c89f9e115..3bea930410c6 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -103,14 +103,52 @@ static void em_debug_create_pd(struct device *dev) {}
static void em_debug_remove_pd(struct device *dev) {}
#endif

+static int em_compute_costs(struct device *dev, struct em_perf_state *table,
+ struct em_data_callback *cb, int nr_states,
+ unsigned long flags)
+{
+ unsigned long prev_cost = ULONG_MAX;
+ u64 fmax;
+ int i, ret;
+
+ /* Compute the cost of each performance state. */
+ fmax = (u64) table[nr_states - 1].frequency;
+ for (i = nr_states - 1; i >= 0; i--) {
+ unsigned long power_res, cost;
+
+ if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
+ ret = cb->get_cost(dev, table[i].frequency, &cost);
+ if (ret || !cost || cost > EM_MAX_POWER) {
+ dev_err(dev, "EM: invalid cost %lu %d\n",
+ cost, ret);
+ return -EINVAL;
+ }
+ } else {
+ power_res = table[i].power;
+ cost = div64_u64(fmax * power_res, table[i].frequency);
+ }
+
+ table[i].cost = cost;
+
+ if (table[i].cost >= prev_cost) {
+ table[i].flags = EM_PERF_STATE_INEFFICIENT;
+ dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
+ table[i].frequency);
+ } else {
+ prev_cost = table[i].cost;
+ }
+ }
+
+ return 0;
+}
+
static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
int nr_states, struct em_data_callback *cb,
unsigned long flags)
{
- unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
+ unsigned long power, freq, prev_freq = 0;
struct em_perf_state *table;
int i, ret;
- u64 fmax;

table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
if (!table)
@@ -154,33 +192,9 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
table[i].frequency = prev_freq = freq;
}

- /* Compute the cost of each performance state. */
- fmax = (u64) table[nr_states - 1].frequency;
- for (i = nr_states - 1; i >= 0; i--) {
- unsigned long power_res, cost;
-
- if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
- ret = cb->get_cost(dev, table[i].frequency, &cost);
- if (ret || !cost || cost > EM_MAX_POWER) {
- dev_err(dev, "EM: invalid cost %lu %d\n",
- cost, ret);
- goto free_ps_table;
- }
- } else {
- power_res = table[i].power;
- cost = div64_u64(fmax * power_res, table[i].frequency);
- }
-
- table[i].cost = cost;
-
- if (table[i].cost >= prev_cost) {
- table[i].flags = EM_PERF_STATE_INEFFICIENT;
- dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
- table[i].frequency);
- } else {
- prev_cost = table[i].cost;
- }
- }
+ ret = em_compute_costs(dev, table, cb, nr_states, flags);
+ if (ret)
+ goto free_ps_table;

pd->table = table;
pd->nr_perf_states = nr_states;
--
2.25.1

2023-11-29 11:09:41

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 10/23] PM: EM: Add API for memory allocations for new tables

The runtime modified EM table can be provided from drivers. Create
mechanism which allows safely allocate and free the table for device
drivers. The same table can be used by the EAS in task scheduler code
paths, so make sure the memory is not freed when the device driver module
is unloaded.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 11 +++++++++
kernel/power/energy_model.c | 44 ++++++++++++++++++++++++++++++++++--
2 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 94a77a813724..e785211828fe 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -5,6 +5,7 @@
#include <linux/device.h>
#include <linux/jump_label.h>
#include <linux/kobject.h>
+#include <linux/kref.h>
#include <linux/rcupdate.h>
#include <linux/sched/cpufreq.h>
#include <linux/sched/topology.h>
@@ -39,10 +40,12 @@ struct em_perf_state {
/**
* struct em_perf_table - Performance states table
* @rcu: RCU used for safe access and destruction
+ * @refcount: Reference count to track the owners
* @state: List of performance states, in ascending order
*/
struct em_perf_table {
struct rcu_head rcu;
+ struct kref refcount;
struct em_perf_state state[];
};

@@ -184,6 +187,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *span,
bool microwatts);
void em_dev_unregister_perf_domain(struct device *dev);
+struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd);
+void em_free_table(struct em_perf_table __rcu *table);

/**
* em_pd_get_efficient_state() - Get an efficient performance state from the EM
@@ -368,6 +373,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
{
return 0;
}
+static inline
+struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd)
+{
+ return NULL;
+}
+static inline void em_free_table(struct em_perf_table __rcu *table) {}
#endif

#endif
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 489287666705..489a358b9a00 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -114,12 +114,46 @@ static void em_destroy_table_rcu(struct rcu_head *rp)
kfree(runtime_table);
}

-static void em_free_table(struct em_perf_table __rcu *table)
+static void em_release_table_kref(struct kref *kref)
{
+ struct em_perf_table __rcu *table;
+
+ /* It was the last owner of this table so we can free */
+ table = container_of(kref, struct em_perf_table, refcount);
+
call_rcu(&table->rcu, em_destroy_table_rcu);
}

-static struct em_perf_table __rcu *
+static inline void em_inc_usage(struct em_perf_table __rcu *table)
+{
+ kref_get(&table->refcount);
+}
+
+static void em_dec_usage(struct em_perf_table __rcu *table)
+{
+ kref_put(&table->refcount, em_release_table_kref);
+}
+
+/**
+ * em_free_table() - Handles safe free of the EM table when needed
+ * @table : EM memory which is going to be freed
+ *
+ * No return values.
+ */
+void em_free_table(struct em_perf_table __rcu *table)
+{
+ em_dec_usage(table);
+}
+
+/**
+ * em_allocate_table() - Handles safe allocation of the new EM table
+ * @table : EM memory which is going to be freed
+ *
+ * Increments the reference counter to mark that there is an owner of that
+ * EM table. That might be a device driver module or EAS.
+ * Returns allocated table or error.
+ */
+struct em_perf_table __rcu *
em_allocate_table(struct em_perf_domain *pd)
{
struct em_perf_table __rcu *table;
@@ -128,6 +162,12 @@ em_allocate_table(struct em_perf_domain *pd)
table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;

table = kzalloc(sizeof(*table) + table_size, GFP_KERNEL);
+ if (!table)
+ return table;
+
+ kref_init(&table->refcount);
+ em_inc_usage(table);
+
return table;
}

--
2.25.1

2023-11-29 11:09:47

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 11/23] PM: EM: Add API for updating the runtime modifiable EM

Add API function em_dev_update_perf_domain() which allows to safely
change the EM. The concurrent modifiers are protected by the mutex
to serialize them. Removal of the old memory is asynchronous and
handled by the RCU mechanisms.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 8 +++++++
kernel/power/energy_model.c | 46 ++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e785211828fe..520a8c8ad849 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -183,6 +183,8 @@ struct em_data_callback {

struct em_perf_domain *em_cpu_get(int cpu);
struct em_perf_domain *em_pd_get(struct device *dev);
+int em_dev_update_perf_domain(struct device *dev,
+ struct em_perf_table __rcu *new_table);
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *span,
bool microwatts);
@@ -379,6 +381,12 @@ struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd)
return NULL;
}
static inline void em_free_table(struct em_perf_table __rcu *table) {}
+static inline
+int em_dev_update_perf_domain(struct device *dev,
+ struct em_perf_table __rcu *new_table)
+{
+ return -EINVAL;
+}
#endif

#endif
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 489a358b9a00..614891fde8df 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -221,6 +221,52 @@ static int em_allocate_perf_table(struct em_perf_domain *pd,
return 0;
}

+/**
+ * em_dev_update_perf_domain() - Update runtime EM table for a device
+ * @dev : Device for which the EM is to be updated
+ * @table : The new EM table that is going to used from now
+ *
+ * Update EM runtime modifiable table for the @dev using the privided @table.
+ *
+ * This function uses mutex to serialize writers, so it must not be called
+ * from non-sleeping context.
+ *
+ * Return 0 on success or a proper error in case of failure.
+ */
+int em_dev_update_perf_domain(struct device *dev,
+ struct em_perf_table __rcu *new_table)
+{
+ struct em_perf_table __rcu *old_table;
+ struct em_perf_domain *pd;
+
+ /*
+ * The lock serializes update and unregister code paths. When the
+ * EM has been unregistered in the meantime, we should capture that
+ * when entering this critical section. It also makes sure that
+ * two concurrent updates will be serialized.
+ */
+ mutex_lock(&em_pd_mutex);
+
+ if (!dev || !dev->em_pd) {
+ mutex_unlock(&em_pd_mutex);
+ return -EINVAL;
+ }
+ pd = dev->em_pd;
+
+ em_inc_usage(new_table);
+
+ old_table = pd->runtime_table;
+ rcu_assign_pointer(pd->runtime_table, new_table);
+
+ em_cpufreq_update_efficiencies(dev, new_table->state);
+
+ em_dec_usage(old_table);
+
+ mutex_unlock(&em_pd_mutex);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
+
static int em_create_runtime_table(struct em_perf_domain *pd)
{
struct em_perf_table __rcu *runtime_table;
--
2.25.1

2023-11-29 11:09:58

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 13/23] PM: EM: Add performance field to struct em_perf_state

The performance doesn't scale linearly with the frequency. Also, it may
be different in different workloads. Some CPUs are designed to be
particularly good at some applications e.g. images or video processing
and other CPUs in different. When those different types of CPUs are
combined in one SoC they should be properly modeled to get max of the HW
in Energy Aware Scheduler (EAS). The Energy Model (EM) provides the
power vs. performance curves to the EAS, but assumes the CPUs capacity
is fixed and scales linearly with the frequency. This patch allows to
adjust the curve on the 'performance' axis as well.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 11 ++++++-----
kernel/power/energy_model.c | 27 +++++++++++++++++++++++++++
2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index ae3ccc8b9f44..e30750500b10 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -13,6 +13,7 @@

/**
* struct em_perf_state - Performance state of a performance domain
+ * @performance: Non-linear CPU performance at a given frequency
* @frequency: The frequency in KHz, for consistency with CPUFreq
* @power: The power consumed at this level (by 1 CPU or by a registered
* device). It can be a total power: static and dynamic.
@@ -21,6 +22,7 @@
* @flags: see "em_perf_state flags" description below.
*/
struct em_perf_state {
+ unsigned long performance;
unsigned long frequency;
unsigned long power;
unsigned long cost;
@@ -207,14 +209,14 @@ void em_free_table(struct em_perf_table __rcu *table);
*/
static inline int
em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
- unsigned long freq, unsigned long pd_flags)
+ unsigned long max_util, unsigned long pd_flags)
{
struct em_perf_state *ps;
int i;

for (i = 0; i < nr_perf_states; i++) {
ps = &table[i];
- if (ps->frequency >= freq) {
+ if (ps->performance >= max_util) {
if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
ps->flags & EM_PERF_STATE_INEFFICIENT)
continue;
@@ -246,8 +248,8 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
unsigned long allowed_cpu_cap)
{
struct em_perf_table *runtime_table;
- unsigned long freq, scale_cpu;
struct em_perf_state *ps;
+ unsigned long scale_cpu;
int cpu, i;

if (!sum_util)
@@ -274,14 +276,13 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,

max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
- freq = map_util_freq(max_util, ps->frequency, scale_cpu);

/*
* Find the lowest performance state of the Energy Model above the
* requested frequency.
*/
i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
- freq, pd->flags);
+ max_util, pd->flags);
ps = &runtime_table->state[i];

/*
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 614891fde8df..b5016afe6a19 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -46,6 +46,7 @@ static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
debugfs_create_ulong("power", 0444, d, &ps->power);
debugfs_create_ulong("cost", 0444, d, &ps->cost);
+ debugfs_create_ulong("performance", 0444, d, &ps->performance);
debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
}

@@ -171,6 +172,30 @@ em_allocate_table(struct em_perf_domain *pd)
return table;
}

+static void em_init_performance(struct device *dev, struct em_perf_domain *pd,
+ struct em_perf_state *table, int nr_states)
+{
+ u64 fmax, max_cap;
+ int i, cpu;
+
+ /* This is needed only for CPUs and EAS skip other devices */
+ if (!_is_cpu_device(dev))
+ return;
+
+ cpu = cpumask_first(em_span_cpus(pd));
+
+ /*
+ * Calculate the performance value for each frequency with
+ * linear relationship. The final CPU capacity might not be ready at
+ * boot time, but the EM will be updated a bit later with correct one.
+ */
+ fmax = (u64) table[nr_states - 1].frequency;
+ max_cap = (u64) arch_scale_cpu_capacity(cpu);
+ for (i = 0; i < nr_states; i++)
+ table[i].performance = div64_u64(max_cap * table[i].frequency,
+ fmax);
+}
+
static int em_compute_costs(struct device *dev, struct em_perf_state *table,
struct em_data_callback *cb, int nr_states,
unsigned long flags)
@@ -331,6 +356,8 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
table[i].frequency = prev_freq = freq;
}

+ em_init_performance(dev, pd, table, nr_states);
+
ret = em_compute_costs(dev, table, cb, nr_states, flags);
if (ret)
return -EINVAL;
--
2.25.1

2023-11-29 11:10:05

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 12/23] PM: EM: Add helpers to read under RCU lock the EM table

To use the runtime modifiable EM table there is a need to use RCU
read locking properly. Add helper functions for the device drivers and
frameworks to make sure it's done properly.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 520a8c8ad849..ae3ccc8b9f44 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -341,6 +341,20 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
return pd->nr_perf_states;
}

+static inline struct em_perf_state *em_get_table(struct em_perf_domain *pd)
+{
+ struct em_perf_table __rcu *runtime_table;
+
+ rcu_read_lock();
+ runtime_table = rcu_dereference(pd->runtime_table);
+ return runtime_table->state;
+}
+
+static inline void em_put_table(void)
+{
+ rcu_read_unlock();
+}
+
#else
struct em_data_callback {};
#define EM_ADV_DATA_CB(_active_power_cb, _cost_cb) { }
@@ -387,6 +401,11 @@ int em_dev_update_perf_domain(struct device *dev,
{
return -EINVAL;
}
+static inline struct em_perf_state *em_get_table(struct em_perf_domain *pd)
+{
+ return NULL;
+}
+static inline void em_put_table(void) {}
#endif

#endif
--
2.25.1

2023-11-29 11:10:15

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 14/23] PM: EM: Support late CPUs booting and capacity adjustment

The patch adds needed infrastructure to handle the late CPUs boot, which
might change the previous CPUs capacity values. With this changes the new
CPUs which try to register EM will trigger the needed re-calculations for
other CPUs EMs. Thanks to that the em_per_state::performance values will
be aligned with the CPU capacity information after all CPUs finish the
boot and EM registrations.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/power/energy_model.c | 121 ++++++++++++++++++++++++++++++++++++
1 file changed, 121 insertions(+)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index b5016afe6a19..d3fa5a77de80 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -25,6 +25,9 @@ static DEFINE_MUTEX(em_pd_mutex);

static void em_cpufreq_update_efficiencies(struct device *dev,
struct em_perf_state *table);
+static void em_check_capacity_update(void);
+static void em_update_workfn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(em_update_work, em_update_workfn);

static bool _is_cpu_device(struct device *dev)
{
@@ -596,6 +599,10 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,

unlock:
mutex_unlock(&em_pd_mutex);
+
+ if (_is_cpu_device(dev))
+ em_check_capacity_update();
+
return ret;
}
EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
@@ -631,3 +638,117 @@ void em_dev_unregister_perf_domain(struct device *dev)
mutex_unlock(&em_pd_mutex);
}
EXPORT_SYMBOL_GPL(em_dev_unregister_perf_domain);
+
+/*
+ * Adjustment of CPU performance values after boot, when all CPUs capacites
+ * are correctly calculated.
+ */
+static void em_adjust_new_capacity(struct device *dev,
+ struct em_perf_domain *pd,
+ u64 max_cap)
+{
+ struct em_perf_table __rcu *runtime_table;
+ struct em_perf_state *table, *new_table;
+ int ret, table_size;
+
+ runtime_table = em_allocate_table(pd);
+ if (!runtime_table) {
+ dev_warn(dev, "EM: allocation failed\n");
+ return;
+ }
+
+ new_table = runtime_table->state;
+
+ table = em_get_table(pd);
+ /* Initialize data based on older runtime table */
+ table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;
+ memcpy(new_table, table, table_size);
+
+ em_put_table();
+
+ em_init_performance(dev, pd, new_table, pd->nr_perf_states);
+ ret = em_compute_costs(dev, new_table, NULL, pd->nr_perf_states,
+ pd->flags);
+ if (ret) {
+ em_free_table(runtime_table);
+ return;
+ }
+
+ ret = em_dev_update_perf_domain(dev, runtime_table);
+ if (ret)
+ dev_warn(dev, "EM: update failed %d\n", ret);
+
+ /*
+ * This is one-time-update, so give up the ownership in this updater.
+ * The EM fwk will keep the reference and free the memory when needed.
+ */
+ em_free_table(runtime_table);
+}
+
+static void em_check_capacity_update(void)
+{
+ cpumask_var_t cpu_done_mask;
+ struct em_perf_state *table;
+ struct em_perf_domain *pd;
+ unsigned long cpu_capacity;
+ int cpu;
+
+ if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
+ pr_warn("no free memory\n");
+ return;
+ }
+
+ /* Check if CPUs capacity has changed than update EM */
+ for_each_possible_cpu(cpu) {
+ struct cpufreq_policy *policy;
+ unsigned long em_max_perf;
+ struct device *dev;
+ int nr_states;
+
+ if (cpumask_test_cpu(cpu, cpu_done_mask))
+ continue;
+
+ policy = cpufreq_cpu_get(cpu);
+ if (!policy) {
+ pr_debug("Accessing cpu%d policy failed\n", cpu);
+ schedule_delayed_work(&em_update_work,
+ msecs_to_jiffies(1000));
+ break;
+ }
+ cpufreq_cpu_put(policy);
+
+ pd = em_cpu_get(cpu);
+ if (!pd || em_is_artificial(pd))
+ continue;
+
+ cpumask_or(cpu_done_mask, cpu_done_mask,
+ em_span_cpus(pd));
+
+ nr_states = pd->nr_perf_states;
+ cpu_capacity = arch_scale_cpu_capacity(cpu);
+
+ table = em_get_table(pd);
+ em_max_perf = table[pd->nr_perf_states - 1].performance;
+ em_put_table();
+
+ /*
+ * Check if the CPU capacity has been adjusted during boot
+ * and trigger the update for new performance values.
+ */
+ if (em_max_perf == cpu_capacity)
+ continue;
+
+ pr_debug("updating cpu%d cpu_cap=%lu old capacity=%lu\n",
+ cpu, cpu_capacity, em_max_perf);
+
+ dev = get_cpu_device(cpu);
+ em_adjust_new_capacity(dev, pd, cpu_capacity);
+ }
+
+ free_cpumask_var(cpu_done_mask);
+}
+
+static void em_update_workfn(struct work_struct *work)
+{
+ em_check_capacity_update();
+}
--
2.25.1

2023-11-29 11:10:20

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 16/23] powercap/dtpm_cpu: Use new Energy Model interface to get table

Energy Model framework support modifications at runtime of the power
values. Use the new EM table API which is protected with RCU. Align the
code so that this RCU read section is short.

This change is not expected to alter the general functionality.

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/powercap/dtpm_cpu.c | 35 +++++++++++++++++++++++++----------
1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
index 8a2f18fa3faf..45bb7e2849d7 100644
--- a/drivers/powercap/dtpm_cpu.c
+++ b/drivers/powercap/dtpm_cpu.c
@@ -42,6 +42,7 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
{
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
struct em_perf_domain *pd = em_cpu_get(dtpm_cpu->cpu);
+ struct em_perf_state *table;
struct cpumask cpus;
unsigned long freq;
u64 power;
@@ -50,20 +51,21 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd->cpus));
nr_cpus = cpumask_weight(&cpus);

+ table = em_get_table(pd);
for (i = 0; i < pd->nr_perf_states; i++) {

- power = pd->table[i].power * nr_cpus;
+ power = table[i].power * nr_cpus;

if (power > power_limit)
break;
}

- freq = pd->table[i - 1].frequency;
+ freq = table[i - 1].frequency;
+ power_limit = table[i - 1].power * nr_cpus;
+ em_put_table();

freq_qos_update_request(&dtpm_cpu->qos_req, freq);

- power_limit = pd->table[i - 1].power * nr_cpus;
-
return power_limit;
}

@@ -87,9 +89,11 @@ static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
static u64 get_pd_power_uw(struct dtpm *dtpm)
{
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
+ struct em_perf_state *table;
struct em_perf_domain *pd;
struct cpumask *pd_mask;
unsigned long freq;
+ u64 power = 0;
int i;

pd = em_cpu_get(dtpm_cpu->cpu);
@@ -98,33 +102,41 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)

freq = cpufreq_quick_get(dtpm_cpu->cpu);

+ table = em_get_table(pd);
for (i = 0; i < pd->nr_perf_states; i++) {

- if (pd->table[i].frequency < freq)
+ if (table[i].frequency < freq)
continue;

- return scale_pd_power_uw(pd_mask, pd->table[i].power);
+ power = scale_pd_power_uw(pd_mask, table[i].power);
+ break;
}
+ em_put_table();

- return 0;
+ return power;
}

static int update_pd_power_uw(struct dtpm *dtpm)
{
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
struct em_perf_domain *em = em_cpu_get(dtpm_cpu->cpu);
+ struct em_perf_state *table;
struct cpumask cpus;
int nr_cpus;

cpumask_and(&cpus, cpu_online_mask, to_cpumask(em->cpus));
nr_cpus = cpumask_weight(&cpus);

- dtpm->power_min = em->table[0].power;
+ table = em_get_table(em);
+
+ dtpm->power_min = table[0].power;
dtpm->power_min *= nr_cpus;

- dtpm->power_max = em->table[em->nr_perf_states - 1].power;
+ dtpm->power_max = table[em->nr_perf_states - 1].power;
dtpm->power_max *= nr_cpus;

+ em_put_table();
+
return 0;
}

@@ -178,6 +190,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
{
struct dtpm_cpu *dtpm_cpu;
struct cpufreq_policy *policy;
+ struct em_perf_state *table;
struct em_perf_domain *pd;
char name[CPUFREQ_NAME_LEN];
int ret = -ENOMEM;
@@ -210,9 +223,11 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
if (ret)
goto out_kfree_dtpm_cpu;

+ table = em_get_table(pd);
ret = freq_qos_add_request(&policy->constraints,
&dtpm_cpu->qos_req, FREQ_QOS_MAX,
- pd->table[pd->nr_perf_states - 1].frequency);
+ table[pd->nr_perf_states - 1].frequency);
+ em_put_table();
if (ret)
goto out_dtpm_unregister;

--
2.25.1

2023-11-29 11:10:24

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division

The Energy Model (EM) can be modified at runtime which brings new
possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
(EAS) in it's hot path. The energy calculation uses power value for
a given performance state (ps) and the CPU busy time as percentage for that
given frequency, which effectively is:

pd_nrg = ps->power * busy_time_pct (1)

cpu_util
busy_time_pct = ----------------- (2)
ps->performance

The 'ps->performance' is the CPU capacity (performance) at that given ps.
Thus, in a situation when the OS is not overloaded and we have EAS
working, the busy time is lower than 'ps->performance' that the CPU is
running at. Therefore, in longer scheduling period we can treat the power
value calculated above as the energy.

We can optimize the last arithmetic operation in em_cpu_energy() and
remove the division. This can be done because em_perf_state::cost, which
is a special coefficient, can now hold the pre-calculated value including
the 'ps->performance' information for a performance state (ps):

ps->power
ps->cost = --------------- (3)
ps->performance

In the past the 'ps->performance' had to be calculated at runtime every
time the em_cpu_energy() was called. Thus there was this formula involved:

ps->freq
ps->performance = ------------- * scale_cpu (4)
cpu_max_freq

When we inject (4) into (2) than we can have this equation:

cpu_util * cpu_max_freq
busy_time_pct = ------------------------ (5)
ps->freq * scale_cpu

Because the right 'scale_cpu' value wasn't ready during the boot time
and EM initialization, we had to perform the division by 'scale_cpu'
at runtime. There was not safe mechanism to update EM at runtime.
It has changed thanks to EM runtime modification feature.

It is possible to avoid the division by 'scale_cpu' at runtime, because
EM is updated whenever new max capacity CPU is set in the system or after
the boot has finished and proper CPU capacity is ready.

Use that feature and do the needed division during the calculation of the
coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
multiplied simply by utilization:

pd_nrg = ps->cost * \Sum cpu_util (6)

to get the needed energy for whole Performance Domain (PD).

With this optimization, the em_cpu_energy() should run faster on the Big
CPU by 1.43x and on the Little CPU by 1.69x.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 68 +++++-------------------------------
kernel/power/energy_model.c | 7 ++--
2 files changed, 12 insertions(+), 63 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e30750500b10..0f5621898a81 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -115,27 +115,6 @@ struct em_perf_domain {
#define EM_MAX_NUM_CPUS 16
#endif

-/*
- * To avoid an overflow on 32bit machines while calculating the energy
- * use a different order in the operation. First divide by the 'cpu_scale'
- * which would reduce big value stored in the 'cost' field, then multiply by
- * the 'sum_util'. This would allow to handle existing platforms, which have
- * e.g. power ~1.3 Watt at max freq, so the 'cost' value > 1mln micro-Watts.
- * In such scenario, where there are 4 CPUs in the Perf. Domain the 'sum_util'
- * could be 4096, then multiplication: 'cost' * 'sum_util' would overflow.
- * This reordering of operations has some limitations, we lose small
- * precision in the estimation (comparing to 64bit platform w/o reordering).
- *
- * We are safe on 64bit machine.
- */
-#ifdef CONFIG_64BIT
-#define em_estimate_energy(cost, sum_util, scale_cpu) \
- (((cost) * (sum_util)) / (scale_cpu))
-#else
-#define em_estimate_energy(cost, sum_util, scale_cpu) \
- (((cost) / (scale_cpu)) * (sum_util))
-#endif
-
struct em_data_callback {
/**
* active_power() - Provide power at the next performance state of
@@ -249,29 +228,16 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
{
struct em_perf_table *runtime_table;
struct em_perf_state *ps;
- unsigned long scale_cpu;
- int cpu, i;
+ int i;

if (!sum_util)
return 0;

- /*
- * In order to predict the performance state, map the utilization of
- * the most utilized CPU of the performance domain to a requested
- * frequency, like schedutil. Take also into account that the real
- * frequency might be set lower (due to thermal capping). Thus, clamp
- * max utilization to the allowed CPU capacity before calculating
- * effective frequency.
- */
- cpu = cpumask_first(to_cpumask(pd->cpus));
- scale_cpu = arch_scale_cpu_capacity(cpu);
-
/*
* No rcu_read_lock() since it's already called by task scheduler.
* The runtime_table is always there for CPUs, so we don't check.
*/
runtime_table = rcu_dereference(pd->runtime_table);
-
ps = &runtime_table->state[pd->nr_perf_states - 1];

max_util = map_util_perf(max_util);
@@ -286,35 +252,21 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
ps = &runtime_table->state[i];

/*
- * The capacity of a CPU in the domain at the performance state (ps)
- * can be computed as:
- *
- * ps->freq * scale_cpu
- * ps->cap = -------------------- (1)
- * cpu_max_freq
- *
- * So, ignoring the costs of idle states (which are not available in
- * the EM), the energy consumed by this CPU at that performance state
+ * The energy consumed by the CPU at the given performance state (ps)
* is estimated as:
*
- * ps->power * cpu_util
- * cpu_nrg = -------------------- (2)
- * ps->cap
+ * ps->power
+ * cpu_nrg = --------------- * cpu_util (1)
+ * ps->performance
*
- * since 'cpu_util / ps->cap' represents its percentage of busy time.
+ * The 'cpu_util / ps->performance' represents its percentage of
+ * busy time. The idle cost is ignored (it's not available in the EM).
*
* NOTE: Although the result of this computation actually is in
* units of power, it can be manipulated as an energy value
* over a scheduling period, since it is assumed to be
* constant during that interval.
*
- * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
- * of two terms:
- *
- * ps->power * cpu_max_freq cpu_util
- * cpu_nrg = ------------------------ * --------- (3)
- * ps->freq scale_cpu
- *
* The first term is static, and is stored in the em_perf_state struct
* as 'ps->cost'.
*
@@ -323,11 +275,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
* total energy of the domain (which is the simple sum of the energy of
* all of its CPUs) can be factorized as:
*
- * ps->cost * \Sum cpu_util
- * pd_nrg = ------------------------ (4)
- * scale_cpu
+ * pd_nrg = ps->cost * \Sum cpu_util (2)
*/
- return em_estimate_energy(ps->cost, sum_util, scale_cpu);
+ return ps->cost * sum_util;
}

/**
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index d3fa5a77de80..c6e5f35a5129 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -204,11 +204,9 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
unsigned long flags)
{
unsigned long prev_cost = ULONG_MAX;
- u64 fmax;
int i, ret;

/* Compute the cost of each performance state. */
- fmax = (u64) table[nr_states - 1].frequency;
for (i = nr_states - 1; i >= 0; i--) {
unsigned long power_res, cost;

@@ -220,8 +218,9 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
return -EINVAL;
}
} else {
- power_res = table[i].power;
- cost = div64_u64(fmax * power_res, table[i].frequency);
+ /* increase resolution of 'cost' precision */
+ power_res = table[i].power * 10;
+ cost = power_res / table[i].performance;
}

table[i].cost = cost;
--
2.25.1

2023-11-29 11:11:06

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 19/23] drivers/thermal/devfreq_cooling: Use new Energy Model interface

Energy Model framework support modifications at runtime of the power
values. Use the new EM table API which is protected with RCU. Align the
code so that this RCU read section is short.

This change is not expected to alter the general functionality.

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/thermal/devfreq_cooling.c | 43 ++++++++++++++++++++++++-------
1 file changed, 34 insertions(+), 9 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 262e62ab6cf2..b7aed5a3810e 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -87,6 +87,7 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
struct devfreq_cooling_device *dfc = cdev->devdata;
struct devfreq *df = dfc->devfreq;
struct device *dev = df->dev.parent;
+ struct em_perf_state *table;
unsigned long freq;
int perf_idx;

@@ -100,7 +101,10 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,

if (dfc->em_pd) {
perf_idx = dfc->max_state - state;
- freq = dfc->em_pd->table[perf_idx].frequency * 1000;
+
+ table = em_get_table(dfc->em_pd);
+ freq = table[perf_idx].frequency * 1000;
+ em_put_table();
} else {
freq = dfc->freq_table[state];
}
@@ -123,14 +127,20 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
*/
static int get_perf_idx(struct em_perf_domain *em_pd, unsigned long freq)
{
- int i;
+ struct em_perf_state *table;
+ int i, idx = -EINVAL;

+ table = em_get_table(em_pd);
for (i = 0; i < em_pd->nr_perf_states; i++) {
- if (em_pd->table[i].frequency == freq)
- return i;
+ if (table[i].frequency != freq)
+ continue;
+
+ idx = i;
+ break;
}
+ em_put_table();

- return -EINVAL;
+ return idx;
}

static unsigned long get_voltage(struct devfreq *df, unsigned long freq)
@@ -181,6 +191,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
struct devfreq_cooling_device *dfc = cdev->devdata;
struct devfreq *df = dfc->devfreq;
struct devfreq_dev_status status;
+ struct em_perf_state *table;
unsigned long state;
unsigned long freq;
unsigned long voltage;
@@ -204,7 +215,10 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
state = dfc->capped_state;

/* Convert EM power into milli-Watts first */
- dfc->res_util = dfc->em_pd->table[state].power;
+ table = em_get_table(dfc->em_pd);
+ dfc->res_util = table[state].power;
+ em_put_table();
+
dfc->res_util /= MICROWATT_PER_MILLIWATT;

dfc->res_util *= SCALE_ERROR_MITIGATION;
@@ -225,7 +239,10 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
_normalize_load(&status);

/* Convert EM power into milli-Watts first */
- *power = dfc->em_pd->table[perf_idx].power;
+ table = em_get_table(dfc->em_pd);
+ *power = table[perf_idx].power;
+ em_put_table();
+
*power /= MICROWATT_PER_MILLIWATT;
/* Scale power for utilization */
*power *= status.busy_time;
@@ -245,13 +262,18 @@ static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev,
unsigned long state, u32 *power)
{
struct devfreq_cooling_device *dfc = cdev->devdata;
+ struct em_perf_state *table;
int perf_idx;

if (state > dfc->max_state)
return -EINVAL;

perf_idx = dfc->max_state - state;
- *power = dfc->em_pd->table[perf_idx].power;
+
+ table = em_get_table(dfc->em_pd);
+ *power = table[perf_idx].power;
+ em_put_table();
+
*power /= MICROWATT_PER_MILLIWATT;

return 0;
@@ -264,6 +286,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
struct devfreq *df = dfc->devfreq;
struct devfreq_dev_status status;
unsigned long freq, em_power_mw;
+ struct em_perf_state *table;
s32 est_power;
int i;

@@ -288,13 +311,15 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
* Find the first cooling state that is within the power
* budget. The EM power table is sorted ascending.
*/
+ table = em_get_table(dfc->em_pd);
for (i = dfc->max_state; i > 0; i--) {
/* Convert EM power to milli-Watts to make safe comparison */
- em_power_mw = dfc->em_pd->table[i].power;
+ em_power_mw = table[i].power;
em_power_mw /= MICROWATT_PER_MILLIWATT;
if (est_power >= em_power_mw)
break;
}
+ em_put_table();

*state = dfc->max_state - i;
dfc->capped_state = *state;
--
2.25.1

2023-11-29 11:11:23

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 18/23] drivers/thermal/cpufreq_cooling: Use new Energy Model interface

Energy Model framework support modifications at runtime of the power
values. Use the new EM table API which is protected with RCU. Align the
code so that this RCU read section is short.

This change is not expected to alter the general functionality.

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/thermal/cpufreq_cooling.c | 40 ++++++++++++++++++++++++-------
1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index e2cc7bd30862..c32d8dfa4fff 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -91,12 +91,15 @@ struct cpufreq_cooling_device {
static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
unsigned int freq)
{
+ struct em_perf_state *table;
int i;

+ table = em_get_table(cpufreq_cdev->em);
for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
- if (freq > cpufreq_cdev->em->table[i].frequency)
+ if (freq > table[i].frequency)
break;
}
+ em_put_table();

return cpufreq_cdev->max_level - i - 1;
}
@@ -104,16 +107,19 @@ static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
u32 freq)
{
+ struct em_perf_state *table;
unsigned long power_mw;
int i;

+ table = em_get_table(cpufreq_cdev->em);
for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
- if (freq > cpufreq_cdev->em->table[i].frequency)
+ if (freq > table[i].frequency)
break;
}

- power_mw = cpufreq_cdev->em->table[i + 1].power;
+ power_mw = table[i + 1].power;
power_mw /= MICROWATT_PER_MILLIWATT;
+ em_put_table();

return power_mw;
}
@@ -121,18 +127,23 @@ static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
u32 power)
{
+ struct em_perf_state *table;
unsigned long em_power_mw;
+ u32 freq;
int i;

+ table = em_get_table(cpufreq_cdev->em);
for (i = cpufreq_cdev->max_level; i > 0; i--) {
/* Convert EM power to milli-Watts to make safe comparison */
- em_power_mw = cpufreq_cdev->em->table[i].power;
+ em_power_mw = table[i].power;
em_power_mw /= MICROWATT_PER_MILLIWATT;
if (power >= em_power_mw)
break;
}
+ freq = table[i].frequency;
+ em_put_table();

- return cpufreq_cdev->em->table[i].frequency;
+ return freq;
}

/**
@@ -262,8 +273,9 @@ static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev,
static int cpufreq_state2power(struct thermal_cooling_device *cdev,
unsigned long state, u32 *power)
{
- unsigned int freq, num_cpus, idx;
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+ unsigned int freq, num_cpus, idx;
+ struct em_perf_state *table;

/* Request state should be less than max_level */
if (state > cpufreq_cdev->max_level)
@@ -272,7 +284,11 @@ static int cpufreq_state2power(struct thermal_cooling_device *cdev,
num_cpus = cpumask_weight(cpufreq_cdev->policy->cpus);

idx = cpufreq_cdev->max_level - state;
- freq = cpufreq_cdev->em->table[idx].frequency;
+
+ table = em_get_table(cpufreq_cdev->em);
+ freq = table[idx].frequency;
+ em_put_table();
+
*power = cpu_freq_to_power(cpufreq_cdev, freq) * num_cpus;

return 0;
@@ -378,8 +394,16 @@ static unsigned int get_state_freq(struct cpufreq_cooling_device *cpufreq_cdev,
#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
/* Use the Energy Model table if available */
if (cpufreq_cdev->em) {
+ struct em_perf_state *table;
+ unsigned int freq;
+
idx = cpufreq_cdev->max_level - state;
- return cpufreq_cdev->em->table[idx].frequency;
+
+ table = em_get_table(cpufreq_cdev->em);
+ freq = table[idx].frequency;
+ em_put_table();
+
+ return freq;
}
#endif

--
2.25.1

2023-11-29 11:11:37

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design

Add a new section 'Design' which covers the information about Energy
Model. It contains the design decisions, describes models and how they
reflect the reality. Remove description of the default EM. Change the
other section IDs. Add documentation bit for the new feature which
allows to modify the EM in runtime.

Signed-off-by: Lukasz Luba <[email protected]>
---
Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++--
1 file changed, 196 insertions(+), 10 deletions(-)

diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
index 13225965c9a4..1f8cf36914b1 100644
--- a/Documentation/power/energy-model.rst
+++ b/Documentation/power/energy-model.rst
@@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance
domains can have different micro-architectures.


-2. Core APIs
+2. Design
+-----------------
+
+2.1 Runtime modifiable EM
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To better reflect power variation due to static power (leakage) the EM
+supports runtime modifications of the power values. The mechanism relies on
+RCU to free the modifiable EM perf_state table memory. Its user, the task
+scheduler, also uses RCU to access this memory. The EM framework provides
+API for allocating/freeing the new memory for the modifiable EM table.
+The old memory is freed automatically using RCU callback mechanism when there
+are no owners anymore for the given EM runtime table instance. This is tracked
+using kref mechanism. The device driver which provided the new EM at runtime,
+should call EM API to free it safely when it's no longer needed. The EM
+framework will handle the clean-up when it's possible.
+
+The kernel code which want to modify the EM values is protected from concurrent
+access using a mutex. Therefore, the device driver code must run in sleeping
+context when it tries to modify the EM.
+
+With the runtime modifiable EM we switch from a 'single and during the entire
+runtime static EM' (system property) design to a 'single EM which can be
+changed during runtime according e.g. to the workload' (system and workload
+property) design.
+
+It is possible also to modify the CPU performance values for each EM's
+performance state. Thus, the full power and performance profile (which
+is an exponential curve) can be changed according e.g. to the workload
+or system property.
+
+
+3. Core APIs
------------

-2.1 Config options
+3.1 Config options
^^^^^^^^^^^^^^^^^^

CONFIG_ENERGY_MODEL must be enabled to use the EM framework.


-2.2 Registration of performance domains
+3.2 Registration of performance domains
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Registration of 'advanced' EM
@@ -110,8 +142,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel
subsystems which use EM might rely on this flag to check if all EM devices use
the same scale. If there are different scales, these subsystems might decide
to return warning/error, stop working or panic.
-See Section 3. for an example of driver implementing this
-callback, or Section 2.4 for further documentation on this API
+See Section 4. for an example of driver implementing this
+callback, or Section 3.4 for further documentation on this API

Registration of EM using DT
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -156,7 +188,7 @@ The EM which is registered using this method might not reflect correctly the
physics of a real device, e.g. when static power (leakage) is important.


-2.3 Accessing performance domains
+3.3 Accessing performance domains
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are two API functions which provide the access to the energy model:
@@ -175,10 +207,83 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
not provided for other type of devices.

More details about the above APIs can be found in ``<linux/energy_model.h>``
-or in Section 2.4
+or in Section 3.5
+
+
+3.4 Runtime modifications
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Drivers willing to update the EM at runtime should use the following dedicated
+function to allocate a new instance of the modified EM. The API is listed
+below::
+
+ struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd);
+
+This allows to allocate a structure which contains the new EM table with
+also RCU and kref needed by the EM framework. The 'struct em_perf_table'
+contains array 'struct em_perf_state state[]' which is a list of performance
+states in ascending order. That list must be populated by the device driver
+which wants to update the EM. The list of frequencies can be taken from
+existing EM (created during boot). The content in the 'struct em_perf_state'
+must be populated by the driver as well.
+
+This is the API which does the EM update, using RCU pointers swap::
+
+ int em_dev_update_perf_domain(struct device *dev,
+ struct em_perf_table __rcu *new_table);
+
+Drivers must provide a pointer to the allocated and initialized new EM
+'struct em_perf_table'. That new EM will be safely used inside the EM framework
+and will be visible to other sub-systems in the kernel (thermal, powercap).
+The main design goal for this API is to be fast and avoid extra calculations
+or memory allocations at runtime. When pre-computed EMs are available in the
+device driver, than it should be possible to simply re-use them with low
+performance overhead.
+
+In order to free the EM, provided earlier by the driver (e.g. when the module
+is unloaded), there is a need to call the API::
+
+ void em_free_table(struct em_perf_table __rcu *table);
+
+It will allow the EM framework to safely remove the memory, when there is
+no other sub-system using it, e.g. EAS.
+
+To use the power values in other sub-systems (like thermal, powercap) there is
+a need to call API which protects the reader and provide consistency of the EM
+table data::

+ struct em_perf_state *em_get_table(struct em_perf_domain *pd);

-2.4 Description details of this API
+It returns the 'struct em_perf_state' pointer which is an array of performance
+states in ascending order.
+
+When the EM table is not needed anymore there is a need to call dedicated API::
+
+ void em_put_table(void);
+
+In this way the EM safely uses the RCU read section and protects the users.
+It also allows the EM framework to manage the memory and free it.
+
+There is dedicated API for device drivers to calculate em_perf_state::cost
+values::
+
+ int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
+ int nr_states);
+
+These 'cost' values from EM are used in EAS. The new EM table should be passed
+together with the number of entries and device pointer. When the computation
+of the cost values is done properly the return value from the function is 0.
+The function takes care for right setting of inefficiency for each performance
+state as well. It updates em_perf_state::flags accordingly.
+Then such prepared new EM can be passed to the em_dev_update_perf_domain()
+function, which will allow to use it.
+
+More details about the above APIs can be found in ``<linux/energy_model.h>``
+or in Section 4.2 with an example code showing simple implementation of the
+updating mechanism in a device driver.
+
+
+3.5 Description details of this API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. kernel-doc:: include/linux/energy_model.h
:internal:
@@ -187,8 +292,11 @@ or in Section 2.4
:export:


-3. Example driver
------------------
+4. Examples
+-----------
+
+4.1 Example driver with EM registration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The CPUFreq framework supports dedicated callback for registering
the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
@@ -242,3 +350,81 @@ EM framework::
39 static struct cpufreq_driver foo_cpufreq_driver = {
40 .register_em = foo_cpufreq_register_em,
41 };
+
+
+4.2 Example driver with EM modification
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This section provides a simple example of a thermal driver modifying the EM.
+The driver implements a foo_thermal_em_update() function. The driver is woken
+up periodically to check the temperature and modify the EM data::
+
+ -> drivers/soc/example/example_em_mod.c
+
+ 01 static void foo_get_new_em(struct device *dev)
+ 02 {
+ 03 struct em_perf_table __rcu *runtime_table;
+ 04 struct em_perf_state *table, *new_table;
+ 05 struct em_perf_domain *pd;
+ 06 unsigned long freq;
+ 07 int i, ret;
+ 08
+ 09 pd = em_pd_get(dev);
+ 10 if (!pd)
+ 11 return;
+ 12
+ 13 runtime_table = em_allocate_table(pd);
+ 14 if (!runtime_table)
+ 15 return;
+ 16
+ 17 new_table = runtime_table->state;
+ 18
+ 19 table = em_get_table(pd);
+ 20 for (i = 0; i < pd->nr_perf_states; i++) {
+ 21 freq = table[i].frequency;
+ 22 foo_get_power_perf_values(dev, freq, &new_table[i]);
+ 23 }
+ 24 em_put_table();
+ 25
+ 26 /* Calculate 'cost' values for EAS */
+ 27 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
+ 28 if (ret) {
+ 29 dev_warn(dev, "EM: compute costs failed %d\n", ret);
+ 30 em_free_table(runtime_table);
+ 31 return;
+ 32 }
+ 33
+ 34 ret = em_dev_update_perf_domain(dev, runtime_table);
+ 35 if (ret) {
+ 36 dev_warn(dev, "EM: update failed %d\n", ret);
+ 37 em_free_table(runtime_table);
+ 38 return;
+ 39 }
+ 40
+ 41 ctx->runtime_table = runtime_table;
+ 42 }
+ 43
+ 44 /*
+ 45 * Function called periodically to check the temperature and
+ 46 * update the EM if needed
+ 47 */
+ 48 static void foo_thermal_em_update(struct foo_context *ctx)
+ 49 {
+ 50 struct device *dev = ctx->dev;
+ 51 int cpu;
+ 52
+ 53 ctx->temperature = foo_get_temp(dev, ctx);
+ 54 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
+ 55 return;
+ 56
+ 57 foo_get_new_em(dev);
+ 58 }
+ 59
+ 60 static void foo_exit(void)
+ 61 {
+ 62 struct foo_context *ctx = glob_ctx;
+ 63
+ 64 em_free_table(ctx->runtime_table);
+ 65 }
+ 66
+ 67 module_exit(foo_exit);
--
2.25.1

2023-11-29 11:11:39

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 21/23] PM: EM: Remove old table

Remove the old EM table which wasn't able to modify the data. Clean the
unneeded function and refactor the code a bit.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 2 --
kernel/power/energy_model.c | 47 ++++++------------------------------
2 files changed, 8 insertions(+), 41 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 0f5621898a81..9c47388482a0 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -53,7 +53,6 @@ struct em_perf_table {

/**
* struct em_perf_domain - Performance domain
- * @table: List of performance states, in ascending order
* @runtime_table: Pointer to the runtime modifiable em_perf_table
* @nr_perf_states: Number of performance states
* @flags: See "em_perf_domain flags"
@@ -69,7 +68,6 @@ struct em_perf_table {
* field is unused.
*/
struct em_perf_domain {
- struct em_perf_state *table;
struct em_perf_table __rcu *runtime_table;
int nr_perf_states;
unsigned long flags;
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index cc47993b4d64..234823c0e59d 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -286,17 +286,6 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
return 0;
}

-static int em_allocate_perf_table(struct em_perf_domain *pd,
- int nr_states)
-{
- pd->table = kcalloc(nr_states, sizeof(struct em_perf_state),
- GFP_KERNEL);
- if (!pd->table)
- return -ENOMEM;
-
- return 0;
-}
-
/**
* em_dev_update_perf_domain() - Update runtime EM table for a device
* @dev : Device for which the EM is to be updated
@@ -343,24 +332,6 @@ int em_dev_update_perf_domain(struct device *dev,
}
EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);

-static int em_create_runtime_table(struct em_perf_domain *pd)
-{
- struct em_perf_table __rcu *runtime_table;
- int table_size;
-
- runtime_table = em_allocate_table(pd);
- if (!runtime_table)
- return -ENOMEM;
-
- /* Initialize runtime table with existing data */
- table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;
- memcpy(runtime_table->state, pd->table, table_size);
-
- rcu_assign_pointer(pd->runtime_table, runtime_table);
-
- return 0;
-}
-
static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
struct em_perf_state *table,
int nr_states, struct em_data_callback *cb,
@@ -420,6 +391,7 @@ static int em_create_pd(struct device *dev, int nr_states,
struct em_data_callback *cb, cpumask_t *cpus,
unsigned long flags)
{
+ struct em_perf_table __rcu *runtime_table;
struct em_perf_domain *pd;
struct device *cpu_dev;
int cpu, ret, num_cpus;
@@ -446,17 +418,16 @@ static int em_create_pd(struct device *dev, int nr_states,

pd->nr_perf_states = nr_states;

- ret = em_allocate_perf_table(pd, nr_states);
- if (ret)
+ runtime_table = em_allocate_table(pd);
+ if (!runtime_table)
goto free_pd;

- ret = em_create_perf_table(dev, pd, pd->table, nr_states, cb, flags);
+ ret = em_create_perf_table(dev, pd, runtime_table->state,
+ nr_states, cb, flags);
if (ret)
goto free_pd_table;

- ret = em_create_runtime_table(pd);
- if (ret)
- goto free_pd_table;
+ rcu_assign_pointer(pd->runtime_table, runtime_table);

if (_is_cpu_device(dev))
for_each_cpu(cpu, cpus) {
@@ -469,7 +440,7 @@ static int em_create_pd(struct device *dev, int nr_states,
return 0;

free_pd_table:
- kfree(pd->table);
+ kfree(runtime_table);
free_pd:
kfree(pd);
return -EINVAL;
@@ -640,7 +611,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,

dev->em_pd->flags |= flags;

- em_cpufreq_update_efficiencies(dev, dev->em_pd->table);
+ em_cpufreq_update_efficiencies(dev, dev->em_pd->runtime_table->state);

em_debug_create_pd(dev);
dev_info(dev, "EM: created perf domain\n");
@@ -677,8 +648,6 @@ void em_dev_unregister_perf_domain(struct device *dev)
mutex_lock(&em_pd_mutex);
em_debug_remove_pd(dev);

- kfree(dev->em_pd->table);
-
em_free_table(dev->em_pd->runtime_table);

kfree(dev->em_pd);
--
2.25.1

2023-11-29 11:11:52

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 22/23] PM: EM: Add em_dev_compute_costs() as API for device drivers

The device drivers can modify EM at runtime by providing a new EM table.
The EM is used by the EAS and the em_perf_state::cost stores
pre-calculated value to avoid overhead. This patch provides the API for
device drivers to calculate the cost values properly (and not duplicate
the same code).

Signed-off-by: Lukasz Luba <[email protected]>
---
include/linux/energy_model.h | 8 ++++++++
kernel/power/energy_model.c | 18 ++++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 9c47388482a0..836622b1a0a1 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -170,6 +170,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
void em_dev_unregister_perf_domain(struct device *dev);
struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd);
void em_free_table(struct em_perf_table __rcu *table);
+int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
+ int nr_states);

/**
* em_pd_get_efficient_state() - Get an efficient performance state from the EM
@@ -355,6 +357,12 @@ static inline struct em_perf_state *em_get_table(struct em_perf_domain *pd)
return NULL;
}
static inline void em_put_table(void) {}
+static inline
+int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
+ int nr_states)
+{
+ return -EINVAL;
+}
#endif

#endif
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 234823c0e59d..fadfdefbe5f0 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -286,6 +286,24 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
return 0;
}

+/**
+ * em_dev_compute_costs() - Calculate cost values for new runtime EM table
+ * @dev : Device for which the EM table is to be updated
+ * @table : The new EM table that is going to get the costs calculated
+ *
+ * Calculate the em_perf_state::cost values for new runtime EM table. The
+ * values are used for EAS during task placement. It also calculates and sets
+ * the efficiency flag for each performance state. When the function finish
+ * successfully the EM table is ready to be updated and used by EAS.
+ *
+ * Return 0 on success or a proper error in case of failure.
+ */
+int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
+ int nr_states)
+{
+ return em_compute_costs(dev, table, NULL, nr_states, 0);
+}
+
/**
* em_dev_update_perf_domain() - Update runtime EM table for a device
* @dev : Device for which the EM is to be updated
--
2.25.1

2023-11-29 11:12:57

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 17/23] powercap/dtpm_devfreq: Use new Energy Model interface to get table

Energy Model framework support modifications at runtime of the power
values. Use the new EM table API which is protected with RCU. Align the
code so that this RCU read section is short.

This change is not expected to alter the general functionality.

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/powercap/dtpm_devfreq.c | 31 ++++++++++++++++++++-----------
1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/powercap/dtpm_devfreq.c b/drivers/powercap/dtpm_devfreq.c
index 612c3b59dd5b..514aa0d9d9c2 100644
--- a/drivers/powercap/dtpm_devfreq.c
+++ b/drivers/powercap/dtpm_devfreq.c
@@ -37,11 +37,15 @@ static int update_pd_power_uw(struct dtpm *dtpm)
struct devfreq *devfreq = dtpm_devfreq->devfreq;
struct device *dev = devfreq->dev.parent;
struct em_perf_domain *pd = em_pd_get(dev);
+ struct em_perf_state *table;

- dtpm->power_min = pd->table[0].power;
+ table = em_get_table(pd);

- dtpm->power_max = pd->table[pd->nr_perf_states - 1].power;
+ dtpm->power_min = table[0].power;

+ dtpm->power_max = table[pd->nr_perf_states - 1].power;
+
+ em_put_table();
return 0;
}

@@ -51,20 +55,22 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
struct devfreq *devfreq = dtpm_devfreq->devfreq;
struct device *dev = devfreq->dev.parent;
struct em_perf_domain *pd = em_pd_get(dev);
+ struct em_perf_state *table;
unsigned long freq;
int i;

+ table = em_get_table(pd);
for (i = 0; i < pd->nr_perf_states; i++) {
- if (pd->table[i].power > power_limit)
+ if (table[i].power > power_limit)
break;
}

- freq = pd->table[i - 1].frequency;
+ freq = table[i - 1].frequency;
+ power_limit = table[i - 1].power;
+ em_put_table();

dev_pm_qos_update_request(&dtpm_devfreq->qos_req, freq);

- power_limit = pd->table[i - 1].power;
-
return power_limit;
}

@@ -89,8 +95,9 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
struct device *dev = devfreq->dev.parent;
struct em_perf_domain *pd = em_pd_get(dev);
struct devfreq_dev_status status;
+ struct em_perf_state *table;
unsigned long freq;
- u64 power;
+ u64 power = 0;
int i;

mutex_lock(&devfreq->lock);
@@ -100,19 +107,21 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
freq = DIV_ROUND_UP(status.current_frequency, HZ_PER_KHZ);
_normalize_load(&status);

+ table = em_get_table(pd);
for (i = 0; i < pd->nr_perf_states; i++) {

- if (pd->table[i].frequency < freq)
+ if (table[i].frequency < freq)
continue;

- power = pd->table[i].power;
+ power = table[i].power;
power *= status.busy_time;
power >>= 10;

- return power;
+ break;
}
+ em_put_table();

- return 0;
+ return power;
}

static void pd_release(struct dtpm *dtpm)
--
2.25.1

2023-11-29 11:17:03

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH v5 20/23] PM: EM: Change debugfs configuration to use runtime EM table data

Dump the runtime EM table values which can be modified in time. In order
to do that allocate chunk of debug memory which can be later freed
automatically thanks to devm_kcalloc().

This design can handle the fact that the EM table memory can change
after EM update, so debug code cannot use the pointer from initialization
phase.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/power/energy_model.c | 65 ++++++++++++++++++++++++++++++++-----
1 file changed, 57 insertions(+), 8 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index c6e5f35a5129..cc47993b4d64 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -37,20 +37,63 @@ static bool _is_cpu_device(struct device *dev)
#ifdef CONFIG_DEBUG_FS
static struct dentry *rootdir;

-static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
+struct em_dbg_info {
+ struct em_perf_domain *pd;
+ int ps_id;
+};
+
+#define DEFINE_EM_DBG_SHOW(name, fname) \
+static int em_debug_##fname##_show(struct seq_file *s, void *unused) \
+{ \
+ struct em_dbg_info *em_dbg = s->private; \
+ struct em_perf_state *table; \
+ unsigned long val; \
+ \
+ table = em_get_table(em_dbg->pd); \
+ val = table[em_dbg->ps_id].name; \
+ em_put_table(); \
+ \
+ seq_printf(s, "%lu\n", val); \
+ return 0; \
+} \
+DEFINE_SHOW_ATTRIBUTE(em_debug_##fname)
+
+DEFINE_EM_DBG_SHOW(frequency, frequency);
+DEFINE_EM_DBG_SHOW(power, power);
+DEFINE_EM_DBG_SHOW(cost, cost);
+DEFINE_EM_DBG_SHOW(performance, performance);
+DEFINE_EM_DBG_SHOW(flags, inefficiency);
+
+static void em_debug_create_ps(struct em_perf_domain *em_pd,
+ struct em_dbg_info *em_dbg, int i,
+ struct dentry *pd)
{
+ struct em_perf_state *table;
+ unsigned long freq;
struct dentry *d;
char name[24];

- snprintf(name, sizeof(name), "ps:%lu", ps->frequency);
+ em_dbg[i].pd = em_pd;
+ em_dbg[i].ps_id = i;
+
+ table = em_get_table(em_pd);
+ freq = table[i].frequency;
+ em_put_table();
+
+ snprintf(name, sizeof(name), "ps:%lu", freq);

/* Create per-ps directory */
d = debugfs_create_dir(name, pd);
- debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
- debugfs_create_ulong("power", 0444, d, &ps->power);
- debugfs_create_ulong("cost", 0444, d, &ps->cost);
- debugfs_create_ulong("performance", 0444, d, &ps->performance);
- debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
+ debugfs_create_file("frequency", 0444, d, &em_dbg[i],
+ &em_debug_frequency_fops);
+ debugfs_create_file("power", 0444, d, &em_dbg[i],
+ &em_debug_power_fops);
+ debugfs_create_file("cost", 0444, d, &em_dbg[i],
+ &em_debug_cost_fops);
+ debugfs_create_file("performance", 0444, d, &em_dbg[i],
+ &em_debug_performance_fops);
+ debugfs_create_file("inefficient", 0444, d, &em_dbg[i],
+ &em_debug_inefficiency_fops);
}

static int em_debug_cpus_show(struct seq_file *s, void *unused)
@@ -73,6 +116,7 @@ DEFINE_SHOW_ATTRIBUTE(em_debug_flags);

static void em_debug_create_pd(struct device *dev)
{
+ struct em_dbg_info *em_dbg;
struct dentry *d;
int i;

@@ -86,9 +130,14 @@ static void em_debug_create_pd(struct device *dev)
debugfs_create_file("flags", 0444, d, dev->em_pd,
&em_debug_flags_fops);

+ em_dbg = devm_kcalloc(dev, dev->em_pd->nr_perf_states,
+ sizeof(*em_dbg), GFP_KERNEL);
+ if (!em_dbg)
+ return;
+
/* Create a sub-directory for each performance state */
for (i = 0; i < dev->em_pd->nr_perf_states; i++)
- em_debug_create_ps(&dev->em_pd->table[i], d);
+ em_debug_create_ps(dev->em_pd, em_dbg, i, d);

}

--
2.25.1

2023-12-12 18:49:21

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

On 29/11/2023 12:08, Lukasz Luba wrote:
> Hi all,
>
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
>
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.
>
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation the EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.
>
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.
>
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
>
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
>
> Some test results.
> The EM can be updated to fit better the workload type. In the case below the EM
> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
> to get more reliable data.
>
> 1. Janky frames percentage
> +--------+-----------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+-----------------+---------------------+-------+-----------+
> | gmean | jank_percentage | EM_default | 2.0 | 0.0% |
> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% |
> +--------+-----------------+---------------------+-------+-----------+
>
> 2. Avg frame render time duration
> +--------+---------------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+---------------------+---------------------+-------+-----------+
> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% |
> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% |
> +--------+---------------------+---------------------+-------+-----------+
>
> 3. Max frame render time duration
> +--------+--------------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+--------------------+---------------------+-------+-----------+
> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% |
> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% |
> +--------+--------------------+---------------------+-------+-----------+
>
> 4. OS overutilized state percentage (when EAS is not working)
> +--------------+---------------------+------+------------+------------+
> | metric | wa_path | time | total_time | percentage |
> +--------------+---------------------+------+------------+------------+
> | overutilized | EM_default | 1.65 | 253.38 | 0.65 |
> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 |
> +--------------+---------------------+------+------------+------------+
>
> 5. All CPUs (Little+Mid+Big) power values in mW
> +------------+--------+---------------------+-------+-----------+
> | channel | metric | kernel | value | perc_diff |
> +------------+--------+---------------------+-------+-----------+
> | CPU | gmean | EM_default | 142.1 | 0.0% |
> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
> +------------+--------+---------------------+-------+-----------+
>
> The time cost to update the EM decreased in this v5 vs v4:
> big: 5us vs 2us -> 2.6x faster
> mid: 9us vs 3us -> 3x faster
> little: 16us vs 16us -> no change

I guess this is entirely due to the changes in
em_dev_update_perf_domain()? Moving from per-OPP em_update_callback to
switching the entire table (pd->runtime_table) inside
em_dev_update_perf_domain()?

> We still have to update the inefficiency in the cpufreq framework, thus
> a bit of overhead will be there.
>
> Changelog:
> v5:
> - removed 2 tables design
> - have only one table (runtime_table) used also in thermal (Wei, Rafael)

Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
v5 this changed to only have one, the modifiable. IMHO it would be
better to change the existing table to be modifiable rather than staring
with two EM's and then removing the static one. I assume you end up with
way less code changes and the patch-set will become easier to digest for
reviewers.

I would mention that 14/23 "PM: EM: Support late CPUs booting and
capacity adjustment" is a testcase for the modifiable EM build-in into
the code changes. This relies on the table being modifiable.

> - refactored update function and removed callback call for each opp
> - added faster EM table swap, using only the RCU pointer update
> - added memory allocation API and tracking with kref
> - avoid overhead for computing 'cost' for each OPP in update, it can be
> pre-computed in device drivers EM earlier
> - add support for device drivers providing EM table
> - added API for computing 'cost' values in EM for EAS
> - added API for thermal/powercap to use EM (using RCU wrappers)
> - switched to single allocation and 'state[]' array (Rafael)
> - changed documentation to align with current design
> - added helper API for computing cost values
> - simplified EM free in unregister path (thanks to kref)
> - split patch updating EM clients and changed them separetly
> - added seperate patch removing old static EM table
> - added EM debugfs change patch to dump the runtime_table
> - addressed comments in v4 for spelling/comments/headers
> - added review tags

[...]

2023-12-12 18:49:48

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 04/23] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible

On 29/11/2023 12:08, Lukasz Luba wrote:
> The Energy Model (EM) is going to support runtime modification. There
> are going to be 2 EM tables which store information. This patch aims
> to prepare the code to be generic and use one of the tables. The function
> will no longer get a pointer to 'struct em_perf_domain' (the EM) but
> instead a pointer to 'struct em_perf_state' (which is one of the EM's
> tables).
I thought the 2 EM tables design is gone?

IMHO it would be less code changes and hence a more enjoyable review
experience if you would add the 'modifiable' feature to the existing EM
(1) and not add (2) and then remove (1) in [21/23].


struct em_perf_domain {
- struct em_perf_state *table; <-- (1)
struct em_perf_table __rcu *runtime_table; <-- (2)

> Prepare em_pd_get_efficient_state() for the upcoming changes and
> make it possible to re-use. Return an index for the best performance

s/make it possible to re-use/make it possible to be re-used ?

> state for a given EM table. The function arguments that are introduced
> should allow to work on different performance state arrays. The caller of
> em_pd_get_efficient_state() should be able to use the index either
> on the default or the modifiable EM table.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> Reviewed-by: Daniel Lezcano <[email protected]>
> ---
> include/linux/energy_model.h | 30 +++++++++++++++++-------------
> 1 file changed, 17 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index b9caa01dfac4..8069f526c9d8 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -175,33 +175,35 @@ void em_dev_unregister_perf_domain(struct device *dev);
>
> /**
> * em_pd_get_efficient_state() - Get an efficient performance state from the EM
> - * @pd : Performance domain for which we want an efficient frequency
> - * @freq : Frequency to map with the EM
> + * @state: List of performance states, in ascending order

(3)

> + * @nr_perf_states: Number of performance states
> + * @freq: Frequency to map with the EM
> + * @pd_flags: Performance Domain flags
> *
> * It is called from the scheduler code quite frequently and as a consequence
> * doesn't implement any check.
> *
> - * Return: An efficient performance state, high enough to meet @freq
> + * Return: An efficient performance state id, high enough to meet @freq
> * requirement.
> */
> -static inline
> -struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
> - unsigned long freq)
> +static inline int
> +em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
> + unsigned long freq, unsigned long pd_flags)

(3) but em_pd_get_efficient_state(struct em_perf_state *table
^^^^^
[...]

2023-12-12 18:50:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

Hi Lukasz,

On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <[email protected]> wrote:
>
> Hi all,
>
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
>
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.
>
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation the EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.
>
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.
>
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
>
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
>
> Some test results.
> The EM can be updated to fit better the workload type. In the case below the EM
> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
> to get more reliable data.
>
> 1. Janky frames percentage
> +--------+-----------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+-----------------+---------------------+-------+-----------+
> | gmean | jank_percentage | EM_default | 2.0 | 0.0% |
> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% |
> +--------+-----------------+---------------------+-------+-----------+
>
> 2. Avg frame render time duration
> +--------+---------------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+---------------------+---------------------+-------+-----------+
> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% |
> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% |
> +--------+---------------------+---------------------+-------+-----------+
>
> 3. Max frame render time duration
> +--------+--------------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+--------------------+---------------------+-------+-----------+
> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% |
> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% |
> +--------+--------------------+---------------------+-------+-----------+
>
> 4. OS overutilized state percentage (when EAS is not working)
> +--------------+---------------------+------+------------+------------+
> | metric | wa_path | time | total_time | percentage |
> +--------------+---------------------+------+------------+------------+
> | overutilized | EM_default | 1.65 | 253.38 | 0.65 |
> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 |
> +--------------+---------------------+------+------------+------------+
>
> 5. All CPUs (Little+Mid+Big) power values in mW
> +------------+--------+---------------------+-------+-----------+
> | channel | metric | kernel | value | perc_diff |
> +------------+--------+---------------------+-------+-----------+
> | CPU | gmean | EM_default | 142.1 | 0.0% |
> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
> +------------+--------+---------------------+-------+-----------+
>
> The time cost to update the EM decreased in this v5 vs v4:
> big: 5us vs 2us -> 2.6x faster
> mid: 9us vs 3us -> 3x faster
> little: 16us vs 16us -> no change
>
> We still have to update the inefficiency in the cpufreq framework, thus
> a bit of overhead will be there.
>
> Changelog:
> v5:
> - removed 2 tables design
> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
> - refactored update function and removed callback call for each opp
> - added faster EM table swap, using only the RCU pointer update
> - added memory allocation API and tracking with kref
> - avoid overhead for computing 'cost' for each OPP in update, it can be
> pre-computed in device drivers EM earlier
> - add support for device drivers providing EM table
> - added API for computing 'cost' values in EM for EAS
> - added API for thermal/powercap to use EM (using RCU wrappers)
> - switched to single allocation and 'state[]' array (Rafael)
> - changed documentation to align with current design
> - added helper API for computing cost values
> - simplified EM free in unregister path (thanks to kref)
> - split patch updating EM clients and changed them separetly
> - added seperate patch removing old static EM table
> - added EM debugfs change patch to dump the runtime_table
> - addressed comments in v4 for spelling/comments/headers
> - added review tags

I like this one more than the previous one and thanks for taking my
feedback into account.

I would still like other people having a vested interest in the EM to
look at it and give feedback (or just tags), so I'm not inclined to
apply it just yet. However, I don't have any specific comments on it.

2023-12-12 18:50:31

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 08/23] PM: EM: Introduce runtime modifiable table

On 29/11/2023 12:08, Lukasz Luba wrote:
> The new runtime table can be populated with a new power data to better
> reflect the actual efficiency of the device e.g. CPU. The power can vary
> over time e.g. due to the SoC temperature change. Higher temperature can
> increase power values. For longer running scenarios, such as game or
> camera, when also other devices are used (e.g. GPU, ISP) the CPU power can

Don't understand this sentence. So CPU power changes with higher
temperature and for longer running scenarios when other devices are
involved? Not getting the 2. part.

> change. The new EM framework is able to addresses this issue and change
> the EM data at runtime safely.

Maybe better:
The new EM framework addresses this issue by allowing to change the EM
data at runtime.

[...]

2023-12-12 18:50:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 07/23] PM: EM: Refactor how the EM table is allocated and populated

On 29/11/2023 12:08, Lukasz Luba wrote:
> Split the process of allocation and data initialization for the EM table.
> The upcoming changes for modifiable EM will use it.
>
> This change is not expected to alter the general functionality.

NIT: IMHO, I guess you wanted to say: "No functional changes
introduced"? I.e. all not only general functionality ...

[...]

> static int em_create_pd(struct device *dev, int nr_states,
> @@ -234,11 +234,15 @@ static int em_create_pd(struct device *dev, int nr_states,
> return -ENOMEM;
> }
>
> - ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
> - if (ret) {
> - kfree(pd);
> - return ret;
> - }
> + pd->nr_perf_states = nr_states;

Why does `pd->nr_perf_states = nr_states;` have to move from
em_create_perf_table() to em_create_pd()?

> +
> + ret = em_allocate_perf_table(pd, nr_states);
> + if (ret)
> + goto free_pd;
> +
> + ret = em_create_perf_table(dev, pd, pd->table, nr_states, cb, flags);

If you set it in em_create_pd() then you can use 'pd->nr_perf_states' in
em_create_perf_table() and doesn't have to pass `nr_states`.

[...]

2023-12-12 18:51:00

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 11/23] PM: EM: Add API for updating the runtime modifiable EM

On 29/11/2023 12:08, Lukasz Luba wrote:

[...]

> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 489a358b9a00..614891fde8df 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -221,6 +221,52 @@ static int em_allocate_perf_table(struct em_perf_domain *pd,
> return 0;
> }
>
> +/**
> + * em_dev_update_perf_domain() - Update runtime EM table for a device
> + * @dev : Device for which the EM is to be updated
> + * @table : The new EM table that is going to used from now

s/going to used/going to be used

> + *
> + * Update EM runtime modifiable table for the @dev using the privided @table.

s/privided/provided

> + *
> + * This function uses mutex to serialize writers, so it must not be called
> + * from non-sleeping context.
> + *
> + * Return 0 on success or a proper error in case of failure.
> + */
> +int em_dev_update_perf_domain(struct device *dev,
> + struct em_perf_table __rcu *new_table)
> +{
> + struct em_perf_table __rcu *old_table;
> + struct em_perf_domain *pd;
> +
> + /*
> + * The lock serializes update and unregister code paths. When the
> + * EM has been unregistered in the meantime, we should capture that
> + * when entering this critical section. It also makes sure that

What do you want to capture here? You want to block in this moment,
right? Don't understand the 2. sentence here.

[...]

2023-12-12 18:51:27

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division

On 29/11/2023 12:08, Lukasz Luba wrote:
> The Energy Model (EM) can be modified at runtime which brings new
> possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
> (EAS) in it's hot path. The energy calculation uses power value for

NIT: s/it's/its

> a given performance state (ps) and the CPU busy time as percentage for that
> given frequency, which effectively is:
>
> pd_nrg = ps->power * busy_time_pct (1)
>
> cpu_util
> busy_time_pct = ----------------- (2)
> ps->performance
>
> The 'ps->performance' is the CPU capacity (performance) at that given ps.
> Thus, in a situation when the OS is not overloaded and we have EAS
> working, the busy time is lower than 'ps->performance' that the CPU is
> running at. Therefore, in longer scheduling period we can treat the power
> value calculated above as the energy.

Not sure I understand what a longer 'scheduling period' has to do with
that? Is this to highlight the issue between instantaneous power and the
energy being the integral over it? And the 'scheduling period' is the
runnable time of this task?

> We can optimize the last arithmetic operation in em_cpu_energy() and
> remove the division. This can be done because em_perf_state::cost, which
> is a special coefficient, can now hold the pre-calculated value including
> the 'ps->performance' information for a performance state (ps):
>
> ps->power
> ps->cost = --------------- (3)
> ps->performance

Ah, this is equation (2) in the existing code with s/cap/performance.

> In the past the 'ps->performance' had to be calculated at runtime every
> time the em_cpu_energy() was called. Thus there was this formula involved:
>

> ps->freq
> ps->performance = ------------- * scale_cpu (4)
> cpu_max_freq
>
> When we inject (4) into (2) than we can have this equation:
>
> cpu_util * cpu_max_freq
> busy_time_pct = ------------------------ (5)
> ps->freq * scale_cpu
>
> Because the right 'scale_cpu' value wasn't ready during the boot time
> and EM initialization, we had to perform the division by 'scale_cpu'
> at runtime. There was not safe mechanism to update EM at runtime.
> It has changed thanks to EM runtime modification feature.
>
> It is possible to avoid the division by 'scale_cpu' at runtime, because
> EM is updated whenever new max capacity CPU is set in the system or after
> the boot has finished and proper CPU capacity is ready.
>
> Use that feature and do the needed division during the calculation of the
> coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
> multiplied simply by utilization:
>
> pd_nrg = ps->cost * \Sum cpu_util (6)
>
> to get the needed energy for whole Performance Domain (PD).
>
> With this optimization, the em_cpu_energy() should run faster on the Big
> CPU by 1.43x and on the Little CPU by 1.69x.

Where are those precise numbers are coming from? Which platform was it?

>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> include/linux/energy_model.h | 68 +++++-------------------------------
> kernel/power/energy_model.c | 7 ++--
> 2 files changed, 12 insertions(+), 63 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index e30750500b10..0f5621898a81 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -115,27 +115,6 @@ struct em_perf_domain {
> #define EM_MAX_NUM_CPUS 16
> #endif
>
> -/*
> - * To avoid an overflow on 32bit machines while calculating the energy
> - * use a different order in the operation. First divide by the 'cpu_scale'
> - * which would reduce big value stored in the 'cost' field, then multiply by
> - * the 'sum_util'. This would allow to handle existing platforms, which have
> - * e.g. power ~1.3 Watt at max freq, so the 'cost' value > 1mln micro-Watts.
> - * In such scenario, where there are 4 CPUs in the Perf. Domain the 'sum_util'
> - * could be 4096, then multiplication: 'cost' * 'sum_util' would overflow.
> - * This reordering of operations has some limitations, we lose small
> - * precision in the estimation (comparing to 64bit platform w/o reordering).
> - *
> - * We are safe on 64bit machine.
> - */
> -#ifdef CONFIG_64BIT
> -#define em_estimate_energy(cost, sum_util, scale_cpu) \
> - (((cost) * (sum_util)) / (scale_cpu))
> -#else
> -#define em_estimate_energy(cost, sum_util, scale_cpu) \
> - (((cost) / (scale_cpu)) * (sum_util))
> -#endif
> -
> struct em_data_callback {
> /**
> * active_power() - Provide power at the next performance state of
> @@ -249,29 +228,16 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> {
> struct em_perf_table *runtime_table;
> struct em_perf_state *ps;
> - unsigned long scale_cpu;
> - int cpu, i;
> + int i;
>
> if (!sum_util)
> return 0;
>
> - /*
> - * In order to predict the performance state, map the utilization of
> - * the most utilized CPU of the performance domain to a requested
> - * frequency, like schedutil. Take also into account that the real
> - * frequency might be set lower (due to thermal capping). Thus, clamp
> - * max utilization to the allowed CPU capacity before calculating
> - * effective frequency.

Why do you remove this comment? IMHO, it's still valid and independent
of the changes here?

> - */
> - cpu = cpumask_first(to_cpumask(pd->cpus));
> - scale_cpu = arch_scale_cpu_capacity(cpu);
> -
> /*
> * No rcu_read_lock() since it's already called by task scheduler.
> * The runtime_table is always there for CPUs, so we don't check.
> */
> runtime_table = rcu_dereference(pd->runtime_table);
> -
> ps = &runtime_table->state[pd->nr_perf_states - 1];
>
> max_util = map_util_perf(max_util);
> @@ -286,35 +252,21 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> ps = &runtime_table->state[i];
>
> /*
> - * The capacity of a CPU in the domain at the performance state (ps)
> - * can be computed as:
> - *
> - * ps->freq * scale_cpu
> - * ps->cap = -------------------- (1)
> - * cpu_max_freq
> - *
> - * So, ignoring the costs of idle states (which are not available in
> - * the EM), the energy consumed by this CPU at that performance state
> + * The energy consumed by the CPU at the given performance state (ps)
> * is estimated as:
> *
> - * ps->power * cpu_util
> - * cpu_nrg = -------------------- (2)
> - * ps->cap
> + * ps->power
> + * cpu_nrg = --------------- * cpu_util (1)
> + * ps->performance
> *
> - * since 'cpu_util / ps->cap' represents its percentage of busy time.
> + * The 'cpu_util / ps->performance' represents its percentage of
> + * busy time. The idle cost is ignored (it's not available in the EM).
> *
> * NOTE: Although the result of this computation actually is in
> * units of power, it can be manipulated as an energy value
> * over a scheduling period, since it is assumed to be
> * constant during that interval.
> *
> - * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
> - * of two terms:
> - *
> - * ps->power * cpu_max_freq cpu_util
> - * cpu_nrg = ------------------------ * --------- (3)
> - * ps->freq scale_cpu
> - *
> * The first term is static, and is stored in the em_perf_state struct
> * as 'ps->cost'.
> *
> @@ -323,11 +275,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> * total energy of the domain (which is the simple sum of the energy of
> * all of its CPUs) can be factorized as:
> *
> - * ps->cost * \Sum cpu_util
> - * pd_nrg = ------------------------ (4)
> - * scale_cpu
> + * pd_nrg = ps->cost * \Sum cpu_util (2)
> */
> - return em_estimate_energy(ps->cost, sum_util, scale_cpu);
> + return ps->cost * sum_util;

Can you not keep the existing comment and only change:

(a) that ps->cap id ps->performance in (2) and

(b) that:

* ps->power * cpu_max_freq cpu_util
* cpu_nrg = ------------------------ * --------- (3)
* ps->freq scale_cpu

<---- (old) ps->cost --->

is now

ps->power * cpu_max_freq 1
ps-> cost = ------------------------ * ----------
ps->freq scale_cpu

<---- (old) ps->cost --->

and (c) that (4) has changed to:

* pd_nrg = ps->cost * \Sum cpu_util (4)

which avoid the division?

Less changes is always much nicer since it makes it so much easier to
detect history and review changes.

I do understand the changes from the technical viewpoint but the review
took me way too long which I partly blame to all the changes in the
comments which could have been avoided. Just want to make sure that
others done have to go through this pain too.

[...]


2023-12-12 18:51:37

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 22/23] PM: EM: Add em_dev_compute_costs() as API for device drivers

On 29/11/2023 12:08, Lukasz Luba wrote:
> The device drivers can modify EM at runtime by providing a new EM table.
> The EM is used by the EAS and the em_perf_state::cost stores
> pre-calculated value to avoid overhead. This patch provides the API for
> device drivers to calculate the cost values properly (and not duplicate
> the same code).

New interface w/o any users? Can we not remove this from this patch-set
and introduce it with the first user(s)?

[...]

2023-12-12 18:51:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 14/23] PM: EM: Support late CPUs booting and capacity adjustment

On 29/11/2023 12:08, Lukasz Luba wrote:
> The patch adds needed infrastructure to handle the late CPUs boot, which
> might change the previous CPUs capacity values. With this changes the new
> CPUs which try to register EM will trigger the needed re-calculations for
> other CPUs EMs. Thanks to that the em_per_state::performance values will
> be aligned with the CPU capacity information after all CPUs finish the
> boot and EM registrations.

IMHO, it's worth mentioning here that this added functionality is the 1.
use case of the modifiable EM.

[...]

> + * Adjustment of CPU performance values after boot, when all CPUs capacites
> + * are correctly calculated.
> + */
> +static void em_adjust_new_capacity(struct device *dev,
> + struct em_perf_domain *pd,
> + u64 max_cap)
> +{

[...]

> + /*
> + * This is one-time-update, so give up the ownership in this updater.
> + * The EM fwk will keep the reference and free the memory when needed.

s/fwk/framework ?

> + */
> + em_free_table(runtime_table);
> +}
> +
> +static void em_check_capacity_update(void)
> +{
> + cpumask_var_t cpu_done_mask;
> + struct em_perf_state *table;
> + struct em_perf_domain *pd;
> + unsigned long cpu_capacity;
> + int cpu;
> +
> + if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
> + pr_warn("no free memory\n");
> + return;
> + }
> +
> + /* Check if CPUs capacity has changed than update EM */

s/than/then ?

Maybe this comment is not needed since there is (1) further down?


> + for_each_possible_cpu(cpu) {
> + struct cpufreq_policy *policy;
> + unsigned long em_max_perf;
> + struct device *dev;
> + int nr_states;
> +
> + if (cpumask_test_cpu(cpu, cpu_done_mask))
> + continue;
> +
> + policy = cpufreq_cpu_get(cpu);
> + if (!policy) {
> + pr_debug("Accessing cpu%d policy failed\n", cpu);
> + schedule_delayed_work(&em_update_work,
> + msecs_to_jiffies(1000));
> + break;
> + }
> + cpufreq_cpu_put(policy);
> +
> + pd = em_cpu_get(cpu);
> + if (!pd || em_is_artificial(pd))
> + continue;
> +
> + cpumask_or(cpu_done_mask, cpu_done_mask,
> + em_span_cpus(pd));
> +
> + nr_states = pd->nr_perf_states;
> + cpu_capacity = arch_scale_cpu_capacity(cpu);
> +
> + table = em_get_table(pd);
> + em_max_perf = table[pd->nr_perf_states - 1].performance;
> + em_put_table();
> +
> + /*
> + * Check if the CPU capacity has been adjusted during boot
> + * and trigger the update for new performance values.
> + */

(1)

[...]

2023-12-12 18:52:25

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design

On 29/11/2023 12:08, Lukasz Luba wrote:
> Add a new section 'Design' which covers the information about Energy
> Model. It contains the design decisions, describes models and how they
> reflect the reality. Remove description of the default EM. Change the
> other section IDs. Add documentation bit for the new feature which
> allows to modify the EM in runtime.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++--
> 1 file changed, 196 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
> index 13225965c9a4..1f8cf36914b1 100644
> --- a/Documentation/power/energy-model.rst
> +++ b/Documentation/power/energy-model.rst
> @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance
> domains can have different micro-architectures.
>
>
> -2. Core APIs
> +2. Design
> +-----------------
> +
> +2.1 Runtime modifiable EM
> +^^^^^^^^^^^^^^^^^^^^^^^^^

The issue I see here is that since now the EM is runtime modifiable and
there is only one EM people might be confused in locking for a
non-runtime modifiable EM. (which matches the design till v4).

So 'runtime modifiability' is now feature of the EM itself.

There is also a figure in this document illustrating the use of
em_get_energy(), em_cpu_get() and em_dev_register_perf_domain().

I wonder if this should be extended to cover all the new interfaces
created for the 'runtime modifiability' feature?

[...]

2023-12-13 09:22:32

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

Hi Dietmar,

Thank you for the review, I will go one-by-one to respond
your comments in patches as well. First comments are below.

On 12/12/23 18:48, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
>>
>> The concepts:
>> 1. The CPU power usage can vary due to the workload that it's running or due
>> to the temperature of the SoC. The same workload can use more power when the
>> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
>> In such situation the EM can be adjusted and reflect the fact of increased
>> power usage. That power increase is due to static power
>> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
>> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
>> They are also built differently with High Performance (HP) cells or
>> Low Power (LP) cells. They are affected by the temperature increase
>> differently: HP cells have bigger leakage. The SW model can leverage that
>> knowledge.
>>
>> 2. It is also possible to change the EM to better reflect the currently
>> running workload. Usually the EM is derived from some average power values
>> taken from experiments with benchmark (e.g. Dhrystone). The model derived
>> from such scenario might not represent properly the workloads usually running
>> on the device. Therefore, runtime modification of the EM allows to switch to
>> a different model, when there is a need.
>>
>> 3. The EM can be adjusted after boot, when all the modules are loaded and
>> more information about the SoC is available e.g. chip binning. This would help
>> to better reflect the silicon characteristics. Thus, this EM modification
>> API allows it now. It wasn't possible in the past and the EM had to be
>> 'set in stone'.
>>
>> More detailed explanation and background can be found in presentations
>> during LPC2022 [1][2] or in the documentation patches.
>>
>> Some test results.
>> The EM can be updated to fit better the workload type. In the case below the EM
>> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
>> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
>> to get more reliable data.
>>
>> 1. Janky frames percentage
>> +--------+-----------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+-----------------+---------------------+-------+-----------+
>> | gmean | jank_percentage | EM_default | 2.0 | 0.0% |
>> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% |
>> +--------+-----------------+---------------------+-------+-----------+
>>
>> 2. Avg frame render time duration
>> +--------+---------------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+---------------------+---------------------+-------+-----------+
>> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% |
>> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% |
>> +--------+---------------------+---------------------+-------+-----------+
>>
>> 3. Max frame render time duration
>> +--------+--------------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+--------------------+---------------------+-------+-----------+
>> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% |
>> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% |
>> +--------+--------------------+---------------------+-------+-----------+
>>
>> 4. OS overutilized state percentage (when EAS is not working)
>> +--------------+---------------------+------+------------+------------+
>> | metric | wa_path | time | total_time | percentage |
>> +--------------+---------------------+------+------------+------------+
>> | overutilized | EM_default | 1.65 | 253.38 | 0.65 |
>> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 |
>> +--------------+---------------------+------+------------+------------+
>>
>> 5. All CPUs (Little+Mid+Big) power values in mW
>> +------------+--------+---------------------+-------+-----------+
>> | channel | metric | kernel | value | perc_diff |
>> +------------+--------+---------------------+-------+-----------+
>> | CPU | gmean | EM_default | 142.1 | 0.0% |
>> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
>> +------------+--------+---------------------+-------+-----------+
>>
>> The time cost to update the EM decreased in this v5 vs v4:
>> big: 5us vs 2us -> 2.6x faster
>> mid: 9us vs 3us -> 3x faster
>> little: 16us vs 16us -> no change
>
> I guess this is entirely due to the changes in
> em_dev_update_perf_domain()? Moving from per-OPP em_update_callback to
> switching the entire table (pd->runtime_table) inside
> em_dev_update_perf_domain()?

Yes correct, it's due to that design change.

>
>> We still have to update the inefficiency in the cpufreq framework, thus
>> a bit of overhead will be there.
>>
>> Changelog:
>> v5:
>> - removed 2 tables design
>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
>
> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
> v5 this changed to only have one, the modifiable. IMHO it would be
> better to change the existing table to be modifiable rather than staring
> with two EM's and then removing the static one. I assume you end up with
> way less code changes and the patch-set will become easier to digest for
> reviewers.

The patches are structured in this way following Daniel's recommendation
I got when I was adding similar big changes to EM in 2020 (support all
devices in kernel). The approach is as follows:
0. Do some basic clean-up/refactoring if needed for a new feature, to
re-use some code if possible in future
1. Introduce new feature next to the existing one
2. Add API and all needed infrastructure (structures, fields) for
drivers
3. Re-wire the existing drivers/frameworks to the new feature via new
API; ideally keep 1 patch per driver so the maintainer can easily
grasp the changes and ACK it, because it will go via different tree
(Rafael's tree); in case of some code clash in the driver's code
during merge - it will be a single driver so easier to handle
4. when all drivers and frameworks are wired up with the new feature
remove the old feature (structures, fields, APIs, etc)
5. Update the documentation with new latest state of desing

In this approach the patches are less convoluted. Because if I remove
the old feature and add new in a single patch (e.g. the main structure)
that patch will have to modify all drivers to still compile. It
would be a big messy patch for this re-design.

I can see in some later comment from Rafael that he is OK with current
patch set structure.

>
> I would mention that 14/23 "PM: EM: Support late CPUs booting and
> capacity adjustment" is a testcase for the modifiable EM build-in into
> the code changes. This relies on the table being modifiable.
>

Correct, that the 1st user on runtime modifiable EM, which is actually
also build-in. I could add that to the cover letter.

Regards,
Lukasz

2023-12-13 09:31:32

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

Hi Rafael,

Thank you for having a loot at the series.

On 12/12/23 18:49, Rafael J. Wysocki wrote:
> Hi Lukasz,
>
> On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <[email protected]> wrote:
>>
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
>>
>> The concepts:
>> 1. The CPU power usage can vary due to the workload that it's running or due
>> to the temperature of the SoC. The same workload can use more power when the
>> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
>> In such situation the EM can be adjusted and reflect the fact of increased
>> power usage. That power increase is due to static power
>> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
>> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
>> They are also built differently with High Performance (HP) cells or
>> Low Power (LP) cells. They are affected by the temperature increase
>> differently: HP cells have bigger leakage. The SW model can leverage that
>> knowledge.
>>
>> 2. It is also possible to change the EM to better reflect the currently
>> running workload. Usually the EM is derived from some average power values
>> taken from experiments with benchmark (e.g. Dhrystone). The model derived
>> from such scenario might not represent properly the workloads usually running
>> on the device. Therefore, runtime modification of the EM allows to switch to
>> a different model, when there is a need.
>>
>> 3. The EM can be adjusted after boot, when all the modules are loaded and
>> more information about the SoC is available e.g. chip binning. This would help
>> to better reflect the silicon characteristics. Thus, this EM modification
>> API allows it now. It wasn't possible in the past and the EM had to be
>> 'set in stone'.
>>
>> More detailed explanation and background can be found in presentations
>> during LPC2022 [1][2] or in the documentation patches.
>>
>> Some test results.
>> The EM can be updated to fit better the workload type. In the case below the EM
>> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
>> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
>> to get more reliable data.
>>
>> 1. Janky frames percentage
>> +--------+-----------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+-----------------+---------------------+-------+-----------+
>> | gmean | jank_percentage | EM_default | 2.0 | 0.0% |
>> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% |
>> +--------+-----------------+---------------------+-------+-----------+
>>
>> 2. Avg frame render time duration
>> +--------+---------------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+---------------------+---------------------+-------+-----------+
>> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% |
>> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% |
>> +--------+---------------------+---------------------+-------+-----------+
>>
>> 3. Max frame render time duration
>> +--------+--------------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+--------------------+---------------------+-------+-----------+
>> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% |
>> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% |
>> +--------+--------------------+---------------------+-------+-----------+
>>
>> 4. OS overutilized state percentage (when EAS is not working)
>> +--------------+---------------------+------+------------+------------+
>> | metric | wa_path | time | total_time | percentage |
>> +--------------+---------------------+------+------------+------------+
>> | overutilized | EM_default | 1.65 | 253.38 | 0.65 |
>> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 |
>> +--------------+---------------------+------+------------+------------+
>>
>> 5. All CPUs (Little+Mid+Big) power values in mW
>> +------------+--------+---------------------+-------+-----------+
>> | channel | metric | kernel | value | perc_diff |
>> +------------+--------+---------------------+-------+-----------+
>> | CPU | gmean | EM_default | 142.1 | 0.0% |
>> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
>> +------------+--------+---------------------+-------+-----------+
>>
>> The time cost to update the EM decreased in this v5 vs v4:
>> big: 5us vs 2us -> 2.6x faster
>> mid: 9us vs 3us -> 3x faster
>> little: 16us vs 16us -> no change
>>
>> We still have to update the inefficiency in the cpufreq framework, thus
>> a bit of overhead will be there.
>>
>> Changelog:
>> v5:
>> - removed 2 tables design
>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
>> - refactored update function and removed callback call for each opp
>> - added faster EM table swap, using only the RCU pointer update
>> - added memory allocation API and tracking with kref
>> - avoid overhead for computing 'cost' for each OPP in update, it can be
>> pre-computed in device drivers EM earlier
>> - add support for device drivers providing EM table
>> - added API for computing 'cost' values in EM for EAS
>> - added API for thermal/powercap to use EM (using RCU wrappers)
>> - switched to single allocation and 'state[]' array (Rafael)
>> - changed documentation to align with current design
>> - added helper API for computing cost values
>> - simplified EM free in unregister path (thanks to kref)
>> - split patch updating EM clients and changed them separetly
>> - added seperate patch removing old static EM table
>> - added EM debugfs change patch to dump the runtime_table
>> - addressed comments in v4 for spelling/comments/headers
>> - added review tags
>
> I like this one more than the previous one and thanks for taking my
> feedback into account.
>
> I would still like other people having a vested interest in the EM to
> look at it and give feedback (or just tags), so I'm not inclined to
> apply it just yet. However, I don't have any specific comments on it.

Let me contact offline some of the partners who were keen to have this
in mainline (when I presented some first implementation in 2021 at
Android kernel review systems).

Regards,
Lukasz

2023-12-13 11:37:20

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

On 13/12/2023 10:23, Lukasz Luba wrote:
> Hi Dietmar,
>
> Thank you for the review, I will go one-by-one to respond
> your comments in patches as well. First comments are below.
>
> On 12/12/23 18:48, Dietmar Eggemann wrote:
>> On 29/11/2023 12:08, Lukasz Luba wrote:

[...]

>>> Changelog:
>>> v5:
>>> - removed 2 tables design
>>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
>>
>> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
>> v5 this changed to only have one, the modifiable. IMHO it would be
>> better to change the existing table to be modifiable rather than staring
>> with two EM's and then removing the static one. I assume you end up with
>> way less code changes and the patch-set will become easier to digest for
>> reviewers.
>
> The patches are structured in this way following Daniel's recommendation
> I got when I was adding similar big changes to EM in 2020 (support all
> devices in kernel). The approach is as follows:
> 0. Do some basic clean-up/refactoring if needed for a new feature, to
>    re-use some code if possible in future
> 1. Introduce new feature next to the existing one
> 2. Add API and all needed infrastructure (structures, fields) for
>    drivers
> 3. Re-wire the existing drivers/frameworks to the new feature via new
>    API; ideally keep 1 patch per driver so the maintainer can easily
>    grasp the changes and ACK it, because it will go via different tree
>    (Rafael's tree); in case of some code clash in the driver's code
>    during merge - it will be a single driver so easier to handle
> 4. when all drivers and frameworks are wired up with the new feature
>    remove the old feature (structures, fields, APIs, etc)
> 5. Update the documentation with new latest state of desing
>
> In this approach the patches are less convoluted. Because if I remove
> the old feature and add new in a single patch (e.g. the main structure)
> that patch will have to modify all drivers to still compile. It
> would be a big messy patch for this re-design.
>
> I can see in some later comment from Rafael that he is OK with current
> patch set structure.

OK, in case Rafael and Daniel prefer this, then it's fine.

I just find it weird that we now have

70 struct em_perf_domain {
71 struct em_perf_table __rcu *runtime_table;
^^^^^^^^^^^^^

as the only EM table.

2023-12-13 11:46:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

On Wed, Dec 13, 2023 at 12:34 PM Dietmar Eggemann
<[email protected]> wrote:
>
> On 13/12/2023 10:23, Lukasz Luba wrote:
> > Hi Dietmar,
> >
> > Thank you for the review, I will go one-by-one to respond
> > your comments in patches as well. First comments are below.
> >
> > On 12/12/23 18:48, Dietmar Eggemann wrote:
> >> On 29/11/2023 12:08, Lukasz Luba wrote:
>
> [...]
>
> >>> Changelog:
> >>> v5:
> >>> - removed 2 tables design
> >>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
> >>
> >> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
> >> v5 this changed to only have one, the modifiable. IMHO it would be
> >> better to change the existing table to be modifiable rather than staring
> >> with two EM's and then removing the static one. I assume you end up with
> >> way less code changes and the patch-set will become easier to digest for
> >> reviewers.
> >
> > The patches are structured in this way following Daniel's recommendation
> > I got when I was adding similar big changes to EM in 2020 (support all
> > devices in kernel). The approach is as follows:
> > 0. Do some basic clean-up/refactoring if needed for a new feature, to
> > re-use some code if possible in future
> > 1. Introduce new feature next to the existing one
> > 2. Add API and all needed infrastructure (structures, fields) for
> > drivers
> > 3. Re-wire the existing drivers/frameworks to the new feature via new
> > API; ideally keep 1 patch per driver so the maintainer can easily
> > grasp the changes and ACK it, because it will go via different tree
> > (Rafael's tree); in case of some code clash in the driver's code
> > during merge - it will be a single driver so easier to handle
> > 4. when all drivers and frameworks are wired up with the new feature
> > remove the old feature (structures, fields, APIs, etc)
> > 5. Update the documentation with new latest state of desing
> >
> > In this approach the patches are less convoluted. Because if I remove
> > the old feature and add new in a single patch (e.g. the main structure)
> > that patch will have to modify all drivers to still compile. It
> > would be a big messy patch for this re-design.
> >
> > I can see in some later comment from Rafael that he is OK with current
> > patch set structure.
>
> OK, in case Rafael and Daniel prefer this, then it's fine.
>
> I just find it weird that we now have
>
> 70 struct em_perf_domain {
> 71 struct em_perf_table __rcu *runtime_table;
> ^^^^^^^^^^^^^
>
> as the only EM table.

I agree that it would be better to call it something like em_table.

2023-12-13 12:19:32

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model



On 12/13/23 11:45, Rafael J. Wysocki wrote:
> On Wed, Dec 13, 2023 at 12:34 PM Dietmar Eggemann
> <[email protected]> wrote:
>>
>> On 13/12/2023 10:23, Lukasz Luba wrote:
>>> Hi Dietmar,
>>>
>>> Thank you for the review, I will go one-by-one to respond
>>> your comments in patches as well. First comments are below.
>>>
>>> On 12/12/23 18:48, Dietmar Eggemann wrote:
>>>> On 29/11/2023 12:08, Lukasz Luba wrote:
>>
>> [...]
>>
>>>>> Changelog:
>>>>> v5:
>>>>> - removed 2 tables design
>>>>> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
>>>>
>>>> Until v4 you had 2 EM's, the static and the modifiable (runtime). Now in
>>>> v5 this changed to only have one, the modifiable. IMHO it would be
>>>> better to change the existing table to be modifiable rather than staring
>>>> with two EM's and then removing the static one. I assume you end up with
>>>> way less code changes and the patch-set will become easier to digest for
>>>> reviewers.
>>>
>>> The patches are structured in this way following Daniel's recommendation
>>> I got when I was adding similar big changes to EM in 2020 (support all
>>> devices in kernel). The approach is as follows:
>>> 0. Do some basic clean-up/refactoring if needed for a new feature, to
>>> re-use some code if possible in future
>>> 1. Introduce new feature next to the existing one
>>> 2. Add API and all needed infrastructure (structures, fields) for
>>> drivers
>>> 3. Re-wire the existing drivers/frameworks to the new feature via new
>>> API; ideally keep 1 patch per driver so the maintainer can easily
>>> grasp the changes and ACK it, because it will go via different tree
>>> (Rafael's tree); in case of some code clash in the driver's code
>>> during merge - it will be a single driver so easier to handle
>>> 4. when all drivers and frameworks are wired up with the new feature
>>> remove the old feature (structures, fields, APIs, etc)
>>> 5. Update the documentation with new latest state of desing
>>>
>>> In this approach the patches are less convoluted. Because if I remove
>>> the old feature and add new in a single patch (e.g. the main structure)
>>> that patch will have to modify all drivers to still compile. It
>>> would be a big messy patch for this re-design.
>>>
>>> I can see in some later comment from Rafael that he is OK with current
>>> patch set structure.
>>
>> OK, in case Rafael and Daniel prefer this, then it's fine.
>>
>> I just find it weird that we now have
>>
>> 70 struct em_perf_domain {
>> 71 struct em_perf_table __rcu *runtime_table;
>> ^^^^^^^^^^^^^
>>
>> as the only EM table.
>
> I agree that it would be better to call it something like em_table.
>

OK, I'll change that. Thanks Rafael and Dietmar!

2023-12-13 13:15:41

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

Hi Abhijeet,

It's been a while when we discussed an EM feature presented on some
Android common kernel Gerrit (Nov 2021).

On 11/29/23 11:08, Lukasz Luba wrote:
> Hi all,
>
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
>
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.
>
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation the EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.
>
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.
>
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
>
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
>
> Some test results.
> The EM can be updated to fit better the workload type. In the case below the EM
> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
> to get more reliable data.
>
> 1. Janky frames percentage
> +--------+-----------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+-----------------+---------------------+-------+-----------+
> | gmean | jank_percentage | EM_default | 2.0 | 0.0% |
> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% |
> +--------+-----------------+---------------------+-------+-----------+
>
> 2. Avg frame render time duration
> +--------+---------------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+---------------------+---------------------+-------+-----------+
> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% |
> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% |
> +--------+---------------------+---------------------+-------+-----------+
>
> 3. Max frame render time duration
> +--------+--------------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+--------------------+---------------------+-------+-----------+
> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% |
> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% |
> +--------+--------------------+---------------------+-------+-----------+
>
> 4. OS overutilized state percentage (when EAS is not working)
> +--------------+---------------------+------+------------+------------+
> | metric | wa_path | time | total_time | percentage |
> +--------------+---------------------+------+------------+------------+
> | overutilized | EM_default | 1.65 | 253.38 | 0.65 |
> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 |
> +--------------+---------------------+------+------------+------------+
>
> 5. All CPUs (Little+Mid+Big) power values in mW
> +------------+--------+---------------------+-------+-----------+
> | channel | metric | kernel | value | perc_diff |
> +------------+--------+---------------------+-------+-----------+
> | CPU | gmean | EM_default | 142.1 | 0.0% |
> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
> +------------+--------+---------------------+-------+-----------+
>
> The time cost to update the EM decreased in this v5 vs v4:
> big: 5us vs 2us -> 2.6x faster
> mid: 9us vs 3us -> 3x faster
> little: 16us vs 16us -> no change
>
> We still have to update the inefficiency in the cpufreq framework, thus
> a bit of overhead will be there.
>
> Changelog:
> v5:
> - removed 2 tables design
> - have only one table (runtime_table) used also in thermal (Wei, Rafael)
> - refactored update function and removed callback call for each opp
> - added faster EM table swap, using only the RCU pointer update
> - added memory allocation API and tracking with kref
> - avoid overhead for computing 'cost' for each OPP in update, it can be
> pre-computed in device drivers EM earlier
> - add support for device drivers providing EM table
> - added API for computing 'cost' values in EM for EAS
> - added API for thermal/powercap to use EM (using RCU wrappers)
> - switched to single allocation and 'state[]' array (Rafael)
> - changed documentation to align with current design
> - added helper API for computing cost values
> - simplified EM free in unregister path (thanks to kref)
> - split patch updating EM clients and changed them separetly
> - added seperate patch removing old static EM table
> - added EM debugfs change patch to dump the runtime_table
> - addressed comments in v4 for spelling/comments/headers
> - added review tags
> v4 changes are here [4]
>
> Regards,
> Lukasz Luba
>
> [1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
> [2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
> [3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
> [4] https://lore.kernel.org/lkml/[email protected]/
>
>
> Lukasz Luba (23):
> PM: EM: Add missing newline for the message log
> PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
> PM: EM: Find first CPU active while updating OPP efficiency
> PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
> PM: EM: Refactor a new function em_compute_costs()
> PM: EM: Check if the get_cost() callback is present in
> em_compute_costs()
> PM: EM: Refactor how the EM table is allocated and populated
> PM: EM: Introduce runtime modifiable table
> PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
> PM: EM: Add API for memory allocations for new tables
> PM: EM: Add API for updating the runtime modifiable EM
> PM: EM: Add helpers to read under RCU lock the EM table
> PM: EM: Add performance field to struct em_perf_state
> PM: EM: Support late CPUs booting and capacity adjustment
> PM: EM: Optimize em_cpu_energy() and remove division
> powercap/dtpm_cpu: Use new Energy Model interface to get table
> powercap/dtpm_devfreq: Use new Energy Model interface to get table
> drivers/thermal/cpufreq_cooling: Use new Energy Model interface
> drivers/thermal/devfreq_cooling: Use new Energy Model interface
> PM: EM: Change debugfs configuration to use runtime EM table data
> PM: EM: Remove old table
> PM: EM: Add em_dev_compute_costs() as API for device drivers
> Documentation: EM: Update with runtime modification design
>
> Documentation/power/energy-model.rst | 206 +++++++++++-
> drivers/powercap/dtpm_cpu.c | 35 +-
> drivers/powercap/dtpm_devfreq.c | 31 +-
> drivers/thermal/cpufreq_cooling.c | 40 ++-
> drivers/thermal/devfreq_cooling.c | 43 ++-
> include/linux/energy_model.h | 163 +++++----
> kernel/power/energy_model.c | 479 +++++++++++++++++++++++----
> 7 files changed, 813 insertions(+), 184 deletions(-)
>

You've been interested in this feature back then.

I have a gentle ask, if you are still interested in. It would be nice if
you (or some other Qcom engineer) could leave a feedback comment
(similar what you have made for the Gerrit original series). I will be
really grateful.

In this cover letter, there are some power saving numbers from
a real phone, with also performance metrics (janky frames). You might
be interested in those scenarios as well.

Regards,
Lukasz

2023-12-13 13:41:25

by Hongyan Xia

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

Hi Rafael,

On 12/12/2023 18:49, Rafael J. Wysocki wrote:
> Hi Lukasz,
>
> On Wed, Nov 29, 2023 at 12:08 PM Lukasz Luba <[email protected]> wrote:
>>
>> [...]
>
> I like this one more than the previous one and thanks for taking my
> feedback into account.
>
> I would still like other people having a vested interest in the EM to
> look at it and give feedback (or just tags), so I'm not inclined to
> apply it just yet. However, I don't have any specific comments on it.

I do have a keen interest in this series, but mostly from the point of
view of uclamp. Currently uclamp is able to send hint the scheduler to
bias task placement. Some CPU cores are known to have very different
energy efficiency depending on the task. We know these tasks beforehand
and can use uclamp to bias to certain CPUs which we know are more
efficient for them.

Personally I've always been wondering if this could just be reflected in
the EM itself without emphasizing on the task placement aspect of
uclamp. The idea of this series LGTM and I'll take a deeper look.

Hongyan

2023-12-17 17:58:27

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 02/23] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments

On 11/29/23 11:08, Lukasz Luba wrote:
> In order to prepare the code for the modifiable EM perf_state table,
> refactor existing function em_cpufreq_update_efficiencies().

nit: What is being refactored here? The description is not adding much info
about the change.


Cheers

--
Qais Yousef

>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> kernel/power/energy_model.c | 8 +++-----
> 1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 8b9dd4a39f63..42486674b834 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -237,10 +237,10 @@ static int em_create_pd(struct device *dev, int nr_states,
> return 0;
> }
>
> -static void em_cpufreq_update_efficiencies(struct device *dev)
> +static void
> +em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
> {
> struct em_perf_domain *pd = dev->em_pd;
> - struct em_perf_state *table;
> struct cpufreq_policy *policy;
> int found = 0;
> int i;
> @@ -254,8 +254,6 @@ static void em_cpufreq_update_efficiencies(struct device *dev)
> return;
> }
>
> - table = pd->table;
> -
> for (i = 0; i < pd->nr_perf_states; i++) {
> if (!(table[i].flags & EM_PERF_STATE_INEFFICIENT))
> continue;
> @@ -397,7 +395,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>
> dev->em_pd->flags |= flags;
>
> - em_cpufreq_update_efficiencies(dev);
> + em_cpufreq_update_efficiencies(dev, dev->em_pd->table);
>
> em_debug_create_pd(dev);
> dev_info(dev, "EM: created perf domain\n");
> --
> 2.25.1
>

2023-12-17 17:58:44

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 03/23] PM: EM: Find first CPU active while updating OPP efficiency

On 11/29/23 11:08, Lukasz Luba wrote:
> The Energy Model might be updated at runtime and the energy efficiency
> for each OPP may change. Thus, there is a need to update also the
> cpufreq framework and make it aligned to the new values. In order to
> do that, use a first active CPU from the Performance Domain. This is
> needed since the first CPU in the cpumask might be offline when we
> run this code path.

I didn't understand the problem here. It seems you're fixing a race, but the
description is not clear to me what the race is.

>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> kernel/power/energy_model.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 42486674b834..aa7c89f9e115 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -243,12 +243,19 @@ em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
> struct em_perf_domain *pd = dev->em_pd;
> struct cpufreq_policy *policy;
> int found = 0;
> - int i;
> + int i, cpu;
>
> if (!_is_cpu_device(dev) || !pd)
> return;
>
> - policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
> + /* Try to get a CPU which is active and in this PD */
> + cpu = cpumask_first_and(em_span_cpus(pd), cpu_active_mask);
> + if (cpu >= nr_cpu_ids) {
> + dev_warn(dev, "EM: No online CPU for CPUFreq policy\n");
> + return;
> + }
> +
> + policy = cpufreq_cpu_get(cpu);

Shouldn't policy be NULL here if all policy->realted_cpus were offlined?


Cheers

--
Qais Yousef

> if (!policy) {
> dev_warn(dev, "EM: Access to CPUFreq policy failed\n");
> return;
> --
> 2.25.1
>

2023-12-17 17:59:07

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 05/23] PM: EM: Refactor a new function em_compute_costs()

On 11/29/23 11:08, Lukasz Luba wrote:
> Refactor a dedicated function which will be easier to maintain and re-use
> in future. The upcoming changes for the modifiable EM perf_state table
> will use it (instead of duplicating the code).

nit: What is being refactored? Looks like you took em_compute_cost() out of
em_create_perf_table().


Cheers

--
Qais Yousef

>
> This change is not expected to alter the general functionality.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> kernel/power/energy_model.c | 72 ++++++++++++++++++++++---------------
> 1 file changed, 43 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index aa7c89f9e115..3bea930410c6 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -103,14 +103,52 @@ static void em_debug_create_pd(struct device *dev) {}
> static void em_debug_remove_pd(struct device *dev) {}
> #endif
>
> +static int em_compute_costs(struct device *dev, struct em_perf_state *table,
> + struct em_data_callback *cb, int nr_states,
> + unsigned long flags)
> +{
> + unsigned long prev_cost = ULONG_MAX;
> + u64 fmax;
> + int i, ret;
> +
> + /* Compute the cost of each performance state. */
> + fmax = (u64) table[nr_states - 1].frequency;
> + for (i = nr_states - 1; i >= 0; i--) {
> + unsigned long power_res, cost;
> +
> + if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
> + ret = cb->get_cost(dev, table[i].frequency, &cost);
> + if (ret || !cost || cost > EM_MAX_POWER) {
> + dev_err(dev, "EM: invalid cost %lu %d\n",
> + cost, ret);
> + return -EINVAL;
> + }
> + } else {
> + power_res = table[i].power;
> + cost = div64_u64(fmax * power_res, table[i].frequency);
> + }
> +
> + table[i].cost = cost;
> +
> + if (table[i].cost >= prev_cost) {
> + table[i].flags = EM_PERF_STATE_INEFFICIENT;
> + dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
> + table[i].frequency);
> + } else {
> + prev_cost = table[i].cost;
> + }
> + }
> +
> + return 0;
> +}
> +
> static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> int nr_states, struct em_data_callback *cb,
> unsigned long flags)
> {
> - unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
> + unsigned long power, freq, prev_freq = 0;
> struct em_perf_state *table;
> int i, ret;
> - u64 fmax;
>
> table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
> if (!table)
> @@ -154,33 +192,9 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> table[i].frequency = prev_freq = freq;
> }
>
> - /* Compute the cost of each performance state. */
> - fmax = (u64) table[nr_states - 1].frequency;
> - for (i = nr_states - 1; i >= 0; i--) {
> - unsigned long power_res, cost;
> -
> - if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
> - ret = cb->get_cost(dev, table[i].frequency, &cost);
> - if (ret || !cost || cost > EM_MAX_POWER) {
> - dev_err(dev, "EM: invalid cost %lu %d\n",
> - cost, ret);
> - goto free_ps_table;
> - }
> - } else {
> - power_res = table[i].power;
> - cost = div64_u64(fmax * power_res, table[i].frequency);
> - }
> -
> - table[i].cost = cost;
> -
> - if (table[i].cost >= prev_cost) {
> - table[i].flags = EM_PERF_STATE_INEFFICIENT;
> - dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
> - table[i].frequency);
> - } else {
> - prev_cost = table[i].cost;
> - }
> - }
> + ret = em_compute_costs(dev, table, cb, nr_states, flags);
> + if (ret)
> + goto free_ps_table;
>
> pd->table = table;
> pd->nr_perf_states = nr_states;
> --
> 2.25.1
>

2023-12-17 17:59:28

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 07/23] PM: EM: Refactor how the EM table is allocated and populated

On 11/29/23 11:08, Lukasz Luba wrote:
> Split the process of allocation and data initialization for the EM table.
> The upcoming changes for modifiable EM will use it.
>
> This change is not expected to alter the general functionality.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> kernel/power/energy_model.c | 52 ++++++++++++++++++++++---------------
> 1 file changed, 31 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 3c8542443dd4..99426b5eedb6 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -142,18 +142,25 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
> return 0;
> }
>
> +static int em_allocate_perf_table(struct em_perf_domain *pd,
> + int nr_states)
> +{
> + pd->table = kcalloc(nr_states, sizeof(struct em_perf_state),
> + GFP_KERNEL);
> + if (!pd->table)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> + struct em_perf_state *table,
> int nr_states, struct em_data_callback *cb,
> unsigned long flags)
> {
> unsigned long power, freq, prev_freq = 0;
> - struct em_perf_state *table;
> int i, ret;
>
> - table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
> - if (!table)
> - return -ENOMEM;
> -
> /* Build the list of performance states for this performance domain */
> for (i = 0, freq = 0; i < nr_states; i++, freq++) {
> /*
> @@ -165,7 +172,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> if (ret) {
> dev_err(dev, "EM: invalid perf. state: %d\n",
> ret);
> - goto free_ps_table;
> + return -EINVAL;
> }
>
> /*
> @@ -175,7 +182,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> if (freq <= prev_freq) {
> dev_err(dev, "EM: non-increasing freq: %lu\n",
> freq);
> - goto free_ps_table;
> + return -EINVAL;
> }
>
> /*
> @@ -185,7 +192,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> if (!power || power > EM_MAX_POWER) {
> dev_err(dev, "EM: invalid power: %lu\n",
> power);
> - goto free_ps_table;
> + return -EINVAL;
> }
>
> table[i].power = power;
> @@ -194,16 +201,9 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>
> ret = em_compute_costs(dev, table, cb, nr_states, flags);
> if (ret)
> - goto free_ps_table;

We don't care about propagating the error number here stored in ret?

> -
> - pd->table = table;
> - pd->nr_perf_states = nr_states;
> + return -EINVAL;
>
> return 0;
> -
> -free_ps_table:
> - kfree(table);
> - return -EINVAL;
> }
>
> static int em_create_pd(struct device *dev, int nr_states,
> @@ -234,11 +234,15 @@ static int em_create_pd(struct device *dev, int nr_states,
> return -ENOMEM;
> }
>
> - ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
> - if (ret) {
> - kfree(pd);
> - return ret;
> - }
> + pd->nr_perf_states = nr_states;
> +
> + ret = em_allocate_perf_table(pd, nr_states);
> + if (ret)
> + goto free_pd;
> +
> + ret = em_create_perf_table(dev, pd, pd->table, nr_states, cb, flags);
> + if (ret)
> + goto free_pd_table;

Ditto for all the above


Cheers

--
Qais Yousef

>
> if (_is_cpu_device(dev))
> for_each_cpu(cpu, cpus) {
> @@ -249,6 +253,12 @@ static int em_create_pd(struct device *dev, int nr_states,
> dev->em_pd = pd;
>
> return 0;
> +
> +free_pd_table:
> + kfree(pd->table);
> +free_pd:
> + kfree(pd);
> + return -EINVAL;
> }
>
> static void
> --
> 2.25.1
>

2023-12-17 17:59:41

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 09/23] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS

On 11/29/23 11:08, Lukasz Luba wrote:
> The new Energy Model (EM) supports runtime modification of the performance
> state table to better model the power used by the SoC. Use this new
> feature to improve energy estimation and therefore task placement in
> Energy Aware Scheduler (EAS).

nit: you moved the code to use the new runtime em table instead of the one
parsed at boot.

>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> include/linux/energy_model.h | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 1e618e431cac..94a77a813724 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -238,6 +238,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> unsigned long max_util, unsigned long sum_util,
> unsigned long allowed_cpu_cap)
> {
> + struct em_perf_table *runtime_table;
> unsigned long freq, scale_cpu;
> struct em_perf_state *ps;
> int cpu, i;
> @@ -255,7 +256,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> */
> cpu = cpumask_first(to_cpumask(pd->cpus));
> scale_cpu = arch_scale_cpu_capacity(cpu);
> - ps = &pd->table[pd->nr_perf_states - 1];
> +
> + /*
> + * No rcu_read_lock() since it's already called by task scheduler.
> + * The runtime_table is always there for CPUs, so we don't check.
> + */

WARN_ON(rcu_read_lock_held()) instead?


Cheers

--
Qais Yousef

> + runtime_table = rcu_dereference(pd->runtime_table);
> +
> + ps = &runtime_table->state[pd->nr_perf_states - 1];
>
> max_util = map_util_perf(max_util);
> max_util = min(max_util, allowed_cpu_cap);
> @@ -265,9 +273,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> * Find the lowest performance state of the Energy Model above the
> * requested frequency.
> */
> - i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
> - pd->flags);
> - ps = &pd->table[i];
> + i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
> + freq, pd->flags);
> + ps = &runtime_table->state[i];
>
> /*
> * The capacity of a CPU in the domain at the performance state (ps)
> --
> 2.25.1
>

2023-12-17 18:00:11

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 10/23] PM: EM: Add API for memory allocations for new tables

On 11/29/23 11:08, Lukasz Luba wrote:
> The runtime modified EM table can be provided from drivers. Create
> mechanism which allows safely allocate and free the table for device
> drivers. The same table can be used by the EAS in task scheduler code
> paths, so make sure the memory is not freed when the device driver module
> is unloaded.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> include/linux/energy_model.h | 11 +++++++++
> kernel/power/energy_model.c | 44 ++++++++++++++++++++++++++++++++++--
> 2 files changed, 53 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 94a77a813724..e785211828fe 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -5,6 +5,7 @@
> #include <linux/device.h>
> #include <linux/jump_label.h>
> #include <linux/kobject.h>
> +#include <linux/kref.h>
> #include <linux/rcupdate.h>
> #include <linux/sched/cpufreq.h>
> #include <linux/sched/topology.h>
> @@ -39,10 +40,12 @@ struct em_perf_state {
> /**
> * struct em_perf_table - Performance states table
> * @rcu: RCU used for safe access and destruction
> + * @refcount: Reference count to track the owners
> * @state: List of performance states, in ascending order
> */
> struct em_perf_table {
> struct rcu_head rcu;
> + struct kref refcount;
> struct em_perf_state state[];
> };
>
> @@ -184,6 +187,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> struct em_data_callback *cb, cpumask_t *span,
> bool microwatts);
> void em_dev_unregister_perf_domain(struct device *dev);
> +struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd);
> +void em_free_table(struct em_perf_table __rcu *table);
>
> /**
> * em_pd_get_efficient_state() - Get an efficient performance state from the EM
> @@ -368,6 +373,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
> {
> return 0;
> }
> +static inline
> +struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd)
> +{
> + return NULL;
> +}
> +static inline void em_free_table(struct em_perf_table __rcu *table) {}
> #endif
>
> #endif
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 489287666705..489a358b9a00 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -114,12 +114,46 @@ static void em_destroy_table_rcu(struct rcu_head *rp)
> kfree(runtime_table);
> }
>
> -static void em_free_table(struct em_perf_table __rcu *table)
> +static void em_release_table_kref(struct kref *kref)
> {
> + struct em_perf_table __rcu *table;
> +
> + /* It was the last owner of this table so we can free */
> + table = container_of(kref, struct em_perf_table, refcount);
> +
> call_rcu(&table->rcu, em_destroy_table_rcu);
> }
>
> -static struct em_perf_table __rcu *
> +static inline void em_inc_usage(struct em_perf_table __rcu *table)
> +{
> + kref_get(&table->refcount);
> +}
> +
> +static void em_dec_usage(struct em_perf_table __rcu *table)
> +{
> + kref_put(&table->refcount, em_release_table_kref);
> +}

nit: em_table_inc/dec() instead? matches general theme elsewhere in the code
base.

> +
> +/**
> + * em_free_table() - Handles safe free of the EM table when needed
> + * @table : EM memory which is going to be freed
> + *
> + * No return values.
> + */
> +void em_free_table(struct em_perf_table __rcu *table)
> +{
> + em_dec_usage(table);
> +}
> +
> +/**
> + * em_allocate_table() - Handles safe allocation of the new EM table
> + * @table : EM memory which is going to be freed
> + *
> + * Increments the reference counter to mark that there is an owner of that
> + * EM table. That might be a device driver module or EAS.
> + * Returns allocated table or error.
> + */
> +struct em_perf_table __rcu *
> em_allocate_table(struct em_perf_domain *pd)
> {
> struct em_perf_table __rcu *table;
> @@ -128,6 +162,12 @@ em_allocate_table(struct em_perf_domain *pd)
> table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;
>
> table = kzalloc(sizeof(*table) + table_size, GFP_KERNEL);
> + if (!table)
> + return table;
> +
> + kref_init(&table->refcount);
> + em_inc_usage(table);

Doesn't kref_init() initialize to the count to 1 already? Is the em_inc_usage()
needed here?


Cheers

--
Qais Yousef

> +
> return table;
> }
>
> --
> 2.25.1
>

2023-12-17 18:00:28

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 13/23] PM: EM: Add performance field to struct em_perf_state

On 11/29/23 11:08, Lukasz Luba wrote:
> The performance doesn't scale linearly with the frequency. Also, it may
> be different in different workloads. Some CPUs are designed to be
> particularly good at some applications e.g. images or video processing
> and other CPUs in different. When those different types of CPUs are
> combined in one SoC they should be properly modeled to get max of the HW
> in Energy Aware Scheduler (EAS). The Energy Model (EM) provides the
> power vs. performance curves to the EAS, but assumes the CPUs capacity
> is fixed and scales linearly with the frequency. This patch allows to
> adjust the curve on the 'performance' axis as well.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> include/linux/energy_model.h | 11 ++++++-----
> kernel/power/energy_model.c | 27 +++++++++++++++++++++++++++
> 2 files changed, 33 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index ae3ccc8b9f44..e30750500b10 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -13,6 +13,7 @@
>
> /**
> * struct em_perf_state - Performance state of a performance domain
> + * @performance: Non-linear CPU performance at a given frequency
> * @frequency: The frequency in KHz, for consistency with CPUFreq
> * @power: The power consumed at this level (by 1 CPU or by a registered
> * device). It can be a total power: static and dynamic.
> @@ -21,6 +22,7 @@
> * @flags: see "em_perf_state flags" description below.
> */
> struct em_perf_state {
> + unsigned long performance;
> unsigned long frequency;
> unsigned long power;
> unsigned long cost;
> @@ -207,14 +209,14 @@ void em_free_table(struct em_perf_table __rcu *table);
> */
> static inline int
> em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
> - unsigned long freq, unsigned long pd_flags)
> + unsigned long max_util, unsigned long pd_flags)
> {
> struct em_perf_state *ps;
> int i;
>
> for (i = 0; i < nr_perf_states; i++) {
> ps = &table[i];
> - if (ps->frequency >= freq) {
> + if (ps->performance >= max_util) {
> if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
> ps->flags & EM_PERF_STATE_INEFFICIENT)
> continue;
> @@ -246,8 +248,8 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> unsigned long allowed_cpu_cap)
> {
> struct em_perf_table *runtime_table;
> - unsigned long freq, scale_cpu;
> struct em_perf_state *ps;
> + unsigned long scale_cpu;
> int cpu, i;
>
> if (!sum_util)
> @@ -274,14 +276,13 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>
> max_util = map_util_perf(max_util);
> max_util = min(max_util, allowed_cpu_cap);
> - freq = map_util_freq(max_util, ps->frequency, scale_cpu);
>
> /*
> * Find the lowest performance state of the Energy Model above the
> * requested frequency.
> */
> i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
> - freq, pd->flags);
> + max_util, pd->flags);
> ps = &runtime_table->state[i];
>
> /*
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 614891fde8df..b5016afe6a19 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -46,6 +46,7 @@ static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
> debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
> debugfs_create_ulong("power", 0444, d, &ps->power);
> debugfs_create_ulong("cost", 0444, d, &ps->cost);
> + debugfs_create_ulong("performance", 0444, d, &ps->performance);
> debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
> }
>
> @@ -171,6 +172,30 @@ em_allocate_table(struct em_perf_domain *pd)
> return table;
> }
>
> +static void em_init_performance(struct device *dev, struct em_perf_domain *pd,
> + struct em_perf_state *table, int nr_states)
> +{
> + u64 fmax, max_cap;
> + int i, cpu;
> +
> + /* This is needed only for CPUs and EAS skip other devices */
> + if (!_is_cpu_device(dev))
> + return;
> +
> + cpu = cpumask_first(em_span_cpus(pd));
> +
> + /*
> + * Calculate the performance value for each frequency with
> + * linear relationship. The final CPU capacity might not be ready at
> + * boot time, but the EM will be updated a bit later with correct one.
> + */
> + fmax = (u64) table[nr_states - 1].frequency;
> + max_cap = (u64) arch_scale_cpu_capacity(cpu);
> + for (i = 0; i < nr_states; i++)
> + table[i].performance = div64_u64(max_cap * table[i].frequency,
> + fmax);

Should we sanity check the returned performance value is correct in case we got
passed a malformed table? Maybe the table is sanity checked and sorted before
we get here; I didn't check to be honest.

I think a warning that performance is always <= max_cap would be helpful in
general as code evolved in the future.


Cheers

--
Qais Yousef

> +}
> +
> static int em_compute_costs(struct device *dev, struct em_perf_state *table,
> struct em_data_callback *cb, int nr_states,
> unsigned long flags)
> @@ -331,6 +356,8 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> table[i].frequency = prev_freq = freq;
> }
>
> + em_init_performance(dev, pd, table, nr_states);
> +
> ret = em_compute_costs(dev, table, cb, nr_states, flags);
> if (ret)
> return -EINVAL;
> --
> 2.25.1
>

2023-12-17 18:00:54

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 14/23] PM: EM: Support late CPUs booting and capacity adjustment

On 11/29/23 11:08, Lukasz Luba wrote:
> The patch adds needed infrastructure to handle the late CPUs boot, which
> might change the previous CPUs capacity values. With this changes the new
> CPUs which try to register EM will trigger the needed re-calculations for
> other CPUs EMs. Thanks to that the em_per_state::performance values will
> be aligned with the CPU capacity information after all CPUs finish the
> boot and EM registrations.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> kernel/power/energy_model.c | 121 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 121 insertions(+)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index b5016afe6a19..d3fa5a77de80 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -25,6 +25,9 @@ static DEFINE_MUTEX(em_pd_mutex);
>
> static void em_cpufreq_update_efficiencies(struct device *dev,
> struct em_perf_state *table);
> +static void em_check_capacity_update(void);
> +static void em_update_workfn(struct work_struct *work);
> +static DECLARE_DELAYED_WORK(em_update_work, em_update_workfn);
>
> static bool _is_cpu_device(struct device *dev)
> {
> @@ -596,6 +599,10 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>
> unlock:
> mutex_unlock(&em_pd_mutex);
> +
> + if (_is_cpu_device(dev))
> + em_check_capacity_update();
> +
> return ret;
> }
> EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
> @@ -631,3 +638,117 @@ void em_dev_unregister_perf_domain(struct device *dev)
> mutex_unlock(&em_pd_mutex);
> }
> EXPORT_SYMBOL_GPL(em_dev_unregister_perf_domain);
> +
> +/*
> + * Adjustment of CPU performance values after boot, when all CPUs capacites
> + * are correctly calculated.
> + */
> +static void em_adjust_new_capacity(struct device *dev,
> + struct em_perf_domain *pd,
> + u64 max_cap)
> +{
> + struct em_perf_table __rcu *runtime_table;
> + struct em_perf_state *table, *new_table;
> + int ret, table_size;
> +
> + runtime_table = em_allocate_table(pd);
> + if (!runtime_table) {
> + dev_warn(dev, "EM: allocation failed\n");
> + return;
> + }
> +
> + new_table = runtime_table->state;
> +
> + table = em_get_table(pd);
> + /* Initialize data based on older runtime table */
> + table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;
> + memcpy(new_table, table, table_size);
> +
> + em_put_table();
> +
> + em_init_performance(dev, pd, new_table, pd->nr_perf_states);
> + ret = em_compute_costs(dev, new_table, NULL, pd->nr_perf_states,
> + pd->flags);
> + if (ret) {
> + em_free_table(runtime_table);
> + return;
> + }
> +
> + ret = em_dev_update_perf_domain(dev, runtime_table);
> + if (ret)
> + dev_warn(dev, "EM: update failed %d\n", ret);
> +
> + /*
> + * This is one-time-update, so give up the ownership in this updater.
> + * The EM fwk will keep the reference and free the memory when needed.
> + */
> + em_free_table(runtime_table);
> +}
> +
> +static void em_check_capacity_update(void)
> +{
> + cpumask_var_t cpu_done_mask;
> + struct em_perf_state *table;
> + struct em_perf_domain *pd;
> + unsigned long cpu_capacity;
> + int cpu;
> +
> + if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
> + pr_warn("no free memory\n");
> + return;
> + }
> +
> + /* Check if CPUs capacity has changed than update EM */
> + for_each_possible_cpu(cpu) {

Can't we instead hook into cpufreq_online/offline() to check if we need to
do any em related update for this policy?


Cheers

--
Qais Yousef

> + struct cpufreq_policy *policy;
> + unsigned long em_max_perf;
> + struct device *dev;
> + int nr_states;
> +
> + if (cpumask_test_cpu(cpu, cpu_done_mask))
> + continue;
> +
> + policy = cpufreq_cpu_get(cpu);
> + if (!policy) {
> + pr_debug("Accessing cpu%d policy failed\n", cpu);
> + schedule_delayed_work(&em_update_work,
> + msecs_to_jiffies(1000));
> + break;
> + }
> + cpufreq_cpu_put(policy);
> +
> + pd = em_cpu_get(cpu);
> + if (!pd || em_is_artificial(pd))
> + continue;
> +
> + cpumask_or(cpu_done_mask, cpu_done_mask,
> + em_span_cpus(pd));
> +
> + nr_states = pd->nr_perf_states;
> + cpu_capacity = arch_scale_cpu_capacity(cpu);
> +
> + table = em_get_table(pd);
> + em_max_perf = table[pd->nr_perf_states - 1].performance;
> + em_put_table();
> +
> + /*
> + * Check if the CPU capacity has been adjusted during boot
> + * and trigger the update for new performance values.
> + */
> + if (em_max_perf == cpu_capacity)
> + continue;
> +
> + pr_debug("updating cpu%d cpu_cap=%lu old capacity=%lu\n",
> + cpu, cpu_capacity, em_max_perf);
> +
> + dev = get_cpu_device(cpu);
> + em_adjust_new_capacity(dev, pd, cpu_capacity);
> + }
> +
> + free_cpumask_var(cpu_done_mask);
> +}
> +
> +static void em_update_workfn(struct work_struct *work)
> +{
> + em_check_capacity_update();
> +}
> --
> 2.25.1
>

2023-12-17 18:03:29

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 22/23] PM: EM: Add em_dev_compute_costs() as API for device drivers

On 12/12/23 19:50, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
> > The device drivers can modify EM at runtime by providing a new EM table.
> > The EM is used by the EAS and the em_perf_state::cost stores
> > pre-calculated value to avoid overhead. This patch provides the API for
> > device drivers to calculate the cost values properly (and not duplicate
> > the same code).
>
> New interface w/o any users? Can we not remove this from this patch-set
> and introduce it with the first user(s)?

It's a chicken and egg problem. No interface, will not enable the new users to
appear too. So assuming the interface makes sense, I vote to keep it.

I lost brain power half way through the series and didn't review this properly
yet; but will continue looking later during the week.


Cheers

--
Qais Yousef

2023-12-17 18:23:14

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

Hi Lukasz

On 11/29/23 11:08, Lukasz Luba wrote:
> Hi all,
>
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
>
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.

Thanks. The problem of EM accuracy has been observed in the field and would be
nice to have a mainline solution for it. We carry our own out-of-tree change to
enable modifying the EM.

>
> The concepts:
> 1. The CPU power usage can vary due to the workload that it's running or due
> to the temperature of the SoC. The same workload can use more power when the
> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
> In such situation the EM can be adjusted and reflect the fact of increased
> power usage. That power increase is due to static power
> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
> They are also built differently with High Performance (HP) cells or
> Low Power (LP) cells. They are affected by the temperature increase
> differently: HP cells have bigger leakage. The SW model can leverage that
> knowledge.

One thing I'm not sure about is that in practice temperature of the SoC can
vary a lot in a short period of time. What is the expectation here? I can see
this useful in practice only if we average it over a window of time. Following
it will be really hard. Big variations can happen in few ms scales.

Driver interface for this part makes sense; as thermal framework will likely to
know how feed things back to EM table, if necessary.

>
> 2. It is also possible to change the EM to better reflect the currently
> running workload. Usually the EM is derived from some average power values
> taken from experiments with benchmark (e.g. Dhrystone). The model derived
> from such scenario might not represent properly the workloads usually running
> on the device. Therefore, runtime modification of the EM allows to switch to
> a different model, when there is a need.

I didn't get how the new performance field is supposed to be controlled and
modified by users. A driver interface doesn't seem suitable as there's no
subsystem that knows the characteristic of the workload except userspace. In
Android we do have contextual info about what the current top-app to enable
modifying the capacities to match its characteristics.

>
> 3. The EM can be adjusted after boot, when all the modules are loaded and
> more information about the SoC is available e.g. chip binning. This would help
> to better reflect the silicon characteristics. Thus, this EM modification
> API allows it now. It wasn't possible in the past and the EM had to be
> 'set in stone'.
>
> More detailed explanation and background can be found in presentations
> during LPC2022 [1][2] or in the documentation patches.
>
> Some test results.
> The EM can be updated to fit better the workload type. In the case below the EM
> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
> to get more reliable data.
>
> 1. Janky frames percentage
> +--------+-----------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+-----------------+---------------------+-------+-----------+
> | gmean | jank_percentage | EM_default | 2.0 | 0.0% |
> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% |
> +--------+-----------------+---------------------+-------+-----------+
>
> 2. Avg frame render time duration
> +--------+---------------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+---------------------+---------------------+-------+-----------+
> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% |
> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% |
> +--------+---------------------+---------------------+-------+-----------+
>
> 3. Max frame render time duration
> +--------+--------------------+---------------------+-------+-----------+
> | metric | variable | kernel | value | perc_diff |
> +--------+--------------------+---------------------+-------+-----------+
> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% |
> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% |
> +--------+--------------------+---------------------+-------+-----------+
>
> 4. OS overutilized state percentage (when EAS is not working)
> +--------------+---------------------+------+------------+------------+
> | metric | wa_path | time | total_time | percentage |
> +--------------+---------------------+------+------------+------------+
> | overutilized | EM_default | 1.65 | 253.38 | 0.65 |
> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 |
> +--------------+---------------------+------+------------+------------+
>
> 5. All CPUs (Little+Mid+Big) power values in mW
> +------------+--------+---------------------+-------+-----------+
> | channel | metric | kernel | value | perc_diff |
> +------------+--------+---------------------+-------+-----------+
> | CPU | gmean | EM_default | 142.1 | 0.0% |
> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
> +------------+--------+---------------------+-------+-----------+

How did you modify the EM here? Did you change both performance and power
fields? How did you calculate the new ones?

Did you try to simulate any heating effect during the run if you're taking
temperature into account to modify the power? What was the variation like and
at what rate was the EM being updated in this case? I think Jankbench in
general wouldn't stress the SoC enough.

It'd be insightful to look at frequency residencies between the two runs and
power breakdown for each cluster if you have access to them. No worries if not!

My brain started to fail me somewhere around patch 15. I'll have another look
some time later in the week but generally looks good to me. If I have any
worries it is about how it can be used with the provided interfaces. Especially
expectations about managing fast thermal changes at the level you're targeting.


Thanks!

--
Qais Yousef

2023-12-18 12:01:22

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 22/23] PM: EM: Add em_dev_compute_costs() as API for device drivers

Hi Dietmar and Qais,

On 12/17/23 18:03, Qais Yousef wrote:
> On 12/12/23 19:50, Dietmar Eggemann wrote:
>> On 29/11/2023 12:08, Lukasz Luba wrote:
>>> The device drivers can modify EM at runtime by providing a new EM table.
>>> The EM is used by the EAS and the em_perf_state::cost stores
>>> pre-calculated value to avoid overhead. This patch provides the API for
>>> device drivers to calculate the cost values properly (and not duplicate
>>> the same code).
>>
>> New interface w/o any users? Can we not remove this from this patch-set
>> and introduce it with the first user(s)?

I didn't wanted to introduce the user of this in the same patch set.
I will send a follow up patch for Exynos SoC. More about this below.

>
> It's a chicken and egg problem. No interface, will not enable the new users to
> appear too. So assuming the interface makes sense, I vote to keep it.

There are already in mainline platforms which will benefit from this
feature and would use this API. The platform which support chip
binning and adjust the voltage based on that information. It can be a
driver which can even be built as a module. One example is Exynos5 ASV
(Adaptive Supply Voltage) part of the Exynos chipid driver [1].
Here is the dmesg log with some additional debug from this driver.
As you can see the EM finished the registration and also update (the
new feature from this patch set), but it worked on old Voltages from
OPPs. (Also, this driver can be built as a module).

-------------------------------------------------
[ 4.651049] cpu cpu4: EM: created perf domain
[ 4.654073] cpu cpu0: EM: OPP:1200000 is inefficient
[ 4.654108] cpu cpu0: EM: OPP:1100000 is inefficient
[ 4.654140] cpu cpu0: EM: OPP:900000 is inefficient
[ 4.654173] cpu cpu0: EM: OPP:800000 is inefficient
[ 4.654204] cpu cpu0: EM: OPP:600000 is inefficient
[ 4.654235] cpu cpu0: EM: OPP:500000 is inefficient
[ 4.654266] cpu cpu0: EM: OPP:400000 is inefficient
[ 4.654297] cpu cpu0: EM: OPP:200000 is inefficient
[ 4.654342] cpu cpu0: EM: updated
....
[ 4.750026] exynos-chipid 10000000.chipid: cpu0 opp0, freq: 1500 missing
[ 4.755329] exynos-chipid 10000000.chipid: Checking asv_volt=1175000
opp_volt=1275000
[ 4.763213] exynos-chipid 10000000.chipid: Checking asv_volt=1125000
opp_volt=1250000
[ 4.770982] exynos-chipid 10000000.chipid: Checking asv_volt=1075000
opp_volt=1250000
[ 4.778820] exynos-chipid 10000000.chipid: Checking asv_volt=1037500
opp_volt=1250000
[ 4.786515] exynos-chipid 10000000.chipid: Checking asv_volt=1000000
opp_volt=1100000
[ 4.794356] exynos-chipid 10000000.chipid: Checking asv_volt=962500
opp_volt=1100000
[ 4.802018] exynos-chipid 10000000.chipid: Checking asv_volt=925000
opp_volt=1100000
[ 4.816323] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=1000000
[ 4.824109] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=1000000
[ 4.839933] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=1000000
[ 4.854762] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=1000000
[ 4.866191] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 4.878812] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 4.886052] exynos-chipid 10000000.chipid: cpu4 opp0, freq: 2100 missing
[ 4.892800] exynos-chipid 10000000.chipid: Checking asv_volt=1225000
opp_volt=1312500
[ 4.900542] exynos-chipid 10000000.chipid: Checking asv_volt=1162500
opp_volt=1262500
[ 4.908342] exynos-chipid 10000000.chipid: Checking asv_volt=1112500
opp_volt=1237500
[ 4.916066] exynos-chipid 10000000.chipid: Checking asv_volt=1075000
opp_volt=1250000
[ 4.923926] exynos-chipid 10000000.chipid: Checking asv_volt=1037500
opp_volt=1250000
[ 4.931707] exynos-chipid 10000000.chipid: Checking asv_volt=1000000
opp_volt=1100000
[ 4.939582] exynos-chipid 10000000.chipid: Checking asv_volt=975000
opp_volt=1100000
[ 4.947225] exynos-chipid 10000000.chipid: Checking asv_volt=950000
opp_volt=1100000
[ 4.954885] exynos-chipid 10000000.chipid: Checking asv_volt=925000
opp_volt=1000000
[ 4.962601] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=1000000
[ 4.974047] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=1000000
[ 4.974071] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=1000000
[ 4.993670] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 5.001163] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 5.008818] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 5.016318] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 5.023955] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 5.039723] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 5.054445] exynos-chipid 10000000.chipid: Checking asv_volt=900000
opp_volt=900000
[ 5.066709] exynos-chipid 10000000.chipid: Exynos: CPU[EXYNOS5800]
PRO_ID[0xe5422000] REV[0x1] Detected

-------------------------------------------------

The new EM which would be updated from that driver, would have lower
voltages as well as different 'inefficient OPPs'. The maximum voltage
difference based on the tables is 13.54% which means for the dynamic
power:
1362500 = 1.135416667 * 1200000
P_dyn = C* f * (V*1.1354 * V*1.1354) = C*f*V^2 * 1.289

That's ~29% different dynamic power (for one core).

This Voltage adjustment is due to chip lottery. Different SoC vendors
use different name for this fact.
I only have this Exynos platform, but when this API
and v5 features get in, the vendors can modify their drivers and test.

This should help both: EAS and IPA/DTPM.

Regards,
Lukasz

[1]
https://elixir.bootlin.com/linux/latest/source/drivers/soc/samsung/exynos5422-asv.c

2023-12-19 04:03:54

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH v5 09/23] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS

On Mon, Dec 18, 2023 at 1:59 AM Qais Yousef <[email protected]> wrote:
>
> On 11/29/23 11:08, Lukasz Luba wrote:
> > The new Energy Model (EM) supports runtime modification of the performance
> > state table to better model the power used by the SoC. Use this new
> > feature to improve energy estimation and therefore task placement in
> > Energy Aware Scheduler (EAS).
>
> nit: you moved the code to use the new runtime em table instead of the one
> parsed at boot.
>
> >
> > Signed-off-by: Lukasz Luba <[email protected]>
> > ---
> > include/linux/energy_model.h | 16 ++++++++++++----
> > 1 file changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > index 1e618e431cac..94a77a813724 100644
> > --- a/include/linux/energy_model.h
> > +++ b/include/linux/energy_model.h
> > @@ -238,6 +238,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > unsigned long max_util, unsigned long sum_util,
> > unsigned long allowed_cpu_cap)
> > {
> > + struct em_perf_table *runtime_table;
> > unsigned long freq, scale_cpu;
> > struct em_perf_state *ps;
> > int cpu, i;
> > @@ -255,7 +256,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > */
> > cpu = cpumask_first(to_cpumask(pd->cpus));
> > scale_cpu = arch_scale_cpu_capacity(cpu);
> > - ps = &pd->table[pd->nr_perf_states - 1];
> > +
> > + /*
> > + * No rcu_read_lock() since it's already called by task scheduler.
> > + * The runtime_table is always there for CPUs, so we don't check.
> > + */
>
> WARN_ON(rcu_read_lock_held()) instead?

I agree, or SCHED_WARN_ON(!rcu_read_lock_held()) ?

>
>
> Cheers
>
> --
> Qais Yousef
>
> > + runtime_table = rcu_dereference(pd->runtime_table);
> > +
> > + ps = &runtime_table->state[pd->nr_perf_states - 1];
> >
> > max_util = map_util_perf(max_util);
> > max_util = min(max_util, allowed_cpu_cap);
> > @@ -265,9 +273,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > * Find the lowest performance state of the Energy Model above the
> > * requested frequency.
> > */
> > - i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
> > - pd->flags);
> > - ps = &pd->table[i];
> > + i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
> > + freq, pd->flags);
> > + ps = &runtime_table->state[i];
> >
> > /*
> > * The capacity of a CPU in the domain at the performance state (ps)
> > --
> > 2.25.1
> >
>

2023-12-19 04:42:34

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design

On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <[email protected]> wrote:
>
> Add a new section 'Design' which covers the information about Energy
> Model. It contains the design decisions, describes models and how they
> reflect the reality. Remove description of the default EM. Change the
> other section IDs. Add documentation bit for the new feature which
> allows to modify the EM in runtime.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++--
> 1 file changed, 196 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
> index 13225965c9a4..1f8cf36914b1 100644
> --- a/Documentation/power/energy-model.rst
> +++ b/Documentation/power/energy-model.rst
> @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance
> domains can have different micro-architectures.
>
>
> -2. Core APIs
> +2. Design
> +-----------------
> +
> +2.1 Runtime modifiable EM
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +To better reflect power variation due to static power (leakage) the EM
> +supports runtime modifications of the power values. The mechanism relies on
> +RCU to free the modifiable EM perf_state table memory. Its user, the task
> +scheduler, also uses RCU to access this memory. The EM framework provides
> +API for allocating/freeing the new memory for the modifiable EM table.
> +The old memory is freed automatically using RCU callback mechanism when there
> +are no owners anymore for the given EM runtime table instance. This is tracked
> +using kref mechanism. The device driver which provided the new EM at runtime,
> +should call EM API to free it safely when it's no longer needed. The EM
> +framework will handle the clean-up when it's possible.
> +
> +The kernel code which want to modify the EM values is protected from concurrent
> +access using a mutex. Therefore, the device driver code must run in sleeping
> +context when it tries to modify the EM.
> +
> +With the runtime modifiable EM we switch from a 'single and during the entire
> +runtime static EM' (system property) design to a 'single EM which can be
> +changed during runtime according e.g. to the workload' (system and workload
> +property) design.
> +
> +It is possible also to modify the CPU performance values for each EM's
> +performance state. Thus, the full power and performance profile (which
> +is an exponential curve) can be changed according e.g. to the workload
> +or system property.
> +
> +
> +3. Core APIs
> ------------
>
> -2.1 Config options
> +3.1 Config options
> ^^^^^^^^^^^^^^^^^^
>
> CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
>
>
> -2.2 Registration of performance domains
> +3.2 Registration of performance domains
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Registration of 'advanced' EM
> @@ -110,8 +142,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel
> subsystems which use EM might rely on this flag to check if all EM devices use
> the same scale. If there are different scales, these subsystems might decide
> to return warning/error, stop working or panic.
> -See Section 3. for an example of driver implementing this
> -callback, or Section 2.4 for further documentation on this API
> +See Section 4. for an example of driver implementing this
> +callback, or Section 3.4 for further documentation on this API
>
> Registration of EM using DT
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -156,7 +188,7 @@ The EM which is registered using this method might not reflect correctly the
> physics of a real device, e.g. when static power (leakage) is important.
>
>
> -2.3 Accessing performance domains
> +3.3 Accessing performance domains
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> There are two API functions which provide the access to the energy model:
> @@ -175,10 +207,83 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
> not provided for other type of devices.
>
> More details about the above APIs can be found in ``<linux/energy_model.h>``
> -or in Section 2.4
> +or in Section 3.5
> +
> +
> +3.4 Runtime modifications
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Drivers willing to update the EM at runtime should use the following dedicated
> +function to allocate a new instance of the modified EM. The API is listed
> +below::
> +
> + struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd);
> +
> +This allows to allocate a structure which contains the new EM table with
> +also RCU and kref needed by the EM framework. The 'struct em_perf_table'
> +contains array 'struct em_perf_state state[]' which is a list of performance
> +states in ascending order. That list must be populated by the device driver
> +which wants to update the EM. The list of frequencies can be taken from
> +existing EM (created during boot). The content in the 'struct em_perf_state'
> +must be populated by the driver as well.
> +
> +This is the API which does the EM update, using RCU pointers swap::
> +
> + int em_dev_update_perf_domain(struct device *dev,
> + struct em_perf_table __rcu *new_table);
> +
> +Drivers must provide a pointer to the allocated and initialized new EM
> +'struct em_perf_table'. That new EM will be safely used inside the EM framework
> +and will be visible to other sub-systems in the kernel (thermal, powercap).
> +The main design goal for this API is to be fast and avoid extra calculations
> +or memory allocations at runtime. When pre-computed EMs are available in the
> +device driver, than it should be possible to simply re-use them with low
> +performance overhead.
> +
> +In order to free the EM, provided earlier by the driver (e.g. when the module
> +is unloaded), there is a need to call the API::
> +
> + void em_free_table(struct em_perf_table __rcu *table);
> +
> +It will allow the EM framework to safely remove the memory, when there is
> +no other sub-system using it, e.g. EAS.
> +
> +To use the power values in other sub-systems (like thermal, powercap) there is
> +a need to call API which protects the reader and provide consistency of the EM
> +table data::
>
> + struct em_perf_state *em_get_table(struct em_perf_domain *pd);
>
> -2.4 Description details of this API
> +It returns the 'struct em_perf_state' pointer which is an array of performance
> +states in ascending order.
> +
> +When the EM table is not needed anymore there is a need to call dedicated API::
> +
> + void em_put_table(void);
> +
> +In this way the EM safely uses the RCU read section and protects the users.
> +It also allows the EM framework to manage the memory and free it.
> +
> +There is dedicated API for device drivers to calculate em_perf_state::cost
> +values::
> +
> + int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
> + int nr_states);
> +
> +These 'cost' values from EM are used in EAS. The new EM table should be passed
> +together with the number of entries and device pointer. When the computation
> +of the cost values is done properly the return value from the function is 0.
> +The function takes care for right setting of inefficiency for each performance
> +state as well. It updates em_perf_state::flags accordingly.
> +Then such prepared new EM can be passed to the em_dev_update_perf_domain()
> +function, which will allow to use it.
> +
> +More details about the above APIs can be found in ``<linux/energy_model.h>``
> +or in Section 4.2 with an example code showing simple implementation of the
> +updating mechanism in a device driver.
> +
> +
> +3.5 Description details of this API
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> .. kernel-doc:: include/linux/energy_model.h
> :internal:
> @@ -187,8 +292,11 @@ or in Section 2.4
> :export:
>
>
> -3. Example driver
> ------------------
> +4. Examples
> +-----------
> +
> +4.1 Example driver with EM registration
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The CPUFreq framework supports dedicated callback for registering
> the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
> @@ -242,3 +350,81 @@ EM framework::
> 39 static struct cpufreq_driver foo_cpufreq_driver = {
> 40 .register_em = foo_cpufreq_register_em,
> 41 };
> +
> +
> +4.2 Example driver with EM modification
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +This section provides a simple example of a thermal driver modifying the EM.
> +The driver implements a foo_thermal_em_update() function. The driver is woken
> +up periodically to check the temperature and modify the EM data::
> +
> + -> drivers/soc/example/example_em_mod.c
> +
> + 01 static void foo_get_new_em(struct device *dev)
> + 02 {
> + 03 struct em_perf_table __rcu *runtime_table;
> + 04 struct em_perf_state *table, *new_table;
> + 05 struct em_perf_domain *pd;
> + 06 unsigned long freq;
> + 07 int i, ret;
> + 08
> + 09 pd = em_pd_get(dev);
> + 10 if (!pd)
> + 11 return;
> + 12
> + 13 runtime_table = em_allocate_table(pd);
> + 14 if (!runtime_table)
> + 15 return;
> + 16
> + 17 new_table = runtime_table->state;
> + 18
> + 19 table = em_get_table(pd);
> + 20 for (i = 0; i < pd->nr_perf_states; i++) {
> + 21 freq = table[i].frequency;
> + 22 foo_get_power_perf_values(dev, freq, &new_table[i]);
> + 23 }
> + 24 em_put_table();
> + 25
> + 26 /* Calculate 'cost' values for EAS */
> + 27 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
> + 28 if (ret) {
> + 29 dev_warn(dev, "EM: compute costs failed %d\n", ret);
> + 30 em_free_table(runtime_table);
> + 31 return;
> + 32 }
> + 33
> + 34 ret = em_dev_update_perf_domain(dev, runtime_table);
> + 35 if (ret) {
> + 36 dev_warn(dev, "EM: update failed %d\n", ret);
> + 37 em_free_table(runtime_table);
> + 38 return;
> + 39 }
> + 40
> + 41 ctx->runtime_table = runtime_table;

Because here is ctx, maybe the foo_get_new_em(struct device *dev)
shoule be foo_get_new_em(struct foo_context *ctx)?


BR
---
xuewen

> + 42 }
> + 43
> + 44 /*
> + 45 * Function called periodically to check the temperature and
> + 46 * update the EM if needed
> + 47 */
> + 48 static void foo_thermal_em_update(struct foo_context *ctx)
> + 49 {
> + 50 struct device *dev = ctx->dev;
> + 51 int cpu;
> + 52
> + 53 ctx->temperature = foo_get_temp(dev, ctx);
> + 54 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
> + 55 return;
> + 56
> + 57 foo_get_new_em(dev);
> + 58 }
> + 59
> + 60 static void foo_exit(void)
> + 61 {
> + 62 struct foo_context *ctx = glob_ctx;
> + 63
> + 64 em_free_table(ctx->runtime_table);
> + 65 }
> + 66
> + 67 module_exit(foo_exit);
> --
> 2.25.1
>

2023-12-19 06:23:07

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design

Hi Lukasz,

On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <[email protected]> wrote:
>
> Add a new section 'Design' which covers the information about Energy
> Model. It contains the design decisions, describes models and how they
> reflect the reality. Remove description of the default EM. Change the
> other section IDs. Add documentation bit for the new feature which
> allows to modify the EM in runtime.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++--
> 1 file changed, 196 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
> index 13225965c9a4..1f8cf36914b1 100644
> --- a/Documentation/power/energy-model.rst
> +++ b/Documentation/power/energy-model.rst
> @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance
> domains can have different micro-architectures.
>
>
> -2. Core APIs
> +2. Design
> +-----------------
> +
> +2.1 Runtime modifiable EM
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +To better reflect power variation due to static power (leakage) the EM
> +supports runtime modifications of the power values. The mechanism relies on
> +RCU to free the modifiable EM perf_state table memory. Its user, the task
> +scheduler, also uses RCU to access this memory. The EM framework provides
> +API for allocating/freeing the new memory for the modifiable EM table.
> +The old memory is freed automatically using RCU callback mechanism when there
> +are no owners anymore for the given EM runtime table instance. This is tracked
> +using kref mechanism. The device driver which provided the new EM at runtime,
> +should call EM API to free it safely when it's no longer needed. The EM
> +framework will handle the clean-up when it's possible.
> +
> +The kernel code which want to modify the EM values is protected from concurrent
> +access using a mutex. Therefore, the device driver code must run in sleeping
> +context when it tries to modify the EM.
> +
> +With the runtime modifiable EM we switch from a 'single and during the entire
> +runtime static EM' (system property) design to a 'single EM which can be
> +changed during runtime according e.g. to the workload' (system and workload
> +property) design.
> +
> +It is possible also to modify the CPU performance values for each EM's
> +performance state. Thus, the full power and performance profile (which
> +is an exponential curve) can be changed according e.g. to the workload
> +or system property.
> +
> +
> +3. Core APIs
> ------------
>
> -2.1 Config options
> +3.1 Config options
> ^^^^^^^^^^^^^^^^^^
>
> CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
>
>
> -2.2 Registration of performance domains
> +3.2 Registration of performance domains
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Registration of 'advanced' EM
> @@ -110,8 +142,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel
> subsystems which use EM might rely on this flag to check if all EM devices use
> the same scale. If there are different scales, these subsystems might decide
> to return warning/error, stop working or panic.
> -See Section 3. for an example of driver implementing this
> -callback, or Section 2.4 for further documentation on this API
> +See Section 4. for an example of driver implementing this
> +callback, or Section 3.4 for further documentation on this API
>
> Registration of EM using DT
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -156,7 +188,7 @@ The EM which is registered using this method might not reflect correctly the
> physics of a real device, e.g. when static power (leakage) is important.
>
>
> -2.3 Accessing performance domains
> +3.3 Accessing performance domains
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> There are two API functions which provide the access to the energy model:
> @@ -175,10 +207,83 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
> not provided for other type of devices.
>
> More details about the above APIs can be found in ``<linux/energy_model.h>``
> -or in Section 2.4
> +or in Section 3.5
> +
> +
> +3.4 Runtime modifications
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Drivers willing to update the EM at runtime should use the following dedicated
> +function to allocate a new instance of the modified EM. The API is listed
> +below::
> +
> + struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd);
> +
> +This allows to allocate a structure which contains the new EM table with
> +also RCU and kref needed by the EM framework. The 'struct em_perf_table'
> +contains array 'struct em_perf_state state[]' which is a list of performance
> +states in ascending order. That list must be populated by the device driver
> +which wants to update the EM. The list of frequencies can be taken from
> +existing EM (created during boot). The content in the 'struct em_perf_state'
> +must be populated by the driver as well.
> +
> +This is the API which does the EM update, using RCU pointers swap::
> +
> + int em_dev_update_perf_domain(struct device *dev,
> + struct em_perf_table __rcu *new_table);
> +
> +Drivers must provide a pointer to the allocated and initialized new EM
> +'struct em_perf_table'. That new EM will be safely used inside the EM framework
> +and will be visible to other sub-systems in the kernel (thermal, powercap).
> +The main design goal for this API is to be fast and avoid extra calculations
> +or memory allocations at runtime. When pre-computed EMs are available in the
> +device driver, than it should be possible to simply re-use them with low
> +performance overhead.
> +
> +In order to free the EM, provided earlier by the driver (e.g. when the module
> +is unloaded), there is a need to call the API::
> +
> + void em_free_table(struct em_perf_table __rcu *table);
> +
> +It will allow the EM framework to safely remove the memory, when there is
> +no other sub-system using it, e.g. EAS.
> +
> +To use the power values in other sub-systems (like thermal, powercap) there is
> +a need to call API which protects the reader and provide consistency of the EM
> +table data::
>
> + struct em_perf_state *em_get_table(struct em_perf_domain *pd);
>
> -2.4 Description details of this API
> +It returns the 'struct em_perf_state' pointer which is an array of performance
> +states in ascending order.
> +
> +When the EM table is not needed anymore there is a need to call dedicated API::
> +
> + void em_put_table(void);
> +
> +In this way the EM safely uses the RCU read section and protects the users.
> +It also allows the EM framework to manage the memory and free it.
> +
> +There is dedicated API for device drivers to calculate em_perf_state::cost
> +values::
> +
> + int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
> + int nr_states);
> +
> +These 'cost' values from EM are used in EAS. The new EM table should be passed
> +together with the number of entries and device pointer. When the computation
> +of the cost values is done properly the return value from the function is 0.
> +The function takes care for right setting of inefficiency for each performance
> +state as well. It updates em_perf_state::flags accordingly.
> +Then such prepared new EM can be passed to the em_dev_update_perf_domain()
> +function, which will allow to use it.
> +
> +More details about the above APIs can be found in ``<linux/energy_model.h>``
> +or in Section 4.2 with an example code showing simple implementation of the
> +updating mechanism in a device driver.
> +
> +
> +3.5 Description details of this API
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> .. kernel-doc:: include/linux/energy_model.h
> :internal:
> @@ -187,8 +292,11 @@ or in Section 2.4
> :export:
>
>
> -3. Example driver
> ------------------
> +4. Examples
> +-----------
> +
> +4.1 Example driver with EM registration
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The CPUFreq framework supports dedicated callback for registering
> the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
> @@ -242,3 +350,81 @@ EM framework::
> 39 static struct cpufreq_driver foo_cpufreq_driver = {
> 40 .register_em = foo_cpufreq_register_em,
> 41 };
> +
> +
> +4.2 Example driver with EM modification
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +This section provides a simple example of a thermal driver modifying the EM.
> +The driver implements a foo_thermal_em_update() function. The driver is woken
> +up periodically to check the temperature and modify the EM data::
> +
> + -> drivers/soc/example/example_em_mod.c
> +
> + 01 static void foo_get_new_em(struct device *dev)

Because now some drivers use the dev_pm_opp_of_register_em() to
register energy model,
and maybe we can add a new function to update the energy model using
"EM_SET_ACTIVE_POWER_CB(em_cb, cb)"
instead of letting users set power again?

Thanks!

> + 02 {
> + 03 struct em_perf_table __rcu *runtime_table;
> + 04 struct em_perf_state *table, *new_table;
> + 05 struct em_perf_domain *pd;
> + 06 unsigned long freq;
> + 07 int i, ret;
> + 08
> + 09 pd = em_pd_get(dev);
> + 10 if (!pd)
> + 11 return;
> + 12
> + 13 runtime_table = em_allocate_table(pd);
> + 14 if (!runtime_table)
> + 15 return;
> + 16
> + 17 new_table = runtime_table->state;
> + 18
> + 19 table = em_get_table(pd);
> + 20 for (i = 0; i < pd->nr_perf_states; i++) {
> + 21 freq = table[i].frequency;
> + 22 foo_get_power_perf_values(dev, freq, &new_table[i]);
> + 23 }
> + 24 em_put_table();
> + 25
> + 26 /* Calculate 'cost' values for EAS */
> + 27 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
> + 28 if (ret) {
> + 29 dev_warn(dev, "EM: compute costs failed %d\n", ret);
> + 30 em_free_table(runtime_table);
> + 31 return;
> + 32 }
> + 33
> + 34 ret = em_dev_update_perf_domain(dev, runtime_table);
> + 35 if (ret) {
> + 36 dev_warn(dev, "EM: update failed %d\n", ret);
> + 37 em_free_table(runtime_table);
> + 38 return;
> + 39 }
> + 40
> + 41 ctx->runtime_table = runtime_table;
> + 42 }
> + 43
> + 44 /*
> + 45 * Function called periodically to check the temperature and
> + 46 * update the EM if needed
> + 47 */
> + 48 static void foo_thermal_em_update(struct foo_context *ctx)
> + 49 {
> + 50 struct device *dev = ctx->dev;
> + 51 int cpu;
> + 52
> + 53 ctx->temperature = foo_get_temp(dev, ctx);
> + 54 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
> + 55 return;
> + 56
> + 57 foo_get_new_em(dev);
> + 58 }
> + 59
> + 60 static void foo_exit(void)
> + 61 {
> + 62 struct foo_context *ctx = glob_ctx;
> + 63
> + 64 em_free_table(ctx->runtime_table);
> + 65 }
> + 66
> + 67 module_exit(foo_exit);
> --
> 2.25.1
>

2023-12-19 08:31:53

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 09/23] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS

Hi Qais and Xuewen,

On 12/19/23 04:03, Xuewen Yan wrote:
> On Mon, Dec 18, 2023 at 1:59 AM Qais Yousef <[email protected]> wrote:
>>
>> On 11/29/23 11:08, Lukasz Luba wrote:
>>> The new Energy Model (EM) supports runtime modification of the performance
>>> state table to better model the power used by the SoC. Use this new
>>> feature to improve energy estimation and therefore task placement in
>>> Energy Aware Scheduler (EAS).
>>
>> nit: you moved the code to use the new runtime em table instead of the one
>> parsed at boot.
>>
>>>
>>> Signed-off-by: Lukasz Luba <[email protected]>
>>> ---
>>> include/linux/energy_model.h | 16 ++++++++++++----
>>> 1 file changed, 12 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>>> index 1e618e431cac..94a77a813724 100644
>>> --- a/include/linux/energy_model.h
>>> +++ b/include/linux/energy_model.h
>>> @@ -238,6 +238,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>> unsigned long max_util, unsigned long sum_util,
>>> unsigned long allowed_cpu_cap)
>>> {
>>> + struct em_perf_table *runtime_table;
>>> unsigned long freq, scale_cpu;
>>> struct em_perf_state *ps;
>>> int cpu, i;
>>> @@ -255,7 +256,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>> */
>>> cpu = cpumask_first(to_cpumask(pd->cpus));
>>> scale_cpu = arch_scale_cpu_capacity(cpu);
>>> - ps = &pd->table[pd->nr_perf_states - 1];
>>> +
>>> + /*
>>> + * No rcu_read_lock() since it's already called by task scheduler.
>>> + * The runtime_table is always there for CPUs, so we don't check.
>>> + */
>>
>> WARN_ON(rcu_read_lock_held()) instead?
>
> I agree, or SCHED_WARN_ON(!rcu_read_lock_held()) ?

I disagree here. This is a sched function in hot path and as comment
says:

-----------------------
* This function must be used only for CPU devices. There is no validation,
* i.e. if the EM is a CPU type and has cpumask allocated. It is called
from
* the scheduler code quite frequently and that is why there is not checks.
-----------------------

We don't have to put the checks or warnings everywhere in the kernel
functions. Especially hot one like this one.

As you might not notice, we don't even check if the pd->cpus is not NULL

Regards,
Lukasz

2023-12-19 08:44:13

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 10/23] PM: EM: Add API for memory allocations for new tables



On 12/17/23 17:59, Qais Yousef wrote:
> On 11/29/23 11:08, Lukasz Luba wrote:
>> The runtime modified EM table can be provided from drivers. Create
>> mechanism which allows safely allocate and free the table for device
>> drivers. The same table can be used by the EAS in task scheduler code
>> paths, so make sure the memory is not freed when the device driver module
>> is unloaded.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> include/linux/energy_model.h | 11 +++++++++
>> kernel/power/energy_model.c | 44 ++++++++++++++++++++++++++++++++++--
>> 2 files changed, 53 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index 94a77a813724..e785211828fe 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -5,6 +5,7 @@
>> #include <linux/device.h>
>> #include <linux/jump_label.h>
>> #include <linux/kobject.h>
>> +#include <linux/kref.h>
>> #include <linux/rcupdate.h>
>> #include <linux/sched/cpufreq.h>
>> #include <linux/sched/topology.h>
>> @@ -39,10 +40,12 @@ struct em_perf_state {
>> /**
>> * struct em_perf_table - Performance states table
>> * @rcu: RCU used for safe access and destruction
>> + * @refcount: Reference count to track the owners
>> * @state: List of performance states, in ascending order
>> */
>> struct em_perf_table {
>> struct rcu_head rcu;
>> + struct kref refcount;
>> struct em_perf_state state[];
>> };
>>
>> @@ -184,6 +187,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>> struct em_data_callback *cb, cpumask_t *span,
>> bool microwatts);
>> void em_dev_unregister_perf_domain(struct device *dev);
>> +struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd);
>> +void em_free_table(struct em_perf_table __rcu *table);
>>
>> /**
>> * em_pd_get_efficient_state() - Get an efficient performance state from the EM
>> @@ -368,6 +373,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
>> {
>> return 0;
>> }
>> +static inline
>> +struct em_perf_table __rcu *em_allocate_table(struct em_perf_domain *pd)
>> +{
>> + return NULL;
>> +}
>> +static inline void em_free_table(struct em_perf_table __rcu *table) {}
>> #endif
>>
>> #endif
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 489287666705..489a358b9a00 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -114,12 +114,46 @@ static void em_destroy_table_rcu(struct rcu_head *rp)
>> kfree(runtime_table);
>> }
>>
>> -static void em_free_table(struct em_perf_table __rcu *table)
>> +static void em_release_table_kref(struct kref *kref)
>> {
>> + struct em_perf_table __rcu *table;
>> +
>> + /* It was the last owner of this table so we can free */
>> + table = container_of(kref, struct em_perf_table, refcount);
>> +
>> call_rcu(&table->rcu, em_destroy_table_rcu);
>> }
>>
>> -static struct em_perf_table __rcu *
>> +static inline void em_inc_usage(struct em_perf_table __rcu *table)
>> +{
>> + kref_get(&table->refcount);
>> +}
>> +
>> +static void em_dec_usage(struct em_perf_table __rcu *table)
>> +{
>> + kref_put(&table->refcount, em_release_table_kref);
>> +}
>
> nit: em_table_inc/dec() instead? matches general theme elsewhere in the code
> base.

Looks good, I will change it.

>
>> +
>> +/**
>> + * em_free_table() - Handles safe free of the EM table when needed
>> + * @table : EM memory which is going to be freed
>> + *
>> + * No return values.
>> + */
>> +void em_free_table(struct em_perf_table __rcu *table)
>> +{
>> + em_dec_usage(table);
>> +}
>> +
>> +/**
>> + * em_allocate_table() - Handles safe allocation of the new EM table
>> + * @table : EM memory which is going to be freed
>> + *
>> + * Increments the reference counter to mark that there is an owner of that
>> + * EM table. That might be a device driver module or EAS.
>> + * Returns allocated table or error.
>> + */
>> +struct em_perf_table __rcu *
>> em_allocate_table(struct em_perf_domain *pd)
>> {
>> struct em_perf_table __rcu *table;
>> @@ -128,6 +162,12 @@ em_allocate_table(struct em_perf_domain *pd)
>> table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;
>>
>> table = kzalloc(sizeof(*table) + table_size, GFP_KERNEL);
>> + if (!table)
>> + return table;
>> +
>> + kref_init(&table->refcount);
>> + em_inc_usage(table);
>
> Doesn't kref_init() initialize to the count to 1 already? Is the em_inc_usage()
> needed here?

Good catch this is not needed here. Thanks!

2023-12-19 08:54:59

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design



On 12/19/23 04:42, Xuewen Yan wrote:
> On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <[email protected]> wrote:

[snip]

>> +
>> + -> drivers/soc/example/example_em_mod.c
>> +
>> + 01 static void foo_get_new_em(struct device *dev)
>> + 02 {
>> + 03 struct em_perf_table __rcu *runtime_table;
>> + 04 struct em_perf_state *table, *new_table;
>> + 05 struct em_perf_domain *pd;
>> + 06 unsigned long freq;
>> + 07 int i, ret;
>> + 08
>> + 09 pd = em_pd_get(dev);
>> + 10 if (!pd)
>> + 11 return;
>> + 12
>> + 13 runtime_table = em_allocate_table(pd);
>> + 14 if (!runtime_table)
>> + 15 return;
>> + 16
>> + 17 new_table = runtime_table->state;
>> + 18
>> + 19 table = em_get_table(pd);
>> + 20 for (i = 0; i < pd->nr_perf_states; i++) {
>> + 21 freq = table[i].frequency;
>> + 22 foo_get_power_perf_values(dev, freq, &new_table[i]);
>> + 23 }
>> + 24 em_put_table();
>> + 25
>> + 26 /* Calculate 'cost' values for EAS */
>> + 27 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
>> + 28 if (ret) {
>> + 29 dev_warn(dev, "EM: compute costs failed %d\n", ret);
>> + 30 em_free_table(runtime_table);
>> + 31 return;
>> + 32 }
>> + 33
>> + 34 ret = em_dev_update_perf_domain(dev, runtime_table);
>> + 35 if (ret) {
>> + 36 dev_warn(dev, "EM: update failed %d\n", ret);
>> + 37 em_free_table(runtime_table);
>> + 38 return;
>> + 39 }
>> + 40
>> + 41 ctx->runtime_table = runtime_table;
>
> Because here is ctx, maybe the foo_get_new_em(struct device *dev)
> shoule be foo_get_new_em(struct foo_context *ctx)?

Make sense, I will change that bit. Thanks!

2023-12-19 09:32:08

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design



On 12/19/23 06:22, Xuewen Yan wrote:
> Hi Lukasz,
>
> On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <[email protected]> wrote:

[snip]

>> +
>> + -> drivers/soc/example/example_em_mod.c
>> +
>> + 01 static void foo_get_new_em(struct device *dev)
>
> Because now some drivers use the dev_pm_opp_of_register_em() to
> register energy model,
> and maybe we can add a new function to update the energy model using
> "EM_SET_ACTIVE_POWER_CB(em_cb, cb)"
> instead of letting users set power again?
>

There are different usage of this EM feature:
1. Adjust power values after boot is finish and e.g. ASV in Exynos
has adjusted new voltage values in the OPP framework. It's
due to chip binning. I have described that in conversation
below patch 22/23. I'm going to send a patch for that
platform and OPP fwk later as a follow up to this series.
2. Change the EM power values after long gaming, when the GPU
heats up the SoC heavily and CPUs start increase the leakage
3. Change the EM for long running heavy apps, e.g. video conference app,
which is using camera w/ image AI and filters (so some heavy stuff)
4. any other optimization that vendor/OEM like to have for

2023-12-19 09:35:48

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design



On 12/12/23 18:51, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> Add a new section 'Design' which covers the information about Energy
>> Model. It contains the design decisions, describes models and how they
>> reflect the reality. Remove description of the default EM. Change the
>> other section IDs. Add documentation bit for the new feature which
>> allows to modify the EM in runtime.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> Documentation/power/energy-model.rst | 206 +++++++++++++++++++++++++--
>> 1 file changed, 196 insertions(+), 10 deletions(-)
>>
>> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
>> index 13225965c9a4..1f8cf36914b1 100644
>> --- a/Documentation/power/energy-model.rst
>> +++ b/Documentation/power/energy-model.rst
>> @@ -72,16 +72,48 @@ required to have the same micro-architecture. CPUs in different performance
>> domains can have different micro-architectures.
>>
>>
>> -2. Core APIs
>> +2. Design
>> +-----------------
>> +
>> +2.1 Runtime modifiable EM
>> +^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The issue I see here is that since now the EM is runtime modifiable and
> there is only one EM people might be confused in locking for a
> non-runtime modifiable EM. (which matches the design till v4).
>
> So 'runtime modifiability' is now feature of the EM itself.

True, I can skip this, since it's now default.

>
> There is also a figure in this document illustrating the use of
> em_get_energy(), em_cpu_get() and em_dev_register_perf_domain().
>
> I wonder if this should be extended to cover all the new interfaces
> created for the 'runtime modifiability' feature?

That ASCI picture would be totally messy, with that many interfaces.
We can think about some other picture later, when this basic code and
basic doc is merged.

2023-12-19 10:21:47

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

Hi Qais,

On 12/17/23 18:22, Qais Yousef wrote:
> Hi Lukasz
>
> On 11/29/23 11:08, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
>
> Thanks. The problem of EM accuracy has been observed in the field and would be
> nice to have a mainline solution for it. We carry our own out-of-tree change to
> enable modifying the EM.

Thanks for that statement here.

>
>>
>> The concepts:
>> 1. The CPU power usage can vary due to the workload that it's running or due
>> to the temperature of the SoC. The same workload can use more power when the
>> temperature of the silicon has increased (e.g. due to hot GPU or ISP).
>> In such situation the EM can be adjusted and reflect the fact of increased
>> power usage. That power increase is due to static power
>> (sometimes called simply: leakage). The CPUs in recent SoCs are different.
>> We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
>> They are also built differently with High Performance (HP) cells or
>> Low Power (LP) cells. They are affected by the temperature increase
>> differently: HP cells have bigger leakage. The SW model can leverage that
>> knowledge.
>
> One thing I'm not sure about is that in practice temperature of the SoC can
> vary a lot in a short period of time. What is the expectation here? I can see
> this useful in practice only if we average it over a window of time. Following
> it will be really hard. Big variations can happen in few ms scales.

It's mostly for long running heavy workloads, which involve other device
than CPUs, e.g. GPU or ISP (Image Signal Processor). Those devices can
heat up the SoC. In our game DrArm running on pixel6 the GPU uses 75-77%
of total power budget (starting from ~2.5W for GPU + 1.3W for all CPUs).
That 2.5W from the GPU is heating up the CPUs and mostly impact the Big
cores, which are made from High-Performance cells (thus leaking more).
OverUtilization in the first 4-5min of gaming is ~4-9%, so EAS can work
and save some power, if it has a good model. Later we have thermal
throttling and OU goes to ~50% but EAS still can work. If the model is
more precised - thus adjusted for the raising leakage due to temperature
increase (generated due to GPU power), than we still can use better that
power budget and not waist on the leakage at higher OPPs.

>
> Driver interface for this part makes sense; as thermal framework will likely to
> know how feed things back to EM table, if necessary.

Thermal framework or I would rather say smarter thermal dedicated driver
which has built-in power model and access to the sensors data. In this
way it can provide adjusted power model into the EM dynamically.
It will also calculate the efficiency (the 'cost' field).

>
>>
>> 2. It is also possible to change the EM to better reflect the currently
>> running workload. Usually the EM is derived from some average power values
>> taken from experiments with benchmark (e.g. Dhrystone). The model derived
>> from such scenario might not represent properly the workloads usually running
>> on the device. Therefore, runtime modification of the EM allows to switch to
>> a different model, when there is a need.
>
> I didn't get how the new performance field is supposed to be controlled and
> modified by users. A driver interface doesn't seem suitable as there's no
> subsystem that knows the characteristic of the workload except userspace. In
> Android we do have contextual info about what the current top-app to enable
> modifying the capacities to match its characteristics.

Well in latest public documentation (May2023) for Cortex-X4 there are
described new features of Arm cores: PDP, MPMM, which can change the
'performance' of the core in FW. Our SCMI kernel subsystem will get an
interrupt, so the drivers can know about it. It could be used for
recalculating the efficiency of the CPUs in the EM. When there is no
hotplug and the long running app is still running, that FW policy would
be reflected in EM. It's just not done all-in-one-step. Those patches
will be later.

Second, I have used that 'performance' field to finally get rid of
this runtime division in em_cpu_energy() hot path - which was annoying
me for very long time. It wasn't possible to optimize that last
operation there, because the not all CPUs boot and final CPU capacity
is not known when we register EMs. With this feature finally I can
remove that heavy operation. You can see more in that patch 15/23.

>
>>
>> 3. The EM can be adjusted after boot, when all the modules are loaded and
>> more information about the SoC is available e.g. chip binning. This would help
>> to better reflect the silicon characteristics. Thus, this EM modification
>> API allows it now. It wasn't possible in the past and the EM had to be
>> 'set in stone'.
>>
>> More detailed explanation and background can be found in presentations
>> during LPC2022 [1][2] or in the documentation patches.
>>
>> Some test results.
>> The EM can be updated to fit better the workload type. In the case below the EM
>> has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
>> for the scheduler bits). The Jankbench was run 10 times for those two configurations,
>> to get more reliable data.
>>
>> 1. Janky frames percentage
>> +--------+-----------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+-----------------+---------------------+-------+-----------+
>> | gmean | jank_percentage | EM_default | 2.0 | 0.0% |
>> | gmean | jank_percentage | EM_modified_runtime | 1.3 | -35.33% |
>> +--------+-----------------+---------------------+-------+-----------+
>>
>> 2. Avg frame render time duration
>> +--------+---------------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+---------------------+---------------------+-------+-----------+
>> | gmean | mean_frame_duration | EM_default | 10.5 | 0.0% |
>> | gmean | mean_frame_duration | EM_modified_runtime | 9.6 | -8.52% |
>> +--------+---------------------+---------------------+-------+-----------+
>>
>> 3. Max frame render time duration
>> +--------+--------------------+---------------------+-------+-----------+
>> | metric | variable | kernel | value | perc_diff |
>> +--------+--------------------+---------------------+-------+-----------+
>> | gmean | max_frame_duration | EM_default | 251.6 | 0.0% |
>> | gmean | max_frame_duration | EM_modified_runtime | 115.5 | -54.09% |
>> +--------+--------------------+---------------------+-------+-----------+
>>
>> 4. OS overutilized state percentage (when EAS is not working)
>> +--------------+---------------------+------+------------+------------+
>> | metric | wa_path | time | total_time | percentage |
>> +--------------+---------------------+------+------------+------------+
>> | overutilized | EM_default | 1.65 | 253.38 | 0.65 |
>> | overutilized | EM_modified_runtime | 1.4 | 277.5 | 0.51 |
>> +--------------+---------------------+------+------------+------------+
>>
>> 5. All CPUs (Little+Mid+Big) power values in mW
>> +------------+--------+---------------------+-------+-----------+
>> | channel | metric | kernel | value | perc_diff |
>> +------------+--------+---------------------+-------+-----------+
>> | CPU | gmean | EM_default | 142.1 | 0.0% |
>> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
>> +------------+--------+---------------------+-------+-----------+
>
> How did you modify the EM here? Did you change both performance and power
> fields? How did you calculate the new ones?

It was just the power values modified on my pixel6:
for Littles 1.6x, Mid 0.8x, Big 1.3x of their boot power.
TBH I don't know the chip binning of that SoC, but I suspect it
could be due to this fact. More about possible error range in chip
binning power values you can find in my comment to the patch 22/23

>
> Did you try to simulate any heating effect during the run if you're taking
> temperature into account to modify the power? What was the variation like and

Yes, I did that experiment and presented on OSPM 2023 slide 13. There is
big CPU power plot change in time, due to GPU heat. All detailed data is
there. The big CPU power is ~18-20% higher when 1-1.5W GPU is heating up
the whole SoC.

> at what rate was the EM being updated in this case? I think Jankbench in

In this experiment EM was only set once w/ the values mentioned above.
It could be due to the chip lottery. I cannot say on 100% this phone.

> general wouldn't stress the SoC enough.

True, this test is not power heavy as it can be seen. It's more
to show that the default EM after boot might not be the optimal one.

>
> It'd be insightful to look at frequency residencies between the two runs and
> power breakdown for each cluster if you have access to them. No worries if not!

I'm afraid you're asking for too much ;)

>
> My brain started to fail me somewhere around patch 15. I'll have another look
> some time later in the week but generally looks good to me. If I have any
> worries it is about how it can be used with the provided interfaces. Especially
> expectations about managing fast thermal changes at the level you're targeting.

No worries, thanks for the review! The fast thermal changes, which are
linked to the CPU's workload are not an issue here and I'm not worried
about those. The side effect of the heat from other device is the issue.
Thus, that thermal driver which modifies the EM should be aware of the
'whole SoC' situation (like mainline IPA does, when it manages all
devices in a single thermal zone).

Regards,
Lukasz

2023-12-19 10:29:40

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 02/23] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments



On 12/17/23 17:58, Qais Yousef wrote:
> On 11/29/23 11:08, Lukasz Luba wrote:
>> In order to prepare the code for the modifiable EM perf_state table,
>> refactor existing function em_cpufreq_update_efficiencies().
>
> nit: What is being refactored here? The description is not adding much info
> about the change.

The function takes the ptr to the table now as its argument. You have
missed that in the code below?

>
>
> Cheers
>
> --
> Qais Yousef
>
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> kernel/power/energy_model.c | 8 +++-----
>> 1 file changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 8b9dd4a39f63..42486674b834 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -237,10 +237,10 @@ static int em_create_pd(struct device *dev, int nr_states,
>> return 0;
>> }
>>
>> -static void em_cpufreq_update_efficiencies(struct device *dev)
>> +static void
>> +em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
>> {
>> struct em_perf_domain *pd = dev->em_pd;
>> - struct em_perf_state *table;
>> struct cpufreq_policy *policy;
>> int found = 0;
>> int i;
>> @@ -254,8 +254,6 @@ static void em_cpufreq_update_efficiencies(struct device *dev)
>> return;
>> }
>>
>> - table = pd->table;
>> -
>> for (i = 0; i < pd->nr_perf_states; i++) {
>> if (!(table[i].flags & EM_PERF_STATE_INEFFICIENT))
>> continue;
>> @@ -397,7 +395,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>>
>> dev->em_pd->flags |= flags;
>>
>> - em_cpufreq_update_efficiencies(dev);
>> + em_cpufreq_update_efficiencies(dev, dev->em_pd->table);
>>
>> em_debug_create_pd(dev);
>> dev_info(dev, "EM: created perf domain\n");
>> --
>> 2.25.1
>>
>

2023-12-19 10:52:30

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 03/23] PM: EM: Find first CPU active while updating OPP efficiency



On 12/17/23 17:58, Qais Yousef wrote:
> On 11/29/23 11:08, Lukasz Luba wrote:
>> The Energy Model might be updated at runtime and the energy efficiency
>> for each OPP may change. Thus, there is a need to update also the
>> cpufreq framework and make it aligned to the new values. In order to
>> do that, use a first active CPU from the Performance Domain. This is
>> needed since the first CPU in the cpumask might be offline when we
>> run this code path.
>
> I didn't understand the problem here. It seems you're fixing a race, but the
> description is not clear to me what the race is.

I have explained that in v1, v4 comments for this patch.
When the EM is registered the fist CPU is always online. No problem
for the old code, but for new code with runtime modification at
later time, potentially from different subsystems - it it (e.g. thermal,
drivers, etc). The fist CPU might be offline, but still such EM
update for this domain shouldn'y fail. Although, when the CPU is offline
we cannot get the valid policy...

We can get it for next cpu in the cpumask, that's what the code is
doing.

>
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> kernel/power/energy_model.c | 11 +++++++++--
>> 1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 42486674b834..aa7c89f9e115 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -243,12 +243,19 @@ em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
>> struct em_perf_domain *pd = dev->em_pd;
>> struct cpufreq_policy *policy;
>> int found = 0;
>> - int i;
>> + int i, cpu;
>>
>> if (!_is_cpu_device(dev) || !pd)
>> return;
>>
>> - policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
>> + /* Try to get a CPU which is active and in this PD */
>> + cpu = cpumask_first_and(em_span_cpus(pd), cpu_active_mask);
>> + if (cpu >= nr_cpu_ids) {
>> + dev_warn(dev, "EM: No online CPU for CPUFreq policy\n");
>> + return;
>> + }
>> +
>> + policy = cpufreq_cpu_get(cpu);
>
> Shouldn't policy be NULL here if all policy->realted_cpus were offlined?

It will be NULL but we will capture that fact in other way in the 'if'
above.

We want something else.

We want to get policy using 'some' online CPU's id from our known
cpumask. Then we can continue with such policy in the code.

2023-12-19 10:57:45

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 04/23] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible



On 12/12/23 18:49, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> The Energy Model (EM) is going to support runtime modification. There
>> are going to be 2 EM tables which store information. This patch aims
>> to prepare the code to be generic and use one of the tables. The function
>> will no longer get a pointer to 'struct em_perf_domain' (the EM) but
>> instead a pointer to 'struct em_perf_state' (which is one of the EM's
>> tables).
> I thought the 2 EM tables design is gone?
>
> IMHO it would be less code changes and hence a more enjoyable review
> experience if you would add the 'modifiable' feature to the existing EM
> (1) and not add (2) and then remove (1) in [21/23].

I have explained that to some other your email: such approach would
create a patch monster, touching all drivers and frameworks, to just
make sure they still can compile. This is not the right approach.


>
>
> struct em_perf_domain {
> - struct em_perf_state *table; <-- (1)
> struct em_perf_table __rcu *runtime_table; <-- (2)
>
>> Prepare em_pd_get_efficient_state() for the upcoming changes and
>> make it possible to re-use. Return an index for the best performance
>
> s/make it possible to re-use/make it possible to be re-used ?

OK

>
>> state for a given EM table. The function arguments that are introduced
>> should allow to work on different performance state arrays. The caller of
>> em_pd_get_efficient_state() should be able to use the index either
>> on the default or the modifiable EM table.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> Reviewed-by: Daniel Lezcano <[email protected]>
>> ---
>> include/linux/energy_model.h | 30 +++++++++++++++++-------------
>> 1 file changed, 17 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index b9caa01dfac4..8069f526c9d8 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -175,33 +175,35 @@ void em_dev_unregister_perf_domain(struct device *dev);
>>
>> /**
>> * em_pd_get_efficient_state() - Get an efficient performance state from the EM
>> - * @pd : Performance domain for which we want an efficient frequency
>> - * @freq : Frequency to map with the EM
>> + * @state: List of performance states, in ascending order
>
> (3)
>
>> + * @nr_perf_states: Number of performance states
>> + * @freq: Frequency to map with the EM
>> + * @pd_flags: Performance Domain flags
>> *
>> * It is called from the scheduler code quite frequently and as a consequence
>> * doesn't implement any check.
>> *
>> - * Return: An efficient performance state, high enough to meet @freq
>> + * Return: An efficient performance state id, high enough to meet @freq
>> * requirement.
>> */
>> -static inline
>> -struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
>> - unsigned long freq)
>> +static inline int
>> +em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
>> + unsigned long freq, unsigned long pd_flags)
>
> (3) but em_pd_get_efficient_state(struct em_perf_state *table
> ^^^^^
> [...]

Good catch, I'll change that.

2023-12-19 11:01:42

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 05/23] PM: EM: Refactor a new function em_compute_costs()



On 12/17/23 17:58, Qais Yousef wrote:
> On 11/29/23 11:08, Lukasz Luba wrote:
>> Refactor a dedicated function which will be easier to maintain and re-use
>> in future. The upcoming changes for the modifiable EM perf_state table
>> will use it (instead of duplicating the code).
>
> nit: What is being refactored? Looks like you took em_compute_cost() out of
> em_create_perf_table().

Yes, it's going to be re-used later for also update code path, not only
register code path.

2023-12-19 11:33:03

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 08/23] PM: EM: Introduce runtime modifiable table



On 12/12/23 18:50, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> The new runtime table can be populated with a new power data to better
>> reflect the actual efficiency of the device e.g. CPU. The power can vary
>> over time e.g. due to the SoC temperature change. Higher temperature can
>> increase power values. For longer running scenarios, such as game or
>> camera, when also other devices are used (e.g. GPU, ISP) the CPU power can
>
> Don't understand this sentence. So CPU power changes with higher
> temperature and for longer running scenarios when other devices are
> involved? Not getting the 2. part.

Total power consists of:
1. dynamic power - related to the freq, voltage^2 and logic capacitance
size involved in switching during the computation
2. static power - aka. leakage - depends on voltage and
temperature of the silicon. The higher the temperature, the higher
the static power.

When you heat up the SoC using e.g. GPU, you start seeing on our CPU
power plot in time a raising function. Even if your CPU was running
constantly the same workload and data for long time, this effect
will happen after you add the heat from GPU in the same chip die.

IMO this is not the right place to educate people about physics of
the chip... Some understating and higher level education would
be needed otherwise even the best patch header description won't help.
So, I would keep those patch descriptions simple.

Beside, I have explained that in a few LPC and OSPM conferences.
In the cover letter there are links to them.

>
>> change. The new EM framework is able to addresses this issue and change
>> the EM data at runtime safely.
>
> Maybe better:
> The new EM framework addresses this issue by allowing to change the EM
> data at runtime.
>

Sounds good, I can change that.

2023-12-19 13:18:37

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 07/23] PM: EM: Refactor how the EM table is allocated and populated



On 12/12/23 18:50, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> Split the process of allocation and data initialization for the EM table.
>> The upcoming changes for modifiable EM will use it.
>>
>> This change is not expected to alter the general functionality.
>
> NIT: IMHO, I guess you wanted to say: "No functional changes
> introduced"? I.e. all not only general functionality ...
>

Yes 'no functional changes'. Rafael gave me that sense once - and I use
in such cases.

> [...]
>
>> static int em_create_pd(struct device *dev, int nr_states,
>> @@ -234,11 +234,15 @@ static int em_create_pd(struct device *dev, int nr_states,
>> return -ENOMEM;
>> }
>>
>> - ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
>> - if (ret) {
>> - kfree(pd);
>> - return ret;
>> - }
>> + pd->nr_perf_states = nr_states;
>
> Why does `pd->nr_perf_states = nr_states;` have to move from
> em_create_perf_table() to em_create_pd()?

Because I have split the old code which did allocation and
initialization w/ data the in em_create_perf_table().

Now we are going to have separate:
1. allocation of a new table (which can be re-used later)
2. initialization of the data (power, freq, etc) in registration
code path

It will allow to also allow to introduce update data function,
and simply use the same allocation function for both cases:
- EM registration code path
- update EM code path

>
>> +
>> + ret = em_allocate_perf_table(pd, nr_states);
>> + if (ret)
>> + goto free_pd;
>> +
>> + ret = em_create_perf_table(dev, pd, pd->table, nr_states, cb, flags);
>
> If you set it in em_create_pd() then you can use 'pd->nr_perf_states' in
> em_create_perf_table() and doesn't have to pass `nr_states`.
>
> [...]

That's true. I could further refactor that function and remove that
'nr_states' argument.

I'll do this in v6. Thanks!

2023-12-20 02:08:42

by Xuewen Yan

[permalink] [raw]
Subject: Re: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design

On Tue, Dec 19, 2023 at 5:31 PM Lukasz Luba <[email protected]> wrote:
>
>
>
> On 12/19/23 06:22, Xuewen Yan wrote:
> > Hi Lukasz,
> >
> > On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <[email protected]> wrote:
>
> [snip]
>
> >> +
> >> + -> drivers/soc/example/example_em_mod.c
> >> +
> >> + 01 static void foo_get_new_em(struct device *dev)
> >
> > Because now some drivers use the dev_pm_opp_of_register_em() to
> > register energy model,
> > and maybe we can add a new function to update the energy model using
> > "EM_SET_ACTIVE_POWER_CB(em_cb, cb)"
> > instead of letting users set power again?
> >
>
> There are different usage of this EM feature:
> 1. Adjust power values after boot is finish and e.g. ASV in Exynos
> has adjusted new voltage values in the OPP framework. It's
> due to chip binning. I have described that in conversation
> below patch 22/23. I'm going to send a patch for that
> platform and OPP fwk later as a follow up to this series.

I understand what you mean, what I mean is that if we can provide an
interface for changing EM of opp fwk, it will be more friendly for
those users who use opp, because then they don't have to calculate the
new EM by themselves, but only need After updating the voltage of opp,
just call this interface directly.

BR
---
xuewen

> 2. Change the EM power values after long gaming, when the GPU
> heats up the SoC heavily and CPUs start increase the leakage
> 3. Change the EM for long running heavy apps, e.g. video conference app,
> which is using camera w/ image AI and filters (so some heavy stuff)
> 4. any other optimization that vendor/OEM like to have for

2023-12-20 07:56:25

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 23/23] Documentation: EM: Update with runtime modification design



On 12/20/23 02:08, Xuewen Yan wrote:
> On Tue, Dec 19, 2023 at 5:31 PM Lukasz Luba <[email protected]> wrote:
>>
>>
>>
>> On 12/19/23 06:22, Xuewen Yan wrote:
>>> Hi Lukasz,
>>>
>>> On Wed, Nov 29, 2023 at 7:11 PM Lukasz Luba <[email protected]> wrote:
>>
>> [snip]
>>
>>>> +
>>>> + -> drivers/soc/example/example_em_mod.c
>>>> +
>>>> + 01 static void foo_get_new_em(struct device *dev)
>>>
>>> Because now some drivers use the dev_pm_opp_of_register_em() to
>>> register energy model,
>>> and maybe we can add a new function to update the energy model using
>>> "EM_SET_ACTIVE_POWER_CB(em_cb, cb)"
>>> instead of letting users set power again?
>>>
>>
>> There are different usage of this EM feature:
>> 1. Adjust power values after boot is finish and e.g. ASV in Exynos
>> has adjusted new voltage values in the OPP framework. It's
>> due to chip binning. I have described that in conversation
>> below patch 22/23. I'm going to send a patch for that
>> platform and OPP fwk later as a follow up to this series.
>
> I understand what you mean, what I mean is that if we can provide an
> interface for changing EM of opp fwk, it will be more friendly for
> those users who use opp, because then they don't have to calculate the
> new EM by themselves, but only need After updating the voltage of opp,
> just call this interface directly.

It is the plan. Don't worry. I didn't wanted to push this in one
big patch set. Exynos driver + the OPP change would do exactly this.
The EM functions from drivers/opp/of.c will be re-used for this.

It is too big to be made in one step. There is pattern in those more
complex changes, like in Arm SCMI fwk to make the improvements
gradually. This folds into the same bucket.

Although, you are another person asking for similar thing, so I
will send a follow-up change using this new EM API - instead
of waiting to finish this review.

Thanks,
Lukasz

2023-12-20 08:06:21

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 11/23] PM: EM: Add API for updating the runtime modifiable EM



On 12/12/23 18:50, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>
> [...]
>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 489a358b9a00..614891fde8df 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -221,6 +221,52 @@ static int em_allocate_perf_table(struct em_perf_domain *pd,
>> return 0;
>> }
>>
>> +/**
>> + * em_dev_update_perf_domain() - Update runtime EM table for a device
>> + * @dev : Device for which the EM is to be updated
>> + * @table : The new EM table that is going to used from now
>
> s/going to used/going to be used
>
>> + *
>> + * Update EM runtime modifiable table for the @dev using the privided @table.
>
> s/privided/provided
>
>> + *
>> + * This function uses mutex to serialize writers, so it must not be called
>> + * from non-sleeping context.
>> + *
>> + * Return 0 on success or a proper error in case of failure.
>> + */
>> +int em_dev_update_perf_domain(struct device *dev,
>> + struct em_perf_table __rcu *new_table)
>> +{
>> + struct em_perf_table __rcu *old_table;
>> + struct em_perf_domain *pd;
>> +
>> + /*
>> + * The lock serializes update and unregister code paths. When the
>> + * EM has been unregistered in the meantime, we should capture that
>> + * when entering this critical section. It also makes sure that
>
> What do you want to capture here? You want to block in this moment,
> right? Don't understand the 2. sentence here.
>
> [...]

There is general issue with module... they can reload. A driver which
registered EM can than later disappear. I had similar issues for the
devfreq cooling. It can happen at any time. In this scenario let's
consider scenario w/ 2 kernel drivers:
1. Main driver which registered EM, e.g. GPU driver
2. Thermal driver which updates that EM
When 1. starts unload process, it has to make sure that it will
not free the main EM 'pd', because the 2. might try to use e.g.
'pd->nr_perf_states' while doing update at the moment.
Thus, this 'pd' has local mutex, to avoid issues of
module unload vs. EM update. The EM unregister will block on
that mutex and let the background update finish it's critical
section.

2023-12-20 08:20:12

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 13/23] PM: EM: Add performance field to struct em_perf_state



On 12/17/23 18:00, Qais Yousef wrote:
> On 11/29/23 11:08, Lukasz Luba wrote:
>> The performance doesn't scale linearly with the frequency. Also, it may
>> be different in different workloads. Some CPUs are designed to be
>> particularly good at some applications e.g. images or video processing
>> and other CPUs in different. When those different types of CPUs are
>> combined in one SoC they should be properly modeled to get max of the HW
>> in Energy Aware Scheduler (EAS). The Energy Model (EM) provides the
>> power vs. performance curves to the EAS, but assumes the CPUs capacity
>> is fixed and scales linearly with the frequency. This patch allows to
>> adjust the curve on the 'performance' axis as well.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> include/linux/energy_model.h | 11 ++++++-----
>> kernel/power/energy_model.c | 27 +++++++++++++++++++++++++++
>> 2 files changed, 33 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index ae3ccc8b9f44..e30750500b10 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -13,6 +13,7 @@
>>
>> /**
>> * struct em_perf_state - Performance state of a performance domain
>> + * @performance: Non-linear CPU performance at a given frequency
>> * @frequency: The frequency in KHz, for consistency with CPUFreq
>> * @power: The power consumed at this level (by 1 CPU or by a registered
>> * device). It can be a total power: static and dynamic.
>> @@ -21,6 +22,7 @@
>> * @flags: see "em_perf_state flags" description below.
>> */
>> struct em_perf_state {
>> + unsigned long performance;
>> unsigned long frequency;
>> unsigned long power;
>> unsigned long cost;
>> @@ -207,14 +209,14 @@ void em_free_table(struct em_perf_table __rcu *table);
>> */
>> static inline int
>> em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
>> - unsigned long freq, unsigned long pd_flags)
>> + unsigned long max_util, unsigned long pd_flags)
>> {
>> struct em_perf_state *ps;
>> int i;
>>
>> for (i = 0; i < nr_perf_states; i++) {
>> ps = &table[i];
>> - if (ps->frequency >= freq) {
>> + if (ps->performance >= max_util) {
>> if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
>> ps->flags & EM_PERF_STATE_INEFFICIENT)
>> continue;
>> @@ -246,8 +248,8 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>> unsigned long allowed_cpu_cap)
>> {
>> struct em_perf_table *runtime_table;
>> - unsigned long freq, scale_cpu;
>> struct em_perf_state *ps;
>> + unsigned long scale_cpu;
>> int cpu, i;
>>
>> if (!sum_util)
>> @@ -274,14 +276,13 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>
>> max_util = map_util_perf(max_util);
>> max_util = min(max_util, allowed_cpu_cap);
>> - freq = map_util_freq(max_util, ps->frequency, scale_cpu);
>>
>> /*
>> * Find the lowest performance state of the Energy Model above the
>> * requested frequency.
>> */
>> i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
>> - freq, pd->flags);
>> + max_util, pd->flags);
>> ps = &runtime_table->state[i];
>>
>> /*
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 614891fde8df..b5016afe6a19 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -46,6 +46,7 @@ static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
>> debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
>> debugfs_create_ulong("power", 0444, d, &ps->power);
>> debugfs_create_ulong("cost", 0444, d, &ps->cost);
>> + debugfs_create_ulong("performance", 0444, d, &ps->performance);
>> debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
>> }
>>
>> @@ -171,6 +172,30 @@ em_allocate_table(struct em_perf_domain *pd)
>> return table;
>> }
>>
>> +static void em_init_performance(struct device *dev, struct em_perf_domain *pd,
>> + struct em_perf_state *table, int nr_states)
>> +{
>> + u64 fmax, max_cap;
>> + int i, cpu;
>> +
>> + /* This is needed only for CPUs and EAS skip other devices */
>> + if (!_is_cpu_device(dev))
>> + return;
>> +
>> + cpu = cpumask_first(em_span_cpus(pd));
>> +
>> + /*
>> + * Calculate the performance value for each frequency with
>> + * linear relationship. The final CPU capacity might not be ready at
>> + * boot time, but the EM will be updated a bit later with correct one.
>> + */
>> + fmax = (u64) table[nr_states - 1].frequency;
>> + max_cap = (u64) arch_scale_cpu_capacity(cpu);
>> + for (i = 0; i < nr_states; i++)
>> + table[i].performance = div64_u64(max_cap * table[i].frequency,
>> + fmax);
>
> Should we sanity check the returned performance value is correct in case we got
> passed a malformed table? Maybe the table is sanity checked and sorted before
> we get here; I didn't check to be honest.

The frequency values are checked if they have asc sorting order. It's
done in the em_create_perf_table(). There is even an error printed and
returned, so the EM registration will fail.

>
> I think a warning that performance is always <= max_cap would be helpful in
> general as code evolved in the future.

I don't see that need. There are needed checks for frequency values and
this simple math formula is just linear. Nothing can happen when
frequencies are sorted asc. The whole EAS relies on that fact:

Frequencies are sorted ascending, thus
fmax = (u64) table[nr_states - 1].frequency
is always true.

2023-12-20 08:22:25

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 14/23] PM: EM: Support late CPUs booting and capacity adjustment



On 12/12/23 18:50, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> The patch adds needed infrastructure to handle the late CPUs boot, which
>> might change the previous CPUs capacity values. With this changes the new
>> CPUs which try to register EM will trigger the needed re-calculations for
>> other CPUs EMs. Thanks to that the em_per_state::performance values will
>> be aligned with the CPU capacity information after all CPUs finish the
>> boot and EM registrations.
>
> IMHO, it's worth mentioning here that this added functionality is the 1.
> use case of the modifiable EM.

Make sense. I will add that. It's quite important information, since
it also justifies the EM update feature.

>
> [...]
>
>> + * Adjustment of CPU performance values after boot, when all CPUs capacites
>> + * are correctly calculated.
>> + */
>> +static void em_adjust_new_capacity(struct device *dev,
>> + struct em_perf_domain *pd,
>> + u64 max_cap)
>> +{
>
> [...]
>
>> + /*
>> + * This is one-time-update, so give up the ownership in this updater.
>> + * The EM fwk will keep the reference and free the memory when needed.
>
> s/fwk/framework ?

OK

>
>> + */
>> + em_free_table(runtime_table);
>> +}
>> +
>> +static void em_check_capacity_update(void)
>> +{
>> + cpumask_var_t cpu_done_mask;
>> + struct em_perf_state *table;
>> + struct em_perf_domain *pd;
>> + unsigned long cpu_capacity;
>> + int cpu;
>> +
>> + if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
>> + pr_warn("no free memory\n");
>> + return;
>> + }
>> +
>> + /* Check if CPUs capacity has changed than update EM */
>
> s/than/then ?
>
> Maybe this comment is not needed since there is (1) further down?

Yes, I'll remove that.

>
>
>> + for_each_possible_cpu(cpu) {
>> + struct cpufreq_policy *policy;
>> + unsigned long em_max_perf;
>> + struct device *dev;
>> + int nr_states;
>> +
>> + if (cpumask_test_cpu(cpu, cpu_done_mask))
>> + continue;
>> +
>> + policy = cpufreq_cpu_get(cpu);
>> + if (!policy) {
>> + pr_debug("Accessing cpu%d policy failed\n", cpu);
>> + schedule_delayed_work(&em_update_work,
>> + msecs_to_jiffies(1000));
>> + break;
>> + }
>> + cpufreq_cpu_put(policy);
>> +
>> + pd = em_cpu_get(cpu);
>> + if (!pd || em_is_artificial(pd))
>> + continue;
>> +
>> + cpumask_or(cpu_done_mask, cpu_done_mask,
>> + em_span_cpus(pd));
>> +
>> + nr_states = pd->nr_perf_states;
>> + cpu_capacity = arch_scale_cpu_capacity(cpu);
>> +
>> + table = em_get_table(pd);
>> + em_max_perf = table[pd->nr_perf_states - 1].performance;
>> + em_put_table();
>> +
>> + /*
>> + * Check if the CPU capacity has been adjusted during boot
>> + * and trigger the update for new performance values.
>> + */
>
> (1)
>
> [...]

2023-12-20 08:44:47

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division



On 12/12/23 18:50, Dietmar Eggemann wrote:
> On 29/11/2023 12:08, Lukasz Luba wrote:
>> The Energy Model (EM) can be modified at runtime which brings new
>> possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
>> (EAS) in it's hot path. The energy calculation uses power value for
>
> NIT: s/it's/its

OK

>
>> a given performance state (ps) and the CPU busy time as percentage for that
>> given frequency, which effectively is:
>>
>> pd_nrg = ps->power * busy_time_pct (1)
>>
>> cpu_util
>> busy_time_pct = ----------------- (2)
>> ps->performance
>>
>> The 'ps->performance' is the CPU capacity (performance) at that given ps.
>> Thus, in a situation when the OS is not overloaded and we have EAS
>> working, the busy time is lower than 'ps->performance' that the CPU is
>> running at. Therefore, in longer scheduling period we can treat the power
>> value calculated above as the energy.
>
> Not sure I understand what a longer 'scheduling period' has to do with
> that? Is this to highlight the issue between instantaneous power and the
> energy being the integral over it? And the 'scheduling period' is the
> runnable time of this task?

I can probably drop this sentence. I just wanted to describe that EAS
operates on power values, but actually assumes that it will be energy
because we know that the tasks will run longer. It's not the best
place to even try to describe this bit of EAS+EM in this patch header.

>
>> We can optimize the last arithmetic operation in em_cpu_energy() and
>> remove the division. This can be done because em_perf_state::cost, which
>> is a special coefficient, can now hold the pre-calculated value including
>> the 'ps->performance' information for a performance state (ps):
>>
>> ps->power
>> ps->cost = --------------- (3)
>> ps->performance
>
> Ah, this is equation (2) in the existing code with s/cap/performance.

yes

>
>> In the past the 'ps->performance' had to be calculated at runtime every
>> time the em_cpu_energy() was called. Thus there was this formula involved:
>>
>
>> ps->freq
>> ps->performance = ------------- * scale_cpu (4)
>> cpu_max_freq
>>
>> When we inject (4) into (2) than we can have this equation:
>>
>> cpu_util * cpu_max_freq
>> busy_time_pct = ------------------------ (5)
>> ps->freq * scale_cpu
>>
>> Because the right 'scale_cpu' value wasn't ready during the boot time
>> and EM initialization, we had to perform the division by 'scale_cpu'
>> at runtime. There was not safe mechanism to update EM at runtime.
>> It has changed thanks to EM runtime modification feature.
>>
>> It is possible to avoid the division by 'scale_cpu' at runtime, because
>> EM is updated whenever new max capacity CPU is set in the system or after
>> the boot has finished and proper CPU capacity is ready.
>>
>> Use that feature and do the needed division during the calculation of the
>> coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
>> multiplied simply by utilization:
>>
>> pd_nrg = ps->cost * \Sum cpu_util (6)
>>
>> to get the needed energy for whole Performance Domain (PD).
>>
>> With this optimization, the em_cpu_energy() should run faster on the Big
>> CPU by 1.43x and on the Little CPU by 1.69x.
>
> Where are those precise numbers are coming from? Which platform was it?

That was mainline big.Little board rockpi4 b w/ rockchip 3399, present
quite a few commercial devices (e.g. chromebooks or plenty other seen in
DT). The numbers are from measuring the time it takes to run this
function em_cpu_cost() in a loop for mln of times. Thus, the instruction
cache and data cache should be hot, but the operation would impact the
different score.

>
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> include/linux/energy_model.h | 68 +++++-------------------------------
>> kernel/power/energy_model.c | 7 ++--
>> 2 files changed, 12 insertions(+), 63 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index e30750500b10..0f5621898a81 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -115,27 +115,6 @@ struct em_perf_domain {
>> #define EM_MAX_NUM_CPUS 16
>> #endif
>>
>> -/*
>> - * To avoid an overflow on 32bit machines while calculating the energy
>> - * use a different order in the operation. First divide by the 'cpu_scale'
>> - * which would reduce big value stored in the 'cost' field, then multiply by
>> - * the 'sum_util'. This would allow to handle existing platforms, which have
>> - * e.g. power ~1.3 Watt at max freq, so the 'cost' value > 1mln micro-Watts.
>> - * In such scenario, where there are 4 CPUs in the Perf. Domain the 'sum_util'
>> - * could be 4096, then multiplication: 'cost' * 'sum_util' would overflow.
>> - * This reordering of operations has some limitations, we lose small
>> - * precision in the estimation (comparing to 64bit platform w/o reordering).
>> - *
>> - * We are safe on 64bit machine.
>> - */
>> -#ifdef CONFIG_64BIT
>> -#define em_estimate_energy(cost, sum_util, scale_cpu) \
>> - (((cost) * (sum_util)) / (scale_cpu))
>> -#else
>> -#define em_estimate_energy(cost, sum_util, scale_cpu) \
>> - (((cost) / (scale_cpu)) * (sum_util))
>> -#endif
>> -
>> struct em_data_callback {
>> /**
>> * active_power() - Provide power at the next performance state of
>> @@ -249,29 +228,16 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>> {
>> struct em_perf_table *runtime_table;
>> struct em_perf_state *ps;
>> - unsigned long scale_cpu;
>> - int cpu, i;
>> + int i;
>>
>> if (!sum_util)
>> return 0;
>>
>> - /*
>> - * In order to predict the performance state, map the utilization of
>> - * the most utilized CPU of the performance domain to a requested
>> - * frequency, like schedutil. Take also into account that the real
>> - * frequency might be set lower (due to thermal capping). Thus, clamp
>> - * max utilization to the allowed CPU capacity before calculating
>> - * effective frequency.
>
> Why do you remove this comment? IMHO, it's still valid and independent
> of the changes here?

Fair enough, I thought this comment makes more confusion in the new
function, but I'll keep it.

>
>> - */
>> - cpu = cpumask_first(to_cpumask(pd->cpus));
>> - scale_cpu = arch_scale_cpu_capacity(cpu);
>> -
>> /*
>> * No rcu_read_lock() since it's already called by task scheduler.
>> * The runtime_table is always there for CPUs, so we don't check.
>> */
>> runtime_table = rcu_dereference(pd->runtime_table);
>> -
>> ps = &runtime_table->state[pd->nr_perf_states - 1];
>>
>> max_util = map_util_perf(max_util);
>> @@ -286,35 +252,21 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>> ps = &runtime_table->state[i];
>>
>> /*
>> - * The capacity of a CPU in the domain at the performance state (ps)
>> - * can be computed as:
>> - *
>> - * ps->freq * scale_cpu
>> - * ps->cap = -------------------- (1)
>> - * cpu_max_freq
>> - *
>> - * So, ignoring the costs of idle states (which are not available in
>> - * the EM), the energy consumed by this CPU at that performance state
>> + * The energy consumed by the CPU at the given performance state (ps)
>> * is estimated as:
>> *
>> - * ps->power * cpu_util
>> - * cpu_nrg = -------------------- (2)
>> - * ps->cap
>> + * ps->power
>> + * cpu_nrg = --------------- * cpu_util (1)
>> + * ps->performance
>> *
>> - * since 'cpu_util / ps->cap' represents its percentage of busy time.
>> + * The 'cpu_util / ps->performance' represents its percentage of
>> + * busy time. The idle cost is ignored (it's not available in the EM).
>> *
>> * NOTE: Although the result of this computation actually is in
>> * units of power, it can be manipulated as an energy value
>> * over a scheduling period, since it is assumed to be
>> * constant during that interval.
>> *
>> - * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
>> - * of two terms:
>> - *
>> - * ps->power * cpu_max_freq cpu_util
>> - * cpu_nrg = ------------------------ * --------- (3)
>> - * ps->freq scale_cpu
>> - *
>> * The first term is static, and is stored in the em_perf_state struct
>> * as 'ps->cost'.
>> *
>> @@ -323,11 +275,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>> * total energy of the domain (which is the simple sum of the energy of
>> * all of its CPUs) can be factorized as:
>> *
>> - * ps->cost * \Sum cpu_util
>> - * pd_nrg = ------------------------ (4)
>> - * scale_cpu
>> + * pd_nrg = ps->cost * \Sum cpu_util (2)
>> */
>> - return em_estimate_energy(ps->cost, sum_util, scale_cpu);
>> + return ps->cost * sum_util;
>
> Can you not keep the existing comment and only change:
>
> (a) that ps->cap id ps->performance in (2) and
>
> (b) that:
>
> * ps->power * cpu_max_freq cpu_util
> * cpu_nrg = ------------------------ * --------- (3)
> * ps->freq scale_cpu
>
> <---- (old) ps->cost --->
>
> is now
>
> ps->power * cpu_max_freq 1
> ps-> cost = ------------------------ * ----------
> ps->freq scale_cpu
>
> <---- (old) ps->cost --->
>
> and (c) that (4) has changed to:
>
> * pd_nrg = ps->cost * \Sum cpu_util (4)
>
> which avoid the division?
>
> Less changes is always much nicer since it makes it so much easier to
> detect history and review changes.

I'm open to change that, but I will have to contact you offline
what you mean. This comment section in code is really tricky to
handle right.

>
> I do understand the changes from the technical viewpoint but the review
> took me way too long which I partly blame to all the changes in the
> comments which could have been avoided. Just want to make sure that
> others done have to go through this pain too.
>

I'll try to apply your comments and produce smaller diff in that
patch.

2023-12-20 11:13:37

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 22/23] PM: EM: Add em_dev_compute_costs() as API for device drivers

Hi Dietmar, Qais, Xuewen,

On 12/18/23 11:56, Lukasz Luba wrote:
> Hi Dietmar and Qais,
>
> On 12/17/23 18:03, Qais Yousef wrote:
>> On 12/12/23 19:50, Dietmar Eggemann wrote:
>>> On 29/11/2023 12:08, Lukasz Luba wrote:
>>>> The device drivers can modify EM at runtime by providing a new EM
>>>> table.
>>>> The EM is used by the EAS and the em_perf_state::cost stores
>>>> pre-calculated value to avoid overhead. This patch provides the API for
>>>> device drivers to calculate the cost values properly (and not duplicate
>>>> the same code).
>>>
>>> New interface w/o any users? Can we not remove this from this patch-set
>>> and introduce it with the first user(s)?
>
> I didn't wanted to introduce the user of this in the same patch set.
> I will send a follow up patch for Exynos SoC. More about this below.
>
>>
>> It's a chicken and egg problem. No interface, will not enable the new
>> users to
>> appear too. So assuming the interface makes sense, I vote to keep it.
>
> There are already in mainline platforms which will benefit from this
> feature and would use this API. The platform which support chip
> binning and adjust the voltage based on that information. It can be a
> driver which can even be built as a module. One example is Exynos5 ASV
> (Adaptive Supply Voltage) part of the Exynos chipid driver [1].
> Here is the dmesg log with some additional debug from this driver.
> As you can see the EM finished the registration and also update (the
> new feature from this patch set), but it worked on old Voltages from
> OPPs. (Also, this driver can be built as a module).
>
> -------------------------------------------------
> [    4.651049] cpu cpu4: EM: created perf domain
> [    4.654073] cpu cpu0: EM: OPP:1200000 is inefficient
> [    4.654108] cpu cpu0: EM: OPP:1100000 is inefficient
> [    4.654140] cpu cpu0: EM: OPP:900000 is inefficient
> [    4.654173] cpu cpu0: EM: OPP:800000 is inefficient
> [    4.654204] cpu cpu0: EM: OPP:600000 is inefficient
> [    4.654235] cpu cpu0: EM: OPP:500000 is inefficient
> [    4.654266] cpu cpu0: EM: OPP:400000 is inefficient
> [    4.654297] cpu cpu0: EM: OPP:200000 is inefficient
> [    4.654342] cpu cpu0: EM: updated
> ....
> [    4.750026] exynos-chipid 10000000.chipid: cpu0 opp0, freq: 1500 missing
> [    4.755329] exynos-chipid 10000000.chipid: Checking asv_volt=1175000
> opp_volt=1275000
> [    4.763213] exynos-chipid 10000000.chipid: Checking asv_volt=1125000
> opp_volt=1250000
> [    4.770982] exynos-chipid 10000000.chipid: Checking asv_volt=1075000
> opp_volt=1250000
> [    4.778820] exynos-chipid 10000000.chipid: Checking asv_volt=1037500
> opp_volt=1250000
> [    4.786515] exynos-chipid 10000000.chipid: Checking asv_volt=1000000
> opp_volt=1100000
> [    4.794356] exynos-chipid 10000000.chipid: Checking asv_volt=962500
> opp_volt=1100000
> [    4.802018] exynos-chipid 10000000.chipid: Checking asv_volt=925000
> opp_volt=1100000
> [    4.816323] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=1000000
> [    4.824109] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=1000000
> [    4.839933] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=1000000
> [    4.854762] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=1000000
> [    4.866191] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    4.878812] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    4.886052] exynos-chipid 10000000.chipid: cpu4 opp0, freq: 2100 missing
> [    4.892800] exynos-chipid 10000000.chipid: Checking asv_volt=1225000
> opp_volt=1312500
> [    4.900542] exynos-chipid 10000000.chipid: Checking asv_volt=1162500
> opp_volt=1262500
> [    4.908342] exynos-chipid 10000000.chipid: Checking asv_volt=1112500
> opp_volt=1237500
> [    4.916066] exynos-chipid 10000000.chipid: Checking asv_volt=1075000
> opp_volt=1250000
> [    4.923926] exynos-chipid 10000000.chipid: Checking asv_volt=1037500
> opp_volt=1250000
> [    4.931707] exynos-chipid 10000000.chipid: Checking asv_volt=1000000
> opp_volt=1100000
> [    4.939582] exynos-chipid 10000000.chipid: Checking asv_volt=975000
> opp_volt=1100000
> [    4.947225] exynos-chipid 10000000.chipid: Checking asv_volt=950000
> opp_volt=1100000
> [    4.954885] exynos-chipid 10000000.chipid: Checking asv_volt=925000
> opp_volt=1000000
> [    4.962601] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=1000000
> [    4.974047] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=1000000
> [    4.974071] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=1000000
> [    4.993670] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    5.001163] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    5.008818] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    5.016318] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    5.023955] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    5.039723] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    5.054445] exynos-chipid 10000000.chipid: Checking asv_volt=900000
> opp_volt=900000
> [    5.066709] exynos-chipid 10000000.chipid: Exynos: CPU[EXYNOS5800]
> PRO_ID[0xe5422000] REV[0x1] Detected
>
> -------------------------------------------------
>
> The new EM which would be updated from that driver, would have lower
> voltages as well as different 'inefficient OPPs'. The maximum voltage
> difference based on the tables is 13.54% which means for the dynamic
> power:
> 1362500 = 1.135416667 * 1200000
> P_dyn = C* f * (V*1.1354 * V*1.1354) = C*f*V^2 * 1.289
>
> That's ~29% different dynamic power (for one core).
>
> This Voltage adjustment is due to chip lottery. Different SoC vendors
> use different name for this fact.
> I only have this Exynos platform, but when this API
> and v5 features get in, the vendors can modify their drivers and test.
>
> This should help both: EAS and IPA/DTPM.
>
> Regards,
> Lukasz
>
> [1]
> https://elixir.bootlin.com/linux/latest/source/drivers/soc/samsung/exynos5422-asv.c
>

Because you wanted to see how this API is going to be used after
boot, I have send a follow-up patch for the OPP framework and Exynos
chip driver [1].

You can see there that all drivers which would need this feature would
share/use the same code in OPP. That OPP uses the EM new APIs.

I don't want to combine this as well in one step in this patch set.
I rather follow step-by-step development like in Arm SCMI.

Regards,
Lukasz

https://lore.kernel.org/lkml/[email protected]/

2023-12-28 17:00:37

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 02/23] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments

On 12/19/23 10:30, Lukasz Luba wrote:
>
>
> On 12/17/23 17:58, Qais Yousef wrote:
> > On 11/29/23 11:08, Lukasz Luba wrote:
> > > In order to prepare the code for the modifiable EM perf_state table,
> > > refactor existing function em_cpufreq_update_efficiencies().
> >
> > nit: What is being refactored here? The description is not adding much info
> > about the change.
>
> The function takes the ptr to the table now as its argument. You have
> missed that in the code below?

I meant the commit message could be more descriptive if you care to expand on
it.

2023-12-28 17:13:31

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 03/23] PM: EM: Find first CPU active while updating OPP efficiency

On 12/19/23 10:53, Lukasz Luba wrote:
>
>
> On 12/17/23 17:58, Qais Yousef wrote:
> > On 11/29/23 11:08, Lukasz Luba wrote:
> > > The Energy Model might be updated at runtime and the energy efficiency
> > > for each OPP may change. Thus, there is a need to update also the
> > > cpufreq framework and make it aligned to the new values. In order to
> > > do that, use a first active CPU from the Performance Domain. This is
> > > needed since the first CPU in the cpumask might be offline when we
> > > run this code path.
> >
> > I didn't understand the problem here. It seems you're fixing a race, but the
> > description is not clear to me what the race is.
>
> I have explained that in v1, v4 comments for this patch.
> When the EM is registered the fist CPU is always online. No problem
> for the old code, but for new code with runtime modification at
> later time, potentially from different subsystems - it it (e.g. thermal,
> drivers, etc). The fist CPU might be offline, but still such EM
> update for this domain shouldn'y fail. Although, when the CPU is offline
> we cannot get the valid policy...
>
> We can get it for next cpu in the cpumask, that's what the code is
> doing.

Okay, I can see now that cpufreq_cpu_get_raw() ignores offline CPUs
intentionally.

A new variant seems better to me. But the experts know better. So LGTM.

2023-12-28 17:14:55

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 05/23] PM: EM: Refactor a new function em_compute_costs()

On 12/19/23 10:59, Lukasz Luba wrote:
>
>
> On 12/17/23 17:58, Qais Yousef wrote:
> > On 11/29/23 11:08, Lukasz Luba wrote:
> > > Refactor a dedicated function which will be easier to maintain and re-use
> > > in future. The upcoming changes for the modifiable EM perf_state table
> > > will use it (instead of duplicating the code).
> >
> > nit: What is being refactored? Looks like you took em_compute_cost() out of
> > em_create_perf_table().
>
> Yes, it's going to be re-used later for also update code path, not only
> register code path.

Sorry I was terse. I meant the commit message could be clearer to require less
effort untangling what is actually being changed.

2023-12-28 17:33:13

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 09/23] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS

On 12/19/23 08:32, Lukasz Luba wrote:
> Hi Qais and Xuewen,
>
> On 12/19/23 04:03, Xuewen Yan wrote:
> > On Mon, Dec 18, 2023 at 1:59 AM Qais Yousef <[email protected]> wrote:
> > >
> > > On 11/29/23 11:08, Lukasz Luba wrote:
> > > > The new Energy Model (EM) supports runtime modification of the performance
> > > > state table to better model the power used by the SoC. Use this new
> > > > feature to improve energy estimation and therefore task placement in
> > > > Energy Aware Scheduler (EAS).
> > >
> > > nit: you moved the code to use the new runtime em table instead of the one
> > > parsed at boot.
> > >
> > > >
> > > > Signed-off-by: Lukasz Luba <[email protected]>
> > > > ---
> > > > include/linux/energy_model.h | 16 ++++++++++++----
> > > > 1 file changed, 12 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > > > index 1e618e431cac..94a77a813724 100644
> > > > --- a/include/linux/energy_model.h
> > > > +++ b/include/linux/energy_model.h
> > > > @@ -238,6 +238,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > > > unsigned long max_util, unsigned long sum_util,
> > > > unsigned long allowed_cpu_cap)
> > > > {
> > > > + struct em_perf_table *runtime_table;
> > > > unsigned long freq, scale_cpu;
> > > > struct em_perf_state *ps;
> > > > int cpu, i;
> > > > @@ -255,7 +256,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > > > */
> > > > cpu = cpumask_first(to_cpumask(pd->cpus));
> > > > scale_cpu = arch_scale_cpu_capacity(cpu);
> > > > - ps = &pd->table[pd->nr_perf_states - 1];
> > > > +
> > > > + /*
> > > > + * No rcu_read_lock() since it's already called by task scheduler.
> > > > + * The runtime_table is always there for CPUs, so we don't check.
> > > > + */
> > >
> > > WARN_ON(rcu_read_lock_held()) instead?
> >
> > I agree, or SCHED_WARN_ON(!rcu_read_lock_held()) ?
>
> I disagree here. This is a sched function in hot path and as comment

WARN_ON() is not a sched function.

> says:
>
> -----------------------
> * This function must be used only for CPU devices. There is no validation,
> * i.e. if the EM is a CPU type and has cpumask allocated. It is called from
> * the scheduler code quite frequently and that is why there is not checks.
> -----------------------
>
> We don't have to put the checks or warnings everywhere in the kernel
> functions. Especially hot one like this one.

When checks are necessary, there are ways even for hot paths.

>
> As you might not notice, we don't even check if the pd->cpus is not NULL

rcu_read_lock_held() is only enabled for lockdebug build and it's the standard
way to document and add verification to ensure locking rules are honoured. On
non lockdebug build this will be compiled out.

You had to put a long comment to ensure locking rules are correct, why not
use existing infrastructure instead to provide better checks and inherent
documentation?

We had a bug recently where the rcu_read_lock() was moved and this broke some
function buried down in the call stack. So subtle code shuffles elsewhere can
cause unwanted side effects; and it's hard to catch these bugs.

https://lore.kernel.org/stable/[email protected]/

2023-12-28 17:46:30

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 13/23] PM: EM: Add performance field to struct em_perf_state

On 12/20/23 08:21, Lukasz Luba wrote:
>
>
> On 12/17/23 18:00, Qais Yousef wrote:
> > On 11/29/23 11:08, Lukasz Luba wrote:
> > > The performance doesn't scale linearly with the frequency. Also, it may
> > > be different in different workloads. Some CPUs are designed to be
> > > particularly good at some applications e.g. images or video processing
> > > and other CPUs in different. When those different types of CPUs are
> > > combined in one SoC they should be properly modeled to get max of the HW
> > > in Energy Aware Scheduler (EAS). The Energy Model (EM) provides the
> > > power vs. performance curves to the EAS, but assumes the CPUs capacity
> > > is fixed and scales linearly with the frequency. This patch allows to
> > > adjust the curve on the 'performance' axis as well.
> > >
> > > Signed-off-by: Lukasz Luba <[email protected]>
> > > ---
> > > include/linux/energy_model.h | 11 ++++++-----
> > > kernel/power/energy_model.c | 27 +++++++++++++++++++++++++++
> > > 2 files changed, 33 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > > index ae3ccc8b9f44..e30750500b10 100644
> > > --- a/include/linux/energy_model.h
> > > +++ b/include/linux/energy_model.h
> > > @@ -13,6 +13,7 @@
> > > /**
> > > * struct em_perf_state - Performance state of a performance domain
> > > + * @performance: Non-linear CPU performance at a given frequency
> > > * @frequency: The frequency in KHz, for consistency with CPUFreq
> > > * @power: The power consumed at this level (by 1 CPU or by a registered
> > > * device). It can be a total power: static and dynamic.
> > > @@ -21,6 +22,7 @@
> > > * @flags: see "em_perf_state flags" description below.
> > > */
> > > struct em_perf_state {
> > > + unsigned long performance;
> > > unsigned long frequency;
> > > unsigned long power;
> > > unsigned long cost;
> > > @@ -207,14 +209,14 @@ void em_free_table(struct em_perf_table __rcu *table);
> > > */
> > > static inline int
> > > em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
> > > - unsigned long freq, unsigned long pd_flags)
> > > + unsigned long max_util, unsigned long pd_flags)
> > > {
> > > struct em_perf_state *ps;
> > > int i;
> > > for (i = 0; i < nr_perf_states; i++) {
> > > ps = &table[i];
> > > - if (ps->frequency >= freq) {
> > > + if (ps->performance >= max_util) {
> > > if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
> > > ps->flags & EM_PERF_STATE_INEFFICIENT)
> > > continue;
> > > @@ -246,8 +248,8 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > > unsigned long allowed_cpu_cap)
> > > {
> > > struct em_perf_table *runtime_table;
> > > - unsigned long freq, scale_cpu;
> > > struct em_perf_state *ps;
> > > + unsigned long scale_cpu;
> > > int cpu, i;
> > > if (!sum_util)
> > > @@ -274,14 +276,13 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > > max_util = map_util_perf(max_util);
> > > max_util = min(max_util, allowed_cpu_cap);
> > > - freq = map_util_freq(max_util, ps->frequency, scale_cpu);
> > > /*
> > > * Find the lowest performance state of the Energy Model above the
> > > * requested frequency.
> > > */
> > > i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
> > > - freq, pd->flags);
> > > + max_util, pd->flags);
> > > ps = &runtime_table->state[i];
> > > /*
> > > diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> > > index 614891fde8df..b5016afe6a19 100644
> > > --- a/kernel/power/energy_model.c
> > > +++ b/kernel/power/energy_model.c
> > > @@ -46,6 +46,7 @@ static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
> > > debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
> > > debugfs_create_ulong("power", 0444, d, &ps->power);
> > > debugfs_create_ulong("cost", 0444, d, &ps->cost);
> > > + debugfs_create_ulong("performance", 0444, d, &ps->performance);
> > > debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
> > > }
> > > @@ -171,6 +172,30 @@ em_allocate_table(struct em_perf_domain *pd)
> > > return table;
> > > }
> > > +static void em_init_performance(struct device *dev, struct em_perf_domain *pd,
> > > + struct em_perf_state *table, int nr_states)
> > > +{
> > > + u64 fmax, max_cap;
> > > + int i, cpu;
> > > +
> > > + /* This is needed only for CPUs and EAS skip other devices */
> > > + if (!_is_cpu_device(dev))
> > > + return;
> > > +
> > > + cpu = cpumask_first(em_span_cpus(pd));
> > > +
> > > + /*
> > > + * Calculate the performance value for each frequency with
> > > + * linear relationship. The final CPU capacity might not be ready at
> > > + * boot time, but the EM will be updated a bit later with correct one.
> > > + */
> > > + fmax = (u64) table[nr_states - 1].frequency;
> > > + max_cap = (u64) arch_scale_cpu_capacity(cpu);
> > > + for (i = 0; i < nr_states; i++)
> > > + table[i].performance = div64_u64(max_cap * table[i].frequency,
> > > + fmax);
> >
> > Should we sanity check the returned performance value is correct in case we got
> > passed a malformed table? Maybe the table is sanity checked and sorted before
> > we get here; I didn't check to be honest.
>
> The frequency values are checked if they have asc sorting order. It's
> done in the em_create_perf_table(). There is even an error printed and
> returned, so the EM registration will fail.
>
> >
> > I think a warning that performance is always <= max_cap would be helpful in
> > general as code evolved in the future.
>
> I don't see that need. There are needed checks for frequency values and
> this simple math formula is just linear. Nothing can happen when
> frequencies are sorted asc. The whole EAS relies on that fact:
>
> Frequencies are sorted ascending, thus
> fmax = (u64) table[nr_states - 1].frequency
> is always true.

I saw that but wasn't sure if this is always guaranteed. It seems it is from
you're saying, then yes no issues here then.

2023-12-28 18:07:01

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division

On 11/29/23 11:08, Lukasz Luba wrote:

> @@ -220,8 +218,9 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
> return -EINVAL;
> }
> } else {
> - power_res = table[i].power;
> - cost = div64_u64(fmax * power_res, table[i].frequency);
> + /* increase resolution of 'cost' precision */
> + power_res = table[i].power * 10;

Power is in uW, right? You're just taking advantage here that everything will
use this new cost field so you can add as many 0s to improve resolution without
impact elsewhere that care to compare using the same units?

Did you see a problem or just being extra cautious here?

> + cost = power_res / table[i].performance;
> }
>
> table[i].cost = cost;
> --
> 2.25.1
>

2023-12-28 18:41:31

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model

On 12/19/23 10:22, Lukasz Luba wrote:

> > One thing I'm not sure about is that in practice temperature of the SoC can
> > vary a lot in a short period of time. What is the expectation here? I can see
> > this useful in practice only if we average it over a window of time. Following
> > it will be really hard. Big variations can happen in few ms scales.
>
> It's mostly for long running heavy workloads, which involve other device
> than CPUs, e.g. GPU or ISP (Image Signal Processor). Those devices can
> heat up the SoC. In our game DrArm running on pixel6 the GPU uses 75-77%
> of total power budget (starting from ~2.5W for GPU + 1.3W for all CPUs).
> That 2.5W from the GPU is heating up the CPUs and mostly impact the Big
> cores, which are made from High-Performance cells (thus leaking more).
> OverUtilization in the first 4-5min of gaming is ~4-9%, so EAS can work
> and save some power, if it has a good model. Later we have thermal
> throttling and OU goes to ~50% but EAS still can work. If the model is
> more precised - thus adjusted for the raising leakage due to temperature
> increase (generated due to GPU power), than we still can use better that
> power budget and not waist on the leakage at higher OPPs.

I can understand the need. But looking at one specific case vs generalized form
is different.

So IIUC the expectation is to track temperature variations over minutes by
external sources to CPU.

> > I didn't get how the new performance field is supposed to be controlled and
> > modified by users. A driver interface doesn't seem suitable as there's no
> > subsystem that knows the characteristic of the workload except userspace. In
> > Android we do have contextual info about what the current top-app to enable
> > modifying the capacities to match its characteristics.
>
> Well in latest public documentation (May2023) for Cortex-X4 there are
> described new features of Arm cores: PDP, MPMM, which can change the
> 'performance' of the core in FW. Our SCMI kernel subsystem will get an
> interrupt, so the drivers can know about it. It could be used for
> recalculating the efficiency of the CPUs in the EM. When there is no
> hotplug and the long running app is still running, that FW policy would
> be reflected in EM. It's just not done all-in-one-step. Those patches
> will be later.

I think these features are some form of thermal throttling IIUC.

I was asking for handling the EM accuracy issue using the runtime model. I was
expecting some sysfs knobs. Do you see this also require a vendor specific
driver to try to account for the EM inaccuracy issues we're seeing?

> Second, I have used that 'performance' field to finally get rid of
> this runtime division in em_cpu_energy() hot path - which was annoying
> me for very long time. It wasn't possible to optimize that last
> operation there, because the not all CPUs boot and final CPU capacity
> is not known when we register EMs. With this feature finally I can
> remove that heavy operation. You can see more in that patch 15/23.

Yep, it's good addition :)

> > > 5. All CPUs (Little+Mid+Big) power values in mW
> > > +------------+--------+---------------------+-------+-----------+
> > > | channel | metric | kernel | value | perc_diff |
> > > +------------+--------+---------------------+-------+-----------+
> > > | CPU | gmean | EM_default | 142.1 | 0.0% |
> > > | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
> > > +------------+--------+---------------------+-------+-----------+
> >
> > How did you modify the EM here? Did you change both performance and power
> > fields? How did you calculate the new ones?
>
> It was just the power values modified on my pixel6:
> for Littles 1.6x, Mid 0.8x, Big 1.3x of their boot power.
> TBH I don't know the chip binning of that SoC, but I suspect it
> could be due to this fact. More about possible error range in chip
> binning power values you can find in my comment to the patch 22/23

Strange just modifying the power had this impact. It could be related to
similar impact I've seen with migration margin for the little increasing. By
making the cost higher there, then it'd move the residency to other cores and
potentially reduce running at higher freq on the littles.

> > Did you try to simulate any heating effect during the run if you're taking
> > temperature into account to modify the power? What was the variation like and
>
> Yes, I did that experiment and presented on OSPM 2023 slide 13. There is
> big CPU power plot change in time, due to GPU heat. All detailed data is
> there. The big CPU power is ~18-20% higher when 1-1.5W GPU is heating up
> the whole SoC.

I meant during your experiment above.

> > at what rate was the EM being updated in this case? I think Jankbench in
>
> In this experiment EM was only set once w/ the values mentioned above.
> It could be due to the chip lottery. I cannot say on 100% this phone.
>
> > general wouldn't stress the SoC enough.
>
> True, this test is not power heavy as it can be seen. It's more
> to show that the default EM after boot might not be the optimal one.

I wouldn't reach that conclusion for this particular case. But the underlying
issues exists for sure.

> > It'd be insightful to look at frequency residencies between the two runs and
> > power breakdown for each cluster if you have access to them. No worries if not!
>
> I'm afraid you're asking for too much ;)

It should be easy to get them. It's hard to know where the benefit is coming
from otherwise. But as I said, no worries if not. If you have perfetto traces
I can take help to take a look.

> > My brain started to fail me somewhere around patch 15. I'll have another look
> > some time later in the week but generally looks good to me. If I have any
> > worries it is about how it can be used with the provided interfaces. Especially
> > expectations about managing fast thermal changes at the level you're targeting.
>
> No worries, thanks for the review! The fast thermal changes, which are
> linked to the CPU's workload are not an issue here and I'm not worried
> about those. The side effect of the heat from other device is the issue.
> Thus, that thermal driver which modifies the EM should be aware of the
> 'whole SoC' situation (like mainline IPA does, when it manages all
> devices in a single thermal zone).

I think in practice there will be challenges to generalize the thermal impact.
But overall from EM accuracy point of view (for all the various reasons
mentioned), we need this ability to help handle them in practice. Booting with
a single hardcoded EM doesn't work.


Cheers

--
Qais Yousef

2024-01-02 09:39:22

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 02/23] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments



On 12/28/23 16:59, Qais Yousef wrote:
> On 12/19/23 10:30, Lukasz Luba wrote:
>>
>>
>> On 12/17/23 17:58, Qais Yousef wrote:
>>> On 11/29/23 11:08, Lukasz Luba wrote:
>>>> In order to prepare the code for the modifiable EM perf_state table,
>>>> refactor existing function em_cpufreq_update_efficiencies().
>>>
>>> nit: What is being refactored here? The description is not adding much info
>>> about the change.
>>
>> The function takes the ptr to the table now as its argument. You have
>> missed that in the code below?
>
> I meant the commit message could be more descriptive if you care to expand on
> it.

I see, yes I will adjust the comment.

2024-01-02 09:40:58

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 03/23] PM: EM: Find first CPU active while updating OPP efficiency



On 12/28/23 17:13, Qais Yousef wrote:
> On 12/19/23 10:53, Lukasz Luba wrote:
>>
>>
>> On 12/17/23 17:58, Qais Yousef wrote:
>>> On 11/29/23 11:08, Lukasz Luba wrote:
>>>> The Energy Model might be updated at runtime and the energy efficiency
>>>> for each OPP may change. Thus, there is a need to update also the
>>>> cpufreq framework and make it aligned to the new values. In order to
>>>> do that, use a first active CPU from the Performance Domain. This is
>>>> needed since the first CPU in the cpumask might be offline when we
>>>> run this code path.
>>>
>>> I didn't understand the problem here. It seems you're fixing a race, but the
>>> description is not clear to me what the race is.
>>
>> I have explained that in v1, v4 comments for this patch.
>> When the EM is registered the fist CPU is always online. No problem
>> for the old code, but for new code with runtime modification at
>> later time, potentially from different subsystems - it it (e.g. thermal,
>> drivers, etc). The fist CPU might be offline, but still such EM
>> update for this domain shouldn'y fail. Although, when the CPU is offline
>> we cannot get the valid policy...
>>
>> We can get it for next cpu in the cpumask, that's what the code is
>> doing.
>
> Okay, I can see now that cpufreq_cpu_get_raw() ignores offline CPUs
> intentionally.
>
> A new variant seems better to me. But the experts know better. So LGTM.

Thanks. Yes, I will gently ask Viresh to have a look at those places
cpufreq related places.

2024-01-02 09:42:20

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 05/23] PM: EM: Refactor a new function em_compute_costs()



On 12/28/23 17:14, Qais Yousef wrote:
> On 12/19/23 10:59, Lukasz Luba wrote:
>>
>>
>> On 12/17/23 17:58, Qais Yousef wrote:
>>> On 11/29/23 11:08, Lukasz Luba wrote:
>>>> Refactor a dedicated function which will be easier to maintain and re-use
>>>> in future. The upcoming changes for the modifiable EM perf_state table
>>>> will use it (instead of duplicating the code).
>>>
>>> nit: What is being refactored? Looks like you took em_compute_cost() out of
>>> em_create_perf_table().
>>
>> Yes, it's going to be re-used later for also update code path, not only
>> register code path.
>
> Sorry I was terse. I meant the commit message could be clearer to require less
> effort untangling what is actually being changed.

OK, I will rephrase that description. Thanks.

2024-01-02 11:16:29

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 09/23] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS



On 12/28/23 17:32, Qais Yousef wrote:
> On 12/19/23 08:32, Lukasz Luba wrote:
>> Hi Qais and Xuewen,
>>
>> On 12/19/23 04:03, Xuewen Yan wrote:
>>> On Mon, Dec 18, 2023 at 1:59 AM Qais Yousef <[email protected]> wrote:
>>>>
>>>> On 11/29/23 11:08, Lukasz Luba wrote:
>>>>> The new Energy Model (EM) supports runtime modification of the performance
>>>>> state table to better model the power used by the SoC. Use this new
>>>>> feature to improve energy estimation and therefore task placement in
>>>>> Energy Aware Scheduler (EAS).
>>>>
>>>> nit: you moved the code to use the new runtime em table instead of the one
>>>> parsed at boot.
>>>>
>>>>>
>>>>> Signed-off-by: Lukasz Luba <[email protected]>
>>>>> ---
>>>>> include/linux/energy_model.h | 16 ++++++++++++----
>>>>> 1 file changed, 12 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>>>>> index 1e618e431cac..94a77a813724 100644
>>>>> --- a/include/linux/energy_model.h
>>>>> +++ b/include/linux/energy_model.h
>>>>> @@ -238,6 +238,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>>>> unsigned long max_util, unsigned long sum_util,
>>>>> unsigned long allowed_cpu_cap)
>>>>> {
>>>>> + struct em_perf_table *runtime_table;
>>>>> unsigned long freq, scale_cpu;
>>>>> struct em_perf_state *ps;
>>>>> int cpu, i;
>>>>> @@ -255,7 +256,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>>>> */
>>>>> cpu = cpumask_first(to_cpumask(pd->cpus));
>>>>> scale_cpu = arch_scale_cpu_capacity(cpu);
>>>>> - ps = &pd->table[pd->nr_perf_states - 1];
>>>>> +
>>>>> + /*
>>>>> + * No rcu_read_lock() since it's already called by task scheduler.
>>>>> + * The runtime_table is always there for CPUs, so we don't check.
>>>>> + */
>>>>
>>>> WARN_ON(rcu_read_lock_held()) instead?
>>>
>>> I agree, or SCHED_WARN_ON(!rcu_read_lock_held()) ?
>>
>> I disagree here. This is a sched function in hot path and as comment
>
> WARN_ON() is not a sched function.

I was referring to em_cpu_energy() being sched function. No one else
should call it. That's the old contract also put into the doc of
that function.

>
>> says:
>>
>> -----------------------
>> * This function must be used only for CPU devices. There is no validation,
>> * i.e. if the EM is a CPU type and has cpumask allocated. It is called from
>> * the scheduler code quite frequently and that is why there is not checks.
>> -----------------------
>>
>> We don't have to put the checks or warnings everywhere in the kernel
>> functions. Especially hot one like this one.
>
> When checks are necessary, there are ways even for hot paths.

We have that function called from feec() where the RCU must be hold,
otherwise the whole EAS would be unstable.

>
>>
>> As you might not notice, we don't even check if the pd->cpus is not NULL
>
> rcu_read_lock_held() is only enabled for lockdebug build and it's the standard
> way to document and add verification to ensure locking rules are honoured. On
> non lockdebug build this will be compiled out.
>
> You had to put a long comment to ensure locking rules are correct, why not
> use existing infrastructure instead to provide better checks and inherent
> documentation?

I didn't want to add any more overhead in this hot path.

>
> We had a bug recently where the rcu_read_lock() was moved and this broke some
> function buried down in the call stack. So subtle code shuffles elsewhere can
> cause unwanted side effects; and it's hard to catch these bugs.
>
> https://lore.kernel.org/stable/[email protected]/

OK, let me check that w/ and w/o lockdebug build and the
SCHED_WARN_ON(!rcu_read_lock_held())

Although, it would be only a safety net for accidental use of
em_cpu_energy() from code path other than feec()...
Which actually might bring some value.

2024-01-02 11:38:17

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 14/23] PM: EM: Support late CPUs booting and capacity adjustment



On 12/17/23 18:00, Qais Yousef wrote:
> On 11/29/23 11:08, Lukasz Luba wrote:
>> The patch adds needed infrastructure to handle the late CPUs boot, which
>> might change the previous CPUs capacity values. With this changes the new
>> CPUs which try to register EM will trigger the needed re-calculations for
>> other CPUs EMs. Thanks to that the em_per_state::performance values will
>> be aligned with the CPU capacity information after all CPUs finish the
>> boot and EM registrations.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> kernel/power/energy_model.c | 121 ++++++++++++++++++++++++++++++++++++
>> 1 file changed, 121 insertions(+)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index b5016afe6a19..d3fa5a77de80 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -25,6 +25,9 @@ static DEFINE_MUTEX(em_pd_mutex);
>>
>> static void em_cpufreq_update_efficiencies(struct device *dev,
>> struct em_perf_state *table);
>> +static void em_check_capacity_update(void);
>> +static void em_update_workfn(struct work_struct *work);
>> +static DECLARE_DELAYED_WORK(em_update_work, em_update_workfn);
>>
>> static bool _is_cpu_device(struct device *dev)
>> {
>> @@ -596,6 +599,10 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>>
>> unlock:
>> mutex_unlock(&em_pd_mutex);
>> +
>> + if (_is_cpu_device(dev))
>> + em_check_capacity_update();
>> +
>> return ret;
>> }
>> EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
>> @@ -631,3 +638,117 @@ void em_dev_unregister_perf_domain(struct device *dev)
>> mutex_unlock(&em_pd_mutex);
>> }
>> EXPORT_SYMBOL_GPL(em_dev_unregister_perf_domain);
>> +
>> +/*
>> + * Adjustment of CPU performance values after boot, when all CPUs capacites
>> + * are correctly calculated.
>> + */
>> +static void em_adjust_new_capacity(struct device *dev,
>> + struct em_perf_domain *pd,
>> + u64 max_cap)
>> +{
>> + struct em_perf_table __rcu *runtime_table;
>> + struct em_perf_state *table, *new_table;
>> + int ret, table_size;
>> +
>> + runtime_table = em_allocate_table(pd);
>> + if (!runtime_table) {
>> + dev_warn(dev, "EM: allocation failed\n");
>> + return;
>> + }
>> +
>> + new_table = runtime_table->state;
>> +
>> + table = em_get_table(pd);
>> + /* Initialize data based on older runtime table */
>> + table_size = sizeof(struct em_perf_state) * pd->nr_perf_states;
>> + memcpy(new_table, table, table_size);
>> +
>> + em_put_table();
>> +
>> + em_init_performance(dev, pd, new_table, pd->nr_perf_states);
>> + ret = em_compute_costs(dev, new_table, NULL, pd->nr_perf_states,
>> + pd->flags);
>> + if (ret) {
>> + em_free_table(runtime_table);
>> + return;
>> + }
>> +
>> + ret = em_dev_update_perf_domain(dev, runtime_table);
>> + if (ret)
>> + dev_warn(dev, "EM: update failed %d\n", ret);
>> +
>> + /*
>> + * This is one-time-update, so give up the ownership in this updater.
>> + * The EM fwk will keep the reference and free the memory when needed.
>> + */
>> + em_free_table(runtime_table);
>> +}
>> +
>> +static void em_check_capacity_update(void)
>> +{
>> + cpumask_var_t cpu_done_mask;
>> + struct em_perf_state *table;
>> + struct em_perf_domain *pd;
>> + unsigned long cpu_capacity;
>> + int cpu;
>> +
>> + if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
>> + pr_warn("no free memory\n");
>> + return;
>> + }
>> +
>> + /* Check if CPUs capacity has changed than update EM */
>> + for_each_possible_cpu(cpu) {
>
> Can't we instead hook into cpufreq_online/offline() to check if we need to
> do any em related update for this policy?
>

I think it would be a bit over-engineering. We know the moment when
there is a need for this check - it's when new EM is registered.
Also, for the cpu hotplug, not always the capacity would change,
which would confuse in such code. Not mentioning, that it will create
an extra everhead for that hotplug notification chain, for not good
reason.

2024-01-02 11:45:59

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division



On 12/28/23 18:06, Qais Yousef wrote:
> On 11/29/23 11:08, Lukasz Luba wrote:
>
>> @@ -220,8 +218,9 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>> return -EINVAL;
>> }
>> } else {
>> - power_res = table[i].power;
>> - cost = div64_u64(fmax * power_res, table[i].frequency);
>> + /* increase resolution of 'cost' precision */
>> + power_res = table[i].power * 10;
>
> Power is in uW, right? You're just taking advantage here that everything will
> use this new cost field so you can add as many 0s to improve resolution without
> impact elsewhere that care to compare using the same units?

This code doesn't overwrite the 'power' field value. The 'cost' value is
only used in EAS, so yes I just want to increase resolution there.

I think you mixed 'power' and 'cost' fields. We don't compare 'cost'
anywhere. We just use 'cost' in one place em_cpu_energy() and we
multiply it (not compare it).

>
> Did you see a problem or just being extra cautious here?

There is no problem, 'cost' is a private coefficient for EAS only.

>
>> + cost = power_res / table[i].performance;
>> }
>>
>> table[i].cost = cost;
>> --
>> 2.25.1
>>

2024-01-02 12:11:49

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 00/23] Introduce runtime modifiable Energy Model



On 12/28/23 18:41, Qais Yousef wrote:
> On 12/19/23 10:22, Lukasz Luba wrote:
>
>>> One thing I'm not sure about is that in practice temperature of the SoC can
>>> vary a lot in a short period of time. What is the expectation here? I can see
>>> this useful in practice only if we average it over a window of time. Following
>>> it will be really hard. Big variations can happen in few ms scales.
>>
>> It's mostly for long running heavy workloads, which involve other device
>> than CPUs, e.g. GPU or ISP (Image Signal Processor). Those devices can
>> heat up the SoC. In our game DrArm running on pixel6 the GPU uses 75-77%
>> of total power budget (starting from ~2.5W for GPU + 1.3W for all CPUs).
>> That 2.5W from the GPU is heating up the CPUs and mostly impact the Big
>> cores, which are made from High-Performance cells (thus leaking more).
>> OverUtilization in the first 4-5min of gaming is ~4-9%, so EAS can work
>> and save some power, if it has a good model. Later we have thermal
>> throttling and OU goes to ~50% but EAS still can work. If the model is
>> more precised - thus adjusted for the raising leakage due to temperature
>> increase (generated due to GPU power), than we still can use better that
>> power budget and not waist on the leakage at higher OPPs.
>
> I can understand the need. But looking at one specific case vs generalized form
> is different.
>
> So IIUC the expectation is to track temperature variations over minutes by
> external sources to CPU.

Yes

>
>>> I didn't get how the new performance field is supposed to be controlled and
>>> modified by users. A driver interface doesn't seem suitable as there's no
>>> subsystem that knows the characteristic of the workload except userspace. In
>>> Android we do have contextual info about what the current top-app to enable
>>> modifying the capacities to match its characteristics.
>>
>> Well in latest public documentation (May2023) for Cortex-X4 there are
>> described new features of Arm cores: PDP, MPMM, which can change the
>> 'performance' of the core in FW. Our SCMI kernel subsystem will get an
>> interrupt, so the drivers can know about it. It could be used for
>> recalculating the efficiency of the CPUs in the EM. When there is no
>> hotplug and the long running app is still running, that FW policy would
>> be reflected in EM. It's just not done all-in-one-step. Those patches
>> will be later.
>
> I think these features are some form of thermal throttling IIUC.
>
> I was asking for handling the EM accuracy issue using the runtime model. I was
> expecting some sysfs knobs. Do you see this also require a vendor specific
> driver to try to account for the EM inaccuracy issues we're seeing?

Yes, it needs vendor driver. In the EM fwk we don't plan to add sysfs
interface.

>
>> Second, I have used that 'performance' field to finally get rid of
>> this runtime division in em_cpu_energy() hot path - which was annoying
>> me for very long time. It wasn't possible to optimize that last
>> operation there, because the not all CPUs boot and final CPU capacity
>> is not known when we register EMs. With this feature finally I can
>> remove that heavy operation. You can see more in that patch 15/23.
>
> Yep, it's good addition :)
>
>>>> 5. All CPUs (Little+Mid+Big) power values in mW
>>>> +------------+--------+---------------------+-------+-----------+
>>>> | channel | metric | kernel | value | perc_diff |
>>>> +------------+--------+---------------------+-------+-----------+
>>>> | CPU | gmean | EM_default | 142.1 | 0.0% |
>>>> | CPU | gmean | EM_modified_runtime | 131.8 | -7.27% |
>>>> +------------+--------+---------------------+-------+-----------+
>>>
>>> How did you modify the EM here? Did you change both performance and power
>>> fields? How did you calculate the new ones?
>>
>> It was just the power values modified on my pixel6:
>> for Littles 1.6x, Mid 0.8x, Big 1.3x of their boot power.
>> TBH I don't know the chip binning of that SoC, but I suspect it
>> could be due to this fact. More about possible error range in chip
>> binning power values you can find in my comment to the patch 22/23
>
> Strange just modifying the power had this impact. It could be related to
> similar impact I've seen with migration margin for the little increasing. By
> making the cost higher there, then it'd move the residency to other cores and
> potentially reduce running at higher freq on the littles.

Well, on Pixel6 we don't know the chip binning for the CPUs and big L3
cache... This could be the source of such a need in power values
adjustment. In my OdroidXU4 (Exynos5422) I can see binning and the max
power for some OPP can be ~30%. It's too big to ignore and I dare to say
that in Pixel6 the binning should be there (don't know the variation
though).

>
>>> Did you try to simulate any heating effect during the run if you're taking
>>> temperature into account to modify the power? What was the variation like and
>>
>> Yes, I did that experiment and presented on OSPM 2023 slide 13. There is
>> big CPU power plot change in time, due to GPU heat. All detailed data is
>> there. The big CPU power is ~18-20% higher when 1-1.5W GPU is heating up
>> the whole SoC.
>
> I meant during your experiment above.

For that experiment there power is too low and GPU is even lower ~5-10mW
so there is no temperature impact.

>
>>> at what rate was the EM being updated in this case? I think Jankbench in
>>
>> In this experiment EM was only set once w/ the values mentioned above.
>> It could be due to the chip lottery. I cannot say on 100% this phone.
>>
>>> general wouldn't stress the SoC enough.
>>
>> True, this test is not power heavy as it can be seen. It's more
>> to show that the default EM after boot might not be the optimal one.
>
> I wouldn't reach that conclusion for this particular case. But the underlying
> issues exists for sure.

Hard to say for sure the root cause, when you don't have full access to
the SoC internals and doc. We would need the chip binning and some
other internals.

>
>>> It'd be insightful to look at frequency residencies between the two runs and
>>> power breakdown for each cluster if you have access to them. No worries if not!
>>
>> I'm afraid you're asking for too much ;)
>
> It should be easy to get them. It's hard to know where the benefit is coming
> from otherwise. But as I said, no worries if not. If you have perfetto traces
> I can take help to take a look.

We use Mid cores more (but still at lowest OPP) instead of keeping them
in idle. I don't have perfetto traces.

>
>>> My brain started to fail me somewhere around patch 15. I'll have another look
>>> some time later in the week but generally looks good to me. If I have any
>>> worries it is about how it can be used with the provided interfaces. Especially
>>> expectations about managing fast thermal changes at the level you're targeting.
>>
>> No worries, thanks for the review! The fast thermal changes, which are
>> linked to the CPU's workload are not an issue here and I'm not worried
>> about those. The side effect of the heat from other device is the issue.
>> Thus, that thermal driver which modifies the EM should be aware of the
>> 'whole SoC' situation (like mainline IPA does, when it manages all
>> devices in a single thermal zone).
>
> I think in practice there will be challenges to generalize the thermal impact.
> But overall from EM accuracy point of view (for all the various reasons
> mentioned), we need this ability to help handle them in practice. Booting with
> a single hardcoded EM doesn't work.
>

Thanks, I agree. That's the main goal, to get rid of the single
hardcoded EM created during boot, from sometimes bogus information.

Regards,
Lukasz


2024-01-04 15:45:56

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 11/23] PM: EM: Add API for updating the runtime modifiable EM

On 20/12/2023 09:06, Lukasz Luba wrote:
>
>
> On 12/12/23 18:50, Dietmar Eggemann wrote:
>> On 29/11/2023 12:08, Lukasz Luba wrote:

[...]

>>> +int em_dev_update_perf_domain(struct device *dev,
>>> +                  struct em_perf_table __rcu *new_table)
>>> +{
>>> +    struct em_perf_table __rcu *old_table;
>>> +    struct em_perf_domain *pd;
>>> +
>>> +    /*
>>> +     * The lock serializes update and unregister code paths. When the
>>> +     * EM has been unregistered in the meantime, we should capture that
>>> +     * when entering this critical section. It also makes sure that
>>
>> What do you want to capture here? You want to block in this moment,
>> right? Don't understand the 2. sentence here.
>>
>> [...]
>
> There is general issue with module... they can reload. A driver which
> registered EM can than later disappear. I had similar issues for the
> devfreq cooling. It can happen at any time. In this scenario let's
> consider scenario w/ 2 kernel drivers:
> 1. Main driver which registered EM, e.g. GPU driver
> 2. Thermal driver which updates that EM
> When 1. starts unload process, it has to make sure that it will
> not free the main EM 'pd', because the 2. might try to use e.g.
> 'pd->nr_perf_states' while doing update at the moment.
> Thus, this 'pd' has local mutex, to avoid issues of
> module unload vs. EM update. The EM unregister will block on
> that mutex and let the background update finish it's critical
> section.

All true but wouldn't

/* Serialize update/unregister or concurrent updates */

be sufficient as a comment here?


2024-01-04 16:30:24

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division

On 20/12/2023 09:42, Lukasz Luba wrote:
>
>
> On 12/12/23 18:50, Dietmar Eggemann wrote:
>> On 29/11/2023 12:08, Lukasz Luba wrote:

[...]

>>> With this optimization, the em_cpu_energy() should run faster on the Big
>>> CPU by 1.43x and on the Little CPU by 1.69x.
>>
>> Where are those precise numbers are coming from? Which platform was it?
>
> That was mainline big.Little board rockpi4 b w/ rockchip 3399, present

IMHO, you should mention the platform here so people don't wonder.

> quite a few commercial devices (e.g. chromebooks or plenty other seen in
> DT). The numbers are from measuring the time it takes to run this
> function em_cpu_cost() in a loop for mln of times. Thus, the instruction
> cache and data cache should be hot, but the operation would impact the
> different score.

[...]

>> Can you not keep the existing comment and only change:
>>
>> (a) that ps->cap id ps->performance in (2) and
>>
>> (b) that:
>>
>>            *             ps->power * cpu_max_freq   cpu_util
>>            *   cpu_nrg = ------------------------ * ---------     (3)
>>            *                    ps->freq            scale_cpu
>>
>>                          <---- (old) ps->cost --->
>>
>>      is now
>>
>>                  ps->power * cpu_max_freq       1
>>      ps-> cost = ------------------------ * ----------
>>                          ps->freq            scale_cpu
>>
>>                  <---- (old) ps->cost --->
>>
>> and (c) that (4) has changed to:
>>
>>           *   pd_nrg = ps->cost * \Sum cpu_util                   (4)
>>
>> which avoid the division?
>>
>> Less changes is always much nicer since it makes it so much easier to
>> detect history and review changes.
>
> I'm open to change that, but I will have to contact you offline
> what you mean. This comment section in code is really tricky to
> handle right.

OK, the changes you showed me offline LGTM.

[...]

2024-01-04 16:54:18

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 11/23] PM: EM: Add API for updating the runtime modifiable EM



On 1/4/24 15:45, Dietmar Eggemann wrote:
> On 20/12/2023 09:06, Lukasz Luba wrote:
>>
>>
>> On 12/12/23 18:50, Dietmar Eggemann wrote:
>>> On 29/11/2023 12:08, Lukasz Luba wrote:
>
> [...]
>
>>>> +int em_dev_update_perf_domain(struct device *dev,
>>>> +                  struct em_perf_table __rcu *new_table)
>>>> +{
>>>> +    struct em_perf_table __rcu *old_table;
>>>> +    struct em_perf_domain *pd;
>>>> +
>>>> +    /*
>>>> +     * The lock serializes update and unregister code paths. When the
>>>> +     * EM has been unregistered in the meantime, we should capture that
>>>> +     * when entering this critical section. It also makes sure that
>>>
>>> What do you want to capture here? You want to block in this moment,
>>> right? Don't understand the 2. sentence here.
>>>
>>> [...]
>>
>> There is general issue with module... they can reload. A driver which
>> registered EM can than later disappear. I had similar issues for the
>> devfreq cooling. It can happen at any time. In this scenario let's
>> consider scenario w/ 2 kernel drivers:
>> 1. Main driver which registered EM, e.g. GPU driver
>> 2. Thermal driver which updates that EM
>> When 1. starts unload process, it has to make sure that it will
>> not free the main EM 'pd', because the 2. might try to use e.g.
>> 'pd->nr_perf_states' while doing update at the moment.
>> Thus, this 'pd' has local mutex, to avoid issues of
>> module unload vs. EM update. The EM unregister will block on
>> that mutex and let the background update finish it's critical
>> section.
>
> All true but wouldn't
>
> /* Serialize update/unregister or concurrent updates */
>
> be sufficient as a comment here?
>

Sounds good, I'll change that.

2024-01-04 16:56:10

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division



On 1/4/24 16:30, Dietmar Eggemann wrote:
> On 20/12/2023 09:42, Lukasz Luba wrote:
>>
>>
>> On 12/12/23 18:50, Dietmar Eggemann wrote:
>>> On 29/11/2023 12:08, Lukasz Luba wrote:
>
> [...]
>
>>>> With this optimization, the em_cpu_energy() should run faster on the Big
>>>> CPU by 1.43x and on the Little CPU by 1.69x.
>>>
>>> Where are those precise numbers are coming from? Which platform was it?
>>
>> That was mainline big.Little board rockpi4 b w/ rockchip 3399, present
>
> IMHO, you should mention the platform here so people don't wonder.
>
>> quite a few commercial devices (e.g. chromebooks or plenty other seen in
>> DT). The numbers are from measuring the time it takes to run this
>> function em_cpu_cost() in a loop for mln of times. Thus, the instruction
>> cache and data cache should be hot, but the operation would impact the
>> different score.
>
> [...]
>
>>> Can you not keep the existing comment and only change:
>>>
>>> (a) that ps->cap id ps->performance in (2) and
>>>
>>> (b) that:
>>>
>>>            *             ps->power * cpu_max_freq   cpu_util
>>>            *   cpu_nrg = ------------------------ * ---------     (3)
>>>            *                    ps->freq            scale_cpu
>>>
>>>                          <---- (old) ps->cost --->
>>>
>>>      is now
>>>
>>>                  ps->power * cpu_max_freq       1
>>>      ps-> cost = ------------------------ * ----------
>>>                          ps->freq            scale_cpu
>>>
>>>                  <---- (old) ps->cost --->
>>>
>>> and (c) that (4) has changed to:
>>>
>>>           *   pd_nrg = ps->cost * \Sum cpu_util                   (4)
>>>
>>> which avoid the division?
>>>
>>> Less changes is always much nicer since it makes it so much easier to
>>> detect history and review changes.
>>
>> I'm open to change that, but I will have to contact you offline
>> what you mean. This comment section in code is really tricky to
>> handle right.
>
> OK, the changes you showed me offline LGTM.
>
> [...]
>

All good then. Thank you for the comments. I'll send v6.

2024-01-04 19:24:33

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division

On 01/02/24 11:47, Lukasz Luba wrote:
> > Did you see a problem or just being extra cautious here?
>
> There is no problem, 'cost' is a private coefficient for EAS only.

Let me ask differently, what goes wrong if you don't increase the resolution
here? Why is it necessary?


Cheers

--
Qais Yousef

2024-01-10 13:58:08

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division



On 1/4/24 19:23, Qais Yousef wrote:
> On 01/02/24 11:47, Lukasz Luba wrote:
>>> Did you see a problem or just being extra cautious here?
>>
>> There is no problem, 'cost' is a private coefficient for EAS only.
>
> Let me ask differently, what goes wrong if you don't increase the resolution
> here? Why is it necessary?
>


When you have 800mW at CPU capacity 1024, then the value is small (below
1 thousand).
Example:
power = 800000 uW
cost = 800000 / 1024 = 781

While I know from past that sometimes OPPs might have close voltage
values and a rounding could occur and make some OPPs inefficient
while they aren't.

This is what would happen when we have the 1x resolution:
/sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
/sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
/sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
/sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
/sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
/sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
/sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
which is not true (see below).

This is what would happen when we have the 10x resolution:
/sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
/sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
/sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
/sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
/sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
/sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
/sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
Here the OPP with 600MHz is more efficient than 408MHz,
which is true. So only 408MHz will be marked as in-efficient OPP.


This is what would happen when we have the 100x resolution:
/sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
/sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
/sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
/sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
/sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
/sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
/sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
The higher (100x) resolution does not bring that much in
practice.

If you have other questions, let's continue on v6 series.

2024-01-15 12:22:18

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division

On 01/10/24 13:53, Lukasz Luba wrote:
>
>
> On 1/4/24 19:23, Qais Yousef wrote:
> > On 01/02/24 11:47, Lukasz Luba wrote:
> > > > Did you see a problem or just being extra cautious here?
> > >
> > > There is no problem, 'cost' is a private coefficient for EAS only.
> >
> > Let me ask differently, what goes wrong if you don't increase the resolution
> > here? Why is it necessary?
> >
>
>
> When you have 800mW at CPU capacity 1024, then the value is small (below
> 1 thousand).
> Example:
> power = 800000 uW
> cost = 800000 / 1024 = 781
>
> While I know from past that sometimes OPPs might have close voltage
> values and a rounding could occur and make some OPPs inefficient
> while they aren't.
>
> This is what would happen when we have the 1x resolution:
> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
> The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
> which is not true (see below).
>
> This is what would happen when we have the 10x resolution:
> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
> Here the OPP with 600MHz is more efficient than 408MHz,
> which is true. So only 408MHz will be marked as in-efficient OPP.
>
>
> This is what would happen when we have the 100x resolution:
> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
> The higher (100x) resolution does not bring that much in
> practice.

So it seems a uW is not sufficient enough. We moved from mW because of
resolution already. Shall we make it nW then and multiply by 1000 always? The
choice of 10 looks arbitrary IMHO

>
> If you have other questions, let's continue on v6 series.

2024-01-15 12:35:18

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division



On 1/15/24 12:21, Qais Yousef wrote:
> On 01/10/24 13:53, Lukasz Luba wrote:
>>
>>
>> On 1/4/24 19:23, Qais Yousef wrote:
>>> On 01/02/24 11:47, Lukasz Luba wrote:
>>>>> Did you see a problem or just being extra cautious here?
>>>>
>>>> There is no problem, 'cost' is a private coefficient for EAS only.
>>>
>>> Let me ask differently, what goes wrong if you don't increase the resolution
>>> here? Why is it necessary?
>>>
>>
>>
>> When you have 800mW at CPU capacity 1024, then the value is small (below
>> 1 thousand).
>> Example:
>> power = 800000 uW
>> cost = 800000 / 1024 = 781
>>
>> While I know from past that sometimes OPPs might have close voltage
>> values and a rounding could occur and make some OPPs inefficient
>> while they aren't.
>>
>> This is what would happen when we have the 1x resolution:
>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
>> The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
>> which is not true (see below).
>>
>> This is what would happen when we have the 10x resolution:
>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
>> Here the OPP with 600MHz is more efficient than 408MHz,
>> which is true. So only 408MHz will be marked as in-efficient OPP.
>>
>>
>> This is what would happen when we have the 100x resolution:
>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
>> The higher (100x) resolution does not bring that much in
>> practice.
>
> So it seems a uW is not sufficient enough. We moved from mW because of
> resolution already. Shall we make it nW then and multiply by 1000 always? The
> choice of 10 looks arbitrary IMHO
>

No, there is no need of nW in the 'power' field for this.
You've missed the point.

2024-01-16 13:10:53

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division

On 01/15/24 12:36, Lukasz Luba wrote:
>
>
> On 1/15/24 12:21, Qais Yousef wrote:
> > On 01/10/24 13:53, Lukasz Luba wrote:
> > >
> > >
> > > On 1/4/24 19:23, Qais Yousef wrote:
> > > > On 01/02/24 11:47, Lukasz Luba wrote:
> > > > > > Did you see a problem or just being extra cautious here?
> > > > >
> > > > > There is no problem, 'cost' is a private coefficient for EAS only.
> > > >
> > > > Let me ask differently, what goes wrong if you don't increase the resolution
> > > > here? Why is it necessary?
> > > >
> > >
> > >
> > > When you have 800mW at CPU capacity 1024, then the value is small (below
> > > 1 thousand).
> > > Example:
> > > power = 800000 uW
> > > cost = 800000 / 1024 = 781
> > >
> > > While I know from past that sometimes OPPs might have close voltage
> > > values and a rounding could occur and make some OPPs inefficient
> > > while they aren't.
> > >
> > > This is what would happen when we have the 1x resolution:
> > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
> > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
> > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
> > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
> > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
> > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
> > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
> > > The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
> > > which is not true (see below).
> > >
> > > This is what would happen when we have the 10x resolution:
> > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
> > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
> > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
> > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
> > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
> > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
> > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
> > > Here the OPP with 600MHz is more efficient than 408MHz,
> > > which is true. So only 408MHz will be marked as in-efficient OPP.
> > >
> > >
> > > This is what would happen when we have the 100x resolution:
> > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
> > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
> > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
> > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
> > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
> > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
> > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
> > > The higher (100x) resolution does not bring that much in
> > > practice.
> >
> > So it seems a uW is not sufficient enough. We moved from mW because of
> > resolution already. Shall we make it nW then and multiply by 1000 always? The
> > choice of 10 looks arbitrary IMHO
> >
>
> No, there is no need of nW in the 'power' field for this.
> You've missed the point.

I think you're missing what I am saying. The multiplication by 10 looks like
magic value to increase resolution based on a single observation you noticed.

The feedback I am giving is that this needs to be better explained, in
a comment. And instead of multiplying by 10 multiply by 1000. Saying this is
enough based on a single observation is not adequate IMO.

Also the difference is tiny. Could you actually measure a difference in overall
power with and without this extra decimal point resolution? It might be better
to run at 816MHz and go back to idle faster. So the trade-off is not clear cut
to me.

So generally I am not keen on magic values based on single observations.
I think removing this or use 1000 is better.

AFAICT you decided that 0.1uW is worth caring about. But 0.19uW difference
isn't.

I can't see how much difference this makes in practice tbh. But using more
uniform conversion so that the cost is in nW (keep the power field in uW) makes
more sense at least.

It still raises the question whether this minuscule cost difference is actually
better taken into account. I think the perf/watt for 816MHz is much better so
skipping 600MHz as inefficient looks better to me.

2024-01-16 15:33:42

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division



On 1/16/24 13:10, Qais Yousef wrote:
> On 01/15/24 12:36, Lukasz Luba wrote:
>>
>>
>> On 1/15/24 12:21, Qais Yousef wrote:
>>> On 01/10/24 13:53, Lukasz Luba wrote:
>>>>
>>>>
>>>> On 1/4/24 19:23, Qais Yousef wrote:
>>>>> On 01/02/24 11:47, Lukasz Luba wrote:
>>>>>>> Did you see a problem or just being extra cautious here?
>>>>>>
>>>>>> There is no problem, 'cost' is a private coefficient for EAS only.
>>>>>
>>>>> Let me ask differently, what goes wrong if you don't increase the resolution
>>>>> here? Why is it necessary?
>>>>>
>>>>
>>>>
>>>> When you have 800mW at CPU capacity 1024, then the value is small (below
>>>> 1 thousand).
>>>> Example:
>>>> power = 800000 uW
>>>> cost = 800000 / 1024 = 781
>>>>
>>>> While I know from past that sometimes OPPs might have close voltage
>>>> values and a rounding could occur and make some OPPs inefficient
>>>> while they aren't.
>>>>
>>>> This is what would happen when we have the 1x resolution:
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
>>>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
>>>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
>>>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
>>>> The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
>>>> which is not true (see below).
>>>>
>>>> This is what would happen when we have the 10x resolution:
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
>>>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
>>>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
>>>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
>>>> Here the OPP with 600MHz is more efficient than 408MHz,
>>>> which is true. So only 408MHz will be marked as in-efficient OPP.
>>>>
>>>>
>>>> This is what would happen when we have the 100x resolution:
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
>>>> /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
>>>> /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
>>>> /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
>>>> /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
>>>> The higher (100x) resolution does not bring that much in
>>>> practice.
>>>
>>> So it seems a uW is not sufficient enough. We moved from mW because of
>>> resolution already. Shall we make it nW then and multiply by 1000 always? The
>>> choice of 10 looks arbitrary IMHO
>>>
>>
>> No, there is no need of nW in the 'power' field for this.
>> You've missed the point.
>
> I think you're missing what I am saying. The multiplication by 10 looks like
> magic value to increase resolution based on a single observation you noticed.
>
> The feedback I am giving is that this needs to be better explained, in
> a comment. And instead of multiplying by 10 multiply by 1000. Saying this is
> enough based on a single observation is not adequate IMO.

I think you are trying to review something which you don't have full
details and previous history. I have been fighting with those rounding
issues in past and there are commits with description of issues.
You haven't analyze all edge cases, one more is below (about your
proposal with 1000x the nW).

>
> Also the difference is tiny. Could you actually measure a difference in overall
> power with and without this extra decimal point resolution? It might be better

Yes, I had such power measurements, but for older rounding issues. Take
into account that the EM model is reflecting one CPU, but in reality we
often have 4 CPUs linked together in one frequency domain. Thus, a small
energy difference is actually multiplied.

> to run at 816MHz and go back to idle faster. So the trade-off is not clear cut
> to me.

It's not the $subject to discuss other possible design which set such
trade-offs differently. Please don't mix many topics. A "race to idle"
from OPPs which have a bit higher voltage is totally different topic,
currently not in EAS design at all. Otherwise we end up in a heuristic
issue like: how much more 'inefficient' it has to be to skip it.
Currently we are strict in 'inefficient' OPP tagging.

>
> So generally I am not keen on magic values based on single observations.
> I think removing this or use 1000 is better.

That is your opinion. I've tried to explain to you:
1) why we cannot remove it and why we need the 10x
2) why we don't need more that 10x

>
> AFAICT you decided that 0.1uW is worth caring about. But 0.19uW difference
> isn't.

It's not strictly related to power value, but the earlier division
operation that we perform in setup time and not in runtime (in different
order on the arguments in the math involved). That operation cuts some
important information from the integer value (as listed above in those
different configurations' dumps of 'cost' values).

>
> I can't see how much difference this makes in practice tbh. But using more
> uniform conversion so that the cost is in nW (keep the power field in uW) makes
> more sense at least.

This is the edge case which I've mentioned at the begging that you're
missing some background. Your proposal is to have 1000x resolution so in
nano-Watts power for the 'cost'. Let's consider example power of 1.4Watt
on single CPU at mid-high-freq OPP (700 capacity), running on 32bit
kernel, so unsigned long has 32bit.

power = 1.4W = 1400000000nW

cost = 1400000000 / 700 = 2000000 (2mln)

Then in EAS we can have this simulation:
4 CPUs with util 550 voting for this OPP (700 capacity),
so the em_cpu_energy() would perform:

return cost * sum_util

2000000 * (4 * 550) = 4400000000 <--- overflow on 32bit ulong

That's why I said you haven't considered your proposal fully.

>
> It still raises the question whether this minuscule cost difference is actually
> better taken into account. I think the perf/watt for 816MHz is much better so
> skipping 600MHz as inefficient looks better to me.
>

This is exactly the place where we disagree. You "think the perf/watt
for 816MHz is much better so skipping 600MHz as inefficient looks
better". For me, the numbers from 3 different configuration dumps are
telling me exactly opposite. I will base the algorithms on the numbers
and not on a heuristic that I think looks better.

I'm going to send v7. Please end this discussion on v5.

Regards,
Lukasz

2024-01-16 19:34:10

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v5 15/23] PM: EM: Optimize em_cpu_energy() and remove division

On 01/16/24 15:34, Lukasz Luba wrote:
>
>
> On 1/16/24 13:10, Qais Yousef wrote:
> > On 01/15/24 12:36, Lukasz Luba wrote:
> > >
> > >
> > > On 1/15/24 12:21, Qais Yousef wrote:
> > > > On 01/10/24 13:53, Lukasz Luba wrote:
> > > > >
> > > > >
> > > > > On 1/4/24 19:23, Qais Yousef wrote:
> > > > > > On 01/02/24 11:47, Lukasz Luba wrote:
> > > > > > > > Did you see a problem or just being extra cautious here?
> > > > > > >
> > > > > > > There is no problem, 'cost' is a private coefficient for EAS only.
> > > > > >
> > > > > > Let me ask differently, what goes wrong if you don't increase the resolution
> > > > > > here? Why is it necessary?
> > > > > >
> > > > >
> > > > >
> > > > > When you have 800mW at CPU capacity 1024, then the value is small (below
> > > > > 1 thousand).
> > > > > Example:
> > > > > power = 800000 uW
> > > > > cost = 800000 / 1024 = 781
> > > > >
> > > > > While I know from past that sometimes OPPs might have close voltage
> > > > > values and a rounding could occur and make some OPPs inefficient
> > > > > while they aren't.
> > > > >
> > > > > This is what would happen when we have the 1x resolution:
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:551
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:644
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:744
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:851
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:493
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:493
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:493
> > > > > The bottom 3 OPPs have the same 'cost' thus 2 OPPs are in-efficient,
> > > > > which is not true (see below).
> > > > >
> > > > > This is what would happen when we have the 10x resolution:
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:5513
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:6443
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:7447
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:8514
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:4934
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:4933
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:4934
> > > > > Here the OPP with 600MHz is more efficient than 408MHz,
> > > > > which is true. So only 408MHz will be marked as in-efficient OPP.
> > > > >
> > > > >
> > > > > This is what would happen when we have the 100x resolution:
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1008000/cost:55137
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1200000/cost:64433
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1416000/cost:74473
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:1512000/cost:85140
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:408000/cost:49346
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:600000/cost:49331
> > > > > /sys/kernel/debug/energy_model/cpu4/ps:816000/cost:49346
> > > > > The higher (100x) resolution does not bring that much in
> > > > > practice.
> > > >
> > > > So it seems a uW is not sufficient enough. We moved from mW because of
> > > > resolution already. Shall we make it nW then and multiply by 1000 always? The
> > > > choice of 10 looks arbitrary IMHO
> > > >
> > >
> > > No, there is no need of nW in the 'power' field for this.
> > > You've missed the point.
> >
> > I think you're missing what I am saying. The multiplication by 10 looks like
> > magic value to increase resolution based on a single observation you noticed.
> >
> > The feedback I am giving is that this needs to be better explained, in
> > a comment. And instead of multiplying by 10 multiply by 1000. Saying this is
> > enough based on a single observation is not adequate IMO.
>
> I think you are trying to review something which you don't have full
> details and previous history. I have been fighting with those rounding

I don't think so..

> issues in past and there are commits with description of issues.
> You haven't analyze all edge cases, one more is below (about your
> proposal with 1000x the nW).
>
> >
> > Also the difference is tiny. Could you actually measure a difference in overall
> > power with and without this extra decimal point resolution? It might be better
>
> Yes, I had such power measurements, but for older rounding issues. Take

so not against this series..

> into account that the EM model is reflecting one CPU, but in reality we
> often have 4 CPUs linked together in one frequency domain. Thus, a small
> energy difference is actually multiplied.
>
> > to run at 816MHz and go back to idle faster. So the trade-off is not clear cut
> > to me.
>
> It's not the $subject to discuss other possible design which set such
> trade-offs differently. Please don't mix many topics. A "race to idle"

I am not mixing topics. I am questioning the claim about this addition of
resolution which looked random to me.

> from OPPs which have a bit higher voltage is totally different topic,
> currently not in EAS design at all. Otherwise we end up in a heuristic
> issue like: how much more 'inefficient' it has to be to skip it.
> Currently we are strict in 'inefficient' OPP tagging.

Then this part of this patch about the resolution better be split into its own
patch submission?

>
> >
> > So generally I am not keen on magic values based on single observations.
> > I think removing this or use 1000 is better.
>
> That is your opinion. I've tried to explain to you:
> 1) why we cannot remove it and why we need the 10x
> 2) why we don't need more that 10x
>
> >
> > AFAICT you decided that 0.1uW is worth caring about. But 0.19uW difference
> > isn't.
>
> It's not strictly related to power value, but the earlier division
> operation that we perform in setup time and not in runtime (in different
> order on the arguments in the math involved). That operation cuts some
> important information from the integer value (as listed above in those
> different configurations' dumps of 'cost' values).

-ENOPARSE. From what I see the cost has different resolution.

>
> >
> > I can't see how much difference this makes in practice tbh. But using more
> > uniform conversion so that the cost is in nW (keep the power field in uW) makes
> > more sense at least.
>
> This is the edge case which I've mentioned at the begging that you're
> missing some background. Your proposal is to have 1000x resolution so in
> nano-Watts power for the 'cost'. Let's consider example power of 1.4Watt
> on single CPU at mid-high-freq OPP (700 capacity), running on 32bit
> kernel, so unsigned long has 32bit.
>
> power = 1.4W = 1400000000nW
>
> cost = 1400000000 / 700 = 2000000 (2mln)
>
> Then in EAS we can have this simulation:
> 4 CPUs with util 550 voting for this OPP (700 capacity),
> so the em_cpu_energy() would perform:
>
> return cost * sum_util
>
> 2000000 * (4 * 550) = 4400000000 <--- overflow on 32bit ulong
>
> That's why I said you haven't considered your proposal fully.

overflow was in mind, I didn't feel it was necessary to elaborate more..
overflows issues can be handled

>
> >
> > It still raises the question whether this minuscule cost difference is actually
> > better taken into account. I think the perf/watt for 816MHz is much better so
> > skipping 600MHz as inefficient looks better to me.
> >
>
> This is exactly the place where we disagree. You "think the perf/watt
> for 816MHz is much better so skipping 600MHz as inefficient looks
> better". For me, the numbers from 3 different configuration dumps are
> telling me exactly opposite. I will base the algorithms on the numbers
> and not on a heuristic that I think looks better.
>
> I'm going to send v7. Please end this discussion on v5.

This thread is the context of the discussion..

It seems you don't want the feedback. I don't think there's mixing of topics.
But decisions made and I don't see proper explanation to them. Hence the
questions and probing proposals in attempt to understand more. We do have to
cover a wide areas of cases in general and enforcing such random numbers has
been a problem in practice as there's no sensible defaults. And I am not seeing
that this is a good generalization from what I am reading. Similar to
util_threshold which has caused power regressions, and migration margin and
dvfs headroom that are causing problems too. I see this is another random
number added. The numbers you're referring to are very limited in scope, it's
not number vs heuristics looks better the issue here. It seems you don't want
to consider the perf/watt impact rather than the pure cost of a single OPP
- which was the true intent behind the question. I might have misunderstood
something, but If you explained this in this reply, I certainly would have lost
it with this constant stop discussing and move to v6 or v7.

Anyway. I'll leave this at here.