2022-06-22 15:05:18

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 0/4] Energy Model power in micro-Watts and SCMI v3.1 alignment

Hi all,

This is a patch set which changes Energy Model power values scale to
micro-Watts. It also upgrades the SCMI performance layer + scmi-cpufreq
driver to leverage the SCMI v3.1 spec and process micro-Watts power values
coming from FW. The higher precision in EM power field solves an issue
of a rounding error, which then can be misinterpreted as 'inefficient OPP'.
An example rounding issue calculation is present in patch 1/4 description.

Regards,
Lukasz Luba

Lukasz Luba (4):
PM: EM: convert power field to micro-Watts precision and align drivers
Documentation: EM: Switch to micro-Watts scale
firmware: arm_scmi: Get detailed power scale from perf
cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1

Documentation/power/energy-model.rst | 14 +++---
drivers/cpufreq/mediatek-cpufreq-hw.c | 7 +--
drivers/cpufreq/scmi-cpufreq.c | 15 ++++++-
drivers/firmware/arm_scmi/perf.c | 18 +++++---
drivers/opp/of.c | 15 ++++---
drivers/powercap/dtpm_cpu.c | 5 +--
drivers/thermal/cpufreq_cooling.c | 13 +++++-
drivers/thermal/devfreq_cooling.c | 19 ++++++--
include/linux/energy_model.h | 63 ++++++++++++++++++++-------
include/linux/scmi_protocol.h | 8 +++-
kernel/power/energy_model.c | 31 ++++++++-----
11 files changed, 146 insertions(+), 62 deletions(-)

--
2.17.1


2022-06-22 15:05:24

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 1/4] PM: EM: convert power field to micro-Watts precision and align drivers

The milli-Watts precision causes rounding errors while calculating
efficiency cost for each OPP. This is especially visible in the 'simple'
Energy Model (EM), where the power for each OPP is provided from OPP
framework. This can cause some OPPs to be marked inefficient, while
using micro-Watts precision that might not happen.

Update all EM users which access 'power' field and assume the value is
in milli-Watts.

Solve also an issue with potential overflow in calculation of energy
estimation on 32bit machine. It's needed now since the power value
(thus the 'cost' as well) are higher.

Example calculation which shows the rounding error and impact:

power = 'dyn-power-coeff' * volt_mV * volt_mV * freq_MHz

power_a_uW = (100 * 600mW * 600mW * 500MHz) / 10^6 = 18000
power_a_mW = (100 * 600mW * 600mW * 500MHz) / 10^9 = 18

power_b_uW = (100 * 605mW * 605mW * 600MHz) / 10^6 = 21961
power_b_mW = (100 * 605mW * 605mW * 600MHz) / 10^9 = 21

max_freq = 2000MHz

cost_a_mW = 18 * 2000MHz/500MHz = 72
cost_a_uW = 18000 * 2000MHz/500MHz = 72000

cost_b_mW = 21 * 2000MHz/600MHz = 70 // <- artificially better
cost_b_uW = 21961 * 2000MHz/600MHz = 73203

The 'cost_b_mW' (which is based on old milli-Watts) is misleadingly
better that the 'cost_b_uW' (this patch uses micro-Watts) and such
would have impact on the 'inefficient OPPs' information in the Cpufreq
framework. This patch set removes the rounding issue.

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/cpufreq/mediatek-cpufreq-hw.c | 7 +--
drivers/cpufreq/scmi-cpufreq.c | 6 +++
drivers/opp/of.c | 15 ++++---
drivers/powercap/dtpm_cpu.c | 5 +--
drivers/thermal/cpufreq_cooling.c | 13 +++++-
drivers/thermal/devfreq_cooling.c | 19 ++++++--
include/linux/energy_model.h | 63 ++++++++++++++++++++-------
kernel/power/energy_model.c | 31 ++++++++-----
8 files changed, 114 insertions(+), 45 deletions(-)

diff --git a/drivers/cpufreq/mediatek-cpufreq-hw.c b/drivers/cpufreq/mediatek-cpufreq-hw.c
index 813cccbfe934..f0e0a35c7f21 100644
--- a/drivers/cpufreq/mediatek-cpufreq-hw.c
+++ b/drivers/cpufreq/mediatek-cpufreq-hw.c
@@ -51,7 +51,7 @@ static const u16 cpufreq_mtk_offsets[REG_ARRAY_SIZE] = {
};

static int __maybe_unused
-mtk_cpufreq_get_cpu_power(struct device *cpu_dev, unsigned long *mW,
+mtk_cpufreq_get_cpu_power(struct device *cpu_dev, unsigned long *uW,
unsigned long *KHz)
{
struct mtk_cpufreq_data *data;
@@ -71,8 +71,9 @@ mtk_cpufreq_get_cpu_power(struct device *cpu_dev, unsigned long *mW,
i--;

*KHz = data->table[i].frequency;
- *mW = readl_relaxed(data->reg_bases[REG_EM_POWER_TBL] +
- i * LUT_ROW_SIZE) / 1000;
+ /* Provide micro-Watts value to the Energy Model */
+ *uW = readl_relaxed(data->reg_bases[REG_EM_POWER_TBL] +
+ i * LUT_ROW_SIZE);

return 0;
}
diff --git a/drivers/cpufreq/scmi-cpufreq.c b/drivers/cpufreq/scmi-cpufreq.c
index 6d2a4cf46db7..bfd35583d653 100644
--- a/drivers/cpufreq/scmi-cpufreq.c
+++ b/drivers/cpufreq/scmi-cpufreq.c
@@ -19,6 +19,7 @@
#include <linux/slab.h>
#include <linux/scmi_protocol.h>
#include <linux/types.h>
+#include <linux/units.h>

struct scmi_data {
int domain_id;
@@ -99,6 +100,7 @@ static int __maybe_unused
scmi_get_cpu_power(struct device *cpu_dev, unsigned long *power,
unsigned long *KHz)
{
+ bool power_scale_mw = perf_ops->power_scale_mw_get(ph);
unsigned long Hz;
int ret, domain;

@@ -112,6 +114,10 @@ scmi_get_cpu_power(struct device *cpu_dev, unsigned long *power,
if (ret)
return ret;

+ /* Provide bigger resolution power to the Energy Model */
+ if (power_scale_mw)
+ *power *= MICROWATT_PER_MILLIWATT;
+
/* The EM framework specifies the frequency in KHz. */
*KHz = Hz / 1000;

diff --git a/drivers/opp/of.c b/drivers/opp/of.c
index ea8fc9e1f7e3..74c33cebeb29 100644
--- a/drivers/opp/of.c
+++ b/drivers/opp/of.c
@@ -1482,12 +1482,12 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_of_node);
* It provides the power used by @dev at @kHz if it is the frequency of an
* existing OPP, or at the frequency of the first OPP above @kHz otherwise
* (see dev_pm_opp_find_freq_ceil()). This function updates @kHz to the ceiled
- * frequency and @mW to the associated power.
+ * frequency and @uW to the associated power.
*
* Returns 0 on success or a proper -EINVAL value in case of error.
*/
static int __maybe_unused
-_get_dt_power(struct device *dev, unsigned long *mW, unsigned long *kHz)
+_get_dt_power(struct device *dev, unsigned long *uW, unsigned long *kHz)
{
struct dev_pm_opp *opp;
unsigned long opp_freq, opp_power;
@@ -1504,7 +1504,7 @@ _get_dt_power(struct device *dev, unsigned long *mW, unsigned long *kHz)
return -EINVAL;

*kHz = opp_freq / 1000;
- *mW = opp_power / 1000;
+ *uW = opp_power;

return 0;
}
@@ -1514,14 +1514,14 @@ _get_dt_power(struct device *dev, unsigned long *mW, unsigned long *kHz)
* This computes the power estimated by @dev at @kHz if it is the frequency
* of an existing OPP, or at the frequency of the first OPP above @kHz otherwise
* (see dev_pm_opp_find_freq_ceil()). This function updates @kHz to the ceiled
- * frequency and @mW to the associated power. The power is estimated as
+ * frequency and @uW to the associated power. The power is estimated as
* P = C * V^2 * f with C being the device's capacitance and V and f
* respectively the voltage and frequency of the OPP.
*
* Returns -EINVAL if the power calculation failed because of missing
* parameters, 0 otherwise.
*/
-static int __maybe_unused _get_power(struct device *dev, unsigned long *mW,
+static int __maybe_unused _get_power(struct device *dev, unsigned long *uW,
unsigned long *kHz)
{
struct dev_pm_opp *opp;
@@ -1551,9 +1551,10 @@ static int __maybe_unused _get_power(struct device *dev, unsigned long *mW,
return -EINVAL;

tmp = (u64)cap * mV * mV * (Hz / 1000000);
- do_div(tmp, 1000000000);
+ /* Provide power in micro-Watts */
+ do_div(tmp, 1000000);

- *mW = (unsigned long)tmp;
+ *uW = (unsigned long)tmp;
*kHz = Hz / 1000;

return 0;
diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
index f5eced0842b3..61c5ff80bd30 100644
--- a/drivers/powercap/dtpm_cpu.c
+++ b/drivers/powercap/dtpm_cpu.c
@@ -53,7 +53,7 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)

for (i = 0; i < pd->nr_perf_states; i++) {

- power = pd->table[i].power * MICROWATT_PER_MILLIWATT * nr_cpus;
+ power = pd->table[i].power * nr_cpus;

if (power > power_limit)
break;
@@ -63,8 +63,7 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)

freq_qos_update_request(&dtpm_cpu->qos_req, freq);

- power_limit = pd->table[i - 1].power *
- MICROWATT_PER_MILLIWATT * nr_cpus;
+ power_limit = pd->table[i - 1].power * nr_cpus;

return power_limit;
}
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index b8151d95a806..dc19e7c80751 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -21,6 +21,7 @@
#include <linux/pm_qos.h>
#include <linux/slab.h>
#include <linux/thermal.h>
+#include <linux/units.h>

#include <trace/events/thermal.h>

@@ -101,6 +102,7 @@ static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
u32 freq)
{
+ unsigned long power_mw;
int i;

for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
@@ -108,16 +110,23 @@ static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
break;
}

- return cpufreq_cdev->em->table[i + 1].power;
+ power_mw = cpufreq_cdev->em->table[i + 1].power;
+ power_mw /= MICROWATT_PER_MILLIWATT;
+
+ return power_mw;
}

static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
u32 power)
{
+ unsigned long em_power_mw;
int i;

for (i = cpufreq_cdev->max_level; i > 0; i--) {
- if (power >= cpufreq_cdev->em->table[i].power)
+ /* Convert EM power to milli-Watts to make safe comparison */
+ em_power_mw = cpufreq_cdev->em->table[i].power;
+ em_power_mw /= MICROWATT_PER_MILLIWATT;
+ if (power >= em_power_mw)
break;
}

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 8c76f9655e57..8d1260f65061 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -200,7 +200,11 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
res = dfc->power_ops->get_real_power(df, power, freq, voltage);
if (!res) {
state = dfc->capped_state;
+
+ /* Convert EM power into milli-Watts first */
dfc->res_util = dfc->em_pd->table[state].power;
+ dfc->res_util /= MICROWATT_PER_MILLIWATT;
+
dfc->res_util *= SCALE_ERROR_MITIGATION;

if (*power > 1)
@@ -218,8 +222,10 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd

_normalize_load(&status);

- /* Scale power for utilization */
+ /* Convert EM power into milli-Watts first */
*power = dfc->em_pd->table[perf_idx].power;
+ *power /= MICROWATT_PER_MILLIWATT;
+ /* Scale power for utilization */
*power *= status.busy_time;
*power >>= 10;
}
@@ -244,6 +250,7 @@ static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev,

perf_idx = dfc->max_state - state;
*power = dfc->em_pd->table[perf_idx].power;
+ *power /= MICROWATT_PER_MILLIWATT;

return 0;
}
@@ -254,7 +261,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
struct devfreq_cooling_device *dfc = cdev->devdata;
struct devfreq *df = dfc->devfreq;
struct devfreq_dev_status status;
- unsigned long freq;
+ unsigned long freq, em_power_mw;
s32 est_power;
int i;

@@ -279,9 +286,13 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
* Find the first cooling state that is within the power
* budget. The EM power table is sorted ascending.
*/
- for (i = dfc->max_state; i > 0; i--)
- if (est_power >= dfc->em_pd->table[i].power)
+ for (i = dfc->max_state; i > 0; i--) {
+ /* Convert EM power to milli-Watts to make safe comparison */
+ em_power_mw = dfc->em_pd->table[i].power;
+ em_power_mw /= MICROWATT_PER_MILLIWATT;
+ if (est_power >= em_power_mw)
break;
+ }

*state = dfc->max_state - i;
dfc->capped_state = *state;
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 8419bffb4398..a8199168c2a5 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -62,7 +62,7 @@ struct em_perf_domain {
/*
* em_perf_domain flags:
*
- * EM_PERF_DOMAIN_MILLIWATTS: The power values are in milli-Watts or some
+ * EM_PERF_DOMAIN_MICROWATTS: The power values are in micro-Watts or some
* other scale.
*
* EM_PERF_DOMAIN_SKIP_INEFFICIENCIES: Skip inefficient states when estimating
@@ -71,7 +71,7 @@ struct em_perf_domain {
* EM_PERF_DOMAIN_ARTIFICIAL: The power values are artificial and might be
* created by platform missing real power information
*/
-#define EM_PERF_DOMAIN_MILLIWATTS BIT(0)
+#define EM_PERF_DOMAIN_MICROWATTS BIT(0)
#define EM_PERF_DOMAIN_SKIP_INEFFICIENCIES BIT(1)
#define EM_PERF_DOMAIN_ARTIFICIAL BIT(2)

@@ -79,22 +79,53 @@ struct em_perf_domain {
#define em_is_artificial(em) ((em)->flags & EM_PERF_DOMAIN_ARTIFICIAL)

#ifdef CONFIG_ENERGY_MODEL
-#define EM_MAX_POWER 0xFFFF
+/*
+ * The max power value in micro-Watts. The limit of 64 Watts is set as
+ * a safety net to not overflow multiplications on 32bit platforms. The
+ * 32bit value limit for total Perf Domain power implies a limit of
+ * maximum CPUs in such domain to 64.
+ */
+#define EM_MAX_POWER (64000000) /* 64 Watts */
+
+/*
+ * Validate if the 'cost' value won't cause overflow on 32bit machine
+ * in the em_cpu_energy(). This is unlikely on existing 32bit platforms
+ * with small number of CPUs in a Performance Domain and/or small power
+ * value. Although, the issue would be triggered when we have more than
+ * 64 CPUs in a single Perf. Domain and each of them consumes 64 Watts.
+ *
+ * We are safe on 64bit machine.
+ */
+#ifdef CONFIG_64BIT
+#define em_validate_cost(cost, num_devs) (false)
+#else
+/*
+ * Simulate maximum possible sum utilization and multiply by
+ * cost/cpu_scale, which is in fact 'cost * num_devs'
+ */
+#define em_validate_cost(cost, num_devs) \
+ ((((u64)(cost) * (num_devs)) >= UINT_MAX))
+#endif

/*
- * Increase resolution of energy estimation calculations for 64-bit
- * architectures. The extra resolution improves decision made by EAS for the
- * task placement when two Performance Domains might provide similar energy
- * estimation values (w/o better resolution the values could be equal).
+ * To avoid an overflow on 32bit machines while calculating the energy
+ * use a different order in the operation. First divide by the 'cpu_scale'
+ * which would reduce big value stored in the 'cost' field, then multiply by
+ * the 'sum_util'. This would allow to handle existing platforms, which have
+ * e.g. power ~1.3 Watt at max freq, so the 'cost' value > 1mln micro-Watts.
+ * In such scenario, where there are 4 CPUs in the Perf. Domain the 'sum_util'
+ * could be 4096, then multiplication: 'cost' * 'sum_util' would overflow.
+ * This reordering of operations has some limitations, we lose small
+ * precision in the estimation (comparing to 64bit platform w/o reordering).
*
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. 64-bit). The costs for increasing resolution when 32-bit
- * are pretty high and the returns do not justify the increased costs.
+ * We are safe on 64bit machine.
*/
#ifdef CONFIG_64BIT
-#define em_scale_power(p) ((p) * 1000)
+#define em_estimate_energy(cost, sum_util, scale_cpu) \
+ (((cost) * (sum_util)) / (scale_cpu))
#else
-#define em_scale_power(p) (p)
+#define em_estimate_energy(cost, sum_util, scale_cpu) \
+ (((cost) / (scale_cpu)) * (sum_util))
#endif

struct em_data_callback {
@@ -112,7 +143,7 @@ struct em_data_callback {
* and frequency.
*
* In case of CPUs, the power is the one of a single CPU in the domain,
- * expressed in milli-Watts or an abstract scale. It is expected to
+ * expressed in micro-Watts or an abstract scale. It is expected to
* fit in the [0, EM_MAX_POWER] range.
*
* Return 0 on success.
@@ -148,7 +179,7 @@ struct em_perf_domain *em_cpu_get(int cpu);
struct em_perf_domain *em_pd_get(struct device *dev);
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *span,
- bool milliwatts);
+ bool microwatts);
void em_dev_unregister_perf_domain(struct device *dev);

/**
@@ -273,7 +304,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
* pd_nrg = ------------------------ (4)
* scale_cpu
*/
- return ps->cost * sum_util / scale_cpu;
+ return em_estimate_energy(ps->cost, sum_util, scale_cpu);
}

/**
@@ -297,7 +328,7 @@ struct em_data_callback {};
static inline
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *span,
- bool milliwatts)
+ bool microwatts)
{
return -EINVAL;
}
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 6c373f2960e7..910668ec8838 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -108,10 +108,11 @@ static void em_debug_remove_pd(struct device *dev) {}

static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
int nr_states, struct em_data_callback *cb,
- unsigned long flags)
+ unsigned long flags, int num_devs)
{
unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
struct em_perf_state *table;
+ unsigned long max_cost = 0;
int i, ret;
u64 fmax;

@@ -145,7 +146,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,

/*
* The power returned by active_state() is expected to be
- * positive and to fit into 16 bits.
+ * positive and be in range.
*/
if (!power || power > EM_MAX_POWER) {
dev_err(dev, "EM: invalid power: %lu\n",
@@ -170,7 +171,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
goto free_ps_table;
}
} else {
- power_res = em_scale_power(table[i].power);
+ power_res = table[i].power;
cost = div64_u64(fmax * power_res, table[i].frequency);
}

@@ -183,6 +184,15 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
} else {
prev_cost = table[i].cost;
}
+
+ if (max_cost < table[i].cost)
+ max_cost = table[i].cost;
+ }
+
+ /* Check if it won't overflow during energy estimation. */
+ if (em_validate_cost(max_cost, num_devs)) {
+ dev_err(dev, "EM: too big 'cost' value: %lu\n", max_cost);
+ goto free_ps_table;
}

pd->table = table;
@@ -199,9 +209,9 @@ static int em_create_pd(struct device *dev, int nr_states,
struct em_data_callback *cb, cpumask_t *cpus,
unsigned long flags)
{
+ int cpu, ret, num_devs = 1;
struct em_perf_domain *pd;
struct device *cpu_dev;
- int cpu, ret;

if (_is_cpu_device(dev)) {
pd = kzalloc(sizeof(*pd) + cpumask_size(), GFP_KERNEL);
@@ -209,13 +219,14 @@ static int em_create_pd(struct device *dev, int nr_states,
return -ENOMEM;

cpumask_copy(em_span_cpus(pd), cpus);
+ num_devs = cpumask_weight(cpus);
} else {
pd = kzalloc(sizeof(*pd), GFP_KERNEL);
if (!pd)
return -ENOMEM;
}

- ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
+ ret = em_create_perf_table(dev, pd, nr_states, cb, flags, num_devs);
if (ret) {
kfree(pd);
return ret;
@@ -314,13 +325,13 @@ EXPORT_SYMBOL_GPL(em_cpu_get);
* @cpus : Pointer to cpumask_t, which in case of a CPU device is
* obligatory. It can be taken from i.e. 'policy->cpus'. For other
* type of devices this should be set to NULL.
- * @milliwatts : Flag indicating that the power values are in milliWatts or
+ * @microwatts : Flag indicating that the power values are in micro-Watts or
* in some other scale. It must be set properly.
*
* Create Energy Model tables for a performance domain using the callbacks
* defined in cb.
*
- * The @milliwatts is important to set with correct value. Some kernel
+ * The @microwatts is important to set with correct value. Some kernel
* sub-systems might rely on this flag and check if all devices in the EM are
* using the same scale.
*
@@ -331,7 +342,7 @@ EXPORT_SYMBOL_GPL(em_cpu_get);
*/
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *cpus,
- bool milliwatts)
+ bool microwatts)
{
unsigned long cap, prev_cap = 0;
unsigned long flags = 0;
@@ -381,8 +392,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
}
}

- if (milliwatts)
- flags |= EM_PERF_DOMAIN_MILLIWATTS;
+ if (microwatts)
+ flags |= EM_PERF_DOMAIN_MICROWATTS;
else if (cb->get_cost)
flags |= EM_PERF_DOMAIN_ARTIFICIAL;

--
2.17.1

2022-06-22 15:05:44

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 3/4] firmware: arm_scmi: Get detailed power scale from perf

In SCMI v3.1 the power scale can be in micro-Watts. The upper layers, e.g.
cpufreq and EM should handle received power values properly (upscale when
needed). Thus, provide an interface which allows to check what is the
scale for power values. The old interface allowed to distinguish between
bogo-Watts and milli-Watts only (which was good for older SCMI spec).

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/firmware/arm_scmi/perf.c | 18 +++++++++++-------
include/linux/scmi_protocol.h | 8 +++++++-
2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/firmware/arm_scmi/perf.c b/drivers/firmware/arm_scmi/perf.c
index bbb0331801ff..92414e53f908 100644
--- a/drivers/firmware/arm_scmi/perf.c
+++ b/drivers/firmware/arm_scmi/perf.c
@@ -170,8 +170,7 @@ struct perf_dom_info {
struct scmi_perf_info {
u32 version;
int num_domains;
- bool power_scale_mw;
- bool power_scale_uw;
+ enum scmi_power_scale power_scale;
u64 stats_addr;
u32 stats_size;
struct perf_dom_info *dom_info;
@@ -201,9 +200,13 @@ static int scmi_perf_attributes_get(const struct scmi_protocol_handle *ph,
u16 flags = le16_to_cpu(attr->flags);

pi->num_domains = le16_to_cpu(attr->num_domains);
- pi->power_scale_mw = POWER_SCALE_IN_MILLIWATT(flags);
+
+ if (POWER_SCALE_IN_MILLIWATT(flags))
+ pi->power_scale = SCMI_POWER_MILLIWATTS;
if (PROTOCOL_REV_MAJOR(pi->version) >= 0x3)
- pi->power_scale_uw = POWER_SCALE_IN_MICROWATT(flags);
+ if (POWER_SCALE_IN_MICROWATT(flags))
+ pi->power_scale = SCMI_POWER_MICROWATTS;
+
pi->stats_addr = le32_to_cpu(attr->stats_addr_low) |
(u64)le32_to_cpu(attr->stats_addr_high) << 32;
pi->stats_size = le32_to_cpu(attr->stats_size);
@@ -792,11 +795,12 @@ static bool scmi_fast_switch_possible(const struct scmi_protocol_handle *ph,
return dom->fc_info && dom->fc_info->level_set_addr;
}

-static bool scmi_power_scale_mw_get(const struct scmi_protocol_handle *ph)
+static enum scmi_power_scale
+scmi_power_scale_get(const struct scmi_protocol_handle *ph)
{
struct scmi_perf_info *pi = ph->get_priv(ph);

- return pi->power_scale_mw;
+ return pi->power_scale;
}

static const struct scmi_perf_proto_ops perf_proto_ops = {
@@ -811,7 +815,7 @@ static const struct scmi_perf_proto_ops perf_proto_ops = {
.freq_get = scmi_dvfs_freq_get,
.est_power_get = scmi_dvfs_est_power_get,
.fast_switch_possible = scmi_fast_switch_possible,
- .power_scale_mw_get = scmi_power_scale_mw_get,
+ .power_scale_get = scmi_power_scale_get,
};

static int scmi_perf_set_notify_enabled(const struct scmi_protocol_handle *ph,
diff --git a/include/linux/scmi_protocol.h b/include/linux/scmi_protocol.h
index 704111f63993..a0a246310ba1 100644
--- a/include/linux/scmi_protocol.h
+++ b/include/linux/scmi_protocol.h
@@ -60,6 +60,12 @@ struct scmi_clock_info {
};
};

+enum scmi_power_scale {
+ SCMI_POWER_BOGOWATTS,
+ SCMI_POWER_MILLIWATTS,
+ SCMI_POWER_MICROWATTS
+};
+
struct scmi_handle;
struct scmi_device;
struct scmi_protocol_handle;
@@ -135,7 +141,7 @@ struct scmi_perf_proto_ops {
unsigned long *rate, unsigned long *power);
bool (*fast_switch_possible)(const struct scmi_protocol_handle *ph,
struct device *dev);
- bool (*power_scale_mw_get)(const struct scmi_protocol_handle *ph);
+ enum scmi_power_scale (*power_scale_get)(const struct scmi_protocol_handle *ph);
};

/**
--
2.17.1

2022-06-22 15:07:05

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 2/4] Documentation: EM: Switch to micro-Watts scale

The EM now uses the micro-Watts scale for the power values. Update
related documentation to reflect that fact.

Fix also a problematic sentence in the doc "to:" which triggers test
scripts complaining about wrong email address.

Signed-off-by: Lukasz Luba <[email protected]>
---
Documentation/power/energy-model.rst | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
index feb257b7f350..ef341be2882b 100644
--- a/Documentation/power/energy-model.rst
+++ b/Documentation/power/energy-model.rst
@@ -20,20 +20,20 @@ possible source of information on its own, the EM framework intervenes as an
abstraction layer which standardizes the format of power cost tables in the
kernel, hence enabling to avoid redundant work.

-The power values might be expressed in milli-Watts or in an 'abstract scale'.
+The power values might be expressed in micro-Watts or in an 'abstract scale'.
Multiple subsystems might use the EM and it is up to the system integrator to
check that the requirements for the power value scale types are met. An example
can be found in the Energy-Aware Scheduler documentation
Documentation/scheduler/sched-energy.rst. For some subsystems like thermal or
powercap power values expressed in an 'abstract scale' might cause issues.
These subsystems are more interested in estimation of power used in the past,
-thus the real milli-Watts might be needed. An example of these requirements can
+thus the real micro-Watts might be needed. An example of these requirements can
be found in the Intelligent Power Allocation in
Documentation/driver-api/thermal/power_allocator.rst.
Kernel subsystems might implement automatic detection to check whether EM
registered devices have inconsistent scale (based on EM internal flag).
Important thing to keep in mind is that when the power values are expressed in
-an 'abstract scale' deriving real energy in milli-Joules would not be possible.
+an 'abstract scale' deriving real energy in micro-Joules would not be possible.

The figure below depicts an example of drivers (Arm-specific here, but the
approach is applicable to any architecture) providing power costs to the EM
@@ -98,7 +98,7 @@ Drivers are expected to register performance domains into the EM framework by
calling the following API::

int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
- struct em_data_callback *cb, cpumask_t *cpus, bool milliwatts);
+ struct em_data_callback *cb, cpumask_t *cpus, bool microwatts);

Drivers must provide a callback function returning <frequency, power> tuples
for each performance state. The callback function provided by the driver is free
@@ -106,10 +106,10 @@ to fetch data from any relevant location (DT, firmware, ...), and by any mean
deemed necessary. Only for CPU devices, drivers must specify the CPUs of the
performance domains using cpumask. For other devices than CPUs the last
argument must be set to NULL.
-The last argument 'milliwatts' is important to set with correct value. Kernel
+The last argument 'microwatts' is important to set with correct value. Kernel
subsystems which use EM might rely on this flag to check if all EM devices use
the same scale. If there are different scales, these subsystems might decide
-to: return warning/error, stop working or panic.
+to return warning/error, stop working or panic.
See Section 3. for an example of driver implementing this
callback, or Section 2.4 for further documentation on this API

@@ -137,7 +137,7 @@ The .get_cost() allows to provide the 'cost' values which reflect the
efficiency of the CPUs. This would allow to provide EAS information which
has different relation than what would be forced by the EM internal
formulas calculating 'cost' values. To register an EM for such platform, the
-driver must set the flag 'milliwatts' to 0, provide .get_power() callback
+driver must set the flag 'microwatts' to 0, provide .get_power() callback
and provide .get_cost() callback. The EM framework would handle such platform
properly during registration. A flag EM_PERF_DOMAIN_ARTIFICIAL is set for such
platform. Special care should be taken by other frameworks which are using EM
--
2.17.1

2022-06-22 15:56:02

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 4/4] cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1

The SCMI v3.1 adds support for power values in micro-Watts. They are not
always in milli-Watts anymore (ignoring the bogo-Watts). Thus, the power
must be converted conditionally before sending to Energy Model. Add the
logic which handles the needed checks and conversions.

Signed-off-by: Lukasz Luba <[email protected]>
---
drivers/cpufreq/scmi-cpufreq.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/scmi-cpufreq.c b/drivers/cpufreq/scmi-cpufreq.c
index bfd35583d653..513a071845c2 100644
--- a/drivers/cpufreq/scmi-cpufreq.c
+++ b/drivers/cpufreq/scmi-cpufreq.c
@@ -100,7 +100,7 @@ static int __maybe_unused
scmi_get_cpu_power(struct device *cpu_dev, unsigned long *power,
unsigned long *KHz)
{
- bool power_scale_mw = perf_ops->power_scale_mw_get(ph);
+ enum scmi_power_scale power_scale = perf_ops->power_scale_get(ph);
unsigned long Hz;
int ret, domain;

@@ -114,8 +114,8 @@ scmi_get_cpu_power(struct device *cpu_dev, unsigned long *power,
if (ret)
return ret;

- /* Provide bigger resolution power to the Energy Model */
- if (power_scale_mw)
+ /* Convert the power to uW if it is mW (ignore bogoW) */
+ if (power_scale == SCMI_POWER_MILLIWATTS)
*power *= MICROWATT_PER_MILLIWATT;

/* The EM framework specifies the frequency in KHz. */
@@ -255,8 +255,9 @@ static int scmi_cpufreq_exit(struct cpufreq_policy *policy)
static void scmi_cpufreq_register_em(struct cpufreq_policy *policy)
{
struct em_data_callback em_cb = EM_DATA_CB(scmi_get_cpu_power);
- bool power_scale_mw = perf_ops->power_scale_mw_get(ph);
+ enum scmi_power_scale power_scale = perf_ops->power_scale_get(ph);
struct scmi_data *priv = policy->driver_data;
+ bool em_power_scale = false;

/*
* This callback will be called for each policy, but we don't need to
@@ -268,9 +269,13 @@ static void scmi_cpufreq_register_em(struct cpufreq_policy *policy)
if (!priv->nr_opp)
return;

+ if (power_scale == SCMI_POWER_MILLIWATTS
+ || power_scale == SCMI_POWER_MICROWATTS)
+ em_power_scale = true;
+
em_dev_register_perf_domain(get_cpu_device(policy->cpu), priv->nr_opp,
&em_cb, priv->opp_shared_cpus,
- power_scale_mw);
+ em_power_scale);
}

static struct cpufreq_driver scmi_cpufreq_driver = {
--
2.17.1

2022-06-22 16:25:36

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH 4/4] cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1

On Wed, Jun 22, 2022 at 03:58:02PM +0100, Lukasz Luba wrote:
> The SCMI v3.1 adds support for power values in micro-Watts. They are not
> always in milli-Watts anymore (ignoring the bogo-Watts). Thus, the power
> must be converted conditionally before sending to Energy Model. Add the
> logic which handles the needed checks and conversions.
>

Acked-by: Sudeep Holla <[email protected]>

--
Regards,
Sudeep

2022-06-22 16:26:46

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH 3/4] firmware: arm_scmi: Get detailed power scale from perf

On Wed, Jun 22, 2022 at 03:58:01PM +0100, Lukasz Luba wrote:
> In SCMI v3.1 the power scale can be in micro-Watts. The upper layers, e.g.
> cpufreq and EM should handle received power values properly (upscale when
> needed). Thus, provide an interface which allows to check what is the
> scale for power values. The old interface allowed to distinguish between
> bogo-Watts and milli-Watts only (which was good for older SCMI spec).
>

Assuming you will take this as a series,

Acked-by: Sudeep Holla <[email protected]>

--
Regards,
Sudeep

2022-06-23 08:08:47

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 3/4] firmware: arm_scmi: Get detailed power scale from perf



On 6/22/22 16:53, Sudeep Holla wrote:
> On Wed, Jun 22, 2022 at 03:58:01PM +0100, Lukasz Luba wrote:
>> In SCMI v3.1 the power scale can be in micro-Watts. The upper layers, e.g.
>> cpufreq and EM should handle received power values properly (upscale when
>> needed). Thus, provide an interface which allows to check what is the
>> scale for power values. The old interface allowed to distinguish between
>> bogo-Watts and milli-Watts only (which was good for older SCMI spec).
>>
>
> Assuming you will take this as a series,
>
> Acked-by: Sudeep Holla <[email protected]>
>

Thanks Sudeep for the ACKs!

Regards,
Lukasz

2022-06-29 10:26:14

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 0/4] Energy Model power in micro-Watts and SCMI v3.1 alignment

Hi Viresh,

On 6/29/22 10:53, Viresh Kumar wrote:
> On 29-06-22, 10:49, Lukasz Luba wrote:
>> I would like to move forward with the micro-Watts in
>> the Energy Model. We have feedback from our partners
>> that this is a limitation. Also, as you can see
>> this uW is part of the new SCMI spec, which we
>> have support on our roadmap.
>
> Should I pick them and merge via PM tree ?
>

Thanks for fast response. It would be great.

I have 2 ACKs from Sudeep for the SCMI part,
but I don't know the status e.g. of DTPM
current work which is using the EM milli-Watts
and does conversion to uW internally.
I hope, I won't make issues to Daniel's work with this
change.

2022-06-29 10:40:11

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 0/4] Energy Model power in micro-Watts and SCMI v3.1 alignment

Hi guys,

Are there any objections to these patches?


On 6/22/22 15:57, Lukasz Luba wrote:
> Hi all,
>
> This is a patch set which changes Energy Model power values scale to
> micro-Watts. It also upgrades the SCMI performance layer + scmi-cpufreq
> driver to leverage the SCMI v3.1 spec and process micro-Watts power values
> coming from FW. The higher precision in EM power field solves an issue
> of a rounding error, which then can be misinterpreted as 'inefficient OPP'.
> An example rounding issue calculation is present in patch 1/4 description.
>
> Regards,
> Lukasz Luba
>
> Lukasz Luba (4):
> PM: EM: convert power field to micro-Watts precision and align drivers
> Documentation: EM: Switch to micro-Watts scale
> firmware: arm_scmi: Get detailed power scale from perf
> cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1
>
> Documentation/power/energy-model.rst | 14 +++---
> drivers/cpufreq/mediatek-cpufreq-hw.c | 7 +--
> drivers/cpufreq/scmi-cpufreq.c | 15 ++++++-
> drivers/firmware/arm_scmi/perf.c | 18 +++++---
> drivers/opp/of.c | 15 ++++---
> drivers/powercap/dtpm_cpu.c | 5 +--
> drivers/thermal/cpufreq_cooling.c | 13 +++++-
> drivers/thermal/devfreq_cooling.c | 19 ++++++--
> include/linux/energy_model.h | 63 ++++++++++++++++++++-------
> include/linux/scmi_protocol.h | 8 +++-
> kernel/power/energy_model.c | 31 ++++++++-----
> 11 files changed, 146 insertions(+), 62 deletions(-)
>


I would like to move forward with the micro-Watts in
the Energy Model. We have feedback from our partners
that this is a limitation. Also, as you can see
this uW is part of the new SCMI spec, which we
have support on our roadmap.

Regards,
Lukasz

2022-06-29 10:44:10

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 0/4] Energy Model power in micro-Watts and SCMI v3.1 alignment

On 29-06-22, 10:49, Lukasz Luba wrote:
> I would like to move forward with the micro-Watts in
> the Energy Model. We have feedback from our partners
> that this is a limitation. Also, as you can see
> this uW is part of the new SCMI spec, which we
> have support on our roadmap.

Should I pick them and merge via PM tree ?

--
viresh

2022-06-29 10:58:01

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 0/4] Energy Model power in micro-Watts and SCMI v3.1 alignment

Hi Daniel,

On 6/29/22 11:21, Daniel Lezcano wrote:
>
> Hi Lukasz,
>
> On 29/06/2022 12:01, Viresh Kumar wrote:
>> On 29-06-22, 11:00, Lukasz Luba wrote:
>>> Thanks for fast response. It would be great.
>>>
>>> I have 2 ACKs from Sudeep for the SCMI part,
>>> but I don't know the status e.g. of DTPM
>>> current work which is using the EM milli-Watts
>>> and does conversion to uW internally.
>>> I hope, I won't make issues to Daniel's work with this
>>> change.
>>
>> Daniel, do you have any objections to this ?
>
> Sorry I had no time to review the series yet, give me a couple of days,
> may be a bit more if possible
>
>

OK, take your time. I hope this could land as a material
for v5.20, we still have some time.

Regards,
Lukasz

2022-06-29 11:04:12

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH 0/4] Energy Model power in micro-Watts and SCMI v3.1 alignment

On 29-06-22, 11:00, Lukasz Luba wrote:
> Thanks for fast response. It would be great.
>
> I have 2 ACKs from Sudeep for the SCMI part,
> but I don't know the status e.g. of DTPM
> current work which is using the EM milli-Watts
> and does conversion to uW internally.
> I hope, I won't make issues to Daniel's work with this
> change.

Daniel, do you have any objections to this ?

--
viresh

2022-06-29 11:04:55

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 0/4] Energy Model power in micro-Watts and SCMI v3.1 alignment


Hi Lukasz,

On 29/06/2022 12:01, Viresh Kumar wrote:
> On 29-06-22, 11:00, Lukasz Luba wrote:
>> Thanks for fast response. It would be great.
>>
>> I have 2 ACKs from Sudeep for the SCMI part,
>> but I don't know the status e.g. of DTPM
>> current work which is using the EM milli-Watts
>> and does conversion to uW internally.
>> I hope, I won't make issues to Daniel's work with this
>> change.
>
> Daniel, do you have any objections to this ?

Sorry I had no time to review the series yet, give me a couple of days,
may be a bit more if possible


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2022-07-05 09:27:35

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 1/4] PM: EM: convert power field to micro-Watts precision and align drivers

On 22/06/2022 16:57, Lukasz Luba wrote:
> The milli-Watts precision causes rounding errors while calculating
> efficiency cost for each OPP. This is especially visible in the 'simple'
> Energy Model (EM), where the power for each OPP is provided from OPP
> framework. This can cause some OPPs to be marked inefficient, while
> using micro-Watts precision that might not happen.
>
> Update all EM users which access 'power' field and assume the value is
> in milli-Watts.
>
> Solve also an issue with potential overflow in calculation of energy
> estimation on 32bit machine. It's needed now since the power value
> (thus the 'cost' as well) are higher.
>
> Example calculation which shows the rounding error and impact:
>
> power = 'dyn-power-coeff' * volt_mV * volt_mV * freq_MHz
>
> power_a_uW = (100 * 600mW * 600mW * 500MHz) / 10^6 = 18000
> power_a_mW = (100 * 600mW * 600mW * 500MHz) / 10^9 = 18
>
> power_b_uW = (100 * 605mW * 605mW * 600MHz) / 10^6 = 21961
> power_b_mW = (100 * 605mW * 605mW * 600MHz) / 10^9 = 21
>
> max_freq = 2000MHz
>
> cost_a_mW = 18 * 2000MHz/500MHz = 72
> cost_a_uW = 18000 * 2000MHz/500MHz = 72000
>
> cost_b_mW = 21 * 2000MHz/600MHz = 70 // <- artificially better
> cost_b_uW = 21961 * 2000MHz/600MHz = 73203
>
> The 'cost_b_mW' (which is based on old milli-Watts) is misleadingly
> better that the 'cost_b_uW' (this patch uses micro-Watts) and such
> would have impact on the 'inefficient OPPs' information in the Cpufreq
> framework. This patch set removes the rounding issue.

Thanks for this detailed description, it really helps to understand why
this change is needed.

Perhaps it would make sense to add a power_uw in the EM structure and
keeping the old one with the milli-watts in order to reduce the impact
of the change.

It is a suggestion if you find it more convenient. Otherwise I'm fine
with this approach too.

A few comments below.

> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> drivers/cpufreq/mediatek-cpufreq-hw.c | 7 +--
> drivers/cpufreq/scmi-cpufreq.c | 6 +++
> drivers/opp/of.c | 15 ++++---
> drivers/powercap/dtpm_cpu.c | 5 +--
> drivers/thermal/cpufreq_cooling.c | 13 +++++-
> drivers/thermal/devfreq_cooling.c | 19 ++++++--
> include/linux/energy_model.h | 63 ++++++++++++++++++++-------
> kernel/power/energy_model.c | 31 ++++++++-----
> 8 files changed, 114 insertions(+), 45 deletions(-)
>
> diff --git a/drivers/cpufreq/mediatek-cpufreq-hw.c b/drivers/cpufreq/mediatek-cpufreq-hw.c
> index 813cccbfe934..f0e0a35c7f21 100644
> --- a/drivers/cpufreq/mediatek-cpufreq-hw.c
> +++ b/drivers/cpufreq/mediatek-cpufreq-hw.c
> @@ -51,7 +51,7 @@ static const u16 cpufreq_mtk_offsets[REG_ARRAY_SIZE] = {
> };
>

[ ... ]

> diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
> index b8151d95a806..dc19e7c80751 100644
> --- a/drivers/thermal/cpufreq_cooling.c
> +++ b/drivers/thermal/cpufreq_cooling.c
> @@ -21,6 +21,7 @@
> #include <linux/pm_qos.h>
> #include <linux/slab.h>
> #include <linux/thermal.h>
> +#include <linux/units.h>
>
> #include <trace/events/thermal.h>
>
> @@ -101,6 +102,7 @@ static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
> static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
> u32 freq)
> {
> + unsigned long power_mw;
> int i;
>
> for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
> @@ -108,16 +110,23 @@ static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
> break;
> }
>
> - return cpufreq_cdev->em->table[i + 1].power;
> + power_mw = cpufreq_cdev->em->table[i + 1].power;
> + power_mw /= MICROWATT_PER_MILLIWATT;

Won't this fail with an unresolved symbols on some archs ? I mean may be
do_div should be used instead ?

> +
> + return power_mw;
> }

[ ... ]

> #ifdef CONFIG_64BIT
> -#define em_scale_power(p) ((p) * 1000)
> +#define em_estimate_energy(cost, sum_util, scale_cpu) \
> + (((cost) * (sum_util)) / (scale_cpu))
> #else
> -#define em_scale_power(p) (p)
> +#define em_estimate_energy(cost, sum_util, scale_cpu) \
> + (((cost) / (scale_cpu)) * (sum_util))
> #endif
>
> struct em_data_callback {
> @@ -112,7 +143,7 @@ struct em_data_callback {
> * and frequency.
> *
> * In case of CPUs, the power is the one of a single CPU in the domain,
> - * expressed in milli-Watts or an abstract scale. It is expected to
> + * expressed in micro-Watts or an abstract scale. It is expected to
> * fit in the [0, EM_MAX_POWER] range.
> *
> * Return 0 on success.
> @@ -148,7 +179,7 @@ struct em_perf_domain *em_cpu_get(int cpu);
> struct em_perf_domain *em_pd_get(struct device *dev);
> int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> struct em_data_callback *cb, cpumask_t *span,
> - bool milliwatts);
> + bool microwatts);
> void em_dev_unregister_perf_domain(struct device *dev);
>
> /**
> @@ -273,7 +304,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> * pd_nrg = ------------------------ (4)
> * scale_cpu
> */
> - return ps->cost * sum_util / scale_cpu;
> + return em_estimate_energy(ps->cost, sum_util, scale_cpu);
> }
>
> /**
> @@ -297,7 +328,7 @@ struct em_data_callback {};
> static inline
> int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> struct em_data_callback *cb, cpumask_t *span,
> - bool milliwatts)
> + bool microwatts)
> {
> return -EINVAL;
> }
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 6c373f2960e7..910668ec8838 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -108,10 +108,11 @@ static void em_debug_remove_pd(struct device *dev) {}
>
> static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> int nr_states, struct em_data_callback *cb,
> - unsigned long flags)
> + unsigned long flags, int num_devs)
> {
> unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
> struct em_perf_state *table;
> + unsigned long max_cost = 0;
> int i, ret;
> u64 fmax;
>
> @@ -145,7 +146,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>
> /*
> * The power returned by active_state() is expected to be
> - * positive and to fit into 16 bits.
> + * positive and be in range.
> */
> if (!power || power > EM_MAX_POWER) {
> dev_err(dev, "EM: invalid power: %lu\n",
> @@ -170,7 +171,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> goto free_ps_table;
> }
> } else {
> - power_res = em_scale_power(table[i].power);
> + power_res = table[i].power;
> cost = div64_u64(fmax * power_res, table[i].frequency);
> }
>
> @@ -183,6 +184,15 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
> } else {
> prev_cost = table[i].cost;
> }
> +
> + if (max_cost < table[i].cost)
> + max_cost = table[i].cost;
> + }
> +
> + /* Check if it won't overflow during energy estimation. */
> + if (em_validate_cost(max_cost, num_devs)) {

I'm not finding the em_validate_cost() function

> + dev_err(dev, "EM: too big 'cost' value: %lu\n", max_cost);
> + goto free_ps_table;
> }
>
> pd->table = table;
> @@ -199,9 +209,9 @@ static int em_create_pd(struct device *dev, int nr_states,
> struct em_data_callback *cb, cpumask_t *cpus,
> unsigned long flags)
> {
> + int cpu, ret, num_devs = 1;
> struct em_perf_domain *pd;
> struct device *cpu_dev;
> - int cpu, ret;
>
> if (_is_cpu_device(dev)) {
> pd = kzalloc(sizeof(*pd) + cpumask_size(), GFP_KERNEL);
> @@ -209,13 +219,14 @@ static int em_create_pd(struct device *dev, int nr_states,
> return -ENOMEM;
>
> cpumask_copy(em_span_cpus(pd), cpus);
> + num_devs = cpumask_weight(cpus);

Why is this change needed ? What is the connection with the uW unit change ?


> } else {
> pd = kzalloc(sizeof(*pd), GFP_KERNEL);
> if (!pd)
> return -ENOMEM;
> }
>
> - ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
> + ret = em_create_perf_table(dev, pd, nr_states, cb, flags, num_devs);
> if (ret) {
> kfree(pd);
> return ret;
> @@ -314,13 +325,13 @@ EXPORT_SYMBOL_GPL(em_cpu_get);
> * @cpus : Pointer to cpumask_t, which in case of a CPU device is
> * obligatory. It can be taken from i.e. 'policy->cpus'. For other
> * type of devices this should be set to NULL.
> - * @milliwatts : Flag indicating that the power values are in milliWatts or
> + * @microwatts : Flag indicating that the power values are in micro-Watts or
> * in some other scale. It must be set properly.
> *
> * Create Energy Model tables for a performance domain using the callbacks
> * defined in cb.
> *
> - * The @milliwatts is important to set with correct value. Some kernel
> + * The @microwatts is important to set with correct value. Some kernel
> * sub-systems might rely on this flag and check if all devices in the EM are
> * using the same scale.
> *
> @@ -331,7 +342,7 @@ EXPORT_SYMBOL_GPL(em_cpu_get);
> */
> int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> struct em_data_callback *cb, cpumask_t *cpus,
> - bool milliwatts)
> + bool microwatts)
> {
> unsigned long cap, prev_cap = 0;
> unsigned long flags = 0;
> @@ -381,8 +392,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> }
> }
>
> - if (milliwatts)
> - flags |= EM_PERF_DOMAIN_MILLIWATTS;
> + if (microwatts)
> + flags |= EM_PERF_DOMAIN_MICROWATTS;
> else if (cb->get_cost)
> flags |= EM_PERF_DOMAIN_ARTIFICIAL;
>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2022-07-05 09:28:09

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 2/4] Documentation: EM: Switch to micro-Watts scale

On 22/06/2022 16:58, Lukasz Luba wrote:
> The EM now uses the micro-Watts scale for the power values. Update
> related documentation to reflect that fact.
>
> Fix also a problematic sentence in the doc "to:" which triggers test
> scripts complaining about wrong email address.
>
> Signed-off-by: Lukasz Luba <[email protected]>

Reviewed-by: Daniel Lezcano <[email protected]>

> ---
> Documentation/power/energy-model.rst | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
> index feb257b7f350..ef341be2882b 100644
> --- a/Documentation/power/energy-model.rst
> +++ b/Documentation/power/energy-model.rst
> @@ -20,20 +20,20 @@ possible source of information on its own, the EM framework intervenes as an
> abstraction layer which standardizes the format of power cost tables in the
> kernel, hence enabling to avoid redundant work.
>
> -The power values might be expressed in milli-Watts or in an 'abstract scale'.
> +The power values might be expressed in micro-Watts or in an 'abstract scale'.
> Multiple subsystems might use the EM and it is up to the system integrator to
> check that the requirements for the power value scale types are met. An example
> can be found in the Energy-Aware Scheduler documentation
> Documentation/scheduler/sched-energy.rst. For some subsystems like thermal or
> powercap power values expressed in an 'abstract scale' might cause issues.
> These subsystems are more interested in estimation of power used in the past,
> -thus the real milli-Watts might be needed. An example of these requirements can
> +thus the real micro-Watts might be needed. An example of these requirements can
> be found in the Intelligent Power Allocation in
> Documentation/driver-api/thermal/power_allocator.rst.
> Kernel subsystems might implement automatic detection to check whether EM
> registered devices have inconsistent scale (based on EM internal flag).
> Important thing to keep in mind is that when the power values are expressed in
> -an 'abstract scale' deriving real energy in milli-Joules would not be possible.
> +an 'abstract scale' deriving real energy in micro-Joules would not be possible.
>
> The figure below depicts an example of drivers (Arm-specific here, but the
> approach is applicable to any architecture) providing power costs to the EM
> @@ -98,7 +98,7 @@ Drivers are expected to register performance domains into the EM framework by
> calling the following API::
>
> int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> - struct em_data_callback *cb, cpumask_t *cpus, bool milliwatts);
> + struct em_data_callback *cb, cpumask_t *cpus, bool microwatts);
>
> Drivers must provide a callback function returning <frequency, power> tuples
> for each performance state. The callback function provided by the driver is free
> @@ -106,10 +106,10 @@ to fetch data from any relevant location (DT, firmware, ...), and by any mean
> deemed necessary. Only for CPU devices, drivers must specify the CPUs of the
> performance domains using cpumask. For other devices than CPUs the last
> argument must be set to NULL.
> -The last argument 'milliwatts' is important to set with correct value. Kernel
> +The last argument 'microwatts' is important to set with correct value. Kernel
> subsystems which use EM might rely on this flag to check if all EM devices use
> the same scale. If there are different scales, these subsystems might decide
> -to: return warning/error, stop working or panic.
> +to return warning/error, stop working or panic.
> See Section 3. for an example of driver implementing this
> callback, or Section 2.4 for further documentation on this API
>
> @@ -137,7 +137,7 @@ The .get_cost() allows to provide the 'cost' values which reflect the
> efficiency of the CPUs. This would allow to provide EAS information which
> has different relation than what would be forced by the EM internal
> formulas calculating 'cost' values. To register an EM for such platform, the
> -driver must set the flag 'milliwatts' to 0, provide .get_power() callback
> +driver must set the flag 'microwatts' to 0, provide .get_power() callback
> and provide .get_cost() callback. The EM framework would handle such platform
> properly during registration. A flag EM_PERF_DOMAIN_ARTIFICIAL is set for such
> platform. Special care should be taken by other frameworks which are using EM


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2022-07-05 09:31:36

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH 4/4] cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1

On 22/06/2022 16:58, Lukasz Luba wrote:
> The SCMI v3.1 adds support for power values in micro-Watts. They are not
> always in milli-Watts anymore (ignoring the bogo-Watts). Thus, the power
> must be converted conditionally before sending to Energy Model. Add the
> logic which handles the needed checks and conversions.
>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> drivers/cpufreq/scmi-cpufreq.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/cpufreq/scmi-cpufreq.c b/drivers/cpufreq/scmi-cpufreq.c
> index bfd35583d653..513a071845c2 100644
> --- a/drivers/cpufreq/scmi-cpufreq.c
> +++ b/drivers/cpufreq/scmi-cpufreq.c
> @@ -100,7 +100,7 @@ static int __maybe_unused
> scmi_get_cpu_power(struct device *cpu_dev, unsigned long *power,
> unsigned long *KHz)
> {
> - bool power_scale_mw = perf_ops->power_scale_mw_get(ph);
> + enum scmi_power_scale power_scale = perf_ops->power_scale_get(ph);
> unsigned long Hz;
> int ret, domain;
>
> @@ -114,8 +114,8 @@ scmi_get_cpu_power(struct device *cpu_dev, unsigned long *power,
> if (ret)
> return ret;
>
> - /* Provide bigger resolution power to the Energy Model */
> - if (power_scale_mw)
> + /* Convert the power to uW if it is mW (ignore bogoW) */
> + if (power_scale == SCMI_POWER_MILLIWATTS)
> *power *= MICROWATT_PER_MILLIWATT;
>
> /* The EM framework specifies the frequency in KHz. */
> @@ -255,8 +255,9 @@ static int scmi_cpufreq_exit(struct cpufreq_policy *policy)
> static void scmi_cpufreq_register_em(struct cpufreq_policy *policy)
> {
> struct em_data_callback em_cb = EM_DATA_CB(scmi_get_cpu_power);
> - bool power_scale_mw = perf_ops->power_scale_mw_get(ph);
> + enum scmi_power_scale power_scale = perf_ops->power_scale_get(ph);
> struct scmi_data *priv = policy->driver_data;
> + bool em_power_scale = false;

Just pass 'false' to em_dev_register_perf_domain()

> /*
> * This callback will be called for each policy, but we don't need to
> @@ -268,9 +269,13 @@ static void scmi_cpufreq_register_em(struct cpufreq_policy *policy)
> if (!priv->nr_opp)
> return;
>
> + if (power_scale == SCMI_POWER_MILLIWATTS
> + || power_scale == SCMI_POWER_MICROWATTS)
> + em_power_scale = true;
> +
> em_dev_register_perf_domain(get_cpu_device(policy->cpu), priv->nr_opp,
> &em_cb, priv->opp_shared_cpus,
> - power_scale_mw);
> + em_power_scale);
> }
>
> static struct cpufreq_driver scmi_cpufreq_driver = {


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2022-07-06 09:12:37

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 2/4] Documentation: EM: Switch to micro-Watts scale



On 7/5/22 10:10, Daniel Lezcano wrote:
> On 22/06/2022 16:58, Lukasz Luba wrote:
>> The EM now uses the micro-Watts scale for the power values. Update
>> related documentation to reflect that fact.
>>
>> Fix also a problematic sentence in the doc "to:" which triggers test
>> scripts complaining about wrong email address.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>
> Reviewed-by: Daniel Lezcano <[email protected]>
>

Thanks for the reviews!

2022-07-06 09:34:53

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 4/4] cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1



On 7/5/22 10:25, Daniel Lezcano wrote:
> On 22/06/2022 16:58, Lukasz Luba wrote:
>> The SCMI v3.1 adds support for power values in micro-Watts. They are not
>> always in milli-Watts anymore (ignoring the bogo-Watts). Thus, the power
>> must be converted conditionally before sending to Energy Model. Add the
>> logic which handles the needed checks and conversions.
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>>   drivers/cpufreq/scmi-cpufreq.c | 15 ++++++++++-----
>>   1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/cpufreq/scmi-cpufreq.c
>> b/drivers/cpufreq/scmi-cpufreq.c
>> index bfd35583d653..513a071845c2 100644
>> --- a/drivers/cpufreq/scmi-cpufreq.c
>> +++ b/drivers/cpufreq/scmi-cpufreq.c
>> @@ -100,7 +100,7 @@ static int __maybe_unused
>>   scmi_get_cpu_power(struct device *cpu_dev, unsigned long *power,
>>              unsigned long *KHz)
>>   {
>> -    bool power_scale_mw = perf_ops->power_scale_mw_get(ph);
>> +    enum scmi_power_scale power_scale = perf_ops->power_scale_get(ph);
>>       unsigned long Hz;
>>       int ret, domain;
>> @@ -114,8 +114,8 @@ scmi_get_cpu_power(struct device *cpu_dev,
>> unsigned long *power,
>>       if (ret)
>>           return ret;
>> -    /* Provide bigger resolution power to the Energy Model */
>> -    if (power_scale_mw)
>> +    /* Convert the power to uW if it is mW (ignore bogoW) */
>> +    if (power_scale == SCMI_POWER_MILLIWATTS)
>>           *power *= MICROWATT_PER_MILLIWATT;
>>       /* The EM framework specifies the frequency in KHz. */
>> @@ -255,8 +255,9 @@ static int scmi_cpufreq_exit(struct cpufreq_policy
>> *policy)
>>   static void scmi_cpufreq_register_em(struct cpufreq_policy *policy)
>>   {
>>       struct em_data_callback em_cb = EM_DATA_CB(scmi_get_cpu_power);
>> -    bool power_scale_mw = perf_ops->power_scale_mw_get(ph);
>> +    enum scmi_power_scale power_scale = perf_ops->power_scale_get(ph);
>>       struct scmi_data *priv = policy->driver_data;
>> +    bool em_power_scale = false;
>
> Just pass 'false' to em_dev_register_perf_domain()

We cannot,

>
>>       /*
>>        * This callback will be called for each policy, but we don't
>> need to
>> @@ -268,9 +269,13 @@ static void scmi_cpufreq_register_em(struct
>> cpufreq_policy *policy)
>>       if (!priv->nr_opp)
>>           return;
>> +    if (power_scale == SCMI_POWER_MILLIWATTS
>> +        || power_scale == SCMI_POWER_MICROWATTS)
>> +        em_power_scale = true;
>> +

because sometimes it's 'true'.

>>       em_dev_register_perf_domain(get_cpu_device(policy->cpu),
>> priv->nr_opp,
>>                       &em_cb, priv->opp_shared_cpus,
>> -                    power_scale_mw);
>> +                    em_power_scale);

Then we just use the variable here in single call.

>>   }
>>   static struct cpufreq_driver scmi_cpufreq_driver = {
>
>

2022-07-06 09:41:55

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 1/4] PM: EM: convert power field to micro-Watts precision and align drivers



On 7/5/22 10:09, Daniel Lezcano wrote:
> On 22/06/2022 16:57, Lukasz Luba wrote:
>> The milli-Watts precision causes rounding errors while calculating
>> efficiency cost for each OPP. This is especially visible in the 'simple'
>> Energy Model (EM), where the power for each OPP is provided from OPP
>> framework. This can cause some OPPs to be marked inefficient, while
>> using micro-Watts precision that might not happen.
>>
>> Update all EM users which access 'power' field and assume the value is
>> in milli-Watts.
>>
>> Solve also an issue with potential overflow in calculation of energy
>> estimation on 32bit machine. It's needed now since the power value
>> (thus the 'cost' as well) are higher.
>>
>> Example calculation which shows the rounding error and impact:
>>
>> power = 'dyn-power-coeff' * volt_mV * volt_mV * freq_MHz
>>
>> power_a_uW = (100 * 600mW * 600mW * 500MHz) / 10^6 = 18000
>> power_a_mW = (100 * 600mW * 600mW * 500MHz) / 10^9 = 18
>>
>> power_b_uW = (100 * 605mW * 605mW * 600MHz) / 10^6 = 21961
>> power_b_mW = (100 * 605mW * 605mW * 600MHz) / 10^9 = 21
>>
>> max_freq = 2000MHz
>>
>> cost_a_mW = 18 * 2000MHz/500MHz = 72
>> cost_a_uW = 18000 * 2000MHz/500MHz = 72000
>>
>> cost_b_mW = 21 * 2000MHz/600MHz = 70 // <- artificially better
>> cost_b_uW = 21961 * 2000MHz/600MHz = 73203
>>
>> The 'cost_b_mW' (which is based on old milli-Watts) is misleadingly
>> better that the 'cost_b_uW' (this patch uses micro-Watts) and such
>> would have impact on the 'inefficient OPPs' information in the Cpufreq
>> framework. This patch set removes the rounding issue.
>
> Thanks for this detailed description, it really helps to understand why
> this change is needed.
>
> Perhaps it would make sense to add a power_uw in the EM structure and
> keeping the old one with the milli-watts in order to reduce the impact
> of the change.
>
> It is a suggestion if you find it more convenient. Otherwise I'm fine
> with this approach too.

I see your point, it could go with 2 patches instead of one. If there
will be a need of v2 I will consider this split.

>
> A few comments below.
>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>>   drivers/cpufreq/mediatek-cpufreq-hw.c |  7 +--
>>   drivers/cpufreq/scmi-cpufreq.c        |  6 +++
>>   drivers/opp/of.c                      | 15 ++++---
>>   drivers/powercap/dtpm_cpu.c           |  5 +--
>>   drivers/thermal/cpufreq_cooling.c     | 13 +++++-
>>   drivers/thermal/devfreq_cooling.c     | 19 ++++++--
>>   include/linux/energy_model.h          | 63 ++++++++++++++++++++-------
>>   kernel/power/energy_model.c           | 31 ++++++++-----
>>   8 files changed, 114 insertions(+), 45 deletions(-)
>>
>> diff --git a/drivers/cpufreq/mediatek-cpufreq-hw.c
>> b/drivers/cpufreq/mediatek-cpufreq-hw.c
>> index 813cccbfe934..f0e0a35c7f21 100644
>> --- a/drivers/cpufreq/mediatek-cpufreq-hw.c
>> +++ b/drivers/cpufreq/mediatek-cpufreq-hw.c
>> @@ -51,7 +51,7 @@ static const u16 cpufreq_mtk_offsets[REG_ARRAY_SIZE]
>> = {
>>   };
>
> [ ... ]
>
>> diff --git a/drivers/thermal/cpufreq_cooling.c
>> b/drivers/thermal/cpufreq_cooling.c
>> index b8151d95a806..dc19e7c80751 100644
>> --- a/drivers/thermal/cpufreq_cooling.c
>> +++ b/drivers/thermal/cpufreq_cooling.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/pm_qos.h>
>>   #include <linux/slab.h>
>>   #include <linux/thermal.h>
>> +#include <linux/units.h>
>>   #include <trace/events/thermal.h>
>> @@ -101,6 +102,7 @@ static unsigned long get_level(struct
>> cpufreq_cooling_device *cpufreq_cdev,
>>   static u32 cpu_freq_to_power(struct cpufreq_cooling_device
>> *cpufreq_cdev,
>>                    u32 freq)
>>   {
>> +    unsigned long power_mw;
>>       int i;
>>       for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
>> @@ -108,16 +110,23 @@ static u32 cpu_freq_to_power(struct
>> cpufreq_cooling_device *cpufreq_cdev,
>>               break;
>>       }
>> -    return cpufreq_cdev->em->table[i + 1].power;
>> +    power_mw = cpufreq_cdev->em->table[i + 1].power;
>> +    power_mw /= MICROWATT_PER_MILLIWATT;
>
> Won't this fail with an unresolved symbols on some archs ? I mean may be
> do_div should be used instead ?

I've run that code in internal CI for all archs and didn't crash.
We already have a division in IPA or in devfreq_cooling where
the variables are 32bit and works fine.

>
>> +
>> +    return power_mw;
>>   }
>
> [ ... ]


The em_validate_cost() is in this cut section.


>
>>   #ifdef CONFIG_64BIT
>> -#define em_scale_power(p) ((p) * 1000)
>> +#define em_estimate_energy(cost, sum_util, scale_cpu) \
>> +    (((cost) * (sum_util)) / (scale_cpu))
>>   #else
>> -#define em_scale_power(p) (p)
>> +#define em_estimate_energy(cost, sum_util, scale_cpu) \
>> +    (((cost) / (scale_cpu)) * (sum_util))
>>   #endif
>>   struct em_data_callback {
>> @@ -112,7 +143,7 @@ struct em_data_callback {
>>        * and frequency.
>>        *
>>        * In case of CPUs, the power is the one of a single CPU in the
>> domain,
>> -     * expressed in milli-Watts or an abstract scale. It is expected to
>> +     * expressed in micro-Watts or an abstract scale. It is expected to
>>        * fit in the [0, EM_MAX_POWER] range.
>>        *
>>        * Return 0 on success.
>> @@ -148,7 +179,7 @@ struct em_perf_domain *em_cpu_get(int cpu);
>>   struct em_perf_domain *em_pd_get(struct device *dev);
>>   int em_dev_register_perf_domain(struct device *dev, unsigned int
>> nr_states,
>>                   struct em_data_callback *cb, cpumask_t *span,
>> -                bool milliwatts);
>> +                bool microwatts);
>>   void em_dev_unregister_perf_domain(struct device *dev);
>>   /**
>> @@ -273,7 +304,7 @@ static inline unsigned long em_cpu_energy(struct
>> em_perf_domain *pd,
>>        *   pd_nrg = ------------------------                       (4)
>>        *                  scale_cpu
>>        */
>> -    return ps->cost * sum_util / scale_cpu;
>> +    return em_estimate_energy(ps->cost, sum_util, scale_cpu);
>>   }
>>   /**
>> @@ -297,7 +328,7 @@ struct em_data_callback {};
>>   static inline
>>   int em_dev_register_perf_domain(struct device *dev, unsigned int
>> nr_states,
>>                   struct em_data_callback *cb, cpumask_t *span,
>> -                bool milliwatts)
>> +                bool microwatts)
>>   {
>>       return -EINVAL;
>>   }
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 6c373f2960e7..910668ec8838 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -108,10 +108,11 @@ static void em_debug_remove_pd(struct device
>> *dev) {}
>>   static int em_create_perf_table(struct device *dev, struct
>> em_perf_domain *pd,
>>                   int nr_states, struct em_data_callback *cb,
>> -                unsigned long flags)
>> +                unsigned long flags, int num_devs)
>>   {
>>       unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
>>       struct em_perf_state *table;
>> +    unsigned long max_cost = 0;
>>       int i, ret;
>>       u64 fmax;
>> @@ -145,7 +146,7 @@ static int em_create_perf_table(struct device
>> *dev, struct em_perf_domain *pd,
>>           /*
>>            * The power returned by active_state() is expected to be
>> -         * positive and to fit into 16 bits.
>> +         * positive and be in range.
>>            */
>>           if (!power || power > EM_MAX_POWER) {
>>               dev_err(dev, "EM: invalid power: %lu\n",
>> @@ -170,7 +171,7 @@ static int em_create_perf_table(struct device
>> *dev, struct em_perf_domain *pd,
>>                   goto free_ps_table;
>>               }
>>           } else {
>> -            power_res = em_scale_power(table[i].power);
>> +            power_res = table[i].power;
>>               cost = div64_u64(fmax * power_res, table[i].frequency);
>>           }
>> @@ -183,6 +184,15 @@ static int em_create_perf_table(struct device
>> *dev, struct em_perf_domain *pd,
>>           } else {
>>               prev_cost = table[i].cost;
>>           }
>> +
>> +        if (max_cost < table[i].cost)
>> +            max_cost = table[i].cost;
>> +    }
>> +
>> +    /* Check if it won't overflow during energy estimation. */
>> +    if (em_validate_cost(max_cost, num_devs)) {
>
> I'm not finding the em_validate_cost() function

It's in the energy_model.h

>
>> +        dev_err(dev, "EM: too big 'cost' value: %lu\n",    max_cost);
>> +        goto free_ps_table;
>>       }
>>       pd->table = table;
>> @@ -199,9 +209,9 @@ static int em_create_pd(struct device *dev, int
>> nr_states,
>>               struct em_data_callback *cb, cpumask_t *cpus,
>>               unsigned long flags)
>>   {
>> +    int cpu, ret, num_devs = 1;
>>       struct em_perf_domain *pd;
>>       struct device *cpu_dev;
>> -    int cpu, ret;
>>       if (_is_cpu_device(dev)) {
>>           pd = kzalloc(sizeof(*pd) + cpumask_size(), GFP_KERNEL);
>> @@ -209,13 +219,14 @@ static int em_create_pd(struct device *dev, int
>> nr_states,
>>               return -ENOMEM;
>>           cpumask_copy(em_span_cpus(pd), cpus);
>> +        num_devs = cpumask_weight(cpus);
>
> Why is this change needed ? What is the connection with the uW unit
> change ?

We support 32bit arch still with the 'unsigned long power' variable,
but we would store e.g. 1.2 Watts there as:
power = 1200000 // 0x124f80
not
power = 1200 // 0x4b0

This would use > 20bits as you can see. We then calculate:
cost_i = power_i * fmax / freq_i
which is used by EAS.

The value from the 'cost' is used for calculating energy in EAS:

unsigned long energy = (cost * sum_utilization) / cpu_arch_capacity
OR on 32bit machines:
unsigned long energy = (cost / cpu_arch_capacity) * sum_utilization

We cannot overflow in any use case. The 'num_devs' is part of this
mechanism. as you can see in this example for 32bit:
max_possible_cost_for_fmax = 64000000 //64Watts
energy = (64000000 / cpu_arch_capacity) * (num_cpus *
max_cpu_utilization) =>
// assume: cpu_arch_capacity == max_cpu_utilization is true
unsigned long energy = 64000000 * num_cpus
Then question:
Q: how many cpus you can have to not overflow?
A: depends on your max_power and then 'cost'
In the above example:
num_cpus must be < 68

I can simplify this to just put a new define for 32bit
machines like num_cpus=16 for safety:

#ifdef CONFIG_64BIT
#define EM_MAX_NUM_CPUS UINT_MAX
#else
#define EM_MAX_NUM_CPUS 16 /*we don't expect more than that */

Then there is no need to modify that calculation function
em_create_perf_table()

The more I look at this the more I'm convinced to do that...

In the old code, the power value had limit to 16bits, the num_cpus
also had limit IIRC to 16bit, thus multiplication wasn't a problem.