From: "Gautham R. Shenoy" <[email protected]>
This is a third version of the patch to fix pstate related issues in
the powernv-cpufreq driver.
The previous versions can be found here:
[v2]: https://lkml.org/lkml/2017/12/7/1562
[v1]: https://lkml.org/lkml/2017/11/29/1338
On POWERNV platform, Pstates are 8-bit values. On POWER8 they are
negatively numbered while on POWER9 they are positively
numbered. Thus, on POWER9, the maximum number of pstates could be as
high as 256.
In multiple places, the current code interprets pstates as a signed
8-bit value which subsequently gets assigned to a signed integer
variable. This causes a problem on POWER9 platforms which have more
than 128 pstates. On such systems, on a CPU that is in a lower
pstate whose number is greater than 128, querying the current pstate
via the pstate_to_idx() returns a "pstate X is out of bound" error
message and the current pstate is reported as the nominal
pstate. This is due to the manner in which the bounds are checked in
pstate_to_idx which again depends on the sign of pstates and whether
the pstates max to min are monotonically increasing or decreasing.
Further the current code makes a couple of assumptions which is not
guaranteed by the device-tree bindings:
1) Pstate ids are continguous.
2) Every Pstate should always lie between the max and the min
pstates that are explicitly reported in the device tree.
Both these assumptions are unwarranted and can change on future
platforms.
In this patch-series, we fix the implementation via the following
changes:
PATCH 1: Define a helper function to correctly extract the pstates
from the PMCR and take care of any sign extentions. This is
an immediate fix to add the ability to handle more than 128
pstates on POWER9 systems.
PATCH 2: Define a hash-map which will return the index into the
cpufreq frequency table for a given pstate. Use this hashmap
in the implementation of pstate_to_idx(). This does away with
the assumptions (1) mentioned above, and will work with non
continguous pstate ids. If no entry exists for a particular
pstate, then such a pstate is treated as being out of
bounds. This gets rid of assumption (2).
PATCH 3: Treat pstates as opaque 8-bit values consistent with the
definitions in the PMSR and PMCR. We no longer need any
sign-extentions nor do we require to interpret the sign of
the pstates anywhere in the code.
Gautham R. Shenoy (3):
powernv-cpufreq: Add helper to extract pstate from PMSR
powernv-cpufreq: Fix pstate_to_idx() to handle non-continguous pstates
powernv-cpufreq: Treat pstates as opaque 8-bit values
drivers/cpufreq/powernv-cpufreq.c | 139 ++++++++++++++++++++++++--------------
1 file changed, 90 insertions(+), 49 deletions(-)
--
1.9.4
From: "Gautham R. Shenoy" <[email protected]>
On POWERNV platform, the fields for pstates in the Power Management
Status Register (PMSR) and the Power Management Control Register
(PMCR) are 8-bits wide. On POWER8 the pstates are negatively numbered
while on POWER9 they are positively numbered.
The device-tree exports pstates as 32-bit entries. The device-tree
implementation sign-extends the 8-bit pstate values to obtain the
corresponding 32-bit entry.
Eg: On POWER8, a pstate value 0x82 [-126] is represented in the
device-tree as 0xfffffff82 while on POWER9, the same value 0x82 [130]
is represented in the device-tree as 0x00000082.
The powernv-cpufreq driver implementation represents pstates using the
integer type. In multiple places in the driver, the code interprets
the pstates extracted from the PMSR as a signed byte and assigns it to
a integer variable to get the sign-extention.
On POWER9 platforms which have greater than 128 pstates, this results
in the driver performing incorrect sign-extention, and thereby
treating a legitimate pstate (say 130) as an invalid pstates (since it
is interpreted as -126).
This patch fixes the issue by implementing a helper function to
extract Pstates from PMSR register, and correctly sign-extend it to be
consistent with the values provided by the device-tree.
Signed-off-by: Gautham R. Shenoy <[email protected]>
---
drivers/cpufreq/powernv-cpufreq.c | 37 +++++++++++++++++++++++--------------
1 file changed, 23 insertions(+), 14 deletions(-)
diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index b6d7c4c..f46b60f 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -41,11 +41,9 @@
#define POWERNV_MAX_PSTATES 256
#define PMSR_PSAFE_ENABLE (1UL << 30)
#define PMSR_SPR_EM_DISABLE (1UL << 31)
-#define PMSR_MAX(x) ((x >> 32) & 0xFF)
+#define MAX_PSTATE_SHIFT 32
#define LPSTATE_SHIFT 48
#define GPSTATE_SHIFT 56
-#define GET_LPSTATE(x) (((x) >> LPSTATE_SHIFT) & 0xFF)
-#define GET_GPSTATE(x) (((x) >> GPSTATE_SHIFT) & 0xFF)
#define MAX_RAMP_DOWN_TIME 5120
/*
@@ -94,6 +92,7 @@ struct global_pstate_info {
};
static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
+u32 pstate_sign_prefix;
static bool rebooting, throttled, occ_reset;
static const char * const throttle_reason[] = {
@@ -148,6 +147,20 @@ enum throttle_reason_type {
bool wof_enabled;
} powernv_pstate_info;
+static inline int extract_pstate(u64 pmsr_val, unsigned int shift)
+{
+ int ret = ((pmsr_val >> shift) & 0xFF);
+
+ if (!ret)
+ return ret;
+
+ return (pstate_sign_prefix | ret);
+}
+
+#define extract_local_pstate(x) extract_pstate(x, LPSTATE_SHIFT)
+#define extract_global_pstate(x) extract_pstate(x, GPSTATE_SHIFT)
+#define extract_max_pstate(x) extract_pstate(x, MAX_PSTATE_SHIFT)
+
/* Use following macros for conversions between pstate_id and index */
static inline int idx_to_pstate(unsigned int i)
{
@@ -278,6 +291,9 @@ static int init_powernv_pstates(void)
powernv_pstate_info.nr_pstates = nr_pstates;
pr_debug("NR PStates %d\n", nr_pstates);
+
+ pstate_sign_prefix = pstate_min & ~0xFF;
+
for (i = 0; i < nr_pstates; i++) {
u32 id = be32_to_cpu(pstate_ids[i]);
u32 freq = be32_to_cpu(pstate_freqs[i]);
@@ -438,17 +454,10 @@ struct powernv_smp_call_data {
static void powernv_read_cpu_freq(void *arg)
{
unsigned long pmspr_val;
- s8 local_pstate_id;
struct powernv_smp_call_data *freq_data = arg;
pmspr_val = get_pmspr(SPRN_PMSR);
-
- /*
- * The local pstate id corresponds bits 48..55 in the PMSR.
- * Note: Watch out for the sign!
- */
- local_pstate_id = (pmspr_val >> 48) & 0xFF;
- freq_data->pstate_id = local_pstate_id;
+ freq_data->pstate_id = extract_local_pstate(pmspr_val);
freq_data->freq = pstate_id_to_freq(freq_data->pstate_id);
pr_debug("cpu %d pmsr %016lX pstate_id %d frequency %d kHz\n",
@@ -522,7 +531,7 @@ static void powernv_cpufreq_throttle_check(void *data)
chip = this_cpu_read(chip_info);
/* Check for Pmax Capping */
- pmsr_pmax = (s8)PMSR_MAX(pmsr);
+ pmsr_pmax = extract_max_pstate(pmsr);
pmsr_pmax_idx = pstate_to_idx(pmsr_pmax);
if (pmsr_pmax_idx != powernv_pstate_info.max) {
if (chip->throttled)
@@ -645,8 +654,8 @@ void gpstate_timer_handler(struct timer_list *t)
* value. Hence, read from PMCR to get correct data.
*/
val = get_pmspr(SPRN_PMCR);
- freq_data.gpstate_id = (s8)GET_GPSTATE(val);
- freq_data.pstate_id = (s8)GET_LPSTATE(val);
+ freq_data.gpstate_id = extract_global_pstate(val);
+ freq_data.pstate_id = extract_local_pstate(val);
if (freq_data.gpstate_id == freq_data.pstate_id) {
reset_gpstates(policy);
spin_unlock(&gpstates->gpstate_lock);
--
1.9.4
From: "Gautham R. Shenoy" <[email protected]>
On POWER8 and POWER9, the PMSR and the PMCR registers define pstates
to be 8-bit wide values. The device-tree exports pstates as 32-bit
wide values of which the lower byte is the actual pstate.
The current implementation in the kernel treats pstates as integer
type, since it used to use the sign of the pstate for performing some
boundary-checks. This is no longer required after the patch
"powernv-cpufreq: Fix pstate_to_idx() to handle non-continguous
pstates".
So, in this patch, we modify the powernv-cpufreq driver to uniformly
treat pstates as opaque 8-bit values obtained from the device-tree or
the PMCR. This simplifies the extract_pstate() helper function since
we no longer no longer require to worry about the sign-extentions.
Signed-off-by: Gautham R. Shenoy <[email protected]>
---
drivers/cpufreq/powernv-cpufreq.c | 47 ++++++++++++++++-----------------------
1 file changed, 19 insertions(+), 28 deletions(-)
diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index 8e3dbca..8a4e2ce 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -110,12 +110,11 @@ struct global_pstate_info {
* hashtable
*/
struct pstate_idx_revmap_data {
- int pstate_id;
+ u8 pstate_id;
unsigned int cpufreq_table_idx;
struct hlist_node hentry;
};
-u32 pstate_sign_prefix;
static bool rebooting, throttled, occ_reset;
static const char * const throttle_reason[] = {
@@ -170,14 +169,9 @@ enum throttle_reason_type {
bool wof_enabled;
} powernv_pstate_info;
-static inline int extract_pstate(u64 pmsr_val, unsigned int shift)
+static inline u8 extract_pstate(u64 pmsr_val, unsigned int shift)
{
- int ret = ((pmsr_val >> shift) & 0xFF);
-
- if (!ret)
- return ret;
-
- return (pstate_sign_prefix | ret);
+ return ((pmsr_val >> shift) & 0xFF);
}
#define extract_local_pstate(x) extract_pstate(x, LPSTATE_SHIFT)
@@ -194,7 +188,7 @@ static inline int extract_pstate(u64 pmsr_val, unsigned int shift)
* If @i is out of bound, this will return the pstate
* corresponding to the nominal frequency.
*/
-static inline int idx_to_pstate(unsigned int i)
+static inline u8 idx_to_pstate(unsigned int i)
{
if (unlikely(i >= powernv_pstate_info.nr_pstates)) {
pr_warn_once("idx_to_pstate: index %u is out of bound\n", i);
@@ -213,7 +207,7 @@ static inline int idx_to_pstate(unsigned int i)
* this will return the index of the nominal
* frequency.
*/
-static unsigned int pstate_to_idx(int pstate)
+static unsigned int pstate_to_idx(u8 pstate)
{
unsigned int key = pstate % POWERNV_MAX_PSTATES;
struct pstate_idx_revmap_data *revmap_data;
@@ -223,7 +217,7 @@ static unsigned int pstate_to_idx(int pstate)
return revmap_data->cpufreq_table_idx;
}
- pr_warn_once("pstate_to_idx: pstate %d not found\n", pstate);
+ pr_warn_once("pstate_to_idx: pstate 0x%x not found\n", pstate);
return powernv_pstate_info.nominal;
}
@@ -291,7 +285,7 @@ static int init_powernv_pstates(void)
powernv_pstate_info.wof_enabled = true;
next:
- pr_info("cpufreq pstate min %d nominal %d max %d\n", pstate_min,
+ pr_info("cpufreq pstate min 0x%x nominal 0x%x max 0x%x\n", pstate_min,
pstate_nominal, pstate_max);
pr_info("Workload Optimized Frequency is %s in the platform\n",
(powernv_pstate_info.wof_enabled) ? "enabled" : "disabled");
@@ -323,8 +317,6 @@ static int init_powernv_pstates(void)
powernv_pstate_info.nr_pstates = nr_pstates;
pr_debug("NR PStates %d\n", nr_pstates);
- pstate_sign_prefix = pstate_min & ~0xFF;
-
for (i = 0; i < nr_pstates; i++) {
u32 id = be32_to_cpu(pstate_ids[i]);
u32 freq = be32_to_cpu(pstate_freqs[i]);
@@ -333,14 +325,14 @@ static int init_powernv_pstates(void)
pr_debug("PState id %d freq %d MHz\n", id, freq);
powernv_freqs[i].frequency = freq * 1000; /* kHz */
- powernv_freqs[i].driver_data = id;
+ powernv_freqs[i].driver_data = id & 0xFF;
revmap_data = (struct pstate_idx_revmap_data *)
kmalloc(sizeof(*revmap_data), GFP_KERNEL);
- revmap_data->pstate_id = id;
+ revmap_data->pstate_id = id & 0xFF;
revmap_data->cpufreq_table_idx = i;
- key = id % POWERNV_MAX_PSTATES;
+ key = (revmap_data->pstate_id) % POWERNV_MAX_PSTATES;
hash_add(pstate_revmap, &revmap_data->hentry, key);
if (id == pstate_max)
@@ -364,14 +356,13 @@ static int init_powernv_pstates(void)
}
/* Returns the CPU frequency corresponding to the pstate_id. */
-static unsigned int pstate_id_to_freq(int pstate_id)
+static unsigned int pstate_id_to_freq(u8 pstate_id)
{
int i;
i = pstate_to_idx(pstate_id);
if (i >= powernv_pstate_info.nr_pstates || i < 0) {
- pr_warn("PState id %d outside of PState table, "
- "reporting nominal id %d instead\n",
+ pr_warn("PState id 0x%x outside of PState table, reporting nominal id 0x%x instead\n",
pstate_id, idx_to_pstate(powernv_pstate_info.nominal));
i = powernv_pstate_info.nominal;
}
@@ -477,8 +468,8 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val)
*/
struct powernv_smp_call_data {
unsigned int freq;
- int pstate_id;
- int gpstate_id;
+ u8 pstate_id;
+ u8 gpstate_id;
};
/*
@@ -501,9 +492,9 @@ static void powernv_read_cpu_freq(void *arg)
freq_data->pstate_id = extract_local_pstate(pmspr_val);
freq_data->freq = pstate_id_to_freq(freq_data->pstate_id);
- pr_debug("cpu %d pmsr %016lX pstate_id %d frequency %d kHz\n",
- raw_smp_processor_id(), pmspr_val, freq_data->pstate_id,
- freq_data->freq);
+ pr_debug("cpu %d pmsr %016lX pstate_id 0x%x frequency %d kHz\n",
+ raw_smp_processor_id(), pmspr_val, freq_data->pstate_id,
+ freq_data->freq);
}
/*
@@ -565,7 +556,7 @@ static void powernv_cpufreq_throttle_check(void *data)
struct chip *chip;
unsigned int cpu = smp_processor_id();
unsigned long pmsr;
- int pmsr_pmax;
+ u8 pmsr_pmax;
unsigned int pmsr_pmax_idx;
pmsr = get_pmspr(SPRN_PMSR);
@@ -579,7 +570,7 @@ static void powernv_cpufreq_throttle_check(void *data)
goto next;
chip->throttled = true;
if (pmsr_pmax_idx > powernv_pstate_info.nominal) {
- pr_warn_once("CPU %d on Chip %u has Pmax(%d) reduced below nominal frequency(%d)\n",
+ pr_warn_once("CPU %d on Chip %u has Pmax(0x%x) reduced below that of nominal frequency(0x%x)\n",
cpu, chip->id, pmsr_pmax,
idx_to_pstate(powernv_pstate_info.nominal));
chip->throttle_sub_turbo++;
--
1.9.4
From: "Gautham R. Shenoy" <[email protected]>
The code in powernv-cpufreq, makes the following two assumptions which
are not guaranteed by the device-tree bindings:
1) Pstate ids are continguous: This is used in pstate_to_idx() to
obtain the reverse map from a pstate to it's corresponding
entry into the cpufreq frequency table.
2) Every Pstate should always lie between the max and the min
pstates that are explicitly reported in the device tree: This
is used to determine whether a pstate reported by the PMSR is
out of bounds.
Both these assumptions are unwarranted and can change on future
platforms.
In this patch, we maintain the reverse map from a pstate to it's index
in the cpufreq frequency table and use this in pstate_to_idx(). This
does away with the assumptions (1) mentioned above, and will work with
non continguous pstate ids. If no entry exists for a particular
pstate, then such a pstate is treated as being out of bounds. This
gets rid of assumption (2).
On all the existing platforms, where the pstates are 8-bit long
values, the new implementation of pstate_to_idx() takes constant time.
Signed-off-by: Gautham R. Shenoy <[email protected]>
---
drivers/cpufreq/powernv-cpufreq.c | 85 +++++++++++++++++++++++++++++----------
1 file changed, 63 insertions(+), 22 deletions(-)
diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index f46b60f..8e3dbca 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -29,6 +29,7 @@
#include <linux/reboot.h>
#include <linux/slab.h>
#include <linux/cpu.h>
+#include <linux/hashtable.h>
#include <trace/events/power.h>
#include <asm/cputhreads.h>
@@ -38,7 +39,8 @@
#include <asm/opal.h>
#include <linux/timer.h>
-#define POWERNV_MAX_PSTATES 256
+#define POWERNV_MAX_PSTATES_ORDER 8
+#define POWERNV_MAX_PSTATES (1UL << (POWERNV_MAX_PSTATES_ORDER))
#define PMSR_PSAFE_ENABLE (1UL << 30)
#define PMSR_SPR_EM_DISABLE (1UL << 31)
#define MAX_PSTATE_SHIFT 32
@@ -92,6 +94,27 @@ struct global_pstate_info {
};
static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
+
+DEFINE_HASHTABLE(pstate_revmap, POWERNV_MAX_PSTATES_ORDER);
+/**
+ * struct pstate_idx_revmap_data: Entry in the hashmap pstate_revmap
+ * indexed by a function of pstate id.
+ *
+ * @pstate_id: pstate id for this entry.
+ *
+ * @cpufreq_table_idx: Index into the powernv_freqs
+ * cpufreq_frequency_table for frequency
+ * corresponding to pstate_id.
+ *
+ * @hentry: hlist_node that hooks this entry into the pstate_revmap
+ * hashtable
+ */
+struct pstate_idx_revmap_data {
+ int pstate_id;
+ unsigned int cpufreq_table_idx;
+ struct hlist_node hentry;
+};
+
u32 pstate_sign_prefix;
static bool rebooting, throttled, occ_reset;
@@ -161,39 +184,47 @@ static inline int extract_pstate(u64 pmsr_val, unsigned int shift)
#define extract_global_pstate(x) extract_pstate(x, GPSTATE_SHIFT)
#define extract_max_pstate(x) extract_pstate(x, MAX_PSTATE_SHIFT)
-/* Use following macros for conversions between pstate_id and index */
+/* Use following functions for conversions between pstate_id and index */
+
+/**
+ * idx_to_pstate : Returns the pstate id corresponding to the
+ * frequency in the cpufreq frequency table
+ * powernv_freqs indexed by @i.
+ *
+ * If @i is out of bound, this will return the pstate
+ * corresponding to the nominal frequency.
+ */
static inline int idx_to_pstate(unsigned int i)
{
if (unlikely(i >= powernv_pstate_info.nr_pstates)) {
- pr_warn_once("index %u is out of bound\n", i);
+ pr_warn_once("idx_to_pstate: index %u is out of bound\n", i);
return powernv_freqs[powernv_pstate_info.nominal].driver_data;
}
return powernv_freqs[i].driver_data;
}
-static inline unsigned int pstate_to_idx(int pstate)
+/**
+ * pstate_to_idx : Returns the index in the cpufreq frequencytable
+ * powernv_freqs for the frequency whose corresponding
+ * pstate id is @pstate.
+ *
+ * If no frequency corresponding to @pstate is found,
+ * this will return the index of the nominal
+ * frequency.
+ */
+static unsigned int pstate_to_idx(int pstate)
{
- int min = powernv_freqs[powernv_pstate_info.min].driver_data;
- int max = powernv_freqs[powernv_pstate_info.max].driver_data;
+ unsigned int key = pstate % POWERNV_MAX_PSTATES;
+ struct pstate_idx_revmap_data *revmap_data;
- if (min > 0) {
- if (unlikely((pstate < max) || (pstate > min))) {
- pr_warn_once("pstate %d is out of bound\n", pstate);
- return powernv_pstate_info.nominal;
- }
- } else {
- if (unlikely((pstate > max) || (pstate < min))) {
- pr_warn_once("pstate %d is out of bound\n", pstate);
- return powernv_pstate_info.nominal;
- }
+ hash_for_each_possible(pstate_revmap, revmap_data, hentry, key) {
+ if (revmap_data->pstate_id == pstate)
+ return revmap_data->cpufreq_table_idx;
}
- /*
- * abs() is deliberately used so that is works with
- * both monotonically increasing and decreasing
- * pstate values
- */
- return abs(pstate - idx_to_pstate(powernv_pstate_info.max));
+
+ pr_warn_once("pstate_to_idx: pstate %d not found\n", pstate);
+ return powernv_pstate_info.nominal;
}
static inline void reset_gpstates(struct cpufreq_policy *policy)
@@ -297,11 +328,21 @@ static int init_powernv_pstates(void)
for (i = 0; i < nr_pstates; i++) {
u32 id = be32_to_cpu(pstate_ids[i]);
u32 freq = be32_to_cpu(pstate_freqs[i]);
+ struct pstate_idx_revmap_data *revmap_data;
+ unsigned int key;
pr_debug("PState id %d freq %d MHz\n", id, freq);
powernv_freqs[i].frequency = freq * 1000; /* kHz */
powernv_freqs[i].driver_data = id;
+ revmap_data = (struct pstate_idx_revmap_data *)
+ kmalloc(sizeof(*revmap_data), GFP_KERNEL);
+
+ revmap_data->pstate_id = id;
+ revmap_data->cpufreq_table_idx = i;
+ key = id % POWERNV_MAX_PSTATES;
+ hash_add(pstate_revmap, &revmap_data->hentry, key);
+
if (id == pstate_max)
powernv_pstate_info.max = i;
else if (id == pstate_nominal)
--
1.9.4
On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
<[email protected]> wrote:
> From: "Gautham R. Shenoy" <[email protected]>
>
> On POWERNV platform, the fields for pstates in the Power Management
> Status Register (PMSR) and the Power Management Control Register
> (PMCR) are 8-bits wide. On POWER8 the pstates are negatively numbered
> while on POWER9 they are positively numbered.
>
> The device-tree exports pstates as 32-bit entries. The device-tree
> implementation sign-extends the 8-bit pstate values to obtain the
> corresponding 32-bit entry.
>
> Eg: On POWER8, a pstate value 0x82 [-126] is represented in the
> device-tree as 0xfffffff82 while on POWER9, the same value 0x82 [130]
> is represented in the device-tree as 0x00000082.
>
> The powernv-cpufreq driver implementation represents pstates using the
> integer type. In multiple places in the driver, the code interprets
> the pstates extracted from the PMSR as a signed byte and assigns it to
> a integer variable to get the sign-extention.
>
> On POWER9 platforms which have greater than 128 pstates, this results
> in the driver performing incorrect sign-extention, and thereby
> treating a legitimate pstate (say 130) as an invalid pstates (since it
> is interpreted as -126).
>
> This patch fixes the issue by implementing a helper function to
> extract Pstates from PMSR register, and correctly sign-extend it to be
> consistent with the values provided by the device-tree.
>
> Signed-off-by: Gautham R. Shenoy <[email protected]>
> ---
This looks better
Acked-by: Balbir Singh <[email protected]>
Balbir Singh
On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
<[email protected]> wrote:
> From: "Gautham R. Shenoy" <[email protected]>
>
> The code in powernv-cpufreq, makes the following two assumptions which
> are not guaranteed by the device-tree bindings:
>
> 1) Pstate ids are continguous: This is used in pstate_to_idx() to
> obtain the reverse map from a pstate to it's corresponding
> entry into the cpufreq frequency table.
>
> 2) Every Pstate should always lie between the max and the min
> pstates that are explicitly reported in the device tree: This
> is used to determine whether a pstate reported by the PMSR is
> out of bounds.
>
> Both these assumptions are unwarranted and can change on future
> platforms.
While this is a good thing, I wonder if it is worth the complexity. Pstates
are contiguous because they define transitions in incremental value
of change in frequency and I can't see how this can be broken in the
future?
Balbir Singh.
On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
<[email protected]> wrote:
> From: "Gautham R. Shenoy" <[email protected]>
>
> On POWER8 and POWER9, the PMSR and the PMCR registers define pstates
> to be 8-bit wide values. The device-tree exports pstates as 32-bit
> wide values of which the lower byte is the actual pstate.
>
> The current implementation in the kernel treats pstates as integer
> type, since it used to use the sign of the pstate for performing some
> boundary-checks. This is no longer required after the patch
> "powernv-cpufreq: Fix pstate_to_idx() to handle non-continguous
> pstates".
>
> So, in this patch, we modify the powernv-cpufreq driver to uniformly
> treat pstates as opaque 8-bit values obtained from the device-tree or
> the PMCR. This simplifies the extract_pstate() helper function since
> we no longer no longer require to worry about the sign-extentions.
>
> Signed-off-by: Gautham R. Shenoy <[email protected]>
> ---
> drivers/cpufreq/powernv-cpufreq.c | 47 ++++++++++++++++-----------------------
> 1 file changed, 19 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index 8e3dbca..8a4e2ce 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -110,12 +110,11 @@ struct global_pstate_info {
> * hashtable
> */
> struct pstate_idx_revmap_data {
> - int pstate_id;
> + u8 pstate_id;
> unsigned int cpufreq_table_idx;
> struct hlist_node hentry;
> };
>
> -u32 pstate_sign_prefix;
> static bool rebooting, throttled, occ_reset;
>
> static const char * const throttle_reason[] = {
> @@ -170,14 +169,9 @@ enum throttle_reason_type {
> bool wof_enabled;
> } powernv_pstate_info;
>
> -static inline int extract_pstate(u64 pmsr_val, unsigned int shift)
> +static inline u8 extract_pstate(u64 pmsr_val, unsigned int shift)
> {
> - int ret = ((pmsr_val >> shift) & 0xFF);
> -
> - if (!ret)
> - return ret;
> -
> - return (pstate_sign_prefix | ret);
> + return ((pmsr_val >> shift) & 0xFF);
> }
So we just added this and moved from an int to u8. I was going to ask
if we still need an int in patch1, but I thought the driver dealt with
just integers because of the larger framework.
Balbir Singh.
Hi Balbir,
On Sun, Dec 17, 2017 at 02:15:25PM +1100, Balbir Singh wrote:
> On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
> <[email protected]> wrote:
> > From: "Gautham R. Shenoy" <[email protected]>
> >
> > The code in powernv-cpufreq, makes the following two assumptions which
> > are not guaranteed by the device-tree bindings:
> >
> > 1) Pstate ids are continguous: This is used in pstate_to_idx() to
> > obtain the reverse map from a pstate to it's corresponding
> > entry into the cpufreq frequency table.
> >
> > 2) Every Pstate should always lie between the max and the min
> > pstates that are explicitly reported in the device tree: This
> > is used to determine whether a pstate reported by the PMSR is
> > out of bounds.
> >
> > Both these assumptions are unwarranted and can change on future
> > platforms.
>
> While this is a good thing, I wonder if it is worth the complexity. Pstates
> are contiguous because they define transitions in incremental value
> of change in frequency and I can't see how this can be broken in the
> future?
In the future, we can have the OPAL firmware give us a smaller set of
pstates instead of expose every one of them. As it stands today, for
most of the workloads, we will need at best 20-30 pstates and not
beyond that.
>
> Balbir Singh.
>
--
Thanks and Regards
gautham.
Hi Balbir,
On Sun, Dec 17, 2017 at 02:17:02PM +1100, Balbir Singh wrote:
> On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
[..snip..]
> >
> > -static inline int extract_pstate(u64 pmsr_val, unsigned int shift)
> > +static inline u8 extract_pstate(u64 pmsr_val, unsigned int shift)
> > {
> > - int ret = ((pmsr_val >> shift) & 0xFF);
> > -
> > - if (!ret)
> > - return ret;
> > -
> > - return (pstate_sign_prefix | ret);
> > + return ((pmsr_val >> shift) & 0xFF);
> > }
>
> So we just added this and moved from an int to u8. I was going to ask
> if we still need an int in patch1, but I thought the driver dealt with
> just integers because of the larger framework.
The larger framework is with respect to the device tree which defines
pstates as 32-bit integers (I am not aware of the reasons for this
choice, but perhaps device-tree doesn't have a s8/u8 type?!). The
driver still knows that pstates are only 8-bits wide because of the
PMSR,PMCR definitions. Before this patch, the driver was still doing
the conversions from 8-bit to int and vice-versa every time. With this
patch, we do it only once, i.e at the initialization. After that we
treat pstates as 8-bit integers.
>
> Balbir Singh.
>
Hi Balbir,
On Sun, Dec 17, 2017 at 02:04:03PM +1100, Balbir Singh wrote:
> On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
> <[email protected]> wrote:
> > From: "Gautham R. Shenoy" <[email protected]>
> >
> > On POWERNV platform, the fields for pstates in the Power Management
> > Status Register (PMSR) and the Power Management Control Register
> > (PMCR) are 8-bits wide. On POWER8 the pstates are negatively numbered
> > while on POWER9 they are positively numbered.
> >
> > The device-tree exports pstates as 32-bit entries. The device-tree
> > implementation sign-extends the 8-bit pstate values to obtain the
> > corresponding 32-bit entry.
> >
> > Eg: On POWER8, a pstate value 0x82 [-126] is represented in the
> > device-tree as 0xfffffff82 while on POWER9, the same value 0x82 [130]
> > is represented in the device-tree as 0x00000082.
> >
> > The powernv-cpufreq driver implementation represents pstates using the
> > integer type. In multiple places in the driver, the code interprets
> > the pstates extracted from the PMSR as a signed byte and assigns it to
> > a integer variable to get the sign-extention.
> >
> > On POWER9 platforms which have greater than 128 pstates, this results
> > in the driver performing incorrect sign-extention, and thereby
> > treating a legitimate pstate (say 130) as an invalid pstates (since it
> > is interpreted as -126).
> >
> > This patch fixes the issue by implementing a helper function to
> > extract Pstates from PMSR register, and correctly sign-extend it to be
> > consistent with the values provided by the device-tree.
> >
> > Signed-off-by: Gautham R. Shenoy <[email protected]>
> > ---
>
> This looks better
Thanks for the Review.
>
> Acked-by: Balbir Singh <[email protected]>
>
> Balbir Singh
>
On Monday, December 18, 2017 9:38:20 AM CET Gautham R Shenoy wrote:
> Hi Balbir,
>
> On Sun, Dec 17, 2017 at 02:15:25PM +1100, Balbir Singh wrote:
> > On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
> > <[email protected]> wrote:
> > > From: "Gautham R. Shenoy" <[email protected]>
> > >
> > > The code in powernv-cpufreq, makes the following two assumptions which
> > > are not guaranteed by the device-tree bindings:
> > >
> > > 1) Pstate ids are continguous: This is used in pstate_to_idx() to
> > > obtain the reverse map from a pstate to it's corresponding
> > > entry into the cpufreq frequency table.
> > >
> > > 2) Every Pstate should always lie between the max and the min
> > > pstates that are explicitly reported in the device tree: This
> > > is used to determine whether a pstate reported by the PMSR is
> > > out of bounds.
> > >
> > > Both these assumptions are unwarranted and can change on future
> > > platforms.
> >
> > While this is a good thing, I wonder if it is worth the complexity. Pstates
> > are contiguous because they define transitions in incremental value
> > of change in frequency and I can't see how this can be broken in the
> > future?
>
> In the future, we can have the OPAL firmware give us a smaller set of
> pstates instead of expose every one of them. As it stands today, for
> most of the workloads, we will need at best 20-30 pstates and not
> beyond that.
I'm not sure about the status here.
Is this good to go as is or is it going to be updated?
Thanks,
Rafael
On Wed, Jan 3, 2018 at 11:07 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Monday, December 18, 2017 9:38:20 AM CET Gautham R Shenoy wrote:
>> Hi Balbir,
>>
>> On Sun, Dec 17, 2017 at 02:15:25PM +1100, Balbir Singh wrote:
>> > On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
>> > <[email protected]> wrote:
>> > > From: "Gautham R. Shenoy" <[email protected]>
>> > >
>> > > The code in powernv-cpufreq, makes the following two assumptions which
>> > > are not guaranteed by the device-tree bindings:
>> > >
>> > > 1) Pstate ids are continguous: This is used in pstate_to_idx() to
>> > > obtain the reverse map from a pstate to it's corresponding
>> > > entry into the cpufreq frequency table.
>> > >
>> > > 2) Every Pstate should always lie between the max and the min
>> > > pstates that are explicitly reported in the device tree: This
>> > > is used to determine whether a pstate reported by the PMSR is
>> > > out of bounds.
>> > >
>> > > Both these assumptions are unwarranted and can change on future
>> > > platforms.
>> >
>> > While this is a good thing, I wonder if it is worth the complexity. Pstates
>> > are contiguous because they define transitions in incremental value
>> > of change in frequency and I can't see how this can be broken in the
>> > future?
>>
>> In the future, we can have the OPAL firmware give us a smaller set of
>> pstates instead of expose every one of them. As it stands today, for
>> most of the workloads, we will need at best 20-30 pstates and not
>> beyond that.
>
> I'm not sure about the status here.
>
> Is this good to go as is or is it going to be updated?
>
I have no major objections, except some of the added complexity, but
Gautham makes a point that this is refactoring for the future
Balbir Singh.
Hi Rafael,
On Wed, Jan 03, 2018 at 11:47:58PM +1100, Balbir Singh wrote:
> On Wed, Jan 3, 2018 at 11:07 PM, Rafael J. Wysocki <[email protected]> wrote:
> > On Monday, December 18, 2017 9:38:20 AM CET Gautham R Shenoy wrote:
> >> Hi Balbir,
> >>
> >> On Sun, Dec 17, 2017 at 02:15:25PM +1100, Balbir Singh wrote:
> >> > On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
> >> > <[email protected]> wrote:
> >> > > From: "Gautham R. Shenoy" <[email protected]>
> >> > >
> >> > > The code in powernv-cpufreq, makes the following two assumptions which
> >> > > are not guaranteed by the device-tree bindings:
> >> > >
> >> > > 1) Pstate ids are continguous: This is used in pstate_to_idx() to
> >> > > obtain the reverse map from a pstate to it's corresponding
> >> > > entry into the cpufreq frequency table.
> >> > >
> >> > > 2) Every Pstate should always lie between the max and the min
> >> > > pstates that are explicitly reported in the device tree: This
> >> > > is used to determine whether a pstate reported by the PMSR is
> >> > > out of bounds.
> >> > >
> >> > > Both these assumptions are unwarranted and can change on future
> >> > > platforms.
> >> >
> >> > While this is a good thing, I wonder if it is worth the complexity. Pstates
> >> > are contiguous because they define transitions in incremental value
> >> > of change in frequency and I can't see how this can be broken in the
> >> > future?
> >>
> >> In the future, we can have the OPAL firmware give us a smaller set of
> >> pstates instead of expose every one of them. As it stands today, for
> >> most of the workloads, we will need at best 20-30 pstates and not
> >> beyond that.
> >
> > I'm not sure about the status here.
> >
> > Is this good to go as is or is it going to be updated?
> >
>
> I have no major objections, except some of the added complexity, but
> Gautham makes a point that this is refactoring for the future
I have tested this across POWER8 and POWER9. The additional complexity
introduced by the second patch is required for the future when we are
going to reduce the number of pstates.
>
> Balbir Singh.
>
--
Thanks and Regards
gautham.
On 13-12-17, 12:27, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" <[email protected]>
>
> This is a third version of the patch to fix pstate related issues in
> the powernv-cpufreq driver.
Acked-by: Viresh Kumar <[email protected]>
--
viresh
On Wednesday, January 10, 2018 9:55:45 AM CET Gautham R Shenoy wrote:
> Hi Rafael,
>
> On Wed, Jan 03, 2018 at 11:47:58PM +1100, Balbir Singh wrote:
> > On Wed, Jan 3, 2018 at 11:07 PM, Rafael J. Wysocki <[email protected]> wrote:
> > > On Monday, December 18, 2017 9:38:20 AM CET Gautham R Shenoy wrote:
> > >> Hi Balbir,
> > >>
> > >> On Sun, Dec 17, 2017 at 02:15:25PM +1100, Balbir Singh wrote:
> > >> > On Wed, Dec 13, 2017 at 5:57 PM, Gautham R. Shenoy
> > >> > <[email protected]> wrote:
> > >> > > From: "Gautham R. Shenoy" <[email protected]>
> > >> > >
> > >> > > The code in powernv-cpufreq, makes the following two assumptions which
> > >> > > are not guaranteed by the device-tree bindings:
> > >> > >
> > >> > > 1) Pstate ids are continguous: This is used in pstate_to_idx() to
> > >> > > obtain the reverse map from a pstate to it's corresponding
> > >> > > entry into the cpufreq frequency table.
> > >> > >
> > >> > > 2) Every Pstate should always lie between the max and the min
> > >> > > pstates that are explicitly reported in the device tree: This
> > >> > > is used to determine whether a pstate reported by the PMSR is
> > >> > > out of bounds.
> > >> > >
> > >> > > Both these assumptions are unwarranted and can change on future
> > >> > > platforms.
> > >> >
> > >> > While this is a good thing, I wonder if it is worth the complexity. Pstates
> > >> > are contiguous because they define transitions in incremental value
> > >> > of change in frequency and I can't see how this can be broken in the
> > >> > future?
> > >>
> > >> In the future, we can have the OPAL firmware give us a smaller set of
> > >> pstates instead of expose every one of them. As it stands today, for
> > >> most of the workloads, we will need at best 20-30 pstates and not
> > >> beyond that.
> > >
> > > I'm not sure about the status here.
> > >
> > > Is this good to go as is or is it going to be updated?
> > >
> >
> > I have no major objections, except some of the added complexity, but
> > Gautham makes a point that this is refactoring for the future
>
> I have tested this across POWER8 and POWER9. The additional complexity
> introduced by the second patch is required for the future when we are
> going to reduce the number of pstates.
I have applied the series, thanks!