2017-07-05 16:39:04

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH 0/5] powernv:idle: Cleanup idle states initialization

From: "Gautham R. Shenoy" <[email protected]>

Hi,

This patch set aims at cleaning up the powernv idle initialization
code mainly covering the following

a) Currently there is redundant code for parsing the device-tree for
idle states. We do it in two places, once during the platform idle
initialization, once more when the cpidle driver initializes. In
this patchset the device-tree is parsed only once and we maintain
an in-kernel data structure with the details of each platform idle
state. The cpu-idle initialization code looks at this data
structure for initializing cpuidle states. This makes the cpuidle
driver initialization more streamlined.

b) Currently the idle initialzation code for power8 and power9 are
mixed up. In this patchset we segregate them into their respective
functions for improved readability.

c) The current code has a bug when the Sleep-Winkle-Engine is unable
to restore the hypervisor states for the deep idle states that lose
full hypervisor context, since in such cases we don't disable such
deep states. Thus, the CPUs that enter such deep states don't
wakeup correctly.

Patch 1 in the series addresses a).

Patches 2,3,4 address b)

Patch 5 fixes the bug c)

These patches are applied on top of next branch of the powerpc-linux
git tree. The patches have been tested on POWER8 and POWER9.

Gautham R. Shenoy (5):
powernv:idle: Move device-tree parsing to one place.
powernv:idle: Change return type of pnv_probe_idle_states to int
powernv:idle: Define idle init function for power8
powernv:idle: Move initialization of sibling pacas to
pnv_alloc_idle_core_states
powernv:idle: Disable LOSE_FULL_CONTEXT states when stop-api fails.

arch/powerpc/include/asm/cpuidle.h | 32 +-
arch/powerpc/platforms/powernv/idle.c | 576 ++++++++++++++++++++++++++--------
drivers/cpuidle/cpuidle-powernv.c | 233 ++++----------
3 files changed, 521 insertions(+), 320 deletions(-)

--
1.9.4


2017-07-05 16:38:43

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH 4/5] powernv:idle: Move initialization of sibling pacas to pnv_alloc_idle_core_states

From: "Gautham R. Shenoy" <[email protected]>

On POWER9 DD1, in order to get around a hardware issue, we store in
every CPU thread's paca the paca pointers of all its siblings.

Move this code into pnv_alloc_idle_core_states() soon after the space
for saving the sibling pacas is allocated.

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/platforms/powernv/idle.c | 45 +++++++++++++++++------------------
1 file changed, 22 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index c400ff9..254a0db8 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -194,6 +194,28 @@ static void pnv_alloc_idle_core_states(void)
}
}

+ /*
+ * For each CPU, record its PACA address in each of it's
+ * sibling thread's PACA at the slot corresponding to this
+ * CPU's index in the core.
+ */
+ if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
+ int cpu;
+
+ pr_info("powernv: idle: Saving PACA pointers of all CPUs in their thread sibling PACA\n");
+ for_each_possible_cpu(cpu) {
+ int base_cpu = cpu_first_thread_sibling(cpu);
+ int idx = cpu_thread_in_core(cpu);
+ int i;
+
+ for (i = 0; i < threads_per_core; i++) {
+ int j = base_cpu + i;
+
+ paca[j].thread_sibling_pacas[idx] = &paca[cpu];
+ }
+ }
+ }
+
update_subcore_sibling_mask();

if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT)
@@ -898,31 +920,8 @@ static int __init pnv_init_idle_states(void)
if (pnv_probe_idle_states())
goto out;

-
pnv_alloc_idle_core_states();

- /*
- * For each CPU, record its PACA address in each of it's
- * sibling thread's PACA at the slot corresponding to this
- * CPU's index in the core.
- */
- if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
- int cpu;
-
- pr_info("powernv: idle: Saving PACA pointers of all CPUs in their thread sibling PACA\n");
- for_each_possible_cpu(cpu) {
- int base_cpu = cpu_first_thread_sibling(cpu);
- int idx = cpu_thread_in_core(cpu);
- int i;
-
- for (i = 0; i < threads_per_core; i++) {
- int j = base_cpu + i;
-
- paca[j].thread_sibling_pacas[idx] = &paca[cpu];
- }
- }
- }
-
out:
return 0;
}
--
1.9.4

2017-07-05 16:38:47

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH 5/5] powernv:idle: Disable LOSE_FULL_CONTEXT states when stop-api fails.

From: "Gautham R. Shenoy" <[email protected]>

Currently, we use the opal call opal_slw_set_reg() to inform the that
the Sleep-Winkle Engine (SLW) to restore the contents of some of the
Hypervisor state on wakeup from deep idle states that lose full
hypervisor context (characterized by the flag
OPAL_PM_LOSE_FULL_CONTEXT).

However, the current code has a bug in that if opal_slw_set_reg()
fails, we don't disable the use of these deep states (winkle on
POWER8, stop4 onwards on POWER9).

This patch fixes this bug by ensuring that if the the sleep winkle
engine is unable to restore the hypervisor states in
pnv_save_sprs_for_deep_states(), then we mark as invalid the states
which lose full context.

As a side-effect, since supported_cpuidle_states in
pnv_probe_idle_states() consists of flags of only the valid states,
this patch will ensure that no other subsystem in the kernel can use
the states which lose full context on stop-api failures.

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/platforms/powernv/idle.c | 98 +++++++++++++++++++++++++++++++----
1 file changed, 87 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 254a0db8..8d07ce6 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -217,9 +217,6 @@ static void pnv_alloc_idle_core_states(void)
}

update_subcore_sibling_mask();
-
- if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT)
- pnv_save_sprs_for_deep_states();
}

u32 pnv_get_supported_cpuidle_states(void)
@@ -518,6 +515,57 @@ static void __init pnv_power9_idle_init(void)
u64 max_residency_ns = 0;
int i;
int dt_idle_states = pnv_idle.nr_states;
+ bool save_sprs_for_deep_stop = false;
+ bool disable_lose_full_context = false;
+ u64 psscr_rl, residency_ns, psscr_val, psscr_mask;
+ u32 flags;
+
+ /*
+ * pnv_deepest_stop_{val,mask} should be set to values
+ * corresponding to the deepest stop state.
+ */
+ for (i = 0; i < dt_idle_states; i++) {
+ psscr_val = pnv_idle.states[i].ctrl_reg_val;
+ psscr_mask = pnv_idle.states[i].ctrl_reg_mask;
+ psscr_rl = psscr_val & PSSCR_RL_MASK;
+ flags = pnv_idle.states[i].flags;
+ residency_ns = pnv_idle.states[i].residency_ns;
+
+ if (flags & OPAL_PM_LOSE_FULL_CONTEXT)
+ save_sprs_for_deep_stop = true;
+
+ if (max_residency_ns < residency_ns) {
+ max_residency_ns = residency_ns;
+ pnv_deepest_stop_psscr_val = psscr_val;
+ pnv_deepest_stop_psscr_mask = psscr_mask;
+ deepest_stop_found = true;
+ }
+ }
+
+ /*
+ * pnv_save_sprs_for_deep_states() expects
+ * pnv_deepest_stop_psscr_val to be initialized.
+ */
+ if (save_sprs_for_deep_stop) {
+ int rc;
+
+ rc = pnv_save_sprs_for_deep_states();
+
+ /*
+ * If the Sleep-Winkle Engine is unable to restore the
+ * critical SPRs on wakeup from some of the deep stop
+ * states that lose full context, then we mark such
+ * deep states as invalid and recompute the
+ * pnv_deepest_stop_psscr_val/mask from among the
+ * valid states.
+ */
+ if (unlikely(rc)) {
+ pr_warn("cpuidle-powernv:Disabling full-context loss states.SLW unable to restore SPRs\n");
+ disable_lose_full_context = true;
+ max_residency_ns = 0;
+ deepest_stop_found = false;
+ }
+ }

/*
* Set pnv_first_deep_stop_state, pnv_deepest_stop_psscr_{val,mask},
@@ -526,16 +574,20 @@ static void __init pnv_power9_idle_init(void)
* pnv_first_deep_stop_state should be set to the first stop
* level to cause hypervisor state loss.
*
- * pnv_deepest_stop_{val,mask} should be set to values corresponding to
- * the deepest stop state.
*
* pnv_default_stop_{val,mask} should be set to values corresponding to
* the shallowest (OPAL_PM_STOP_INST_FAST) loss-less stop state.
*/
pnv_first_deep_stop_state = MAX_STOP_STATE;
for (i = 0; i < dt_idle_states; i++) {
- u64 psscr_rl, residency_ns, psscr_val, psscr_mask;
- u32 flags;
+ flags = pnv_idle.states[i].flags;
+
+ if ((flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
+ disable_lose_full_context) {
+ pnv_idle.states[i].valid = false;
+ pr_warn("cpuidle-powernv: Disabling full-context loss state :%s\n",
+ pnv_idle.states[i].name);
+ }

if (!pnv_idle.states[i].valid)
continue;
@@ -543,15 +595,14 @@ static void __init pnv_power9_idle_init(void)
psscr_val = pnv_idle.states[i].ctrl_reg_val;
psscr_mask = pnv_idle.states[i].ctrl_reg_mask;
psscr_rl = psscr_val & PSSCR_RL_MASK;
- flags = pnv_idle.states[i].flags;
+
residency_ns = pnv_idle.states[i].residency_ns;

if ((flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
- (pnv_first_deep_stop_state > psscr_rl)) {
+ (pnv_first_deep_stop_state > psscr_rl))
pnv_first_deep_stop_state = psscr_rl;
- }

- if (max_residency_ns < residency_ns) {
+ if (unlikely(max_residency_ns < residency_ns)) {
max_residency_ns = residency_ns;
pnv_deepest_stop_psscr_val = psscr_val;
pnv_deepest_stop_psscr_mask = psscr_mask;
@@ -593,6 +644,8 @@ static void __init pnv_power8_idle_init(void)
bool has_nap = false;
bool has_sleep_er1 = false;
int dt_idle_states = pnv_idle.nr_states;
+ bool disable_full_context_loss = false;
+ bool sprs_for_deep_state_saved = false;

for (i = 0; i < dt_idle_states; i++) {
struct pnv_idle_state *state = &pnv_idle.states[i];
@@ -601,6 +654,29 @@ static void __init pnv_power8_idle_init(void)
has_nap = true;
if (state->flags & OPAL_PM_SLEEP_ENABLED_ER1)
has_sleep_er1 = true;
+ if (state->flags & OPAL_PM_LOSE_FULL_CONTEXT) {
+ int rc;
+
+ if (sprs_for_deep_state_saved)
+ continue;
+ if (disable_full_context_loss) {
+ state->valid = false;
+ pr_warn("cpuidle-powernv: Disabling full-context loss state :%s\n",
+ pnv_idle.states[i].name);
+ continue;
+ }
+
+ rc = pnv_save_sprs_for_deep_states();
+ if (likely(!rc)) {
+ sprs_for_deep_state_saved = true;
+ } else {
+ pr_warn("cpuidle-powernv:Disabling full-context loss states.SLW unable to restore SPRs.\n");
+ disable_full_context_loss = true;
+ state->valid = false;
+ pr_warn("cpuidle-powernv:Disabling full-context loss state:%s\n",
+ pnv_idle.states[i].name);
+ }
+ }
}

if (!has_sleep_er1) {
--
1.9.4

2017-07-05 16:38:52

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH 2/5] powernv:idle: Change return type of pnv_probe_idle_states to int

From: "Gautham R. Shenoy" <[email protected]>

In the current idle initialization code, if there are failures in
pnv_probe_idle_states, then no platform idle state is
enabled. However, since the error is not propagated to the top-level
function pnv_init_idle_states, we continue initialization in this
top-level function even though this will never be used.

Hence change the the return type of pnv_probe_idle_states from void to
int and in case of failures, bail out early on in
pnv_init_idle_states.

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/platforms/powernv/idle.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index b747bb5..a5990d9 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -813,26 +813,27 @@ static int __init pnv_idle_parse(struct device_node *np, int dt_idle_states)
/*
* Probe device tree for supported idle states
*/
-static void __init pnv_probe_idle_states(void)
+static int __init pnv_probe_idle_states(void)
{
struct device_node *np;
int dt_idle_states;
- int i;
+ int i, rc;

np = of_find_node_by_path("/ibm,opal/power-mgt");
if (!np) {
pr_warn("opal: PowerMgmt Node not found\n");
- return;
+ return -ENODEV;
}
dt_idle_states = of_property_count_u32_elems(np,
"ibm,cpu-idle-state-flags");
if (dt_idle_states < 0) {
pr_warn("cpuidle-powernv: no idle states found in the DT\n");
- return;
+ return -ENOENT;
}

- if (pnv_idle_parse(np, dt_idle_states))
- return;
+ rc = pnv_idle_parse(np, dt_idle_states);
+ if (rc)
+ return rc;

if (cpu_has_feature(CPU_FTR_ARCH_300))
pnv_power9_idle_init();
@@ -842,6 +843,8 @@ static void __init pnv_probe_idle_states(void)
continue;
supported_cpuidle_states |= pnv_idle.states[i].flags;
}
+
+ return 0;
}

static int __init pnv_init_idle_states(void)
@@ -852,7 +855,8 @@ static int __init pnv_init_idle_states(void)
if (cpuidle_disable != IDLE_NO_OVERRIDE)
goto out;

- pnv_probe_idle_states();
+ if (pnv_probe_idle_states())
+ goto out;

if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
patch_instruction(
--
1.9.4

2017-07-05 16:38:41

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH 1/5] powernv:idle: Move device-tree parsing to one place.

From: "Gautham R. Shenoy" <[email protected]>

The details of the platform idle state are exposed by the firmware to
the kernel via device tree.

In the current code, we parse the device tree twice :

1) During the boot up in arch/powerpc/platforms/powernv/idle.c Here,
the device tree is parsed to obtain the details of the
supported_cpuidle_states which is used to determine the default idle
state (which would be used when cpuidle is absent) and the deepest
idle state (which would be used for cpu-hotplug).

2) During the powernv cpuidle driver initializion
(drivers/cpuidle/cpuidle-powernv.c). Here we parse the device tree to
populate the cpuidle driver's states.

This patch moves all the device tree parsing to the platform idle
code. It defines data-structures for recording the details of the
parsed idle states. Any other kernel subsystem that is interested in
the idle states (eg: cpuidle-powernv driver) can just use the
in-kernel data structure instead of parsing the device tree all over
again.

Further, this helps to check the validity of states in one place and
in case of invalid states (eg : stop states whose psscr values are
errorenous) flag them as invalid, so that the other subsystems can be
prevented from using those.

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/include/asm/cpuidle.h | 32 +--
arch/powerpc/platforms/powernv/idle.c | 390 ++++++++++++++++++++++++++--------
drivers/cpuidle/cpuidle-powernv.c | 233 +++++---------------
3 files changed, 378 insertions(+), 277 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
index 52586f9..88ff2a1 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -73,19 +73,25 @@
extern u64 pnv_first_deep_stop_state;

unsigned long pnv_cpu_offline(unsigned int cpu);
-int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags);
-static inline void report_invalid_psscr_val(u64 psscr_val, int err)
-{
- switch (err) {
- case ERR_EC_ESL_MISMATCH:
- pr_warn("Invalid psscr 0x%016llx : ESL,EC bits unequal",
- psscr_val);
- break;
- case ERR_DEEP_STATE_ESL_MISMATCH:
- pr_warn("Invalid psscr 0x%016llx : ESL cleared for deep stop-state",
- psscr_val);
- }
-}
+
+#define PNV_IDLE_NAME_LEN 16
+struct pnv_idle_state {
+ char name[PNV_IDLE_NAME_LEN];
+ u32 flags;
+ u32 latency_ns;
+ u32 residency_ns;
+ u64 ctrl_reg_val; /* The ctrl_reg on POWER8 would be pmicr. */
+ u64 ctrl_reg_mask; /* On POWER9 it is psscr */
+ bool valid;
+};
+
+struct pnv_idle_states {
+ unsigned int nr_states;
+ struct pnv_idle_state *states;
+};
+
+struct pnv_idle_states *get_pnv_idle_states(void);
+
#endif

#endif
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 2abee07..b747bb5 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -58,6 +58,17 @@
static u64 pnv_deepest_stop_psscr_mask;
static bool deepest_stop_found;

+/*
+ * Data structure that stores details of
+ * all the platform idle states.
+ */
+struct pnv_idle_states pnv_idle;
+
+struct pnv_idle_states *get_pnv_idle_states(void)
+{
+ return &pnv_idle;
+}
+
static int pnv_save_sprs_for_deep_states(void)
{
int cpu;
@@ -435,9 +446,11 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
* stop instruction
*/

-int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags)
+void validate_psscr_val_mask(int i)
{
- int err = 0;
+ u64 *psscr_val = &pnv_idle.states[i].ctrl_reg_val;
+ u64 *psscr_mask = &pnv_idle.states[i].ctrl_reg_mask;
+ u32 flags = pnv_idle.states[i].flags;

/*
* psscr_mask == 0xf indicates an older firmware.
@@ -447,7 +460,8 @@ int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags)
if (*psscr_mask == 0xf) {
*psscr_val = *psscr_val | PSSCR_HV_DEFAULT_VAL;
*psscr_mask = PSSCR_HV_DEFAULT_MASK;
- return err;
+ pnv_idle.states[i].valid = true;
+ return;
}

/*
@@ -458,13 +472,17 @@ int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags)
* - ESL bit is set for all the deep stop states.
*/
if (GET_PSSCR_ESL(*psscr_val) != GET_PSSCR_EC(*psscr_val)) {
- err = ERR_EC_ESL_MISMATCH;
+ pnv_idle.states[i].valid = false;
+ pr_warn("Invalid state:%s:psscr 0x%016llx: ESL,EC bits unequal\n",
+ pnv_idle.states[i].name, *psscr_val);
} else if ((flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
GET_PSSCR_ESL(*psscr_val) == 0) {
- err = ERR_DEEP_STATE_ESL_MISMATCH;
+ pnv_idle.states[i].valid = false;
+ pr_warn("Invalid state:%s:psscr 0x%016llx:ESL cleared for deep stop\n",
+ pnv_idle.states[i].name, *psscr_val);
+ } else {
+ pnv_idle.states[i].valid = true;
}
-
- return err;
}

/*
@@ -472,53 +490,12 @@ int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags)
* deep idle state and deepest idle state on
* ISA 3.0 CPUs.
*
- * @np: /ibm,opal/power-mgt device node
- * @flags: cpu-idle-state-flags array
- * @dt_idle_states: Number of idle state entries
- * Returns 0 on success
*/
-static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
- int dt_idle_states)
+static void __init pnv_power9_idle_init(void)
{
- u64 *psscr_val = NULL;
- u64 *psscr_mask = NULL;
- u32 *residency_ns = NULL;
u64 max_residency_ns = 0;
- int rc = 0, i;
-
- psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL);
- psscr_mask = kcalloc(dt_idle_states, sizeof(*psscr_mask), GFP_KERNEL);
- residency_ns = kcalloc(dt_idle_states, sizeof(*residency_ns),
- GFP_KERNEL);
-
- if (!psscr_val || !psscr_mask || !residency_ns) {
- rc = -1;
- goto out;
- }
-
- if (of_property_read_u64_array(np,
- "ibm,cpu-idle-state-psscr",
- psscr_val, dt_idle_states)) {
- pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr in DT\n");
- rc = -1;
- goto out;
- }
-
- if (of_property_read_u64_array(np,
- "ibm,cpu-idle-state-psscr-mask",
- psscr_mask, dt_idle_states)) {
- pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr-mask in DT\n");
- rc = -1;
- goto out;
- }
-
- if (of_property_read_u32_array(np,
- "ibm,cpu-idle-state-residency-ns",
- residency_ns, dt_idle_states)) {
- pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-residency-ns in DT\n");
- rc = -1;
- goto out;
- }
+ int i;
+ int dt_idle_states = pnv_idle.nr_states;

/*
* Set pnv_first_deep_stop_state, pnv_deepest_stop_psscr_{val,mask},
@@ -535,31 +512,34 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
*/
pnv_first_deep_stop_state = MAX_STOP_STATE;
for (i = 0; i < dt_idle_states; i++) {
- int err;
- u64 psscr_rl = psscr_val[i] & PSSCR_RL_MASK;
-
- if ((flags[i] & OPAL_PM_LOSE_FULL_CONTEXT) &&
- (pnv_first_deep_stop_state > psscr_rl))
- pnv_first_deep_stop_state = psscr_rl;
+ u64 psscr_rl, residency_ns, psscr_val, psscr_mask;
+ u32 flags;

- err = validate_psscr_val_mask(&psscr_val[i], &psscr_mask[i],
- flags[i]);
- if (err) {
- report_invalid_psscr_val(psscr_val[i], err);
+ if (!pnv_idle.states[i].valid)
continue;
+
+ psscr_val = pnv_idle.states[i].ctrl_reg_val;
+ psscr_mask = pnv_idle.states[i].ctrl_reg_mask;
+ psscr_rl = psscr_val & PSSCR_RL_MASK;
+ flags = pnv_idle.states[i].flags;
+ residency_ns = pnv_idle.states[i].residency_ns;
+
+ if ((flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
+ (pnv_first_deep_stop_state > psscr_rl)) {
+ pnv_first_deep_stop_state = psscr_rl;
}

- if (max_residency_ns < residency_ns[i]) {
- max_residency_ns = residency_ns[i];
- pnv_deepest_stop_psscr_val = psscr_val[i];
- pnv_deepest_stop_psscr_mask = psscr_mask[i];
+ if (max_residency_ns < residency_ns) {
+ max_residency_ns = residency_ns;
+ pnv_deepest_stop_psscr_val = psscr_val;
+ pnv_deepest_stop_psscr_mask = psscr_mask;
deepest_stop_found = true;
}

if (!default_stop_found &&
- (flags[i] & OPAL_PM_STOP_INST_FAST)) {
- pnv_default_stop_val = psscr_val[i];
- pnv_default_stop_mask = psscr_mask[i];
+ (flags & OPAL_PM_STOP_INST_FAST)) {
+ pnv_default_stop_val = psscr_val;
+ pnv_default_stop_mask = psscr_mask;
default_stop_found = true;
}
}
@@ -582,10 +562,251 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,

pr_info("cpuidle-powernv: Requested Level (RL) value of first deep stop = 0x%llx\n",
pnv_first_deep_stop_state);
+}
+
+/*
+ * Returns 0 if prop1_len == prop2_len. Else returns -1
+ */
+static inline int validate_dt_prop_sizes(const char *prop1, int prop1_len,
+ const char *prop2, int prop2_len)
+{
+ if (prop1_len == prop2_len)
+ return 0;
+
+ pr_warn("cpuidle-powernv: array sizes don't match for %s and %s\n",
+ prop1, prop2);
+ return -1;
+}
+
+/**
+ * get_idle_prop_u32_array: Returns an array of u32 elements
+ * parsed from the device tree corresponding
+ * to the property provided in variable propname.
+ *
+ * @np: Pointer to device tree node "/ibm,opal/power-mgt"
+ * @nr_states: Expected number of elements.
+ * @propname : Name of the property whose values is an array of
+ * u32 elements
+ *
+ * Returns a pointer to a u32 array of size nr_states on success.
+ * Returns NULL on failure.
+ */
+static inline u32 *get_idle_prop_u32_array(struct device_node *np,
+ int nr_states,
+ const char *propname)
+{
+ u32 *ret_array;
+ int rc, count;
+
+ count = of_property_count_u32_elems(np, propname);
+ rc = validate_dt_prop_sizes("ibm,cpu-idle-state-flags", nr_states,
+ propname, count);
+ if (rc)
+ return NULL;
+
+ ret_array = kcalloc(nr_states, sizeof(*ret_array), GFP_KERNEL);
+ if (!ret_array)
+ return NULL;
+
+ rc = of_property_read_u32_array(np, propname, ret_array, nr_states);
+ if (!rc)
+ return ret_array;
+
+ kfree(ret_array);
+ return NULL;
+}
+
+/**
+ * get_idle_prop_u64_array: Returns an array of u64 elements
+ * parsed from the device tree corresponding
+ * to the property provided in variable propname.
+ *
+ * @np: Pointer to device tree node "/ibm,opal/power-mgt"
+ * @nr_states: Expected number of elements.
+ * @propname : Name of the property whose value is an array of
+ * u64 elements
+ *
+ * Returns a pointer to a u64 array of size nr_states on success.
+ * Returns NULL on failure.
+ */
+static inline u64 *get_idle_prop_u64_array(struct device_node *np,
+ int nr_states,
+ const char *propname)
+{
+ u64 *ret_array;
+ int rc, count;
+
+ count = of_property_count_u64_elems(np, propname);
+ rc = validate_dt_prop_sizes("ibm,cpu-idle-state-flags", nr_states,
+ propname, count);
+ if (rc)
+ return NULL;
+
+ ret_array = kcalloc(nr_states, sizeof(*ret_array), GFP_KERNEL);
+ if (!ret_array)
+ return NULL;
+
+ rc = of_property_read_u64_array(np, propname, ret_array, nr_states);
+ if (!rc)
+ return ret_array;
+
+ kfree(ret_array);
+ return NULL;
+}
+
+/**
+ * get_idle_prop_strings_array: Returns an array of string pointers
+ * parsed from the device tree corresponding
+ * to the property provided in variable propname.
+ *
+ * @np: Pointer to device tree node "/ibm,opal/power-mgt"
+ * @nr_states: Expected number of elements.
+ *
+ * @propname : Name of the property whose values is an array of string
+ * pointers.
+ *
+ * Returns a pointer to an array of string pointers ofsize nr_states on success.
+ * Returns NULL on failure.
+ */
+static inline const char **get_idle_prop_strings_array(struct device_node *np,
+ int nr_states,
+ const char *propname)
+{
+ const char **ret_array;
+ int rc, count;
+
+ count = of_property_count_strings(np, propname);
+ rc = validate_dt_prop_sizes("ibm,cpu-idle-state-flags", nr_states,
+ propname, count);
+ if (rc)
+ return NULL;
+
+ ret_array = kcalloc(nr_states, sizeof(*ret_array), GFP_KERNEL);
+ if (!ret_array)
+ return NULL;
+
+ rc = of_property_read_string_array(np, propname, ret_array, nr_states);
+ if (rc >= 0)
+ return ret_array;
+
+ kfree(ret_array);
+ return NULL;
+}
+
+static int __init pnv_idle_parse(struct device_node *np, int dt_idle_states)
+{
+ int count;
+ u32 *latency_ns = NULL, *residency_ns = NULL, *flags = NULL;
+ u64 *psscr_vals = NULL, *psscr_masks = NULL;
+ const char **names = NULL;
+ u32 has_stop_states = 0;
+ int i, rc = -1;
+ bool residency_present = true;
+
+ pnv_idle.nr_states = 0;
+
+ flags = get_idle_prop_u32_array(np, dt_idle_states,
+ "ibm,cpu-idle-state-flags");
+ if (!flags)
+ goto out_err;
+
+ latency_ns = get_idle_prop_u32_array(np, dt_idle_states,
+ "ibm,cpu-idle-state-latencies-ns");
+ if (!latency_ns)
+ goto out_err;
+
+ names = get_idle_prop_strings_array(np, dt_idle_states,
+ "ibm,cpu-idle-state-names");
+ if (!names)
+ goto out_err;
+
+ /*
+ * If the idle states use stop instruction, probe for psscr
+ * values and psscr mask which are necessary to specify
+ * required stop level.
+ */
+ has_stop_states = (flags[0] &
+ (OPAL_PM_STOP_INST_FAST | OPAL_PM_STOP_INST_DEEP));
+ if (has_stop_states) {
+ psscr_vals = get_idle_prop_u64_array(np, dt_idle_states,
+ "ibm,cpu-idle-state-psscr");
+ if (!psscr_vals)
+ goto out_err;
+
+ psscr_masks = get_idle_prop_u64_array(np, dt_idle_states,
+ "ibm,cpu-idle-state-psscr-mask");
+ if (!psscr_masks)
+ goto out_err;
+ }
+
+ count = of_property_count_u32_elems(np,
+ "ibm,cpu-idle-state-residency-ns");
+ /*
+ * On POWER8, on some of the older firmware, the residency
+ * array can be absent. In this case we hardcode the values
+ * for the nap and fastsleep states in the kernel.
+ *
+ * On POWER9, the cpu-idle-state-residency-ns is expected to be
+ * provided by the firmware.
+ */
+ if (count < 0) {
+ if (has_stop_states) {
+ pr_warn("cpuidle-powernv:Missing ibm,cpu-idle-state-residency in DT\n");
+ goto out_err;
+ } else {
+ residency_present = false;
+ }
+ } else {
+ residency_ns = get_idle_prop_u32_array(np, dt_idle_states,
+ "ibm,cpu-idle-state-residency-ns");
+ if (!residency_ns)
+ goto out_err;
+ }
+
+ pnv_idle.states = kcalloc(dt_idle_states, sizeof(struct pnv_idle_state),
+ GFP_KERNEL);
+ if (!pnv_idle.states)
+ goto out_err;
+
+ for (i = 0; i < dt_idle_states; i++) {
+ struct pnv_idle_state *state = &pnv_idle.states[i];
+
+ strlcpy(state->name, names[i], PNV_IDLE_NAME_LEN);
+ state->flags = flags[i];
+ state->latency_ns = latency_ns[i];
+
+ if (has_stop_states) {
+ state->residency_ns = residency_ns[i];
+ state->ctrl_reg_val = psscr_vals[i];
+ state->ctrl_reg_mask = psscr_masks[i];
+ validate_psscr_val_mask(i);
+ } else {
+ if (residency_present)
+ state->residency_ns = residency_ns[i];
+ else if (flags[i] & OPAL_PM_NAP_ENABLED)
+ state->residency_ns = 100000;
+ else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
+ flags[i] & OPAL_PM_SLEEP_ENABLED_ER1)
+ state->residency_ns = 300000000;
+
+ state->valid = true;
+ }
+ }
+
+ pnv_idle.nr_states = dt_idle_states;
+ rc = 0;
+ goto out;
+
+out_err:
+ kfree(pnv_idle.states);
out:
- kfree(psscr_val);
- kfree(psscr_mask);
+ kfree(names);
+ kfree(flags);
kfree(residency_ns);
+ kfree(latency_ns);
+ kfree(psscr_vals);
+ kfree(psscr_masks);
+
return rc;
}

@@ -596,40 +817,33 @@ static void __init pnv_probe_idle_states(void)
{
struct device_node *np;
int dt_idle_states;
- u32 *flags = NULL;
int i;

np = of_find_node_by_path("/ibm,opal/power-mgt");
if (!np) {
pr_warn("opal: PowerMgmt Node not found\n");
- goto out;
+ return;
}
dt_idle_states = of_property_count_u32_elems(np,
"ibm,cpu-idle-state-flags");
if (dt_idle_states < 0) {
pr_warn("cpuidle-powernv: no idle states found in the DT\n");
- goto out;
+ return;
}

- flags = kcalloc(dt_idle_states, sizeof(*flags), GFP_KERNEL);
+ if (pnv_idle_parse(np, dt_idle_states))
+ return;

- if (of_property_read_u32_array(np,
- "ibm,cpu-idle-state-flags", flags, dt_idle_states)) {
- pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-flags in DT\n");
- goto out;
- }
+ if (cpu_has_feature(CPU_FTR_ARCH_300))
+ pnv_power9_idle_init();

- if (cpu_has_feature(CPU_FTR_ARCH_300)) {
- if (pnv_power9_idle_init(np, flags, dt_idle_states))
- goto out;
+ for (i = 0; i < dt_idle_states; i++) {
+ if (!pnv_idle.states[i].valid)
+ continue;
+ supported_cpuidle_states |= pnv_idle.states[i].flags;
}
-
- for (i = 0; i < dt_idle_states; i++)
- supported_cpuidle_states |= flags[i];
-
-out:
- kfree(flags);
}
+
static int __init pnv_init_idle_states(void)
{

diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 37b0698..6e5ce79 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -221,218 +221,99 @@ static inline void add_powernv_state(int index, const char *name,
stop_psscr_table[index].mask = psscr_mask;
}

-/*
- * Returns 0 if prop1_len == prop2_len. Else returns -1
- */
-static inline int validate_dt_prop_sizes(const char *prop1, int prop1_len,
- const char *prop2, int prop2_len)
-{
- if (prop1_len == prop2_len)
- return 0;
-
- pr_warn("cpuidle-powernv: array sizes don't match for %s and %s\n",
- prop1, prop2);
- return -1;
-}
-
static int powernv_add_idle_states(void)
{
- struct device_node *power_mgt;
int nr_idle_states = 1; /* Snooze */
- int dt_idle_states, count;
- u32 latency_ns[CPUIDLE_STATE_MAX];
- u32 residency_ns[CPUIDLE_STATE_MAX];
- u32 flags[CPUIDLE_STATE_MAX];
- u64 psscr_val[CPUIDLE_STATE_MAX];
- u64 psscr_mask[CPUIDLE_STATE_MAX];
- const char *names[CPUIDLE_STATE_MAX];
+ int dt_idle_states;
u32 has_stop_states = 0;
- int i, rc;
-
- /* Currently we have snooze statically defined */
+ int i;
+ struct pnv_idle_states *pnv_idle;

- power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
- if (!power_mgt) {
- pr_warn("opal: PowerMgmt Node not found\n");
- goto out;
- }
+ pnv_idle = get_pnv_idle_states();
+ dt_idle_states = pnv_idle->nr_states;

- /* Read values of any property to determine the num of idle states */
- dt_idle_states = of_property_count_u32_elems(power_mgt, "ibm,cpu-idle-state-flags");
- if (dt_idle_states < 0) {
- pr_warn("cpuidle-powernv: no idle states found in the DT\n");
+ /* Currently we have snooze statically defined */
+ if (!dt_idle_states) {
+ pr_warn("cpuidle-powernv: Only snooze state available\n");
goto out;
}

- count = of_property_count_u32_elems(power_mgt,
- "ibm,cpu-idle-state-latencies-ns");
-
- if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states,
- "ibm,cpu-idle-state-latencies-ns",
- count) != 0)
- goto out;
-
- count = of_property_count_strings(power_mgt,
- "ibm,cpu-idle-state-names");
- if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags", dt_idle_states,
- "ibm,cpu-idle-state-names",
- count) != 0)
- goto out;
-
/*
- * Since snooze is used as first idle state, max idle states allowed is
- * CPUIDLE_STATE_MAX -1
+ * Since snooze is used as first idle state, max idle states
+ * allowed is CPUIDLE_STATE_MAX -1
*/
if (dt_idle_states > CPUIDLE_STATE_MAX - 1) {
pr_warn("cpuidle-powernv: discovered idle states more than allowed");
dt_idle_states = CPUIDLE_STATE_MAX - 1;
}

- if (of_property_read_u32_array(power_mgt,
- "ibm,cpu-idle-state-flags", flags, dt_idle_states)) {
- pr_warn("cpuidle-powernv : missing ibm,cpu-idle-state-flags in DT\n");
- goto out;
- }
-
- if (of_property_read_u32_array(power_mgt,
- "ibm,cpu-idle-state-latencies-ns", latency_ns,
- dt_idle_states)) {
- pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-latencies-ns in DT\n");
- goto out;
- }
- if (of_property_read_string_array(power_mgt,
- "ibm,cpu-idle-state-names", names, dt_idle_states) < 0) {
- pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-names in DT\n");
- goto out;
- }
-
/*
- * If the idle states use stop instruction, probe for psscr values
- * and psscr mask which are necessary to specify required stop level.
+ * If the idle states use stop instruction, we will need psscr
+ * values and psscr mask to specify required stop level.
*/
- has_stop_states = (flags[0] &
+ has_stop_states = (pnv_idle->states[0].flags &
(OPAL_PM_STOP_INST_FAST | OPAL_PM_STOP_INST_DEEP));
- if (has_stop_states) {
- count = of_property_count_u64_elems(power_mgt,
- "ibm,cpu-idle-state-psscr");
- if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
- dt_idle_states,
- "ibm,cpu-idle-state-psscr",
- count) != 0)
- goto out;
-
- count = of_property_count_u64_elems(power_mgt,
- "ibm,cpu-idle-state-psscr-mask");
- if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
- dt_idle_states,
- "ibm,cpu-idle-state-psscr-mask",
- count) != 0)
- goto out;
-
- if (of_property_read_u64_array(power_mgt,
- "ibm,cpu-idle-state-psscr", psscr_val, dt_idle_states)) {
- pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-psscr in DT\n");
- goto out;
- }

- if (of_property_read_u64_array(power_mgt,
- "ibm,cpu-idle-state-psscr-mask",
- psscr_mask, dt_idle_states)) {
- pr_warn("cpuidle-powernv:Missing ibm,cpu-idle-state-psscr-mask in DT\n");
- goto out;
- }
- }
+ for (i = 0; i < dt_idle_states; i++) {
+ unsigned int exit_latency, target_residency;
+ u64 latency_ns, psscr_val = 0, psscr_mask = 0;
+ u32 flags, cpu_idle_flags = CPUIDLE_FLAG_NONE;
+ const char *name;
+ int (*idle_fn)(struct cpuidle_device *,
+ struct cpuidle_driver *, int);

- count = of_property_count_u32_elems(power_mgt,
- "ibm,cpu-idle-state-residency-ns");
+ if (!pnv_idle->states[i].valid)
+ continue;

- if (count < 0) {
- rc = count;
- } else if (validate_dt_prop_sizes("ibm,cpu-idle-state-flags",
- dt_idle_states,
- "ibm,cpu-idle-state-residency-ns",
- count) != 0) {
- goto out;
- } else {
- rc = of_property_read_u32_array(power_mgt,
- "ibm,cpu-idle-state-residency-ns",
- residency_ns, dt_idle_states);
- }
+ latency_ns = pnv_idle->states[i].latency_ns;

- for (i = 0; i < dt_idle_states; i++) {
- unsigned int exit_latency, target_residency;
- bool stops_timebase = false;
/*
* If an idle state has exit latency beyond
* POWERNV_THRESHOLD_LATENCY_NS then don't use it
* in cpu-idle.
*/
- if (latency_ns[i] > POWERNV_THRESHOLD_LATENCY_NS)
+ if (latency_ns > POWERNV_THRESHOLD_LATENCY_NS)
continue;
- /*
- * Firmware passes residency and latency values in ns.
- * cpuidle expects it in us.
- */
- exit_latency = latency_ns[i] / 1000;
- if (!rc)
- target_residency = residency_ns[i] / 1000;
- else
- target_residency = 0;
-
- if (has_stop_states) {
- int err = validate_psscr_val_mask(&psscr_val[i],
- &psscr_mask[i],
- flags[i]);
- if (err) {
- report_invalid_psscr_val(psscr_val[i], err);
+
+ flags = pnv_idle->states[i].flags;
+
+ if (flags & OPAL_PM_TIMEBASE_STOP) {
+ /*
+ * All cpuidle states with CPU_IDLE_TIMER_STOP set
+ * depend on CONFIG_TICK_ONE_SHOT
+ */
+ if (!IS_ENABLED(CONFIG_TICK_ONESHOT))
continue;
- }
+ cpu_idle_flags = CPUIDLE_FLAG_TIMER_STOP;
}

- if (flags[i] & OPAL_PM_TIMEBASE_STOP)
- stops_timebase = true;
-
- /*
- * For nap and fastsleep, use default target_residency
- * values if f/w does not expose it.
- */
- if (flags[i] & OPAL_PM_NAP_ENABLED) {
- if (!rc)
- target_residency = 100;
- /* Add NAP state */
- add_powernv_state(nr_idle_states, "Nap",
- CPUIDLE_FLAG_NONE, nap_loop,
- target_residency, exit_latency, 0, 0);
- } else if (has_stop_states && !stops_timebase) {
- add_powernv_state(nr_idle_states, names[i],
- CPUIDLE_FLAG_NONE, stop_loop,
- target_residency, exit_latency,
- psscr_val[i], psscr_mask[i]);
+ if (flags & OPAL_PM_NAP_ENABLED) {
+ name = "Nap";
+ idle_fn = nap_loop;
+ } else if (flags & OPAL_PM_SLEEP_ENABLED ||
+ flags & OPAL_PM_SLEEP_ENABLED_ER1) {
+ name = "FastSleep";
+ idle_fn = fastsleep_loop;
+ } else if (has_stop_states) {
+ name = pnv_idle->states[i].name;
+ idle_fn = stop_loop;
+ psscr_val = pnv_idle->states[i].ctrl_reg_val;
+ psscr_mask = pnv_idle->states[i].ctrl_reg_mask;
+ } else {
+ continue;
}

/*
- * All cpuidle states with CPUIDLE_FLAG_TIMER_STOP set must come
- * within this config dependency check.
+ * Firmware passes residency and latency values in ns.
+ * cpuidle expects it in us.
*/
-#ifdef CONFIG_TICK_ONESHOT
- else if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
- flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
- if (!rc)
- target_residency = 300000;
- /* Add FASTSLEEP state */
- add_powernv_state(nr_idle_states, "FastSleep",
- CPUIDLE_FLAG_TIMER_STOP,
- fastsleep_loop,
- target_residency, exit_latency, 0, 0);
- } else if (has_stop_states && stops_timebase) {
- add_powernv_state(nr_idle_states, names[i],
- CPUIDLE_FLAG_TIMER_STOP, stop_loop,
- target_residency, exit_latency,
- psscr_val[i], psscr_mask[i]);
- }
-#endif
- else
- continue;
+ target_residency =
+ pnv_idle->states[i].residency_ns / 1000;
+ exit_latency = latency_ns / 1000;
+
+ add_powernv_state(nr_idle_states, name, cpu_idle_flags,
+ idle_fn, target_residency, exit_latency,
+ psscr_val, psscr_mask);
nr_idle_states++;
}
out:
--
1.9.4

2017-07-05 16:38:39

by Gautham R Shenoy

[permalink] [raw]
Subject: [PATCH 3/5] powernv:idle: Define idle init function for power8

From: "Gautham R. Shenoy" <[email protected]>

In this patch we define a new function named pnv_power8_idle_init().

We move the following code from pnv_init_idle_states() into this newly
defined function.
a) That patches out pnv_fastsleep_workaround_at_entry/exit when
no states with OPAL_PM_SLEEP_ENABLED_ER1 are present.
b) Creating a sysfs control to choose how the workaround has to be
applied when a OPAL_PM_SLEEP_ENABLED_ER1 state is present.
c) Set ppc_md.power_save to power7_idle when OPAL_PM_NAP_ENABLED is
present.

With this, all the power8 specific initializations are in one place.

Signed-off-by: Gautham R. Shenoy <[email protected]>
---
arch/powerpc/platforms/powernv/idle.c | 59 ++++++++++++++++++++++++-----------
1 file changed, 40 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index a5990d9..c400ff9 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -564,6 +564,44 @@ static void __init pnv_power9_idle_init(void)
pnv_first_deep_stop_state);
}

+
+static void __init pnv_power8_idle_init(void)
+{
+ int i;
+ bool has_nap = false;
+ bool has_sleep_er1 = false;
+ int dt_idle_states = pnv_idle.nr_states;
+
+ for (i = 0; i < dt_idle_states; i++) {
+ struct pnv_idle_state *state = &pnv_idle.states[i];
+
+ if (state->flags & OPAL_PM_NAP_ENABLED)
+ has_nap = true;
+ if (state->flags & OPAL_PM_SLEEP_ENABLED_ER1)
+ has_sleep_er1 = true;
+ }
+
+ if (!has_sleep_er1) {
+ patch_instruction(
+ (unsigned int *)pnv_fastsleep_workaround_at_entry,
+ PPC_INST_NOP);
+ patch_instruction(
+ (unsigned int *)pnv_fastsleep_workaround_at_exit,
+ PPC_INST_NOP);
+ } else {
+ /*
+ * OPAL_PM_SLEEP_ENABLED_ER1 is set. It indicates that
+ * workaround is needed to use fastsleep. Provide sysfs
+ * control to choose how this workaround has to be applied.
+ */
+ device_create_file(cpu_subsys.dev_root,
+ &dev_attr_fastsleep_workaround_applyonce);
+ }
+
+ if (has_nap)
+ ppc_md.power_save = power7_idle;
+}
+
/*
* Returns 0 if prop1_len == prop2_len. Else returns -1
*/
@@ -837,6 +875,8 @@ static int __init pnv_probe_idle_states(void)

if (cpu_has_feature(CPU_FTR_ARCH_300))
pnv_power9_idle_init();
+ else
+ pnv_power8_idle_init();

for (i = 0; i < dt_idle_states; i++) {
if (!pnv_idle.states[i].valid)
@@ -858,22 +898,6 @@ static int __init pnv_init_idle_states(void)
if (pnv_probe_idle_states())
goto out;

- if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
- patch_instruction(
- (unsigned int *)pnv_fastsleep_workaround_at_entry,
- PPC_INST_NOP);
- patch_instruction(
- (unsigned int *)pnv_fastsleep_workaround_at_exit,
- PPC_INST_NOP);
- } else {
- /*
- * OPAL_PM_SLEEP_ENABLED_ER1 is set. It indicates that
- * workaround is needed to use fastsleep. Provide sysfs
- * control to choose how this workaround has to be applied.
- */
- device_create_file(cpu_subsys.dev_root,
- &dev_attr_fastsleep_workaround_applyonce);
- }

pnv_alloc_idle_core_states();

@@ -899,9 +923,6 @@ static int __init pnv_init_idle_states(void)
}
}

- if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
- ppc_md.power_save = power7_idle;
-
out:
return 0;
}
--
1.9.4

2017-07-06 14:54:03

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 1/5] powernv:idle: Move device-tree parsing to one place.

On Wed, 5 Jul 2017 22:08:12 +0530
"Gautham R. Shenoy" <[email protected]> wrote:

> From: "Gautham R. Shenoy" <[email protected]>
>
> The details of the platform idle state are exposed by the firmware to
> the kernel via device tree.
>
> In the current code, we parse the device tree twice :
>
> 1) During the boot up in arch/powerpc/platforms/powernv/idle.c Here,
> the device tree is parsed to obtain the details of the
> supported_cpuidle_states which is used to determine the default idle
> state (which would be used when cpuidle is absent) and the deepest
> idle state (which would be used for cpu-hotplug).
>
> 2) During the powernv cpuidle driver initializion
> (drivers/cpuidle/cpuidle-powernv.c). Here we parse the device tree to
> populate the cpuidle driver's states.
>
> This patch moves all the device tree parsing to the platform idle
> code. It defines data-structures for recording the details of the
> parsed idle states. Any other kernel subsystem that is interested in
> the idle states (eg: cpuidle-powernv driver) can just use the
> in-kernel data structure instead of parsing the device tree all over
> again.
>
> Further, this helps to check the validity of states in one place and
> in case of invalid states (eg : stop states whose psscr values are
> errorenous) flag them as invalid, so that the other subsystems can be
> prevented from using those.
>
> Signed-off-by: Gautham R. Shenoy <[email protected]>

Hi,

I think the overall direction is good. A few small things.


> +
> +#define PNV_IDLE_NAME_LEN 16
> +struct pnv_idle_state {
> + char name[PNV_IDLE_NAME_LEN];
> + u32 flags;
> + u32 latency_ns;
> + u32 residency_ns;
> + u64 ctrl_reg_val; /* The ctrl_reg on POWER8 would be pmicr. */
> + u64 ctrl_reg_mask; /* On POWER9 it is psscr */
> + bool valid;
> +};

Do we use PMICR anywhere in the idle code? What about allowing for some
machine-specific fields?

union {
struct { /* p9 */
u64 psscr_val;
u64 psscr_mask;
};
struct { /* p8 */
u64 pmicr...;


> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 2abee07..b747bb5 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -58,6 +58,17 @@
> static u64 pnv_deepest_stop_psscr_mask;
> static bool deepest_stop_found;
>
> +/*
> + * Data structure that stores details of
> + * all the platform idle states.
> + */
> +struct pnv_idle_states pnv_idle;
> +
> +struct pnv_idle_states *get_pnv_idle_states(void)
> +{
> + return &pnv_idle;
> +}

I wouldn't have the wrapper function... but it's your code so it's
up to you. One thing though is that this function you have called get_
just to return the pointer, but it does not take a reference or
allocate memory or initialize the structure. Other functions with the
same prefix do such things. Can we make something more consistent?

...

> +/**
> + * get_idle_prop_u32_array: Returns an array of u32 elements
> + * parsed from the device tree corresponding
> + * to the property provided in variable propname.
> + *
> + * @np: Pointer to device tree node "/ibm,opal/power-mgt"
> + * @nr_states: Expected number of elements.
> + * @propname : Name of the property whose values is an array of
> + * u32 elements
> + *
> + * Returns a pointer to a u32 array of size nr_states on success.
> + * Returns NULL on failure.
> + */
> +static inline u32 *get_idle_prop_u32_array(struct device_node *np,
> + int nr_states,
> + const char *propname)
> +{
> + u32 *ret_array;
> + int rc, count;
> +
> + count = of_property_count_u32_elems(np, propname);
> + rc = validate_dt_prop_sizes("ibm,cpu-idle-state-flags", nr_states,
> + propname, count);
> + if (rc)
> + return NULL;
> +
> + ret_array = kcalloc(nr_states, sizeof(*ret_array), GFP_KERNEL);
> + if (!ret_array)
> + return NULL;

So I would say for this, how about moving the allocations into the caller?
You're still doing most of the error handling freeing there, so I would
say it's more balanced if you do that.

Also, perhaps consider dropping most of the inline keywords. Unless it's
performance critical or does some significant optimisation due to constant
parameters I would say avoid the keyword as a rule.

[snip]

There's a lot of code movement, I haven't reviewed it all carefully, but
it looks good in general. I'll apply the patches and check the result
in the next few days when I get a bit of time.

Thanks,
Nick

2017-07-06 15:02:05

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 2/5] powernv:idle: Change return type of pnv_probe_idle_states to int

On Wed, 5 Jul 2017 22:08:13 +0530
"Gautham R. Shenoy" <[email protected]> wrote:

> From: "Gautham R. Shenoy" <[email protected]>
>
> In the current idle initialization code, if there are failures in
> pnv_probe_idle_states, then no platform idle state is
> enabled. However, since the error is not propagated to the top-level
> function pnv_init_idle_states, we continue initialization in this
> top-level function even though this will never be used.
>
> Hence change the the return type of pnv_probe_idle_states from void to
> int and in case of failures, bail out early on in
> pnv_init_idle_states.
>
> Signed-off-by: Gautham R. Shenoy <[email protected]>

Looks good to me.

Reviewed-by: Nicholas Piggin <[email protected]>

I wonder if the warnings are strong enough here to let people know
idle won't be used so power consumption will be high and performance
significantly reduced on SMT machines?

2017-07-06 15:07:03

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 3/5] powernv:idle: Define idle init function for power8

On Wed, 5 Jul 2017 22:08:14 +0530
"Gautham R. Shenoy" <[email protected]> wrote:

> From: "Gautham R. Shenoy" <[email protected]>
>
> In this patch we define a new function named pnv_power8_idle_init().
>
> We move the following code from pnv_init_idle_states() into this newly
> defined function.
> a) That patches out pnv_fastsleep_workaround_at_entry/exit when
> no states with OPAL_PM_SLEEP_ENABLED_ER1 are present.
> b) Creating a sysfs control to choose how the workaround has to be
> applied when a OPAL_PM_SLEEP_ENABLED_ER1 state is present.
> c) Set ppc_md.power_save to power7_idle when OPAL_PM_NAP_ENABLED is
> present.
>
> With this, all the power8 specific initializations are in one place.
>
> Signed-off-by: Gautham R. Shenoy <[email protected]>
> ---
> arch/powerpc/platforms/powernv/idle.c | 59 ++++++++++++++++++++++++-----------
> 1 file changed, 40 insertions(+), 19 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index a5990d9..c400ff9 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -564,6 +564,44 @@ static void __init pnv_power9_idle_init(void)
> pnv_first_deep_stop_state);
> }
>
> +
> +static void __init pnv_power8_idle_init(void)
> +{
> + int i;
> + bool has_nap = false;
> + bool has_sleep_er1 = false;
> + int dt_idle_states = pnv_idle.nr_states;
> +
> + for (i = 0; i < dt_idle_states; i++) {
> + struct pnv_idle_state *state = &pnv_idle.states[i];
> +
> + if (state->flags & OPAL_PM_NAP_ENABLED)
> + has_nap = true;
> + if (state->flags & OPAL_PM_SLEEP_ENABLED_ER1)
> + has_sleep_er1 = true;
> + }
> +
> + if (!has_sleep_er1) {
> + patch_instruction(
> + (unsigned int *)pnv_fastsleep_workaround_at_entry,
> + PPC_INST_NOP);
> + patch_instruction(
> + (unsigned int *)pnv_fastsleep_workaround_at_exit,
> + PPC_INST_NOP);
> + } else {
> + /*
> + * OPAL_PM_SLEEP_ENABLED_ER1 is set. It indicates that
> + * workaround is needed to use fastsleep. Provide sysfs
> + * control to choose how this workaround has to be applied.
> + */
> + device_create_file(cpu_subsys.dev_root,
> + &dev_attr_fastsleep_workaround_applyonce);
> + }
> +
> + if (has_nap)
> + ppc_md.power_save = power7_idle;
> +}
> +
> /*
> * Returns 0 if prop1_len == prop2_len. Else returns -1
> */
> @@ -837,6 +875,8 @@ static int __init pnv_probe_idle_states(void)
>
> if (cpu_has_feature(CPU_FTR_ARCH_300))
> pnv_power9_idle_init();
> + else
> + pnv_power8_idle_init();
>
> for (i = 0; i < dt_idle_states; i++) {
> if (!pnv_idle.states[i].valid)
> @@ -858,22 +898,6 @@ static int __init pnv_init_idle_states(void)
> if (pnv_probe_idle_states())
> goto out;
>
> - if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
> - patch_instruction(
> - (unsigned int *)pnv_fastsleep_workaround_at_entry,
> - PPC_INST_NOP);
> - patch_instruction(
> - (unsigned int *)pnv_fastsleep_workaround_at_exit,
> - PPC_INST_NOP);

So previously this would run on POWER9 and patch out those branches.
But POWER9 never runs that code, so no problem. Good cleanup.

Reviewed-by: Nicholas Piggin <[email protected]>

2017-07-06 15:16:26

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 4/5] powernv:idle: Move initialization of sibling pacas to pnv_alloc_idle_core_states

On Wed, 5 Jul 2017 22:08:15 +0530
"Gautham R. Shenoy" <[email protected]> wrote:

> From: "Gautham R. Shenoy" <[email protected]>
>
> On POWER9 DD1, in order to get around a hardware issue, we store in
> every CPU thread's paca the paca pointers of all its siblings.
>
> Move this code into pnv_alloc_idle_core_states() soon after the space
> for saving the sibling pacas is allocated.
>
> Signed-off-by: Gautham R. Shenoy <[email protected]>

> - if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
> - int cpu;
> -
> - pr_info("powernv: idle: Saving PACA pointers of all CPUs in their thread sibling PACA\n");
> - for_each_possible_cpu(cpu) {
> - int base_cpu = cpu_first_thread_sibling(cpu);
> - int idx = cpu_thread_in_core(cpu);
> - int i;
> -

You could move the thread_sibling_pacas allocation to here?

Speaking of which... core_idle_state and thread_sibling_pacas are
allocated with kmalloc_node... What happens if we take an SLB miss
in the idle wakeup code on these guys? Nothing good I think. Perhaps
we should put them into the pacas or somewhere in bolted memory.

Good cleanup though.

Reviewed-by: Nicholas Piggin <[email protected]>

2017-07-06 15:29:32

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 5/5] powernv:idle: Disable LOSE_FULL_CONTEXT states when stop-api fails.

On Wed, 5 Jul 2017 22:08:16 +0530
"Gautham R. Shenoy" <[email protected]> wrote:

> From: "Gautham R. Shenoy" <[email protected]>
>
> Currently, we use the opal call opal_slw_set_reg() to inform the that
> the Sleep-Winkle Engine (SLW) to restore the contents of some of the
> Hypervisor state on wakeup from deep idle states that lose full
> hypervisor context (characterized by the flag
> OPAL_PM_LOSE_FULL_CONTEXT).
>
> However, the current code has a bug in that if opal_slw_set_reg()
> fails, we don't disable the use of these deep states (winkle on
> POWER8, stop4 onwards on POWER9).
>
> This patch fixes this bug by ensuring that if the the sleep winkle
> engine is unable to restore the hypervisor states in
> pnv_save_sprs_for_deep_states(), then we mark as invalid the states
> which lose full context.
>
> As a side-effect, since supported_cpuidle_states in
> pnv_probe_idle_states() consists of flags of only the valid states,
> this patch will ensure that no other subsystem in the kernel can use
> the states which lose full context on stop-api failures.

Looks good. Is there something minimal we can do for stable here?

Aside question, do we need to restore LPCR at all with the SLW engine?
It gets set up again when by the idle wakeup code.

And does POWER9 really need MSR and PSSCR restored by SLW? (going a bit
off topic here, I'm just curious)

Thanks,
Nick

2017-07-07 11:25:49

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH 1/5] powernv:idle: Move device-tree parsing to one place.

Hello Nicholas,

On Fri, Jul 07, 2017 at 12:53:40AM +1000, Nicholas Piggin wrote:
> On Wed, 5 Jul 2017 22:08:12 +0530
> "Gautham R. Shenoy" <[email protected]> wrote:
>
> > From: "Gautham R. Shenoy" <[email protected]>
> >
> > The details of the platform idle state are exposed by the firmware to
> > the kernel via device tree.
> >
> > In the current code, we parse the device tree twice :
> >
> > 1) During the boot up in arch/powerpc/platforms/powernv/idle.c Here,
> > the device tree is parsed to obtain the details of the
> > supported_cpuidle_states which is used to determine the default idle
> > state (which would be used when cpuidle is absent) and the deepest
> > idle state (which would be used for cpu-hotplug).
> >
> > 2) During the powernv cpuidle driver initializion
> > (drivers/cpuidle/cpuidle-powernv.c). Here we parse the device tree to
> > populate the cpuidle driver's states.
> >
> > This patch moves all the device tree parsing to the platform idle
> > code. It defines data-structures for recording the details of the
> > parsed idle states. Any other kernel subsystem that is interested in
> > the idle states (eg: cpuidle-powernv driver) can just use the
> > in-kernel data structure instead of parsing the device tree all over
> > again.
> >
> > Further, this helps to check the validity of states in one place and
> > in case of invalid states (eg : stop states whose psscr values are
> > errorenous) flag them as invalid, so that the other subsystems can be
> > prevented from using those.
> >
> > Signed-off-by: Gautham R. Shenoy <[email protected]>
>
> Hi,
>
> I think the overall direction is good. A few small things.

Thanks for reviewing the patches.
>
>
> > +
> > +#define PNV_IDLE_NAME_LEN 16
> > +struct pnv_idle_state {
> > + char name[PNV_IDLE_NAME_LEN];
> > + u32 flags;
> > + u32 latency_ns;
> > + u32 residency_ns;
> > + u64 ctrl_reg_val; /* The ctrl_reg on POWER8 would be pmicr. */
> > + u64 ctrl_reg_mask; /* On POWER9 it is psscr */
> > + bool valid;
> > +};
>
> Do we use PMICR anywhere in the idle code? What about allowing for some
> machine-specific fields?

PMICR is not used anywhere so far. I will change to to psscr_val and
psscr_mask for now. If there is a use for pmicr n the future, we can
change this to the union struct as you suggest.


>
> union {
> struct { /* p9 */
> u64 psscr_val;
> u64 psscr_mask;
> };
> struct { /* p8 */
> u64 pmicr...;
>
>
> > diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> > index 2abee07..b747bb5 100644
> > --- a/arch/powerpc/platforms/powernv/idle.c
> > +++ b/arch/powerpc/platforms/powernv/idle.c
> > @@ -58,6 +58,17 @@
> > static u64 pnv_deepest_stop_psscr_mask;
> > static bool deepest_stop_found;
> >
> > +/*
> > + * Data structure that stores details of
> > + * all the platform idle states.
> > + */
> > +struct pnv_idle_states pnv_idle;
> > +
> > +struct pnv_idle_states *get_pnv_idle_states(void)
> > +{
> > + return &pnv_idle;
> > +}
>
> I wouldn't have the wrapper function... but it's your code so it's
> up to you. One thing though is that this function you have called get_
> just to return the pointer, but it does not take a reference or
> allocate memory or initialize the structure. Other functions with the
> same prefix do such things. Can we make something more consistent?

I agree with the wrapper function. But then the alternative was to
declare this variable as an extern so that cpuidle can access it. Is
that preferable ?

>
> ...
>
> > +/**
> > + * get_idle_prop_u32_array: Returns an array of u32 elements
> > + * parsed from the device tree corresponding
> > + * to the property provided in variable propname.
> > + *
> > + * @np: Pointer to device tree node "/ibm,opal/power-mgt"
> > + * @nr_states: Expected number of elements.
> > + * @propname : Name of the property whose values is an array of
> > + * u32 elements
> > + *
> > + * Returns a pointer to a u32 array of size nr_states on success.
> > + * Returns NULL on failure.
> > + */
> > +static inline u32 *get_idle_prop_u32_array(struct device_node *np,
> > + int nr_states,
> > + const char *propname)
> > +{
> > + u32 *ret_array;
> > + int rc, count;
> > +
> > + count = of_property_count_u32_elems(np, propname);
> > + rc = validate_dt_prop_sizes("ibm,cpu-idle-state-flags", nr_states,
> > + propname, count);
> > + if (rc)
> > + return NULL;
> > +
> > + ret_array = kcalloc(nr_states, sizeof(*ret_array), GFP_KERNEL);
> > + if (!ret_array)
> > + return NULL;
>
> So I would say for this, how about moving the allocations into the caller?
> You're still doing most of the error handling freeing there, so I would
> say it's more balanced if you do that.

Sure, that makes sense. I will move the allocation to the main
function and remove the "inline" associated with these helpers.

>
> Also, perhaps consider dropping most of the inline keywords. Unless it's
> performance critical or does some significant optimisation due to constant
> parameters I would say avoid the keyword as a rule.
>
> [snip]
>
> There's a lot of code movement, I haven't reviewed it all carefully, but
> it looks good in general. I'll apply the patches and check the result
> in the next few days when I get a bit of time.

If it helps, I will post the subsequent version breaking this patch
into smaller ones.


>
> Thanks,
> Nick
>
--
Thanks and Regards
gautham.

2017-07-07 12:35:51

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH 2/5] powernv:idle: Change return type of pnv_probe_idle_states to int

Hello Nicholas,

On Fri, Jul 07, 2017 at 01:01:49AM +1000, Nicholas Piggin wrote:
> On Wed, 5 Jul 2017 22:08:13 +0530
> "Gautham R. Shenoy" <[email protected]> wrote:
>
> > From: "Gautham R. Shenoy" <[email protected]>
> >
> > In the current idle initialization code, if there are failures in
> > pnv_probe_idle_states, then no platform idle state is
> > enabled. However, since the error is not propagated to the top-level
> > function pnv_init_idle_states, we continue initialization in this
> > top-level function even though this will never be used.
> >
> > Hence change the the return type of pnv_probe_idle_states from void to
> > int and in case of failures, bail out early on in
> > pnv_init_idle_states.
> >
> > Signed-off-by: Gautham R. Shenoy <[email protected]>
>
> Looks good to me.
>
> Reviewed-by: Nicholas Piggin <[email protected]>
>
> I wonder if the warnings are strong enough here to let people know
> idle won't be used so power consumption will be high and performance
> significantly reduced on SMT machines?

Good point. Will try to print an error message to this effect.

--
Thanks and Regards
gautham.

2017-07-07 12:43:26

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH 3/5] powernv:idle: Define idle init function for power8

Hi Nicholas,

On Fri, Jul 07, 2017 at 01:06:46AM +1000, Nicholas Piggin wrote:
> On Wed, 5 Jul 2017 22:08:14 +0530
> "Gautham R. Shenoy" <[email protected]> wrote:
>
> > From: "Gautham R. Shenoy" <[email protected]>
> >
> > In this patch we define a new function named pnv_power8_idle_init().
> >
> > We move the following code from pnv_init_idle_states() into this newly
> > defined function.
> > a) That patches out pnv_fastsleep_workaround_at_entry/exit when
> > no states with OPAL_PM_SLEEP_ENABLED_ER1 are present.
> > b) Creating a sysfs control to choose how the workaround has to be
> > applied when a OPAL_PM_SLEEP_ENABLED_ER1 state is present.
> > c) Set ppc_md.power_save to power7_idle when OPAL_PM_NAP_ENABLED is
> > present.
> >
> > With this, all the power8 specific initializations are in one place.
> >
> > Signed-off-by: Gautham R. Shenoy <[email protected]>
> > ---
> > arch/powerpc/platforms/powernv/idle.c | 59 ++++++++++++++++++++++++-----------
> > 1 file changed, 40 insertions(+), 19 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> > index a5990d9..c400ff9 100644
> > --- a/arch/powerpc/platforms/powernv/idle.c
> > +++ b/arch/powerpc/platforms/powernv/idle.c
> > @@ -564,6 +564,44 @@ static void __init pnv_power9_idle_init(void)
> > pnv_first_deep_stop_state);
> > }
> >
> > +
> > +static void __init pnv_power8_idle_init(void)
> > +{
> > + int i;
> > + bool has_nap = false;
> > + bool has_sleep_er1 = false;
> > + int dt_idle_states = pnv_idle.nr_states;
> > +
> > + for (i = 0; i < dt_idle_states; i++) {
> > + struct pnv_idle_state *state = &pnv_idle.states[i];
> > +
> > + if (state->flags & OPAL_PM_NAP_ENABLED)
> > + has_nap = true;
> > + if (state->flags & OPAL_PM_SLEEP_ENABLED_ER1)
> > + has_sleep_er1 = true;
> > + }
> > +
> > + if (!has_sleep_er1) {
> > + patch_instruction(
> > + (unsigned int *)pnv_fastsleep_workaround_at_entry,
> > + PPC_INST_NOP);
> > + patch_instruction(
> > + (unsigned int *)pnv_fastsleep_workaround_at_exit,
> > + PPC_INST_NOP);
> > + } else {
> > + /*
> > + * OPAL_PM_SLEEP_ENABLED_ER1 is set. It indicates that
> > + * workaround is needed to use fastsleep. Provide sysfs
> > + * control to choose how this workaround has to be applied.
> > + */
> > + device_create_file(cpu_subsys.dev_root,
> > + &dev_attr_fastsleep_workaround_applyonce);
> > + }
> > +
> > + if (has_nap)
> > + ppc_md.power_save = power7_idle;
> > +}
> > +
> > /*
> > * Returns 0 if prop1_len == prop2_len. Else returns -1
> > */
> > @@ -837,6 +875,8 @@ static int __init pnv_probe_idle_states(void)
> >
> > if (cpu_has_feature(CPU_FTR_ARCH_300))
> > pnv_power9_idle_init();
> > + else
> > + pnv_power8_idle_init();
> >
> > for (i = 0; i < dt_idle_states; i++) {
> > if (!pnv_idle.states[i].valid)
> > @@ -858,22 +898,6 @@ static int __init pnv_init_idle_states(void)
> > if (pnv_probe_idle_states())
> > goto out;
> >
> > - if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
> > - patch_instruction(
> > - (unsigned int *)pnv_fastsleep_workaround_at_entry,
> > - PPC_INST_NOP);
> > - patch_instruction(
> > - (unsigned int *)pnv_fastsleep_workaround_at_exit,
> > - PPC_INST_NOP);
>
> So previously this would run on POWER9 and patch out those branches.
> But POWER9 never runs that code, so no problem. Good cleanup.

And that's what I thought, but on checking the assembly code, I found
that pnv_fastsleep_workaround_at_exit is executed on POWER9. Will fix
this!

>
> Reviewed-by: Nicholas Piggin <[email protected]>
>

--
Thanks and Regards
gautham.

2017-07-07 15:04:26

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH 4/5] powernv:idle: Move initialization of sibling pacas to pnv_alloc_idle_core_states

On Fri, Jul 07, 2017 at 01:16:09AM +1000, Nicholas Piggin wrote:
> On Wed, 5 Jul 2017 22:08:15 +0530
> "Gautham R. Shenoy" <[email protected]> wrote:
>
> > From: "Gautham R. Shenoy" <[email protected]>
> >
> > On POWER9 DD1, in order to get around a hardware issue, we store in
> > every CPU thread's paca the paca pointers of all its siblings.
> >
> > Move this code into pnv_alloc_idle_core_states() soon after the space
> > for saving the sibling pacas is allocated.
> >
> > Signed-off-by: Gautham R. Shenoy <[email protected]>
>
> > - if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
> > - int cpu;
> > -
> > - pr_info("powernv: idle: Saving PACA pointers of all CPUs in their thread sibling PACA\n");
> > - for_each_possible_cpu(cpu) {
> > - int base_cpu = cpu_first_thread_sibling(cpu);
> > - int idx = cpu_thread_in_core(cpu);
> > - int i;
> > -
>
> You could move the thread_sibling_pacas allocation to here?
>
> Speaking of which... core_idle_state and thread_sibling_pacas are
> allocated with kmalloc_node... What happens if we take an SLB miss
> in the idle wakeup code on these guys? Nothing good I think. Perhaps
> we should put them into the pacas or somewhere in bolted memory.

Yes, though the SLB miss hasn't yet been encountered in practise so
far!

While one can define thread_sibling_pacas in PACA, it doesn't make
sense to allocate space for core_idle_state in PACA since the
allocated value of the secondary threads will never be used.

What is the right way to ensure that these allocations fall in the
bolted range ?

>
> Good cleanup though.
>
> Reviewed-by: Nicholas Piggin <[email protected]>
>

2017-07-07 17:37:35

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH 5/5] powernv:idle: Disable LOSE_FULL_CONTEXT states when stop-api fails.

On Fri, Jul 07, 2017 at 01:29:16AM +1000, Nicholas Piggin wrote:
> On Wed, 5 Jul 2017 22:08:16 +0530
> "Gautham R. Shenoy" <[email protected]> wrote:
>
> > From: "Gautham R. Shenoy" <[email protected]>
> >
> > Currently, we use the opal call opal_slw_set_reg() to inform the that
> > the Sleep-Winkle Engine (SLW) to restore the contents of some of the
> > Hypervisor state on wakeup from deep idle states that lose full
> > hypervisor context (characterized by the flag
> > OPAL_PM_LOSE_FULL_CONTEXT).
> >
> > However, the current code has a bug in that if opal_slw_set_reg()
> > fails, we don't disable the use of these deep states (winkle on
> > POWER8, stop4 onwards on POWER9).
> >
> > This patch fixes this bug by ensuring that if the the sleep winkle
> > engine is unable to restore the hypervisor states in
> > pnv_save_sprs_for_deep_states(), then we mark as invalid the states
> > which lose full context.
> >
> > As a side-effect, since supported_cpuidle_states in
> > pnv_probe_idle_states() consists of flags of only the valid states,
> > this patch will ensure that no other subsystem in the kernel can use
> > the states which lose full context on stop-api failures.
>
> Looks good. Is there something minimal we can do for stable here?
>
> Aside question, do we need to restore LPCR at all with the SLW engine?
> It gets set up again when by the idle wakeup code.


>
> And does POWER9 really need MSR and PSSCR restored by SLW? (going a bit
> off topic here, I'm just curious)

MSR is needed to be restored so that we wakeup with the right
endianness and with the IR,DR disabled.

PSSCR is set to a value so that in case of a special wakeup for a
deep-stop, the SLW can program the core to go back to the stop level
provided by the PSSCR value via the stop-api.


>
> Thanks,
> Nick
>

--
Thanks and Regards
gautham.

2017-07-08 08:43:17

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 1/5] powernv:idle: Move device-tree parsing to one place.

On Fri, 7 Jul 2017 16:55:39 +0530
Gautham R Shenoy <[email protected]> wrote:

> Hello Nicholas,
>
> On Fri, Jul 07, 2017 at 12:53:40AM +1000, Nicholas Piggin wrote:

> > I wouldn't have the wrapper function... but it's your code so it's
> > up to you. One thing though is that this function you have called get_
> > just to return the pointer, but it does not take a reference or
> > allocate memory or initialize the structure. Other functions with the
> > same prefix do such things. Can we make something more consistent?
>
> I agree with the wrapper function. But then the alternative was to
> declare this variable as an extern so that cpuidle can access it. Is
> that preferable ?

Yeah I think that's fine.

[snip

> > [snip]
> >
> > There's a lot of code movement, I haven't reviewed it all carefully, but
> > it looks good in general. I'll apply the patches and check the result
> > in the next few days when I get a bit of time.
>
> If it helps, I will post the subsequent version breaking this patch
> into smaller ones.

If you could without too much trouble, that would be a good help.

Thanks,
Nick

2017-07-08 09:00:33

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 4/5] powernv:idle: Move initialization of sibling pacas to pnv_alloc_idle_core_states

On Fri, 7 Jul 2017 20:34:16 +0530
Gautham R Shenoy <[email protected]> wrote:

> On Fri, Jul 07, 2017 at 01:16:09AM +1000, Nicholas Piggin wrote:
> > On Wed, 5 Jul 2017 22:08:15 +0530
> > "Gautham R. Shenoy" <[email protected]> wrote:
> >
> > > From: "Gautham R. Shenoy" <[email protected]>
> > >
> > > On POWER9 DD1, in order to get around a hardware issue, we store in
> > > every CPU thread's paca the paca pointers of all its siblings.
> > >
> > > Move this code into pnv_alloc_idle_core_states() soon after the space
> > > for saving the sibling pacas is allocated.
> > >
> > > Signed-off-by: Gautham R. Shenoy <[email protected]>
> >
> > > - if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
> > > - int cpu;
> > > -
> > > - pr_info("powernv: idle: Saving PACA pointers of all CPUs in their thread sibling PACA\n");
> > > - for_each_possible_cpu(cpu) {
> > > - int base_cpu = cpu_first_thread_sibling(cpu);
> > > - int idx = cpu_thread_in_core(cpu);
> > > - int i;
> > > -
> >
> > You could move the thread_sibling_pacas allocation to here?
> >
> > Speaking of which... core_idle_state and thread_sibling_pacas are
> > allocated with kmalloc_node... What happens if we take an SLB miss
> > in the idle wakeup code on these guys? Nothing good I think. Perhaps
> > we should put them into the pacas or somewhere in bolted memory.
>
> Yes, though the SLB miss hasn't yet been encountered in practise so
> far!

Considering it's a node-affine allocation, it may actually be possible
to hit in practice on very large memory systems in practice.

> While one can define thread_sibling_pacas in PACA, it doesn't make
> sense to allocate space for core_idle_state in PACA since the
> allocated value of the secondary threads will never be used.

Well, same for core_idle_state, although that's smaller.

> What is the right way to ensure that these allocations fall in the
> bolted range ?

I'm not sure, I guess the memblock allocator is not up anymore at this
point. I think we'd have to move it earlier. You could allocate another
array of them along side the paca allocation.

Thanks,
Nick

2017-07-08 09:05:48

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 5/5] powernv:idle: Disable LOSE_FULL_CONTEXT states when stop-api fails.

On Fri, 7 Jul 2017 23:07:10 +0530
Gautham R Shenoy <[email protected]> wrote:

> On Fri, Jul 07, 2017 at 01:29:16AM +1000, Nicholas Piggin wrote:
> > On Wed, 5 Jul 2017 22:08:16 +0530
> > "Gautham R. Shenoy" <[email protected]> wrote:
> >
> > > From: "Gautham R. Shenoy" <[email protected]>
> > >
> > > Currently, we use the opal call opal_slw_set_reg() to inform the that
> > > the Sleep-Winkle Engine (SLW) to restore the contents of some of the
> > > Hypervisor state on wakeup from deep idle states that lose full
> > > hypervisor context (characterized by the flag
> > > OPAL_PM_LOSE_FULL_CONTEXT).
> > >
> > > However, the current code has a bug in that if opal_slw_set_reg()
> > > fails, we don't disable the use of these deep states (winkle on
> > > POWER8, stop4 onwards on POWER9).
> > >
> > > This patch fixes this bug by ensuring that if the the sleep winkle
> > > engine is unable to restore the hypervisor states in
> > > pnv_save_sprs_for_deep_states(), then we mark as invalid the states
> > > which lose full context.
> > >
> > > As a side-effect, since supported_cpuidle_states in
> > > pnv_probe_idle_states() consists of flags of only the valid states,
> > > this patch will ensure that no other subsystem in the kernel can use
> > > the states which lose full context on stop-api failures.
> >
> > Looks good. Is there something minimal we can do for stable here?
> >
> > Aside question, do we need to restore LPCR at all with the SLW engine?
> > It gets set up again when by the idle wakeup code.
>
>
> >
> > And does POWER9 really need MSR and PSSCR restored by SLW? (going a bit
> > off topic here, I'm just curious)
>
> MSR is needed to be restored so that we wakeup with the right
> endianness and with the IR,DR disabled.

And POWER8 does not require this?

> PSSCR is set to a value so that in case of a special wakeup for a
> deep-stop, the SLW can program the core to go back to the stop level
> provided by the PSSCR value via the stop-api.

It always restores to deepest stop? Is there any way to restore to the
achieved stop level? Maybe there is no usefulness for that.

Thanks,
Nick

2017-07-10 04:34:12

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 4/5] powernv:idle: Move initialization of sibling pacas to pnv_alloc_idle_core_states

Nicholas Piggin <[email protected]> writes:
> On Fri, 7 Jul 2017 20:34:16 +0530
> Gautham R Shenoy <[email protected]> wrote:
>> On Fri, Jul 07, 2017 at 01:16:09AM +1000, Nicholas Piggin wrote:
>> >
>> > Speaking of which... core_idle_state and thread_sibling_pacas are
>> > allocated with kmalloc_node... What happens if we take an SLB miss
>> > in the idle wakeup code on these guys? Nothing good I think. Perhaps
>> > we should put them into the pacas or somewhere in bolted memory.
>>
>> Yes, though the SLB miss hasn't yet been encountered in practise so
>> far!
>
> Considering it's a node-affine allocation, it may actually be possible
> to hit in practice on very large memory systems in practice.

You can boot with disable_1tb_segments on the kernel command line to
increase the change of hitting it.

cheers

2017-07-11 05:04:28

by Gautham R Shenoy

[permalink] [raw]
Subject: Re: [PATCH 5/5] powernv:idle: Disable LOSE_FULL_CONTEXT states when stop-api fails.

On Sat, Jul 08, 2017 at 07:05:26PM +1000, Nicholas Piggin wrote:
> On Fri, 7 Jul 2017 23:07:10 +0530
> Gautham R Shenoy <[email protected]> wrote:
>
> > On Fri, Jul 07, 2017 at 01:29:16AM +1000, Nicholas Piggin wrote:
> > > On Wed, 5 Jul 2017 22:08:16 +0530
> > > "Gautham R. Shenoy" <[email protected]> wrote:
> > >
> > > > From: "Gautham R. Shenoy" <[email protected]>
> > > >
> > > > Currently, we use the opal call opal_slw_set_reg() to inform the that
> > > > the Sleep-Winkle Engine (SLW) to restore the contents of some of the
> > > > Hypervisor state on wakeup from deep idle states that lose full
> > > > hypervisor context (characterized by the flag
> > > > OPAL_PM_LOSE_FULL_CONTEXT).
> > > >
> > > > However, the current code has a bug in that if opal_slw_set_reg()
> > > > fails, we don't disable the use of these deep states (winkle on
> > > > POWER8, stop4 onwards on POWER9).
> > > >
> > > > This patch fixes this bug by ensuring that if the the sleep winkle
> > > > engine is unable to restore the hypervisor states in
> > > > pnv_save_sprs_for_deep_states(), then we mark as invalid the states
> > > > which lose full context.
> > > >
> > > > As a side-effect, since supported_cpuidle_states in
> > > > pnv_probe_idle_states() consists of flags of only the valid states,
> > > > this patch will ensure that no other subsystem in the kernel can use
> > > > the states which lose full context on stop-api failures.
> > >
> > > Looks good. Is there something minimal we can do for stable here?
> > >
> > > Aside question, do we need to restore LPCR at all with the SLW engine?
> > > It gets set up again when by the idle wakeup code.
> >
> >
> > >
> > > And does POWER9 really need MSR and PSSCR restored by SLW? (going a bit
> > > off topic here, I'm just curious)
> >
> > MSR is needed to be restored so that we wakeup with the right
> > endianness and with the IR,DR disabled.
>
> And POWER8 does not require this?
>
> > PSSCR is set to a value so that in case of a special wakeup for a
> > deep-stop, the SLW can program the core to go back to the stop level
> > provided by the PSSCR value via the stop-api.
>
> It always restores to deepest stop? Is there any way to restore to the
> achieved stop level? Maybe there is no usefulness for that.

That would have been ideal. But there's no way to achieve that at the
moment. The alternative is to have call the stop-api with psscr set to
the desired stop level before every stop entry. This is will consume
additional cycles which is what we are trying to avoid.

So we are currently setting the psscr value to deepest stop via stop
api as a compromise, because then on wakeup, we end up restoring more
than what would typically be required, but that's still ok since we
would be erring on the side of caution.

Programming the PSSCR to any other value might have safety
concerns. Eg: Suppose a core which was in stop11 got woken up by a
special wakeup and if the psscr programmed via stop API was stop4 then
the firmware will put the core in stop4. Now, since stop4 doesn't lose
timebase and stop11 does, in the aforementioned case TB would have
gone out of sync in the duration that the core was in stop11. Thus,
when the core wakes up in the kernel in stop4, the kernel won't resync
the TB which is a problematic.

>
> Thanks,
> Nick
>