Previously if a older kernel runs on a newer firmware, it may enable
all available states irrespective of its capability of handling it.
New device tree format adds a compatible flag, so that only kernel
which has the capability to handle the version of stop state will enable
it.
Older kernel will still see stop0 and stop0_lite in older format and we
will depricate it after some time.
1) Idea is to bump up the version string in firmware if we find a bug or
regression in stop states. A fix will be provided in linux which would
now know about the bumped up version of stop states, where as kernel
without fixes would ignore the states.
2) Slowly deprecate cpuidle/cpuhotplug threshold which is hard-coded
into cpuidle-powernv driver. Instead use compatible strings to indicate
if idle state is suitable for cpuidle and hotplug.
New idle state device tree format :
power-mgt {
...
ibm,enabled-stop-levels = <0xec000000>;
ibm,cpu-idle-state-psscr-mask = <0x0 0x3003ff 0x0 0x3003ff>;
ibm,cpu-idle-state-latencies-ns = <0x3e8 0x7d0>;
ibm,cpu-idle-state-psscr = <0x0 0x330 0x0 0x300330>;
ibm,cpu-idle-state-flags = <0x100000 0x101000>;
ibm,cpu-idle-state-residency-ns = <0x2710 0x4e20>;
ibm,idle-states {
stop4 {
flags = <0x207000>;
compatible = "ibm,state-v1",
"cpuidle",
"opal-supported";
psscr-mask = <0x0 0x3003ff>;
handle = <0x102>;
latency-ns = <0x186a0>;
residency-ns = <0x989680>;
psscr = <0x0 0x300374>;
};
...
stop11 {
...
compatible = "ibm,state-v1",
"cpuoffline",
"opal-supported";
...
};
};
Skiboot patch-set for device-tree is posted here :
https://patchwork.ozlabs.org/project/skiboot/list/?series=58934
High-level parsing algorithm :
Say Known version string = "ibm,state-v1"
for each stop state node in device tree:
if (compatible has known version string)
kernel takes care of stop-transitions
else if (compatible has "opal-supported")
OPAL takes care of stop-transitions
else
Skip All deeper states
When a state does not have both version support and opal support,
Its possible to exit from a shallower state. Hence skipping all
deeper states.
How does it work ?
-------------------
Consider a case that stop4 has a bug. We take the following steps to
mitigate the problem.
1) Change compatible string for stop4 in OPAL to "ibm-state-v2" and
remove "opal-supported". ship the new firmware.
The kernel ignores stop4 and all deeper states. But we will still have
shallower states. Prevents from completely disabling stop states.
2) Implement workaround in OPAL and add "opal-supported". Ship new firmware
The kernel uses opal for stop-transtion , which has workaround implemented.
We get stop4 and deeper states working without kernel changes and backports.
(and considerably less time)
3) Implement workaround in kernel and add "ibm-state-v2" as known versions
The kernel will now be able to handle stop4 and deeper states.
Also includes Abhishek's RFC which was posted there :
https://patchwork.ozlabs.org/patch/947568/
This patch-set is on top of mpe-next
Abhishek Goel (1):
cpuidle/powernv: Conditionally save-restore sprs using opal
Akshay Adiga (2):
cpuidle/powernv: Add support for states with ibm,cpuidle-state-v1
powernv/cpuidle: Pass pointers instead of values to stop loop
arch/powerpc/include/asm/cpuidle.h | 12 +
arch/powerpc/include/asm/opal-api.h | 4 +-
arch/powerpc/include/asm/opal.h | 3 +
arch/powerpc/include/asm/paca.h | 5 +-
arch/powerpc/include/asm/processor.h | 9 +-
arch/powerpc/kernel/asm-offsets.c | 3 +
arch/powerpc/kernel/idle_book3s.S | 55 ++++-
arch/powerpc/platforms/powernv/idle.c | 214 +++++++++++++++---
.../powerpc/platforms/powernv/opal-wrappers.S | 2 +
arch/powerpc/xmon/xmon.c | 14 +-
drivers/cpuidle/cpuidle-powernv.c | 66 +++---
11 files changed, 304 insertions(+), 83 deletions(-)
--
2.18.0.rc2.85.g1fb9df7
This patch adds support for new device-tree format for idle state
description.
Previously if a older kernel runs on a newer firmware, it may enable
all available states irrespective of its capability of handling it.
New device tree format adds a compatible flag, so that only kernel
which has the capability to handle the version of stop state will enable
it.
Older kernel will still see stop0 and stop0_lite in older format and we
will depricate it after some time.
1) Idea is to bump up the version in firmware if we find a bug or
regression in stop states. A fix will be provided in linux which would
now know about the bumped up version of stop states, where as kernel
without fixes would ignore the states.
2) Slowly deprecate cpuidle /cpuhotplug threshold which is hard-coded
into cpuidle-powernv driver. Instead use compatible strings to indicate
if idle state is suitable for cpuidle and hotplug.
New idle state device tree format :
power-mgt {
...
ibm,enabled-stop-levels = <0xec000000>;
ibm,cpu-idle-state-psscr-mask = <0x0 0x3003ff 0x0 0x3003ff>;
ibm,cpu-idle-state-latencies-ns = <0x3e8 0x7d0>;
ibm,cpu-idle-state-psscr = <0x0 0x330 0x0 0x300330>;
ibm,cpu-idle-state-flags = <0x100000 0x101000>;
ibm,cpu-idle-state-residency-ns = <0x2710 0x4e20>;
ibm,idle-states {
stop4 {
flags = <0x207000>;
compatible = "ibm,state-v1",
"cpuidle",
"opal-supported";
psscr-mask = <0x0 0x3003ff>;
handle = <0x102>;
latency-ns = <0x186a0>;
residency-ns = <0x989680>;
psscr = <0x0 0x300374>;
};
...
stop11 {
...
compatible = "ibm,state-v1",
"cpuoffline",
"opal-supported";
...
};
};
compatible strings :
"cpuidle" : indicates it should be used by cpuidle-driver
"cpuoffline" : indicates it should be used by hotplug driver
"ibm,state-v1" : kernel checks if it knows about this version
"opal-supported" : indicates kernel can fall back to use opal
for stop-transitions
Signed-off-by: Akshay Adiga <[email protected]>
---
arch/powerpc/include/asm/cpuidle.h | 11 ++
arch/powerpc/platforms/powernv/idle.c | 139 +++++++++++++++++++++++++-
drivers/cpuidle/cpuidle-powernv.c | 50 +++++----
3 files changed, 175 insertions(+), 25 deletions(-)
diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
index 43e5f31fe64d..b965066560cc 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -79,17 +79,28 @@ struct stop_sprs {
u64 mmcra;
};
+enum idle_state_type_t {
+ CPUIDLE_TYPE,
+ CPUOFFLINE_TYPE
+};
+
+
+#define POWERNV_THRESHOLD_LATENCY_NS 200000
+#define PNV_VER_NAME_LEN 32
#define PNV_IDLE_NAME_LEN 16
struct pnv_idle_states_t {
char name[PNV_IDLE_NAME_LEN];
+ char version[PNV_VER_NAME_LEN];
u32 latency_ns;
u32 residency_ns;
u64 psscr_val;
u64 psscr_mask;
u32 flags;
+ enum idle_state_type_t type;
bool valid;
};
+
extern struct pnv_idle_states_t *pnv_idle_states;
extern int nr_pnv_idle_states;
extern u32 pnv_fastsleep_workaround_at_entry[];
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 7cf71b3e03a1..93accece92e3 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -47,6 +47,19 @@ static u64 pnv_default_stop_val;
static u64 pnv_default_stop_mask;
static bool default_stop_found;
+static int parse_dt_v1(struct device_node *np);
+struct stop_version_t {
+ const char name[PNV_VER_NAME_LEN];
+ int (*parser_fn)(struct device_node *np);
+};
+struct stop_version_t known_versions[] = {
+ {
+ .name = "ibm,state-v1",
+ .parser_fn = parse_dt_v1,
+ }
+};
+const int nr_known_versions = 1;
+
/*
* First deep stop state. Used to figure out when to save/restore
* hypervisor context.
@@ -659,8 +672,14 @@ static int __init pnv_power9_idle_init(void)
state->valid = false;
report_invalid_psscr_val(state->psscr_val, err);
continue;
+ } else {
+ state->valid = true;
}
+ /*
+ * We pick state with highest residency. We dont care if
+ * its a cpuidle state or a cpuoffline state.
+ */
if (max_residency_ns < state->residency_ns) {
max_residency_ns = state->residency_ns;
pnv_deepest_stop_psscr_val = state->psscr_val;
@@ -720,6 +739,73 @@ static void __init pnv_probe_idle_states(void)
supported_cpuidle_states |= pnv_idle_states[i].flags;
}
+static int parse_dt_v1(struct device_node *dt_node)
+{
+ const char *temp_str;
+ int rc;
+ int i = nr_pnv_idle_states;
+
+ if (!dt_node) {
+ pr_err("Invalid device_node\n");
+ return -EINVAL;
+ }
+
+ rc = of_property_read_string(dt_node, "name", &temp_str);
+ if (rc) {
+ pr_err("error reading names rc= %d\n", rc);
+ return -EINVAL;
+ } else {
+ strncpy(pnv_idle_states[i].name, temp_str,
+ PNV_IDLE_NAME_LEN);
+ }
+ rc = of_property_read_u32(dt_node, "residency-ns",
+ &pnv_idle_states[i].residency_ns);
+ if (rc) {
+ pr_err("error reading residency rc= %d\n", rc);
+ return -EINVAL;
+ }
+ rc = of_property_read_u32(dt_node, "latency-ns",
+ &pnv_idle_states[i].latency_ns);
+ if (rc) {
+ pr_err("error reading latency rc= %d\n", rc);
+ return -EINVAL;
+ }
+ rc = of_property_read_u32(dt_node, "flags",
+ &pnv_idle_states[i].flags);
+ if (rc) {
+ pr_err("error reading flags rc= %d\n", rc);
+ return -EINVAL;
+ }
+
+ /* We are not expecting power8 device-tree in this format */
+ rc = of_property_read_u64(dt_node, "psscr-mask",
+ &pnv_idle_states[i].psscr_mask);
+ if (rc) {
+ pr_err("error reading psscr-mask rc= %d\n", rc);
+ return -EINVAL;
+ }
+ rc = of_property_read_u64(dt_node, "psscr",
+ &pnv_idle_states[i].psscr_val);
+ if (rc) {
+ pr_err("error reading psscr rc= %d\n", rc);
+ return -EINVAL;
+ }
+
+ /*
+ * TODO : save the version strings in data structure
+ */
+ if (of_device_is_compatible(dt_node, "cpuidle"))
+ pnv_idle_states[i].type = CPUIDLE_TYPE;
+ else if (of_device_is_compatible(dt_node, "cpuoffline"))
+ pnv_idle_states[i].type = CPUOFFLINE_TYPE;
+ else {
+ pr_err("Invalid type skipping %s\n",
+ pnv_idle_states[i].name);
+ return -EINVAL;
+ }
+ return 0;
+
+}
/*
* This function parses device-tree and populates all the information
* into pnv_idle_states structure. It also sets up nr_pnv_idle_states
@@ -728,8 +814,9 @@ static void __init pnv_probe_idle_states(void)
static int pnv_parse_cpuidle_dt(void)
{
- struct device_node *np;
+ struct device_node *np, *np1, *dt_node;
int nr_idle_states, i;
+ int additional_states = 0;
int rc = 0;
u32 *temp_u32;
u64 *temp_u64;
@@ -742,9 +829,14 @@ static int pnv_parse_cpuidle_dt(void)
}
nr_idle_states = of_property_count_u32_elems(np,
"ibm,cpu-idle-state-flags");
-
- pnv_idle_states = kcalloc(nr_idle_states, sizeof(*pnv_idle_states),
- GFP_KERNEL);
+ np1 = of_find_node_by_path("/ibm,opal/power-mgt/ibm,idle-states");
+ if (np1) {
+ for_each_child_of_node(np1, dt_node)
+ additional_states++;
+ }
+ pr_info("states in new format : %d\n", additional_states);
+ pnv_idle_states = kcalloc(nr_idle_states + additional_states,
+ sizeof(*pnv_idle_states), GFP_KERNEL);
temp_u32 = kcalloc(nr_idle_states, sizeof(u32), GFP_KERNEL);
temp_u64 = kcalloc(nr_idle_states, sizeof(u64), GFP_KERNEL);
temp_string = kcalloc(nr_idle_states, sizeof(char *), GFP_KERNEL);
@@ -823,8 +915,45 @@ static int pnv_parse_cpuidle_dt(void)
for (i = 0; i < nr_idle_states; i++)
strncpy(pnv_idle_states[i].name, temp_string[i],
PNV_IDLE_NAME_LEN);
+
+ /* Mark states as CPUIDLE_TYPE /CPUOFFLINE for older version*/
+ for (i = 0; i < nr_idle_states; i++) {
+ if (pnv_idle_states[i].latency_ns > POWERNV_THRESHOLD_LATENCY_NS)
+ pnv_idle_states[i].type = CPUOFFLINE_TYPE;
+ else
+ pnv_idle_states[i].type = CPUIDLE_TYPE;
+ }
+
+ /* Setting up global count of parsed state */
nr_pnv_idle_states = nr_idle_states;
- rc = 0;
+
+ /* Parsing node-based idle states device-tree format */
+ if (!np1) {
+ pr_info("dt does not contain ibm,idle_states");
+ goto out;
+ }
+ /* Parse each child node with appropriate parser_fn */
+ for_each_child_of_node(np1, dt_node) {
+ bool found_known_version = false;
+ /* we don't have state falling back to opal*/
+ for (i = 0; i < nr_known_versions ; i++) {
+ if (of_device_is_compatible(dt_node, known_versions[i].name)) {
+ rc = known_versions[i].parser_fn(dt_node);
+ if (rc) {
+ pr_err("%s could not parse\n",known_versions[i].name);
+ continue;
+ }
+ found_known_version = true;
+ }
+ }
+
+ if (!found_known_version) {
+ pr_info("Unsupported state, skipping all further state\n");
+ goto out;
+ }
+ nr_pnv_idle_states++;
+ }
+
out:
kfree(temp_u32);
kfree(temp_u64);
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 47ac37d6c443..f5579f0369d1 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -26,7 +26,6 @@
* Expose only those Hardware idle states via the cpuidle framework
* that have latency value below POWERNV_THRESHOLD_LATENCY_NS.
*/
-#define POWERNV_THRESHOLD_LATENCY_NS 200000
static struct cpuidle_driver powernv_idle_driver = {
.name = "powernv_idle",
@@ -266,7 +265,7 @@ extern u32 pnv_get_supported_cpuidle_states(void);
static int powernv_add_idle_states(void)
{
int nr_idle_states = 1; /* Snooze */
- int dt_idle_states;
+ int dt_idle_states = 0;
u32 has_stop_states = 0;
int i;
u32 supported_flags = pnv_get_supported_cpuidle_states();
@@ -277,16 +276,19 @@ static int powernv_add_idle_states(void)
pr_warn("cpuidle-powernv : Only Snooze is available\n");
goto out;
}
-
- /* TODO: Count only states which are eligible for cpuidle */
- dt_idle_states = nr_pnv_idle_states;
-
+ /* Count only cpuidle states*/
+ for (i = 0; i < nr_pnv_idle_states; i++) {
+ if (pnv_idle_states[i].type == CPUIDLE_TYPE)
+ dt_idle_states++;
+ }
+ pr_info("idle states in dt = %d , states with idle flag = %d",
+ nr_pnv_idle_states, dt_idle_states);
/*
* Since snooze is used as first idle state, max idle states allowed is
* CPUIDLE_STATE_MAX -1
*/
- if (nr_pnv_idle_states > CPUIDLE_STATE_MAX - 1) {
- pr_warn("cpuidle-powernv: discovered idle states more than allowed");
+ if (dt_idle_states > CPUIDLE_STATE_MAX - 1) {
+ pr_warn("cpuidle-powernv: discovered idle states > allowed");
dt_idle_states = CPUIDLE_STATE_MAX - 1;
}
@@ -297,24 +299,28 @@ static int powernv_add_idle_states(void)
has_stop_states = (pnv_idle_states[0].flags &
(OPAL_PM_STOP_INST_FAST | OPAL_PM_STOP_INST_DEEP));
- for (i = 0; i < dt_idle_states; i++) {
+ for (i = 0; i < nr_pnv_idle_states; i++) {
unsigned int exit_latency, target_residency;
bool stops_timebase = false;
struct pnv_idle_states_t *state = &pnv_idle_states[i];
-
/*
- * Skip the platform idle state whose flag isn't in
- * the supported_cpuidle_states flag mask.
+ * For older version of device-tree the state will be
+ * set as CPUIDLE_TYPE if the latency exceeds
+ * POWERNV_THRESHOLD_LATENCY_NS
*/
- if ((state->flags & supported_flags) != state->flags)
+ if (state->type != CPUIDLE_TYPE) {
+ pr_info("State %d is not idletype, it of %d type\n", i,
+ state->type);
continue;
+ }
/*
- * If an idle state has exit latency beyond
- * POWERNV_THRESHOLD_LATENCY_NS then don't use it
- * in cpu-idle.
+ * Skip the platform idle state whose flag isn't in
+ * the supported_cpuidle_states flag mask.
*/
- if (state->latency_ns > POWERNV_THRESHOLD_LATENCY_NS)
+ if ((state->flags & supported_flags) != state->flags) {
+ pr_warn("State %d is not have supported flag\n", i);
continue;
+ }
/*
* Firmware passes residency and latency values in ns.
* cpuidle expects it in us.
@@ -322,8 +328,10 @@ static int powernv_add_idle_states(void)
exit_latency = DIV_ROUND_UP(state->latency_ns, 1000);
target_residency = DIV_ROUND_UP(state->residency_ns, 1000);
- if (has_stop_states && !(state->valid))
- continue;
+ if (has_stop_states && !(state->valid)) {
+ pr_warn("State %d is invalid\n", i);
+ continue;
+ }
if (state->flags & OPAL_PM_TIMEBASE_STOP)
stops_timebase = true;
@@ -365,8 +373,10 @@ static int powernv_add_idle_states(void)
state->psscr_mask);
}
#endif
- else
+ else {
+ pr_warn("cpuidle-powernv : could not add state\n");
continue;
+ }
nr_idle_states++;
}
out:
--
2.18.0.rc2.85.g1fb9df7
Passing pointer to the pnv_idle_state instead of psscr value and mask.
This helps us to pass more information to the stop loop. This will help to
figure out the method to enter/exit idle state.
Signed-off-by: Akshay Adiga <[email protected]>
---
arch/powerpc/include/asm/processor.h | 3 +-
arch/powerpc/platforms/powernv/idle.c | 58 ++++++++++++---------------
drivers/cpuidle/cpuidle-powernv.c | 16 +++-----
3 files changed, 32 insertions(+), 45 deletions(-)
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 5debe337ea9d..34f572056add 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -515,8 +515,7 @@ extern unsigned long power7_idle_insn(unsigned long type); /* PNV_THREAD_NAP/etc
extern void power7_idle_type(unsigned long type);
extern unsigned long power9_idle_stop(unsigned long psscr_val);
extern unsigned long power9_offline_stop(unsigned long psscr_val);
-extern void power9_idle_type(unsigned long stop_psscr_val,
- unsigned long stop_psscr_mask);
+extern void power9_idle_type(int index);
extern void flush_instruction_cache(void);
extern void hard_reset_now(void);
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 93accece92e3..a6ef9b68e27b 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -43,8 +43,7 @@ int nr_pnv_idle_states;
* The default stop state that will be used by ppc_md.power_save
* function on platforms that support stop instruction.
*/
-static u64 pnv_default_stop_val;
-static u64 pnv_default_stop_mask;
+struct pnv_idle_states_t *pnv_default_state;
static bool default_stop_found;
static int parse_dt_v1(struct device_node *np);
@@ -70,9 +69,7 @@ u64 pnv_first_deep_stop_state = MAX_STOP_STATE;
* psscr value and mask of the deepest stop idle state.
* Used when a cpu is offlined.
*/
-static u64 pnv_deepest_stop_psscr_val;
-static u64 pnv_deepest_stop_psscr_mask;
-static u64 pnv_deepest_stop_flag;
+static struct pnv_idle_states_t *pnv_deepest_state;
static bool deepest_stop_found;
static int pnv_save_sprs_for_deep_states(void)
@@ -92,7 +89,7 @@ static int pnv_save_sprs_for_deep_states(void)
uint64_t hid5_val = mfspr(SPRN_HID5);
uint64_t hmeer_val = mfspr(SPRN_HMEER);
uint64_t msr_val = MSR_IDLE;
- uint64_t psscr_val = pnv_deepest_stop_psscr_val;
+ uint64_t psscr_val = pnv_deepest_state->psscr_val;
for_each_present_cpu(cpu) {
uint64_t pir = get_hard_smp_processor_id(cpu);
@@ -218,18 +215,15 @@ static void pnv_alloc_idle_core_states(void)
pr_warn("cpuidle-powernv: Idle power-savings, CPU-Hotplug affected\n");
if (cpu_has_feature(CPU_FTR_ARCH_300) &&
- (pnv_deepest_stop_flag & OPAL_PM_LOSE_FULL_CONTEXT)) {
+ (pnv_deepest_state->flags & OPAL_PM_LOSE_FULL_CONTEXT)) {
/*
* Use the default stop state for CPU-Hotplug
* if available.
*/
if (default_stop_found) {
- pnv_deepest_stop_psscr_val =
- pnv_default_stop_val;
- pnv_deepest_stop_psscr_mask =
- pnv_default_stop_mask;
+ pnv_deepest_state = pnv_default_state;
pr_warn("cpuidle-powernv: Offlined CPUs will stop with psscr = 0x%016llx\n",
- pnv_deepest_stop_psscr_val);
+ pnv_deepest_state->psscr_val);
} else { /* Fallback to snooze loop for CPU-Hotplug */
deepest_stop_found = false;
pr_warn("cpuidle-powernv: Offlined CPUs will busy wait\n");
@@ -365,20 +359,20 @@ void power7_idle(void)
power7_idle_type(PNV_THREAD_NAP);
}
-static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
- unsigned long stop_psscr_mask)
+static unsigned long __power9_idle_type(struct pnv_idle_states_t *state)
{
- unsigned long psscr;
+ unsigned long psscr, stop_psscr_mask, stop_psscr_val;
unsigned long srr1;
+ stop_psscr_mask = state->psscr_mask;
+ stop_psscr_val = state->psscr_val;
if (!prep_irq_for_idle_irqsoff())
return 0;
psscr = mfspr(SPRN_PSSCR);
psscr = (psscr & ~stop_psscr_mask) | stop_psscr_val;
-
__ppc64_runlatch_off();
- srr1 = power9_idle_stop(psscr);
+ srr1 = power9_idle_stop(psscr, state->opal_supported);
__ppc64_runlatch_on();
fini_irq_for_idle_irqsoff();
@@ -386,12 +380,11 @@ static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
return srr1;
}
-void power9_idle_type(unsigned long stop_psscr_val,
- unsigned long stop_psscr_mask)
+void power9_idle_type(struct pnv_idle_states_t *state)
{
unsigned long srr1;
+ srr1 = __power9_idle_type(state);
- srr1 = __power9_idle_type(stop_psscr_val, stop_psscr_mask);
irq_set_pending_from_srr1(srr1);
}
@@ -400,7 +393,10 @@ void power9_idle_type(unsigned long stop_psscr_val,
*/
void power9_idle(void)
{
- power9_idle_type(pnv_default_stop_val, pnv_default_stop_mask);
+ unsigned long srr1;
+
+ srr1 = __power9_idle_type(pnv_default_state);
+ irq_set_pending_from_srr1(srr1);
}
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -500,6 +496,7 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
unsigned long srr1;
u32 idle_states = pnv_get_supported_cpuidle_states();
u64 lpcr_val;
+ struct pnv_idle_states_t *state = pnv_deepest_state;
/*
* We don't want to take decrementer interrupts while we are
@@ -520,9 +517,8 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
unsigned long psscr;
psscr = mfspr(SPRN_PSSCR);
- psscr = (psscr & ~pnv_deepest_stop_psscr_mask) |
- pnv_deepest_stop_psscr_val;
- srr1 = power9_offline_stop(psscr);
+ psscr = (psscr & ~state->psscr_mask) | state->psscr_val;
+ srr1 = power9_offline_stop(psscr, state->opal_supported);
} else if ((idle_states & OPAL_PM_WINKLE_ENABLED) &&
(idle_states & OPAL_PM_LOSE_FULL_CONTEXT)) {
@@ -682,16 +678,13 @@ static int __init pnv_power9_idle_init(void)
*/
if (max_residency_ns < state->residency_ns) {
max_residency_ns = state->residency_ns;
- pnv_deepest_stop_psscr_val = state->psscr_val;
- pnv_deepest_stop_psscr_mask = state->psscr_mask;
- pnv_deepest_stop_flag = state->flags;
+ pnv_deepest_state = state;
deepest_stop_found = true;
}
if (!default_stop_found &&
(state->flags & OPAL_PM_STOP_INST_FAST)) {
- pnv_default_stop_val = state->psscr_val;
- pnv_default_stop_mask = state->psscr_mask;
+ pnv_default_state = state;
default_stop_found = true;
}
}
@@ -701,15 +694,16 @@ static int __init pnv_power9_idle_init(void)
} else {
ppc_md.power_save = power9_idle;
pr_info("cpuidle-powernv: Default stop: psscr = 0x%016llx,mask=0x%016llx\n",
- pnv_default_stop_val, pnv_default_stop_mask);
+ pnv_default_state->psscr_val,
+ pnv_default_state->psscr_mask);
}
if (unlikely(!deepest_stop_found)) {
pr_warn("cpuidle-powernv: No suitable stop state for CPU-Hotplug. Offlined CPUs will busy wait");
} else {
pr_info("cpuidle-powernv: Deepest stop: psscr = 0x%016llx,mask=0x%016llx\n",
- pnv_deepest_stop_psscr_val,
- pnv_deepest_stop_psscr_mask);
+ pnv_deepest_state->psscr_val,
+ pnv_deepest_state->psscr_mask);
}
pr_info("cpuidle-powernv: Requested Level (RL) value of first deep stop = 0x%llx\n",
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index f5579f0369d1..b8cb377b774b 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -35,13 +35,7 @@ static struct cpuidle_driver powernv_idle_driver = {
static int max_idle_state __read_mostly;
static struct cpuidle_state *cpuidle_state_table __read_mostly;
-struct stop_psscr_table {
- u64 val;
- u64 mask;
-};
-
-static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly;
-
+struct pnv_idle_states_t idx_to_state_ptr[CPUIDLE_STATE_MAX] __read_mostly;
static u64 default_snooze_timeout __read_mostly;
static bool snooze_timeout_en __read_mostly;
@@ -143,8 +137,9 @@ static int stop_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
{
- power9_idle_type(stop_psscr_table[index].val,
- stop_psscr_table[index].mask);
+ struct pnv_idle_states_t state;
+ state = idx_to_state_ptr[index];
+ power9_idle_type(state);
return index;
}
@@ -243,8 +238,6 @@ static inline void add_powernv_state(int index, const char *name,
powernv_states[index].exit_latency = exit_latency;
powernv_states[index].enter = idle_fn;
/* For power8 and below psscr_* will be 0 */
- stop_psscr_table[index].val = psscr_val;
- stop_psscr_table[index].mask = psscr_mask;
}
/*
@@ -371,6 +364,7 @@ static int powernv_add_idle_states(void)
target_residency, exit_latency,
state->psscr_val,
state->psscr_mask);
+ idx_to_state_ptr[nr_idle_states] = state;
}
#endif
else {
--
2.18.0.rc2.85.g1fb9df7
From: Abhishek Goel <[email protected]>
If a state has "opal-supported" compat flag in device-tree, an opal call
needs to be made during the entry and exit of the stop state. This patch
passes a hint to the power9_idle_stop and power9_offline_stop.
This patch moves the saving and restoring of sprs for P9 cpuidle
from kernel to opal. This patch still uses existing code to detect
first thread in core.
In an attempt to make the powernv idle code backward compatible,
and to some extent forward compatible, add support for pre-stop entry
and post-stop exit actions in OPAL. If a kernel knows about this
opal call, then just a firmware supporting newer hardware is required,
instead of waiting for kernel updates.
Signed-off-by: Abhishek Goel <[email protected]>
Signed-off-by: Akshay Adiga <[email protected]>
---
arch/powerpc/include/asm/cpuidle.h | 1 +
arch/powerpc/include/asm/opal-api.h | 4 +-
arch/powerpc/include/asm/opal.h | 3 +
arch/powerpc/include/asm/paca.h | 5 +-
arch/powerpc/include/asm/processor.h | 6 +-
arch/powerpc/kernel/asm-offsets.c | 3 +
arch/powerpc/kernel/idle_book3s.S | 55 ++++++++++++++++++-
arch/powerpc/platforms/powernv/idle.c | 41 ++++++++++----
.../powerpc/platforms/powernv/opal-wrappers.S | 2 +
arch/powerpc/xmon/xmon.c | 14 ++---
10 files changed, 109 insertions(+), 25 deletions(-)
diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
index b965066560cc..2fb2324d15fc 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -96,6 +96,7 @@ struct pnv_idle_states_t {
u64 psscr_val;
u64 psscr_mask;
u32 flags;
+ bool req_opal_call;
enum idle_state_type_t type;
bool valid;
};
diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 3bab299eda49..6792a737bc9a 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -208,7 +208,9 @@
#define OPAL_SENSOR_READ_U64 162
#define OPAL_PCI_GET_PBCQ_TUNNEL_BAR 164
#define OPAL_PCI_SET_PBCQ_TUNNEL_BAR 165
-#define OPAL_LAST 165
+#define OPAL_IDLE_SAVE 168
+#define OPAL_IDLE_RESTORE 169
+#define OPAL_LAST 169
#define QUIESCE_HOLD 1 /* Spin all calls at entry */
#define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index e1b2910c6e81..12d57aeacde2 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -356,6 +356,9 @@ extern void opal_kmsg_init(void);
extern int opal_event_request(unsigned int opal_event_nr);
+extern int opal_cpuidle_save(u64 *stop_sprs, int scope, u64 psscr);
+extern int opal_cpuidle_restore(u64 *stop_sprs, int scope, u64 psscr, u64 srr1);
+
struct opal_sg_list *opal_vmalloc_to_sg_list(void *vmalloc_addr,
unsigned long vmalloc_size);
void opal_free_sg_list(struct opal_sg_list *sg);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 6d34bd71139d..586059594443 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -195,11 +195,14 @@ struct paca_struct {
/* The PSSCR value that the kernel requested before going to stop */
u64 requested_psscr;
+ u64 wakeup_psscr;
+ u8 req_opal_call;
/*
- * Save area for additional SPRs that need to be
+ * Save area for SPRs that need to be
* saved/restored during cpuidle stop.
*/
struct stop_sprs stop_sprs;
+ u64 *opal_stop_sprs;
#endif
#ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 34f572056add..9f9fb1f11dd6 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -513,8 +513,10 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
extern int powersave_nap; /* set if nap mode can be used in idle loop */
extern unsigned long power7_idle_insn(unsigned long type); /* PNV_THREAD_NAP/etc*/
extern void power7_idle_type(unsigned long type);
-extern unsigned long power9_idle_stop(unsigned long psscr_val);
-extern unsigned long power9_offline_stop(unsigned long psscr_val);
+extern unsigned long power9_idle_stop(unsigned long psscr_val,
+ bool opal_enabled);
+extern unsigned long power9_offline_stop(unsigned long psscr_val,
+ bool opal_enabled);
extern void power9_idle_type(int index);
extern void flush_instruction_cache(void);
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 262c44a90ea1..740ae068ec74 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -768,6 +768,9 @@ int main(void)
OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas);
OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
OFFSET(PACA_DONT_STOP, paca_struct, dont_stop);
+ OFFSET(PACA_WAKEUP_PSSCR, paca_struct, wakeup_psscr);
+ OFFSET(PACA_REQ_OPAL_CALL, paca_struct, req_opal_call);
+ OFFSET(STOP_SPRS, paca_struct, opal_stop_sprs);
#define STOP_SPR(x, f) OFFSET(x, paca_struct, stop_sprs.f)
STOP_SPR(STOP_PID, pid);
STOP_SPR(STOP_LDBAR, ldbar);
diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S
index e734f6e45abc..b5f5245011a5 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -45,6 +45,9 @@
#define PSSCR_EC_ESL_MASK_SHIFTED (PSSCR_EC | PSSCR_ESL) >> 16
+#define SCOPE_CORE 0
+#define SCOPE_THREAD 1
+
.text
/*
@@ -388,7 +391,18 @@ lwarx_loop_stop:
bne- lwarx_loop_stop
isync
+ ld r6,PACA_REQ_OPAL_CALL(r13)
+ cmpwi r6,1
+ beq opal_save
bl save_sprs_to_stack
+ PPC_STOP
+ b .
+
+opal_save:
+ ld r3,STOP_SPRS(r13)
+ li r4,SCOPE_CORE
+ ld r5,PACA_REQ_PSSCR(r13)
+ bl opal_cpuidle_save
PPC_STOP /* Does not return (system reset interrupt) */
@@ -435,13 +449,14 @@ _GLOBAL(power9_offline_stop)
* between threads, but in that case KVM has a barrier sync in real
* mode before and after switching between radix and hash.
*/
- li r4,KVM_HWTHREAD_IN_IDLE
- stb r4,HSTATE_HWTHREAD_STATE(r13)
+ li r5,KVM_HWTHREAD_IN_IDLE
+ stb r5,HSTATE_HWTHREAD_STATE(r13)
#endif
/* fall through */
_GLOBAL(power9_idle_stop)
std r3, PACA_REQ_PSSCR(r13)
+ std r4, PACA_REQ_OPAL_CALL(r13)
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
BEGIN_FTR_SECTION
sync
@@ -566,6 +581,8 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
#endif
/* Return SRR1 from power7_nap() */
+ rlwinm r11,r12,47-31,30,31
+ cmpwi cr3,r11,2
blt cr3,pnv_wakeup_noloss
b pnv_wakeup_loss
@@ -576,6 +593,8 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
* cr3 - set to gt if waking up with partial/complete hypervisor state loss
*/
pnv_restore_hyp_resource_arch300:
+ mfspr r5, SPRN_PSSCR
+ std r5, PACA_WAKEUP_PSSCR(r13)
/*
* Workaround for POWER9, if we lost resources, the ERAT
* might have been mixed up and needs flushing. We also need
@@ -831,7 +850,24 @@ subcore_state_restored:
bne cr2,clear_lock
first_thread_in_core:
+ ld r6,PACA_REQ_OPAL_CALL(r13)
+ cmpwi r6,1
+ beq opal_restore_core
+ b kernel_restore_core
+
+opal_restore_core:
+ ld r3,STOP_SPRS(r13)
+ li r4,SCOPE_CORE
+ ld r5,PACA_WAKEUP_PSSCR(r13)
+ mr r6,r19 /*r19 contains SRR1*/
+ bl opal_cpuidle_restore
+ ld r1,PACAR1(r13)
+ xoris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
+ lwsync
+ stw r15,0(r14)
+ b hypervisor_state_restored
+kernel_restore_core:
/*
* First thread in the core waking up from any state which can cause
* partial or complete hypervisor state loss. It needs to
@@ -885,6 +921,21 @@ clear_lock:
stw r15,0(r14)
common_exit:
+ ld r6,PACA_REQ_OPAL_CALL(r13)
+ cmpwi r6,1
+ beq opal_restore_thread
+ b kernel_restore_thread
+
+opal_restore_thread:
+ ld r3,STOP_SPRS(r13)
+ li r4,SCOPE_THREAD
+ ld r5,PACA_WAKEUP_PSSCR(r13)
+ mr r6,r19 /*r19 contains SRR1*/
+ bl opal_cpuidle_restore
+ ld r1,PACAR1(r13)
+ b hypervisor_state_restored
+
+kernel_restore_thread:
/*
* Common to all threads.
*
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index a6ef9b68e27b..e5d38524aec3 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -147,6 +147,7 @@ static int pnv_save_sprs_for_deep_states(void)
return 0;
}
+#define MAX_STOP_SPRS_COUNT 25
static void pnv_alloc_idle_core_states(void)
{
int i, j;
@@ -186,6 +187,9 @@ static void pnv_alloc_idle_core_states(void)
for (j = 0; j < threads_per_core; j++) {
int cpu = first_cpu + j;
+ paca_ptrs[cpu]->opal_stop_sprs = kmalloc_node(
+ MAX_STOP_SPRS_COUNT * sizeof(u64),
+ GFP_KERNEL, node);
paca_ptrs[cpu]->core_idle_state_ptr = core_idle_state;
paca_ptrs[cpu]->thread_idle_state = PNV_THREAD_RUNNING;
paca_ptrs[cpu]->thread_mask = 1 << j;
@@ -372,7 +376,7 @@ static unsigned long __power9_idle_type(struct pnv_idle_states_t *state)
psscr = mfspr(SPRN_PSSCR);
psscr = (psscr & ~stop_psscr_mask) | stop_psscr_val;
__ppc64_runlatch_off();
- srr1 = power9_idle_stop(psscr, state->opal_supported);
+ srr1 = power9_idle_stop(psscr, state->req_opal_call);
__ppc64_runlatch_on();
fini_irq_for_idle_irqsoff();
@@ -518,7 +522,7 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
psscr = mfspr(SPRN_PSSCR);
psscr = (psscr & ~state->psscr_mask) | state->psscr_val;
- srr1 = power9_offline_stop(psscr, state->opal_supported);
+ srr1 = power9_offline_stop(psscr, state->req_opal_call);
} else if ((idle_states & OPAL_PM_WINKLE_ENABLED) &&
(idle_states & OPAL_PM_LOSE_FULL_CONTEXT)) {
@@ -815,6 +819,7 @@ static int pnv_parse_cpuidle_dt(void)
u32 *temp_u32;
u64 *temp_u64;
const char **temp_string;
+ bool fall_back_to_opal = false;
np = of_find_node_by_path("/ibm,opal/power-mgt");
if (!np) {
@@ -929,21 +934,33 @@ static int pnv_parse_cpuidle_dt(void)
/* Parse each child node with appropriate parser_fn */
for_each_child_of_node(np1, dt_node) {
bool found_known_version = false;
- /* we don't have state falling back to opal*/
- for (i = 0; i < nr_known_versions ; i++) {
- if (of_device_is_compatible(dt_node, known_versions[i].name)) {
- rc = known_versions[i].parser_fn(dt_node);
- if (rc) {
- pr_err("%s could not parse\n",known_versions[i].name);
- continue;
+ int idx = nr_pnv_idle_states;
+ if (!fall_back_to_opal) {
+ /* we don't have state falling back to opal*/
+ for (i = 0; i < nr_known_versions ; i++) {
+ if (of_device_is_compatible(dt_node, known_versions[i].name)) {
+ rc = known_versions[i].parser_fn(dt_node);
+ if (rc) {
+ pr_err("%s could not parse\n",known_versions[i].name);
+ continue;
+ }
+ found_known_version = true;
}
- found_known_version = true;
}
}
-
- if (!found_known_version) {
+ if (!found_known_version || fall_back_to_opal) {
+ if (of_device_is_compatible(dt_node, "opal-support")) {
+ rc = known_versions[0].parser_fn(dt_node);
+ if (rc) {
+ pr_err("%s could not parse\n", "opal-support");
+ continue;
+ }
+ pnv_idle_states[idx].req_opal_call = true;
+ fall_back_to_opal = true;
+ } else {
pr_info("Unsupported state, skipping all further state\n");
goto out;
+ }
}
nr_pnv_idle_states++;
}
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index a8d9b4089c31..b75c37d93efd 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -327,3 +327,5 @@ OPAL_CALL(opal_npu_tl_set, OPAL_NPU_TL_SET);
OPAL_CALL(opal_pci_get_pbcq_tunnel_bar, OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
OPAL_CALL(opal_pci_set_pbcq_tunnel_bar, OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
OPAL_CALL(opal_sensor_read_u64, OPAL_SENSOR_READ_U64);
+OPAL_CALL(opal_cpuidle_save, OPAL_IDLE_SAVE);
+OPAL_CALL(opal_cpuidle_restore, OPAL_IDLE_RESTORE);
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 47166ad2a669..8ae34f37e60a 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2431,14 +2431,14 @@ static void dump_one_paca(int cpu)
DUMP(p, subcore_sibling_mask, "%#-*x");
DUMP(p, thread_sibling_pacas, "%-*px");
DUMP(p, requested_psscr, "%#-*llx");
- DUMP(p, stop_sprs.pid, "%#-*llx");
- DUMP(p, stop_sprs.ldbar, "%#-*llx");
- DUMP(p, stop_sprs.fscr, "%#-*llx");
- DUMP(p, stop_sprs.hfscr, "%#-*llx");
- DUMP(p, stop_sprs.mmcr1, "%#-*llx");
- DUMP(p, stop_sprs.mmcr2, "%#-*llx");
- DUMP(p, stop_sprs.mmcra, "%#-*llx");
DUMP(p, dont_stop.counter, "%#-*x");
+
+ /*
+ * TODO Either kernel or opal has sprs stored. If opal stored it,
+ * we can find a way to make the indices available to kernel through
+ * paca.
+ */
+
#endif
DUMP(p, accounting.utime, "%#-*lx");
--
2.18.0.rc2.85.g1fb9df7
On Thu, 2 Aug 2018 10:21:32 +0530
Akshay Adiga <[email protected]> wrote:
> From: Abhishek Goel <[email protected]>
>
> If a state has "opal-supported" compat flag in device-tree, an opal call
> needs to be made during the entry and exit of the stop state. This patch
> passes a hint to the power9_idle_stop and power9_offline_stop.
>
> This patch moves the saving and restoring of sprs for P9 cpuidle
> from kernel to opal. This patch still uses existing code to detect
> first thread in core.
> In an attempt to make the powernv idle code backward compatible,
> and to some extent forward compatible, add support for pre-stop entry
> and post-stop exit actions in OPAL. If a kernel knows about this
> opal call, then just a firmware supporting newer hardware is required,
> instead of waiting for kernel updates.
Still think we should make these do-everything calls. Including
executing nap/stop instructions, restoring timebase, possibly even
saving and restoring SLB (although a return code could be used to
tell the kernel to do that maybe if performance advantage is enough).
I haven't had a lot of time to go through it, I'm working on moving
~all of idle_book3s.S to C code, I'd like to do that before this
OPAL idle driver if possible.
A minor thing I just noticed, you don't have to allocate the opal
spr save space in Linux, just do it all in OPAL.
Thanks,
Nick
Akshay Adiga <[email protected]> writes:
> Previously if a older kernel runs on a newer firmware, it may enable
> all available states irrespective of its capability of handling it.
> New device tree format adds a compatible flag, so that only kernel
> which has the capability to handle the version of stop state will enable
> it.
>
> Older kernel will still see stop0 and stop0_lite in older format and we
> will depricate it after some time.
>
> 1) Idea is to bump up the version string in firmware if we find a bug or
> regression in stop states. A fix will be provided in linux which would
> now know about the bumped up version of stop states, where as kernel
> without fixes would ignore the states.
>
> 2) Slowly deprecate cpuidle/cpuhotplug threshold which is hard-coded
> into cpuidle-powernv driver. Instead use compatible strings to indicate
> if idle state is suitable for cpuidle and hotplug.
>
> New idle state device tree format :
> power-mgt {
> ...
> ibm,enabled-stop-levels = <0xec000000>;
> ibm,cpu-idle-state-psscr-mask = <0x0 0x3003ff 0x0 0x3003ff>;
> ibm,cpu-idle-state-latencies-ns = <0x3e8 0x7d0>;
> ibm,cpu-idle-state-psscr = <0x0 0x330 0x0 0x300330>;
> ibm,cpu-idle-state-flags = <0x100000 0x101000>;
> ibm,cpu-idle-state-residency-ns = <0x2710 0x4e20>;
> ibm,idle-states {
> stop4 {
> flags = <0x207000>;
> compatible = "ibm,state-v1",
> "cpuidle",
> "opal-supported";
> psscr-mask = <0x0 0x3003ff>;
> handle = <0x102>;
> latency-ns = <0x186a0>;
> residency-ns = <0x989680>;
> psscr = <0x0 0x300374>;
> };
> ...
> stop11 {
> ...
> compatible = "ibm,state-v1",
> "cpuoffline",
> "opal-supported";
> ...
> };
> };
>
> Skiboot patch-set for device-tree is posted here :
> https://patchwork.ozlabs.org/project/skiboot/list/?series=58934
I don't see a device tree binding documented anywhere?
There is an existing binding defined for ARM chips, presumably it
doesn't do everything we need. But are there good reasons why we are not
using it as a base?
See: Documentation/devicetree/bindings/arm/idle-states.txt
The way you're using compatible is not really consistent with its
traditional meaning.
eg, you have multiple states with:
compatible = "ibm,state-v1",
"cpuoffline",
"opal-supported";
This would typically mean that all those state are all "compatible" with
some semantics defined by the name "ibm,state-v1". What you're trying to
say (I think) is that each state is "version 1" of *that state*. And
only kernels that understand version 1 should use the state.
And "cpuoffline" and "opal-supported" definitely don't belong in
compatible AFAICS, they should simply be boolean properties of the node.
cheers
Hello Michael,
On Tue, Aug 07, 2018 at 10:15:37PM +1000, Michael Ellerman wrote:
> > Skiboot patch-set for device-tree is posted here :
> > https://patchwork.ozlabs.org/project/skiboot/list/?series=58934
>
> I don't see a device tree binding documented anywhere?
>
> There is an existing binding defined for ARM chips, presumably it
> doesn't do everything we need. But are there good reasons why we are not
> using it as a base?
>
> See: Documentation/devicetree/bindings/arm/idle-states.txt
>
In case of ARM, the idle-states node is a child of cpus node. Each
child of the idle-states node is a node describing that particular
idle state.
idle-states {
entry-method = "psci";
CPU_RETENTION_0_0: cpu-retention-0-0 {
compatible = "arm,idle-state";
arm,psci-suspend-param = <0x0010000>;
entry-latency-us = <20>;
exit-latency-us = <40>;
min-residency-us = <80>;
status = "disabled"
};
CPU_SLEEP_0_0: cpu-sleep-0-0 {
compatible = "arm,idle-state";
local-timer-stop;
arm,psci-suspend-param = <0x0010000>;
entry-latency-us = <250>;
exit-latency-us = <500>;
min-residency-us = <950>;
status = "okay"
};
.
.
.
}
Furthermore, each CPU can have a different set of cpu-idle states
due to the asymmetric nature of the processors units on the board.
Thus, there is an additional property for each cpu called
cpu-idle-states which points to the containers of the idle states
themselves.
cpus {
#size-cells = <0>;
#address-cells = <2>;
CPU0: cpu@0 {
device_type = "cpu";
compatible = "arm,cortex-a57";
reg = <0x0 0x0>;
enable-method = "psci";
cpu-idle-states = <&CPU_RETENTION_0_0 &CPU_SLEEP_0_0
&CLUSTER_RETENTION_0 &CLUSTER_SLEEP_0>;
};
. . .
. . .
. . .
. . .
CPU8: cpu@100000000 {
device_type = "cpu";
compatible = "arm,cortex-a53";
reg = <0x1 0x0>;
enable-method = "psci";
cpu-idle-states = <&CPU_RETENTION_1_0 &CPU_SLEEP_1_0
&CLUSTER_RETENTION_1 &CLUSTER_SLEEP_1>;
};
In our case, we already have an "ibm,opal/power-mgt/" node in the
device tree where we have defined the idle state so far. This was the
reason to stick the new device tree format under this existing node
that has been specially earmarked for power management related bits,
instead of defining the new format under the cpus node.
Also, in our case, since all the CPU nodes are symmetric they will
have the same set of idle states. Hence, we wouldn't need the
"cpu-idle-states" property for each CPU.
As for the properties of idle states themselves, the only common
things between the ARM idle-states and our case are the compatible,
exit-latency-us, min-residency-us. In addition to this we need the
flags which indicate the nature of the resource loss (Hypervisors
state loss, Timebase loss, etc..) , the psscr_val and the psscr_mask
corresponding to the stop states which the ARM device-tree doesn't
provide.
For this reason we have opted for a new bindings since the overlap
between these two platforms is minimal.
>
> The way you're using compatible is not really consistent with its
> traditional meaning.
>
> eg, you have multiple states with:
>
> compatible = "ibm,state-v1",
> "cpuoffline",
> "opal-supported";
>
>
> This would typically mean that all those state are all "compatible" with
> some semantics defined by the name "ibm,state-v1". What you're trying to
> say (I think) is that each state is "version 1" of *that state*. And
> only kernels that understand version 1 should use the state.
Ok, I see what you mean here. Perhaps, we should have had something
like "ibm,stop0-v1" , "ibm,stop1-v2", "ibm,stop2-v2" etc, where
version1, version2 etc, pertains to the versions of those specific
states.
Thus a kernel that knows about "version 1" of stop0 and stop2 and
"version 2" of stop1 will end up using only stop0 and stop1 since it
doesn't know "version 2" of stop2.
In such a case, kernel should fallback to OPAL for stop2. Does this
make sense ?
>
> And "cpuoffline" and "opal-supported" definitely don't belong in
> compatible AFAICS, they should simply be boolean properties of the
> node.
I agree. These should be flags.
>
> cheers
>
Hello Nicholas,
On Fri, Aug 03, 2018 at 12:05:47AM +1000, Nicholas Piggin wrote:
> On Thu, 2 Aug 2018 10:21:32 +0530
> Akshay Adiga <[email protected]> wrote:
>
> > From: Abhishek Goel <[email protected]>
> >
> > If a state has "opal-supported" compat flag in device-tree, an opal call
> > needs to be made during the entry and exit of the stop state. This patch
> > passes a hint to the power9_idle_stop and power9_offline_stop.
> >
> > This patch moves the saving and restoring of sprs for P9 cpuidle
> > from kernel to opal. This patch still uses existing code to detect
> > first thread in core.
> > In an attempt to make the powernv idle code backward compatible,
> > and to some extent forward compatible, add support for pre-stop entry
> > and post-stop exit actions in OPAL. If a kernel knows about this
> > opal call, then just a firmware supporting newer hardware is required,
> > instead of waiting for kernel updates.
>
> Still think we should make these do-everything calls. Including
> executing nap/stop instructions, restoring timebase, possibly even
> saving and restoring SLB (although a return code could be used to
> tell the kernel to do that maybe if performance advantage is
enough).
So, if we execute the stop instruction in opal, the wakeup from stop
still happens at the hypervisor 0x100. On wake up, we need to check
SRR1 to see if we have lost state, in which case, the stop exit also
needs to be handled inside opal. On return from this opal call, we
need to unwind the extra stack frame that would have been created when
kernel entered opal to execute the stop from which there was no
return. In the case where a lossy stop state was requested, but wakeup
happened from a lossless stop state, this adds additional overhead.
Furthermore, the measurements show that the additional time taken to
perform the restore of the resources in OPAL vs doing so in Kernel on
wakeup from stop takes additional 5-10us. For the current stop states
that lose hypervisor state, since the latency is relatively high (100s
of us), this is a relatively small penalty (~1%) .
However, in future if we do have states that lose only a part of
hypervisor state to provide a wakeup latency in the order of few tens
of microseconds the additional latency caused by OPAL call would
become noticable, no ?
>
> I haven't had a lot of time to go through it, I'm working on moving
> ~all of idle_book3s.S to C code, I'd like to do that before this
> OPAL idle driver if possible.
>
> A minor thing I just noticed, you don't have to allocate the opal
> spr save space in Linux, just do it all in OPAL.
The idea was to not leave any state in OPAL, as OPAL is supposed to be
state-less. However, I agree, that if OPAL is not going to interpret
the contents of the save/area, it should be harmless to move that bit
into OPAL.
That said, if we are going to add the logic of determining the first
thread in the core waking up, etc, then we have no choice but to
maintain that state in OPAL.
>
> Thanks,
> Nick
>
--
Thanks and Regards
gautham.
On Wed, 8 Aug 2018 21:11:16 +0530
Gautham R Shenoy <[email protected]> wrote:
> Hello Nicholas,
>
> On Fri, Aug 03, 2018 at 12:05:47AM +1000, Nicholas Piggin wrote:
> > On Thu, 2 Aug 2018 10:21:32 +0530
> > Akshay Adiga <[email protected]> wrote:
> >
> > > From: Abhishek Goel <[email protected]>
> > >
> > > If a state has "opal-supported" compat flag in device-tree, an opal call
> > > needs to be made during the entry and exit of the stop state. This patch
> > > passes a hint to the power9_idle_stop and power9_offline_stop.
> > >
> > > This patch moves the saving and restoring of sprs for P9 cpuidle
> > > from kernel to opal. This patch still uses existing code to detect
> > > first thread in core.
> > > In an attempt to make the powernv idle code backward compatible,
> > > and to some extent forward compatible, add support for pre-stop entry
> > > and post-stop exit actions in OPAL. If a kernel knows about this
> > > opal call, then just a firmware supporting newer hardware is required,
> > > instead of waiting for kernel updates.
> >
> > Still think we should make these do-everything calls. Including
> > executing nap/stop instructions, restoring timebase, possibly even
> > saving and restoring SLB (although a return code could be used to
> > tell the kernel to do that maybe if performance advantage is
> enough).
>
> So, if we execute the stop instruction in opal, the wakeup from stop
> still happens at the hypervisor 0x100. On wake up, we need to check
> SRR1 to see if we have lost state, in which case, the stop exit also
> needs to be handled inside opal.
Yes. That's okay, SRR1 seems to be pretty well architected.
> On return from this opal call, we
> need to unwind the extra stack frame that would have been created when
> kernel entered opal to execute the stop from which there was no
> return. In the case where a lossy stop state was requested, but wakeup
> happened from a lossless stop state, this adds additional overhead.
True, but you're going from 1 OPAL call to 2. So you still have that
overhead. Although possibly we could implement some special light
weight stackless calls (I'm thinking about doing that for MCE handling
too). Or you could perhaps just discard the stack without needing to
unwind anything in the case of a lossless wakeup.
>
> Furthermore, the measurements show that the additional time taken to
> perform the restore of the resources in OPAL vs doing so in Kernel on
> wakeup from stop takes additional 5-10us. For the current stop states
> that lose hypervisor state, since the latency is relatively high (100s
> of us), this is a relatively small penalty (~1%) .
Yeah OPAL is pretty heavy to enter. We can improve that a bit. But
yes for P10 timeframe it may be still heavy weight.
>
> However, in future if we do have states that lose only a part of
> hypervisor state to provide a wakeup latency in the order of few tens
> of microseconds the additional latency caused by OPAL call would
> become noticable, no ?
I think so long as we can do shallow states in Linux it really won't
be that big a deal (even if we don't do any of the above speedup
tricks).
I think it's really desirable to have a complete firmware
implementation. Having this compromise seems like the worst of both
in a way (does not allow firmware to control everything, and does
not have great performance).
>
>
> >
> > I haven't had a lot of time to go through it, I'm working on moving
> > ~all of idle_book3s.S to C code, I'd like to do that before this
> > OPAL idle driver if possible.
> >
> > A minor thing I just noticed, you don't have to allocate the opal
> > spr save space in Linux, just do it all in OPAL.
>
> The idea was to not leave any state in OPAL, as OPAL is supposed to be
> state-less. However, I agree, that if OPAL is not going to interpret
> the contents of the save/area, it should be harmless to move that bit
> into OPAL.
>
> That said, if we are going to add the logic of determining the first
> thread in the core waking up, etc, then we have no choice but to
> maintain that state in OPAL.
I don't think it's such a problem for particular very carefully
defined cases like this.
Thanks,
Nick