2014-12-04 07:28:57

by Shreyas B. Prabhu

[permalink] [raw]
Subject: [PATCH v3 0/4] powernv: cpuidle: Redesign idle states management

Deep idle states like sleep and winkle are per core idle states. A core
enters these states only when all the threads enter either the particular
idle state or a deeper one. There are tasks like fastsleep hardware bug
workaround and hypervisor core state save which have to be done only by
the last thread of the core entering deep idle state and similarly tasks
like timebase resync, hypervisor core register restore that have to be
done only by the first thread waking up from these states.

The current idle state management does not have a way to distinguish the
first/last thread of the core waking/entering idle states. Tasks like
timebase resync are done for all the threads. This is not only is suboptimal,
but can cause functionality issues when subcores are involved.

Winkle is deeper idle state compared to fastsleep. In this state the power
supply to the chiplet, i.e core, private L2 and private L3 is turned off.
This results in a total hypervisor state loss. This patch set adds support
for winkle and provides a way to track the idle states of the threads of the
core and use it for idle state management of idle states sleep and winkle.

TBD:
----
- Remove duplication of branching to kvm code.

Changes in v3:
-------------
- Added barriers after lock
- Added a paca field to that stores thread mask.
- Changed code structure around fastsleep workaround, to allow for manual
patching out if the platform does not require it.
- Threads waiting on core_idle_state lock now loop in HMT_LOW
- Using NV CRs to avoid save/restore of CR while making OPAL calls.
- Fixed couple of flow issues in path where fastsleep workaround was not needed
- Using PPC_LR_STKOFF instead of _LINK in opal_call_realmode
- Restoring WORT and WORC

Changes in v2:
--------------
-Using PNV_THREAD_NAP/SLEEP defines while calling power7_powersave_common
-Comment changes based on review
-Rebased on top of 3.18-rc6


Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Vaidyanathan Srinivasan <[email protected]>
Cc: Preeti U Murthy <[email protected]>

Paul Mackerras (1):
powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle
mode

Preeti U. Murthy (1):
powerpc/powernv: Enable Offline CPUs to enter deep idle states

Shreyas B. Prabhu (2):
powernv: cpuidle: Redesign idle states management
powernv: powerpc: Add winkle support for offline cpus

arch/powerpc/include/asm/cpuidle.h | 14 ++
arch/powerpc/include/asm/opal.h | 13 +
arch/powerpc/include/asm/paca.h | 6 +
arch/powerpc/include/asm/ppc-opcode.h | 2 +
arch/powerpc/include/asm/processor.h | 1 +
arch/powerpc/include/asm/reg.h | 4 +
arch/powerpc/kernel/asm-offsets.c | 6 +
arch/powerpc/kernel/cpu_setup_power.S | 4 +
arch/powerpc/kernel/exceptions-64s.S | 30 ++-
arch/powerpc/kernel/idle_power7.S | 332 +++++++++++++++++++++----
arch/powerpc/platforms/powernv/opal-wrappers.S | 39 +++
arch/powerpc/platforms/powernv/powernv.h | 2 +
arch/powerpc/platforms/powernv/setup.c | 160 ++++++++++++
arch/powerpc/platforms/powernv/smp.c | 10 +-
arch/powerpc/platforms/powernv/subcore.c | 34 +++
arch/powerpc/platforms/powernv/subcore.h | 1 +
drivers/cpuidle/cpuidle-powernv.c | 10 +-
17 files changed, 608 insertions(+), 60 deletions(-)
create mode 100644 arch/powerpc/include/asm/cpuidle.h

--
1.9.3


2014-12-04 07:29:05

by Shreyas B. Prabhu

[permalink] [raw]
Subject: [PATCH v3 1/4] powerpc: powernv: Switch off MMU before entering nap/sleep/rvwinkle mode

From: Paul Mackerras <[email protected]>

Currently, when going idle, we set the flag indicating that we are in
nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
(or sleep or rvwinkle) instruction, all with the MMU on. This is bad
for two reasons: (a) the architecture specifies that those instructions
must be executed with the MMU off, and in fact with only the SF, HV, ME
and possibly RI bits set, and (b) this introduces a race, because as
soon as we set the flag, another thread can switch the MMU to a guest
context. If the race is lost, this thread will typically start looping
on relocation-on ISIs at 0xc...4400.

This fixes it by setting the MSR as required by the architecture before
setting the flag or executing the nap/sleep/rvwinkle instruction.

[ [email protected]: Edited to handle LE ]
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Shreyas B. Prabhu <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: [email protected]
---
arch/powerpc/include/asm/reg.h | 2 ++
arch/powerpc/kernel/idle_power7.S | 18 +++++++++++++++++-
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index c998279..a68ee15 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -118,8 +118,10 @@
#define __MSR (MSR_ME | MSR_RI | MSR_IR | MSR_DR | MSR_ISF |MSR_HV)
#ifdef __BIG_ENDIAN__
#define MSR_ __MSR
+#define MSR_IDLE (MSR_ME | MSR_SF | MSR_HV)
#else
#define MSR_ (__MSR | MSR_LE)
+#define MSR_IDLE (MSR_ME | MSR_SF | MSR_HV | MSR_LE)
#endif
#define MSR_KERNEL (MSR_ | MSR_64BIT)
#define MSR_USER32 (MSR_ | MSR_PR | MSR_EE)
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index c0754bb..283c603 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -101,7 +101,23 @@ _GLOBAL(power7_powersave_common)
std r9,_MSR(r1)
std r1,PACAR1(r13)

-_GLOBAL(power7_enter_nap_mode)
+ /*
+ * Go to real mode to do the nap, as required by the architecture.
+ * Also, we need to be in real mode before setting hwthread_state,
+ * because as soon as we do that, another thread can switch
+ * the MMU context to the guest.
+ */
+ LOAD_REG_IMMEDIATE(r5, MSR_IDLE)
+ li r6, MSR_RI
+ andc r6, r9, r6
+ LOAD_REG_ADDR(r7, power7_enter_nap_mode)
+ mtmsrd r6, 1 /* clear RI before setting SRR0/1 */
+ mtspr SPRN_SRR0, r7
+ mtspr SPRN_SRR1, r5
+ rfid
+
+ .globl power7_enter_nap_mode
+power7_enter_nap_mode:
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
/* Tell KVM we're napping */
li r4,KVM_HWTHREAD_IN_NAP
--
1.9.3

2014-12-04 07:29:15

by Shreyas B. Prabhu

[permalink] [raw]
Subject: [PATCH v3 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states

From: "Preeti U. Murthy" <[email protected]>

The secondary threads should enter deep idle states so as to gain maximum
powersavings when the entire core is offline. To do so the offline path
must be made aware of the available deepest idle state. Hence probe the
device tree for the possible idle states in powernv core code and
expose the deepest idle state through flags.

Since the device tree is probed by the cpuidle driver as well, move
the parameters required to discover the idle states into an appropriate
common place to both the driver and the powernv core code.

Another point is that fastsleep idle state may require workarounds in
the kernel to function properly. This workaround is introduced in the
subsequent patches. However neither the cpuidle driver or the hotplug
path need be bothered about this workaround.

They will be taken care of by the core powernv code.

Originally-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Preeti U. Murthy <[email protected]>
Signed-off-by: Shreyas B. Prabhu <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/powerpc/include/asm/opal.h | 8 ++++++
arch/powerpc/platforms/powernv/powernv.h | 2 ++
arch/powerpc/platforms/powernv/setup.c | 49 ++++++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/smp.c | 7 ++++-
drivers/cpuidle/cpuidle-powernv.c | 9 ++----
5 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 9124b0e..f8b95c0 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -155,6 +155,14 @@ struct opal_sg_list {
#define OPAL_REGISTER_DUMP_REGION 101
#define OPAL_UNREGISTER_DUMP_REGION 102

+/* Device tree flags */
+
+/* Flags set in power-mgmt nodes in device tree if
+ * respective idle states are supported in the platform.
+ */
+#define OPAL_PM_NAP_ENABLED 0x00010000
+#define OPAL_PM_SLEEP_ENABLED 0x00020000
+
#ifndef __ASSEMBLY__

#include <linux/notifier.h>
diff --git a/arch/powerpc/platforms/powernv/powernv.h b/arch/powerpc/platforms/powernv/powernv.h
index 6c8e2d1..604c48e 100644
--- a/arch/powerpc/platforms/powernv/powernv.h
+++ b/arch/powerpc/platforms/powernv/powernv.h
@@ -29,6 +29,8 @@ static inline u64 pnv_pci_dma_get_required_mask(struct pci_dev *pdev)
}
#endif

+extern u32 pnv_get_supported_cpuidle_states(void);
+
extern void pnv_lpc_init(void);

bool cpu_core_split_required(void);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 3f9546d..34c6665 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -290,6 +290,55 @@ static void __init pnv_setup_machdep_rtas(void)
}
#endif /* CONFIG_PPC_POWERNV_RTAS */

+static u32 supported_cpuidle_states;
+
+u32 pnv_get_supported_cpuidle_states(void)
+{
+ return supported_cpuidle_states;
+}
+
+static int __init pnv_init_idle_states(void)
+{
+ struct device_node *power_mgt;
+ int dt_idle_states;
+ const __be32 *idle_state_flags;
+ u32 len_flags, flags;
+ int i;
+
+ supported_cpuidle_states = 0;
+
+ if (cpuidle_disable != IDLE_NO_OVERRIDE)
+ return 0;
+
+ if (!firmware_has_feature(FW_FEATURE_OPALv3))
+ return 0;
+
+ power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+ if (!power_mgt) {
+ pr_warn("opal: PowerMgmt Node not found\n");
+ return 0;
+ }
+
+ idle_state_flags = of_get_property(power_mgt,
+ "ibm,cpu-idle-state-flags", &len_flags);
+ if (!idle_state_flags) {
+ pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+ return 0;
+ }
+
+ dt_idle_states = len_flags / sizeof(u32);
+
+ for (i = 0; i < dt_idle_states; i++) {
+ flags = be32_to_cpu(idle_state_flags[i]);
+ supported_cpuidle_states |= flags;
+ }
+
+ return 0;
+}
+
+subsys_initcall(pnv_init_idle_states);
+
+
static int __init pnv_probe(void)
{
unsigned long root = of_get_flat_dt_root();
diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
index 4753958..3dc4cec 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -149,6 +149,7 @@ static int pnv_smp_cpu_disable(void)
static void pnv_smp_cpu_kill_self(void)
{
unsigned int cpu;
+ u32 idle_states;

/* Standard hot unplug procedure */
local_irq_disable();
@@ -159,13 +160,17 @@ static void pnv_smp_cpu_kill_self(void)
generic_set_cpu_dead(cpu);
smp_wmb();

+ idle_states = pnv_get_supported_cpuidle_states();
/* We don't want to take decrementer interrupts while we are offline,
* so clear LPCR:PECE1. We keep PECE2 enabled.
*/
mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
while (!generic_check_cpu_restart(cpu)) {
ppc64_runlatch_off();
- power7_nap(1);
+ if (idle_states & OPAL_PM_SLEEP_ENABLED)
+ power7_sleep();
+ else
+ power7_nap(1);
ppc64_runlatch_on();

/* Clear the IPI that woke us up */
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 7d3a349..0a7d827 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -16,13 +16,10 @@

#include <asm/machdep.h>
#include <asm/firmware.h>
+#include <asm/opal.h>
#include <asm/runlatch.h>

-/* Flags and constants used in PowerNV platform */
-
#define MAX_POWERNV_IDLE_STATES 8
-#define IDLE_USE_INST_NAP 0x00010000 /* Use nap instruction */
-#define IDLE_USE_INST_SLEEP 0x00020000 /* Use sleep instruction */

struct cpuidle_driver powernv_idle_driver = {
.name = "powernv_idle",
@@ -198,7 +195,7 @@ static int powernv_add_idle_states(void)
* target residency to be 10x exit_latency
*/
latency_ns = be32_to_cpu(idle_state_latency[i]);
- if (flags & IDLE_USE_INST_NAP) {
+ if (flags & OPAL_PM_NAP_ENABLED) {
/* Add NAP state */
strcpy(powernv_states[nr_idle_states].name, "Nap");
strcpy(powernv_states[nr_idle_states].desc, "Nap");
@@ -211,7 +208,7 @@ static int powernv_add_idle_states(void)
nr_idle_states++;
}

- if (flags & IDLE_USE_INST_SLEEP) {
+ if (flags & OPAL_PM_SLEEP_ENABLED) {
/* Add FASTSLEEP state */
strcpy(powernv_states[nr_idle_states].name, "FastSleep");
strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
--
1.9.3

2014-12-04 07:29:20

by Shreyas B. Prabhu

[permalink] [raw]
Subject: [PATCH v3 3/4] powernv: cpuidle: Redesign idle states management

Deep idle states like sleep and winkle are per core idle states. A core
enters these states only when all the threads enter either the
particular idle state or a deeper one. There are tasks like fastsleep
hardware bug workaround and hypervisor core state save which have to be
done only by the last thread of the core entering deep idle state and
similarly tasks like timebase resync, hypervisor core register restore
that have to be done only by the first thread waking up from these
state.

The current idle state management does not have a way to distinguish the
first/last thread of the core waking/entering idle states. Tasks like
timebase resync are done for all the threads. This is not only is
suboptimal, but can cause functionality issues when subcores and kvm is
involved.

This patch adds the necessary infrastructure to track idle states of
threads in a per-core structure. It uses this info to perform tasks like
fastsleep workaround and timebase resync only once per core.

Signed-off-by: Shreyas B. Prabhu <[email protected]>
Originally-by: Preeti U. Murthy <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/powerpc/include/asm/cpuidle.h | 20 +++
arch/powerpc/include/asm/opal.h | 2 +
arch/powerpc/include/asm/paca.h | 6 +
arch/powerpc/kernel/asm-offsets.c | 6 +
arch/powerpc/kernel/exceptions-64s.S | 24 ++--
arch/powerpc/kernel/idle_power7.S | 188 +++++++++++++++++++------
arch/powerpc/platforms/powernv/opal-wrappers.S | 37 +++++
arch/powerpc/platforms/powernv/setup.c | 47 ++++++-
arch/powerpc/platforms/powernv/smp.c | 3 +-
drivers/cpuidle/cpuidle-powernv.c | 3 +-
10 files changed, 277 insertions(+), 59 deletions(-)
create mode 100644 arch/powerpc/include/asm/cpuidle.h

diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
new file mode 100644
index 0000000..d2f99ca
--- /dev/null
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -0,0 +1,20 @@
+#ifndef _ASM_POWERPC_CPUIDLE_H
+#define _ASM_POWERPC_CPUIDLE_H
+
+#ifdef CONFIG_PPC_POWERNV
+/* Used in powernv idle state management */
+#define PNV_THREAD_RUNNING 0
+#define PNV_THREAD_NAP 1
+#define PNV_THREAD_SLEEP 2
+#define PNV_THREAD_WINKLE 3
+#define PNV_CORE_IDLE_LOCK_BIT 0x100
+#define PNV_CORE_IDLE_THREAD_BITS 0x0FF
+
+#ifndef __ASSEMBLY__
+extern u32 pnv_fastsleep_workaround_at_entry[];
+extern u32 pnv_fastsleep_workaround_at_exit[];
+#endif
+
+#endif
+
+#endif
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index f8b95c0..bef7fbc 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -152,6 +152,7 @@ struct opal_sg_list {
#define OPAL_PCI_ERR_INJECT 96
#define OPAL_PCI_EEH_FREEZE_SET 97
#define OPAL_HANDLE_HMI 98
+#define OPAL_CONFIG_CPU_IDLE_STATE 99
#define OPAL_REGISTER_DUMP_REGION 101
#define OPAL_UNREGISTER_DUMP_REGION 102

@@ -162,6 +163,7 @@ struct opal_sg_list {
*/
#define OPAL_PM_NAP_ENABLED 0x00010000
#define OPAL_PM_SLEEP_ENABLED 0x00020000
+#define OPAL_PM_SLEEP_ENABLED_ER1 0x00080000

#ifndef __ASSEMBLY__

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index a5139ea..e4578c3 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -158,6 +158,12 @@ struct paca_struct {
* early exception handler for use by high level C handler
*/
struct opal_machine_check_event *opal_mc_evt;
+
+ /* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
+ u32 *core_idle_state_ptr;
+ u8 thread_idle_state; /* ~Idle[0]/Nap[1]/Sleep[2]/Winkle[3] */
+ /* Mask to indicate thread id in core */
+ u8 thread_mask;
#endif
#ifdef CONFIG_PPC_BOOK3S_64
/* Exclusive emergency stack pointer for machine check exception. */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 9d7dede..3bc0352 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -731,6 +731,12 @@ int main(void)
DEFINE(OPAL_MC_SRR0, offsetof(struct opal_machine_check_event, srr0));
DEFINE(OPAL_MC_SRR1, offsetof(struct opal_machine_check_event, srr1));
DEFINE(PACA_OPAL_MC_EVT, offsetof(struct paca_struct, opal_mc_evt));
+ DEFINE(PACA_CORE_IDLE_STATE_PTR,
+ offsetof(struct paca_struct, core_idle_state_ptr));
+ DEFINE(PACA_THREAD_IDLE_STATE,
+ offsetof(struct paca_struct, thread_idle_state));
+ DEFINE(PACA_THREAD_MASK,
+ offsetof(struct paca_struct, thread_mask));
#endif

return 0;
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 72e783e..7637889 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -15,6 +15,7 @@
#include <asm/hw_irq.h>
#include <asm/exception-64s.h>
#include <asm/ptrace.h>
+#include <asm/cpuidle.h>

/*
* We layout physical memory as follows:
@@ -109,15 +110,19 @@ BEGIN_FTR_SECTION
rlwinm. r13,r13,47-31,30,31
beq 9f

- /* waking up from powersave (nap) state */
- cmpwi cr1,r13,2
- /* Total loss of HV state is fatal, we could try to use the
- * PIR to locate a PACA, then use an emergency stack etc...
- * OPAL v3 based powernv platforms have new idle states
- * which fall in this catagory.
- */
- bgt cr1,8f
+ cmpwi cr3,r13,2
+
GET_PACA(r13)
+ lbz r0,PACA_THREAD_IDLE_STATE(r13)
+ cmpwi cr2,r0,PNV_THREAD_NAP
+ bgt cr2,8f /* Either sleep or Winkle */
+
+ /* Waking up from nap should not cause hypervisor state loss */
+ bgt cr3,.
+
+ /* Waking up from nap */
+ li r0,PNV_THREAD_RUNNING
+ stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */

#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
li r0,KVM_HWTHREAD_IN_KERNEL
@@ -131,7 +136,7 @@ BEGIN_FTR_SECTION
1:
#endif

- beq cr1,2f
+ beq cr3,2f
b power7_wakeup_noloss
2: b power7_wakeup_loss

@@ -1386,6 +1391,7 @@ machine_check_handle_early:
MACHINE_CHECK_HANDLER_WINDUP
GET_PACA(r13)
ld r1,PACAR1(r13)
+ li r3,PNV_THREAD_NAP
b power7_enter_nap_mode
4:
#endif
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index 283c603..8c3a1f4 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -18,6 +18,7 @@
#include <asm/hw_irq.h>
#include <asm/kvm_book3s_asm.h>
#include <asm/opal.h>
+#include <asm/cpuidle.h>

#undef DEBUG

@@ -37,8 +38,7 @@

/*
* Pass requested state in r3:
- * 0 - nap
- * 1 - sleep
+ * r3 - PNV_THREAD_NAP/SLEEP/WINKLE
*
* To check IRQ_HAPPENED in r4
* 0 - don't check
@@ -123,12 +123,58 @@ power7_enter_nap_mode:
li r4,KVM_HWTHREAD_IN_NAP
stb r4,HSTATE_HWTHREAD_STATE(r13)
#endif
- cmpwi cr0,r3,1
- beq 2f
+ stb r3,PACA_THREAD_IDLE_STATE(r13)
+ cmpwi cr1,r3,PNV_THREAD_SLEEP
+ bge cr1,2f
IDLE_STATE_ENTER_SEQ(PPC_NAP)
/* No return */
-2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
- /* No return */
+2:
+ /* Sleep or winkle */
+ lbz r7,PACA_THREAD_MASK(r13)
+ ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
+lwarx_loop1:
+ lwarx r15,0,r14
+ andc r15,r15,r7 /* Clear thread bit */
+
+ andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
+
+/*
+ * If cr0 = 0, then current thread is the last thread of the core entering
+ * sleep. Last thread needs to execute the hardware bug workaround code if
+ * required by the platform.
+ * Make the workaround call unconditionally here. The below branch call is
+ * patched out when the idle states are discovered if the platform does not
+ * require it.
+ */
+.global pnv_fastsleep_workaround_at_entry
+pnv_fastsleep_workaround_at_entry:
+ beq fastsleep_workaround_at_entry
+
+ stwcx. r15,0,r14
+ isync
+ bne- lwarx_loop1
+
+common_enter: /* common code for all the threads entering sleep */
+ IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+
+fastsleep_workaround_at_entry:
+ ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
+ stwcx. r15,0,r14
+ isync
+ bne- lwarx_loop1
+
+ /* Fast sleep workaround */
+ li r3,1
+ li r4,1
+ li r0,OPAL_CONFIG_CPU_IDLE_STATE
+ bl opal_call_realmode
+
+ /* Clear Lock bit */
+ li r0,0
+ lwsync
+ stw r0,0(r14)
+ b common_enter
+

_GLOBAL(power7_idle)
/* Now check if user or arch enabled NAP mode */
@@ -141,49 +187,16 @@ _GLOBAL(power7_idle)

_GLOBAL(power7_nap)
mr r4,r3
- li r3,0
+ li r3,PNV_THREAD_NAP
b power7_powersave_common
/* No return */

_GLOBAL(power7_sleep)
- li r3,1
+ li r3,PNV_THREAD_SLEEP
li r4,1
b power7_powersave_common
/* No return */

-/*
- * Make opal call in realmode. This is a generic function to be called
- * from realmode from reset vector. It handles endianess.
- *
- * r13 - paca pointer
- * r1 - stack pointer
- * r3 - opal token
- */
-opal_call_realmode:
- mflr r12
- std r12,_LINK(r1)
- ld r2,PACATOC(r13)
- /* Set opal return address */
- LOAD_REG_ADDR(r0,return_from_opal_call)
- mtlr r0
- /* Handle endian-ness */
- li r0,MSR_LE
- mfmsr r12
- andc r12,r12,r0
- mtspr SPRN_HSRR1,r12
- mr r0,r3 /* Move opal token to r0 */
- LOAD_REG_ADDR(r11,opal)
- ld r12,8(r11)
- ld r2,0(r11)
- mtspr SPRN_HSRR0,r12
- hrfid
-
-return_from_opal_call:
- FIXUP_ENDIAN
- ld r0,_LINK(r1)
- mtlr r0
- blr
-
#define CHECK_HMI_INTERRUPT \
mfspr r0,SPRN_SRR1; \
BEGIN_FTR_SECTION_NESTED(66); \
@@ -196,10 +209,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66); \
/* Invoke opal call to handle hmi */ \
ld r2,PACATOC(r13); \
ld r1,PACAR1(r13); \
- std r3,ORIG_GPR3(r1); /* Save original r3 */ \
- li r3,OPAL_HANDLE_HMI; /* Pass opal token argument*/ \
+ li r0,OPAL_HANDLE_HMI; /* Pass opal token argument*/ \
bl opal_call_realmode; \
- ld r3,ORIG_GPR3(r1); /* Restore original r3 */ \
20: nop;


@@ -210,12 +221,90 @@ _GLOBAL(power7_wakeup_tb_loss)
BEGIN_FTR_SECTION
CHECK_HMI_INTERRUPT
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
+
+ lbz r7,PACA_THREAD_MASK(r13)
+ ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
+lwarx_loop2:
+ lwarx r15,0,r14
+ andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT
+ /*
+ * Lock bit is set in one of the 2 cases-
+ * a. In the sleep/winkle enter path, the last thread is executing
+ * fastsleep workaround code.
+ * b. In the wake up path, another thread is executing fastsleep
+ * workaround undo code or resyncing timebase or restoring context
+ * In either case loop until the lock bit is cleared.
+ */
+ bne core_idle_lock_held
+
+ cmpwi cr2,r15,0
+ or r15,r15,r7 /* Set thread bit */
+
+ beq cr2,first_thread
+
+ /* Not first thread in core to wake up */
+ stwcx. r15,0,r14
+ isync
+ bne- lwarx_loop2
+ b common_exit
+
+core_idle_lock_held:
+ HMT_LOW
+core_idle_lock_loop:
+ lwz r15,0(14)
+ andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT
+ bne core_idle_lock_loop
+ HMT_MEDIUM
+ b lwarx_loop2
+
+first_thread:
+ /* First thread in core to wakeup */
+ ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
+ stwcx. r15,0,r14
+ isync
+ bne- lwarx_loop2
+
+ /*
+ * First thread in the core waking up from fastsleep. It needs to
+ * call the fastsleep workaround code if the platform requires it.
+ * Call it unconditionally here. The below branch instruction will
+ * be patched out when the idle states are discovered if platform
+ * does not require workaround.
+ */
+.global pnv_fastsleep_workaround_at_exit
+pnv_fastsleep_workaround_at_exit:
+ b fastsleep_workaround_at_exit
+
+timebase_resync:
+ /* Do timebase resync if we are waking up from sleep. Use cr3 value
+ * set in exceptions-64s.S */
+ ble cr3,clear_lock
/* Time base re-sync */
- li r3,OPAL_RESYNC_TIMEBASE
+ li r0,OPAL_RESYNC_TIMEBASE
bl opal_call_realmode;
-
/* TODO: Check r3 for failure */

+clear_lock:
+ andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
+ lwsync
+ stw r15,0(r14)
+
+common_exit:
+ li r5,PNV_THREAD_RUNNING
+ stb r5,PACA_THREAD_IDLE_STATE(r13)
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ li r0,KVM_HWTHREAD_IN_KERNEL
+ stb r0,HSTATE_HWTHREAD_STATE(r13)
+ /* Order setting hwthread_state vs. testing hwthread_req */
+ sync
+ lbz r0,HSTATE_HWTHREAD_REQ(r13)
+ cmpwi r0,0
+ beq 6f
+ b kvm_start_guest
+6:
+#endif
+
REST_NVGPRS(r1)
REST_GPR(2, r1)
ld r3,_CCR(r1)
@@ -228,6 +317,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
mtspr SPRN_SRR0,r5
rfid

+fastsleep_workaround_at_exit:
+ li r3,1
+ li r4,0
+ li r0,OPAL_CONFIG_CPU_IDLE_STATE
+ bl opal_call_realmode
+ b timebase_resync
+
_GLOBAL(power7_wakeup_loss)
ld r1,PACAR1(r13)
BEGIN_FTR_SECTION
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index feb549a..a0f43e8 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -158,6 +158,43 @@ opal_tracepoint_return:
blr
#endif

+/*
+ * Make opal call in realmode. This is a generic function to be called
+ * from realmode. It handles endianness.
+ *
+ * r13 - paca pointer
+ * r1 - stack pointer
+ * r0 - opal token
+ */
+_GLOBAL(opal_call_realmode)
+ mflr r12
+ std r12,PPC_LR_STKOFF(r1)
+ ld r2,PACATOC(r13)
+ /* Set opal return address */
+ LOAD_REG_ADDR(r12,return_from_opal_call)
+ mtlr r12
+
+ mfmsr r12
+#ifdef __LITTLE_ENDIAN__
+ /* Handle endian-ness */
+ li r11,MSR_LE
+ andc r12,r12,r11
+#endif
+ mtspr SPRN_HSRR1,r12
+ LOAD_REG_ADDR(r11,opal)
+ ld r12,8(r11)
+ ld r2,0(r11)
+ mtspr SPRN_HSRR0,r12
+ hrfid
+
+return_from_opal_call:
+#ifdef __LITTLE_ENDIAN__
+ FIXUP_ENDIAN
+#endif
+ ld r12,PPC_LR_STKOFF(r1)
+ mtlr r12
+ blr
+
OPAL_CALL(opal_invalid_call, OPAL_INVALID_CALL);
OPAL_CALL(opal_console_write, OPAL_CONSOLE_WRITE);
OPAL_CALL(opal_console_read, OPAL_CONSOLE_READ);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 34c6665..97e0279 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -36,6 +36,9 @@
#include <asm/opal.h>
#include <asm/kexec.h>
#include <asm/smp.h>
+#include <asm/cputhreads.h>
+#include <asm/cpuidle.h>
+#include <asm/code-patching.h>

#include "powernv.h"

@@ -292,10 +295,43 @@ static void __init pnv_setup_machdep_rtas(void)

static u32 supported_cpuidle_states;

+static void pnv_alloc_idle_core_states(void)
+{
+ int i, j;
+ int nr_cores = cpu_nr_cores();
+ u32 *core_idle_state;
+
+ /*
+ * core_idle_state - First 8 bits track the idle state of each thread
+ * of the core. The 8th bit is the lock bit. Initially all thread bits
+ * are set. They are cleared when the thread enters deep idle state
+ * like sleep and winkle. Initially the lock bit is cleared.
+ * The lock bit has 2 purposes
+ * a. While the first thread is restoring core state, it prevents
+ * from other threads in the core from switching to prcoess context.
+ * b. While the last thread in the core is saving the core state, it
+ * prevent a different thread from waking up.
+ */
+ for (i = 0; i < nr_cores; i++) {
+ int first_cpu = i * threads_per_core;
+ int node = cpu_to_node(first_cpu);
+
+ core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
+ for (j = 0; j < threads_per_core; j++) {
+ int cpu = first_cpu + j;
+
+ paca[cpu].core_idle_state_ptr = core_idle_state;
+ paca[cpu].thread_idle_state = PNV_THREAD_RUNNING;
+ paca[cpu].thread_mask = 1 << (cpu % threads_per_core);
+ }
+ }
+}
+
u32 pnv_get_supported_cpuidle_states(void)
{
return supported_cpuidle_states;
}
+EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states);

static int __init pnv_init_idle_states(void)
{
@@ -332,13 +368,20 @@ static int __init pnv_init_idle_states(void)
flags = be32_to_cpu(idle_state_flags[i]);
supported_cpuidle_states |= flags;
}
-
+ if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
+ patch_instruction(
+ (unsigned int *)pnv_fastsleep_workaround_at_entry,
+ PPC_INST_NOP);
+ patch_instruction(
+ (unsigned int *)pnv_fastsleep_workaround_at_exit,
+ PPC_INST_NOP);
+ }
+ pnv_alloc_idle_core_states();
return 0;
}

subsys_initcall(pnv_init_idle_states);

-
static int __init pnv_probe(void)
{
unsigned long root = of_get_flat_dt_root();
diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
index 3dc4cec..12b761a 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -167,7 +167,8 @@ static void pnv_smp_cpu_kill_self(void)
mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
while (!generic_check_cpu_restart(cpu)) {
ppc64_runlatch_off();
- if (idle_states & OPAL_PM_SLEEP_ENABLED)
+ if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
+ (idle_states & OPAL_PM_SLEEP_ENABLED_ER1))
power7_sleep();
else
power7_nap(1);
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 0a7d827..a489b56 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -208,7 +208,8 @@ static int powernv_add_idle_states(void)
nr_idle_states++;
}

- if (flags & OPAL_PM_SLEEP_ENABLED) {
+ if (flags & OPAL_PM_SLEEP_ENABLED ||
+ flags & OPAL_PM_SLEEP_ENABLED_ER1) {
/* Add FASTSLEEP state */
strcpy(powernv_states[nr_idle_states].name, "FastSleep");
strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
--
1.9.3

2014-12-04 07:29:40

by Shreyas B. Prabhu

[permalink] [raw]
Subject: [PATCH v3 4/4] powernv: powerpc: Add winkle support for offline cpus

Winkle is a deep idle state supported in power8 chips. A core enters
winkle when all the threads of the core enter winkle. In this state
power supply to the entire chiplet i.e core, private L2 and private L3
is turned off. As a result it gives higher powersavings compared to
sleep.

But entering winkle results in a total hypervisor state loss. Hence the
hypervisor context has to be preserved before entering winkle and
restored upon wake up.

Power-on Reset Engine (PORE) is a dedicated engine which is responsible
for powering on the chiplet during wake up. It can be programmed to
restore the register contests of a few specific registers. This patch
uses PORE to restore register state wherever possible and uses stack to
save and restore rest of the necessary registers.

With hypervisor state restore things fall under three categories-
per-core state, per-subcore state and per-thread state. To manage this,
extend the infrastructure introduced for sleep. Mainly we add a paca
variable subcore_sibling_mask. Using this and the core_idle_state we can
distingush first thread in core and subcore.

Signed-off-by: Shreyas B. Prabhu <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: [email protected]
---
arch/powerpc/include/asm/opal.h | 3 +
arch/powerpc/include/asm/paca.h | 2 +
arch/powerpc/include/asm/ppc-opcode.h | 2 +
arch/powerpc/include/asm/processor.h | 1 +
arch/powerpc/include/asm/reg.h | 2 +
arch/powerpc/kernel/asm-offsets.c | 2 +
arch/powerpc/kernel/exceptions-64s.S | 16 ++-
arch/powerpc/kernel/idle_power7.S | 151 +++++++++++++++++++++++--
arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
arch/powerpc/platforms/powernv/setup.c | 73 ++++++++++++
arch/powerpc/platforms/powernv/smp.c | 4 +-
arch/powerpc/platforms/powernv/subcore.c | 34 ++++++
arch/powerpc/platforms/powernv/subcore.h | 1 +
13 files changed, 280 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index bef7fbc..f0ca2d9 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -153,6 +153,7 @@ struct opal_sg_list {
#define OPAL_PCI_EEH_FREEZE_SET 97
#define OPAL_HANDLE_HMI 98
#define OPAL_CONFIG_CPU_IDLE_STATE 99
+#define OPAL_SLW_SET_REG 100
#define OPAL_REGISTER_DUMP_REGION 101
#define OPAL_UNREGISTER_DUMP_REGION 102

@@ -163,6 +164,7 @@ struct opal_sg_list {
*/
#define OPAL_PM_NAP_ENABLED 0x00010000
#define OPAL_PM_SLEEP_ENABLED 0x00020000
+#define OPAL_PM_WINKLE_ENABLED 0x00040000
#define OPAL_PM_SLEEP_ENABLED_ER1 0x00080000

#ifndef __ASSEMBLY__
@@ -972,6 +974,7 @@ int64_t opal_sensor_read(uint32_t sensor_hndl, int token, __be32 *sensor_data);
int64_t opal_handle_hmi(void);
int64_t opal_register_dump_region(uint32_t id, uint64_t start, uint64_t end);
int64_t opal_unregister_dump_region(uint32_t id);
+int64_t opal_slw_set_reg(uint64_t cpu_pir, uint64_t sprn, uint64_t val);
int64_t opal_pci_set_phb_cxl_mode(uint64_t phb_id, uint64_t mode, uint64_t pe_number);

/* Internal functions */
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index e4578c3..e89f4a4 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -164,6 +164,8 @@ struct paca_struct {
u8 thread_idle_state; /* ~Idle[0]/Nap[1]/Sleep[2]/Winkle[3] */
/* Mask to indicate thread id in core */
u8 thread_mask;
+ /* Mask to denote subcore sibling threads */
+ u8 subcore_sibling_mask;
#endif
#ifdef CONFIG_PPC_BOOK3S_64
/* Exclusive emergency stack pointer for machine check exception. */
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index 6f85362..5155be7 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -194,6 +194,7 @@

#define PPC_INST_NAP 0x4c000364
#define PPC_INST_SLEEP 0x4c0003a4
+#define PPC_INST_WINKLE 0x4c0003e4

/* A2 specific instructions */
#define PPC_INST_ERATWE 0x7c0001a6
@@ -374,6 +375,7 @@

#define PPC_NAP stringify_in_c(.long PPC_INST_NAP)
#define PPC_SLEEP stringify_in_c(.long PPC_INST_SLEEP)
+#define PPC_WINKLE stringify_in_c(.long PPC_INST_WINKLE)

/* BHRB instructions */
#define PPC_CLRBHRB stringify_in_c(.long PPC_INST_CLRBHRB)
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index dda7ac4..c076842 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -453,6 +453,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
extern int powersave_nap; /* set if nap mode can be used in idle loop */
extern void power7_nap(int check_irq);
extern void power7_sleep(void);
+extern void power7_winkle(void);
extern void flush_instruction_cache(void);
extern void hard_reset_now(void);
extern void poweroff_now(void);
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index a68ee15..1c874fb 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -373,6 +373,7 @@
#define SPRN_DBAT7L 0x23F /* Data BAT 7 Lower Register */
#define SPRN_DBAT7U 0x23E /* Data BAT 7 Upper Register */
#define SPRN_PPR 0x380 /* SMT Thread status Register */
+#define SPRN_TSCR 0x399 /* Thread Switch Control Register */

#define SPRN_DEC 0x016 /* Decrement Register */
#define SPRN_DER 0x095 /* Debug Enable Regsiter */
@@ -730,6 +731,7 @@
#define SPRN_BESCR 806 /* Branch event status and control register */
#define BESCR_GE 0x8000000000000000ULL /* Global Enable */
#define SPRN_WORT 895 /* Workload optimization register - thread */
+#define SPRN_WORC 863 /* Workload optimization register - core */

#define SPRN_PMC1 787
#define SPRN_PMC2 788
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 3bc0352..f262e3e 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -737,6 +737,8 @@ int main(void)
offsetof(struct paca_struct, thread_idle_state));
DEFINE(PACA_THREAD_MASK,
offsetof(struct paca_struct, thread_mask));
+ DEFINE(PACA_SUBCORE_SIBLING_MASK,
+ offsetof(struct paca_struct, subcore_sibling_mask));
#endif

return 0;
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 7637889..2b9b5fb 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -102,9 +102,7 @@ system_reset_pSeries:
#ifdef CONFIG_PPC_P7_NAP
BEGIN_FTR_SECTION
/* Running native on arch 2.06 or later, check if we are
- * waking up from nap. We only handle no state loss and
- * supervisor state loss. We do -not- handle hypervisor
- * state loss at this time.
+ * waking up from nap/sleep/winkle.
*/
mfspr r13,SPRN_SRR1
rlwinm. r13,r13,47-31,30,31
@@ -112,7 +110,17 @@ BEGIN_FTR_SECTION

cmpwi cr3,r13,2

- GET_PACA(r13)
+ /* Check if last bit of HSPGR0 is set. This indicates whether we are
+ * waking up from winkle */
+ li r3,1
+ mfspr r4,SPRN_HSPRG0
+ and r5,r4,r3
+ cmpwi cr4,r5,1 /* Store result in cr4 for later use */
+
+ andc r4,r4,r3
+ mtspr SPRN_HSPRG0,r4
+
+ mr r13,r4
lbz r0,PACA_THREAD_IDLE_STATE(r13)
cmpwi cr2,r0,PNV_THREAD_NAP
bgt cr2,8f /* Either sleep or Winkle */
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index 8c3a1f4..8102075 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -19,8 +19,24 @@
#include <asm/kvm_book3s_asm.h>
#include <asm/opal.h>
#include <asm/cpuidle.h>
+#include <asm/mmu-hash64.h>

#undef DEBUG
+/*
+ * Use unused space in the interrupt stack to save and restore
+ * registers for winkle support.
+ */
+#define _SDR1 GPR3
+#define _RPR GPR4
+#define _SPURR GPR5
+#define _PURR GPR6
+#define _TSCR GPR7
+#define _DSCR GPR8
+#define _AMOR GPR9
+#define _PMC5 GPR10
+#define _PMC6 GPR11
+#define _WORT GPR12
+#define _WORC GPR13

/* Idle state entry routines */

@@ -124,8 +140,8 @@ power7_enter_nap_mode:
stb r4,HSTATE_HWTHREAD_STATE(r13)
#endif
stb r3,PACA_THREAD_IDLE_STATE(r13)
- cmpwi cr1,r3,PNV_THREAD_SLEEP
- bge cr1,2f
+ cmpwi cr3,r3,PNV_THREAD_SLEEP
+ bge cr3,2f
IDLE_STATE_ENTER_SEQ(PPC_NAP)
/* No return */
2:
@@ -154,7 +170,8 @@ pnv_fastsleep_workaround_at_entry:
isync
bne- lwarx_loop1

-common_enter: /* common code for all the threads entering sleep */
+common_enter: /* common code for all the threads entering sleep or winkle */
+ bgt cr3,enter_winkle
IDLE_STATE_ENTER_SEQ(PPC_SLEEP)

fastsleep_workaround_at_entry:
@@ -175,6 +192,34 @@ fastsleep_workaround_at_entry:
stw r0,0(r14)
b common_enter

+enter_winkle:
+ /*
+ * Note all register i.e per-core, per-subcore or per-thread is saved
+ * here since any thread in the core might wake up first
+ */
+ mfspr r3,SPRN_SDR1
+ std r3,_SDR1(r1)
+ mfspr r3,SPRN_RPR
+ std r3,_RPR(r1)
+ mfspr r3,SPRN_SPURR
+ std r3,_SPURR(r1)
+ mfspr r3,SPRN_PURR
+ std r3,_PURR(r1)
+ mfspr r3,SPRN_TSCR
+ std r3,_TSCR(r1)
+ mfspr r3,SPRN_DSCR
+ std r3,_DSCR(r1)
+ mfspr r3,SPRN_AMOR
+ std r3,_AMOR(r1)
+ mfspr r3,SPRN_PMC5
+ std r3,_PMC5(r1)
+ mfspr r3,SPRN_PMC6
+ std r3,_PMC6(r1)
+ mfspr r3,SPRN_WORT
+ std r3,_WORT(r1)
+ mfspr r3,SPRN_WORC
+ std r3,_WORC(r1)
+ IDLE_STATE_ENTER_SEQ(PPC_WINKLE)

_GLOBAL(power7_idle)
/* Now check if user or arch enabled NAP mode */
@@ -197,6 +242,12 @@ _GLOBAL(power7_sleep)
b power7_powersave_common
/* No return */

+_GLOBAL(power7_winkle)
+ li r3,3
+ li r4,1
+ b power7_powersave_common
+ /* No return */
+
#define CHECK_HMI_INTERRUPT \
mfspr r0,SPRN_SRR1; \
BEGIN_FTR_SECTION_NESTED(66); \
@@ -238,11 +289,23 @@ lwarx_loop2:
bne core_idle_lock_held

cmpwi cr2,r15,0
+ lbz r4,PACA_SUBCORE_SIBLING_MASK(r13)
+ and r4,r4,r15
+ cmpwi cr1,r4,0 /* Check if first in subcore */
+
+ /*
+ * At this stage
+ * cr1 - 10 if first thread to wakeup in subcore
+ * cr2 - 10 if first thread to wakeup in core
+ * cr3- 01 if waking up from sleep or winkle
+ * cr4 - 10 if waking up from winkle
+ */
+
or r15,r15,r7 /* Set thread bit */

- beq cr2,first_thread
+ beq cr1,first_thread_in_subcore

- /* Not first thread in core to wake up */
+ /* Not first thread in subcore to wake up */
stwcx. r15,0,r14
isync
bne- lwarx_loop2
@@ -257,14 +320,35 @@ core_idle_lock_loop:
HMT_MEDIUM
b lwarx_loop2

-first_thread:
- /* First thread in core to wakeup */
+first_thread_in_subcore:
+ /* First thread in subcore to wakeup */
ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
stwcx. r15,0,r14
isync
bne- lwarx_loop2

/*
+ * If waking up from sleep, subcore state is not lost. Hence
+ * skip subcore state restore
+ */
+ bne cr4,subcore_state_restored
+
+ /* Restore per-subcore state */
+ ld r4,_SDR1(r1)
+ mtspr SPRN_SDR1,r4
+ ld r4,_RPR(r1)
+ mtspr SPRN_RPR,r4
+ ld r4,_AMOR(r1)
+ mtspr SPRN_AMOR,r4
+
+subcore_state_restored:
+ /* Check if the thread is also the first thread in the core. If not,
+ * skip to clear_lock */
+ bne cr2,clear_lock
+
+first_thread_in_core:
+
+ /*
* First thread in the core waking up from fastsleep. It needs to
* call the fastsleep workaround code if the platform requires it.
* Call it unconditionally here. The below branch instruction will
@@ -284,12 +368,65 @@ timebase_resync:
bl opal_call_realmode;
/* TODO: Check r3 for failure */

+ /*
+ * If waking up from sleep, per core state is not lost, skip to
+ * clear_lock.
+ */
+ bne cr4,clear_lock
+
+ /* Restore per core state */
+ ld r4,_TSCR(r1)
+ mtspr SPRN_TSCR,r4
+ ld r4,_WORC(r1)
+ mtspr SPRN_WORC,r4
+
clear_lock:
andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
lwsync
stw r15,0(r14)

common_exit:
+ /* Common to all threads
+ *
+ * If waking up from sleep, hypervisor state is not lost. Hence
+ * skip hypervisor state restore.
+ */
+ bne cr4,hypervisor_state_restored
+
+ /* Waking up from winkle */
+
+ /* Restore per thread state */
+ bl __restore_cpu_power8
+
+ /* Restore SLB from PACA */
+ ld r8,PACA_SLBSHADOWPTR(r13)
+
+ .rept SLB_NUM_BOLTED
+ li r3, SLBSHADOW_SAVEAREA
+ LDX_BE r5, r8, r3
+ addi r3, r3, 8
+ LDX_BE r6, r8, r3
+ andis. r7,r5,SLB_ESID_V@h
+ beq 1f
+ slbmte r6,r5
+1: addi r8,r8,16
+ .endr
+
+ ld r4,_SPURR(r1)
+ mtspr SPRN_SPURR,r4
+ ld r4,_PURR(r1)
+ mtspr SPRN_PURR,r4
+ ld r4,_DSCR(r1)
+ mtspr SPRN_DSCR,r4
+ ld r4,_PMC5(r1)
+ mtspr SPRN_PMC5,r4
+ ld r4,_PMC6(r1)
+ mtspr SPRN_PMC6,r4
+ ld r4,_WORT(r1)
+ mtspr SPRN_WORT,r4
+
+hypervisor_state_restored:
+
li r5,PNV_THREAD_RUNNING
stb r5,PACA_THREAD_IDLE_STATE(r13)

diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index a0f43e8..72b4af8 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -284,6 +284,7 @@ OPAL_CALL(opal_sensor_read, OPAL_SENSOR_READ);
OPAL_CALL(opal_get_param, OPAL_GET_PARAM);
OPAL_CALL(opal_set_param, OPAL_SET_PARAM);
OPAL_CALL(opal_handle_hmi, OPAL_HANDLE_HMI);
+OPAL_CALL(opal_slw_set_reg, OPAL_SLW_SET_REG);
OPAL_CALL(opal_register_dump_region, OPAL_REGISTER_DUMP_REGION);
OPAL_CALL(opal_unregister_dump_region, OPAL_UNREGISTER_DUMP_REGION);
OPAL_CALL(opal_pci_set_phb_cxl_mode, OPAL_PCI_SET_PHB_CXL_MODE);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 97e0279..a0de28c 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -41,6 +41,7 @@
#include <asm/code-patching.h>

#include "powernv.h"
+#include "subcore.h"

static void __init pnv_setup_arch(void)
{
@@ -294,6 +295,74 @@ static void __init pnv_setup_machdep_rtas(void)
#endif /* CONFIG_PPC_POWERNV_RTAS */

static u32 supported_cpuidle_states;
+int pnv_save_sprs_for_winkle(void)
+{
+ int cpu;
+ int rc;
+
+ /*
+ * hid0, hid1, hid4, hid5, hmeer and lpcr values are symmetric accross
+ * all cpus at boot. Get these reg values of current cpu and use the
+ * same accross all cpus.
+ */
+ uint64_t lpcr_val = mfspr(SPRN_LPCR);
+ uint64_t hid0_val = mfspr(SPRN_HID0);
+ uint64_t hid1_val = mfspr(SPRN_HID1);
+ uint64_t hid4_val = mfspr(SPRN_HID4);
+ uint64_t hid5_val = mfspr(SPRN_HID5);
+ uint64_t hmeer_val = mfspr(SPRN_HMEER);
+
+ for_each_possible_cpu(cpu) {
+ uint64_t pir = get_hard_smp_processor_id(cpu);
+ uint64_t hsprg0_val = (uint64_t)&paca[cpu];
+
+ /*
+ * HSPRG0 is used to store the cpu's pointer to paca. Hence last
+ * 3 bits are guaranteed to be 0. Program slw to restore HSPRG0
+ * with 63rd bit set, so that when a thread wakes up at 0x100 we
+ * can use this bit to distinguish between fastsleep and
+ * deep winkle.
+ */
+ hsprg0_val |= 1;
+
+ rc = opal_slw_set_reg(pir, SPRN_HSPRG0, hsprg0_val);
+ if (rc != 0)
+ return rc;
+
+ rc = opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
+ if (rc != 0)
+ return rc;
+
+ /* HIDs are per core registers */
+ if (cpu_thread_in_core(cpu) == 0) {
+
+ rc = opal_slw_set_reg(pir, SPRN_HMEER, hmeer_val);
+ if (rc != 0)
+ return rc;
+
+ rc = opal_slw_set_reg(pir, SPRN_HID0, hid0_val);
+ if (rc != 0)
+ return rc;
+
+ rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
+ if (rc != 0)
+ return rc;
+
+ rc = opal_slw_set_reg(pir, SPRN_HID4, hid4_val);
+ if (rc != 0)
+ return rc;
+
+ rc = opal_slw_set_reg(pir, SPRN_HID5, hid5_val);
+ if (rc != 0)
+ return rc;
+
+ }
+
+ }
+
+ return 0;
+
+}

static void pnv_alloc_idle_core_states(void)
{
@@ -325,6 +394,10 @@ static void pnv_alloc_idle_core_states(void)
paca[cpu].thread_mask = 1 << (cpu % threads_per_core);
}
}
+ update_subcore_sibling_mask();
+ if (supported_cpuidle_states & OPAL_PM_WINKLE_ENABLED)
+ pnv_save_sprs_for_winkle();
+
}

u32 pnv_get_supported_cpuidle_states(void)
diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
index 12b761a..5e35857 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -167,7 +167,9 @@ static void pnv_smp_cpu_kill_self(void)
mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
while (!generic_check_cpu_restart(cpu)) {
ppc64_runlatch_off();
- if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
+ if (idle_states & OPAL_PM_WINKLE_ENABLED)
+ power7_winkle();
+ else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
(idle_states & OPAL_PM_SLEEP_ENABLED_ER1))
power7_sleep();
else
diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c
index c87f96b..f60f80a 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -160,6 +160,18 @@ static void wait_for_sync_step(int step)
mb();
}

+static void update_hid_in_slw(u64 hid0)
+{
+ u64 idle_states = pnv_get_supported_cpuidle_states();
+
+ if (idle_states & OPAL_PM_WINKLE_ENABLED) {
+ /* OPAL call to patch slw with the new HID0 value */
+ u64 cpu_pir = hard_smp_processor_id();
+
+ opal_slw_set_reg(cpu_pir, SPRN_HID0, hid0);
+ }
+}
+
static void unsplit_core(void)
{
u64 hid0, mask;
@@ -179,6 +191,7 @@ static void unsplit_core(void)
hid0 = mfspr(SPRN_HID0);
hid0 &= ~HID0_POWER8_DYNLPARDIS;
mtspr(SPRN_HID0, hid0);
+ update_hid_in_slw(hid0);

while (mfspr(SPRN_HID0) & mask)
cpu_relax();
@@ -215,6 +228,7 @@ static void split_core(int new_mode)
hid0 = mfspr(SPRN_HID0);
hid0 |= HID0_POWER8_DYNLPARDIS | split_parms[i].value;
mtspr(SPRN_HID0, hid0);
+ update_hid_in_slw(hid0);

/* Wait for it to happen */
while (!(mfspr(SPRN_HID0) & split_parms[i].mask))
@@ -251,6 +265,25 @@ bool cpu_core_split_required(void)
return true;
}

+void update_subcore_sibling_mask(void)
+{
+ int cpu;
+ /*
+ * sibling mask for the first cpu. Left shift this by required bits
+ * to get sibling mask for the rest of the cpus.
+ */
+ int sibling_mask_first_cpu = (1 << threads_per_subcore) - 1;
+
+ for_each_possible_cpu(cpu) {
+ int tid = cpu_thread_in_core(cpu);
+ int offset = (tid / threads_per_subcore) * threads_per_subcore;
+ int mask = sibling_mask_first_cpu << offset;
+
+ paca[cpu].subcore_sibling_mask = mask;
+
+ }
+}
+
static int cpu_update_split_mode(void *data)
{
int cpu, new_mode = *(int *)data;
@@ -284,6 +317,7 @@ static int cpu_update_split_mode(void *data)
/* Make the new mode public */
subcores_per_core = new_mode;
threads_per_subcore = threads_per_core / subcores_per_core;
+ update_subcore_sibling_mask();

/* Make sure the new mode is written before we exit */
mb();
diff --git a/arch/powerpc/platforms/powernv/subcore.h b/arch/powerpc/platforms/powernv/subcore.h
index 148abc9..604eb40 100644
--- a/arch/powerpc/platforms/powernv/subcore.h
+++ b/arch/powerpc/platforms/powernv/subcore.h
@@ -15,4 +15,5 @@

#ifndef __ASSEMBLY__
void split_core_secondary_loop(u8 *state);
+extern void update_subcore_sibling_mask(void);
#endif
--
1.9.3

2014-12-08 03:34:02

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states

On Thu, Dec 04, 2014 at 12:58:21PM +0530, Shreyas B. Prabhu wrote:
> From: "Preeti U. Murthy" <[email protected]>
>
> The secondary threads should enter deep idle states so as to gain maximum
> powersavings when the entire core is offline. To do so the offline path
> must be made aware of the available deepest idle state. Hence probe the
> device tree for the possible idle states in powernv core code and
> expose the deepest idle state through flags.
>
> Since the device tree is probed by the cpuidle driver as well, move
> the parameters required to discover the idle states into an appropriate
> common place to both the driver and the powernv core code.
>
> Another point is that fastsleep idle state may require workarounds in
> the kernel to function properly. This workaround is introduced in the
> subsequent patches. However neither the cpuidle driver or the hotplug
> path need be bothered about this workaround.
>
> They will be taken care of by the core powernv code.
>
> Originally-by: Srivatsa S. Bhat <[email protected]>
> Signed-off-by: Preeti U. Murthy <[email protected]>
> Signed-off-by: Shreyas B. Prabhu <[email protected]>

Reviewed-by: Paul Mackerras <[email protected]>

2014-12-08 05:01:38

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] powernv: cpuidle: Redesign idle states management

On Thu, Dec 04, 2014 at 12:58:22PM +0530, Shreyas B. Prabhu wrote:
> Deep idle states like sleep and winkle are per core idle states. A core
> enters these states only when all the threads enter either the
> particular idle state or a deeper one. There are tasks like fastsleep
> hardware bug workaround and hypervisor core state save which have to be
> done only by the last thread of the core entering deep idle state and
> similarly tasks like timebase resync, hypervisor core register restore
> that have to be done only by the first thread waking up from these
> state.
>
> The current idle state management does not have a way to distinguish the
> first/last thread of the core waking/entering idle states. Tasks like
> timebase resync are done for all the threads. This is not only is
> suboptimal, but can cause functionality issues when subcores and kvm is
> involved.
>
> This patch adds the necessary infrastructure to track idle states of
> threads in a per-core structure. It uses this info to perform tasks like
> fastsleep workaround and timebase resync only once per core.

Comments below...

> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index a5139ea..e4578c3 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -158,6 +158,12 @@ struct paca_struct {
> * early exception handler for use by high level C handler
> */
> struct opal_machine_check_event *opal_mc_evt;
> +
> + /* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
> + u32 *core_idle_state_ptr;
> + u8 thread_idle_state; /* ~Idle[0]/Nap[1]/Sleep[2]/Winkle[3] */

Might be clearer in the comment to say "/* PNV_THREAD_xxx */" so it's
clear the value should be one of PNV_THREAD_NAP, PNV_THREAD_SLEEP,
etc.

> diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
> index 283c603..8c3a1f4 100644
> --- a/arch/powerpc/kernel/idle_power7.S
> +++ b/arch/powerpc/kernel/idle_power7.S
> @@ -18,6 +18,7 @@
> #include <asm/hw_irq.h>
> #include <asm/kvm_book3s_asm.h>
> #include <asm/opal.h>
> +#include <asm/cpuidle.h>
>
> #undef DEBUG
>
> @@ -37,8 +38,7 @@
>
> /*
> * Pass requested state in r3:
> - * 0 - nap
> - * 1 - sleep
> + * r3 - PNV_THREAD_NAP/SLEEP/WINKLE
> *
> * To check IRQ_HAPPENED in r4
> * 0 - don't check
> @@ -123,12 +123,58 @@ power7_enter_nap_mode:
> li r4,KVM_HWTHREAD_IN_NAP
> stb r4,HSTATE_HWTHREAD_STATE(r13)
> #endif
> - cmpwi cr0,r3,1
> - beq 2f
> + stb r3,PACA_THREAD_IDLE_STATE(r13)
> + cmpwi cr1,r3,PNV_THREAD_SLEEP
> + bge cr1,2f
> IDLE_STATE_ENTER_SEQ(PPC_NAP)
> /* No return */
> -2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
> - /* No return */
> +2:
> + /* Sleep or winkle */
> + lbz r7,PACA_THREAD_MASK(r13)
> + ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
> +lwarx_loop1:
> + lwarx r15,0,r14
> + andc r15,r15,r7 /* Clear thread bit */
> +
> + andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
> +
> +/*
> + * If cr0 = 0, then current thread is the last thread of the core entering
> + * sleep. Last thread needs to execute the hardware bug workaround code if
> + * required by the platform.
> + * Make the workaround call unconditionally here. The below branch call is
> + * patched out when the idle states are discovered if the platform does not
> + * require it.
> + */
> +.global pnv_fastsleep_workaround_at_entry
> +pnv_fastsleep_workaround_at_entry:
> + beq fastsleep_workaround_at_entry

Did you investigate using the feature bit mechanism to do this
patching for you? You would need to allocate a CPU feature bit and
parse the device tree early on and set or clear the feature bit,
before the feature fixups are done. The code here would then end up
looking like:

BEGIN_FTR_SECTION
beq fastsleep_workaround_at_entry
END_FTR_SECTION_IFSET(CPU_FTR_FASTSLEEP_WORKAROUND)

> + stwcx. r15,0,r14
> + isync
> + bne- lwarx_loop1

The isync has to come after the bne. Please fix this here and in the
other places where you added the isync.

> +common_enter: /* common code for all the threads entering sleep */
> + IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
> +
> +fastsleep_workaround_at_entry:
> + ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
> + stwcx. r15,0,r14
> + isync
> + bne- lwarx_loop1
> +
> + /* Fast sleep workaround */
> + li r3,1
> + li r4,1
> + li r0,OPAL_CONFIG_CPU_IDLE_STATE
> + bl opal_call_realmode
> +
> + /* Clear Lock bit */
> + li r0,0
> + lwsync
> + stw r0,0(r14)
> + b common_enter
> +
>
> _GLOBAL(power7_idle)
> /* Now check if user or arch enabled NAP mode */
> @@ -141,49 +187,16 @@ _GLOBAL(power7_idle)
>
> _GLOBAL(power7_nap)
> mr r4,r3
> - li r3,0
> + li r3,PNV_THREAD_NAP
> b power7_powersave_common
> /* No return */
>
> _GLOBAL(power7_sleep)
> - li r3,1
> + li r3,PNV_THREAD_SLEEP
> li r4,1
> b power7_powersave_common
> /* No return */
>
> -/*
> - * Make opal call in realmode. This is a generic function to be called
> - * from realmode from reset vector. It handles endianess.
> - *
> - * r13 - paca pointer
> - * r1 - stack pointer
> - * r3 - opal token
> - */
> -opal_call_realmode:
> - mflr r12
> - std r12,_LINK(r1)
> - ld r2,PACATOC(r13)
> - /* Set opal return address */
> - LOAD_REG_ADDR(r0,return_from_opal_call)
> - mtlr r0
> - /* Handle endian-ness */
> - li r0,MSR_LE
> - mfmsr r12
> - andc r12,r12,r0
> - mtspr SPRN_HSRR1,r12
> - mr r0,r3 /* Move opal token to r0 */
> - LOAD_REG_ADDR(r11,opal)
> - ld r12,8(r11)
> - ld r2,0(r11)
> - mtspr SPRN_HSRR0,r12
> - hrfid
> -
> -return_from_opal_call:
> - FIXUP_ENDIAN
> - ld r0,_LINK(r1)
> - mtlr r0
> - blr
> -
> #define CHECK_HMI_INTERRUPT \
> mfspr r0,SPRN_SRR1; \
> BEGIN_FTR_SECTION_NESTED(66); \
> @@ -196,10 +209,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66); \
> /* Invoke opal call to handle hmi */ \
> ld r2,PACATOC(r13); \
> ld r1,PACAR1(r13); \
> - std r3,ORIG_GPR3(r1); /* Save original r3 */ \
> - li r3,OPAL_HANDLE_HMI; /* Pass opal token argument*/ \
> + li r0,OPAL_HANDLE_HMI; /* Pass opal token argument*/ \
> bl opal_call_realmode; \
> - ld r3,ORIG_GPR3(r1); /* Restore original r3 */ \
> 20: nop;

I recently sent a patch "powerpc: powernv: Return to cpu offline loop
when finished in KVM guest" which passes a value in r3 through
power7_wakeup_loss and power7_wakeup_noloss back to the caller of
power7_nap(). So please don't take out the save/restore of r3 here.

> @@ -210,12 +221,90 @@ _GLOBAL(power7_wakeup_tb_loss)
> BEGIN_FTR_SECTION
> CHECK_HMI_INTERRUPT
> END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
> +
> + lbz r7,PACA_THREAD_MASK(r13)
> + ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
> +lwarx_loop2:
> + lwarx r15,0,r14
> + andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT
> + /*
> + * Lock bit is set in one of the 2 cases-
> + * a. In the sleep/winkle enter path, the last thread is executing
> + * fastsleep workaround code.
> + * b. In the wake up path, another thread is executing fastsleep
> + * workaround undo code or resyncing timebase or restoring context
> + * In either case loop until the lock bit is cleared.
> + */
> + bne core_idle_lock_held
> +
> + cmpwi cr2,r15,0
> + or r15,r15,r7 /* Set thread bit */
> +
> + beq cr2,first_thread
> +
> + /* Not first thread in core to wake up */
> + stwcx. r15,0,r14
> + isync
> + bne- lwarx_loop2
> + b common_exit
> +
> +core_idle_lock_held:
> + HMT_LOW
> +core_idle_lock_loop:
> + lwz r15,0(14)
> + andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT
> + bne core_idle_lock_loop
> + HMT_MEDIUM
> + b lwarx_loop2
> +
> +first_thread:
> + /* First thread in core to wakeup */
> + ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
> + stwcx. r15,0,r14
> + isync
> + bne- lwarx_loop2
> +
> + /*
> + * First thread in the core waking up from fastsleep. It needs to
> + * call the fastsleep workaround code if the platform requires it.
> + * Call it unconditionally here. The below branch instruction will
> + * be patched out when the idle states are discovered if platform
> + * does not require workaround.
> + */
> +.global pnv_fastsleep_workaround_at_exit
> +pnv_fastsleep_workaround_at_exit:
> + b fastsleep_workaround_at_exit
> +
> +timebase_resync:
> + /* Do timebase resync if we are waking up from sleep. Use cr3 value
> + * set in exceptions-64s.S */
> + ble cr3,clear_lock
> /* Time base re-sync */
> - li r3,OPAL_RESYNC_TIMEBASE
> + li r0,OPAL_RESYNC_TIMEBASE
> bl opal_call_realmode;
> -
> /* TODO: Check r3 for failure */
>
> +clear_lock:
> + andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
> + lwsync
> + stw r15,0(r14)
> +
> +common_exit:
> + li r5,PNV_THREAD_RUNNING
> + stb r5,PACA_THREAD_IDLE_STATE(r13)
> +
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> + li r0,KVM_HWTHREAD_IN_KERNEL
> + stb r0,HSTATE_HWTHREAD_STATE(r13)
> + /* Order setting hwthread_state vs. testing hwthread_req */
> + sync
> + lbz r0,HSTATE_HWTHREAD_REQ(r13)
> + cmpwi r0,0
> + beq 6f
> + b kvm_start_guest

There is a bit of a problem here: the FIXUP_ENDIAN in
opal_call_realmode will trash SRR1 (if the kernel is little-endian),
but the code at kvm_start_guest needs SRR1 from the system reset
exception so that it can know what the wakeup reason was.

> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
> index 34c6665..97e0279 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -36,6 +36,9 @@
> #include <asm/opal.h>
> #include <asm/kexec.h>
> #include <asm/smp.h>
> +#include <asm/cputhreads.h>
> +#include <asm/cpuidle.h>
> +#include <asm/code-patching.h>
>
> #include "powernv.h"
>
> @@ -292,10 +295,43 @@ static void __init pnv_setup_machdep_rtas(void)
>
> static u32 supported_cpuidle_states;
>
> +static void pnv_alloc_idle_core_states(void)
> +{
> + int i, j;
> + int nr_cores = cpu_nr_cores();
> + u32 *core_idle_state;
> +
> + /*
> + * core_idle_state - First 8 bits track the idle state of each thread
> + * of the core. The 8th bit is the lock bit. Initially all thread bits
> + * are set. They are cleared when the thread enters deep idle state
> + * like sleep and winkle. Initially the lock bit is cleared.
> + * The lock bit has 2 purposes
> + * a. While the first thread is restoring core state, it prevents
> + * from other threads in the core from switching to prcoess context.

^^^^ remove "from" ^^^^^^^ process

> + * b. While the last thread in the core is saving the core state, it
> + * prevent a different thread from waking up.

^^^^^^^ prevents

> + */
> + for (i = 0; i < nr_cores; i++) {
> + int first_cpu = i * threads_per_core;
> + int node = cpu_to_node(first_cpu);
> +
> + core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
> + for (j = 0; j < threads_per_core; j++) {
> + int cpu = first_cpu + j;
> +
> + paca[cpu].core_idle_state_ptr = core_idle_state;
> + paca[cpu].thread_idle_state = PNV_THREAD_RUNNING;
> + paca[cpu].thread_mask = 1 << (cpu % threads_per_core);

This would be simpler and quicker:

paca[cpu].thread_mask = 1 << j;

Paul.

2014-12-08 05:26:57

by Shreyas B. Prabhu

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] powernv: cpuidle: Redesign idle states management

Hi Paul,

On Monday 08 December 2014 10:31 AM, Paul Mackerras wrote:
> On Thu, Dec 04, 2014 at 12:58:22PM +0530, Shreyas B. Prabhu wrote:
>> Deep idle states like sleep and winkle are per core idle states. A core
>> enters these states only when all the threads enter either the
>> particular idle state or a deeper one. There are tasks like fastsleep
>> hardware bug workaround and hypervisor core state save which have to be
>> done only by the last thread of the core entering deep idle state and
>> similarly tasks like timebase resync, hypervisor core register restore
>> that have to be done only by the first thread waking up from these
>> state.
>>
>> The current idle state management does not have a way to distinguish the
>> first/last thread of the core waking/entering idle states. Tasks like
>> timebase resync are done for all the threads. This is not only is
>> suboptimal, but can cause functionality issues when subcores and kvm is
>> involved.
>>
>> This patch adds the necessary infrastructure to track idle states of
>> threads in a per-core structure. It uses this info to perform tasks like
>> fastsleep workaround and timebase resync only once per core.
>
> Comments below...
>
>> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
>> index a5139ea..e4578c3 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -158,6 +158,12 @@ struct paca_struct {
>> * early exception handler for use by high level C handler
>> */
>> struct opal_machine_check_event *opal_mc_evt;
>> +
>> + /* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
>> + u32 *core_idle_state_ptr;
>> + u8 thread_idle_state; /* ~Idle[0]/Nap[1]/Sleep[2]/Winkle[3] */
>
> Might be clearer in the comment to say "/* PNV_THREAD_xxx */" so it's
> clear the value should be one of PNV_THREAD_NAP, PNV_THREAD_SLEEP,
> etc.

Okay.
>
>> diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
>> index 283c603..8c3a1f4 100644
>> --- a/arch/powerpc/kernel/idle_power7.S
>> +++ b/arch/powerpc/kernel/idle_power7.S
>> @@ -18,6 +18,7 @@
>> #include <asm/hw_irq.h>
>> #include <asm/kvm_book3s_asm.h>
>> #include <asm/opal.h>
>> +#include <asm/cpuidle.h>
>>
>> #undef DEBUG
>>
>> @@ -37,8 +38,7 @@
>>
>> /*
>> * Pass requested state in r3:
>> - * 0 - nap
>> - * 1 - sleep
>> + * r3 - PNV_THREAD_NAP/SLEEP/WINKLE
>> *
>> * To check IRQ_HAPPENED in r4
>> * 0 - don't check
>> @@ -123,12 +123,58 @@ power7_enter_nap_mode:
>> li r4,KVM_HWTHREAD_IN_NAP
>> stb r4,HSTATE_HWTHREAD_STATE(r13)
>> #endif
>> - cmpwi cr0,r3,1
>> - beq 2f
>> + stb r3,PACA_THREAD_IDLE_STATE(r13)
>> + cmpwi cr1,r3,PNV_THREAD_SLEEP
>> + bge cr1,2f
>> IDLE_STATE_ENTER_SEQ(PPC_NAP)
>> /* No return */
>> -2: IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>> - /* No return */
>> +2:
>> + /* Sleep or winkle */
>> + lbz r7,PACA_THREAD_MASK(r13)
>> + ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
>> +lwarx_loop1:
>> + lwarx r15,0,r14
>> + andc r15,r15,r7 /* Clear thread bit */
>> +
>> + andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
>> +
>> +/*
>> + * If cr0 = 0, then current thread is the last thread of the core entering
>> + * sleep. Last thread needs to execute the hardware bug workaround code if
>> + * required by the platform.
>> + * Make the workaround call unconditionally here. The below branch call is
>> + * patched out when the idle states are discovered if the platform does not
>> + * require it.
>> + */
>> +.global pnv_fastsleep_workaround_at_entry
>> +pnv_fastsleep_workaround_at_entry:
>> + beq fastsleep_workaround_at_entry
>
> Did you investigate using the feature bit mechanism to do this
> patching for you? You would need to allocate a CPU feature bit and
> parse the device tree early on and set or clear the feature bit,
> before the feature fixups are done. The code here would then end up
> looking like:
>
> BEGIN_FTR_SECTION
> beq fastsleep_workaround_at_entry
> END_FTR_SECTION_IFSET(CPU_FTR_FASTSLEEP_WORKAROUND)
>

I agree using feature fixup is a much cleaner implementation. The difficulty is,
information on whether fastsleep workaround is needed is passed in the device
tree. do_feature_fixups is currently called before we unflatten the device tree.
Any suggestions for this?

>> + stwcx. r15,0,r14
>> + isync
>> + bne- lwarx_loop1
>
> The isync has to come after the bne. Please fix this here and in the
> other places where you added the isync.
>
Okay.

>> +common_enter: /* common code for all the threads entering sleep */
>> + IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>> +
>> +fastsleep_workaround_at_entry:
>> + ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
>> + stwcx. r15,0,r14
>> + isync
>> + bne- lwarx_loop1
>> +
>> + /* Fast sleep workaround */
>> + li r3,1
>> + li r4,1
>> + li r0,OPAL_CONFIG_CPU_IDLE_STATE
>> + bl opal_call_realmode
>> +
>> + /* Clear Lock bit */
>> + li r0,0
>> + lwsync
>> + stw r0,0(r14)
>> + b common_enter
>> +
>>
>> _GLOBAL(power7_idle)
>> /* Now check if user or arch enabled NAP mode */
>> @@ -141,49 +187,16 @@ _GLOBAL(power7_idle)
>>
>> _GLOBAL(power7_nap)
>> mr r4,r3
>> - li r3,0
>> + li r3,PNV_THREAD_NAP
>> b power7_powersave_common
>> /* No return */
>>
>> _GLOBAL(power7_sleep)
>> - li r3,1
>> + li r3,PNV_THREAD_SLEEP
>> li r4,1
>> b power7_powersave_common
>> /* No return */
>>
>> -/*
>> - * Make opal call in realmode. This is a generic function to be called
>> - * from realmode from reset vector. It handles endianess.
>> - *
>> - * r13 - paca pointer
>> - * r1 - stack pointer
>> - * r3 - opal token
>> - */
>> -opal_call_realmode:
>> - mflr r12
>> - std r12,_LINK(r1)
>> - ld r2,PACATOC(r13)
>> - /* Set opal return address */
>> - LOAD_REG_ADDR(r0,return_from_opal_call)
>> - mtlr r0
>> - /* Handle endian-ness */
>> - li r0,MSR_LE
>> - mfmsr r12
>> - andc r12,r12,r0
>> - mtspr SPRN_HSRR1,r12
>> - mr r0,r3 /* Move opal token to r0 */
>> - LOAD_REG_ADDR(r11,opal)
>> - ld r12,8(r11)
>> - ld r2,0(r11)
>> - mtspr SPRN_HSRR0,r12
>> - hrfid
>> -
>> -return_from_opal_call:
>> - FIXUP_ENDIAN
>> - ld r0,_LINK(r1)
>> - mtlr r0
>> - blr
>> -
>> #define CHECK_HMI_INTERRUPT \
>> mfspr r0,SPRN_SRR1; \
>> BEGIN_FTR_SECTION_NESTED(66); \
>> @@ -196,10 +209,8 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66); \
>> /* Invoke opal call to handle hmi */ \
>> ld r2,PACATOC(r13); \
>> ld r1,PACAR1(r13); \
>> - std r3,ORIG_GPR3(r1); /* Save original r3 */ \
>> - li r3,OPAL_HANDLE_HMI; /* Pass opal token argument*/ \
>> + li r0,OPAL_HANDLE_HMI; /* Pass opal token argument*/ \
>> bl opal_call_realmode; \
>> - ld r3,ORIG_GPR3(r1); /* Restore original r3 */ \
>> 20: nop;
>
> I recently sent a patch "powerpc: powernv: Return to cpu offline loop
> when finished in KVM guest" which passes a value in r3 through
> power7_wakeup_loss and power7_wakeup_noloss back to the caller of
> power7_nap(). So please don't take out the save/restore of r3 here.
>

Okay. I'll base my these patches on top of your patch and resend.

>> @@ -210,12 +221,90 @@ _GLOBAL(power7_wakeup_tb_loss)
>> BEGIN_FTR_SECTION
>> CHECK_HMI_INTERRUPT
>> END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>> +
>> + lbz r7,PACA_THREAD_MASK(r13)
>> + ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
>> +lwarx_loop2:
>> + lwarx r15,0,r14
>> + andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT
>> + /*
>> + * Lock bit is set in one of the 2 cases-
>> + * a. In the sleep/winkle enter path, the last thread is executing
>> + * fastsleep workaround code.
>> + * b. In the wake up path, another thread is executing fastsleep
>> + * workaround undo code or resyncing timebase or restoring context
>> + * In either case loop until the lock bit is cleared.
>> + */
>> + bne core_idle_lock_held
>> +
>> + cmpwi cr2,r15,0
>> + or r15,r15,r7 /* Set thread bit */
>> +
>> + beq cr2,first_thread
>> +
>> + /* Not first thread in core to wake up */
>> + stwcx. r15,0,r14
>> + isync
>> + bne- lwarx_loop2
>> + b common_exit
>> +
>> +core_idle_lock_held:
>> + HMT_LOW
>> +core_idle_lock_loop:
>> + lwz r15,0(14)
>> + andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT
>> + bne core_idle_lock_loop
>> + HMT_MEDIUM
>> + b lwarx_loop2
>> +
>> +first_thread:
>> + /* First thread in core to wakeup */
>> + ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
>> + stwcx. r15,0,r14
>> + isync
>> + bne- lwarx_loop2
>> +
>> + /*
>> + * First thread in the core waking up from fastsleep. It needs to
>> + * call the fastsleep workaround code if the platform requires it.
>> + * Call it unconditionally here. The below branch instruction will
>> + * be patched out when the idle states are discovered if platform
>> + * does not require workaround.
>> + */
>> +.global pnv_fastsleep_workaround_at_exit
>> +pnv_fastsleep_workaround_at_exit:
>> + b fastsleep_workaround_at_exit
>> +
>> +timebase_resync:
>> + /* Do timebase resync if we are waking up from sleep. Use cr3 value
>> + * set in exceptions-64s.S */
>> + ble cr3,clear_lock
>> /* Time base re-sync */
>> - li r3,OPAL_RESYNC_TIMEBASE
>> + li r0,OPAL_RESYNC_TIMEBASE
>> bl opal_call_realmode;
>> -
>> /* TODO: Check r3 for failure */
>>
>> +clear_lock:
>> + andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
>> + lwsync
>> + stw r15,0(r14)
>> +
>> +common_exit:
>> + li r5,PNV_THREAD_RUNNING
>> + stb r5,PACA_THREAD_IDLE_STATE(r13)
>> +
>> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> + li r0,KVM_HWTHREAD_IN_KERNEL
>> + stb r0,HSTATE_HWTHREAD_STATE(r13)
>> + /* Order setting hwthread_state vs. testing hwthread_req */
>> + sync
>> + lbz r0,HSTATE_HWTHREAD_REQ(r13)
>> + cmpwi r0,0
>> + beq 6f
>> + b kvm_start_guest
>
> There is a bit of a problem here: the FIXUP_ENDIAN in
> opal_call_realmode will trash SRR1 (if the kernel is little-endian),
> but the code at kvm_start_guest needs SRR1 from the system reset
> exception so that it can know what the wakeup reason was.
>
Hmm, I'll save/restore SRR1 before calling opal_call_realmode. Thanks for catching this.

>> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
>> index 34c6665..97e0279 100644
>> --- a/arch/powerpc/platforms/powernv/setup.c
>> +++ b/arch/powerpc/platforms/powernv/setup.c
>> @@ -36,6 +36,9 @@
>> #include <asm/opal.h>
>> #include <asm/kexec.h>
>> #include <asm/smp.h>
>> +#include <asm/cputhreads.h>
>> +#include <asm/cpuidle.h>
>> +#include <asm/code-patching.h>
>>
>> #include "powernv.h"
>>
>> @@ -292,10 +295,43 @@ static void __init pnv_setup_machdep_rtas(void)
>>
>> static u32 supported_cpuidle_states;
>>
>> +static void pnv_alloc_idle_core_states(void)
>> +{
>> + int i, j;
>> + int nr_cores = cpu_nr_cores();
>> + u32 *core_idle_state;
>> +
>> + /*
>> + * core_idle_state - First 8 bits track the idle state of each thread
>> + * of the core. The 8th bit is the lock bit. Initially all thread bits
>> + * are set. They are cleared when the thread enters deep idle state
>> + * like sleep and winkle. Initially the lock bit is cleared.
>> + * The lock bit has 2 purposes
>> + * a. While the first thread is restoring core state, it prevents
>> + * from other threads in the core from switching to prcoess context.
>
> ^^^^ remove "from" ^^^^^^^ process
>
>> + * b. While the last thread in the core is saving the core state, it
>> + * prevent a different thread from waking up.
>
> ^^^^^^^ prevents
>
Oops. Will fix it.
>> + */
>> + for (i = 0; i < nr_cores; i++) {
>> + int first_cpu = i * threads_per_core;
>> + int node = cpu_to_node(first_cpu);
>> +
>> + core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
>> + for (j = 0; j < threads_per_core; j++) {
>> + int cpu = first_cpu + j;
>> +
>> + paca[cpu].core_idle_state_ptr = core_idle_state;
>> + paca[cpu].thread_idle_state = PNV_THREAD_RUNNING;
>> + paca[cpu].thread_mask = 1 << (cpu % threads_per_core);
>
> This would be simpler and quicker:
>
> paca[cpu].thread_mask = 1 << j;
>
Will make the change.


Thanks,
Shreyas

2014-12-08 05:52:27

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] powernv: powerpc: Add winkle support for offline cpus

On Thu, Dec 04, 2014 at 12:58:23PM +0530, Shreyas B. Prabhu wrote:
> Winkle is a deep idle state supported in power8 chips. A core enters
> winkle when all the threads of the core enter winkle. In this state
> power supply to the entire chiplet i.e core, private L2 and private L3
> is turned off. As a result it gives higher powersavings compared to
> sleep.
>
> But entering winkle results in a total hypervisor state loss. Hence the
> hypervisor context has to be preserved before entering winkle and
> restored upon wake up.
>
> Power-on Reset Engine (PORE) is a dedicated engine which is responsible
> for powering on the chiplet during wake up. It can be programmed to
> restore the register contests of a few specific registers. This patch
> uses PORE to restore register state wherever possible and uses stack to
> save and restore rest of the necessary registers.
>
> With hypervisor state restore things fall under three categories-
> per-core state, per-subcore state and per-thread state. To manage this,
> extend the infrastructure introduced for sleep. Mainly we add a paca
> variable subcore_sibling_mask. Using this and the core_idle_state we can
> distingush first thread in core and subcore.

Comments below...

> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index 7637889..2b9b5fb 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -102,9 +102,7 @@ system_reset_pSeries:
> #ifdef CONFIG_PPC_P7_NAP
> BEGIN_FTR_SECTION
> /* Running native on arch 2.06 or later, check if we are
> - * waking up from nap. We only handle no state loss and
> - * supervisor state loss. We do -not- handle hypervisor
> - * state loss at this time.
> + * waking up from nap/sleep/winkle.
> */
> mfspr r13,SPRN_SRR1
> rlwinm. r13,r13,47-31,30,31
> @@ -112,7 +110,17 @@ BEGIN_FTR_SECTION
>
> cmpwi cr3,r13,2
>
> - GET_PACA(r13)
> + /* Check if last bit of HSPGR0 is set. This indicates whether we are
> + * waking up from winkle */
> + li r3,1
> + mfspr r4,SPRN_HSPRG0
> + and r5,r4,r3
> + cmpwi cr4,r5,1 /* Store result in cr4 for later use */
> +
> + andc r4,r4,r3
> + mtspr SPRN_HSPRG0,r4
> +
> + mr r13,r4

This seems unnecessarily convoluted. How about:

GET_PACA(r13)
clrldi r5,r13,63
clrrdi r13,r13,1
cmpwi cr4,r5,1
mtspr SPRN_HSPRG0,r13

> diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
> index 8c3a1f4..8102075 100644
> --- a/arch/powerpc/kernel/idle_power7.S
> +++ b/arch/powerpc/kernel/idle_power7.S
> @@ -19,8 +19,24 @@
> #include <asm/kvm_book3s_asm.h>
> #include <asm/opal.h>
> #include <asm/cpuidle.h>
> +#include <asm/mmu-hash64.h>
>
> #undef DEBUG
> +/*
> + * Use unused space in the interrupt stack to save and restore
> + * registers for winkle support.
> + */
> +#define _SDR1 GPR3
> +#define _RPR GPR4
> +#define _SPURR GPR5
> +#define _PURR GPR6
> +#define _TSCR GPR7
> +#define _DSCR GPR8
> +#define _AMOR GPR9
> +#define _PMC5 GPR10
> +#define _PMC6 GPR11

Why only PMC5 and PMC6 out of all the PMU registers? What about
PMC1-PMC4 and the MMCR registers? I assume they're lost during winkle
state also, aren't they? If we're not saving them, what's the point
of saving and restoring PMC5 and PMC6?

> +#define _WORT GPR12
> +#define _WORC GPR13
>
> /* Idle state entry routines */
>
> @@ -124,8 +140,8 @@ power7_enter_nap_mode:
> stb r4,HSTATE_HWTHREAD_STATE(r13)
> #endif
> stb r3,PACA_THREAD_IDLE_STATE(r13)
> - cmpwi cr1,r3,PNV_THREAD_SLEEP
> - bge cr1,2f
> + cmpwi cr3,r3,PNV_THREAD_SLEEP
> + bge cr3,2f
> IDLE_STATE_ENTER_SEQ(PPC_NAP)
> /* No return */
> 2:
> @@ -154,7 +170,8 @@ pnv_fastsleep_workaround_at_entry:
> isync
> bne- lwarx_loop1
>
> -common_enter: /* common code for all the threads entering sleep */
> +common_enter: /* common code for all the threads entering sleep or winkle */
> + bgt cr3,enter_winkle
> IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>
> fastsleep_workaround_at_entry:
> @@ -175,6 +192,34 @@ fastsleep_workaround_at_entry:
> stw r0,0(r14)
> b common_enter
>
> +enter_winkle:
> + /*
> + * Note all register i.e per-core, per-subcore or per-thread is saved
> + * here since any thread in the core might wake up first
> + */
> + mfspr r3,SPRN_SDR1
> + std r3,_SDR1(r1)
> + mfspr r3,SPRN_RPR
> + std r3,_RPR(r1)
> + mfspr r3,SPRN_SPURR
> + std r3,_SPURR(r1)
> + mfspr r3,SPRN_PURR
> + std r3,_PURR(r1)
> + mfspr r3,SPRN_TSCR
> + std r3,_TSCR(r1)
> + mfspr r3,SPRN_DSCR
> + std r3,_DSCR(r1)
> + mfspr r3,SPRN_AMOR
> + std r3,_AMOR(r1)
> + mfspr r3,SPRN_PMC5
> + std r3,_PMC5(r1)
> + mfspr r3,SPRN_PMC6
> + std r3,_PMC6(r1)
> + mfspr r3,SPRN_WORT
> + std r3,_WORT(r1)
> + mfspr r3,SPRN_WORC
> + std r3,_WORC(r1)
> + IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
>
> _GLOBAL(power7_idle)
> /* Now check if user or arch enabled NAP mode */
> @@ -197,6 +242,12 @@ _GLOBAL(power7_sleep)
> b power7_powersave_common
> /* No return */
>
> +_GLOBAL(power7_winkle)
> + li r3,3
> + li r4,1
> + b power7_powersave_common
> + /* No return */
> +
> #define CHECK_HMI_INTERRUPT \
> mfspr r0,SPRN_SRR1; \
> BEGIN_FTR_SECTION_NESTED(66); \
> @@ -238,11 +289,23 @@ lwarx_loop2:
> bne core_idle_lock_held
>
> cmpwi cr2,r15,0
> + lbz r4,PACA_SUBCORE_SIBLING_MASK(r13)
> + and r4,r4,r15
> + cmpwi cr1,r4,0 /* Check if first in subcore */
> +
> + /*
> + * At this stage
> + * cr1 - 10 if first thread to wakeup in subcore
> + * cr2 - 10 if first thread to wakeup in core
> + * cr3- 01 if waking up from sleep or winkle
> + * cr4 - 10 if waking up from winkle
> + */

What do "10" and "01" mean in this comment? (If they were CR field
values in binary they would need to be 3 or 4 bits, not 2.)

Paul.

2014-12-08 21:54:53

by Shreyas B. Prabhu

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] powernv: powerpc: Add winkle support for offline cpus



On Monday 08 December 2014 11:22 AM, Paul Mackerras wrote:
> On Thu, Dec 04, 2014 at 12:58:23PM +0530, Shreyas B. Prabhu wrote:
>> Winkle is a deep idle state supported in power8 chips. A core enters
>> winkle when all the threads of the core enter winkle. In this state
>> power supply to the entire chiplet i.e core, private L2 and private L3
>> is turned off. As a result it gives higher powersavings compared to
>> sleep.
>>
>> But entering winkle results in a total hypervisor state loss. Hence the
>> hypervisor context has to be preserved before entering winkle and
>> restored upon wake up.
>>
>> Power-on Reset Engine (PORE) is a dedicated engine which is responsible
>> for powering on the chiplet during wake up. It can be programmed to
>> restore the register contests of a few specific registers. This patch
>> uses PORE to restore register state wherever possible and uses stack to
>> save and restore rest of the necessary registers.
>>
>> With hypervisor state restore things fall under three categories-
>> per-core state, per-subcore state and per-thread state. To manage this,
>> extend the infrastructure introduced for sleep. Mainly we add a paca
>> variable subcore_sibling_mask. Using this and the core_idle_state we can
>> distingush first thread in core and subcore.
>
> Comments below...
>
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
>> index 7637889..2b9b5fb 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -102,9 +102,7 @@ system_reset_pSeries:
>> #ifdef CONFIG_PPC_P7_NAP
>> BEGIN_FTR_SECTION
>> /* Running native on arch 2.06 or later, check if we are
>> - * waking up from nap. We only handle no state loss and
>> - * supervisor state loss. We do -not- handle hypervisor
>> - * state loss at this time.
>> + * waking up from nap/sleep/winkle.
>> */
>> mfspr r13,SPRN_SRR1
>> rlwinm. r13,r13,47-31,30,31
>> @@ -112,7 +110,17 @@ BEGIN_FTR_SECTION
>>
>> cmpwi cr3,r13,2
>>
>> - GET_PACA(r13)
>> + /* Check if last bit of HSPGR0 is set. This indicates whether we are
>> + * waking up from winkle */
>> + li r3,1
>> + mfspr r4,SPRN_HSPRG0
>> + and r5,r4,r3
>> + cmpwi cr4,r5,1 /* Store result in cr4 for later use */
>> +
>> + andc r4,r4,r3
>> + mtspr SPRN_HSPRG0,r4
>> +
>> + mr r13,r4
>
> This seems unnecessarily convoluted. How about:
>
> GET_PACA(r13)
> clrldi r5,r13,63
> clrrdi r13,r13,1
> cmpwi cr4,r5,1
> mtspr SPRN_HSPRG0,r13
>
Yes, makes more sense. I'll use this.

>> diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
>> index 8c3a1f4..8102075 100644
>> --- a/arch/powerpc/kernel/idle_power7.S
>> +++ b/arch/powerpc/kernel/idle_power7.S
>> @@ -19,8 +19,24 @@
>> #include <asm/kvm_book3s_asm.h>
>> #include <asm/opal.h>
>> #include <asm/cpuidle.h>
>> +#include <asm/mmu-hash64.h>
>>
>> #undef DEBUG
>> +/*
>> + * Use unused space in the interrupt stack to save and restore
>> + * registers for winkle support.
>> + */
>> +#define _SDR1 GPR3
>> +#define _RPR GPR4
>> +#define _SPURR GPR5
>> +#define _PURR GPR6
>> +#define _TSCR GPR7
>> +#define _DSCR GPR8
>> +#define _AMOR GPR9
>> +#define _PMC5 GPR10
>> +#define _PMC6 GPR11
>
> Why only PMC5 and PMC6 out of all the PMU registers? What about
> PMC1-PMC4 and the MMCR registers? I assume they're lost during winkle
> state also, aren't they? If we're not saving them, what's the point
> of saving and restoring PMC5 and PMC6?
>
Yes all PMC and MMCR contents are lost. Using __restore_cpu_power8, the
MMCR registers are initialized to 0. The reasoning behind specifically
restoring PMC5 and PMC6 was the fact that they are not programmable and
count cycles/instructions by default. We suspected that there might be a
userspace program which relied on PMC5/PMC6 always increasing.
But now on closer look, since these counters are 32 bit and cycles/
instruction counts are bound to exceed it, I doubt such userspace programs
exist. I'll drop PMC5 and PMC6 in the next version.

>> +#define _WORT GPR12
>> +#define _WORC GPR13
>>
>> /* Idle state entry routines */
>>
>> @@ -124,8 +140,8 @@ power7_enter_nap_mode:
>> stb r4,HSTATE_HWTHREAD_STATE(r13)
>> #endif
>> stb r3,PACA_THREAD_IDLE_STATE(r13)
>> - cmpwi cr1,r3,PNV_THREAD_SLEEP
>> - bge cr1,2f
>> + cmpwi cr3,r3,PNV_THREAD_SLEEP
>> + bge cr3,2f
>> IDLE_STATE_ENTER_SEQ(PPC_NAP)
>> /* No return */
>> 2:
>> @@ -154,7 +170,8 @@ pnv_fastsleep_workaround_at_entry:
>> isync
>> bne- lwarx_loop1
>>
>> -common_enter: /* common code for all the threads entering sleep */
>> +common_enter: /* common code for all the threads entering sleep or winkle */
>> + bgt cr3,enter_winkle
>> IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
>>
>> fastsleep_workaround_at_entry:
>> @@ -175,6 +192,34 @@ fastsleep_workaround_at_entry:
>> stw r0,0(r14)
>> b common_enter
>>
>> +enter_winkle:
>> + /*
>> + * Note all register i.e per-core, per-subcore or per-thread is saved
>> + * here since any thread in the core might wake up first
>> + */
>> + mfspr r3,SPRN_SDR1
>> + std r3,_SDR1(r1)
>> + mfspr r3,SPRN_RPR
>> + std r3,_RPR(r1)
>> + mfspr r3,SPRN_SPURR
>> + std r3,_SPURR(r1)
>> + mfspr r3,SPRN_PURR
>> + std r3,_PURR(r1)
>> + mfspr r3,SPRN_TSCR
>> + std r3,_TSCR(r1)
>> + mfspr r3,SPRN_DSCR
>> + std r3,_DSCR(r1)
>> + mfspr r3,SPRN_AMOR
>> + std r3,_AMOR(r1)
>> + mfspr r3,SPRN_PMC5
>> + std r3,_PMC5(r1)
>> + mfspr r3,SPRN_PMC6
>> + std r3,_PMC6(r1)
>> + mfspr r3,SPRN_WORT
>> + std r3,_WORT(r1)
>> + mfspr r3,SPRN_WORC
>> + std r3,_WORC(r1)
>> + IDLE_STATE_ENTER_SEQ(PPC_WINKLE)
>>
>> _GLOBAL(power7_idle)
>> /* Now check if user or arch enabled NAP mode */
>> @@ -197,6 +242,12 @@ _GLOBAL(power7_sleep)
>> b power7_powersave_common
>> /* No return */
>>
>> +_GLOBAL(power7_winkle)
>> + li r3,3
>> + li r4,1
>> + b power7_powersave_common
>> + /* No return */
>> +
>> #define CHECK_HMI_INTERRUPT \
>> mfspr r0,SPRN_SRR1; \
>> BEGIN_FTR_SECTION_NESTED(66); \
>> @@ -238,11 +289,23 @@ lwarx_loop2:
>> bne core_idle_lock_held
>>
>> cmpwi cr2,r15,0
>> + lbz r4,PACA_SUBCORE_SIBLING_MASK(r13)
>> + and r4,r4,r15
>> + cmpwi cr1,r4,0 /* Check if first in subcore */
>> +
>> + /*
>> + * At this stage
>> + * cr1 - 10 if first thread to wakeup in subcore
>> + * cr2 - 10 if first thread to wakeup in core
>> + * cr3- 01 if waking up from sleep or winkle
>> + * cr4 - 10 if waking up from winkle
>> + */
>
> What do "10" and "01" mean in this comment? (If they were CR field
> values in binary they would need to be 3 or 4 bits, not 2.)
>
I'll fix this.

Thanks,
Shreyas

2014-12-14 10:06:03

by Michael Ellerman

[permalink] [raw]
Subject: Re: [v3, 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states

On Thu, 2014-04-12 at 07:28:21 UTC, "Shreyas B. Prabhu" wrote:
> From: "Preeti U. Murthy" <[email protected]>
>
> The secondary threads should enter deep idle states so as to gain maximum
> powersavings when the entire core is offline. To do so the offline path
> must be made aware of the available deepest idle state. Hence probe the
> device tree for the possible idle states in powernv core code and
> expose the deepest idle state through flags.
>
> Since the device tree is probed by the cpuidle driver as well, move
> the parameters required to discover the idle states into an appropriate
> common place to both the driver and the powernv core code.
>
> Another point is that fastsleep idle state may require workarounds in
> the kernel to function properly. This workaround is introduced in the
> subsequent patches. However neither the cpuidle driver or the hotplug
> path need be bothered about this workaround.
>
> They will be taken care of by the core powernv code.

...

> diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
> index 4753958..3dc4cec 100644
> --- a/arch/powerpc/platforms/powernv/smp.c
> +++ b/arch/powerpc/platforms/powernv/smp.c
> @@ -159,13 +160,17 @@ static void pnv_smp_cpu_kill_self(void)
> generic_set_cpu_dead(cpu);
> smp_wmb();
>
> + idle_states = pnv_get_supported_cpuidle_states();
> /* We don't want to take decrementer interrupts while we are offline,
> * so clear LPCR:PECE1. We keep PECE2 enabled.
> */
> mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
> while (!generic_check_cpu_restart(cpu)) {
> ppc64_runlatch_off();
> - power7_nap(1);
> + if (idle_states & OPAL_PM_SLEEP_ENABLED)
> + power7_sleep();
> + else
> + power7_nap(1);

So I might be missing something subtle here, but aren't we potentially enabling
sleep here, prior to your next patch which makes it safe to actually use sleep?

Shouldn't we only allow sleep after patch 3? Or in other words shouldn't this
be patch 3 (or 4)?

cheers

2014-12-14 11:50:12

by Shreyas B. Prabhu

[permalink] [raw]
Subject: Re: [v3, 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states



On Sunday 14 December 2014 03:35 PM, Michael Ellerman wrote:
> On Thu, 2014-04-12 at 07:28:21 UTC, "Shreyas B. Prabhu" wrote:
>> From: "Preeti U. Murthy" <[email protected]>
>>
>> The secondary threads should enter deep idle states so as to gain maximum
>> powersavings when the entire core is offline. To do so the offline path
>> must be made aware of the available deepest idle state. Hence probe the
>> device tree for the possible idle states in powernv core code and
>> expose the deepest idle state through flags.
>>
>> Since the device tree is probed by the cpuidle driver as well, move
>> the parameters required to discover the idle states into an appropriate
>> common place to both the driver and the powernv core code.
>>
>> Another point is that fastsleep idle state may require workarounds in
>> the kernel to function properly. This workaround is introduced in the
>> subsequent patches. However neither the cpuidle driver or the hotplug
>> path need be bothered about this workaround.
>>
>> They will be taken care of by the core powernv code.
>
> ...
>
>> diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
>> index 4753958..3dc4cec 100644
>> --- a/arch/powerpc/platforms/powernv/smp.c
>> +++ b/arch/powerpc/platforms/powernv/smp.c
>> @@ -159,13 +160,17 @@ static void pnv_smp_cpu_kill_self(void)
>> generic_set_cpu_dead(cpu);
>> smp_wmb();
>>
>> + idle_states = pnv_get_supported_cpuidle_states();
>> /* We don't want to take decrementer interrupts while we are offline,
>> * so clear LPCR:PECE1. We keep PECE2 enabled.
>> */
>> mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
>> while (!generic_check_cpu_restart(cpu)) {
>> ppc64_runlatch_off();
>> - power7_nap(1);
>> + if (idle_states & OPAL_PM_SLEEP_ENABLED)
>> + power7_sleep();
>> + else
>> + power7_nap(1);
>
> So I might be missing something subtle here, but aren't we potentially enabling
> sleep here, prior to your next patch which makes it safe to actually use sleep?
>
> Shouldn't we only allow sleep after patch 3? Or in other words shouldn't this
> be patch 3 (or 4)?
>

A point to note here, when sleep is exposed in device tree under ibm,cpu-idle-state-flags,
we use 2 bits, OPAL_PM_SLEEP_ENABLED and OPAL_PM_SLEEP_ENABLED_ER1. This patch only enables
sleep in OPAL_PM_SLEEP_ENABLED case. In current POWER8 chips, sleep is exposed as
OPAL_PM_SLEEP_ENABLED_ER1, indicating the hardware bug and the need for fastsleep
workaround. And bulk of the redesign introduced in next patch helps fastsleep workaround
and winkle.

That said, using sleep without "powernv: cpuidle: Redesign idle states management"
does expose us to a bug with performing VM migration onto subcores. But not enabling
here (i.e offline case) until next patch doesn't make much difference as the cpuidle
framework has already enabled sleep.

In other words, OPAL_PM_SLEEP_ENABLED case will come into picture when the hardware
bug around fastsleep is fixed. And in this case running any kernel without "powernv:
cpuidle: Redesign idle states management" does expose us to a bug with sleep + VM
migration onto subcores, because cpuidle enables sleep based on OPAL_PM_SLEEP_ENABLED
bit. IMO delaying enabling of sleep in OPAL_PM_SLEEP_ENABLED case until next patch,
only for offline cpus should not gain us much. But I'll be happy to resend the patches
with the change if you think it is required.


Thanks,
Shreyas

2014-12-14 23:44:47

by Michael Ellerman

[permalink] [raw]
Subject: Re: [v3, 2/4] powerpc/powernv: Enable Offline CPUs to enter deep idle states

On Sun, 2014-12-14 at 17:19 +0530, Shreyas B Prabhu wrote:
>
> On Sunday 14 December 2014 03:35 PM, Michael Ellerman wrote:
> > On Thu, 2014-04-12 at 07:28:21 UTC, "Shreyas B. Prabhu" wrote:
> >> From: "Preeti U. Murthy" <[email protected]>
> >>
> >> The secondary threads should enter deep idle states so as to gain maximum
> >> powersavings when the entire core is offline. To do so the offline path
> >> must be made aware of the available deepest idle state. Hence probe the
> >> device tree for the possible idle states in powernv core code and
> >> expose the deepest idle state through flags.
> >>
> >> Since the device tree is probed by the cpuidle driver as well, move
> >> the parameters required to discover the idle states into an appropriate
> >> common place to both the driver and the powernv core code.
> >>
> >> Another point is that fastsleep idle state may require workarounds in
> >> the kernel to function properly. This workaround is introduced in the
> >> subsequent patches. However neither the cpuidle driver or the hotplug
> >> path need be bothered about this workaround.
> >>
> >> They will be taken care of by the core powernv code.
> >
> > ...
> >
> >> diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
> >> index 4753958..3dc4cec 100644
> >> --- a/arch/powerpc/platforms/powernv/smp.c
> >> +++ b/arch/powerpc/platforms/powernv/smp.c
> >> @@ -159,13 +160,17 @@ static void pnv_smp_cpu_kill_self(void)
> >> generic_set_cpu_dead(cpu);
> >> smp_wmb();
> >>
> >> + idle_states = pnv_get_supported_cpuidle_states();
> >> /* We don't want to take decrementer interrupts while we are offline,
> >> * so clear LPCR:PECE1. We keep PECE2 enabled.
> >> */
> >> mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
> >> while (!generic_check_cpu_restart(cpu)) {
> >> ppc64_runlatch_off();
> >> - power7_nap(1);
> >> + if (idle_states & OPAL_PM_SLEEP_ENABLED)
> >> + power7_sleep();
> >> + else
> >> + power7_nap(1);
> >
> > So I might be missing something subtle here, but aren't we potentially enabling
> > sleep here, prior to your next patch which makes it safe to actually use sleep?
> >
> > Shouldn't we only allow sleep after patch 3? Or in other words shouldn't this
> > be patch 3 (or 4)?
>
> A point to note here, when sleep is exposed in device tree under ibm,cpu-idle-state-flags,
> we use 2 bits, OPAL_PM_SLEEP_ENABLED and OPAL_PM_SLEEP_ENABLED_ER1. This patch only enables
> sleep in OPAL_PM_SLEEP_ENABLED case. In current POWER8 chips, sleep is exposed as
> OPAL_PM_SLEEP_ENABLED_ER1, indicating the hardware bug and the need for fastsleep
> workaround. And bulk of the redesign introduced in next patch helps fastsleep workaround
> and winkle.
>
> That said, using sleep without "powernv: cpuidle: Redesign idle states management"
> does expose us to a bug with performing VM migration onto subcores. But not enabling
> here (i.e offline case) until next patch doesn't make much difference as the cpuidle
> framework has already enabled sleep.
>
> In other words, OPAL_PM_SLEEP_ENABLED case will come into picture when the hardware
> bug around fastsleep is fixed. And in this case running any kernel without "powernv:
> cpuidle: Redesign idle states management" does expose us to a bug with sleep + VM
> migration onto subcores, because cpuidle enables sleep based on OPAL_PM_SLEEP_ENABLED
> bit. IMO delaying enabling of sleep in OPAL_PM_SLEEP_ENABLED case until next patch,
> only for offline cpus should not gain us much. But I'll be happy to resend the patches
> with the change if you think it is required.

OK, thanks for the explanation. I'll put it in as-is.

In future if you can add that sort of explanation to the changelog that would
be great.

cheers