Subject: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

Hi,

**** RFC not for inclusion ****

This is the version 3 of the patch series to provide a cpu-offline framework
that enables the administrators choose the state of a CPU when it is
offlined, when multiple such states are exposed by the underlying
architecture.

Changes from Version 2:(can be found here: http://lkml.org/lkml/2009/8/28/102)
- Addressed Andrew Morton's review comments regarding names of global
variables, handling of error conditions and documentation of the interfaces.

- Implemented a patch to provide helper functions to set the cede latency
specifier value in the VPA indicating latency expectation of the guest OS
when the vcpu is ceded from a subsequent H_CEDE hypercall. Hypervisor may
use this for better energy savings.

- Renamed of the cpu-hotplug states. "deallocate" is renamed
as "offline" and "deactivate" is renamed as "inactive".

The patch-series exposes the following sysfs tunables to
allow the system-adminstrator to choose the state of a CPU:

To query the available hotplug states, one needs to read the sysfs tunable:
/sys/devices/system/cpu/cpu<number>/available_hotplug_states
To query or set the current state, on needs to read/write the sysfs tunable:
/sys/devices/system/cpu/cpu<number>/current_hotplug_state

The patchset ensures that the writes to the "current_hotplug_state" sysfs file are
serialized against the writes to the "online" file.

This patchset contains the offline state driver implemented for
pSeries. For pSeries, we define three available_hotplug_states. They are:

online: The processor is online.

offline: This is the the default behaviour when the cpu is offlined
even in the absense of this driver. The CPU would call make an
rtas_stop_self() call and hand over the CPU back to the resource pool,
thereby effectively deallocating that vCPU from the LPAR.
NOTE: This would result in a configuration change to the LPAR
which is visible to the outside world.

inactive: This cedes the vCPU to the hypervisor with a cede latency
specifier value 2.
NOTE: This option does not result in a configuration change
and the vCPU would be still entitled to the LPAR to which it earlier
belong to.

Any feedback on the patchset will be immensely valuable.
---

Arun R Bharadwaj (1):
pSeries: cede latency specifier helper function.

Gautham R Shenoy (2):
cpu: Implement cpu-offline-state callbacks for pSeries.
cpu: Offline state Framework.


Documentation/cpu-hotplug.txt | 22 +++
arch/powerpc/include/asm/lppaca.h | 9 +
arch/powerpc/platforms/pseries/Makefile | 2
arch/powerpc/platforms/pseries/hotplug-cpu.c | 88 ++++++++++-
arch/powerpc/platforms/pseries/offline_driver.c | 148 +++++++++++++++++++
arch/powerpc/platforms/pseries/offline_driver.h | 20 +++
arch/powerpc/platforms/pseries/plpar_wrappers.h | 17 ++
arch/powerpc/platforms/pseries/smp.c | 17 ++
arch/powerpc/xmon/xmon.c | 3
drivers/base/cpu.c | 181 ++++++++++++++++++++++-
include/linux/cpu.h | 10 +
11 files changed, 498 insertions(+), 19 deletions(-)
create mode 100644 arch/powerpc/platforms/pseries/offline_driver.c
create mode 100644 arch/powerpc/platforms/pseries/offline_driver.h

--
Thanks and Regards
gautham.


Subject: [PATCH v3 1/3] pSeries: cede latency specifier helper function.

From: Arun R Bharadwaj <[email protected]>

This patch provides helper functions to set the cede latency specifier
value in the VPA indicating the latency expectation of the guest OS to
inform the hypervisor's choice of the platform dependent energy saving
mode chosen for the processor when unused during the subsequent
H_CEDE hypercall.

Signed-off-by: Arun R Bharadwaj <[email protected]>
Signed-off-by: Gautham R Shenoy <[email protected]>
---
arch/powerpc/include/asm/lppaca.h | 9 ++++++++-
arch/powerpc/platforms/pseries/plpar_wrappers.h | 17 +++++++++++++++++
arch/powerpc/xmon/xmon.c | 3 ++-
3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index f78f65c..aaa0066 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -100,7 +100,14 @@ struct lppaca {
// Used to pass parms from the OS to PLIC for SetAsrAndRfid
u64 saved_gpr3; // Saved GPR3 x20-x27
u64 saved_gpr4; // Saved GPR4 x28-x2F
- u64 saved_gpr5; // Saved GPR5 x30-x37
+ union {
+ u64 saved_gpr5; // Saved GPR5 x30-x37
+ struct {
+ u8 cede_latency_hint; // x30
+ u8 reserved[7]; // x31-x36
+ } fields;
+ } gpr5_dword;
+

u8 dtl_enable_mask; // Dispatch Trace Log mask x38-x38
u8 donate_dedicated_cpu; // Donate dedicated CPU cycles x39-x39
diff --git a/arch/powerpc/platforms/pseries/plpar_wrappers.h b/arch/powerpc/platforms/pseries/plpar_wrappers.h
index a24a6b2..1174d4b 100644
--- a/arch/powerpc/platforms/pseries/plpar_wrappers.h
+++ b/arch/powerpc/platforms/pseries/plpar_wrappers.h
@@ -9,11 +9,28 @@ static inline long poll_pending(void)
return plpar_hcall_norets(H_POLL_PENDING);
}

+static inline u8 get_cede_latency_hint(void)
+{
+ return get_lppaca()->gpr5_dword.fields.cede_latency_hint;
+}
+
+static inline void set_cede_latency_hint(u8 latency_hint)
+{
+ get_lppaca()->gpr5_dword.fields.cede_latency_hint = latency_hint;
+}
+
static inline long cede_processor(void)
{
return plpar_hcall_norets(H_CEDE);
}

+static inline long extended_cede_processor(u8 latency_hint)
+{
+ set_cede_latency_hint(latency_hint);
+ cede_processor();
+
+}
+
static inline long vpa_call(unsigned long flags, unsigned long cpu,
unsigned long vpa)
{
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index e1f33a8..a2089cd 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1613,7 +1613,8 @@ static void super_regs(void)
ptrLpPaca->saved_srr0, ptrLpPaca->saved_srr1);
printf(" Saved Gpr3=%.16lx Saved Gpr4=%.16lx \n",
ptrLpPaca->saved_gpr3, ptrLpPaca->saved_gpr4);
- printf(" Saved Gpr5=%.16lx \n", ptrLpPaca->saved_gpr5);
+ printf(" Saved Gpr5=%.16lx \n",
+ ptrLpPaca->gpr5_dword.saved_gpr5);
}
#endif

Subject: [PATCH v3 2/3] cpu: Offline state Framework.

Provide an interface by which the system administrator can decide what state
should the CPU go to when it is offlined.

To query the hotplug states, on needs to perform a read on:
/sys/devices/system/cpu/cpu<number>/available_hotplug_states

To query or set the current state for a particular CPU, one needs to
use the sysfs interface

/sys/devices/system/cpu/cpu<number>/current_hotplug_state

This patch implements the architecture independent bits of the
cpu-offline-state framework.

The architecture specific bits are expected to register the actual code which
implements the callbacks when the above mentioned sysfs interfaces are read or
written into. Thus the values provided by reading available_offline_states vary
with the architecture.

The patch provides serialization for writes to the "current_hotplug_state"
with respect to with the writes to the "online" sysfs tunable.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
Documentation/cpu-hotplug.txt | 22 +++++
drivers/base/cpu.c | 181 +++++++++++++++++++++++++++++++++++++++--
include/linux/cpu.h | 10 ++
3 files changed, 204 insertions(+), 9 deletions(-)

diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt
index 9d620c1..dcec06d 100644
--- a/Documentation/cpu-hotplug.txt
+++ b/Documentation/cpu-hotplug.txt
@@ -115,6 +115,28 @@ Just remember the critical section cannot call any
function that can sleep or schedule this process away. The preempt_disable()
will work as long as stop_machine_run() is used to take a cpu down.

+CPU-offline states
+--------------------------------------
+On architectures which allow the more than one valid state when
+the CPU goes offline, the system administrator can decide
+the state the CPU needs to go to when it is offlined.
+
+If the architecture has implemented a cpu-offline driver exposing these
+multiple offline states, the system administrator can use the following sysfs
+interfaces to query the available hotplug states and also query and set the
+current hotplug state for a given cpu:
+
+To query the hotplug states, on needs to perform a read on:
+/sys/devices/system/cpu/cpu<number>/available_hotplug_states
+
+To query or set the current state for a particular CPU,
+one needs to use the sysfs interface
+
+/sys/devices/system/cpu/cpu<number>/current_hotplug_state
+
+Writes to the "online" sysfs files are serialized against the writes to the
+"current_hotplug_state" file.
+
CPU Hotplug - Frequently Asked Questions.

Q: How to enable my kernel to support CPU hotplug?
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index e62a4cc..00c38be 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -20,7 +20,166 @@ EXPORT_SYMBOL(cpu_sysdev_class);

static DEFINE_PER_CPU(struct sys_device *, cpu_sys_devices);

+struct sys_device *get_cpu_sysdev(unsigned cpu)
+{
+ if (cpu < nr_cpu_ids && cpu_possible(cpu))
+ return per_cpu(cpu_sys_devices, cpu);
+ else
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(get_cpu_sysdev);
+
+
#ifdef CONFIG_HOTPLUG_CPU
+
+struct cpu_offline_driver *cpu_offline_driver;
+static DEFINE_MUTEX(cpu_offline_driver_lock);
+
+ssize_t show_available_hotplug_states(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+ int cpu_num = cpu->sysdev.id;
+ ssize_t ret;
+
+ mutex_lock(&cpu_offline_driver_lock);
+ if (!cpu_offline_driver) {
+ ret = -EEXIST;
+ goto out_unlock;
+ }
+
+ ret = cpu_offline_driver->read_available_states(cpu_num, buf);
+
+out_unlock:
+ mutex_unlock(&cpu_offline_driver_lock);
+
+ return ret;
+
+}
+
+ssize_t show_current_hotplug_state(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+ int cpu_num = cpu->sysdev.id;
+ ssize_t ret = 0;
+
+ mutex_lock(&cpu_offline_driver_lock);
+ if (!cpu_offline_driver) {
+ ret = -EEXIST;
+ goto out_unlock;
+ }
+
+ ret = cpu_offline_driver->read_current_state(cpu_num, buf);
+
+out_unlock:
+ mutex_unlock(&cpu_offline_driver_lock);
+
+ return ret;
+}
+
+ssize_t store_current_hotplug_state(struct sys_device *dev,
+ struct sysdev_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct cpu *cpu = container_of(dev, struct cpu, sysdev);
+ int cpu_num = cpu->sysdev.id;
+ ssize_t ret = count;
+
+ mutex_lock(&cpu_offline_driver_lock);
+ if (!cpu_offline_driver) {
+ ret = -EEXIST;
+ goto out_unlock;
+ }
+
+ ret = cpu_offline_driver->write_current_state(cpu_num, buf);
+
+out_unlock:
+ mutex_unlock(&cpu_offline_driver_lock);
+
+ if (ret >= 0)
+ ret = count;
+ return ret;
+}
+
+static SYSDEV_ATTR(available_hotplug_states, 0444,
+ show_available_hotplug_states, NULL);
+static SYSDEV_ATTR(current_hotplug_state, 0644,
+ show_current_hotplug_state, store_current_hotplug_state);
+
+/* Should be called with cpu_add_remove_lock held */
+void cpu_offline_driver_add_cpu(struct sys_device *cpu_sys_dev)
+{
+ int rc;
+
+ if (!cpu_offline_driver || !cpu_sys_dev)
+ return;
+
+ rc = sysdev_create_file(cpu_sys_dev, &attr_available_hotplug_states);
+ BUG_ON(rc == -EEXIST);
+
+ rc = sysdev_create_file(cpu_sys_dev, &attr_current_hotplug_state);
+ BUG_ON(rc == -EEXIST);
+}
+
+/* Should be called with cpu_add_remove_lock held */
+void cpu_offline_driver_remove_cpu(struct sys_device *cpu_sys_dev)
+{
+ if (!cpu_offline_driver || !cpu_sys_dev)
+ return;
+
+ sysdev_remove_file(cpu_sys_dev, &attr_available_hotplug_states);
+ sysdev_remove_file(cpu_sys_dev, &attr_current_hotplug_state);
+
+}
+
+int register_cpu_offline_driver(struct cpu_offline_driver *arch_cpu_driver)
+{
+ int ret = 0;
+ int cpu;
+
+ mutex_lock(&cpu_offline_driver_lock);
+
+ if (cpu_offline_driver != NULL) {
+ ret = -EEXIST;
+ goto out_unlock;
+ }
+
+ if (WARN_ON(!(arch_cpu_driver->read_available_states &&
+ arch_cpu_driver->read_current_state &&
+ arch_cpu_driver->write_current_state))) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ cpu_offline_driver = arch_cpu_driver;
+
+ for_each_possible_cpu(cpu)
+ cpu_offline_driver_add_cpu(get_cpu_sysdev(cpu));
+
+out_unlock:
+ mutex_unlock(&cpu_offline_driver_lock);
+ return ret;
+}
+
+void unregister_cpu_offline_driver(struct cpu_offline_driver *arch_cpu_driver)
+{
+ int cpu;
+ mutex_lock(&cpu_offline_driver_lock);
+
+ if (WARN_ON(!cpu_offline_driver)) {
+ mutex_unlock(&cpu_offline_driver_lock);
+ return;
+ }
+
+ for_each_possible_cpu(cpu)
+ cpu_offline_driver_remove_cpu(get_cpu_sysdev(cpu));
+
+ cpu_offline_driver = NULL;
+ mutex_unlock(&cpu_offline_driver_lock);
+}
+
+
static ssize_t show_online(struct sys_device *dev, struct sysdev_attribute *attr,
char *buf)
{
@@ -35,6 +194,7 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut
struct cpu *cpu = container_of(dev, struct cpu, sysdev);
ssize_t ret;

+ mutex_lock(&cpu_offline_driver_lock);
switch (buf[0]) {
case '0':
ret = cpu_down(cpu->sysdev.id);
@@ -50,6 +210,8 @@ static ssize_t __ref store_online(struct sys_device *dev, struct sysdev_attribut
ret = -EINVAL;
}

+ mutex_unlock(&cpu_offline_driver_lock);
+
if (ret >= 0)
ret = count;
return ret;
@@ -59,23 +221,33 @@ static SYSDEV_ATTR(online, 0644, show_online, store_online);
static void __cpuinit register_cpu_control(struct cpu *cpu)
{
sysdev_create_file(&cpu->sysdev, &attr_online);
+ mutex_lock(&cpu_offline_driver_lock);
+ cpu_offline_driver_add_cpu(&cpu->sysdev);
+ mutex_unlock(&cpu_offline_driver_lock);
}
+
void unregister_cpu(struct cpu *cpu)
{
int logical_cpu = cpu->sysdev.id;

unregister_cpu_under_node(logical_cpu, cpu_to_node(logical_cpu));

+ mutex_lock(&cpu_offline_driver_lock);
+ cpu_offline_driver_remove_cpu(&cpu->sysdev);
+ mutex_unlock(&cpu_offline_driver_lock);
+
sysdev_remove_file(&cpu->sysdev, &attr_online);

sysdev_unregister(&cpu->sysdev);
per_cpu(cpu_sys_devices, logical_cpu) = NULL;
return;
}
+
#else /* ... !CONFIG_HOTPLUG_CPU */
static inline void register_cpu_control(struct cpu *cpu)
{
}
+
#endif /* CONFIG_HOTPLUG_CPU */

#ifdef CONFIG_KEXEC
@@ -224,15 +396,6 @@ int __cpuinit register_cpu(struct cpu *cpu, int num)
return error;
}

-struct sys_device *get_cpu_sysdev(unsigned cpu)
-{
- if (cpu < nr_cpu_ids && cpu_possible(cpu))
- return per_cpu(cpu_sys_devices, cpu);
- else
- return NULL;
-}
-EXPORT_SYMBOL_GPL(get_cpu_sysdev);
-
int __init cpu_dev_init(void)
{
int err;
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 4d668e0..8ac12e8 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -51,6 +51,16 @@ struct notifier_block;
#ifdef CONFIG_HOTPLUG_CPU
extern int register_cpu_notifier(struct notifier_block *nb);
extern void unregister_cpu_notifier(struct notifier_block *nb);
+
+struct cpu_offline_driver {
+ ssize_t (*read_available_states)(unsigned int cpu, char *buf);
+ ssize_t (*read_current_state)(unsigned int cpu, char *buf);
+ ssize_t (*write_current_state)(unsigned int cpu, const char *buf);
+};
+
+extern int register_cpu_offline_driver(struct cpu_offline_driver *driver);
+extern void unregister_cpu_offline_driver(struct cpu_offline_driver *driver);
+
#else

#ifndef MODULE

Subject: [PATCH v3 3/3] cpu: Implement cpu-offline-state callbacks for pSeries.

This patch implements the callbacks to handle the reads/writes into the sysfs
interfaces

/sys/devices/system/cpu/cpu<number>/available_hotplug_states
and
/sys/devices/system/cpu/cpu<number>/current_hotplug_state

Currently, the patch defines two states which the processor can go to when it
is offlined. They are

- offline: The current behaviour when the cpu is offlined.
The CPU would call make an rtas_stop_self() call and hand over the
CPU back to the resource pool, thereby effectively deallocating
that vCPU from the LPAR.

- inactive: This is expected to cede the processor to the hypervisor with a
latency hint specifier value. Hypervisor may use this hint to provide
better energy savings. In this state, the control of the vCPU will continue
to be with the LPAR.

Signed-off-by: Gautham R Shenoy <[email protected]>
---
arch/powerpc/platforms/pseries/Makefile | 2
arch/powerpc/platforms/pseries/hotplug-cpu.c | 88 +++++++++++++-
arch/powerpc/platforms/pseries/offline_driver.c | 148 +++++++++++++++++++++++
arch/powerpc/platforms/pseries/offline_driver.h | 20 +++
arch/powerpc/platforms/pseries/smp.c | 17 +++
5 files changed, 267 insertions(+), 8 deletions(-)
create mode 100644 arch/powerpc/platforms/pseries/offline_driver.c
create mode 100644 arch/powerpc/platforms/pseries/offline_driver.h

diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile
index 790c0b8..3a569c7 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -17,7 +17,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_PCI) += pci.o pci_dlpar.o
obj-$(CONFIG_PSERIES_MSI) += msi.o

-obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o
+obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o offline_driver.o
obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o

obj-$(CONFIG_HVC_CONSOLE) += hvconsole.o
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index a20ead8..1e06bb1 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -30,6 +30,7 @@
#include <asm/pSeries_reconfig.h>
#include "xics.h"
#include "plpar_wrappers.h"
+#include "offline_driver.h"

/* This version can't take the spinlock, because it never returns */
static struct rtas_args rtas_stop_self_args = {
@@ -54,13 +55,74 @@ static void rtas_stop_self(void)
panic("Alas, I survived.\n");
}

+static void cede_on_offline(u8 cede_latency_hint)
+{
+ unsigned int cpu = smp_processor_id();
+ unsigned int hwcpu = hard_smp_processor_id();
+ u8 old_cede_latency_hint;
+
+ old_cede_latency_hint = get_cede_latency_hint();
+ get_lppaca()->idle = 1;
+ if (!get_lppaca()->shared_proc)
+ get_lppaca()->donate_dedicated_cpu = 1;
+
+ printk(KERN_INFO "cpu %u (hwid %u) ceding for offline with hint %d\n",
+ cpu, hwcpu, cede_latency_hint);
+ while (get_preferred_offline_state(cpu) != CPU_STATE_ONLINE) {
+ extended_cede_processor(cede_latency_hint);
+ printk(KERN_INFO "cpu %u (hwid %u) returned from cede.\n",
+ cpu, hwcpu);
+ printk(KERN_INFO
+ "Decrementer value = %x Timebase value = %llx\n",
+ get_dec(), get_tb());
+ }
+
+ printk(KERN_INFO "cpu %u (hwid %u) got prodded to go online\n",
+ cpu, hwcpu);
+
+ if (!get_lppaca()->shared_proc)
+ get_lppaca()->donate_dedicated_cpu = 0;
+ get_lppaca()->idle = 0;
+
+ /* Reset the cede_latency specifier value */
+ set_cede_latency_hint(old_cede_latency_hint);
+
+ unregister_slb_shadow(hwcpu, __pa(get_slb_shadow()));
+
+ /*
+ * NOTE: Calling start_secondary() here for now to start
+ * a new context.
+ *
+ * However, need to do it cleanly by resetting the stack
+ * pointer.
+ */
+ start_secondary();
+}
+
static void pseries_mach_cpu_die(void)
{
+ unsigned int cpu = smp_processor_id();
+ u8 cede_latency_hint = 0;
+
local_irq_disable();
idle_task_exit();
xics_teardown_cpu();
- unregister_slb_shadow(hard_smp_processor_id(), __pa(get_slb_shadow()));
- rtas_stop_self();
+
+ if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
+
+ set_cpu_current_state(cpu, CPU_STATE_OFFLINE);
+ unregister_slb_shadow(hard_smp_processor_id(),
+ __pa(get_slb_shadow()));
+ rtas_stop_self();
+ goto out_bug;
+ } else if (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
+ set_cpu_current_state(cpu, CPU_STATE_INACTIVE);
+ cede_latency_hint = 2;
+ cede_on_offline(cede_latency_hint);
+
+ }
+
+out_bug:
/* Should never get here... */
BUG();
for(;;);
@@ -112,11 +174,23 @@ static void pseries_cpu_die(unsigned int cpu)
int cpu_status;
unsigned int pcpu = get_hard_smp_processor_id(cpu);

- for (tries = 0; tries < 25; tries++) {
- cpu_status = query_cpu_stopped(pcpu);
- if (cpu_status == 0 || cpu_status == -1)
- break;
- cpu_relax();
+ if (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) {
+ cpu_status = 1;
+ for (tries = 0; tries < 1000; tries++) {
+ if (get_cpu_current_state(cpu) == CPU_STATE_INACTIVE) {
+ cpu_status = 0;
+ break;
+ }
+ cpu_relax();
+ }
+ } else {
+
+ for (tries = 0; tries < 25; tries++) {
+ cpu_status = query_cpu_stopped(pcpu);
+ if (cpu_status == 0 || cpu_status == -1)
+ break;
+ cpu_relax();
+ }
}
if (cpu_status != 0) {
printk("Querying DEAD? cpu %i (%i) shows %i\n",
diff --git a/arch/powerpc/platforms/pseries/offline_driver.c b/arch/powerpc/platforms/pseries/offline_driver.c
new file mode 100644
index 0000000..ca15b6b
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/offline_driver.c
@@ -0,0 +1,148 @@
+#include "offline_driver.h"
+#include <linux/cpu.h>
+#include <linux/percpu-defs.h>
+
+struct cpu_hotplug_state {
+ enum cpu_state_vals state_val;
+ const char *state_name;
+ int available;
+} pSeries_cpu_hotplug_states[] = {
+ {CPU_STATE_OFFLINE, "offline", 1},
+ {CPU_STATE_INACTIVE, "inactive", 1},
+ {CPU_STATE_ONLINE, "online", 1},
+ {CPU_MAX_HOTPLUG_STATES, "", 0},
+};
+
+static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) =
+ CPU_STATE_OFFLINE;
+static DEFINE_PER_CPU(enum cpu_state_vals, current_state) = CPU_STATE_OFFLINE;
+
+static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
+
+enum cpu_state_vals get_cpu_current_state(int cpu)
+{
+ return per_cpu(current_state, cpu);
+}
+
+void set_cpu_current_state(int cpu, enum cpu_state_vals state)
+{
+ per_cpu(current_state, cpu) = state;
+}
+
+enum cpu_state_vals get_preferred_offline_state(int cpu)
+{
+ return per_cpu(preferred_offline_state, cpu);
+}
+
+void set_preferred_offline_state(int cpu, enum cpu_state_vals state)
+{
+ per_cpu(preferred_offline_state, cpu) = state;
+}
+
+void set_default_offline_state(int cpu)
+{
+ per_cpu(preferred_offline_state, cpu) = default_offline_state;
+}
+
+static const char *get_cpu_hotplug_state_name(enum cpu_state_vals state_val)
+{
+ return pSeries_cpu_hotplug_states[state_val].state_name;
+}
+
+static bool cpu_hotplug_state_available(enum cpu_state_vals state_val)
+{
+ return pSeries_cpu_hotplug_states[state_val].available;
+}
+
+ssize_t pSeries_read_available_states(unsigned int cpu, char *buf)
+{
+ int state;
+ ssize_t ret = 0;
+
+ for (state = CPU_STATE_OFFLINE; state < CPU_MAX_HOTPLUG_STATES;
+ state++) {
+ if (!cpu_hotplug_state_available(state))
+ continue;
+
+ if (ret >= (ssize_t) ((PAGE_SIZE / sizeof(char))
+ - (CPU_STATES_LEN + 2)))
+ goto out;
+ ret += scnprintf(&buf[ret], CPU_STATES_LEN, "%s ",
+ get_cpu_hotplug_state_name(state));
+ }
+
+out:
+ ret += sprintf(&buf[ret], "\n");
+ return ret;
+}
+
+ssize_t pSeries_read_current_state(unsigned int cpu, char *buf)
+{
+ int state = get_cpu_current_state(cpu);
+
+ return scnprintf(buf, CPU_STATES_LEN, "%s\n",
+ get_cpu_hotplug_state_name(state));
+}
+
+ssize_t pSeries_write_current_state(unsigned int cpu, const char *buf)
+{
+ int ret;
+ char state_name[CPU_STATES_LEN];
+ int i;
+ struct sys_device *dev = get_cpu_sysdev(cpu);
+ ret = sscanf(buf, "%15s", state_name);
+
+ if (ret != 1) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ for (i = CPU_STATE_OFFLINE; i < CPU_MAX_HOTPLUG_STATES; i++)
+ if (!strnicmp(state_name,
+ get_cpu_hotplug_state_name(i),
+ CPU_STATES_LEN))
+ break;
+
+ if (i == CPU_MAX_HOTPLUG_STATES) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ if (i == get_cpu_current_state(cpu)) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ if (i == CPU_STATE_ONLINE) {
+ ret = cpu_up(cpu);
+ if (!ret)
+ kobject_uevent(&dev->kobj, KOBJ_ONLINE);
+ goto out_unlock;
+ }
+
+ if (get_cpu_current_state(cpu) != CPU_STATE_ONLINE) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ set_preferred_offline_state(cpu, i);
+ ret = cpu_down(cpu);
+ if (!ret)
+ kobject_uevent(&dev->kobj, KOBJ_OFFLINE);
+
+out_unlock:
+ return ret;
+}
+
+struct cpu_offline_driver pSeries_offline_driver = {
+ .read_available_states = pSeries_read_available_states,
+ .read_current_state = pSeries_read_current_state,
+ .write_current_state = pSeries_write_current_state,
+};
+
+static int pseries_hotplug_driver_init(void)
+{
+ return register_cpu_offline_driver(&pSeries_offline_driver);
+}
+
+arch_initcall(pseries_hotplug_driver_init);
diff --git a/arch/powerpc/platforms/pseries/offline_driver.h b/arch/powerpc/platforms/pseries/offline_driver.h
new file mode 100644
index 0000000..b4674df
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/offline_driver.h
@@ -0,0 +1,20 @@
+#ifndef _OFFLINE_DRIVER_H_
+#define _OFFLINE_DRIVER_H_
+
+#define CPU_STATES_LEN 16
+
+/* Cpu offline states go here */
+enum cpu_state_vals {
+ CPU_STATE_OFFLINE,
+ CPU_STATE_INACTIVE,
+ CPU_STATE_ONLINE,
+ CPU_MAX_HOTPLUG_STATES
+};
+
+extern enum cpu_state_vals get_cpu_current_state(int cpu);
+extern void set_cpu_current_state(int cpu, enum cpu_state_vals state);
+extern enum cpu_state_vals get_preferred_offline_state(int cpu);
+extern void set_preferred_offline_state(int cpu, enum cpu_state_vals state);
+extern int start_secondary(void);
+extern void set_default_offline_state(int cpu);
+#endif
diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index 1f8f6cf..48f8ae5 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -48,6 +48,7 @@
#include "plpar_wrappers.h"
#include "pseries.h"
#include "xics.h"
+#include "offline_driver.h"


/*
@@ -86,6 +87,9 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu)
/* Fixup atomic count: it exited inside IRQ handler. */
task_thread_info(paca[lcpu].__current)->preempt_count = 0;

+ if (get_cpu_current_state(lcpu) != CPU_STATE_OFFLINE)
+ goto out;
+
/*
* If the RTAS start-cpu token does not exist then presume the
* cpu is already spinning.
@@ -100,6 +104,7 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu)
return 0;
}

+out:
return 1;
}

@@ -113,12 +118,15 @@ static void __devinit smp_xics_setup_cpu(int cpu)
vpa_init(cpu);

cpu_clear(cpu, of_spin_map);
+ set_cpu_current_state(cpu, CPU_STATE_ONLINE);
+ set_default_offline_state(cpu);

}
#endif /* CONFIG_XICS */

static void __devinit smp_pSeries_kick_cpu(int nr)
{
+ long rc;
BUG_ON(nr < 0 || nr >= NR_CPUS);

if (!smp_startup_cpu(nr))
@@ -130,6 +138,15 @@ static void __devinit smp_pSeries_kick_cpu(int nr)
* the processor will continue on to secondary_start
*/
paca[nr].cpu_start = 1;
+
+ set_preferred_offline_state(nr, CPU_STATE_ONLINE);
+
+ if (get_cpu_current_state(nr) != CPU_STATE_OFFLINE) {
+ rc = plpar_hcall_norets(H_PROD, nr);
+ if (rc != H_SUCCESS)
+ panic("Error: Prod to wake up processor %d Ret= %ld\n",
+ nr, rc);
+ }
}

static int smp_pSeries_cpu_bootable(unsigned int nr)

2009-09-15 12:12:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> This patchset contains the offline state driver implemented for
> pSeries. For pSeries, we define three available_hotplug_states. They are:
>
> online: The processor is online.
>
> offline: This is the the default behaviour when the cpu is offlined
> even in the absense of this driver. The CPU would call make an
> rtas_stop_self() call and hand over the CPU back to the resource pool,
> thereby effectively deallocating that vCPU from the LPAR.
> NOTE: This would result in a configuration change to the LPAR
> which is visible to the outside world.
>
> inactive: This cedes the vCPU to the hypervisor with a cede latency
> specifier value 2.
> NOTE: This option does not result in a configuration change
> and the vCPU would be still entitled to the LPAR to which it earlier
> belong to.
>
> Any feedback on the patchset will be immensely valuable.

I still think its a layering violation... its the hypervisor manager
that should be bothered in what state an off-lined cpu is in.


2009-09-15 13:21:50

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> > This patchset contains the offline state driver implemented for
> > pSeries. For pSeries, we define three available_hotplug_states. They are:
> >
> > online: The processor is online.
> >
> > offline: This is the the default behaviour when the cpu is offlined
> > even in the absense of this driver. The CPU would call make an
> > rtas_stop_self() call and hand over the CPU back to the resource pool,
> > thereby effectively deallocating that vCPU from the LPAR.
> > NOTE: This would result in a configuration change to the LPAR
> > which is visible to the outside world.
> >
> > inactive: This cedes the vCPU to the hypervisor with a cede latency
> > specifier value 2.
> > NOTE: This option does not result in a configuration change
> > and the vCPU would be still entitled to the LPAR to which it earlier
> > belong to.
> >
> > Any feedback on the patchset will be immensely valuable.
>
> I still think its a layering violation... its the hypervisor manager
> that should be bothered in what state an off-lined cpu is in.

Yeah it probably is a layering violation, but when has that stopped us
before :)

Is it anticipated that this will be useful on platforms other than
pseries?

cheers



Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-09-15 14:44:17

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] pSeries: cede latency specifier helper function.

On Tue, 2009-09-15 at 17:37 +0530, Gautham R Shenoy wrote:
> // Used to pass parms from the OS to PLIC for SetAsrAndRfid
> u64 saved_gpr3; // Saved GPR3 x20-x27
> u64 saved_gpr4; // Saved GPR4 x28-x2F
> - u64 saved_gpr5; // Saved GPR5 x30-x37
> + union {
> + u64 saved_gpr5; // Saved GPR5 x30-x37
> + struct {
> + u8 cede_latency_hint; // x30
> + u8 reserved[7]; // x31-x36
> + } fields;
> + } gpr5_dword;
> +
>
> u8 dtl_enable_mask; // Dispatch Trace Log mask x38-x38
> u8 donate_dedicated_cpu; // Donate dedicated CPU cycles x39-x39

Could you drop the C99 style comments (use /* */ instead)? If you run
checkpatch on this it will error on these lines..

Daniel

2009-09-15 14:58:45

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

* Peter Zijlstra <[email protected]> [2009-09-15 14:11:41]:

> On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> > This patchset contains the offline state driver implemented for
> > pSeries. For pSeries, we define three available_hotplug_states. They are:
> >
> > online: The processor is online.
> >
> > offline: This is the the default behaviour when the cpu is offlined
> > even in the absense of this driver. The CPU would call make an
> > rtas_stop_self() call and hand over the CPU back to the resource pool,
> > thereby effectively deallocating that vCPU from the LPAR.
> > NOTE: This would result in a configuration change to the LPAR
> > which is visible to the outside world.
> >
> > inactive: This cedes the vCPU to the hypervisor with a cede latency
> > specifier value 2.
> > NOTE: This option does not result in a configuration change
> > and the vCPU would be still entitled to the LPAR to which it earlier
> > belong to.
> >
> > Any feedback on the patchset will be immensely valuable.
>
> I still think its a layering violation... its the hypervisor manager
> that should be bothered in what state an off-lined cpu is in.
>

>From a design standpoint where we stand today is

1. A cede indicates that the CPU is no longer needed and can be
reassigned (remember we do dedicated CPU partitions in power)
2. What this patch is trying to do is to say "We don't need the
CPU, but please don't reassign, put it to sleep"

We work the same way with the hypervisor, that applications work with
the OS today, by providing madvise and other hints.

--
Balbir

2009-09-16 07:49:17

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Tue, Sep 15, 2009 at 08:28:34PM +0530, Balbir Singh wrote:
> * Peter Zijlstra <[email protected]> [2009-09-15 14:11:41]:
>
> > On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> > > This patchset contains the offline state driver implemented for
> > > pSeries. For pSeries, we define three available_hotplug_states. They are:
> > >
> > > online: The processor is online.
> > >
> > > offline: This is the the default behaviour when the cpu is offlined
> > > even in the absense of this driver. The CPU would call make an
> > > rtas_stop_self() call and hand over the CPU back to the resource pool,
> > > thereby effectively deallocating that vCPU from the LPAR.
> > > NOTE: This would result in a configuration change to the LPAR
> > > which is visible to the outside world.
> > >
> > > inactive: This cedes the vCPU to the hypervisor with a cede latency
> > > specifier value 2.
> > > NOTE: This option does not result in a configuration change
> > > and the vCPU would be still entitled to the LPAR to which it earlier
> > > belong to.
> > >
> > > Any feedback on the patchset will be immensely valuable.
> >
> > I still think its a layering violation... its the hypervisor manager
> > that should be bothered in what state an off-lined cpu is in.
> >
>
> From a design standpoint where we stand today is
>
> 1. A cede indicates that the CPU is no longer needed and can be
> reassigned (remember we do dedicated CPU partitions in power)
> 2. What this patch is trying to do is to say "We don't need the
> CPU, but please don't reassign, put it to sleep"

FWIW, this sounds exactly like the same we have already on s390.
But back then I didn't consider adding a common code infrastructure
would make sense :)

Besides the "online" attribute we have an additional "configure"
attribute to which can only be written if the cpu is offline.
Writing a "0" to it would mean that you currently won't need the cpu
anymore and the hypervisor is free to reassign the cpu to a different
LPAR.
Writing a "1" to it means you want to use it. If there are enough
resources you will get it. If not.. bad luck.

2009-09-16 15:28:29

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Tue, Sep 15, 2009 at 02:11:41PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> > This patchset contains the offline state driver implemented for
> > pSeries. For pSeries, we define three available_hotplug_states. They are:
> >
> > online: The processor is online.
> >
> > offline: This is the the default behaviour when the cpu is offlined
> >
> > inactive: This cedes the vCPU to the hypervisor with a cede latency
> >
> > Any feedback on the patchset will be immensely valuable.
>
> I still think its a layering violation... its the hypervisor manager
> that should be bothered in what state an off-lined cpu is in.

The problem is that all hypervisor managers cannot figure out what sort
of latency guest OS can tolerate under the situation. They wouldn't know
from what context guest OS has ceded the vcpu. It has to have
some sort of hint, which is what the guest OS provides.

Thanks
Dipankar

2009-09-16 15:33:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Wed, 2009-09-16 at 20:58 +0530, Dipankar Sarma wrote:
> On Tue, Sep 15, 2009 at 02:11:41PM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> > > This patchset contains the offline state driver implemented for
> > > pSeries. For pSeries, we define three available_hotplug_states. They are:
> > >
> > > online: The processor is online.
> > >
> > > offline: This is the the default behaviour when the cpu is offlined
> > >
> > > inactive: This cedes the vCPU to the hypervisor with a cede latency
> > >
> > > Any feedback on the patchset will be immensely valuable.
> >
> > I still think its a layering violation... its the hypervisor manager
> > that should be bothered in what state an off-lined cpu is in.
>
> The problem is that all hypervisor managers cannot figure out what sort
> of latency guest OS can tolerate under the situation. They wouldn't know
> from what context guest OS has ceded the vcpu. It has to have
> some sort of hint, which is what the guest OS provides.

I'm missing something here, hot-unplug is a slow path and should not
ever be latency critical..?

2009-09-16 16:25:10

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Wed, Sep 16, 2009 at 05:32:51PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-16 at 20:58 +0530, Dipankar Sarma wrote:
> > On Tue, Sep 15, 2009 at 02:11:41PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> > > > This patchset contains the offline state driver implemented for
> > > > pSeries. For pSeries, we define three available_hotplug_states. They are:
> > > >
> > > > online: The processor is online.
> > > >
> > > > offline: This is the the default behaviour when the cpu is offlined
> > > >
> > > > inactive: This cedes the vCPU to the hypervisor with a cede latency
> > > >
> > > > Any feedback on the patchset will be immensely valuable.
> > >
> > > I still think its a layering violation... its the hypervisor manager
> > > that should be bothered in what state an off-lined cpu is in.
> >
> > The problem is that all hypervisor managers cannot figure out what sort
> > of latency guest OS can tolerate under the situation. They wouldn't know
> > from what context guest OS has ceded the vcpu. It has to have
> > some sort of hint, which is what the guest OS provides.
>
> I'm missing something here, hot-unplug is a slow path and should not
> ever be latency critical..?

You aren't, I did :)

No, for this specific case, latency isn't an issue. The issue is -
how do we cede unused vcpus to hypervisor for better energy management ?
Yes, it can be done by a hypervisor manager telling the kernel to
offline and make a bunch of vcpus "inactive". It does have to choose
offline (release vcpu) vs. inactive (cede but guranteed if needed).
The problem is that long ago we exported a lot of hotplug stuff to
userspace through the sysfs interface and we cannot do something
inside the kernel without keeping the sysfs stuff consistent.
This seems like a sane way to do that without undoing all the
virtual cpu hotplug infrastructure in different supporting archs.

Thanks
Dipankar

2009-09-16 16:35:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Wed, 2009-09-16 at 21:54 +0530, Dipankar Sarma wrote:

> No, for this specific case, latency isn't an issue. The issue is -
> how do we cede unused vcpus to hypervisor for better energy management ?
> Yes, it can be done by a hypervisor manager telling the kernel to
> offline and make a bunch of vcpus "inactive". It does have to choose
> offline (release vcpu) vs. inactive (cede but guranteed if needed).
> The problem is that long ago we exported a lot of hotplug stuff to
> userspace through the sysfs interface and we cannot do something
> inside the kernel without keeping the sysfs stuff consistent.
> This seems like a sane way to do that without undoing all the
> virtual cpu hotplug infrastructure in different supporting archs.

I'm still not getting it..

Suppose we have some guest, it booted with 4 cpus.

We then offline 2 of them.

Apparently this LPAR binds guest cpus to physical cpus?
So we use a hypervisor interface to reclaim these 2 offlined cpus and
re-assign them to some other guest.

So far so good, right?

Now if you were to try and online the cpus in the guest, it'd fail
because the cpus aren't backed anymore, and the hot-plug simply
times-out and fails.

And we're still good, right?

2009-09-16 17:03:34

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

* Peter Zijlstra <[email protected]> [2009-09-16 18:35:16]:

> On Wed, 2009-09-16 at 21:54 +0530, Dipankar Sarma wrote:
>
> > No, for this specific case, latency isn't an issue. The issue is -
> > how do we cede unused vcpus to hypervisor for better energy management ?
> > Yes, it can be done by a hypervisor manager telling the kernel to
> > offline and make a bunch of vcpus "inactive". It does have to choose
> > offline (release vcpu) vs. inactive (cede but guranteed if needed).
> > The problem is that long ago we exported a lot of hotplug stuff to
> > userspace through the sysfs interface and we cannot do something
> > inside the kernel without keeping the sysfs stuff consistent.
> > This seems like a sane way to do that without undoing all the
> > virtual cpu hotplug infrastructure in different supporting archs.
>
> I'm still not getting it..
>
> Suppose we have some guest, it booted with 4 cpus.
>
> We then offline 2 of them.
>
> Apparently this LPAR binds guest cpus to physical cpus?
> So we use a hypervisor interface to reclaim these 2 offlined cpus and
> re-assign them to some other guest.
>
> So far so good, right?
>
> Now if you were to try and online the cpus in the guest, it'd fail
> because the cpus aren't backed anymore, and the hot-plug simply
> times-out and fails.
>
> And we're still good, right?

The requirement differ here. If we had offlined 2 vCPUs for the
purpose of system reconfiguration, the expected behavior with offline
interface will work right. However the proposed cede interface is
needed when we want them to temporarily go away but still come back
when we do an online. We want the online to always succeed since the
backing physical resources are not relinquished. The proposed
interface facilitates offline without relinquishing the physical
resources assigned to LPARs.

--Vaidy

2009-09-16 17:22:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Wed, 2009-09-16 at 22:33 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <[email protected]> [2009-09-16 18:35:16]:
>
> > On Wed, 2009-09-16 at 21:54 +0530, Dipankar Sarma wrote:
> >
> > > No, for this specific case, latency isn't an issue. The issue is -
> > > how do we cede unused vcpus to hypervisor for better energy management ?
> > > Yes, it can be done by a hypervisor manager telling the kernel to
> > > offline and make a bunch of vcpus "inactive". It does have to choose
> > > offline (release vcpu) vs. inactive (cede but guranteed if needed).
> > > The problem is that long ago we exported a lot of hotplug stuff to
> > > userspace through the sysfs interface and we cannot do something
> > > inside the kernel without keeping the sysfs stuff consistent.
> > > This seems like a sane way to do that without undoing all the
> > > virtual cpu hotplug infrastructure in different supporting archs.
> >
> > I'm still not getting it..
> >
> > Suppose we have some guest, it booted with 4 cpus.
> >
> > We then offline 2 of them.
> >
> > Apparently this LPAR binds guest cpus to physical cpus?
> > So we use a hypervisor interface to reclaim these 2 offlined cpus and
> > re-assign them to some other guest.
> >
> > So far so good, right?
> >
> > Now if you were to try and online the cpus in the guest, it'd fail
> > because the cpus aren't backed anymore, and the hot-plug simply
> > times-out and fails.
> >
> > And we're still good, right?
>
> The requirement differ here. If we had offlined 2 vCPUs for the
> purpose of system reconfiguration, the expected behavior with offline
> interface will work right. However the proposed cede interface is
> needed when we want them to temporarily go away but still come back
> when we do an online. We want the online to always succeed since the
> backing physical resources are not relinquished. The proposed
> interface facilitates offline without relinquishing the physical
> resources assigned to LPARs.

Then make that the platform default and leave the lpar management to
whatever pokes at the lpar?

2009-09-16 20:18:00

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Wed, Sep 16, 2009 at 07:22:35PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-16 at 22:33 +0530, Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <[email protected]> [2009-09-16 18:35:16]:
> >
> > > Now if you were to try and online the cpus in the guest, it'd fail
> > > because the cpus aren't backed anymore, and the hot-plug simply
> > > times-out and fails.
> > >
> > > And we're still good, right?
> >
> > The requirement differ here. If we had offlined 2 vCPUs for the
> > purpose of system reconfiguration, the expected behavior with offline
> > interface will work right. However the proposed cede interface is
> > needed when we want them to temporarily go away but still come back
> > when we do an online. We want the online to always succeed since the
> > backing physical resources are not relinquished. The proposed
> > interface facilitates offline without relinquishing the physical
> > resources assigned to LPARs.
>
> Then make that the platform default and leave the lpar management to
> whatever pokes at the lpar?

That could have worked - however lpar management already uses
the same sysfs interface to poke. The current semantics make the lpar
vcpu deconfig state the platform default assuming that it will be used for
lpar management. The only clean way to do this without breaking lpar
management stuff is to add another state - "inactive" and retain backward
compatibility.

Thanks
Dipankar

2009-09-24 00:54:18

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote:
> I still think its a layering violation... its the hypervisor manager
> that should be bothered in what state an off-lined cpu is in.
>
That's not how our hypervisor works.

If you ask through the management interface, to remove a CPU from a
partition, the HV will communicate with a daemon inside the partition
that will then unplug the CPU via the right call.

I don't really understand your objections to be honest. And I fail to
see why it would be a layering violation to have the ability for the OS
to indicate in what state it wishes to relinguish a CPU to the
hypervisor, which more or less defines what is the expected latency for
getting it back later on.

Ben.

2009-09-24 00:55:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Wed, 2009-09-16 at 09:48 +0200, Heiko Carstens wrote:
> FWIW, this sounds exactly like the same we have already on s390.
> But back then I didn't consider adding a common code infrastructure
> would make sense :)
>
> Besides the "online" attribute we have an additional "configure"
> attribute to which can only be written if the cpu is offline.
> Writing a "0" to it would mean that you currently won't need the cpu
> anymore and the hypervisor is free to reassign the cpu to a different
> LPAR.
> Writing a "1" to it means you want to use it. If there are enough
> resources you will get it. If not.. bad luck.
>
Maybe we should use a common API for that then. And the right way to do
so is via a generically located attribute.

Ben.

2009-09-24 00:57:38

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Wed, 2009-09-16 at 21:54 +0530, Dipankar Sarma wrote:
> You aren't, I did :)
>
> No, for this specific case, latency isn't an issue. The issue is -
> how do we cede unused vcpus to hypervisor for better energy
> management ?
> Yes, it can be done by a hypervisor manager telling the kernel to
> offline and make a bunch of vcpus "inactive". It does have to choose
> offline (release vcpu) vs. inactive (cede but guranteed if needed).
> The problem is that long ago we exported a lot of hotplug stuff to
> userspace through the sysfs interface and we cannot do something
> inside the kernel without keeping the sysfs stuff consistent.
> This seems like a sane way to do that without undoing all the
> virtual cpu hotplug infrastructure in different supporting archs.
>
Well, I did bring the latency into the picture. To some extent it -is- a
latency issue. Though we aren't talking milliseconds here... if the
CPU's been reallocated to another partition we are talking seconds or
minutes or more until we can get it back :-)

In any case, this sounds to me like a perfectly valid feature to have,
which s390 already does via arch specific hooks, so I fail to see why it
wouldn't be legitimate to have a common attribute for it.

I don't buy into the layering violation. It's too often a straw man and
an excuse for a lack of a proper reason.

Cheers,
Ben.

2009-09-25 14:48:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Thu, 2009-09-24 at 10:51 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote:
> > I still think its a layering violation... its the hypervisor manager
> > that should be bothered in what state an off-lined cpu is in.
> >
> That's not how our hypervisor works.

Then fix it?

> If you ask through the management interface, to remove a CPU from a
> partition, the HV will communicate with a daemon inside the partition
> that will then unplug the CPU via the right call.
>
> I don't really understand your objections to be honest. And I fail to
> see why it would be a layering violation to have the ability for the OS
> to indicate in what state it wishes to relinguish a CPU to the
> hypervisor, which more or less defines what is the expected latency for
> getting it back later on.

OK, so the main objection is the abuse of CPU hotplug as resource
management feature.

CPU hotplug is terribly invasive and expensive to the kernel, doing
hotplug on a minute basis is just plain crazy.

If you want a CPU in a keep it near and don't hand it back to the HV
state, why not use cpusets to isolate it and simply not run tasks on it?

cpusets don't use stopmachine and are much nicer to the rest of the
kernel over-all.

2009-09-25 21:15:12

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Fri, 2009-09-25 at 16:48 +0200, Peter Zijlstra wrote:
> On Thu, 2009-09-24 at 10:51 +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote:
> > > I still think its a layering violation... its the hypervisor manager
> > > that should be bothered in what state an off-lined cpu is in.
> > >
> > That's not how our hypervisor works.
>
> Then fix it?

Are you serious ? :-)

> CPU hotplug is terribly invasive and expensive to the kernel, doing
> hotplug on a minute basis is just plain crazy.
>
> If you want a CPU in a keep it near and don't hand it back to the HV
> state, why not use cpusets to isolate it and simply not run tasks on it?
>
> cpusets don't use stopmachine and are much nicer to the rest of the
> kernel over-all.

Gautham, what is the different in term of power saving between having
it idle for long periods of time (which could do H_CEDE and with NO_HZ,
probably wouln't need to wake up that often) and having it unplugged in
a H_CEDE loop ?

Ben.


2009-09-26 09:55:29

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

On Tue 2009-09-15 14:11:41, Peter Zijlstra wrote:
> On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> > This patchset contains the offline state driver implemented for
> > pSeries. For pSeries, we define three available_hotplug_states. They are:
> >
> > online: The processor is online.
> >
> > offline: This is the the default behaviour when the cpu is offlined
> > even in the absense of this driver. The CPU would call make an
> > rtas_stop_self() call and hand over the CPU back to the resource pool,
> > thereby effectively deallocating that vCPU from the LPAR.
> > NOTE: This would result in a configuration change to the LPAR
> > which is visible to the outside world.
> >
> > inactive: This cedes the vCPU to the hypervisor with a cede latency
> > specifier value 2.
> > NOTE: This option does not result in a configuration change
> > and the vCPU would be still entitled to the LPAR to which it earlier
> > belong to.
> >
> > Any feedback on the patchset will be immensely valuable.
>
> I still think its a layering violation... its the hypervisor manager
> that should be bothered in what state an off-lined cpu is in.

Agreed. Proposed interface is ugly.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-28 13:51:15

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

* Peter Zijlstra <[email protected]> [2009-09-25 16:48:40]:

> On Thu, 2009-09-24 at 10:51 +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote:
> > > I still think its a layering violation... its the hypervisor manager
> > > that should be bothered in what state an off-lined cpu is in.
> > >
> > That's not how our hypervisor works.
>
> Then fix it?
>
> > If you ask through the management interface, to remove a CPU from a
> > partition, the HV will communicate with a daemon inside the partition
> > that will then unplug the CPU via the right call.
> >
> > I don't really understand your objections to be honest. And I fail to
> > see why it would be a layering violation to have the ability for the OS
> > to indicate in what state it wishes to relinguish a CPU to the
> > hypervisor, which more or less defines what is the expected latency for
> > getting it back later on.
>
> OK, so the main objection is the abuse of CPU hotplug as resource
> management feature.
>
> CPU hotplug is terribly invasive and expensive to the kernel, doing
> hotplug on a minute basis is just plain crazy.
>
> If you want a CPU in a keep it near and don't hand it back to the HV
> state, why not use cpusets to isolate it and simply not run tasks on it?
>
> cpusets don't use stopmachine and are much nicer to the rest of the
> kernel over-all.

Hi Peter,

This interface is not expected to be used every minute or so to impact
the operation of the rest of the system. Cpuhotplug is currently used
as a resource management feature in virtualised system using dlpar
operations. I do understand that cpu hotplug is invasive at run time
and that amount of complexity is required to carefully isolate the cpu
from any involvement in the running kernel.

Building another interface to isolate the cpus to the same extent as
cpu hotplug does today would be redundant and is going to be equally
invasive. Alternatives like cpuset for isolation migrates only tasks.

--Vaidy

2009-09-28 13:53:59

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework

* Benjamin Herrenschmidt <[email protected]> [2009-09-26 07:12:48]:

> On Fri, 2009-09-25 at 16:48 +0200, Peter Zijlstra wrote:
> > On Thu, 2009-09-24 at 10:51 +1000, Benjamin Herrenschmidt wrote:
> > > On Tue, 2009-09-15 at 14:11 +0200, Peter Zijlstra wrote:
> > > > I still think its a layering violation... its the hypervisor manager
> > > > that should be bothered in what state an off-lined cpu is in.
> > > >
> > > That's not how our hypervisor works.
> >
> > Then fix it?
>
> Are you serious ? :-)
>
> > CPU hotplug is terribly invasive and expensive to the kernel, doing
> > hotplug on a minute basis is just plain crazy.
> >
> > If you want a CPU in a keep it near and don't hand it back to the HV
> > state, why not use cpusets to isolate it and simply not run tasks on it?
> >
> > cpusets don't use stopmachine and are much nicer to the rest of the
> > kernel over-all.
>
> Gautham, what is the different in term of power saving between having
> it idle for long periods of time (which could do H_CEDE and with NO_HZ,
> probably wouln't need to wake up that often) and having it unplugged in
> a H_CEDE loop ?

Hi Ben,

A cede latency specifier value indicating latency expectation of the
guest OS can be established in the VPA to inform the hypervisor during
the H_CEDE call. Currently, we do call H_CEDE during NO_HZ for
efficient idle. However, higher cede latency values may not be
suitable for idle CPUs in the kernel and instead more energy savings
may result from exploiting this feature through CPU hotplug
interface.

--Vaidy

2009-09-30 17:32:23

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] cpu: Offline state Framework.

On Tue, 15 Sep 2009 17:37:06 +0530 Gautham R Shenoy wrote:

> Signed-off-by: Gautham R Shenoy <[email protected]>
> ---
> Documentation/cpu-hotplug.txt | 22 +++++
> drivers/base/cpu.c | 181 +++++++++++++++++++++++++++++++++++++++--
> include/linux/cpu.h | 10 ++
> 3 files changed, 204 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt
> index 9d620c1..dcec06d 100644
> --- a/Documentation/cpu-hotplug.txt
> +++ b/Documentation/cpu-hotplug.txt
> @@ -115,6 +115,28 @@ Just remember the critical section cannot call any
> function that can sleep or schedule this process away. The preempt_disable()
> will work as long as stop_machine_run() is used to take a cpu down.
>
> +CPU-offline states
> +--------------------------------------
> +On architectures which allow the more than one valid state when

^drop "the"

> +the CPU goes offline, the system administrator can decide
> +the state the CPU needs to go to when it is offlined.

s/needs to/should/

> +
> +If the architecture has implemented a cpu-offline driver exposing these
> +multiple offline states, the system administrator can use the following sysfs
> +interfaces to query the available hotplug states and also query and set the
> +current hotplug state for a given cpu:
> +
> +To query the hotplug states, on needs to perform a read on:

one

> +/sys/devices/system/cpu/cpu<number>/available_hotplug_states
> +
> +To query or set the current state for a particular CPU,
> +one needs to use the sysfs interface
> +
> +/sys/devices/system/cpu/cpu<number>/current_hotplug_state
> +
> +Writes to the "online" sysfs files are serialized against the writes to the
> +"current_hotplug_state" file.
> +
> CPU Hotplug - Frequently Asked Questions.
>
> Q: How to enable my kernel to support CPU hotplug?


---
~Randy