2012-06-04 18:29:58

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

From: Fenghua Yu <[email protected]>

Since offline CPU is in wmait or hlt if mwait feature is not available, it can
be waken up by writing to monitored memory range or via nmi.

Compared to current INIT, INIT, STARTUP wake up sequence, waking up offline CPU
is faster via wmait or nmi. This is especially useful when offline CPU for
power saving and shorter waking up time is desired. On one tested desktop
machine, waking up time via mwait or nmi is reduced to 23% of waking up time
via INIT. Waking up time is measured from the beginning of store_online() to
the beginning of cpu_idle() after the CPU is waken up.

Waking up offline CPU via mwait or nmi is also useful to support BSP offline/
online because offline BSP can not be waken up by the INIT's sequence. The BSP
offline/online patchset will be sent out seperately.

Fenghua Yu (6):
x86/Documentation/kernel-parameters.txt: Add wakeup_cpu_via_init
kernel parameter help
x86/head_32.S/head_64.S: Kernel entry code after waking up offline
CPU via mwait or nmi
x86/smpboot.c: Wake up offline CPU via mwait or nmi
x86/apic_flat_64.c: Wakeup function in apic calls mwait or nmi method
x86/x2apic_cluster.c: Wakeup function in x2apic_cluster calls mwait
or nmi method
x86/x2apic_phys.c: Wakeup function in x2apic_phys calls mwait or nmi
method

Documentation/kernel-parameters.txt | 3 +
arch/x86/include/asm/apic.h | 5 +-
arch/x86/include/asm/cpu.h | 1 +
arch/x86/kernel/apic/apic_flat_64.c | 2 +
arch/x86/kernel/apic/x2apic_cluster.c | 1 +
arch/x86/kernel/apic/x2apic_phys.c | 1 +
arch/x86/kernel/head_32.S | 12 ++
arch/x86/kernel/head_64.S | 14 +++
arch/x86/kernel/smpboot.c | 187 ++++++++++++++++++++++++++++-----
9 files changed, 198 insertions(+), 28 deletions(-)


2012-06-04 18:30:00

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH 1/6] x86/Documentation/kernel-parameters.txt: Add wakeup_cpu_via_init kernel parameter help

From: Fenghua Yu <[email protected]>

The new kernel parameter wakeup_cpu_via_init overrides mwait or nmi method and
forces to wake up offline CPU via INIT, INIT, STARTUP sequence. This is useful
in RAS when a temporary CPU error could be fixed by INIT.

Signed-off-by: Fenghua Yu <[email protected]>
---
Documentation/kernel-parameters.txt | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 2b17e82..8efae55 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3099,6 +3099,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Format:
<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]

+ wakeup_cpu_via_init [X86]
+ Wake up offline CPU via INIT, INIT, STARTUP sequence.
+
______________________________________________________________________

TODO:
--
1.6.0.3

2012-06-04 18:30:45

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH 4/6] x86/apic_flat_64.c: Wakeup function in apic calls mwait or nmi method

From: Fenghua Yu <[email protected]>

The wakeup function wakeup_secondary_cpu in apic_flat and apic_physflat call
wakeup_secondary_cpu_via_soft().

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/kernel/apic/apic_flat_64.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/apic_flat_64.c b/arch/x86/kernel/apic/apic_flat_64.c
index 0e881c4..dd79ab7 100644
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -219,6 +219,7 @@ static struct apic apic_flat = {
.send_IPI_all = flat_send_IPI_all,
.send_IPI_self = apic_send_IPI_self,

+ .wakeup_secondary_cpu = wakeup_secondary_cpu_via_soft,
.trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW,
.trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH,
.wait_for_init_deassert = NULL,
@@ -379,6 +380,7 @@ static struct apic apic_physflat = {
.send_IPI_all = physflat_send_IPI_all,
.send_IPI_self = apic_send_IPI_self,

+ .wakeup_secondary_cpu = wakeup_secondary_cpu_via_soft,
.trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW,
.trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH,
.wait_for_init_deassert = NULL,
--
1.6.0.3

2012-06-04 18:30:40

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH 6/6] x86/x2apic_phys.c: Wakeup function in x2apic_phys calls mwait or nmi method

wmb();
+ printk("%s:%d\n",__func__,__LINE__);
cpu_idle();
}

@@ -478,6 +479,8 @@ _wakeup_secondary_cpu_via_nmi(int apicid, int dest_mode)
unsigned long send_status, accept_status = 0;
int maxlvt;

+ printk("%s:%d\n",__func__,__LINE__);
+
/* Target chip */
/* Boot on the stack */
/* Kick the second */
@@ -530,6 +533,7 @@ DEFINE_PER_CPU(int, cpu_dead) = { 0 };

static int wakeup_secondary_cpu_via_mwait(int cpu)
{
+ printk("%s:%d\n",__func__,__LINE__);
per_cpu(cpu_dead, cpu) |= CPU_DEAD_TRIGGER;
return 0;
}
@@ -553,6 +557,7 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
unsigned long send_status, accept_status = 0;
int maxlvt, num_starts, j;

+ printk("%s:%d\n",__func__,__LINE__);
maxlvt = lapic_get_maxlvt();

/*
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 6345294..b218149 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -49,6 +49,7 @@ static ssize_t __ref store_online(struct device *dev,
kobject_uevent(&dev->kobj, KOBJ_OFFLINE);
break;
case '1':
+ printk("%s:%d\n",__func__,__LINE__);
ret = cpu_up(cpu->dev.id);
if (!ret)
kobject_uevent(&dev->kobj, KOBJ_ONLINE);

2012-06-04 18:30:37

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH 6/6] x86/x2apic_phys.c: Wakeup function in x2apic_phys calls mwait or nmi method

From: Fenghua Yu <[email protected]>

The wakeup_secondary_cpu function in apic_x2apic_phys calls
wakeup_secondary_cpu_via_soft().

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/kernel/apic/x2apic_phys.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c
index c17e982..64d9804 100644
--- a/arch/x86/kernel/apic/x2apic_phys.c
+++ b/arch/x86/kernel/apic/x2apic_phys.c
@@ -164,6 +164,7 @@ static struct apic apic_x2apic_phys = {
.send_IPI_all = x2apic_send_IPI_all,
.send_IPI_self = x2apic_send_IPI_self,

+ .wakeup_secondary_cpu = wakeup_secondary_cpu_via_soft,
.trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW,
.trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH,
.wait_for_init_deassert = NULL,
--
1.6.0.3

2012-06-04 18:31:25

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH 3/6] x86/smpboot.c: Wake up offline CPU via mwait or nmi

From: Fenghua Yu <[email protected]>

wakeup_secondary_cpu_via_soft() is defined to wake up offline CPU via mwait if
the CPU is in mwait or via nmi if the CPU is in hlt.

A CPU boots up by INIT, INIT, STARTUP sequence when it boots up for the first
time during boot time or hot plug.

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/include/asm/apic.h | 5 +-
arch/x86/kernel/smpboot.c | 187 ++++++++++++++++++++++++++++++++++++------
2 files changed, 164 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index eaff479..cad00b1 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -425,7 +425,10 @@ extern struct apic *__apicdrivers[], *__apicdrivers_end[];
#ifdef CONFIG_SMP
extern atomic_t init_deasserted;
extern int wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip);
-#endif
+extern int wakeup_secondary_cpu_via_soft(int apicid, unsigned long start_eip);
+#else /* CONFIG_SMP */
+#define wakeup_secondary_cpu_via_soft NULL
+#endif /* CONFIG_SMP */

#ifdef CONFIG_X86_LOCAL_APIC

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fd019d7..109df30 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -472,13 +472,8 @@ void __inquire_remote_apic(int apicid)
}
}

-/*
- * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
- * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
- * won't ... remember to clear down the APIC, etc later.
- */
-int __cpuinit
-wakeup_secondary_cpu_via_nmi(int logical_apicid, unsigned long start_eip)
+static int __cpuinit
+_wakeup_secondary_cpu_via_nmi(int apicid, int dest_mode)
{
unsigned long send_status, accept_status = 0;
int maxlvt;
@@ -486,7 +481,7 @@ wakeup_secondary_cpu_via_nmi(int logical_apicid, unsigned long start_eip)
/* Target chip */
/* Boot on the stack */
/* Kick the second */
- apic_icr_write(APIC_DM_NMI | apic->dest_logical, logical_apicid);
+ apic_icr_write(APIC_DM_NMI | dest_mode, apicid);

pr_debug("Waiting for send to finish...\n");
send_status = safe_apic_wait_icr_idle();
@@ -511,6 +506,47 @@ wakeup_secondary_cpu_via_nmi(int logical_apicid, unsigned long start_eip)
return (send_status | accept_status);
}

+/*
+ * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
+ * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
+ * won't ... remember to clear down the APIC, etc later.
+ */
+int __cpuinit
+wakeup_secondary_cpu_via_nmi_phys(int phys_apicid, unsigned long start_eip)
+{
+ return _wakeup_secondary_cpu_via_nmi(phys_apicid, APIC_DEST_PHYSICAL);
+}
+
+int __cpuinit
+wakeup_secondary_cpu_via_nmi(int logical_apicid, unsigned long start_eip)
+{
+ return _wakeup_secondary_cpu_via_nmi(logical_apicid, APIC_DEST_LOGICAL);
+}
+
+DEFINE_PER_CPU(int, cpu_dead) = { 0 };
+#define CPU_DEAD_TRIGGER 1
+#define CPU_DEAD_MWAIT 2
+#define CPU_DEAD_HLT 4
+
+static int wakeup_secondary_cpu_via_mwait(int cpu)
+{
+ per_cpu(cpu_dead, cpu) |= CPU_DEAD_TRIGGER;
+ return 0;
+}
+
+static int wakeup_cpu_nmi(unsigned int cmd, struct pt_regs *regs)
+{
+ int cpu = smp_processor_id();
+ int *cpu_dead_ptr;
+
+ cpu_dead_ptr = &per_cpu(cpu_dead, cpu);
+ if (!cpu_online(cpu) && (*cpu_dead_ptr & CPU_DEAD_HLT) &&
+ (*cpu_dead_ptr & CPU_DEAD_TRIGGER))
+ return NMI_HANDLED;
+
+ return NMI_DONE;
+}
+
static int __cpuinit
wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
{
@@ -626,6 +662,52 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
return (send_status | accept_status);
}

+/*
+ * Kick a cpu.
+ *
+ * If the CPU is in mwait, wake it up by mwait method. Otherwise, if the CPU is
+ * in halt, wake it up by NMI. If none of above exists, wake it up by INIT boot
+ * APIC message.
+ *
+ * When the CPU first time boots up, i.e. cpu_dead is 0, it's waken up by INIT
+ * boot APIC message.
+ *
+ * At this point, the CPU should be in a fixed dead state. So we don't consider
+ * racy condition here.
+ */
+int __cpuinit
+wakeup_secondary_cpu_via_soft(int apicid, unsigned long start_eip)
+{
+ int cpu;
+ int boot_error = 0;
+ /* start_ip had better be page-aligned! */
+ unsigned long start_ip = real_mode_header->trampoline_start;
+
+ for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+ if (apicid == apic->cpu_present_to_apicid(cpu))
+ break;
+
+ if (cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ if (per_cpu(cpu_dead, cpu) & CPU_DEAD_MWAIT) {
+ boot_error = wakeup_secondary_cpu_via_mwait(cpu);
+ } else if (per_cpu(cpu_dead, cpu) & CPU_DEAD_HLT) {
+ int *cpu_dead_ptr;
+
+ cpu_dead_ptr = &per_cpu(cpu_dead, cpu);
+ *cpu_dead_ptr |= CPU_DEAD_TRIGGER;
+
+ boot_error = wakeup_secondary_cpu_via_nmi_phys(apicid,
+ start_ip);
+ if (boot_error)
+ *cpu_dead_ptr &= ~CPU_DEAD_TRIGGER;
+ } else
+ boot_error = wakeup_secondary_cpu_via_init(apicid, start_ip);
+
+ return boot_error;
+}
+
/* reduce the number of lines printed when booting a large cpu count system */
static void __cpuinit announce_cpu(int cpu, int apicid)
{
@@ -778,6 +860,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
*/
smpboot_restore_warm_reset_vector();
}
+
return boot_error;
}

@@ -977,6 +1060,20 @@ static void __init smp_cpu_index_default(void)
}
}

+static bool mwait_supported(void)
+{
+ struct cpuinfo_x86 *c = __this_cpu_ptr(&cpu_info);
+
+ if (!(this_cpu_has(X86_FEATURE_MWAIT) && mwait_usable(c)))
+ return false;
+ if (!this_cpu_has(X86_FEATURE_CLFLSH))
+ return false;
+ if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
+ return false;
+
+ return true;
+}
+
/*
* Prepare for SMP bootup. The MP table or ACPI has been read
* earlier. Just do some sanity checking here and enable APIC mode.
@@ -1051,6 +1148,11 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
uv_system_init();

set_mtrr_aps_delayed_init();
+
+#ifdef CONFIG_HOTPLUG_CPU
+ if (!mwait_supported())
+ register_nmi_handler(NMI_LOCAL, wakeup_cpu_nmi, 0, "wake_cpu");
+#endif
out:
preempt_enable();
}
@@ -1111,6 +1213,12 @@ static int __init _setup_possible_cpus(char *str)
}
early_param("possible_cpus", _setup_possible_cpus);

+static int __init setup_wakeup_cpu_via_init(char *str)
+{
+ apic->wakeup_secondary_cpu = NULL;
+ return 0;
+}
+__setup("wakeup_cpu_via_init", setup_wakeup_cpu_via_init);

/*
* cpu_possible_mask should be static, it cannot change as cpu's
@@ -1286,6 +1394,28 @@ void play_dead_common(void)
local_irq_disable();
}

+static bool wakeup_cpu(int *trigger)
+{
+ unsigned int timeout;
+
+ /*
+ * Wait up to 1 seconds to check if CPU wakeup trigger is set in
+ * cpu_dead by either memory write or NMI.
+ * If there is no CPU wakeup trigger, go back to sleep.
+ */
+ for (timeout = 0; timeout < 1000000; timeout++) {
+ /*
+ * Check if CPU0 wakeup NMI is issued and handled.
+ */
+ if (*trigger & CPU_DEAD_TRIGGER)
+ return true;
+
+ udelay(1);
+ }
+
+ return false;
+}
+
/*
* We need to flush the caches before going to sleep, lest we have
* dirty data in our caches when we come back up.
@@ -1296,14 +1426,9 @@ static inline void mwait_play_dead(void)
unsigned int highest_cstate = 0;
unsigned int highest_subcstate = 0;
int i;
- void *mwait_ptr;
- struct cpuinfo_x86 *c = __this_cpu_ptr(&cpu_info);
+ int *cpu_dead_ptr;

- if (!(this_cpu_has(X86_FEATURE_MWAIT) && mwait_usable(c)))
- return;
- if (!this_cpu_has(X86_FEATURE_CLFLSH))
- return;
- if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
+ if (!mwait_supported())
return;

eax = CPUID_MWAIT_LEAF;
@@ -1328,16 +1453,10 @@ static inline void mwait_play_dead(void)
(highest_subcstate - 1);
}

- /*
- * This should be a memory location in a cache line which is
- * unlikely to be touched by other processors. The actual
- * content is immaterial as it is not actually modified in any way.
- */
- mwait_ptr = &current_thread_info()->flags;
-
- wbinvd();
-
+ cpu_dead_ptr = &per_cpu(cpu_dead, smp_processor_id());
+ *cpu_dead_ptr = CPU_DEAD_MWAIT;
while (1) {
+ *cpu_dead_ptr &= ~CPU_DEAD_TRIGGER;
/*
* The CLFLUSH is a workaround for erratum AAI65 for
* the Xeon 7400 series. It's not clear it is actually
@@ -1345,20 +1464,34 @@ static inline void mwait_play_dead(void)
* The WBINVD is insufficient due to the spurious-wakeup
* case where we return around the loop.
*/
- clflush(mwait_ptr);
- __monitor(mwait_ptr, 0, 0);
+ wbinvd();
+ clflush(cpu_dead_ptr);
+ __monitor(cpu_dead_ptr, 0, 0);
mb();
- __mwait(eax, 0);
+ if ((*cpu_dead_ptr & CPU_DEAD_TRIGGER) == 0)
+ __mwait(eax, 0);
+
+ /* Waken up by another CPU. */
+ if (wakeup_cpu(cpu_dead_ptr))
+ start_cpu();
}
}

static inline void hlt_play_dead(void)
{
+ int *cpu_dead_ptr;
+
if (__this_cpu_read(cpu_info.x86) >= 4)
wbinvd();

+ cpu_dead_ptr = &per_cpu(cpu_dead, smp_processor_id());
+ *cpu_dead_ptr = CPU_DEAD_HLT;
while (1) {
+ *cpu_dead_ptr &= ~CPU_DEAD_TRIGGER;
native_halt();
+ /* If NMI wants to wake up me, I'll start. */
+ if (wakeup_cpu(cpu_dead_ptr))
+ start_cpu();
}
}

--
1.6.0.3

2012-06-04 18:31:23

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH 5/6] x86/x2apic_cluster.c: Wakeup function in x2apic_cluster calls mwait or nmi method

From: Fenghua Yu <[email protected]>

The wakeup_secondary_cpu function in apic_x2apic_cluster calls
wakeup_secondary_cpu_via_soft().

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/kernel/apic/x2apic_cluster.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index ff35cff..c3b8fcf 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -252,6 +252,7 @@ static struct apic apic_x2apic_cluster = {
.send_IPI_all = x2apic_send_IPI_all,
.send_IPI_self = x2apic_send_IPI_self,

+ .wakeup_secondary_cpu = wakeup_secondary_cpu_via_soft,
.trampoline_phys_low = DEFAULT_TRAMPOLINE_PHYS_LOW,
.trampoline_phys_high = DEFAULT_TRAMPOLINE_PHYS_HIGH,
.wait_for_init_deassert = NULL,
--
1.6.0.3

2012-06-04 18:31:51

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH 2/6] x86/head_32.S/head_64.S: Kernel entry code after waking up offline CPU via mwait or nmi

From: Fenghua Yu <[email protected]>

start_cpu() is CPU entry point after it's waken up via mwait or nmi. It's called
from play_dead(). Everything has been set up already except stack. We just set
up stack here. Then call start_secondary().

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/include/asm/cpu.h | 1 +
arch/x86/kernel/head_32.S | 12 ++++++++++++
arch/x86/kernel/head_64.S | 14 ++++++++++++++
3 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index 4564c8e..c2ad71c 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -28,6 +28,7 @@ struct x86_cpu {
#ifdef CONFIG_HOTPLUG_CPU
extern int arch_register_cpu(int num);
extern void arch_unregister_cpu(int);
+extern void __cpuinit start_cpu(void);
#endif

DECLARE_PER_CPU(int, cpu_state);
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index d42ab17..55981a7 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -266,6 +266,18 @@ num_subarch_entries = (. - subarch_entries) / 4
jmp default_entry
#endif /* CONFIG_PARAVIRT */

+#ifdef CONFIG_HOTPLUG_CPU
+/*
+ * Boot CPU entry point. It's called from play_dead(). Everything has been set
+ * up already except stack. We just set up stack here. Then call
+ * start_secondary().
+ */
+ENTRY(start_cpu)
+ movl stack_start, %ecx
+ movl %ecx, %esp
+ jmp *(initial_code)
+#endif
+
/*
* Non-boot CPU entry point; entered from trampoline.S
* We can't lgdt here, because lgdt itself uses a data segment, but
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 94bf9cc..84a8e1d 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -251,6 +251,20 @@ ENTRY(secondary_startup_64)
pushq $__KERNEL_CS # set correct cs
pushq %rax # target address in negative space
lretq
+#ifdef CONFIG_HOTPLUG_CPU
+/*
+ * Boot CPU entry point. It's called from play_dead(). Everything has been set
+ * up already except stack. We just set up stack here. Then call
+ * start_secondary().
+ */
+ENTRY(start_cpu)
+ movq stack_start(%rip),%rsp
+ movq initial_code(%rip),%rax
+ pushq $0 # fake return address to stop unwinder
+ pushq $__KERNEL_CS # set correct cs
+ pushq %rax # target address in negative space
+ lretq
+#endif

/* SMP bootup changes these two */
__REFDATA
--
1.6.0.3

2012-06-04 18:59:33

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH 3/6] x86/smpboot.c: Wake up offline CPU via mwait or nmi

On Mon, 2012-06-04 at 11:17 -0700, Fenghua Yu wrote:
> From: Fenghua Yu <[email protected]>
>
> wakeup_secondary_cpu_via_soft() is defined to wake up offline CPU via mwait if
> the CPU is in mwait or via nmi if the CPU is in hlt.
>
> A CPU boots up by INIT, INIT, STARTUP sequence when it boots up for the first
> time during boot time or hot plug.

I think this breaks suspend/resume as the cpu state gets lost.

Have you tried suspend/resume?

>
> Signed-off-by: Fenghua Yu <[email protected]>
> ---
> arch/x86/include/asm/apic.h | 5 +-
> arch/x86/kernel/smpboot.c | 187 ++++++++++++++++++++++++++++++++++++------
> 2 files changed, 164 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> index eaff479..cad00b1 100644
> --- a/arch/x86/include/asm/apic.h
> +++ b/arch/x86/include/asm/apic.h
> @@ -425,7 +425,10 @@ extern struct apic *__apicdrivers[], *__apicdrivers_end[];
> #ifdef CONFIG_SMP
> extern atomic_t init_deasserted;
> extern int wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip);
> -#endif
> +extern int wakeup_secondary_cpu_via_soft(int apicid, unsigned long start_eip);
> +#else /* CONFIG_SMP */
> +#define wakeup_secondary_cpu_via_soft NULL
> +#endif /* CONFIG_SMP */
>
> #ifdef CONFIG_X86_LOCAL_APIC
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index fd019d7..109df30 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -472,13 +472,8 @@ void __inquire_remote_apic(int apicid)
> }
> }
>
> -/*
> - * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
> - * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
> - * won't ... remember to clear down the APIC, etc later.
> - */
> -int __cpuinit
> -wakeup_secondary_cpu_via_nmi(int logical_apicid, unsigned long start_eip)
> +static int __cpuinit
> +_wakeup_secondary_cpu_via_nmi(int apicid, int dest_mode)
> {
> unsigned long send_status, accept_status = 0;
> int maxlvt;
> @@ -486,7 +481,7 @@ wakeup_secondary_cpu_via_nmi(int logical_apicid, unsigned long start_eip)
> /* Target chip */
> /* Boot on the stack */
> /* Kick the second */
> - apic_icr_write(APIC_DM_NMI | apic->dest_logical, logical_apicid);
> + apic_icr_write(APIC_DM_NMI | dest_mode, apicid);
>
> pr_debug("Waiting for send to finish...\n");
> send_status = safe_apic_wait_icr_idle();
> @@ -511,6 +506,47 @@ wakeup_secondary_cpu_via_nmi(int logical_apicid, unsigned long start_eip)
> return (send_status | accept_status);
> }
>
> +/*
> + * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
> + * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
> + * won't ... remember to clear down the APIC, etc later.
> + */
> +int __cpuinit
> +wakeup_secondary_cpu_via_nmi_phys(int phys_apicid, unsigned long start_eip)
> +{
> + return _wakeup_secondary_cpu_via_nmi(phys_apicid, APIC_DEST_PHYSICAL);
> +}
> +
> +int __cpuinit
> +wakeup_secondary_cpu_via_nmi(int logical_apicid, unsigned long start_eip)
> +{
> + return _wakeup_secondary_cpu_via_nmi(logical_apicid, APIC_DEST_LOGICAL);
> +}
> +
> +DEFINE_PER_CPU(int, cpu_dead) = { 0 };
> +#define CPU_DEAD_TRIGGER 1
> +#define CPU_DEAD_MWAIT 2
> +#define CPU_DEAD_HLT 4
> +
> +static int wakeup_secondary_cpu_via_mwait(int cpu)
> +{
> + per_cpu(cpu_dead, cpu) |= CPU_DEAD_TRIGGER;
> + return 0;
> +}
> +
> +static int wakeup_cpu_nmi(unsigned int cmd, struct pt_regs *regs)
> +{
> + int cpu = smp_processor_id();
> + int *cpu_dead_ptr;
> +
> + cpu_dead_ptr = &per_cpu(cpu_dead, cpu);
> + if (!cpu_online(cpu) && (*cpu_dead_ptr & CPU_DEAD_HLT) &&
> + (*cpu_dead_ptr & CPU_DEAD_TRIGGER))
> + return NMI_HANDLED;
> +
> + return NMI_DONE;
> +}
> +
> static int __cpuinit
> wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
> {
> @@ -626,6 +662,52 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
> return (send_status | accept_status);
> }
>
> +/*
> + * Kick a cpu.
> + *
> + * If the CPU is in mwait, wake it up by mwait method. Otherwise, if the CPU is
> + * in halt, wake it up by NMI. If none of above exists, wake it up by INIT boot
> + * APIC message.
> + *
> + * When the CPU first time boots up, i.e. cpu_dead is 0, it's waken up by INIT
> + * boot APIC message.
> + *
> + * At this point, the CPU should be in a fixed dead state. So we don't consider
> + * racy condition here.
> + */
> +int __cpuinit
> +wakeup_secondary_cpu_via_soft(int apicid, unsigned long start_eip)
> +{
> + int cpu;
> + int boot_error = 0;
> + /* start_ip had better be page-aligned! */
> + unsigned long start_ip = real_mode_header->trampoline_start;
> +
> + for (cpu = 0; cpu < nr_cpu_ids; cpu++)
> + if (apicid == apic->cpu_present_to_apicid(cpu))
> + break;
> +
> + if (cpu >= nr_cpu_ids)
> + return -EINVAL;
> +
> + if (per_cpu(cpu_dead, cpu) & CPU_DEAD_MWAIT) {
> + boot_error = wakeup_secondary_cpu_via_mwait(cpu);
> + } else if (per_cpu(cpu_dead, cpu) & CPU_DEAD_HLT) {
> + int *cpu_dead_ptr;
> +
> + cpu_dead_ptr = &per_cpu(cpu_dead, cpu);
> + *cpu_dead_ptr |= CPU_DEAD_TRIGGER;
> +
> + boot_error = wakeup_secondary_cpu_via_nmi_phys(apicid,
> + start_ip);
> + if (boot_error)
> + *cpu_dead_ptr &= ~CPU_DEAD_TRIGGER;
> + } else
> + boot_error = wakeup_secondary_cpu_via_init(apicid, start_ip);
> +
> + return boot_error;
> +}
> +
> /* reduce the number of lines printed when booting a large cpu count system */
> static void __cpuinit announce_cpu(int cpu, int apicid)
> {
> @@ -778,6 +860,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
> */
> smpboot_restore_warm_reset_vector();
> }
> +
> return boot_error;
> }
>
> @@ -977,6 +1060,20 @@ static void __init smp_cpu_index_default(void)
> }
> }
>
> +static bool mwait_supported(void)
> +{
> + struct cpuinfo_x86 *c = __this_cpu_ptr(&cpu_info);
> +
> + if (!(this_cpu_has(X86_FEATURE_MWAIT) && mwait_usable(c)))
> + return false;
> + if (!this_cpu_has(X86_FEATURE_CLFLSH))
> + return false;
> + if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
> + return false;
> +
> + return true;
> +}
> +
> /*
> * Prepare for SMP bootup. The MP table or ACPI has been read
> * earlier. Just do some sanity checking here and enable APIC mode.
> @@ -1051,6 +1148,11 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
> uv_system_init();
>
> set_mtrr_aps_delayed_init();
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> + if (!mwait_supported())
> + register_nmi_handler(NMI_LOCAL, wakeup_cpu_nmi, 0, "wake_cpu");
> +#endif
> out:
> preempt_enable();
> }
> @@ -1111,6 +1213,12 @@ static int __init _setup_possible_cpus(char *str)
> }
> early_param("possible_cpus", _setup_possible_cpus);
>
> +static int __init setup_wakeup_cpu_via_init(char *str)
> +{
> + apic->wakeup_secondary_cpu = NULL;
> + return 0;
> +}
> +__setup("wakeup_cpu_via_init", setup_wakeup_cpu_via_init);
>
> /*
> * cpu_possible_mask should be static, it cannot change as cpu's
> @@ -1286,6 +1394,28 @@ void play_dead_common(void)
> local_irq_disable();
> }
>
> +static bool wakeup_cpu(int *trigger)
> +{
> + unsigned int timeout;
> +
> + /*
> + * Wait up to 1 seconds to check if CPU wakeup trigger is set in
> + * cpu_dead by either memory write or NMI.
> + * If there is no CPU wakeup trigger, go back to sleep.
> + */
> + for (timeout = 0; timeout < 1000000; timeout++) {
> + /*
> + * Check if CPU0 wakeup NMI is issued and handled.
> + */
> + if (*trigger & CPU_DEAD_TRIGGER)
> + return true;
> +
> + udelay(1);
> + }
> +
> + return false;
> +}
> +
> /*
> * We need to flush the caches before going to sleep, lest we have
> * dirty data in our caches when we come back up.
> @@ -1296,14 +1426,9 @@ static inline void mwait_play_dead(void)
> unsigned int highest_cstate = 0;
> unsigned int highest_subcstate = 0;
> int i;
> - void *mwait_ptr;
> - struct cpuinfo_x86 *c = __this_cpu_ptr(&cpu_info);
> + int *cpu_dead_ptr;
>
> - if (!(this_cpu_has(X86_FEATURE_MWAIT) && mwait_usable(c)))
> - return;
> - if (!this_cpu_has(X86_FEATURE_CLFLSH))
> - return;
> - if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
> + if (!mwait_supported())
> return;
>
> eax = CPUID_MWAIT_LEAF;
> @@ -1328,16 +1453,10 @@ static inline void mwait_play_dead(void)
> (highest_subcstate - 1);
> }
>
> - /*
> - * This should be a memory location in a cache line which is
> - * unlikely to be touched by other processors. The actual
> - * content is immaterial as it is not actually modified in any way.
> - */
> - mwait_ptr = &current_thread_info()->flags;
> -
> - wbinvd();
> -
> + cpu_dead_ptr = &per_cpu(cpu_dead, smp_processor_id());
> + *cpu_dead_ptr = CPU_DEAD_MWAIT;
> while (1) {
> + *cpu_dead_ptr &= ~CPU_DEAD_TRIGGER;
> /*
> * The CLFLUSH is a workaround for erratum AAI65 for
> * the Xeon 7400 series. It's not clear it is actually
> @@ -1345,20 +1464,34 @@ static inline void mwait_play_dead(void)
> * The WBINVD is insufficient due to the spurious-wakeup
> * case where we return around the loop.
> */
> - clflush(mwait_ptr);
> - __monitor(mwait_ptr, 0, 0);
> + wbinvd();
> + clflush(cpu_dead_ptr);
> + __monitor(cpu_dead_ptr, 0, 0);
> mb();
> - __mwait(eax, 0);
> + if ((*cpu_dead_ptr & CPU_DEAD_TRIGGER) == 0)
> + __mwait(eax, 0);
> +
> + /* Waken up by another CPU. */
> + if (wakeup_cpu(cpu_dead_ptr))
> + start_cpu();
> }
> }
>
> static inline void hlt_play_dead(void)
> {
> + int *cpu_dead_ptr;
> +
> if (__this_cpu_read(cpu_info.x86) >= 4)
> wbinvd();
>
> + cpu_dead_ptr = &per_cpu(cpu_dead, smp_processor_id());
> + *cpu_dead_ptr = CPU_DEAD_HLT;
> while (1) {
> + *cpu_dead_ptr &= ~CPU_DEAD_TRIGGER;
> native_halt();
> + /* If NMI wants to wake up me, I'll start. */
> + if (wakeup_cpu(cpu_dead_ptr))
> + start_cpu();
> }
> }
>

2012-06-04 20:11:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Mon, 4 Jun 2012, Fenghua Yu wrote:

> From: Fenghua Yu <[email protected]>
>
> Since offline CPU is in wmait or hlt if mwait feature is not available, it can
> be waken up by writing to monitored memory range or via nmi.
>
> Compared to current INIT, INIT, STARTUP wake up sequence, waking up offline CPU
> is faster via wmait or nmi. This is especially useful when offline CPU for
> power saving and shorter waking up time is desired. On one tested desktop
> machine, waking up time via mwait or nmi is reduced to 23% of waking up time
> via INIT. Waking up time is measured from the beginning of store_online() to
> the beginning of cpu_idle() after the CPU is waken up.
>
> Waking up offline CPU via mwait or nmi is also useful to support BSP offline/
> online because offline BSP can not be waken up by the INIT's sequence. The BSP
> offline/online patchset will be sent out seperately.

I understand what you are trying to do, though I completely disagree
with the solution.

The main problem of the current hotplug code is that it is an all or
nothing approach. You have to tear down the whole thing completely
instead of just taking it out of the usable set of cpus.

I'm working on a proper state machine driven online/offline sequence,
where you can put the cpu into an intermediate state which avoids
bringing it down completely. This is enough to get the full
powersaving benefits w/o having to go through all the synchronization
states of a full online/offline. That will shorten the onlining time
of an previously offlined cpu to almost nothing.

I really want to avoid adding more bandaids to the hotplug code before
we have sorted out the existing horror.

Thanks,

tglx

2012-06-04 20:33:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Mon, 2012-06-04 at 22:11 +0200, Thomas Gleixner wrote:

> I understand what you are trying to do, though I completely disagree
> with the solution.
>
> The main problem of the current hotplug code is that it is an all or
> nothing approach. You have to tear down the whole thing completely
> instead of just taking it out of the usable set of cpus.
>
> I'm working on a proper state machine driven online/offline sequence,
> where you can put the cpu into an intermediate state which avoids
> bringing it down completely. This is enough to get the full
> powersaving benefits w/o having to go through all the synchronization
> states of a full online/offline. That will shorten the onlining time
> of an previously offlined cpu to almost nothing.
>
> I really want to avoid adding more bandaids to the hotplug code before
> we have sorted out the existing horror.

Its far worse.. you shouldn't _ever_ care about hotplug latency unless
you've got absolutely braindead hardware. We all now ARM has been
particularly creative here, but is Intel now trying to trump ARM at
stupid?

2012-06-04 21:19:09

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH 3/6] x86/smpboot.c: Wake up offline CPU via mwait or nmi

> From: Siddha, Suresh B
> Sent: Monday, June 04, 2012 11:59 AM
> To: Yu, Fenghua
> Cc: Ingo Molnar; Thomas Gleixner; H Peter Anvin; Luck, Tony; Mallick,
> Asit K; Arjan Dan De Ven; linux-kernel; x86; linux-pm
> Subject: Re: [PATCH 3/6] x86/smpboot.c: Wake up offline CPU via mwait
> or nmi
>
> On Mon, 2012-06-04 at 11:17 -0700, Fenghua Yu wrote:
> > From: Fenghua Yu <[email protected]>
> >
> > wakeup_secondary_cpu_via_soft() is defined to wake up offline CPU via
> mwait if
> > the CPU is in mwait or via nmi if the CPU is in hlt.
> >
> > A CPU boots up by INIT, INIT, STARTUP sequence when it boots up for
> the first
> > time during boot time or hot plug.
>
> I think this breaks suspend/resume as the cpu state gets lost.
>
> Have you tried suspend/resume?

Good catch! Suspend/resume is broken.

A pm callback can be installed to clear cpu_dead state in suspend preparation. So when resume, INIT sequence is used to wakeup CPU's just like the boot time case.

I have a quick fix and will put it in the next version of the patchset.

Thanks.

-Fenghua
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-06-04 22:03:21

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

> I'm working on a proper state machine driven online/offline sequence,
> where you can put the cpu into an intermediate state which avoids
> bringing it down completely. This is enough to get the full
> powersaving benefits w/o having to go through all the synchronization
> states of a full online/offline. That will shorten the onlining time
> of an previously offlined cpu to almost nothing.

Will this also be a step towards Fenghua's goal of being able to
take cpu#0 away. I.e. will your state machine be fully symmetric
and allow for cpu#0 to enter this intermediate mostly-offline state?

-Tony

2012-06-04 22:52:39

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Mon, 4 Jun 2012, Luck, Tony wrote:

> > I'm working on a proper state machine driven online/offline sequence,
> > where you can put the cpu into an intermediate state which avoids
> > bringing it down completely. This is enough to get the full
> > powersaving benefits w/o having to go through all the synchronization
> > states of a full online/offline. That will shorten the onlining time
> > of an previously offlined cpu to almost nothing.
>
> Will this also be a step towards Fenghua's goal of being able to
> take cpu#0 away. I.e. will your state machine be fully symmetric
> and allow for cpu#0 to enter this intermediate mostly-offline state?

The state machine does not care about the cpu number.

The only reverence to the initial boot cpu is that it can enter the
state machine at the "running" level instead of going through all
instances. That's unavoidable AFAICT.

Thanks,

tglx

2012-06-05 01:07:15

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Mon, 04 Jun 2012 22:33:21 +0200, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2012-06-04 at 22:11 +0200, Thomas Gleixner wrote:
>
> > I understand what you are trying to do, though I completely disagree
> > with the solution.
> >
> > The main problem of the current hotplug code is that it is an all or
> > nothing approach. You have to tear down the whole thing completely
> > instead of just taking it out of the usable set of cpus.
> >
> > I'm working on a proper state machine driven online/offline sequence,
> > where you can put the cpu into an intermediate state which avoids
> > bringing it down completely. This is enough to get the full
> > powersaving benefits w/o having to go through all the synchronization
> > states of a full online/offline. That will shorten the onlining time
> > of an previously offlined cpu to almost nothing.
> >
> > I really want to avoid adding more bandaids to the hotplug code before
> > we have sorted out the existing horror.
>
> Its far worse.. you shouldn't _ever_ care about hotplug latency unless
> you've got absolutely braindead hardware. We all now ARM has been
> particularly creative here, but is Intel now trying to trump ARM at
> stupid?

I disagree. Deactivating a cpu for power saving is halfway to hotplug
anyway. I'd rather unify the two cases, where we can specify how dead a
CPU should be, than have individual archs and boards do random hacks.

It also gives us a great excuse to audit and neaten various of the
hotplug cpu callbacks; most of the ones I've looked at have been racy :(

The ones which simply want to keep per-cpu stats can be given a nice
helper with two simple callbacks: one to empty stats for a going-away
cpu, and (maybe) one to restore them.

The per-cpu kthreads should no longer get torn down and recreated, and
doing it via a separate notifier function is ugly and error-prone. My
plan is a "bool kthread_cpu_going(void)" and then a "void
kthread_cpu_can_go(void)", so kthreads can do:

if (kthread_cpu_going()) {
/* Do any cleanup we need. */
...

/* This returns when CPU comes back. */
kthread_cpu_can_go();
}

Yeah, we should probably have the kthread exit inside
kthread_cpu_can_go() if they stop the kthread, but that's a detail.

Cheers,
Rusty.

2012-06-05 01:23:26

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On 6/4/2012 5:40 PM, Rusty Russell wrote:
> On Mon, 04 Jun 2012 22:33:21 +0200, Peter Zijlstra <[email protected]> wrote:
>> On Mon, 2012-06-04 at 22:11 +0200, Thomas Gleixner wrote:
>>
>>> I understand what you are trying to do, though I completely disagree
>>> with the solution.
>>>
>>> The main problem of the current hotplug code is that it is an all or
>>> nothing approach. You have to tear down the whole thing completely
>>> instead of just taking it out of the usable set of cpus.
>>>
>>> I'm working on a proper state machine driven online/offline sequence,
>>> where you can put the cpu into an intermediate state which avoids
>>> bringing it down completely. This is enough to get the full
>>> powersaving benefits w/o having to go through all the synchronization
>>> states of a full online/offline. That will shorten the onlining time
>>> of an previously offlined cpu to almost nothing.
>>>
>>> I really want to avoid adding more bandaids to the hotplug code before
>>> we have sorted out the existing horror.
>>
>> Its far worse.. you shouldn't _ever_ care about hotplug latency unless
>> you've got absolutely braindead hardware. We all now ARM has been
>> particularly creative here, but is Intel now trying to trump ARM at
>> stupid?
>
> I disagree. Deactivating a cpu for power saving is halfway to hotplug
> anyway. I'd rather unify the two cases, where we can specify how dead a
> CPU should be, than have individual archs and boards do random hacks.

well on PC's there really is no difference at least;
idle equals "all power removed" already there.

but I can see that on some other architectures, that lack idle that
deep, there can be a real difference.

2012-06-05 07:39:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Mon, 2012-06-04 at 18:23 -0700, Arjan van de Ven wrote:
>
> but I can see that on some other architectures, that lack idle that
> deep, there can be a real difference.
>
That's what I said, stupid hardware..

2012-06-05 07:39:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 10:10 +0930, Rusty Russell wrote:
> I disagree. Deactivating a cpu for power saving is halfway to hotplug
> anyway.

No, that's only so for broken hardware. On sane hardware idling is
sufficient.

2012-06-05 09:37:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 5 Jun 2012, Rusty Russell wrote:
> On Mon, 04 Jun 2012 22:33:21 +0200, Peter Zijlstra <[email protected]> wrote:
> > On Mon, 2012-06-04 at 22:11 +0200, Thomas Gleixner wrote:
> >
> > > I understand what you are trying to do, though I completely disagree
> > > with the solution.
> > >
> > > The main problem of the current hotplug code is that it is an all or
> > > nothing approach. You have to tear down the whole thing completely
> > > instead of just taking it out of the usable set of cpus.
> > >
> > > I'm working on a proper state machine driven online/offline sequence,
> > > where you can put the cpu into an intermediate state which avoids
> > > bringing it down completely. This is enough to get the full
> > > powersaving benefits w/o having to go through all the synchronization
> > > states of a full online/offline. That will shorten the onlining time
> > > of an previously offlined cpu to almost nothing.
> > >
> > > I really want to avoid adding more bandaids to the hotplug code before
> > > we have sorted out the existing horror.
> >
> > Its far worse.. you shouldn't _ever_ care about hotplug latency unless
> > you've got absolutely braindead hardware. We all now ARM has been
> > particularly creative here, but is Intel now trying to trump ARM at
> > stupid?
>
> I disagree. Deactivating a cpu for power saving is halfway to hotplug
> anyway. I'd rather unify the two cases, where we can specify how dead a
> CPU should be, than have individual archs and boards do random hacks.
>
> It also gives us a great excuse to audit and neaten various of the
> hotplug cpu callbacks; most of the ones I've looked at have been racy :(
>
> The ones which simply want to keep per-cpu stats can be given a nice
> helper with two simple callbacks: one to empty stats for a going-away
> cpu, and (maybe) one to restore them.
>
> The per-cpu kthreads should no longer get torn down and recreated, and
> doing it via a separate notifier function is ugly and error-prone. My
> plan is a "bool kthread_cpu_going(void)" and then a "void
> kthread_cpu_can_go(void)", so kthreads can do:
>
> if (kthread_cpu_going()) {
> /* Do any cleanup we need. */
> ...
>
> /* This returns when CPU comes back. */
> kthread_cpu_can_go();
> }
>
> Yeah, we should probably have the kthread exit inside
> kthread_cpu_can_go() if they stop the kthread, but that's a detail.

I have an implementation of that already. Need to polish and post. If
my day would have more than 24 hours....

Thanks,

tglx

2012-06-05 13:42:08

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH] kthread: Implement park/unpark facility

Subject: kthread: Implement park/unpark facility
From: Thomas Gleixner <[email protected]>
Date: Wed, 18 Apr 2012 16:37:40 +0200

To avoid the full teardown/setup of per cpu kthreads in the case of
cpu hot(un)plug, provide a facility which allows to put the kthread
into a park position and unpark it when the cpu comes online again.

Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/kthread.h | 10 ++
kernel/kthread.c | 161 +++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 160 insertions(+), 11 deletions(-)

Index: tip/include/linux/kthread.h
===================================================================
--- tip.orig/include/linux/kthread.h
+++ tip/include/linux/kthread.h
@@ -14,6 +14,11 @@ struct task_struct *kthread_create_on_no
kthread_create_on_node(threadfn, data, -1, namefmt, ##arg)


+struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
+ void *data,
+ unsigned int cpu,
+ const char *namefmt);
+
/**
* kthread_run - create and wake a thread.
* @threadfn: the function to run until signal_pending(current).
@@ -34,9 +39,12 @@ struct task_struct *kthread_create_on_no

void kthread_bind(struct task_struct *k, unsigned int cpu);
int kthread_stop(struct task_struct *k);
-int kthread_should_stop(void);
+bool kthread_should_stop(void);
+bool kthread_should_park(void);
bool kthread_freezable_should_stop(bool *was_frozen);
void *kthread_data(struct task_struct *k);
+int kthread_park(struct task_struct *k);
+void kthread_unpark(struct task_struct *k);

int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
Index: tip/kernel/kthread.c
===================================================================
--- tip.orig/kernel/kthread.c
+++ tip/kernel/kthread.c
@@ -37,8 +37,13 @@ struct kthread_create_info
};

struct kthread {
- int should_stop;
+ bool should_stop;
+ bool should_park;
+ bool is_parked;
+ bool is_percpu;
+ unsigned int cpu;
void *data;
+ struct completion parked;
struct completion exited;
};

@@ -52,13 +57,29 @@ struct kthread {
* and this will return true. You should then return, and your return
* value will be passed through to kthread_stop().
*/
-int kthread_should_stop(void)
+bool kthread_should_stop(void)
{
return to_kthread(current)->should_stop;
}
EXPORT_SYMBOL(kthread_should_stop);

/**
+ * kthread_should_park - should this kthread return now?
+ *
+ * When someone calls kthread_park() on your kthread, it will be woken
+ * and this will return true. You should then return, and your return
+ * value will be passed through to kthread_park().
+ *
+ * Similar to kthread_should_stop(), but this keeps the thread alive
+ * and in a park position. kthread_unpark() "restart" the thread and
+ * calls the thread function again.
+ */
+bool kthread_should_park(void)
+{
+ return to_kthread(current)->should_park;
+}
+
+/**
* kthread_freezable_should_stop - should this freezable kthread return now?
* @was_frozen: optional out parameter, indicates whether %current was frozen
*
@@ -96,6 +117,23 @@ void *kthread_data(struct task_struct *t
return to_kthread(task)->data;
}

+static bool kthread_parking(struct kthread *self)
+{
+ bool ret = false;
+
+ __set_current_state(TASK_INTERRUPTIBLE);
+ if (self->should_park) {
+ ret = true;
+ if (!self->is_parked) {
+ self->is_parked = true;
+ complete(&self->parked);
+ }
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+ return ret;
+}
+
static int kthread(void *_create)
{
/* Copy data: it's on kthread's stack */
@@ -105,9 +143,12 @@ static int kthread(void *_create)
struct kthread self;
int ret;

- self.should_stop = 0;
+ self.should_stop = false;
+ self.should_park = false;
+ self.is_parked = false;
self.data = data;
init_completion(&self.exited);
+ init_completion(&self.parked);
current->vfork_done = &self.exited;

/* OK, tell user we're spawned, wait for stop or wakeup */
@@ -117,9 +158,15 @@ static int kthread(void *_create)
schedule();

ret = -EINTR;
- if (!self.should_stop)
- ret = threadfn(data);

+ while (!self.should_stop) {
+ if (kthread_parking(&self))
+ continue;
+ self.is_parked = false;
+ ret = threadfn(data);
+ if (!self.should_park)
+ break;
+ }
/* we can't just return, we must preserve "self" on stack */
do_exit(ret);
}
@@ -210,6 +257,13 @@ struct task_struct *kthread_create_on_no
}
EXPORT_SYMBOL(kthread_create_on_node);

+static void __kthread_bind(struct task_struct *p, unsigned int cpu)
+{
+ /* It's safe because the task is inactive. */
+ do_set_cpus_allowed(p, cpumask_of(cpu));
+ p->flags |= PF_THREAD_BOUND;
+}
+
/**
* kthread_bind - bind a just-created kthread to a cpu.
* @p: thread created by kthread_create().
@@ -226,14 +280,101 @@ void kthread_bind(struct task_struct *p,
WARN_ON(1);
return;
}
-
- /* It's safe because the task is inactive. */
- do_set_cpus_allowed(p, cpumask_of(cpu));
- p->flags |= PF_THREAD_BOUND;
+ __kthread_bind(p, cpu);
}
EXPORT_SYMBOL(kthread_bind);

/**
+ * kthread_create_on_cpu - Create a cpu bound kthread
+ * @threadfn: the function to run until signal_pending(current).
+ * @data: data ptr for @threadfn.
+ * @node: memory node number.
+ * @namefmt: printf-style name for the thread.
+ *
+ * Description: This helper function creates and names a kernel thread
+ * and binds it to a given CPU. The thread will be woken and put into
+ * park mode.
+ */
+struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
+ void *data,
+ unsigned int cpu,
+ const char *namefmt)
+{
+ struct task_struct *p;
+
+ p = kthread_create_on_node(threadfn, data, cpu_to_node(cpu), namefmt,
+ cpu);
+ if (IS_ERR(p))
+ return p;
+ /* Park the thread, mark it percpu and then bind it */
+ kthread_park(p);
+ to_kthread(p)->is_percpu = true;
+ to_kthread(p)->cpu = cpu;
+ __kthread_bind(p, cpu);
+ return p;
+}
+
+/**
+ * kthread_unpark - unpark a thread created by kthread_create().
+ * @k: thread created by kthread_create().
+ *
+ * Sets kthread_should_park() for @k to return false, wakes it, and
+ * waits for it to return. If the thread is marked percpu then its
+ * bound to the cpu again.
+ */
+void kthread_unpark(struct task_struct *k)
+{
+ struct kthread *kthread;
+
+ get_task_struct(k);
+
+ kthread = to_kthread(k);
+ barrier(); /* it might have exited */
+ if (k->vfork_done != NULL && kthread->is_parked) {
+ if (kthread->is_percpu)
+ __kthread_bind(k, kthread->cpu);
+ kthread->should_park = false;
+ wake_up_process(k);
+ }
+ put_task_struct(k);
+}
+
+/**
+ * kthread_park - park a thread created by kthread_create().
+ * @k: thread created by kthread_create().
+ *
+ * Sets kthread_should_park() for @k to return true, wakes it, and
+ * waits for it to return. This can also be called after kthread_create()
+ * instead of calling wake_up_process(): the thread will park without
+ * calling threadfn().
+ *
+ * Returns 0 if the thread is parked, -ENOSYS if the thread exited.
+ * If called by the kthread itself just the park bit is set.
+ */
+int kthread_park(struct task_struct *k)
+{
+ struct kthread *kthread;
+ int ret = -ENOSYS;
+
+ get_task_struct(k);
+
+ kthread = to_kthread(k);
+ barrier(); /* it might have exited */
+ if (k->vfork_done != NULL) {
+ if (!kthread->is_parked) {
+ kthread->should_park = true;
+ if (k != current) {
+ wake_up_process(k);
+ wait_for_completion(&kthread->parked);
+ }
+ }
+ ret = 0;
+ }
+ put_task_struct(k);
+ return ret;
+}
+
+/**
* kthread_stop - stop a thread created by kthread_create().
* @k: thread created by kthread_create().
*
@@ -259,7 +400,7 @@ int kthread_stop(struct task_struct *k)
kthread = to_kthread(k);
barrier(); /* it might have exited */
if (k->vfork_done != NULL) {
- kthread->should_stop = 1;
+ kthread->should_stop = true;
wake_up_process(k);
wait_for_completion(&kthread->exited);
}

2012-06-05 14:01:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] kthread: Implement park/unpark facility

On Tue, 2012-06-05 at 15:41 +0200, Thomas Gleixner wrote:
> + * Returns 0 if the thread is parked, -ENOSYS if the thread exited.

I think we typically return something like -ESRCH in that case.

2012-06-05 14:06:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] kthread: Implement park/unpark facility

On Tue, 2012-06-05 at 15:41 +0200, Thomas Gleixner wrote:
> struct kthread {
> - int should_stop;
> + bool should_stop;
> + bool should_park;
> + bool is_parked;
> + bool is_percpu;

bool doesn't have a well specified storage type. I typically try to
avoid using it in structures for this reason. Others might not care
though.

2012-06-05 14:17:04

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Mon, 4 Jun 2012, Arjan van de Ven wrote:

> > I disagree. Deactivating a cpu for power saving is halfway to hotplug
> > anyway. I'd rather unify the two cases, where we can specify how dead a
> > CPU should be, than have individual archs and boards do random hacks.
>
> well on PC's there really is no difference at least;
> idle equals "all power removed" already there.

This doesn't sound right at all. Len Brown has often told us that on
Intel chips, power can't be removed from a package until _all_ the
cores in the package are idle.

Alan Stern

2012-06-05 15:27:10

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On 6/5/2012 7:17 AM, Alan Stern wrote:
> On Mon, 4 Jun 2012, Arjan van de Ven wrote:
>
>>> I disagree. Deactivating a cpu for power saving is halfway to hotplug
>>> anyway. I'd rather unify the two cases, where we can specify how dead a
>>> CPU should be, than have individual archs and boards do random hacks.
>>
>> well on PC's there really is no difference at least;
>> idle equals "all power removed" already there.
>
> This doesn't sound right at all. Len Brown has often told us that on
> Intel chips, power can't be removed from a package until _all_ the
> cores in the package are idle.

but "cpu hotplug" does not change that.. in fact, cpu hotplug is
implemented as a C state...


and to be specific; power will get removed from the cores one at a time.
you just cannot remove the power from the memory controller until the
last one is off.

2012-06-05 15:35:54

by Jiang Liu

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

Hi Thomas,
It's a great idea to use state machine to implement cpu online/offline.
Actually we are trying to split the cpu online/offline flow into several stages
so we could do error recover easily when hot-plugging physical processors.
Thanks!
Gerry

On 06/05/2012 04:11 AM, Thomas Gleixner wrote:
> On Mon, 4 Jun 2012, Fenghua Yu wrote:
>
>> From: Fenghua Yu <[email protected]>
>>
>> Since offline CPU is in wmait or hlt if mwait feature is not available, it can
>> be waken up by writing to monitored memory range or via nmi.
>>
>> Compared to current INIT, INIT, STARTUP wake up sequence, waking up offline CPU
>> is faster via wmait or nmi. This is especially useful when offline CPU for
>> power saving and shorter waking up time is desired. On one tested desktop
>> machine, waking up time via mwait or nmi is reduced to 23% of waking up time
>> via INIT. Waking up time is measured from the beginning of store_online() to
>> the beginning of cpu_idle() after the CPU is waken up.
>>
>> Waking up offline CPU via mwait or nmi is also useful to support BSP offline/
>> online because offline BSP can not be waken up by the INIT's sequence. The BSP
>> offline/online patchset will be sent out seperately.
>
> I understand what you are trying to do, though I completely disagree
> with the solution.
>
> The main problem of the current hotplug code is that it is an all or
> nothing approach. You have to tear down the whole thing completely
> instead of just taking it out of the usable set of cpus.
>
> I'm working on a proper state machine driven online/offline sequence,
> where you can put the cpu into an intermediate state which avoids
> bringing it down completely. This is enough to get the full
> powersaving benefits w/o having to go through all the synchronization
> states of a full online/offline. That will shorten the onlining time
> of an previously offlined cpu to almost nothing.
>
> I really want to avoid adding more bandaids to the hotplug code before
> we have sorted out the existing horror.
>
> Thanks,
>
> tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-06-05 16:02:44

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Tuesday, June 05, 2012 12:40 AM
> To: Rusty Russell
> Cc: Thomas Gleixner; Yu, Fenghua; Ingo Molnar; H Peter Anvin; Siddha,
> Suresh B; Luck, Tony; Mallick, Asit K; Arjan Dan De Ven; linux-kernel;
> x86; linux-pm; Srivatsa S. Bhat
> Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait
> or nmi
>
> On Tue, 2012-06-05 at 10:10 +0930, Rusty Russell wrote:
> > I disagree. Deactivating a cpu for power saving is halfway to
> hotplug
> > anyway.
>
> No, that's only so for broken hardware. On sane hardware idling is
> sufficient.

Users can consolidate processes on a few online CPU's and offline the rest when workload is light. This consolidation can save more power and have better performance than idling CPU's because hot cache in online CPU's and offline CPU's can be allocated in same package.

Thanks.

-Fenghua

2012-06-05 16:09:48

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 16:02 +0000, Yu, Fenghua wrote:
>
> Users can consolidate processes on a few online CPU's and offline the
> rest when workload is light.

No they can't, hotplug is a root only op.

> This consolidation can save more power and have better performance
> than idling CPU's because hot cache in online CPU's and offline CPU's
> can be allocated in same package.

Yeah, or you fix the load-balancer to do this.

2012-06-05 16:19:08

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Tuesday, June 05, 2012 9:09 AM
> To: Yu, Fenghua
> Cc: Rusty Russell; Thomas Gleixner; Ingo Molnar; H Peter Anvin; Siddha,
> Suresh B; Luck, Tony; Mallick, Asit K; Arjan Dan De Ven; linux-kernel;
> x86; linux-pm; Srivatsa S. Bhat
> Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait
> or nmi
>
> On Tue, 2012-06-05 at 16:02 +0000, Yu, Fenghua wrote:
> >
> > Users can consolidate processes on a few online CPU's and offline the
> > rest when workload is light.
>
> No they can't, hotplug is a root only op.

That's right. A root admin does this. Actually some Linux people are adjusting online/offline CPU's based on workload to save power in their business. There are some advantages for online/offline CPU's than idling CPU's.

>
> > This consolidation can save more power and have better performance
> > than idling CPU's because hot cache in online CPU's and offline CPU's
> > can be allocated in same package.
>
> Yeah, or you fix the load-balancer to do this.

2012-06-05 16:20:11

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 16:18 +0000, Yu, Fenghua wrote:
> Actually some Linux people are adjusting online/offline CPU's based on
> workload to save power in their business. There are some advantages
> for online/offline CPU's than idling CPU's.

Like what? Offline is nothing more than a C state on x86.

2012-06-05 17:44:20

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

> Like what? Offline is nothing more than a C state on x86.

Offline is a bigger hammer than idle.

When a core is idle it may take an interrupt which wakes it up to use power.
The scheduler may assign a process to run on it, which will wake it up to use power.

When a core is offline we take extra steps (re-routing interrupts, telling the
scheduler it is not available for work) to make sure it STAYS in that low
power state.

-Tony

2012-06-05 17:50:51

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 17:44 +0000, Luck, Tony wrote:
> > Like what? Offline is nothing more than a C state on x86.
>
> Offline is a bigger hammer than idle.
>
> When a core is idle it may take an interrupt which wakes it up to use power.
> The scheduler may assign a process to run on it, which will wake it up to use power.
>
> When a core is offline we take extra steps (re-routing interrupts, telling the
> scheduler it is not available for work) to make sure it STAYS in that low
> power state.

You also wreck cpusets, cpu affinity and you need some userspace crap to
poll state trying to figure out when to wake up again.

(And yes, I've heard stories about userspace hotplug daemons that cause
machine wakeups themselves and were a main source of power usage at some
point).

All the timer/interrupt nonsense needs to be fixed anyhow, the HPC and
RT people want isolation anyway.

So shouldn't we all start by fixing the entire
load-balancer/timer/interrupt madness before we start swinging stupid
big hammers around that break half the interfaces we have?

2012-06-05 18:07:46

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 17:44 +0000, Luck, Tony wrote:
> The scheduler may assign a process to run on it, which will wake it up
> to use power.

But given that you care about how fast a cpu can come up again, this
seems to be exactly what you want. You want to adapt to load fast, so
why bother going through userspace and wrecking bits in between?

2012-06-05 19:44:16

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

B1;2601;0cOn Tue, 5 Jun 2012, Peter Zijlstra wrote:

> On Tue, 2012-06-05 at 17:44 +0000, Luck, Tony wrote:
> > > Like what? Offline is nothing more than a C state on x86.
> >
> > Offline is a bigger hammer than idle.
> >
> > When a core is idle it may take an interrupt which wakes it up to use power.
> > The scheduler may assign a process to run on it, which will wake it up to use power.
> >
> > When a core is offline we take extra steps (re-routing interrupts, telling the
> > scheduler it is not available for work) to make sure it STAYS in that low
> > power state.
>
> You also wreck cpusets, cpu affinity and you need some userspace crap to
> poll state trying to figure out when to wake up again.
>
> (And yes, I've heard stories about userspace hotplug daemons that cause
> machine wakeups themselves and were a main source of power usage at some
> point).
>
> All the timer/interrupt nonsense needs to be fixed anyhow, the HPC and
> RT people want isolation anyway.
>
> So shouldn't we all start by fixing the entire
> load-balancer/timer/interrupt madness before we start swinging stupid
> big hammers around that break half the interfaces we have?

My idea of the stateful hotplug is to have a state which just gets rid
of the interrupts, timers and some other crap (mostly IPIs) but allows
an ad hoc resurrection of the cpu.

Ideally the state transition would be driven by the load-balancer.

I know that the current load balancer is too stupid to do that, but
that's a different problem. Right now we can't fix the load balancer
because we have no mechanisms to solve the other issues and the other
issues are not solved because the stupid load balancer is in the way.

So we have to start somewhere.

IMNSHO providing a stateful hotplug mechanism which allows us to solve
the issues outside of the load balancer in a simple and robust way is
a proper approach. Once we have that we can tackle the load balancer
to control the whole thing.

Vs. the interrupt/timer/other crap madness:

- We really don't want to have an interrupt balancer in the kernel
again, but we need a mechanism to prevent the user space balancer
trainwreck from ruining the power saving party.

- The timer issue is mostly solved by the existing nohz stuff
(plus/minus the few bugs in there).

- The other details (silly IPIs) and cross CPU timer arming) are way
easier to solve by a proper prohibitive state than by chasing that
nonsense all over the tree forever.


Thoughts ?

tglx

2012-06-05 19:46:15

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
> My idea of the stateful hotplug is to have a state which just gets rid
> of the interrupts, timers and some other crap (mostly IPIs) but allows
> an ad hoc resurrection of the cpu.
>
> Ideally the state transition would be driven by the load-balancer.
>
No that's retarded.. Just make it a proper idle state, don't go muck
about with hotplug from the load-balancer.

2012-06-05 19:49:34

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
> Vs. the interrupt/timer/other crap madness:
>
> - We really don't want to have an interrupt balancer in the kernel
> again, but we need a mechanism to prevent the user space balancer
> trainwreck from ruining the power saving party.

What's wrong with having an interrupt balancer tied to the scheduler
which optimistically tries to avoid interrupting nohz/isolated/idle
cpus?

> - The timer issue is mostly solved by the existing nohz stuff
> (plus/minus the few bugs in there).

Its not.. if you create an isolated domain there's no way to expel
existing timers from there.

> - The other details (silly IPIs) and cross CPU timer arming) are way
> easier to solve by a proper prohibitive state than by chasing that
> nonsense all over the tree forever.

But we need to solve all that without a prohibitibe state anyway for the
isolation stuff to be useful.

2012-06-05 19:51:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On 6/5/2012 12:49 PM, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
>> Vs. the interrupt/timer/other crap madness:
>>
>> - We really don't want to have an interrupt balancer in the kernel
>> again, but we need a mechanism to prevent the user space balancer
>> trainwreck from ruining the power saving party.
>
> What's wrong with having an interrupt balancer tied to the scheduler
> which optimistically tries to avoid interrupting nohz/isolated/idle
> cpus?

ideally threaded interrupts are like this.. we really should push for
more usage of such and it all falls into place

2012-06-05 19:51:46

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
>
> I know that the current load balancer is too stupid to do that, but
> that's a different problem. Right now we can't fix the load balancer
> because we have no mechanisms to solve the other issues and the other
> issues are not solved because the stupid load balancer is in the way.
>
This is not so.. If we do a power aware balancer that packs work
insteads of spreads it, its fairly easy to find who's waking the should
be idle cpus and try and fix these causes.

2012-06-05 19:53:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 12:51 -0700, Arjan van de Ven wrote:
> On 6/5/2012 12:49 PM, Peter Zijlstra wrote:
> > On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
> >> Vs. the interrupt/timer/other crap madness:
> >>
> >> - We really don't want to have an interrupt balancer in the kernel
> >> again, but we need a mechanism to prevent the user space balancer
> >> trainwreck from ruining the power saving party.
> >
> > What's wrong with having an interrupt balancer tied to the scheduler
> > which optimistically tries to avoid interrupting nohz/isolated/idle
> > cpus?
>
> ideally threaded interrupts are like this.. we really should push for
> more usage of such and it all falls into place

They are nothing like that.. threaded interrupts still have a hardirq
kick-off that's ran on whatever interrupt the interrupt routing picks.

2012-06-05 19:54:23

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

> But given that you care about how fast a cpu can come up again, this
> seems to be exactly what you want. You want to adapt to load fast, so
> why bother going through userspace and wrecking bits in between?

There are multiple needs here - that appear to have some significant
overlap (re-routing interrupts, stopping scheduling).

My need is for RAS reasons to take a cpu offline - and since
it is broken, I'm not going to bring it back again (at least not anytime
soon - perhaps a service engineer will drive by in a few hours, pull out
the broken cpu, put in a new one, and then bring that online ... but
saving a few milli-seconds in this use case it pointless).

Other people want to do this for power saving. They probably do care
how fast the processor can be brought back.

-Tony

2012-06-05 19:56:35

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 19:54 +0000, Luck, Tony wrote:

> My need is for RAS reasons to take a cpu offline - and since
> it is broken, I'm not going to bring it back again (at least not anytime
> soon - perhaps a service engineer will drive by in a few hours, pull out
> the broken cpu, put in a new one, and then bring that online ... but
> saving a few milli-seconds in this use case it pointless).

Right, performance is completely irrelevant in this case. In fact, the
current code should work perfectly fine for you -- except for the whole
BSP nightmare x86 has.

> Other people want to do this for power saving. They probably do care
> how fast the processor can be brought back.

They're bloody insane, or working with broken hardware.

2012-06-05 20:48:06

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
> > Vs. the interrupt/timer/other crap madness:
> >
> > - We really don't want to have an interrupt balancer in the kernel
> > again, but we need a mechanism to prevent the user space balancer
> > trainwreck from ruining the power saving party.
>
> What's wrong with having an interrupt balancer tied to the scheduler
> which optimistically tries to avoid interrupting nohz/isolated/idle
> cpus?

You want to run through a boatload of interrupts and change their
affinity from the load balancer or something related? Not really.

> > - The timer issue is mostly solved by the existing nohz stuff
> > (plus/minus the few bugs in there).
>
> Its not.. if you create an isolated domain there's no way to expel
> existing timers from there.

Yep, that's one of the problems which need to be fixed independent of
the solution we come up with.

> > - The other details (silly IPIs) and cross CPU timer arming) are way
> > easier to solve by a proper prohibitive state than by chasing that
> > nonsense all over the tree forever.
>
> But we need to solve all that without a prohibitibe state anyway for the
> isolation stuff to be useful.

And what is preventing us to use a prohibitive state for that purpose?
The isolation stuff Frederic is working on is nothing else than
dynamically switching in and out of a prohibitive state.

So do we really need to make the world and some more aware of those
states, instead of having a facility which lets us control what's
allowed/applicable in a given situation? Whether that's controlled by
the load-balancer or by user space or partially by both or something
else is a totally different issue.

I completely understand your reasoning, but I seriously doubt that we
can educate the whole crowd to understand the problems at hand. My
experience in the last 10+ years tells me that if you do not restrict
stuff you enter a never ending "chase the human stupidity^Wcreativity"
game. Even if you restrict it massively you end up observing a patch
which does:

+ d->core_internal_state__do_not_mess_with_it |= SOME_CONSTANT;

So do you really want to promote a solution which requires brain
sanity of all involved parties?

What's wrong with making a 'hotplug' model which provides the
following states:

Fully functional

Isolated functional

Isolated idle

<the physical hotplug mess>

where you have the ability to control the transitions of the upper 3
(or maybe more) states from the load balancer and/or user space or
whatever instance we come up with?

That puts the burden on the core facility design, but it removes the
maintainence burden to chase a gazillion of instances doing IPIs,
cross cpu function calls, add_timer_on, add_work_on and whatever
nonsense.

Note, that these upper states are not 'hotplug' by definition, but
they have to be traversed by hot(un)plug as well. So why not making
them explicit states which we can exploit for the other problems we
want to solve?

Your idea of tying everything to the scheduler and the load balancer
is just introducing the exacly same states again, just in a different
context.

Thanks,

tglx

2012-06-05 20:59:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

Thomas Gleixner <[email protected]> writes:
>
> Vs. the interrupt/timer/other crap madness:
>
> - We really don't want to have an interrupt balancer in the kernel
> again, but we need a mechanism to prevent the user space balancer
> trainwreck from ruining the power saving party.

Why not? I think the kernel is exactly the right place for it.
It's essentially a scheduling problem. Scheduling in user space
is not a good idea.

With MSI-X the drivers just want a static setting. User space
shouldn't mess with it.

Some of the workarounds for user space messing with it (like that
interrupt rmap code) are really bad and just a workaround for doing the
scheduling in the wrong place.

For dynamic changes it should indeed by part of scheduling,
following similar rules, with only high level policy input
from userland.

-Andi

--
[email protected] -- Speaking for myself only

2012-06-05 21:15:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 5 Jun 2012, Andi Kleen wrote:
> Thomas Gleixner <[email protected]> writes:
> >
> > Vs. the interrupt/timer/other crap madness:
> >
> > - We really don't want to have an interrupt balancer in the kernel
> > again, but we need a mechanism to prevent the user space balancer
> > trainwreck from ruining the power saving party.
>
> Why not? I think the kernel is exactly the right place for it.
> It's essentially a scheduling problem. Scheduling in user space
> is not a good idea.

No argument about scheduling in user space. Though the real problem is
where do you draw the line between mechanism and policy?

> With MSI-X the drivers just want a static setting. User space
> shouldn't mess with it.
>
> Some of the workarounds for user space messing with it (like that
> interrupt rmap code) are really bad and just a workaround for doing the
> scheduling in the wrong place.
>
> For dynamic changes it should indeed by part of scheduling,
> following similar rules, with only high level policy input
> from userland.

I'd be happy to see a patch which implements all of that and avoids
the pitfalls of the old in kernel irq balancer along with the short
comings of the user space one.

Thanks,

tglx

2012-06-05 21:29:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, Jun 05, 2012 at 09:49:16PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
> > Vs. the interrupt/timer/other crap madness:
> >
> > - We really don't want to have an interrupt balancer in the kernel
> > again, but we need a mechanism to prevent the user space balancer
> > trainwreck from ruining the power saving party.
>
> What's wrong with having an interrupt balancer tied to the scheduler
> which optimistically tries to avoid interrupting nohz/isolated/idle
> cpus?

Such an interrupt balancer would be a good thing, but I don't believe
that it will be sufficient.

> > - The timer issue is mostly solved by the existing nohz stuff
> > (plus/minus the few bugs in there).
>
> Its not.. if you create an isolated domain there's no way to expel
> existing timers from there.

OK, I'll bite... Why not just use CPU hotplug to expel the timers?

(Sorry, but you just can't expect me to pass that one up!)

> > - The other details (silly IPIs) and cross CPU timer arming) are way
> > easier to solve by a proper prohibitive state than by chasing that
> > nonsense all over the tree forever.
>
> But we need to solve all that without a prohibitibe state anyway for the
> isolation stuff to be useful.

I bet that we will end up having to do both.

Thanx, Paul

2012-06-05 21:31:16

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 22:47 +0200, Thomas Gleixner wrote:
> On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
> > > Vs. the interrupt/timer/other crap madness:
> > >
> > > - We really don't want to have an interrupt balancer in the kernel
> > > again, but we need a mechanism to prevent the user space balancer
> > > trainwreck from ruining the power saving party.
> >
> > What's wrong with having an interrupt balancer tied to the scheduler
> > which optimistically tries to avoid interrupting nohz/isolated/idle
> > cpus?
>
> You want to run through a boatload of interrupts and change their
> affinity from the load balancer or something related? Not really.

Well, no not like that, but I think we could do with some coupling
there. Like steer active interrupts away when they keep hitting idle
state.


> > > - The other details (silly IPIs) and cross CPU timer arming) are way
> > > easier to solve by a proper prohibitive state than by chasing that
> > > nonsense all over the tree forever.
> >
> > But we need to solve all that without a prohibitibe state anyway for the
> > isolation stuff to be useful.
>
> And what is preventing us to use a prohibitive state for that purpose?
> The isolation stuff Frederic is working on is nothing else than
> dynamically switching in and out of a prohibitive state.

I don't think so. Its perfectly fine to get TLB invalidate IPIs or
resched-IPIs or any other kind of kernel work that needs doing. Its even
fine for timers to happen. What's not fine is getting spurious IPIs when
there's no work to do, or getting timers from another workload.

> I completely understand your reasoning, but I seriously doubt that we
> can educate the whole crowd to understand the problems at hand. My
> experience in the last 10+ years tells me that if you do not restrict
> stuff you enter a never ending "chase the human stupidity^Wcreativity"
> game. Even if you restrict it massively you end up observing a patch
> which does:
>
> + d->core_internal_state__do_not_mess_with_it |= SOME_CONSTANT;
>
> So do you really want to promote a solution which requires brain
> sanity of all involved parties?

I just don't see a way to hard-wall interrupt sources, esp. when they
might be perfectly fine or even required for the correct operation of
the machine and desired workload.

kstopmachine -- however much we all love that thing -- will need to stop
all cpus and violate isolation barriers.

RCU has similar nasties.

> What's wrong with making a 'hotplug' model which provides the
> following states:

For one calling it hotplug ;-)

> Fully functional
>
> Isolated functional
>
> Isolated idle

I can see the isolated idle, but we can implement that as an idle state
and have smp_send_reschedule() do the magic wakeup. This should even
work for crippled hardware.

What I can't see is the isolated functional, aside from the above
mentioned things, that's not strictly a per-cpu property, we can have a
group that's isolated from the rest but not from each other.

> Note, that these upper states are not 'hotplug' by definition, but
> they have to be traversed by hot(un)plug as well. So why not making
> them explicit states which we can exploit for the other problems we
> want to solve?

I think I can agree with what you call isolated-idle, as long as we
expose that as a generic idle state and put some magic in
smp_send_reschedule(). But ideally we'd conceive a better name than
hotplug for all this and only call the transition to down to 'physical
hotplug mess' hotplug.

> That puts the burden on the core facility design, but it removes the
> maintainence burden to chase a gazillion of instances doing IPIs,
> cross cpu function calls, add_timer_on, add_work_on and whatever
> nonsense.

I'd love for something like that to exist and work, I'm just not seeing
how it could.

2012-06-05 21:33:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 5 Jun 2012, Thomas Gleixner wrote:
> On Tue, 5 Jun 2012, Andi Kleen wrote:
> > Thomas Gleixner <[email protected]> writes:
> > >
> > > Vs. the interrupt/timer/other crap madness:
> > >
> > > - We really don't want to have an interrupt balancer in the kernel
> > > again, but we need a mechanism to prevent the user space balancer
> > > trainwreck from ruining the power saving party.
> >
> > Why not? I think the kernel is exactly the right place for it.
> > It's essentially a scheduling problem. Scheduling in user space
> > is not a good idea.
>
> No argument about scheduling in user space. Though the real problem is
> where do you draw the line between mechanism and policy?
>
> > With MSI-X the drivers just want a static setting. User space
> > shouldn't mess with it.
> >
> > Some of the workarounds for user space messing with it (like that
> > interrupt rmap code) are really bad and just a workaround for doing the
> > scheduling in the wrong place.
> >
> > For dynamic changes it should indeed by part of scheduling,
> > following similar rules, with only high level policy input
> > from userland.
>
> I'd be happy to see a patch which implements all of that and avoids
> the pitfalls of the old in kernel irq balancer along with the short
> comings of the user space one.

And aside of the above requirements it should add the ability to deal
with the fact that aside of server workloads this needs to be able to
cope with appplications in the embedded/mobile space which know more
about the future system state than the scheduler itself.

Thanks,

tglx

2012-06-05 21:37:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 14:29 -0700, Paul E. McKenney wrote:
> OK, I'll bite... Why not just use CPU hotplug to expel the timers?

Currently? Can you say: 'kstopmachine'?

But its also a question of interface and naming. Do you want to have to
iterate all cpus in your isolated set, do you want to bring them down
far enough to physically unplug. Ideally no to both.

If you don't bring them down far enough to unplug, should you still be
calling it hotplug?

Ideally I think there'd be a file in your cpuset which if opened and
written to will flush all pending bits (timers, workqueues, the lot) and
return when this is done (and maybe provide O_ASYNC writes to not wait
for completion).

2012-06-05 22:01:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, Jun 05, 2012 at 11:37:21PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 14:29 -0700, Paul E. McKenney wrote:
> > OK, I'll bite... Why not just use CPU hotplug to expel the timers?
>
> Currently? Can you say: 'kstopmachine'?

So if CPU hotplug (or whatever you want to call it) stops using
kstopmachine, you are OK with it?

> But its also a question of interface and naming. Do you want to have to
> iterate all cpus in your isolated set, do you want to bring them down
> far enough to physically unplug. Ideally no to both.

For many use cases, it is indeed not necessary to get to a point where
the CPUs could be physically removed from the system. But CPU-failure
use cases would need the CPU to be fully deactivated. And many of the
hardware guys tell me that the CPU-failure case will be getting more
common, though I sure hope that they are wrong.

> If you don't bring them down far enough to unplug, should you still be
> calling it hotplug?

I am not too worried about what it is called. Though "banish to monastery"
would probably be going too far in the other direction.

> Ideally I think there'd be a file in your cpuset which if opened and
> written to will flush all pending bits (timers, workqueues, the lot) and
> return when this is done (and maybe provide O_ASYNC writes to not wait
> for completion).

The mobile guys probably are not too worried about bulk operations yet
because they don't have that many CPUs, but it might be useful elsewhere.

Thanx, Paul

2012-06-05 22:09:26

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 22:47 +0200, Thomas Gleixner wrote:
> > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
> > > > Vs. the interrupt/timer/other crap madness:
> > > >
> > > > - We really don't want to have an interrupt balancer in the kernel
> > > > again, but we need a mechanism to prevent the user space balancer
> > > > trainwreck from ruining the power saving party.
> > >
> > > What's wrong with having an interrupt balancer tied to the scheduler
> > > which optimistically tries to avoid interrupting nohz/isolated/idle
> > > cpus?
> >
> > You want to run through a boatload of interrupts and change their
> > affinity from the load balancer or something related? Not really.
>
> Well, no not like that, but I think we could do with some coupling
> there. Like steer active interrupts away when they keep hitting idle
> state.

That's possible, but that wants a well coordinated mechanism which
takes the user space steering into account.

I'm not saying it's impossible, I'm just trying to imagine the extra
user space interfaces needed for that.

> > > > - The other details (silly IPIs) and cross CPU timer arming) are way
> > > > easier to solve by a proper prohibitive state than by chasing that
> > > > nonsense all over the tree forever.
> > >
> > > But we need to solve all that without a prohibitibe state anyway for the
> > > isolation stuff to be useful.
> >
> > And what is preventing us to use a prohibitive state for that purpose?
> > The isolation stuff Frederic is working on is nothing else than
> > dynamically switching in and out of a prohibitive state.
>
> I don't think so. Its perfectly fine to get TLB invalidate IPIs or

No it's not. It's silly. I've observed the very issue more than once
and others have done as well.

If you have a process which has N threads where each thread is pinned
to a core. Only one of them is doing file operations, which result in
mmap/munmap and therefor in TLB shoot down IPIs even if it's ensured
that the other pinned threads will never ever touch that
mapping. That's a PITA as the workaround is to use NFS (how
performant) or split the process into separate processes with shared
memory to avoid the sane design of a single process where the
housekeeping thread just writes to disk.

This is exactly one of the issues where the application has more
knowlegde than the kernel and there is no way to deal with it.

I know, it's a hen and egg problem, but a very real one.

> resched-IPIs or any other kind of kernel work that needs doing. Its even

resched IPIs are a different issue. They cause a real state transition
as does any other kind of work which needs to be scheduled on that
CPU. What I'm talking about is stuff which should not happen on an
isolated cpu. We have no mechanism to exclude those cpus from general
"oh you should do X and Y" tasks which are not really necessary at all.

> I just don't see a way to hard-wall interrupt sources, esp. when they
> might be perfectly fine or even required for the correct operation of
> the machine and desired workload.

You can't steer away interrupts which are willingly targeted at an
isolated CPU. So yes, we need mechanisms for that as well. I don;t
claim that hotplug states are the cure of all problems.

> kstopmachine -- however much we all love that thing -- will need to stop
> all cpus and violate isolation barriers.

Yup. Though we really should sit down and figure out how much we need
it really. If code patching needs it on a given architecture, then
this particular arch has to cope with it, but all others which can
deal with other mechanisms should not care about it. Yes, that's not
how the kernel looks like ATM, but that's how it should look like in
the near future.

> RCU has similar nasties.

Why?

> > What's wrong with making a 'hotplug' model which provides the
> > following states:
>
> For one calling it hotplug ;-)

Bah. Call it what you want. We can put it on top of the hotplug
mechanism as a separate facility, but that does not change the
semantics at all. Neither does it change the fact that the real
hotplug stuff needs these transitions as well.

> > Fully functional
> >
> > Isolated functional
> >
> > Isolated idle
>
> I can see the isolated idle, but we can implement that as an idle state
> and have smp_send_reschedule() do the magic wakeup. This should even
> work for crippled hardware.
>
> What I can't see is the isolated functional, aside from the above
> mentioned things, that's not strictly a per-cpu property, we can have a
> group that's isolated from the rest but not from each other.

That's an implementation detail, really.

> > Note, that these upper states are not 'hotplug' by definition, but
> > they have to be traversed by hot(un)plug as well. So why not making
> > them explicit states which we can exploit for the other problems we
> > want to solve?
>
> I think I can agree with what you call isolated-idle, as long as we
> expose that as a generic idle state and put some magic in
> smp_send_reschedule(). But ideally we'd conceive a better name than
> hotplug for all this and only call the transition to down to 'physical
> hotplug mess' hotplug.

Agreed for the naming convention part.

> > That puts the burden on the core facility design, but it removes the
> > maintainence burden to chase a gazillion of instances doing IPIs,
> > cross cpu function calls, add_timer_on, add_work_on and whatever
> > nonsense.
>
> I'd love for something like that to exist and work, I'm just not seeing
> how it could.

Think harder :)

tglx


2012-06-05 22:12:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, Jun 05, 2012 at 11:30:56PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 22:47 +0200, Thomas Gleixner wrote:
> > On Tue, 5 Jun 2012, Peter Zijlstra wrote:
> > > On Tue, 2012-06-05 at 21:43 +0200, Thomas Gleixner wrote:
> > > > Vs. the interrupt/timer/other crap madness:
> > > >
> > > > - We really don't want to have an interrupt balancer in the kernel
> > > > again, but we need a mechanism to prevent the user space balancer
> > > > trainwreck from ruining the power saving party.
> > >
> > > What's wrong with having an interrupt balancer tied to the scheduler
> > > which optimistically tries to avoid interrupting nohz/isolated/idle
> > > cpus?
> >
> > You want to run through a boatload of interrupts and change their
> > affinity from the load balancer or something related? Not really.
>
> Well, no not like that, but I think we could do with some coupling
> there. Like steer active interrupts away when they keep hitting idle
> state.

But the guys who are more fanatic about performance than about energy
efficiency would -want- the interrupts to hit the idle CPUs, right?

> > > > - The other details (silly IPIs) and cross CPU timer arming) are way
> > > > easier to solve by a proper prohibitive state than by chasing that
> > > > nonsense all over the tree forever.
> > >
> > > But we need to solve all that without a prohibitibe state anyway for the
> > > isolation stuff to be useful.
> >
> > And what is preventing us to use a prohibitive state for that purpose?
> > The isolation stuff Frederic is working on is nothing else than
> > dynamically switching in and out of a prohibitive state.
>
> I don't think so. Its perfectly fine to get TLB invalidate IPIs or
> resched-IPIs or any other kind of kernel work that needs doing. Its even
> fine for timers to happen. What's not fine is getting spurious IPIs when
> there's no work to do, or getting timers from another workload.

One desirable property of CPU hotplug is that it puts the CPU in a state
where it no longer needs to receive TLB invalidations, resched IPIs, etc.

> > I completely understand your reasoning, but I seriously doubt that we
> > can educate the whole crowd to understand the problems at hand. My
> > experience in the last 10+ years tells me that if you do not restrict
> > stuff you enter a never ending "chase the human stupidity^Wcreativity"
> > game. Even if you restrict it massively you end up observing a patch
> > which does:
> >
> > + d->core_internal_state__do_not_mess_with_it |= SOME_CONSTANT;
> >
> > So do you really want to promote a solution which requires brain
> > sanity of all involved parties?
>
> I just don't see a way to hard-wall interrupt sources, esp. when they
> might be perfectly fine or even required for the correct operation of
> the machine and desired workload.
>
> kstopmachine -- however much we all love that thing -- will need to stop
> all cpus and violate isolation barriers.
>
> RCU has similar nasties.

I am working to rid RCU of this sort of thing. I have rcu_barrier() so
that it avoids messing with CPUs that don't have callbacks, which will
be almost all of the idle CPUs, especially for CONFIG_RCU_FAST_NO_HZ=y.
I believe that I have also removed all of RCU's dependencies on CPU
hotplug's using kstopmachine, though Murphy would say otherwise.

I still need to fix up synchronize_sched_expedited(), but that is on
the list. I considered getting rid of this one, but I am probably going
to have to make synchronize_sched() map to it during boot time to keep
the boot-speed demons satisfied.

> > What's wrong with making a 'hotplug' model which provides the
> > following states:
>
> For one calling it hotplug ;-)

OK, what would you want to call it? CPU quiesce with different levels
of quiescence? CPU cripple? CPU curfew? Something else?

> > Fully functional
> >
> > Isolated functional
> >
> > Isolated idle
>
> I can see the isolated idle, but we can implement that as an idle state
> and have smp_send_reschedule() do the magic wakeup. This should even
> work for crippled hardware.
>
> What I can't see is the isolated functional, aside from the above
> mentioned things, that's not strictly a per-cpu property, we can have a
> group that's isolated from the rest but not from each other.

I suspect that Thomas is thinking that the CPU is so idle that it no
longer has to participate in TLB invalidation or RCU. (Thomas will
correct me if I am confused.) But Peter, is that the level of idle
you are thinking of?

Thanx, Paul

> > Note, that these upper states are not 'hotplug' by definition, but
> > they have to be traversed by hot(un)plug as well. So why not making
> > them explicit states which we can exploit for the other problems we
> > want to solve?
>
> I think I can agree with what you call isolated-idle, as long as we
> expose that as a generic idle state and put some magic in
> smp_send_reschedule(). But ideally we'd conceive a better name than
> hotplug for all this and only call the transition to down to 'physical
> hotplug mess' hotplug.
>
> > That puts the burden on the core facility design, but it removes the
> > maintainence burden to chase a gazillion of instances doing IPIs,
> > cross cpu function calls, add_timer_on, add_work_on and whatever
> > nonsense.
>
> I'd love for something like that to exist and work, I'm just not seeing
> how it could.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-06-05 23:13:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

> And aside of the above requirements it should add the ability to deal
> with the fact that aside of server workloads this needs to be able to
> cope with appplications in the embedded/mobile space which know more
> about the future system state than the scheduler itself.

Well solving world hunger in one try is hard. Baby steps are easier.

What I think would be useful short term is a clean mechanism for drivers
to lock a interrupt onto a CPU, without irqbalanced touching it.
This would be mainly for MSI-X drivers to spread their interrupts properly
and give better performance out of the box.

Another short term case is the power aware interrupt routing now on recent
Intel CPUs. In this case the interrupt needs logical focus to multiple CPUs
and the hardware makes the decision (essentially it does power aware load
balancing in hardware). Again nobody else should touch it.

Then maybe this mechanism could be extended with a power aware
software solution with some input from the load balancer like you suggested.
I don't have a firm picture on how exactly it should work.

-Andi

--
[email protected] -- Speaking for myself only.

2012-06-06 01:53:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On 6/5/2012 4:13 PM, Andi Kleen wrote:
>> And aside of the above requirements it should add the ability to deal
>> with the fact that aside of server workloads this needs to be able to
>> cope with appplications in the embedded/mobile space which know more
>> about the future system state than the scheduler itself.
>
> Well solving world hunger in one try is hard. Baby steps are easier.
>
> What I think would be useful short term is a clean mechanism for drivers
> to lock a interrupt onto a CPU, without irqbalanced touching it.
> This would be mainly for MSI-X drivers to spread their interrupts properly
> and give better performance out of the box.


like the IRQ_NO_BALANCING flag ? ;-)


>
> Another short term case is the power aware interrupt routing now on recent
> Intel CPUs. In this case the interrupt needs logical focus to multiple CPUs
> and the hardware makes the decision (essentially it does power aware load
> balancing in hardware). Again nobody else should touch it.

PAIR is hard, it sadly needs a mostly complete revamp on how Linux does
interrupts. t

>
> Then maybe this mechanism could be extended with a power aware
> software solution with some input from the load balancer like you suggested.

irqbalanced at least tries to be power aware.

2012-06-06 08:24:10

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, 2012-06-06 at 00:09 +0200, Thomas Gleixner wrote:
> > Well, no not like that, but I think we could do with some coupling
> > there. Like steer active interrupts away when they keep hitting idle
> > state.
>
> That's possible, but that wants a well coordinated mechanism which
> takes the user space steering into account.

Sure, if the provided interrupt affinity mask doesn't allow this, we
loose. But typically those are all bits set.

> I'm not saying it's impossible, I'm just trying to imagine the extra
> user space interfaces needed for that.

None, preferably.

2012-06-06 08:30:36

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, 2012-06-06 at 00:09 +0200, Thomas Gleixner wrote:
> > I don't think so. Its perfectly fine to get TLB invalidate IPIs or
>
> No it's not. It's silly. I've observed the very issue more than once
> and others have done as well.
>
> If you have a process which has N threads where each thread is pinned
> to a core. Only one of them is doing file operations, which result in
> mmap/munmap and therefor in TLB shoot down IPIs even if it's ensured
> that the other pinned threads will never ever touch that
> mapping. That's a PITA as the workaround is to use NFS (how
> performant) or split the process into separate processes with shared
> memory to avoid the sane design of a single process where the
> housekeeping thread just writes to disk.
>
> This is exactly one of the issues where the application has more
> knowlegde than the kernel and there is no way to deal with it.

But simply refusing IPIs just because isn't going to work either.
Suppose the RT thing did touch the memory region.. then you'll end up
with stale TLB entries, those are not fun either.

We simply cannot not send TLB invalidates, me must assume userspace is
malicious - such is the burden of a general purpose kernel.

The only solution here is using multiple processes and shared memory.

2012-06-06 08:40:27

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, 2012-06-06 at 00:09 +0200, Thomas Gleixner wrote:
> We have no mechanism to exclude those cpus from general
> "oh you should do X and Y" tasks which are not really necessary at
> all.

Its that latter part which makes this nearly impossible. How do you tell
its not really necessary?

Look at the patches Gilad did, there is very little common code between
each of those cases.

We cannot just not flush objects because the cpu is supposed to be
isolated. If it has buffer they need flushing. Not doing so would lead
to memory leaks at best and crashes at worst.

Some people want isolation to never use system calls, those are the easy
case. But the isolation case where the apps do use plenty system calls
but just don't want to deal with the perturbations of other workloads
are much harder to sort.

2012-06-06 08:40:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 15:12 -0700, Paul E. McKenney wrote:
> > Well, no not like that, but I think we could do with some coupling
> > there. Like steer active interrupts away when they keep hitting idle
> > state.
>
> But the guys who are more fanatic about performance than about energy
> efficiency would -want- the interrupts to hit the idle CPUs, right?

Yeah, this is why you want a performance/power knob somewhere.

2012-06-06 08:43:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 15:12 -0700, Paul E. McKenney wrote:
> > RCU has similar nasties.
>
> I am working to rid RCU of this sort of thing. I have rcu_barrier() so
> that it avoids messing with CPUs that don't have callbacks, which will
> be almost all of the idle CPUs, especially for CONFIG_RCU_FAST_NO_HZ=y.
> I believe that I have also removed all of RCU's dependencies on CPU
> hotplug's using kstopmachine, though Murphy would say otherwise.
>
> I still need to fix up synchronize_sched_expedited(), but that is on
> the list. I considered getting rid of this one, but I am probably going
> to have to make synchronize_sched() map to it during boot time to keep
> the boot-speed demons satisfied.

Not the point really. Its perfectly fine for applications in an
'isolated' set to use system calls, hence they get to participate in RCU
state.

I don't think the isolation means userspace while(1) applications is
interesting. Sure, some people do this, and we should dtrt for them, but
the far more interesting case is 'regular' applications that do use
system calls.

2012-06-06 08:43:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 15:12 -0700, Paul E. McKenney wrote:
> > What I can't see is the isolated functional, aside from the above
> > mentioned things, that's not strictly a per-cpu property, we can have a
> > group that's isolated from the rest but not from each other.
>
> I suspect that Thomas is thinking that the CPU is so idle that it no
> longer has to participate in TLB invalidation or RCU. (Thomas will
> correct me if I am confused.) But Peter, is that the level of idle
> you are thinking of?

No, we're talking about isolated, so its very much running something.

2012-06-06 12:17:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Tue, 2012-06-05 at 15:00 -0700, Paul E. McKenney wrote:
> On Tue, Jun 05, 2012 at 11:37:21PM +0200, Peter Zijlstra wrote:
> > On Tue, 2012-06-05 at 14:29 -0700, Paul E. McKenney wrote:
> > > OK, I'll bite... Why not just use CPU hotplug to expel the timers?
> >
> > Currently? Can you say: 'kstopmachine'?
>
> So if CPU hotplug (or whatever you want to call it) stops using
> kstopmachine, you are OK with it?

It would be much better, still not ideal though.

> > But its also a question of interface and naming. Do you want to have to
> > iterate all cpus in your isolated set, do you want to bring them down
> > far enough to physically unplug. Ideally no to both.
>
> For many use cases, it is indeed not necessary to get to a point where
> the CPUs could be physically removed from the system. But CPU-failure
> use cases would need the CPU to be fully deactivated. And many of the
> hardware guys tell me that the CPU-failure case will be getting more
> common, though I sure hope that they are wrong.

Uhm, yeah, that doesn't sound right.

2012-06-06 14:41:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, Jun 06, 2012 at 10:43:43AM +0200, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 15:12 -0700, Paul E. McKenney wrote:
> > > What I can't see is the isolated functional, aside from the above
> > > mentioned things, that's not strictly a per-cpu property, we can have a
> > > group that's isolated from the rest but not from each other.
> >
> > I suspect that Thomas is thinking that the CPU is so idle that it no
> > longer has to participate in TLB invalidation or RCU. (Thomas will
> > correct me if I am confused.) But Peter, is that the level of idle
> > you are thinking of?
>
> No, we're talking about isolated, so its very much running something.

>From what I can see, if the CPU is running something, this is Thomas's
"Isolated functional" state rather than his "Isolated idle" state.
The isolated-idle state should not need to participate in TLB invalidation
or RCU, so that the CPU never ever needs to wake up while in the
isolated-idle state.

Thanx, Paul

2012-06-06 14:43:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, Jun 06, 2012 at 02:17:21PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 15:00 -0700, Paul E. McKenney wrote:
> > On Tue, Jun 05, 2012 at 11:37:21PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2012-06-05 at 14:29 -0700, Paul E. McKenney wrote:
> > > > OK, I'll bite... Why not just use CPU hotplug to expel the timers?
> > >
> > > Currently? Can you say: 'kstopmachine'?
> >
> > So if CPU hotplug (or whatever you want to call it) stops using
> > kstopmachine, you are OK with it?
>
> It would be much better, still not ideal though.

OK, fair enough. Then again, there is not much in this world that is ideal.

> > > But its also a question of interface and naming. Do you want to have to
> > > iterate all cpus in your isolated set, do you want to bring them down
> > > far enough to physically unplug. Ideally no to both.
> >
> > For many use cases, it is indeed not necessary to get to a point where
> > the CPUs could be physically removed from the system. But CPU-failure
> > use cases would need the CPU to be fully deactivated. And many of the
> > hardware guys tell me that the CPU-failure case will be getting more
> > common, though I sure hope that they are wrong.
>
> Uhm, yeah, that doesn't sound right.

The people arguing for this believe that failures will increase with
decreasing feature size. Of course, no one will really know until
the real hardware appears.

Thanx, Paul

2012-06-06 14:44:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, Jun 06, 2012 at 10:42:55AM +0200, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 15:12 -0700, Paul E. McKenney wrote:
> > > RCU has similar nasties.
> >
> > I am working to rid RCU of this sort of thing. I have rcu_barrier() so
> > that it avoids messing with CPUs that don't have callbacks, which will
> > be almost all of the idle CPUs, especially for CONFIG_RCU_FAST_NO_HZ=y.
> > I believe that I have also removed all of RCU's dependencies on CPU
> > hotplug's using kstopmachine, though Murphy would say otherwise.
> >
> > I still need to fix up synchronize_sched_expedited(), but that is on
> > the list. I considered getting rid of this one, but I am probably going
> > to have to make synchronize_sched() map to it during boot time to keep
> > the boot-speed demons satisfied.
>
> Not the point really. Its perfectly fine for applications in an
> 'isolated' set to use system calls, hence they get to participate in RCU
> state.
>
> I don't think the isolation means userspace while(1) applications is
> interesting. Sure, some people do this, and we should dtrt for them, but
> the far more interesting case is 'regular' applications that do use
> system calls.

OK, I will bite. What are the semantics/properties for your isolated set?

Thanx, Paul

2012-06-06 15:24:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On 6/6/2012 7:41 AM, Paul E. McKenney wrote:
> On Wed, Jun 06, 2012 at 10:43:43AM +0200, Peter Zijlstra wrote:
>> On Tue, 2012-06-05 at 15:12 -0700, Paul E. McKenney wrote:
>>>> What I can't see is the isolated functional, aside from the above
>>>> mentioned things, that's not strictly a per-cpu property, we can have a
>>>> group that's isolated from the rest but not from each other.
>>>
>>> I suspect that Thomas is thinking that the CPU is so idle that it no
>>> longer has to participate in TLB invalidation or RCU. (Thomas will
>>> correct me if I am confused.) But Peter, is that the level of idle
>>> you are thinking of?
>>
>> No, we're talking about isolated, so its very much running something.
>
> From what I can see, if the CPU is running something, this is Thomas's
> "Isolated functional" state rather than his "Isolated idle" state.
> The isolated-idle state should not need to participate in TLB invalidation
> or RCU, so that the CPU never ever needs to wake up while in the
> isolated-idle state.

btw TLB invalidation I think is a red herring in this discussion
(other than "global PTEs" kind of kernel pte changes);
at least on x86 this is not happening for a long time; if a CPU is
really idle (which means the CPU internally flushes the tlbs anyway),
Linux also switches to the kernel PTE set so there's no need for a flush
later on.

2012-06-06 15:47:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, 2012-06-06 at 07:44 -0700, Paul E. McKenney wrote:

> > I don't think the isolation means userspace while(1) applications is
> > interesting. Sure, some people do this, and we should dtrt for them, but
> > the far more interesting case is 'regular' applications that do use
> > system calls.
>
> OK, I will bite. What are the semantics/properties for your isolated set?

The scheduler will not place tasks from outside the set in the set and
vice versa. Applications outside the set should not affect those in the
set, except there where there are shared resources across the set
boundary.

So the example of 1 process with multiple threads, some inside some
outside have the obvious shared resource of the address space, hence TLB
invalidates etc. will come through.

Now the kernel as a whole is also a shared resource, and this is where
it all gets tricky, since if something inside the set ends up doing a
memory allocation, it will have to participate in mm/ locks etc.

Same with RCU, if you cannot stay in an extended grace period for some
reason or other, you have to participate in the global RCU state
machinery.

But it should be so that if you don't do any of these things, you should
also not be affected by them.

Now I realize this is a 'weak' model, but the 'strong' model proposed by
tglx would make it impossible to run anything but the while(1) stuff,
and that's of limited utility.

2012-06-06 15:49:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, 2012-06-06 at 08:23 -0700, Arjan van de Ven wrote:
> btw TLB invalidation I think is a red herring in this discussion
> (other than "global PTEs" kind of kernel pte changes);
> at least on x86 this is not happening for a long time; if a CPU is
> really idle (which means the CPU internally flushes the tlbs anyway),
> Linux also switches to the kernel PTE set so there's no need for a flush
> later on.

This is about isolation, not idle. If you share a mm across your
isolation barrier you get TLB invalidates, its unavoidable. The solution
is not sharing the mm -- which is a perfectly usable solution.

2012-06-06 16:08:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, Jun 06, 2012 at 08:23:54AM -0700, Arjan van de Ven wrote:
> On 6/6/2012 7:41 AM, Paul E. McKenney wrote:
> > On Wed, Jun 06, 2012 at 10:43:43AM +0200, Peter Zijlstra wrote:
> >> On Tue, 2012-06-05 at 15:12 -0700, Paul E. McKenney wrote:
> >>>> What I can't see is the isolated functional, aside from the above
> >>>> mentioned things, that's not strictly a per-cpu property, we can have a
> >>>> group that's isolated from the rest but not from each other.
> >>>
> >>> I suspect that Thomas is thinking that the CPU is so idle that it no
> >>> longer has to participate in TLB invalidation or RCU. (Thomas will
> >>> correct me if I am confused.) But Peter, is that the level of idle
> >>> you are thinking of?
> >>
> >> No, we're talking about isolated, so its very much running something.
> >
> > From what I can see, if the CPU is running something, this is Thomas's
> > "Isolated functional" state rather than his "Isolated idle" state.
> > The isolated-idle state should not need to participate in TLB invalidation
> > or RCU, so that the CPU never ever needs to wake up while in the
> > isolated-idle state.
>
> btw TLB invalidation I think is a red herring in this discussion
> (other than "global PTEs" kind of kernel pte changes);
> at least on x86 this is not happening for a long time; if a CPU is
> really idle (which means the CPU internally flushes the tlbs anyway),
> Linux also switches to the kernel PTE set so there's no need for a flush
> later on.

Right, as I understand it, only unmappings in the kernel address space
would need to IPI an idle CPU. But this is still a source of IPIs that
could wake up the CPU, correct?

Thanx, Paul

2012-06-06 16:59:10

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On 6/6/2012 8:49 AM, Paul E. McKenney wrote:
> On Wed, Jun 06, 2012 at 08:23:54AM -0700, Arjan van de Ven wrote:
>> On 6/6/2012 7:41 AM, Paul E. McKenney wrote:
>>> On Wed, Jun 06, 2012 at 10:43:43AM +0200, Peter Zijlstra wrote:
>>>> On Tue, 2012-06-05 at 15:12 -0700, Paul E. McKenney wrote:
>>>>>> What I can't see is the isolated functional, aside from the above
>>>>>> mentioned things, that's not strictly a per-cpu property, we can have a
>>>>>> group that's isolated from the rest but not from each other.
>>>>>
>>>>> I suspect that Thomas is thinking that the CPU is so idle that it no
>>>>> longer has to participate in TLB invalidation or RCU. (Thomas will
>>>>> correct me if I am confused.) But Peter, is that the level of idle
>>>>> you are thinking of?
>>>>
>>>> No, we're talking about isolated, so its very much running something.
>>>
>>> From what I can see, if the CPU is running something, this is Thomas's
>>> "Isolated functional" state rather than his "Isolated idle" state.
>>> The isolated-idle state should not need to participate in TLB invalidation
>>> or RCU, so that the CPU never ever needs to wake up while in the
>>> isolated-idle state.
>>
>> btw TLB invalidation I think is a red herring in this discussion
>> (other than "global PTEs" kind of kernel pte changes);
>> at least on x86 this is not happening for a long time; if a CPU is
>> really idle (which means the CPU internally flushes the tlbs anyway),
>> Linux also switches to the kernel PTE set so there's no need for a flush
>> later on.
>
> Right, as I understand it, only unmappings in the kernel address space
> would need to IPI an idle CPU. But this is still a source of IPIs that
> could wake up the CPU, correct?

this is correct.
in practice, they happen pretty much only when you load or unload a
kernel module...

frankly I wouldn't worry about optimizing for that case ;-)

2012-06-06 23:21:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, Jun 06, 2012 at 05:46:40PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-06-06 at 07:44 -0700, Paul E. McKenney wrote:
>
> > > I don't think the isolation means userspace while(1) applications is
> > > interesting. Sure, some people do this, and we should dtrt for them, but
> > > the far more interesting case is 'regular' applications that do use
> > > system calls.
> >
> > OK, I will bite. What are the semantics/properties for your isolated set?
>
> The scheduler will not place tasks from outside the set in the set and
> vice versa. Applications outside the set should not affect those in the
> set, except there where there are shared resources across the set
> boundary.
>
> So the example of 1 process with multiple threads, some inside some
> outside have the obvious shared resource of the address space, hence TLB
> invalidates etc. will come through.

So, for example, a way of nicely partitioning the system to allow multiple
real-time applications to run without needing to do cross-partition
global priority queuing of the real-time tasks. Cool!

> Now the kernel as a whole is also a shared resource, and this is where
> it all gets tricky, since if something inside the set ends up doing a
> memory allocation, it will have to participate in mm/ locks etc.

Yep.

> Same with RCU, if you cannot stay in an extended grace period for some
> reason or other, you have to participate in the global RCU state
> machinery.

Yep.

> But it should be so that if you don't do any of these things, you should
> also not be affected by them.
>
> Now I realize this is a 'weak' model, but the 'strong' model proposed by
> tglx would make it impossible to run anything but the while(1) stuff,
> and that's of limited utility.

Thomas's strong model also supports a strong form of idle, as well as the
while(1) stuff, both of which have their uses.

Thanx, Paul

2012-06-07 00:04:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] kthread: Implement park/unpark facility

On 06/05/2012 07:05 AM, Peter Zijlstra wrote:
> On Tue, 2012-06-05 at 15:41 +0200, Thomas Gleixner wrote:
>> struct kthread {
>> - int should_stop;
>> + bool should_stop;
>> + bool should_park;
>> + bool is_parked;
>> + bool is_percpu;
>
> bool doesn't have a well specified storage type. I typically try to
> avoid using it in structures for this reason. Others might not care
> though.
>

On all known Linux platforms, bool is implemented as a single byte
having a value of either 0 or 1. However, it might make more sense to
have a flags field here...

-hpa

2012-06-08 09:20:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86/cpu hotplug: Wake up offline CPU via mwait or nmi

On Wed, 2012-06-06 at 16:20 -0700, Paul E. McKenney wrote:
>
> Thomas's strong model also supports a strong form of idle, as well as the
> while(1) stuff, both of which have their uses.

Just the while(1), the idle thing is independent.

2012-06-10 06:38:50

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] kthread: Implement park/unpark facility

On Tue, 5 Jun 2012 15:41:48 +0200 (CEST), Thomas Gleixner <[email protected]> wrote:
> Subject: kthread: Implement park/unpark facility
> From: Thomas Gleixner <[email protected]>
> Date: Wed, 18 Apr 2012 16:37:40 +0200
>
> To avoid the full teardown/setup of per cpu kthreads in the case of
> cpu hot(un)plug, provide a facility which allows to put the kthread
> into a park position and unpark it when the cpu comes online again.

Like the idea, but the API is awkward. Now you've made returning from a
thread do different things depending on whether it was parked or not.

How about just have the thread call "kthread_parkme()" which only
returns if/when the thread is unparked?

So the thread does:

while (!kthread_should_stop()) {
if (kthread_should_park()) {
... cleanup ...
kthread_parkme();
... restore ...
}
... work ...
}

Threads which never exit have "for (;;)" instead of while
(!kthread_should_stop()).

Cheers,
Rusty.

2012-06-11 09:26:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] kthread: Implement park/unpark facility

On Sun, 10 Jun 2012, Rusty Russell wrote:

> On Tue, 5 Jun 2012 15:41:48 +0200 (CEST), Thomas Gleixner <[email protected]> wrote:
> > Subject: kthread: Implement park/unpark facility
> > From: Thomas Gleixner <[email protected]>
> > Date: Wed, 18 Apr 2012 16:37:40 +0200
> >
> > To avoid the full teardown/setup of per cpu kthreads in the case of
> > cpu hot(un)plug, provide a facility which allows to put the kthread
> > into a park position and unpark it when the cpu comes online again.
>
> Like the idea, but the API is awkward. Now you've made returning from a
> thread do different things depending on whether it was parked or not.
>
> How about just have the thread call "kthread_parkme()" which only
> returns if/when the thread is unparked?
>
> So the thread does:
>
> while (!kthread_should_stop()) {
> if (kthread_should_park()) {
> ... cleanup ...
> kthread_parkme();
> ... restore ...
> }
> ... work ...
> }
>
> Threads which never exit have "for (;;)" instead of while
> (!kthread_should_stop()).

Makes sense. Will have a go on that.

One other thing what I'm thinking about is to avoid the synchronous
parking mechanism, i.e. just tell the thread to park and check the
park state later before going further down.

Thanks,

tglx

2012-06-12 09:19:15

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] kthread: Implement park/unpark facility

On Mon, 11 Jun 2012 11:26:12 +0200 (CEST), Thomas Gleixner <[email protected]> wrote:
> Makes sense. Will have a go on that.
>
> One other thing what I'm thinking about is to avoid the synchronous
> parking mechanism, i.e. just tell the thread to park and check the
> park state later before going further down.

Yes, definitely a nice optimization, and should be pretty simple.

Thanks,
Rusty.