The complete() should not be used on offlined CPU. Rewrite the
wait-complete mechanism with wait_on_bit_timeout().
The CPU triggering hot unplug (e.g. CPU0) will loop until some bit is
cleared. In each iteration schedule_timeout() is used with initial sleep
time of 1 ms. Later it is increased to 10 ms.
The dying CPU will clear the bit which is safe in that context.
This fixes following RCU warning on ARMv8 (Exynos 4412, Trats2) during
suspend to RAM:
[ 31.113925] ===============================
[ 31.113928] [ INFO: suspicious RCU usage. ]
[ 31.113935] 3.19.0-rc7-next-20150203 #1914 Not tainted
[ 31.113938] -------------------------------
[ 31.113943] kernel/sched/fair.c:4740 suspicious rcu_dereference_check() usage!
[ 31.113946]
[ 31.113946] other info that might help us debug this:
[ 31.113946]
[ 31.113952]
[ 31.113952] RCU used illegally from offline CPU!
[ 31.113952] rcu_scheduler_active = 1, debug_locks = 0
[ 31.113957] 3 locks held by swapper/1/0:
[ 31.113988] #0: ((cpu_died).wait.lock){......}, at: [<c005a114>] complete+0x14/0x44
[ 31.114012] #1: (&p->pi_lock){-.-.-.}, at: [<c004a790>] try_to_wake_up+0x28/0x300
[ 31.114035] #2: (rcu_read_lock){......}, at: [<c004f1b8>] select_task_rq_fair+0x5c/0xa04
[ 31.114038]
[ 31.114038] stack backtrace:
[ 31.114046] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.19.0-rc7-next-20150203 #1914
[ 31.114050] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[ 31.114076] [<c0014ce4>] (unwind_backtrace) from [<c0011c30>] (show_stack+0x10/0x14)
[ 31.114091] [<c0011c30>] (show_stack) from [<c04dc048>] (dump_stack+0x70/0xbc)
[ 31.114105] [<c04dc048>] (dump_stack) from [<c004f83c>] (select_task_rq_fair+0x6e0/0xa04)
[ 31.114118] [<c004f83c>] (select_task_rq_fair) from [<c004a83c>] (try_to_wake_up+0xd4/0x300)
[ 31.114129] [<c004a83c>] (try_to_wake_up) from [<c00598a0>] (__wake_up_common+0x4c/0x80)
[ 31.114140] [<c00598a0>] (__wake_up_common) from [<c00598e8>] (__wake_up_locked+0x14/0x1c)
[ 31.114150] [<c00598e8>] (__wake_up_locked) from [<c005a134>] (complete+0x34/0x44)
[ 31.114167] [<c005a134>] (complete) from [<c04d6ca4>] (cpu_die+0x24/0x84)
[ 31.114179] [<c04d6ca4>] (cpu_die) from [<c005a508>] (cpu_startup_entry+0x328/0x358)
[ 31.114189] [<c005a508>] (cpu_startup_entry) from [<40008784>] (0x40008784)
[ 31.114226] CPU1: shutdown
Signed-off-by: Krzysztof Kozlowski <[email protected]>
---
Changes since v1:
1. Use adaptive sleep time when waiting for CPU die (idea and code
from Paul E. McKenney). Paul also acked the patch but I made evem more
changes.
2. Add another bit (CPU_DIE_TIMEOUT_BIT) for synchronizing power down
failure in case:
CPU0 (killing) CPUx (killed)
wait_for_cpu_die
timeout
cpu_die()
clear_bit()
self power down
In this case the bit would be cleared and CPU would be powered down
introducing wrong behavior in next power down sequence (CPU0 would
see the bit cleared).
I think that such race is still possible but was narrowed to very
short time frame. Any CPU up will reset the bit to proper values.
3. Remove pre-test for bit in wait_for_cpu_die(). Suggested by Stephen
Boyd. This leads to more simplification in wait_for_cpu_die() loop.
4. Update comment for second flush_cache_louis() in dying CPU.
Suggested by Stephen Boyd.
---
arch/arm/kernel/smp.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 75 insertions(+), 6 deletions(-)
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 86ef244c5a24..0f6f1371739d 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -26,6 +26,7 @@
#include <linux/completion.h>
#include <linux/cpufreq.h>
#include <linux/irq_work.h>
+#include <linux/wait.h>
#include <linux/atomic.h>
#include <asm/smp.h>
@@ -76,6 +77,10 @@ enum ipi_msg_type {
static DECLARE_COMPLETION(cpu_running);
+#define CPU_DIE_WAIT_BIT 0
+#define CPU_DIE_TIMEOUT_BIT 1
+static unsigned long wait_cpu_die;
+
static struct smp_operations smp_ops;
void __init smp_set_ops(struct smp_operations *ops)
@@ -133,6 +138,9 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
pr_err("CPU%u: failed to boot: %d\n", cpu, ret);
}
+ set_bit(CPU_DIE_WAIT_BIT, &wait_cpu_die);
+ clear_bit(CPU_DIE_TIMEOUT_BIT, &wait_cpu_die);
+ smp_mb__after_atomic();
memset(&secondary_data, 0, sizeof(secondary_data));
return ret;
@@ -213,7 +221,40 @@ int __cpu_disable(void)
return 0;
}
-static DECLARE_COMPLETION(cpu_died);
+static inline int wait_on_die_bit_timeout(int sleep_ms)
+{
+ smp_mb__before_atomic();
+
+ return out_of_line_wait_on_bit_timeout(&wait_cpu_die,
+ CPU_DIE_WAIT_BIT, bit_wait_timeout,
+ TASK_UNINTERRUPTIBLE,
+ msecs_to_jiffies(sleep_ms));
+}
+
+/*
+ * Wait for 5000 ms for 'wait_cpu_die' bit to be cleared.
+ * Actually the real wait time may be longer because bit_wait_timeout
+ * calls schedule() in each iteration.
+ *
+ * Returns 0 if bit was cleared (CPU died) or non-zero
+ * otherwise (1 or negative ERRNO).
+ */
+static int wait_for_cpu_die(void)
+{
+ int ms_left = 5000, sleep_ms = 1, ret;
+
+ might_sleep();
+
+ while ((ret = wait_on_die_bit_timeout(sleep_ms))) {
+ ms_left -= sleep_ms;
+ if (!ret || (ms_left <= 0))
+ break;
+
+ sleep_ms = DIV_ROUND_UP(sleep_ms * 11, 10);
+ }
+
+ return ret;
+}
/*
* called on the thread which is asking for a CPU to be shutdown -
@@ -221,7 +262,9 @@ static DECLARE_COMPLETION(cpu_died);
*/
void __cpu_die(unsigned int cpu)
{
- if (!wait_for_completion_timeout(&cpu_died, msecs_to_jiffies(5000))) {
+ if (wait_for_cpu_die()) {
+ set_bit(CPU_DIE_TIMEOUT_BIT, &wait_cpu_die);
+ smp_mb__after_atomic();
pr_err("CPU%u: cpu didn't die\n", cpu);
return;
}
@@ -236,6 +279,11 @@ void __cpu_die(unsigned int cpu)
*/
if (!platform_cpu_kill(cpu))
pr_err("CPU%u: unable to kill\n", cpu);
+
+ /* Prepare the bit for some next CPU die */
+ set_bit(CPU_DIE_WAIT_BIT, &wait_cpu_die);
+ clear_bit(CPU_DIE_TIMEOUT_BIT, &wait_cpu_die);
+ smp_mb__after_atomic();
}
/*
@@ -250,6 +298,8 @@ void __ref cpu_die(void)
{
unsigned int cpu = smp_processor_id();
+ WARN_ON(!test_bit(CPU_DIE_WAIT_BIT, &wait_cpu_die));
+
idle_task_exit();
local_irq_disable();
@@ -267,12 +317,23 @@ void __ref cpu_die(void)
* this returns, power and/or clocks can be removed at any point
* from this CPU and its cache by platform_cpu_kill().
*/
- complete(&cpu_died);
+ clear_bit(CPU_DIE_WAIT_BIT, &wait_cpu_die);
+ smp_mb__after_atomic();
+
+ /*
+ * If killing CPU reached timeout than this thread must set dying bit
+ * for next power down sequence.
+ */
+ if (test_bit(CPU_DIE_TIMEOUT_BIT, &wait_cpu_die)) {
+ clear_bit(CPU_DIE_TIMEOUT_BIT, &wait_cpu_die);
+ set_bit(CPU_DIE_WAIT_BIT, &wait_cpu_die);
+ smp_mb__after_atomic();
+ }
/*
- * Ensure that the cache lines associated with that completion are
- * written out. This covers the case where _this_ CPU is doing the
- * powering down, to ensure that the completion is visible to the
+ * Ensure that the cache lines associated with clearing 'wait_cpu_die'
+ * bit are written out. This covers the case where _this_ CPU is doing
+ * the powering down, to ensure that the bit clearing is visible to the
* CPU waiting for this one.
*/
flush_cache_louis();
@@ -296,6 +357,14 @@ void __ref cpu_die(void)
cpu);
/*
+ * There is a chance that the killing CPU reached time out in
+ * __cpu_die() so set the bit for next power down sequence.
+ */
+ set_bit(CPU_DIE_WAIT_BIT, &wait_cpu_die);
+ clear_bit(CPU_DIE_TIMEOUT_BIT, &wait_cpu_die);
+ smp_mb__after_atomic();
+
+ /*
* Do not return to the idle loop - jump back to the secondary
* cpu initialisation. There's some initialisation which needs
* to be repeated to undo the effects of taking the CPU offline.
--
1.9.1
On Thu, Feb 05, 2015 at 11:14:30AM +0100, Krzysztof Kozlowski wrote:
> The complete() should not be used on offlined CPU. Rewrite the
> wait-complete mechanism with wait_on_bit_timeout().
Yuck.
I think that the IPI idea would be far better, and a much smaller patch.
We can continue using the completions, but instead of running the
completion on the dying CPU, the dying CPU triggers an IPI which does
the completion on the requesting CPU.
You're modifying arch/arm/kernel/smp.c, so you just hook it directly
into the IPI mechanism without any registration required.
We can also kill the second cache flush by the dying CPU - as we're
not writing to memory anymore by calling complete() after the first
cache flush, so this will probably make CPU hotplug fractionally faster
too.
(You'll get some fuzz with this patch as I have the NMI backtrace stuff
in my kernel.)
Something like this - only build tested so far (waiting for the compile
to finish...):
arch/arm/kernel/smp.c | 43 ++++++++++++++++++++++++-------------------
1 file changed, 24 insertions(+), 19 deletions(-)
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 194df2f1aa87..c623e27a9c85 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -73,6 +73,9 @@ enum ipi_msg_type {
IPI_IRQ_WORK,
IPI_COMPLETION,
IPI_CPU_BACKTRACE,
+#ifdef CONFIG_HOTPLUG_CPU
+ IPI_CPU_DEAD,
+#endif
};
/* For reliability, we're prepared to waste bits here. */
@@ -88,6 +91,14 @@ void __init smp_set_ops(struct smp_operations *ops)
smp_ops = *ops;
};
+static void (*__smp_cross_call)(const struct cpumask *, unsigned int);
+
+void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
+{
+ if (!__smp_cross_call)
+ __smp_cross_call = fn;
+}
+
static unsigned long get_arch_pgd(pgd_t *pgd)
{
phys_addr_t pgdir = virt_to_idmap(pgd);
@@ -267,19 +278,13 @@ void __ref cpu_die(void)
flush_cache_louis();
/*
- * Tell __cpu_die() that this CPU is now safe to dispose of. Once
- * this returns, power and/or clocks can be removed at any point
+ * Tell __cpu_die() that this CPU is now safe to dispose of. We
+ * do this via an IPI to any online CPU - it doesn't matter, we
+ * just need another CPU to run the completion. Once this IPI
+ * has been sent, power and/or clocks can be removed at any point
* from this CPU and its cache by platform_cpu_kill().
*/
- complete(&cpu_died);
-
- /*
- * Ensure that the cache lines associated with that completion are
- * written out. This covers the case where _this_ CPU is doing the
- * powering down, to ensure that the completion is visible to the
- * CPU waiting for this one.
- */
- flush_cache_louis();
+ __smp_cross_call(cpumask_of(cpumask_any(cpu_online_mask)), IPI_CPU_DEAD);
/*
* The actual CPU shutdown procedure is at least platform (if not
@@ -442,14 +447,6 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
}
}
-static void (*__smp_cross_call)(const struct cpumask *, unsigned int);
-
-void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
-{
- if (!__smp_cross_call)
- __smp_cross_call = fn;
-}
-
static const char *ipi_types[NR_IPI] __tracepoint_string = {
#define S(x,s) [x] = s
S(IPI_WAKEUP, "CPU wakeup interrupts"),
@@ -648,6 +645,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
irq_exit();
break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case IPI_CPU_DEAD:
+ irq_enter();
+ complete(&cpu_died);
+ irq_exit();
+ break;
+#endif
+
default:
pr_crit("CPU%u: Unknown IPI message 0x%x\n",
cpu, ipinr);
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On Thu, Feb 05, 2015 at 10:14:30AM +0000, Krzysztof Kozlowski wrote:
> The complete() should not be used on offlined CPU. Rewrite the
> wait-complete mechanism with wait_on_bit_timeout().
>
> The CPU triggering hot unplug (e.g. CPU0) will loop until some bit is
> cleared. In each iteration schedule_timeout() is used with initial sleep
> time of 1 ms. Later it is increased to 10 ms.
>
> The dying CPU will clear the bit which is safe in that context.
>
> This fixes following RCU warning on ARMv8 (Exynos 4412, Trats2) during
> suspend to RAM:
Nit: isn't Exynos4412 a quad-A9 (ARMv7 rather than ARMv8)?
> [ 31.113925] ===============================
> [ 31.113928] [ INFO: suspicious RCU usage. ]
> [ 31.113935] 3.19.0-rc7-next-20150203 #1914 Not tainted
> [ 31.113938] -------------------------------
> [ 31.113943] kernel/sched/fair.c:4740 suspicious rcu_dereference_check() usage!
> [ 31.113946]
> [ 31.113946] other info that might help us debug this:
> [ 31.113946]
> [ 31.113952]
> [ 31.113952] RCU used illegally from offline CPU!
> [ 31.113952] rcu_scheduler_active = 1, debug_locks = 0
> [ 31.113957] 3 locks held by swapper/1/0:
> [ 31.113988] #0: ((cpu_died).wait.lock){......}, at: [<c005a114>] complete+0x14/0x44
> [ 31.114012] #1: (&p->pi_lock){-.-.-.}, at: [<c004a790>] try_to_wake_up+0x28/0x300
> [ 31.114035] #2: (rcu_read_lock){......}, at: [<c004f1b8>] select_task_rq_fair+0x5c/0xa04
> [ 31.114038]
> [ 31.114038] stack backtrace:
> [ 31.114046] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.19.0-rc7-next-20150203 #1914
> [ 31.114050] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> [ 31.114076] [<c0014ce4>] (unwind_backtrace) from [<c0011c30>] (show_stack+0x10/0x14)
> [ 31.114091] [<c0011c30>] (show_stack) from [<c04dc048>] (dump_stack+0x70/0xbc)
> [ 31.114105] [<c04dc048>] (dump_stack) from [<c004f83c>] (select_task_rq_fair+0x6e0/0xa04)
> [ 31.114118] [<c004f83c>] (select_task_rq_fair) from [<c004a83c>] (try_to_wake_up+0xd4/0x300)
> [ 31.114129] [<c004a83c>] (try_to_wake_up) from [<c00598a0>] (__wake_up_common+0x4c/0x80)
> [ 31.114140] [<c00598a0>] (__wake_up_common) from [<c00598e8>] (__wake_up_locked+0x14/0x1c)
> [ 31.114150] [<c00598e8>] (__wake_up_locked) from [<c005a134>] (complete+0x34/0x44)
> [ 31.114167] [<c005a134>] (complete) from [<c04d6ca4>] (cpu_die+0x24/0x84)
> [ 31.114179] [<c04d6ca4>] (cpu_die) from [<c005a508>] (cpu_startup_entry+0x328/0x358)
> [ 31.114189] [<c005a508>] (cpu_startup_entry) from [<40008784>] (0x40008784)
> [ 31.114226] CPU1: shutdown
>
> Signed-off-by: Krzysztof Kozlowski <[email protected]>
>
> ---
> Changes since v1:
> 1. Use adaptive sleep time when waiting for CPU die (idea and code
> from Paul E. McKenney). Paul also acked the patch but I made evem more
> changes.
>
> 2. Add another bit (CPU_DIE_TIMEOUT_BIT) for synchronizing power down
> failure in case:
> CPU0 (killing) CPUx (killed)
> wait_for_cpu_die
> timeout
> cpu_die()
> clear_bit()
> self power down
>
> In this case the bit would be cleared and CPU would be powered down
> introducing wrong behavior in next power down sequence (CPU0 would
> see the bit cleared).
> I think that such race is still possible but was narrowed to very
> short time frame. Any CPU up will reset the bit to proper values.
In the case of shutting down 2 CPUs in quick succession (without an
intervening boot of a CPU), surely this does not solve the potential
race on the wait_cpu_die variable?
I think we instead need a percpu synchronisation variable, which would
prevent racing on the value between CPUs, and a CPU would have to be
brought up before we could decide to kill it again. With that I think we
only need a single bit, too.
Thanks,
Mark.
On czw, 2015-02-05 at 10:53 +0000, Mark Rutland wrote:
> On Thu, Feb 05, 2015 at 10:14:30AM +0000, Krzysztof Kozlowski wrote:
> > The complete() should not be used on offlined CPU. Rewrite the
> > wait-complete mechanism with wait_on_bit_timeout().
> >
> > The CPU triggering hot unplug (e.g. CPU0) will loop until some bit is
> > cleared. In each iteration schedule_timeout() is used with initial sleep
> > time of 1 ms. Later it is increased to 10 ms.
> >
> > The dying CPU will clear the bit which is safe in that context.
> >
> > This fixes following RCU warning on ARMv8 (Exynos 4412, Trats2) during
> > suspend to RAM:
>
> Nit: isn't Exynos4412 a quad-A9 (ARMv7 rather than ARMv8)?
Yes, it should be ARMv7. However still this should be fixed for both
architectures.
>
> > [ 31.113925] ===============================
> > [ 31.113928] [ INFO: suspicious RCU usage. ]
> > [ 31.113935] 3.19.0-rc7-next-20150203 #1914 Not tainted
> > [ 31.113938] -------------------------------
> > [ 31.113943] kernel/sched/fair.c:4740 suspicious rcu_dereference_check() usage!
> > [ 31.113946]
> > [ 31.113946] other info that might help us debug this:
> > [ 31.113946]
> > [ 31.113952]
> > [ 31.113952] RCU used illegally from offline CPU!
> > [ 31.113952] rcu_scheduler_active = 1, debug_locks = 0
> > [ 31.113957] 3 locks held by swapper/1/0:
> > [ 31.113988] #0: ((cpu_died).wait.lock){......}, at: [<c005a114>] complete+0x14/0x44
> > [ 31.114012] #1: (&p->pi_lock){-.-.-.}, at: [<c004a790>] try_to_wake_up+0x28/0x300
> > [ 31.114035] #2: (rcu_read_lock){......}, at: [<c004f1b8>] select_task_rq_fair+0x5c/0xa04
> > [ 31.114038]
> > [ 31.114038] stack backtrace:
> > [ 31.114046] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.19.0-rc7-next-20150203 #1914
> > [ 31.114050] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> > [ 31.114076] [<c0014ce4>] (unwind_backtrace) from [<c0011c30>] (show_stack+0x10/0x14)
> > [ 31.114091] [<c0011c30>] (show_stack) from [<c04dc048>] (dump_stack+0x70/0xbc)
> > [ 31.114105] [<c04dc048>] (dump_stack) from [<c004f83c>] (select_task_rq_fair+0x6e0/0xa04)
> > [ 31.114118] [<c004f83c>] (select_task_rq_fair) from [<c004a83c>] (try_to_wake_up+0xd4/0x300)
> > [ 31.114129] [<c004a83c>] (try_to_wake_up) from [<c00598a0>] (__wake_up_common+0x4c/0x80)
> > [ 31.114140] [<c00598a0>] (__wake_up_common) from [<c00598e8>] (__wake_up_locked+0x14/0x1c)
> > [ 31.114150] [<c00598e8>] (__wake_up_locked) from [<c005a134>] (complete+0x34/0x44)
> > [ 31.114167] [<c005a134>] (complete) from [<c04d6ca4>] (cpu_die+0x24/0x84)
> > [ 31.114179] [<c04d6ca4>] (cpu_die) from [<c005a508>] (cpu_startup_entry+0x328/0x358)
> > [ 31.114189] [<c005a508>] (cpu_startup_entry) from [<40008784>] (0x40008784)
> > [ 31.114226] CPU1: shutdown
> >
> > Signed-off-by: Krzysztof Kozlowski <[email protected]>
> >
> > ---
> > Changes since v1:
> > 1. Use adaptive sleep time when waiting for CPU die (idea and code
> > from Paul E. McKenney). Paul also acked the patch but I made evem more
> > changes.
> >
> > 2. Add another bit (CPU_DIE_TIMEOUT_BIT) for synchronizing power down
> > failure in case:
> > CPU0 (killing) CPUx (killed)
> > wait_for_cpu_die
> > timeout
> > cpu_die()
> > clear_bit()
> > self power down
> >
> > In this case the bit would be cleared and CPU would be powered down
> > introducing wrong behavior in next power down sequence (CPU0 would
> > see the bit cleared).
> > I think that such race is still possible but was narrowed to very
> > short time frame. Any CPU up will reset the bit to proper values.
>
> In the case of shutting down 2 CPUs in quick succession (without an
> intervening boot of a CPU), surely this does not solve the potential
> race on the wait_cpu_die variable?
Right, the race is not fully fixed.
>
> I think we instead need a percpu synchronisation variable, which would
> prevent racing on the value between CPUs, and a CPU would have to be
> brought up before we could decide to kill it again. With that I think we
> only need a single bit, too.
You mean a single bit-value per cpu?
Best regards,
Krzysztof
>
> Thanks,
> Mark.
On czw, 2015-02-05 at 10:50 +0000, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 11:14:30AM +0100, Krzysztof Kozlowski wrote:
> > The complete() should not be used on offlined CPU. Rewrite the
> > wait-complete mechanism with wait_on_bit_timeout().
>
> Yuck.
>
> I think that the IPI idea would be far better, and a much smaller patch.
> We can continue using the completions, but instead of running the
> completion on the dying CPU, the dying CPU triggers an IPI which does
> the completion on the requesting CPU.
>
> You're modifying arch/arm/kernel/smp.c, so you just hook it directly
> into the IPI mechanism without any registration required.
>
> We can also kill the second cache flush by the dying CPU - as we're
> not writing to memory anymore by calling complete() after the first
> cache flush, so this will probably make CPU hotplug fractionally faster
> too.
>
> (You'll get some fuzz with this patch as I have the NMI backtrace stuff
> in my kernel.)
>
> Something like this - only build tested so far (waiting for the compile
> to finish...):
I am looking into IPI also. Maybe just smp_call_function_any() would be
enough?
Do you want to continue with the IPI version patch?
Best regards,
Krzysztof
>
> arch/arm/kernel/smp.c | 43 ++++++++++++++++++++++++-------------------
> 1 file changed, 24 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 194df2f1aa87..c623e27a9c85 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -73,6 +73,9 @@ enum ipi_msg_type {
> IPI_IRQ_WORK,
> IPI_COMPLETION,
> IPI_CPU_BACKTRACE,
> +#ifdef CONFIG_HOTPLUG_CPU
> + IPI_CPU_DEAD,
> +#endif
> };
>
> /* For reliability, we're prepared to waste bits here. */
> @@ -88,6 +91,14 @@ void __init smp_set_ops(struct smp_operations *ops)
> smp_ops = *ops;
> };
>
> +static void (*__smp_cross_call)(const struct cpumask *, unsigned int);
> +
> +void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
> +{
> + if (!__smp_cross_call)
> + __smp_cross_call = fn;
> +}
> +
> static unsigned long get_arch_pgd(pgd_t *pgd)
> {
> phys_addr_t pgdir = virt_to_idmap(pgd);
> @@ -267,19 +278,13 @@ void __ref cpu_die(void)
> flush_cache_louis();
>
> /*
> - * Tell __cpu_die() that this CPU is now safe to dispose of. Once
> - * this returns, power and/or clocks can be removed at any point
> + * Tell __cpu_die() that this CPU is now safe to dispose of. We
> + * do this via an IPI to any online CPU - it doesn't matter, we
> + * just need another CPU to run the completion. Once this IPI
> + * has been sent, power and/or clocks can be removed at any point
> * from this CPU and its cache by platform_cpu_kill().
> */
> - complete(&cpu_died);
> -
> - /*
> - * Ensure that the cache lines associated with that completion are
> - * written out. This covers the case where _this_ CPU is doing the
> - * powering down, to ensure that the completion is visible to the
> - * CPU waiting for this one.
> - */
> - flush_cache_louis();
> + __smp_cross_call(cpumask_of(cpumask_any(cpu_online_mask)), IPI_CPU_DEAD);
>
> /*
> * The actual CPU shutdown procedure is at least platform (if not
> @@ -442,14 +447,6 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> }
> }
>
> -static void (*__smp_cross_call)(const struct cpumask *, unsigned int);
> -
> -void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
> -{
> - if (!__smp_cross_call)
> - __smp_cross_call = fn;
> -}
> -
> static const char *ipi_types[NR_IPI] __tracepoint_string = {
> #define S(x,s) [x] = s
> S(IPI_WAKEUP, "CPU wakeup interrupts"),
> @@ -648,6 +645,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
> irq_exit();
> break;
>
> +#ifdef CONFIG_HOTPLUG_CPU
> + case IPI_CPU_DEAD:
> + irq_enter();
> + complete(&cpu_died);
> + irq_exit();
> + break;
> +#endif
> +
> default:
> pr_crit("CPU%u: Unknown IPI message 0x%x\n",
> cpu, ipinr);
>
On Thu, Feb 05, 2015 at 12:00:58PM +0100, Krzysztof Kozlowski wrote:
> On czw, 2015-02-05 at 10:50 +0000, Russell King - ARM Linux wrote:
> > On Thu, Feb 05, 2015 at 11:14:30AM +0100, Krzysztof Kozlowski wrote:
> > > The complete() should not be used on offlined CPU. Rewrite the
> > > wait-complete mechanism with wait_on_bit_timeout().
> >
> > Yuck.
> >
> > I think that the IPI idea would be far better, and a much smaller patch.
> > We can continue using the completions, but instead of running the
> > completion on the dying CPU, the dying CPU triggers an IPI which does
> > the completion on the requesting CPU.
> >
> > You're modifying arch/arm/kernel/smp.c, so you just hook it directly
> > into the IPI mechanism without any registration required.
> >
> > We can also kill the second cache flush by the dying CPU - as we're
> > not writing to memory anymore by calling complete() after the first
> > cache flush, so this will probably make CPU hotplug fractionally faster
> > too.
> >
> > (You'll get some fuzz with this patch as I have the NMI backtrace stuff
> > in my kernel.)
> >
> > Something like this - only build tested so far (waiting for the compile
> > to finish...):
>
> I am looking into IPI also. Maybe just smp_call_function_any() would be
> enough?
I really don't like that idea. Let's keep it simple, and avoid calling
code unnecessarily which could end up with RCU issues - and avoid the
need for a second L1 cache flush.
This does seem to work on iMX6 (if you ignore the lockdep splat caused
by that frigging CMA lock - yes, that issue I reported earlier last
year still exists and is still unsolved, which is really disgusting.)
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
Hi Russell,
On Thu, Feb 05, 2015 at 10:50:35AM +0000, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 11:14:30AM +0100, Krzysztof Kozlowski wrote:
> > The complete() should not be used on offlined CPU. Rewrite the
> > wait-complete mechanism with wait_on_bit_timeout().
>
> Yuck.
>
> I think that the IPI idea would be far better, and a much smaller patch.
> We can continue using the completions, but instead of running the
> completion on the dying CPU, the dying CPU triggers an IPI which does
> the completion on the requesting CPU.
This does look _much_ nicer than the bitmask approach.
[...]
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 194df2f1aa87..c623e27a9c85 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -73,6 +73,9 @@ enum ipi_msg_type {
> IPI_IRQ_WORK,
> IPI_COMPLETION,
> IPI_CPU_BACKTRACE,
> +#ifdef CONFIG_HOTPLUG_CPU
> + IPI_CPU_DEAD,
> +#endif
> };
[...]
> static const char *ipi_types[NR_IPI] __tracepoint_string = {
> #define S(x,s) [x] = s
> S(IPI_WAKEUP, "CPU wakeup interrupts"),
We'll probably want to add an entry here ("CPU teardown interrupts"?),
and bump NR_IPI in asm/hardirq.h.
Thanks,
Mark.
On Thu, Feb 05, 2015 at 11:28:05AM +0000, Mark Rutland wrote:
> Hi Russell,
>
> On Thu, Feb 05, 2015 at 10:50:35AM +0000, Russell King - ARM Linux wrote:
> > On Thu, Feb 05, 2015 at 11:14:30AM +0100, Krzysztof Kozlowski wrote:
> > > The complete() should not be used on offlined CPU. Rewrite the
> > > wait-complete mechanism with wait_on_bit_timeout().
> >
> > Yuck.
> >
> > I think that the IPI idea would be far better, and a much smaller patch.
> > We can continue using the completions, but instead of running the
> > completion on the dying CPU, the dying CPU triggers an IPI which does
> > the completion on the requesting CPU.
>
> This does look _much_ nicer than the bitmask approach.
>
> [...]
>
> > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> > index 194df2f1aa87..c623e27a9c85 100644
> > --- a/arch/arm/kernel/smp.c
> > +++ b/arch/arm/kernel/smp.c
> > @@ -73,6 +73,9 @@ enum ipi_msg_type {
> > IPI_IRQ_WORK,
> > IPI_COMPLETION,
> > IPI_CPU_BACKTRACE,
> > +#ifdef CONFIG_HOTPLUG_CPU
> > + IPI_CPU_DEAD,
> > +#endif
> > };
>
> [...]
>
> > static const char *ipi_types[NR_IPI] __tracepoint_string = {
> > #define S(x,s) [x] = s
> > S(IPI_WAKEUP, "CPU wakeup interrupts"),
>
> We'll probably want to add an entry here ("CPU teardown interrupts"?),
> and bump NR_IPI in asm/hardirq.h.
I'd need to move IPI_CPU_BACKTRACE out of the way then - that'll mostly
always be zero (even if the NMI IPI happens.) I'll sort that when I
backport the patch to mainline kernels. :)
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On Thu, Feb 05, 2015 at 10:50:35AM +0000, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 11:14:30AM +0100, Krzysztof Kozlowski wrote:
> > The complete() should not be used on offlined CPU. Rewrite the
> > wait-complete mechanism with wait_on_bit_timeout().
>
> Yuck.
>
> I think that the IPI idea would be far better, and a much smaller patch.
> We can continue using the completions, but instead of running the
> completion on the dying CPU, the dying CPU triggers an IPI which does
> the completion on the requesting CPU.
>
> You're modifying arch/arm/kernel/smp.c, so you just hook it directly
> into the IPI mechanism without any registration required.
>
> We can also kill the second cache flush by the dying CPU - as we're
> not writing to memory anymore by calling complete() after the first
> cache flush, so this will probably make CPU hotplug fractionally faster
> too.
>
> (You'll get some fuzz with this patch as I have the NMI backtrace stuff
> in my kernel.)
>
> Something like this - only build tested so far (waiting for the compile
> to finish...):
Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
Thanx, Paul
> arch/arm/kernel/smp.c | 43 ++++++++++++++++++++++++-------------------
> 1 file changed, 24 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 194df2f1aa87..c623e27a9c85 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -73,6 +73,9 @@ enum ipi_msg_type {
> IPI_IRQ_WORK,
> IPI_COMPLETION,
> IPI_CPU_BACKTRACE,
> +#ifdef CONFIG_HOTPLUG_CPU
> + IPI_CPU_DEAD,
> +#endif
> };
>
> /* For reliability, we're prepared to waste bits here. */
> @@ -88,6 +91,14 @@ void __init smp_set_ops(struct smp_operations *ops)
> smp_ops = *ops;
> };
>
> +static void (*__smp_cross_call)(const struct cpumask *, unsigned int);
> +
> +void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
> +{
> + if (!__smp_cross_call)
> + __smp_cross_call = fn;
> +}
> +
> static unsigned long get_arch_pgd(pgd_t *pgd)
> {
> phys_addr_t pgdir = virt_to_idmap(pgd);
> @@ -267,19 +278,13 @@ void __ref cpu_die(void)
> flush_cache_louis();
>
> /*
> - * Tell __cpu_die() that this CPU is now safe to dispose of. Once
> - * this returns, power and/or clocks can be removed at any point
> + * Tell __cpu_die() that this CPU is now safe to dispose of. We
> + * do this via an IPI to any online CPU - it doesn't matter, we
> + * just need another CPU to run the completion. Once this IPI
> + * has been sent, power and/or clocks can be removed at any point
> * from this CPU and its cache by platform_cpu_kill().
> */
> - complete(&cpu_died);
> -
> - /*
> - * Ensure that the cache lines associated with that completion are
> - * written out. This covers the case where _this_ CPU is doing the
> - * powering down, to ensure that the completion is visible to the
> - * CPU waiting for this one.
> - */
> - flush_cache_louis();
> + __smp_cross_call(cpumask_of(cpumask_any(cpu_online_mask)), IPI_CPU_DEAD);
>
> /*
> * The actual CPU shutdown procedure is at least platform (if not
> @@ -442,14 +447,6 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> }
> }
>
> -static void (*__smp_cross_call)(const struct cpumask *, unsigned int);
> -
> -void __init set_smp_cross_call(void (*fn)(const struct cpumask *, unsigned int))
> -{
> - if (!__smp_cross_call)
> - __smp_cross_call = fn;
> -}
> -
> static const char *ipi_types[NR_IPI] __tracepoint_string = {
> #define S(x,s) [x] = s
> S(IPI_WAKEUP, "CPU wakeup interrupts"),
> @@ -648,6 +645,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
> irq_exit();
> break;
>
> +#ifdef CONFIG_HOTPLUG_CPU
> + case IPI_CPU_DEAD:
> + irq_enter();
> + complete(&cpu_died);
> + irq_exit();
> + break;
> +#endif
> +
> default:
> pr_crit("CPU%u: Unknown IPI message 0x%x\n",
> cpu, ipinr);
>
> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.
>
On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
Sigh... I kind'a new it wouldn't be this simple. The gic code which
actually raises the IPI takes a raw spinlock, so it's not going to be
this simple - there's a small theoretical window where we have taken
this lock, written the register to send the IPI, and then dropped the
lock - the update to the lock to release it could get lost if the
CPU power is quickly cut at that point.
Also, we _do_ need the second cache flush in place to ensure that the
unlock is seen to other CPUs.
We could work around that by taking and releasing the lock in the IPI
processing function... but this is starting to look less attractive
as the lock is private to irq-gic.c.
Well, we're very close to 3.19, we're too close to be trying to sort
this out, so I'm hoping that your changes which cause this RCU error
are *not* going in during this merge window, because we seem to have
something of a problem right now which needs more time to resolve.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
>
> Sigh... I kind'a new it wouldn't be this simple. The gic code which
> actually raises the IPI takes a raw spinlock, so it's not going to be
> this simple - there's a small theoretical window where we have taken
> this lock, written the register to send the IPI, and then dropped the
> lock - the update to the lock to release it could get lost if the
> CPU power is quickly cut at that point.
>
> Also, we _do_ need the second cache flush in place to ensure that the
> unlock is seen to other CPUs.
>
> We could work around that by taking and releasing the lock in the IPI
> processing function... but this is starting to look less attractive
> as the lock is private to irq-gic.c.
>
> Well, we're very close to 3.19, we're too close to be trying to sort
> this out, so I'm hoping that your changes which cause this RCU error
> are *not* going in during this merge window, because we seem to have
> something of a problem right now which needs more time to resolve.
Most likely into the 3.20 merge window. But please keep in mind that
RCU is just the messenger here -- the current code will break if any
CPU for whatever reason takes more than a jiffy to get from its
_stop_machine() handler to the end of its last RCU read-side critical
section on its way out. A jiffy may sound like a lot, but it is not
hard to exceed this limit, especially in virtualized environments.
So not like to go into v3.19, but it does need to be resolved.
Thanx, Paul
On Thu, Feb 05, 2015 at 09:02:28AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> >
> > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > actually raises the IPI takes a raw spinlock, so it's not going to be
> > this simple - there's a small theoretical window where we have taken
> > this lock, written the register to send the IPI, and then dropped the
> > lock - the update to the lock to release it could get lost if the
> > CPU power is quickly cut at that point.
> >
> > Also, we _do_ need the second cache flush in place to ensure that the
> > unlock is seen to other CPUs.
> >
> > We could work around that by taking and releasing the lock in the IPI
> > processing function... but this is starting to look less attractive
> > as the lock is private to irq-gic.c.
> >
> > Well, we're very close to 3.19, we're too close to be trying to sort
> > this out, so I'm hoping that your changes which cause this RCU error
> > are *not* going in during this merge window, because we seem to have
> > something of a problem right now which needs more time to resolve.
>
> Most likely into the 3.20 merge window. But please keep in mind that
> RCU is just the messenger here -- the current code will break if any
> CPU for whatever reason takes more than a jiffy to get from its
> _stop_machine() handler to the end of its last RCU read-side critical
> section on its way out. A jiffy may sound like a lot, but it is not
> hard to exceed this limit, especially in virtualized environments.
What I'm saying is that we can't likely get a good fix prepared before
the 3.20 merge window opens.
I don't term the set_bit/clear_bit solution a "good fix" because it is
far too complex - I've not done a thorough review on it, but the idea
of setting and clearing a couple of bits in unison, making sure that
their state is set appropriately through multiple different code paths
does not strike me as a provably correct replacement for this completion.
The reason for that complexity is because there is no pre-notification
to arch code that a CPU might be going down, so there's no way for the
"CPU is dead" flag to be properly reset (which is why there's all the
manipulation in lots of possible failure paths.)
The idea that we could reset it in the CPU up code doesn't fly - that
would only work if we had one secondary CPU (which would guarantee a
strict up/down/up ordering on it) but as soon as you have more than one
CPU, that doesn't hold true.
We could hook into the CPU hotplug notifiers - which would be quite a
lot of additional code to achieve the reset early enough in the hot
unplug path, though it would probably be the most reliable solution to
the wait-for-bit solution.
However, any of those solutions needs writing and thorough testing,
which, if Linus opens the merge window on Sunday, isn't going to
happen before hand (and we know Linus doesn't like extra development
appearing which wasn't in -next prior to the merge window - he's taken
snapshots of -next to check during the merge window in the past - so
it's not something I'm going to be adding during that time, not even
as a "fix" because we know about the problem right now, before the
merge window. To me, to treat this as a "fix" would be wilfully
deceitful.)
I don't think the existing code is a big problem at the moment - it's
been like this for about 10 years, and no one has ever reported an
issue with it, although there have been changes over that time:
aa033810461ee56abbef6cef10aabd6b97f5caee
ARM: smp: Drop RCU_NONIDLE usage in cpu_die()
This removed the RCU_NONIDLE() from the completion() call.
ff081e05bfba3461119cd280201d163b6858eda2
ARM: 7457/1: smp: Fix suspicious RCU originating from cpu_die()
This added the RCU_NONIDLE() to the completion() call.
3c030beabf937b1d3b4ecaedfd1fb2f1e2aa0c70
ARM: CPU hotplug: move cpu_killed completion to core code
This moved the completion code from Realview (and other ARM
platforms) into core ARM code.
and 97a63ecff4bd06da5d8feb8c0394a4d020f2d34d
[ARM SMP] Add CPU hotplug support for Realview MPcore
The earliest current introduction of CPU hotplug in 2005.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On Thu, Feb 05, 2015 at 05:34:40PM +0000, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 09:02:28AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> > > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > > > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> > >
> > > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > > actually raises the IPI takes a raw spinlock, so it's not going to be
> > > this simple - there's a small theoretical window where we have taken
> > > this lock, written the register to send the IPI, and then dropped the
> > > lock - the update to the lock to release it could get lost if the
> > > CPU power is quickly cut at that point.
> > >
> > > Also, we _do_ need the second cache flush in place to ensure that the
> > > unlock is seen to other CPUs.
> > >
> > > We could work around that by taking and releasing the lock in the IPI
> > > processing function... but this is starting to look less attractive
> > > as the lock is private to irq-gic.c.
> > >
> > > Well, we're very close to 3.19, we're too close to be trying to sort
> > > this out, so I'm hoping that your changes which cause this RCU error
> > > are *not* going in during this merge window, because we seem to have
> > > something of a problem right now which needs more time to resolve.
> >
> > Most likely into the 3.20 merge window. But please keep in mind that
> > RCU is just the messenger here -- the current code will break if any
> > CPU for whatever reason takes more than a jiffy to get from its
> > _stop_machine() handler to the end of its last RCU read-side critical
> > section on its way out. A jiffy may sound like a lot, but it is not
> > hard to exceed this limit, especially in virtualized environments.
>
> What I'm saying is that we can't likely get a good fix prepared before
> the 3.20 merge window opens.
And I cannot count. Or cannot type. Or something.
I meant do say "Most likely into the 3.21 merge window." I agree that
this stuff is not ready for next week's merge window. For one thing,
there are similar issues with a number of other architectures as well.
> I don't term the set_bit/clear_bit solution a "good fix" because it is
> far too complex - I've not done a thorough review on it, but the idea
> of setting and clearing a couple of bits in unison, making sure that
> their state is set appropriately through multiple different code paths
> does not strike me as a provably correct replacement for this completion.
> The reason for that complexity is because there is no pre-notification
> to arch code that a CPU might be going down, so there's no way for the
> "CPU is dead" flag to be properly reset (which is why there's all the
> manipulation in lots of possible failure paths.)
>
> The idea that we could reset it in the CPU up code doesn't fly - that
> would only work if we had one secondary CPU (which would guarantee a
> strict up/down/up ordering on it) but as soon as you have more than one
> CPU, that doesn't hold true.
>
> We could hook into the CPU hotplug notifiers - which would be quite a
> lot of additional code to achieve the reset early enough in the hot
> unplug path, though it would probably be the most reliable solution to
> the wait-for-bit solution.
>
> However, any of those solutions needs writing and thorough testing,
> which, if Linus opens the merge window on Sunday, isn't going to
> happen before hand (and we know Linus doesn't like extra development
> appearing which wasn't in -next prior to the merge window - he's taken
> snapshots of -next to check during the merge window in the past - so
> it's not something I'm going to be adding during that time, not even
> as a "fix" because we know about the problem right now, before the
> merge window. To me, to treat this as a "fix" would be wilfully
> deceitful.)
>
> I don't think the existing code is a big problem at the moment - it's
> been like this for about 10 years, and no one has ever reported an
> issue with it, although there have been changes over that time:
>
> aa033810461ee56abbef6cef10aabd6b97f5caee
> ARM: smp: Drop RCU_NONIDLE usage in cpu_die()
>
> This removed the RCU_NONIDLE() from the completion() call.
>
> ff081e05bfba3461119cd280201d163b6858eda2
> ARM: 7457/1: smp: Fix suspicious RCU originating from cpu_die()
>
> This added the RCU_NONIDLE() to the completion() call.
>
> 3c030beabf937b1d3b4ecaedfd1fb2f1e2aa0c70
> ARM: CPU hotplug: move cpu_killed completion to core code
>
> This moved the completion code from Realview (and other ARM
> platforms) into core ARM code.
>
> and 97a63ecff4bd06da5d8feb8c0394a4d020f2d34d
> [ARM SMP] Add CPU hotplug support for Realview MPcore
>
> The earliest current introduction of CPU hotplug in 2005.
Agreed. It does need to be fixed, but it makes sense to take a few
weeks and get a fix that is more likely to be correct.
Thanx, Paul
On 02/05/15 08:11, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
>> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> Sigh... I kind'a new it wouldn't be this simple. The gic code which
> actually raises the IPI takes a raw spinlock, so it's not going to be
> this simple - there's a small theoretical window where we have taken
> this lock, written the register to send the IPI, and then dropped the
> lock - the update to the lock to release it could get lost if the
> CPU power is quickly cut at that point.
Hm.. at first glance it would seem like a similar problem exists with
the completion variable. But it seems that we rely on the call to
complete() fom the dying CPU to synchronize with wait_for_completion()
on the killing CPU via the completion's wait.lock.
void complete(struct completion *x)
{
unsigned long flags;
spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
__wake_up_locked(&x->wait, TASK_NORMAL, 1);
spin_unlock_irqrestore(&x->wait.lock, flags);
}
and
static inline long __sched
do_wait_for_common(struct completion *x,
long (*action)(long), long timeout, int state)
...
spin_unlock_irq(&x->wait.lock);
timeout = action(timeout);
spin_lock_irq(&x->wait.lock);
so the power can't really be cut until the killing CPU sees the lock
released either explicitly via the second cache flush in cpu_die() or
implicitly via hardware. Maybe we can do the same thing here by using a
spinlock for synchronization between the IPI handler and the dying CPU?
So lock/unlock around the IPI sending from the dying CPU and then do a
lock/unlock on the killing CPU before continuing.
It would be nice if we didn't have to do anything at all though so
perhaps we can make it a nop on configs where there isn't a big little
switcher. Yeah it's some ugly coupling between these two pieces of code,
but I'm not sure how we can do better.
>
> Also, we _do_ need the second cache flush in place to ensure that the
> unlock is seen to other CPUs.
>
> We could work around that by taking and releasing the lock in the IPI
> processing function... but this is starting to look less attractive
> as the lock is private to irq-gic.c.
With Daniel Thompson's NMI fiq patches at least the lock would almost
always be gone, except for the bL switcher users. Another solution might
be to put a hotplug lock around the bL switcher code and then skip
taking the lock in gic_raise_softirq() if the IPI is our special hotplug
one. Conditional locking is pretty ugly though, so perhaps this isn't
such a great idea.
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On Mon, Feb 09, 2015 at 05:24:08PM -0800, Stephen Boyd wrote:
> On 02/05/15 08:11, Russell King - ARM Linux wrote:
> > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> >> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > actually raises the IPI takes a raw spinlock, so it's not going to be
> > this simple - there's a small theoretical window where we have taken
> > this lock, written the register to send the IPI, and then dropped the
> > lock - the update to the lock to release it could get lost if the
> > CPU power is quickly cut at that point.
>
> Hm.. at first glance it would seem like a similar problem exists with
> the completion variable. But it seems that we rely on the call to
> complete() fom the dying CPU to synchronize with wait_for_completion()
> on the killing CPU via the completion's wait.lock.
>
> void complete(struct completion *x)
> {
> unsigned long flags;
>
> spin_lock_irqsave(&x->wait.lock, flags);
> x->done++;
> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> spin_unlock_irqrestore(&x->wait.lock, flags);
> }
>
> and
>
> static inline long __sched
> do_wait_for_common(struct completion *x,
> long (*action)(long), long timeout, int state)
> ...
> spin_unlock_irq(&x->wait.lock);
> timeout = action(timeout);
> spin_lock_irq(&x->wait.lock);
>
>
> so the power can't really be cut until the killing CPU sees the lock
> released either explicitly via the second cache flush in cpu_die() or
> implicitly via hardware. Maybe we can do the same thing here by using a
> spinlock for synchronization between the IPI handler and the dying CPU?
> So lock/unlock around the IPI sending from the dying CPU and then do a
> lock/unlock on the killing CPU before continuing.
>
> It would be nice if we didn't have to do anything at all though so
> perhaps we can make it a nop on configs where there isn't a big little
> switcher. Yeah it's some ugly coupling between these two pieces of code,
> but I'm not sure how we can do better.
The default ugly-but-known-to-work approach is to set a variable in
the dying CPU that the surviving CPU periodically polls. If all else
fails and all that.
> > Also, we _do_ need the second cache flush in place to ensure that the
> > unlock is seen to other CPUs.
> >
> > We could work around that by taking and releasing the lock in the IPI
> > processing function... but this is starting to look less attractive
> > as the lock is private to irq-gic.c.
>
> With Daniel Thompson's NMI fiq patches at least the lock would almost
> always be gone, except for the bL switcher users. Another solution might
> be to put a hotplug lock around the bL switcher code and then skip
> taking the lock in gic_raise_softirq() if the IPI is our special hotplug
> one. Conditional locking is pretty ugly though, so perhaps this isn't
> such a great idea.
Which hotplug lock are you suggesting? We cannot use sleeplocks, because
releasing them can go through the scheduler, which is not legal at this
point.
Thanx, Paul
On 02/09/15 17:37, Paul E. McKenney wrote:
> On Mon, Feb 09, 2015 at 05:24:08PM -0800, Stephen Boyd wrote:
>> On 02/05/15 08:11, Russell King - ARM Linux wrote:
>>> On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
>>>> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
>>> Sigh... I kind'a new it wouldn't be this simple. The gic code which
>>> actually raises the IPI takes a raw spinlock, so it's not going to be
>>> this simple - there's a small theoretical window where we have taken
>>> this lock, written the register to send the IPI, and then dropped the
>>> lock - the update to the lock to release it could get lost if the
>>> CPU power is quickly cut at that point.
>> Hm.. at first glance it would seem like a similar problem exists with
>> the completion variable. But it seems that we rely on the call to
>> complete() fom the dying CPU to synchronize with wait_for_completion()
>> on the killing CPU via the completion's wait.lock.
>>
>> void complete(struct completion *x)
>> {
>> unsigned long flags;
>>
>> spin_lock_irqsave(&x->wait.lock, flags);
>> x->done++;
>> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
>> spin_unlock_irqrestore(&x->wait.lock, flags);
>> }
>>
>> and
>>
>> static inline long __sched
>> do_wait_for_common(struct completion *x,
>> long (*action)(long), long timeout, int state)
>> ...
>> spin_unlock_irq(&x->wait.lock);
>> timeout = action(timeout);
>> spin_lock_irq(&x->wait.lock);
>>
>>
>> so the power can't really be cut until the killing CPU sees the lock
>> released either explicitly via the second cache flush in cpu_die() or
>> implicitly via hardware. Maybe we can do the same thing here by using a
>> spinlock for synchronization between the IPI handler and the dying CPU?
>> So lock/unlock around the IPI sending from the dying CPU and then do a
>> lock/unlock on the killing CPU before continuing.
>>
>> It would be nice if we didn't have to do anything at all though so
>> perhaps we can make it a nop on configs where there isn't a big little
>> switcher. Yeah it's some ugly coupling between these two pieces of code,
>> but I'm not sure how we can do better.
> The default ugly-but-known-to-work approach is to set a variable in
> the dying CPU that the surviving CPU periodically polls. If all else
> fails and all that.
So it isn't the ugliest. Good.
>
>>> Also, we _do_ need the second cache flush in place to ensure that the
>>> unlock is seen to other CPUs.
>>>
>>> We could work around that by taking and releasing the lock in the IPI
>>> processing function... but this is starting to look less attractive
>>> as the lock is private to irq-gic.c.
>> With Daniel Thompson's NMI fiq patches at least the lock would almost
>> always be gone, except for the bL switcher users. Another solution might
>> be to put a hotplug lock around the bL switcher code and then skip
>> taking the lock in gic_raise_softirq() if the IPI is our special hotplug
>> one. Conditional locking is pretty ugly though, so perhaps this isn't
>> such a great idea.
> Which hotplug lock are you suggesting? We cannot use sleeplocks, because
> releasing them can go through the scheduler, which is not legal at this
> point.
>
I'm not suggesting we take any hotplug locks here in the cpu_die() path.
I'm thinking we make the bL switcher code hold a hotplug lock or at
least prevent hotplug from happening while it's moving IPIs from the
outgoing CPU to the incoming CPU (see code in
arch/arm/common/bL_switcher.c). Actually, I seem to recall that hotplug
can't happen if preemption/irqs are disabled, so maybe nothing needs to
change there and we can just assume that if we're sending the hotplug
IPI we don't need to worry about taking the spinlock in
gic_raise_softirq()? We still have conditional locking so it's still
fragile.
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On Mon, Feb 09, 2015 at 06:05:55PM -0800, Stephen Boyd wrote:
> On 02/09/15 17:37, Paul E. McKenney wrote:
> > On Mon, Feb 09, 2015 at 05:24:08PM -0800, Stephen Boyd wrote:
> >> On 02/05/15 08:11, Russell King - ARM Linux wrote:
> >>> On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> >>>> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> >>> Sigh... I kind'a new it wouldn't be this simple. The gic code which
> >>> actually raises the IPI takes a raw spinlock, so it's not going to be
> >>> this simple - there's a small theoretical window where we have taken
> >>> this lock, written the register to send the IPI, and then dropped the
> >>> lock - the update to the lock to release it could get lost if the
> >>> CPU power is quickly cut at that point.
> >> Hm.. at first glance it would seem like a similar problem exists with
> >> the completion variable. But it seems that we rely on the call to
> >> complete() fom the dying CPU to synchronize with wait_for_completion()
> >> on the killing CPU via the completion's wait.lock.
> >>
> >> void complete(struct completion *x)
> >> {
> >> unsigned long flags;
> >>
> >> spin_lock_irqsave(&x->wait.lock, flags);
> >> x->done++;
> >> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> >> spin_unlock_irqrestore(&x->wait.lock, flags);
> >> }
> >>
> >> and
> >>
> >> static inline long __sched
> >> do_wait_for_common(struct completion *x,
> >> long (*action)(long), long timeout, int state)
> >> ...
> >> spin_unlock_irq(&x->wait.lock);
> >> timeout = action(timeout);
> >> spin_lock_irq(&x->wait.lock);
> >>
> >>
> >> so the power can't really be cut until the killing CPU sees the lock
> >> released either explicitly via the second cache flush in cpu_die() or
> >> implicitly via hardware. Maybe we can do the same thing here by using a
> >> spinlock for synchronization between the IPI handler and the dying CPU?
> >> So lock/unlock around the IPI sending from the dying CPU and then do a
> >> lock/unlock on the killing CPU before continuing.
> >>
> >> It would be nice if we didn't have to do anything at all though so
> >> perhaps we can make it a nop on configs where there isn't a big little
> >> switcher. Yeah it's some ugly coupling between these two pieces of code,
> >> but I'm not sure how we can do better.
> > The default ugly-but-known-to-work approach is to set a variable in
> > the dying CPU that the surviving CPU periodically polls. If all else
> > fails and all that.
>
> So it isn't the ugliest. Good.
Woo-hoo!!! Something to aspire to! ;-)
> >>> Also, we _do_ need the second cache flush in place to ensure that the
> >>> unlock is seen to other CPUs.
> >>>
> >>> We could work around that by taking and releasing the lock in the IPI
> >>> processing function... but this is starting to look less attractive
> >>> as the lock is private to irq-gic.c.
> >> With Daniel Thompson's NMI fiq patches at least the lock would almost
> >> always be gone, except for the bL switcher users. Another solution might
> >> be to put a hotplug lock around the bL switcher code and then skip
> >> taking the lock in gic_raise_softirq() if the IPI is our special hotplug
> >> one. Conditional locking is pretty ugly though, so perhaps this isn't
> >> such a great idea.
> > Which hotplug lock are you suggesting? We cannot use sleeplocks, because
> > releasing them can go through the scheduler, which is not legal at this
> > point.
> >
>
> I'm not suggesting we take any hotplug locks here in the cpu_die() path.
> I'm thinking we make the bL switcher code hold a hotplug lock or at
> least prevent hotplug from happening while it's moving IPIs from the
> outgoing CPU to the incoming CPU (see code in
> arch/arm/common/bL_switcher.c). Actually, I seem to recall that hotplug
> can't happen if preemption/irqs are disabled, so maybe nothing needs to
> change there and we can just assume that if we're sending the hotplug
> IPI we don't need to worry about taking the spinlock in
> gic_raise_softirq()? We still have conditional locking so it's still
> fragile.
More precisely, if you are running on a given CPU with preemption disabled
(and disabling irqs disables preemption), then that CPU cannot go offline.
On the other hand, some other CPU may have already been partway offline,
and if it is far enough along in that process, it might well go the rest
of the way offline during the time your CPU is running with preemption
disabled.
Does that help?
Thanx, Paul
On Tue, Feb 10, 2015 at 01:24:08AM +0000, Stephen Boyd wrote:
> On 02/05/15 08:11, Russell King - ARM Linux wrote:
> > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> >> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > actually raises the IPI takes a raw spinlock, so it's not going to be
> > this simple - there's a small theoretical window where we have taken
> > this lock, written the register to send the IPI, and then dropped the
> > lock - the update to the lock to release it could get lost if the
> > CPU power is quickly cut at that point.
>
> Hm.. at first glance it would seem like a similar problem exists with
> the completion variable. But it seems that we rely on the call to
> complete() fom the dying CPU to synchronize with wait_for_completion()
> on the killing CPU via the completion's wait.lock.
>
> void complete(struct completion *x)
> {
> unsigned long flags;
>
> spin_lock_irqsave(&x->wait.lock, flags);
> x->done++;
> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> spin_unlock_irqrestore(&x->wait.lock, flags);
> }
>
> and
>
> static inline long __sched
> do_wait_for_common(struct completion *x,
> long (*action)(long), long timeout, int state)
> ...
> spin_unlock_irq(&x->wait.lock);
> timeout = action(timeout);
> spin_lock_irq(&x->wait.lock);
>
>
> so the power can't really be cut until the killing CPU sees the lock
> released either explicitly via the second cache flush in cpu_die() or
> implicitly via hardware.
That sounds about right, though surely cache flush is irrelevant w.r.t.
publishing of the unlock? The dsb(ishst) in the unlock path will ensure
that the write is visibile prior to the second flush_cache_louis().
That said, we _do_ need to flush the cache prior to the CPU being
killed, or we can lose any (shared) dirty cache lines the CPU owns. In
the presence of dirty cacheline migration we need to be sure the CPU to
be killed doesn't acquire any lines prior to being killed (i.e. its
caches need to be off and flushed). Given that I don't think it's
feasible to perform an IPI.
I think we need to move the synchronisation down into the
cpu_ops::{cpu_die,cpu_kill} implementations, so that we can have the
dying CPU signal readiness after it has disabled and flushed its caches.
If the CPU can kill itself and we can query the state of the CPU, then
the dying CPU needs to do nothing, and cpu_kill can just poll until it
is dead. If the CPU needs to be killed from another CPU, it can update a
(cacheline-padded) percpu variable that cpu_kill can poll (cleaning
before each read).
> Maybe we can do the same thing here by using a
> spinlock for synchronization between the IPI handler and the dying CPU?
> So lock/unlock around the IPI sending from the dying CPU and then do a
> lock/unlock on the killing CPU before continuing.
>
> It would be nice if we didn't have to do anything at all though so
> perhaps we can make it a nop on configs where there isn't a big little
> switcher. Yeah it's some ugly coupling between these two pieces of code,
> but I'm not sure how we can do better.
I'm missing something here. What does the switcher have to do with this?
> > Also, we _do_ need the second cache flush in place to ensure that the
> > unlock is seen to other CPUs.
> >
> > We could work around that by taking and releasing the lock in the IPI
> > processing function... but this is starting to look less attractive
> > as the lock is private to irq-gic.c.
>
> With Daniel Thompson's NMI fiq patches at least the lock would almost
> always be gone, except for the bL switcher users. Another solution might
> be to put a hotplug lock around the bL switcher code and then skip
> taking the lock in gic_raise_softirq() if the IPI is our special hotplug
> one. Conditional locking is pretty ugly though, so perhaps this isn't
> such a great idea.
There are also SMP platforms without a GIC (e.g. hip04) that would need
similar modifications to their interrupt controller drivers, which is
going to be painful.
Thanks,
Mark.
On Mon, Feb 09, 2015 at 05:24:08PM -0800, Stephen Boyd wrote:
> On 02/05/15 08:11, Russell King - ARM Linux wrote:
> > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> >> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > actually raises the IPI takes a raw spinlock, so it's not going to be
> > this simple - there's a small theoretical window where we have taken
> > this lock, written the register to send the IPI, and then dropped the
> > lock - the update to the lock to release it could get lost if the
> > CPU power is quickly cut at that point.
>
> Hm.. at first glance it would seem like a similar problem exists with
> the completion variable. But it seems that we rely on the call to
> complete() fom the dying CPU to synchronize with wait_for_completion()
> on the killing CPU via the completion's wait.lock.
>
> void complete(struct completion *x)
> {
> unsigned long flags;
>
> spin_lock_irqsave(&x->wait.lock, flags);
> x->done++;
> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> spin_unlock_irqrestore(&x->wait.lock, flags);
> }
>
> and
>
> static inline long __sched
> do_wait_for_common(struct completion *x,
> long (*action)(long), long timeout, int state)
> ...
> spin_unlock_irq(&x->wait.lock);
> timeout = action(timeout);
> spin_lock_irq(&x->wait.lock);
>
>
> so the power can't really be cut until the killing CPU sees the lock
> released either explicitly via the second cache flush in cpu_die() or
> implicitly via hardware.
Correct - so the caller of wait_for_completion_timeout() needs to
re-acquire the cache line after the complete() in order to return
successfully - which means that the spin_unlock_irqrestore() on the
dying CPU _must_ have become visible to other observers for the
requesting CPU to proceed.
> Maybe we can do the same thing here by using a
> spinlock for synchronization between the IPI handler and the dying CPU?
> So lock/unlock around the IPI sending from the dying CPU and then do a
> lock/unlock on the killing CPU before continuing.
It would be nice, but it means exporting irq_controller_lock from irq_gic.c.
It's doable, but it's really not nice - it creates a layering issue, buy
making arch/arm/kernel/smp.c depend on symbols exported from
drivers/irqchip/irq-gic.c.
> > Also, we _do_ need the second cache flush in place to ensure that the
> > unlock is seen to other CPUs.
> >
> > We could work around that by taking and releasing the lock in the IPI
> > processing function... but this is starting to look less attractive
> > as the lock is private to irq-gic.c.
>
> With Daniel Thompson's NMI fiq patches at least the lock would almost
> always be gone, except for the bL switcher users. Another solution might
> be to put a hotplug lock around the bL switcher code and then skip
> taking the lock in gic_raise_softirq() if the IPI is our special hotplug
> one. Conditional locking is pretty ugly though, so perhaps this isn't
> such a great idea.
I haven't even thought about the implications of that yet. :) We need to
fix the already existing in-kernel code before we consider not-yet-merged
code.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On 02/10/15 07:41, Russell King - ARM Linux wrote:
> On Mon, Feb 09, 2015 at 05:24:08PM -0800, Stephen Boyd wrote:
>
>
>> Maybe we can do the same thing here by using a
>> spinlock for synchronization between the IPI handler and the dying CPU?
>> So lock/unlock around the IPI sending from the dying CPU and then do a
>> lock/unlock on the killing CPU before continuing.
> It would be nice, but it means exporting irq_controller_lock from irq_gic.c.
> It's doable, but it's really not nice - it creates a layering issue, buy
> making arch/arm/kernel/smp.c depend on symbols exported from
> drivers/irqchip/irq-gic.c.
I wasn't talking about the irq_controller_lock. I was saying we should
add another spinlock for synchronization purposes in arm/kernel/smp.c
---8<----
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 02c5da16c7ed..fe0386c751b2 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -225,6 +225,7 @@ int __cpu_disable(void)
}
static DECLARE_COMPLETION(cpu_died);
+static DEFINE_RAW_SPINLOCK(stop_lock);
/*
* called on the thread which is asking for a CPU to be shutdown -
@@ -232,10 +233,13 @@ static DECLARE_COMPLETION(cpu_died);
*/
void __cpu_die(unsigned int cpu)
{
+ unsigned long flags;
if (!wait_for_completion_timeout(&cpu_died, msecs_to_jiffies(5000))) {
pr_err("CPU%u: cpu didn't die\n", cpu);
return;
}
+ raw_spin_lock_irqsave(&stop_lock, flags);
+ raw_spin_unlock_irqrestore(&stop_lock, flags);
pr_notice("CPU%u: shutdown\n", cpu);
/*
@@ -280,7 +284,17 @@ void __ref cpu_die(void)
* has been sent, power and/or clocks can be removed at any point
* from this CPU and its cache by platform_cpu_kill().
*/
+ raw_spin_lock(&stop_lock);
__smp_cross_call(cpumask_of(cpumask_any(cpu_online_mask)), IPI_CPU_DEAD);
+ raw_spin_unlock(&stop_lock);
+
+ /*
+ * Ensure that the cache lines associated with the stop_lock are
+ * written out. This covers the case where _this_ CPU is doing the
+ * powering down, to ensure that the lock is visible to the
+ * CPU waiting for this one.
+ */
+ flush_cache_louis();
/*
* The actual CPU shutdown procedure is at least platform (if not
@@ -517,8 +531,6 @@ void tick_broadcast(const struct cpumask *mask)
}
#endif
-static DEFINE_RAW_SPINLOCK(stop_lock);
-
/*
* ipi_cpu_stop - handle IPI from smp_send_stop()
*/
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On 02/10/15 07:14, Mark Rutland wrote:
> On Tue, Feb 10, 2015 at 01:24:08AM +0000, Stephen Boyd wrote:
>> On 02/05/15 08:11, Russell King - ARM Linux wrote:
>>> On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
>>>> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
>>> Sigh... I kind'a new it wouldn't be this simple. The gic code which
>>> actually raises the IPI takes a raw spinlock, so it's not going to be
>>> this simple - there's a small theoretical window where we have taken
>>> this lock, written the register to send the IPI, and then dropped the
>>> lock - the update to the lock to release it could get lost if the
>>> CPU power is quickly cut at that point.
>> Hm.. at first glance it would seem like a similar problem exists with
>> the completion variable. But it seems that we rely on the call to
>> complete() fom the dying CPU to synchronize with wait_for_completion()
>> on the killing CPU via the completion's wait.lock.
>>
>> void complete(struct completion *x)
>> {
>> unsigned long flags;
>>
>> spin_lock_irqsave(&x->wait.lock, flags);
>> x->done++;
>> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
>> spin_unlock_irqrestore(&x->wait.lock, flags);
>> }
>>
>> and
>>
>> static inline long __sched
>> do_wait_for_common(struct completion *x,
>> long (*action)(long), long timeout, int state)
>> ...
>> spin_unlock_irq(&x->wait.lock);
>> timeout = action(timeout);
>> spin_lock_irq(&x->wait.lock);
>>
>>
>> so the power can't really be cut until the killing CPU sees the lock
>> released either explicitly via the second cache flush in cpu_die() or
>> implicitly via hardware.
> That sounds about right, though surely cache flush is irrelevant w.r.t.
> publishing of the unlock? The dsb(ishst) in the unlock path will ensure
> that the write is visibile prior to the second flush_cache_louis().
Ah right. I was incorrectly thinking that the CPU had already disabled
coherency at this point.
>
> That said, we _do_ need to flush the cache prior to the CPU being
> killed, or we can lose any (shared) dirty cache lines the CPU owns. In
> the presence of dirty cacheline migration we need to be sure the CPU to
> be killed doesn't acquire any lines prior to being killed (i.e. its
> caches need to be off and flushed). Given that I don't think it's
> feasible to perform an IPI.
The IPI/completion sounds nice because it allows the killing CPU to
schedule and do other work until the dying CPU notifies that it's almost
dead.
>
> I think we need to move the synchronisation down into the
> cpu_ops::{cpu_die,cpu_kill} implementations, so that we can have the
> dying CPU signal readiness after it has disabled and flushed its caches.
>
> If the CPU can kill itself and we can query the state of the CPU, then
> the dying CPU needs to do nothing, and cpu_kill can just poll until it
> is dead. If the CPU needs to be killed from another CPU, it can update a
> (cacheline-padded) percpu variable that cpu_kill can poll (cleaning
> before each read).
How about a hybrid approach where we send the IPI from generic cpu_die()
and then do the cacheline-padded bit poll + invalidate and bit set? That
way we get the benefit of not doing that poll until we really need to
and if we need to do it at all.
cpu_kill | cpu_die | IPI | bit poll
---------+---------+-----+----------
Y | Y | Y | N
N | Y | Y | Y
Y | N | ? | ? <-- Is this a valid configuration?
N | N | N | N <-- Hotplug should be disabled
If the hardware doesn't have a synchronization method in row 1 we can
expose the bit polling functionality to the ops so that they can set and
poll the bit. It looks like rockchip would need this because we just
pull the power in cpu_kill without any synchronization. Unfortunately
this is starting to sound like a fairly large patch to backport.
Aside: For that last row we really should be setting cpu->hotpluggable
in struct cpu based on what cpu_ops::cpu_disable returns (from what I
can tell we use that op to indicate if a CPU can be hotplugged).
Double Aside: It seems that exynos has a bug with coherency.
exynos_cpu_die() calls v7_exit_coherency_flush() and then calls
exynos_cpu_power_down() which may call of_machine_is_compatible() and
that function will grab and release a kref which uses ldrex/strex for
atomics after we've disabled coherency in v7_exit_coherency_flush().
>> Maybe we can do the same thing here by using a
>> spinlock for synchronization between the IPI handler and the dying CPU?
>> So lock/unlock around the IPI sending from the dying CPU and then do a
>> lock/unlock on the killing CPU before continuing.
>>
>> It would be nice if we didn't have to do anything at all though so
>> perhaps we can make it a nop on configs where there isn't a big little
>> switcher. Yeah it's some ugly coupling between these two pieces of code,
>> but I'm not sure how we can do better.
> I'm missing something here. What does the switcher have to do with this?
The switcher is the reason we have a spinlock in gic_raise_softirq().
That's the problem that Russell was mentioning.
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On Tue, Feb 10, 2015 at 03:14:16PM +0000, Mark Rutland wrote:
> That sounds about right, though surely cache flush is irrelevant w.r.t.
> publishing of the unlock? The dsb(ishst) in the unlock path will ensure
> that the write is visibile prior to the second flush_cache_louis().
>
> That said, we _do_ need to flush the cache prior to the CPU being
> killed, or we can lose any (shared) dirty cache lines the CPU owns. In
> the presence of dirty cacheline migration we need to be sure the CPU to
> be killed doesn't acquire any lines prior to being killed (i.e. its
> caches need to be off and flushed). Given that I don't think it's
> feasible to perform an IPI.
>
> I think we need to move the synchronisation down into the
> cpu_ops::{cpu_die,cpu_kill} implementations, so that we can have the
> dying CPU signal readiness after it has disabled and flushed its caches.
Absolutely not - we're not having lots of implementations of the same
fscking thing. We need this basic stuff to be in the generic code, not
repeated some hundreds of times across every single SMP implementation
that's out there.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On 02/10/15 12:48, Stephen Boyd wrote:
> On 02/10/15 07:14, Mark Rutland wrote:
>> On Tue, Feb 10, 2015 at 01:24:08AM +0000, Stephen Boyd wrote:
>>> On 02/05/15 08:11, Russell King - ARM Linux wrote:
>>>> On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
>>>>> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
>>>> Sigh... I kind'a new it wouldn't be this simple. The gic code which
>>>> actually raises the IPI takes a raw spinlock, so it's not going to be
>>>> this simple - there's a small theoretical window where we have taken
>>>> this lock, written the register to send the IPI, and then dropped the
>>>> lock - the update to the lock to release it could get lost if the
>>>> CPU power is quickly cut at that point.
>>> Hm.. at first glance it would seem like a similar problem exists with
>>> the completion variable. But it seems that we rely on the call to
>>> complete() fom the dying CPU to synchronize with wait_for_completion()
>>> on the killing CPU via the completion's wait.lock.
>>>
>>> void complete(struct completion *x)
>>> {
>>> unsigned long flags;
>>>
>>> spin_lock_irqsave(&x->wait.lock, flags);
>>> x->done++;
>>> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
>>> spin_unlock_irqrestore(&x->wait.lock, flags);
>>> }
>>>
>>> and
>>>
>>> static inline long __sched
>>> do_wait_for_common(struct completion *x,
>>> long (*action)(long), long timeout, int state)
>>> ...
>>> spin_unlock_irq(&x->wait.lock);
>>> timeout = action(timeout);
>>> spin_lock_irq(&x->wait.lock);
>>>
>>>
>>> so the power can't really be cut until the killing CPU sees the lock
>>> released either explicitly via the second cache flush in cpu_die() or
>>> implicitly via hardware.
>> That sounds about right, though surely cache flush is irrelevant w.r.t.
>> publishing of the unlock? The dsb(ishst) in the unlock path will ensure
>> that the write is visibile prior to the second flush_cache_louis().
> Ah right. I was incorrectly thinking that the CPU had already disabled
> coherency at this point.
>
>> That said, we _do_ need to flush the cache prior to the CPU being
>> killed, or we can lose any (shared) dirty cache lines the CPU owns. In
>> the presence of dirty cacheline migration we need to be sure the CPU to
>> be killed doesn't acquire any lines prior to being killed (i.e. its
>> caches need to be off and flushed). Given that I don't think it's
>> feasible to perform an IPI.
> The IPI/completion sounds nice because it allows the killing CPU to
> schedule and do other work until the dying CPU notifies that it's almost
> dead.
>
>> I think we need to move the synchronisation down into the
>> cpu_ops::{cpu_die,cpu_kill} implementations, so that we can have the
>> dying CPU signal readiness after it has disabled and flushed its caches.
>>
>> If the CPU can kill itself and we can query the state of the CPU, then
>> the dying CPU needs to do nothing, and cpu_kill can just poll until it
>> is dead. If the CPU needs to be killed from another CPU, it can update a
>> (cacheline-padded) percpu variable that cpu_kill can poll (cleaning
>> before each read).
> How about a hybrid approach where we send the IPI from generic cpu_die()
> and then do the cacheline-padded bit poll + invalidate and bit set? That
> way we get the benefit of not doing that poll until we really need to
> and if we need to do it at all.
>
> cpu_kill | cpu_die | IPI | bit poll
> ---------+---------+-----+----------
> Y | Y | Y | N
> N | Y | Y | Y
> Y | N | ? | ? <-- Is this a valid configuration?
> N | N | N | N <-- Hotplug should be disabled
>
>
> If the hardware doesn't have a synchronization method in row 1 we can
> expose the bit polling functionality to the ops so that they can set and
> poll the bit. It looks like rockchip would need this because we just
> pull the power in cpu_kill without any synchronization. Unfortunately
> this is starting to sound like a fairly large patch to backport.
>
> Aside: For that last row we really should be setting cpu->hotpluggable
> in struct cpu based on what cpu_ops::cpu_disable returns (from what I
> can tell we use that op to indicate if a CPU can be hotplugged).
>
There's a patch for that (tm).
----8<----
From: Stephen Boyd <[email protected]>
Subject: [PATCH] ARM: smp: Only expose /sys/.../cpuX/online if hotpluggable
Writes to /sys/.../cpuX/online fail if we determine the platform
doesn't support hotplug for that CPU. Let's figure this out
befoer we make the sysfs nodes so that the online file doesn't
even exist if it's not possible to hotplug the CPU.
Signed-off-by: Stephen Boyd <[email protected]>
---
arch/arm/include/asm/smp.h | 6 ++++++
arch/arm/kernel/setup.c | 2 +-
arch/arm/kernel/smp.c | 11 ++++-------
3 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/arch/arm/include/asm/smp.h b/arch/arm/include/asm/smp.h
index 18f5a554134f..9f82430efd59 100644
--- a/arch/arm/include/asm/smp.h
+++ b/arch/arm/include/asm/smp.h
@@ -123,4 +123,10 @@ struct of_cpu_method {
*/
extern void smp_set_ops(struct smp_operations *);
+#ifdef CONFIG_HOTPLUG_CPU
+extern int platform_can_hotplug_cpu(unsigned int cpu);
+#else
+static inline int platform_can_hotplug_cpu(unsigned int cpu) { return 0; }
+#endif
+
#endif /* ifndef __ASM_ARM_SMP_H */
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index 715ae19bc7c8..c61c09defc78 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -974,7 +974,7 @@ static int __init topology_init(void)
for_each_possible_cpu(cpu) {
struct cpuinfo_arm *cpuinfo = &per_cpu(cpu_data, cpu);
- cpuinfo->cpu.hotpluggable = 1;
+ cpuinfo->cpu.hotpluggable = platform_can_hotplug_cpu(cpu);
register_cpu(&cpuinfo->cpu, cpu);
}
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index fe0386c751b2..4d213b24db60 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -174,18 +174,19 @@ static int platform_cpu_kill(unsigned int cpu)
return 1;
}
-static int platform_cpu_disable(unsigned int cpu)
+int platform_can_hotplug_cpu(unsigned int cpu)
{
if (smp_ops.cpu_disable)
- return smp_ops.cpu_disable(cpu);
+ return smp_ops.cpu_disable(cpu) ? 0 : 1;
/*
* By default, allow disabling all CPUs except the first one,
* since this is special on a lot of platforms, e.g. because
* of clock tick interrupts.
*/
- return cpu == 0 ? -EPERM : 0;
+ return cpu == 0 ? 0 : 1;
}
+
/*
* __cpu_disable runs on the processor to be shutdown.
*/
@@ -194,10 +195,6 @@ int __cpu_disable(void)
unsigned int cpu = smp_processor_id();
int ret;
- ret = platform_cpu_disable(cpu);
- if (ret)
- return ret;
-
/*
* Take this CPU offline. Once we clear this, we can't return,
* and we must not schedule until we're ready to give up the cpu.
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On Tue, Feb 10, 2015 at 01:04:30PM -0800, Stephen Boyd wrote:
> From: Stephen Boyd <[email protected]>
> Subject: [PATCH] ARM: smp: Only expose /sys/.../cpuX/online if hotpluggable
>
> Writes to /sys/.../cpuX/online fail if we determine the platform
> doesn't support hotplug for that CPU. Let's figure this out
> befoer we make the sysfs nodes so that the online file doesn't
> even exist if it's not possible to hotplug the CPU.
>
> Signed-off-by: Stephen Boyd <[email protected]>
> ---
> arch/arm/include/asm/smp.h | 6 ++++++
> arch/arm/kernel/setup.c | 2 +-
> arch/arm/kernel/smp.c | 11 ++++-------
> 3 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm/include/asm/smp.h b/arch/arm/include/asm/smp.h
> index 18f5a554134f..9f82430efd59 100644
> --- a/arch/arm/include/asm/smp.h
> +++ b/arch/arm/include/asm/smp.h
> @@ -123,4 +123,10 @@ struct of_cpu_method {
> */
> extern void smp_set_ops(struct smp_operations *);
>
> +#ifdef CONFIG_HOTPLUG_CPU
> +extern int platform_can_hotplug_cpu(unsigned int cpu);
> +#else
> +static inline int platform_can_hotplug_cpu(unsigned int cpu) { return 0; }
Please split this across four lines like a normal function.
> +#endif
> +
> #endif /* ifndef __ASM_ARM_SMP_H */
> diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
> index 715ae19bc7c8..c61c09defc78 100644
> --- a/arch/arm/kernel/setup.c
> +++ b/arch/arm/kernel/setup.c
> @@ -974,7 +974,7 @@ static int __init topology_init(void)
>
> for_each_possible_cpu(cpu) {
> struct cpuinfo_arm *cpuinfo = &per_cpu(cpu_data, cpu);
> - cpuinfo->cpu.hotpluggable = 1;
> + cpuinfo->cpu.hotpluggable = platform_can_hotplug_cpu(cpu);
> register_cpu(&cpuinfo->cpu, cpu);
> }
>
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index fe0386c751b2..4d213b24db60 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -174,18 +174,19 @@ static int platform_cpu_kill(unsigned int cpu)
> return 1;
> }
>
> -static int platform_cpu_disable(unsigned int cpu)
> +int platform_can_hotplug_cpu(unsigned int cpu)
> {
> if (smp_ops.cpu_disable)
> - return smp_ops.cpu_disable(cpu);
> + return smp_ops.cpu_disable(cpu) ? 0 : 1;
>
> /*
> * By default, allow disabling all CPUs except the first one,
> * since this is special on a lot of platforms, e.g. because
> * of clock tick interrupts.
> */
> - return cpu == 0 ? -EPERM : 0;
> + return cpu == 0 ? 0 : 1;
> }
> +
> /*
> * __cpu_disable runs on the processor to be shutdown.
> */
> @@ -194,10 +195,6 @@ int __cpu_disable(void)
> unsigned int cpu = smp_processor_id();
> int ret;
>
> - ret = platform_cpu_disable(cpu);
> - if (ret)
> - return ret;
> -
I would much rather prefer smp_ops.cpu_disable() to be renamed in this
case - name it smp_ops.cpu_can_disable() so that it's clear that it's
no longer part of the __cpu_disable() path.
Thanks.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On 02/10/15 13:15, Russell King - ARM Linux wrote:
> On Tue, Feb 10, 2015 at 01:04:30PM -0800, Stephen Boyd wrote:
>> From: Stephen Boyd <[email protected]>
>> Subject: [PATCH] ARM: smp: Only expose /sys/.../cpuX/online if hotpluggable
>>
>> Writes to /sys/.../cpuX/online fail if we determine the platform
>> doesn't support hotplug for that CPU. Let's figure this out
>> befoer we make the sysfs nodes so that the online file doesn't
>> even exist if it's not possible to hotplug the CPU.
>>
>> Signed-off-by: Stephen Boyd <[email protected]>
>> ---
>> arch/arm/include/asm/smp.h | 6 ++++++
>> arch/arm/kernel/setup.c | 2 +-
>> arch/arm/kernel/smp.c | 11 ++++-------
>> 3 files changed, 11 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/smp.h b/arch/arm/include/asm/smp.h
>> index 18f5a554134f..9f82430efd59 100644
>> --- a/arch/arm/include/asm/smp.h
>> +++ b/arch/arm/include/asm/smp.h
>> @@ -123,4 +123,10 @@ struct of_cpu_method {
>> */
>> extern void smp_set_ops(struct smp_operations *);
>>
>> +#ifdef CONFIG_HOTPLUG_CPU
>> +extern int platform_can_hotplug_cpu(unsigned int cpu);
>> +#else
>> +static inline int platform_can_hotplug_cpu(unsigned int cpu) { return 0; }
> Please split this across four lines like a normal function.
>
Ok.
>> +#endif
>> +
>> #endif /* ifndef __ASM_ARM_SMP_H */
>> diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
>> index 715ae19bc7c8..c61c09defc78 100644
>> --- a/arch/arm/kernel/setup.c
>> +++ b/arch/arm/kernel/setup.c
>> @@ -974,7 +974,7 @@ static int __init topology_init(void)
>>
>> for_each_possible_cpu(cpu) {
>> struct cpuinfo_arm *cpuinfo = &per_cpu(cpu_data, cpu);
>> - cpuinfo->cpu.hotpluggable = 1;
>> + cpuinfo->cpu.hotpluggable = platform_can_hotplug_cpu(cpu);
>> register_cpu(&cpuinfo->cpu, cpu);
>> }
>>
>> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
>> index fe0386c751b2..4d213b24db60 100644
>> --- a/arch/arm/kernel/smp.c
>> +++ b/arch/arm/kernel/smp.c
>> @@ -174,18 +174,19 @@ static int platform_cpu_kill(unsigned int cpu)
>> return 1;
>> }
>>
>> -static int platform_cpu_disable(unsigned int cpu)
>> +int platform_can_hotplug_cpu(unsigned int cpu)
>> {
>> if (smp_ops.cpu_disable)
>> - return smp_ops.cpu_disable(cpu);
>> + return smp_ops.cpu_disable(cpu) ? 0 : 1;
>>
>> /*
>> * By default, allow disabling all CPUs except the first one,
>> * since this is special on a lot of platforms, e.g. because
>> * of clock tick interrupts.
>> */
>> - return cpu == 0 ? -EPERM : 0;
>> + return cpu == 0 ? 0 : 1;
>> }
>> +
>> /*
>> * __cpu_disable runs on the processor to be shutdown.
>> */
>> @@ -194,10 +195,6 @@ int __cpu_disable(void)
>> unsigned int cpu = smp_processor_id();
>> int ret;
>>
>> - ret = platform_cpu_disable(cpu);
>> - if (ret)
>> - return ret;
>> -
> I would much rather prefer smp_ops.cpu_disable() to be renamed in this
> case - name it smp_ops.cpu_can_disable() so that it's clear that it's
> no longer part of the __cpu_disable() path.
Sure. That also makes the ? 0 : 1 thing go away.
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On 02/10/15 13:49, Stephen Boyd wrote:
>>> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
>>> index fe0386c751b2..4d213b24db60 100644
>>> --- a/arch/arm/kernel/smp.c
>>> +++ b/arch/arm/kernel/smp.c
>>> @@ -174,18 +174,19 @@ static int platform_cpu_kill(unsigned int cpu)
>>> return 1;
>>> }
>>>
>>> -static int platform_cpu_disable(unsigned int cpu)
>>> +int platform_can_hotplug_cpu(unsigned int cpu)
>>> {
>>> if (smp_ops.cpu_disable)
>>> - return smp_ops.cpu_disable(cpu);
>>> + return smp_ops.cpu_disable(cpu) ? 0 : 1;
>>>
BTW, we also need this
/* cpu_die must be specified to support hotplug */
if (!smp_ops.cpu_die)
return 0;
otherwise it seems that we can have a situation where cpu_die() does
nothing besides unexpectedly jump to secondary_start_kernel() before
generic cpu hotplug facilities expect it to.
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On Tue, Feb 10, 2015 at 08:48:18PM +0000, Stephen Boyd wrote:
> On 02/10/15 07:14, Mark Rutland wrote:
> > On Tue, Feb 10, 2015 at 01:24:08AM +0000, Stephen Boyd wrote:
> >> On 02/05/15 08:11, Russell King - ARM Linux wrote:
> >>> On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> >>>> Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> >>> Sigh... I kind'a new it wouldn't be this simple. The gic code which
> >>> actually raises the IPI takes a raw spinlock, so it's not going to be
> >>> this simple - there's a small theoretical window where we have taken
> >>> this lock, written the register to send the IPI, and then dropped the
> >>> lock - the update to the lock to release it could get lost if the
> >>> CPU power is quickly cut at that point.
> >> Hm.. at first glance it would seem like a similar problem exists with
> >> the completion variable. But it seems that we rely on the call to
> >> complete() fom the dying CPU to synchronize with wait_for_completion()
> >> on the killing CPU via the completion's wait.lock.
> >>
> >> void complete(struct completion *x)
> >> {
> >> unsigned long flags;
> >>
> >> spin_lock_irqsave(&x->wait.lock, flags);
> >> x->done++;
> >> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> >> spin_unlock_irqrestore(&x->wait.lock, flags);
> >> }
> >>
> >> and
> >>
> >> static inline long __sched
> >> do_wait_for_common(struct completion *x,
> >> long (*action)(long), long timeout, int state)
> >> ...
> >> spin_unlock_irq(&x->wait.lock);
> >> timeout = action(timeout);
> >> spin_lock_irq(&x->wait.lock);
> >>
> >>
> >> so the power can't really be cut until the killing CPU sees the lock
> >> released either explicitly via the second cache flush in cpu_die() or
> >> implicitly via hardware.
> > That sounds about right, though surely cache flush is irrelevant w.r.t.
> > publishing of the unlock? The dsb(ishst) in the unlock path will ensure
> > that the write is visibile prior to the second flush_cache_louis().
>
> Ah right. I was incorrectly thinking that the CPU had already disabled
> coherency at this point.
>
> >
> > That said, we _do_ need to flush the cache prior to the CPU being
> > killed, or we can lose any (shared) dirty cache lines the CPU owns. In
> > the presence of dirty cacheline migration we need to be sure the CPU to
> > be killed doesn't acquire any lines prior to being killed (i.e. its
> > caches need to be off and flushed). Given that I don't think it's
> > feasible to perform an IPI.
>
> The IPI/completion sounds nice because it allows the killing CPU to
> schedule and do other work until the dying CPU notifies that it's almost
> dead.
Ok. My main concern was that it's not sufficient to know that a CPU is
ready to die, but I guess it may signal that it is close.
> > I think we need to move the synchronisation down into the
> > cpu_ops::{cpu_die,cpu_kill} implementations, so that we can have the
> > dying CPU signal readiness after it has disabled and flushed its caches.
> >
> > If the CPU can kill itself and we can query the state of the CPU, then
> > the dying CPU needs to do nothing, and cpu_kill can just poll until it
> > is dead. If the CPU needs to be killed from another CPU, it can update a
> > (cacheline-padded) percpu variable that cpu_kill can poll (cleaning
> > before each read).
>
> How about a hybrid approach where we send the IPI from generic cpu_die()
> and then do the cacheline-padded bit poll + invalidate and bit set? That
> way we get the benefit of not doing that poll until we really need to
> and if we need to do it at all.
That sounds sane to me.
> cpu_kill | cpu_die | IPI | bit poll
> ---------+---------+-----+----------
> Y | Y | Y | N
> N | Y | Y | Y
> Y | N | ? | ? <-- Is this a valid configuration?
> N | N | N | N <-- Hotplug should be disabled
>
>
> If the hardware doesn't have a synchronization method in row 1 we can
> expose the bit polling functionality to the ops so that they can set and
> poll the bit. It looks like rockchip would need this because we just
> pull the power in cpu_kill without any synchronization. Unfortunately
> this is starting to sound like a fairly large patch to backport.
Oh, fun.
> Aside: For that last row we really should be setting cpu->hotpluggable
> in struct cpu based on what cpu_ops::cpu_disable returns (from what I
> can tell we use that op to indicate if a CPU can be hotplugged).
>
> Double Aside: It seems that exynos has a bug with coherency.
> exynos_cpu_die() calls v7_exit_coherency_flush() and then calls
> exynos_cpu_power_down() which may call of_machine_is_compatible() and
> that function will grab and release a kref which uses ldrex/strex for
> atomics after we've disabled coherency in v7_exit_coherency_flush().
>
> >> Maybe we can do the same thing here by using a
> >> spinlock for synchronization between the IPI handler and the dying CPU?
> >> So lock/unlock around the IPI sending from the dying CPU and then do a
> >> lock/unlock on the killing CPU before continuing.
> >>
> >> It would be nice if we didn't have to do anything at all though so
> >> perhaps we can make it a nop on configs where there isn't a big little
> >> switcher. Yeah it's some ugly coupling between these two pieces of code,
> >> but I'm not sure how we can do better.
> > I'm missing something here. What does the switcher have to do with this?
>
> The switcher is the reason we have a spinlock in gic_raise_softirq().
> That's the problem that Russell was mentioning.
Ah. Thanks for the pointer.
Thanks,
Mark.
[...]
> > Aside: For that last row we really should be setting cpu->hotpluggable
> > in struct cpu based on what cpu_ops::cpu_disable returns (from what I
> > can tell we use that op to indicate if a CPU can be hotplugged).
> >
>
> There's a patch for that (tm).
>
> ----8<----
>
> From: Stephen Boyd <[email protected]>
> Subject: [PATCH] ARM: smp: Only expose /sys/.../cpuX/online if hotpluggable
>
> Writes to /sys/.../cpuX/online fail if we determine the platform
> doesn't support hotplug for that CPU. Let's figure this out
> befoer we make the sysfs nodes so that the online file doesn't
> even exist if it's not possible to hotplug the CPU.
>
> Signed-off-by: Stephen Boyd <[email protected]>
> ---
> arch/arm/include/asm/smp.h | 6 ++++++
> arch/arm/kernel/setup.c | 2 +-
> arch/arm/kernel/smp.c | 11 ++++-------
> 3 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm/include/asm/smp.h b/arch/arm/include/asm/smp.h
> index 18f5a554134f..9f82430efd59 100644
> --- a/arch/arm/include/asm/smp.h
> +++ b/arch/arm/include/asm/smp.h
> @@ -123,4 +123,10 @@ struct of_cpu_method {
> */
> extern void smp_set_ops(struct smp_operations *);
>
> +#ifdef CONFIG_HOTPLUG_CPU
> +extern int platform_can_hotplug_cpu(unsigned int cpu);
> +#else
> +static inline int platform_can_hotplug_cpu(unsigned int cpu) { return 0; }
> +#endif
> +
> #endif /* ifndef __ASM_ARM_SMP_H */
> diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
> index 715ae19bc7c8..c61c09defc78 100644
> --- a/arch/arm/kernel/setup.c
> +++ b/arch/arm/kernel/setup.c
> @@ -974,7 +974,7 @@ static int __init topology_init(void)
>
> for_each_possible_cpu(cpu) {
> struct cpuinfo_arm *cpuinfo = &per_cpu(cpu_data, cpu);
> - cpuinfo->cpu.hotpluggable = 1;
> + cpuinfo->cpu.hotpluggable = platform_can_hotplug_cpu(cpu);
> register_cpu(&cpuinfo->cpu, cpu);
> }
>
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index fe0386c751b2..4d213b24db60 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -174,18 +174,19 @@ static int platform_cpu_kill(unsigned int cpu)
> return 1;
> }
>
> -static int platform_cpu_disable(unsigned int cpu)
> +int platform_can_hotplug_cpu(unsigned int cpu)
> {
> if (smp_ops.cpu_disable)
> - return smp_ops.cpu_disable(cpu);
> + return smp_ops.cpu_disable(cpu) ? 0 : 1;
>
> /*
> * By default, allow disabling all CPUs except the first one,
> * since this is special on a lot of platforms, e.g. because
> * of clock tick interrupts.
> */
> - return cpu == 0 ? -EPERM : 0;
> + return cpu == 0 ? 0 : 1;
> }
> +
> /*
> * __cpu_disable runs on the processor to be shutdown.
> */
> @@ -194,10 +195,6 @@ int __cpu_disable(void)
> unsigned int cpu = smp_processor_id();
> int ret;
>
> - ret = platform_cpu_disable(cpu);
> - if (ret)
> - return ret;
For PSCI 0.2+ I was hoping to hook the MIGRATE logic in here. The secure
side may reject hotplugging of a CPU, but it's a dynamic property of the
system and so can't be probed once at boot time.
I guess when I manage to get around to that I can restore a hook here.
Thanks,
Mark.
On Fri, Feb 13, 2015 at 03:52:08PM +0000, Mark Rutland wrote:
> > @@ -194,10 +195,6 @@ int __cpu_disable(void)
> > unsigned int cpu = smp_processor_id();
> > int ret;
> >
> > - ret = platform_cpu_disable(cpu);
> > - if (ret)
> > - return ret;
>
> For PSCI 0.2+ I was hoping to hook the MIGRATE logic in here. The secure
> side may reject hotplugging of a CPU, but it's a dynamic property of the
> system and so can't be probed once at boot time.
You may have to think about how to deal with the static nature of the
sysfs CPU hotplug properties then - or, you may wish to have the existing
behaviour where we expose the sysfs hotplug properties on all CPUs and
rely on returning -EPERM.
One question does come up - if it's a dynamic property of the system,
what ensures that it can't change between the point when we test it
(in __cpu_disable()) and when we actually come to take the CPU offline?
How does the secure side signal its rejection of hotunplugging of a CPU?
If it happens after __cpu_disable(), then that's a problem: the system
will have gone through all the expensive preparation by that time to
shut the CPU down, and it will expect the CPU to go offline. The only
way it can come back at that point is by going through a CPU plug-in
cycle... which means going back through secondary_start_kernel.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On 02/13/15 07:52, Mark Rutland wrote:
> @@ -194,10 +195,6 @@ int __cpu_disable(void)
> unsigned int cpu = smp_processor_id();
> int ret;
>
> - ret = platform_cpu_disable(cpu);
> - if (ret)
> - return ret;
> For PSCI 0.2+ I was hoping to hook the MIGRATE logic in here. The secure
> side may reject hotplugging of a CPU, but it's a dynamic property of the
> system and so can't be probed once at boot time.
>
> I guess when I manage to get around to that I can restore a hook here.
>
Yeah I noticed a comment to that effect in the mcpm_cpu_disable() code.
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
On Fri, Feb 13, 2015 at 04:27:25PM +0000, Russell King - ARM Linux wrote:
> On Fri, Feb 13, 2015 at 03:52:08PM +0000, Mark Rutland wrote:
> > > @@ -194,10 +195,6 @@ int __cpu_disable(void)
> > > unsigned int cpu = smp_processor_id();
> > > int ret;
> > >
> > > - ret = platform_cpu_disable(cpu);
> > > - if (ret)
> > > - return ret;
> >
> > For PSCI 0.2+ I was hoping to hook the MIGRATE logic in here. The secure
> > side may reject hotplugging of a CPU, but it's a dynamic property of the
> > system and so can't be probed once at boot time.
>
> You may have to think about how to deal with the static nature of the
> sysfs CPU hotplug properties then - or, you may wish to have the existing
> behaviour where we expose the sysfs hotplug properties on all CPUs and
> rely on returning -EPERM.
>
> One question does come up - if it's a dynamic property of the system,
> what ensures that it can't change between the point when we test it
> (in __cpu_disable()) and when we actually come to take the CPU offline?
By relying on hotplug operations being serialised and the secure OS not
moving arbitrarily (as required by the PSCI spec).
This matters in the case of a UP, migrateable secure OS (AKA TOS). It
lives on a core, but we can ask it (via the PSCI implementation) to
move. It will only move in response to MIGRATE calls, and at boot time
we would query which CPU it's on (which should in practice be CPU0
except in rare cases like a crash kernel).
At __cpu_disable time (where the current platform_cpu_disable callback
is), if the core being disabled has the TOS resident it would call
MIGRATE, passsing the physical ID of another CPU to migrate to. If this
fails, then the TOS didn't move and we can't hotplug. If it succeeds,
then we know it has moved to the other CPU.
The disabled CPU then goes through the rest of the teardown, eventually
calling PSCI_OFF to actually be shut down.
We can then wait for the dying CPU to have been killed with
AFFINITY_INFO (as with the current psci_cpu_kill implementation). As we
can't initiate antoehr hotplug before this we can't race and migrate the
TOS back to the original CPU.
> How does the secure side signal its rejection of hotunplugging of a CPU?
It returns an error code in response to the PSCI MIGRATE call.
> If it happens after __cpu_disable(), then that's a problem: the system
> will have gone through all the expensive preparation by that time to
> shut the CPU down, and it will expect the CPU to go offline. The only
> way it can come back at that point is by going through a CPU plug-in
> cycle... which means going back through secondary_start_kernel.
This would happen within __cpu_disable, as the current
platform_cpu_disable() call does, before it's too late to abort the
hotplug.
Thanks,
Mark.
On Fri, Feb 13, 2015 at 05:21:41PM +0000, Mark Rutland wrote:
> On Fri, Feb 13, 2015 at 04:27:25PM +0000, Russell King - ARM Linux wrote:
> > If it happens after __cpu_disable(), then that's a problem: the system
> > will have gone through all the expensive preparation by that time to
> > shut the CPU down, and it will expect the CPU to go offline. The only
> > way it can come back at that point is by going through a CPU plug-in
> > cycle... which means going back through secondary_start_kernel.
>
> This would happen within __cpu_disable, as the current
> platform_cpu_disable() call does, before it's too late to abort the
> hotplug.
Okay, in which case it all sounds fine, but what it does mean is that
we can't merge Stephen's patch since we have no real idea which cores
can be hotplugged.
I'd rather not add additional complexity when we know that it's not
going to work very well...
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
>
> Sigh... I kind'a new it wouldn't be this simple. The gic code which
> actually raises the IPI takes a raw spinlock, so it's not going to be
> this simple - there's a small theoretical window where we have taken
> this lock, written the register to send the IPI, and then dropped the
> lock - the update to the lock to release it could get lost if the
> CPU power is quickly cut at that point.
>
> Also, we _do_ need the second cache flush in place to ensure that the
> unlock is seen to other CPUs.
It's time to start discussing this problem again now that we're the
other side of the merge window.
I've been thinking about the lock in the GIC code. Do we actually need
this lock in gic_raise_softirq(), or could we move this lock into the
higher level code?
Let's consider the bL switcher.
I think the bL switcher is racy wrt CPU hotplug at the moment. What
happens if we're sleeping in wait_for_completion(&inbound_alive) and
CPU hotplug unplugs the CPU outgoing CPU? What protects us against
this scenario? I can't see anything in bL_switch_to() which ensures
that CPU hotplug can't run.
Let's assume that this rather glaring bug has been fixed, and that CPU
hotplug can't run in parallel with the bL switcher (and hence
gic_migrate_target() can't run concurrently with a CPU being taken
offline.)
If we have that guarantee, then we don't need to take a lock when sending
the completion IPI - we would know that while a CPU is being taken down,
the bL switcher could not run. What we now need is a lock-less way to
raise an IPI.
Now, is the locking between the normal IPI paths and the bL switcher
something that is specific to the interrupt controller, or should generic
code care about it? I think it's something generic code should care about
- and I believe that would make life a little easier.
The current bL switcher idea is to bring the new CPU up, disable IRQs
and FIQs on the outgoing CPU, change the IRQ/IPI targets, then read
any pending SGIs and raise them on the new CPU. But what about any
pending SPIs? These look like they could be lost.
How about this for an idea instead - the bL switcher code:
- brings the new CPU online.
- disables IRQs and FIQs.
- takes the IPI lock, which prevents new IPIs being raised.
- re-targets IRQs and IPIs onto the new CPU.
- releases the IPI lock.
- re-enables IRQs and FIQs.
- polls the IRQ controller to wait for any remaining IRQs and IPIs
to be delivered.
- re-disables IRQs and FIQs (which shouldn't be received anyway since
they're now targetting the new CPU.)
- shuts down the tick device.
- completes the switch
What this means is that we're not needing to have complex code in the
interrupt controllers to re-raise interrupts on other CPUs, and we
don't need a lock in the interrupt controller code synchronising IPI
raising with the bL switcher.
I'd also suggest is that this IPI lock should _not_ be a spinlock - it
should be a read/write spin lock - it's perfectly acceptable to have
multiple CPUs raising IPIs to each other, but it is not acceptable to
have any CPU raising an IPI when the bL switcher is modifying the IRQ
targets. That fits the rwlock semantics.
What this means is that gic_raise_softirq() should again become a lock-
less function, which opens the door to using an IPI to complete the
CPU hot-unplug operation cleanly.
Thoughts (especially from Nico)?
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On Wed, 25 Feb 2015, Russell King - ARM Linux wrote:
> On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> >
> > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > actually raises the IPI takes a raw spinlock, so it's not going to be
> > this simple - there's a small theoretical window where we have taken
> > this lock, written the register to send the IPI, and then dropped the
> > lock - the update to the lock to release it could get lost if the
> > CPU power is quickly cut at that point.
> >
> > Also, we _do_ need the second cache flush in place to ensure that the
> > unlock is seen to other CPUs.
>
> It's time to start discussing this problem again now that we're the
> other side of the merge window.
>
> I've been thinking about the lock in the GIC code. Do we actually need
> this lock in gic_raise_softirq(), or could we move this lock into the
> higher level code?
It could be a rw lock as you say.
> Let's consider the bL switcher.
>
> I think the bL switcher is racy wrt CPU hotplug at the moment. What
> happens if we're sleeping in wait_for_completion(&inbound_alive) and
> CPU hotplug unplugs the CPU outgoing CPU? What protects us against
> this scenario? I can't see anything in bL_switch_to() which ensures
> that CPU hotplug can't run.
True. The actual switch would then be suspended in mid air until that
CPU is plugged back in. The inbound CPU would wait at mcpm_entry_gated
until the outbound CPU comes back to open the gate. There wouldn't be
much harm besides the minor fact that the inbound CPU would be wasting
more power while looping on a WFE compared to its previously disabled
state. I'm still debating if this is worth fixing.
> Let's assume that this rather glaring bug has been fixed, and that CPU
> hotplug can't run in parallel with the bL switcher (and hence
> gic_migrate_target() can't run concurrently with a CPU being taken
> offline.)
I'm still trying to figure out how this might happen. At the point
where gic_migrate_target() is called, IRQs are disabled and nothing can
prevent the switch from happening anymore. Any IPI attempting to stop
that CPU for hotplug would be pending until the inbound CPU
eventually honors it.
> If we have that guarantee, then we don't need to take a lock when sending
> the completion IPI - we would know that while a CPU is being taken down,
> the bL switcher could not run. What we now need is a lock-less way to
> raise an IPI.
>
> Now, is the locking between the normal IPI paths and the bL switcher
> something that is specific to the interrupt controller, or should generic
> code care about it? I think it's something generic code should care about
> - and I believe that would make life a little easier.
Well... The only reason for having a lock there is to ensure that no
IPIs are sent to the outbound CPU after gic_cpu_map[] has been modified
and pending IPIs on the outbound CPU have been migrated to the inbound
CPU. I think this is pretty specific to the GIC driver code.
If there was a way for gic_migrate_target() to be sure no other CPUs are
using the old gic_cpu_map value any longer then no lock would be needed
in gic_raise_softirq(). The code in gic_migrate_target() would only
have to wait until it is safe to migrate pending IPIs on the outbound
CPU without missing any.
> The current bL switcher idea is to bring the new CPU up, disable IRQs
> and FIQs on the outgoing CPU, change the IRQ/IPI targets, then read
> any pending SGIs and raise them on the new CPU. But what about any
> pending SPIs? These look like they could be lost.
SPIs are raised and cleared independently of their distribution config.
So the only thing that gic_migrate_target() has to do is to disable the
distribution target for the outbound CPU and enable the target for the
inbound CPU. This way unserviced IRQs become pending on the outbound
CPU automatically. The only other part that plays with targets is
gic_set_affinity() and irq_controller_lock protects against concurrency
here.
> How about this for an idea instead - the bL switcher code:
>
> - brings the new CPU online.
> - disables IRQs and FIQs.
> - takes the IPI lock, which prevents new IPIs being raised.
> - re-targets IRQs and IPIs onto the new CPU.
> - releases the IPI lock.
But aren't we trying to get rid of that IPI lock to start with? I'd
personally love to remove it -- it's been nagging me since I initially
added it.
> - re-enables IRQs and FIQs.
> - polls the IRQ controller to wait for any remaining IRQs and IPIs
> to be delivered.
Poll for how long? How can you be sure no other CPU is in the process of
targetting an IPI to the outbound CPU? With things like the FIQ
debugger coming to mainline or even JTAG-based debuggers, this could
represent an indetermined amount of time if the sending CPU is stopped
at the right moment.
That notwithstanding, I'm afraid this would open a big can of worms.
The CPU would no longer have functional interrupts since they're all
directed to the inbound CPU at that point. Any IRQ controls are now
directed to the new CPU and things like self-IPIs (first scenario that
comes to my mind) would no longer produce the expected result. I'd much
prefer to get over with the switch ASAP at that point rather than
letting the outbound CPU run much longer in a degraded state.
> - re-disables IRQs and FIQs (which shouldn't be received anyway since
> they're now targetting the new CPU.)
> - shuts down the tick device.
> - completes the switch
>
> What this means is that we're not needing to have complex code in the
> interrupt controllers to re-raise interrupts on other CPUs, and we
> don't need a lock in the interrupt controller code synchronising IPI
> raising with the bL switcher.
>
> I'd also suggest is that this IPI lock should _not_ be a spinlock - it
> should be a read/write spin lock - it's perfectly acceptable to have
> multiple CPUs raising IPIs to each other, but it is not acceptable to
> have any CPU raising an IPI when the bL switcher is modifying the IRQ
> targets. That fits the rwlock semantics.
>
> What this means is that gic_raise_softirq() should again become a lock-
> less function, which opens the door to using an IPI to complete the
> CPU hot-unplug operation cleanly.
>
> Thoughts (especially from Nico)?
I completely agree with the r/w spinlock. Something like this ought to
be sufficient to make gic_raise_softirq() reentrant which is the issue
here, right? I've been stress-testing it for a while with no problems
so far.
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 4634cf7d0e..3404c6bc12 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -80,6 +80,9 @@ static DEFINE_RAW_SPINLOCK(irq_controller_lock);
#define NR_GIC_CPU_IF 8
static u8 gic_cpu_map[NR_GIC_CPU_IF] __read_mostly;
+/* This allows for multiple concurrent users of gic_cpu_map[] */
+static DEFINE_RWLOCK(gic_cpu_map_lock);
+
/*
* Supported arch specific GIC irq extension.
* Default make them NULL.
@@ -627,7 +630,7 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
int cpu;
unsigned long flags, map = 0;
- raw_spin_lock_irqsave(&irq_controller_lock, flags);
+ read_lock_irqsave(&gic_cpu_map_lock, flags);
/* Convert our logical CPU mask into a physical one. */
for_each_cpu(cpu, mask)
@@ -642,7 +645,7 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
/* this always happens on GIC0 */
writel_relaxed(map << 16 | irq, gic_data_dist_base(&gic_data[0]) + GIC_DIST_SOFTINT);
- raw_spin_unlock_irqrestore(&irq_controller_lock, flags);
+ read_unlock_irqrestore(&gic_cpu_map_lock, flags);
}
#endif
@@ -711,6 +714,7 @@ void gic_migrate_target(unsigned int new_cpu_id)
cur_target_mask = 0x01010101 << cur_cpu_id;
ror_val = (cur_cpu_id - new_cpu_id) & 31;
+ write_lock(&gic_cpu_map_lock);
raw_spin_lock(&irq_controller_lock);
/* Update the target interface for this logical CPU */
@@ -731,6 +735,7 @@ void gic_migrate_target(unsigned int new_cpu_id)
}
}
+ write_unlock(&gic_cpu_map_lock);
raw_spin_unlock(&irq_controller_lock);
/*
On Wed, Feb 25, 2015 at 11:47:48AM -0500, Nicolas Pitre wrote:
> I completely agree with the r/w spinlock. Something like this ought to
> be sufficient to make gic_raise_softirq() reentrant which is the issue
> here, right? I've been stress-testing it for a while with no problems
> so far.
No. The issue is that we need a totally lockless way to raise an IPI
during CPU hot-unplug, so we can raise an IPI in __cpu_die() to tell
the __cpu_kill() code that it's safe to proceed to platform code.
As soon sa that IPI has been received, the receiving CPU can decide
to cut power to the dying CPU. So, it's entirely possible that power
could be lost on the dying CPU before the unlock has become visible.
It's a catch-22 - the reason we're sending the IPI is for synchronisation,
but right now we need another form of synchronisation because we're
using a form of synchronisation...
We could just use the spin-and-poll solution instead of an IPI, but
I really don't like that - when you see the complexity needed to
re-initialise it each time, it quickly becomes very yucky because
there is no well defined order between __cpu_die() and __cpu_kill()
being called by the two respective CPUs.
The last patch I saw doing that had multiple bits to indicate success
and timeout, and rather a lot of complexity to recover from failures,
and reinitialise state for a second CPU going down.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
On Wed, 25 Feb 2015, Russell King - ARM Linux wrote:
> On Wed, Feb 25, 2015 at 11:47:48AM -0500, Nicolas Pitre wrote:
> > I completely agree with the r/w spinlock. Something like this ought to
> > be sufficient to make gic_raise_softirq() reentrant which is the issue
> > here, right? I've been stress-testing it for a while with no problems
> > so far.
>
> No. The issue is that we need a totally lockless way to raise an IPI
> during CPU hot-unplug, so we can raise an IPI in __cpu_die() to tell
> the __cpu_kill() code that it's safe to proceed to platform code.
>
> As soon sa that IPI has been received, the receiving CPU can decide
> to cut power to the dying CPU. So, it's entirely possible that power
> could be lost on the dying CPU before the unlock has become visible.
However... wouldn't this be fragile to rely on every interrupt
controller drivers to never modify RAM in their IPI sending path? That
would constitute an estrange requirement on IRQ controller drivers that
was never spelled out before.
> It's a catch-22 - the reason we're sending the IPI is for synchronisation,
> but right now we need another form of synchronisation because we're
> using a form of synchronisation...
Can't the dying CPU pull the plug by itself in most cases?
> We could just use the spin-and-poll solution instead of an IPI, but
> I really don't like that - when you see the complexity needed to
> re-initialise it each time, it quickly becomes very yucky because
> there is no well defined order between __cpu_die() and __cpu_kill()
> being called by the two respective CPUs.
>
> The last patch I saw doing that had multiple bits to indicate success
> and timeout, and rather a lot of complexity to recover from failures,
> and reinitialise state for a second CPU going down.
What about a per CPU state? That would at least avoid the need to
serialize things across CPUs. If only one CPU may write its state, that
should eliminate the need for any kind of locking.
Nicolas
On Wed, 25 Feb 2015, Nicolas Pitre wrote:
> On Wed, 25 Feb 2015, Russell King - ARM Linux wrote:
>
> > We could just use the spin-and-poll solution instead of an IPI, but
> > I really don't like that - when you see the complexity needed to
> > re-initialise it each time, it quickly becomes very yucky because
> > there is no well defined order between __cpu_die() and __cpu_kill()
> > being called by the two respective CPUs.
> >
> > The last patch I saw doing that had multiple bits to indicate success
> > and timeout, and rather a lot of complexity to recover from failures,
> > and reinitialise state for a second CPU going down.
>
> What about a per CPU state? That would at least avoid the need to
> serialize things across CPUs. If only one CPU may write its state, that
> should eliminate the need for any kind of locking.
Something like the following? If according to $subject it is the
complete() usage that has problems, then this replacement certainly has
it removed while keeping things simple. And I doubt CPU hotplug is
performance critical so a simple polling is certainly good enough.
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 86ef244c5a..f253f79a34 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -213,7 +213,7 @@ int __cpu_disable(void)
return 0;
}
-static DECLARE_COMPLETION(cpu_died);
+static struct cpumask dead_cpus;
/*
* called on the thread which is asking for a CPU to be shutdown -
@@ -221,7 +221,14 @@ static DECLARE_COMPLETION(cpu_died);
*/
void __cpu_die(unsigned int cpu)
{
- if (!wait_for_completion_timeout(&cpu_died, msecs_to_jiffies(5000))) {
+ int i;
+
+ for (i = 5 * HZ; i > 0; i -= 10) {
+ if (cpumask_test_cpu(cpu, &dead_cpus))
+ break;
+ schedule_timeout_uninterruptible(10);
+ }
+ if (i <= 0) {
pr_err("CPU%u: cpu didn't die\n", cpu);
return;
}
@@ -267,12 +274,12 @@ void __ref cpu_die(void)
* this returns, power and/or clocks can be removed at any point
* from this CPU and its cache by platform_cpu_kill().
*/
- complete(&cpu_died);
+ cpumask_set_cpu(cpu, &dead_cpus);
/*
- * Ensure that the cache lines associated with that completion are
+ * Ensure that the cache line associated with that dead_cpus update is
* written out. This covers the case where _this_ CPU is doing the
- * powering down, to ensure that the completion is visible to the
+ * powering down, to ensure that the update is visible to the
* CPU waiting for this one.
*/
flush_cache_louis();
@@ -349,6 +356,8 @@ asmlinkage void secondary_start_kernel(void)
current->active_mm = mm;
cpumask_set_cpu(cpu, mm_cpumask(mm));
+ cpumask_clear_cpu(cpu, &dead_cpus);
+
cpu_init();
pr_debug("CPU%u: Booted secondary processor\n", cpu);
On Wed, Feb 25, 2015 at 03:16:59PM -0500, Nicolas Pitre wrote:
> On Wed, 25 Feb 2015, Nicolas Pitre wrote:
>
> > On Wed, 25 Feb 2015, Russell King - ARM Linux wrote:
> >
> > > We could just use the spin-and-poll solution instead of an IPI, but
> > > I really don't like that - when you see the complexity needed to
> > > re-initialise it each time, it quickly becomes very yucky because
> > > there is no well defined order between __cpu_die() and __cpu_kill()
> > > being called by the two respective CPUs.
> > >
> > > The last patch I saw doing that had multiple bits to indicate success
> > > and timeout, and rather a lot of complexity to recover from failures,
> > > and reinitialise state for a second CPU going down.
> >
> > What about a per CPU state? That would at least avoid the need to
> > serialize things across CPUs. If only one CPU may write its state, that
> > should eliminate the need for any kind of locking.
>
> Something like the following? If according to $subject it is the
> complete() usage that has problems, then this replacement certainly has
> it removed while keeping things simple. And I doubt CPU hotplug is
> performance critical so a simple polling is certainly good enough.
For whatever it is worth, I am proposing the patch below for common code.
Works on x86. (Famous last words...)
Thanx, Paul
------------------------------------------------------------------------
smpboot: Add common code for notification from dying CPU
RCU ignores offlined CPUs, so they cannot safely run RCU read-side code.
(They -can- use SRCU, but not RCU.) This means that any use of RCU
during or after the call to arch_cpu_idle_dead(). Unfortunately,
commit 2ed53c0d6cc99 added a complete() call, which will contain RCU
read-side critical sections if there is a task waiting to be awakened.
Which, as it turns out, there almost never is. In my qemu/KVM testing,
the to-be-awakened task is not yet asleep more than 99.5% of the time.
In current mainline, failure is even harder to reproduce, requiring a
virtualized environment that delays the outgoing CPU by at least three
jiffies between the time it exits its stop_machine() task at CPU_DYING
time and the time it calls arch_cpu_idle_dead() from the idle loop.
However, this problem really can occur, especially in virtualized
environments, and therefore really does need to be fixed
This suggests moving back to the polling loop, but using a much shorter
wait, with gentle exponential backoff instead of the old 100-millisecond
wait. Most of the time, the loop will exit without waiting at all,
and almost all of the remaining uses will wait only five microseconds.
If the outgoing CPU is preempted, a loop will wait one jiffy, then
increase the wait by a factor of 11/10ths, rounding up. As before, there
is a five-second timeout.
This commit therefore provides common-code infrastructure to do the
dying-to-surviving CPU handoff in a safe manner. This code also
provides an indication at CPU-online of whether the CPU to be onlined
previously timed out on offline. The new cpu_check_up_prepare() function
returns -EBUSY if this CPU previously took more than five seconds to
go offline, or -EAGAIN if it has not yet managed to go offline. The
rationale for -EAGAIN is that it might still be preempted, so an additional
wait might well find it correctly offlined. Architecture-specific code
can decide how to handle these conditions. Systems in which CPUs take
themselves completely offline might respond to an -EBUSY return as if
it was a zero (success) return. Systems in which the surviving CPU must
take some action might take it at this time, or might simply mark the
other CPU as unusable.
Note that architectures that take the easy way out and simply pass the
-EBUSY and -EAGAIN upwards will change the sysfs API.
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 1d58c7a6ed72..ef87e3c2451a 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -97,6 +97,8 @@ enum {
* must not fail */
#define CPU_DYING_IDLE 0x000B /* CPU (unsigned)v dying, reached
* idle loop. */
+#define CPU_BROKEN 0x000C /* CPU (unsigned)v did not die properly,
+ * perhaps due to preemption. */
/* Used for CPU hotplug events occurring while tasks are frozen due to a suspend
* operation in progress
@@ -275,4 +277,12 @@ void arch_cpu_idle_dead(void);
DECLARE_PER_CPU(bool, cpu_dead_idle);
+int cpu_report_state(int cpu);
+int cpu_check_up_prepare(int cpu);
+void cpu_set_state_online(int cpu);
+#ifdef CONFIG_HOTPLUG_CPU
+bool cpu_wait_death(unsigned int cpu);
+bool cpu_report_death(void);
+#endif /* #ifdef CONFIG_HOTPLUG_CPU */
+
#endif /* _LINUX_CPU_H_ */
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index f032fb5284e3..e940f68008db 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -4,6 +4,7 @@
#include <linux/cpu.h>
#include <linux/err.h>
#include <linux/smp.h>
+#include <linux/delay.h>
#include <linux/init.h>
#include <linux/list.h>
#include <linux/slab.h>
@@ -312,3 +313,139 @@ void smpboot_unregister_percpu_thread(struct smp_hotplug_thread *plug_thread)
put_online_cpus();
}
EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread);
+
+static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD);
+
+/*
+ * Called to poll specified CPU's state, for example, when waiting for
+ * a CPU to come online.
+ */
+int cpu_report_state(int cpu)
+{
+ return atomic_read(&per_cpu(cpu_hotplug_state, cpu));
+}
+
+/*
+ * If CPU has died properly, set its state to CPU_UP_PREPARE and
+ * return success. Otherwise, return -EBUSY if the CPU died after
+ * cpu_wait_death() timed out. And yet otherwise again, return -EAGAIN
+ * if cpu_wait_death() timed out and the CPU still hasn't gotten around
+ * to dying. In the latter two cases, the CPU might not be set up
+ * properly, but it is up to the arch-specific code to decide.
+ * Finally, -EIO indicates an unanticipated problem.
+ */
+int cpu_check_up_prepare(int cpu)
+{
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU)) {
+ atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_UP_PREPARE);
+ return 0;
+ }
+
+ switch (atomic_read(&per_cpu(cpu_hotplug_state, cpu))) {
+
+ case CPU_POST_DEAD:
+
+ /* The CPU died properly, so just start it up again. */
+ atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_UP_PREPARE);
+ return 0;
+
+ case CPU_DEAD:
+
+ /*
+ * Timeout during CPU death, so let caller know.
+ * The outgoing CPU completed its processing, but after
+ * cpu_wait_death() timed out and reported the error. The
+ * caller is free to proceed, in which case the state
+ * will be reset properly by cpu_set_state_online().
+ * Proceeding despite this -EBUSY return makes sense
+ * for systems where the outgoing CPUs take themselves
+ * offline, with no post-death manipulation required from
+ * a surviving CPU.
+ */
+ return -EBUSY;
+
+ case CPU_BROKEN:
+
+ /*
+ * The most likely reason we got here is that there was
+ * a timeout during CPU death, and the outgoing CPU never
+ * did complete its processing. This could happen on
+ * a virtualized system if the outgoing VCPU gets preempted
+ * for more than five seconds, and the user attempts to
+ * immediately online that same CPU. Trying again later
+ * might return -EBUSY above, hence -EAGAIN.
+ */
+ return -EAGAIN;
+
+ default:
+
+ /* Should not happen. Famous last words. */
+ return -EIO;
+ }
+}
+
+/*
+ * Mark the specified CPU online.
+ */
+void cpu_set_state_online(int cpu)
+{
+ (void)atomic_xchg(&per_cpu(cpu_hotplug_state, cpu), CPU_ONLINE);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+
+/*
+ * Wait for the specified CPU to exit the idle loop and die.
+ */
+bool cpu_wait_death(unsigned int cpu)
+{
+ int jf_left = 5 * HZ;
+ int oldstate;
+ bool ret = true;
+ int sleep_jf = 1;
+
+ might_sleep();
+
+ /* The outgoing CPU will normally get done quite quickly. */
+ if (atomic_read(&per_cpu(cpu_hotplug_state, cpu)) == CPU_DEAD)
+ goto update_state;
+ udelay(5);
+
+ /* But if the outgoing CPU dawdles, wait increasingly long times. */
+ while (atomic_read(&per_cpu(cpu_hotplug_state, cpu)) != CPU_DEAD) {
+ schedule_timeout_uninterruptible(sleep_jf);
+ jf_left -= sleep_jf;
+ if (jf_left <= 0)
+ break;
+ sleep_jf = DIV_ROUND_UP(sleep_jf * 11, 10);
+ }
+update_state:
+ oldstate = atomic_read(&per_cpu(cpu_hotplug_state, cpu));
+ if (oldstate == CPU_DEAD) {
+ /* Outgoing CPU died normally, update state. */
+ smp_mb(); /* atomic_read() before update. */
+ atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_POST_DEAD);
+ } else {
+ /* Outgoing CPU still hasn't died, set state accordingly. */
+ if (atomic_cmpxchg(&per_cpu(cpu_hotplug_state, cpu),
+ oldstate, CPU_BROKEN) != oldstate)
+ goto update_state;
+ ret = false;
+ }
+ return ret;
+}
+
+/*
+ * Called by the outgoing CPU to report its successful death. Return
+ * false if this report follows the surviving CPU's timing out.
+ */
+bool cpu_report_death(void)
+{
+ int oldstate;
+ int cpu = smp_processor_id();
+
+ oldstate = atomic_xchg(&per_cpu(cpu_hotplug_state, cpu), CPU_DEAD);
+ return oldstate == CPU_ONLINE;
+}
+
+#endif /* #ifdef CONFIG_HOTPLUG_CPU */
On Wed, 2015-02-25 at 11:47 -0500, Nicolas Pitre wrote:
> On Wed, 25 Feb 2015, Russell King - ARM Linux wrote:
>
> > On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> > > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > > > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> > >
> > > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > > actually raises the IPI takes a raw spinlock, so it's not going to be
> > > this simple - there's a small theoretical window where we have taken
> > > this lock, written the register to send the IPI, and then dropped the
> > > lock - the update to the lock to release it could get lost if the
> > > CPU power is quickly cut at that point.
> > >
> > > Also, we _do_ need the second cache flush in place to ensure that the
> > > unlock is seen to other CPUs.
> >
> > It's time to start discussing this problem again now that we're the
> > other side of the merge window.
> >
> > I've been thinking about the lock in the GIC code. Do we actually need
> > this lock in gic_raise_softirq(), or could we move this lock into the
> > higher level code?
>
> It could be a rw lock as you say.
>
> > Let's consider the bL switcher.
> >
> > I think the bL switcher is racy wrt CPU hotplug at the moment. What
> > happens if we're sleeping in wait_for_completion(&inbound_alive) and
> > CPU hotplug unplugs the CPU outgoing CPU? What protects us against
> > this scenario? I can't see anything in bL_switch_to() which ensures
> > that CPU hotplug can't run.
>
> True. The actual switch would then be suspended in mid air until that
> CPU is plugged back in. The inbound CPU would wait at mcpm_entry_gated
> until the outbound CPU comes back to open the gate. There wouldn't be
> much harm besides the minor fact that the inbound CPU would be wasting
> more power while looping on a WFE compared to its previously disabled
> state. I'm still debating if this is worth fixing.
>
> > Let's assume that this rather glaring bug has been fixed, and that CPU
> > hotplug can't run in parallel with the bL switcher (and hence
> > gic_migrate_target() can't run concurrently with a CPU being taken
> > offline.)
>
> I'm still trying to figure out how this might happen. At the point
> where gic_migrate_target() is called, IRQs are disabled and nothing can
> prevent the switch from happening anymore. Any IPI attempting to stop
> that CPU for hotplug would be pending until the inbound CPU
> eventually honors it.
>
> > If we have that guarantee, then we don't need to take a lock when sending
> > the completion IPI - we would know that while a CPU is being taken down,
> > the bL switcher could not run. What we now need is a lock-less way to
> > raise an IPI.
> >
> > Now, is the locking between the normal IPI paths and the bL switcher
> > something that is specific to the interrupt controller, or should generic
> > code care about it? I think it's something generic code should care about
> > - and I believe that would make life a little easier.
>
> Well... The only reason for having a lock there is to ensure that no
> IPIs are sent to the outbound CPU after gic_cpu_map[] has been modified
> and pending IPIs on the outbound CPU have been migrated to the inbound
> CPU. I think this is pretty specific to the GIC driver code.
>
> If there was a way for gic_migrate_target() to be sure no other CPUs are
> using the old gic_cpu_map value any longer then no lock would be needed
> in gic_raise_softirq(). The code in gic_migrate_target() would only
> have to wait until it is safe to migrate pending IPIs on the outbound
> CPU without missing any.
>
> > The current bL switcher idea is to bring the new CPU up, disable IRQs
> > and FIQs on the outgoing CPU, change the IRQ/IPI targets, then read
> > any pending SGIs and raise them on the new CPU. But what about any
> > pending SPIs? These look like they could be lost.
>
> SPIs are raised and cleared independently of their distribution config.
> So the only thing that gic_migrate_target() has to do is to disable the
> distribution target for the outbound CPU and enable the target for the
> inbound CPU. This way unserviced IRQs become pending on the outbound
> CPU automatically. The only other part that plays with targets is
> gic_set_affinity() and irq_controller_lock protects against concurrency
> here.
>
> > How about this for an idea instead - the bL switcher code:
> >
> > - brings the new CPU online.
> > - disables IRQs and FIQs.
> > - takes the IPI lock, which prevents new IPIs being raised.
> > - re-targets IRQs and IPIs onto the new CPU.
> > - releases the IPI lock.
>
> But aren't we trying to get rid of that IPI lock to start with? I'd
> personally love to remove it -- it's been nagging me since I initially
> added it.
>
> > - re-enables IRQs and FIQs.
> > - polls the IRQ controller to wait for any remaining IRQs and IPIs
> > to be delivered.
>
> Poll for how long? How can you be sure no other CPU is in the process of
> targetting an IPI to the outbound CPU? With things like the FIQ
> debugger coming to mainline or even JTAG-based debuggers, this could
> represent an indetermined amount of time if the sending CPU is stopped
> at the right moment.
>
> That notwithstanding, I'm afraid this would open a big can of worms.
> The CPU would no longer have functional interrupts since they're all
> directed to the inbound CPU at that point. Any IRQ controls are now
> directed to the new CPU and things like self-IPIs (first scenario that
> comes to my mind) would no longer produce the expected result. I'd much
> prefer to get over with the switch ASAP at that point rather than
> letting the outbound CPU run much longer in a degraded state.
>
> > - re-disables IRQs and FIQs (which shouldn't be received anyway since
> > they're now targetting the new CPU.)
> > - shuts down the tick device.
> > - completes the switch
> >
> > What this means is that we're not needing to have complex code in the
> > interrupt controllers to re-raise interrupts on other CPUs, and we
> > don't need a lock in the interrupt controller code synchronising IPI
> > raising with the bL switcher.
> >
> > I'd also suggest is that this IPI lock should _not_ be a spinlock - it
> > should be a read/write spin lock - it's perfectly acceptable to have
> > multiple CPUs raising IPIs to each other, but it is not acceptable to
> > have any CPU raising an IPI when the bL switcher is modifying the IRQ
> > targets. That fits the rwlock semantics.
> >
> > What this means is that gic_raise_softirq() should again become a lock-
> > less function, which opens the door to using an IPI to complete the
> > CPU hot-unplug operation cleanly.
> >
> > Thoughts (especially from Nico)?
>
> I completely agree with the r/w spinlock. Something like this ought to
> be sufficient to make gic_raise_softirq() reentrant which is the issue
> here, right? I've been stress-testing it for a while with no problems
> so far.
Do you fancy trying patch 1 and 2 from this series?
http://thread.gmane.org/gmane.linux.kernel/1881415
The recent FIQ work required gic_raise_softirq() to be reentrant so I
came up with similar patches to yours. As soon as we tease out this code
into a separate lock people observe that the lock can melt away entirely
if the b.L switcher is not compiled in and make that the next move...
>
> diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
> index 4634cf7d0e..3404c6bc12 100644
> --- a/drivers/irqchip/irq-gic.c
> +++ b/drivers/irqchip/irq-gic.c
> @@ -80,6 +80,9 @@ static DEFINE_RAW_SPINLOCK(irq_controller_lock);
> #define NR_GIC_CPU_IF 8
> static u8 gic_cpu_map[NR_GIC_CPU_IF] __read_mostly;
>
> +/* This allows for multiple concurrent users of gic_cpu_map[] */
> +static DEFINE_RWLOCK(gic_cpu_map_lock);
> +
> /*
> * Supported arch specific GIC irq extension.
> * Default make them NULL.
> @@ -627,7 +630,7 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
> int cpu;
> unsigned long flags, map = 0;
>
> - raw_spin_lock_irqsave(&irq_controller_lock, flags);
> + read_lock_irqsave(&gic_cpu_map_lock, flags);
>
> /* Convert our logical CPU mask into a physical one. */
> for_each_cpu(cpu, mask)
> @@ -642,7 +645,7 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq)
> /* this always happens on GIC0 */
> writel_relaxed(map << 16 | irq, gic_data_dist_base(&gic_data[0]) + GIC_DIST_SOFTINT);
>
> - raw_spin_unlock_irqrestore(&irq_controller_lock, flags);
> + read_unlock_irqrestore(&gic_cpu_map_lock, flags);
> }
> #endif
>
> @@ -711,6 +714,7 @@ void gic_migrate_target(unsigned int new_cpu_id)
> cur_target_mask = 0x01010101 << cur_cpu_id;
> ror_val = (cur_cpu_id - new_cpu_id) & 31;
>
> + write_lock(&gic_cpu_map_lock);
> raw_spin_lock(&irq_controller_lock);
>
> /* Update the target interface for this logical CPU */
> @@ -731,6 +735,7 @@ void gic_migrate_target(unsigned int new_cpu_id)
> }
> }
>
> + write_unlock(&gic_cpu_map_lock);
> raw_spin_unlock(&irq_controller_lock);
>
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Thu, 26 Feb 2015, Daniel Thompson wrote:
> On Wed, 2015-02-25 at 11:47 -0500, Nicolas Pitre wrote:
> > On Wed, 25 Feb 2015, Russell King - ARM Linux wrote:
> >
> > > On Thu, Feb 05, 2015 at 04:11:00PM +0000, Russell King - ARM Linux wrote:
> > > > On Thu, Feb 05, 2015 at 06:29:18AM -0800, Paul E. McKenney wrote:
> > > > > Works for me, assuming no hidden uses of RCU in the IPI code. ;-)
> > > >
> > > > Sigh... I kind'a new it wouldn't be this simple. The gic code which
> > > > actually raises the IPI takes a raw spinlock, so it's not going to be
> > > > this simple - there's a small theoretical window where we have taken
> > > > this lock, written the register to send the IPI, and then dropped the
> > > > lock - the update to the lock to release it could get lost if the
> > > > CPU power is quickly cut at that point.
> > > >
> > > > Also, we _do_ need the second cache flush in place to ensure that the
> > > > unlock is seen to other CPUs.
> > >
> > > It's time to start discussing this problem again now that we're the
> > > other side of the merge window.
> > >
> > > I've been thinking about the lock in the GIC code. Do we actually need
> > > this lock in gic_raise_softirq(), or could we move this lock into the
> > > higher level code?
> >
> > It could be a rw lock as you say.
> >
> > > Let's consider the bL switcher.
> > >
> > > I think the bL switcher is racy wrt CPU hotplug at the moment. What
> > > happens if we're sleeping in wait_for_completion(&inbound_alive) and
> > > CPU hotplug unplugs the CPU outgoing CPU? What protects us against
> > > this scenario? I can't see anything in bL_switch_to() which ensures
> > > that CPU hotplug can't run.
> >
> > True. The actual switch would then be suspended in mid air until that
> > CPU is plugged back in. The inbound CPU would wait at mcpm_entry_gated
> > until the outbound CPU comes back to open the gate. There wouldn't be
> > much harm besides the minor fact that the inbound CPU would be wasting
> > more power while looping on a WFE compared to its previously disabled
> > state. I'm still debating if this is worth fixing.
> >
> > > Let's assume that this rather glaring bug has been fixed, and that CPU
> > > hotplug can't run in parallel with the bL switcher (and hence
> > > gic_migrate_target() can't run concurrently with a CPU being taken
> > > offline.)
> >
> > I'm still trying to figure out how this might happen. At the point
> > where gic_migrate_target() is called, IRQs are disabled and nothing can
> > prevent the switch from happening anymore. Any IPI attempting to stop
> > that CPU for hotplug would be pending until the inbound CPU
> > eventually honors it.
> >
> > > If we have that guarantee, then we don't need to take a lock when sending
> > > the completion IPI - we would know that while a CPU is being taken down,
> > > the bL switcher could not run. What we now need is a lock-less way to
> > > raise an IPI.
> > >
> > > Now, is the locking between the normal IPI paths and the bL switcher
> > > something that is specific to the interrupt controller, or should generic
> > > code care about it? I think it's something generic code should care about
> > > - and I believe that would make life a little easier.
> >
> > Well... The only reason for having a lock there is to ensure that no
> > IPIs are sent to the outbound CPU after gic_cpu_map[] has been modified
> > and pending IPIs on the outbound CPU have been migrated to the inbound
> > CPU. I think this is pretty specific to the GIC driver code.
> >
> > If there was a way for gic_migrate_target() to be sure no other CPUs are
> > using the old gic_cpu_map value any longer then no lock would be needed
> > in gic_raise_softirq(). The code in gic_migrate_target() would only
> > have to wait until it is safe to migrate pending IPIs on the outbound
> > CPU without missing any.
> >
> > > The current bL switcher idea is to bring the new CPU up, disable IRQs
> > > and FIQs on the outgoing CPU, change the IRQ/IPI targets, then read
> > > any pending SGIs and raise them on the new CPU. But what about any
> > > pending SPIs? These look like they could be lost.
> >
> > SPIs are raised and cleared independently of their distribution config.
> > So the only thing that gic_migrate_target() has to do is to disable the
> > distribution target for the outbound CPU and enable the target for the
> > inbound CPU. This way unserviced IRQs become pending on the outbound
> > CPU automatically. The only other part that plays with targets is
> > gic_set_affinity() and irq_controller_lock protects against concurrency
> > here.
> >
> > > How about this for an idea instead - the bL switcher code:
> > >
> > > - brings the new CPU online.
> > > - disables IRQs and FIQs.
> > > - takes the IPI lock, which prevents new IPIs being raised.
> > > - re-targets IRQs and IPIs onto the new CPU.
> > > - releases the IPI lock.
> >
> > But aren't we trying to get rid of that IPI lock to start with? I'd
> > personally love to remove it -- it's been nagging me since I initially
> > added it.
> >
> > > - re-enables IRQs and FIQs.
> > > - polls the IRQ controller to wait for any remaining IRQs and IPIs
> > > to be delivered.
> >
> > Poll for how long? How can you be sure no other CPU is in the process of
> > targetting an IPI to the outbound CPU? With things like the FIQ
> > debugger coming to mainline or even JTAG-based debuggers, this could
> > represent an indetermined amount of time if the sending CPU is stopped
> > at the right moment.
> >
> > That notwithstanding, I'm afraid this would open a big can of worms.
> > The CPU would no longer have functional interrupts since they're all
> > directed to the inbound CPU at that point. Any IRQ controls are now
> > directed to the new CPU and things like self-IPIs (first scenario that
> > comes to my mind) would no longer produce the expected result. I'd much
> > prefer to get over with the switch ASAP at that point rather than
> > letting the outbound CPU run much longer in a degraded state.
> >
> > > - re-disables IRQs and FIQs (which shouldn't be received anyway since
> > > they're now targetting the new CPU.)
> > > - shuts down the tick device.
> > > - completes the switch
> > >
> > > What this means is that we're not needing to have complex code in the
> > > interrupt controllers to re-raise interrupts on other CPUs, and we
> > > don't need a lock in the interrupt controller code synchronising IPI
> > > raising with the bL switcher.
> > >
> > > I'd also suggest is that this IPI lock should _not_ be a spinlock - it
> > > should be a read/write spin lock - it's perfectly acceptable to have
> > > multiple CPUs raising IPIs to each other, but it is not acceptable to
> > > have any CPU raising an IPI when the bL switcher is modifying the IRQ
> > > targets. That fits the rwlock semantics.
> > >
> > > What this means is that gic_raise_softirq() should again become a lock-
> > > less function, which opens the door to using an IPI to complete the
> > > CPU hot-unplug operation cleanly.
> > >
> > > Thoughts (especially from Nico)?
> >
> > I completely agree with the r/w spinlock. Something like this ought to
> > be sufficient to make gic_raise_softirq() reentrant which is the issue
> > here, right? I've been stress-testing it for a while with no problems
> > so far.
>
> Do you fancy trying patch 1 and 2 from this series?
> http://thread.gmane.org/gmane.linux.kernel/1881415
>
> The recent FIQ work required gic_raise_softirq() to be reentrant so I
> came up with similar patches to yours. As soon as we tease out this code
> into a separate lock people observe that the lock can melt away entirely
> if the b.L switcher is not compiled in and make that the next move...
Patch #1 is wrong. It provably opens a race. I'll reply on that thread
to explain why. Patch #2 is fine.
Nicolas