2016-12-15 10:36:09

by Mason

[permalink] [raw]
Subject: Re: Linux crashes when trying to online secondary core

On 14/12/2016 18:47, Mason wrote:

> On 14/12/2016 18:08, Thomas Gleixner wrote:
>
>> On Wed, 14 Dec 2016, Mason wrote:
>>
>>> I'm seeing Linux v4.9 crash (dereferencing NULL) when I try to online
>>> the secondary core, after putting it offline.
>>
>> Does the patch below fix the issue?
>>
>> Thanks,
>>
>> tglx
>>
>> 8<---------------
>>
>> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
>> index 22acee76cf4c..2594c287b078 100644
>> --- a/include/linux/cpuhotplug.h
>> +++ b/include/linux/cpuhotplug.h
>> @@ -101,7 +101,6 @@ enum cpuhp_state {
>> CPUHP_AP_ARM_L2X0_STARTING,
>> CPUHP_AP_ARM_ARCH_TIMER_STARTING,
>> CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
>> - CPUHP_AP_DUMMY_TIMER_STARTING,
>> CPUHP_AP_JCORE_TIMER_STARTING,
>> CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
>> CPUHP_AP_ARM_TWD_STARTING,
>> @@ -111,6 +110,7 @@ enum cpuhp_state {
>> CPUHP_AP_MARCO_TIMER_STARTING,
>> CPUHP_AP_MIPS_GIC_TIMER_STARTING,
>> CPUHP_AP_ARC_TIMER_STARTING,
>> + CPUHP_AP_DUMMY_TIMER_STARTING,
>> CPUHP_AP_KVM_STARTING,
>> CPUHP_AP_KVM_ARM_VGIC_INIT_STARTING,
>> CPUHP_AP_KVM_ARM_VGIC_STARTING,
>
> $ patch -p1 < tglx.patch
> patching file include/linux/cpuhotplug.h
> Hunk #1 succeeded at 80 (offset -21 lines).
> Hunk #2 succeeded at 89 (offset -21 lines).
>
> It does seem to fix the problem:
>
> # echo 0 > /sys/devices/system/cpu/cpu1/online
> SMC called with a0=0x00[000001 a1=0x00000121 a2=0x00000005 a3 =0xc01189b4 0x00000121
> [1][flow/suspend3.c:39] CPU 1 die: jumping6 to. post-boot WFE
> 402826] CPU1: shutdown
> SMC called with a0=0x00000001 a1=0x00000122 a2=0x00000000 a3=0x00000000 0x00000122
> [0][flow/suspend.c:82] Killing core1
> armor+++ armor: core 1 booted, entering wfe...
> # echo 1 > /sys/devices/system/cpu/cpu1/online
> [ 215.692700] tango_boot_secondary from __cpu_up
> SMC called with a0=0x80101500 a1=0x00000105 a2=0x00000000 a3=0x00000000 0x00000105
> [ 215.704494] tango_set_aux_boot_addr=0
> SMC called with a0=0x00000001 a1=0x00000104 a2=0x00000000 a3=0x00000000 0x00000104
> [0][flow/smc_handler.c:127] waking up CPU1
> [ 215.719308] tango_start_aux_core=0
>
>
> I reverted your patch, and the kernel blows up again.
>
> So what's the problem, and how does your patch solve it?

Link to the original report:
https://marc.info/?l=linux-arm-kernel&m=148173152524746&w=2

Forgot to CC Robin Murphy, who had provided valuable input
in similar circumstances a few months back.

Also add LKML, since this doesn't appear to be ARM-specific.

Do I need to specify which device tree I was using?

Regards.


Subject: [tip:timers/urgent] tick/broadcast: Prevent NULL pointer dereference

Commit-ID: c1a9eeb938b5433947e5ea22f89baff3182e7075
Gitweb: http://git.kernel.org/tip/c1a9eeb938b5433947e5ea22f89baff3182e7075
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 15 Dec 2016 12:10:37 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 15 Dec 2016 12:25:13 +0100

tick/broadcast: Prevent NULL pointer dereference

When a disfunctional timer, e.g. dummy timer, is installed, the tick core
tries to setup the broadcast timer.

If no broadcast device is installed, the kernel crashes with a NULL pointer
dereference in tick_broadcast_setup_oneshot() because the function has no
sanity check.

Reported-by: Mason <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Peter Zijlstra <[email protected]>,
Cc: Sebastian Frias <[email protected]>
Cc: Thibaud Cornic <[email protected]>
Cc: Robin Murphy <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
kernel/time/tick-broadcast.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index f6aae79..d2a20e8 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -871,6 +871,9 @@ void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
{
int cpu = smp_processor_id();

+ if (!bc)
+ return;
+
/* Set it up only once ! */
if (bc->event_handler != tick_handle_oneshot_broadcast) {
int was_periodic = clockevent_state_periodic(bc);

Subject: [tip:smp/urgent] clocksource/dummy_timer: Move hotplug callback after the real timers

Commit-ID: 9bf11ecce5a2758e5a097c2f3a13d08552d0d6f9
Gitweb: http://git.kernel.org/tip/9bf11ecce5a2758e5a097c2f3a13d08552d0d6f9
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 15 Dec 2016 12:01:05 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 15 Dec 2016 12:09:20 +0100

clocksource/dummy_timer: Move hotplug callback after the real timers

When the dummy timer callback is invoked before the real timer callbacks,
then it tries to install that timer for the starting CPU. If the platform
does not have a broadcast timer installed the installation fails with a
kernel crash. The crash happens due to a unconditional deference of the non
available broadcast device. This needs to be fixed in the timer core code.

But even when this is fixed in the core code then installing the dummy
timer before the real timers is a pointless exercise.

Move it to the end of the callback list.

Fixes: 00c1d17aab51 ("clocksource/dummy_timer: Convert to hotplug state machine")
Reported-and-tested-by: Mason <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Anna-Maria Gleixner <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Peter Zijlstra <[email protected]>,
Cc: Sebastian Frias <[email protected]>
Cc: Thibaud Cornic <[email protected]>
Cc: Robin Murphy <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/linux/cpuhotplug.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 22acee7..2ab7bf5 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -101,7 +101,6 @@ enum cpuhp_state {
CPUHP_AP_ARM_L2X0_STARTING,
CPUHP_AP_ARM_ARCH_TIMER_STARTING,
CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
- CPUHP_AP_DUMMY_TIMER_STARTING,
CPUHP_AP_JCORE_TIMER_STARTING,
CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
CPUHP_AP_ARM_TWD_STARTING,
@@ -115,6 +114,8 @@ enum cpuhp_state {
CPUHP_AP_KVM_ARM_VGIC_INIT_STARTING,
CPUHP_AP_KVM_ARM_VGIC_STARTING,
CPUHP_AP_KVM_ARM_TIMER_STARTING,
+ /* Must be the last timer callback */
+ CPUHP_AP_DUMMY_TIMER_STARTING,
CPUHP_AP_ARM_XEN_STARTING,
CPUHP_AP_ARM_CORESIGHT_STARTING,
CPUHP_AP_ARM_CORESIGHT4_STARTING,

2016-12-15 12:02:23

by Mark Rutland

[permalink] [raw]
Subject: Re: Linux crashes when trying to online secondary core

On Thu, Dec 15, 2016 at 11:35:12AM +0100, Mason wrote:
> On 14/12/2016 18:47, Mason wrote:
> > On 14/12/2016 18:08, Thomas Gleixner wrote:
> >> Does the patch below fix the issue?

> >> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> >> index 22acee76cf4c..2594c287b078 100644
> >> --- a/include/linux/cpuhotplug.h
> >> +++ b/include/linux/cpuhotplug.h
> >> @@ -101,7 +101,6 @@ enum cpuhp_state {
> >> CPUHP_AP_ARM_L2X0_STARTING,
> >> CPUHP_AP_ARM_ARCH_TIMER_STARTING,
> >> CPUHP_AP_ARM_GLOBAL_TIMER_STARTING,
> >> - CPUHP_AP_DUMMY_TIMER_STARTING,
> >> CPUHP_AP_JCORE_TIMER_STARTING,
> >> CPUHP_AP_EXYNOS4_MCT_TIMER_STARTING,
> >> CPUHP_AP_ARM_TWD_STARTING,
> >> @@ -111,6 +110,7 @@ enum cpuhp_state {
> >> CPUHP_AP_MARCO_TIMER_STARTING,
> >> CPUHP_AP_MIPS_GIC_TIMER_STARTING,
> >> CPUHP_AP_ARC_TIMER_STARTING,
> >> + CPUHP_AP_DUMMY_TIMER_STARTING,
> >> CPUHP_AP_KVM_STARTING,
> >> CPUHP_AP_KVM_ARM_VGIC_INIT_STARTING,
> >> CPUHP_AP_KVM_ARM_VGIC_STARTING,

> > It does seem to fix the problem:

> > I reverted your patch, and the kernel blows up again.
> >
> > So what's the problem, and how does your patch solve it?
>
> Link to the original report:
> https://marc.info/?l=linux-arm-kernel&m=148173152524746&w=2
>
> Forgot to CC Robin Murphy, who had provided valuable input
> in similar circumstances a few months back.
>
> Also add LKML, since this doesn't appear to be ARM-specific.
>
> Do I need to specify which device tree I was using?

This is already fixed in the linux-tip tree, with commit messages
describing the fix.

It's specific to a few clocksources, due to their hotplug callbacks
occuring later than the dummy timer. That triggers the bug fixed in:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/urgent&id=c1a9eeb938b5433947e5ea22f89baff3182e7075

The relevant timers were fixed in:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=smp/urgent&id=9bf11ecce5a2758e5a097c2f3a13d08552d0d6f9

Thanks,
Mark.