2014-06-04 13:22:09

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] x86: fix hang when AP bringup is too slow

On Mon, 5 May 2014 22:49:47 +0200
Igor Mammedov <[email protected]> wrote:

> changes since v4:
> * merge "[PATCH v4 1/5] x86: fix list corruption on CPU hotplug"
> and "[PATCH v4 2/5] x86: fix memory corruption in acpi_unmap_lsapic()"
> together
> * "x86: initialize secondary CPU only if master CPU will wait for it:
> - add 10 seconds timeout description into commit message
> - add smp_mb() after clearing cpu_initialized_mask
>
> changes since v3:
> * put simple bugfixes first
> * move common part of syncing with master CPU in cpu_init()
> for x32/64 variant into helper function
> * cpu_init(): WARN_ON if cpu_initialized_mask is set
> * fix panic on CPU unplug, caused by erroneous removing
> of "pr->dev = dev;" in drivers/acpi/acpi_processor.c
>
> --
> Hang is observed on virtual machines during CPU hotplug,
> especially in big guests with many CPUs. (It happens more
> often if host is over-committed).
>
> Hang happens because master CPU timeouts on waiting till
> AP boots and 'cancels' CPU online operation assuming AP
> is not functional but AP may continue run wild later
> causing various hangs or panics in running kernel that
> is assuming that AP was offline.
>
> This is an alternative approach, that instead of canceling
> in-progress AP bringup (https://lkml.org/lkml/2014/3/6/257),
> removes timeouts so that AP bringup won't be affected by
> poor timing and syncs AP with master CPU at early startup
> making sure that AP won't run wild if master CPU doesn't
> expect AP to come online.
>
> Series also fixes 3 bugs found during testing CPU bringup
> failure case.

since 3.16 merge window is open now,
ping

> --
> Below is the detailed description of a more often happening hang:
> ---
> Master CPU may timeout before cpu_callin_mask is set and cancel
> booting CPU, but being onlined CPU still continues to boot, sets
> cpu_active_mask (CPU_STARTING notifiers) and spins in
> check_tsc_sync_target() for master cpu to arrive. Following attempt
> to online another cpu hangs in stop_machine, initiated from here:
> smp_callin ->
> smp_store_cpu_info ->
> identify_secondary_cpu ->
> mtrr_ap_init -> set_mtrr_from_inactive_cpu
>
> stop_machine waits on completion of stop_work on all CPUs from
> cpu_active_mask including a failed CPU that spins in check_tsc_sync_target().
>
> Igor Mammedov (4):
> x86: fix list/memory corruption on CPU hotplug
> acpi_processor: do not mark present at boot but not onlined CPU as
> onlined
> x86: log error on secondary CPU wakeup failure at ERR level
> x86: initialize secondary CPU only if master CPU will wait for it
>
> arch/x86/kernel/cpu/common.c | 27 ++++++----
> arch/x86/kernel/smpboot.c | 104 +++++++++++++----------------------------
> drivers/acpi/acpi_processor.c | 1 -
> 3 files changed, 48 insertions(+), 84 deletions(-)
>


2014-06-05 12:29:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] x86: fix hang when AP bringup is too slow


* Igor Mammedov <[email protected]> wrote:

> > Series also fixes 3 bugs found during testing CPU bringup
> > failure case.
>
> since 3.16 merge window is open now,
> ping

Mind resending the remaining patches on top of Linus's latest, which I
suppose has one of the fixes already included via Rafael's tree?

Thanks,

Ingo

2014-06-05 13:12:48

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] x86: fix hang when AP bringup is too slow

On Thu, 5 Jun 2014 14:29:40 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Igor Mammedov <[email protected]> wrote:
>
> > > Series also fixes 3 bugs found during testing CPU bringup
> > > failure case.
> >
> > since 3.16 merge window is open now,
> > ping
>
> Mind resending the remaining patches on top of Linus's latest, which I
> suppose has one of the fixes already included via Rafael's tree?
Sure, I'll rebase and repost it today.

>
> Thanks,
>
> Ingo