Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755041AbcLZVps (ORCPT ); Mon, 26 Dec 2016 16:45:48 -0500 Received: from Galois.linutronix.de ([146.0.238.70]:39995 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751012AbcLZVpq (ORCPT ); Mon, 26 Dec 2016 16:45:46 -0500 Date: Mon, 26 Dec 2016 22:42:01 +0100 (CET) From: Thomas Gleixner To: Borislav Petkov cc: Boris Ostrovsky , Markus Trippelsdorf , Linus Torvalds , LKML , Ingo Molnar , "H. Peter Anvin" , Sebastian Andrzej Siewior Subject: Re: [GIT pull] smp/hotplug: Removal of notifiers In-Reply-To: Message-ID: References: <20161226074530.GA297@x4> <20161226110600.GB297@x4> <20161226154502.GA287@x4> <53e3b52b-f353-63c8-f96f-649d754596bc@oracle.com> <20161226210015.GA2945@nazgul.tnic> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3060 Lines: 99 On Mon, 26 Dec 2016, Thomas Gleixner wrote: > On Mon, 26 Dec 2016, Borislav Petkov wrote: > > On Mon, Dec 26, 2016 at 07:21:44PM +0100, Thomas Gleixner wrote: > > > Is there anything interesting error message before the BUG hits? I'll try > > > to reproduce on a AMD box tomorrow. > > > > Hmm, so lemme see if I see it correctly: > > > > threshold_create_bank() does kobject_create_and_add(name, &dev->kobj); > > and that dev thing is > > > > struct device *dev = per_cpu(mce_device, cpu); > > > > BUT(!), those mce_device per-CPU things get initialized in > > > > mce_cpu_online() > > |-> mce_device_create(cpu); > > > > With a CONFIG_HOTPLUG_CPU=n .config that doesn't happen, right? > > > > Oh, and I see what could've changed that: > > > > 8c0eeac819c8 ("x86/mcheck: Move CPU_ONLINE and CPU_DOWN_PREPARE to hotplug state machine") > > > > And before that, we did call mce_device_create(cpu) in > > mcheck_init_device() which is a device initcall and not dependent on CPU > > hotplug. > > > > And frankly, flipping back to the for_each_online_cpu(i) is yucky as > > hell but I don't see any other/better solution besides pulling up > > mce_device_create() into mcheck_init_device()... > > The hotplug callbacks are invoked even with HOTPLUG=n. So that's not the > problem. I can reproduce it. Will post info once I understand it. So the issue is indeed in that commit. I'm a moron. But the amd mce code should be made more solid, because exactly that issue can happen when something goes wrong in mcheck_init_device(). If that happens then the device pointer is NULL and this code crashes. Adding the NULL pointer check makes the machine survive despite the wreckage in the hotplug code. Fix below. Thanks, tglx 8<--------------------------- arch/x86/kernel/cpu/mcheck/mce_amd.c | 3 +++ kernel/cpu.c | 9 ++++++++- 2 files changed, 11 insertions(+), 1 deletion(-) --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c @@ -1182,6 +1182,9 @@ static int threshold_create_bank(unsigne const char *name = get_name(bank, NULL); int err = 0; + if (!dev) + return -ENODEV; + if (is_shared_bank(bank)) { nb = node_to_amd_nb(amd_get_nb_id(cpu)); --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1471,6 +1471,7 @@ int __cpuhp_setup_state(enum cpuhp_state bool multi_instance) { int cpu, ret = 0; + bool dynstate; if (cpuhp_cb_check(state) || !name) return -EINVAL; @@ -1480,6 +1481,12 @@ int __cpuhp_setup_state(enum cpuhp_state ret = cpuhp_store_callbacks(state, name, startup, teardown, multi_instance); + dynstate = state == CPUHP_AP_ONLINE_DYN; + if (ret > 0 && dynstate) { + state = ret; + ret = 0; + } + if (ret || !invoke || !startup) goto out; @@ -1508,7 +1515,7 @@ int __cpuhp_setup_state(enum cpuhp_state * If the requested state is CPUHP_AP_ONLINE_DYN, return the * dynamically allocated state in case of success. */ - if (!ret && state == CPUHP_AP_ONLINE_DYN) + if (!ret && dynstate) return state; return ret; }