Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751848AbbBWJ2k (ORCPT ); Mon, 23 Feb 2015 04:28:40 -0500 Received: from mail.skyhub.de ([78.46.96.112]:40494 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751172AbbBWJ2j (ORCPT ); Mon, 23 Feb 2015 04:28:39 -0500 Date: Mon, 23 Feb 2015 10:27:39 +0100 From: Borislav Petkov To: Naoya Horiguchi Cc: Tony Luck , Vivek Goyal , "linux-kernel@vger.kernel.org" , Junichi Nomura , Kiyoshi Ueda Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together Message-ID: <20150223092739.GA22757@pd.tnic> References: <1424682719-16493-1-git-send-email-n-horiguchi@ah.jp.nec.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1424682719-16493-1-git-send-email-n-horiguchi@ah.jp.nec.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1437 Lines: 30 On Mon, Feb 23, 2015 at 09:12:29AM +0000, Naoya Horiguchi wrote: > kexec disables (or "shoots down") all CPUs other than a crashing CPU before > entering the 2nd kernel. This disablement is done via NMI, and the crashing > CPU wait for the completions by spinning at most for 1 second. > However, there is a race window if this NMI handling doesn't complete within > the 1 second on some CPU, which cause the fragile situation where only a > portion of online CPUs are responsive to MCE interrupt. If MCE happens during > this race window, MCE synchronization always timeouts and results in kernel > panic. So the user-visible effect of this bug is kdump failure. > > Note that this race window did exist when current MCE handler was implemented > around 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has > reached a timeout") made it more visible by changing the default behavior of > the synchronization timeout from "ignore" to "panic". Let me guess: you could raise the tolerance level to 3 temporarily from native_machine_crash_shutdown() and not touch the #MC handler at all, right? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/