Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752239AbbBXIQg (ORCPT ); Tue, 24 Feb 2015 03:16:36 -0500 Received: from TYO201.gate.nec.co.jp ([210.143.35.51]:37825 "EHLO tyo201.gate.nec.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751182AbbBXIQe convert rfc822-to-8bit (ORCPT ); Tue, 24 Feb 2015 03:16:34 -0500 From: Naoya Horiguchi To: Borislav Petkov CC: Naoya Horiguchi , Tony Luck , Vivek Goyal , "linux-kernel@vger.kernel.org" , Junichi Nomura , Kiyoshi Ueda Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together Thread-Topic: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together Thread-Index: AQHQT0jZuvMOBL1AfkmyeYjgPGZ0iZz9YH2AgAA71wCAAA/kAIAAHKKAgAAX8oCAAP3NgA== Date: Tue, 24 Feb 2015 08:15:35 +0000 Message-ID: <20150224081517.GB2918@hori1.linux.bs1.fc.nec.co.jp> References: <1424682719-16493-1-git-send-email-n-horiguchi@ah.jp.nec.com> <20150223092739.GA22757@pd.tnic> <54EB24BE.5050006@gmail.com> <20150223135842.GA22753@pd.tnic> <54EB4A17.6020800@gmail.com> <20150223170653.GA16699@pd.tnic> In-Reply-To: <20150223170653.GA16699@pd.tnic> Accept-Language: ja-JP, en-US Content-Language: ja-JP X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.128.101.12] Content-Type: text/plain; charset="iso-2022-jp" Content-ID: <34347BC40D28DE43B7577ACCE6E819AD@gisp.nec.co.jp> Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1662 Lines: 35 On Mon, Feb 23, 2015 at 06:06:53PM +0100, Borislav Petkov wrote: > On Tue, Feb 24, 2015 at 12:41:11AM +0900, Naoya Horiguchi wrote: > > What I saw was that once in hundreds of kdump and reboot cycle we hit > > kdump failure and panic with "Timeout synchronizing machine check over > > CPUs" message. > > Ok, but this doesn't necessarily mean you're seeing an MCE. This message is spit out only in do_machine_check(), so I think this is a real MCE. > Or perhaps your NMI shooting down is causing an MCE once in a hundred > kdump cycles, i.e. I'm looking at nmi_shootdown_cpus(). > > Can you send me a dmesg from such a case where this happens? The more > verbose, the better. Unfortunately, the reproduced case was on develoment version of a distribution kernel, so I'm afraid I can't show the dmesg here now. I don't reproduce this on upstream kernel yet. Let me update my explanation about the problem. I wrote the description about race window of nmi shoot down threads. That's not wrong, but that's only the part of the problem. The more suitable description is that all "shot down" CPUs keep MCE enabled (disable_local_APIC() doesn't stop it) after entering infinite loop of cpu_relax(), so any MCE event causes kernel panic due to synchronization timeout whenever after the 2nd kernel launches on the crashing CPU (where the CPU don't run do_machine_check(), but the other CPUs do). Thanks, Naoya Horiguchi-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/