Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752671AbbBWPlS (ORCPT ); Mon, 23 Feb 2015 10:41:18 -0500 Received: from mail-pd0-f180.google.com ([209.85.192.180]:42431 "EHLO mail-pd0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751974AbbBWPlR (ORCPT ); Mon, 23 Feb 2015 10:41:17 -0500 Message-ID: <54EB4A17.6020800@gmail.com> Date: Tue, 24 Feb 2015 00:41:11 +0900 From: Naoya Horiguchi User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Borislav Petkov CC: Naoya Horiguchi , Tony Luck , Vivek Goyal , "linux-kernel@vger.kernel.org" , Junichi Nomura , Kiyoshi Ueda Subject: Re: [PATCH 1/2] x86: mce: kdump: use under_crashdumping to turn off MCE in all CPUs together References: <1424682719-16493-1-git-send-email-n-horiguchi@ah.jp.nec.com> <20150223092739.GA22757@pd.tnic> <54EB24BE.5050006@gmail.com> <20150223135842.GA22753@pd.tnic> In-Reply-To: <20150223135842.GA22753@pd.tnic> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1882 Lines: 42 On 02/23/2015 10:58 PM, Borislav Petkov wrote: > On Mon, Feb 23, 2015 at 10:01:50PM +0900, Naoya Horiguchi wrote: >> userspace. What end users see is like these timeout messages: >> - "Timeout: Not all CPUs entered broadcast exception handler", >> - "Timeout: Subject CPUs unable to finish machine check processing", >> - "Timeout: Monarch CPU unable to finish machine check processing", or >> - "Timeout: Monarch CPU did not finish machine check processing". >> These are informative for developers like us, but confusing for end users. > > Those messages won't go out if tolerant level is > 1 AFAICT and from > looking at mce_timed_out() and the machine wouldn't panic, for that > matter. Sorry, I misread the code, and you're right. Please ignore this part. > > So what is the actual problem you're seeing? > > Cores timeoutting when a machine check happens during entering kdump or > you not wanting cores to panic due to a machine check while the machine > enters kdump? What I saw was that once in hundreds of kdump and reboot cycle we hit kdump failure and panic with "Timeout synchronizing machine check over CPUs" message. Panic is OK if the MCE is severe enough, but I don't think panic due to this synchronization timeout is good because it is not related to MCE's nature (like victim component or type of error) or severity, so even recoverable MCEs could trigger this panic. This timeout is just an artifact of current kdump code, so I think we can/should avoid it. Anyway your suggestion of raising tolerant to 3 should solve this problem, so I'll take this approach in the next post. Thanks, Naoya Horiguchi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/