Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753757Ab1EIR3t (ORCPT ); Mon, 9 May 2011 13:29:49 -0400 Received: from e23smtp05.au.ibm.com ([202.81.31.147]:46321 "EHLO e23smtp05.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753084Ab1EIR3r (ORCPT ); Mon, 9 May 2011 13:29:47 -0400 Date: Mon, 9 May 2011 22:59:35 +0530 From: "K.Prasad" To: Andi Kleen Cc: Linux Kernel Mailing List , "Luck, Tony" , Vivek Goyal , kexec@lists.infradead.org, Srivatsa Vaddagiri , Ananth N Mavinakayanahalli Subject: Re: [RFC] Kdump and memory error handling Message-ID: <20110509172935.GD1963@in.ibm.com> Reply-To: prasad@linux.vnet.ibm.com References: <20110504193509.GA5342@in.ibm.com> <20110504203914.GC1737@one.firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110504203914.GC1737@one.firstfloor.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1579 Lines: 41 On Wed, May 04, 2011 at 10:39:14PM +0200, Andi Kleen wrote: > > Any thoughts/suggestions? > > My old attempts to solve this are > > Don't dump on MCE: > > http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=shortlog;h=refs/heads/mce/xpanic > The problem we seen in avoiding a panic->crash_kexec->[coredump capture] is that the user may not have a means to know the reason for crash, unless the serial console is connected to capture and store the panic string. Alternatively a 'slim' kdump (as described here: https://lkml.org/lkml/2011/5/4/396) would not contain meaningless data from the old memory, but inform the user about the cause of the crash. I'm intending to post some patches with a quick implementation of it soon. > Handle dumps of corrupted memory regresions: > > http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=shortlog;h=refs/heads/mce/crashdump > > IMHO these patches are still the right solutions for this. > Like Vatsa had raised, the processor's behaviour upon reading (or any I/O operation) the faulty memory location isn't clearly defined (to the extent I read through System Programming Guide Part 1, Volume 3A, Chapter 15). In such a scenario, disabling MCE for the kdump kernel (which can potentially read the faulty memory) is making things hazy. Thanks, K.Prasad -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/