Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761931AbZFOPQW (ORCPT ); Mon, 15 Jun 2009 11:16:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752315AbZFOPQN (ORCPT ); Mon, 15 Jun 2009 11:16:13 -0400 Received: from one.firstfloor.org ([213.235.205.2]:59379 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752942AbZFOPQM (ORCPT ); Mon, 15 Jun 2009 11:16:12 -0400 Date: Mon, 15 Jun 2009 17:24:28 +0200 From: Andi Kleen To: Alan Cox Cc: Andi Kleen , Hugh Dickins , Wu Fengguang , Balbir Singh , Andrew Morton , LKML , Ingo Molnar , Mel Gorman , Thomas Gleixner , "H. Peter Anvin" , Peter Zijlstra , Nick Piggin , "riel@redhat.com" , "chris.mason@oracle.com" , "linux-mm@kvack.org" Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5) Message-ID: <20090615152427.GF31969@one.firstfloor.org> References: <20090615024520.786814520@intel.com> <4A35BD7A.9070208@linux.vnet.ibm.com> <20090615042753.GA20788@localhost> <20090615140019.4e405d37@lxorguk.ukuu.org.uk> <20090615132934.GE31969@one.firstfloor.org> <20090615154832.73c89733@lxorguk.ukuu.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090615154832.73c89733@lxorguk.ukuu.org.uk> User-Agent: Mutt/1.4.2.1i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1830 Lines: 44 > Everyone I knew in the business end of deploying Linux turned on panics > for I/O errors, reboot on panic and all the rest of those. oops=panic already implies panic on all machine check exceptions, so they will be fine then (assuming this is the best strategy for availability for them, which I personally find quite doubtful, but we can discuss this some other time) > Really - so if your design is wrong for the way PPC wants to work what > are we going to do ? It's not a requirement that PPC64 support is there Then we change the code. Or if it's too difficult don't support their stuff. After all it's not cast in stone. That said I doubt the PPC requirements will be much different than what we have. > I'd guess that zSeries has some rather different views on how ECC > failures propogate through the hypervisors for example, including the > fact that a failed page can be unfailed which you don't seem to allow for. That's correct. That's because unpoisioning is quite hard -- you need some kind of synchronization point for all the error handling and that's the poisoned page and if it unposions itself then you need some very heavy weight synchronization to avoid handling errors multiple time. I looked at it, but it's quite messy. Also it's of somewhat dubious value. > > (You can unfail pages on x86 as well it appears by scrubbing them via DMA > - yes ?) Not architectually. Also the other problem is not just unpoisoning them, but finding out if the page is permenantly bad or just temporarily. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/