Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751162Ab2J2THI (ORCPT ); Mon, 29 Oct 2012 15:07:08 -0400 Received: from mga09.intel.com ([134.134.136.24]:52096 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750719Ab2J2THG (ORCPT ); Mon, 29 Oct 2012 15:07:06 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.80,673,1344236400"; d="scan'208";a="212618628" From: Andi Kleen To: "Theodore Ts'o" Cc: Dave Chinner , "Luck\, Tony" , Naoya Horiguchi , "Kleen\, Andi" , "Wu\, Fengguang" , Andrew Morton , Jan Kara , "Jun'ichi Nomura" , Akira Fujita , "linux-kernel\@vger.kernel.org" , "linux-mm\@kvack.org" , "linux-ext4\@vger.kernel.org" Subject: Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page References: <1351177969-893-1-git-send-email-n-horiguchi@ah.jp.nec.com> <1351177969-893-3-git-send-email-n-horiguchi@ah.jp.nec.com> <20121026061206.GA31139@thunk.org> <3908561D78D1C84285E8C5FCA982C28F19D5A13B@ORSMSX108.amr.corp.intel.com> <20121026184649.GA8614@thunk.org> <3908561D78D1C84285E8C5FCA982C28F19D5A388@ORSMSX108.amr.corp.intel.com> <20121027221626.GA9161@thunk.org> <20121029011632.GN29378@dastard> <20121029024024.GC9365@thunk.org> <20121029182455.GA7098@thunk.org> Date: Mon, 29 Oct 2012 12:07:04 -0700 In-Reply-To: <20121029182455.GA7098@thunk.org> (Theodore Ts'o's message of "Mon, 29 Oct 2012 14:24:56 -0400") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2025 Lines: 59 Theodore Ts'o writes: > > It's actually pretty easy to test this particular one, Note the error can happen at any time. > and certainly > one of the things I'd strongly encourage in this patch series is the > introduction of an interface via madvise It already exists of course. I would suggest to study the existing framework before more suggestions. > simulate an ECC hard error event. So I don't think "it's hard to > test" is a reason not to do the right thing. Let's make it easy to What you can't test doesn't work. It's that simple. And memory error handling is extremly hard to test. The errors can happen at any time. It's not a well defined event. There are test suites for it of course (mce-test, mce-inject[1]), but they needed a lot of engineering effort to be at where they are. [1] despite the best efforts of some current RAS developers at breaking it. > Note that the problem that we're dealing with is buffered writes; so > it's quite possible that the process which wrote the file, thus > dirtying the page cache, has already exited; so there's no way we can > guarantee we can inform the process which wrote the file via a signal > or a error code return. Is that any different from other IO errors? It doesn't need to be better. > Also, if you're going to keep this state in memory, what happens if > the inode gets pushed out of memory? You lose the error, just like you do today with any other IO error. We had a lot of discussions on this when the memory error handling was originally introduced, that was the conclusuion. I don't think a special panic knob for this makes sense either. We already have multiple panic knobs for memory errors, that can be used. -Andi -- ak@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/