Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753500AbZFHMrZ (ORCPT ); Mon, 8 Jun 2009 08:47:25 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753268AbZFHMrI (ORCPT ); Mon, 8 Jun 2009 08:47:08 -0400 Received: from mga14.intel.com ([143.182.124.37]:24660 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753460AbZFHMrH (ORCPT ); Mon, 8 Jun 2009 08:47:07 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.41,324,1241420400"; d="scan'208";a="151802142" Date: Mon, 8 Jun 2009 20:31:33 +0800 From: Wu Fengguang To: Nai Xia Cc: Andi Kleen , Nick Piggin , "hugh@veritas.com" , "riel@redhat.com" , "akpm@linux-foundation.org" , "chris.mason@oracle.com" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3 Message-ID: <20090608123133.GA7944@localhost> References: <200905271012.668777061@firstfloor.org> <20090527201239.C2C9C1D0294@basil.firstfloor.org> <20090528082616.GG6920@wotan.suse.de> <20090528093141.GD1065@one.firstfloor.org> <20090528120854.GJ6920@wotan.suse.de> <20090528134520.GH1065@one.firstfloor.org> <20090528145021.GA5503@localhost> <20090607160225.GA24315@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7812 Lines: 184 On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote: > On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang wrote: > > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote: > >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang wrote: > >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote: > >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote: > >> > > >> > [snip] > >> > > >> >> > > >> >> > BTW. I don't know if you are checking for PG_writeback often enough? > >> >> > You can't remove a PG_writeback page from pagecache. The normal > >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I > >> >> > >> >> So pages can be in writeback without being locked? I still > >> >> wasn't able to find such a case (in fact unless I'm misreading > >> >> the code badly the writeback bit is only used by NFS and a few > >> >> obscure cases) > >> > > >> > Yes the writeback page is typically not locked. Only read IO requires > >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO > >> > is page *reader* :-) > >> > >> Sorry for maybe somewhat a little bit off topic, > >> I am trying to get a good understanding of PG_writeback & PG_locked ;) > >> > >> So you are saying PG_writeback & PG_locked are acting like a read/write lock? > >> I notice wait_on_page_writeback(page) seems always called with page locked -- > > > > No. Note that pages are not locked in wait_on_page_writeback_range(). > > I see. This function seems mostly called on the sync path, > it just waits for data being synchronized to disk. > No writers from the pages' POV, so no lock. > I missed this case, but my argument about the role of read/write lock. > seems still consistent. :) It's more constrained. Normal read/write locks allow concurrent readers, however fsync() must wait for previous IO to finish before starting its own IO. > > > >> that is the semantics of a writer waiting to get the lock while it's > >> acquired by > >> some reader:The caller(e.g. truncate_inode_pages_range()  and > >> invalidate_inode_pages2_range()) are the writers waiting for > >> writeback readers (as you clarified ) to finish their job, right ? > > > > Sorry if my metaphor confused you. But they are not typical > > reader/writer problems, but more about data integrities. > > No, you didn't :) > Actually, you make me clear about the mixed roles for > those bits. > > > > > Pages have to be "not under writeback" when truncated. > > Otherwise data lost is possible: > > > > 1) create a file with one page (page A) > > 2) truncate page A that is under writeback > > 3) write to file, which creates page B > > 4) sync file, which sends page B to disk quickly > > > > Now if page B reaches disk before A, the new data will be overwritten > > by truncated old data, which corrupts the file. > > I fully understand this scenario which you had already clarified in a > previous message. :) > > 1. someone make index1-> page A > 2. Path P1 is acting as a *reader* to a cache page at index1 by > setting PG_writeback on, while at the same time as a *writer* to > the corresponding file blocks. > 3. Another path P2 comes in and truncate page A, he is the writer > to the same cache page. > 4. Yet another path P3 comes as the writer to the cache page > making it points to page B: index1--> page B. > 5. Path P4 comes writing back the cache page(and set PG_writeback). > He is the reader of the cache page and the writer to the file blocks. > > The corrupts occur because P1 & P4 races when writing file blocks. > But the _root_ of this racing is because nothing is used to serialize > them on the side of writing the file blocks and above stream reading was > inconsistent because of the writers(P2 & P3) to cache page at index1. > > Note that the "sync file" is somewhat irrelevant, even without "sync file", > the racing still may exists. I know you must want to show me that this could > make the corruption more easy to occur. > > So I think the simple logic is: > 1) if you want to truncate/change the mapping from a index to a struct *page, > test writeback bit because the writebacker to the file blocks is the reader > of this mapping. > 2) if a writebacker want to start a read of this mapping with > test_set_page_writeback() > or set_page_writeback(), he'd be sure this page is locked to keep out the > writers to this mapping of index-->struct *page. > > This is really behavior of a read/write lock, right ? Please, that's a dangerous idea. A page can be written to at any time when writeback to disk is under way. Does PG_writeback (your reader lock) prevents page data writers? NO. Thanks, Fengguang > wait_on_page_writeback_range() looks different only because "sync" > operates on "struct page", it's not sensitive to index-->struct *page mapping. > It does care about if pages returned by pagevec_lookup_tag() are > still maintains the mapping when wait_on_page_writeback(page). > Here, PG_writeback is only a status flag for "struct page" not a lock bit for > index->struct *page mapping. > > > > >> So do you think the idea is sane to group the two bits together > >> to form a real read/write lock, which does not care about the _number_ > >> of readers ? > > > > We don't care number of readers here. So please forget about it. > Yeah, I meant number of readers is not important. > > I still hold that these two bits in some way act like a _sparse_ > read/write lock. > But I am going to drop the idea of making them a pure lock, since PG_writeback > does has other meaning -- the page is being writing back: for sync > path, it's only > a status flag. > Making a pure read/write lock definitely will lose that or at least distort it. > > > Hoping I've made my words understandable, correct me if wrong, and > many thanks for your time and patience. :-) > > > Nai Xia > > > > > Thanks, > > Fengguang > > > >> > The writeback bit is _widely_ used.  test_set_page_writeback() is > >> > directly used by NFS/AFS etc. But its main user is in fact > >> > set_page_writeback(), which is called in 26 places. > >> > > >> >> > think would be safest > >> >> > >> >> Okay. I'll just add it after the page lock. > >> >> > >> >> > (then you never have to bother with the writeback bit again) > >> >> > >> >> Until Fengguang does something fancy with it. > >> > > >> > Yes I'm going to do it without wait_on_page_writeback(). > >> > > >> > The reason truncate_inode_pages_range() has to wait on writeback page > >> > is to ensure data integrity. Otherwise if there comes two events: > >> >        truncate page A at offset X > >> >        populate page B at offset X > >> > If A and B are all writeback pages, then B can hit disk first and then > >> > be overwritten by A. Which corrupts the data at offset X from user's POV. > >> > > >> > But for hwpoison, there are no such worries. If A is poisoned, we do > >> > our best to isolate it as well as intercepting its IO. If the interception > >> > fails, it will trigger another machine check before hitting the disk. > >> > > >> > After all, poisoned A means the data at offset X is already corrupted. > >> > It doesn't matter if there comes another B page. > >> > > >> > Thanks, > >> > Fengguang > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >> > the body of a message to majordomo@vger.kernel.org > >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html > >> > Please read the FAQ at  http://www.tux.org/lkml/ > >> > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/