2009-06-01 11:50:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > > Hi Nick,
> > >
> > > > > + /*
> > > > > + * remove_from_page_cache assumes (mapping && !mapped)
> > > > > + */
> > > > > + if (page_mapping(p) && !page_mapped(p)) {
> > > > > + remove_from_page_cache(p);
> > > > > + page_cache_release(p);
> > > > > + }
> > > >
> > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > probably introduce pagecache removal vs page fault races which
> > > > will make the kernel bug.
> > >
> > > We use remove_mapping() at first, then discovered that it made strong
> > > assumption on page_count=2.
> > >
> > > I guess it is safe from races since we are locking the page?
> >
> > Yes it probably should (although you will lose get_user_pages data, but
> > I guess that's the aim anyway).
>
> Yes. We (and truncate) rely heavily on this logic:
>
> retry:
> lock_page(page);
> if (page->mapping == NULL)
> goto retry;
> // do something on page
> unlock_page(page);
>
> So that we can steal/isolate a page under its page lock.
>
> The truncate code does wait on writeback page, but we would like to
> isolate the page ASAP, so as to avoid someone to find it in the page
> cache (or swap cache) and then access its content.
>
> I see no obvious problems to isolate a writeback page from page cache
> or swap cache. But also I'm not sure it won't break some assumption
> in some corner of the kernel.

The problem is that then you have lost synchronization in the
pagecache. Nothing then prevents a new page from being put
in there and trying to do IO to or from the same device as the
currently running writeback.


> > But I just don't like this one file having all that required knowledge
>
> Yes that's a big problem.
>
> One major complexity involves classify the page into different known
> types, by testing page flags, page_mapping, page_mapped, etc. This
> is not avoidable.

No.


> Another major complexity is on calling the isolation routines to
> remove references from
> - PTE
> - page cache
> - swap cache
> - LRU list
> They more or less made some assumptions on their operating environment
> that we have to take care of. Unfortunately these complexities are
> also not easily resolvable.
>
> > (and few comments) of all the files in mm/. If you want to get rid
>
> I promise I'll add more comments :)

OK, but they should still go in their relevant files. Or as best as
possible. Right now it's just silly to have all this here when much
of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.


> > of the page and don't care what it's count or dirtyness is, then
> > truncate_inode_pages_range is the correct API to use.
> >
> > (or you could extract out some of it so you can call it directly on
> > individual locked pages, if that helps).
>
> The patch to move over to truncate_complete_page() would like this.
> It's not a big win indeed.

No I don't mean to do this, but to move the truncate_inode_pages
code for truncating a single, locked, page into another function
in mm/truncate.c and then call that from here.

>
> ---
> mm/memory-failure.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> --- sound-2.6.orig/mm/memory-failure.c
> +++ sound-2.6/mm/memory-failure.c
> @@ -327,20 +327,18 @@ static int me_pagecache_clean(struct pag
> if (!isolate_lru_page(p))
> page_cache_release(p);
>
> - if (page_has_private(p))
> - do_invalidatepage(p, 0);
> - if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> - Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> - page_to_pfn(p));
> -
> /*
> * remove_from_page_cache assumes (mapping && !mapped)
> */
> if (page_mapping(p) && !page_mapped(p)) {
> - remove_from_page_cache(p);
> - page_cache_release(p);
> + ClearPageMlocked(p);
> + truncate_complete_page(p->mapping, p)
> }
>
> + if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> + Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> + page_to_pfn(p));
> +
> return RECOVERED;
> }
>
>
> > OK this is the point I was missing.
> >
> > Should all be commented and put into mm/swap_state.c (or somewhere that
> > Hugh prefers).
>
> But I doubt Hugh will welcome moving that bits into swap*.c ;)

Why not? If he has to look at it anyway, he probably rather looks
at fewer files :)


> > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > in the known good data from disk.
> >
> > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > swapcache anyway?
>
> The ClearPageUptodate() is kind of a careless addition, in the hope
> that it will stop some random readers. Need more investigations.

OK. But it just muddies the waters in the meantime, so maybe take
such things out until there is a case for them.


> > > > You haven't waited on writeback here AFAIKS, and have you
> > > > *really* verified it is safe to call delete_from_swap_cache?
> > >
> > > Good catch. I'll soon submit patches for handling the under
> > > read/write IO pages. In this patchset they are simply ignored.
> >
> > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > It is simple and should work. _Unless_ you can show it is a big problem that
> > needs equivalently big mes to fix ;)
>
> Yes we could do wait_on_page_writeback() if necessary. The downside is,
> keeping writeback page in page cache opens a small time window for
> some one to access the page.

AFAIKS there already is such a window? You're doing lock_page and such.
No, it seems rather insane to do something like this here that no other
code in the mm ever does.


2009-06-01 14:06:22

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> > On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > > On Thu, May 28, 2009 at 05:59:34PM +0800, Wu Fengguang wrote:
> > > > Hi Nick,
> > > >
> > > > > > + /*
> > > > > > + * remove_from_page_cache assumes (mapping && !mapped)
> > > > > > + */
> > > > > > + if (page_mapping(p) && !page_mapped(p)) {
> > > > > > + remove_from_page_cache(p);
> > > > > > + page_cache_release(p);
> > > > > > + }
> > > > >
> > > > > remove_mapping would probably be a better idea. Otherwise you can
> > > > > probably introduce pagecache removal vs page fault races which
> > > > > will make the kernel bug.
> > > >
> > > > We use remove_mapping() at first, then discovered that it made strong
> > > > assumption on page_count=2.
> > > >
> > > > I guess it is safe from races since we are locking the page?
> > >
> > > Yes it probably should (although you will lose get_user_pages data, but
> > > I guess that's the aim anyway).
> >
> > Yes. We (and truncate) rely heavily on this logic:
> >
> > retry:
> > lock_page(page);
> > if (page->mapping == NULL)
> > goto retry;
> > // do something on page
> > unlock_page(page);
> >
> > So that we can steal/isolate a page under its page lock.
> >
> > The truncate code does wait on writeback page, but we would like to
> > isolate the page ASAP, so as to avoid someone to find it in the page
> > cache (or swap cache) and then access its content.
> >
> > I see no obvious problems to isolate a writeback page from page cache
> > or swap cache. But also I'm not sure it won't break some assumption
> > in some corner of the kernel.
>
> The problem is that then you have lost synchronization in the
> pagecache. Nothing then prevents a new page from being put
> in there and trying to do IO to or from the same device as the
> currently running writeback.

[ I'm not setting mine mind to get rid of wait_on_page_writeback(),
however I'm curious about the consequences of not doing it :) ]

You are right in that IO can happen for a new page at the same file offset.
But I have analyzed that in another email:

: The reason truncate_inode_pages_range() has to wait on writeback page
: is to ensure data integrity. Otherwise if there comes two events:
: truncate page A at offset X
: populate page B at offset X
: If A and B are all writeback pages, then B can hit disk first and then
: be overwritten by A. Which corrupts the data at offset X from user's POV.
:
: But for hwpoison, there are no such worries. If A is poisoned, we do
: our best to isolate it as well as intercepting its IO. If the interception
: fails, it will trigger another machine check before hitting the disk.
:
: After all, poisoned A means the data at offset X is already corrupted.
: It doesn't matter if there comes another B page.

Does that make sense?

In fact even under the assumption that page won't be truncated during
writeback, nowhere except the end_writeback_io handlers can actually
safely take advantage of this assumption. There are no much such
handlers, so it's relatively easy to check them out one by one.

> > > But I just don't like this one file having all that required knowledge
> >
> > Yes that's a big problem.
> >
> > One major complexity involves classify the page into different known
> > types, by testing page flags, page_mapping, page_mapped, etc. This
> > is not avoidable.
>
> No.

If you don't know kind of page it is, how do we know to properly
isolate it? Or do you mean the current classifying code can be
simplified? Yeah that's kind of possible.

>
> > Another major complexity is on calling the isolation routines to
> > remove references from
> > - PTE
> > - page cache
> > - swap cache
> > - LRU list
> > They more or less made some assumptions on their operating environment
> > that we have to take care of. Unfortunately these complexities are
> > also not easily resolvable.
> >
> > > (and few comments) of all the files in mm/. If you want to get rid
> >
> > I promise I'll add more comments :)
>
> OK, but they should still go in their relevant files. Or as best as
> possible. Right now it's just silly to have all this here when much
> of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.

OK, I'll bear that point in mind.

> > > of the page and don't care what it's count or dirtyness is, then
> > > truncate_inode_pages_range is the correct API to use.
> > >
> > > (or you could extract out some of it so you can call it directly on
> > > individual locked pages, if that helps).
> >
> > The patch to move over to truncate_complete_page() would like this.
> > It's not a big win indeed.
>
> No I don't mean to do this, but to move the truncate_inode_pages
> code for truncating a single, locked, page into another function
> in mm/truncate.c and then call that from here.

It seems to me that truncate_complete_page() is already the code
you want to move ;-) Or you mean more code around the call site of
truncate_complete_page()?

lock_page(page);

wait_on_page_writeback(page);
We could do this.

if (page_mapped(page)) {
unmap_mapping_range(mapping,
(loff_t)page->index<<PAGE_CACHE_SHIFT,
PAGE_CACHE_SIZE, 0);
}
We need a rather complex unmap logic.

if (page->index > next)
next = page->index;
next++;
truncate_complete_page(mapping, page);
unlock_page(page);

Now it's obvious that reusing more code than truncate_complete_page()
is not easy (or natural).

> > ---
> > mm/memory-failure.c | 14 ++++++--------
> > 1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > --- sound-2.6.orig/mm/memory-failure.c
> > +++ sound-2.6/mm/memory-failure.c
> > @@ -327,20 +327,18 @@ static int me_pagecache_clean(struct pag
> > if (!isolate_lru_page(p))
> > page_cache_release(p);
> >
> > - if (page_has_private(p))
> > - do_invalidatepage(p, 0);
> > - if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > - Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> > - page_to_pfn(p));
> > -
> > /*
> > * remove_from_page_cache assumes (mapping && !mapped)
> > */
> > if (page_mapping(p) && !page_mapped(p)) {
> > - remove_from_page_cache(p);
> > - page_cache_release(p);
> > + ClearPageMlocked(p);
> > + truncate_complete_page(p->mapping, p)
> > }
> >
> > + if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> > + Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> > + page_to_pfn(p));
> > +
> > return RECOVERED;
> > }
> >
> >
> > > OK this is the point I was missing.
> > >
> > > Should all be commented and put into mm/swap_state.c (or somewhere that
> > > Hugh prefers).
> >
> > But I doubt Hugh will welcome moving that bits into swap*.c ;)
>
> Why not? If he has to look at it anyway, he probably rather looks
> at fewer files :)

Heh. OK if that's more convenient - not a big issue for me really.

> > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > in the known good data from disk.
> > >
> > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > swapcache anyway?
> >
> > The ClearPageUptodate() is kind of a careless addition, in the hope
> > that it will stop some random readers. Need more investigations.
>
> OK. But it just muddies the waters in the meantime, so maybe take
> such things out until there is a case for them.

OK.

> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > >
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > >
> > > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > > It is simple and should work. _Unless_ you can show it is a big problem that
> > > needs equivalently big mes to fix ;)
> >
> > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > keeping writeback page in page cache opens a small time window for
> > some one to access the page.
>
> AFAIKS there already is such a window? You're doing lock_page and such.

You know I'm such a crazy guy - I'm going to do try_lock_page() for
intercepting under read IOs 8-)

> No, it seems rather insane to do something like this here that no other
> code in the mm ever does.

Yes it's kind of insane. I'm interested in reasoning it out though.

Thanks,
Fengguang

2009-06-01 14:40:58

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 01, 2009 at 10:05:53PM +0800, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > The problem is that then you have lost synchronization in the
> > pagecache. Nothing then prevents a new page from being put
> > in there and trying to do IO to or from the same device as the
> > currently running writeback.
>
> [ I'm not setting mine mind to get rid of wait_on_page_writeback(),
> however I'm curious about the consequences of not doing it :) ]
>
> You are right in that IO can happen for a new page at the same file offset.
> But I have analyzed that in another email:
>
> : The reason truncate_inode_pages_range() has to wait on writeback page
> : is to ensure data integrity. Otherwise if there comes two events:
> : truncate page A at offset X
> : populate page B at offset X
> : If A and B are all writeback pages, then B can hit disk first and then
> : be overwritten by A. Which corrupts the data at offset X from user's POV.
> :
> : But for hwpoison, there are no such worries. If A is poisoned, we do
> : our best to isolate it as well as intercepting its IO. If the interception
> : fails, it will trigger another machine check before hitting the disk.
> :
> : After all, poisoned A means the data at offset X is already corrupted.
> : It doesn't matter if there comes another B page.
>
> Does that make sense?

But you just said that you try to intercept the IO. So the underlying
data is not necessarily corrupt. And even if it was then what if it
was reinitialized to something else in the meantime (such as filesystem
metadata blocks?) You'd just be introducing worse possibilities for
coruption.

You will need to demonstrate a *big* advantage before doing crazy things
with writeback ;)


> > > > But I just don't like this one file having all that required knowledge
> > >
> > > Yes that's a big problem.
> > >
> > > One major complexity involves classify the page into different known
> > > types, by testing page flags, page_mapping, page_mapped, etc. This
> > > is not avoidable.
> >
> > No.
>
> If you don't know kind of page it is, how do we know to properly
> isolate it? Or do you mean the current classifying code can be
> simplified? Yeah that's kind of possible.

No I just was agreeing that it is not avoidable ;)


> > > Another major complexity is on calling the isolation routines to
> > > remove references from
> > > - PTE
> > > - page cache
> > > - swap cache
> > > - LRU list
> > > They more or less made some assumptions on their operating environment
> > > that we have to take care of. Unfortunately these complexities are
> > > also not easily resolvable.
> > >
> > > > (and few comments) of all the files in mm/. If you want to get rid
> > >
> > > I promise I'll add more comments :)
> >
> > OK, but they should still go in their relevant files. Or as best as
> > possible. Right now it's just silly to have all this here when much
> > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
>
> OK, I'll bear that point in mind.
>
> > > > of the page and don't care what it's count or dirtyness is, then
> > > > truncate_inode_pages_range is the correct API to use.
> > > >
> > > > (or you could extract out some of it so you can call it directly on
> > > > individual locked pages, if that helps).
> > >
> > > The patch to move over to truncate_complete_page() would like this.
> > > It's not a big win indeed.
> >
> > No I don't mean to do this, but to move the truncate_inode_pages
> > code for truncating a single, locked, page into another function
> > in mm/truncate.c and then call that from here.
>
> It seems to me that truncate_complete_page() is already the code
> you want to move ;-) Or you mean more code around the call site of
> truncate_complete_page()?
>
> lock_page(page);
>
> wait_on_page_writeback(page);
> We could do this.
>
> if (page_mapped(page)) {
> unmap_mapping_range(mapping,
> (loff_t)page->index<<PAGE_CACHE_SHIFT,
> PAGE_CACHE_SIZE, 0);
> }
> We need a rather complex unmap logic.
>
> if (page->index > next)
> next = page->index;
> next++;
> truncate_complete_page(mapping, page);
> unlock_page(page);
>
> Now it's obvious that reusing more code than truncate_complete_page()
> is not easy (or natural).

Just lock the page and wait for writeback, then do the truncate
work in another function. In your case if you've already unmapped
the page then it won't try to unmap again so no problem.

Truncating from pagecache does not change ->index so you can
move the loop logic out.


> > > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > > keeping writeback page in page cache opens a small time window for
> > > some one to access the page.
> >
> > AFAIKS there already is such a window? You're doing lock_page and such.
>
> You know I'm such a crazy guy - I'm going to do try_lock_page() for
> intercepting under read IOs 8-)
>
> > No, it seems rather insane to do something like this here that no other
> > code in the mm ever does.
>
> Yes it's kind of insane. I'm interested in reasoning it out though.

I guess it is a good idea to start simple.

Considering that there are so many other types of pages that are
impossible to deal with or have holes, then I very strongly doubt
it will be worth so much complexity for closing the gap from 90%
to 90.1%. But we'll see.

2009-06-01 18:25:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > Another major complexity is on calling the isolation routines to
> > remove references from
> > - PTE
> > - page cache
> > - swap cache
> > - LRU list
> > They more or less made some assumptions on their operating environment
> > that we have to take care of. Unfortunately these complexities are
> > also not easily resolvable.
> >
> > > (and few comments) of all the files in mm/. If you want to get rid
> >
> > I promise I'll add more comments :)
>
> OK, but they should still go in their relevant files. Or as best as
> possible. Right now it's just silly to have all this here when much
> of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.

Can you be more specific what that "all this" is?

> > > of the page and don't care what it's count or dirtyness is, then
> > > truncate_inode_pages_range is the correct API to use.
> > >
> > > (or you could extract out some of it so you can call it directly on
> > > individual locked pages, if that helps).
> >
> > The patch to move over to truncate_complete_page() would like this.
> > It's not a big win indeed.
>
> No I don't mean to do this, but to move the truncate_inode_pages
> code for truncating a single, locked, page into another function
> in mm/truncate.c and then call that from here.

I took a look at that. First there's no direct equivalent of
me_pagecache_clean/dirty in truncate.c and to be honest I don't
see a clean way to refactor any of the existing functions to
do the same.

Then memory-failure already calls into the other files for
pretty much anything interesting (do_invalidatepage, cancel_dirty_page,
try_to_free_mapping) -- there is very little that memory-failure.c
does on its own.

These are also all already called from all over the kernel, e.g.
there are 15+ callers of try_to_release_page outside truncate.c

For do_invalidatepage and cancel_dirty_page it's not as clear cut, but there's
already precendence of several callers outside truncate.c.

We could presumably move the swap cache functions, but given how simple
they are and just also calling direct into the swap code anyways, is there
much value in it? Hugh, can you give guidance?

static int me_swapcache_dirty(struct page *p)
{
ClearPageDirty(p);

if (!isolate_lru_page(p))
page_cache_release(p);

return DELAYED;
}

static int me_swapcache_clean(struct page *p)
{
ClearPageUptodate(p);

if (!isolate_lru_page(p))
page_cache_release(p);

delete_from_swap_cache(p);

return RECOVERED;
}


>
> > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > in the known good data from disk.
> > >
> > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > swapcache anyway?
> >
> > The ClearPageUptodate() is kind of a careless addition, in the hope
> > that it will stop some random readers. Need more investigations.
>
> OK. But it just muddies the waters in the meantime, so maybe take
> such things out until there is a case for them.

It's gone

> > > > > You haven't waited on writeback here AFAIKS, and have you
> > > > > *really* verified it is safe to call delete_from_swap_cache?
> > > >
> > > > Good catch. I'll soon submit patches for handling the under
> > > > read/write IO pages. In this patchset they are simply ignored.
> > >
> > > Well that's quite important ;) I would suggest you just wait_on_page_writeback.
> > > It is simple and should work. _Unless_ you can show it is a big problem that
> > > needs equivalently big mes to fix ;)
> >
> > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > keeping writeback page in page cache opens a small time window for
> > some one to access the page.
>
> AFAIKS there already is such a window? You're doing lock_page and such.

Yes there already is plenty of window.

> No, it seems rather insane to do something like this here that no other
> code in the mm ever does.

Just because the rest of the VM doesn't do it doesn't mean it might make sense.

But the writeback windows are probably too short to careing. I haven't
done numbers on those, if it's a significant percentage of memory in
some workload it might be worth it, otherwise not.

But all of that would be in the future, right now I just want to get
the basic facility in.

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-01 21:12:13

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, 1 Jun 2009, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > On Thu, May 28, 2009 at 09:54:28PM +0800, Wu Fengguang wrote:
> > > On Thu, May 28, 2009 at 08:23:57PM +0800, Nick Piggin wrote:
> > > >
> > > > Should all be commented and put into mm/swap_state.c (or somewhere that
> > > > Hugh prefers).
> > >
> > > But I doubt Hugh will welcome moving that bits into swap*.c ;)
> >
> > Why not? If he has to look at it anyway, he probably rather looks
> > at fewer files :)
>
> Heh. OK if that's more convenient - not a big issue for me really.

Sorry for being so elusive, leaving you all guessing: it's kind of
you to consider me at all. As I remarked to Andi in private mail
earlier, I'm so far behind on my promises (especially to KSM) that
I don't expect to be looking at HWPOISON for quite a while.

I don't think I'd mind about the number of files to look at.

Generally I agree with Nick, wanting the rmap-ish code to be in rmap.c
and the swap_state-ish code to be in swap_state.c and the swapfile-ish
code to be in swapfile.c etc. (Though it's an acquired skill to work
out which is which of those two - one thing you can be sure of though,
if it's swap-related code, swap.c is strangely not the place for it.
Yikes, someone put swap_setup in there.)

But like most of us, I'm not so keen on #ifdefs: am I right to think
that if you distribute the hwpoison code around in its appropriate
source files, we'll have a nasty rash of #ifdefs all over? We can
sometimes get away with the optimizer removing what's not needed,
but that only works in the simpler cases.

Maybe we should start out, as you have, with most of the hwpoison
code located in one file (rather like with migrate.c?); but hope
to refactor things and distribute it over time.

How seriously does the hwpoison work interfere with the assumptions
in other sourcefiles? If it's playing tricks liable to confuse
someone reading through those other files, then it would be better
to place the hwpoison code in those files, even though #ifdefed.

There, how's that for a frustratingly equivocal answer?

Hugh

2009-06-01 21:34:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

> Maybe we should start out, as you have, with most of the hwpoison
> code located in one file (rather like with migrate.c?); but hope
> to refactor things and distribute it over time.

That's the plan.

>
> How seriously does the hwpoison work interfere with the assumptions
> in other sourcefiles? If it's playing tricks liable to confuse

I hope only minimally. It can come in at any time (that's the partly
tricky part), but in general it tries to be very conservative and
non intrusive and just calls code used elsewhere.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-02 11:21:24

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> On Mon, Jun 01, 2009 at 10:05:53PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 01, 2009 at 07:50:46PM +0800, Nick Piggin wrote:
> > > The problem is that then you have lost synchronization in the
> > > pagecache. Nothing then prevents a new page from being put
> > > in there and trying to do IO to or from the same device as the
> > > currently running writeback.
> >
> > [ I'm not setting mine mind to get rid of wait_on_page_writeback(),
> > however I'm curious about the consequences of not doing it :) ]
> >
> > You are right in that IO can happen for a new page at the same file offset.
> > But I have analyzed that in another email:
> >
> > : The reason truncate_inode_pages_range() has to wait on writeback page
> > : is to ensure data integrity. Otherwise if there comes two events:
> > : truncate page A at offset X
> > : populate page B at offset X
> > : If A and B are all writeback pages, then B can hit disk first and then
> > : be overwritten by A. Which corrupts the data at offset X from user's POV.
> > :
> > : But for hwpoison, there are no such worries. If A is poisoned, we do
> > : our best to isolate it as well as intercepting its IO. If the interception
> > : fails, it will trigger another machine check before hitting the disk.
> > :
> > : After all, poisoned A means the data at offset X is already corrupted.
> > : It doesn't matter if there comes another B page.
> >
> > Does that make sense?
>
> But you just said that you try to intercept the IO. So the underlying
> data is not necessarily corrupt. And even if it was then what if it
> was reinitialized to something else in the meantime (such as filesystem
> metadata blocks?) You'd just be introducing worse possibilities for
> coruption.

The IO interception will be based on PFN instead of file offset, so it
won't affect innocent pages such as your example of reinitialized data.

poisoned dirty page == corrupt data => process shall be killed
poisoned clean page == recoverable data => process shall survive

In the case of dirty hwpoison page, if we reload the on disk old data
and let application proceed with it, it may lead to *silent* data
corruption/inconsistency, because the application will first see v2
then v1, which is illogical and hence may mess up its internal data
structure.

> You will need to demonstrate a *big* advantage before doing crazy things
> with writeback ;)

OK. We can do two things about poisoned writeback pages:

1) to stop IO for them, thus avoid corrupted data to hit disk and/or
trigger further machine checks
2) to isolate them from page cache, thus preventing possible
references in the writeback time window

1) is important, because there may be many writeback pages in a
production system.

2) is good to have if possible, because the time window may grow large
when the writeback IO queue is congested or the async write
requests are hold off by many sync read/write requests.

> > > > > But I just don't like this one file having all that required knowledge
> > > >
> > > > Yes that's a big problem.
> > > >
> > > > One major complexity involves classify the page into different known
> > > > types, by testing page flags, page_mapping, page_mapped, etc. This
> > > > is not avoidable.
> > >
> > > No.
> >
> > If you don't know kind of page it is, how do we know to properly
> > isolate it? Or do you mean the current classifying code can be
> > simplified? Yeah that's kind of possible.
>
> No I just was agreeing that it is not avoidable ;)

Ah OK.

> > > > Another major complexity is on calling the isolation routines to
> > > > remove references from
> > > > - PTE
> > > > - page cache
> > > > - swap cache
> > > > - LRU list
> > > > They more or less made some assumptions on their operating environment
> > > > that we have to take care of. Unfortunately these complexities are
> > > > also not easily resolvable.
> > > >
> > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > >
> > > > I promise I'll add more comments :)
> > >
> > > OK, but they should still go in their relevant files. Or as best as
> > > possible. Right now it's just silly to have all this here when much
> > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> >
> > OK, I'll bear that point in mind.
> >
> > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > truncate_inode_pages_range is the correct API to use.
> > > > >
> > > > > (or you could extract out some of it so you can call it directly on
> > > > > individual locked pages, if that helps).
> > > >
> > > > The patch to move over to truncate_complete_page() would like this.
> > > > It's not a big win indeed.
> > >
> > > No I don't mean to do this, but to move the truncate_inode_pages
> > > code for truncating a single, locked, page into another function
> > > in mm/truncate.c and then call that from here.
> >
> > It seems to me that truncate_complete_page() is already the code
> > you want to move ;-) Or you mean more code around the call site of
> > truncate_complete_page()?
> >
> > lock_page(page);
> >
> > wait_on_page_writeback(page);
> > We could do this.
> >
> > if (page_mapped(page)) {
> > unmap_mapping_range(mapping,
> > (loff_t)page->index<<PAGE_CACHE_SHIFT,
> > PAGE_CACHE_SIZE, 0);
> > }
> > We need a rather complex unmap logic.
> >
> > if (page->index > next)
> > next = page->index;
> > next++;
> > truncate_complete_page(mapping, page);
> > unlock_page(page);
> >
> > Now it's obvious that reusing more code than truncate_complete_page()
> > is not easy (or natural).
>
> Just lock the page and wait for writeback, then do the truncate
> work in another function. In your case if you've already unmapped
> the page then it won't try to unmap again so no problem.
>
> Truncating from pagecache does not change ->index so you can
> move the loop logic out.

Right. So effectively the reusable function is exactly
truncate_complete_page(). As I said this reuse is not a big gain.

> > > > Yes we could do wait_on_page_writeback() if necessary. The downside is,
> > > > keeping writeback page in page cache opens a small time window for
> > > > some one to access the page.
> > >
> > > AFAIKS there already is such a window? You're doing lock_page and such.
> >
> > You know I'm such a crazy guy - I'm going to do try_lock_page() for
> > intercepting under read IOs 8-)
> >
> > > No, it seems rather insane to do something like this here that no other
> > > code in the mm ever does.
> >
> > Yes it's kind of insane. I'm interested in reasoning it out though.
>
> I guess it is a good idea to start simple.

Agreed.

> Considering that there are so many other types of pages that are
> impossible to deal with or have holes, then I very strongly doubt
> it will be worth so much complexity for closing the gap from 90%
> to 90.1%. But we'll see.

Yes, the plan is to first focus on the more important cases.

Thanks,
Fengguang

2009-06-02 12:00:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Mon, Jun 01, 2009 at 08:32:25PM +0200, Andi Kleen wrote:
> On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > Another major complexity is on calling the isolation routines to
> > > remove references from
> > > - PTE
> > > - page cache
> > > - swap cache
> > > - LRU list
> > > They more or less made some assumptions on their operating environment
> > > that we have to take care of. Unfortunately these complexities are
> > > also not easily resolvable.
> > >
> > > > (and few comments) of all the files in mm/. If you want to get rid
> > >
> > > I promise I'll add more comments :)
> >
> > OK, but they should still go in their relevant files. Or as best as
> > possible. Right now it's just silly to have all this here when much
> > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
>
> Can you be more specific what that "all this" is?

The functions which take action in response to a bad page being
detected. They belong with the subsystem that the page belongs
to. I'm amazed this is causing so much argument or confusion
because it is how the rest of mm/ code is arranged. OK, Hugh has
a point about ifdefs, but OTOH we have lots of ifdefs like this.


> > > > of the page and don't care what it's count or dirtyness is, then
> > > > truncate_inode_pages_range is the correct API to use.
> > > >
> > > > (or you could extract out some of it so you can call it directly on
> > > > individual locked pages, if that helps).
> > >
> > > The patch to move over to truncate_complete_page() would like this.
> > > It's not a big win indeed.
> >
> > No I don't mean to do this, but to move the truncate_inode_pages
> > code for truncating a single, locked, page into another function
> > in mm/truncate.c and then call that from here.
>
> I took a look at that. First there's no direct equivalent of
> me_pagecache_clean/dirty in truncate.c and to be honest I don't
> see a clean way to refactor any of the existing functions to
> do the same.

With all that writing you could have just done it. It's really
not a big deal and just avoids duplicating code. I attached an
(untested) patch.


> > No, it seems rather insane to do something like this here that no other
> > code in the mm ever does.
>
> Just because the rest of the VM doesn't do it doesn't mean it might make sense.

It is going to be possible to do it somehow surely, but it is insane
to try to add such constraints to the VM to close a few small windows
if you already have other large ones.

---
mm/truncate.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -135,6 +135,16 @@ invalidate_complete_page(struct address_
return ret;
}

+void truncate_inode_page(struct address_space *mapping, struct page *page)
+{
+ if (page_mapped(page)) {
+ unmap_mapping_range(mapping,
+ (loff_t)page->index<<PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ }
+ truncate_complete_page(mapping, page);
+}
+
/**
* truncate_inode_pages - truncate range of pages specified by start & end byte offsets
* @mapping: mapping to truncate
@@ -196,12 +206,7 @@ void truncate_inode_pages_range(struct a
unlock_page(page);
continue;
}
- if (page_mapped(page)) {
- unmap_mapping_range(mapping,
- (loff_t)page_index<<PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
- }
- truncate_complete_page(mapping, page);
+ truncate_inode_page(mapping, page);
unlock_page(page);
}
pagevec_release(&pvec);
@@ -238,15 +243,10 @@ void truncate_inode_pages_range(struct a
break;
lock_page(page);
wait_on_page_writeback(page);
- if (page_mapped(page)) {
- unmap_mapping_range(mapping,
- (loff_t)page->index<<PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
- }
+ truncate_inode_page(mappng, page);
if (page->index > next)
next = page->index;
next++;
- truncate_complete_page(mapping, page);
unlock_page(page);
}
pagevec_release(&pvec);

2009-06-02 12:19:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > But you just said that you try to intercept the IO. So the underlying
> > data is not necessarily corrupt. And even if it was then what if it
> > was reinitialized to something else in the meantime (such as filesystem
> > metadata blocks?) You'd just be introducing worse possibilities for
> > coruption.
>
> The IO interception will be based on PFN instead of file offset, so it
> won't affect innocent pages such as your example of reinitialized data.

OK, if you could intercept the IO so it never happens at all, yes
of course that could work.


> poisoned dirty page == corrupt data => process shall be killed
> poisoned clean page == recoverable data => process shall survive
>
> In the case of dirty hwpoison page, if we reload the on disk old data
> and let application proceed with it, it may lead to *silent* data
> corruption/inconsistency, because the application will first see v2
> then v1, which is illogical and hence may mess up its internal data
> structure.

Right, but how do you prevent that? There is no way to reconstruct the
most updtodate data because it was destroyed.


> > You will need to demonstrate a *big* advantage before doing crazy things
> > with writeback ;)
>
> OK. We can do two things about poisoned writeback pages:
>
> 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> trigger further machine checks

1b) At which point, you invoke the end-io handlers, and the page is
no longer writeback.

> 2) to isolate them from page cache, thus preventing possible
> references in the writeback time window

And then this is possible because you aren't violating mm
assumptions due to 1b. This proceeds just as the existing
pagecache mce error handler case which exists now.


> > > Now it's obvious that reusing more code than truncate_complete_page()
> > > is not easy (or natural).
> >
> > Just lock the page and wait for writeback, then do the truncate
> > work in another function. In your case if you've already unmapped
> > the page then it won't try to unmap again so no problem.
> >
> > Truncating from pagecache does not change ->index so you can
> > move the loop logic out.
>
> Right. So effectively the reusable function is exactly
> truncate_complete_page(). As I said this reuse is not a big gain.

Anyway, we don't have to argue about it. I already send a patch
because it was so hard to do, so let's move past this ;)


> > > Yes it's kind of insane. I'm interested in reasoning it out though.

Well with the IO interception (I missed this point), then it seems
maybe no longer so insane. We could see how it looks.


> > I guess it is a good idea to start simple.
>
> Agreed.
>
> > Considering that there are so many other types of pages that are
> > impossible to deal with or have holes, then I very strongly doubt
> > it will be worth so much complexity for closing the gap from 90%
> > to 90.1%. But we'll see.
>
> Yes, the plan is to first focus on the more important cases.

Great.

Thanks,
Nick

2009-06-02 12:40:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > > Another major complexity is on calling the isolation routines to
> > > > remove references from
> > > > - PTE
> > > > - page cache
> > > > - swap cache
> > > > - LRU list
> > > > They more or less made some assumptions on their operating environment
> > > > that we have to take care of. Unfortunately these complexities are
> > > > also not easily resolvable.
> > > >
> > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > >
> > > > I promise I'll add more comments :)
> > >
> > > OK, but they should still go in their relevant files. Or as best as
> > > possible. Right now it's just silly to have all this here when much
> > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> >
> > Can you be more specific what that "all this" is?
>
> The functions which take action in response to a bad page being
> detected. They belong with the subsystem that the page belongs
> to. I'm amazed this is causing so much argument or confusion
> because it is how the rest of mm/ code is arranged. OK, Hugh has
> a point about ifdefs, but OTOH we have lots of ifdefs like this.

Well we're already calling into that subsystem, just not with
a single function call.

> > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > truncate_inode_pages_range is the correct API to use.
> > > > >
> > > > > (or you could extract out some of it so you can call it directly on
> > > > > individual locked pages, if that helps).
> > > >
> > > > The patch to move over to truncate_complete_page() would like this.
> > > > It's not a big win indeed.
> > >
> > > No I don't mean to do this, but to move the truncate_inode_pages
> > > code for truncating a single, locked, page into another function
> > > in mm/truncate.c and then call that from here.
> >
> > I took a look at that. First there's no direct equivalent of
> > me_pagecache_clean/dirty in truncate.c and to be honest I don't
> > see a clean way to refactor any of the existing functions to
> > do the same.
>
> With all that writing you could have just done it. It's really

I would have done it if it made sense to me, but so far it hasn't.

The problem with your suggestion is that you do the big picture,
but seem to skip over a lot of details. But details matter.

> not a big deal and just avoids duplicating code. I attached an
> (untested) patch.

Thanks. But the function in the patch is not doing the same what
the me_pagecache_clean/dirty are doing. For once there is no error
checking, as in the second try_to_release_page()

Then it doesn't do all the IO error and missing mapping handling.

The page_mapped() check is useless because the pages are not
mapped here etc.

We could probably call truncate_complete_page(), but then
we would also need to duplicate most of the checking outside
the function anyways and there wouldn't be any possibility
to share the clean/dirty variants. If you insist I can
do it, but I think it would be significantly worse code
than before and I'm reluctant to do that.

I don't also really see what the big deal is of just
calling these few functions directly. After all we're not
truncating here and they're all already called from other files.

> > > No, it seems rather insane to do something like this here that no other
> > > code in the mm ever does.
> >
> > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
>
> It is going to be possible to do it somehow surely, but it is insane
> to try to add such constraints to the VM to close a few small windows

We don't know currently if they are small. If they are small I would
agree with you, but that needs numbers. That said fancy writeback handling
is currently not on my agenda.

> if you already have other large ones.

That's unclear too.

-Andi

2009-06-02 12:57:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > not a big deal and just avoids duplicating code. I attached an
> > (untested) patch.
>
> Thanks. But the function in the patch is not doing the same what
> the me_pagecache_clean/dirty are doing. For once there is no error
> checking, as in the second try_to_release_page()
>
> Then it doesn't do all the IO error and missing mapping handling.

Obviously I don't mean just use that single call for the entire
handler. You can set the EIO bit or whatever you like. The
"error handling" you have there also seems strange. You could
retain it, but the page is assured to be removed from pagecache.


> The page_mapped() check is useless because the pages are not
> mapped here etc.

That's OK, it is a core part of the protocol to prevent
truncated pages from being mapped, so I like it to be in
that function.

(you are also doing extraneous page_mapped tests in your handler,
so surely your concern isn't from the perspective of this
error handler code)


> We could probably call truncate_complete_page(), but then
> we would also need to duplicate most of the checking outside
> the function anyways and there wouldn't be any possibility
> to share the clean/dirty variants. If you insist I can
> do it, but I think it would be significantly worse code
> than before and I'm reluctant to do that.

I can write you the patch for that too if you like.


> I don't also really see what the big deal is of just
> calling these few functions directly. After all we're not
> truncating here and they're all already called from other files.
>
> > > > No, it seems rather insane to do something like this here that no other
> > > > code in the mm ever does.
> > >
> > > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
> >
> > It is going to be possible to do it somehow surely, but it is insane
> > to try to add such constraints to the VM to close a few small windows
>
> We don't know currently if they are small. If they are small I would
> agree with you, but that needs numbers. That said fancy writeback handling
> is currently not on my agenda.

Yes, writeback pages are very limited, a tiny number at any time and
the faction gets relatively smaller as total RAM size gets larger.


> > if you already have other large ones.
>
> That's unclear too.

You can't do much about most kernel pages, and dirty metadata pages
are both going to cause big problems. User pagetable pages. Lots of
stuff.

2009-06-02 13:18:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > not a big deal and just avoids duplicating code. I attached an
> > > (untested) patch.
> >
> > Thanks. But the function in the patch is not doing the same what
> > the me_pagecache_clean/dirty are doing. For once there is no error
> > checking, as in the second try_to_release_page()
> >
> > Then it doesn't do all the IO error and missing mapping handling.
>
> Obviously I don't mean just use that single call for the entire
> handler. You can set the EIO bit or whatever you like. The
> "error handling" you have there also seems strange. You could
> retain it, but the page is assured to be removed from pagecache.

The reason this code double checks is that someone could have
a reference (remember we can come in any time) we cannot kill immediately.

> > The page_mapped() check is useless because the pages are not
> > mapped here etc.
>
> That's OK, it is a core part of the protocol to prevent
> truncated pages from being mapped, so I like it to be in
> that function.
>
> (you are also doing extraneous page_mapped tests in your handler,
> so surely your concern isn't from the perspective of this
> error handler code)

We do page_mapping() checks, not page_mapped checks.

I know details, but ...

>
>
> > We could probably call truncate_complete_page(), but then
> > we would also need to duplicate most of the checking outside
> > the function anyways and there wouldn't be any possibility
> > to share the clean/dirty variants. If you insist I can
> > do it, but I think it would be significantly worse code
> > than before and I'm reluctant to do that.
>
> I can write you the patch for that too if you like.

Ok I will write it, but I will add a comment saying that Nick forced
me to make the code worse @)

It'll be fairly redundant at least.

> > > if you already have other large ones.
> >
> > That's unclear too.
>
> You can't do much about most kernel pages, and dirty metadata pages
> are both going to cause big problems. User pagetable pages. Lots of
> stuff.

User page tables was on the todo list, these are actually relatively
easy. The biggest issue is to detect them.

Metadata would likely need file system callbacks, which I would like to
avoid at this point.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-02 13:24:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > not a big deal and just avoids duplicating code. I attached an
> > > > (untested) patch.
> > >
> > > Thanks. But the function in the patch is not doing the same what
> > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > checking, as in the second try_to_release_page()
> > >
> > > Then it doesn't do all the IO error and missing mapping handling.
> >
> > Obviously I don't mean just use that single call for the entire
> > handler. You can set the EIO bit or whatever you like. The
> > "error handling" you have there also seems strange. You could
> > retain it, but the page is assured to be removed from pagecache.
>
> The reason this code double checks is that someone could have
> a reference (remember we can come in any time) we cannot kill immediately.

Can't kill what? The page is gone from pagecache. It may remain
other kernel references, but I don't see why this code will
consider this as a failure (and not, for example, a raised error
count).


> > > The page_mapped() check is useless because the pages are not
> > > mapped here etc.
> >
> > That's OK, it is a core part of the protocol to prevent
> > truncated pages from being mapped, so I like it to be in
> > that function.
> >
> > (you are also doing extraneous page_mapped tests in your handler,
> > so surely your concern isn't from the perspective of this
> > error handler code)
>
> We do page_mapping() checks, not page_mapped checks.
>
> I know details, but ...

+static int me_pagecache_clean(struct page *p)
+{
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ if (page_has_private(p))
+ do_invalidatepage(p, 0);
+ if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+ Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+ page_to_pfn(p));
+
+ /*
+ * remove_from_page_cache assumes (mapping && !mapped)
+ */
+ if (page_mapping(p) && !page_mapped(p)) {
^^^^^^^^^^^^^^^

+ remove_from_page_cache(p);
+ page_cache_release(p);
+ }
+
+ return RECOVERED;


> > > we would also need to duplicate most of the checking outside
> > > the function anyways and there wouldn't be any possibility
> > > to share the clean/dirty variants. If you insist I can
> > > do it, but I think it would be significantly worse code
> > > than before and I'm reluctant to do that.
> >
> > I can write you the patch for that too if you like.
>
> Ok I will write it, but I will add a comment saying that Nick forced
> me to make the code worse @)
>
> It'll be fairly redundant at least.

If it's that bad, then I'll be happy to rewrite it for you.


> > > > if you already have other large ones.
> > >
> > > That's unclear too.
> >
> > You can't do much about most kernel pages, and dirty metadata pages
> > are both going to cause big problems. User pagetable pages. Lots of
> > stuff.
>
> User page tables was on the todo list, these are actually relatively
> easy. The biggest issue is to detect them.
>
> Metadata would likely need file system callbacks, which I would like to
> avoid at this point.

So I just don't know why you argue the point that you have lots
of large holes left.

2009-06-02 13:32:52

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 08:47:57PM +0800, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > > On Mon, Jun 01, 2009 at 01:50:46PM +0200, Nick Piggin wrote:
> > > > > Another major complexity is on calling the isolation routines to
> > > > > remove references from
> > > > > - PTE
> > > > > - page cache
> > > > > - swap cache
> > > > > - LRU list
> > > > > They more or less made some assumptions on their operating environment
> > > > > that we have to take care of. Unfortunately these complexities are
> > > > > also not easily resolvable.
> > > > >
> > > > > > (and few comments) of all the files in mm/. If you want to get rid
> > > > >
> > > > > I promise I'll add more comments :)
> > > >
> > > > OK, but they should still go in their relevant files. Or as best as
> > > > possible. Right now it's just silly to have all this here when much
> > > > of it could be moved out to filemap.c, swap_state.c, page_alloc.c, etc.
> > >
> > > Can you be more specific what that "all this" is?
> >
> > The functions which take action in response to a bad page being
> > detected. They belong with the subsystem that the page belongs
> > to. I'm amazed this is causing so much argument or confusion
> > because it is how the rest of mm/ code is arranged. OK, Hugh has
> > a point about ifdefs, but OTOH we have lots of ifdefs like this.
>
> Well we're already calling into that subsystem, just not with
> a single function call.
>
> > > > > > of the page and don't care what it's count or dirtyness is, then
> > > > > > truncate_inode_pages_range is the correct API to use.
> > > > > >
> > > > > > (or you could extract out some of it so you can call it directly on
> > > > > > individual locked pages, if that helps).
> > > > >
> > > > > The patch to move over to truncate_complete_page() would like this.
> > > > > It's not a big win indeed.
> > > >
> > > > No I don't mean to do this, but to move the truncate_inode_pages
> > > > code for truncating a single, locked, page into another function
> > > > in mm/truncate.c and then call that from here.
> > >
> > > I took a look at that. First there's no direct equivalent of
> > > me_pagecache_clean/dirty in truncate.c and to be honest I don't
> > > see a clean way to refactor any of the existing functions to
> > > do the same.
> >
> > With all that writing you could have just done it. It's really
>
> I would have done it if it made sense to me, but so far it hasn't.
>
> The problem with your suggestion is that you do the big picture,
> but seem to skip over a lot of details. But details matter.
>
> > not a big deal and just avoids duplicating code. I attached an
> > (untested) patch.
>
> Thanks. But the function in the patch is not doing the same what
> the me_pagecache_clean/dirty are doing. For once there is no error
> checking, as in the second try_to_release_page()
>
> Then it doesn't do all the IO error and missing mapping handling.
>
> The page_mapped() check is useless because the pages are not
> mapped here etc.
>
> We could probably call truncate_complete_page(), but then
> we would also need to duplicate most of the checking outside
> the function anyways and there wouldn't be any possibility
> to share the clean/dirty variants. If you insist I can
> do it, but I think it would be significantly worse code
> than before and I'm reluctant to do that.
>
> I don't also really see what the big deal is of just
> calling these few functions directly. After all we're not
> truncating here and they're all already called from other files.

Yes I like the current "one code block calling one elemental function
to isolate from one reference source" scenario:
- PTE
- page cache
- swap cache
- LRU list

Calling into the generic truncate code only messes up the concepts.

Thanks,
Fengguang

2009-06-02 13:34:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > > not a big deal and just avoids duplicating code. I attached an
> > > > > (untested) patch.
> > > >
> > > > Thanks. But the function in the patch is not doing the same what
> > > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > > checking, as in the second try_to_release_page()
> > > >
> > > > Then it doesn't do all the IO error and missing mapping handling.
> > >
> > > Obviously I don't mean just use that single call for the entire
> > > handler. You can set the EIO bit or whatever you like. The
> > > "error handling" you have there also seems strange. You could
> > > retain it, but the page is assured to be removed from pagecache.
> >
> > The reason this code double checks is that someone could have
> > a reference (remember we can come in any time) we cannot kill immediately.
>
> Can't kill what? The page is gone from pagecache. It may remain
> other kernel references, but I don't see why this code will
> consider this as a failure (and not, for example, a raised error
> count).

It's a failure because the page was still used and not successfully
isolated.

> + * remove_from_page_cache assumes (mapping && !mapped)
> + */
> + if (page_mapping(p) && !page_mapped(p)) {

Ok you're right. That one is not needed. I will remove it.

> >
> > User page tables was on the todo list, these are actually relatively
> > easy. The biggest issue is to detect them.
> >
> > Metadata would likely need file system callbacks, which I would like to
> > avoid at this point.
>
> So I just don't know why you argue the point that you have lots
> of large holes left.

I didn't argue that. My point was just that I currently don't have
data what holes are the worst on given workloads. If I figure out at
some point that writeback pages are a significant part of some important
workload I would be interested in tackling them.
That said I think that's unlikely, but I'm not ruling it out.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-02 13:40:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 03:41:26PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > > The reason this code double checks is that someone could have
> > > a reference (remember we can come in any time) we cannot kill immediately.
> >
> > Can't kill what? The page is gone from pagecache. It may remain
> > other kernel references, but I don't see why this code will
> > consider this as a failure (and not, for example, a raised error
> > count).
>
> It's a failure because the page was still used and not successfully
> isolated.

But you're predicating success on page_count, so there can be
other users anyway. You do check page_count later and emit
a different message in this case, but even that isn't enough
to tell you if it has no more users.

I wouldn't have thought it's worth the complication, but
there is nothing preventing you using my truncate function
and also keeping this error check to test afterwards.


> > + * remove_from_page_cache assumes (mapping && !mapped)
> > + */
> > + if (page_mapping(p) && !page_mapped(p)) {
>
> Ok you're right. That one is not needed. I will remove it.
>
> > >
> > > User page tables was on the todo list, these are actually relatively
> > > easy. The biggest issue is to detect them.
> > >
> > > Metadata would likely need file system callbacks, which I would like to
> > > avoid at this point.
> >
> > So I just don't know why you argue the point that you have lots
> > of large holes left.
>
> I didn't argue that. My point was just that I currently don't have
> data what holes are the worst on given workloads. If I figure out at
> some point that writeback pages are a significant part of some important
> workload I would be interested in tackling them.
> That said I think that's unlikely, but I'm not ruling it out.

Well, it sounds like maybe there is a sane way to do them with your
IO interception... but anyway let's not worry about this right
now ;)

2009-06-02 13:53:45

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 09:41:26PM +0800, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 03:24:41PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 03:25:38PM +0200, Andi Kleen wrote:
> > > On Tue, Jun 02, 2009 at 02:57:13PM +0200, Nick Piggin wrote:
> > > > > > not a big deal and just avoids duplicating code. I attached an
> > > > > > (untested) patch.
> > > > >
> > > > > Thanks. But the function in the patch is not doing the same what
> > > > > the me_pagecache_clean/dirty are doing. For once there is no error
> > > > > checking, as in the second try_to_release_page()
> > > > >
> > > > > Then it doesn't do all the IO error and missing mapping handling.
> > > >
> > > > Obviously I don't mean just use that single call for the entire
> > > > handler. You can set the EIO bit or whatever you like. The
> > > > "error handling" you have there also seems strange. You could
> > > > retain it, but the page is assured to be removed from pagecache.
> > >
> > > The reason this code double checks is that someone could have
> > > a reference (remember we can come in any time) we cannot kill immediately.
> >
> > Can't kill what? The page is gone from pagecache. It may remain
> > other kernel references, but I don't see why this code will
> > consider this as a failure (and not, for example, a raised error
> > count).
>
> It's a failure because the page was still used and not successfully
> isolated.
>
> > + * remove_from_page_cache assumes (mapping && !mapped)
> > + */
> > + if (page_mapping(p) && !page_mapped(p)) {
>
> Ok you're right. That one is not needed. I will remove it.

No! Please read the comment. In fact __remove_from_page_cache() has a

BUG_ON(page_mapped(page));

Or, at least correct that BUG_ON() line together.

Thanks,
Fengguang

2009-06-02 13:54:02

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 08:57:13PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> > > not a big deal and just avoids duplicating code. I attached an
> > > (untested) patch.
> >
> > Thanks. But the function in the patch is not doing the same what
> > the me_pagecache_clean/dirty are doing. For once there is no error
> > checking, as in the second try_to_release_page()
> >
> > Then it doesn't do all the IO error and missing mapping handling.
>
> Obviously I don't mean just use that single call for the entire
> handler. You can set the EIO bit or whatever you like. The
> "error handling" you have there also seems strange. You could
> retain it, but the page is assured to be removed from pagecache.

You mean this?

if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
return FAILED;

If page->private cannot be removed, that means some fs may start IO on it, so
we return FAILED.

> > The page_mapped() check is useless because the pages are not
> > mapped here etc.
>
> That's OK, it is a core part of the protocol to prevent
> truncated pages from being mapped, so I like it to be in
> that function.

Right.

> (you are also doing extraneous page_mapped tests in your handler,
> so surely your concern isn't from the perspective of this
> error handler code)

That's because the initial try_to_unmap() may fail and page still remain
mapped, and remove_from_page_cache() assumes !page_mapped().

> > We could probably call truncate_complete_page(), but then
> > we would also need to duplicate most of the checking outside
> > the function anyways and there wouldn't be any possibility
> > to share the clean/dirty variants. If you insist I can
> > do it, but I think it would be significantly worse code
> > than before and I'm reluctant to do that.
>
> I can write you the patch for that too if you like.

I have already posted one on truncate_complete_page(). Not the way you want it?

> > I don't also really see what the big deal is of just
> > calling these few functions directly. After all we're not
> > truncating here and they're all already called from other files.
> >
> > > > > No, it seems rather insane to do something like this here that no other
> > > > > code in the mm ever does.
> > > >
> > > > Just because the rest of the VM doesn't do it doesn't mean it might make sense.
> > >
> > > It is going to be possible to do it somehow surely, but it is insane
> > > to try to add such constraints to the VM to close a few small windows
> >
> > We don't know currently if they are small. If they are small I would
> > agree with you, but that needs numbers. That said fancy writeback handling
> > is currently not on my agenda.
>
> Yes, writeback pages are very limited, a tiny number at any time and
> the faction gets relatively smaller as total RAM size gets larger.

Yes they are less interesting for now.

> > > if you already have other large ones.
> >
> > That's unclear too.
>
> You can't do much about most kernel pages, and dirty metadata pages
> are both going to cause big problems. User pagetable pages. Lots of
> stuff.

Yes, that's a network of pointers that's hard to break away with.

Thanks,
Fengguang

2009-06-02 13:54:21

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 08:19:40PM +0800, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > > But you just said that you try to intercept the IO. So the underlying
> > > data is not necessarily corrupt. And even if it was then what if it
> > > was reinitialized to something else in the meantime (such as filesystem
> > > metadata blocks?) You'd just be introducing worse possibilities for
> > > coruption.
> >
> > The IO interception will be based on PFN instead of file offset, so it
> > won't affect innocent pages such as your example of reinitialized data.
>
> OK, if you could intercept the IO so it never happens at all, yes
> of course that could work.
>
> > poisoned dirty page == corrupt data => process shall be killed
> > poisoned clean page == recoverable data => process shall survive
> >
> > In the case of dirty hwpoison page, if we reload the on disk old data
> > and let application proceed with it, it may lead to *silent* data
> > corruption/inconsistency, because the application will first see v2
> > then v1, which is illogical and hence may mess up its internal data
> > structure.
>
> Right, but how do you prevent that? There is no way to reconstruct the
> most updtodate data because it was destroyed.

To kill the application ruthlessly, rather than allow it go rotten quietly.

> > > You will need to demonstrate a *big* advantage before doing crazy things
> > > with writeback ;)
> >
> > OK. We can do two things about poisoned writeback pages:
> >
> > 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> > trigger further machine checks
>
> 1b) At which point, you invoke the end-io handlers, and the page is
> no longer writeback.
>
> > 2) to isolate them from page cache, thus preventing possible
> > references in the writeback time window
>
> And then this is possible because you aren't violating mm
> assumptions due to 1b. This proceeds just as the existing
> pagecache mce error handler case which exists now.

Yeah that's a good scheme - we are talking about two interception
scheme. Mine is passive one and yours is active one.

passive: check hwpoison pages at __generic_make_request()/elv_next_request()
(the code will be enabled by an mce_bad_io_pages counter)

active: iterate all queued requests for hwpoison pages

Each has its merits and complexities.

I'll list the merits(+) and complexities(-) of the passive approach,
with them you automatically get the merits of the active one:

+ works on generic code and don't have to touch all deadline/as/cfq elevators
- the wait_on_page_writeback() puzzle because of the writeback time window

+ could also intercept the "cannot de-dirty for now" pages when they
eventually go to writeback IO
- have to avoid filesystem references on PG_hwpoison pages, eg.
- zeroing partial EOF page when i_size is not page aligned
- calculating checksums


> > > > Now it's obvious that reusing more code than truncate_complete_page()
> > > > is not easy (or natural).
> > >
> > > Just lock the page and wait for writeback, then do the truncate
> > > work in another function. In your case if you've already unmapped
> > > the page then it won't try to unmap again so no problem.
> > >
> > > Truncating from pagecache does not change ->index so you can
> > > move the loop logic out.
> >
> > Right. So effectively the reusable function is exactly
> > truncate_complete_page(). As I said this reuse is not a big gain.
>
> Anyway, we don't have to argue about it. I already send a patch
> because it was so hard to do, so let's move past this ;)
>
>
> > > > Yes it's kind of insane. I'm interested in reasoning it out though.
>
> Well with the IO interception (I missed this point), then it seems
> maybe no longer so insane. We could see how it looks.

OK.

Thanks,
Fengguang

2009-06-02 13:59:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

> > Ok you're right. That one is not needed. I will remove it.
>
> No! Please read the comment. In fact __remove_from_page_cache() has a
>
> BUG_ON(page_mapped(page));
>
> Or, at least correct that BUG_ON() line together.

Yes, but we already have them unmapped earlier and the poison check
in the page fault handler should prevent remapping.

So it really should not happen and if it happened we would deserve
the BUG.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-02 14:01:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

> > > We could probably call truncate_complete_page(), but then
> > > we would also need to duplicate most of the checking outside
> > > the function anyways and there wouldn't be any possibility
> > > to share the clean/dirty variants. If you insist I can
> > > do it, but I think it would be significantly worse code
> > > than before and I'm reluctant to do that.
> >
> > I can write you the patch for that too if you like.
>
> I have already posted one on truncate_complete_page(). Not the way you want it?

Sorry I must have missed it (too much mail I guess). Can you repost please?

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-02 14:10:59

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 10:08:30PM +0800, Andi Kleen wrote:
> > > > We could probably call truncate_complete_page(), but then
> > > > we would also need to duplicate most of the checking outside
> > > > the function anyways and there wouldn't be any possibility
> > > > to share the clean/dirty variants. If you insist I can
> > > > do it, but I think it would be significantly worse code
> > > > than before and I'm reluctant to do that.
> > >
> > > I can write you the patch for that too if you like.
> >
> > I have already posted one on truncate_complete_page(). Not the way you want it?
>
> Sorry I must have missed it (too much mail I guess). Can you repost please?

OK, here it is, a more simplified one.

---
mm/memory-failure.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -324,23 +324,16 @@ static int me_free(struct page *p)
*/
static int me_pagecache_clean(struct page *p)
{
+ if (page_mapping(p))
+ truncate_complete_page(p->mapping, p);
+
if (!isolate_lru_page(p))
page_cache_release(p);

- if (page_has_private(p))
- do_invalidatepage(p, 0);
if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
page_to_pfn(p));

- /*
- * remove_from_page_cache assumes (mapping && !mapped)
- */
- if (page_mapping(p) && !page_mapped(p)) {
- remove_from_page_cache(p);
- page_cache_release(p);
- }
-
return RECOVERED;
}

2009-06-02 14:12:47

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 10:06:39PM +0800, Andi Kleen wrote:
> > > Ok you're right. That one is not needed. I will remove it.
> >
> > No! Please read the comment. In fact __remove_from_page_cache() has a
> >
> > BUG_ON(page_mapped(page));
> >
> > Or, at least correct that BUG_ON() line together.
>
> Yes, but we already have them unmapped earlier and the poison check

But you commented "try_to_unmap can fail temporarily due to races."

That's self-contradictory.

> in the page fault handler should prevent remapping.
>
> So it really should not happen and if it happened we would deserve
> the BUG.
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.

2009-06-02 14:14:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 10:10:31PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 10:08:30PM +0800, Andi Kleen wrote:
> > > > > We could probably call truncate_complete_page(), but then
> > > > > we would also need to duplicate most of the checking outside
> > > > > the function anyways and there wouldn't be any possibility
> > > > > to share the clean/dirty variants. If you insist I can
> > > > > do it, but I think it would be significantly worse code
> > > > > than before and I'm reluctant to do that.
> > > >
> > > > I can write you the patch for that too if you like.
> > >
> > > I have already posted one on truncate_complete_page(). Not the way you want it?
> >
> > Sorry I must have missed it (too much mail I guess). Can you repost please?
>
> OK, here it is, a more simplified one.

I prefer mine because I don't want truncate_complete_page escaping
mm/truncate.c because the caller has to deal with truncate races.


> ---
> mm/memory-failure.c | 13 +++----------
> 1 file changed, 3 insertions(+), 10 deletions(-)
>
> --- sound-2.6.orig/mm/memory-failure.c
> +++ sound-2.6/mm/memory-failure.c
> @@ -324,23 +324,16 @@ static int me_free(struct page *p)
> */
> static int me_pagecache_clean(struct page *p)
> {
> + if (page_mapping(p))
> + truncate_complete_page(p->mapping, p);
> +
> if (!isolate_lru_page(p))
> page_cache_release(p);
>
> - if (page_has_private(p))
> - do_invalidatepage(p, 0);
> if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> page_to_pfn(p));
>
> - /*
> - * remove_from_page_cache assumes (mapping && !mapped)
> - */
> - if (page_mapping(p) && !page_mapped(p)) {
> - remove_from_page_cache(p);
> - page_cache_release(p);
> - }
> -
> return RECOVERED;
> }

2009-06-02 14:21:53

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 10:12:22PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 10:06:39PM +0800, Andi Kleen wrote:
> > > > Ok you're right. That one is not needed. I will remove it.
> > >
> > > No! Please read the comment. In fact __remove_from_page_cache() has a
> > >
> > > BUG_ON(page_mapped(page));
> > >
> > > Or, at least correct that BUG_ON() line together.
> >
> > Yes, but we already have them unmapped earlier and the poison check
>
> But you commented "try_to_unmap can fail temporarily due to races."
>
> That's self-contradictory.

If you use the bloody code I posted (and suggested from the start),
then you DON'T HAVE TO WORRY ABOUT THIS, because it is handled by
the subsystem that knows about it.

How anybody can say it will make your code overcomplicated or "is
not much improvement" is just totally beyond me.

2009-06-02 14:33:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 08:51:34PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 08:19:40PM +0800, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 07:14:07PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 01, 2009 at 10:40:51PM +0800, Nick Piggin wrote:
> > > > But you just said that you try to intercept the IO. So the underlying
> > > > data is not necessarily corrupt. And even if it was then what if it
> > > > was reinitialized to something else in the meantime (such as filesystem
> > > > metadata blocks?) You'd just be introducing worse possibilities for
> > > > coruption.
> > >
> > > The IO interception will be based on PFN instead of file offset, so it
> > > won't affect innocent pages such as your example of reinitialized data.
> >
> > OK, if you could intercept the IO so it never happens at all, yes
> > of course that could work.
> >
> > > poisoned dirty page == corrupt data => process shall be killed
> > > poisoned clean page == recoverable data => process shall survive
> > >
> > > In the case of dirty hwpoison page, if we reload the on disk old data
> > > and let application proceed with it, it may lead to *silent* data
> > > corruption/inconsistency, because the application will first see v2
> > > then v1, which is illogical and hence may mess up its internal data
> > > structure.
> >
> > Right, but how do you prevent that? There is no way to reconstruct the
> > most updtodate data because it was destroyed.
>
> To kill the application ruthlessly, rather than allow it go rotten quietly.

Right, but you don't because you just do EIO in a lot of cases. See
EIO subthread.


> > > > You will need to demonstrate a *big* advantage before doing crazy things
> > > > with writeback ;)
> > >
> > > OK. We can do two things about poisoned writeback pages:
> > >
> > > 1) to stop IO for them, thus avoid corrupted data to hit disk and/or
> > > trigger further machine checks
> >
> > 1b) At which point, you invoke the end-io handlers, and the page is
> > no longer writeback.
> >
> > > 2) to isolate them from page cache, thus preventing possible
> > > references in the writeback time window
> >
> > And then this is possible because you aren't violating mm
> > assumptions due to 1b. This proceeds just as the existing
> > pagecache mce error handler case which exists now.
>
> Yeah that's a good scheme - we are talking about two interception
> scheme. Mine is passive one and yours is active one.

Oh, hmm, not quite. I had assumed your IO interception is based
on another MCE from DMA transfer (Andi said you get another exception
in that case).

If you are just hoping to get an MCE from CPU access in order to
intercept IO, then you may as well not bother because it is not
closing the window much (very likely that the page will never be
touched again by the CPU).

So if you can get an MCE from the DMA, then you would fail the
request, which will automatically clear writeback, so your CPU MCE
handler never has to bother with writeback pages.

2009-06-02 15:10:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:

[snip: reusing truncate.c code]

> > With all that writing you could have just done it. It's really
>
> I would have done it if it made sense to me, but so far it hasn't.
>
> The problem with your suggestion is that you do the big picture,
> but seem to skip over a lot of details. But details matter.

BTW. just to answer this point. The reason maybe for this
is because the default response to my concerns seems to
have been "you're wrong". Not "i don't understand, can you
detail", and not "i don't agree because ...".

I may well be wrong (in this case I'm quite sure I'm not),
but if you say I'm wrong, then I assume that you understand
what I'm talking about and have a fair idea of the details.

Anyway don't worry. I get that a lot. I do really want to
help get this merged.

2009-06-02 15:17:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 09:46:59PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 08:57:13PM +0800, Nick Piggin wrote:
> > Obviously I don't mean just use that single call for the entire
> > handler. You can set the EIO bit or whatever you like. The
> > "error handling" you have there also seems strange. You could
> > retain it, but the page is assured to be removed from pagecache.
>
> You mean this?
>
> if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> return FAILED;
>
> If page->private cannot be removed, that means some fs may start IO on it, so
> we return FAILED.

Hmm, if you're handling buffercache here then possibly yes.
But if you throw out dirty buffer cache then you're probably
corrupting your filesystem just as bad (or even worse than
a couple of bits flipped). Just seems ad-hoc.

I guess it is best-effort in most places though, and this
doesn't take much effort. But due to being best effort
means that it is hard for someone who knows exactly what all
the code does, to know what your intentions or intended
semantics are in places like this. So short comments would help,

2009-06-02 17:12:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 05:09:52PM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
>
> [snip: reusing truncate.c code]
>
> > > With all that writing you could have just done it. It's really
> >
> > I would have done it if it made sense to me, but so far it hasn't.
> >
> > The problem with your suggestion is that you do the big picture,
> > but seem to skip over a lot of details. But details matter.
>
> BTW. just to answer this point. The reason maybe for this
> is because the default response to my concerns seems to
> have been "you're wrong". Not "i don't understand, can you
> detail", and not "i don't agree because ...".

Sorry, I didn't want to imply you're wrong. I apologize if
it came over this way. I understand you understand this code
very well. I realize the one above came out
a bit flamey, but it wasn't really intended like this.

The disagreement right now seems to be more how the
code is structured. Typically there's no clear "right" or "wrong"
with these things anyways.

I'll take a look at your suggestion this evening and see
how it comes out.

> Anyway don't worry. I get that a lot. I do really want to
> help get this merged.

I wanted to thank you for your great reviewing work, even if I didn't
agree with everything :) But I think the disagreement were quite
small and only relatively unimportant things.

-Andi


--
[email protected] -- Speaking for myself only.

2009-06-02 17:20:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

> Hmm, if you're handling buffercache here then possibly yes.

Good question, will check.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-03 06:24:58

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 07:19:15PM +0200, Andi Kleen wrote:
> On Tue, Jun 02, 2009 at 05:09:52PM +0200, Nick Piggin wrote:
> > On Tue, Jun 02, 2009 at 02:47:57PM +0200, Andi Kleen wrote:
> > > On Tue, Jun 02, 2009 at 02:00:42PM +0200, Nick Piggin wrote:
> >
> > [snip: reusing truncate.c code]
> >
> > > > With all that writing you could have just done it. It's really
> > >
> > > I would have done it if it made sense to me, but so far it hasn't.
> > >
> > > The problem with your suggestion is that you do the big picture,
> > > but seem to skip over a lot of details. But details matter.
> >
> > BTW. just to answer this point. The reason maybe for this
> > is because the default response to my concerns seems to
> > have been "you're wrong". Not "i don't understand, can you
> > detail", and not "i don't agree because ...".
>
> Sorry, I didn't want to imply you're wrong. I apologize if
> it came over this way. I understand you understand this code
> very well. I realize the one above came out
> a bit flamey, but it wasn't really intended like this.

Ah it's OK :) Actually that was too far, most of the time
actually you gave constructive responses. Just one or two
sticking points but probably I was getting carried away
as well. Nothing personal of course!


> I'll take a look at your suggestion this evening and see
> how it comes out.

Cool.


> > Anyway don't worry. I get that a lot. I do really want to
> > help get this merged.
>
> I wanted to thank you for your great reviewing work, even if I didn't
> agree with everything :) But I think the disagreement were quite
> small and only relatively unimportant things.

Yes, I see nothing fundamentally wrong with the design...

Thanks,
Nick

2009-06-03 09:36:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 07:27:15PM +0200, Andi Kleen wrote:
> > Hmm, if you're handling buffercache here then possibly yes.
>
> Good question, will check.

BTW. now that I think about it, buffercache is probably not a good
idea to truncate (truncate, as-in: remove from pagecache). Because
filesystems can assume that with just a reference on the page, then
it will not be truncated.

This code will cause ext2 (as the first one I looked at), to go
oops.

And this is not predicated on PagePrivate or page_has_buffers,
because filesystems are free to directly operate on their own
metadata buffercache pages.

So I think it would be a good idea to exclude buffercache from
here completely until it can be shown to be safe. Actually you
*can* use the invalidate_mapping_pages path, which will check
refcounts etc (or a derivative thereof, similarly to my truncate
patch).

2009-06-03 10:21:57

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02 2009, Wu Fengguang wrote:
> > And then this is possible because you aren't violating mm
> > assumptions due to 1b. This proceeds just as the existing
> > pagecache mce error handler case which exists now.
>
> Yeah that's a good scheme - we are talking about two interception
> scheme. Mine is passive one and yours is active one.
>
> passive: check hwpoison pages at __generic_make_request()/elv_next_request()
> (the code will be enabled by an mce_bad_io_pages counter)

That's not a feasible approach at all, it'll add O(N) scan of a bio at
queue time. Ditto for the elv_next_request() approach.

What would be cheaper is to check the pages at dma map time, since you
have to scan the request anyway. That means putting it in
blk_rq_map_sg() or similar.

--
Jens Axboe

2009-06-03 11:17:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Wed, Jun 03, 2009 at 11:35:46AM +0200, Nick Piggin wrote:
> On Tue, Jun 02, 2009 at 07:27:15PM +0200, Andi Kleen wrote:
> > > Hmm, if you're handling buffercache here then possibly yes.
> >
> > Good question, will check.
>
> BTW. now that I think about it, buffercache is probably not a good
> idea to truncate (truncate, as-in: remove from pagecache). Because
> filesystems can assume that with just a reference on the page, then
> it will not be truncated.

Yes I understand. Need to check for this, but I'm not sure
how we can reliably detect it based on the struct page alone. I guess we have
to look at the mapping.

> So I think it would be a good idea to exclude buffercache from
> here completely until it can be shown to be safe. Actually you

Agreed. Just need to figure out how.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-03 15:51:52

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Tue, Jun 02, 2009 at 02:32:25AM +0800, Andi Kleen wrote:
[snip]
> > > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > > in the known good data from disk.
> > > >
> > > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > > swapcache anyway?
> > >
> > > The ClearPageUptodate() is kind of a careless addition, in the hope
> > > that it will stop some random readers. Need more investigations.
> >
> > OK. But it just muddies the waters in the meantime, so maybe take
> > such things out until there is a case for them.
>
> It's gone

Andi, I'd recommend to re-add ClearPageUptodate() for dirty swap cache
pages. It will then make shmem_getpage() return EIO for
- shmem_fault() => kill app with VM_FAULT_SIGBUS
- shmem_readpage() => fail splice()/sendfile() etc.
- shmem_write_begin() => fail splice()/sendfile() etc.
which is exactly what we wanted. Note that the EIO here is permanent.

I'll continue to do some experiments on its normal read/write behaviors.

Thanks,
Fengguang

2009-06-03 15:58:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

On Wed, Jun 03, 2009 at 11:51:33PM +0800, Wu Fengguang wrote:
> On Tue, Jun 02, 2009 at 02:32:25AM +0800, Andi Kleen wrote:
> [snip]
> > > > > > Clean swap cache pages can be directly isolated. A later page fault will bring
> > > > > > in the known good data from disk.
> > > > >
> > > > > OK, but why do you ClearPageUptodate if it is just to be deleted from
> > > > > swapcache anyway?
> > > >
> > > > The ClearPageUptodate() is kind of a careless addition, in the hope
> > > > that it will stop some random readers. Need more investigations.
> > >
> > > OK. But it just muddies the waters in the meantime, so maybe take
> > > such things out until there is a case for them.
> >
> > It's gone
>
> Andi, I'd recommend to re-add ClearPageUptodate() for dirty swap cache
> pages. It will then make shmem_getpage() return EIO for
> - shmem_fault() => kill app with VM_FAULT_SIGBUS
> - shmem_readpage() => fail splice()/sendfile() etc.
> - shmem_write_begin() => fail splice()/sendfile() etc.
> which is exactly what we wanted. Note that the EIO here is permanent.

Done.

-Andi
--
[email protected] -- Speaking for myself only.