LinuxLists.cc - [PATCH 00/22] HWPOISON: Intro (v5)

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 11:18:18AM +0800, Balbir Singh wrote:
> Wu Fengguang wrote:
> > Hi all,
> >
> > Comments are warmly welcome on the newly introduced uevent code :)
> >
> > I hope we can reach consensus in this round and then be able to post
> > a final version for .31 inclusion.
>
> Isn't that too aggressive? .31 is already in the merge window.

Yes, a bit aggressive. This is a new feature that involves complex logics.
However it is basically a no-op when there are no memory errors,
and when memory corruption does occur, it's better to (possibly) panic
in this code than to panic unconditionally in the absence of this
feature (as said by Rik).

So IMHO it's OK for .31 as long as we agree on the user interfaces,
ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.

It comes a long way through numerous reviews, and I believe all the
important issues and concerns have been addressed. Nick, Rik, Hugh,
Ingo, ... what are your opinions? Is the uevent good enough to meet
your request to "die hard" or "die gracefully" or whatever on memory
failure events?

Thanks,
Fengguang

2009-06-15 06:44:58

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 12:27:53PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 11:18:18AM +0800, Balbir Singh wrote:
> > Wu Fengguang wrote:
> > > Hi all,
> > >
> > > Comments are warmly welcome on the newly introduced uevent code :)
> > >
> > > I hope we can reach consensus in this round and then be able to post
> > > a final version for .31 inclusion.
> >
> > Isn't that too aggressive? .31 is already in the merge window.
>
> Yes, a bit aggressive. This is a new feature that involves complex logics.
> However it is basically a no-op when there are no memory errors,
> and when memory corruption does occur, it's better to (possibly) panic
> in this code than to panic unconditionally in the absence of this
> feature (as said by Rik).
>
> So IMHO it's OK for .31 as long as we agree on the user interfaces,
> ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
>
> It comes a long way through numerous reviews, and I believe all the
> important issues and concerns have been addressed. Nick, Rik, Hugh,
> Ingo, ... what are your opinions? Is the uevent good enough to meet
> your request to "die hard" or "die gracefully" or whatever on memory
> failure events?

Uevent? As in, send a message to userspace? I don't think this
would be ideal for a fail-stop/failover situation.

I can't see a good reason to rush to merge it.

IMO the userspace-visible changes have maybe not been considered
too thoroughly, which is what I'd be most worried about. I probably
missed seeing documentation of exact semantics and situations
where admins should tune things one way or the other.

Did we verify with filesystem maintainers (eg. btrfs) that the
!ISREG test will be enough to prevent oopses?

I hope it is going to be merged with an easy-to-use fault injector,
because that is the only way Joe kernel developer is ever going to
test it.

2009-06-15 07:01:16

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> >
> > So IMHO it's OK for .31 as long as we agree on the user interfaces,
> > ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> >
> > It comes a long way through numerous reviews, and I believe all the
> > important issues and concerns have been addressed. Nick, Rik, Hugh,
> > Ingo, ... what are your opinions? Is the uevent good enough to meet
> > your request to "die hard" or "die gracefully" or whatever on memory
> > failure events?
>
> Uevent? As in, send a message to userspace? I don't think this
> would be ideal for a fail-stop/failover situation.

Agreed.

For failover you typically want a application level heartbeat anyways
to guard against user space software problems and if there's a kill then it
would catch it. Also again in you want to check against all corruptions you
have to do it in the low level handler or better watch corrected
events too to predict failures (but the later is quite hard to do generally).
To some extent the first is already implemented on x86, e.g. set
the tolerance level to 0 will give more aggressive panics.

> I can't see a good reason to rush to merge it.

The low level x86 code for MCA recovery is in, just this high level
part is missing to kill the correct process. I think it would be good to merge
a core now. The basic code seems to be also as well tested as we can do it
right now and exposing it to more users would be good. It's undoubtedly not
perfect yet, but that's not a requirement for merge.

There's a lot of fancy stuff that could be done in addition,
but that's not really needed right now and for a lot of the fancy
ideas (I have enough on my own :) it's dubious they are actually
worth it.

> IMO the userspace-visible changes have maybe not been considered
> too thoroughly, which is what I'd be most worried about. I probably
> missed seeing documentation of exact semantics and situations
> where admins should tune things one way or the other.

There's only a single tunable anyways, early kill vs late kill.

For KVM you need early kill, for the others it remains to be seen.

> I hope it is going to be merged with an easy-to-use fault injector,
> because that is the only way Joe kernel developer is ever going to
> test it.

See patches 13 and 14. In addition there's another low level x86
injector too.

There's also a test suite available (mce-test on kernel.org git)

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-15 07:19:20

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 09:09:14AM +0200, Andi Kleen wrote:
> On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > >
> > > So IMHO it's OK for .31 as long as we agree on the user interfaces,
> > > ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> > >
> > > It comes a long way through numerous reviews, and I believe all the
> > > important issues and concerns have been addressed. Nick, Rik, Hugh,
> > > Ingo, ... what are your opinions? Is the uevent good enough to meet
> > > your request to "die hard" or "die gracefully" or whatever on memory
> > > failure events?
> >
> > Uevent? As in, send a message to userspace? I don't think this
> > would be ideal for a fail-stop/failover situation.
>
> Agreed.
>
> For failover you typically want a application level heartbeat anyways
> to guard against user space software problems and if there's a kill then it
> would catch it. Also again in you want to check against all corruptions you
> have to do it in the low level handler or better watch corrected
> events too to predict failures (but the later is quite hard to do generally).
> To some extent the first is already implemented on x86, e.g. set
> the tolerance level to 0 will give more aggressive panics.
>
> > I can't see a good reason to rush to merge it.
>
> The low level x86 code for MCA recovery is in, just this high level
> part is missing to kill the correct process. I think it would be good to merge
> a core now. The basic code seems to be also as well tested as we can do it
> right now and exposing it to more users would be good. It's undoubtedly not
> perfect yet, but that's not a requirement for merge.
>
> There's a lot of fancy stuff that could be done in addition,
> but that's not really needed right now and for a lot of the fancy
> ideas (I have enough on my own :) it's dubious they are actually
> worth it.

Just my opinion. Normally it takes a lot longer for VM patches
like this to go through, but it's not really up to me anyway.
If Andrew or Linus has it in their head to merge it in 2.6.31,
it's going to get merged ;)

> > IMO the userspace-visible changes have maybe not been considered
> > too thoroughly, which is what I'd be most worried about. I probably
> > missed seeing documentation of exact semantics and situations
> > where admins should tune things one way or the other.
>
> There's only a single tunable anyways, early kill vs late kill.
>
> For KVM you need early kill, for the others it remains to be seen.

Right. It's almost like you need to do a per-process thing, and
those that can handle things (such as the new SIGBUS or the new
EIO) could get those, and others could be killed.

Early-kill for KVM does seem like reasonable justification on the
surface, but when I think more about it, I wonder does the guest
actually stand any better chance to correct the error if it is
reported at time T rather than T+delta? (who knows what the page
will be used for at any given time).

> > I hope it is going to be merged with an easy-to-use fault injector,
> > because that is the only way Joe kernel developer is ever going to
> > test it.
>
> See patches 13 and 14. In addition there's another low level x86
> injector too.
>
> There's also a test suite available (mce-test on kernel.org git)

OK, I probably saw some "FOR TETSING" things and wrongly assumed the
injector was not supposed to be merged.

Thanks,
Nick

2009-06-15 08:15:15

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> Did we verify with filesystem maintainers (eg. btrfs) that the
> !ISREG test will be enough to prevent oopses?

BTW. this is quite a significant change I think and not
really documented well enough. Previously a filesystem
will know exactly when and why pagecache in a mapping
under its control will be truncated (as opposed to
invalidated).

They even have opportunity to hold locks such as i_mutex.

And depending on what they do, they could do interesting
things even with ISREG files.

So, I really think this needs review by filesystem
maintainers and it would be far safer to use invalidate
until it is known to be safe.

2009-06-15 10:29:09

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 04:14:53PM +0800, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > Did we verify with filesystem maintainers (eg. btrfs) that the
> > !ISREG test will be enough to prevent oopses?
>
> BTW. this is quite a significant change I think and not
> really documented well enough. Previously a filesystem
> will know exactly when and why pagecache in a mapping
> under its control will be truncated (as opposed to
> invalidated).
>
> They even have opportunity to hold locks such as i_mutex.
>
> And depending on what they do, they could do interesting
> things even with ISREG files.
>
> So, I really think this needs review by filesystem
> maintainers and it would be far safer to use invalidate
> until it is known to be safe.

Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
Do you mean to do invalidate_complete_page() for all inodes for now?
That's a good suggestion, it shall be able to do the job for most
pages indeed.

To make things look clear, here is the complete memory-failure.c
without the uevent bits.

Thanks,
Fengguang
---

/*
* Copyright (C) 2008, 2009 Intel Corporation
* Authors: Andi Kleen, Fengguang Wu
*
* This software may be redistributed and/or modified under the terms of
* the GNU General Public License ("GPL") version 2 only as published by the
* Free Software Foundation.
*
* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focuses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* The operation to map back from RMAP chains to processes has to walk
* the complete process list and has non linear complexity with the number
* mappings. In short it can be quite slow. But since memory corruptions
* are rare we hope to get away with this.
*/

/*
* Notebook:
* - hugetlb needs more code
* - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
* - pass bad pages to kdump next kernel
*/
#define DEBUG 1
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/page-flags.h>
#include <linux/sched.h>
#include <linux/rmap.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/backing-dev.h>
#include "internal.h"

int sysctl_memory_failure_early_kill __read_mostly = 1;

atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);

/*
* Send all the processes who have the page mapped an ``action optional''
* signal.
*/
static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
unsigned long pfn)
{
struct siginfo si;
int ret;

printk(KERN_ERR
"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
pfn, t->comm, t->pid);
si.si_signo = SIGBUS;
si.si_errno = 0;
si.si_code = BUS_MCEERR_AO;
si.si_addr = (void *)addr;
#ifdef __ARCH_SI_TRAPNO
si.si_trapno = trapno;
#endif
si.si_addr_lsb = PAGE_SHIFT;
/*
* Don't use force here, it's convenient if the signal
* can be temporarily blocked.
* This could cause a loop when the user sets SIGBUS
* to SIG_IGN, but hopefully noone will do that?
*/
ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
if (ret < 0)
printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
t->comm, t->pid, ret);
return ret;
}

/*
* Kill all processes that have a poisoned page mapped and then isolate
* the page.
*
* General strategy:
* Find all processes having the page mapped and kill them.
* But we keep a page reference around so that the page is not
* actually freed yet.
* Then stash the page away
*
* There's no convenient way to get back to mapped processes
* from the VMAs. So do a brute-force search over all
* running processes.
*
* Remember that machine checks are not common (or rather
* if they are common you have other problems), so this shouldn't
* be a performance issue.
*
* Also there are some races possible while we get from the
* error detection to actually handle it.
*/

struct to_kill {
struct list_head nd;
struct task_struct *tsk;
unsigned long addr;
unsigned addr_valid:1;
};

/*
* Failure handling: if we can't find or can't kill a process there's
* not much we can do. We just print a message and ignore otherwise.
*/

/*
* Schedule a process for later kill.
*/
static void add_to_kill(struct task_struct *tsk, struct page *p,
struct vm_area_struct *vma,
struct list_head *to_kill,
struct to_kill **tkc)
{
struct to_kill *tk;

if (*tkc) {
tk = *tkc;
*tkc = NULL;
} else {
tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
if (!tk) {
printk(KERN_ERR
"MCE: Out of memory while machine check handling\n");
return;
}
}
tk->addr = page_address_in_vma(p, vma);
tk->addr_valid = 1;

/*
* In theory we don't have to kill when the page was
* munmaped. But it could be also a mremap. Since that's
* likely very rare kill anyways just out of paranoia, but use
* a SIGKILL because the error is not contained anymore.
*/
if (tk->addr == -EFAULT) {
pr_debug("MCE: Unable to find user space address %lx in %s\n",
page_to_pfn(p), tsk->comm);
tk->addr_valid = 0;
}
get_task_struct(tsk);
tk->tsk = tsk;
list_add_tail(&tk->nd, to_kill);
}

/*
* Kill the processes that have been collected earlier.
*
* Only do anything when DOIT is set, otherwise just free the list
* (this is used for clean pages which do not need killing)
* Also when FAIL is set do a force kill because something went
* wrong earlier.
*/
static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
int fail, unsigned long pfn)
{
struct to_kill *tk, *next;

list_for_each_entry_safe (tk, next, to_kill, nd) {
if (doit) {
/*
* In case something went wrong with munmaping
* make sure the process doesn't catch the
* signal and then access the memory. Just kill it.
* the signal handlers
*/
if (fail || tk->addr_valid == 0) {
printk(KERN_ERR
"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
pfn, tk->tsk->comm, tk->tsk->pid);
force_sig(SIGKILL, tk->tsk);
}

/*
* In theory the process could have mapped
* something else on the address in-between. We could
* check for that, but we need to tell the
* process anyways.
*/
else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
pfn) < 0)
printk(KERN_ERR
"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
pfn, tk->tsk->comm, tk->tsk->pid);
}
put_task_struct(tk->tsk);
kfree(tk);
}
}

/*
* Collect processes when the error hit an anonymous page.
*/
static void collect_procs_anon(struct page *page, struct list_head *to_kill,
struct to_kill **tkc)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
struct anon_vma *av;

read_lock(&tasklist_lock);

av = page_lock_anon_vma(page);
if (av == NULL) /* Not actually mapped anymore */
goto out;

for_each_process (tsk) {
if (!tsk->mm)
continue;
list_for_each_entry (vma, &av->head, anon_vma_node) {
if (!page_mapped_in_vma(page, vma))
continue;

if (vma->vm_mm == tsk->mm)
add_to_kill(tsk, page, vma, to_kill, tkc);
}
}
page_unlock_anon_vma(av);
out:
read_unlock(&tasklist_lock);
}

/*
* Collect processes when the error hit a file mapped page.
*/
static void collect_procs_file(struct page *page, struct list_head *to_kill,
struct to_kill **tkc)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
struct prio_tree_iter iter;
struct address_space *mapping = page->mapping;

/*
* A note on the locking order between the two locks.
* We don't rely on this particular order.
* If you have some other code that needs a different order
* feel free to switch them around. Or add a reverse link
* from mm_struct to task_struct, then this could be all
* done without taking tasklist_lock and looping over all tasks.
*/

read_lock(&tasklist_lock);
spin_lock(&mapping->i_mmap_lock);
for_each_process(tsk) {
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

if (!tsk->mm)
continue;

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
pgoff)
if (vma->vm_mm == tsk->mm)
add_to_kill(tsk, page, vma, to_kill, tkc);
}
spin_unlock(&mapping->i_mmap_lock);
read_unlock(&tasklist_lock);
}

/*
* Collect the processes who have the corrupted page mapped to kill.
*/
static void collect_procs(struct page *page, struct list_head *tokill)
{
struct to_kill *tk;

/*
* First preallocate one to_kill structure outside the spin locks,
* so that we can kill at least one process reasonably reliable.
*/
tk = kmalloc(sizeof(struct to_kill), GFP_NOIO);

if (PageAnon(page))
collect_procs_anon(page, tokill, &tk);
else
collect_procs_file(page, tokill, &tk);
kfree(tk);
}

/*
* Error handlers for various types of pages.
*/

enum outcome {
FAILED, /* Error handling failed */
DELAYED, /* Will be handled later */
IGNORED, /* Error safely ignored */
RECOVERED, /* Successfully recovered */
};

static const char *action_name[] = {
[FAILED] = "Failed",
[DELAYED] = "Delayed",
[IGNORED] = "Ignored",
[RECOVERED] = "Recovered",
};

/*
* Error hit kernel page.
* Do nothing, try to be lucky and not touch this instead. For a few cases we
* could be more sophisticated.
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
return DELAYED;
}

/*
* Already poisoned page.
*/
static int me_ignore(struct page *p, unsigned long pfn)
{
return IGNORED;
}

/*
* Page in unknown state. Do nothing.
*/
static int me_unknown(struct page *p, unsigned long pfn)
{
printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
return FAILED;
}

/*
* Free memory
*/
static int me_free(struct page *p, unsigned long pfn)
{
return DELAYED;
}

/*
* Clean (or cleaned) page cache page.
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
struct address_space *mapping;

if (!isolate_lru_page(p))
page_cache_release(p);

mapping = page_mapping(p);
if (mapping == NULL)
return RECOVERED;

/*
* Now truncate the page in the page cache. This is really
* more like a "temporary hole punch"
* Don't do this for block devices when someone else
* has a reference, because it could be file system metadata
* and that's not safe to truncate.
*/
if (!S_ISREG(mapping->host->i_mode) &&
!invalidate_complete_page(mapping, p)) {
printk(KERN_ERR
"MCE %#lx: failed to invalidate metadata page\n",
pfn);
return FAILED;
}

truncate_inode_page(mapping, p);
if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
pfn);
return FAILED;
}
return RECOVERED;
}

/*
* Dirty cache page page
* Issues: when the error hit a hole page the error is not properly
* propagated.
*/
static int me_pagecache_dirty(struct page *p, unsigned long pfn)
{
struct address_space *mapping = page_mapping(p);

SetPageError(p);
/* TBD: print more information about the file. */
if (mapping) {
/*
* IO error will be reported by write(), fsync(), etc.
* who check the mapping.
* This way the application knows that something went
* wrong with its dirty file data.
*
* There's one open issue:
*
* The EIO will be only reported on the next IO
* operation and then cleared through the IO map.
* Normally Linux has two mechanisms to pass IO error
* first through the AS_EIO flag in the address space
* and then through the PageError flag in the page.
* Since we drop pages on memory failure handling the
* only mechanism open to use is through AS_AIO.
*
* This has the disadvantage that it gets cleared on
* the first operation that returns an error, while
* the PageError bit is more sticky and only cleared
* when the page is reread or dropped. If an
* application assumes it will always get error on
* fsync, but does other operations on the fd before
* and the page is dropped inbetween then the error
* will not be properly reported.
*
* This can already happen even without hwpoisoned
* pages: first on metadata IO errors (which only
* report through AS_EIO) or when the page is dropped
* at the wrong time.
*
* So right now we assume that the application DTRT on
* the first EIO, but we're not worse than other parts
* of the kernel.
*/
mapping_set_error(mapping, EIO);
}

return me_pagecache_clean(p, pfn);
}

/*
* Clean and dirty swap cache.
*
* Dirty swap cache page is tricky to handle. The page could live both in page
* cache and swap cache(ie. page is freshly swapped in). So it could be
* referenced concurrently by 2 types of PTEs:
* normal PTEs and swap PTEs. We try to handle them consistently by calling u
* try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
* and then
* - clear dirty bit to prevent IO
* - remove from LRU
* - but keep in the swap cache, so that when we return to it on
* a later page fault, we know the application is accessing
* corrupted data and shall be killed (we installed simple
* interception code in do_swap_page to catch it).
*
* Clean swap cache pages can be directly isolated. A later page fault will
* bring in the known good data from disk.
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);

if (!isolate_lru_page(p))
page_cache_release(p);

return DELAYED;
}

static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
if (!isolate_lru_page(p))
page_cache_release(p);

delete_from_swap_cache(p);

return RECOVERED;
}

/*
* Huge pages. Needs work.
* Issues:
* No rmap support so we cannot find the original mapper. In theory could walk
* all MMs and look for the mappings, but that would be non atomic and racy.
* Need rmap for hugepages for this. Alternatively we could employ a heuristic,
* like just walking the current process and hoping it has it mapped (that
* should be usually true for the common "shared database cache" case)
* Should handle free huge pages and dequeue them too, but this needs to
* handle huge page accounting correctly.
*/
static int me_huge_page(struct page *p, unsigned long pfn)
{
return FAILED;
}

/*
* Various page states we can handle.
*
* A page state is defined by its current page->flags bits.
* The table matches them in order and calls the right handler.
*
* This is quite tricky because we can access page at any time
* in its live cycle, so all accesses have to be extremly careful.
*
* This is not complete. More states could be added.
* For any missing state don't attempt recovery.
*/

#define dirty (1UL << PG_dirty)
#define sc (1UL << PG_swapcache)
#define unevict (1UL << PG_unevictable)
#define mlock (1UL << PG_mlocked)
#define writeback (1UL << PG_writeback)
#define lru (1UL << PG_lru)
#define swapbacked (1UL << PG_swapbacked)
#define head (1UL << PG_head)
#define tail (1UL << PG_tail)
#define compound (1UL << PG_compound)
#define slab (1UL << PG_slab)
#define buddy (1UL << PG_buddy)
#define reserved (1UL << PG_reserved)

static struct page_state {
unsigned long mask;
unsigned long res;
char *msg;
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, "reserved kernel", me_ignore },
{ buddy, buddy, "free kernel", me_free },

/*
* Could in theory check if slab page is free or if we can drop
* currently unused objects without touching them. But just
* treat it as standard kernel for now.
*/
{ slab, slab, "kernel slab", me_kernel },

#ifdef CONFIG_PAGEFLAGS_EXTENDED
{ head, head, "huge", me_huge_page },
{ tail, tail, "huge", me_huge_page },
#else
{ compound, compound, "huge", me_huge_page },
#endif

{ sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty },
{ sc|dirty, sc, "swapcache", me_swapcache_clean },

#ifdef CONFIG_UNEVICTABLE_LRU
{ unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty},
{ unevict, unevict, "unevictable LRU", me_pagecache_clean},
#endif

#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
{ mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty },
{ mlock, mlock, "mlocked LRU", me_pagecache_clean },
#endif

{ lru|dirty, lru|dirty, "LRU", me_pagecache_dirty },
{ lru|dirty, lru, "clean LRU", me_pagecache_clean },
{ swapbacked, swapbacked, "anonymous", me_pagecache_clean },

/*
* Catchall entry: must be at end.
*/
{ 0, 0, "unknown page state", me_unknown },
};

static void action_result(unsigned long pfn, char *msg, int result)
{
printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
msg, action_name[result]);
}

static void page_action(struct page_state *ps, struct page *p,
unsigned long pfn)
{
int result;

result = ps->action(p, pfn);
action_result(pfn, ps->msg, result);
if (page_count(p) != 1)
printk(KERN_ERR
"MCE %#lx: %s page still referenced by %d users\n",
pfn, ps->msg, page_count(p) - 1);

/* Could do more checks here if page looks ok */
atomic_long_add(1, &mce_bad_pages);

/*
* Could adjust zone counters here to correct for the missing page.
*/
}

#define N_UNMAP_TRIES 5

/*
* Do all that is necessary to remove user space mappings. Unmap
* the pages and send SIGBUS to the processes if the data was dirty.
*/
static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
int trapno)
{
enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
int kill = sysctl_memory_failure_early_kill;
struct address_space *mapping;
LIST_HEAD(tokill);
int ret;
int i;

if (PageReserved(p) || PageCompound(p) || PageSlab(p))
return;

if (!PageLRU(p))
lru_add_drain();

/*
* This check implies we don't kill processes if their pages
* are in the swap cache early. Those are always late kills.
*/
if (!page_mapped(p))
return;

if (PageSwapCache(p)) {
printk(KERN_ERR
"MCE %#lx: keeping poisoned page in swap cache\n", pfn);
ttu |= TTU_IGNORE_HWPOISON;
}

/*
* Propagate the dirty bit from PTEs to struct page first, because we
* need this to decide if we should kill or just drop the page.
*/
mapping = page_mapping(p);
if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {
if (page_mkclean(p))
SetPageDirty(p);
else {
kill = 0;
ttu |= TTU_IGNORE_HWPOISON;
printk(KERN_INFO
"MCE %#lx: corrupted page was clean: dropped without side effects\n",
pfn);
}
}

/*
* First collect all the processes that have the page
* mapped. This has to be done before try_to_unmap,
* because ttu takes the rmap data structures down.
*
* This also has the side effect to propagate the dirty
* bit from PTEs into the struct page. This is needed
* to actually decide if something needs to be killed
* or errored, or if it's ok to just drop the page.
*
* Error handling: We ignore errors here because
* there's nothing that can be done.
*/
if (kill && p->mapping)
collect_procs(p, &tokill);

/*
* try_to_unmap can fail temporarily due to races.
* Try a few times (RED-PEN better strategy?)
*/
for (i = 0; i < N_UNMAP_TRIES; i++) {
ret = try_to_unmap(p, ttu);
if (ret == SWAP_SUCCESS)
break;
pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
}

if (ret != SWAP_SUCCESS)
printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(p));

/*
* Now that the dirty bit has been propagated to the
* struct page and all unmaps done we can decide if
* killing is needed or not. Only kill when the page
* was dirty, otherwise the tokill list is merely
* freed. When there was a problem unmapping earlier
* use a more force-full uncatchable kill to prevent
* any accesses to the poisoned memory.
*/
kill_procs_ao(&tokill, !!PageDirty(p), trapno,
ret != SWAP_SUCCESS, pfn);
}

/**
* memory_failure - Handle memory failure of a page.
* @pfn: Page Number of the corrupted page
* @trapno: Trap number reported in the signal to user space.
*
* This function is called by the low level machine check code
* of an architecture when it detects hardware memory corruption
* of a page. It tries its best to recover, which includes
* dropping pages, killing processes etc.
*
* The function is primarily of use for corruptions that
* happen outside the current execution context (e.g. when
* detected by a background scrubber)
*
* Must run in process context (e.g. a work queue) with interrupts
* enabled and no spinlocks hold.
*/
void memory_failure(unsigned long pfn, int trapno)
{
struct page_state *ps;
struct page *p;

if (!pfn_valid(pfn)) {
printk(KERN_ERR
"MCE %#lx: memory outside kernel control: Ignored\n",
pfn);
return;
}

p = pfn_to_page(pfn);
if (TestSetPageHWPoison(p)) {
action_result(pfn, "already hardware poisoned", IGNORED);
return;
}

/*
* We need/can do nothing about count=0 pages.
* 1) it's a free page, and therefore in safe hand:
* prep_new_page() will be the gate keeper.
* 2) it's part of a non-compound high order page.
* Implies some kernel user: cannot stop them from
* R/W the page; let's pray that the page has been
* used and will be freed some time later.
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
*/
if (!get_page_unless_zero(compound_head(p))) {
action_result(pfn, "free or high order kernel", IGNORED);
return;
}

/*
* Lock the page and wait for writeback to finish.
* It's very difficult to mess with pages currently under IO
* and in many cases impossible, so we just avoid it here.
*/
lock_page_nosync(p);
wait_on_page_writeback(p);

/*
* Now take care of user space mappings.
*/
hwpoison_user_mappings(p, pfn, trapno);

/*
* Torn down by someone else?
*/
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
action_result(pfn, "already truncated LRU", IGNORED);
goto out;
}

for (ps = error_states;; ps++) {
if ((p->flags & ps->mask) == ps->res) {
page_action(ps, p, pfn);
break;
}
}
out:
unlock_page(p);
}

2009-06-15 10:36:31

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 06:09:54PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 04:14:53PM +0800, Nick Piggin wrote:
> > On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > > Did we verify with filesystem maintainers (eg. btrfs) that the
> > > !ISREG test will be enough to prevent oopses?
> >
> > BTW. this is quite a significant change I think and not
> > really documented well enough. Previously a filesystem
> > will know exactly when and why pagecache in a mapping
> > under its control will be truncated (as opposed to
> > invalidated).
> >
> > They even have opportunity to hold locks such as i_mutex.
> >
> > And depending on what they do, they could do interesting
> > things even with ISREG files.
> >
> > So, I really think this needs review by filesystem
> > maintainers and it would be far safer to use invalidate
> > until it is known to be safe.
>
> Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
> Do you mean to do invalidate_complete_page() for all inodes for now?

That would make me a lot happier. It is obviously correct because
that is basically what page reclaim and inode reclaim and drop
caches etc does.

Note that I still don't like exporting invalidate_complete_page
fro the same reasons I don't like exporting truncate_complete_page,
so I will ask if you can do an invalidate_inode_page function
along the same lines of the truncate_inode_page one please.

> That's a good suggestion, it shall be able to do the job for most
> pages indeed.

Yes I think it will be far far safer while only introducing
another small class of pages which cannot be recovered (probably
a much smaller set most of the time than the size of the existing
set of pages which cannot be recovered).

2009-06-15 12:10:41

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 09:09:14AM +0200, Andi Kleen wrote:
> > On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > > >
> > > > So IMHO it's OK for .31 as long as we agree on the user interfaces,
> > > > ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
> > > >
> > > > It comes a long way through numerous reviews, and I believe all the
> > > > important issues and concerns have been addressed. Nick, Rik, Hugh,
> > > > Ingo, ... what are your opinions? Is the uevent good enough to meet
> > > > your request to "die hard" or "die gracefully" or whatever on memory
> > > > failure events?
> > >
> > > Uevent? As in, send a message to userspace? I don't think this
> > > would be ideal for a fail-stop/failover situation.
> >
> > Agreed.
> >
> > For failover you typically want a application level heartbeat anyways
> > to guard against user space software problems and if there's a kill then it
> > would catch it. Also again in you want to check against all corruptions you
> > have to do it in the low level handler or better watch corrected
> > events too to predict failures (but the later is quite hard to do generally).
> > To some extent the first is already implemented on x86, e.g. set
> > the tolerance level to 0 will give more aggressive panics.
> >
> > > I can't see a good reason to rush to merge it.
> >
> > The low level x86 code for MCA recovery is in, just this high level
> > part is missing to kill the correct process. I think it would be good to merge
> > a core now. The basic code seems to be also as well tested as we can do it
> > right now and exposing it to more users would be good. It's undoubtedly not
> > perfect yet, but that's not a requirement for merge.
> >
> > There's a lot of fancy stuff that could be done in addition,
> > but that's not really needed right now and for a lot of the fancy
> > ideas (I have enough on my own :) it's dubious they are actually
> > worth it.
>
> Just my opinion. Normally it takes a lot longer for VM patches
> like this to go through, but it's not really up to me anyway.
> If Andrew or Linus has it in their head to merge it in 2.6.31,
> it's going to get merged ;)
>
>
> > > IMO the userspace-visible changes have maybe not been considered
> > > too thoroughly, which is what I'd be most worried about. I probably
> > > missed seeing documentation of exact semantics and situations
> > > where admins should tune things one way or the other.
> >
> > There's only a single tunable anyways, early kill vs late kill.
> >
> > For KVM you need early kill, for the others it remains to be seen.
>
> Right. It's almost like you need to do a per-process thing, and
> those that can handle things (such as the new SIGBUS or the new
> EIO) could get those, and others could be killed.

To send early SIGBUS kills to processes who has called
sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
don't mind they can understand that signal at all.

> Early-kill for KVM does seem like reasonable justification on the
> surface, but when I think more about it, I wonder does the guest
> actually stand any better chance to correct the error if it is
> reported at time T rather than T+delta? (who knows what the page
> will be used for at any given time).

Early kill makes a lot difference for KVM. Think about the vast
amount of clean page cache pages. With early kill the page can be
trivially isolated. With late kill the whole virtual machine dies
hard.

Thanks,
Fengguang

2009-06-15 12:10:53

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 06:36:19PM +0800, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 06:09:54PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 15, 2009 at 04:14:53PM +0800, Nick Piggin wrote:
> > > On Mon, Jun 15, 2009 at 08:44:47AM +0200, Nick Piggin wrote:
> > > > Did we verify with filesystem maintainers (eg. btrfs) that the
> > > > !ISREG test will be enough to prevent oopses?
> > >
> > > BTW. this is quite a significant change I think and not
> > > really documented well enough. Previously a filesystem
> > > will know exactly when and why pagecache in a mapping
> > > under its control will be truncated (as opposed to
> > > invalidated).
> > >
> > > They even have opportunity to hold locks such as i_mutex.
> > >
> > > And depending on what they do, they could do interesting
> > > things even with ISREG files.
> > >
> > > So, I really think this needs review by filesystem
> > > maintainers and it would be far safer to use invalidate
> > > until it is known to be safe.
> >
> > Nick, we are doing invalidate_complete_page() for !S_ISREG inodes now.
> > Do you mean to do invalidate_complete_page() for all inodes for now?
>
> That would make me a lot happier. It is obviously correct because
> that is basically what page reclaim and inode reclaim and drop
> caches etc does.
>
> Note that I still don't like exporting invalidate_complete_page
> fro the same reasons I don't like exporting truncate_complete_page,
> so I will ask if you can do an invalidate_inode_page function
> along the same lines of the truncate_inode_page one please.

Sure. I did something radical - don't try to isolate dirty/writeback
pages, to match the exact invalidate_mapping_pages() behavior.

Let's mess with the dirty/writeback pages some time later.

+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p, unsigned long pfn)
+{
+ struct address_space *mapping;
+
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ mapping = page_mapping(p);
+ if (mapping == NULL)
+ return RECOVERED;
+
+ /*
+ * Now remove it from page cache.
+ * Currently we only remove clean, unused page for the sake of safety.
+ */
+ if (!invalidate_inode_page(mapping, p)) {
+ printk(KERN_ERR
+ "MCE %#lx: failed to remove from page cache\n", pfn);
+ return FAILED;
+ }
+ return RECOVERED;
+}

--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -135,6 +135,21 @@ invalidate_complete_page(struct address_
return ret;
}

+/*
+ * Safely invalidate one page from its pagecache mapping.
+ * It only drops clean, unused pages. The page must be locked.
+ *
+ * Returns 1 if the page is successfully invalidated, otherwise 0.
+ */
+int invalidate_inode_page(struct address_space *mapping, struct page *page)
+{
+ if (PageDirty(page) || PageWriteback(page))
+ return 0;
+ if (page_mapped(page))
+ return 0;
+ return invalidate_complete_page(mapping, page);
+}
+
/**
* truncate_inode_pages - truncate range of pages specified by start & end byte offsets
* @mapping: mapping to truncate
@@ -311,12 +326,8 @@ unsigned long invalidate_mapping_pages(s
if (lock_failed)
continue;

- if (PageDirty(page) || PageWriteback(page))
- goto unlock;
- if (page_mapped(page))
- goto unlock;
- ret += invalidate_complete_page(mapping, page);
-unlock:
+ ret += invalidate_inode_page(mapping, page);
+
unlock_page(page);
if (next > end)
break;

>
> > That's a good suggestion, it shall be able to do the job for most
> > pages indeed.
>
> Yes I think it will be far far safer while only introducing
> another small class of pages which cannot be recovered (probably
> a much smaller set most of the time than the size of the existing
> set of pages which cannot be recovered).

Anyway, dirty pages are limited to 15% of the total memory by default.

Thanks,
Fengguang

2009-06-15 12:25:42

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > For KVM you need early kill, for the others it remains to be seen.
> >
> > Right. It's almost like you need to do a per-process thing, and
> > those that can handle things (such as the new SIGBUS or the new
> > EIO) could get those, and others could be killed.
>
> To send early SIGBUS kills to processes who has called
> sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
> don't mind they can understand that signal at all.

For apps that hook into SIGBUS for some other means and
do not understand the new type of SIGBUS signal? What about
those?

> > Early-kill for KVM does seem like reasonable justification on the
> > surface, but when I think more about it, I wonder does the guest
> > actually stand any better chance to correct the error if it is
> > reported at time T rather than T+delta? (who knows what the page
> > will be used for at any given time).
>
> Early kill makes a lot difference for KVM. Think about the vast
> amount of clean page cache pages. With early kill the page can be
> trivially isolated. With late kill the whole virtual machine dies
> hard.

Why? In both cases it will enter the exception handler and
attempt to do something about it... in both cases I would
have thought there is some chance that the page error is not
recoverable and some chance it is recoverable. Or am I
missing something?

Anyway, I would like to see a basic analysis of those probabilities
to justify early kill. Not saying there is no justification, but
it would be helpful to see why.

Thanks,
Nick

2009-06-15 12:52:45

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, 15 Jun 2009, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 11:18:18AM +0800, Balbir Singh wrote:
> > Wu Fengguang wrote:
> > >
> > > I hope we can reach consensus in this round and then be able to post
> > > a final version for .31 inclusion.
> >
> > Isn't that too aggressive? .31 is already in the merge window.
>
> Yes, a bit aggressive. This is a new feature that involves complex logics.
> However it is basically a no-op when there are no memory errors,
> and when memory corruption does occur, it's better to (possibly) panic
> in this code than to panic unconditionally in the absence of this
> feature (as said by Rik).
>
> So IMHO it's OK for .31 as long as we agree on the user interfaces,
> ie. /proc/sys/vm/memory_failure_early_kill and the hwpoison uevent.
>
> It comes a long way through numerous reviews, and I believe all the
> important issues and concerns have been addressed. Nick, Rik, Hugh,
> Ingo, ... what are your opinions?

And for how long has this work been in linux-next or mmotm?

My opinion is that it's way too late for .31 - your only chance
is that Linus sometimes gets bored with playing safe, and decides
to break his rules and try something out regardless - but I'd hope
the bootmem business already sated his appetite for danger this time.

Hugh

2009-06-15 13:01:20

by Alan

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

> My opinion is that it's way too late for .31 - your only chance
> is that Linus sometimes gets bored with playing safe, and decides
> to break his rules and try something out regardless - but I'd hope
> the bootmem business already sated his appetite for danger this time.

I see no consensus on it being worth merging, no testing, no upstream
integration shakedown, no builds on non-x86 boxes, no work with other
arch maintainers who have similar abilities and needs.

It belongs in next for a release or two while people work on it, while
things like PPC64 can plumb into it if they wish and while people work
out if its actually useful given that for most users it reduces the
reliability of the services they are providing rather than improving it.

2009-06-15 13:21:29

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

I think you're wrong about killing processes decreasing
reliability. Traditionally we always tried to keep things running if possible
instead of panicing. That is why ext3 or block does not default to panic
on each IO error for example. Or oops does not panic by default like
on BSDs. Your argumentation would be good for a traditional early Unix
which likes to panic instead of handling errors, but that's not the
Linux way as I know it.

Also BTW in many cases (e.g. a lot of file cache pages) there
is actually no process kill involved; just a reload of the page from
disk.

Then for example in a cluster you typically have a application level
heartbeat on the application and restarting the app is faster
if you don't need to reboot the box too. If you don't have a cluster
with failover then gracefull degradation is the best. In general
panic is a very nasty failure mode and should be avoided.

That said you can configure it anyways to panic if you want,
but it would be a very bad default.

See also Linus' or hpa's statement on the topic.

> no testing,

There's an extensive test suite in mce-test.git

We did a lot of testing with these separate test suites and also
some other tests. For much more it needs actual users pounding on it, and that
can be only adequately done in mainline.

That said the real tests of this can be only done with test suites
really, these errors tend to not happen quickly.

> integration shakedown, no builds on non-x86 boxes, no work with other
> arch maintainers who have similar abilities and needs.

We did build tests on ia64 and power and it was reviewed by Tony for IA64.
The ia64 specific code is not quite ready yet, but will come at some point.

I don't think it's a requirement for merging to have PPC64 support.

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-15 13:30:16

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

Andi Kleen wrote:
>
> That said the real tests of this can be only done with test suites
> really, these errors tend to not happen quickly.
>

What's far more important, quite frankly, is harmlessness regression
testing.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-06-15 14:23:02

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 08:25:28PM +0800, Nick Piggin wrote:
> On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > > For KVM you need early kill, for the others it remains to be seen.
> > >
> > > Right. It's almost like you need to do a per-process thing, and
> > > those that can handle things (such as the new SIGBUS or the new
> > > EIO) could get those, and others could be killed.
> >
> > To send early SIGBUS kills to processes who has called
> > sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
> > don't mind they can understand that signal at all.
>
> For apps that hook into SIGBUS for some other means and

Yes I was referring to the sigaction(SIGBUS) apps, others will
be late killed anyway.

> do not understand the new type of SIGBUS signal? What about
> those?

We introduced two new SIGBUS codes:
BUS_MCEERR_AO=5 for early kill
BUS_MCEERR_AR=4 for late kill
I'd assume a legacy application will handle them in the same way (both
are unexpected code to the application).

We don't care whether the application can be killed by BUS_MCEERR_AO
or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
But (in the rare case) if the handler
- refused to die on BUS_MCEERR_AR, it may create a busy loop and
flooding of SIGBUS signals, which is a bug of the application.
BUS_MCEERR_AO is one time and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
it may well hurt the same way on BUS_MCEERR_AR. The latter one is
unavoidable, so the application must be fixed anyway.

>
> > > Early-kill for KVM does seem like reasonable justification on the
> > > surface, but when I think more about it, I wonder does the guest
> > > actually stand any better chance to correct the error if it is
> > > reported at time T rather than T+delta? (who knows what the page
> > > will be used for at any given time).
> >
> > Early kill makes a lot difference for KVM. Think about the vast
> > amount of clean page cache pages. With early kill the page can be
> > trivially isolated. With late kill the whole virtual machine dies
> > hard.
>
> Why? In both cases it will enter the exception handler and
> attempt to do something about it... in both cases I would
> have thought there is some chance that the page error is not
> recoverable and some chance it is recoverable. Or am I
> missing something?

The early kill / late kill to KVM from the POV of host kernel matches
the MCE AO/AR events inside the KVM guest kernel. The key difference
between AO/AR is, whether the page is _being_ consumed.

It's a lot harder (if possible) to try to stop an active consumer.
For example, the clean cache pages can be consumed in many ways:
- be accessed by read()/write() or mapped read/write
- be reclaimed and then allocated for whatever new usage, for example,
be zeroed by __GFP_ZERO, or be insert into another file and start
read/write IO and be accessed by disk driver via DMA, or even be
allocated for kernel slabs..
Frankly speaking I don't know how to stop all the above consumers.
We now simply die on AR events.

> Anyway, I would like to see a basic analysis of those probabilities
> to justify early kill. Not saying there is no justification, but
> it would be helpful to see why.

That's fine. I'd be glad if the above explanation paves way to
solutions for AR events :)

Thanks,
Fengguang

2009-06-15 14:48:59

by Alan

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, 15 Jun 2009 15:29:34 +0200
Andi Kleen <[email protected]> wrote:

>
> I think you're wrong about killing processes decreasing
> reliability. Traditionally we always tried to keep things running if possible
> instead of panicing. That is why ext3 or block does not default to panic
> on each IO error for example. Or oops does not panic by default like
> on BSDs. Your argumentation would be good for a traditional early Unix
> which likes to panic instead of handling errors, but that's not the
> Linux way as I know it.

Everyone I knew in the business end of deploying Linux turned on panics
for I/O errors, reboot on panic and all the rest of those.

Why ? because they don't want a system where the web server is running
but not logging transactions, or to find out the database is up but that
some other "must not fail" layer killed or stalled the backup server for
it last week ...

The I/O ones can really blow up on you in a reliable environment because
often the process still exists but isn't working so fools much of the
monitoring software.

> That said you can configure it anyways to panic if you want,
> but it would be a very bad default.

That depends for whom

> See also Linus' or hpa's statement on the topic.

Linus doesn't run big server systems. Its a really bad default for
developers. Its probably a bad default for desktop users.

> We did a lot of testing with these separate test suites and also
> some other tests. For much more it needs actual users pounding on it, and that
> can be only adequately done in mainline.

Thats why we have -next and -mm

> We did build tests on ia64 and power and it was reviewed by Tony for IA64.
> The ia64 specific code is not quite ready yet, but will come at some point.
>
> I don't think it's a requirement for merging to have PPC64 support.

Really - so if your design is wrong for the way PPC wants to work what
are we going to do ? It's not a requirement that PPC64 support is there
but it is most certainly a requirement that its been in -next a while and
other arch maintainers have at least had time to say "works for me",
"irrelevant to my platform" or "Arghhh noooo.. ECC errors work like
[this] so we need ..."

I'd guess that zSeries has some rather different views on how ECC
failures propogate through the hypervisors for example, including the
fact that a failed page can be unfailed which you don't seem to allow for.

(You can unfail pages on x86 as well it appears by scrubbing them via DMA
- yes ?)

Alan

2009-06-15 15:16:22

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

> Everyone I knew in the business end of deploying Linux turned on panics
> for I/O errors, reboot on panic and all the rest of those.

oops=panic already implies panic on all machine check exceptions, so they will
be fine then (assuming this is the best strategy for availability
for them, which I personally find quite doubtful, but we can discuss this some
other time)

> Really - so if your design is wrong for the way PPC wants to work what
> are we going to do ? It's not a requirement that PPC64 support is there

Then we change the code. Or if it's too difficult don't support their stuff.
After all it's not cast in stone. That said I doubt the PPC requirements will
be much different than what we have.

> I'd guess that zSeries has some rather different views on how ECC
> failures propogate through the hypervisors for example, including the
> fact that a failed page can be unfailed which you don't seem to allow for.

That's correct.

That's because unpoisioning is quite hard -- you need some kind
of synchronization point for all the error handling and that's
the poisoned page and if it unposions itself then you need
some very heavy weight synchronization to avoid handling errors
multiple time. I looked at it, but it's quite messy.

Also it's of somewhat dubious value.

>
> (You can unfail pages on x86 as well it appears by scrubbing them via DMA
> - yes ?)

Not architectually. Also the other problem is not just unpoisoning them,
but finding out if the page is permenantly bad or just temporarily.

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-15 15:28:41

by Alan

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

> oops=panic already implies panic on all machine check exceptions, so they will
> be fine then (assuming this is the best strategy for availability
> for them, which I personally find quite doubtful, but we can discuss this some
> other time)

You can have the argument with all the people who deploy large systems.
Providing their boxes can be persuaded to panic they don't care about the
masses.

> That's because unpoisioning is quite hard -- you need some kind
> of synchronization point for all the error handling and that's
> the poisoned page and if it unposions itself then you need
> some very heavy weight synchronization to avoid handling errors
> multiple time. I looked at it, but it's quite messy.
>
> Also it's of somewhat dubious value.

On a system running under a hypervisor or with hot swappable memory its
of rather higher value. In the hypervisor case the guest system can
acquire a new virtual page to replace the faulty one. In fact the
hypervisor case is even more complex as the guest may get migrated at
which point knowledge of "poisoned" memory is not at all connected to
information on hardware failings.

> >
> > (You can unfail pages on x86 as well it appears by scrubbing them via DMA
> > - yes ?)
>
> Not architectually. Also the other problem is not just unpoisoning them,
> but finding out if the page is permenantly bad or just temporarily.

Small detail you are overlooking: Hot swap mirrorable memory.

Second detail you are overlooking

curse a lot
suspend to disk
remove dirt from fans, clean/replace RAM
resume from disk

The very act of making the ECC error not take out the box creates the
environment whereby the underlying hardware error (if there was one) can
be cured.

In all these cases the fact you've got to shoot stuff because a page has
been lost becomes totally disconnected from the idea that the page is
somehow not recoverable and "contaminated" forever.

Which to me says your model is wrong.

Alan

2009-06-15 16:10:57

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 04:28:04PM +0100, Alan "zSeries" Cox wrote:

> curse a lot
> suspend to disk
> remove dirt from fans, clean/replace RAM
> resume from disk
>
> The very act of making the ECC error not take out the box creates the

Ok so at least you agree now that handling these errors without
panic is the right thing to do. That's at least some progress.

> environment whereby the underlying hardware error (if there was one) can
> be cured.

These ECC errors are still somewhat rare (or rather if they become
common you should definitely service the system). That is why
losing a single page of memory for them isn't a big issue normally.

Sure you could spend effort making unpoisioning work,
but it would seem very dubious to me. After all it's just another
4K of memory for each error.

The only reasonably good use case I heard for unpoisoning was
if you have a lot of huge pages (you can't use a full huge page with one bad
small page), but that's also still relatively exotic.

-Andi

[1] mostly you need a new special form of RCU I think

--
[email protected] -- Speaking for myself only.

2009-06-15 16:28:40

by Alan

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

> > curse a lot
> > suspend to disk
> > remove dirt from fans, clean/replace RAM
> > resume from disk
> >
> > The very act of making the ECC error not take out the box creates the
>
> Ok so at least you agree now that handling these errors without
> panic is the right thing to do. That's at least some progress.

There are some situations it may be useful - possibly. But then if you
can't sort the resulting mess out because your patches are too limited
its not useful yet is it.

Even then this isn't 2.6.31 stuff - its 2.6.31-mm or -next stuff maybe
for 2.6.32

2009-06-15 16:59:26

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

> But then if you
> can't sort the resulting mess out because your patches are too limited
> its not useful yet is it.

With "too limited" you refer to unpoisioning?

Again very slowly:

- If you have a lot of errors you die eventually anyways.
- If you have a very low rate of errors (which is the normal case) you don't
need unpoisioning because the memory lost for each error is miniscule.
- In the case of a hypervisor it's actually not memory lost, but only
guest physical address space, which is plenty on a 64bit system. You can
eventually replace it by readding memory to a guest, but that's unlikely
to be needed.

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-16 19:44:53

by Russ Anderson

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Mon, Jun 15, 2009 at 03:29:34PM +0200, Andi Kleen wrote:
>
> I think you're wrong about killing processes decreasing
> reliability. Traditionally we always tried to keep things running if possible
> instead of panicing.

Customers love the ia64 feature of killing a user process instead of
panicing the system when a user process hits a memory uncorrectable
error. Avoiding a system panic is a very good thing.

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]

2009-06-16 20:37:22

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

Russ Anderson wrote:
> On Mon, Jun 15, 2009 at 03:29:34PM +0200, Andi Kleen wrote:
>> I think you're wrong about killing processes decreasing
>> reliability. Traditionally we always tried to keep things running if possible
>> instead of panicing.
>
> Customers love the ia64 feature of killing a user process instead of
> panicing the system when a user process hits a memory uncorrectable
> error. Avoiding a system panic is a very good thing.

Sometimes (sometimes it's a very bad thing.)

However, the more fundamental thing is that it is always trivial to
promote an error to a higher severity; the opposite is not true. As
such, it becomes an administrator-set policy, which is what it needs to be.

-hpa

2009-06-16 20:55:17

by Russ Anderson

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

On Tue, Jun 16, 2009 at 01:28:54PM -0700, H. Peter Anvin wrote:
> Russ Anderson wrote:
> > On Mon, Jun 15, 2009 at 03:29:34PM +0200, Andi Kleen wrote:
> >> I think you're wrong about killing processes decreasing
> >> reliability. Traditionally we always tried to keep things running if possible
> >> instead of panicing.
> >
> > Customers love the ia64 feature of killing a user process instead of
> > panicing the system when a user process hits a memory uncorrectable
> > error. Avoiding a system panic is a very good thing.
>
> Sometimes (sometimes it's a very bad thing.)
>
> However, the more fundamental thing is that it is always trivial to
> promote an error to a higher severity; the opposite is not true. As
> such, it becomes an administrator-set policy, which is what it needs to be.

Good point. On ia64 the recovery code is implemented as a kernel
loadable module. Installing the module turns on the feature.

That is handy for customer demos. Install the module, inject a
memory error, have an application read the bad data and get killed.
Repeat a few times. Then uninstall the module, inject a
memory error, have an application read the bad data and watch
the system panic.

Then it is the customer's choice to have it on or off.

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]

2009-06-16 21:06:01

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 00/22] HWPOISON: Intro (v5)

Russ Anderson wrote:
>>
>> However, the more fundamental thing is that it is always trivial to
>> promote an error to a higher severity; the opposite is not true. As
>> such, it becomes an administrator-set policy, which is what it needs to be.
>
> Good point. On ia64 the recovery code is implemented as a kernel
> loadable module. Installing the module turns on the feature.
>
> That is handy for customer demos. Install the module, inject a
> memory error, have an application read the bad data and get killed.
> Repeat a few times. Then uninstall the module, inject a
> memory error, have an application read the bad data and watch
> the system panic.
>
> Then it is the customer's choice to have it on or off.
>

There are a number of ways to set escalation policy. Modules isn't
necessarily the best, but it doesn't really matter what the exact
mechanism is.

-hpa

2009-06-17 06:37:23

[permalink] [raw]

Subject: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler

On Mon, Jun 15, 2009 at 10:22:25PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 08:25:28PM +0800, Nick Piggin wrote:
> > On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > > > For KVM you need early kill, for the others it remains to be seen.
> > > >
> > > > Right. It's almost like you need to do a per-process thing, and
> > > > those that can handle things (such as the new SIGBUS or the new
> > > > EIO) could get those, and others could be killed.
> > >
> > > To send early SIGBUS kills to processes who has called
> > > sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
> > > don't mind they can understand that signal at all.
> >
> > For apps that hook into SIGBUS for some other means and
>
> Yes I was referring to the sigaction(SIGBUS) apps, others will
> be late killed anyway.
>
> > do not understand the new type of SIGBUS signal? What about
> > those?
>
> We introduced two new SIGBUS codes:
> BUS_MCEERR_AO=5 for early kill
> BUS_MCEERR_AR=4 for late kill
> I'd assume a legacy application will handle them in the same way (both
> are unexpected code to the application).
>
> We don't care whether the application can be killed by BUS_MCEERR_AO
> or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
> But (in the rare case) if the handler
> - refused to die on BUS_MCEERR_AR, it may create a busy loop and
> flooding of SIGBUS signals, which is a bug of the application.
> BUS_MCEERR_AO is one time and won't lead to busy loops.
> - does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
> it may well hurt the same way on BUS_MCEERR_AR. The latter one is
> unavoidable, so the application must be fixed anyway.

This patch materializes the automatically early kill idea.
It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.

This is mainly a policy change, please comment.

Thanks,
Fengguang

---
HWPOISON: only early kill processes who installed SIGBUS handler

We want to send SIGBUS.BUS_MCEERR_AO signals to KVM ASAP, so that
it is able to take actions to isolate the corrupted page. In fact,
any applications that does extensive internal caching (KVM, Oracle,
etc.) is advised to install a SIGBUS handler to get early notifications
of corrupted memory, so that it has good possibility to find and remove
the page from its cache. If don't do so, they will later receive the
SIGBUS.BUS_MCEERR_AR signal on accessing the corrupted memory, which
can be deadly (too hard to rescue).

For applications that don't care the signal, let them continue to run
until they try to consume the corrupted data.

For applications that used to catch the SIGBUS handler but don't understand
the new BUS_MCEERR_AO/BUS_MCEERR_AR codes, they may
- refused to die on BUS_MCEERR_AR, creating a busy loop and
flooding of SIGBUS signals, which is a bug of the application.
BUS_MCEERR_AO is an one shot event and won't lead to busy loops.
- does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
it may well hurt the same way on BUS_MCEERR_AR. The latter one is
unavoidable, so the application must be fixed anyway.

CC: Nick Piggin <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/memory-failure.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -205,6 +205,20 @@ static void kill_procs_ao(struct list_he
}
}

+static bool task_early_kill_elegible(struct task_struct *tsk)
+{
+ __sighandler_t handler;
+
+ if (!tsk->mm)
+ return false;
+
+ handler = tsk->sighand->action[SIGBUS-1].sa.sa_handler;
+ if (handler == SIG_DFL || handler == SIG_IGN)
+ return false;
+
+ return true;
+}
+
/*
* Collect processes when the error hit an anonymous page.
*/
@@ -222,7 +236,7 @@ static void collect_procs_anon(struct pa
goto out;

for_each_process (tsk) {
- if (!tsk->mm)
+ if (!task_early_kill_elegible(tsk))
continue;
list_for_each_entry (vma, &av->head, anon_vma_node) {
if (!page_mapped_in_vma(page, vma))
@@ -262,7 +276,7 @@ static void collect_procs_file(struct pa
for_each_process(tsk) {
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);

- if (!tsk->mm)
+ if (!task_early_kill_elegible(tsk))
continue;

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,

2009-06-17 08:04:36

[permalink] [raw]

Subject: Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler

On Wed, Jun 17, 2009 at 02:37:02PM +0800, Wu Fengguang wrote:
> On Mon, Jun 15, 2009 at 10:22:25PM +0800, Wu Fengguang wrote:
> > On Mon, Jun 15, 2009 at 08:25:28PM +0800, Nick Piggin wrote:
> > > On Mon, Jun 15, 2009 at 08:10:01PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jun 15, 2009 at 03:19:07PM +0800, Nick Piggin wrote:
> > > > > > For KVM you need early kill, for the others it remains to be seen.
> > > > >
> > > > > Right. It's almost like you need to do a per-process thing, and
> > > > > those that can handle things (such as the new SIGBUS or the new
> > > > > EIO) could get those, and others could be killed.
> > > >
> > > > To send early SIGBUS kills to processes who has called
> > > > sigaction(SIGBUS, ...)? KVM will sure do that. For other apps we
> > > > don't mind they can understand that signal at all.
> > >
> > > For apps that hook into SIGBUS for some other means and
> >
> > Yes I was referring to the sigaction(SIGBUS) apps, others will
> > be late killed anyway.
> >
> > > do not understand the new type of SIGBUS signal? What about
> > > those?
> >
> > We introduced two new SIGBUS codes:
> > BUS_MCEERR_AO=5 for early kill
> > BUS_MCEERR_AR=4 for late kill
> > I'd assume a legacy application will handle them in the same way (both
> > are unexpected code to the application).
> >
> > We don't care whether the application can be killed by BUS_MCEERR_AO
> > or BUS_MCEERR_AR depending on its SIGBUS handler implementation.
> > But (in the rare case) if the handler
> > - refused to die on BUS_MCEERR_AR, it may create a busy loop and
> > flooding of SIGBUS signals, which is a bug of the application.
> > BUS_MCEERR_AO is one time and won't lead to busy loops.
> > - does something that hurts itself (ie. data safety) on BUS_MCEERR_AO,
> > it may well hurt the same way on BUS_MCEERR_AR. The latter one is
> > unavoidable, so the application must be fixed anyway.
>
> This patch materializes the automatically early kill idea.
> It aims to remove the vm.memory_failure_ealy_kill sysctl parameter.
>
> This is mainly a policy change, please comment.

Well then you can still early-kill random apps that did not
want it, and you may still cause problems if its sigbus
handler does something nontrivial.

Can you use a prctl or something so it can expclitly
register interest in this?

2009-06-17 09:55:51

[permalink] [raw]

Subject: Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler

2009-06-17 10:00:24

[permalink] [raw]

Subject: Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler

On Wed, Jun 17, 2009 at 05:55:32PM +0800, Wu Fengguang wrote:
> On Wed, Jun 17, 2009 at 04:04:04PM +0800, Nick Piggin wrote:
> > Well then you can still early-kill random apps that did not
> > want it, and you may still cause problems if its sigbus
> > handler does something nontrivial.
> >
> > Can you use a prctl or something so it can expclitly
> > register interest in this?
>
> No I don't think prctl would be much better.
>
> - if an application want early/late kill, it can do so with a proper
> written SIGBUS handler: the prctl call is redundant.

s/proper written/is switched to new semantics based on the existance
of a/

> - if an admin want to control early/late kill for an unmodified app,
> prctl is as unhelpful as this patch(*).

Clearly you can execute a process with a given prctl.

> - prctl does can help legacy apps whose SIGBUS handler has trouble
> with the new SIGBUS codes, however such application should be rare
> and the application should be fixed(why shall it do something wrong
> on newly introduced code at all? Shall we stop introducing new codes
> just because some random buggy app cannot handle new codes?)

Backwards compatibility? Kind of important.

> So I still prefer this patch, until we come up with some solution that
> allows both app and admin to change the setting.

Not only does it allow that, but it also provides backwards
compatibility. Your patch does not allow admin to change
anything nor does it guarantee 100% back compat so I can't
see how you think it is better.

Also it does not allow for an app with a SIGBUS handler to
use late kill. If late kill is useful to anyone, why would
it not be useful to some app with a SIGBUS handler (that is
not KVM)?

2009-06-17 11:58:30

[permalink] [raw]

Subject: Re: [RFC][PATCH] HWPOISON: only early kill processes who installed SIGBUS handler

On Wed, Jun 17, 2009 at 06:00:06PM +0800, Nick Piggin wrote:
> On Wed, Jun 17, 2009 at 05:55:32PM +0800, Wu Fengguang wrote:
> > On Wed, Jun 17, 2009 at 04:04:04PM +0800, Nick Piggin wrote:
> > > Well then you can still early-kill random apps that did not
> > > want it, and you may still cause problems if its sigbus
> > > handler does something nontrivial.
> > >
> > > Can you use a prctl or something so it can expclitly
> > > register interest in this?
> >
> > No I don't think prctl would be much better.
> >
> > - if an application want early/late kill, it can do so with a proper
> > written SIGBUS handler: the prctl call is redundant.
>
> s/proper written/is switched to new semantics based on the existance
> of a/

Not necessarily so. If an application
- did not has a SIGBUS handler, and want to be
- early killed: must install a handler, this is not a big problem
because it may well want to rescue something on the event.
- late killed: just do nothing.
(here kill = 'notification')
- had a SIGBUS hander, and want to
- early die: call exit(0) in the handler.
- late die: intercept and ignore the signal.
So if source code modification is viable, prctl is not necessary at all.

> > - if an admin want to control early/late kill for an unmodified app,
> > prctl is as unhelpful as this patch(*).
>
> Clearly you can execute a process with a given prctl.

OK, right.

> > - prctl does can help legacy apps whose SIGBUS handler has trouble
> > with the new SIGBUS codes, however such application should be rare
> > and the application should be fixed(why shall it do something wrong
> > on newly introduced code at all? Shall we stop introducing new codes
> > just because some random buggy app cannot handle new codes?)
>
> Backwards compatibility? Kind of important.

Maybe.

> > So I still prefer this patch, until we come up with some solution that
> > allows both app and admin to change the setting.
>
> Not only does it allow that, but it also provides backwards
> compatibility. Your patch does not allow admin to change
> anything nor does it guarantee 100% back compat so I can't
> see how you think it is better.

I didn't say it is better, but clearly mean that prctl is not better
enough to warrant a new user interface, if(!adm_friendly). Now it's
obvious that adm_friendly=1, so I agree prctl is a good interface :)

> Also it does not allow for an app with a SIGBUS handler to
> use late kill. If late kill is useful to anyone, why would
> it not be useful to some app with a SIGBUS handler (that is
> not KVM)?

Late kill will always be sent. Ignore the early kill signal in the
SIGBUS handler does the trick (see above analyzes).

Thanks,
Fengguang

2009-06-18 09:56:55