Upcoming Intel CPUs have support for recovering from some memory errors. This
requires the OS to declare a page "poisoned", kill the processes associated
with it and avoid using it in the future. This patchkit implements
the necessary infrastructure in the VM.
To quote the overview comment:
* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focusses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* Some of the operations here are somewhat inefficient and have non
* linear algorithmic complexity, because the data structures have not
* been optimized for this case. This is in particular the case
* for the mapping from a vma to a process. Since this case is expected
* to be rare we hope we can get away with this.
The code consists of a the high level handler in mm/memory-failure.c,
a new page poison bit and various checks in the VM to handle poisoned
pages.
The main target right now is KVM guests, but it works for all kinds
of applications.
For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might.
This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.
Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome.
The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)
-Andi
Poisoned pages need special handling in the VM and shouldn't be touched
again. This requires a new page flag. Define it here.
The page flags wars seem to be over, so it shouldn't be a problem
to get a new one. I hope.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/page-flags.h | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h 2009-04-07 16:39:27.000000000 +0200
+++ linux/include/linux/page-flags.h 2009-04-07 16:39:39.000000000 +0200
@@ -51,6 +51,9 @@
* PG_buddy is set to indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
*
+ * PG_poison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
*/
/*
@@ -104,6 +107,9 @@
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+ PG_poison, /* poisoned page. Don't touch */
+#endif
__NR_PAGEFLAGS,
/* Filesystems */
@@ -273,6 +279,14 @@
PAGEFLAG_FALSE(Uncached)
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(Poison, poison)
+#define __PG_POISON (1UL << PG_poison)
+#else
+PAGEFLAG_FALSE(Poison)
+#define __PG_POISON 0
+#endif
+
static inline int PageUptodate(struct page *page)
{
int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -403,7 +417,7 @@
1 << PG_private | 1 << PG_private_2 | \
1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
- __PG_UNEVICTABLE | __PG_MLOCKED)
+ __PG_POISON | __PG_UNEVICTABLE | __PG_MLOCKED)
/*
* Flags checked when a page is prepped for return by the page allocator.
Needed for later patch that walks rmap entries on its own.
This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/rmap.h | 6 ++++++
mm/rmap.c | 4 ++--
2 files changed, 8 insertions(+), 2 deletions(-)
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h 2009-04-07 16:39:26.000000000 +0200
+++ linux/include/linux/rmap.h 2009-04-07 16:43:06.000000000 +0200
@@ -118,6 +118,12 @@
}
#endif
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
#else /* !CONFIG_MMU */
#define anon_vma_init() do {} while (0)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-04-07 16:39:26.000000000 +0200
+++ linux/mm/rmap.c 2009-04-07 16:43:06.000000000 +0200
@@ -191,7 +191,7 @@
* Getting a lock on a stable anon_vma from a page off the LRU is
* tricky: page_lock_anon_vma rely on RCU to guard against the races.
*/
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;
@@ -211,7 +211,7 @@
return NULL;
}
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
spin_unlock(&anon_vma->lock);
rcu_read_unlock();
Make sure no poisoned pages are put back into the free page
lists. This can happen with some races.
This is allo slow path in the bad page bits path, so another
check doesn't really matter.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/page_alloc.c | 9 +++++++++
1 file changed, 9 insertions(+)
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c 2009-04-07 16:39:26.000000000 +0200
+++ linux/mm/page_alloc.c 2009-04-07 16:39:39.000000000 +0200
@@ -228,6 +228,15 @@
static unsigned long nr_unshown;
/*
+ * Page may have been marked bad before process is freeing it.
+ * Make sure it is not put back into the free page lists.
+ */
+ if (PagePoison(page)) {
+ /* check more flags here... */
+ return;
+ }
+
+ /*
* Allow a burst of 60 reports, then keep quiet for that minute;
* or allow a steady drip of one report per second.
*/
The machine check poison handling needs to go to process context very
quickly. Add a new high priority queueing mechanism for work items.
This should be only used in exceptional cases! (but a machine check
is definitely exceptional)
The insert is not fully O(1) in regards to other high priority
items, but those should be rather rare anyways.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/workqueue.h | 3 +++
kernel/workqueue.c | 15 +++++++++++++++
2 files changed, 18 insertions(+)
Index: linux/include/linux/workqueue.h
===================================================================
--- linux.orig/include/linux/workqueue.h 2009-04-07 16:39:28.000000000 +0200
+++ linux/include/linux/workqueue.h 2009-04-07 16:39:39.000000000 +0200
@@ -25,6 +25,7 @@
struct work_struct {
atomic_long_t data;
#define WORK_STRUCT_PENDING 0 /* T if work item pending execution */
+#define WORK_STRUCT_HIGHPRI 1 /* work is high priority */
#define WORK_STRUCT_FLAG_MASK (3UL)
#define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
struct list_head entry;
@@ -163,6 +164,8 @@
#define work_clear_pending(work) \
clear_bit(WORK_STRUCT_PENDING, work_data_bits(work))
+#define set_work_highpri(work) \
+ set_bit(WORK_STRUCT_HIGHPRI, work_data_bits(work))
extern struct workqueue_struct *
__create_workqueue_key(const char *name, int singlethread,
Index: linux/kernel/workqueue.c
===================================================================
--- linux.orig/kernel/workqueue.c 2009-04-07 16:39:28.000000000 +0200
+++ linux/kernel/workqueue.c 2009-04-07 16:39:39.000000000 +0200
@@ -132,6 +132,21 @@
* result of list_add() below, see try_to_grab_pending().
*/
smp_wmb();
+ /*
+ * Insert after last high priority item. This avoids
+ * them starving each other.
+ * High priority items should be rare, so it's ok to not have
+ * O(1) insert for them.
+ */
+ if (test_bit(WORK_STRUCT_HIGHPRI, work_data_bits(work)) &&
+ !list_empty(head)) {
+ struct work_struct *w;
+ list_for_each_entry (w, head, entry) {
+ if (!test_bit(WORK_STRUCT_HIGHPRI, work_data_bits(w)))
+ break;
+ }
+ head = &w->entry;
+ }
list_add_tail(&work->entry, head);
wake_up(&cwq->more_work);
}
CPU migration uses special swap entry types to trigger special actions on page
faults. Extend this mechanism to also support poisoned swap entries, to trigger
poison handling on page faults. This allows followon patches to prevent
processes from faulting in poisoned pages again.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/swap.h | 34 ++++++++++++++++++++++++++++------
include/linux/swapops.h | 38 ++++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 4 ++--
3 files changed, 68 insertions(+), 8 deletions(-)
Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h 2009-04-07 16:39:25.000000000 +0200
+++ linux/include/linux/swap.h 2009-04-07 16:39:39.000000000 +0200
@@ -34,16 +34,38 @@
* the type/offset into the pte as 5/27 as well.
*/
#define MAX_SWAPFILES_SHIFT 5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES (1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_POISON_NUM + 1)
+#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_POISON_NUM + 2)
#else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
#endif
/*
+ * Handling of poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_POISON_NUM 1
+#define SWP_POISON (MAX_SWAPFILES + 1)
+#else
+#define SWP_POISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+ ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_POISON_NUM)
+
+/*
* Magic header for a swap area. The first part of the union is
* what the swap magic looks like for the old (limited to 128MB)
* swap area format, the second part of the union adds - in the
Index: linux/include/linux/swapops.h
===================================================================
--- linux.orig/include/linux/swapops.h 2009-04-07 16:39:25.000000000 +0200
+++ linux/include/linux/swapops.h 2009-04-07 16:39:39.000000000 +0200
@@ -131,3 +131,41 @@
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for poisoned pages
+ */
+static inline swp_entry_t make_poison_entry(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ return swp_entry(SWP_POISON, page_to_pfn(page));
+}
+
+static inline int is_poison_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_POISON;
+}
+#else
+
+static inline swp_entry_t make_poison_entry(struct page *page)
+{
+ return swp_entry(0, 0);
+}
+
+static inline int is_poison_entry(swp_entry_t swp)
+{
+ return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+ return swp_type(entry) > MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+ return 0;
+}
+#endif
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c 2009-04-07 16:39:25.000000000 +0200
+++ linux/mm/swapfile.c 2009-04-07 16:39:39.000000000 +0200
@@ -579,7 +579,7 @@
struct swap_info_struct *p;
struct page *page = NULL;
- if (is_migration_entry(entry))
+ if (non_swap_entry(entry))
return 1;
p = swap_info_get(entry);
@@ -1949,7 +1949,7 @@
unsigned long offset, type;
int result = 0;
- if (is_migration_entry(entry))
+ if (non_swap_entry(entry))
return 1;
type = swp_type(entry);
Bail out early when poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly
into processes.
This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/memory.c | 7 +++++++
1 file changed, 7 insertions(+)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/memory.c 2009-04-07 16:39:39.000000000 +0200
@@ -2560,6 +2560,10 @@
goto oom;
__SetPageUptodate(page);
+ /* Kludge for now until we take poisoned pages out of the free lists */
+ if (unlikely(PagePoison(page)))
+ return VM_FAULT_POISON;
+
if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
goto oom_free_page;
@@ -2625,6 +2629,9 @@
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
return ret;
+ if (unlikely(PagePoison(vmf.page)))
+ return VM_FAULT_POISON;
+
/*
* For consistency in subsequent calls, make the faulted page always
* locked.
- Add a new VM_FAULT_POISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/mm.h | 3 ++-
mm/memory.c | 17 ++++++++++++++---
2 files changed, 16 insertions(+), 4 deletions(-)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2009-04-07 16:39:24.000000000 +0200
+++ linux/mm/memory.c 2009-04-07 16:43:06.000000000 +0200
@@ -1315,7 +1315,8 @@
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return i ? i : -ENOMEM;
- else if (ret & VM_FAULT_SIGBUS)
+ if (ret &
+ (VM_FAULT_POISON|VM_FAULT_SIGBUS))
return i ? i : -EFAULT;
BUG();
}
@@ -2426,8 +2427,15 @@
goto out;
entry = pte_to_swp_entry(orig_pte);
- if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ if (unlikely(non_swap_entry(entry))) {
+ if (is_migration_entry(entry)) {
+ migration_entry_wait(mm, pmd, address);
+ } else if (is_poison_entry(entry)) {
+ ret = VM_FAULT_POISON;
+ } else {
+ print_bad_pte(vma, address, pte, NULL);
+ ret = VM_FAULT_OOM;
+ }
goto out;
}
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2451,6 +2459,9 @@
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
+ } else if (PagePoison(page)) {
+ ret = VM_FAULT_POISON;
+ goto out;
}
lock_page(page);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2009-04-07 16:39:24.000000000 +0200
+++ linux/include/linux/mm.h 2009-04-07 16:43:05.000000000 +0200
@@ -702,11 +702,12 @@
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
#define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */
+#define VM_FAULT_POISON 0x0010 /* Hit poisoned page */
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
-#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_POISON)
/*
* Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
Add VM_FAULT_POISON handling to the x86 page fault handler. This is
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/mm/fault.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c 2009-04-07 16:39:23.000000000 +0200
+++ linux/arch/x86/mm/fault.c 2009-04-07 16:39:39.000000000 +0200
@@ -189,6 +189,7 @@
info.si_errno = 0;
info.si_code = si_code;
info.si_addr = (void __user *)address;
+ info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
force_sig_info(si_signo, &info, tsk);
}
@@ -827,10 +828,12 @@
}
static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+ unsigned int fault)
{
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ int code = BUS_ADRERR;
up_read(&mm->mmap_sem);
@@ -846,7 +849,14 @@
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+ if (fault & VM_FAULT_POISON) {
+ printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
+ tsk->comm, tsk->pid);
+ code = BUS_MCEERR_AR;
+ }
+#endif
+ force_sig_info_fault(SIGBUS, code, address, tsk);
}
static noinline void
@@ -856,8 +866,8 @@
if (fault & VM_FAULT_OOM) {
out_of_memory(regs, error_code, address);
} else {
- if (fault & VM_FAULT_SIGBUS)
- do_sigbus(regs, error_code, address);
+ if (fault & (VM_FAULT_SIGBUS|VM_FAULT_POISON))
+ do_sigbus(regs, error_code, address, fault);
else
BUG();
}
try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.
A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.
Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.
This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/rmap.h | 14 +++++++++++++-
mm/migrate.c | 2 +-
mm/rmap.c | 40 ++++++++++++++++++++++------------------
mm/vmscan.c | 2 +-
4 files changed, 37 insertions(+), 21 deletions(-)
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h 2009-04-07 16:39:39.000000000 +0200
+++ linux/include/linux/rmap.h 2009-04-07 16:39:39.000000000 +0200
@@ -84,7 +84,19 @@
* Called from mm/vmscan.c to handle paging out
*/
int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+ TTU_UNMAP = 0, /* unmap mode */
+ TTU_MIGRATION = 1, /* migration mode */
+ TTU_MUNLOCK = 2, /* munlock mode */
+ TTU_ACTION_MASK = 0xff,
+
+ TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
+ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
/*
* Called from mm/filemap_xip.c to unmap empty zero page
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/rmap.c 2009-04-07 16:43:06.000000000 +0200
@@ -755,7 +755,7 @@
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
- int migration)
+ enum ttu_flags flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -777,11 +777,13 @@
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.
*/
- if (!migration) {
+ if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
ret = SWAP_MLOCK;
goto out_unmap;
}
+ }
+ if (!(flags & TTU_IGNORE_ACCESS)) {
if (ptep_clear_flush_young_notify(vma, address, pte)) {
ret = SWAP_FAIL;
goto out_unmap;
@@ -821,12 +823,12 @@
* pte. do_swap_page() will wait until the migration
* pte is removed and then restart fault handling.
*/
- BUG_ON(!migration);
+ BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
entry = make_migration_entry(page, pte_write(pteval));
}
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
- } else if (PAGE_MIGRATION && migration) {
+ } else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
/* Establish migration entry for a file page */
swp_entry_t entry;
entry = make_migration_entry(page, pte_write(pteval));
@@ -995,12 +997,13 @@
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* 'LOCKED.
*/
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
unsigned int mlocked = 0;
int ret = SWAP_AGAIN;
+ int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
if (MLOCK_PAGES && unlikely(unlock))
ret = SWAP_SUCCESS; /* default for try_to_munlock() */
@@ -1016,7 +1019,7 @@
continue; /* must visit all unlocked vmas */
ret = SWAP_MLOCK; /* saw at least one mlocked vma */
} else {
- ret = try_to_unmap_one(page, vma, migration);
+ ret = try_to_unmap_one(page, vma, flags);
if (ret == SWAP_FAIL || !page_mapped(page))
break;
}
@@ -1040,8 +1043,7 @@
/**
* try_to_unmap_file - unmap/unlock file page using the object-based rmap method
* @page: the page to unmap/unlock
- * @unlock: request for unlock rather than unmap [unlikely]
- * @migration: unmapping for migration - ignored if @unlock
+ * @flags: action and flags
*
* Find all the mappings of a page using the mapping pointer and the vma chains
* contained in the address_space struct it points to.
@@ -1053,7 +1055,7 @@
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* 'LOCKED.
*/
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
{
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1065,6 +1067,7 @@
unsigned long max_nl_size = 0;
unsigned int mapcount;
unsigned int mlocked = 0;
+ int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
if (MLOCK_PAGES && unlikely(unlock))
ret = SWAP_SUCCESS; /* default for try_to_munlock() */
@@ -1077,7 +1080,7 @@
continue; /* must visit all vmas */
ret = SWAP_MLOCK;
} else {
- ret = try_to_unmap_one(page, vma, migration);
+ ret = try_to_unmap_one(page, vma, flags);
if (ret == SWAP_FAIL || !page_mapped(page))
goto out;
}
@@ -1102,7 +1105,8 @@
ret = SWAP_MLOCK; /* leave mlocked == 0 */
goto out; /* no need to look further */
}
- if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+ if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+ (vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
if (cursor > max_nl_cursor)
@@ -1136,7 +1140,7 @@
do {
list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (!MLOCK_PAGES && !migration &&
+ if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
@@ -1176,7 +1180,7 @@
/**
* try_to_unmap - try to remove all page table mappings to a page
* @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
*
* Tries to remove all the page table entries which are mapping this
* page, used in the pageout path. Caller must hold the page lock.
@@ -1187,16 +1191,16 @@
* SWAP_FAIL - the page is unswappable
* SWAP_MLOCK - page is mlocked.
*/
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
{
int ret;
BUG_ON(!PageLocked(page));
if (PageAnon(page))
- ret = try_to_unmap_anon(page, 0, migration);
+ ret = try_to_unmap_anon(page, flags);
else
- ret = try_to_unmap_file(page, 0, migration);
+ ret = try_to_unmap_file(page, flags);
if (ret != SWAP_MLOCK && !page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
@@ -1222,8 +1226,8 @@
VM_BUG_ON(!PageLocked(page) || PageLRU(page));
if (PageAnon(page))
- return try_to_unmap_anon(page, 1, 0);
+ return try_to_unmap_anon(page, TTU_MUNLOCK);
else
- return try_to_unmap_file(page, 1, 0);
+ return try_to_unmap_file(page, TTU_MUNLOCK);
}
#endif
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c 2009-04-07 16:39:23.000000000 +0200
+++ linux/mm/vmscan.c 2009-04-07 16:39:39.000000000 +0200
@@ -663,7 +663,7 @@
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page, 0)) {
+ switch (try_to_unmap(page, TTU_UNMAP)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c 2009-04-07 16:39:23.000000000 +0200
+++ linux/mm/migrate.c 2009-04-07 16:39:39.000000000 +0200
@@ -669,7 +669,7 @@
}
/* Establish migration ptes or remove ptes */
- try_to_unmap(page, 1);
+ try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
Bail out early in set_page_dirty for poisoned pages. We don't want any
of the dirty accounting done or file system write back started, because
the page will be just thrown away.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/page-writeback.c | 4 ++++
1 file changed, 4 insertions(+)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2009-04-07 16:39:22.000000000 +0200
+++ linux/mm/page-writeback.c 2009-04-07 16:39:39.000000000 +0200
@@ -1277,6 +1277,10 @@
{
struct address_space *mapping = page_mapping(page);
+ if (unlikely(PagePoison(page))) {
+ SetPageDirty(page);
+ return 0;
+ }
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
#ifdef CONFIG_BLOCK
Add new SIGBUS codes for reporting machine checks as signals. When
the hardware detects an uncorrected ECC error it can trigger these
signals.
This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.
This patch merely defines the new types.
- Define two new si_codes for SIGBUS. BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.
AK: I plan to write a man page update before anyone asks.
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
include/asm-generic/siginfo.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
Index: linux/include/asm-generic/siginfo.h
===================================================================
--- linux.orig/include/asm-generic/siginfo.h 2009-04-07 16:39:24.000000000 +0200
+++ linux/include/asm-generic/siginfo.h 2009-04-07 16:39:39.000000000 +0200
@@ -82,6 +82,7 @@
#ifdef __ARCH_SI_TRAPNO
int _trapno; /* TRAP # which caused the signal */
#endif
+ short _addr_lsb; /* LSB of the reported address */
} _sigfault;
/* SIGPOLL */
@@ -112,6 +113,7 @@
#ifdef __ARCH_SI_TRAPNO
#define si_trapno _sifields._sigfault._trapno
#endif
+#define si_addr_lsb _sifields._sigfault._addr_lsb
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd
@@ -192,7 +194,11 @@
#define BUS_ADRALN (__SI_FAULT|1) /* invalid address alignment */
#define BUS_ADRERR (__SI_FAULT|2) /* non-existant physical address */
#define BUS_OBJERR (__SI_FAULT|3) /* object specific hardware error */
-#define NSIGBUS 3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR (__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO (__SI_FAULT|5)
+#define NSIGBUS 5
/*
* SIGTRAP si_codes
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
mm/rmap.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-04-07 16:39:39.000000000 +0200
+++ linux/mm/rmap.c 2009-04-07 16:39:39.000000000 +0200
@@ -801,7 +801,14 @@
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
- if (PageAnon(page)) {
+ if (PagePoison(page)) {
+ if (PageAnon(page))
+ dec_mm_counter(mm, anon_rss);
+ else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+ dec_mm_counter(mm, file_rss);
+ set_pte_at(mm, address, pte,
+ swp_entry_to_pte(make_poison_entry(page)));
+ } else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(page) };
if (PageSwapCache(page)) {
This patch adds the high level memory handler that poisons pages.
It is portable code and lives in mm/memory-failure.c
To quote the overview comment:
* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focusses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* Some of the operations here are somewhat inefficient and have non
* linear algorithmic complexity, because the data structures have not
* been optimized for this case. This is in particular the case
* for the mapping from a vma to a process. Since this case is expected
* to be rare we hope we can get away with this.
There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.
The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways
This isn't complete yet. The biggest gap is the missing hugepage
handling and also a few other corner cases. The code is unable
in all cases to get rid of all references.
This is rather tricky code and needs a lot of review. Undoubtedly it still
has bugs.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
fs/proc/meminfo.c | 9
include/linux/mm.h | 4
kernel/sysctl.c | 14 +
mm/Kconfig | 3
mm/Makefile | 1
mm/memory-failure.c | 575 ++++++++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 605 insertions(+), 1 deletion(-)
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile 2009-04-07 16:39:21.000000000 +0200
+++ linux/mm/Makefile 2009-04-07 16:39:39.000000000 +0200
@@ -38,3 +38,4 @@
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
Index: linux/mm/memory-failure.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/memory-failure.c 2009-04-07 16:39:39.000000000 +0200
@@ -0,0 +1,575 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states. The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * Some of the operations here are somewhat inefficient and have non
+ * linear algorithmic complexity, because the data structures have not
+ * been optimized for this case. This is in particular the case
+ * for the mapping from a VMA to a process. Since this case is expected
+ * to be rare we hope we can get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - nonlinear
+ * - remap races
+ * - anonymous (tinject):
+ * + left over references when process catches signal?
+ * - error reporting on EIO missing (tinject)
+ * - kcore/oldmem/vmcore/mem/kmem check for poison pages
+ * - pass bad pages to kdump next kernel
+ */
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include "internal.h"
+
+#define Dprintk(x...) printk(x)
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages;
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno)
+{
+ struct siginfo si;
+ int ret;
+
+ printk(KERN_ERR
+ "MCE: Killing %s:%d due to hardware memory corruption\n",
+ t->comm, t->pid);
+ si.si_signo = SIGBUS;
+ si.si_errno = 0;
+ si.si_code = BUS_MCEERR_AO;
+ si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+ si.si_trapno = trapno;
+#endif
+ si.si_addr_lsb = PAGE_SHIFT;
+ ret = force_sig_info(SIGBUS, &si, t); /* synchronous? */
+ if (ret < 0)
+ printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+ t->comm, t->pid, ret);
+ return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+ struct list_head nd;
+ struct task_struct *tsk;
+ unsigned long addr;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do. We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+ struct vm_area_struct *vma,
+ struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ int fail = 0;
+ struct to_kill *tk;
+
+ if (*tkc) {
+ tk = *tkc;
+ *tkc = NULL;
+ } else {
+ tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+ if (!tk) {
+ printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
+ return;
+ }
+ }
+ tk->addr = page_address_in_vma(p, vma);
+ if (tk->addr == -EFAULT) {
+ printk(KERN_INFO "MCE: Failed to get address in VMA\n");
+ tk->addr = 0;
+ fail = 1;
+ }
+ get_task_struct(tsk);
+ tk->tsk = tsk;
+ list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ */
+static void
+kill_procs_ao(struct list_head *to_kill, int doit, int trapno, int fail)
+{
+ struct to_kill *tk, *next;
+
+ list_for_each_entry_safe (tk, next, to_kill, nd) {
+ if (doit) {
+ /*
+ * In case something went wrong with munmaping
+ * make sure the process doesn't catch the
+ * signal and then access the memory. So reset
+ * the signal handlers
+ */
+ if (fail)
+ flush_signal_handlers(tk->tsk, 1);
+
+ /*
+ * In theory the process could have mapped
+ * something else on the address in-between. We could
+ * check for that, but we need to tell the
+ * process anyways.
+ */
+ if (kill_proc_ao(tk->tsk, tk->addr, trapno) < 0)
+ printk(KERN_ERR
+ "MCE: Cannot send advisory machine check signal to %s:%d\n",
+ tk->tsk->comm, tk->tsk->pid);
+ }
+ put_task_struct(tk->tsk);
+ kfree(tk);
+ }
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+ struct anon_vma *av = page_lock_anon_vma(page);
+
+ if (av == NULL) /* Not actually mapped anymore */
+ goto out;
+
+ read_lock(&tasklist_lock);
+ for_each_process (tsk) {
+ if (!tsk->mm)
+ continue;
+ list_for_each_entry (vma, &av->head, anon_vma_node) {
+ if (vma->vm_mm == tsk->mm)
+ add_to_kill(tsk, page, vma, to_kill, tkc);
+ }
+ }
+ read_unlock(&tasklist_lock);
+out:
+ page_unlock_anon_vma(av);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+ struct prio_tree_iter iter;
+ struct address_space *mapping = page_mapping(page);
+
+ read_lock(&tasklist_lock);
+ spin_lock(&mapping->i_mmap_lock);
+ for_each_process(tsk) {
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+ if (!tsk->mm)
+ continue;
+
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+ pgoff)
+ if (vma->vm_mm == tsk->mm)
+ add_to_kill(tsk, page, vma, to_kill, tkc);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+ struct to_kill *tk;
+
+ tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+ /* memory allocation failure is implicitly handled */
+ if (PageAnon(page))
+ collect_procs_anon(page, tokill, &tk);
+ else
+ collect_procs_file(page, tokill, &tk);
+ kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+ FAILED,
+ DELAYED,
+ IGNORED,
+ RECOVERED,
+};
+
+static const char *action_name[] = {
+ [FAILED] = "Failed",
+ [DELAYED] = "Delayed",
+ [IGNORED] = "Ignored",
+ [RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p)
+{
+ return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p)
+{
+ return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p)
+{
+ printk(KERN_ERR "MCE: Unknown state page %lx flags %lx, count %d\n",
+ page_to_pfn(p), p->flags, page_count(p));
+ return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p)
+{
+ /* TBD Should delete page from buddy here. */
+ return IGNORED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p)
+{
+ struct address_space *mapping;
+
+ if (PagePrivate(p))
+ do_invalidatepage(p, 0);
+ mapping = page_mapping(p);
+ if (mapping) {
+ if (!remove_mapping(mapping, p))
+ return FAILED;
+ }
+ return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p)
+{
+ struct address_space *mapping = page_mapping(p);
+
+ SetPageError(p);
+ /* TBD: print more information about the file. */
+ printk(KERN_ERR "MCE: Hardware memory corruption on dirty file page: write error\n");
+ if (mapping) {
+ /* CHECKME: does that report the error in all cases? */
+ mapping_set_error(mapping, EIO);
+ }
+ if (PagePrivate(p)) {
+ if (try_to_release_page(p, GFP_KERNEL)) {
+ /*
+ * Normally this should not happen because we
+ * have the lock. What should we do
+ * here. wait on the page? (TBD)
+ */
+ printk(KERN_ERR
+ "MCE: Trying to release dirty page failed\n");
+ return FAILED;
+ }
+ } else if (mapping) {
+ cancel_dirty_page(p, PAGE_CACHE_SIZE);
+ }
+ return me_pagecache_clean(p);
+}
+
+/*
+ * Dirty swap cache.
+ * Cannot map back to the process because the rmaps are gone. Instead we rely
+ * on any subsequent re-fault to run into the Poison bit. This is not optimal.
+ */
+static int me_swapcache_dirty(struct page *p)
+{
+ delete_from_swap_cache(p);
+ return DELAYED;
+}
+
+/*
+ * Clean swap cache.
+ */
+static int me_swapcache_clean(struct page *p)
+{
+ delete_from_swap_cache(p);
+ return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p)
+{
+ return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle.
+ *
+ * This is not complete. More states could be added.
+ */
+static struct page_state {
+ unsigned long mask;
+ unsigned long res;
+ char *msg;
+ int (*action)(struct page *p);
+} error_states[] = {
+#define F(x) (1UL << PG_ ## x)
+ { F(reserved), F(reserved), "reserved kernel", me_ignore },
+ { F(buddy), F(buddy), "free kernel", me_free },
+ /*
+ * Could in theory check if slab page is free or if we can drop
+ * currently unused objects without touching them. But just
+ * treat it as standard kernel for now.
+ */
+ { F(slab), F(slab), "kernel slab", me_kernel },
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+ { F(head), F(head), "hugetlb", me_huge_page },
+ { F(tail), F(tail), "hugetlb", me_huge_page },
+#else
+ { F(compound), F(compound), "hugetlb", me_huge_page },
+#endif
+ { F(swapcache)|F(dirty), F(swapcache)|F(dirty), "dirty swapcache",
+ me_swapcache_dirty },
+ { F(swapcache)|F(dirty), F(swapcache), "clean swapcache",
+ me_swapcache_clean },
+#ifdef CONFIG_UNEVICTABLE_LRU
+ { F(unevictable)|F(dirty), F(unevictable)|F(dirty),
+ "unevictable dirty page cache", me_pagecache_dirty },
+ { F(unevictable), F(unevictable), "unevictable page cache",
+ me_pagecache_clean },
+#endif
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+ { F(mlocked)|F(dirty), F(mlocked)|F(dirty), "mlocked dirty page cache",
+ me_pagecache_dirty },
+ { F(mlocked), F(mlocked), "mlocked page cache", me_pagecache_clean },
+#endif
+ { F(lru)|F(dirty), F(lru)|F(dirty), "dirty lru", me_pagecache_dirty },
+ { F(lru)|F(dirty), F(lru), "clean lru", me_pagecache_clean },
+ { F(swapbacked), F(swapbacked), "anonymous", me_pagecache_clean },
+ /*
+ * More states could be added here.
+ */
+ { 0, 0, "unknown page state", me_unknown }, /* must be at end */
+#undef F
+};
+
+static void page_action(char *msg, struct page *p, int (*action)(struct page *),
+ unsigned long pfn)
+{
+ int ret;
+
+ printk(KERN_ERR
+ "MCE: Starting recovery on %s page %lx corrupted by hardware\n",
+ msg, pfn);
+ ret = action(p);
+ printk(KERN_ERR "MCE: Recovery of %s page %lx: %s\n",
+ msg, pfn, action_name[ret]);
+ if (page_count(p) != 1)
+ printk(KERN_ERR
+ "MCE: Page %lx (flags %lx) still referenced by %d users after recovery\n",
+ pfn, p->flags, page_count(p));
+
+ /* Could do more checks here if page looks ok */
+ atomic_long_add(1, &mce_bad_pages);
+
+ /*
+ * Could adjust zone counters here to correct for the missing page.
+ */
+}
+
+#define N_UNMAP_TRIES 5
+
+static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
+{
+ if (PagePoison(p)) {
+ printk(KERN_ERR
+ "MCE: Error for already poisoned page at %lx\n", pfn);
+ return -1;
+ }
+ SetPagePoison(p);
+
+ if (!PageReserved(p) && !PageSlab(p) && page_mapped(p)) {
+ LIST_HEAD(tokill);
+ int ret;
+ int i;
+
+ /*
+ * First collect all the processes that have the page
+ * mapped. This has to be done before try_to_unmap,
+ * because ttu takes the rmap data structures down.
+ *
+ * Error handling: We ignore errors here because
+ * there's nothing that can be done.
+ *
+ * RED-PEN some cases in process exit seem to deadlock
+ * on the page lock. drop it or add poison checks?
+ */
+ if (sysctl_memory_failure_early_kill)
+ collect_procs(p, &tokill);
+
+ /*
+ * try_to_unmap can fail temporarily due to races.
+ * Try a few times (RED-PEN better strategy?)
+ */
+ for (i = 0; i < N_UNMAP_TRIES; i++) {
+ ret = try_to_unmap(p, TTU_UNMAP|
+ TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ if (ret == SWAP_SUCCESS)
+ break;
+ Dprintk("MCE: try_to_unmap retry needed %d\n", ret);
+ }
+
+ /*
+ * Now that the dirty bit has been propagated to the
+ * struct page and all unmaps done we can decide if
+ * killing is needed or not. Only kill when the page
+ * was dirty, otherwise the tokill list is merely
+ * freed. When there was a problem unmapping earlier
+ * use a more force-full uncatchable kill to prevent
+ * any accesses to the poisoned memory.
+ */
+ kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+ ret != SWAP_SUCCESS);
+ }
+
+ return 0;
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ *
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+ Dprintk("memory failure %lx\n", pfn);
+
+ if (!pfn_valid(pfn)) {
+ printk(KERN_ERR
+ "MCE: Hardware memory corruption in memory outside kernel control at %lx\n",
+ pfn);
+ } else {
+ struct page *p = pfn_to_page(pfn);
+ struct page_state *ps;
+
+ /*
+ * Make sure no one frees the page outside our control.
+ */
+ get_page(p);
+ lock_page_nosync(p);
+
+ if (poison_page_prepare(p, pfn, trapno) < 0)
+ goto out;
+
+ for (ps = error_states;; ps++) {
+ if ((p->flags & ps->mask) == ps->res) {
+ page_action(ps->msg, p, ps->action, pfn);
+ break;
+ }
+ }
+out:
+ unlock_page(p);
+ }
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2009-04-07 16:39:39.000000000 +0200
+++ linux/include/linux/mm.h 2009-04-07 16:39:39.000000000 +0200
@@ -1322,6 +1322,10 @@
extern void *alloc_locked_buffer(size_t size);
extern void free_locked_buffer(void *buffer, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
extern void release_locked_buffer(void *buffer, size_t size);
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c 2009-04-07 16:39:21.000000000 +0200
+++ linux/kernel/sysctl.c 2009-04-07 16:39:39.000000000 +0200
@@ -1266,6 +1266,20 @@
.extra2 = &one,
},
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "memory_failure_early_kill",
+ .data = &sysctl_memory_failure_early_kill,
+ .maxlen = sizeof(vm_highmem_is_dirtyable),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
+
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c 2009-04-07 16:39:21.000000000 +0200
+++ linux/fs/proc/meminfo.c 2009-04-07 16:39:39.000000000 +0200
@@ -97,7 +97,11 @@
"Committed_AS: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
- "VmallocChunk: %8lu kB\n",
+ "VmallocChunk: %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+ "BadPages: %8lu kB\n"
+#endif
+ ,
K(i.totalram),
K(i.freeram),
K(i.bufferram),
@@ -144,6 +148,9 @@
(unsigned long)VMALLOC_TOTAL >> 10,
vmi.used >> 10,
vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+ ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
);
hugetlb_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig 2009-04-07 16:39:21.000000000 +0200
+++ linux/mm/Kconfig 2009-04-07 16:39:39.000000000 +0200
@@ -223,3 +223,6 @@
config MMU_NOTIFIER
bool
+
+config MEMORY_FAILURE
+ bool
Newer Intel CPUs support a new class of machine checks called recoverable
action optional.
Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropiate action, like killing the
process with the corrupted data or logging the event properly to disk.
This is done by the new generic high level memory failure handler added in a
earlier patch. The high level handler takes the address with the failed
memory and does the appropiate action, like killing the process.
The high level handler cannot be directly called from the machine check
exception though, because it has to run in a defined process context to be able
to sleep when taking VM locks (it is not expected to sleep for a long time,
just do so in some exceptional cases like lock contention)
Thus the MCE handler has to queue a work item for process context,
trigger process context and then call the high level handler from there.
This patch adds two path to process context: through a per thread kernel exit
notify_user() callback or through a high priority work item. The first
runs when the process exits back to user space, the other when it goes
to sleep and there is no higher priority process.
The machine check handler will schedule both, and whoever runs first
will grab the event. This is done because quick reaction to this
event is critical to avoid a potential more fatal machine check
when the corruption is consumed.
There is a simple lock less ring buffer to queue the corrupted
addresses between the exception handler and the process context handler.
Then in process context it just calls the high level VM code with
the corrupted PFNs.
The code adds the required code to extract the failed address from
the CPU's machine check registers. It doesn't try to handle all
possible cases -- the specification has 6 different ways to specify
memory address -- but only the linear address.
Most of the required checking has been already done earlier in the
mce_severity rule checking engine. Following the Intel
recommendations Action Optional errors are only enabled for known
situations (encoded in MCACODs). The errors are ignored otherwise,
because they are action optional.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/Kconfig | 1
arch/x86/include/asm/irq_vectors.h | 1
arch/x86/include/asm/mce.h | 1
arch/x86/kernel/cpu/mcheck/mce-severity.c | 8 +-
arch/x86/kernel/cpu/mcheck/mce_64.c | 114 ++++++++++++++++++++++++++++++
arch/x86/kernel/signal.c | 2
6 files changed, 125 insertions(+), 2 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:39:39.000000000 +0200
@@ -14,6 +14,7 @@
#include <linux/sched.h>
#include <linux/string.h>
#include <linux/rcupdate.h>
+#include <linux/mm.h>
#include <linux/kallsyms.h>
#include <linux/sysdev.h>
#include <linux/miscdevice.h>
@@ -79,6 +80,8 @@
[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
};
+static DEFINE_PER_CPU(struct work_struct, mce_work);
+
/* Do initial initialization of a struct mce */
void mce_setup(struct mce *m)
{
@@ -273,6 +276,52 @@
wrmsrl(msr, v);
}
+/*
+ * Simple lockless ring to communicate PFNs from the exception handler with the
+ * process context work function. This is vastly simplified because there's
+ * only a single reader and a single writer.
+ */
+#define MCE_RING_SIZE 16 /* we use one entry less */
+
+struct mce_ring {
+ unsigned short start;
+ unsigned short end;
+ unsigned long ring[MCE_RING_SIZE];
+};
+static DEFINE_PER_CPU(struct mce_ring, mce_ring);
+
+static int mce_ring_empty(void)
+{
+ struct mce_ring *r = &__get_cpu_var(mce_ring);
+
+ return r->start == r->end;
+}
+
+static int mce_ring_get(unsigned long *pfn)
+{
+ struct mce_ring *r = &__get_cpu_var(mce_ring);
+
+ if (r->start == r->end)
+ return 0;
+ *pfn = r->ring[r->start];
+ r->start = (r->start + 1) % MCE_RING_SIZE;
+ return 1;
+}
+
+static int mce_ring_add(unsigned long pfn)
+{
+ struct mce_ring *r = &__get_cpu_var(mce_ring);
+ unsigned next;
+
+ next = (r->end + 1) % MCE_RING_SIZE;
+ if (next == r->start)
+ return -1;
+ r->ring[r->end] = pfn;
+ wmb();
+ r->end = next;
+ return 0;
+}
+
int mce_available(struct cpuinfo_x86 *c)
{
if (mce_dont_init)
@@ -293,6 +342,15 @@
m->ip = mce_rdmsrl(rip_msr);
}
+static void mce_schedule_work(void)
+{
+ if (!mce_ring_empty()) {
+ struct work_struct *work = &__get_cpu_var(mce_work);
+ if (!work_pending(work))
+ schedule_work(work);
+ }
+}
+
/*
* Called after interrupts have been reenabled again
* when a MCE happened during an interrupts off region
@@ -304,6 +362,7 @@
exit_idle();
irq_enter();
mce_notify_irq();
+ mce_schedule_work();
irq_exit();
}
@@ -311,6 +370,13 @@
{
if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
mce_notify_irq();
+ /*
+ * Triggering the work queue here is just an insurance
+ * policy in case the syscall exit notify handler
+ * doesn't run soon enough or ends up running on the
+ * wrong CPU (can happen when audit sleeps)
+ */
+ mce_schedule_work();
return;
}
@@ -669,6 +735,23 @@
return ret;
}
+/*
+ * Check if the address reported by the CPU is in a format we can parse.
+ * It would be possible to add code for most other cases, but all would
+ * be somewhat complicated (e.g. segment offset would require an instruction
+ * parser). So only support physical addresses upto page granuality for now.
+ */
+static int mce_usable_address(struct mce *m)
+{
+ if (!(m->status & MCI_STATUS_MISCV) || !(m->status & MCI_STATUS_ADDRV))
+ return 0;
+ if ((m->misc & 0x3f) > PAGE_SHIFT)
+ return 0;
+ if (((m->misc >> 6) & 7) != MCM_ADDR_PHYS)
+ return 0;
+ return 1;
+}
+
static void mce_clear_state(unsigned long *toclear)
{
int i;
@@ -802,6 +885,16 @@
if (m.status & MCI_STATUS_ADDRV)
m.addr = mce_rdmsrl(MSR_IA32_MC0_ADDR + i*4);
+ /*
+ * Action optional error. Queue address for later processing.
+ * When the ring overflows we just ignore the AO error.
+ * RED-PEN add some logging mechanism when
+ * usable_address or mce_add_ring fails.
+ * RED-PEN don't ignore overflow for tolerant == 0
+ */
+ if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
+ mce_ring_add(m.addr >> PAGE_SHIFT);
+
mce_get_rip(&m, regs);
mce_log(&m);
@@ -852,6 +945,26 @@
}
EXPORT_SYMBOL_GPL(do_machine_check);
+/*
+ * Called after mce notification in process context. This code
+ * is allowed to sleep. Call the high level VM handler to process
+ * any corrupted pages.
+ * Assume that the work queue code only calls this one at a time
+ * per CPU.
+ */
+void mce_notify_process(void)
+{
+ unsigned long pfn;
+ mce_notify_irq();
+ while (mce_ring_get(&pfn))
+ memory_failure(pfn, MCE_VECTOR);
+}
+
+static void mce_process_work(struct work_struct *dummy)
+{
+ mce_notify_process();
+}
+
#ifdef CONFIG_X86_MCE_INTEL
/***
* mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
@@ -1088,6 +1201,7 @@
mce_init();
mce_cpu_features(c);
mce_init_timer();
+ INIT_WORK(&__get_cpu_var(mce_work), mce_process_work);
}
/*
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:39:39.000000000 +0200
@@ -163,6 +163,7 @@
extern void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
extern int mce_notify_irq(void);
+extern void mce_notify_process(void);
#endif /* !CONFIG_X86_32 */
Index: linux/arch/x86/kernel/signal.c
===================================================================
--- linux.orig/arch/x86/kernel/signal.c 2009-04-07 16:39:39.000000000 +0200
+++ linux/arch/x86/kernel/signal.c 2009-04-07 16:39:39.000000000 +0200
@@ -860,7 +860,7 @@
#if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE)
/* notify userspace of pending MCEs */
if (thread_info_flags & _TIF_MCE_NOTIFY)
- mce_notify_irq();
+ mce_notify_process();
#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
/* deal with pending signal delivery */
Index: linux/arch/x86/kernel/cpu/mcheck/mce-severity.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-severity.c 2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce-severity.c 2009-04-07 16:39:39.000000000 +0200
@@ -67,7 +67,13 @@
"Action required; unknown MCACOD", SER),
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_STATUS_OVER|MCI_UC_SAR, PANIC,
"Action required with lost events", SER),
- /* AO add known MCACODs here */
+
+ /* known AO MCACODs: handle by calling high level handler */
+ MASK(MCI_UC_SAR|0xfff0, MCI_UC_S|0xc0, AO,
+ "Action optional: memory scrubbing error", SER),
+ MASK(MCI_UC_SAR|MCACOD, MCI_UC_S|0x17a, AO,
+ "Action optional: last level cache writeback error", SER),
+
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S, SOME,
"Action optional unknown MCACOD", SER),
MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_S|MCI_STATUS_OVER, SOME,
Index: linux/arch/x86/include/asm/irq_vectors.h
===================================================================
--- linux.orig/arch/x86/include/asm/irq_vectors.h 2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/include/asm/irq_vectors.h 2009-04-07 16:39:39.000000000 +0200
@@ -25,6 +25,7 @@
*/
#define NMI_VECTOR 0x02
+#define MCE_VECTOR 0x12
/*
* IDT vectors usable for external interrupt sources start
Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig 2009-04-07 16:39:00.000000000 +0200
+++ linux/arch/x86/Kconfig 2009-04-07 16:39:39.000000000 +0200
@@ -760,6 +760,7 @@
config X86_MCE
bool "Machine Check Exception"
+ select MEMORY_FAILURE
---help---
Machine Check Exception support allows the processor to notify the
kernel if it detects a problem (e.g. overheating, component failure).
Impact: optional, useful for debugging
Add a new madvice sub command to inject poison for some
pages in a process' address space. This is useful for
testing the poison page handling.
Open issues:
- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.
Signed-off-by: Andi Kleen <[email protected]>
---
include/asm-generic/mman.h | 1 +
mm/madvise.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c 2009-04-07 16:36:29.000000000 +0200
+++ linux/mm/madvise.c 2009-04-07 16:39:39.000000000 +0200
@@ -208,6 +208,38 @@
return error;
}
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_poison(unsigned long start, unsigned long end)
+{
+ /*
+ * RED-PEN
+ * This allows to tie up arbitary amounts of memory.
+ * Might be a good idea to disable it inside containers even for root.
+ */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ for (; start < end; start += PAGE_SIZE) {
+ struct page *p;
+ int ret = get_user_pages(current, current->mm, start, 1,
+ 0, 0, &p, NULL);
+ if (ret != 1)
+ return ret;
+ put_page(p);
+ /*
+ * RED-PEN page can be reused, but otherwise we'll have to fight with the
+ * refcnt
+ */
+ printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+ page_to_pfn(p), start);
+ memory_failure(page_to_pfn(p), 0);
+ }
+ return 0;
+}
+#endif
+
static long
madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
@@ -290,6 +322,11 @@
int write;
size_t len;
+#ifdef CONFIG_MEMORY_FAILURE
+ if (behavior == MADV_POISON)
+ return madvise_poison(start, start+len_in);
+#endif
+
write = madvise_need_mmap_write(behavior);
if (write)
down_write(¤t->mm->mmap_sem);
Index: linux/include/asm-generic/mman.h
===================================================================
--- linux.orig/include/asm-generic/mman.h 2009-04-07 16:36:29.000000000 +0200
+++ linux/include/asm-generic/mman.h 2009-04-07 16:39:39.000000000 +0200
@@ -34,6 +34,7 @@
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
+#define MADV_POISON 12 /* poison the page (root only) */
/* compatibility flags */
#define MAP_FILE 0
Impact: cleanup
Rename the mce_notify_user function to mce_notify_irq. The next
patch will split the wakeup handling of interrupt context
and of process context and it's better to give it a clearer
name for this.
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/mce.h | 2 +-
arch/x86/kernel/cpu/mcheck/mce-inject.c | 2 +-
arch/x86/kernel/cpu/mcheck/mce_64.c | 10 +++++-----
arch/x86/kernel/cpu/mcheck/mce_intel_64.c | 2 +-
arch/x86/kernel/signal.c | 2 +-
5 files changed, 9 insertions(+), 9 deletions(-)
Index: linux/arch/x86/kernel/cpu/mcheck/mce_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_64.c 2009-04-07 16:43:04.000000000 +0200
@@ -303,14 +303,14 @@
ack_APIC_irq();
exit_idle();
irq_enter();
- mce_notify_user();
+ mce_notify_irq();
irq_exit();
}
static void mce_report_event(struct pt_regs *regs)
{
if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
- mce_notify_user();
+ mce_notify_irq();
return;
}
@@ -904,7 +904,7 @@
* polling interval, otherwise increase the polling interval.
*/
n = &__get_cpu_var(next_interval);
- if (mce_notify_user()) {
+ if (mce_notify_irq()) {
*n = max(*n/2, HZ/100);
} else {
*n = min(*n*2, (int)round_jiffies_relative(check_interval*HZ));
@@ -926,7 +926,7 @@
* Can be called from interrupt context, but not from machine check/NMI
* context.
*/
-int mce_notify_user(void)
+int mce_notify_irq(void)
{
/* Not more than two messages every minute */
static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
@@ -950,7 +950,7 @@
}
return 0;
}
-EXPORT_SYMBOL_GPL(mce_notify_user);
+EXPORT_SYMBOL_GPL(mce_notify_irq);
/*
* Initialize Machine Checks for a CPU.
Index: linux/arch/x86/include/asm/mce.h
===================================================================
--- linux.orig/arch/x86/include/asm/mce.h 2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/include/asm/mce.h 2009-04-07 16:43:04.000000000 +0200
@@ -162,7 +162,7 @@
};
extern void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
-extern int mce_notify_user(void);
+extern int mce_notify_irq(void);
#endif /* !CONFIG_X86_32 */
Index: linux/arch/x86/kernel/cpu/mcheck/mce-inject.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce-inject.c 2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce-inject.c 2009-04-07 16:39:39.000000000 +0200
@@ -65,7 +65,7 @@
memset(&b, 0xff, sizeof(mce_banks_t));
printk(KERN_INFO "Starting machine check poll CPU %d\n", cpu);
machine_check_poll(0, &b);
- mce_notify_user();
+ mce_notify_irq();
printk(KERN_INFO "Finished machine check poll on CPU %d\n",
cpu);
}
Index: linux/arch/x86/kernel/signal.c
===================================================================
--- linux.orig/arch/x86/kernel/signal.c 2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/signal.c 2009-04-07 16:43:04.000000000 +0200
@@ -860,7 +860,7 @@
#if defined(CONFIG_X86_64) && defined(CONFIG_X86_MCE)
/* notify userspace of pending MCEs */
if (thread_info_flags & _TIF_MCE_NOTIFY)
- mce_notify_user();
+ mce_notify_irq();
#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
/* deal with pending signal delivery */
Index: linux/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
===================================================================
--- linux.orig/arch/x86/kernel/cpu/mcheck/mce_intel_64.c 2009-04-07 16:39:21.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mcheck/mce_intel_64.c 2009-04-07 16:39:39.000000000 +0200
@@ -132,7 +132,7 @@
static void intel_threshold_interrupt(void)
{
machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
- mce_notify_user();
+ mce_notify_irq();
}
static void print_update(char *type, int *hdr, int num)
Andi Kleen wrote:
> This is rather tricky code and needs a lot of review. Undoubtedly it still
> has bugs.
It's just complex enough that it looks like it might have
more bugs, but I sure couldn't find any.
Hitting a bug in this code seems favorable to hitting
guaranteed memory corruption, so I hope Andrew or Ingo
will merge this into one of their trees.
> Signed-off-by: Andi Kleen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed.
On Tue, Apr 07, 2009 at 12:03:00PM -0400, Rik van Riel wrote:
> Andi Kleen wrote:
>
> >This is rather tricky code and needs a lot of review. Undoubtedly it still
> >has bugs.
>
> It's just complex enough that it looks like it might have
> more bugs, but I sure couldn't find any.
Thanks for the review.
Perhaps I didn't put it strongly enough: I know there are still bugs
in there (e.g. nonlinear mappings deadlock and there are some cases
where the reference count of the page doesn't drop the zero).
> Hitting a bug in this code seems favorable to hitting
> guaranteed memory corruption, so I hope Andrew or Ingo
Yes the alternative is always panic() when the hardware detects
the consumed corruption and bails out. So even if this code is buggy it's
very likely still an improvement. So it would be reasonable to
do a relatively early merge and improve further in tree.
> >Signed-off-by: Andi Kleen <[email protected]>
>
> Acked-by: Rik van Riel <[email protected]>
Thanks added
-Andi
--
[email protected] -- Speaking for myself only.
Hi Andi,
On Tue, Apr 07, 2009 at 05:10:10PM +0200, Andi Kleen wrote:
> +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> + struct to_kill **tkc)
> +{
> + struct vm_area_struct *vma;
> + struct task_struct *tsk;
> + struct anon_vma *av = page_lock_anon_vma(page);
> +
> + if (av == NULL) /* Not actually mapped anymore */
> + goto out;
> +
> + read_lock(&tasklist_lock);
> + for_each_process (tsk) {
> + if (!tsk->mm)
> + continue;
> + list_for_each_entry (vma, &av->head, anon_vma_node) {
> + if (vma->vm_mm == tsk->mm)
> + add_to_kill(tsk, page, vma, to_kill, tkc);
> + }
> + }
> + read_unlock(&tasklist_lock);
> +out:
> + page_unlock_anon_vma(av);
If !av, this doesn't need an unlock and in fact crashes due to
dereferencing NULL.
> +static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
> +{
> + if (PagePoison(p)) {
> + printk(KERN_ERR
> + "MCE: Error for already poisoned page at %lx\n", pfn);
> + return -1;
> + }
> + SetPagePoison(p);
TestSetPagePoison()?
On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
>
> Bail out early when poisoned pages are found in page fault handling.
> Since they are poisoned they should not be mapped freshly
> into processes.
>
> This is generally handled in the same way as OOM, just a different
> error code is returned to the architecture code.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/memory.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c 2009-04-07 16:39:39.000000000 +0200
> +++ linux/mm/memory.c 2009-04-07 16:39:39.000000000 +0200
> @@ -2560,6 +2560,10 @@
> goto oom;
> __SetPageUptodate(page);
>
> + /* Kludge for now until we take poisoned pages out of the free lists */
> + if (unlikely(PagePoison(page)))
> + return VM_FAULT_POISON;
> +
When memory_failure() hits a page still on the free list
(!page_count()) then the get_page() in memory_failure() will trigger a
VM_BUG. So either this check is unneeded or it should be
get_page_unless_zero() in memory_failure()?
How does this overlap with the bad page quarantine that ia64 uses
following an MCA?
Robin
On Tue, Apr 07, 2009 at 05:09:56PM +0200, Andi Kleen wrote:
>
> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.
>
> To quote the overview comment:
>
> * High level machine check handler. Handles pages reported by the
> * hardware as being corrupted usually due to a 2bit ECC memory or cache
> * failure.
> *
> * This focusses on pages detected as corrupted in the background.
> * When the current CPU tries to consume corruption the currently
> * running process can just be killed directly instead. This implies
> * that if the error cannot be handled for some reason it's safe to
> * just ignore it because no corruption has been consumed yet. Instead
> * when that happens another machine check will happen.
> *
> * Handles page cache pages in various states. The tricky part
> * here is that we can access any page asynchronous to other VM
> * users, because memory failures could happen anytime and anywhere,
> * possibly violating some of their assumptions. This is why this code
> * has to be extremely careful. Generally it tries to use normal locking
> * rules, as in get the standard locks, even if that means the
> * error handling takes potentially a long time.
> *
> * Some of the operations here are somewhat inefficient and have non
> * linear algorithmic complexity, because the data structures have not
> * been optimized for this case. This is in particular the case
> * for the mapping from a vma to a process. Since this case is expected
> * to be rare we hope we can get away with this.
>
> The code consists of a the high level handler in mm/memory-failure.c,
> a new page poison bit and various checks in the VM to handle poisoned
> pages.
>
> The main target right now is KVM guests, but it works for all kinds
> of applications.
>
> For the KVM use there was need for a new signal type so that
> KVM can inject the machine check into the guest with the proper
> address. This in theory allows other applications to handle
> memory failures too. The expection is that near all applications
> won't do that, but some very specialized ones might.
>
> This is not fully complete yet, in particular there are still ways
> to access poison through various ways (crash dump, /proc/kcore etc.)
> that need to be plugged too.
>
> Also undoubtedly the high level handler still has bugs and cases
> it cannot recover from. For example nonlinear mappings deadlock right now
> and a few other cases lose references. Huge pages are not supported
> yet. Any additional testing, reviewing etc. welcome.
>
> The patch series requires the earlier x86 MCE feature series for the x86
> specific action optional part. The code can be tested without the x86 specific
> part using the injector, this only requires to enable the Kconfig entry
> manually in some Kconfig file (by default it is implicitely enabled
> by the architecture)
>
> -Andi
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, Apr 07, 2009 at 09:03:30PM +0200, Johannes Weiner wrote:
> On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
> >
> > Bail out early when poisoned pages are found in page fault handling.
> > Since they are poisoned they should not be mapped freshly
> > into processes.
> >
> > This is generally handled in the same way as OOM, just a different
> > error code is returned to the architecture code.
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> >
> > ---
> > mm/memory.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > Index: linux/mm/memory.c
> > ===================================================================
> > --- linux.orig/mm/memory.c 2009-04-07 16:39:39.000000000 +0200
> > +++ linux/mm/memory.c 2009-04-07 16:39:39.000000000 +0200
> > @@ -2560,6 +2560,10 @@
> > goto oom;
> > __SetPageUptodate(page);
> >
> > + /* Kludge for now until we take poisoned pages out of the free lists */
> > + if (unlikely(PagePoison(page)))
> > + return VM_FAULT_POISON;
> > +
>
> When memory_failure() hits a page still on the free list
It won't free it then. Later on it will take it out of the free lists,
but that code is not written yet.
> (!page_count()) then the get_page() in memory_failure() will trigger a
> VM_BUG. So either this check is unneeded or it should be
So no bug
> get_page_unless_zero() in memory_failure()?
That's not what this is handling. The issue is that sometimes
the process can still freeing it and we need to make sure it
never hits the free lists.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, Apr 07, 2009 at 02:13:00PM -0500, Robin Holt wrote:
> How does this overlap with the bad page quarantine that ia64 uses
> following an MCA?
It's much more comprehensive than what ia64 has, mostly due to
differing requirements. It also doesn't limit itself to user
mapped anonymous pages only.
-Andi
On Tue, Apr 07, 2009 at 08:51:46PM +0200, Johannes Weiner wrote:
> > +
> > + if (av == NULL) /* Not actually mapped anymore */
> > + goto out;
> > +
> > + read_lock(&tasklist_lock);
> > + for_each_process (tsk) {
> > + if (!tsk->mm)
> > + continue;
> > + list_for_each_entry (vma, &av->head, anon_vma_node) {
> > + if (vma->vm_mm == tsk->mm)
> > + add_to_kill(tsk, page, vma, to_kill, tkc);
> > + }
> > + }
> > + read_unlock(&tasklist_lock);
> > +out:
> > + page_unlock_anon_vma(av);
>
> If !av, this doesn't need an unlock and in fact crashes due to
> dereferencing NULL.
Good point. Fixed. Thanks.
>
> > +static int poison_page_prepare(struct page *p, unsigned long pfn, int trapno)
> > +{
> > + if (PagePoison(p)) {
> > + printk(KERN_ERR
> > + "MCE: Error for already poisoned page at %lx\n", pfn);
> > + return -1;
> > + }
> > + SetPagePoison(p);
>
> TestSetPagePoison()?
It doesn't matter in this case because it doesn't need to be atomic.
The normal reason for TestSet is atomicity requirements. If someone
feels strongly about it I can add it.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, Apr 07, 2009 at 09:31:45PM +0200, Andi Kleen wrote:
> On Tue, Apr 07, 2009 at 09:03:30PM +0200, Johannes Weiner wrote:
> > On Tue, Apr 07, 2009 at 05:10:05PM +0200, Andi Kleen wrote:
> > >
> > > Bail out early when poisoned pages are found in page fault handling.
> > > Since they are poisoned they should not be mapped freshly
> > > into processes.
> > >
> > > This is generally handled in the same way as OOM, just a different
> > > error code is returned to the architecture code.
> > >
> > > Signed-off-by: Andi Kleen <[email protected]>
> > >
> > > ---
> > > mm/memory.c | 7 +++++++
> > > 1 file changed, 7 insertions(+)
> > >
> > > Index: linux/mm/memory.c
> > > ===================================================================
> > > --- linux.orig/mm/memory.c 2009-04-07 16:39:39.000000000 +0200
> > > +++ linux/mm/memory.c 2009-04-07 16:39:39.000000000 +0200
> > > @@ -2560,6 +2560,10 @@
> > > goto oom;
> > > __SetPageUptodate(page);
> > >
> > > + /* Kludge for now until we take poisoned pages out of the free lists */
> > > + if (unlikely(PagePoison(page)))
> > > + return VM_FAULT_POISON;
> > > +
> >
> > When memory_failure() hits a page still on the free list
>
> It won't free it then. Later on it will take it out of the free lists,
> but that code is not written yet.
>
> > (!page_count()) then the get_page() in memory_failure() will trigger a
> > VM_BUG. So either this check is unneeded or it should be
>
> So no bug
> > get_page_unless_zero() in memory_failure()?
>
> That's not what this is handling. The issue is that sometimes
> the process can still freeing it and we need to make sure it
> never hits the free lists.
I think we missed each other here. I wasn't talking about _why_ you
take that reference -- that is clear. But I see these two
possibilities:
a) memory_failure() is called on a page on the free list, the
get_page() will trigger a bug because the refcount is 0
b) if that is not possible, the above check is not needed
> I think we missed each other here. I wasn't talking about _why_ you
> take that reference -- that is clear. But I see these two
> possibilities:
>
> a) memory_failure() is called on a page on the free list, the
> get_page() will trigger a bug because the refcount is 0
Ah got it now. Sorry for misreading you. That's indeed a problem.
Fixing.
free pages was something my injector based test suite didn't cover :/
> b) if that is not possible, the above check is not needed
There was at least one case where the process could free it anyways.
I think. Or maybe that was something I fixed in a different way.
It's possible this check is not needed, but it's probably safer
to keep it (and it's all super slow path)
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, Apr 07, 2009 at 10:24:49PM +0200, Andi Kleen wrote:
> > I think we missed each other here. I wasn't talking about _why_ you
> > take that reference -- that is clear. But I see these two
> > possibilities:
> >
> > a) memory_failure() is called on a page on the free list, the
> > get_page() will trigger a bug because the refcount is 0
>
> Ah got it now. Sorry for misreading you. That's indeed a problem.
> Fixing.
>
> free pages was something my injector based test suite didn't cover :/
Hm, perhaps walking mem_map and poisoning pages at random? :)
> > b) if that is not possible, the above check is not needed
>
> There was at least one case where the process could free it anyways.
> I think. Or maybe that was something I fixed in a different way.
> It's possible this check is not needed, but it's probably safer
> to keep it (and it's all super slow path)
Ok. I first thought it could be useful to shrink the race window
between allocating the page and installing the pte but the rest of the
poisoning code should be able to cope.
Hannes
Acked-by: Christoph Lameter <[email protected]>
Could you separate the semantic changes to flag checking for migration
out for easier review?
On Tue, 7 Apr 2009, Andi Kleen wrote:
> +
> +enum ttu_flags {
> + TTU_UNMAP = 0, /* unmap mode */
> + TTU_MIGRATION = 1, /* migration mode */
> + TTU_MUNLOCK = 2, /* munlock mode */
> + TTU_ACTION_MASK = 0xff,
> +
> + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
Ignoring MLOCK? This means we are violating POSIX which says that an
MLOCKed page cannot be unmapped from a process? Note that page migration
does this under special pte entries so that the page will never appear to
be unmapped to user space.
How does that work for the poisoning case? We substitute a fresh page?
On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
>
> Could you separate the semantic changes to flag checking for migration
You mean to try_to_unmap?
> out for easier review?
That's already done. The first patch doesn't change any semantics,
just flags/action checking. Or rather any semantics change in there
would be a bug.
Only the two later ttu patches add to the semantics.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, Apr 07, 2009 at 05:19:19PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
>
> > +
> > +enum ttu_flags {
> > + TTU_UNMAP = 0, /* unmap mode */
> > + TTU_MIGRATION = 1, /* migration mode */
> > + TTU_MUNLOCK = 2, /* munlock mode */
> > + TTU_ACTION_MASK = 0xff,
> > +
> > + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
>
>
> Ignoring MLOCK? This means we are violating POSIX which says that an
> MLOCKed page cannot be unmapped from a process?
I'm sure if you can find sufficiently vague language in the document
to standards lawyer around that requirement @)
The alternative would be to panic.
> Note that page migration
> does this under special pte entries so that the page will never appear to
> be unmapped to user space.
>
> How does that work for the poisoning case? We substitute a fresh page?
It depends on the state of the page. If it was a clean disk mapped
page yes (it's just invalidated and can be reloaded). If it's a dirty anon
page the process is normally killed first (with advisory mode on) or only
killed when it hits the corrupted page. The process can also
catch the signal if it choses so. The late killing works with
a special entry similar to the migration case, but that results
in a special SIGBUS.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, 7 Apr 2009, Andi Kleen wrote:
> On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
> >
> > Could you separate the semantic changes to flag checking for migration
>
> You mean to try_to_unmap?
I mean the changes to checking the pte contents for a migratable /
swappable page. Those are significant independent from this patchset and
would be useful to review independently.
On Tue, 7 Apr 2009, Andi Kleen wrote:
> > Ignoring MLOCK? This means we are violating POSIX which says that an
> > MLOCKed page cannot be unmapped from a process?
>
> I'm sure if you can find sufficiently vague language in the document
> to standards lawyer around that requirement @)
>
> The alternative would be to panic.
If you unmmap a MLOCKed page then you may get memory corruption because
f.e. the Infiniband layer is doing DMA to that page.
> > How does that work for the poisoning case? We substitute a fresh page?
>
> It depends on the state of the page. If it was a clean disk mapped
> page yes (it's just invalidated and can be reloaded). If it's a dirty anon
> page the process is normally killed first (with advisory mode on) or only
> killed when it hits the corrupted page. The process can also
> catch the signal if it choses so. The late killing works with
> a special entry similar to the migration case, but that results
> in a special SIGBUS.
I think a process needs to be killed if any MLOCKed page gets corrupted
because the OS cannot keep the POSIX guarantees.
On Tue, Apr 07, 2009 at 05:56:28PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
>
> > On Tue, Apr 07, 2009 at 05:11:26PM -0400, Christoph Lameter wrote:
> > >
> > > Could you separate the semantic changes to flag checking for migration
> >
> > You mean to try_to_unmap?
>
> I mean the changes to checking the pte contents for a migratable /
> swappable page. Those are significant independent from this patchset and
> would be useful to review independently.
Sorry I'm still not quite sure what you're asking for.
Are you asking about the fault path or about try_to_unmap or some
other path?
And why do you want a separate patchset versus merely a separate patch?
(afaik the patches to generic code are already pretty separated)
I don't really change the semantics of the migration or swap code itself
for example. At least not consciously. If I did that would be a bug.
e.g. the changes to try_to_unmap are two stages:
- add flags/action code. Everything should still do the same, just
the flags are passed around differently.
- add a check for an already poisoned page and insert a poison
swap entry for those
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, Apr 07, 2009 at 06:04:39PM -0400, Christoph Lameter wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
>
> > > Ignoring MLOCK? This means we are violating POSIX which says that an
> > > MLOCKed page cannot be unmapped from a process?
> >
> > I'm sure if you can find sufficiently vague language in the document
> > to standards lawyer around that requirement @)
> >
> > The alternative would be to panic.
>
>
> If you unmmap a MLOCKed page then you may get memory corruption because
> f.e. the Infiniband layer is doing DMA to that page.
The page is not going away, it's poisoned in hardware and software
and stays. There is currently no mechanism to unpoison pages without
rebooting.
DMA should actually cause a bus abort on the hardware level,
at least for RMW.
I currently don't have a cancel mechanism for such kinds of mappings
though. It just does cancel_dirty_page(), but when IO is happening
In theory one could add a more forceful IO cancel mechanism using
special driver callbacks, but I'm not sure it's worth it. Normally the
hardware should abort on hitting poison (although some might do strange things)
and you'll get some more (recoverable) machine checks.
> > > How does that work for the poisoning case? We substitute a fresh page?
> >
> > It depends on the state of the page. If it was a clean disk mapped
> > page yes (it's just invalidated and can be reloaded). If it's a dirty anon
> > page the process is normally killed first (with advisory mode on) or only
> > killed when it hits the corrupted page. The process can also
> > catch the signal if it choses so. The late killing works with
> > a special entry similar to the migration case, but that results
> > in a special SIGBUS.
>
> I think a process needs to be killed if any MLOCKed page gets corrupted
> because the OS cannot keep the POSIX guarantees.
That's the default behaviour with vm.memory_failure_early_kill = 1
However the process can catch the signal if it wants.
-Andi
--
[email protected] -- Speaking for myself only.
Hi, Andi.
On Wed, Apr 8, 2009 at 12:09 AM, Andi Kleen <[email protected]> wrote:
>
> Make sure no poisoned pages are put back into the free page
> lists. This can happen with some races.
>
> This is allo slow path in the bad page bits path, so another
> check doesn't really matter.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/page_alloc.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c 2009-04-07 16:39:26.000000000 +0200
> +++ linux/mm/page_alloc.c 2009-04-07 16:39:39.000000000 +0200
> @@ -228,6 +228,15 @@
> static unsigned long nr_unshown;
>
> /*
> + * Page may have been marked bad before process is freeing it.
> + * Make sure it is not put back into the free page lists.
> + */
> + if (PagePoison(page)) {
> + /* check more flags here... */
How about adding WARNING with some information(ex, pfn, flags..).
> + return;
> + }
> +
> + /*
> * Allow a burst of 60 reports, then keep quiet for that minute;
> * or allow a steady drip of one report per second.
> */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
--
Kinds regards,
Minchan Kim
On Tue, Apr 07, 2009 at 05:09:58PM +0200, Andi Kleen wrote:
>
> Poisoned pages need special handling in the VM and shouldn't be touched
> again. This requires a new page flag. Define it here.
>
> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one. I hope.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> include/linux/page-flags.h | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> Index: linux/include/linux/page-flags.h
> ===================================================================
> --- linux.orig/include/linux/page-flags.h 2009-04-07 16:39:27.000000000 +0200
> +++ linux/include/linux/page-flags.h 2009-04-07 16:39:39.000000000 +0200
> @@ -51,6 +51,9 @@
> * PG_buddy is set to indicate that the page is free and in the buddy system
> * (see mm/page_alloc.c).
> *
> + * PG_poison indicates that a page got corrupted in hardware and contains
> + * data with incorrect ECC bits that triggered a machine check. Accessing is
> + * not safe since it may cause another machine check. Don't touch!
> */
>
> /*
> @@ -104,6 +107,9 @@
> #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
> PG_uncached, /* Page has been mapped as uncached */
> #endif
> +#ifdef CONFIG_MEMORY_FAILURE
Is it necessary to have this under CONFIG_MEMORY_FAILURE?
> + PG_poison, /* poisoned page. Don't touch */
> +#endif
> __NR_PAGEFLAGS,
>
> /* Filesystems */
> @@ -273,6 +279,14 @@
> PAGEFLAG_FALSE(Uncached)
> #endif
>
> +#ifdef CONFIG_MEMORY_FAILURE
> +PAGEFLAG(Poison, poison)
> +#define __PG_POISON (1UL << PG_poison)
> +#else
> +PAGEFLAG_FALSE(Poison)
> +#define __PG_POISON 0
> +#endif
> +
> static inline int PageUptodate(struct page *page)
> {
> int ret = test_bit(PG_uptodate, &(page)->flags);
> @@ -403,7 +417,7 @@
> 1 << PG_private | 1 << PG_private_2 | \
> 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
> 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
> - __PG_UNEVICTABLE | __PG_MLOCKED)
> + __PG_POISON | __PG_UNEVICTABLE | __PG_MLOCKED)
>
> /*
> * Flags checked when a page is prepped for return by the page allocator.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]
On Tue, 7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <[email protected]> wrote:
> Poisoned pages need special handling in the VM and shouldn't be touched
> again. This requires a new page flag. Define it here.
I wish this patchset didn't change/abuse the well-understood meaning of
the word "poison".
> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one. I hope.
They are? How did it all get addressed?
On Tue, 7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <[email protected]> wrote:
> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.
If the page is clean then we can just toss it and grab a new one from
backing store without killing anyone.
Does the patchset do that?
On Tue, 7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <[email protected]> wrote:
> Upcoming Intel CPUs have support for recovering from some memory errors. This
> requires the OS to declare a page "poisoned", kill the processes associated
> with it and avoid using it in the future. This patchkit implements
> the necessary infrastructure in the VM.
Seems that this feature is crying out for a testing framework (perhaps
it already has one?). A simplistic approach would be
echo some-pfn > /proc/bad-pfn-goes-here
A slightly more sophisticated version might do the deed from within a
timer interrupt, just to get a bit more coverage.
On Tue, Apr 07, 2009 at 10:15:42PM -0700, Andrew Morton wrote:
> On Tue, 7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <[email protected]> wrote:
>
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
>
> If the page is clean then we can just toss it and grab a new one from
> backing store without killing anyone.
>
> Does the patchset do that?
Yes. But it only really works for shared mmap, anonymous and private
tends to be near always dirty.
Also you can disable even the early kill and only request kill
on access.
It also does some other tricks, like for dirty file just trigger
an IO error (although I must admit the dirty handling is rather
tricky and I would appreciate very careful review of that part)s
A few other known recovery tricks are not implemented yet
(like handling free memory[1]), but will be over time.
-Andi
[1] I didn't consider that one high priority since production
systems with long uptime shouldn't have much free memory.
--
[email protected] -- Speaking for myself only.
On Tue, Apr 07, 2009 at 10:47:09PM -0700, Andrew Morton wrote:
> On Tue, 7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <[email protected]> wrote:
>
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
>
> Seems that this feature is crying out for a testing framework (perhaps
> it already has one?).
Multiple ones in fact.
One of them is
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
(test suite covering various cases)
git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
(injector using the x86 specific error injection hooks I posted
earlier)
Then i have some tests using the madvise MADV_POISON hook
(which tests the various cases from a process stand points
and recovers). This is still a little hackish, but if there's
interest I can put it out. It has at least one test case
that is known to hang (non linear mappings), still looking
at that.
Long term plan was to put both mce-test above and the
MADV_POISON test into LTP.
And a few random hacks. But coverage is still not 100%
> A simplistic approach would be
Random kill anywhere is hard to test because your system will
die regularly and randomly. mce-test.git does some automated
testing of fatal errors by catching them using kexec, but we haven't
tried that for full recovery.
>
> echo some-pfn > /proc/bad-pfn-goes-here
>
> A slightly more sophisticated version might do the deed from within a
> timer interrupt, just to get a bit more coverage.
mce-test/inject does it from other CPUs with smp_function_call_single,
so it's really relatively random. I've considered to use NMIs too,
but at least the high level recovery code synchronizes first
to work queue context anyways, so it doesn't buy us too much for that.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> On Tue, 7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <[email protected]> wrote:
>
> > Poisoned pages need special handling in the VM and shouldn't be touched
> > again. This requires a new page flag. Define it here.
>
> I wish this patchset didn't change/abuse the well-understood meaning of
> the word "poison".
Sorry, that's the terminology on the hardware side.
If there's much confusion I could rename it HwPoison or somesuch?
> > The page flags wars seem to be over, so it shouldn't be a problem
> > to get a new one. I hope.
>
> They are? How did it all get addressed?
Allowing 64bit to use more and using [V]SPARSEMAP to limit flags
use for zones. I think.
-Andi
--
[email protected] -- Speaking for myself only.
> > @@ -104,6 +107,9 @@
> > #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
> > PG_uncached, /* Page has been mapped as uncached */
> > #endif
> > +#ifdef CONFIG_MEMORY_FAILURE
>
> Is it necessary to have this under CONFIG_MEMORY_FAILURE?
That was mainly so that !MEMORY_FAILURE 32bits NUMA architectures who
might not use sparsemap/vsparsemap get a few more zone bits in page flags
to play with. Not sure those really exist, so it might be indeed
redundant, but it seemed safer.
-Andi
--
[email protected] -- Speaking for myself only.
> >
> > ? ? ? ?/*
> > + ? ? ? ?* Page may have been marked bad before process is freeing it.
> > + ? ? ? ?* Make sure it is not put back into the free page lists.
> > + ? ? ? ?*/
> > + ? ? ? if (PagePoison(page)) {
> > + ? ? ? ? ? ? ? /* check more flags here... */
>
> How about adding WARNING with some information(ex, pfn, flags..).
The memory_failure() code is already quite chatty. Don't think more
noise is needed currently.
Or are you worrying about the case where a page gets corrupted
by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
That would deserve a printk, but I'm not sure how to reliably test for
that. After all a lot of flag combinations are valid.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, 8 Apr 2009 08:24:41 +0200 Andi Kleen <[email protected]> wrote:
> On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> > On Tue, 7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <[email protected]> wrote:
> >
> > > Poisoned pages need special handling in the VM and shouldn't be touched
> > > again. This requires a new page flag. Define it here.
> >
> > I wish this patchset didn't change/abuse the well-understood meaning of
> > the word "poison".
>
> Sorry, that's the terminology on the hardware side.
>
> If there's much confusion I could rename it HwPoison or somesuch?
I understand that'd be a PITA but I suspect it would be best,
long-term. Having this conflict in core MM is really pretty bad.
> > > The page flags wars seem to be over, so it shouldn't be a problem
> > > to get a new one. I hope.
> >
> > They are? How did it all get addressed?
>
> Allowing 64bit to use more and using [V]SPARSEMAP to limit flags
> use for zones. I think.
Nobody ever seems to be able to work out how many we actually have
left.
On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <[email protected]> wrote:
>> >
>> > /*
>> > + * Page may have been marked bad before process is freeing it.
>> > + * Make sure it is not put back into the free page lists.
>> > + */
>> > + if (PagePoison(page)) {
>> > + /* check more flags here... */
>>
>> How about adding WARNING with some information(ex, pfn, flags..).
>
> The memory_failure() code is already quite chatty. Don't think more
> noise is needed currently.
Sure.
> Or are you worrying about the case where a page gets corrupted
> by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
> That would deserve a printk, but I'm not sure how to reliably test for
> that. After all a lot of flag combinations are valid.
I misunderstood your code.
That's because you add the code in bad_page.
As you commented, your intention was to prevent bad page from returning buddy.
Is right ?
If it is right, how about adding prevention code to free_pages_check ?
Now, bad_page is for showing the information that why it is bad page
I don't like emergency exit in bad_page.
> -Andi
>
> --
> [email protected] -- Speaking for myself only.
>
--
Kinds regards,
Minchan Kim
On Wed, Apr 08, 2009 at 12:00:18AM -0700, Andrew Morton wrote:
> On Wed, 8 Apr 2009 08:24:41 +0200 Andi Kleen <[email protected]> wrote:
>
> > On Tue, Apr 07, 2009 at 10:14:21PM -0700, Andrew Morton wrote:
> > > On Tue, 7 Apr 2009 17:09:58 +0200 (CEST) Andi Kleen <[email protected]> wrote:
> > >
> > > > Poisoned pages need special handling in the VM and shouldn't be touched
> > > > again. This requires a new page flag. Define it here.
> > >
> > > I wish this patchset didn't change/abuse the well-understood meaning of
> > > the word "poison".
> >
> > Sorry, that's the terminology on the hardware side.
> >
> > If there's much confusion I could rename it HwPoison or somesuch?
>
> I understand that'd be a PITA but I suspect it would be best,
> long-term. Having this conflict in core MM is really pretty bad.
Ok. I'll rename it to HWPoison().
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Apr 08, 2009 at 04:39:17PM +0900, Minchan Kim wrote:
> On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <[email protected]> wrote:
> >> >
> >> > ? ? ? ?/*
> >> > + ? ? ? ?* Page may have been marked bad before process is freeing it.
> >> > + ? ? ? ?* Make sure it is not put back into the free page lists.
> >> > + ? ? ? ?*/
> >> > + ? ? ? if (PagePoison(page)) {
> >> > + ? ? ? ? ? ? ? /* check more flags here... */
> >>
> >> How about adding WARNING with some information(ex, pfn, flags..).
> >
> > The memory_failure() code is already quite chatty. Don't think more
> > noise is needed currently.
>
> Sure.
>
> > Or are you worrying about the case where a page gets corrupted
> > by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
> > That would deserve a printk, but I'm not sure how to reliably test for
> > that. After all a lot of flag combinations are valid.
>
> I misunderstood your code.
> That's because you add the code in bad_page.
>
> As you commented, your intention was to prevent bad page from returning buddy.
> Is right ?
Yes. Well actually it should not happen anymore. Perhaps I should
make it a BUG()
> If it is right, how about adding prevention code to free_pages_check ?
> Now, bad_page is for showing the information that why it is bad page
> I don't like emergency exit in bad_page.
There's already one in there, so i just reused that one. It was a convenient
way to keep things out of the fast path
-Andi
[email protected] -- Speaking for myself only.
On Wed, Apr 8, 2009 at 6:41 PM, Andi Kleen <[email protected]> wrote:
> On Wed, Apr 08, 2009 at 04:39:17PM +0900, Minchan Kim wrote:
>> On Wed, Apr 8, 2009 at 3:51 PM, Andi Kleen <[email protected]> wrote:
>> >> >
>> >> > /*
>> >> > + * Page may have been marked bad before process is freeing it.
>> >> > + * Make sure it is not put back into the free page lists.
>> >> > + */
>> >> > + if (PagePoison(page)) {
>> >> > + /* check more flags here... */
>> >>
>> >> How about adding WARNING with some information(ex, pfn, flags..).
>> >
>> > The memory_failure() code is already quite chatty. Don't think more
>> > noise is needed currently.
>>
>> Sure.
>>
>> > Or are you worrying about the case where a page gets corrupted
>> > by software and suddenly has Poison bits set? (e.g. 0xff everywhere).
>> > That would deserve a printk, but I'm not sure how to reliably test for
>> > that. After all a lot of flag combinations are valid.
>>
>> I misunderstood your code.
>> That's because you add the code in bad_page.
>>
>> As you commented, your intention was to prevent bad page from returning buddy.
>> Is right ?
>
> Yes. Well actually it should not happen anymore. Perhaps I should
> make it a BUG()
>
>> If it is right, how about adding prevention code to free_pages_check ?
>> Now, bad_page is for showing the information that why it is bad page
>> I don't like emergency exit in bad_page.
>
> There's already one in there, so i just reused that one. It was a convenient
> way to keep things out of the fast path
Sorry for my vague previous comment.
I mean bad_page function's role is just to print why it is bad now.
Whoever can use bad_page to show information.
If someone begin to add side branch in bad_page, anonther people might
add his exception case in one.
So, I think it would be better to check PagePoison in free_pages_check
not bad_page. :)
> -Andi
>
> [email protected] -- Speaking for myself only.
>
--
Kinds regards,
Minchan Kim
On Tue, 2009-04-07 at 17:10 +0200, Andi Kleen wrote:
> This patch adds the high level memory handler that poisons pages.
> It is portable code and lives in mm/memory-failure.c
I think this is an important feature, thanks for doing all this work
Andi.
> Index: linux/mm/memory-failure.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/mm/memory-failure.c 2009-04-07 16:39:39.000000000 +0200
> +
> +/*
> + * Clean (or cleaned) page cache page.
> + */
> +static int me_pagecache_clean(struct page *p)
> +{
> + struct address_space *mapping;
> +
> + if (PagePrivate(p))
> + do_invalidatepage(p, 0);
> + mapping = page_mapping(p);
> + if (mapping) {
> + if (!remove_mapping(mapping, p))
> + return FAILED;
> + }
> + return RECOVERED;
> +}
> +
> +/*
> + * Dirty cache page page
> + * Issues: when the error hit a hole page the error is not properly
> + * propagated.
> + */
> +static int me_pagecache_dirty(struct page *p)
> +{
> + struct address_space *mapping = page_mapping(p);
> +
> + SetPageError(p);
> + /* TBD: print more information about the file. */
> + printk(KERN_ERR "MCE: Hardware memory corruption on dirty file page: write error\n");
> + if (mapping) {
> + /* CHECKME: does that report the error in all cases? */
> + mapping_set_error(mapping, EIO);
> + }
> + if (PagePrivate(p)) {
> + if (try_to_release_page(p, GFP_KERNEL)) {
So, try_to_release_page returns 1 when it works. I know this only
because I have to read it every time to remember ;)
try_to_release_page is also very likely to fail if the page is dirty or
under writeback. At the end of the day, we'll probably need a call into
the FS to tell it a given page isn't coming back, and to clean it at all
cost.
invalidatepage is close, but ext3/reiserfs will keep the buffer heads
and let the page->mapping go to null in an ugly data=ordered corner
case. The buffer heads pin the page and it won't be freed until the IO
is done.
-chris
> [1] I didn't consider that one high priority since production
> systems with long uptime shouldn't have much free memory.
Surely there are windows after a big job exits where lots of memory
might be free. Not sure how big those windows are in practice but it
does seem if a process using 128GB exits then it might take a while
before that memory all gets used again.
- R.
On Wed, Apr 08, 2009 at 10:29:34AM -0700, Roland Dreier wrote:
> > [1] I didn't consider that one high priority since production
> > systems with long uptime shouldn't have much free memory.
>
> Surely there are windows after a big job exits where lots of memory
> might be free. Not sure how big those windows are in practice but it
> does seem if a process using 128GB exits then it might take a while
> before that memory all gets used again.
Yes, it's definitely something to be fixed at some point.
Basically just needs a new entry point into the page_alloc
buddy allocator to unfree a page. The more tricky part
is actually finding a good injector design for testing for it,
there's no natural race free way to get a free page.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Apr 08, 2009 at 01:03:59PM -0400, Chris Mason wrote:
Hi Chris,
Thanks for the review.
> So, try_to_release_page returns 1 when it works. I know this only
> because I have to read it every time to remember ;)
Argh. I think I read that, but then somehow the code still came out
wrong and the tester didn't catch the failure.
>
> try_to_release_page is also very likely to fail if the page is dirty or
> under writeback. At the end of the day, we'll probably need a call into
Would you recommend a retry step? If it fails cancel_dirty_page() and then
retry?
Ideally I would like to stop the write back before it starts (it will
result in a hardware bus abort or even a machine check if the CPU
touches the data), but I realize it's difficult for anything with
private page state. I just cancel dirty for !Private at least.
> the FS to tell it a given page isn't coming back, and to clean it at all
> cost.
>
> invalidatepage is close, but ext3/reiserfs will keep the buffer heads
> and let the page->mapping go to null in an ugly data=ordered corner
> case. The buffer heads pin the page and it won't be freed until the IO
> is done.
invalidate_mapping_pages() ?
I had this in an earlier version, but took it out because it seemed
problematic to rely on a specific inode. Should i reconsider it?
-Andi
--
[email protected] -- Speaking for myself only.
Double checked the try_to_release_page logic. My assumption was that the
writeback case could never trigger, because during write back the page
should be locked and so it's excluded with the earlier lock_page_nosync().
Is that a correct assumption?
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, 2009-04-09 at 09:58 +0200, Andi Kleen wrote:
> Double checked the try_to_release_page logic. My assumption was that the
> writeback case could never trigger, because during write back the page
> should be locked and so it's excluded with the earlier lock_page_nosync().
>
> Is that a correct assumption?
Yes, the page won't become writeback when you're holding the page lock.
But, the FS usually thinks of try_to_releasepage as a polite request.
It might fail internally for a bunch of reasons.
To make things even more fun, the page won't become writeback magically,
but ext3 and reiser maintain lists of buffer heads for data=ordered, and
they do the data=ordered IO on the buffer heads directly. writepage is
never called and the page lock is never taken, but the buffer heads go
to disk. I don't think any of the other filesystems do it this way.
At least for Ext3 (and reiser3), try_to_releasepage is required to fail
for some data=ordered corner cases, and the only way it'll end up
passing is if you commit the transaction (which writes the buffer_head)
and try again. Even invalidatepage will just end up setting
page->mapping to null but leaving the page around for ext3 to finish
processing.
If we really want the page gone, we'll have to tell the FS
drop-this-or-else....sorry, its some ugly stuff.
The good news is, it is pretty rare. I wouldn't hold up the whole patch
set just for this problem. We could document the future fun required
and fix the return value check and concentrate on something other than
this ugly corner ;)
-chris
On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > Is that a correct assumption?
>
> Yes, the page won't become writeback when you're holding the page lock.
> But, the FS usually thinks of try_to_releasepage as a polite request.
> It might fail internally for a bunch of reasons.
>
> To make things even more fun, the page won't become writeback magically,
> but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> they do the data=ordered IO on the buffer heads directly. writepage is
> never called and the page lock is never taken, but the buffer heads go
> to disk. I don't think any of the other filesystems do it this way.
Ok, so do you think my code handles this correctly?
> If we really want the page gone, we'll have to tell the FS
> drop-this-or-else....sorry, its some ugly stuff.
I would like to give a very strong hint at least. If it fails
we can still ignore it, but it will likely have negative consequences later.
>
> The good news is, it is pretty rare. I wouldn't hold up the whole patch
You mean pages with Private bit are rare? Are you suggesting to just
ignore those? How common is it to have Private pages which are not
locked by someone else?
I keep thinking about doing some instrumentation and figure out
how common the various page types are under different loads, but haven't written
that bit so far.
> set just for this problem. We could document the future fun required
> and fix the return value check
I fixed the return value check. Thanks.
> and concentrate on something other than
> this ugly corner ;)
Any suggestions welcome.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, 2009-04-09 at 16:02 +0200, Andi Kleen wrote:
> On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > > Is that a correct assumption?
> >
> > Yes, the page won't become writeback when you're holding the page lock.
> > But, the FS usually thinks of try_to_releasepage as a polite request.
> > It might fail internally for a bunch of reasons.
> >
> > To make things even more fun, the page won't become writeback magically,
> > but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> > they do the data=ordered IO on the buffer heads directly. writepage is
> > never called and the page lock is never taken, but the buffer heads go
> > to disk. I don't think any of the other filesystems do it this way.
>
> Ok, so do you think my code handles this correctly?
Even though try_to_releasepage only checks page_writeback() the lower
filesystems all bail on dirty pages or dirty buffers (see the checks
done by try_to_free_buffers).
It looks like the only way we have to clean a page and all the buffers
in it is the invalidatepage call. But that doesn't return success or
failure, so maybe invalidatepage followed by releasepage?
I'll have to read harder next week, the FS invalidatepage may expect
truncate to be the only caller.
>
> > If we really want the page gone, we'll have to tell the FS
> > drop-this-or-else....sorry, its some ugly stuff.
>
> I would like to give a very strong hint at least. If it fails
> we can still ignore it, but it will likely have negative consequences later.
>
Nod.
> >
> > The good news is, it is pretty rare. I wouldn't hold up the whole patch
>
> You mean pages with Private bit are rare? Are you suggesting to just
> ignore those? How common is it to have Private pages which are not
> locked by someone else?
>
PagePrivate is very common. try_to_releasepage failing on a clean page
without the writeback bit set and without dirty/locked buffers will be
pretty rare.
-chris
> Even though try_to_releasepage only checks page_writeback() the lower
> filesystems all bail on dirty pages or dirty buffers (see the checks
> done by try_to_free_buffers).
>
> It looks like the only way we have to clean a page and all the buffers
> in it is the invalidatepage call. But that doesn't return success or
> failure, so maybe invalidatepage followed by releasepage?
Ok. I'll poke at it more.
>
> I'll have to read harder next week, the FS invalidatepage may expect
> truncate to be the only caller.
I have to be careful with locks; another lock would deadlock. Ok
I could drop the page lock temporarily, but that would be somewhat
risky of someone else coming in unexpectedly.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, Apr 07, 2009 at 10:47:09PM -0700, Andrew Morton wrote:
> On Tue, 7 Apr 2009 17:09:56 +0200 (CEST) Andi Kleen <[email protected]> wrote:
>
> > Upcoming Intel CPUs have support for recovering from some memory errors. This
> > requires the OS to declare a page "poisoned", kill the processes associated
> > with it and avoid using it in the future. This patchkit implements
> > the necessary infrastructure in the VM.
>
> Seems that this feature is crying out for a testing framework (perhaps
> it already has one?). A simplistic approach would be
>
> echo some-pfn > /proc/bad-pfn-goes-here
How about reusing the /proc/kpageflags interface? i.e. make it writable.
It may sound crazy and way too _hacky_, but it is possible to
attach actions to the state transition of some page flags ;)
PG_poison 0 => 1: call memory_failure()
PG_active 1 => 0: move page into inactive lru
PG_unevictable 1 => 0: move page out of unevictable lru
PG_swapcache 1 => 0: remove page from swap cache
PG_lru 1 => 0: reclaim page
Thanks,
Fengguang
> A slightly more sophisticated version might do the deed from within a
> timer interrupt, just to get a bit more coverage.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> On Thu, 2009-04-09 at 16:02 +0200, Andi Kleen wrote:
> > On Thu, Apr 09, 2009 at 09:30:29AM -0400, Chris Mason wrote:
> > > > Is that a correct assumption?
> > >
> > > Yes, the page won't become writeback when you're holding the page lock.
> > > But, the FS usually thinks of try_to_releasepage as a polite request.
> > > It might fail internally for a bunch of reasons.
> > >
> > > To make things even more fun, the page won't become writeback magically,
> > > but ext3 and reiser maintain lists of buffer heads for data=ordered, and
> > > they do the data=ordered IO on the buffer heads directly. writepage is
> > > never called and the page lock is never taken, but the buffer heads go
> > > to disk. I don't think any of the other filesystems do it this way.
> >
> > Ok, so do you think my code handles this correctly?
>
> Even though try_to_releasepage only checks page_writeback() the lower
> filesystems all bail on dirty pages or dirty buffers (see the checks
> done by try_to_free_buffers).
>
> It looks like the only way we have to clean a page and all the buffers
> in it is the invalidatepage call. But that doesn't return success or
> failure, so maybe invalidatepage followed by releasepage?
>
> I'll have to read harder next week, the FS invalidatepage may expect
> truncate to be the only caller.
If direct de-dirty is hard for some pages, how about just ignore them?
There are the PG_writeback pages anyway. We can inject code to
intercept them at the last stage of IO request dispatching.
Some perceivable problems and solutions are
1) the intercepting overheads could be costly => inject code at runtime.
2) there are cases that the dirty page could be copied for IO:
2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
2.1.1) do_get_write_access(): buffer sits in two active commits
2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
with JBD2_MAGIC_NUMBER
2.2) btrfs have to read page for compress/encryption
Chris: is btrfs_zlib_compress_pages() a good place for detecting
poison pages? Or is it necessary at all for btrfs?(ie. it's
already relatively easy to de-dirty btrfs pages.)
2.3) maybe more cases...
> >
> > > If we really want the page gone, we'll have to tell the FS
> > > drop-this-or-else....sorry, its some ugly stuff.
> >
> > I would like to give a very strong hint at least. If it fails
> > we can still ignore it, but it will likely have negative consequences later.
> >
>
> Nod.
>
> > >
> > > The good news is, it is pretty rare. I wouldn't hold up the whole patch
> >
> > You mean pages with Private bit are rare? Are you suggesting to just
> > ignore those? How common is it to have Private pages which are not
> > locked by someone else?
> >
>
> PagePrivate is very common. try_to_releasepage failing on a clean page
> without the writeback bit set and without dirty/locked buffers will be
> pretty rare.
Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
While ext4 won't.
Thanks,
Fengguang
On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
[snip]
> > PagePrivate is very common. try_to_releasepage failing on a clean page
> > without the writeback bit set and without dirty/locked buffers will be
> > pretty rare.
>
> Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> While ext4 won't.
Chris, I run into a btrfs BUG() when doing
dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345
The half created sparse file is
-rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
Or
-rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse
Below is the kernel messages. I can test patches you throw at me :-)
Thanks,
Fengguang
[ 1067.530868] btrfs allocation failed flags 1, wanted 4096
[ 1067.536313] space_info has 0 free, is full
[ 1067.540533] space_info total=4049600512, pinned=0, delalloc=4096, may_use=0, used=4049600512
[ 1067.549280] block group 12582912 has 8388608 bytes, 8388608 used 0 pinned 0 reserved
[ 1067.557172] 0 blocks of free space at or bigger than bytes is
[ 1067.563020] block group 255918080 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.571334] 0 blocks of free space at or bigger than bytes is
[ 1067.577159] block group 709099520 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.585459] 0 blocks of free space at or bigger than bytes is
[ 1067.591271] block group 1162280960 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.599641] 0 blocks of free space at or bigger than bytes is
[ 1067.605491] block group 1615462400 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.613858] 0 blocks of free space at or bigger than bytes is
[ 1067.619684] block group 2068643840 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.628069] 0 blocks of free space at or bigger than bytes is
[ 1067.633893] block group 2521825280 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.642277] 0 blocks of free space at or bigger than bytes is
[ 1067.648099] block group 2975006720 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.656483] 0 blocks of free space at or bigger than bytes is
[ 1067.662295] block group 3428188160 has 453181440 bytes, 453181440 used 0 pinned 0 reserved
[ 1067.670666] 0 blocks of free space at or bigger than bytes is
[ 1067.676508] block group 3881369600 has 415760384 bytes, 415760384 used 0 pinned 0 reserved
[ 1067.684877] 0 blocks of free space at or bigger than bytes is
[ 1067.690747] ------------[ cut here ]------------
[ 1067.695435] kernel BUG at fs/btrfs/extent-tree.c:2872!
[ 1067.700646] invalid opcode: 0000 [#1] SMP
[ 1067.704873] last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/C23B/charge_full
[ 1067.714473] CPU 0
[ 1067.716575] Modules linked in: drm iwlagn iwlcore snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device snd soundcore snd_page_alloc video
[ 1067.733699] Pid: 3358, comm: dd Not tainted 2.6.30-rc2-next-20090417 #202 HP Compaq 6910p
[ 1067.741975] RIP: 0010:[<ffffffff81201b23>] [<ffffffff81201b23>] __btrfs_reserve_extent+0x213/0x300
[ 1067.751185] RSP: 0018:ffff8800791c77f8 EFLAGS: 00010292
[ 1067.756581] RAX: 0000000000022533 RBX: ffff88007b8c5030 RCX: 0000000000000006
[ 1067.763777] RDX: ffffffff81ccffa0 RSI: ffff8800791c1db0 RDI: 0000000000000286
[ 1067.770984] RBP: ffff8800791c7878 R08: 0000000000000000 R09: 0000000000000000
[ 1067.778203] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88007b38e4b8
[ 1067.785440] R13: 0000000000001000 R14: ffff88007b38e6a8 R15: ffff88007b38e658
[ 1067.792657] FS: 00007f5801f136f0(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
[ 1067.800851] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1067.806668] CR2: 00007f58017c1622 CR3: 000000007bb62000 CR4: 00000000000006e0
[ 1067.813882] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1067.821087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1067.828304] Process dd (pid: 3358, threadinfo ffff8800791c6000, task ffff8800791c1600)
[ 1067.836319] Stack:
[ 1067.838389] 0000000000000000 ffff8800791c7948 0000000000000000 0000000000000000
[ 1067.845792] 0000000000000001 0000000000000000 0000000000000000 0000000000000000
[ 1067.853464] ffff88007bbe4000 0000000100000000 0000000000000001 ffff8800791c7948
[ 1067.861360] Call Trace:
[ 1067.863863] [<ffffffff81201e0b>] btrfs_reserve_extent+0x3b/0x70
[ 1067.869984] [<ffffffff81218feb>] cow_file_range+0x21b/0x3d0
[ 1067.875745] [<ffffffff8122f519>] ? test_range_bit+0xb9/0x180
[ 1067.881616] [<ffffffff81219be2>] run_delalloc_range+0x302/0x3b0
[ 1067.887727] [<ffffffff8122f519>] ? test_range_bit+0xb9/0x180
[ 1067.893583] [<ffffffff8123352f>] ? find_lock_delalloc_range+0x12f/0x1c0
[ 1067.900396] [<ffffffff81233c45>] __extent_writepage+0x175/0x990
[ 1067.906502] [<ffffffff810794a8>] ? mark_held_locks+0x68/0x90
[ 1067.912361] [<ffffffff810ca581>] ? clear_page_dirty_for_io+0x171/0x190
[ 1067.919080] [<ffffffff810797fd>] ? trace_hardirqs_on_caller+0x16d/0x1c0
[ 1067.925891] [<ffffffff812308ce>] extent_write_cache_pages+0x1ee/0x400
[ 1067.932529] [<ffffffff8122e970>] ? flush_write_bio+0x0/0x40
[ 1067.938288] [<ffffffff81233ad0>] ? __extent_writepage+0x0/0x990
[ 1067.944404] [<ffffffff810794a8>] ? mark_held_locks+0x68/0x90
[ 1067.950254] [<ffffffff810f88e5>] ? kmem_cache_free+0x145/0x260
[ 1067.956287] [<ffffffff810797fd>] ? trace_hardirqs_on_caller+0x16d/0x1c0
[ 1067.963091] [<ffffffff81230b22>] extent_writepages+0x42/0x70
[ 1067.968957] [<ffffffff81217020>] ? btrfs_get_extent+0x0/0x960
[ 1067.974891] [<ffffffff81216e58>] btrfs_writepages+0x28/0x30
[ 1067.980663] [<ffffffff8122b940>] btrfs_fdatawrite_range+0x50/0x60
[ 1067.986942] [<ffffffff8122c2c6>] btrfs_wait_ordered_range+0xb6/0x170
[ 1067.993508] [<ffffffff8121cce4>] btrfs_truncate+0x74/0x160
[ 1067.999183] [<ffffffff810dd46d>] vmtruncate+0xad/0x110
[ 1068.004529] [<ffffffff81117095>] inode_setattr+0x35/0x180
[ 1068.010116] [<ffffffff8121d3ab>] btrfs_setattr+0x6b/0xd0
[ 1068.015616] [<ffffffff81117301>] notify_change+0x121/0x330
[ 1068.021298] [<ffffffff810fd1aa>] do_truncate+0x6a/0x90
[ 1068.026623] [<ffffffff810fd2c0>] sys_ftruncate+0xf0/0x130
[ 1068.032220] [<ffffffff8100c2b2>] system_call_fastpath+0x16/0x1b
[ 1068.038364] Code: 4c 8d a0 60 fe ff ff 49 8b 84 24 a0 01 00 00 0f 18 08 49 8d 84 24 a0 01 00 00 49 39 c7 0f 85 8c 00 00 00 4c 89 f7 e8 8d 8e e6 ff <0f> 0b eb fe 66 0f 1f 84 00 00 00 00 00 49 d1 ed 41 8b 84 24 60
[ 1068.059472] RIP [<ffffffff81201b23>] __btrfs_reserve_extent+0x213/0x300
[ 1068.066299] RSP <ffff8800791c77f8>
[ 1068.070292] ---[ end trace ab42ff0a881d9568 ]---
> > I'll have to read harder next week, the FS invalidatepage may expect
> > truncate to be the only caller.
>
> If direct de-dirty is hard for some pages, how about just ignore them?
You mean just ignoring it for the pages where it is hard?
Yes that is what it is essentially doing right now. But at least
some dirty pages need to be handled because most user space
pages tend to be dirty.
> There are the PG_writeback pages anyway. We can inject code to
> intercept them at the last stage of IO request dispatching.
That would require adding error out code through all the file systems,
right?
>
> Some perceivable problems and solutions are
> 1) the intercepting overheads could be costly => inject code at runtime.
> 2) there are cases that the dirty page could be copied for IO:
At some point we should probably add poison checks before these operations
yes. At least for read it should be the same code path as EIO --
you have to check PG_error anyways (or at least you ought to)
The main difference is that for write you have to check it too.
> 2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
> 2.1.1) do_get_write_access(): buffer sits in two active commits
> 2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
> with JBD2_MAGIC_NUMBER
> 2.2) btrfs have to read page for compress/encryption
> Chris: is btrfs_zlib_compress_pages() a good place for detecting
> poison pages? Or is it necessary at all for btrfs?(ie. it's
> already relatively easy to de-dirty btrfs pages.)
I think btrfs' IO error handling is not very great right now. But once
it matures i hope poison pages can be handled in the same way as
regular IO errors.
> 2.3) maybe more cases...
Undoubtedly. Goal is just to handle the common cases that cover a lot
of memory. This will never be 100%.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Apr 29, 2009 at 04:36:55PM +0800, Andi Kleen wrote:
> > > I'll have to read harder next week, the FS invalidatepage may expect
> > > truncate to be the only caller.
> >
> > If direct de-dirty is hard for some pages, how about just ignore them?
>
> You mean just ignoring it for the pages where it is hard?
Yes.
> Yes that is what it is essentially doing right now. But at least
> some dirty pages need to be handled because most user space
> pages tend to be dirty.
Sure. There are three types of dirty pages:
A. now dirty, can be de-dirty in the current code
B. now dirty, cannot be de-dirty
C. now dirty and writeback, cannot be de-dirty
I mean B and C can be handled in one single place - the block layer.
If B is hard to be de-dirtied now, ignore them for now and they will
eventually be going to IO and become C.
> > There are the PG_writeback pages anyway. We can inject code to
> > intercept them at the last stage of IO request dispatching.
>
> That would require adding error out code through all the file systems,
> right?
Not necessarily. The file systems deal with buffer head, extend map
and bios, they normally won't touch the poisoned page content at all.
So it's mostly safe to add one single door-keeper at the low level
request dispatch queue.
> >
> > Some perceivable problems and solutions are
> > 1) the intercepting overheads could be costly => inject code at runtime.
> > 2) there are cases that the dirty page could be copied for IO:
>
> At some point we should probably add poison checks before these operations
Maybe some ext4 developers can drop us more hint one these two cases.
We can also do some instruments to see how often (2.1.x) will happen.
But I guess a simple PagePoison() test is cheap anyway.
> yes. At least for read it should be the same code path as EIO --
> you have to check PG_error anyways (or at least you ought to)
> The main difference is that for write you have to check it too.
Check which on write? You mean Copy-out?
Another copy path is the bounced read/write... I guess it won't be
common in 64bit system though.
> > 2.1) jbd2 has two copy-out cases => should be rare. just ignore them?
> > 2.1.1) do_get_write_access(): buffer sits in two active commits
> > 2.1.2) jbd2_journal_write_metadata_buffer(): buffer happens to start
> > with JBD2_MAGIC_NUMBER
> > 2.2) btrfs have to read page for compress/encryption
> > Chris: is btrfs_zlib_compress_pages() a good place for detecting
> > poison pages? Or is it necessary at all for btrfs?(ie. it's
> > already relatively easy to de-dirty btrfs pages.)
>
> I think btrfs' IO error handling is not very great right now. But once
> it matures i hope poison pages can be handled in the same way as
> regular IO errors.
OK.
> > 2.3) maybe more cases...
>
> Undoubtedly. Goal is just to handle the common cases that cover a lot
> of memory. This will never be 100%.
Right. We'll discover/cover more cases as time goes by.
Thanks,
Fengguang
On Wed, 2009-04-29 at 17:05 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 04:36:55PM +0800, Andi Kleen wrote:
> > > > I'll have to read harder next week, the FS invalidatepage may expect
> > > > truncate to be the only caller.
> > >
> > > If direct de-dirty is hard for some pages, how about just ignore them?
> >
> > You mean just ignoring it for the pages where it is hard?
>
> Yes.
>
> > Yes that is what it is essentially doing right now. But at least
> > some dirty pages need to be handled because most user space
> > pages tend to be dirty.
>
> Sure. There are three types of dirty pages:
>
> A. now dirty, can be de-dirty in the current code
> B. now dirty, cannot be de-dirty
> C. now dirty and writeback, cannot be de-dirty
>
> I mean B and C can be handled in one single place - the block layer.
>
> If B is hard to be de-dirtied now, ignore them for now and they will
> eventually be going to IO and become C.
>
> > > There are the PG_writeback pages anyway. We can inject code to
> > > intercept them at the last stage of IO request dispatching.
> >
> > That would require adding error out code through all the file systems,
> > right?
>
> Not necessarily. The file systems deal with buffer head, extend map
> and bios, they normally won't touch the poisoned page content at all.
>
They often do when zeroing parts of the page that straddle i_size. At
least for btrfs its enough to change grab_cache_page and find_get_page
(and friends) to do the poison magic, along with the functions uses by
write_cache_pages.
-chris
On Wed, 2009-04-29 at 16:21 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> > On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> [snip]
> > > PagePrivate is very common. try_to_releasepage failing on a clean page
> > > without the writeback bit set and without dirty/locked buffers will be
> > > pretty rare.
> >
> > Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> > While ext4 won't.
>
> Chris, I run into a btrfs BUG() when doing
>
> dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345
>
> The half created sparse file is
>
> -rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
> Or
> -rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse
>
> Below is the kernel messages. I can test patches you throw at me :-)
>
How big was the FS you were testing this on? It works for me...
-chris
On Wed, Apr 29, 2009 at 07:40:22PM +0800, Chris Mason wrote:
> On Wed, 2009-04-29 at 16:21 +0800, Wu Fengguang wrote:
> > On Wed, Apr 29, 2009 at 04:16:16PM +0800, Wu Fengguang wrote:
> > > On Thu, Apr 09, 2009 at 10:37:39AM -0400, Chris Mason wrote:
> > [snip]
> > > > PagePrivate is very common. try_to_releasepage failing on a clean page
> > > > without the writeback bit set and without dirty/locked buffers will be
> > > > pretty rare.
> > >
> > > Yup. btrfs seems to tag most(if not all) dirty pages with PG_private.
> > > While ext4 won't.
> >
> > Chris, I run into a btrfs BUG() when doing
> >
> > dd if=/dev/zero of=/b/sparse bs=1k count=1 seek=104857512345
> >
> > The half created sparse file is
> >
> > -rw-r--r-- 1 root root 98T 2009-04-29 14:54 /b/sparse
> > Or
> > -rw-r--r-- 1 root root 107374092641280 2009-04-29 14:54 /b/sparse
> >
> > Below is the kernel messages. I can test patches you throw at me :-)
> >
>
> How big was the FS you were testing this on? It works for me...
df says:
/dev/sda3 4.3G 28K 4.3G 1% /b
Oh bad, I cannot reproduce it now..
Thanks,
Fengguang