Latest version of the hwpoison patchkit.
I rebased it on the mmotm of today to make it easier for Andrew to merge.
All comments have been addressed I believe (except I didn't move the
handlers out to other files)
For details please see the changelogs in the individual patches.
Andrew,
Please consider for merging.
Nick:
I addressed your comments on truncate. Nick can you review that
part? Is it ok for you now?
I also added a check for metadata buffer pages. Can you please review
that too.
And I integrated your truncate patch. The only thing missing
is a Signed-off-by, can you provide that please?
Also I thought a bit about the fsync() error scenario. It's really
a problem that can already happen even without hwpoison, e.g.
when a page is dropped at the wrong time. The real fix would
be to make address space errors more sticky. Fengguang has been
looking at that, but it's probably not something for .31
I wrote up a detailed comment in the code describing the issue.
Thanks,
-Andi
---
Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned",
kill the processes associated with it and avoid using it in the future.
This patchkit implements the necessary infrastructure in the VM.
To quote the overview comment:
* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focusses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* Some of the operations here are somewhat inefficient and have non
* linear algorithmic complexity, because the data structures have not
* been optimized for this case. This is in particular the case
* for the mapping from a vma to a process. Since this case is expected
* to be rare we hope we can get away with this.
The code consists of a the high level handler in mm/memory-failure.c,
a new page poison bit and various checks in the VM to handle poisoned
pages.
The main target right now is KVM guests, but it works for all kinds
of applications.
For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might.
This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.
Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome.
The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)
v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.
v3: Various updates, see changelogs in individual patches.
-Andi
Needed for later patch that walks rmap entries on its own.
This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/rmap.h | 6 ++++++
mm/rmap.c | 4 ++--
2 files changed, 8 insertions(+), 2 deletions(-)
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h 2009-06-03 19:36:22.000000000 +0200
+++ linux/include/linux/rmap.h 2009-06-03 20:39:50.000000000 +0200
@@ -115,6 +115,12 @@
int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
#endif
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
#else /* !CONFIG_MMU */
#define anon_vma_init() do {} while (0)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-06-03 19:36:22.000000000 +0200
+++ linux/mm/rmap.c 2009-06-03 20:39:50.000000000 +0200
@@ -191,7 +191,7 @@
* Getting a lock on a stable anon_vma from a page off the LRU is
* tricky: page_lock_anon_vma rely on RCU to guard against the races.
*/
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;
@@ -211,7 +211,7 @@
return NULL;
}
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
spin_unlock(&anon_vma->lock);
rcu_read_unlock();
Hardware poisoned pages need special handling in the VM and shouldn't be
touched again. This requires a new page flag. Define it here.
The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.
v2: Add TestSetHWPoison (suggested by Johannes Weiner)
Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/page-flags.h | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h 2009-06-03 19:36:23.000000000 +0200
+++ linux/include/linux/page-flags.h 2009-06-03 19:36:23.000000000 +0200
@@ -51,6 +51,9 @@
* PG_buddy is set to indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
*
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
*/
/*
@@ -102,6 +105,9 @@
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+ PG_hwpoison, /* hardware poisoned page. Don't touch */
+#endif
__NR_PAGEFLAGS,
/* Filesystems */
@@ -263,6 +269,15 @@
PAGEFLAG_FALSE(Uncached)
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
static inline int PageUptodate(struct page *page)
{
int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -387,7 +402,7 @@
1 << PG_private | 1 << PG_private_2 | \
1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
- 1 << PG_unevictable | __PG_MLOCKED)
+ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
/*
* Flags checked when a page is prepped for return by the page allocator.
Memory migration uses special swap entry types to trigger special actions on
page faults. Extend this mechanism to also support poisoned swap entries, to
trigger poison handling on page faults. This allows follow-on patches to
prevent processes from faulting in poisoned pages again.
v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/swap.h | 34 ++++++++++++++++++++++++++++------
include/linux/swapops.h | 38 ++++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 4 ++--
3 files changed, 68 insertions(+), 8 deletions(-)
Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h 2009-06-03 19:36:22.000000000 +0200
+++ linux/include/linux/swap.h 2009-06-03 19:36:23.000000000 +0200
@@ -34,16 +34,38 @@
* the type/offset into the pte as 5/27 as well.
*/
#define MAX_SWAPFILES_SHIFT 5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES (1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
#else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
#endif
/*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON MAX_SWAPFILES
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+ ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
* Magic header for a swap area. The first part of the union is
* what the swap magic looks like for the old (limited to 128MB)
* swap area format, the second part of the union adds - in the
Index: linux/include/linux/swapops.h
===================================================================
--- linux.orig/include/linux/swapops.h 2009-06-03 19:36:22.000000000 +0200
+++ linux/include/linux/swapops.h 2009-06-03 19:36:23.000000000 +0200
@@ -131,3 +131,41 @@
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+ return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+ return swp_type(entry) >= MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+ return 0;
+}
+#endif
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c 2009-06-03 19:36:22.000000000 +0200
+++ linux/mm/swapfile.c 2009-06-03 19:36:23.000000000 +0200
@@ -697,7 +697,7 @@
struct swap_info_struct *p;
struct page *page = NULL;
- if (is_migration_entry(entry))
+ if (non_swap_entry(entry))
return 1;
p = swap_info_get(entry);
@@ -2083,7 +2083,7 @@
int count;
bool has_cache;
- if (is_migration_entry(entry))
+ if (non_swap_entry(entry))
return -EINVAL;
type = swp_type(entry);
Add new SIGBUS codes for reporting machine checks as signals. When
the hardware detects an uncorrected ECC error it can trigger these
signals.
This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.
This patch merely defines the new types.
- Define two new si_codes for SIGBUS. BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.
AK: I plan to write a man page update before anyone asks.
Signed-off-by: Andi Kleen <[email protected]>
---
include/asm-generic/siginfo.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
Index: linux/include/asm-generic/siginfo.h
===================================================================
--- linux.orig/include/asm-generic/siginfo.h 2009-06-03 19:36:22.000000000 +0200
+++ linux/include/asm-generic/siginfo.h 2009-06-03 19:36:23.000000000 +0200
@@ -82,6 +82,7 @@
#ifdef __ARCH_SI_TRAPNO
int _trapno; /* TRAP # which caused the signal */
#endif
+ short _addr_lsb; /* LSB of the reported address */
} _sigfault;
/* SIGPOLL */
@@ -112,6 +113,7 @@
#ifdef __ARCH_SI_TRAPNO
#define si_trapno _sifields._sigfault._trapno
#endif
+#define si_addr_lsb _sifields._sigfault._addr_lsb
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd
@@ -192,7 +194,11 @@
#define BUS_ADRALN (__SI_FAULT|1) /* invalid address alignment */
#define BUS_ADRERR (__SI_FAULT|2) /* non-existant physical address */
#define BUS_OBJERR (__SI_FAULT|3) /* object specific hardware error */
-#define NSIGBUS 3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR (__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO (__SI_FAULT|5)
+#define NSIGBUS 5
/*
* SIGTRAP si_codes
From: Wu Fengguang <[email protected]>
If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).
This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.
Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.
This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.
This patch adds some runtime costs to high order page users.
[AK: Improved description]
v2: Andi Kleen:
Port to -mm code
Move check into separate function.
Don't dump stack in bad_pages for hwpoisoned pages.
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
---
mm/page_alloc.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c 2009-06-03 19:37:39.000000000 +0200
+++ linux/mm/page_alloc.c 2009-06-03 20:13:43.000000000 +0200
@@ -237,6 +237,12 @@
static unsigned long nr_shown;
static unsigned long nr_unshown;
+ /* Don't complain about poisoned pages */
+ if (PageHWPoison(page)) {
+ __ClearPageBuddy(page);
+ return;
+ }
+
/*
* Allow a burst of 60 reports, then keep quiet for that minute;
* or allow a steady drip of one report per second.
@@ -650,7 +656,7 @@
/*
* This page is about to be returned from the page allocator
*/
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static inline int check_new_page(struct page *page)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
@@ -659,6 +665,18 @@
bad_page(page);
return 1;
}
+ return 0;
+}
+
+static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+{
+ int i;
+
+ for (i = 0; i < (1 << order); i++) {
+ struct page *p = page + i;
+ if (unlikely(check_new_page(p)))
+ return 1;
+ }
set_page_private(page, 0);
set_page_refcounted(page);
try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.
A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.
Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.
This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/rmap.h | 14 +++++++++++++-
mm/migrate.c | 2 +-
mm/rmap.c | 40 ++++++++++++++++++++++------------------
mm/vmscan.c | 2 +-
4 files changed, 37 insertions(+), 21 deletions(-)
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
+++ linux/include/linux/rmap.h 2009-06-03 20:39:50.000000000 +0200
@@ -84,7 +84,19 @@
* Called from mm/vmscan.c to handle paging out
*/
int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+ TTU_UNMAP = 0, /* unmap mode */
+ TTU_MIGRATION = 1, /* migration mode */
+ TTU_MUNLOCK = 2, /* munlock mode */
+ TTU_ACTION_MASK = 0xff,
+
+ TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
+ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
/*
* Called from mm/filemap_xip.c to unmap empty zero page
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-06-03 19:36:23.000000000 +0200
+++ linux/mm/rmap.c 2009-06-03 20:39:50.000000000 +0200
@@ -897,7 +897,7 @@
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
- int migration)
+ enum ttu_flags flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -919,11 +919,13 @@
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.
*/
- if (!migration) {
+ if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
ret = SWAP_MLOCK;
goto out_unmap;
}
+ }
+ if (!(flags & TTU_IGNORE_ACCESS)) {
if (ptep_clear_flush_young_notify(vma, address, pte)) {
ret = SWAP_FAIL;
goto out_unmap;
@@ -963,12 +965,12 @@
* pte. do_swap_page() will wait until the migration
* pte is removed and then restart fault handling.
*/
- BUG_ON(!migration);
+ BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
entry = make_migration_entry(page, pte_write(pteval));
}
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
- } else if (PAGE_MIGRATION && migration) {
+ } else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
/* Establish migration entry for a file page */
swp_entry_t entry;
entry = make_migration_entry(page, pte_write(pteval));
@@ -1137,12 +1139,13 @@
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* 'LOCKED.
*/
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
unsigned int mlocked = 0;
int ret = SWAP_AGAIN;
+ int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
if (MLOCK_PAGES && unlikely(unlock))
ret = SWAP_SUCCESS; /* default for try_to_munlock() */
@@ -1158,7 +1161,7 @@
continue; /* must visit all unlocked vmas */
ret = SWAP_MLOCK; /* saw at least one mlocked vma */
} else {
- ret = try_to_unmap_one(page, vma, migration);
+ ret = try_to_unmap_one(page, vma, flags);
if (ret == SWAP_FAIL || !page_mapped(page))
break;
}
@@ -1182,8 +1185,7 @@
/**
* try_to_unmap_file - unmap/unlock file page using the object-based rmap method
* @page: the page to unmap/unlock
- * @unlock: request for unlock rather than unmap [unlikely]
- * @migration: unmapping for migration - ignored if @unlock
+ * @flags: action and flags
*
* Find all the mappings of a page using the mapping pointer and the vma chains
* contained in the address_space struct it points to.
@@ -1195,7 +1197,7 @@
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* 'LOCKED.
*/
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
{
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1207,6 +1209,7 @@
unsigned long max_nl_size = 0;
unsigned int mapcount;
unsigned int mlocked = 0;
+ int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
if (MLOCK_PAGES && unlikely(unlock))
ret = SWAP_SUCCESS; /* default for try_to_munlock() */
@@ -1219,7 +1222,7 @@
continue; /* must visit all vmas */
ret = SWAP_MLOCK;
} else {
- ret = try_to_unmap_one(page, vma, migration);
+ ret = try_to_unmap_one(page, vma, flags);
if (ret == SWAP_FAIL || !page_mapped(page))
goto out;
}
@@ -1244,7 +1247,8 @@
ret = SWAP_MLOCK; /* leave mlocked == 0 */
goto out; /* no need to look further */
}
- if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+ if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+ (vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
if (cursor > max_nl_cursor)
@@ -1278,7 +1282,7 @@
do {
list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (!MLOCK_PAGES && !migration &&
+ if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
@@ -1318,7 +1322,7 @@
/**
* try_to_unmap - try to remove all page table mappings to a page
* @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
*
* Tries to remove all the page table entries which are mapping this
* page, used in the pageout path. Caller must hold the page lock.
@@ -1329,16 +1333,16 @@
* SWAP_FAIL - the page is unswappable
* SWAP_MLOCK - page is mlocked.
*/
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
{
int ret;
BUG_ON(!PageLocked(page));
if (PageAnon(page))
- ret = try_to_unmap_anon(page, 0, migration);
+ ret = try_to_unmap_anon(page, flags);
else
- ret = try_to_unmap_file(page, 0, migration);
+ ret = try_to_unmap_file(page, flags);
if (ret != SWAP_MLOCK && !page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
@@ -1363,8 +1367,8 @@
VM_BUG_ON(!PageLocked(page) || PageLRU(page));
if (PageAnon(page))
- return try_to_unmap_anon(page, 1, 0);
+ return try_to_unmap_anon(page, TTU_MUNLOCK);
else
- return try_to_unmap_file(page, 1, 0);
+ return try_to_unmap_file(page, TTU_MUNLOCK);
}
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c 2009-06-03 19:36:21.000000000 +0200
+++ linux/mm/vmscan.c 2009-06-03 19:36:23.000000000 +0200
@@ -659,7 +659,7 @@
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page, 0)) {
+ switch (try_to_unmap(page, TTU_UNMAP)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c 2009-06-03 19:36:21.000000000 +0200
+++ linux/mm/migrate.c 2009-06-03 19:36:23.000000000 +0200
@@ -669,7 +669,7 @@
}
/* Establish migration ptes or remove ptes */
- try_to_unmap(page, 1);
+ try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
Bail out early when hardware poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly into processes,
because that would cause another (potentially deadly) machine check
This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/memory.c | 3 +++
1 file changed, 3 insertions(+)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
+++ linux/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
@@ -2797,6 +2797,9 @@
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
return ret;
+ if (unlikely(PageHWPoison(vmf.page)))
+ return VM_FAULT_HWPOISON;
+
/*
* For consistency in subsequent calls, make the faulted page always
* locked.
- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code
v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
v3: Really use delayacct_clear_flag (Hidehiro Kawai)
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/mm.h | 3 ++-
mm/memory.c | 18 +++++++++++++++---
2 files changed, 17 insertions(+), 4 deletions(-)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2009-06-03 19:36:21.000000000 +0200
+++ linux/mm/memory.c 2009-06-03 20:39:50.000000000 +0200
@@ -1315,7 +1315,8 @@
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return i ? i : -ENOMEM;
- else if (ret & VM_FAULT_SIGBUS)
+ if (ret &
+ (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
return i ? i : -EFAULT;
BUG();
}
@@ -2597,8 +2598,15 @@
goto out;
entry = pte_to_swp_entry(orig_pte);
- if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ if (unlikely(non_swap_entry(entry))) {
+ if (is_migration_entry(entry)) {
+ migration_entry_wait(mm, pmd, address);
+ } else if (is_hwpoison_entry(entry)) {
+ ret = VM_FAULT_HWPOISON;
+ } else {
+ print_bad_pte(vma, address, pte, NULL);
+ ret = VM_FAULT_OOM;
+ }
goto out;
}
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2622,6 +2630,10 @@
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
+ } else if (PageHWPoison(page)) {
+ ret = VM_FAULT_HWPOISON;
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ goto out;
}
lock_page(page);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2009-06-03 19:36:21.000000000 +0200
+++ linux/include/linux/mm.h 2009-06-03 20:39:49.000000000 +0200
@@ -702,11 +702,12 @@
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
#define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
-#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
/*
* Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.
v2: Make the printk more verbose/unique
Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/mm/fault.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c 2009-06-03 19:36:21.000000000 +0200
+++ linux/arch/x86/mm/fault.c 2009-06-03 19:36:23.000000000 +0200
@@ -166,6 +166,7 @@
info.si_errno = 0;
info.si_code = si_code;
info.si_addr = (void __user *)address;
+ info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
force_sig_info(si_signo, &info, tsk);
}
@@ -797,10 +798,12 @@
}
static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+ unsigned int fault)
{
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ int code = BUS_ADRERR;
up_read(&mm->mmap_sem);
@@ -816,7 +819,15 @@
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+ if (fault & VM_FAULT_HWPOISON) {
+ printk(KERN_ERR
+ "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+ tsk->comm, tsk->pid, address);
+ code = BUS_MCEERR_AR;
+ }
+#endif
+ force_sig_info_fault(SIGBUS, code, address, tsk);
}
static noinline void
@@ -826,8 +837,8 @@
if (fault & VM_FAULT_OOM) {
out_of_memory(regs, error_code, address);
} else {
- if (fault & VM_FAULT_SIGBUS)
- do_sigbus(regs, error_code, address);
+ if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+ do_sigbus(regs, error_code, address, fault);
else
BUG();
}
Bail out early in set_page_dirty for poisoned pages. We don't want any
of the dirty accounting done or file system write back started, because
the page will be just thrown away.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/page-writeback.c | 4 ++++
1 file changed, 4 insertions(+)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2009-06-03 19:36:20.000000000 +0200
+++ linux/mm/page-writeback.c 2009-06-03 19:36:23.000000000 +0200
@@ -1304,6 +1304,10 @@
{
struct address_space *mapping = page_mapping(page);
+ if (unlikely(PageHWPoison(page))) {
+ SetPageDirty(page);
+ return 0;
+ }
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
#ifdef CONFIG_BLOCK
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/rmap.h | 1 +
mm/rmap.c | 9 ++++++++-
2 files changed, 9 insertions(+), 1 deletion(-)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-06-03 19:36:23.000000000 +0200
+++ linux/mm/rmap.c 2009-06-03 20:39:49.000000000 +0200
@@ -943,7 +943,14 @@
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
- if (PageAnon(page)) {
+ if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+ if (PageAnon(page))
+ dec_mm_counter(mm, anon_rss);
+ else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+ dec_mm_counter(mm, file_rss);
+ set_pte_at(mm, address, pte,
+ swp_entry_to_pte(make_hwpoison_entry(page)));
+ } else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(page) };
if (PageSwapCache(page)) {
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
+++ linux/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
@@ -93,6 +93,7 @@
TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
+ TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
};
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.
This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.
The code that does this is portable and lives in mm/memory-failure.c
To quote the overview comment:
* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focuses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* Some of the operations here are somewhat inefficient and have non
* linear algorithmic complexity, because the data structures have not
* been optimized for this case. This is in particular the case
* for the mapping from a vma to a process. Since this case is expected
* to be rare we hope we can get away with this.
There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.
The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways
v2: Fix anon vma unlock crash (noticed by Johannes Weiner <[email protected]>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu:
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)
v4: Various fixes based on review comments from Nick Piggin
Document locking order.
Improved comments.
Slightly improved description
Remove bogus hunk.
Wait properly for writeback pages (Nick Piggin)
v5: Improve various comments
Handle page_address_in_vma() failure better by SIGKILL and also
make message debugging only
Clean up printks
Remove redundant PageWriteback check (Nick Piggin)
Add missing clear_page_mlock
Reformat state table to be <80 columns again
Use truncate helper instead of manual truncate in me_pagecache_*
Check for metadata buffer pages and reject them.
A few cleanups.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Hidehiro Kawai <[email protected]>
---
Documentation/sysctl/vm.txt | 21 +
fs/proc/meminfo.c | 9
include/linux/mm.h | 5
kernel/sysctl.c | 14
mm/Kconfig | 3
mm/Makefile | 1
mm/filemap.c | 4
mm/memory-failure.c | 761 ++++++++++++++++++++++++++++++++++++++++++++
mm/rmap.c | 4
9 files changed, 820 insertions(+), 2 deletions(-)
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile 2009-06-03 19:37:38.000000000 +0200
+++ linux/mm/Makefile 2009-06-03 20:39:48.000000000 +0200
@@ -42,5 +42,6 @@
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
Index: linux/mm/memory-failure.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/memory-failure.c 2009-06-03 20:30:43.000000000 +0200
@@ -0,0 +1,761 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Authors: Andi Kleen, Fengguang Wu
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states. The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare we hope to get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#define DEBUG 1
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+ unsigned long pfn)
+{
+ struct siginfo si;
+ int ret;
+
+ printk(KERN_ERR
+ "MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
+ pfn, t->comm, t->pid);
+ si.si_signo = SIGBUS;
+ si.si_errno = 0;
+ si.si_code = BUS_MCEERR_AO;
+ si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+ si.si_trapno = trapno;
+#endif
+ si.si_addr_lsb = PAGE_SHIFT;
+ /*
+ * Don't use force here, it's convenient if the signal
+ * can be temporarily blocked.
+ * This could cause a loop when the user sets SIGBUS
+ * to SIG_IGN, but hopefully noone will do that?
+ */
+ ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
+ if (ret < 0)
+ printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+ t->comm, t->pid, ret);
+ return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+ struct list_head nd;
+ struct task_struct *tsk;
+ unsigned long addr;
+ unsigned addr_valid:1;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do. We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+ struct vm_area_struct *vma,
+ struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct to_kill *tk;
+
+ if (*tkc) {
+ tk = *tkc;
+ *tkc = NULL;
+ } else {
+ tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+ if (!tk) {
+ printk(KERN_ERR
+ "MCE: Out of memory while machine check handling\n");
+ return;
+ }
+ }
+ tk->addr = page_address_in_vma(p, vma);
+ tk->addr_valid = 1;
+
+ /*
+ * In theory we don't have to kill when the page was
+ * munmaped. But it could be also a mremap. Since that's
+ * likely very rare kill anyways just out of paranoia, but use
+ * a SIGKILL because the error is not contained anymore.
+ */
+ if (tk->addr == -EFAULT) {
+ pr_debug("MCE: Unable to find user space address %lx in %s\n",
+ page_to_pfn(p), tsk->comm);
+ tk->addr_valid = 0;
+ }
+ get_task_struct(tsk);
+ tk->tsk = tsk;
+ list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ *
+ * Only do anything when DOIT is set, otherwise just free the list
+ * (this is used for clean pages which do not need killing)
+ * Also when FAIL is set do a force kill because something went
+ * wrong earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+ int fail, unsigned long pfn)
+{
+ struct to_kill *tk, *next;
+
+ list_for_each_entry_safe (tk, next, to_kill, nd) {
+ if (doit) {
+ /*
+ * In case something went wrong with munmaping
+ * make sure the process doesn't catch the
+ * signal and then access the memory. Just kill it.
+ * the signal handlers
+ */
+ if (fail || tk->addr_valid == 0) {
+ printk(KERN_ERR
+ "MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+ pfn, tk->tsk->comm, tk->tsk->pid);
+ force_sig(SIGKILL, tk->tsk);
+ }
+
+ /*
+ * In theory the process could have mapped
+ * something else on the address in-between. We could
+ * check for that, but we need to tell the
+ * process anyways.
+ */
+ else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+ pfn) < 0)
+ printk(KERN_ERR
+ "MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+ pfn, tk->tsk->comm, tk->tsk->pid);
+ }
+ put_task_struct(tk->tsk);
+ kfree(tk);
+ }
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+ struct anon_vma *av = page_lock_anon_vma(page);
+
+ if (av == NULL) /* Not actually mapped anymore */
+ return;
+
+ read_lock(&tasklist_lock);
+ for_each_process (tsk) {
+ if (!tsk->mm)
+ continue;
+ list_for_each_entry (vma, &av->head, anon_vma_node) {
+ if (vma->vm_mm == tsk->mm)
+ add_to_kill(tsk, page, vma, to_kill, tkc);
+ }
+ }
+ page_unlock_anon_vma(av);
+ read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+ struct prio_tree_iter iter;
+ struct address_space *mapping = page_mapping(page);
+
+ /*
+ * A note on the locking order between the two locks.
+ * We don't rely on this particular order.
+ * If you have some other code that needs a different order
+ * feel free to switch them around. Or add a reverse link
+ * from mm_struct to task_struct, then this could be all
+ * done without taking tasklist_lock and looping over all tasks.
+ */
+
+ read_lock(&tasklist_lock);
+ spin_lock(&mapping->i_mmap_lock);
+ for_each_process(tsk) {
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+ if (!tsk->mm)
+ continue;
+
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+ pgoff)
+ if (vma->vm_mm == tsk->mm)
+ add_to_kill(tsk, page, vma, to_kill, tkc);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+ struct to_kill *tk;
+
+ tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+ /* memory allocation failure is implicitly handled */
+ if (PageAnon(page))
+ collect_procs_anon(page, tokill, &tk);
+ else
+ collect_procs_file(page, tokill, &tk);
+ kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+ FAILED, /* Error handling failed */
+ DELAYED, /* Will be handled later */
+ IGNORED, /* Error safely ignored */
+ RECOVERED, /* Successfully recovered */
+};
+
+static const char *action_name[] = {
+ [FAILED] = "Failed",
+ [DELAYED] = "Delayed",
+ [IGNORED] = "Ignored",
+ [RECOVERED] = "Recovered",
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p, unsigned long pfn)
+{
+ return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p, unsigned long pfn)
+{
+ return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p, unsigned long pfn)
+{
+ printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn);
+ return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p, unsigned long pfn)
+{
+ return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p, unsigned long pfn)
+{
+ struct address_space *mapping;
+
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ /*
+ * Now truncate the page in the page cache. This is really
+ * more like a "temporary hole punch"
+ * Don't do this for block devices when someone else
+ * has a reference, because it could be file system metadata
+ * and that's not safe to truncate.
+ */
+ mapping = page_mapping(p);
+ if (mapping && S_ISBLK(mapping->host->i_mode) && page_count(p) > 1) {
+ printk(KERN_ERR
+ "MCE %#lx: page looks like a unsupported file system metadata page\n",
+ pfn);
+ return FAILED;
+ }
+ if (mapping) {
+ truncate_inode_page(mapping, p);
+ if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
+ pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
+ pfn);
+ return FAILED;
+ }
+ }
+ return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+{
+ struct address_space *mapping = page_mapping(p);
+
+ SetPageError(p);
+ /* TBD: print more information about the file. */
+ if (mapping) {
+ /*
+ * IO error will be reported by write(), fsync(), etc.
+ * who check the mapping.
+ * This way the application knows that something went
+ * wrong with its dirty file data.
+ *
+ * There's one open issue:
+ *
+ * The EIO will be only reported on the next IO
+ * operation and then cleared through the IO map.
+ * Normally Linux has two mechanisms to pass IO error
+ * first through the AS_EIO flag in the address space
+ * and then through the PageError flag in the page.
+ * Since we drop pages on memory failure handling the
+ * only mechanism open to use is through AS_AIO.
+ *
+ * This has the disadvantage that it gets cleared on
+ * the first operation that returns an error, while
+ * the PageError bit is more sticky and only cleared
+ * when the page is reread or dropped. If an
+ * application assumes it will always get error on
+ * fsync, but does other operations on the fd before
+ * and the page is dropped inbetween then the error
+ * will not be properly reported.
+ *
+ * This can already happen even without hwpoisoned
+ * pages: first on metadata IO errors (which only
+ * report through AS_EIO) or when the page is dropped
+ * at the wrong time.
+ *
+ * So right now we assume that the application DTRT on
+ * the first EIO, but we're not worse than other parts
+ * of the kernel.
+ */
+ mapping_set_error(mapping, EIO);
+ }
+
+ return me_pagecache_clean(p, pfn);
+}
+
+/*
+ * Clean and dirty swap cache.
+ *
+ * Dirty swap cache page is tricky to handle. The page could live both in page
+ * cache and swap cache(ie. page is freshly swapped in). So it could be
+ * referenced concurrently by 2 types of PTEs:
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling u
+ * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
+ * and then
+ * - clear dirty bit to prevent IO
+ * - remove from LRU
+ * - but keep in the swap cache, so that when we return to it on
+ * a later page fault, we know the application is accessing
+ * corrupted data and shall be killed (we installed simple
+ * interception code in do_swap_page to catch it).
+ *
+ * Clean swap cache pages can be directly isolated. A later page fault will
+ * bring in the known good data from disk.
+ */
+static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+{
+ ClearPageDirty(p);
+ /* Trigger EIO in shmem: */
+ ClearPageUptodate(p);
+
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p, unsigned long pfn)
+{
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ delete_from_swap_cache(p);
+
+ return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p, unsigned long pfn)
+{
+ return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty (1UL << PG_dirty)
+#define sc (1UL << PG_swapcache)
+#define unevict (1UL << PG_unevictable)
+#define mlock (1UL << PG_mlocked)
+#define writeback (1UL << PG_writeback)
+#define lru (1UL << PG_lru)
+#define swapbacked (1UL << PG_swapbacked)
+#define head (1UL << PG_head)
+#define tail (1UL << PG_tail)
+#define compound (1UL << PG_compound)
+#define slab (1UL << PG_slab)
+#define buddy (1UL << PG_buddy)
+#define reserved (1UL << PG_reserved)
+
+static struct page_state {
+ unsigned long mask;
+ unsigned long res;
+ char *msg;
+ int (*action)(struct page *p, unsigned long pfn);
+} error_states[] = {
+ { reserved, reserved, "reserved kernel", me_ignore },
+ { buddy, buddy, "free kernel", me_free },
+
+ /*
+ * Could in theory check if slab page is free or if we can drop
+ * currently unused objects without touching them. But just
+ * treat it as standard kernel for now.
+ */
+ { slab, slab, "kernel slab", me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+ { head, head, "huge", me_huge_page },
+ { tail, tail, "huge", me_huge_page },
+#else
+ { compound, compound, "huge", me_huge_page },
+#endif
+
+ { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty },
+ { sc|dirty, sc, "swapcache", me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+ { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty},
+ { unevict, unevict, "unevictable LRU", me_pagecache_clean},
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+ { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty },
+ { mlock, mlock, "mlocked LRU", me_pagecache_clean },
+#endif
+
+ { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty },
+ { lru|dirty, lru, "clean LRU", me_pagecache_clean },
+ { swapbacked, swapbacked, "anonymous", me_pagecache_clean },
+
+ /*
+ * Catchall entry: must be at end.
+ */
+ { 0, 0, "unknown page state", me_unknown },
+};
+
+static void action_result(unsigned long pfn, char *msg, int ret)
+{
+ printk(KERN_ERR "MCE %#lx: %s page recovery: %s%s\n",
+ pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
+ msg, action_name[ret]);
+}
+
+static void page_action(struct page_state *ps, struct page *p,
+ unsigned long pfn)
+{
+ int ret;
+
+ ret = ps->action(p, pfn);
+ action_result(pfn, ps->msg, ret);
+ if (page_count(p) != 1)
+ printk(KERN_ERR
+ "MCE %#lx: %s page still referenced by %d users\n",
+ pfn, ps->msg, page_count(p) - 1);
+
+ /* Could do more checks here if page looks ok */
+ atomic_long_add(1, &mce_bad_pages);
+
+ /*
+ * Could adjust zone counters here to correct for the missing page.
+ */
+}
+
+#define N_UNMAP_TRIES 5
+
+/*
+ * Do all that is necessary to remove user space mappings. Unmap
+ * the pages and send SIGBUS to the processes if the data was dirty.
+ */
+static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+ int trapno)
+{
+ enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+ int kill = sysctl_memory_failure_early_kill;
+ struct address_space *mapping;
+ LIST_HEAD(tokill);
+ int ret;
+ int i;
+
+ if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+ return;
+
+ if (!PageLRU(p))
+ lru_add_drain();
+
+ /*
+ * This check implies we don't kill processes if their pages
+ * are in the swap cache early. Those are always late kills.
+ */
+ if (!page_mapped(p))
+ return;
+
+ if (PageSwapCache(p)) {
+ printk(KERN_ERR
+ "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+ ttu |= TTU_IGNORE_HWPOISON;
+ }
+
+ /*
+ * Propagate the dirty bit from PTEs to struct pagefirst, because we
+ * need this to decide if we should kill or just drop the page.
+ */
+ mapping = page_mapping(p);
+ if (!PageDirty(p) && !PageAnon(p) && !PageSwapBacked(p) &&
+ mapping && mapping_cap_account_dirty(mapping)) {
+ if (page_mkclean(p))
+ SetPageDirty(p);
+ else {
+ kill = 0;
+ printk(KERN_INFO
+ "MCE %#lx: corrupted page was clean: dropped without side effects\n",
+ pfn);
+ ttu |= TTU_IGNORE_HWPOISON;
+ }
+ }
+
+ /*
+ * First collect all the processes that have the page
+ * mapped. This has to be done before try_to_unmap,
+ * because ttu takes the rmap data structures down.
+ *
+ * This also has the side effect to propagate the dirty
+ * bit from PTEs into the struct page. This is needed
+ * to actually decide if something needs to be killed
+ * or errored, or if it's ok to just drop the page.
+ *
+ * Error handling: We ignore errors here because
+ * there's nothing that can be done.
+ */
+ if (kill)
+ collect_procs(p, &tokill);
+
+ /*
+ * try_to_unmap can fail temporarily due to races.
+ * Try a few times (RED-PEN better strategy?)
+ */
+ for (i = 0; i < N_UNMAP_TRIES; i++) {
+ ret = try_to_unmap(p, ttu);
+ if (ret == SWAP_SUCCESS)
+ break;
+ pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
+ }
+
+ /*
+ * Now that the dirty bit has been propagated to the
+ * struct page and all unmaps done we can decide if
+ * killing is needed or not. Only kill when the page
+ * was dirty, otherwise the tokill list is merely
+ * freed. When there was a problem unmapping earlier
+ * use a more force-full uncatchable kill to prevent
+ * any accesses to the poisoned memory.
+ */
+ kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+ ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ * @pfn: Page Number of the corrupted page
+ * @trapno: Trap number reported in the signal to user space.
+ *
+ * This function is called by the low level machine check code
+ * of an architecture when it detects hardware memory corruption
+ * of a page. It tries its best to recover, which includes
+ * dropping pages, killing processes etc.
+ *
+ * The function is primarily of use for corruptions that
+ * happen outside the current execution context (e.g. when
+ * detected by a background scrubber)
+ *
+ * Must run in process context (e.g. a work queue) with interrupts
+ * enabled and no spinlocks hold.
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+ struct page_state *ps;
+ struct page *p;
+
+ if (!pfn_valid(pfn)) {
+ action_result(pfn, "memory outside kernel control", IGNORED);
+ return;
+ }
+
+
+ p = pfn_to_page(pfn);
+ if (TestSetPageHWPoison(p)) {
+ action_result(pfn, "already hardware poisoned", IGNORED);
+ return;
+ }
+
+ /*
+ * We need/can do nothing about count=0 pages.
+ * 1) it's a free page, and therefore in safe hand:
+ * prep_new_page() will be the gate keeper.
+ * 2) it's part of a non-compound high order page.
+ * Implies some kernel user: cannot stop them from
+ * R/W the page; let's pray that the page has been
+ * used and will be freed some time later.
+ * In fact it's dangerous to directly bump up page count from 0,
+ * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+ */
+ if (!get_page_unless_zero(compound_head(p))) {
+ action_result(pfn, "free or high order kernel", IGNORED);
+ return;
+ }
+
+ /*
+ * Lock the page and wait for writeback to finish.
+ * It's very difficult to mess with pages currently under IO
+ * and in many cases impossible, so we just avoid it here.
+ */
+ lock_page_nosync(p);
+ wait_on_page_writeback(p);
+
+ /*
+ * Now take care of user space mappings.
+ */
+ hwpoison_user_mappings(p, pfn, trapno);
+
+ /*
+ * Torn down by someone else?
+ */
+ if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+ action_result(pfn, "already unmapped LRU", IGNORED);
+ goto out;
+ }
+
+ for (ps = error_states;; ps++) {
+ if ((p->flags & ps->mask) == ps->res) {
+ page_action(ps, p, pfn);
+ break;
+ }
+ }
+out:
+ unlock_page(p);
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2009-06-03 20:13:43.000000000 +0200
+++ linux/include/linux/mm.h 2009-06-03 20:13:43.000000000 +0200
@@ -1326,5 +1326,10 @@
extern int account_locked_memory(struct mm_struct *mm, struct rlimit *rlim,
size_t size);
extern void refund_locked_memory(struct mm_struct *mm, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c 2009-06-03 19:37:38.000000000 +0200
+++ linux/kernel/sysctl.c 2009-06-03 20:13:43.000000000 +0200
@@ -1311,6 +1311,20 @@
.mode = 0644,
.proc_handler = &scan_unevictable_handler,
},
+#ifdef CONFIG_MEMORY_FAILURE
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "memory_failure_early_kill",
+ .data = &sysctl_memory_failure_early_kill,
+ .maxlen = sizeof(vm_highmem_is_dirtyable),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
+
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c 2009-06-03 19:37:38.000000000 +0200
+++ linux/fs/proc/meminfo.c 2009-06-03 20:13:43.000000000 +0200
@@ -95,7 +95,11 @@
"Committed_AS: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
- "VmallocChunk: %8lu kB\n",
+ "VmallocChunk: %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+ "BadPages: %8lu kB\n"
+#endif
+ ,
K(i.totalram),
K(i.freeram),
K(i.bufferram),
@@ -140,6 +144,9 @@
(unsigned long)VMALLOC_TOTAL >> 10,
vmi.used >> 10,
vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+ ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
);
hugetlb_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig 2009-06-03 19:37:38.000000000 +0200
+++ linux/mm/Kconfig 2009-06-03 20:39:48.000000000 +0200
@@ -220,6 +220,9 @@
Enable the KSM kernel module to allow page sharing of equal pages
among different tasks.
+config MEMORY_FAILURE
+ bool
+
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
depends on !MMU
Index: linux/Documentation/sysctl/vm.txt
===================================================================
--- linux.orig/Documentation/sysctl/vm.txt 2009-06-03 19:37:38.000000000 +0200
+++ linux/Documentation/sysctl/vm.txt 2009-06-03 20:13:43.000000000 +0200
@@ -32,6 +32,7 @@
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- memory_failure_early_kill
- min_free_kbytes
- min_slab_ratio
- min_unmapped_ratio
@@ -53,7 +54,6 @@
- vfs_cache_pressure
- zone_reclaim_mode
-
==============================================================
block_dump
@@ -275,6 +275,25 @@
The default value is 65536.
+=============================================================
+
+memory_failure_early_kill:
+
+Control how to kill processes when uncorrected memory error (typically
+a 2bit error in a memory module) is detected in the background by hardware.
+
+1: Kill all processes that have the corrupted page mapped as soon as the
+corruption is detected.
+
+0: Only unmap the page from all processes and only kill a process
+who tries to access it.
+
+The kill is done using a catchable SIGBUS, so processes can handle this
+if they want to.
+
+This is only active on architectures/platforms with advanced machine
+check handling and depends on the hardware capabilities.
+
==============================================================
min_free_kbytes:
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c 2009-06-03 19:37:38.000000000 +0200
+++ linux/mm/filemap.c 2009-06-03 20:13:43.000000000 +0200
@@ -105,6 +105,10 @@
*
* ->task->proc_lock
* ->dcache_lock (proc_pid_lookup)
+ *
+ * (code doesn't rely on that order, so you could switch it around)
+ * ->tasklist_lock (memory_failure, collect_procs_ao)
+ * ->i_mmap_lock
*/
/*
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-06-03 19:37:38.000000000 +0200
+++ linux/mm/rmap.c 2009-06-03 20:13:43.000000000 +0200
@@ -36,6 +36,10 @@
* mapping->tree_lock (widely used, in set_page_dirty,
* in arch-dependent flush_dcache_mmap_lock,
* within inode_lock in __sync_single_inode)
+ *
+ * (code doesn't rely on that order so it could be switched around)
+ * ->tasklist_lock
+ * anon_vma->lock (memory_failure, collect_procs_anon)
*/
#include <linux/mm.h>
Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON
This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.
Open issues:
Should be disabled for cgroups.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/Kconfig | 4 ++++
mm/Makefile | 1 +
mm/hwpoison-inject.c | 41 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+)
Index: linux/mm/hwpoison-inject.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/hwpoison-inject.c 2009-06-03 20:30:47.000000000 +0200
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+ memory_failure(val, 18);
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+ if (hwpoison_dir)
+ debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+ hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+ if (hwpoison_dir == NULL)
+ return -ENOMEM;
+ corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+ NULL, &hwpoison_fops);
+ if (corrupt_pfn == NULL) {
+ pfn_inject_exit();
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig 2009-06-03 20:30:47.000000000 +0200
+++ linux/mm/Kconfig 2009-06-03 20:30:47.000000000 +0200
@@ -225,6 +225,10 @@
default y
depends on MMU
+config HWPOISON_INJECT
+ tristate "Poison pages injector"
+ depends on MEMORY_FAILURE && DEBUG_KERNEL
+
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
depends on !MMU
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile 2009-06-03 20:30:30.000000000 +0200
+++ linux/mm/Makefile 2009-06-03 20:30:47.000000000 +0200
@@ -43,5 +43,6 @@
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
From: Nick Piggin <[email protected]>
Extract out truncate_inode_page() out of the truncate path so that
it can be used by memory-failure.c
[AK: description, headers, fix typos]
Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/mm.h | 2 ++
mm/truncate.c | 24 ++++++++++++------------
2 files changed, 14 insertions(+), 12 deletions(-)
Index: linux/mm/truncate.c
===================================================================
--- linux.orig/mm/truncate.c 2009-06-03 19:37:38.000000000 +0200
+++ linux/mm/truncate.c 2009-06-03 20:13:43.000000000 +0200
@@ -135,6 +135,16 @@
return ret;
}
+void truncate_inode_page(struct address_space *mapping, struct page *page)
+{
+ if (page_mapped(page)) {
+ unmap_mapping_range(mapping,
+ (loff_t)page->index<<PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ }
+ truncate_complete_page(mapping, page);
+}
+
/**
* truncate_inode_pages - truncate range of pages specified by start & end byte offsets
* @mapping: mapping to truncate
@@ -196,12 +206,7 @@
unlock_page(page);
continue;
}
- if (page_mapped(page)) {
- unmap_mapping_range(mapping,
- (loff_t)page_index<<PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
- }
- truncate_complete_page(mapping, page);
+ truncate_inode_page(mapping, page);
unlock_page(page);
}
pagevec_release(&pvec);
@@ -238,15 +243,10 @@
break;
lock_page(page);
wait_on_page_writeback(page);
- if (page_mapped(page)) {
- unmap_mapping_range(mapping,
- (loff_t)page->index<<PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
- }
+ truncate_inode_page(mapping, page);
if (page->index > next)
next = page->index;
next++;
- truncate_complete_page(mapping, page);
unlock_page(page);
}
pagevec_release(&pvec);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2009-06-03 19:37:38.000000000 +0200
+++ linux/include/linux/mm.h 2009-06-03 20:39:49.000000000 +0200
@@ -811,6 +811,8 @@
extern int vmtruncate(struct inode * inode, loff_t offset);
extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
+void truncate_inode_page(struct address_space *mapping, struct page *page);
+
#ifdef CONFIG_MMU
extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access);
Impact: optional, useful for debugging
Add a new madvice sub command to inject poison for some
pages in a process' address space. This is useful for
testing the poison page handling.
Open issues:
- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.
v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)
Signed-off-by: Andi Kleen <[email protected]>
---
include/asm-generic/mman-common.h | 1 +
mm/madvise.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)
Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c 2009-06-03 20:30:31.000000000 +0200
+++ linux/mm/madvise.c 2009-06-03 20:30:47.000000000 +0200
@@ -207,6 +207,38 @@
return error;
}
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+ /*
+ * RED-PEN
+ * This allows to tie up arbitary amounts of memory.
+ * Might be a good idea to disable it inside containers even for root.
+ */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ for (; start < end; start += PAGE_SIZE) {
+ struct page *p;
+ int ret = get_user_pages(current, current->mm, start, 1,
+ 0, 0, &p, NULL);
+ if (ret != 1)
+ return ret;
+ put_page(p);
+ /*
+ * RED-PEN page can be reused in a short window, but otherwise
+ * we'll have to fight with the reference count.
+ */
+ printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+ page_to_pfn(p), start);
+ memory_failure(page_to_pfn(p), 0);
+ }
+ return 0;
+}
+#endif
+
static long
madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
@@ -307,6 +339,10 @@
int write;
size_t len;
+#ifdef CONFIG_MEMORY_FAILURE
+ if (behavior == MADV_HWPOISON)
+ return madvise_hwpoison(start, start+len_in);
+#endif
if (!madvise_behavior_valid(behavior))
return error;
Index: linux/include/asm-generic/mman-common.h
===================================================================
--- linux.orig/include/asm-generic/mman-common.h 2009-06-03 20:30:31.000000000 +0200
+++ linux/include/asm-generic/mman-common.h 2009-06-03 20:30:47.000000000 +0200
@@ -34,6 +34,7 @@
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
+#define MADV_HWPOISON 12 /* poison a page for testing */
/* compatibility flags */
#define MAP_FILE 0
Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.
This should not be merged into mainline.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/Kconfig | 2 ++
1 file changed, 2 insertions(+)
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig 2009-06-03 20:30:32.000000000 +0200
+++ linux/mm/Kconfig 2009-06-03 20:39:48.000000000 +0200
@@ -222,6 +222,8 @@
config MEMORY_FAILURE
bool
+ default y
+ depends on MMU
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
On Thu, Jun 04, 2009 at 02:46:43AM +0800, Andi Kleen wrote:
>
> Bail out early in set_page_dirty for poisoned pages. We don't want any
> of the dirty accounting done or file system write back started, because
> the page will be just thrown away.
I'm afraid this patch is not necessary and could be harmful.
It is not necessary because a poisoned page will normally already be
isolated from page cache, or likely cannot be isolated because it has
dirty buffers.
It is harmful because it put the page into dirty state without queuing
it for IO by moving it to s_io. When more normal pages are dirtied
later, __set_page_dirty_nobuffers() won't move the inode into s_io,
hence delaying the writeback of good pages for arbitrary long time.
Thanks,
Fengguang
> ---
> mm/page-writeback.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c 2009-06-03 19:36:20.000000000 +0200
> +++ linux/mm/page-writeback.c 2009-06-03 19:36:23.000000000 +0200
> @@ -1304,6 +1304,10 @@
> {
> struct address_space *mapping = page_mapping(page);
>
> + if (unlikely(PageHWPoison(page))) {
> + SetPageDirty(page);
> + return 0;
> + }
> if (likely(mapping)) {
> int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> #ifdef CONFIG_BLOCK
On Thu, Jun 04, 2009 at 02:46:47AM +0800, Andi Kleen wrote:
[snip]
This patch is full of this style error (the old version didn't have
this problem though):
ERROR: code indent should use tabs where possible
> +/*
> + * Clean (or cleaned) page cache page.
> + */
> +static int me_pagecache_clean(struct page *p, unsigned long pfn)
> +{
> + struct address_space *mapping;
> +
> + if (!isolate_lru_page(p))
> + page_cache_release(p);
> +
> + /*
> + * Now truncate the page in the page cache. This is really
> + * more like a "temporary hole punch"
> + * Don't do this for block devices when someone else
> + * has a reference, because it could be file system metadata
> + * and that's not safe to truncate.
> + */
> + mapping = page_mapping(p);
> + if (mapping && S_ISBLK(mapping->host->i_mode) && page_count(p) > 1) {
Shall use (page_count > 2) to count for the page cache reference.
Or can we base the test on busy buffers instead of page count? Nick?
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -347,7 +347,7 @@ static int me_free(struct page *p)
*/
static int me_pagecache_clean(struct page *p)
{
- struct address_space *mapping;
+ struct address_space *mapping = page_mapping(p);
if (!isolate_lru_page(p))
page_cache_release(p);
@@ -355,18 +355,17 @@ static int me_pagecache_clean(struct pag
/*
* Now truncate the page in the page cache. This is really
* more like a "temporary hole punch"
- * Don't do this for block devices when someone else
- * has a reference, because it could be file system metadata
- * and that's not safe to truncate.
+ * Don't do this for block device pages with busy buffers,
+ * because file system metadata may not be safe to truncate.
*/
- mapping = page_mapping(p);
- if (mapping && S_ISBLK(mapping->host->i_mode) && page_count(p) > 1)
- return FAILED;
if (mapping) {
+ if (S_ISBLK(mapping->host->i_mode) &&
+ !try_to_release_page(p, GFP_NOIO))
+ return FAILED;
truncate_inode_page(mapping, p);
if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
- pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
- page_to_pfn(p));
+ dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+ page_to_pfn(p));
return FAILED;
}
}
> + printk(KERN_ERR
> + "MCE %#lx: page looks like a unsupported file system metadata page\n",
> + pfn);
> + return FAILED;
> + }
> + if (mapping) {
> + truncate_inode_page(mapping, p);
> + if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
> + pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
> + pfn);
> + return FAILED;
> + }
> + }
> + return RECOVERED;
> +}
> +
> +/*
> + * Dirty cache page page
> + * Issues: when the error hit a hole page the error is not properly
> + * propagated.
> + */
> +static int me_pagecache_dirty(struct page *p, unsigned long pfn)
> +{
> + struct address_space *mapping = page_mapping(p);
> +
> + SetPageError(p);
> + /* TBD: print more information about the file. */
> + if (mapping) {
> + /*
> + * IO error will be reported by write(), fsync(), etc.
> + * who check the mapping.
btw, here are some side notes on EIO.
close() *may* also report it. NFS will sync file on close.
> + * This way the application knows that something went
> + * wrong with its dirty file data.
> + *
> + * There's one open issue:
> + *
> + * The EIO will be only reported on the next IO
> + * operation and then cleared through the IO map.
The report is not reliable in two ways:
- IO error may occur in a previous session
close() will or will not clear IO error depending on filesystem,
so IO error in this session may be seen by a following session.
- IO error may be cleared by a concurrent operation in another process
mapping->flags is shared between file handles, so fsync() on another
process will clear and report the IO error bits that was produced in
this process.
> + * Normally Linux has two mechanisms to pass IO error
> + * first through the AS_EIO flag in the address space
> + * and then through the PageError flag in the page.
> + * Since we drop pages on memory failure handling the
> + * only mechanism open to use is through AS_AIO.
We have to isolate the poisoned page from page cache, otherwise we'll
have to insert PageHWPoison() tests to dozens of places.
It's fine to keep poisoned pages in swap cache, because they are
referenced in a very limited scope.
> + *
> + * This has the disadvantage that it gets cleared on
> + * the first operation that returns an error, while
> + * the PageError bit is more sticky and only cleared
> + * when the page is reread or dropped. If an
> + * application assumes it will always get error on
> + * fsync, but does other operations on the fd before
> + * and the page is dropped inbetween then the error
> + * will not be properly reported.
> + *
> + * This can already happen even without hwpoisoned
> + * pages: first on metadata IO errors (which only
> + * report through AS_EIO) or when the page is dropped
> + * at the wrong time.
> + *
> + * So right now we assume that the application DTRT on
DTRT = do the return value test?
> + * the first EIO, but we're not worse than other parts
> + * of the kernel.
> + */
> + mapping_set_error(mapping, EIO);
> + }
> +
> + return me_pagecache_clean(p, pfn);
> +}
> +
> +/*
> + * Clean and dirty swap cache.
> + *
> + * Dirty swap cache page is tricky to handle. The page could live both in page
> + * cache and swap cache(ie. page is freshly swapped in). So it could be
> + * referenced concurrently by 2 types of PTEs:
> + * normal PTEs and swap PTEs. We try to handle them consistently by calling u
s/ u$//
> + * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
> + * and then
> + * - clear dirty bit to prevent IO
> + * - remove from LRU
> + * - but keep in the swap cache, so that when we return to it on
> + * a later page fault, we know the application is accessing
> + * corrupted data and shall be killed (we installed simple
> + * interception code in do_swap_page to catch it).
> + *
> + * Clean swap cache pages can be directly isolated. A later page fault will
> + * bring in the known good data from disk.
> + */
> +static int me_swapcache_dirty(struct page *p, unsigned long pfn)
> +{
> + ClearPageDirty(p);
> + /* Trigger EIO in shmem: */
> + ClearPageUptodate(p);
style nitpick:
ClearPageDirty(p); /* don't start IO on me */
ClearPageUptodate(p); /* to trigger EIO in shmem */
> +
> + if (!isolate_lru_page(p))
> + page_cache_release(p);
> +
> + return DELAYED;
> +}
> +
> +static int me_swapcache_clean(struct page *p, unsigned long pfn)
> +{
> + if (!isolate_lru_page(p))
> + page_cache_release(p);
> +
> + delete_from_swap_cache(p);
> +
> + return RECOVERED;
> +}
> +
> +/*
> + * Huge pages. Needs work.
> + * Issues:
> + * No rmap support so we cannot find the original mapper. In theory could walk
> + * all MMs and look for the mappings, but that would be non atomic and racy.
> + * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
> + * like just walking the current process and hoping it has it mapped (that
> + * should be usually true for the common "shared database cache" case)
> + * Should handle free huge pages and dequeue them too, but this needs to
> + * handle huge page accounting correctly.
> + */
> +static int me_huge_page(struct page *p, unsigned long pfn)
> +{
> + return FAILED;
> +}
> +
> +/*
> + * Various page states we can handle.
> + *
> + * A page state is defined by its current page->flags bits.
> + * The table matches them in order and calls the right handler.
> + *
> + * This is quite tricky because we can access page at any time
> + * in its live cycle, so all accesses have to be extremly careful.
> + *
> + * This is not complete. More states could be added.
> + * For any missing state don't attempt recovery.
> + */
> +
> +#define dirty (1UL << PG_dirty)
> +#define sc (1UL << PG_swapcache)
> +#define unevict (1UL << PG_unevictable)
> +#define mlock (1UL << PG_mlocked)
> +#define writeback (1UL << PG_writeback)
> +#define lru (1UL << PG_lru)
> +#define swapbacked (1UL << PG_swapbacked)
> +#define head (1UL << PG_head)
> +#define tail (1UL << PG_tail)
> +#define compound (1UL << PG_compound)
> +#define slab (1UL << PG_slab)
> +#define buddy (1UL << PG_buddy)
> +#define reserved (1UL << PG_reserved)
> +
> +static struct page_state {
> + unsigned long mask;
> + unsigned long res;
> + char *msg;
> + int (*action)(struct page *p, unsigned long pfn);
> +} error_states[] = {
> + { reserved, reserved, "reserved kernel", me_ignore },
> + { buddy, buddy, "free kernel", me_free },
> +
> + /*
> + * Could in theory check if slab page is free or if we can drop
> + * currently unused objects without touching them. But just
> + * treat it as standard kernel for now.
> + */
> + { slab, slab, "kernel slab", me_kernel },
> +
> +#ifdef CONFIG_PAGEFLAGS_EXTENDED
> + { head, head, "huge", me_huge_page },
> + { tail, tail, "huge", me_huge_page },
> +#else
> + { compound, compound, "huge", me_huge_page },
> +#endif
> +
> + { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty },
> + { sc|dirty, sc, "swapcache", me_swapcache_clean },
> +
> +#ifdef CONFIG_UNEVICTABLE_LRU
> + { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty},
> + { unevict, unevict, "unevictable LRU", me_pagecache_clean},
> +#endif
> +
> +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
> + { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty },
> + { mlock, mlock, "mlocked LRU", me_pagecache_clean },
> +#endif
> +
> + { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty },
> + { lru|dirty, lru, "clean LRU", me_pagecache_clean },
> + { swapbacked, swapbacked, "anonymous", me_pagecache_clean },
> +
> + /*
> + * Catchall entry: must be at end.
> + */
> + { 0, 0, "unknown page state", me_unknown },
> +};
> +
> +static void action_result(unsigned long pfn, char *msg, int ret)
rename 'ret' to 'action'?
> +{
> + printk(KERN_ERR "MCE %#lx: %s page recovery: %s%s\n",
> + pfn, PageDirty(pfn_to_page(pfn)) ? "dirty " : "",
> + msg, action_name[ret]);
ditto.
> +}
> +
> +static void page_action(struct page_state *ps, struct page *p,
> + unsigned long pfn)
> +{
> + int ret;
ditto.
> +
> + ret = ps->action(p, pfn);
> + action_result(pfn, ps->msg, ret);
ditto.
> + if (page_count(p) != 1)
> + printk(KERN_ERR
> + "MCE %#lx: %s page still referenced by %d users\n",
> + pfn, ps->msg, page_count(p) - 1);
> +
> + /* Could do more checks here if page looks ok */
> + atomic_long_add(1, &mce_bad_pages);
> +
> + /*
> + * Could adjust zone counters here to correct for the missing page.
> + */
> +}
> +
> +#define N_UNMAP_TRIES 5
> +
> +/*
> + * Do all that is necessary to remove user space mappings. Unmap
> + * the pages and send SIGBUS to the processes if the data was dirty.
> + */
> +static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
> + int trapno)
> +{
> + enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> + int kill = sysctl_memory_failure_early_kill;
> + struct address_space *mapping;
> + LIST_HEAD(tokill);
> + int ret;
> + int i;
> +
> + if (PageReserved(p) || PageCompound(p) || PageSlab(p))
> + return;
> +
> + if (!PageLRU(p))
> + lru_add_drain();
> +
> + /*
> + * This check implies we don't kill processes if their pages
> + * are in the swap cache early. Those are always late kills.
> + */
> + if (!page_mapped(p))
> + return;
> +
> + if (PageSwapCache(p)) {
> + printk(KERN_ERR
> + "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
> + ttu |= TTU_IGNORE_HWPOISON;
> + }
> +
> + /*
> + * Propagate the dirty bit from PTEs to struct pagefirst, because we
> + * need this to decide if we should kill or just drop the page.
> + */
> + mapping = page_mapping(p);
> + if (!PageDirty(p) && !PageAnon(p) && !PageSwapBacked(p) &&
!PageAnon(p) could be removed: the below non-zero mapping check will
do the work implicitly.
> + mapping && mapping_cap_account_dirty(mapping)) {
> + if (page_mkclean(p))
> + SetPageDirty(p);
> + else {
> + kill = 0;
> + printk(KERN_INFO
> + "MCE %#lx: corrupted page was clean: dropped without side effects\n",
> + pfn);
> + ttu |= TTU_IGNORE_HWPOISON;
Why not put the two assignment lines together? :)
> + }
> + }
> +
> + /*
> + * First collect all the processes that have the page
> + * mapped. This has to be done before try_to_unmap,
> + * because ttu takes the rmap data structures down.
> + *
> + * This also has the side effect to propagate the dirty
> + * bit from PTEs into the struct page. This is needed
> + * to actually decide if something needs to be killed
> + * or errored, or if it's ok to just drop the page.
> + *
> + * Error handling: We ignore errors here because
> + * there's nothing that can be done.
> + */
> + if (kill)
> + collect_procs(p, &tokill);
> +
> + /*
> + * try_to_unmap can fail temporarily due to races.
> + * Try a few times (RED-PEN better strategy?)
> + */
> + for (i = 0; i < N_UNMAP_TRIES; i++) {
> + ret = try_to_unmap(p, ttu);
> + if (ret == SWAP_SUCCESS)
> + break;
> + pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
Can we make it a printk? This is a serious accident.
> + }
> +
> + /*
> + * Now that the dirty bit has been propagated to the
> + * struct page and all unmaps done we can decide if
> + * killing is needed or not. Only kill when the page
> + * was dirty, otherwise the tokill list is merely
> + * freed. When there was a problem unmapping earlier
> + * use a more force-full uncatchable kill to prevent
> + * any accesses to the poisoned memory.
> + */
> + kill_procs_ao(&tokill, !!PageDirty(p), trapno,
> + ret != SWAP_SUCCESS, pfn);
> +}
> +
> +/**
> + * memory_failure - Handle memory failure of a page.
> + * @pfn: Page Number of the corrupted page
> + * @trapno: Trap number reported in the signal to user space.
> + *
> + * This function is called by the low level machine check code
> + * of an architecture when it detects hardware memory corruption
> + * of a page. It tries its best to recover, which includes
> + * dropping pages, killing processes etc.
> + *
> + * The function is primarily of use for corruptions that
> + * happen outside the current execution context (e.g. when
> + * detected by a background scrubber)
> + *
> + * Must run in process context (e.g. a work queue) with interrupts
> + * enabled and no spinlocks hold.
> + */
> +void memory_failure(unsigned long pfn, int trapno)
> +{
> + struct page_state *ps;
> + struct page *p;
> +
> + if (!pfn_valid(pfn)) {
> + action_result(pfn, "memory outside kernel control", IGNORED);
> + return;
> + }
> +
> +
A bonus blank line.
> + p = pfn_to_page(pfn);
> + if (TestSetPageHWPoison(p)) {
> + action_result(pfn, "already hardware poisoned", IGNORED);
> + return;
> + }
> +
> + /*
> + * We need/can do nothing about count=0 pages.
> + * 1) it's a free page, and therefore in safe hand:
> + * prep_new_page() will be the gate keeper.
> + * 2) it's part of a non-compound high order page.
> + * Implies some kernel user: cannot stop them from
> + * R/W the page; let's pray that the page has been
> + * used and will be freed some time later.
> + * In fact it's dangerous to directly bump up page count from 0,
> + * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
> + */
> + if (!get_page_unless_zero(compound_head(p))) {
> + action_result(pfn, "free or high order kernel", IGNORED);
> + return;
> + }
> +
> + /*
> + * Lock the page and wait for writeback to finish.
> + * It's very difficult to mess with pages currently under IO
> + * and in many cases impossible, so we just avoid it here.
> + */
> + lock_page_nosync(p);
> + wait_on_page_writeback(p);
> +
> + /*
> + * Now take care of user space mappings.
> + */
> + hwpoison_user_mappings(p, pfn, trapno);
> +
> + /*
> + * Torn down by someone else?
> + */
> + if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
> + action_result(pfn, "already unmapped LRU", IGNORED);
"NULL mapping LRU" or "already truncated page"?
At least page_mapped != page_mapping.
> + goto out;
> + }
> +
> + for (ps = error_states;; ps++) {
> + if ((p->flags & ps->mask) == ps->res) {
> + page_action(ps, p, pfn);
> + break;
> + }
> + }
> +out:
> + unlock_page(p);
> +}
> Index: linux/include/linux/mm.h
> ===================================================================
> --- linux.orig/include/linux/mm.h 2009-06-03 20:13:43.000000000 +0200
> +++ linux/include/linux/mm.h 2009-06-03 20:13:43.000000000 +0200
> @@ -1326,5 +1326,10 @@
> extern int account_locked_memory(struct mm_struct *mm, struct rlimit *rlim,
> size_t size);
> extern void refund_locked_memory(struct mm_struct *mm, size_t size);
> +
> +extern void memory_failure(unsigned long pfn, int trapno);
> +extern int sysctl_memory_failure_early_kill;
> +extern atomic_long_t mce_bad_pages;
> +
> #endif /* __KERNEL__ */
> #endif /* _LINUX_MM_H */
> Index: linux/kernel/sysctl.c
> ===================================================================
> --- linux.orig/kernel/sysctl.c 2009-06-03 19:37:38.000000000 +0200
> +++ linux/kernel/sysctl.c 2009-06-03 20:13:43.000000000 +0200
> @@ -1311,6 +1311,20 @@
> .mode = 0644,
> .proc_handler = &scan_unevictable_handler,
> },
> +#ifdef CONFIG_MEMORY_FAILURE
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "memory_failure_early_kill",
> + .data = &sysctl_memory_failure_early_kill,
> + .maxlen = sizeof(vm_highmem_is_dirtyable),
s/vm_highmem_is_dirtyable/sysctl_memory_failure_early_kill/
> + .mode = 0644,
> + .proc_handler = &proc_dointvec_minmax,
> + .strategy = &sysctl_intvec,
> + .extra1 = &zero,
> + .extra2 = &one,
> + },
> +#endif
> +
> /*
> * NOTE: do not add new entries to this table unless you have read
> * Documentation/sysctl/ctl_unnumbered.txt
> Index: linux/fs/proc/meminfo.c
> ===================================================================
> --- linux.orig/fs/proc/meminfo.c 2009-06-03 19:37:38.000000000 +0200
> +++ linux/fs/proc/meminfo.c 2009-06-03 20:13:43.000000000 +0200
> @@ -95,7 +95,11 @@
> "Committed_AS: %8lu kB\n"
> "VmallocTotal: %8lu kB\n"
> "VmallocUsed: %8lu kB\n"
> - "VmallocChunk: %8lu kB\n",
> + "VmallocChunk: %8lu kB\n"
> +#ifdef CONFIG_MEMORY_FAILURE
> + "BadPages: %8lu kB\n"
"HWPoison:" or something like that?
People is more likely to misinterpret "BadPages".
> +#endif
> + ,
> K(i.totalram),
> K(i.freeram),
> K(i.bufferram),
> @@ -140,6 +144,9 @@
> (unsigned long)VMALLOC_TOTAL >> 10,
> vmi.used >> 10,
> vmi.largest_chunk >> 10
> +#ifdef CONFIG_MEMORY_FAILURE
> + ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
ERROR: space required after that ','
> +#endif
> );
>
> hugetlb_report_meminfo(m);
> Index: linux/mm/Kconfig
> ===================================================================
> --- linux.orig/mm/Kconfig 2009-06-03 19:37:38.000000000 +0200
> +++ linux/mm/Kconfig 2009-06-03 20:39:48.000000000 +0200
> @@ -220,6 +220,9 @@
> Enable the KSM kernel module to allow page sharing of equal pages
> among different tasks.
>
> +config MEMORY_FAILURE
> + bool
> +
Do we have code to automatically enable/disable CONFIG_MEMORY_FAILURE
based on hardware capability?
> config NOMMU_INITIAL_TRIM_EXCESS
> int "Turn on mmap() excess space trimming before booting"
> depends on !MMU
> Index: linux/Documentation/sysctl/vm.txt
> ===================================================================
> --- linux.orig/Documentation/sysctl/vm.txt 2009-06-03 19:37:38.000000000 +0200
> +++ linux/Documentation/sysctl/vm.txt 2009-06-03 20:13:43.000000000 +0200
> @@ -32,6 +32,7 @@
> - legacy_va_layout
> - lowmem_reserve_ratio
> - max_map_count
> +- memory_failure_early_kill
> - min_free_kbytes
> - min_slab_ratio
> - min_unmapped_ratio
> @@ -53,7 +54,6 @@
> - vfs_cache_pressure
> - zone_reclaim_mode
>
> -
> ==============================================================
>
> block_dump
> @@ -275,6 +275,25 @@
>
> The default value is 65536.
>
> +=============================================================
> +
> +memory_failure_early_kill:
> +
> +Control how to kill processes when uncorrected memory error (typically
> +a 2bit error in a memory module) is detected in the background by hardware.
> +
> +1: Kill all processes that have the corrupted page mapped as soon as the
> +corruption is detected.
> +
> +0: Only unmap the page from all processes and only kill a process
> +who tries to access it.
Note that
- no process will be killed if the page data is clean and can be
safely reloaded from disk
- pages in swap cache is always late killed.
Thanks,
Fengguang
> +The kill is done using a catchable SIGBUS, so processes can handle this
> +if they want to.
> +
> +This is only active on architectures/platforms with advanced machine
> +check handling and depends on the hardware capabilities.
> +
> ==============================================================
>
> min_free_kbytes:
> Index: linux/mm/filemap.c
> ===================================================================
> --- linux.orig/mm/filemap.c 2009-06-03 19:37:38.000000000 +0200
> +++ linux/mm/filemap.c 2009-06-03 20:13:43.000000000 +0200
> @@ -105,6 +105,10 @@
> *
> * ->task->proc_lock
> * ->dcache_lock (proc_pid_lookup)
> + *
> + * (code doesn't rely on that order, so you could switch it around)
> + * ->tasklist_lock (memory_failure, collect_procs_ao)
> + * ->i_mmap_lock
> */
>
> /*
> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c 2009-06-03 19:37:38.000000000 +0200
> +++ linux/mm/rmap.c 2009-06-03 20:13:43.000000000 +0200
> @@ -36,6 +36,10 @@
> * mapping->tree_lock (widely used, in set_page_dirty,
> * in arch-dependent flush_dcache_mmap_lock,
> * within inode_lock in __sync_single_inode)
> + *
> + * (code doesn't rely on that order so it could be switched around)
> + * ->tasklist_lock
> + * anon_vma->lock (memory_failure, collect_procs_anon)
> */
>
> #include <linux/mm.h>
On Thu, Jun 04, 2009 at 02:46:38AM +0800, Andi Kleen wrote:
>
> Bail out early when hardware poisoned pages are found in page fault handling.
I suspect this patch is also not absolutely necessary: the poisoned
page will normally have been isolated already.
> Since they are poisoned they should not be mapped freshly into processes,
> because that would cause another (potentially deadly) machine check
>
> This is generally handled in the same way as OOM, just a different
> error code is returned to the architecture code.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/memory.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> +++ linux/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> @@ -2797,6 +2797,9 @@
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> return ret;
>
> + if (unlikely(PageHWPoison(vmf.page)))
> + return VM_FAULT_HWPOISON;
> +
Direct return with locked page could lockup someone later.
Either drop this patch or fix it with this check?
Thanks,
Fengguang
---
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -2658,8 +2658,11 @@ static int __do_fault(struct mm_struct *
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
return ret;
- if (unlikely(PageHWPoison(vmf.page)))
+ if (unlikely(PageHWPoison(vmf.page))) {
+ if (ret & VM_FAULT_LOCKED)
+ unlock_page(vmf.page);
return VM_FAULT_HWPOISON;
+ }
/*
* For consistency in subsequent calls, make the faulted page always
On Thu, Jun 04, 2009 at 02:46:46AM +0800, Andi Kleen wrote:
>
> From: Nick Piggin <[email protected]>
>
> Extract out truncate_inode_page() out of the truncate path so that
> it can be used by memory-failure.c
>
> [AK: description, headers, fix typos]
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> include/linux/mm.h | 2 ++
> mm/truncate.c | 24 ++++++++++++------------
> 2 files changed, 14 insertions(+), 12 deletions(-)
>
> Index: linux/mm/truncate.c
> ===================================================================
> --- linux.orig/mm/truncate.c 2009-06-03 19:37:38.000000000 +0200
> +++ linux/mm/truncate.c 2009-06-03 20:13:43.000000000 +0200
> @@ -135,6 +135,16 @@
> return ret;
> }
>
> +void truncate_inode_page(struct address_space *mapping, struct page *page)
> +{
> + if (page_mapped(page)) {
> + unmap_mapping_range(mapping,
> + (loff_t)page->index<<PAGE_CACHE_SHIFT,
> + PAGE_CACHE_SIZE, 0);
> + }
> + truncate_complete_page(mapping, page);
> +}
> +
Small style cleanup:
---
mm/truncate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- sound-2.6.orig/mm/truncate.c
+++ sound-2.6/mm/truncate.c
@@ -139,8 +139,8 @@ void truncate_inode_page(struct address_
{
if (page_mapped(page)) {
unmap_mapping_range(mapping,
- (loff_t)page->index<<PAGE_CACHE_SHIFT,
- PAGE_CACHE_SIZE, 0);
+ (loff_t)page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
}
truncate_complete_page(mapping, page);
}
On Thu, Jun 04, 2009 at 02:46:42AM +0800, Andi Kleen wrote:
>
> When a page has the poison bit set replace the PTE with a poison entry.
> This causes the right error handling to be done later when a process runs
> into it.
>
> Also add a new flag to not do that (needed for the memory-failure handler
> later)
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> include/linux/rmap.h | 1 +
> mm/rmap.c | 9 ++++++++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c 2009-06-03 19:36:23.000000000 +0200
> +++ linux/mm/rmap.c 2009-06-03 20:39:49.000000000 +0200
> @@ -943,7 +943,14 @@
> /* Update high watermark before we lower rss */
> update_hiwater_rss(mm);
>
> - if (PageAnon(page)) {
> + if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> + if (PageAnon(page))
> + dec_mm_counter(mm, anon_rss);
> + else if (!is_migration_entry(pte_to_swp_entry(*pte)))
> + dec_mm_counter(mm, file_rss);
> + set_pte_at(mm, address, pte,
> + swp_entry_to_pte(make_hwpoison_entry(page)));
> + } else if (PageAnon(page)) {
> swp_entry_t entry = { .val = page_private(page) };
>
> if (PageSwapCache(page)) {
> Index: linux/include/linux/rmap.h
> ===================================================================
> --- linux.orig/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
> +++ linux/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
> @@ -93,6 +93,7 @@
>
> TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
> TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
> + TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
Or more precisely comment it as "corrupted data is recoverable"?
On Thu, Jun 04, 2009 at 11:24:41AM +0800, Wu Fengguang wrote:
> On Thu, Jun 04, 2009 at 02:46:47AM +0800, Andi Kleen wrote:
>
> [snip]
>
> This patch is full of this style error (the old version didn't have
> this problem though):
I don't see that here. At least nothing new compared to old.
>
> ERROR: code indent should use tabs where possible
It's checkpath clean for me, except for a few > 80 lines on printks,
one list_for_each_entry_safe (which I think checkpatch is wrong on) and
the meminfo comma error which I also think checkpath.pl is wrong on too.
> > + page_cache_release(p);
> > +
> > + /*
> > + * Now truncate the page in the page cache. This is really
> > + * more like a "temporary hole punch"
> > + * Don't do this for block devices when someone else
> > + * has a reference, because it could be file system metadata
> > + * and that's not safe to truncate.
> > + */
> > + mapping = page_mapping(p);
> > + if (mapping && S_ISBLK(mapping->host->i_mode) && page_count(p) > 1) {
>
> Shall use (page_count > 2) to count for the page cache reference.
I think the page cache reference got dropped in
if (!isolate_lru_page(p))
page_cache_release(p);
So it should be only one if there are no other users
> Or can we base the test on busy buffers instead of page count? Nick?
At least the S_ISBLK test is the best one I came up with. I'm not
saying it's the absolutely best.
> > + SetPageError(p);
> > + /* TBD: print more information about the file. */
> > + if (mapping) {
> > + /*
> > + * IO error will be reported by write(), fsync(), etc.
> > + * who check the mapping.
>
> btw, here are some side notes on EIO.
>
> close() *may* also report it. NFS will sync file on close.
I think the comment is already too verbose, sure there are other
details too that it doesn't describe. It's not trying to be a
full reference on linux error reporting. So I prefer to not
add more cases.
> > + * at the wrong time.
> > + *
> > + * So right now we assume that the application DTRT on
>
> DTRT = do the return value test?
Do The Right Thing
> > +};
> > +
> > +static void action_result(unsigned long pfn, char *msg, int ret)
>
> rename 'ret' to 'action'?
But's not an action (as in a page state handler), it's a return value?
(RECOVERED, FAILED etc.) I can name it result.
> > + * need this to decide if we should kill or just drop the page.
> > + */
> > + mapping = page_mapping(p);
> > + if (!PageDirty(p) && !PageAnon(p) && !PageSwapBacked(p) &&
>
> !PageAnon(p) could be removed: the below non-zero mapping check will
> do the work implicitly.
You mean !page_mapped? Ok.
> > + kill = 0;
> > + printk(KERN_INFO
> > + "MCE %#lx: corrupted page was clean: dropped without side effects\n",
> > + pfn);
> > + ttu |= TTU_IGNORE_HWPOISON;
>
> Why not put the two assignment lines together? :)
Ok. But that was your patch @)
> > + * Try a few times (RED-PEN better strategy?)
> > + */
> > + for (i = 0; i < N_UNMAP_TRIES; i++) {
> > + ret = try_to_unmap(p, ttu);
> > + if (ret == SWAP_SUCCESS)
> > + break;
> > + pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
>
> Can we make it a printk? This is a serious accident.
I think it can actually happen due to races, e.g. when a remap
is currently in process.
> > + */
> > + hwpoison_user_mappings(p, pfn, trapno);
> > +
> > + /*
> > + * Torn down by someone else?
> > + */
> > + if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
> > + action_result(pfn, "already unmapped LRU", IGNORED);
>
> "NULL mapping LRU" or "already truncated page"?
> At least page_mapped != page_mapping.
It's "already truncated" now.
> > @@ -1311,6 +1311,20 @@
> > .mode = 0644,
> > .proc_handler = &scan_unevictable_handler,
> > },
> > +#ifdef CONFIG_MEMORY_FAILURE
> > + {
> > + .ctl_name = CTL_UNNUMBERED,
> > + .procname = "memory_failure_early_kill",
> > + .data = &sysctl_memory_failure_early_kill,
> > + .maxlen = sizeof(vm_highmem_is_dirtyable),
>
> s/vm_highmem_is_dirtyable/sysctl_memory_failure_early_kill/
Fixed thanks.
> > * Documentation/sysctl/ctl_unnumbered.txt
> > Index: linux/fs/proc/meminfo.c
> > ===================================================================
> > --- linux.orig/fs/proc/meminfo.c 2009-06-03 19:37:38.000000000 +0200
> > +++ linux/fs/proc/meminfo.c 2009-06-03 20:13:43.000000000 +0200
> > @@ -95,7 +95,11 @@
> > "Committed_AS: %8lu kB\n"
> > "VmallocTotal: %8lu kB\n"
> > "VmallocUsed: %8lu kB\n"
> > - "VmallocChunk: %8lu kB\n",
> > + "VmallocChunk: %8lu kB\n"
> > +#ifdef CONFIG_MEMORY_FAILURE
> > + "BadPages: %8lu kB\n"
>
> "HWPoison:" or something like that?
> People is more likely to misinterpret "BadPages".
I'll name it HardwareCorrupted. That makes it too long, but it's hopefully
clearer.
> > vmi.used >> 10,
> > vmi.largest_chunk >> 10
> > +#ifdef CONFIG_MEMORY_FAILURE
> > + ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
>
> ERROR: space required after that ','
That's one of the cases where checkpatch.pl is stupid. The lone comma
with a space looks absolutely ridiculous to me. I refuse to do ridiculous
things things just for checkpatch.pl deficiencies.
> > Enable the KSM kernel module to allow page sharing of equal pages
> > among different tasks.
> >
> > +config MEMORY_FAILURE
> > + bool
> > +
>
> Do we have code to automatically enable/disable CONFIG_MEMORY_FAILURE
> based on hardware capability?
Yes the architecture can enable it. There's also another patch
which always enables it for testing.
> > +
> > +Control how to kill processes when uncorrected memory error (typically
> > +a 2bit error in a memory module) is detected in the background by hardware.
> > +
> > +1: Kill all processes that have the corrupted page mapped as soon as the
> > +corruption is detected.
> > +
> > +0: Only unmap the page from all processes and only kill a process
> > +who tries to access it.
>
> Note that
> - no process will be killed if the page data is clean and can be
> safely reloaded from disk
> - pages in swap cache is always late killed.
I clarified that
Thanks,
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Jun 04, 2009 at 12:26:03PM +0800, Wu Fengguang wrote:
> On Thu, Jun 04, 2009 at 02:46:38AM +0800, Andi Kleen wrote:
> >
> > Bail out early when hardware poisoned pages are found in page fault handling.
>
> I suspect this patch is also not absolutely necessary: the poisoned
> page will normally have been isolated already.
It's needed to prevent new pages comming in when there is a parallel
fault while the memory failure handling is in process.
Otherwise the pages could get remapped in that small window.
> > --- linux.orig/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> > +++ linux/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> > @@ -2797,6 +2797,9 @@
> > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> > return ret;
> >
> > + if (unlikely(PageHWPoison(vmf.page)))
> > + return VM_FAULT_HWPOISON;
> > +
>
> Direct return with locked page could lockup someone later.
> Either drop this patch or fix it with this check?
Fair point. Fixed.
Thanks,
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Jun 04, 2009 at 12:32:08PM +0800, Wu Fengguang wrote:
> > +void truncate_inode_page(struct address_space *mapping, struct page *page)
> > +{
> > + if (page_mapped(page)) {
> > + unmap_mapping_range(mapping,
> > + (loff_t)page->index<<PAGE_CACHE_SHIFT,
> > + PAGE_CACHE_SIZE, 0);
> > + }
> > + truncate_complete_page(mapping, page);
> > +}
> > +
>
> Small style cleanup:
Applied.
-Andi
--
[email protected] -- Speaking for myself only.
> > Index: linux/include/linux/rmap.h
> > ===================================================================
> > --- linux.orig/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
> > +++ linux/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
> > @@ -93,6 +93,7 @@
> >
> > TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
> > TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
> > + TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
>
> Or more precisely comment it as "corrupted data is recoverable"?
I think the original comment is clear enough, not changing that for now.
Thanks,
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Jun 04, 2009 at 08:36:21AM +0800, Wu Fengguang wrote:
> On Thu, Jun 04, 2009 at 02:46:43AM +0800, Andi Kleen wrote:
> >
> > Bail out early in set_page_dirty for poisoned pages. We don't want any
> > of the dirty accounting done or file system write back started, because
> > the page will be just thrown away.
>
> I'm afraid this patch is not necessary and could be harmful.
>
> It is not necessary because a poisoned page will normally already be
> isolated from page cache, or likely cannot be isolated because it has
> dirty buffers.
Hmm I think I had a case when I originally wrote the code where it was needed.
But I can't clearly remember now what it was.
But you're right the page cache isolation should normally take care of it.
> It is harmful because it put the page into dirty state without queuing
> it for IO by moving it to s_io. When more normal pages are dirtied
> later, __set_page_dirty_nobuffers() won't move the inode into s_io,
> hence delaying the writeback of good pages for arbitrary long time.
That's a good point.
I dropped the patch for now.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Jun 04, 2009 at 01:13:46PM +0800, Andi Kleen wrote:
> On Thu, Jun 04, 2009 at 11:24:41AM +0800, Wu Fengguang wrote:
> > On Thu, Jun 04, 2009 at 02:46:47AM +0800, Andi Kleen wrote:
> >
> > [snip]
> >
> > This patch is full of this style error (the old version didn't have
> > this problem though):
>
> I don't see that here. At least nothing new compared to old.
>
> >
> > ERROR: code indent should use tabs where possible
>
> It's checkpath clean for me, except for a few > 80 lines on printks,
> one list_for_each_entry_safe (which I think checkpatch is wrong on) and
> the meminfo comma error which I also think checkpath.pl is wrong on too.
OK, that's fine. Maybe some email server expanded tabs in between.
I wonder whether its the send side or the receive side, ie. whether
it affected more people..
> > > + page_cache_release(p);
> > > +
> > > + /*
> > > + * Now truncate the page in the page cache. This is really
> > > + * more like a "temporary hole punch"
> > > + * Don't do this for block devices when someone else
> > > + * has a reference, because it could be file system metadata
> > > + * and that's not safe to truncate.
> > > + */
> > > + mapping = page_mapping(p);
> > > + if (mapping && S_ISBLK(mapping->host->i_mode) && page_count(p) > 1) {
> >
> > Shall use (page_count > 2) to count for the page cache reference.
>
> I think the page cache reference got dropped in
>
> if (!isolate_lru_page(p))
> page_cache_release(p);
>
> So it should be only one if there are no other users
Ah right!
> > Or can we base the test on busy buffers instead of page count? Nick?
>
> At least the S_ISBLK test is the best one I came up with. I'm not
> saying it's the absolutely best.
Yes I agree with the S_ISBLK test and was questioning the page count
test. btw, one exception to the S_ISBLK test is btrfs, which does not
use blockdev for metadata.
> > > + SetPageError(p);
> > > + /* TBD: print more information about the file. */
> > > + if (mapping) {
> > > + /*
> > > + * IO error will be reported by write(), fsync(), etc.
> > > + * who check the mapping.
> >
> > btw, here are some side notes on EIO.
> >
> > close() *may* also report it. NFS will sync file on close.
>
> I think the comment is already too verbose, sure there are other
> details too that it doesn't describe. It's not trying to be a
> full reference on linux error reporting. So I prefer to not
> add more cases.
Yes, I was not asking for expanding the long comment :-)
> > > + * at the wrong time.
> > > + *
> > > + * So right now we assume that the application DTRT on
> >
> > DTRT = do the return value test?
>
> Do The Right Thing
OK.
> > > +};
> > > +
> > > +static void action_result(unsigned long pfn, char *msg, int ret)
> >
> > rename 'ret' to 'action'?
>
> But's not an action (as in a page state handler), it's a return value?
> (RECOVERED, FAILED etc.) I can name it result.
Ah yes, it's return code.
> > > + * need this to decide if we should kill or just drop the page.
> > > + */
> > > + mapping = page_mapping(p);
> > > + if (!PageDirty(p) && !PageAnon(p) && !PageSwapBacked(p) &&
> >
> > !PageAnon(p) could be removed: the below non-zero mapping check will
> > do the work implicitly.
>
> You mean !page_mapped? Ok.
I mean to do
mapping = page_mapping(p);
if (!PageDirty(p) && !PageSwapBacked(p) &&
mapping && mapping_cap_account_dirty(mapping)) {
Because for anonymous pages, page_mapping == NULL.
> > > + kill = 0;
> > > + printk(KERN_INFO
> > > + "MCE %#lx: corrupted page was clean: dropped without side effects\n",
> > > + pfn);
> > > + ttu |= TTU_IGNORE_HWPOISON;
> >
> > Why not put the two assignment lines together? :)
>
> Ok. But that was your patch @)
Yes, so is the above one ;)
> > > + * Try a few times (RED-PEN better strategy?)
> > > + */
> > > + for (i = 0; i < N_UNMAP_TRIES; i++) {
> > > + ret = try_to_unmap(p, ttu);
> > > + if (ret == SWAP_SUCCESS)
> > > + break;
> > > + pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
> >
> > Can we make it a printk? This is a serious accident.
>
> I think it can actually happen due to races, e.g. when a remap
> is currently in process.
When it happened, the page may not be isolated from pte and page cache,
and thus very likely to damage the system. So add a warning when failed?
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -660,6 +660,10 @@ static void hwpoison_user_mappings(struc
break;
pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
}
+ if (ret != SWAP_SUCCESS)
+ printk(KERN_ERR
+ "MCE %#lx: failed to unmap page (mapcount=%d)!\n",
+ pfn, page_mapcount(p));
/*
* Now that the dirty bit has been propagated to the
> > > + */
> > > + hwpoison_user_mappings(p, pfn, trapno);
> > > +
> > > + /*
> > > + * Torn down by someone else?
> > > + */
> > > + if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
> > > + action_result(pfn, "already unmapped LRU", IGNORED);
> >
> > "NULL mapping LRU" or "already truncated page"?
> > At least page_mapped != page_mapping.
>
> It's "already truncated" now.
Thanks.
> > > @@ -1311,6 +1311,20 @@
> > > .mode = 0644,
> > > .proc_handler = &scan_unevictable_handler,
> > > },
> > > +#ifdef CONFIG_MEMORY_FAILURE
> > > + {
> > > + .ctl_name = CTL_UNNUMBERED,
> > > + .procname = "memory_failure_early_kill",
> > > + .data = &sysctl_memory_failure_early_kill,
> > > + .maxlen = sizeof(vm_highmem_is_dirtyable),
> >
> > s/vm_highmem_is_dirtyable/sysctl_memory_failure_early_kill/
>
> Fixed thanks.
>
> > > * Documentation/sysctl/ctl_unnumbered.txt
> > > Index: linux/fs/proc/meminfo.c
> > > ===================================================================
> > > --- linux.orig/fs/proc/meminfo.c 2009-06-03 19:37:38.000000000 +0200
> > > +++ linux/fs/proc/meminfo.c 2009-06-03 20:13:43.000000000 +0200
> > > @@ -95,7 +95,11 @@
> > > "Committed_AS: %8lu kB\n"
> > > "VmallocTotal: %8lu kB\n"
> > > "VmallocUsed: %8lu kB\n"
> > > - "VmallocChunk: %8lu kB\n",
> > > + "VmallocChunk: %8lu kB\n"
> > > +#ifdef CONFIG_MEMORY_FAILURE
> > > + "BadPages: %8lu kB\n"
> >
> > "HWPoison:" or something like that?
> > People is more likely to misinterpret "BadPages".
>
> I'll name it HardwareCorrupted. That makes it too long, but it's hopefully
> clearer.
That's OK. Maybe we need a standalone alignment patch for /proc/meminfo ;-)
> > > vmi.used >> 10,
> > > vmi.largest_chunk >> 10
> > > +#ifdef CONFIG_MEMORY_FAILURE
> > > + ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
> >
> > ERROR: space required after that ','
>
> That's one of the cases where checkpatch.pl is stupid. The lone comma
> with a space looks absolutely ridiculous to me. I refuse to do ridiculous
> things things just for checkpatch.pl deficiencies.
OK.
> > > Enable the KSM kernel module to allow page sharing of equal pages
> > > among different tasks.
> > >
> > > +config MEMORY_FAILURE
> > > + bool
> > > +
> >
> > Do we have code to automatically enable/disable CONFIG_MEMORY_FAILURE
> > based on hardware capability?
>
> Yes the architecture can enable it. There's also another patch
> which always enables it for testing.
OK.
> > > +
> > > +Control how to kill processes when uncorrected memory error (typically
> > > +a 2bit error in a memory module) is detected in the background by hardware.
> > > +
> > > +1: Kill all processes that have the corrupted page mapped as soon as the
> > > +corruption is detected.
> > > +
> > > +0: Only unmap the page from all processes and only kill a process
> > > +who tries to access it.
> >
> > Note that
> > - no process will be killed if the page data is clean and can be
> > safely reloaded from disk
> > - pages in swap cache is always late killed.
>
> I clarified that
Thanks,
Fengguang
On Thu, Jun 04, 2009 at 05:07:37PM +0800, Wu Fengguang wrote:
>
> > > > + * need this to decide if we should kill or just drop the page.
> > > > + */
> > > > + mapping = page_mapping(p);
> > > > + if (!PageDirty(p) && !PageAnon(p) && !PageSwapBacked(p) &&
> > >
> > > !PageAnon(p) could be removed: the below non-zero mapping check will
> > > do the work implicitly.
> >
> > You mean !page_mapped? Ok.
>
> I mean to do
> mapping = page_mapping(p);
> if (!PageDirty(p) && !PageSwapBacked(p) &&
> mapping && mapping_cap_account_dirty(mapping)) {
>
> Because for anonymous pages, page_mapping == NULL.
I realized this after pressing send. Anyways the PageAnon is dropped
>
> --- sound-2.6.orig/mm/memory-failure.c
> +++ sound-2.6/mm/memory-failure.c
> @@ -660,6 +660,10 @@ static void hwpoison_user_mappings(struc
> break;
> pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
> }
> + if (ret != SWAP_SUCCESS)
> + printk(KERN_ERR
> + "MCE %#lx: failed to unmap page (mapcount=%d)!\n",
> + pfn, page_mapcount(p));
Ok.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Jun 04, 2009 at 01:19:15PM +0800, Andi Kleen wrote:
> On Thu, Jun 04, 2009 at 12:26:03PM +0800, Wu Fengguang wrote:
> > On Thu, Jun 04, 2009 at 02:46:38AM +0800, Andi Kleen wrote:
> > >
> > > Bail out early when hardware poisoned pages are found in page fault handling.
> >
> > I suspect this patch is also not absolutely necessary: the poisoned
> > page will normally have been isolated already.
>
> It's needed to prevent new pages comming in when there is a parallel
> fault while the memory failure handling is in process.
> Otherwise the pages could get remapped in that small window.
This patch makes no difference at least for file pages, including tmpfs.
In filemap_fault(), it will first do find_lock_page(), which will lock
the page and then double check if the page->mapping is NULL. If so,
it drops that page and re-find/re-create one in the radix tree.
That logic automatically avoids the poisoned page that is being
processed, because the poisoned page is now being locked, and when
find_lock_page() eventually is able to lock it, its page->mapping
will be NULL. So the PageHWPoison(vmf.page) test will never be true
for filemap_fault.
shmem_fault() calls shmem_getpage() to get the page, which will return
EIO on !uptodate page. It then returns VM_FAULT_SIGBUS on its own.
> > > --- linux.orig/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> > > +++ linux/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> > > @@ -2797,6 +2797,9 @@
> > > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> > > return ret;
> > >
> > > + if (unlikely(PageHWPoison(vmf.page)))
> > > + return VM_FAULT_HWPOISON;
> > > +
> >
> > Direct return with locked page could lockup someone later.
> > Either drop this patch or fix it with this check?
>
> Fair point. Fixed.
OK, thanks.
Fengguang
On Thu, Jun 04, 2009 at 07:55:33PM +0800, Wu Fengguang wrote:
> On Thu, Jun 04, 2009 at 01:19:15PM +0800, Andi Kleen wrote:
> > On Thu, Jun 04, 2009 at 12:26:03PM +0800, Wu Fengguang wrote:
> > > On Thu, Jun 04, 2009 at 02:46:38AM +0800, Andi Kleen wrote:
> > > >
> > > > Bail out early when hardware poisoned pages are found in page fault handling.
> > >
> > > I suspect this patch is also not absolutely necessary: the poisoned
> > > page will normally have been isolated already.
> >
> > It's needed to prevent new pages comming in when there is a parallel
> > fault while the memory failure handling is in process.
> > Otherwise the pages could get remapped in that small window.
>
> This patch makes no difference at least for file pages, including tmpfs.
I was more thinking of anonymous pages with multiple mappers (e.g.
COW after fork)
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Jun 04, 2009 at 08:52:28PM +0800, Andi Kleen wrote:
> On Thu, Jun 04, 2009 at 07:55:33PM +0800, Wu Fengguang wrote:
> > On Thu, Jun 04, 2009 at 01:19:15PM +0800, Andi Kleen wrote:
> > > On Thu, Jun 04, 2009 at 12:26:03PM +0800, Wu Fengguang wrote:
> > > > On Thu, Jun 04, 2009 at 02:46:38AM +0800, Andi Kleen wrote:
> > > > >
> > > > > Bail out early when hardware poisoned pages are found in page fault handling.
> > > >
> > > > I suspect this patch is also not absolutely necessary: the poisoned
> > > > page will normally have been isolated already.
> > >
> > > It's needed to prevent new pages comming in when there is a parallel
> > > fault while the memory failure handling is in process.
> > > Otherwise the pages could get remapped in that small window.
> >
> > This patch makes no difference at least for file pages, including tmpfs.
>
> I was more thinking of anonymous pages with multiple mappers (e.g.
> COW after fork)
I guess they are handled by do_anonymous_page() or do_wp_page(),
instead of do_linear_fault()/do_nonlinear_fault()?
Thanks,
Fengguang
On Thu, Jun 04, 2009 at 08:50:26PM +0800, Wu Fengguang wrote:
> On Thu, Jun 04, 2009 at 08:52:28PM +0800, Andi Kleen wrote:
> > On Thu, Jun 04, 2009 at 07:55:33PM +0800, Wu Fengguang wrote:
> > > On Thu, Jun 04, 2009 at 01:19:15PM +0800, Andi Kleen wrote:
> > > > On Thu, Jun 04, 2009 at 12:26:03PM +0800, Wu Fengguang wrote:
> > > > > On Thu, Jun 04, 2009 at 02:46:38AM +0800, Andi Kleen wrote:
> > > > > >
> > > > > > Bail out early when hardware poisoned pages are found in page fault handling.
> > > > >
> > > > > I suspect this patch is also not absolutely necessary: the poisoned
> > > > > page will normally have been isolated already.
> > > >
> > > > It's needed to prevent new pages comming in when there is a parallel
> > > > fault while the memory failure handling is in process.
> > > > Otherwise the pages could get remapped in that small window.
> > >
> > > This patch makes no difference at least for file pages, including tmpfs.
> >
> > I was more thinking of anonymous pages with multiple mappers (e.g.
> > COW after fork)
>
> I guess they are handled by do_anonymous_page() or do_wp_page(),
> instead of do_linear_fault()/do_nonlinear_fault()?
You're right. Sorry was a little confused in my earlier reply.
I think what I meant is: what happens during the window
when the page has just the poison bit set, but is not isolated/unmapped yet.
During that window I want new mappers to not come in.
That is why that check is there.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, Jun 04, 2009 at 09:02:55PM +0800, Andi Kleen wrote:
> On Thu, Jun 04, 2009 at 08:50:26PM +0800, Wu Fengguang wrote:
> > On Thu, Jun 04, 2009 at 08:52:28PM +0800, Andi Kleen wrote:
> > > On Thu, Jun 04, 2009 at 07:55:33PM +0800, Wu Fengguang wrote:
> > > > On Thu, Jun 04, 2009 at 01:19:15PM +0800, Andi Kleen wrote:
> > > > > On Thu, Jun 04, 2009 at 12:26:03PM +0800, Wu Fengguang wrote:
> > > > > > On Thu, Jun 04, 2009 at 02:46:38AM +0800, Andi Kleen wrote:
> > > > > > >
> > > > > > > Bail out early when hardware poisoned pages are found in page fault handling.
> > > > > >
> > > > > > I suspect this patch is also not absolutely necessary: the poisoned
> > > > > > page will normally have been isolated already.
> > > > >
> > > > > It's needed to prevent new pages comming in when there is a parallel
> > > > > fault while the memory failure handling is in process.
> > > > > Otherwise the pages could get remapped in that small window.
> > > >
> > > > This patch makes no difference at least for file pages, including tmpfs.
> > >
> > > I was more thinking of anonymous pages with multiple mappers (e.g.
> > > COW after fork)
> >
> > I guess they are handled by do_anonymous_page() or do_wp_page(),
> > instead of do_linear_fault()/do_nonlinear_fault()?
>
> You're right. Sorry was a little confused in my earlier reply.
>
> I think what I meant is: what happens during the window
> when the page has just the poison bit set, but is not isolated/unmapped yet.
> During that window I want new mappers to not come in.
> That is why that check is there.
As soon as the poisoned page is locked, it is in safe hand - the new
mappers will have to wait, and then find it either truncated (mapping
== NULL) for file pages, or its PTE updated through the pte_same()
checks in do_wp_page(). do_anonymous_page() is safe because it
allocates the good new page.
We lock the page immediately after setting PG_hwpoison, so the window
is small enough :)
Thanks,
Fengguang
On Wed, Jun 03, 2009 at 08:46:47PM +0200, Andi Kleen wrote:
> +static int me_pagecache_clean(struct page *p, unsigned long pfn)
> +{
> + struct address_space *mapping;
> +
> + if (!isolate_lru_page(p))
> + page_cache_release(p);
> +
> + /*
> + * Now truncate the page in the page cache. This is really
> + * more like a "temporary hole punch"
> + * Don't do this for block devices when someone else
> + * has a reference, because it could be file system metadata
> + * and that's not safe to truncate.
> + */
> + mapping = page_mapping(p);
> + if (mapping && S_ISBLK(mapping->host->i_mode) && page_count(p) > 1) {
> + printk(KERN_ERR
> + "MCE %#lx: page looks like a unsupported file system metadata page\n",
> + pfn);
> + return FAILED;
> + }
page_count check is racy. Hmm, S_ISBLK should handle xfs's private mapping.
AFAIK btrfs has a similar private mapping but a quick grep does not show
up S_IFBLK anywhere, so I don't know what the situation is there.
Unfortunately though, the linear mapping is not the only metadata mapping
a filesystem might have. Many work on directories in seperate mappings
(ext2, for example, which is where I first looked and will still oops with
your check).
Also, others may have other interesting inodes they use for metadata. Do
any of them go through the pagecache? I dont know. The ext3 journal,
for example? How does that work?
Unfortunately I don't know a good way to detect regular data mappings
easily. Ccing linux-fsdevel. Until that is worked out, you'd need to
use the safe pagecache invalidate rather than unsafe truncate.
> + if (mapping) {
> + truncate_inode_page(mapping, p);
> + if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO)) {
> + pr_debug(KERN_ERR "MCE %#lx: failed to release buffers\n",
> + pfn);
> + return FAILED;
> + }
> + }
> + return RECOVERED;
> +}
On Wed, Jun 03, 2009 at 08:46:40PM +0200, Andi Kleen wrote:
>
> Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is
> very similar to VM_FAULT_OOM, the only difference is that a different
> si_code is passed to user space and the new addr_lsb field is initialized.
>
> v2: Make the printk more verbose/unique
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/x86/mm/fault.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> Index: linux/arch/x86/mm/fault.c
> ===================================================================
> --- linux.orig/arch/x86/mm/fault.c 2009-06-03 19:36:21.000000000 +0200
> +++ linux/arch/x86/mm/fault.c 2009-06-03 19:36:23.000000000 +0200
> @@ -166,6 +166,7 @@
> info.si_errno = 0;
> info.si_code = si_code;
> info.si_addr = (void __user *)address;
> + info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
>
> force_sig_info(si_signo, &info, tsk);
> }
> @@ -797,10 +798,12 @@
> }
>
> static void
> -do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
> +do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
> + unsigned int fault)
> {
> struct task_struct *tsk = current;
> struct mm_struct *mm = tsk->mm;
> + int code = BUS_ADRERR;
>
> up_read(&mm->mmap_sem);
>
> @@ -816,7 +819,15 @@
> tsk->thread.error_code = error_code;
> tsk->thread.trap_no = 14;
>
> - force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (fault & VM_FAULT_HWPOISON) {
> + printk(KERN_ERR
> + "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> + tsk->comm, tsk->pid, address);
> + code = BUS_MCEERR_AR;
> + }
> +#endif
If you make VM_FAULT_HWPOISON 0 when !CONFIG_MEMORY_FAILURE, then
you can remove this ifdef, can't you?
> + force_sig_info_fault(SIGBUS, code, address, tsk);
> }
>
> static noinline void
> @@ -826,8 +837,8 @@
> if (fault & VM_FAULT_OOM) {
> out_of_memory(regs, error_code, address);
> } else {
> - if (fault & VM_FAULT_SIGBUS)
> - do_sigbus(regs, error_code, address);
> + if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
> + do_sigbus(regs, error_code, address, fault);
> else
> BUG();
> }
On Wed, Jun 03, 2009 at 08:46:41PM +0200, Andi Kleen wrote:
>
> try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
> which are selected by magic flag variables. The logic is not very straight
> forward, because each of these flag change multiple behaviours (e.g.
> migration turns off aging, not only sets up migration ptes etc.)
> Also the different flags interact in magic ways.
>
> A later patch in this series adds another mode to try_to_unmap, so
> this becomes quickly unmanageable.
>
> Replace the different flags with a action code (migration, munlock, munmap)
> and some additional flags as modifiers (ignore mlock, ignore aging).
> This makes the logic more straight forward and allows easier extension
> to new behaviours. Change all the caller to declare what they want to
> do.
>
> This patch is supposed to be a nop in behaviour. If anyone can prove
> it is not that would be a bug.
>
> Cc: [email protected]
> Cc: [email protected]
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> include/linux/rmap.h | 14 +++++++++++++-
> mm/migrate.c | 2 +-
> mm/rmap.c | 40 ++++++++++++++++++++++------------------
> mm/vmscan.c | 2 +-
> 4 files changed, 37 insertions(+), 21 deletions(-)
>
> Index: linux/include/linux/rmap.h
> ===================================================================
> --- linux.orig/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
> +++ linux/include/linux/rmap.h 2009-06-03 20:39:50.000000000 +0200
> @@ -84,7 +84,19 @@
> * Called from mm/vmscan.c to handle paging out
> */
> int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
> -int try_to_unmap(struct page *, int ignore_refs);
> +
> +enum ttu_flags {
> + TTU_UNMAP = 0, /* unmap mode */
> + TTU_MIGRATION = 1, /* migration mode */
> + TTU_MUNLOCK = 2, /* munlock mode */
> + TTU_ACTION_MASK = 0xff,
> +
> + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
> + TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
> +};
> +#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
I still think this is nasty and should work like Gfp flags.
On Wed, Jun 03, 2009 at 08:46:43PM +0200, Andi Kleen wrote:
>
> Bail out early in set_page_dirty for poisoned pages. We don't want any
> of the dirty accounting done or file system write back started, because
> the page will be just thrown away.
I don't agree with adding overhead to fastpaths like this. Your
MCE handler should have already taken care of this so I can't
see what it can gain.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/page-writeback.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c 2009-06-03 19:36:20.000000000 +0200
> +++ linux/mm/page-writeback.c 2009-06-03 19:36:23.000000000 +0200
> @@ -1304,6 +1304,10 @@
> {
> struct address_space *mapping = page_mapping(page);
>
> + if (unlikely(PageHWPoison(page))) {
> + SetPageDirty(page);
> + return 0;
> + }
> if (likely(mapping)) {
> int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> #ifdef CONFIG_BLOCK
On Wed, Jun 03, 2009 at 08:46:45PM +0200, Andi Kleen wrote:
>
> From: Wu Fengguang <[email protected]>
>
> If memory corruption hits the free buddy pages, we can safely ignore them.
> No one will access them until page allocation time, then prep_new_page()
> will automatically check and isolate PG_hwpoison page for us (for 0-order
> allocation).
It would be kinda nice if this could be done in the handler
directly (ie. take the page directly out of the allocator
or pcp list). Completely avoiding fastpaths would be a
wonderful goal.
>
> This patch expands prep_new_page() to check every component page in a high
> order page allocation, in order to completely stop PG_hwpoison pages from
> being recirculated.
>
> Note that the common case -- only allocating a single page, doesn't
> do any more work than before. Allocating > order 0 does a bit more work,
> but that's relatively uncommon.
>
> This simple implementation may drop some innocent neighbor pages, hopefully
> it is not a big problem because the event should be rare enough.
>
> This patch adds some runtime costs to high order page users.
>
> [AK: Improved description]
>
> v2: Andi Kleen:
> Port to -mm code
> Move check into separate function.
> Don't dump stack in bad_pages for hwpoisoned pages.
> Signed-off-by: Wu Fengguang <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/page_alloc.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c 2009-06-03 19:37:39.000000000 +0200
> +++ linux/mm/page_alloc.c 2009-06-03 20:13:43.000000000 +0200
> @@ -237,6 +237,12 @@
> static unsigned long nr_shown;
> static unsigned long nr_unshown;
>
> + /* Don't complain about poisoned pages */
> + if (PageHWPoison(page)) {
> + __ClearPageBuddy(page);
> + return;
> + }
> +
> /*
> * Allow a burst of 60 reports, then keep quiet for that minute;
> * or allow a steady drip of one report per second.
> @@ -650,7 +656,7 @@
> /*
> * This page is about to be returned from the page allocator
> */
> -static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> +static inline int check_new_page(struct page *page)
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> @@ -659,6 +665,18 @@
> bad_page(page);
> return 1;
> }
> + return 0;
> +}
> +
> +static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> +{
> + int i;
> +
> + for (i = 0; i < (1 << order); i++) {
> + struct page *p = page + i;
> + if (unlikely(check_new_page(p)))
> + return 1;
> + }
>
> set_page_private(page, 0);
> set_page_refcounted(page);
On Wed, Jun 03, 2009 at 08:46:47PM +0200, Andi Kleen wrote:
Why not have this in rmap.c and not export the locking?
I don't know.. does Hugh care?
> +/*
> + * Collect processes when the error hit an anonymous page.
> + */
> +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> + struct to_kill **tkc)
> +{
> + struct vm_area_struct *vma;
> + struct task_struct *tsk;
> + struct anon_vma *av = page_lock_anon_vma(page);
> +
> + if (av == NULL) /* Not actually mapped anymore */
> + return;
> +
> + read_lock(&tasklist_lock);
> + for_each_process (tsk) {
> + if (!tsk->mm)
> + continue;
> + list_for_each_entry (vma, &av->head, anon_vma_node) {
> + if (vma->vm_mm == tsk->mm)
> + add_to_kill(tsk, page, vma, to_kill, tkc);
> + }
> + }
> + page_unlock_anon_vma(av);
> + read_unlock(&tasklist_lock);
> +}
> +
> +/*
> + * Collect processes when the error hit a file mapped page.
> + */
> +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> + struct to_kill **tkc)
> +{
> + struct vm_area_struct *vma;
> + struct task_struct *tsk;
> + struct prio_tree_iter iter;
> + struct address_space *mapping = page_mapping(page);
> +
> + /*
> + * A note on the locking order between the two locks.
> + * We don't rely on this particular order.
> + * If you have some other code that needs a different order
> + * feel free to switch them around. Or add a reverse link
> + * from mm_struct to task_struct, then this could be all
> + * done without taking tasklist_lock and looping over all tasks.
> + */
> +
> + read_lock(&tasklist_lock);
> + spin_lock(&mapping->i_mmap_lock);
This still has my original complaint that it nests tasklist lock inside
anon vma lock and outside inode mmap lock (and anon_vma nests inside i_mmap).
I guess the property of our current rw locks means that does not matter,
but it could if we had "fair" rw locks, or some tree (-rt tree maybe)
changed rw lock to a plain exclusive lock.
> + for_each_process(tsk) {
> + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +
> + if (!tsk->mm)
> + continue;
> +
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
> + pgoff)
> + if (vma->vm_mm == tsk->mm)
> + add_to_kill(tsk, page, vma, to_kill, tkc);
> + }
> + spin_unlock(&mapping->i_mmap_lock);
> + read_unlock(&tasklist_lock);
> +}
> +
> +/*
> + * Collect the processes who have the corrupted page mapped to kill.
> + * This is done in two steps for locking reasons.
> + * First preallocate one tokill structure outside the spin locks,
> + * so that we can kill at least one process reasonably reliable.
> + */
> +static void collect_procs(struct page *page, struct list_head *tokill)
> +{
> + struct to_kill *tk;
> +
> + tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
> + /* memory allocation failure is implicitly handled */
> + if (PageAnon(page))
> + collect_procs_anon(page, tokill, &tk);
> + else
> + collect_procs_file(page, tokill, &tk);
> + kfree(tk);
> +}
> Index: linux/mm/filemap.c
> ===================================================================
> --- linux.orig/mm/filemap.c 2009-06-03 19:37:38.000000000 +0200
> +++ linux/mm/filemap.c 2009-06-03 20:13:43.000000000 +0200
> @@ -105,6 +105,10 @@
> *
> * ->task->proc_lock
> * ->dcache_lock (proc_pid_lookup)
> + *
> + * (code doesn't rely on that order, so you could switch it around)
> + * ->tasklist_lock (memory_failure, collect_procs_ao)
> + * ->i_mmap_lock
> */
>
> /*
> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c 2009-06-03 19:37:38.000000000 +0200
> +++ linux/mm/rmap.c 2009-06-03 20:13:43.000000000 +0200
> @@ -36,6 +36,10 @@
> * mapping->tree_lock (widely used, in set_page_dirty,
> * in arch-dependent flush_dcache_mmap_lock,
> * within inode_lock in __sync_single_inode)
> + *
> + * (code doesn't rely on that order so it could be switched around)
> + * ->tasklist_lock
> + * anon_vma->lock (memory_failure, collect_procs_anon)
> */
>
> #include <linux/mm.h>
On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> Also I thought a bit about the fsync() error scenario. It's really
> a problem that can already happen even without hwpoison, e.g.
> when a page is dropped at the wrong time.
No, the page will never be "dropped" like that except with
this hwpoison. Errors, sure, might get dropped sometimes
due to implementation bugs, but this is adding semantics that
basically break fsync by-design.
I really want to resolve the EIO issue because as I said, it
is a user-abi issue and too many of those just get shoved
through only for someone to care about fundamental breakage
after some years.
You say that SIGKILL is overkill for such pages, but in fact
this is exactly what you do with mapped pages anyway, so why
not with other pages as well? I think it is perfectly fine to
do so (and maybe a new error code can be introduced and that
can be delivered to processes that can handle it rather than
SIGKILL).
Last request: do you have a panic-on-memory-error option?
I think HA systems and ones with properly designed data
integrity at the application layer will much prefer to
halt the system than attempt ad-hoc recovery that does not
always work and might screw things up worse.
On Wed, Jun 03, 2009 at 08:46:38PM +0200, Andi Kleen wrote:
>
> Bail out early when hardware poisoned pages are found in page fault handling.
> Since they are poisoned they should not be mapped freshly into processes,
> because that would cause another (potentially deadly) machine check
>
> This is generally handled in the same way as OOM, just a different
> error code is returned to the architecture code.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/memory.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> +++ linux/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> @@ -2797,6 +2797,9 @@
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> return ret;
>
> + if (unlikely(PageHWPoison(vmf.page)))
> + return VM_FAULT_HWPOISON;
Again, it would be nice if you just worry about this in your MCE
handler and don't sprinkle things like this in fastpaths.
> +
> /*
> * For consistency in subsequent calls, make the faulted page always
> * locked.
On Tue, Jun 09, 2009 at 11:51:55AM +0200, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:47PM +0200, Andi Kleen wrote:
> > +static int me_pagecache_clean(struct page *p, unsigned long pfn)
> > +{
> > + struct address_space *mapping;
> > +
> > + if (!isolate_lru_page(p))
> > + page_cache_release(p);
> > +
> > + /*
> > + * Now truncate the page in the page cache. This is really
> > + * more like a "temporary hole punch"
> > + * Don't do this for block devices when someone else
> > + * has a reference, because it could be file system metadata
> > + * and that's not safe to truncate.
> > + */
> > + mapping = page_mapping(p);
> > + if (mapping && S_ISBLK(mapping->host->i_mode) && page_count(p) > 1) {
> > + printk(KERN_ERR
> > + "MCE %#lx: page looks like a unsupported file system metadata page\n",
> > + pfn);
> > + return FAILED;
> > + }
>
> page_count check is racy. Hmm, S_ISBLK should handle xfs's private mapping.
> AFAIK btrfs has a similar private mapping but a quick grep does not show
> up S_IFBLK anywhere, so I don't know what the situation is there.
>
> Unfortunately though, the linear mapping is not the only metadata mapping
> a filesystem might have. Many work on directories in seperate mappings
> (ext2, for example, which is where I first looked and will still oops with
> your check).
>
> Also, others may have other interesting inodes they use for metadata. Do
> any of them go through the pagecache? I dont know. The ext3 journal,
> for example? How does that work?
>
> Unfortunately I don't know a good way to detect regular data mappings
> easily. Ccing linux-fsdevel. Until that is worked out, you'd need to
> use the safe pagecache invalidate rather than unsafe truncate.
Maybe just testing S_ISREG would be better. Definitely safer than
ISBLK.
Note that for !ISREG files, then you can still attempt the
non-destructive invalidate (after extracting a suitable function
similarly to the truncate one). Most likely the fs is not using
the page right now, so it should give bit more coverage.
I still don't exactly know about, say, ext3 journal. Probably
it doesn't use pagecache anyway. Do any other filesystems do
crazy things with S_ISREG files? They probably deserve to oops
if they do ;)
On Tue, Jun 09, 2009 at 06:25:04PM +0800, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:38PM +0200, Andi Kleen wrote:
> >
> > Bail out early when hardware poisoned pages are found in page fault handling.
> > Since they are poisoned they should not be mapped freshly into processes,
> > because that would cause another (potentially deadly) machine check
> >
> > This is generally handled in the same way as OOM, just a different
> > error code is returned to the architecture code.
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> >
> > ---
> > mm/memory.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > Index: linux/mm/memory.c
> > ===================================================================
> > --- linux.orig/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> > +++ linux/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> > @@ -2797,6 +2797,9 @@
> > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> > return ret;
> >
> > + if (unlikely(PageHWPoison(vmf.page)))
> > + return VM_FAULT_HWPOISON;
>
> Again, it would be nice if you just worry about this in your MCE
> handler and don't sprinkle things like this in fastpaths.
For this patch, I cannot imagine a clear usage case for it, and
proposed to remove it until there comes a case.
Thanks,
Fengguang
On Tue, Jun 09, 2009 at 05:54:23PM +0800, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:40PM +0200, Andi Kleen wrote:
> > - force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
> > +#ifdef CONFIG_MEMORY_FAILURE
> > + if (fault & VM_FAULT_HWPOISON) {
> > + printk(KERN_ERR
> > + "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> > + tsk->comm, tsk->pid, address);
> > + code = BUS_MCEERR_AR;
> > + }
> > +#endif
>
> If you make VM_FAULT_HWPOISON 0 when !CONFIG_MEMORY_FAILURE, then
> you can remove this ifdef, can't you?
Sure we can. Here is the incremental patch.
---
HWPOISON: define VM_FAULT_HWPOISON to 0 when feature is disabled
So as to eliminate one #ifdef in the c source.
Proposed by Nick Piggin.
CC: Nick Piggin <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
arch/x86/mm/fault.c | 3 +--
include/linux/mm.h | 7 ++++++-
2 files changed, 7 insertions(+), 3 deletions(-)
--- sound-2.6.orig/arch/x86/mm/fault.c
+++ sound-2.6/arch/x86/mm/fault.c
@@ -819,14 +819,13 @@ do_sigbus(struct pt_regs *regs, unsigned
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
-#ifdef CONFIG_MEMORY_FAILURE
if (fault & VM_FAULT_HWPOISON) {
printk(KERN_ERR
"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
tsk->comm, tsk->pid, address);
code = BUS_MCEERR_AR;
}
-#endif
+
force_sig_info_fault(SIGBUS, code, address, tsk);
}
--- sound-2.6.orig/include/linux/mm.h
+++ sound-2.6/include/linux/mm.h
@@ -702,11 +702,16 @@ static inline int page_mapped(struct pag
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
#define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */
-#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
+#ifdef CONFIG_MEMORY_FAILURE
+#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */
+#else
+#define VM_FAULT_HWPOISON 0
+#endif
+
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
/*
On Tue, Jun 09, 2009 at 08:21:31PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 06:25:04PM +0800, Nick Piggin wrote:
> > On Wed, Jun 03, 2009 at 08:46:38PM +0200, Andi Kleen wrote:
> > >
> > > Bail out early when hardware poisoned pages are found in page fault handling.
> > > Since they are poisoned they should not be mapped freshly into processes,
> > > because that would cause another (potentially deadly) machine check
> > >
> > > This is generally handled in the same way as OOM, just a different
> > > error code is returned to the architecture code.
> > >
> > > Signed-off-by: Andi Kleen <[email protected]>
> > >
> > > ---
> > > mm/memory.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > Index: linux/mm/memory.c
> > > ===================================================================
> > > --- linux.orig/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> > > +++ linux/mm/memory.c 2009-06-03 19:36:23.000000000 +0200
> > > @@ -2797,6 +2797,9 @@
> > > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> > > return ret;
> > >
> > > + if (unlikely(PageHWPoison(vmf.page)))
> > > + return VM_FAULT_HWPOISON;
> >
> > Again, it would be nice if you just worry about this in your MCE
> > handler and don't sprinkle things like this in fastpaths.
>
> For this patch, I cannot imagine a clear usage case for it, and
> proposed to remove it until there comes a case.
Yes I didn't see some of your review comments and made some
of these comments first. If this is gone then I'm happy :)
On Tue, Jun 09, 2009 at 05:59:20PM +0800, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:43PM +0200, Andi Kleen wrote:
> >
> > Bail out early in set_page_dirty for poisoned pages. We don't want any
> > of the dirty accounting done or file system write back started, because
> > the page will be just thrown away.
>
> I don't agree with adding overhead to fastpaths like this. Your
> MCE handler should have already taken care of this so I can't
> see what it can gain.
Agreed to remove this patch. The poisoned page should already be
isolated (in normal cases), and won't reach this code path.
Thanks,
Fengguang
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> >
> > ---
> > mm/page-writeback.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > Index: linux/mm/page-writeback.c
> > ===================================================================
> > --- linux.orig/mm/page-writeback.c 2009-06-03 19:36:20.000000000 +0200
> > +++ linux/mm/page-writeback.c 2009-06-03 19:36:23.000000000 +0200
> > @@ -1304,6 +1304,10 @@
> > {
> > struct address_space *mapping = page_mapping(page);
> >
> > + if (unlikely(PageHWPoison(page))) {
> > + SetPageDirty(page);
> > + return 0;
> > + }
> > if (likely(mapping)) {
> > int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
> > #ifdef CONFIG_BLOCK
On Tue, Jun 09, 2009 at 06:02:29PM +0800, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:45PM +0200, Andi Kleen wrote:
> >
> > From: Wu Fengguang <[email protected]>
> >
> > If memory corruption hits the free buddy pages, we can safely ignore them.
> > No one will access them until page allocation time, then prep_new_page()
> > will automatically check and isolate PG_hwpoison page for us (for 0-order
> > allocation).
>
> It would be kinda nice if this could be done in the handler
> directly (ie. take the page directly out of the allocator
> or pcp list). Completely avoiding fastpaths would be a
> wonderful goal.
In fact Andi have code to do that. We prefer this one because
- it's simple
- it's good sanity check for possible software BUGs
- it mainly adds overhead to high order pages, which is acceptable
Thanks,
Fengguang
> >
> > + /* Don't complain about poisoned pages */
> > + if (PageHWPoison(page)) {
> > + __ClearPageBuddy(page);
> > + return;
> > + }
> > +
I do think the above chunk is not absolutely necessary, though.
Thanks,
Fengguang
> > /*
> > * Allow a burst of 60 reports, then keep quiet for that minute;
> > * or allow a steady drip of one report per second.
> > @@ -650,7 +656,7 @@
> > /*
> > * This page is about to be returned from the page allocator
> > */
> > -static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> > +static inline int check_new_page(struct page *page)
> > {
> > if (unlikely(page_mapcount(page) |
> > (page->mapping != NULL) |
> > @@ -659,6 +665,18 @@
> > bad_page(page);
> > return 1;
> > }
> > + return 0;
> > +}
> > +
> > +static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < (1 << order); i++) {
> > + struct page *p = page + i;
> > + if (unlikely(check_new_page(p)))
> > + return 1;
> > + }
> >
> > set_page_private(page, 0);
> > set_page_refcounted(page);
On Tue, Jun 09, 2009 at 09:03:04PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 06:02:29PM +0800, Nick Piggin wrote:
> > On Wed, Jun 03, 2009 at 08:46:45PM +0200, Andi Kleen wrote:
> > >
> > > From: Wu Fengguang <[email protected]>
> > >
> > > If memory corruption hits the free buddy pages, we can safely ignore them.
> > > No one will access them until page allocation time, then prep_new_page()
> > > will automatically check and isolate PG_hwpoison page for us (for 0-order
> > > allocation).
> >
> > It would be kinda nice if this could be done in the handler
> > directly (ie. take the page directly out of the allocator
> > or pcp list). Completely avoiding fastpaths would be a
> > wonderful goal.
>
> In fact Andi have code to do that. We prefer this one because
> - it's simple
> - it's good sanity check for possible software BUGs
> - it mainly adds overhead to high order pages, which is acceptable
Yeah it's not bad. But we don't have much other non-debug options
that check for random memory corruption like this. Given that the
struct page is a very tiny proportion of memory, then I'm not
totally convinced that all this checking in the page allocator is
worthwhile for everyone. It's a much bigger cost if checks and
branches have to be there just for hwpoison.
And I don't think removing a free page from the page allocator is
too much more complex than removing a live page from the pagecache ;)
>
> Thanks,
> Fengguang
>
> > >
> > > + /* Don't complain about poisoned pages */
> > > + if (PageHWPoison(page)) {
> > > + __ClearPageBuddy(page);
> > > + return;
> > > + }
> > > +
>
> I do think the above chunk is not absolutely necessary, though.
>
> Thanks,
> Fengguang
>
>
> > > /*
> > > * Allow a burst of 60 reports, then keep quiet for that minute;
> > > * or allow a steady drip of one report per second.
> > > @@ -650,7 +656,7 @@
> > > /*
> > > * This page is about to be returned from the page allocator
> > > */
> > > -static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> > > +static inline int check_new_page(struct page *page)
> > > {
> > > if (unlikely(page_mapcount(page) |
> > > (page->mapping != NULL) |
> > > @@ -659,6 +665,18 @@
> > > bad_page(page);
> > > return 1;
> > > }
> > > + return 0;
> > > +}
> > > +
> > > +static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> > > +{
> > > + int i;
> > > +
> > > + for (i = 0; i < (1 << order); i++) {
> > > + struct page *p = page + i;
> > > + if (unlikely(check_new_page(p)))
> > > + return 1;
> > > + }
> > >
> > > set_page_private(page, 0);
> > > set_page_refcounted(page);
On Tue, Jun 09, 2009 at 09:28:47PM +0800, Nick Piggin wrote:
> On Tue, Jun 09, 2009 at 09:03:04PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 06:02:29PM +0800, Nick Piggin wrote:
> > > On Wed, Jun 03, 2009 at 08:46:45PM +0200, Andi Kleen wrote:
> > > >
> > > > From: Wu Fengguang <[email protected]>
> > > >
> > > > If memory corruption hits the free buddy pages, we can safely ignore them.
> > > > No one will access them until page allocation time, then prep_new_page()
> > > > will automatically check and isolate PG_hwpoison page for us (for 0-order
> > > > allocation).
> > >
> > > It would be kinda nice if this could be done in the handler
> > > directly (ie. take the page directly out of the allocator
> > > or pcp list). Completely avoiding fastpaths would be a
> > > wonderful goal.
> >
> > In fact Andi have code to do that. We prefer this one because
> > - it's simple
> > - it's good sanity check for possible software BUGs
> > - it mainly adds overhead to high order pages, which is acceptable
>
> Yeah it's not bad. But we don't have much other non-debug options
> that check for random memory corruption like this. Given that the
> struct page is a very tiny proportion of memory, then I'm not
> totally convinced that all this checking in the page allocator is
> worthwhile for everyone. It's a much bigger cost if checks and
> branches have to be there just for hwpoison.
Maybe.
> And I don't think removing a free page from the page allocator is
> too much more complex than removing a live page from the pagecache ;)
There are usable functions for doing pagecache isolations, but no one
to isolate one specific page from the buddy system.
Plus, if we did present such a function, you'll then ask for it being
included in page_alloc.c, injecting a big chunk of dead code into the
really hot code blocks and possibly polluting the L2 cache. Will it be
better than just inserting several lines? Hardly. Smaller text itself
yields faster speed.
Thanks,
Fengguang
>
> >
> > Thanks,
> > Fengguang
> >
> > > >
> > > > + /* Don't complain about poisoned pages */
> > > > + if (PageHWPoison(page)) {
> > > > + __ClearPageBuddy(page);
> > > > + return;
> > > > + }
> > > > +
> >
> > I do think the above chunk is not absolutely necessary, though.
> >
> > Thanks,
> > Fengguang
> >
> >
> > > > /*
> > > > * Allow a burst of 60 reports, then keep quiet for that minute;
> > > > * or allow a steady drip of one report per second.
> > > > @@ -650,7 +656,7 @@
> > > > /*
> > > > * This page is about to be returned from the page allocator
> > > > */
> > > > -static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> > > > +static inline int check_new_page(struct page *page)
> > > > {
> > > > if (unlikely(page_mapcount(page) |
> > > > (page->mapping != NULL) |
> > > > @@ -659,6 +665,18 @@
> > > > bad_page(page);
> > > > return 1;
> > > > }
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> > > > +{
> > > > + int i;
> > > > +
> > > > + for (i = 0; i < (1 << order); i++) {
> > > > + struct page *p = page + i;
> > > > + if (unlikely(check_new_page(p)))
> > > > + return 1;
> > > > + }
> > > >
> > > > set_page_private(page, 0);
> > > > set_page_refcounted(page);
On Tue, Jun 09, 2009 at 09:49:03PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 09:28:47PM +0800, Nick Piggin wrote:
> > And I don't think removing a free page from the page allocator is
> > too much more complex than removing a live page from the pagecache ;)
>
> There are usable functions for doing pagecache isolations, but no one
> to isolate one specific page from the buddy system.
But it shouldn't be too hard. Anyway you wanted to reinvent your
own functions for pagecache isolations ;)
> Plus, if we did present such a function, you'll then ask for it being
> included in page_alloc.c, injecting a big chunk of dead code into the
> really hot code blocks and possibly polluting the L2 cache. Will it be
But you would say no because you like it better in your memory
isolation file ;)
> better than just inserting several lines? Hardly. Smaller text itself
> yields faster speed.
Oh speed I'm definitely thinking about, don't worry about that.
Moving hot and cold functions together could become an issue
indeed. Mostly it probably matters a little less than code
within a single function due to their size. But I think gcc
already has options to annotate this kind of thing which we
could be using.
So it's not such a good argument against moving things out of
hotpaths, or guiding in which files to place functions.
Anyway, in this case it is not a "nack" from me. Just that I
would like to see the non-fastpath code too or at least if
it can be thought about.
On Tue, Jun 09, 2009 at 09:55:14PM +0800, Nick Piggin wrote:
> On Tue, Jun 09, 2009 at 09:49:03PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 09:28:47PM +0800, Nick Piggin wrote:
> > > And I don't think removing a free page from the page allocator is
> > > too much more complex than removing a live page from the pagecache ;)
> >
> > There are usable functions for doing pagecache isolations, but no one
> > to isolate one specific page from the buddy system.
>
> But it shouldn't be too hard. Anyway you wanted to reinvent your
> own functions for pagecache isolations ;)
Heh.
> > Plus, if we did present such a function, you'll then ask for it being
> > included in page_alloc.c, injecting a big chunk of dead code into the
> > really hot code blocks and possibly polluting the L2 cache. Will it be
>
> But you would say no because you like it better in your memory
> isolation file ;)
I would like to align with your principle on this one :)
> > better than just inserting several lines? Hardly. Smaller text itself
> > yields faster speed.
>
> Oh speed I'm definitely thinking about, don't worry about that.
>
> Moving hot and cold functions together could become an issue
> indeed. Mostly it probably matters a little less than code
> within a single function due to their size. But I think gcc
> already has options to annotate this kind of thing which we
> could be using.
Can we tell gcc "I bet this _function_ is rarely used"?
> So it's not such a good argument against moving things out of
> hotpaths, or guiding in which files to place functions.
Yes.
> Anyway, in this case it is not a "nack" from me. Just that I
> would like to see the non-fastpath code too or at least if
> it can be thought about.
I think Andi would be pleased to present you with his buddy page
isolation code :)
Thanks,
Fengguang
On Tue, Jun 09, 2009 at 10:56:14PM +0800, Wu Fengguang wrote:
> > Moving hot and cold functions together could become an issue
> > indeed. Mostly it probably matters a little less than code
> > within a single function due to their size. But I think gcc
> > already has options to annotate this kind of thing which we
> > could be using.
>
> Can we tell gcc "I bet this _function_ is rarely used"?
Yes you can annotate a function as hot or cold.
> > So it's not such a good argument against moving things out of
> > hotpaths, or guiding in which files to place functions.
>
> Yes.
>
> > Anyway, in this case it is not a "nack" from me. Just that I
> > would like to see the non-fastpath code too or at least if
> > it can be thought about.
>
> I think Andi would be pleased to present you with his buddy page
> isolation code :)
On Tue, 9 Jun 2009, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:47PM +0200, Andi Kleen wrote:
>
> Why not have this in rmap.c and not export the locking?
> I don't know.. does Hugh care?
Thanks for catching my eye with this, Nick.
As I've said to Andi, I don't actually have time to study all this.
To me, it's just another layer of complexity and maintenance burden
that one special-interest group is imposing upon mm, and I can't
keep up with it myself.
But looking at this one set of extracts: you're right that I'm much
happier when page_lock_anon_vma() isn't leaked outside of mm/rmap.c,
though I'd probably hate this whichever way round it was; and I
share your lock ordering concern.
However, if I'm interpreting these extracts correctly, the whole
thing looks very misguided to me. Are we really going to kill any
process that has a cousin who might once have mapped the page that's
been found hwpoisonous? The hwpoison secret police are dangerously
out of control, I'd say.
The usual use of rmap lookup loops is to go on to look into the page
table to see whether the page is actually mapped: I see no attempt
at that here, just an assumption that anyone on the list is guilty
of mapping the page and must be killed. And even if it did go on
to check if the page is there, maybe the process lost interest in
that page several weeks ago, why kill it?
At least in the file's prio_tree case, we'll only be killing those
who mmapped the range which happens to include the page. But in the
anon case, remember the anon_vma is just a bundle of "related" vmas
outside of which the page will not be found; so if one process got a
poisonous page through COW, all the other processes which happen to
be sharing that anon_vma through fork or through adjacent merging,
are going to get killed too.
Guilty by association.
I think a much more sensible approach would be to follow the page
migration technique of replacing the page's ptes by a special swap-like
entry, then do the killing from do_swap_page() if a process actually
tries to access the page.
But perhaps that has already been discussed and found impossible,
or I'm taking "kill" too seriously and other checks are done
elsewhere, or...
Hugh
>
> > +/*
> > + * Collect processes when the error hit an anonymous page.
> > + */
> > +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> > + struct to_kill **tkc)
> > +{
> > + struct vm_area_struct *vma;
> > + struct task_struct *tsk;
> > + struct anon_vma *av = page_lock_anon_vma(page);
> > +
> > + if (av == NULL) /* Not actually mapped anymore */
> > + return;
> > +
> > + read_lock(&tasklist_lock);
> > + for_each_process (tsk) {
> > + if (!tsk->mm)
> > + continue;
> > + list_for_each_entry (vma, &av->head, anon_vma_node) {
> > + if (vma->vm_mm == tsk->mm)
> > + add_to_kill(tsk, page, vma, to_kill, tkc);
> > + }
> > + }
> > + page_unlock_anon_vma(av);
> > + read_unlock(&tasklist_lock);
> > +}
> > +
> > +/*
> > + * Collect processes when the error hit a file mapped page.
> > + */
> > +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > + struct to_kill **tkc)
> > +{
> > + struct vm_area_struct *vma;
> > + struct task_struct *tsk;
> > + struct prio_tree_iter iter;
> > + struct address_space *mapping = page_mapping(page);
> > +
> > + /*
> > + * A note on the locking order between the two locks.
> > + * We don't rely on this particular order.
> > + * If you have some other code that needs a different order
> > + * feel free to switch them around. Or add a reverse link
> > + * from mm_struct to task_struct, then this could be all
> > + * done without taking tasklist_lock and looping over all tasks.
> > + */
> > +
> > + read_lock(&tasklist_lock);
> > + spin_lock(&mapping->i_mmap_lock);
>
> This still has my original complaint that it nests tasklist lock inside
> anon vma lock and outside inode mmap lock (and anon_vma nests inside i_mmap).
> I guess the property of our current rw locks means that does not matter,
> but it could if we had "fair" rw locks, or some tree (-rt tree maybe)
> changed rw lock to a plain exclusive lock.
>
>
> > + for_each_process(tsk) {
> > + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > +
> > + if (!tsk->mm)
> > + continue;
> > +
> > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
> > + pgoff)
> > + if (vma->vm_mm == tsk->mm)
> > + add_to_kill(tsk, page, vma, to_kill, tkc);
> > + }
> > + spin_unlock(&mapping->i_mmap_lock);
> > + read_unlock(&tasklist_lock);
> > +}
> > +
> > +/*
> > + * Collect the processes who have the corrupted page mapped to kill.
> > + * This is done in two steps for locking reasons.
> > + * First preallocate one tokill structure outside the spin locks,
> > + * so that we can kill at least one process reasonably reliable.
> > + */
> > +static void collect_procs(struct page *page, struct list_head *tokill)
> > +{
> > + struct to_kill *tk;
> > +
> > + tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
> > + /* memory allocation failure is implicitly handled */
> > + if (PageAnon(page))
> > + collect_procs_anon(page, tokill, &tk);
> > + else
> > + collect_procs_file(page, tokill, &tk);
> > + kfree(tk);
> > +}
>
> > Index: linux/mm/filemap.c
> > ===================================================================
> > --- linux.orig/mm/filemap.c 2009-06-03 19:37:38.000000000 +0200
> > +++ linux/mm/filemap.c 2009-06-03 20:13:43.000000000 +0200
> > @@ -105,6 +105,10 @@
> > *
> > * ->task->proc_lock
> > * ->dcache_lock (proc_pid_lookup)
> > + *
> > + * (code doesn't rely on that order, so you could switch it around)
> > + * ->tasklist_lock (memory_failure, collect_procs_ao)
> > + * ->i_mmap_lock
> > */
> >
> > /*
> > Index: linux/mm/rmap.c
> > ===================================================================
> > --- linux.orig/mm/rmap.c 2009-06-03 19:37:38.000000000 +0200
> > +++ linux/mm/rmap.c 2009-06-03 20:13:43.000000000 +0200
> > @@ -36,6 +36,10 @@
> > * mapping->tree_lock (widely used, in set_page_dirty,
> > * in arch-dependent flush_dcache_mmap_lock,
> > * within inode_lock in __sync_single_inode)
> > + *
> > + * (code doesn't rely on that order so it could be switched around)
> > + * ->tasklist_lock
> > + * anon_vma->lock (memory_failure, collect_procs_anon)
> > */
> >
> > #include <linux/mm.h>
On Tue, Jun 09, 2009 at 05:05:53PM +0100, Hugh Dickins wrote:
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> > On Wed, Jun 03, 2009 at 08:46:47PM +0200, Andi Kleen wrote:
> >
> > Why not have this in rmap.c and not export the locking?
> > I don't know.. does Hugh care?
>
> Thanks for catching my eye with this, Nick.
>
> As I've said to Andi, I don't actually have time to study all this.
> To me, it's just another layer of complexity and maintenance burden
> that one special-interest group is imposing upon mm, and I can't
> keep up with it myself.
I know how it feels. But the problem with unreviewed patches is
just that eventually you find yourself needing to review it in
6 months and find the problems then.
There are a few options. Further question the cost/benefit of
a feature; slow down merging until it is satisfactorily reviewed;
or merge it anyway (aka. creative definition of satisfactory).
Not for me to decide, sorry. I'm nearly always further to
conservative with merging features than popular opinion (except
perhaps performance improvements, which are my weakness!)
> But looking at this one set of extracts: you're right that I'm much
> happier when page_lock_anon_vma() isn't leaked outside of mm/rmap.c,
> though I'd probably hate this whichever way round it was; and I
> share your lock ordering concern.
>
> However, if I'm interpreting these extracts correctly, the whole
> thing looks very misguided to me. Are we really going to kill any
> process that has a cousin who might once have mapped the page that's
> been found hwpoisonous? The hwpoison secret police are dangerously
> out of control, I'd say.
>
> The usual use of rmap lookup loops is to go on to look into the page
> table to see whether the page is actually mapped: I see no attempt
> at that here, just an assumption that anyone on the list is guilty
> of mapping the page and must be killed. And even if it did go on
> to check if the page is there, maybe the process lost interest in
> that page several weeks ago, why kill it?
>
> At least in the file's prio_tree case, we'll only be killing those
> who mmapped the range which happens to include the page. But in the
> anon case, remember the anon_vma is just a bundle of "related" vmas
> outside of which the page will not be found; so if one process got a
> poisonous page through COW, all the other processes which happen to
> be sharing that anon_vma through fork or through adjacent merging,
> are going to get killed too.
>
> Guilty by association.
That's a very good point and I didn't even really notice that
(I didn't come across a writeup of the intended policy/semantics
in case of a bad page, so I've not really been able to tell most
of the time whether code matches intention, but I would say you
are probably right about this case).
> I think a much more sensible approach would be to follow the page
> migration technique of replacing the page's ptes by a special swap-like
> entry, then do the killing from do_swap_page() if a process actually
> tries to access the page.
>
> But perhaps that has already been discussed and found impossible,
> or I'm taking "kill" too seriously and other checks are done
> elsewhere, or...
No, I think this might be a very good suggestion.
On Tue, Jun 09, 2009 at 05:57:25PM +0800, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:41PM +0200, Andi Kleen wrote:
> >
> > try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
> > which are selected by magic flag variables. The logic is not very straight
> > forward, because each of these flag change multiple behaviours (e.g.
> > migration turns off aging, not only sets up migration ptes etc.)
> > Also the different flags interact in magic ways.
> >
> > A later patch in this series adds another mode to try_to_unmap, so
> > this becomes quickly unmanageable.
> >
> > Replace the different flags with a action code (migration, munlock, munmap)
> > and some additional flags as modifiers (ignore mlock, ignore aging).
> > This makes the logic more straight forward and allows easier extension
> > to new behaviours. Change all the caller to declare what they want to
> > do.
> >
> > This patch is supposed to be a nop in behaviour. If anyone can prove
> > it is not that would be a bug.
> >
> > Cc: [email protected]
> > Cc: [email protected]
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> >
> > ---
> > include/linux/rmap.h | 14 +++++++++++++-
> > mm/migrate.c | 2 +-
> > mm/rmap.c | 40 ++++++++++++++++++++++------------------
> > mm/vmscan.c | 2 +-
> > 4 files changed, 37 insertions(+), 21 deletions(-)
> >
> > Index: linux/include/linux/rmap.h
> > ===================================================================
> > --- linux.orig/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
> > +++ linux/include/linux/rmap.h 2009-06-03 20:39:50.000000000 +0200
> > @@ -84,7 +84,19 @@
> > * Called from mm/vmscan.c to handle paging out
> > */
> > int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
> > -int try_to_unmap(struct page *, int ignore_refs);
> > +
> > +enum ttu_flags {
> > + TTU_UNMAP = 0, /* unmap mode */
> > + TTU_MIGRATION = 1, /* migration mode */
> > + TTU_MUNLOCK = 2, /* munlock mode */
> > + TTU_ACTION_MASK = 0xff,
> > +
> > + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
> > + TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
> > +};
> > +#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>
> I still think this is nasty and should work like Gfp flags.
I don't see big problems here.
We have page_zone() and gfp_zone(), so why not TTU_ACTION()? We could
allocate one bit for each action code, but in principle they are exclusive.
On Tue, Jun 09, 2009 at 06:09:22PM +0800, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:47PM +0200, Andi Kleen wrote:
>
> Why not have this in rmap.c and not export the locking?
> I don't know.. does Hugh care?
I don't know either :)
> > +/*
> > + * Collect processes when the error hit an anonymous page.
> > + */
> > +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> > + struct to_kill **tkc)
> > +{
> > + struct vm_area_struct *vma;
> > + struct task_struct *tsk;
> > + struct anon_vma *av = page_lock_anon_vma(page);
> > +
> > + if (av == NULL) /* Not actually mapped anymore */
> > + return;
> > +
> > + read_lock(&tasklist_lock);
> > + for_each_process (tsk) {
> > + if (!tsk->mm)
> > + continue;
> > + list_for_each_entry (vma, &av->head, anon_vma_node) {
> > + if (vma->vm_mm == tsk->mm)
> > + add_to_kill(tsk, page, vma, to_kill, tkc);
> > + }
> > + }
> > + page_unlock_anon_vma(av);
> > + read_unlock(&tasklist_lock);
> > +}
> > +
> > +/*
> > + * Collect processes when the error hit a file mapped page.
> > + */
> > +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > + struct to_kill **tkc)
> > +{
> > + struct vm_area_struct *vma;
> > + struct task_struct *tsk;
> > + struct prio_tree_iter iter;
> > + struct address_space *mapping = page_mapping(page);
> > +
> > + /*
> > + * A note on the locking order between the two locks.
> > + * We don't rely on this particular order.
> > + * If you have some other code that needs a different order
> > + * feel free to switch them around. Or add a reverse link
> > + * from mm_struct to task_struct, then this could be all
> > + * done without taking tasklist_lock and looping over all tasks.
> > + */
> > +
> > + read_lock(&tasklist_lock);
> > + spin_lock(&mapping->i_mmap_lock);
>
> This still has my original complaint that it nests tasklist lock inside
> anon vma lock and outside inode mmap lock (and anon_vma nests inside i_mmap).
> I guess the property of our current rw locks means that does not matter,
> but it could if we had "fair" rw locks, or some tree (-rt tree maybe)
> changed rw lock to a plain exclusive lock.
Andi must forgot that - he did change the comment on locking order.
This incremental patch aligns the code with his comment in rmap.c.
---
HWPOISON: fix tasklist_lock/anon_vma locking order
To avoid possible deadlock. Proposed by Nick Piggin:
You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
lock. And anon_vma lock nests inside i_mmap_lock.
This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
type (maybe -rt kernels do it), then you could have a task holding
anon_vma lock and waiting for tasklist_lock, and another holding tasklist
lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
waiting for anon_vma lock.
CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/memory-failure.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
--- sound-2.6.orig/mm/memory-failure.c
+++ sound-2.6/mm/memory-failure.c
@@ -215,12 +215,14 @@ static void collect_procs_anon(struct pa
{
struct vm_area_struct *vma;
struct task_struct *tsk;
- struct anon_vma *av = page_lock_anon_vma(page);
+ struct anon_vma *av;
+ read_lock(&tasklist_lock);
+
+ av = page_lock_anon_vma(page);
if (av == NULL) /* Not actually mapped anymore */
- return;
+ goto out;
- read_lock(&tasklist_lock);
for_each_process (tsk) {
if (!tsk->mm)
continue;
@@ -230,6 +232,7 @@ static void collect_procs_anon(struct pa
}
}
page_unlock_anon_vma(av);
+out:
read_unlock(&tasklist_lock);
}
On Wed, Jun 10, 2009 at 10:27:36AM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 05:57:25PM +0800, Nick Piggin wrote:
> > On Wed, Jun 03, 2009 at 08:46:41PM +0200, Andi Kleen wrote:
> > >
> > > try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
> > > which are selected by magic flag variables. The logic is not very straight
> > > forward, because each of these flag change multiple behaviours (e.g.
> > > migration turns off aging, not only sets up migration ptes etc.)
> > > Also the different flags interact in magic ways.
> > >
> > > A later patch in this series adds another mode to try_to_unmap, so
> > > this becomes quickly unmanageable.
> > >
> > > Replace the different flags with a action code (migration, munlock, munmap)
> > > and some additional flags as modifiers (ignore mlock, ignore aging).
> > > This makes the logic more straight forward and allows easier extension
> > > to new behaviours. Change all the caller to declare what they want to
> > > do.
> > >
> > > This patch is supposed to be a nop in behaviour. If anyone can prove
> > > it is not that would be a bug.
> > >
> > > Cc: [email protected]
> > > Cc: [email protected]
> > >
> > > Signed-off-by: Andi Kleen <[email protected]>
> > >
> > > ---
> > > include/linux/rmap.h | 14 +++++++++++++-
> > > mm/migrate.c | 2 +-
> > > mm/rmap.c | 40 ++++++++++++++++++++++------------------
> > > mm/vmscan.c | 2 +-
> > > 4 files changed, 37 insertions(+), 21 deletions(-)
> > >
> > > Index: linux/include/linux/rmap.h
> > > ===================================================================
> > > --- linux.orig/include/linux/rmap.h 2009-06-03 19:36:23.000000000 +0200
> > > +++ linux/include/linux/rmap.h 2009-06-03 20:39:50.000000000 +0200
> > > @@ -84,7 +84,19 @@
> > > * Called from mm/vmscan.c to handle paging out
> > > */
> > > int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
> > > -int try_to_unmap(struct page *, int ignore_refs);
> > > +
> > > +enum ttu_flags {
> > > + TTU_UNMAP = 0, /* unmap mode */
> > > + TTU_MIGRATION = 1, /* migration mode */
> > > + TTU_MUNLOCK = 2, /* munlock mode */
> > > + TTU_ACTION_MASK = 0xff,
> > > +
> > > + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
> > > + TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
> > > +};
> > > +#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> >
> > I still think this is nasty and should work like Gfp flags.
>
> I don't see big problems here.
>
> We have page_zone() and gfp_zone(), so why not TTU_ACTION()? We could
> allocate one bit for each action code, but in principle they are exclusive.
I haven't actually applied the patchset and looked at the resulting
try_to_unmap function, so I'll hold my tounge until I do that. I'll
send a patch if I can find something that looks nicer. So don't worry
about it for the moment.
On Wed, Jun 10, 2009 at 12:05:53AM +0800, Hugh Dickins wrote:
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> > On Wed, Jun 03, 2009 at 08:46:47PM +0200, Andi Kleen wrote:
> >
> > Why not have this in rmap.c and not export the locking?
> > I don't know.. does Hugh care?
>
> Thanks for catching my eye with this, Nick.
>
> As I've said to Andi, I don't actually have time to study all this.
> To me, it's just another layer of complexity and maintenance burden
> that one special-interest group is imposing upon mm, and I can't
> keep up with it myself.
>
> But looking at this one set of extracts: you're right that I'm much
> happier when page_lock_anon_vma() isn't leaked outside of mm/rmap.c,
> though I'd probably hate this whichever way round it was; and I
> share your lock ordering concern.
The lock ordering is now fixed :)
> However, if I'm interpreting these extracts correctly, the whole
> thing looks very misguided to me. Are we really going to kill any
> process that has a cousin who might once have mapped the page that's
> been found hwpoisonous? The hwpoison secret police are dangerously
> out of control, I'd say.
Good catch! It escaped our previous reviews.
I can add the missing page_mapped_in_vma() calls as well as move the
functions to rmap.c.
> The usual use of rmap lookup loops is to go on to look into the page
> table to see whether the page is actually mapped: I see no attempt
> at that here, just an assumption that anyone on the list is guilty
> of mapping the page and must be killed. And even if it did go on
Right, processes that didn't map the page shall not be killed (at
least by default).
I don't know if the hardware can guarantee to detect a corrupted page
when a process is accessing it (or a device is accessing it via DMA).
If not, then there are a lot more possibilities:
It's possible that a process mapped a corrupted page, consumed the
data, and then the page get unmapped by kswapd, in the process the
hardware didn't know that it's actually a corrupted page. Some time
later the hardware found that and triggered an exception. In this
case, for max safety, we may kill the app whose vma covers the
corrupted file page but does not actually map it. Though that is not
a suitable default policy.
We may also have to define multiple level of kill policies.
- kill affected application as early as possible?
- kill all mmap() users or only active ones?
- kill clean page users or only dirty page users?
We'd better check out the hardware capabilities first..
> to check if the page is there, maybe the process lost interest in
> that page several weeks ago, why kill it?
Yes, maybe.
> At least in the file's prio_tree case, we'll only be killing those
> who mmapped the range which happens to include the page. But in the
> anon case, remember the anon_vma is just a bundle of "related" vmas
> outside of which the page will not be found; so if one process got a
> poisonous page through COW, all the other processes which happen to
> be sharing that anon_vma through fork or through adjacent merging,
> are going to get killed too.
>
> Guilty by association.
Agreed.
> I think a much more sensible approach would be to follow the page
> migration technique of replacing the page's ptes by a special swap-like
> entry, then do the killing from do_swap_page() if a process actually
> tries to access the page.
We call that "late kill" and will be enabled when
sysctl_memory_failure_early_kill=0. Its default value is 1.
> But perhaps that has already been discussed and found impossible,
> or I'm taking "kill" too seriously and other checks are done
> elsewhere, or...
You are right. We should implement different kill policies and let the
user decide the one good for him.
Thanks,
Fengguang
> >
> > > +/*
> > > + * Collect processes when the error hit an anonymous page.
> > > + */
> > > +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> > > + struct to_kill **tkc)
> > > +{
> > > + struct vm_area_struct *vma;
> > > + struct task_struct *tsk;
> > > + struct anon_vma *av = page_lock_anon_vma(page);
> > > +
> > > + if (av == NULL) /* Not actually mapped anymore */
> > > + return;
> > > +
> > > + read_lock(&tasklist_lock);
> > > + for_each_process (tsk) {
> > > + if (!tsk->mm)
> > > + continue;
> > > + list_for_each_entry (vma, &av->head, anon_vma_node) {
> > > + if (vma->vm_mm == tsk->mm)
> > > + add_to_kill(tsk, page, vma, to_kill, tkc);
> > > + }
> > > + }
> > > + page_unlock_anon_vma(av);
> > > + read_unlock(&tasklist_lock);
> > > +}
> > > +
> > > +/*
> > > + * Collect processes when the error hit a file mapped page.
> > > + */
> > > +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > > + struct to_kill **tkc)
> > > +{
> > > + struct vm_area_struct *vma;
> > > + struct task_struct *tsk;
> > > + struct prio_tree_iter iter;
> > > + struct address_space *mapping = page_mapping(page);
> > > +
> > > + /*
> > > + * A note on the locking order between the two locks.
> > > + * We don't rely on this particular order.
> > > + * If you have some other code that needs a different order
> > > + * feel free to switch them around. Or add a reverse link
> > > + * from mm_struct to task_struct, then this could be all
> > > + * done without taking tasklist_lock and looping over all tasks.
> > > + */
> > > +
> > > + read_lock(&tasklist_lock);
> > > + spin_lock(&mapping->i_mmap_lock);
> >
> > This still has my original complaint that it nests tasklist lock inside
> > anon vma lock and outside inode mmap lock (and anon_vma nests inside i_mmap).
> > I guess the property of our current rw locks means that does not matter,
> > but it could if we had "fair" rw locks, or some tree (-rt tree maybe)
> > changed rw lock to a plain exclusive lock.
> >
> >
> > > + for_each_process(tsk) {
> > > + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > > +
> > > + if (!tsk->mm)
> > > + continue;
> > > +
> > > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
> > > + pgoff)
> > > + if (vma->vm_mm == tsk->mm)
> > > + add_to_kill(tsk, page, vma, to_kill, tkc);
> > > + }
> > > + spin_unlock(&mapping->i_mmap_lock);
> > > + read_unlock(&tasklist_lock);
> > > +}
> > > +
> > > +/*
> > > + * Collect the processes who have the corrupted page mapped to kill.
> > > + * This is done in two steps for locking reasons.
> > > + * First preallocate one tokill structure outside the spin locks,
> > > + * so that we can kill at least one process reasonably reliable.
> > > + */
> > > +static void collect_procs(struct page *page, struct list_head *tokill)
> > > +{
> > > + struct to_kill *tk;
> > > +
> > > + tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
> > > + /* memory allocation failure is implicitly handled */
> > > + if (PageAnon(page))
> > > + collect_procs_anon(page, tokill, &tk);
> > > + else
> > > + collect_procs_file(page, tokill, &tk);
> > > + kfree(tk);
> > > +}
> >
> > > Index: linux/mm/filemap.c
> > > ===================================================================
> > > --- linux.orig/mm/filemap.c 2009-06-03 19:37:38.000000000 +0200
> > > +++ linux/mm/filemap.c 2009-06-03 20:13:43.000000000 +0200
> > > @@ -105,6 +105,10 @@
> > > *
> > > * ->task->proc_lock
> > > * ->dcache_lock (proc_pid_lookup)
> > > + *
> > > + * (code doesn't rely on that order, so you could switch it around)
> > > + * ->tasklist_lock (memory_failure, collect_procs_ao)
> > > + * ->i_mmap_lock
> > > */
> > >
> > > /*
> > > Index: linux/mm/rmap.c
> > > ===================================================================
> > > --- linux.orig/mm/rmap.c 2009-06-03 19:37:38.000000000 +0200
> > > +++ linux/mm/rmap.c 2009-06-03 20:13:43.000000000 +0200
> > > @@ -36,6 +36,10 @@
> > > * mapping->tree_lock (widely used, in set_page_dirty,
> > > * in arch-dependent flush_dcache_mmap_lock,
> > > * within inode_lock in __sync_single_inode)
> > > + *
> > > + * (code doesn't rely on that order so it could be switched around)
> > > + * ->tasklist_lock
> > > + * anon_vma->lock (memory_failure, collect_procs_anon)
> > > */
> > >
> > > #include <linux/mm.h>
On Wed, Jun 10, 2009 at 04:38:03PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 12:05:53AM +0800, Hugh Dickins wrote:
> > I think a much more sensible approach would be to follow the page
> > migration technique of replacing the page's ptes by a special swap-like
> > entry, then do the killing from do_swap_page() if a process actually
> > tries to access the page.
>
> We call that "late kill" and will be enabled when
> sysctl_memory_failure_early_kill=0. Its default value is 1.
What's the use of this? What are the tradeoffs, in what situations
should an admin set this sysctl one way or the other?
On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > Also I thought a bit about the fsync() error scenario. It's really
> > a problem that can already happen even without hwpoison, e.g.
> > when a page is dropped at the wrong time.
>
> No, the page will never be "dropped" like that except with
> this hwpoison. Errors, sure, might get dropped sometimes
> due to implementation bugs, but this is adding semantics that
> basically break fsync by-design.
You mean the non persistent EIO is undesirable?
In the other hand, sticky EIO that can only be explicitly cleared by
user can also be annoying. How about auto clearing the EIO bit when
the last active user closes the file?
> I really want to resolve the EIO issue because as I said, it
> is a user-abi issue and too many of those just get shoved
> through only for someone to care about fundamental breakage
> after some years.
Yup.
> You say that SIGKILL is overkill for such pages, but in fact
> this is exactly what you do with mapped pages anyway, so why
> not with other pages as well? I think it is perfectly fine to
> do so (and maybe a new error code can be introduced and that
> can be delivered to processes that can handle it rather than
> SIGKILL).
We can make it a user selectable policy.
They are different in that, mapped dirty pages are normally more vital
(data structures etc.) for correct execution, while write() operates
more often on normal data.
> Last request: do you have a panic-on-memory-error option?
> I think HA systems and ones with properly designed data
> integrity at the application layer will much prefer to
> halt the system than attempt ad-hoc recovery that does not
> always work and might screw things up worse.
Good suggestion. We'll consider such an option. But unconditionally
panic may be undesirable. For example, a corrupted free page or a
clean unmapped file page can be simply isolated - they won't impact
anything.
Thanks,
Fengguang
On Wed, Jun 10, 2009 at 05:07:03PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> > On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > > Also I thought a bit about the fsync() error scenario. It's really
> > > a problem that can already happen even without hwpoison, e.g.
> > > when a page is dropped at the wrong time.
> >
> > No, the page will never be "dropped" like that except with
> > this hwpoison. Errors, sure, might get dropped sometimes
> > due to implementation bugs, but this is adding semantics that
> > basically break fsync by-design.
>
> You mean the non persistent EIO is undesirable?
>
> In the other hand, sticky EIO that can only be explicitly cleared by
> user can also be annoying. How about auto clearing the EIO bit when
> the last active user closes the file?
Well the existing EIO semantics IMO are not great, but that
does not have a big bearing on this new situation. What you
are doing is deliberately throwing away the dirty data, and
giving EIO back in some cases. (but perhaps not others, a
subsequent read or write syscall is not going to get EIO is
it? only fsync).
So even if we did change existing EIO semantics then the
memory corruption case of throwing away dirty data is still
going to be "different" (wrong, I would say).
> > I really want to resolve the EIO issue because as I said, it
> > is a user-abi issue and too many of those just get shoved
> > through only for someone to care about fundamental breakage
> > after some years.
>
> Yup.
>
> > You say that SIGKILL is overkill for such pages, but in fact
> > this is exactly what you do with mapped pages anyway, so why
> > not with other pages as well? I think it is perfectly fine to
> > do so (and maybe a new error code can be introduced and that
> > can be delivered to processes that can handle it rather than
> > SIGKILL).
>
> We can make it a user selectable policy.
Really? Does it need to be? Can the admin sanely make that
choice?
> They are different in that, mapped dirty pages are normally more vital
> (data structures etc.) for correct execution, while write() operates
> more often on normal data.
read and write, remember. That might be somewhat true, but
definitely there are exceptions both ways. How do you
quantify that or justify it? Just handwaving? Why not make
it more consistent overall and just do SIGKILL for everyone?
> > Last request: do you have a panic-on-memory-error option?
> > I think HA systems and ones with properly designed data
> > integrity at the application layer will much prefer to
> > halt the system than attempt ad-hoc recovery that does not
> > always work and might screw things up worse.
>
> Good suggestion. We'll consider such an option. But unconditionally
> panic may be undesirable. For example, a corrupted free page or a
> clean unmapped file page can be simply isolated - they won't impact
> anything.
I thought you were worried about introducing races where the
data can be consumed when doing things such as lock_page and
wait_on_page_writeback. But if things can definitely be
discarded with no references or chances of being consumed, yes
you would not panic for that. But panic for dirty data or
corrupted kernel memory etc. makes a lot of sense.
On Wed, Jun 10, 2009 at 04:59:39PM +0800, Nick Piggin wrote:
> On Wed, Jun 10, 2009 at 04:38:03PM +0800, Wu Fengguang wrote:
> > On Wed, Jun 10, 2009 at 12:05:53AM +0800, Hugh Dickins wrote:
> > > I think a much more sensible approach would be to follow the page
> > > migration technique of replacing the page's ptes by a special swap-like
> > > entry, then do the killing from do_swap_page() if a process actually
> > > tries to access the page.
> >
> > We call that "late kill" and will be enabled when
> > sysctl_memory_failure_early_kill=0. Its default value is 1.
>
> What's the use of this? What are the tradeoffs, in what situations
> should an admin set this sysctl one way or the other?
Good questions.
My understanding is, when an application is generating data A, B, C in
sequence, and A is found to be corrupted by the kernel. Does it make
sense for the application to continue generate B and C? Or, are there
data dependencies between them? With late kill, it becomes more likely
that the disk contain new versions of B/C and old version of A, so
will more likely create data inconsistency.
So early kill is more safe.
Thanks,
Fengguang
On Wed, Jun 10, 2009 at 05:18:07PM +0800, Nick Piggin wrote:
> On Wed, Jun 10, 2009 at 05:07:03PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> > > On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > > > Also I thought a bit about the fsync() error scenario. It's really
> > > > a problem that can already happen even without hwpoison, e.g.
> > > > when a page is dropped at the wrong time.
> > >
> > > No, the page will never be "dropped" like that except with
> > > this hwpoison. Errors, sure, might get dropped sometimes
> > > due to implementation bugs, but this is adding semantics that
> > > basically break fsync by-design.
> >
> > You mean the non persistent EIO is undesirable?
> >
> > In the other hand, sticky EIO that can only be explicitly cleared by
> > user can also be annoying. How about auto clearing the EIO bit when
> > the last active user closes the file?
>
> Well the existing EIO semantics IMO are not great, but that
> does not have a big bearing on this new situation. What you
Nod.
> are doing is deliberately throwing away the dirty data, and
> giving EIO back in some cases. (but perhaps not others, a
> subsequent read or write syscall is not going to get EIO is
> it? only fsync).
Right, only fsync/msync and close on nfs will report the error.
write() is normally cached, so obviously it cannot report the later IO
error.
We can make read() IO succeed even if the relevant pages are corrupted
- they can be isolated transparent to user space readers :-)
> So even if we did change existing EIO semantics then the
> memory corruption case of throwing away dirty data is still
> going to be "different" (wrong, I would say).
Oh well.
> > > I really want to resolve the EIO issue because as I said, it
> > > is a user-abi issue and too many of those just get shoved
> > > through only for someone to care about fundamental breakage
> > > after some years.
> >
> > Yup.
> >
> > > You say that SIGKILL is overkill for such pages, but in fact
> > > this is exactly what you do with mapped pages anyway, so why
> > > not with other pages as well? I think it is perfectly fine to
> > > do so (and maybe a new error code can be introduced and that
> > > can be delivered to processes that can handle it rather than
> > > SIGKILL).
> >
> > We can make it a user selectable policy.
>
> Really? Does it need to be? Can the admin sanely make that
> choice?
I just recalled another fact. See below.
> > They are different in that, mapped dirty pages are normally more vital
> > (data structures etc.) for correct execution, while write() operates
> > more often on normal data.
>
> read and write, remember. That might be somewhat true, but
> definitely there are exceptions both ways. How do you
> quantify that or justify it? Just handwaving? Why not make
> it more consistent overall and just do SIGKILL for everyone?
1) under read IO hwpoison pages can be hidden to user space
2) under write IO hwpoison pages are normally committed by pdflush,
so cannot find the impacted application to kill at all.
3) fsync() users can be caught though. But then the application
have the option to check its return code. If it doesn't do it,
it may well don't care. So why kill it?
Think about a multimedia server. Shall we kill the daemon if some IO
page in the movie get corrupted? And a mission critical server?
Obviously the admin will want the right to choose.
> > > Last request: do you have a panic-on-memory-error option?
> > > I think HA systems and ones with properly designed data
> > > integrity at the application layer will much prefer to
> > > halt the system than attempt ad-hoc recovery that does not
> > > always work and might screw things up worse.
> >
> > Good suggestion. We'll consider such an option. But unconditionally
> > panic may be undesirable. For example, a corrupted free page or a
> > clean unmapped file page can be simply isolated - they won't impact
> > anything.
>
> I thought you were worried about introducing races where the
> data can be consumed when doing things such as lock_page and
> wait_on_page_writeback. But if things can definitely be
> discarded with no references or chances of being consumed, yes
> you would not panic for that. But panic for dirty data or
> corrupted kernel memory etc. makes a lot of sense.
OK. We can panic on dirty/writeback pages, and do try_lock to check
for active users :)
Thanks,
Fengguang
On Wed, Jun 10, 2009 at 05:20:11PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 04:59:39PM +0800, Nick Piggin wrote:
> > On Wed, Jun 10, 2009 at 04:38:03PM +0800, Wu Fengguang wrote:
> > > On Wed, Jun 10, 2009 at 12:05:53AM +0800, Hugh Dickins wrote:
> > > > I think a much more sensible approach would be to follow the page
> > > > migration technique of replacing the page's ptes by a special swap-like
> > > > entry, then do the killing from do_swap_page() if a process actually
> > > > tries to access the page.
> > >
> > > We call that "late kill" and will be enabled when
> > > sysctl_memory_failure_early_kill=0. Its default value is 1.
> >
> > What's the use of this? What are the tradeoffs, in what situations
> > should an admin set this sysctl one way or the other?
>
> Good questions.
>
> My understanding is, when an application is generating data A, B, C in
> sequence, and A is found to be corrupted by the kernel. Does it make
> sense for the application to continue generate B and C? Or, are there
> data dependencies between them? With late kill, it becomes more likely
> that the disk contain new versions of B/C and old version of A, so
> will more likely create data inconsistency.
>
> So early kill is more safe.
Hmm, I think that's pretty speculative, and doesn't seem possible for
an admin (or even kernel programmer) to choose the "right" value.
The application equally may not need to touch the data again, so
killing it might cause some inconsistency in whatever it is currently
doing.
On Wed, Jun 10, 2009 at 05:45:26PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 05:18:07PM +0800, Nick Piggin wrote:
> > On Wed, Jun 10, 2009 at 05:07:03PM +0800, Wu Fengguang wrote:
> > > On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> > > > On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > > > > Also I thought a bit about the fsync() error scenario. It's really
> > > > > a problem that can already happen even without hwpoison, e.g.
> > > > > when a page is dropped at the wrong time.
> > > >
> > > > No, the page will never be "dropped" like that except with
> > > > this hwpoison. Errors, sure, might get dropped sometimes
> > > > due to implementation bugs, but this is adding semantics that
> > > > basically break fsync by-design.
> > >
> > > You mean the non persistent EIO is undesirable?
> > >
> > > In the other hand, sticky EIO that can only be explicitly cleared by
> > > user can also be annoying. How about auto clearing the EIO bit when
> > > the last active user closes the file?
> >
> > Well the existing EIO semantics IMO are not great, but that
> > does not have a big bearing on this new situation. What you
>
> Nod.
>
> > are doing is deliberately throwing away the dirty data, and
> > giving EIO back in some cases. (but perhaps not others, a
> > subsequent read or write syscall is not going to get EIO is
> > it? only fsync).
>
> Right, only fsync/msync and close on nfs will report the error.
>
> write() is normally cached, so obviously it cannot report the later IO
> error.
>
> We can make read() IO succeed even if the relevant pages are corrupted
> - they can be isolated transparent to user space readers :-)
But if the page was dirty and you throw out the dirty data,
then next read will give inconsistent data.
> > So even if we did change existing EIO semantics then the
> > memory corruption case of throwing away dirty data is still
> > going to be "different" (wrong, I would say).
>
> Oh well.
Well I just think SIGKILL is the much safer behaviour to
start with (and matches behaviour with mmapped pagecache
and anon), and does not introduce these different semantics.
> > > > I really want to resolve the EIO issue because as I said, it
> > > > is a user-abi issue and too many of those just get shoved
> > > > through only for someone to care about fundamental breakage
> > > > after some years.
> > >
> > > Yup.
> > >
> > > > You say that SIGKILL is overkill for such pages, but in fact
> > > > this is exactly what you do with mapped pages anyway, so why
> > > > not with other pages as well? I think it is perfectly fine to
> > > > do so (and maybe a new error code can be introduced and that
> > > > can be delivered to processes that can handle it rather than
> > > > SIGKILL).
> > >
> > > We can make it a user selectable policy.
> >
> > Really? Does it need to be? Can the admin sanely make that
> > choice?
>
> I just recalled another fact. See below.
>
> > > They are different in that, mapped dirty pages are normally more vital
> > > (data structures etc.) for correct execution, while write() operates
> > > more often on normal data.
> >
> > read and write, remember. That might be somewhat true, but
> > definitely there are exceptions both ways. How do you
> > quantify that or justify it? Just handwaving? Why not make
> > it more consistent overall and just do SIGKILL for everyone?
>
> 1) under read IO hwpoison pages can be hidden to user space
I mean for cases where the recovery cannot be transparent
(ie. error in dirty page).
> 2) under write IO hwpoison pages are normally committed by pdflush,
> so cannot find the impacted application to kill at all.
Correct.
> 3) fsync() users can be caught though. But then the application
> have the option to check its return code. If it doesn't do it,
> it may well don't care. So why kill it?
Well if it does not check, then we cannot find it to kill
it anyway. If it does care (and hence check with fsync),
then we could kill it.
> Think about a multimedia server. Shall we kill the daemon if some IO
> page in the movie get corrupted?
My multimedia server is using mmap for data...
> And a mission critical server?
Mission critical server should be killed too because it
likely does not understand this semantic of throwing out
dirty data page. It should be detected and restarted and
should recover or fail over to another server.
> Obviously the admin will want the right to choose.
I don't know if they are equipped to really know. Do they
know that their application will correctly handle these
semantics of throwing out dirty data? It is potentially
much more dangerous to do this exactly because it can confuse
the case where it matters most (ie. ones that care about
data integrity).
It just seems like killing is far less controversial and
simpler. Start with that and it should do the right thing
for most people anyway. We could discuss possible ways
to recover in another patch if you want to do this
EIO thing.
> > > > Last request: do you have a panic-on-memory-error option?
> > > > I think HA systems and ones with properly designed data
> > > > integrity at the application layer will much prefer to
> > > > halt the system than attempt ad-hoc recovery that does not
> > > > always work and might screw things up worse.
> > >
> > > Good suggestion. We'll consider such an option. But unconditionally
> > > panic may be undesirable. For example, a corrupted free page or a
> > > clean unmapped file page can be simply isolated - they won't impact
> > > anything.
> >
> > I thought you were worried about introducing races where the
> > data can be consumed when doing things such as lock_page and
> > wait_on_page_writeback. But if things can definitely be
> > discarded with no references or chances of being consumed, yes
> > you would not panic for that. But panic for dirty data or
> > corrupted kernel memory etc. makes a lot of sense.
>
> OK. We can panic on dirty/writeback pages, and do try_lock to check
> for active users :)
That would be good. IMO panic should be the safest and sanest
option (admin knows exactly what it is and has very simple and
clear semantics).
On Wed, Jun 10, 2009 at 07:03:05PM +0800, Nick Piggin wrote:
> On Wed, Jun 10, 2009 at 05:20:11PM +0800, Wu Fengguang wrote:
> > On Wed, Jun 10, 2009 at 04:59:39PM +0800, Nick Piggin wrote:
> > > On Wed, Jun 10, 2009 at 04:38:03PM +0800, Wu Fengguang wrote:
> > > > On Wed, Jun 10, 2009 at 12:05:53AM +0800, Hugh Dickins wrote:
> > > > > I think a much more sensible approach would be to follow the page
> > > > > migration technique of replacing the page's ptes by a special swap-like
> > > > > entry, then do the killing from do_swap_page() if a process actually
> > > > > tries to access the page.
> > > >
> > > > We call that "late kill" and will be enabled when
> > > > sysctl_memory_failure_early_kill=0. Its default value is 1.
> > >
> > > What's the use of this? What are the tradeoffs, in what situations
> > > should an admin set this sysctl one way or the other?
> >
> > Good questions.
> >
> > My understanding is, when an application is generating data A, B, C in
> > sequence, and A is found to be corrupted by the kernel. Does it make
> > sense for the application to continue generate B and C? Or, are there
> > data dependencies between them? With late kill, it becomes more likely
> > that the disk contain new versions of B/C and old version of A, so
> > will more likely create data inconsistency.
> >
> > So early kill is more safe.
>
> Hmm, I think that's pretty speculative, and doesn't seem possible for
> an admin (or even kernel programmer) to choose the "right" value.
>
Agreed. It's not easy to choose if I'm myself an admin ;)
> The application equally may not need to touch the data again, so
> killing it might cause some inconsistency in whatever it is currently
> doing.
Yes, early kill can also be evil. What I can do now is to document the
early kill parameter more carefully.
Thanks,
Fengguang
On Wed, Jun 10, 2009 at 07:15:41PM +0800, Nick Piggin wrote:
> On Wed, Jun 10, 2009 at 05:45:26PM +0800, Wu Fengguang wrote:
> > On Wed, Jun 10, 2009 at 05:18:07PM +0800, Nick Piggin wrote:
> > > On Wed, Jun 10, 2009 at 05:07:03PM +0800, Wu Fengguang wrote:
> > > > On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> > > > > On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > > > > > Also I thought a bit about the fsync() error scenario. It's really
> > > > > > a problem that can already happen even without hwpoison, e.g.
> > > > > > when a page is dropped at the wrong time.
> > > > >
> > > > > No, the page will never be "dropped" like that except with
> > > > > this hwpoison. Errors, sure, might get dropped sometimes
> > > > > due to implementation bugs, but this is adding semantics that
> > > > > basically break fsync by-design.
> > > >
> > > > You mean the non persistent EIO is undesirable?
> > > >
> > > > In the other hand, sticky EIO that can only be explicitly cleared by
> > > > user can also be annoying. How about auto clearing the EIO bit when
> > > > the last active user closes the file?
> > >
> > > Well the existing EIO semantics IMO are not great, but that
> > > does not have a big bearing on this new situation. What you
> >
> > Nod.
> >
> > > are doing is deliberately throwing away the dirty data, and
> > > giving EIO back in some cases. (but perhaps not others, a
> > > subsequent read or write syscall is not going to get EIO is
> > > it? only fsync).
> >
> > Right, only fsync/msync and close on nfs will report the error.
> >
> > write() is normally cached, so obviously it cannot report the later IO
> > error.
> >
> > We can make read() IO succeed even if the relevant pages are corrupted
> > - they can be isolated transparent to user space readers :-)
>
> But if the page was dirty and you throw out the dirty data,
> then next read will give inconsistent data.
Yup. That's a big problem - the application won't get any error
feedback here if it doesn't call fsync() to commit IO.
>
> > > So even if we did change existing EIO semantics then the
> > > memory corruption case of throwing away dirty data is still
> > > going to be "different" (wrong, I would say).
> >
> > Oh well.
>
> Well I just think SIGKILL is the much safer behaviour to
> start with (and matches behaviour with mmapped pagecache
> and anon), and does not introduce these different semantics.
So what? SIGKILL any future processes visiting the corrupted file?
Or better to return EIO to them? Either way we'll be maintaining
a consistent AS_EIO_HWPOISON bit.
>
> > > > > I really want to resolve the EIO issue because as I said, it
> > > > > is a user-abi issue and too many of those just get shoved
> > > > > through only for someone to care about fundamental breakage
> > > > > after some years.
> > > >
> > > > Yup.
> > > >
> > > > > You say that SIGKILL is overkill for such pages, but in fact
> > > > > this is exactly what you do with mapped pages anyway, so why
> > > > > not with other pages as well? I think it is perfectly fine to
> > > > > do so (and maybe a new error code can be introduced and that
> > > > > can be delivered to processes that can handle it rather than
> > > > > SIGKILL).
> > > >
> > > > We can make it a user selectable policy.
> > >
> > > Really? Does it need to be? Can the admin sanely make that
> > > choice?
> >
> > I just recalled another fact. See below.
> >
> > > > They are different in that, mapped dirty pages are normally more vital
> > > > (data structures etc.) for correct execution, while write() operates
> > > > more often on normal data.
> > >
> > > read and write, remember. That might be somewhat true, but
> > > definitely there are exceptions both ways. How do you
> > > quantify that or justify it? Just handwaving? Why not make
> > > it more consistent overall and just do SIGKILL for everyone?
> >
> > 1) under read IO hwpoison pages can be hidden to user space
>
> I mean for cases where the recovery cannot be transparent
> (ie. error in dirty page).
OK. That's a good point.
> > 2) under write IO hwpoison pages are normally committed by pdflush,
> > so cannot find the impacted application to kill at all.
>
> Correct.
>
> > 3) fsync() users can be caught though. But then the application
> > have the option to check its return code. If it doesn't do it,
> > it may well don't care. So why kill it?
>
> Well if it does not check, then we cannot find it to kill
> it anyway. If it does care (and hence check with fsync),
> then we could kill it.
If it really care, it will check EIO after fsync ;)
But yes, if it moderately care, it may ignore the return value.
So SIGKILL on fsync() seems to be a good option.
> > Think about a multimedia server. Shall we kill the daemon if some IO
> > page in the movie get corrupted?
>
> My multimedia server is using mmap for data...
>
> > And a mission critical server?
>
> Mission critical server should be killed too because it
> likely does not understand this semantic of throwing out
> dirty data page. It should be detected and restarted and
> should recover or fail over to another server.
Sorry for the confusion. I meant one server may want to survive,
while another want to kill (and restart service).
> > Obviously the admin will want the right to choose.
>
> I don't know if they are equipped to really know. Do they
> know that their application will correctly handle these
> semantics of throwing out dirty data? It is potentially
> much more dangerous to do this exactly because it can confuse
> the case where it matters most (ie. ones that care about
> data integrity).
>
> It just seems like killing is far less controversial and
> simpler. Start with that and it should do the right thing
> for most people anyway. We could discuss possible ways
> to recover in another patch if you want to do this
> EIO thing.
OK, we can
- kill fsync() users
- and then return EIO for later read()/write()s
- forget about the EIO condition on last file close()
Do you agree?
> > > > > Last request: do you have a panic-on-memory-error option?
> > > > > I think HA systems and ones with properly designed data
> > > > > integrity at the application layer will much prefer to
> > > > > halt the system than attempt ad-hoc recovery that does not
> > > > > always work and might screw things up worse.
> > > >
> > > > Good suggestion. We'll consider such an option. But unconditionally
> > > > panic may be undesirable. For example, a corrupted free page or a
> > > > clean unmapped file page can be simply isolated - they won't impact
> > > > anything.
> > >
> > > I thought you were worried about introducing races where the
> > > data can be consumed when doing things such as lock_page and
> > > wait_on_page_writeback. But if things can definitely be
> > > discarded with no references or chances of being consumed, yes
> > > you would not panic for that. But panic for dirty data or
> > > corrupted kernel memory etc. makes a lot of sense.
> >
> > OK. We can panic on dirty/writeback pages, and do try_lock to check
> > for active users :)
>
> That would be good. IMO panic should be the safest and sanest
> option (admin knows exactly what it is and has very simple and
> clear semantics).
Thanks,
Fengguang
On Wed, Jun 10, 2009 at 08:16:45PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 07:03:05PM +0800, Nick Piggin wrote:
> > The application equally may not need to touch the data again, so
> > killing it might cause some inconsistency in whatever it is currently
> > doing.
>
> Yes, early kill can also be evil. What I can do now is to document the
> early kill parameter more carefully.
That would be good. I would also strongly consider removing
options if possible to make things simpler and if there
is not a really compelling reason to have it.
Thanks,
Nick
On Wed, Jun 10, 2009 at 08:36:00PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 07:15:41PM +0800, Nick Piggin wrote:
> > > We can make read() IO succeed even if the relevant pages are corrupted
> > > - they can be isolated transparent to user space readers :-)
> >
> > But if the page was dirty and you throw out the dirty data,
> > then next read will give inconsistent data.
>
> Yup. That's a big problem - the application won't get any error
> feedback here if it doesn't call fsync() to commit IO.
Right.
> > > > So even if we did change existing EIO semantics then the
> > > > memory corruption case of throwing away dirty data is still
> > > > going to be "different" (wrong, I would say).
> > >
> > > Oh well.
> >
> > Well I just think SIGKILL is the much safer behaviour to
> > start with (and matches behaviour with mmapped pagecache
> > and anon), and does not introduce these different semantics.
>
> So what? SIGKILL any future processes visiting the corrupted file?
> Or better to return EIO to them? Either way we'll be maintaining
> a consistent AS_EIO_HWPOISON bit.
If you don't throw the page out of the pagecache, it could
be left in there as a marker to SIGKILL anybody who tries to
access that page. OTOH this might present some other
difficulties regarding supression of writeback etc. Not
quite sure.
Of course the safest mode, IMO, is to panic the kernel in
situations like this (eg. corruption in dirty pagecache). I
would almost like to see that made as the default mode. That
avoids all questions of how exactly to handle these things.
Then if you can subsequently justify what kind of application
or case would work better with a particular behaviour (such
as throw away the data) then we can discuss and merge that.
> > > 1) under read IO hwpoison pages can be hidden to user space
> >
> > I mean for cases where the recovery cannot be transparent
> > (ie. error in dirty page).
>
> OK. That's a good point.
>
> > > 2) under write IO hwpoison pages are normally committed by pdflush,
> > > so cannot find the impacted application to kill at all.
> >
> > Correct.
> >
> > > 3) fsync() users can be caught though. But then the application
> > > have the option to check its return code. If it doesn't do it,
> > > it may well don't care. So why kill it?
> >
> > Well if it does not check, then we cannot find it to kill
> > it anyway. If it does care (and hence check with fsync),
> > then we could kill it.
>
> If it really care, it will check EIO after fsync ;)
> But yes, if it moderately care, it may ignore the return value.
>
> So SIGKILL on fsync() seems to be a good option.
>
> > > Think about a multimedia server. Shall we kill the daemon if some IO
> > > page in the movie get corrupted?
> >
> > My multimedia server is using mmap for data...
> >
> > > And a mission critical server?
> >
> > Mission critical server should be killed too because it
> > likely does not understand this semantic of throwing out
> > dirty data page. It should be detected and restarted and
> > should recover or fail over to another server.
>
> Sorry for the confusion. I meant one server may want to survive,
> while another want to kill (and restart service).
Yes I just don't think even a really good admin will know
what to choose. At which point might as well remove the option
and just try to implement something sane...
But maybe you can write some good documentation for it, I will
stand corrected ;)
> > > Obviously the admin will want the right to choose.
> >
> > I don't know if they are equipped to really know. Do they
> > know that their application will correctly handle these
> > semantics of throwing out dirty data? It is potentially
> > much more dangerous to do this exactly because it can confuse
> > the case where it matters most (ie. ones that care about
> > data integrity).
> >
> > It just seems like killing is far less controversial and
> > simpler. Start with that and it should do the right thing
> > for most people anyway. We could discuss possible ways
> > to recover in another patch if you want to do this
> > EIO thing.
>
> OK, we can
> - kill fsync() users
> - and then return EIO for later read()/write()s
> - forget about the EIO condition on last file close()
> Do you agree?
I really don't know ;) Anything I can think could be wrong
for a given situation. panic seems like the best default
option to me.
I don't want to sound like I'm quibbling. I don't actually
care too much what options are implemented so long as each
is justified and documented, and so long as the default is a
sane one.
Thanks,
Nick
On Tue, Jun 09, 2009 at 05:05:53PM +0100, Hugh Dickins wrote:
> To me, it's just another layer of complexity and maintenance burden
> that one special-interest group is imposing upon mm, and I can't
> keep up with it myself.
Thanks for the kind words.
>
> However, if I'm interpreting these extracts correctly, the whole
> thing looks very misguided to me. Are we really going to kill any
> process that has a cousin who might once have mapped the page that's
> been found hwpoisonous? The hwpoison secret police are dangerously
> out of control, I'd say.
What do you mean with once? It's a not yet afaik?
The not yet was intentional for early kill mode -- the main reason
for that is KVM guests where it should mimic the hardware behaviour
that you report a future memory corruption, so that the guest
takes step to never access it. So even if the access
to the bad page is in the future as long as the process
has theoretical access it should be killed.
In late kill modus that's different of course.
>
> The usual use of rmap lookup loops is to go on to look into the page
> table to see whether the page is actually mapped: I see no attempt
> at that here, just an assumption that anyone on the list is guilty
> of mapping the page and must be killed. And even if it did go on
Yes that's intentional.
>
> At least in the file's prio_tree case, we'll only be killing those
> who mmapped the range which happens to include the page. But in the
> anon case, remember the anon_vma is just a bundle of "related" vmas
> outside of which the page will not be found; so if one process got a
> poisonous page through COW, all the other processes which happen to
> be sharing that anon_vma through fork or through adjacent merging,
> are going to get killed too.
You're right the COW case is a bit of a problem, we don't distingush
that. Perhaps that can be easily checked, but even if we kill
a bit too much it's still better than killing too little. I don't think it's
as big a problem as you claim.
> I think a much more sensible approach would be to follow the page
> migration technique of replacing the page's ptes by a special swap-like
> entry, then do the killing from do_swap_page() if a process actually
> tries to access the page.
That's what late kill modus does (see the patch description/comment on
top of file), but it doesn't have the right semantics for KVM.
It's still used for a few cases by default, e.g. for the swap cache.
-Andi
--
[email protected] -- Speaking for myself only.