2009-12-08 21:16:14

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [0/31] HWPOISON 2.6.33 pre-merge posting


These are the hwpoison updates for 2.6.33
I plan to send the following patchkit to Linus in a few days.
Any additional review would be appreciated.

Major new features:
- Be more aggressive at flushing caches to get access to a page
- Various fixes for the core memory_failure path
- Handle free memory better by detecting higher-order buddy pages
reliably too.
- Reliable return value for memory_failure. This allows to implement
some other functionality later on.
- New soft offlining feature:
Offline a page without killing a process.
This allows to implement predictive failure analysis for memory, by
watching error trends per page and offlining a page that has too many
corrected errors. The policy is all in user space; the kernel just
offlines the page and reports the errors.
The current git mcelog has support for using this interface.
- Provide a new sysfs interface for both hard and soft offlining.
The existing debugfs interface is still there.
- unpoison support
Unpoison a page. This is mainly for testing, it does not do unpoisioning
on the hardware level.
- hwpoison filter
Various filters to the hwpoison PFN error injection, including
memcg, page type, block device and others.
This is used by the mce-test stress suite to protect the test suite itself
and

This touches some code outside hwpoison, mostly for the memcg support
and for the page types. All these changes are straight-forward,
are in linux-next and have been posted before.

-Andi


2009-12-08 21:21:43

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [1/31] HWPOISON: Add Andi Kleen as hwpoison maintainer to MAINTAINERS


Signed-off-by: Andi Kleen <[email protected]>

---
MAINTAINERS | 9 +++++++++
1 file changed, 9 insertions(+)

Index: linux/MAINTAINERS
===================================================================
--- linux.orig/MAINTAINERS
+++ linux/MAINTAINERS
@@ -2322,6 +2322,15 @@ W: http://www.kernel.org/pub/linux/kerne
S: Maintained
F: drivers/hwmon/hdaps.c

+HWPOISON MEMORY FAILURE HANDLING
+M: Andi Kleen <[email protected]>
+L: [email protected]
+L: [email protected]
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison
+S: Maintained
+F: mm/memory-failure.c
+F: mm/hwpoison-inject.c
+
HYPERVISOR VIRTUAL CONSOLE DRIVER
L: [email protected]
S: Odd Fixes

2009-12-08 21:22:12

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [2/31] HWPOISON: Be more aggressive at freeing non LRU caches


shake_page handles more types of page caches than lru_drain_all()

- per cpu page allocator pages
- per CPU LRU

Stops early when the page became free.

Used in followon patches.

Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/mm.h | 1 +
mm/memory-failure.c | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -83,6 +83,28 @@ static int kill_proc_ao(struct task_stru
}

/*
+ * When a unknown page type is encountered drain as many buffers as possible
+ * in the hope to turn the page into a LRU or free page, which we can handle.
+ */
+void shake_page(struct page *p)
+{
+ if (!PageSlab(p)) {
+ lru_add_drain_all();
+ if (PageLRU(p))
+ return;
+ drain_all_pages();
+ if (PageLRU(p) || is_free_buddy_page(p))
+ return;
+ }
+ /*
+ * Could call shrink_slab here (which would also
+ * shrink other caches). Unfortunately that might
+ * also access the corrupted page, which could be fatal.
+ */
+}
+EXPORT_SYMBOL_GPL(shake_page);
+
+/*
* Kill all processes that have a poisoned page mapped and then isolate
* the page.
*
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1320,6 +1320,7 @@ extern void memory_failure(unsigned long
extern int __memory_failure(unsigned long pfn, int trapno, int ref);
extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
+extern void shake_page(struct page *p);
extern atomic_long_t mce_bad_pages;

#endif /* __KERNEL__ */

2009-12-08 21:16:42

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [3/31] page-types: add standard GPL license header


From: Wu Fengguang <[email protected]>

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/vm/page-types.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

Index: linux/Documentation/vm/page-types.c
===================================================================
--- linux.orig/Documentation/vm/page-types.c
+++ linux/Documentation/vm/page-types.c
@@ -1,11 +1,22 @@
/*
* page-types: Tool for querying page flags
*
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; version 2.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should find a copy of v2 of the GNU General Public License somewhere on
+ * your Linux system; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ *
* Copyright (C) 2009 Intel corporation
*
* Authors: Wu Fengguang <[email protected]>
- *
- * Released under the General Public License (GPL).
*/

#define _LARGEFILE64_SOURCE

2009-12-08 21:16:22

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [4/31] HWPOISON: remove the anonymous entry


From: Wu Fengguang <[email protected]>

(PG_swapbacked && !PG_lru) pages should not happen.
Better to treat them as unknown pages.

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 1 -
1 file changed, 1 deletion(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -611,7 +611,6 @@ static struct page_state {

{ lru|dirty, lru|dirty, "LRU", me_pagecache_dirty },
{ lru|dirty, lru, "clean LRU", me_pagecache_clean },
- { swapbacked, swapbacked, "anonymous", me_pagecache_clean },

/*
* Catchall entry: must be at end.

2009-12-08 21:16:25

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [5/31] HWPOISON: return ENXIO on invalid page number


From: Wu Fengguang <[email protected]>

Use a different errno than the usual EIO for invalid page numbers.
This is mainly for better reporting for the injector.

This also avoids calling action_result() with invalid pfn.

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -620,13 +620,11 @@ static struct page_state {

static void action_result(unsigned long pfn, char *msg, int result)
{
- struct page *page = NULL;
- if (pfn_valid(pfn))
- page = pfn_to_page(pfn);
+ struct page *page = pfn_to_page(pfn);

printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
pfn,
- page && PageDirty(page) ? "dirty " : "",
+ PageDirty(page) ? "dirty " : "",
msg, action_name[result]);
}

@@ -752,8 +750,10 @@ int __memory_failure(unsigned long pfn,
panic("Memory failure from trap %d on page %lx", trapno, pfn);

if (!pfn_valid(pfn)) {
- action_result(pfn, "memory outside kernel control", IGNORED);
- return -EIO;
+ printk(KERN_ERR
+ "MCE %#lx: memory outside kernel control\n",
+ pfn);
+ return -ENXIO;
}

p = pfn_to_page(pfn);

2009-12-08 21:20:46

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [6/31] HWPOISON: avoid grabbing the page count multiple times during madvise injection


From: Wu Fengguang <[email protected]>

If page is double referenced in madvise_hwpoison() and __memory_failure(),
remove_mapping() will fail because it expects page_count=2. Fix it by
not grabbing extra page count in __memory_failure().

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/madvise.c | 1 -
mm/memory-failure.c | 8 ++++----
2 files changed, 4 insertions(+), 5 deletions(-)

Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c
+++ linux/mm/madvise.c
@@ -238,7 +238,6 @@ static int madvise_hwpoison(unsigned lon
page_to_pfn(p), start);
/* Ignore return value for now */
__memory_failure(page_to_pfn(p), 0, 1);
- put_page(p);
}
return ret;
}
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -629,7 +629,7 @@ static void action_result(unsigned long
}

static int page_action(struct page_state *ps, struct page *p,
- unsigned long pfn, int ref)
+ unsigned long pfn)
{
int result;
int count;
@@ -637,7 +637,7 @@ static int page_action(struct page_state
result = ps->action(p, pfn);
action_result(pfn, ps->msg, result);

- count = page_count(p) - 1 - ref;
+ count = page_count(p) - 1;
if (count != 0)
printk(KERN_ERR
"MCE %#lx: %s page still referenced by %d users\n",
@@ -775,7 +775,7 @@ int __memory_failure(unsigned long pfn,
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
*/
- if (!get_page_unless_zero(compound_head(p))) {
+ if (!ref && !get_page_unless_zero(compound_head(p))) {
action_result(pfn, "free or high order kernel", IGNORED);
return PageBuddy(compound_head(p)) ? 0 : -EBUSY;
}
@@ -823,7 +823,7 @@ int __memory_failure(unsigned long pfn,
res = -EBUSY;
for (ps = error_states;; ps++) {
if (((p->flags | lru_flag)& ps->mask) == ps->res) {
- res = page_action(ps, p, pfn, ref);
+ res = page_action(ps, p, pfn);
break;
}
}

2009-12-08 21:16:34

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [7/31] HWPOISON: Turn ref argument into flags argument


Now that "ref" is just a boolean turn it into
a flags argument. First step is only a single flag
that makes the code's intention more clear, but more
may follow.

Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/mm.h | 5 ++++-
mm/madvise.c | 2 +-
mm/memory-failure.c | 5 +++--
3 files changed, 8 insertions(+), 4 deletions(-)

Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1316,8 +1316,11 @@ extern int account_locked_memory(struct
size_t size);
extern void refund_locked_memory(struct mm_struct *mm, size_t size);

+enum mf_flags {
+ MF_COUNT_INCREASED = 1 << 0,
+};
extern void memory_failure(unsigned long pfn, int trapno);
-extern int __memory_failure(unsigned long pfn, int trapno, int ref);
+extern int __memory_failure(unsigned long pfn, int trapno, int flags);
extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -739,7 +739,7 @@ static void hwpoison_user_mappings(struc
ret != SWAP_SUCCESS, pfn);
}

-int __memory_failure(unsigned long pfn, int trapno, int ref)
+int __memory_failure(unsigned long pfn, int trapno, int flags)
{
unsigned long lru_flag;
struct page_state *ps;
@@ -775,7 +775,8 @@ int __memory_failure(unsigned long pfn,
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
*/
- if (!ref && !get_page_unless_zero(compound_head(p))) {
+ if (!(flags & MF_COUNT_INCREASED) &&
+ !get_page_unless_zero(compound_head(p))) {
action_result(pfn, "free or high order kernel", IGNORED);
return PageBuddy(compound_head(p)) ? 0 : -EBUSY;
}
Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c
+++ linux/mm/madvise.c
@@ -237,7 +237,7 @@ static int madvise_hwpoison(unsigned lon
printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
page_to_pfn(p), start);
/* Ignore return value for now */
- __memory_failure(page_to_pfn(p), 0, 1);
+ __memory_failure(page_to_pfn(p), 0, MF_COUNT_INCREASED);
}
return ret;
}

2009-12-08 21:22:00

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [8/31] HWPOISON: abort on failed unmap


From: Wu Fengguang <[email protected]>

Don't try to isolate a still mapped page. Otherwise we will hit the
BUG_ON(page_mapped(page)) in __remove_from_page_cache().

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -657,7 +657,7 @@ static int page_action(struct page_state
* Do all that is necessary to remove user space mappings. Unmap
* the pages and send SIGBUS to the processes if the data was dirty.
*/
-static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
int trapno)
{
enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
@@ -667,15 +667,18 @@ static void hwpoison_user_mappings(struc
int i;
int kill = 1;

- if (PageReserved(p) || PageCompound(p) || PageSlab(p) || PageKsm(p))
- return;
+ if (PageReserved(p) || PageSlab(p))
+ return SWAP_SUCCESS;

/*
* This check implies we don't kill processes if their pages
* are in the swap cache early. Those are always late kills.
*/
if (!page_mapped(p))
- return;
+ return SWAP_SUCCESS;
+
+ if (PageCompound(p) || PageKsm(p))
+ return SWAP_FAIL;

if (PageSwapCache(p)) {
printk(KERN_ERR
@@ -737,6 +740,8 @@ static void hwpoison_user_mappings(struc
*/
kill_procs_ao(&tokill, !!PageDirty(p), trapno,
ret != SWAP_SUCCESS, pfn);
+
+ return ret;
}

int __memory_failure(unsigned long pfn, int trapno, int flags)
@@ -809,8 +814,13 @@ int __memory_failure(unsigned long pfn,

/*
* Now take care of user space mappings.
+ * Abort on fail: __remove_from_page_cache() assumes unmapped page.
*/
- hwpoison_user_mappings(p, pfn, trapno);
+ if (hwpoison_user_mappings(p, pfn, trapno) != SWAP_SUCCESS) {
+ printk(KERN_ERR "MCE %#lx: cannot unmap page, give up\n", pfn);
+ res = -EBUSY;
+ goto out;
+ }

/*
* Torn down by someone else?

2009-12-08 21:21:51

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [9/31] HWPOISON: comment the possible set_page_dirty() race


From: Wu Fengguang <[email protected]>

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 2 ++
1 file changed, 2 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -689,6 +689,8 @@ static int hwpoison_user_mappings(struct
/*
* Propagate the dirty bit from PTEs to struct page first, because we
* need this to decide if we should kill or just drop the page.
+ * XXX: the dirty test could be racy: set_page_dirty() may not always
+ * be called inside page lock (it's recommended but not enforced).
*/
mapping = page_mapping(p);
if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) {

2009-12-08 21:20:13

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [10/31] HWPOISON: comment dirty swapcache pages


From: Wu Fengguang <[email protected]>

AK: Improve comment

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -2540,6 +2540,10 @@ static int do_swap_page(struct mm_struct
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
} else if (PageHWPoison(page)) {
+ /*
+ * hwpoisoned dirty swapcache pages are kept for killing
+ * owner processes (which may be unknown at hwpoison time)
+ */
ret = VM_FAULT_HWPOISON;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto out_release;

2009-12-08 21:19:00

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [11/31] HWPOISON: introduce delete_from_lru_cache()


From: Wu Fengguang <[email protected]>

Introduce delete_from_lru_cache() to
- clear PG_active, PG_unevictable to avoid complains at unpoison time
- move the isolate_lru_page() call back to the handlers instead of the
entrance of __memory_failure(), this is more hwpoison filter friendly

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 45 +++++++++++++++++++++++++++++++++++++--------
1 file changed, 37 insertions(+), 8 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -350,6 +350,30 @@ static const char *action_name[] = {
};

/*
+ * XXX: It is possible that a page is isolated from LRU cache,
+ * and then kept in swap cache or failed to remove from page cache.
+ * The page count will stop it from being freed by unpoison.
+ * Stress tests should be aware of this memory leak problem.
+ */
+static int delete_from_lru_cache(struct page *p)
+{
+ if (!isolate_lru_page(p)) {
+ /*
+ * Clear sensible page flags, so that the buddy system won't
+ * complain when the page is unpoison-and-freed.
+ */
+ ClearPageActive(p);
+ ClearPageUnevictable(p);
+ /*
+ * drop the page count elevated by isolate_lru_page()
+ */
+ page_cache_release(p);
+ return 0;
+ }
+ return -EIO;
+}
+
+/*
* Error hit kernel page.
* Do nothing, try to be lucky and not touch this instead. For a few cases we
* could be more sophisticated.
@@ -393,6 +417,8 @@ static int me_pagecache_clean(struct pag
int ret = FAILED;
struct address_space *mapping;

+ delete_from_lru_cache(p);
+
/*
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
@@ -522,14 +548,20 @@ static int me_swapcache_dirty(struct pag
/* Trigger EIO in shmem: */
ClearPageUptodate(p);

- return DELAYED;
+ if (!delete_from_lru_cache(p))
+ return DELAYED;
+ else
+ return FAILED;
}

static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
delete_from_swap_cache(p);

- return RECOVERED;
+ if (!delete_from_lru_cache(p))
+ return RECOVERED;
+ else
+ return FAILED;
}

/*
@@ -748,7 +780,6 @@ static int hwpoison_user_mappings(struct

int __memory_failure(unsigned long pfn, int trapno, int flags)
{
- unsigned long lru_flag;
struct page_state *ps;
struct page *p;
int res;
@@ -798,13 +829,11 @@ int __memory_failure(unsigned long pfn,
*/
if (!PageLRU(p))
lru_add_drain_all();
- lru_flag = p->flags & lru;
- if (isolate_lru_page(p)) {
+ if (!PageLRU(p)) {
action_result(pfn, "non LRU", IGNORED);
put_page(p);
return -EBUSY;
}
- page_cache_release(p);

/*
* Lock the page and wait for writeback to finish.
@@ -827,7 +856,7 @@ int __memory_failure(unsigned long pfn,
/*
* Torn down by someone else?
*/
- if ((lru_flag & lru) && !PageSwapCache(p) && p->mapping == NULL) {
+ if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
action_result(pfn, "already truncated LRU", IGNORED);
res = 0;
goto out;
@@ -835,7 +864,7 @@ int __memory_failure(unsigned long pfn,

res = -EBUSY;
for (ps = error_states;; ps++) {
- if (((p->flags | lru_flag)& ps->mask) == ps->res) {
+ if ((p->flags & ps->mask) == ps->res) {
res = page_action(ps, p, pfn);
break;
}

2009-12-08 21:21:07

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [12/31] HWPOISON: remove the free buddy page handler


From: Wu Fengguang <[email protected]>

The buddy page has already be handled in the very beginning.
So remove redundant code.

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 14 ++++----------
1 file changed, 4 insertions(+), 10 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -401,14 +401,6 @@ static int me_unknown(struct page *p, un
}

/*
- * Free memory
- */
-static int me_free(struct page *p, unsigned long pfn)
-{
- return DELAYED;
-}
-
-/*
* Clean (or cleaned) page cache page.
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
@@ -604,7 +596,6 @@ static int me_huge_page(struct page *p,
#define tail (1UL << PG_tail)
#define compound (1UL << PG_compound)
#define slab (1UL << PG_slab)
-#define buddy (1UL << PG_buddy)
#define reserved (1UL << PG_reserved)

static struct page_state {
@@ -614,7 +605,10 @@ static struct page_state {
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, "reserved kernel", me_ignore },
- { buddy, buddy, "free kernel", me_free },
+ /*
+ * free pages are specially detected outside this table:
+ * PG_buddy pages only make a small fraction of all free pages.
+ */

/*
* Could in theory check if slab page is free or if we can drop

2009-12-08 21:20:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [13/31] HWPOISON: detect free buddy pages explicitly


From: Wu Fengguang <[email protected]>

Most free pages in the buddy system have no PG_buddy set.
Introduce is_free_buddy_page() for detecting them reliably.

CC: Nick Piggin <[email protected]>
CC: Mel Gorman <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/internal.h | 3 +++
mm/memory-failure.c | 9 +++++++--
mm/page_alloc.c | 21 +++++++++++++++++++++
3 files changed, 31 insertions(+), 2 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -809,8 +809,13 @@ int __memory_failure(unsigned long pfn,
*/
if (!(flags & MF_COUNT_INCREASED) &&
!get_page_unless_zero(compound_head(p))) {
- action_result(pfn, "free or high order kernel", IGNORED);
- return PageBuddy(compound_head(p)) ? 0 : -EBUSY;
+ if (is_free_buddy_page(p)) {
+ action_result(pfn, "free buddy", DELAYED);
+ return 0;
+ } else {
+ action_result(pfn, "high order kernel", IGNORED);
+ return -EBUSY;
+ }
}

/*
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -50,6 +50,9 @@ extern void putback_lru_page(struct page
*/
extern void __free_pages_bootmem(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned long order);
+#ifdef CONFIG_MEMORY_FAILURE
+extern bool is_free_buddy_page(struct page *page);
+#endif


/*
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -5085,3 +5085,24 @@ __offline_isolated_pages(unsigned long s
spin_unlock_irqrestore(&zone->lock, flags);
}
#endif
+
+#ifdef CONFIG_MEMORY_FAILURE
+bool is_free_buddy_page(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long flags;
+ int order;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (order = 0; order < MAX_ORDER; order++) {
+ struct page *page_head = page - (pfn & ((1 << order) - 1));
+
+ if (PageBuddy(page_head) && page_order(page_head) >= order)
+ break;
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ return order < MAX_ORDER;
+}
+#endif

2009-12-08 21:21:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [14/31] HWPOISON: Add unpoisoning support


From: Wu Fengguang <[email protected]>

The unpoisoning interface is useful for stress testing tools to
reclaim poisoned pages (to prevent OOM)

There is no hardware level unpoisioning, so this
cannot be used for real memory errors, only for software injected errors.

Note that it may leak pages silently - those who have been removed from
LRU cache, but not isolated from page cache/swap cache at hwpoison time.
Especially the stress test of dirty swap cache pages shall reboot system
before exhausting memory.

AK: Fix comments, add documentation, add printks, rename symbol

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/vm/hwpoison.txt | 16 ++++++++-
include/linux/mm.h | 1
include/linux/page-flags.h | 2 -
mm/hwpoison-inject.c | 36 ++++++++++++++++++----
mm/memory-failure.c | 68 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 114 insertions(+), 9 deletions(-)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -4,7 +4,7 @@
#include <linux/kernel.h>
#include <linux/mm.h>

-static struct dentry *hwpoison_dir, *corrupt_pfn;
+static struct dentry *hwpoison_dir;

static int hwpoison_inject(void *data, u64 val)
{
@@ -14,7 +14,16 @@ static int hwpoison_inject(void *data, u
return __memory_failure(val, 18, 0);
}

+static int hwpoison_unpoison(void *data, u64 val)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ return unpoison_memory(val);
+}
+
DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+DEFINE_SIMPLE_ATTRIBUTE(unpoison_fops, NULL, hwpoison_unpoison, "%lli\n");

static void pfn_inject_exit(void)
{
@@ -24,16 +33,31 @@ static void pfn_inject_exit(void)

static int pfn_inject_init(void)
{
+ struct dentry *dentry;
+
hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
if (hwpoison_dir == NULL)
return -ENOMEM;
- corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+
+ /*
+ * Note that the below poison/unpoison interfaces do not involve
+ * hardware status change, hence do not require hardware support.
+ * They are mainly for testing hwpoison in software level.
+ */
+ dentry = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
NULL, &hwpoison_fops);
- if (corrupt_pfn == NULL) {
- pfn_inject_exit();
- return -ENOMEM;
- }
+ if (!dentry)
+ goto fail;
+
+ dentry = debugfs_create_file("unpoison-pfn", 0600, hwpoison_dir,
+ NULL, &unpoison_fops);
+ if (!dentry)
+ goto fail;
+
return 0;
+fail:
+ pfn_inject_exit();
+ return -ENOMEM;
}

module_init(pfn_inject_init);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1321,6 +1321,7 @@ enum mf_flags {
};
extern void memory_failure(unsigned long pfn, int trapno);
extern int __memory_failure(unsigned long pfn, int trapno, int flags);
+extern int unpoison_memory(unsigned long pfn);
extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p);
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -840,6 +840,16 @@ int __memory_failure(unsigned long pfn,
* and in many cases impossible, so we just avoid it here.
*/
lock_page_nosync(p);
+
+ /*
+ * unpoison always clear PG_hwpoison inside page lock
+ */
+ if (!PageHWPoison(p)) {
+ action_result(pfn, "unpoisoned", IGNORED);
+ res = 0;
+ goto out;
+ }
+
wait_on_page_writeback(p);

/*
@@ -895,3 +905,61 @@ void memory_failure(unsigned long pfn, i
{
__memory_failure(pfn, trapno, 0);
}
+
+/**
+ * unpoison_memory - Unpoison a previously poisoned page
+ * @pfn: Page number of the to be unpoisoned page
+ *
+ * Software-unpoison a page that has been poisoned by
+ * memory_failure() earlier.
+ *
+ * This is only done on the software-level, so it only works
+ * for linux injected failures, not real hardware failures
+ *
+ * Returns 0 for success, otherwise -errno.
+ */
+int unpoison_memory(unsigned long pfn)
+{
+ struct page *page;
+ struct page *p;
+ int freeit = 0;
+
+ if (!pfn_valid(pfn))
+ return -ENXIO;
+
+ p = pfn_to_page(pfn);
+ page = compound_head(p);
+
+ if (!PageHWPoison(p)) {
+ pr_debug("MCE: Page was already unpoisoned %#lx\n", pfn);
+ return 0;
+ }
+
+ if (!get_page_unless_zero(page)) {
+ if (TestClearPageHWPoison(p))
+ atomic_long_dec(&mce_bad_pages);
+ pr_debug("MCE: Software-unpoisoned free page %#lx\n", pfn);
+ return 0;
+ }
+
+ lock_page_nosync(page);
+ /*
+ * This test is racy because PG_hwpoison is set outside of page lock.
+ * That's acceptable because that won't trigger kernel panic. Instead,
+ * the PG_hwpoison page will be caught and isolated on the entrance to
+ * the free buddy page pool.
+ */
+ if (TestClearPageHWPoison(p)) {
+ pr_debug("MCE: Software-unpoisoned page %#lx\n", pfn);
+ atomic_long_dec(&mce_bad_pages);
+ freeit = 1;
+ }
+ unlock_page(page);
+
+ put_page(page);
+ if (freeit)
+ put_page(page);
+
+ return 0;
+}
+EXPORT_SYMBOL(unpoison_memory);
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -277,7 +277,7 @@ PAGEFLAG_FALSE(Uncached)

#ifdef CONFIG_MEMORY_FAILURE
PAGEFLAG(HWPoison, hwpoison)
-TESTSETFLAG(HWPoison, hwpoison)
+TESTSCFLAG(HWPoison, hwpoison)
#define __PG_HWPOISON (1UL << PG_hwpoison)
#else
PAGEFLAG_FALSE(HWPoison)
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -98,10 +98,22 @@ madvise(MADV_POISON, ....)


hwpoison-inject module through debugfs
- /sys/debug/hwpoison/corrupt-pfn

-Inject hwpoison fault at PFN echoed into this file
+/sys/debug/hwpoison/

+corrupt-pfn
+
+Inject hwpoison fault at PFN echoed into this file.
+
+unpoison-pfn
+
+Software-unpoison page at PFN echoed into this file. This
+way a page can be reused again.
+This only works for Linux injected failures, not for real
+memory failures.
+
+Note these injection interfaces are not stable and might change between
+kernel versions

Architecture specific MCE injector

2009-12-08 21:17:01

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [15/31] HWPOISON: make semantics of IGNORED/DELAYED clear


From: Wu Fengguang <[email protected]>

Change semantics for
- IGNORED: not handled; it may well be _unsafe_
- DELAYED: to be handled later; it is _safe_

With this change,
- IGNORED/FAILED mean (maybe) Error
- DELAYED/RECOVERED mean Success

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -336,16 +336,16 @@ static void collect_procs(struct page *p
*/

enum outcome {
- FAILED, /* Error handling failed */
+ IGNORED, /* Error: cannot be handled */
+ FAILED, /* Error: handling failed */
DELAYED, /* Will be handled later */
- IGNORED, /* Error safely ignored */
RECOVERED, /* Successfully recovered */
};

static const char *action_name[] = {
+ [IGNORED] = "Ignored",
[FAILED] = "Failed",
[DELAYED] = "Delayed",
- [IGNORED] = "Ignored",
[RECOVERED] = "Recovered",
};

@@ -380,14 +380,6 @@ static int delete_from_lru_cache(struct
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
- return DELAYED;
-}
-
-/*
- * Already poisoned page.
- */
-static int me_ignore(struct page *p, unsigned long pfn)
-{
return IGNORED;
}

@@ -604,7 +596,7 @@ static struct page_state {
char *msg;
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
- { reserved, reserved, "reserved kernel", me_ignore },
+ { reserved, reserved, "reserved kernel", me_kernel },
/*
* free pages are specially detected outside this table:
* PG_buddy pages only make a small fraction of all free pages.
@@ -790,7 +782,7 @@ int __memory_failure(unsigned long pfn,

p = pfn_to_page(pfn);
if (TestSetPageHWPoison(p)) {
- action_result(pfn, "already hardware poisoned", IGNORED);
+ printk(KERN_ERR "MCE %#lx: already hardware poisoned\n", pfn);
return 0;
}

@@ -845,7 +837,7 @@ int __memory_failure(unsigned long pfn,
* unpoison always clear PG_hwpoison inside page lock
*/
if (!PageHWPoison(p)) {
- action_result(pfn, "unpoisoned", IGNORED);
+ printk(KERN_ERR "MCE %#lx: just unpoisoned\n", pfn);
res = 0;
goto out;
}
@@ -867,7 +859,7 @@ int __memory_failure(unsigned long pfn,
*/
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
action_result(pfn, "already truncated LRU", IGNORED);
- res = 0;
+ res = -EBUSY;
goto out;
}

2009-12-08 21:16:51

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [16/31] HWPOISON: return 0 to indicate success reliably


From: Wu Fengguang <[email protected]>

Return 0 to indicate success, when
- action result is RECOVERED or DELAYED
- no extra page reference

Note that dirty swapcache pages are kept in swapcache, so can have one
more reference count.

Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -656,17 +656,21 @@ static int page_action(struct page_state
action_result(pfn, ps->msg, result);

count = page_count(p) - 1;
- if (count != 0)
+ if (ps->action == me_swapcache_dirty && result == DELAYED)
+ count--;
+ if (count != 0) {
printk(KERN_ERR
"MCE %#lx: %s page still referenced by %d users\n",
pfn, ps->msg, count);
+ result = FAILED;
+ }

/* Could do more checks here if page looks ok */
/*
* Could adjust zone counters here to correct for the missing page.
*/

- return result == RECOVERED ? 0 : -EBUSY;
+ return (result == RECOVERED || result == DELAYED) ? 0 : -EBUSY;
}

#define N_UNMAP_TRIES 5

2009-12-08 21:17:08

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [17/31] HWPOISON: add fs/device filters


From: Wu Fengguang <[email protected]>

Filesystem data/metadata present the most tricky-to-isolate pages.
It requires careful code review and stress testing to get them right.

The fs/device filter helps to target the stress tests to some specific
filesystem pages. The filter condition is block device's major/minor
numbers:
- corrupt-filter-dev-major
- corrupt-filter-dev-minor
When specified (non -1), only page cache pages that belong to that
device will be poisoned.

The filters are checked reliably on the locked and refcounted page.

Haicheng: clear PG_hwpoison and drop bad page count if filter not OK
AK: Add documentation

CC: Haicheng Li <[email protected]>
CC: Nick Piggin <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/vm/hwpoison.txt | 7 +++++
mm/hwpoison-inject.c | 11 +++++++++
mm/internal.h | 3 ++
mm/memory-failure.c | 51 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 72 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -48,6 +48,50 @@ int sysctl_memory_failure_recovery __rea

atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);

+u32 hwpoison_filter_dev_major = ~0U;
+u32 hwpoison_filter_dev_minor = ~0U;
+EXPORT_SYMBOL_GPL(hwpoison_filter_dev_major);
+EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
+
+static int hwpoison_filter_dev(struct page *p)
+{
+ struct address_space *mapping;
+ dev_t dev;
+
+ if (hwpoison_filter_dev_major == ~0U &&
+ hwpoison_filter_dev_minor == ~0U)
+ return 0;
+
+ /*
+ * page_mapping() does not accept slab page
+ */
+ if (PageSlab(p))
+ return -EINVAL;
+
+ mapping = page_mapping(p);
+ if (mapping == NULL || mapping->host == NULL)
+ return -EINVAL;
+
+ dev = mapping->host->i_sb->s_dev;
+ if (hwpoison_filter_dev_major != ~0U &&
+ hwpoison_filter_dev_major != MAJOR(dev))
+ return -EINVAL;
+ if (hwpoison_filter_dev_minor != ~0U &&
+ hwpoison_filter_dev_minor != MINOR(dev))
+ return -EINVAL;
+
+ return 0;
+}
+
+int hwpoison_filter(struct page *p)
+{
+ if (hwpoison_filter_dev(p))
+ return -EINVAL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(hwpoison_filter);
+
/*
* Send all the processes who have the page mapped an ``action optional''
* signal.
@@ -845,6 +889,13 @@ int __memory_failure(unsigned long pfn,
res = 0;
goto out;
}
+ if (hwpoison_filter(p)) {
+ if (TestClearPageHWPoison(p))
+ atomic_long_dec(&mce_bad_pages);
+ unlock_page(p);
+ put_page(p);
+ return 0;
+ }

wait_on_page_writeback(p);

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -3,6 +3,7 @@
#include <linux/debugfs.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include "internal.h"

static struct dentry *hwpoison_dir;

@@ -54,6 +55,16 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;

+ dentry = debugfs_create_u32("corrupt-filter-dev-major", 0600,
+ hwpoison_dir, &hwpoison_filter_dev_major);
+ if (!dentry)
+ goto fail;
+
+ dentry = debugfs_create_u32("corrupt-filter-dev-minor", 0600,
+ hwpoison_dir, &hwpoison_filter_dev_minor);
+ if (!dentry)
+ goto fail;
+
return 0;
fail:
pfn_inject_exit();
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -263,3 +263,6 @@ int __get_user_pages(struct task_struct
#define ZONE_RECLAIM_SOME 0
#define ZONE_RECLAIM_SUCCESS 1
#endif
+
+extern u32 hwpoison_filter_dev_major;
+extern u32 hwpoison_filter_dev_minor;
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -115,6 +115,13 @@ memory failures.
Note these injection interfaces are not stable and might change between
kernel versions

+corrupt-filter-dev-major
+corrupt-filter-dev-minor
+
+Only handle memory failures to pages associated with the file system defined
+by block device major/minor. -1U is the wildcard value.
+This should be only used for testing with artificial injection.
+
Architecture specific MCE injector

x86 has mce-inject, mce-test

2009-12-08 21:20:03

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [18/31] HWPOISON: limit hwpoison injector to known page types


From: Wu Fengguang <[email protected]>

__memory_failure()'s workflow is

set PG_hwpoison
//...
unset PG_hwpoison if didn't pass hwpoison filter

That could kill unrelated process if it happens to page fault on the
page with the (temporary) PG_hwpoison. The race should be big enough to
appear in stress tests.

Fix it by grabbing the page and checking filter at inject time. This
also avoids the very noisy "Injecting memory failure..." messages.

- we don't touch madvise() based injection, because the filters are
generally not necessary for it.
- if we want to apply the filters to h/w aided injection, we'd better to
rearrange the logic in __memory_failure() instead of this patch.

AK: fix documentation, use drain all, cleanups

CC: Haicheng Li <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/vm/hwpoison.txt | 3 ++-
mm/hwpoison-inject.c | 41 +++++++++++++++++++++++++++++++++++++++--
mm/internal.h | 2 ++
3 files changed, 43 insertions(+), 3 deletions(-)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -3,16 +3,53 @@
#include <linux/debugfs.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
#include "internal.h"

static struct dentry *hwpoison_dir;

static int hwpoison_inject(void *data, u64 val)
{
+ unsigned long pfn = val;
+ struct page *p;
+ int err;
+
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
- printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
- return __memory_failure(val, 18, 0);
+
+ if (!pfn_valid(pfn))
+ return -ENXIO;
+
+ p = pfn_to_page(pfn);
+ /*
+ * This implies unable to support free buddy pages.
+ */
+ if (!get_page_unless_zero(p))
+ return 0;
+
+ if (!PageLRU(p))
+ shake_page(p);
+ /*
+ * This implies unable to support non-LRU pages.
+ */
+ if (!PageLRU(p))
+ return 0;
+
+ /*
+ * do a racy check with elevated page count, to make sure PG_hwpoison
+ * will only be set for the targeted owner (or on a free page).
+ * We temporarily take page lock for try_get_mem_cgroup_from_page().
+ * __memory_failure() will redo the check reliably inside page lock.
+ */
+ lock_page(p);
+ err = hwpoison_filter(p);
+ unlock_page(p);
+ if (err)
+ return 0;
+
+ printk(KERN_INFO "Injecting memory failure at pfn %lx\n", pfn);
+ return __memory_failure(pfn, 18, MF_COUNT_INCREASED);
}

static int hwpoison_unpoison(void *data, u64 val)
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -264,5 +264,7 @@ int __get_user_pages(struct task_struct
#define ZONE_RECLAIM_SUCCESS 1
#endif

+extern int hwpoison_filter(struct page *p);
+
extern u32 hwpoison_filter_dev_major;
extern u32 hwpoison_filter_dev_minor;
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -103,7 +103,8 @@ hwpoison-inject module through debugfs

corrupt-pfn

-Inject hwpoison fault at PFN echoed into this file.
+Inject hwpoison fault at PFN echoed into this file. This does
+some early filtering to avoid corrupted unintended pages in test suites.

unpoison-pfn

2009-12-08 21:20:07

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [19/31] mm: export stable page flags


From: Wu Fengguang <[email protected]>

Rename get_uflags() to stable_page_flags() and make it a global function
for use in the hwpoison page flags filter, which need to compare user
page flags with the value provided by user space.

Also move KPF_* to kernel-page-flags.h for use by user space tools.

CC: Matt Mackall <[email protected]>
CC: Nick Piggin <[email protected]>
CC: Christoph Lameter <[email protected]>
CC: Andi Kleen <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
fs/proc/page.c | 45 ++-----------------------------------
include/linux/kernel-page-flags.h | 46 ++++++++++++++++++++++++++++++++++++++
include/linux/page-flags.h | 2 +
3 files changed, 51 insertions(+), 42 deletions(-)

Index: linux/fs/proc/page.c
===================================================================
--- linux.orig/fs/proc/page.c
+++ linux/fs/proc/page.c
@@ -8,6 +8,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/hugetlb.h>
+#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"

@@ -71,52 +72,12 @@ static const struct file_operations proc
* physical page flags.
*/

-/* These macros are used to decouple internal flags from exported ones */
-
-#define KPF_LOCKED 0
-#define KPF_ERROR 1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE 3
-#define KPF_DIRTY 4
-#define KPF_LRU 5
-#define KPF_ACTIVE 6
-#define KPF_SLAB 7
-#define KPF_WRITEBACK 8
-#define KPF_RECLAIM 9
-#define KPF_BUDDY 10
-
-/* 11-20: new additions in 2.6.31 */
-#define KPF_MMAP 11
-#define KPF_ANON 12
-#define KPF_SWAPCACHE 13
-#define KPF_SWAPBACKED 14
-#define KPF_COMPOUND_HEAD 15
-#define KPF_COMPOUND_TAIL 16
-#define KPF_HUGE 17
-#define KPF_UNEVICTABLE 18
-#define KPF_HWPOISON 19
-#define KPF_NOPAGE 20
-
-#define KPF_KSM 21
-
-/* kernel hacking assistances
- * WARNING: subject to change, never rely on them!
- */
-#define KPF_RESERVED 32
-#define KPF_MLOCKED 33
-#define KPF_MAPPEDTODISK 34
-#define KPF_PRIVATE 35
-#define KPF_PRIVATE_2 36
-#define KPF_OWNER_PRIVATE 37
-#define KPF_ARCH 38
-#define KPF_UNCACHED 39
-
static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
{
return ((kflags >> kbit) & 1) << ubit;
}

-static u64 get_uflags(struct page *page)
+u64 stable_page_flags(struct page *page)
{
u64 k;
u64 u;
@@ -219,7 +180,7 @@ static ssize_t kpageflags_read(struct fi
else
ppage = NULL;

- if (put_user(get_uflags(ppage), out)) {
+ if (put_user(stable_page_flags(ppage), out)) {
ret = -EFAULT;
break;
}
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -284,6 +284,8 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

+u64 stable_page_flags(struct page *page);
+
static inline int PageUptodate(struct page *page)
{
int ret = test_bit(PG_uptodate, &(page)->flags);
Index: linux/include/linux/kernel-page-flags.h
===================================================================
--- /dev/null
+++ linux/include/linux/kernel-page-flags.h
@@ -0,0 +1,46 @@
+#ifndef LINUX_KERNEL_PAGE_FLAGS_H
+#define LINUX_KERNEL_PAGE_FLAGS_H
+
+/*
+ * Stable page flag bits exported to user space
+ */
+
+#define KPF_LOCKED 0
+#define KPF_ERROR 1
+#define KPF_REFERENCED 2
+#define KPF_UPTODATE 3
+#define KPF_DIRTY 4
+#define KPF_LRU 5
+#define KPF_ACTIVE 6
+#define KPF_SLAB 7
+#define KPF_WRITEBACK 8
+#define KPF_RECLAIM 9
+#define KPF_BUDDY 10
+
+/* 11-20: new additions in 2.6.31 */
+#define KPF_MMAP 11
+#define KPF_ANON 12
+#define KPF_SWAPCACHE 13
+#define KPF_SWAPBACKED 14
+#define KPF_COMPOUND_HEAD 15
+#define KPF_COMPOUND_TAIL 16
+#define KPF_HUGE 17
+#define KPF_UNEVICTABLE 18
+#define KPF_HWPOISON 19
+#define KPF_NOPAGE 20
+
+#define KPF_KSM 21
+
+/* kernel hacking assistances
+ * WARNING: subject to change, never rely on them!
+ */
+#define KPF_RESERVED 32
+#define KPF_MLOCKED 33
+#define KPF_MAPPEDTODISK 34
+#define KPF_PRIVATE 35
+#define KPF_PRIVATE_2 36
+#define KPF_OWNER_PRIVATE 37
+#define KPF_ARCH 38
+#define KPF_UNCACHED 39
+
+#endif /* LINUX_KERNEL_PAGE_FLAGS_H */

2009-12-08 21:19:19

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [20/31] HWPOISON: add page flags filter


From: Wu Fengguang <[email protected]>

When specified, only poison pages if ((page_flags & mask) == value).

- corrupt-filter-flags-mask
- corrupt-filter-flags-value

This allows stress testing of many kinds of pages.

Strictly speaking, the buddy pages requires taking zone lock, to avoid
setting PG_hwpoison on a "was buddy but now allocated to someone" page.
However we can just do nothing because we set PG_locked in the beginning,
this prevents the page allocator from allocating it to someone. (It will
BUG() on the unexpected PG_locked, which is fine for hwpoison testing.)

CC: Nick Piggin <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/vm/hwpoison.txt | 10 ++++++++++
mm/hwpoison-inject.c | 10 ++++++++++
mm/internal.h | 2 ++
mm/memory-failure.c | 20 ++++++++++++++++++++
4 files changed, 42 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -102,6 +102,16 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;

+ dentry = debugfs_create_u64("corrupt-filter-flags-mask", 0600,
+ hwpoison_dir, &hwpoison_filter_flags_mask);
+ if (!dentry)
+ goto fail;
+
+ dentry = debugfs_create_u64("corrupt-filter-flags-value", 0600,
+ hwpoison_dir, &hwpoison_filter_flags_value);
+ if (!dentry)
+ goto fail;
+
return 0;
fail:
pfn_inject_exit();
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -268,3 +268,5 @@ extern int hwpoison_filter(struct page *

extern u32 hwpoison_filter_dev_major;
extern u32 hwpoison_filter_dev_minor;
+extern u64 hwpoison_filter_flags_mask;
+extern u64 hwpoison_filter_flags_value;
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -34,6 +34,7 @@
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/page-flags.h>
+#include <linux/kernel-page-flags.h>
#include <linux/sched.h>
#include <linux/ksm.h>
#include <linux/rmap.h>
@@ -50,8 +51,12 @@ atomic_long_t mce_bad_pages __read_mostl

u32 hwpoison_filter_dev_major = ~0U;
u32 hwpoison_filter_dev_minor = ~0U;
+u64 hwpoison_filter_flags_mask;
+u64 hwpoison_filter_flags_value;
EXPORT_SYMBOL_GPL(hwpoison_filter_dev_major);
EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
+EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
+EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);

static int hwpoison_filter_dev(struct page *p)
{
@@ -83,11 +88,26 @@ static int hwpoison_filter_dev(struct pa
return 0;
}

+static int hwpoison_filter_flags(struct page *p)
+{
+ if (!hwpoison_filter_flags_mask)
+ return 0;
+
+ if ((stable_page_flags(p) & hwpoison_filter_flags_mask) ==
+ hwpoison_filter_flags_value)
+ return 0;
+ else
+ return -EINVAL;
+}
+
int hwpoison_filter(struct page *p)
{
if (hwpoison_filter_dev(p))
return -EINVAL;

+ if (hwpoison_filter_flags(p))
+ return -EINVAL;
+
return 0;
}
EXPORT_SYMBOL_GPL(hwpoison_filter);
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -123,6 +123,16 @@ Only handle memory failures to pages ass
by block device major/minor. -1U is the wildcard value.
This should be only used for testing with artificial injection.

+
+corrupt-filter-flags-mask
+corrupt-filter-flags-value
+
+When specified, only poison pages if ((page_flags & mask) == value).
+This allows stress testing of many kinds of pages. The page_flags
+are the same as in /proc/kpageflags. The flag bits are defined in
+include/linux/kernel-page-flags.h and documented in
+Documentation/vm/pagemap.txt
+
Architecture specific MCE injector

x86 has mce-inject, mce-test

2009-12-08 21:19:12

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [21/31] memcg: rename and export try_get_mem_cgroup_from_page()


From: Wu Fengguang <[email protected]>

So that the hwpoison injector can get mem_cgroup for arbitrary page
and thus know whether it is owned by some mem_cgroup task(s).

CC: KOSAKI Motohiro <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Daisuke Nishimura <[email protected]>
CC: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/memcontrol.h | 6 ++++++
mm/memcontrol.c | 11 ++++-------
2 files changed, 10 insertions(+), 7 deletions(-)

Index: linux/mm/memcontrol.c
===================================================================
--- linux.orig/mm/memcontrol.c
+++ linux/mm/memcontrol.c
@@ -1379,25 +1379,22 @@ static struct mem_cgroup *mem_cgroup_loo
return container_of(css, struct mem_cgroup, css);
}

-static struct mem_cgroup *try_get_mem_cgroup_from_swapcache(struct page *page)
+struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
{
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem = NULL;
struct page_cgroup *pc;
unsigned short id;
swp_entry_t ent;

VM_BUG_ON(!PageLocked(page));

- if (!PageSwapCache(page))
- return NULL;
-
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
mem = pc->mem_cgroup;
if (mem && !css_tryget(&mem->css))
mem = NULL;
- } else {
+ } else if (PageSwapCache(page)) {
ent.val = page_private(page);
id = lookup_swap_cgroup(ent);
rcu_read_lock();
@@ -1742,7 +1739,7 @@ int mem_cgroup_try_charge_swapin(struct
*/
if (!PageSwapCache(page))
return 0;
- mem = try_get_mem_cgroup_from_swapcache(page);
+ mem = try_get_mem_cgroup_from_page(page);
if (!mem)
goto charge_cur_mm;
*ptr = mem;
Index: linux/include/linux/memcontrol.h
===================================================================
--- linux.orig/include/linux/memcontrol.h
+++ linux/include/linux/memcontrol.h
@@ -68,6 +68,7 @@ extern unsigned long mem_cgroup_isolate_
extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);

+extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

static inline
@@ -189,6 +190,11 @@ mem_cgroup_move_lists(struct page *page,
{
}

+static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
static inline int mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *mem)
{
return 1;

2009-12-08 21:18:12

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [22/31] memcg: add accessor to mem_cgroup.css


From: Wu Fengguang <[email protected]>

So that an outside user can free the reference count grabbed by
try_get_mem_cgroup_from_page().

CC: KOSAKI Motohiro <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Daisuke Nishimura <[email protected]>
CC: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/memcontrol.h | 7 +++++++
mm/memcontrol.c | 5 +++++
2 files changed, 12 insertions(+)

Index: linux/include/linux/memcontrol.h
===================================================================
--- linux.orig/include/linux/memcontrol.h
+++ linux/include/linux/memcontrol.h
@@ -81,6 +81,8 @@ int mm_match_cgroup(const struct mm_stru
return cgroup == mem;
}

+extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem);
+
extern int
mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
@@ -206,6 +208,11 @@ static inline int task_in_mem_cgroup(str
return 1;
}

+static inline struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
+{
+ return NULL;
+}
+
static inline int
mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
{
Index: linux/mm/memcontrol.c
===================================================================
--- linux.orig/mm/memcontrol.c
+++ linux/mm/memcontrol.c
@@ -282,6 +282,11 @@ mem_cgroup_zoneinfo(struct mem_cgroup *m
return &mem->info.nodeinfo[nid]->zoneinfo[zid];
}

+struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem)
+{
+ return &mem->css;
+}
+
static struct mem_cgroup_per_zone *
page_cgroup_zoneinfo(struct page_cgroup *pc)
{

2009-12-08 21:19:31

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [23/31] HWPOISON: add memory cgroup filter


The hwpoison test suite need to inject hwpoison to a collection of
selected task pages, and must not touch pages not owned by them and
thus kill important system processes such as init. (But it's OK to
mis-hwpoison free/unowned pages as well as shared clean pages.
Mis-hwpoison of shared dirty pages will kill all tasks, so the test
suite will target all or non of such tasks in the first place.)

The memory cgroup serves this purpose well. We can put the target
processes under the control of a memory cgroup, and tell the hwpoison
injection code to only kill pages associated with some active memory
cgroup.

The prerequisite for doing hwpoison stress tests with mem_cgroup is,
the mem_cgroup code tracks task pages _accurately_ (unless page is
locked). Which we believe is/should be true.

The benefits are simplification of hwpoison injector code. Also the
mem_cgroup code will automatically be tested by hwpoison test cases.

The alternative interfaces pin-pfn/unpin-pfn can also delegate the
(process and page flags) filtering functions reliably to user space.
However prototype implementation shows that this scheme adds more
complexity than we wanted.

Example test case:

mkdir /cgroup/hwpoison

usemem -m 100 -s 1000 &
echo `jobs -p` > /cgroup/hwpoison/tasks

memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg

page-types -p `pidof init` --hwpoison # shall do nothing
page-types -p `pidof usemem` --hwpoison # poison its pages

AK: Fix documentation

CC: KOSAKI Motohiro <[email protected]>
CC: Hugh Dickins <[email protected]>
CC: Daisuke Nishimura <[email protected]>
CC: Balbir Singh <[email protected]>
CC: KAMEZAWA Hiroyuki <[email protected]>
CC: Li Zefan <[email protected]>
CC: Paul Menage <[email protected]>
CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/vm/hwpoison.txt | 16 ++++++++++++++++
mm/hwpoison-inject.c | 7 +++++++
mm/internal.h | 1 +
mm/memory-failure.c | 42 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 66 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -100,6 +100,45 @@ static int hwpoison_filter_flags(struct
return -EINVAL;
}

+/*
+ * This allows stress tests to limit test scope to a collection of tasks
+ * by putting them under some memcg. This prevents killing unrelated/important
+ * processes such as /sbin/init. Note that the target task may share clean
+ * pages with init (eg. libc text), which is harmless. If the target task
+ * share _dirty_ pages with another task B, the test scheme must make sure B
+ * is also included in the memcg. At last, due to race conditions this filter
+ * can only guarantee that the page either belongs to the memcg tasks, or is
+ * a freed page.
+ */
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+u64 hwpoison_filter_memcg;
+EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
+static int hwpoison_filter_task(struct page *p)
+{
+ struct mem_cgroup *mem;
+ struct cgroup_subsys_state *css;
+ unsigned long ino;
+
+ if (!hwpoison_filter_memcg)
+ return 0;
+
+ mem = try_get_mem_cgroup_from_page(p);
+ if (!mem)
+ return -EINVAL;
+
+ css = mem_cgroup_css(mem);
+ ino = css->cgroup->dentry->d_inode->i_ino;
+ css_put(css);
+
+ if (ino != hwpoison_filter_memcg)
+ return -EINVAL;
+
+ return 0;
+}
+#else
+static int hwpoison_filter_task(struct page *p) { return 0; }
+#endif
+
int hwpoison_filter(struct page *p)
{
if (hwpoison_filter_dev(p))
@@ -108,6 +147,9 @@ int hwpoison_filter(struct page *p)
if (hwpoison_filter_flags(p))
return -EINVAL;

+ if (hwpoison_filter_task(p))
+ return -EINVAL;
+
return 0;
}
EXPORT_SYMBOL_GPL(hwpoison_filter);
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -270,3 +270,4 @@ extern u32 hwpoison_filter_dev_major;
extern u32 hwpoison_filter_dev_minor;
extern u64 hwpoison_filter_flags_mask;
extern u64 hwpoison_filter_flags_value;
+extern u64 hwpoison_filter_memcg;
Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -112,6 +112,13 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
+ dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
+ hwpoison_dir, &hwpoison_filter_memcg);
+ if (!dentry)
+ goto fail;
+#endif
+
return 0;
fail:
pfn_inject_exit();
Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -123,6 +123,22 @@ Only handle memory failures to pages ass
by block device major/minor. -1U is the wildcard value.
This should be only used for testing with artificial injection.

+corrupt-filter-memcg
+
+Limit injection to pages owned by memgroup. Specified by inode number
+of the memcg.
+
+Example:
+ mkdir /cgroup/hwpoison
+
+ usemem -m 100 -s 1000 &
+ echo `jobs -p` > /cgroup/hwpoison/tasks
+
+ memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ')
+ echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
+
+ page-types -p `pidof init` --hwpoison # shall do nothing
+ page-types -p `pidof usemem` --hwpoison # poison its pages

corrupt-filter-flags-mask
corrupt-filter-flags-value

2009-12-08 21:19:54

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [24/31] HWPOISON: add an interface to switch off/on all the page filters


From: Haicheng Li <[email protected]>

In some use cases, user doesn't need extra filtering. E.g. user program
can inject errors through madvise syscall to its own pages, however it
might not know what the page state exactly is or which inode the page
belongs to.

So introduce an one-off interface "corrupt-filter-enable".

Echo 0 to switch off page filters, and echo 1 to switch on the filters.
[AK: changed default to 0]

Signed-off-by: Haicheng Li <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>

---
mm/hwpoison-inject.c | 5 +++++
mm/internal.h | 1 +
mm/memory-failure.c | 5 +++++
3 files changed, 11 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -92,6 +92,11 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;

+ dentry = debugfs_create_u32("corrupt-filter-enable", 0600,
+ hwpoison_dir, &hwpoison_filter_enable);
+ if (!dentry)
+ goto fail;
+
dentry = debugfs_create_u32("corrupt-filter-dev-major", 0600,
hwpoison_dir, &hwpoison_filter_dev_major);
if (!dentry)
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -271,3 +271,4 @@ extern u32 hwpoison_filter_dev_minor;
extern u64 hwpoison_filter_flags_mask;
extern u64 hwpoison_filter_flags_value;
extern u64 hwpoison_filter_memcg;
+extern u32 hwpoison_filter_enable;
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -49,10 +49,12 @@ int sysctl_memory_failure_recovery __rea

atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);

+u32 hwpoison_filter_enable = 0;
u32 hwpoison_filter_dev_major = ~0U;
u32 hwpoison_filter_dev_minor = ~0U;
u64 hwpoison_filter_flags_mask;
u64 hwpoison_filter_flags_value;
+EXPORT_SYMBOL_GPL(hwpoison_filter_enable);
EXPORT_SYMBOL_GPL(hwpoison_filter_dev_major);
EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
@@ -141,6 +143,9 @@ static int hwpoison_filter_task(struct p

int hwpoison_filter(struct page *p)
{
+ if (!hwpoison_filter_enable)
+ return 0;
+
if (hwpoison_filter_dev(p))
return -EINVAL;

2009-12-08 21:18:52

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [25/31] HWPOISON: Don't do early filtering if filter is disabled


Signed-off-by: Andi Kleen <[email protected]>

---
mm/hwpoison-inject.c | 3 +++
1 file changed, 3 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -18,6 +18,8 @@ static int hwpoison_inject(void *data, u
if (!capable(CAP_SYS_ADMIN))
return -EPERM;

+ if (!hwpoison_filter_enable)
+ goto inject;
if (!pfn_valid(pfn))
return -ENXIO;

@@ -48,6 +50,7 @@ static int hwpoison_inject(void *data, u
if (err)
return 0;

+inject:
printk(KERN_INFO "Injecting memory failure at pfn %lx\n", pfn);
return __memory_failure(pfn, 18, MF_COUNT_INCREASED);
}

2009-12-08 21:17:14

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [26/31] HWPOISON: mention HWPoison in Kconfig entry


Signed-off-by: Andi Kleen <[email protected]>

---
mm/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig
+++ linux/mm/Kconfig
@@ -257,7 +257,7 @@ config MEMORY_FAILURE
special hardware support and typically ECC memory.

config HWPOISON_INJECT
- tristate "Poison pages injector"
+ tristate "HWPoison pages injector"
depends on MEMORY_FAILURE && DEBUG_KERNEL

config NOMMU_INITIAL_TRIM_EXCESS

2009-12-08 21:17:11

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [27/31] HWPOISON: Use correct name for MADV_HWPOISON in documentation


Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/vm/hwpoison.txt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/Documentation/vm/hwpoison.txt
===================================================================
--- linux.orig/Documentation/vm/hwpoison.txt
+++ linux/Documentation/vm/hwpoison.txt
@@ -92,7 +92,7 @@ PR_MCE_KILL_GET

Testing:

-madvise(MADV_POISON, ....)
+madvise(MADV_HWPOISON, ....)
(as root)
Poison a page in the process for testing

2009-12-08 21:18:23

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [28/31] HWPOISON: Use new shake_page in memory_failure


shake_page handles more types of page caches than
the much simpler lru_add_drain_all:

- slab (quite inefficiently for now)
- any other caches with a shrinker callback
- per cpu page allocator pages
- per CPU LRU

Use this call to try to turn pages into free or LRU pages.
Then handle the case of the page becoming free after drain everything.

Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -934,8 +934,15 @@ int __memory_failure(unsigned long pfn,
* walked by the page reclaim code, however that's not a big loss.
*/
if (!PageLRU(p))
- lru_add_drain_all();
+ shake_page(p);
if (!PageLRU(p)) {
+ /*
+ * shake_page could have turned it free.
+ */
+ if (is_free_buddy_page(p)) {
+ action_result(pfn, "free buddy, 2nd try", DELAYED);
+ return 0;
+ }
action_result(pfn, "non LRU", IGNORED);
put_page(p);
return -EBUSY;

2009-12-08 21:17:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [29/31] HWPOISON: Undefine short-hand macros after use to avoid namespace conflict


Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory-failure.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -747,6 +747,19 @@ static struct page_state {
{ 0, 0, "unknown page state", me_unknown },
};

+#undef dirty
+#undef sc
+#undef unevict
+#undef mlock
+#undef writeback
+#undef lru
+#undef swapbacked
+#undef head
+#undef tail
+#undef compound
+#undef slab
+#undef reserved
+
static void action_result(unsigned long pfn, char *msg, int result)
{
struct page *page = pfn_to_page(pfn);

2009-12-08 21:18:34

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [30/31] HWPOISON: Add soft page offline support


This is a simpler, gentler variant of memory_failure() for soft page
offlining controlled from user space. It doesn't kill anything, just
tries to invalidate and if that doesn't work migrate the
page away.

This is useful for predictive failure analysis, where a page has
a high rate of corrected errors, but hasn't gone bad yet. Instead
it can be offlined early and avoided.

The offlining is controlled from sysfs, including a new generic
entry point for hard page offlining for symmetry too.

We use the page isolate facility to prevent re-allocation
race. Normally this is only used by memory hotplug. To avoid
races with memory allocation I am using lock_system_sleep().
This avoids the situation where memory hotplug is about
to isolate a page range and then hwpoison undoes that work.
This is a big hammer currently, but the simplest solution
currently.

When the page is not free or LRU we try to free pages
from slab and other caches. The slab freeing is currently
quite dumb and does not try to focus on the specific slab
cache which might own the page. This could be potentially
improved later.

OPEN: nfs migration hangs (broken low level migration op)

Thanks to Fengguang Wu and Haicheng Li for some fixes.

Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/ABI/testing/sysfs-memory-page-offline | 44 ++++
drivers/base/memory.c | 61 ++++++
include/linux/mm.h | 3
mm/hwpoison-inject.c | 2
mm/memory-failure.c | 194 +++++++++++++++++++-
5 files changed, 297 insertions(+), 7 deletions(-)

Index: linux/drivers/base/memory.c
===================================================================
--- linux.orig/drivers/base/memory.c
+++ linux/drivers/base/memory.c
@@ -341,6 +341,64 @@ static inline int memory_probe_init(void
}
#endif

+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for offlining pages of memory
+ */
+
+/* Soft offline a page */
+static ssize_t
+store_soft_offline_page(struct class *class, const char *buf, size_t count)
+{
+ int ret;
+ u64 pfn;
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ if (strict_strtoull(buf, 0, &pfn) < 0)
+ return -EINVAL;
+ pfn >>= PAGE_SHIFT;
+ if (!pfn_valid(pfn))
+ return -ENXIO;
+ ret = soft_offline_page(pfn_to_page(pfn), 0);
+ return ret == 0 ? count : ret;
+}
+
+/* Forcibly offline a page, including killing processes. */
+static ssize_t
+store_hard_offline_page(struct class *class, const char *buf, size_t count)
+{
+ int ret;
+ u64 pfn;
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ if (strict_strtoull(buf, 0, &pfn) < 0)
+ return -EINVAL;
+ pfn >>= PAGE_SHIFT;
+ ret = __memory_failure(pfn, 0, 0);
+ return ret ? ret : count;
+}
+
+static CLASS_ATTR(soft_offline_page, 0644, NULL, store_soft_offline_page);
+static CLASS_ATTR(hard_offline_page, 0644, NULL, store_hard_offline_page);
+
+static __init int memory_fail_init(void)
+{
+ int err;
+
+ err = sysfs_create_file(&memory_sysdev_class.kset.kobj,
+ &class_attr_soft_offline_page.attr);
+ if (!err)
+ err = sysfs_create_file(&memory_sysdev_class.kset.kobj,
+ &class_attr_hard_offline_page.attr);
+ return err;
+}
+#else
+static inline int memory_fail_init(void)
+{
+ return 0;
+}
+#endif
+
/*
* Note that phys_device is optional. It is here to allow for
* differentiation between which *physical* devices each
@@ -473,6 +531,9 @@ int __init memory_dev_init(void)
err = memory_probe_init();
if (!ret)
ret = err;
+ err = memory_fail_init();
+ if (!ret)
+ ret = err;
err = block_size_init();
if (!ret)
ret = err;
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -41,6 +41,9 @@
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/backing-dev.h>
+#include <linux/migrate.h>
+#include <linux/page-isolation.h>
+#include <linux/suspend.h>
#include "internal.h"

int sysctl_memory_failure_early_kill __read_mostly = 0;
@@ -197,7 +200,7 @@ static int kill_proc_ao(struct task_stru
* When a unknown page type is encountered drain as many buffers as possible
* in the hope to turn the page into a LRU or free page, which we can handle.
*/
-void shake_page(struct page *p)
+void shake_page(struct page *p, int access)
{
if (!PageSlab(p)) {
lru_add_drain_all();
@@ -207,11 +210,19 @@ void shake_page(struct page *p)
if (PageLRU(p) || is_free_buddy_page(p))
return;
}
+
/*
- * Could call shrink_slab here (which would also
- * shrink other caches). Unfortunately that might
- * also access the corrupted page, which could be fatal.
+ * Only all shrink_slab here (which would also
+ * shrink other caches) if access is not potentially fatal.
*/
+ if (access) {
+ int nr;
+ do {
+ nr = shrink_slab(1000, GFP_KERNEL, 1000);
+ if (page_count(p) == 0)
+ break;
+ } while (nr > 10);
+ }
}
EXPORT_SYMBOL_GPL(shake_page);

@@ -947,7 +958,7 @@ int __memory_failure(unsigned long pfn,
* walked by the page reclaim code, however that's not a big loss.
*/
if (!PageLRU(p))
- shake_page(p);
+ shake_page(p, 0);
if (!PageLRU(p)) {
/*
* shake_page could have turned it free.
@@ -1097,3 +1108,176 @@ int unpoison_memory(unsigned long pfn)
return 0;
}
EXPORT_SYMBOL(unpoison_memory);
+
+static struct page *new_page(struct page *p, unsigned long private, int **x)
+{
+ return alloc_pages(GFP_HIGHUSER_MOVABLE, 0);
+}
+
+/*
+ * Safely get reference count of an arbitrary page.
+ * Returns 0 for a free page, -EIO for a zero refcount page
+ * that is not free, and 1 for any other page type.
+ * For 1 the page is returned with increased page count, otherwise not.
+ */
+static int get_any_page(struct page *p, unsigned long pfn, int flags)
+{
+ int ret;
+
+ if (flags & MF_COUNT_INCREASED)
+ return 1;
+
+ /*
+ * The lock_system_sleep prevents a race with memory hotplug,
+ * because the isolation assumes there's only a single user.
+ * This is a big hammer, a better would be nicer.
+ */
+ lock_system_sleep();
+
+ /*
+ * Isolate the page, so that it doesn't get reallocated if it
+ * was free.
+ */
+ set_migratetype_isolate(p);
+ if (!get_page_unless_zero(compound_head(p))) {
+ if (is_free_buddy_page(p)) {
+ pr_debug("get_any_page: %#lx free buddy page\n", pfn);
+ /* Set hwpoison bit while page is still isolated */
+ SetPageHWPoison(p);
+ ret = 0;
+ } else {
+ pr_debug("get_any_page: %#lx: unknown zero refcount page type %lx\n",
+ pfn, p->flags);
+ ret = -EIO;
+ }
+ } else {
+ /* Not a free page */
+ ret = 1;
+ }
+ unset_migratetype_isolate(p);
+ unlock_system_sleep();
+ return ret;
+}
+
+/**
+ * soft_offline_page - Soft offline a page.
+ * @page: page to offline
+ * @flags: flags. Same as memory_failure().
+ *
+ * Returns 0 on success, otherwise negated errno.
+ *
+ * Soft offline a page, by migration or invalidation,
+ * without killing anything. This is for the case when
+ * a page is not corrupted yet (so it's still valid to access),
+ * but has had a number of corrected errors and is better taken
+ * out.
+ *
+ * The actual policy on when to do that is maintained by
+ * user space.
+ *
+ * This should never impact any application or cause data loss,
+ * however it might take some time.
+ *
+ * This is not a 100% solution for all memory, but tries to be
+ * ``good enough'' for the majority of memory.
+ */
+int soft_offline_page(struct page *page, int flags)
+{
+ int ret;
+ unsigned long pfn = page_to_pfn(page);
+
+ ret = get_any_page(page, pfn, flags);
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ goto done;
+
+ /*
+ * Page cache page we can handle?
+ */
+ if (!PageLRU(page)) {
+ /*
+ * Try to free it.
+ */
+ put_page(page);
+ shake_page(page, 1);
+
+ /*
+ * Did it turn free?
+ */
+ ret = get_any_page(page, pfn, flags);
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ goto done;
+ }
+ if (!PageLRU(page)) {
+ pr_debug("soft_offline: %#lx: unknown non LRU page type %lx\n",
+ pfn, page->flags);
+ return -EIO;
+ }
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+
+ /*
+ * Synchronized using the page lock with memory_failure()
+ */
+ if (PageHWPoison(page)) {
+ unlock_page(page);
+ put_page(page);
+ pr_debug("soft offline: %#lx page already poisoned\n", pfn);
+ return -EBUSY;
+ }
+
+ /*
+ * Try to invalidate first. This should work for
+ * non dirty unmapped page cache pages.
+ */
+ ret = invalidate_inode_page(page);
+ unlock_page(page);
+
+ /*
+ * Drop count because page migration doesn't like raised
+ * counts. The page could get re-allocated, but if it becomes
+ * LRU the isolation will just fail.
+ * RED-PEN would be better to keep it isolated here, but we
+ * would need to fix isolation locking first.
+ */
+ put_page(page);
+ if (ret == 1) {
+ ret = 0;
+ pr_debug("soft_offline: %#lx: invalidated\n", pfn);
+ goto done;
+ }
+
+ /*
+ * Simple invalidation didn't work.
+ * Try to migrate to a new page instead. migrate.c
+ * handles a large number of cases for us.
+ */
+ ret = isolate_lru_page(page);
+ if (!ret) {
+ LIST_HEAD(pagelist);
+
+ list_add(&page->lru, &pagelist);
+ ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL);
+ if (ret) {
+ pr_debug("soft offline: %#lx: migration failed %d, type %lx\n",
+ pfn, ret, page->flags);
+ if (ret > 0)
+ ret = -EIO;
+ }
+ } else {
+ pr_debug("soft offline: %#lx: isolation failed: %d, page count %d, type %lx\n",
+ pfn, ret, page_count(page), page->flags);
+ }
+ if (ret)
+ return ret;
+
+done:
+ atomic_long_add(1, &mce_bad_pages);
+ SetPageHWPoison(page);
+ /* keep elevated page count for bad page */
+ return ret;
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -1324,8 +1324,9 @@ extern int __memory_failure(unsigned lon
extern int unpoison_memory(unsigned long pfn);
extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
-extern void shake_page(struct page *p);
+extern void shake_page(struct page *p, int access);
extern atomic_long_t mce_bad_pages;
+extern int soft_offline_page(struct page *page, int flags);

#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
Index: linux/Documentation/ABI/testing/sysfs-memory-page-offline
===================================================================
--- /dev/null
+++ linux/Documentation/ABI/testing/sysfs-memory-page-offline
@@ -0,0 +1,44 @@
+What: /sys/devices/system/memory/soft_offline_page
+Date: Sep 2009
+KernelVersion: 2.6.33
+Contact: [email protected]
+Description:
+ Soft-offline the memory page containing the physical address
+ written into this file. Input is a hex number specifying the
+ physical address of the page. The kernel will then attempt
+ to soft-offline it, by moving the contents elsewhere or
+ dropping it if possible. The kernel will then be placed
+ on the bad page list and never be reused.
+
+ The offlining is done in kernel specific granuality.
+ Normally it's the base page size of the kernel, but
+ this might change.
+
+ The page must be still accessible, not poisoned. The
+ kernel will never kill anything for this, but rather
+ fail the offline. Return value is the size of the
+ number, or a error when the offlining failed. Reading
+ the file is not allowed.
+
+What: /sys/devices/system/memory/hard_offline_page
+Date: Sep 2009
+KernelVersion: 2.6.33
+Contact: [email protected]
+Description:
+ Hard-offline the memory page containing the physical
+ address written into this file. Input is a hex number
+ specifying the physical address of the page. The
+ kernel will then attempt to hard-offline the page, by
+ trying to drop the page or killing any owner or
+ triggering IO errors if needed. Note this may kill
+ any processes owning the page. The kernel will avoid
+ to access this page assuming it's poisoned by the
+ hardware.
+
+ The offlining is done in kernel specific granuality.
+ Normally it's the base page size of the kernel, but
+ this might change.
+
+ Return value is the size of the number, or a error when
+ the offlining failed.
+ Reading the file is not allowed.
Index: linux/mm/hwpoison-inject.c
===================================================================
--- linux.orig/mm/hwpoison-inject.c
+++ linux/mm/hwpoison-inject.c
@@ -29,7 +29,7 @@ static int hwpoison_inject(void *data, u
return 0;

if (!PageLRU(p))
- shake_page(p);
+ shake_page(p, 0);
/*
* This implies unable to support non-LRU pages.
*/

2009-12-08 21:17:55

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining


Process based injection is much easier to handle for test programs,
who can first bring a page into a specific state and then test.
So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
to the existing hard offline injector.

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-generic/mman-common.h | 1 +
mm/madvise.c | 15 ++++++++++++---
2 files changed, 13 insertions(+), 3 deletions(-)

Index: linux/include/asm-generic/mman-common.h
===================================================================
--- linux.orig/include/asm-generic/mman-common.h
+++ linux/include/asm-generic/mman-common.h
@@ -35,6 +35,7 @@
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
#define MADV_HWPOISON 100 /* poison a page for testing */
+#define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */

#define MADV_MERGEABLE 12 /* KSM may merge identical pages */
#define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */
Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c
+++ linux/mm/madvise.c
@@ -9,6 +9,7 @@
#include <linux/pagemap.h>
#include <linux/syscalls.h>
#include <linux/mempolicy.h>
+#include <linux/page-isolation.h>
#include <linux/hugetlb.h>
#include <linux/sched.h>
#include <linux/ksm.h>
@@ -222,7 +223,7 @@ static long madvise_remove(struct vm_are
/*
* Error injection support for memory error handling.
*/
-static int madvise_hwpoison(unsigned long start, unsigned long end)
+static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
{
int ret = 0;

@@ -233,6 +234,14 @@ static int madvise_hwpoison(unsigned lon
int ret = get_user_pages_fast(start, 1, 0, &p);
if (ret != 1)
return ret;
+ if (bhv == MADV_SOFT_OFFLINE) {
+ printk(KERN_INFO "Soft offlining page %lx at %lx\n",
+ page_to_pfn(p), start);
+ ret = soft_offline_page(p, MF_COUNT_INCREASED);
+ if (ret)
+ break;
+ continue;
+ }
printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
page_to_pfn(p), start);
/* Ignore return value for now */
@@ -333,8 +342,8 @@ SYSCALL_DEFINE3(madvise, unsigned long,
size_t len;

#ifdef CONFIG_MEMORY_FAILURE
- if (behavior == MADV_HWPOISON)
- return madvise_hwpoison(start, start+len_in);
+ if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+ return madvise_hwpoison(behavior, start, start+len_in);
#endif
if (!madvise_behavior_valid(behavior))
return error;

2009-12-08 23:10:23

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] [19/31] mm: export stable page flags

On Tue, 2009-12-08 at 22:16 +0100, Andi Kleen wrote:
> From: Wu Fengguang <[email protected]>
>
> Rename get_uflags() to stable_page_flags() and make it a global function
> for use in the hwpoison page flags filter, which need to compare user
> page flags with the value provided by user space.
>
> Also move KPF_* to kernel-page-flags.h for use by user space tools.
>
> CC: Matt Mackall <[email protected]>
> CC: Nick Piggin <[email protected]>
> CC: Christoph Lameter <[email protected]>
> CC: Andi Kleen <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>

Acked-by: Matt Mackall <[email protected]>
--
http://selenic.com : development and support for Mercurial and Linux

2009-12-09 02:00:42

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [19/31] mm: export stable page flags

On Wed, Dec 09, 2009 at 06:27:31AM +0800, Matt Mackall wrote:
> On Tue, 2009-12-08 at 22:16 +0100, Andi Kleen wrote:
> > From: Wu Fengguang <[email protected]>
> >
> > Rename get_uflags() to stable_page_flags() and make it a global function
> > for use in the hwpoison page flags filter, which need to compare user
> > page flags with the value provided by user space.
> >
> > Also move KPF_* to kernel-page-flags.h for use by user space tools.
> >
> > CC: Matt Mackall <[email protected]>
> > CC: Nick Piggin <[email protected]>
> > CC: Christoph Lameter <[email protected]>
> > CC: Andi Kleen <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> > Signed-off-by: Andi Kleen <[email protected]>
>
> Acked-by: Matt Mackall <[email protected]>

Andi and Matt,

Sorry the stable_page_flags() will be undefined on
!CONFIG_PROC_PAGE_MONITOR (it is almost always on,
except for some embedded systems).

Currently the easy solution is to add a Kconfig dependency to
CONFIG_PROC_PAGE_MONITOR. When there comes more users (ie. some
ftrace event), we can then always compile in stable_page_flags().

Thanks,
Fengguang
---
mm/Kconfig | 1 +
mm/memory-failure.c | 4 ++++
2 files changed, 5 insertions(+)

--- linux-mm.orig/mm/Kconfig 2009-12-09 09:47:51.000000000 +0800
+++ linux-mm/mm/Kconfig 2009-12-09 09:58:54.000000000 +0800
@@ -259,6 +259,7 @@ config MEMORY_FAILURE
config HWPOISON_INJECT
tristate "HWPoison pages injector"
depends on MEMORY_FAILURE && DEBUG_KERNEL
+ depends on PROC_PAGE_MONITOR

config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
--- linux-mm.orig/mm/memory-failure.c 2009-12-09 09:49:13.000000000 +0800
+++ linux-mm/mm/memory-failure.c 2009-12-09 09:55:42.000000000 +0800
@@ -51,6 +51,7 @@ int sysctl_memory_failure_recovery __rea

atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);

+#ifdef CONFIG_HWPOISON_INJECT
u32 hwpoison_filter_enable = 1;
u32 hwpoison_filter_dev_major = ~0U;
u32 hwpoison_filter_dev_minor = ~0U;
@@ -160,6 +161,9 @@ int hwpoison_filter(struct page *p)
return 0;
}
EXPORT_SYMBOL_GPL(hwpoison_filter);
+#else
+int hwpoison_filter(struct page *p) { return 0; }
+#endif

/*
* Send all the processes who have the page mapped an ``action optional''

2009-12-09 05:04:31

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> +u64 hwpoison_filter_memcg;
> +EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
> +static int hwpoison_filter_task(struct page *p)
> +{
> + struct mem_cgroup *mem;
> + struct cgroup_subsys_state *css;
> + unsigned long ino;
> +
> + if (!hwpoison_filter_memcg)
> + return 0;
> +
> + mem = try_get_mem_cgroup_from_page(p);
> + if (!mem)
> + return -EINVAL;
> +
> + css = mem_cgroup_css(mem);
> + ino = css->cgroup->dentry->d_inode->i_ino;

I have a question, can try_get_mem_cgroup_from_page() return
root_mem_cgroup?

if it can, then css->cgroup->dentry is NULL, if memcg is
not mounted and there is no subdir in memcg. Because the root
cgroup of an inactive subsystem has no dentry.

> + css_put(css);
> +
> + if (ino != hwpoison_filter_memcg)
> + return -EINVAL;
> +
> + return 0;
> +}
> +#else
> +static int hwpoison_filter_task(struct page *p) { return 0; }
> +#endif

2009-12-09 05:09:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

On Wed, 09 Dec 2009 13:04:06 +0800
Li Zefan <[email protected]> wrote:

> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > +u64 hwpoison_filter_memcg;
> > +EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
> > +static int hwpoison_filter_task(struct page *p)
> > +{
> > + struct mem_cgroup *mem;
> > + struct cgroup_subsys_state *css;
> > + unsigned long ino;
> > +
> > + if (!hwpoison_filter_memcg)
> > + return 0;
> > +
> > + mem = try_get_mem_cgroup_from_page(p);
> > + if (!mem)
> > + return -EINVAL;
> > +
> > + css = mem_cgroup_css(mem);
> > + ino = css->cgroup->dentry->d_inode->i_ino;
>
> I have a question, can try_get_mem_cgroup_from_page() return
> root_mem_cgroup?
>
yes.

> if it can, then css->cgroup->dentry is NULL, if memcg is
> not mounted and there is no subdir in memcg. Because the root
> cgroup of an inactive subsystem has no dentry.
>

Nice catch. It sounds possible. That should be handled.

Regards,
-Kame

2009-12-09 05:33:40

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

* KAMEZAWA Hiroyuki <[email protected]> [2009-12-09 14:06:20]:

> On Wed, 09 Dec 2009 13:04:06 +0800
> Li Zefan <[email protected]> wrote:
>
> > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > +u64 hwpoison_filter_memcg;
> > > +EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
> > > +static int hwpoison_filter_task(struct page *p)
> > > +{
> > > + struct mem_cgroup *mem;
> > > + struct cgroup_subsys_state *css;
> > > + unsigned long ino;
> > > +
> > > + if (!hwpoison_filter_memcg)
> > > + return 0;
> > > +
> > > + mem = try_get_mem_cgroup_from_page(p);
> > > + if (!mem)
> > > + return -EINVAL;
> > > +
> > > + css = mem_cgroup_css(mem);
> > > + ino = css->cgroup->dentry->d_inode->i_ino;
> >
> > I have a question, can try_get_mem_cgroup_from_page() return
> > root_mem_cgroup?
> >
> yes.
>
> > if it can, then css->cgroup->dentry is NULL, if memcg is
> > not mounted and there is no subdir in memcg. Because the root
> > cgroup of an inactive subsystem has no dentry.
> >
>
> Nice catch. It sounds possible. That should be handled.
>
>

Yes, agreed, good catch!

--
Balbir

2009-12-09 09:15:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

> I have a question, can try_get_mem_cgroup_from_page() return
> root_mem_cgroup?

It could be called for any page.


> if it can, then css->cgroup->dentry is NULL, if memcg is
> not mounted and there is no subdir in memcg. Because the root
> cgroup of an inactive subsystem has no dentry.

Thanks. I'll just add an return -EINVAL for this case, sounds good?

-Andi

--
[email protected] -- Speaking for myself only.

2009-12-09 20:47:31

by Paul Menage

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

On Tue, Dec 8, 2009 at 1:16 PM, Andi Kleen <[email protected]> wrote:
>
> The hwpoison test suite need to inject hwpoison to a collection of
> selected task pages, and must not touch pages not owned by them and
> thus kill important system processes such as init. (But it's OK to
> mis-hwpoison free/unowned pages as well as shared clean pages.
> Mis-hwpoison of shared dirty pages will kill all tasks, so the test
> suite will target all or non of such tasks in the first place.)

While the functionality sounds useful, the interface (passing an inode
number) feels a bit ugly to me. Also, if that group is deleted and a
new cgroup created, you could end up reusing the inode number.

How about an approach where you write either the cgroup path (relative
to the memcg mount) or an fd open on the desired cgroup? Then you
could store a (counted) css reference rather than an inode number,
which would make the filter function cleaner too, since it would just
need to compare css objects.

Paul

2009-12-09 21:38:14

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] [19/31] mm: export stable page flags

On Wed, 2009-12-09 at 10:00 +0800, Wu Fengguang wrote:
> On Wed, Dec 09, 2009 at 06:27:31AM +0800, Matt Mackall wrote:
> > On Tue, 2009-12-08 at 22:16 +0100, Andi Kleen wrote:
> > > From: Wu Fengguang <[email protected]>
> > >
> > > Rename get_uflags() to stable_page_flags() and make it a global function
> > > for use in the hwpoison page flags filter, which need to compare user
> > > page flags with the value provided by user space.
> > >
> > > Also move KPF_* to kernel-page-flags.h for use by user space tools.
> > >
> > > CC: Matt Mackall <[email protected]>
> > > CC: Nick Piggin <[email protected]>
> > > CC: Christoph Lameter <[email protected]>
> > > CC: Andi Kleen <[email protected]>
> > > Signed-off-by: Wu Fengguang <[email protected]>
> > > Signed-off-by: Andi Kleen <[email protected]>
> >
> > Acked-by: Matt Mackall <[email protected]>
>
> Andi and Matt,
>
> Sorry the stable_page_flags() will be undefined on
> !CONFIG_PROC_PAGE_MONITOR (it is almost always on,
> except for some embedded systems).
>
> Currently the easy solution is to add a Kconfig dependency to
> CONFIG_PROC_PAGE_MONITOR. When there comes more users (ie. some
> ftrace event), we can then always compile in stable_page_flags().

No objections.

Acked-by: Matt Mackall <[email protected]>

--
http://selenic.com : development and support for Mercurial and Linux

2009-12-09 23:59:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

On Wed, 9 Dec 2009 12:47:27 -0800
Paul Menage <[email protected]> wrote:

> On Tue, Dec 8, 2009 at 1:16 PM, Andi Kleen <[email protected]> wrote:
> >
> > The hwpoison test suite need to inject hwpoison to a collection of
> > selected task pages, and must not touch pages not owned by them and
> > thus kill important system processes such as init. (But it's OK to
> > mis-hwpoison free/unowned pages as well as shared clean pages.
> > Mis-hwpoison of shared dirty pages will kill all tasks, so the test
> > suite will target all or non of such tasks in the first place.)
>
> While the functionality sounds useful, the interface (passing an inode
> number) feels a bit ugly to me. Also, if that group is deleted and a
> new cgroup created, you could end up reusing the inode number.
>
I agree.

> How about an approach where you write either the cgroup path (relative
> to the memcg mount) or an fd open on the desired cgroup? Then you
> could store a (counted) css reference rather than an inode number,
> which would make the filter function cleaner too, since it would just
> need to compare css objects.
>
Sounds reasonable.

Thanks,
-Kame

2009-12-10 01:42:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

> While the functionality sounds useful, the interface (passing an inode
> number) feels a bit ugly to me. Also, if that group is deleted and a
> new cgroup created, you could end up reusing the inode number.

Please note this is just a testing interface, doesn't need to be
100% fool-proof. It'll never be used in production.

>
> How about an approach where you write either the cgroup path (relative
> to the memcg mount) or an fd open on the desired cgroup? Then you
> could store a (counted) css reference rather than an inode number,
> which would make the filter function cleaner too, since it would just
> need to compare css objects.

Sounds complicated, I assume it would be much more code?
I would prefer to keep the testing interfaces as simple as possible.

-Andi
--
[email protected] -- Speaking for myself only.

2009-12-10 01:50:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [19/31] mm: export stable page flags

> Sorry the stable_page_flags() will be undefined on
> !CONFIG_PROC_PAGE_MONITOR (it is almost always on,
> except for some embedded systems).
>
> Currently the easy solution is to add a Kconfig dependency to
> CONFIG_PROC_PAGE_MONITOR. When there comes more users (ie. some
> ftrace event), we can then always compile in stable_page_flags().

I decided to turn it into select instead.

Your original patch didn't handle hwpoison as module btw.

-Andi

2009-12-10 02:09:28

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [19/31] mm: export stable page flags

On Thu, Dec 10, 2009 at 09:50:14AM +0800, Andi Kleen wrote:
> > Sorry the stable_page_flags() will be undefined on
> > !CONFIG_PROC_PAGE_MONITOR (it is almost always on,
> > except for some embedded systems).
> >
> > Currently the easy solution is to add a Kconfig dependency to
> > CONFIG_PROC_PAGE_MONITOR. When there comes more users (ie. some
> > ftrace event), we can then always compile in stable_page_flags().
>
> I decided to turn it into select instead.

OK, that would be good.

> Your original patch didn't handle hwpoison as module btw.

Yes I realized it later..

Thanks,
Fengguang

2009-12-10 02:21:20

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

* Andi Kleen <[email protected]> [2009-12-10 02:42:12]:

> > While the functionality sounds useful, the interface (passing an inode
> > number) feels a bit ugly to me. Also, if that group is deleted and a
> > new cgroup created, you could end up reusing the inode number.
>
> Please note this is just a testing interface, doesn't need to be
> 100% fool-proof. It'll never be used in production.
>
> >
> > How about an approach where you write either the cgroup path (relative
> > to the memcg mount) or an fd open on the desired cgroup? Then you
> > could store a (counted) css reference rather than an inode number,
> > which would make the filter function cleaner too, since it would just
> > need to compare css objects.
>
> Sounds complicated, I assume it would be much more code?
> I would prefer to keep the testing interfaces as simple as possible.
>

We do this for cgroupstats and the code is not very complicated. In
case you want to look, the user space code is at
Documentation/accounting/getdelays.c and the kernel code is in
kernel/taskstats.c

--
Balbir

2009-12-10 13:41:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [19/31] mm: export stable page flags

> > Your original patch didn't handle hwpoison as module btw.
>
> Yes I realized it later..

It's always a big trap, I've seen lots of people (including myself)
run into it.

I preferred to not have the ifdefs in the core module, simply
because it's very little additional code.

-Andi

--
[email protected] -- Speaking for myself only.

2009-12-11 02:14:06

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

On Thu, Dec 10, 2009 at 10:21:13AM +0800, Balbir Singh wrote:
> * Andi Kleen <[email protected]> [2009-12-10 02:42:12]:
>
> > > While the functionality sounds useful, the interface (passing an inode
> > > number) feels a bit ugly to me. Also, if that group is deleted and a
> > > new cgroup created, you could end up reusing the inode number.
> >
> > Please note this is just a testing interface, doesn't need to be
> > 100% fool-proof. It'll never be used in production.
> >
> > >
> > > How about an approach where you write either the cgroup path (relative
> > > to the memcg mount) or an fd open on the desired cgroup? Then you
> > > could store a (counted) css reference rather than an inode number,
> > > which would make the filter function cleaner too, since it would just
> > > need to compare css objects.
> >
> > Sounds complicated, I assume it would be much more code?
> > I would prefer to keep the testing interfaces as simple as possible.
> >
>
> We do this for cgroupstats and the code is not very complicated. In
> case you want to look, the user space code is at
> Documentation/accounting/getdelays.c and the kernel code is in
> kernel/taskstats.c

Balbir, thanks for the tip.

We could keep an fd open on the desired cgroup, in user space:

#!/bin/bash

mkdir /cgroup/hwpoison && \
exec 9<>/cgroup/hwpoison/tasks || exit 1

A bit simpler than an in-kernel fget_light() or CSS refcount :)

Thanks,
Fengguang

2009-12-14 12:53:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [23/31] HWPOISON: add memory cgroup filter

Wu Fengguang <[email protected]> writes:
>
> We could keep an fd open on the desired cgroup, in user space:
>
> #!/bin/bash
>
> mkdir /cgroup/hwpoison && \
> exec 9<>/cgroup/hwpoison/tasks || exit 1
>
> A bit simpler than an in-kernel fget_light() or CSS refcount :)

FYI, I decided to not do any of this in .33, but just keep the
ugly-but-working inode hack. We can look at fixing that for .34.
These interfaces are debugfs, so can be changed.

-Andi

--
[email protected] -- Speaking for myself only.

Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

Hi Andi,

On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <[email protected]> wrote:
>
> Process based injection is much easier to handle for test programs,
> who can first bring a page into a specific state and then test.
> So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
> to the existing hard offline injector.

I see that this made its way into 2.6.33. Could you write a short
piece on it for the madvise.2 man page?

Thanks,

Michael


> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> ?include/asm-generic/mman-common.h | ? ?1 +
> ?mm/madvise.c ? ? ? ? ? ? ? ? ? ? ?| ? 15 ++++++++++++---
> ?2 files changed, 13 insertions(+), 3 deletions(-)
>
> Index: linux/include/asm-generic/mman-common.h
> ===================================================================
> --- linux.orig/include/asm-generic/mman-common.h
> +++ linux/include/asm-generic/mman-common.h
> @@ -35,6 +35,7 @@
> ?#define MADV_DONTFORK ?10 ? ? ? ? ? ? ?/* don't inherit across fork */
> ?#define MADV_DOFORK ? ?11 ? ? ? ? ? ? ?/* do inherit across fork */
> ?#define MADV_HWPOISON ?100 ? ? ? ? ? ? /* poison a page for testing */
> +#define MADV_SOFT_OFFLINE 101 ? ? ? ? ?/* soft offline page for testing */
>
> ?#define MADV_MERGEABLE ? 12 ? ? ? ? ? ?/* KSM may merge identical pages */
> ?#define MADV_UNMERGEABLE 13 ? ? ? ? ? ?/* KSM may not merge identical pages */
> Index: linux/mm/madvise.c
> ===================================================================
> --- linux.orig/mm/madvise.c
> +++ linux/mm/madvise.c
> @@ -9,6 +9,7 @@
> ?#include <linux/pagemap.h>
> ?#include <linux/syscalls.h>
> ?#include <linux/mempolicy.h>
> +#include <linux/page-isolation.h>
> ?#include <linux/hugetlb.h>
> ?#include <linux/sched.h>
> ?#include <linux/ksm.h>
> @@ -222,7 +223,7 @@ static long madvise_remove(struct vm_are
> ?/*
> ?* Error injection support for memory error handling.
> ?*/
> -static int madvise_hwpoison(unsigned long start, unsigned long end)
> +static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
> ?{
> ? ? ? ?int ret = 0;
>
> @@ -233,6 +234,14 @@ static int madvise_hwpoison(unsigned lon
> ? ? ? ? ? ? ? ?int ret = get_user_pages_fast(start, 1, 0, &p);
> ? ? ? ? ? ? ? ?if (ret != 1)
> ? ? ? ? ? ? ? ? ? ? ? ?return ret;
> + ? ? ? ? ? ? ? if (bhv == MADV_SOFT_OFFLINE) {
> + ? ? ? ? ? ? ? ? ? ? ? printk(KERN_INFO "Soft offlining page %lx at %lx\n",
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? page_to_pfn(p), start);
> + ? ? ? ? ? ? ? ? ? ? ? ret = soft_offline_page(p, MF_COUNT_INCREASED);
> + ? ? ? ? ? ? ? ? ? ? ? if (ret)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? }
> ? ? ? ? ? ? ? ?printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
> ? ? ? ? ? ? ? ? ? ? ? page_to_pfn(p), start);
> ? ? ? ? ? ? ? ?/* Ignore return value for now */
> @@ -333,8 +342,8 @@ SYSCALL_DEFINE3(madvise, unsigned long,
> ? ? ? ?size_t len;
>
> ?#ifdef CONFIG_MEMORY_FAILURE
> - ? ? ? if (behavior == MADV_HWPOISON)
> - ? ? ? ? ? ? ? return madvise_hwpoison(start, start+len_in);
> + ? ? ? if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> + ? ? ? ? ? ? ? return madvise_hwpoison(behavior, start, start+len_in);
> ?#endif
> ? ? ? ?if (!madvise_behavior_valid(behavior))
> ? ? ? ? ? ? ? ?return error;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>



--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

2010-06-19 13:20:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

On Sat, Jun 19, 2010 at 02:36:28PM +0200, Michael Kerrisk wrote:
> Hi Andi,
>
> On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <[email protected]> wrote:
> >
> > Process based injection is much easier to handle for test programs,
> > who can first bring a page into a specific state and then test.
> > So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
> > to the existing hard offline injector.
>
> I see that this made its way into 2.6.33. Could you write a short
> piece on it for the madvise.2 man page?

Also fixed the previous snippet slightly.


commit edb43354f0ffc04bf4f23f01261f9ea9f43e0d3d
Author: Andi Kleen <[email protected]>
Date: Sat Jun 19 15:19:28 2010 +0200

MADV_SOFT_OFFLINE

Signed-off-by: Andi Kleen <[email protected]>

diff --git a/man2/madvise.2 b/man2/madvise.2
index db29feb..9dccd97 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -154,7 +154,15 @@ processes.
This operation may result in the calling process receiving a
.B SIGBUS
and the page being unmapped.
-This feature is intended for memory testing.
+This feature is intended for testing of memory error handling code.
+This feature is only available if the kernel was configured with
+.BR CONFIG_MEMORY_FAILURE .
+.TP
+.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
+Soft offline a page. This will result in the memory of the page
+being copied to a new page and original page be offlined. The operation
+should be transparent to the calling process.
+This feature is intended for testing of memory error handling code.
This feature is only available if the kernel was configured with
.BR CONFIG_MEMORY_FAILURE .
.TP

Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

Hi Andi,

Thanks for this. Some comments below.

On Sat, Jun 19, 2010 at 3:20 PM, Andi Kleen <[email protected]> wrote:
> On Sat, Jun 19, 2010 at 02:36:28PM +0200, Michael Kerrisk wrote:
>> Hi Andi,
>>
>> On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <[email protected]> wrote:
>> >
>> > Process based injection is much easier to handle for test programs,
>> > who can first bring a page into a specific state and then test.
>> > So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
>> > to the existing hard offline injector.
>>
>> I see that this made its way into 2.6.33. Could you write a short
>> piece on it for the madvise.2 man page?
>
> Also fixed the previous snippet slightly.

(thanks)

> commit edb43354f0ffc04bf4f23f01261f9ea9f43e0d3d
> Author: Andi Kleen <[email protected]>
> Date: ? Sat Jun 19 15:19:28 2010 +0200
>
> ? ?MADV_SOFT_OFFLINE
>
> ? ?Signed-off-by: Andi Kleen <[email protected]>
>
> diff --git a/man2/madvise.2 b/man2/madvise.2
> index db29feb..9dccd97 100644
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -154,7 +154,15 @@ processes.
> ?This operation may result in the calling process receiving a
> ?.B SIGBUS
> ?and the page being unmapped.
> -This feature is intended for memory testing.
> +This feature is intended for testing of memory error handling code.
> +This feature is only available if the kernel was configured with
> +.BR CONFIG_MEMORY_FAILURE .
> +.TP
> +.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> +Soft offline a page. This will result in the memory of the page
> +being copied to a new page and original page be offlined. The operation

Can you explain the term "offlined" please.

> +should be transparent to the calling process.

Does "should be transparent" mean "is normally invisible"?

Thanks,

Michael

> +This feature is intended for testing of memory error handling code.
> ?This feature is only available if the kernel was configured with
> ?.BR CONFIG_MEMORY_FAILURE .
> ?.TP
>
>



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

2010-06-19 13:30:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

On Sat, Jun 19, 2010 at 03:25:16PM +0200, Michael Kerrisk wrote:
> Hi Andi,
>
> Thanks for this. Some comments below.
>
> On Sat, Jun 19, 2010 at 3:20 PM, Andi Kleen <[email protected]> wrote:
> > On Sat, Jun 19, 2010 at 02:36:28PM +0200, Michael Kerrisk wrote:
> >> Hi Andi,
> >>
> >> On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <[email protected]> wrote:
> >> >
> >> > Process based injection is much easier to handle for test programs,
> >> > who can first bring a page into a specific state and then test.
> >> > So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
> >> > to the existing hard offline injector.
> >>
> >> I see that this made its way into 2.6.33. Could you write a short
> >> piece on it for the madvise.2 man page?
> >
> > Also fixed the previous snippet slightly.
>
> (thanks)
>
> > commit edb43354f0ffc04bf4f23f01261f9ea9f43e0d3d
> > Author: Andi Kleen <[email protected]>
> > Date: ? Sat Jun 19 15:19:28 2010 +0200
> >
> > ? ?MADV_SOFT_OFFLINE
> >
> > ? ?Signed-off-by: Andi Kleen <[email protected]>
> >
> > diff --git a/man2/madvise.2 b/man2/madvise.2
> > index db29feb..9dccd97 100644
> > --- a/man2/madvise.2
> > +++ b/man2/madvise.2
> > @@ -154,7 +154,15 @@ processes.
> > ?This operation may result in the calling process receiving a
> > ?.B SIGBUS
> > ?and the page being unmapped.
> > -This feature is intended for memory testing.
> > +This feature is intended for testing of memory error handling code.
> > +This feature is only available if the kernel was configured with
> > +.BR CONFIG_MEMORY_FAILURE .
> > +.TP
> > +.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> > +Soft offline a page. This will result in the memory of the page
> > +being copied to a new page and original page be offlined. The operation
>
> Can you explain the term "offlined" please.

The memory is not used anymore and taken out of normal
memory management (until unpoisoned)
and the "HardwareCorrupted:" counter in /proc/meminfo increases

(don't put the later in, I'm thinking about changing that)

>
> > +should be transparent to the calling process.
>
> Does "should be transparent" mean "is normally invisible"?

Yes. It's similar to being swapped out and swapped in again.

-Andi

Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

Hi Andi,

On Sat, Jun 19, 2010 at 3:30 PM, Andi Kleen <[email protected]> wrote:
> On Sat, Jun 19, 2010 at 03:25:16PM +0200, Michael Kerrisk wrote:
>> Hi Andi,
>>
>> Thanks for this. Some comments below.
>>
>> On Sat, Jun 19, 2010 at 3:20 PM, Andi Kleen <[email protected]> wrote:
>> > On Sat, Jun 19, 2010 at 02:36:28PM +0200, Michael Kerrisk wrote:
>> >> Hi Andi,
>> >>
>> >> On Tue, Dec 8, 2009 at 11:16 PM, Andi Kleen <[email protected]> wrote:
>> >> >
>> >> > Process based injection is much easier to handle for test programs,
>> >> > who can first bring a page into a specific state and then test.
>> >> > So add a new MADV_SOFT_OFFLINE to soft offline a page, similar
>> >> > to the existing hard offline injector.
>> >>
>> >> I see that this made its way into 2.6.33. Could you write a short
>> >> piece on it for the madvise.2 man page?
>> >
>> > Also fixed the previous snippet slightly.
>>
>> (thanks)
>>
>> > commit edb43354f0ffc04bf4f23f01261f9ea9f43e0d3d
>> > Author: Andi Kleen <[email protected]>
>> > Date: ? Sat Jun 19 15:19:28 2010 +0200
>> >
>> > ? ?MADV_SOFT_OFFLINE
>> >
>> > ? ?Signed-off-by: Andi Kleen <[email protected]>
>> >
>> > diff --git a/man2/madvise.2 b/man2/madvise.2
>> > index db29feb..9dccd97 100644
>> > --- a/man2/madvise.2
>> > +++ b/man2/madvise.2
>> > @@ -154,7 +154,15 @@ processes.
>> > ?This operation may result in the calling process receiving a
>> > ?.B SIGBUS
>> > ?and the page being unmapped.
>> > -This feature is intended for memory testing.
>> > +This feature is intended for testing of memory error handling code.
>> > +This feature is only available if the kernel was configured with
>> > +.BR CONFIG_MEMORY_FAILURE .
>> > +.TP
>> > +.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
>> > +Soft offline a page. This will result in the memory of the page
>> > +being copied to a new page and original page be offlined. The operation
>>
>> Can you explain the term "offlined" please.
>
> The memory is not used anymore and taken out of normal
> memory management (until unpoisoned)

Is there a userspace operation to unpoison (i.e., reverse MADV_SOFT_OFFLINE)?

I ask because I wondered if there is something additional to be documented.

> and the "HardwareCorrupted:" counter in /proc/meminfo increases
>
> (don't put the later in, I'm thinking about changing that)

Okay.

>>
>> > +should be transparent to the calling process.
>>
>> Does "should be transparent" mean "is normally invisible"?
>
> Yes. It's similar to being swapped out and swapped in again.

Okay.

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

2010-06-19 14:09:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

On Sat, Jun 19, 2010 at 03:43:28PM +0200, Michael Kerrisk wrote:
> Is there a userspace operation to unpoison (i.e., reverse MADV_SOFT_OFFLINE)?

Yes, but it's only a debugfs interface currently.

> I ask because I wondered if there is something additional to be documented.

I don't think debugfs needs manpages atm.

-Andi
--
[email protected] -- Speaking for myself only.

Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

Hi Andi,

On Sat, Jun 19, 2010 at 4:09 PM, Andi Kleen <[email protected]> wrote:
> On Sat, Jun 19, 2010 at 03:43:28PM +0200, Michael Kerrisk wrote:
>> Is there a userspace operation to unpoison (i.e., reverse MADV_SOFT_OFFLINE)?
>
> Yes, but it's only a debugfs interface currently.

Okay -- thanks.

>> I ask because I wondered if there is something additional to be documented.
>
> I don't think debugfs needs manpages atm.

Okay.

I edited your text somewhat. Could you please review the below.

Cheers,

Michael

.TP
.BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
Soft offline the pages in the range specified by
.I addr
and
.IR length .
This memory of each page in the specified range is copied to a new page,
and the original page is offlined
(i.e., no longer used, and taken out of normal memory management).
The effect of the
.B MADV_SOFT_OFFLINE
operation is normally invisible to (i.e., does not change the semantics of)
the calling process.
This feature is intended for testing of memory error-handling code;
it is only available if the kernel was configured with
.BR CONFIG_MEMORY_FAILURE .

2010-06-19 19:52:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

> .TP
> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> Soft offline the pages in the range specified by
> .I addr
> and
> .IR length .
> This memory of each page in the specified range is copied to a new page,

Actually there are some cases where it's also dropped if it's cached page.

Perhaps better would be something more fuzzy like

"the contents are preserved"

> and the original page is offlined
> (i.e., no longer used, and taken out of normal memory management).

-Andi
--
[email protected] -- Speaking for myself only.

Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

Hi Andi,
On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <[email protected]> wrote:
>> .TP
>> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
>> Soft offline the pages in the range specified by
>> .I addr
>> and
>> .IR length .
>> This memory of each page in the specified range is copied to a new page,
>
> Actually there are some cases where it's also dropped if it's cached page.
>
> Perhaps better would be something more fuzzy like
>
> "the contents are preserved"

The problem to me is that this gets so fuzzy that it's hard to
understand the meaning (I imagine many readers will ask: "What does it
mean that the contents are preserved"?). Would you be able to come up
with a wording that is a little miore detailed?

>> and the original page is offlined
>> (i.e., no longer used, and taken out of normal memory management).

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

2010-06-20 07:15:21

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

On Sun, Jun 20, 2010 at 02:19:35PM +0800, Michael Kerrisk wrote:
> Hi Andi,
> On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <[email protected]> wrote:
> >> .TP
> >> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> >> Soft offline the pages in the range specified by
> >> .I addr
> >> and
> >> .IR length .
> >> This memory of each page in the specified range is copied to a new page,
> >
> > Actually there are some cases where it's also dropped if it's cached page.
> >
> > Perhaps better would be something more fuzzy like
> >
> > "the contents are preserved"
>
> The problem to me is that this gets so fuzzy that it's hard to
> understand the meaning (I imagine many readers will ask: "What does it
> mean that the contents are preserved"?). Would you be able to come up
> with a wording that is a little miore detailed?

That is, MADV_SOFT_OFFLINE won't lose data.

If a process writes "1" to some virtual address and then called
madvice(MADV_SOFT_OFFLINE) on that virtual address, it can continue
to read "1" from that virtual address.

MADV_SOFT_OFFLINE "transparently" replaces the underlying physical page
frame with a new one that contains the same data "1". The original page
frame is offlined, and the new page frame may be installed lazily.

Thanks,
Fengguang

Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

Hi Fengguang,

On Sun, Jun 20, 2010 at 9:14 AM, Wu Fengguang <[email protected]> wrote:
> On Sun, Jun 20, 2010 at 02:19:35PM +0800, Michael Kerrisk wrote:
>> Hi Andi,
>> On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <[email protected]> wrote:
>> >> .TP
>> >> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
>> >> Soft offline the pages in the range specified by
>> >> .I addr
>> >> and
>> >> .IR length .
>> >> This memory of each page in the specified range is copied to a new page,
>> >
>> > Actually there are some cases where it's also dropped if it's cached page.
>> >
>> > Perhaps better would be something more fuzzy like
>> >
>> > "the contents are preserved"
>>
>> The problem to me is that this gets so fuzzy that it's hard to
>> understand the meaning (I imagine many readers will ask: "What does it
>> mean that the contents are preserved"?). Would you be able to come up
>> with a wording that is a little miore detailed?
>
> That is, MADV_SOFT_OFFLINE won't lose data.
>
> If a process writes "1" to some virtual address and then called
> madvice(MADV_SOFT_OFFLINE) on that virtual address, it can continue
> to read "1" from that virtual address.
>
> MADV_SOFT_OFFLINE "transparently" replaces the underlying physical page
> frame with a new one that contains the same data "1". The original page
> frame is offlined, and the new page frame may be installed lazily.

Thanks. That helps me come up with a description that is I think a bit clearer:

MADV_SOFT_OFFLINE (Since Linux 2.6.33)
Soft offline the pages in the range specified by
addr and length. The memory of each page in the
specified range is preserved (i.e., when next
accessed, the same content will be visible, but
in a new physical page frame), and the original
page is offlined (i.e., no longer used, and
taken out of normal memory management). The
effect of the MADV_SOFT_OFFLINE operation is
invisible to (i.e., does not change the seman-
tics of) the calling process. ...

The actual patch for man-pages-3.26 is below.

Cheers,

Michael

--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -163,12 +163,14 @@ Soft offline the pages in the range specified by
.I addr
and
.IR length .
-The memory of each page in the specified range is copied to a new page,
+The memory of each page in the specified range is preserved
+(i.e., when next accessed, the same content will be visible,
+but in a new physical page frame),
and the original page is offlined
(i.e., no longer used, and taken out of normal memory management).
The effect of the
.B MADV_SOFT_OFFLINE
-operation is normally invisible to (i.e., does not change the semantics of)
+operation is invisible to (i.e., does not change the semantics of)
the calling process.
This feature is intended for testing of memory error-handling code;
it is only available if the kernel was configured with

2010-06-26 23:30:35

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

On Sat, Jun 26, 2010 at 09:18:52PM +0800, Michael Kerrisk wrote:
> Hi Fengguang,
>
> On Sun, Jun 20, 2010 at 9:14 AM, Wu Fengguang <[email protected]> wrote:
> > On Sun, Jun 20, 2010 at 02:19:35PM +0800, Michael Kerrisk wrote:
> >> Hi Andi,
> >> On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <[email protected]> wrote:
> >> >> .TP
> >> >> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
> >> >> Soft offline the pages in the range specified by
> >> >> .I addr
> >> >> and
> >> >> .IR length .
> >> >> This memory of each page in the specified range is copied to a new page,
> >> >
> >> > Actually there are some cases where it's also dropped if it's cached page.
> >> >
> >> > Perhaps better would be something more fuzzy like
> >> >
> >> > "the contents are preserved"
> >>
> >> The problem to me is that this gets so fuzzy that it's hard to
> >> understand the meaning (I imagine many readers will ask: "What does it
> >> mean that the contents are preserved"?). Would you be able to come up
> >> with a wording that is a little miore detailed?
> >
> > That is, MADV_SOFT_OFFLINE won't lose data.
> >
> > If a process writes "1" to some virtual address and then called
> > madvice(MADV_SOFT_OFFLINE) on that virtual address, it can continue
> > to read "1" from that virtual address.
> >
> > MADV_SOFT_OFFLINE "transparently" replaces the underlying physical page
> > frame with a new one that contains the same data "1". The original page
> > frame is offlined, and the new page frame may be installed lazily.
>
> Thanks. That helps me come up with a description that is I think a bit clearer:
>
> MADV_SOFT_OFFLINE (Since Linux 2.6.33)
> Soft offline the pages in the range specified by
> addr and length. The memory of each page in the
> specified range is preserved (i.e., when next
> accessed, the same content will be visible, but
> in a new physical page frame), and the original
> page is offlined (i.e., no longer used, and
> taken out of normal memory management). The
> effect of the MADV_SOFT_OFFLINE operation is
> invisible to (i.e., does not change the seman-
> tics of) the calling process. ...
>
> The actual patch for man-pages-3.26 is below.

Thanks. The change looks good to me.

Note that the other perceivable change may be a little access delay.
The kernel could choose to simply drop the in-memory data when there
is another copy in disk. When accessed again, the content for the new
physical page will be populated from disk IO.

Thanks,
Fengguang

>
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -163,12 +163,14 @@ Soft offline the pages in the range specified by
> .I addr
> and
> .IR length .
> -The memory of each page in the specified range is copied to a new page,
> +The memory of each page in the specified range is preserved
> +(i.e., when next accessed, the same content will be visible,
> +but in a new physical page frame),
> and the original page is offlined
> (i.e., no longer used, and taken out of normal memory management).
> The effect of the
> .B MADV_SOFT_OFFLINE
> -operation is normally invisible to (i.e., does not change the semantics of)
> +operation is invisible to (i.e., does not change the semantics of)
> the calling process.
> This feature is intended for testing of memory error-handling code;
> it is only available if the kernel was configured with

Subject: Re: [PATCH] [31/31] HWPOISON: Add a madvise() injector for soft page offlining

On Sun, Jun 27, 2010 at 1:30 AM, Wu Fengguang <[email protected]> wrote:
> On Sat, Jun 26, 2010 at 09:18:52PM +0800, Michael Kerrisk wrote:
>> Hi Fengguang,
>>
>> On Sun, Jun 20, 2010 at 9:14 AM, Wu Fengguang <[email protected]> wrote:
>> > On Sun, Jun 20, 2010 at 02:19:35PM +0800, Michael Kerrisk wrote:
>> >> Hi Andi,
>> >> On Sat, Jun 19, 2010 at 9:52 PM, Andi Kleen <[email protected]> wrote:
>> >> >> .TP
>> >> >> .BR MADV_SOFT_OFFLINE " (Since Linux 2.6.33)
>> >> >> Soft offline the pages in the range specified by
>> >> >> .I addr
>> >> >> and
>> >> >> .IR length .
>> >> >> This memory of each page in the specified range is copied to a new page,
>> >> >
>> >> > Actually there are some cases where it's also dropped if it's cached page.
>> >> >
>> >> > Perhaps better would be something more fuzzy like
>> >> >
>> >> > "the contents are preserved"
>> >>
>> >> The problem to me is that this gets so fuzzy that it's hard to
>> >> understand the meaning (I imagine many readers will ask: "What does it
>> >> mean that the contents are preserved"?). Would you be able to come up
>> >> with a wording that is a little miore detailed?
>> >
>> > That is, MADV_SOFT_OFFLINE won't lose data.
>> >
>> > If a process writes "1" to some virtual address and then called
>> > madvice(MADV_SOFT_OFFLINE) on that virtual address, it can continue
>> > to read "1" from that virtual address.
>> >
>> > MADV_SOFT_OFFLINE "transparently" replaces the underlying physical page
>> > frame with a new one that contains the same data "1". The original page
>> > frame is offlined, and the new page frame may be installed lazily.
>>
>> Thanks. That helps me come up with a description that is I think a bit clearer:
>>
>> ? ? ? ?MADV_SOFT_OFFLINE (Since Linux 2.6.33)
>> ? ? ? ? ? ? ? Soft offline the pages in the range specified by
>> ? ? ? ? ? ? ? addr and length. ?The memory of each page in the
>> ? ? ? ? ? ? ? specified ?range ?is ?preserved (i.e., when next
>> ? ? ? ? ? ? ? accessed, the same content will be visible, ?but
>> ? ? ? ? ? ? ? in ?a new physical page frame), and the original
>> ? ? ? ? ? ? ? page is offlined ?(i.e., ?no ?longer ?used, ?and
>> ? ? ? ? ? ? ? taken ?out ?of ?normal ?memory management). ?The
>> ? ? ? ? ? ? ? effect of ?the ?MADV_SOFT_OFFLINE ?operation ?is
>> ? ? ? ? ? ? ? invisible ?to ?(i.e., does not change the seman-
>> ? ? ? ? ? ? ? tics of) the calling process. ...
>>
>> The actual patch for man-pages-3.26 is below.
>
> Thanks. The change looks good to me.

Thanks for checking it.

> Note that the other perceivable change may be a little access delay.
> The kernel could choose to simply drop the in-memory data when there
> is another copy in disk. When accessed again, the content for the new
> physical page will be populated from disk IO.

Yes, I'd suposed as much, but decided that was a detail that probably
didm\t need to be mentioned in tha man page.

Thanks,

Michael


>>
>> --- a/man2/madvise.2
>> +++ b/man2/madvise.2
>> @@ -163,12 +163,14 @@ Soft offline the pages in the range specified by
>> ?.I addr
>> ?and
>> ?.IR length .
>> -The memory of each page in the specified range is copied to a new page,
>> +The memory of each page in the specified range is preserved
>> +(i.e., when next accessed, the same content will be visible,
>> +but in a new physical page frame),
>> ?and the original page is offlined
>> ?(i.e., no longer used, and taken out of normal memory management).
>> ?The effect of the
>> ?.B MADV_SOFT_OFFLINE
>> -operation is normally invisible to (i.e., does not change the semantics of)
>> +operation is invisible to (i.e., does not change the semantics of)
>> ?the calling process.
>> ?This feature is intended for testing of memory error-handling code;
>> ?it is only available if the kernel was configured with
>



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/