2012-08-10 21:42:18

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 0/3 v1] HWPOISON: improve dirty pagecache error handling

Hi,

This patchset is to improve handling and reporting of memory errors on
dirty pagecache.

Patch 1 is to fix a messaging bug, and patch 2 is to temporarily undo
the code which can happen the data lost. I think these two are obvious
fixes so I want to push them to merge promptly.

Patch 3 is for a new feature. The problem in error reporting (where AS_EIO
we rely on to report the error to userspace is cleared once checked) is
discussed when hwpoison core patches were reviewed, and we left it unfixed
because it can be fixed with more generic solution which covers legacy EIO.
But in my opinion, legacy EIO and hwpoison are different in how it can or
should be handled (for example, as described in patch 3, we can recover
from memory errors on dirty pagecache with overwriting.) So this patch
only solves the problem of memory error reporting.

My test for this patchset is available on:
https://github.com/Naoya-Horiguchi/test_memory_error_on_dirty_pagecache.git

Could you review or comment?

Thanks,
Naoya


2012-08-10 21:42:21

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean

action_result() fails to print out "dirty" even if an error occurred on a
dirty pagecache, because when we check PageDirty in action_result() it was
cleared after page isolation even if it's dirty before error handling. This
can break some applications that monitor this message, so should be fixed.

There are several callers of action_result() except page_action(), but
either of them are not for LRU pages but for free pages or kernel pages,
so we don't have to consider dirty or not for them.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/memory-failure.c | 22 +++++++++-------------
1 file changed, 9 insertions(+), 13 deletions(-)

diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
index a6e2141..79dfb2f 100644
--- v3.6-rc1.orig/mm/memory-failure.c
+++ v3.6-rc1/mm/memory-failure.c
@@ -779,16 +779,16 @@ static struct page_state {
{ compound, compound, "huge", me_huge_page },
#endif

- { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty },
- { sc|dirty, sc, "swapcache", me_swapcache_clean },
+ { sc|dirty, sc|dirty, "dirty swapcache", me_swapcache_dirty },
+ { sc|dirty, sc, "clean swapcache", me_swapcache_clean },

- { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty},
- { unevict, unevict, "unevictable LRU", me_pagecache_clean},
+ { unevict|dirty, unevict|dirty, "dirty unevictable LRU", me_pagecache_dirty },
+ { unevict, unevict, "clean unevictable LRU", me_pagecache_clean },

- { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty },
- { mlock, mlock, "mlocked LRU", me_pagecache_clean },
+ { mlock|dirty, mlock|dirty, "dirty mlocked LRU", me_pagecache_dirty },
+ { mlock, mlock, "clean mlocked LRU", me_pagecache_clean },

- { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty },
+ { lru|dirty, lru|dirty, "dirty LRU", me_pagecache_dirty },
{ lru|dirty, lru, "clean LRU", me_pagecache_clean },

/*
@@ -812,12 +812,8 @@ static struct page_state {

static void action_result(unsigned long pfn, char *msg, int result)
{
- struct page *page = pfn_to_page(pfn);
-
- printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n",
- pfn,
- PageDirty(page) ? "dirty " : "",
- msg, action_name[result]);
+ pr_err("MCE %#lx: %s page recovery: %s\n",
+ pfn, msg, action_name[result]);
}

static int page_action(struct page_state *ps, struct page *p,
--
1.7.11.2

2012-08-10 21:42:43

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

Current error reporting of memory errors on dirty pagecache has silent
data lost problem because AS_EIO in struct address_space is cleared
once checked.
A simple solution is to make AS_EIO sticky (as Wu Fengguang proposed in
https://lkml.org/lkml/2009/6/11/294), but this patch does more to make
dirty pagecache error recoverable under some conditions. Consider that
if there is a copy of the corrupted dirty pagecache on user buffer and
you write() over the error page with the copy data, then we can ignore
the effect of the error because no one consumes the corrupted data.

To implement this, this patch does roughly the following:
- add data structures and handling routines to manage the metadata
of memory errors on dirty pagecaches,
- return -EHWPOISON when we access to the error-affected address with
read(), partial-page write(), fsync(),
- cancel hwpoison when we do full-page write() over the error-affected
address.

One reason why we have a separate flag AS_HWPOISON is that the conditions
of clearing flags differs between legacy IO error and memory error. AS_EIO
is cleared when subsequent writeback for the error-affected file succeeds.
OTOH, AS_HWPOISON can be cleared when a pagecache on which the error lies
is fully overwritten with copy data in user buffer.
Another reason is that we expect user processes which get the error report
from the kernel to handle it differently between the two types of errors.
Processes which get -EHWPOISON can search copy data in their buffers and
try to write() over the error pages if they have.

We have one behavioral change on PageHWPoison flag. Before this patch,
PageHWPoison means literally "the page is corrupted," and the pages with
PageHWPoison set are never reused. After this patch, we give another role
to this flag. When a user process tries to access the address which was
backed by the corrupted page (which is already removed from pagecache by
memory error handler,) we permit to add a new page onto the pagecache
with PageHWPoison flag set. But we refuse to read() and partial write()
on the page until the PageHWPoison flag is cleared by whole-page write().

Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/page-flags.h | 2 +
include/linux/pagemap.h | 91 ++++++++++++
mm/filemap.c | 51 +++++++
mm/memory-failure.c | 343 +++++++++++++++++++++++++++++++++++++++++++--
mm/truncate.c | 3 +
5 files changed, 479 insertions(+), 11 deletions(-)

diff --git v3.6-rc1.orig/include/linux/page-flags.h v3.6-rc1/include/linux/page-flags.h
index b5d1384..25bbde0 100644
--- v3.6-rc1.orig/include/linux/page-flags.h
+++ v3.6-rc1/include/linux/page-flags.h
@@ -272,6 +272,8 @@ TESTSCFLAG(HWPoison, hwpoison)
#define __PG_HWPOISON (1UL << PG_hwpoison)
#else
PAGEFLAG_FALSE(HWPoison)
+SETPAGEFLAG_NOOP(HWPoison)
+CLEARPAGEFLAG_NOOP(HWPoison)
#define __PG_HWPOISON 0
#endif

diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h
index e42c762..8b18560 100644
--- v3.6-rc1.orig/include/linux/pagemap.h
+++ v3.6-rc1/include/linux/pagemap.h
@@ -24,6 +24,7 @@ enum mapping_flags {
AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
+ AS_HWPOISON = __GFP_BITS_SHIFT + 4, /* pagecache is hwpoisoned */
};

static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -31,6 +32,8 @@ static inline void mapping_set_error(struct address_space *mapping, int error)
if (unlikely(error)) {
if (error == -ENOSPC)
set_bit(AS_ENOSPC, &mapping->flags);
+ else if (error == -EHWPOISON)
+ set_bit(AS_HWPOISON, &mapping->flags);
else
set_bit(AS_EIO, &mapping->flags);
}
@@ -541,4 +544,92 @@ static inline int add_to_page_cache(struct page *page,
return error;
}

+#ifdef CONFIG_MEMORY_FAILURE
+extern int __hwpoison_file_range(struct address_space *mapping, loff_t start,
+ loff_t end);
+extern int __hwpoison_partial_write(struct address_space *mapping, loff_t pos,
+ size_t count);
+extern void __remove_hwp_dirty_pgoff(struct address_space *mapping,
+ pgoff_t index);
+extern void __remove_hwp_dirty_file(struct inode *inode);
+extern void __add_fake_hwpoison(struct page *page,
+ struct address_space *mapping, pgoff_t index);
+extern void __remove_fake_hwpoison(struct page *page,
+ struct address_space *mapping);
+
+static inline int hwpoison_file_range(struct address_space *mapping,
+ loff_t start, loff_t end)
+{
+ if (unlikely(test_bit(AS_HWPOISON, &mapping->flags)))
+ return __hwpoison_file_range(mapping, start, end);
+ return 0;
+}
+
+static inline int hwpoison_partial_write(struct address_space *mapping,
+ loff_t pos, size_t count)
+{
+ if (unlikely(test_bit(AS_HWPOISON, &mapping->flags)))
+ return __hwpoison_partial_write(mapping, pos, count);
+ return 0;
+}
+
+static inline void remove_hwp_dirty_pgoff(struct address_space *mapping,
+ pgoff_t index)
+{
+ if (unlikely(test_bit(AS_HWPOISON, &mapping->flags)))
+ __remove_hwp_dirty_pgoff(mapping, index);
+}
+
+static inline void remove_hwp_dirty_file(struct inode *inode)
+{
+ if (unlikely(test_bit(AS_HWPOISON, &inode->i_mapping->flags)))
+ __remove_hwp_dirty_file(inode);
+}
+
+static inline void add_fake_hwpoison(struct page *page,
+ struct address_space *mapping, pgoff_t index)
+{
+ if (unlikely(test_bit(AS_HWPOISON, &mapping->flags)))
+ __add_fake_hwpoison(page, mapping, index);
+}
+
+static inline void remove_fake_hwpoison(struct page *page,
+ struct address_space *mapping)
+{
+ if (unlikely(test_bit(AS_HWPOISON, &mapping->flags)))
+ __remove_fake_hwpoison(page, mapping);
+}
+#else
+static inline int hwpoison_file_range(struct address_space *mapping,
+ loff_t start, loff_t end)
+{
+ return 0;
+}
+
+static inline int hwpoison_partial_write(struct address_space *mapping,
+ loff_t pos, size_t count)
+{
+ return 0;
+}
+
+static inline void remove_hwp_dirty_pgoff(struct address_space *mapping,
+ pgoff_t index)
+{
+}
+
+static inline void remove_hwp_dirty_file(struct inode *inode)
+{
+}
+
+static inline void add_fake_hwpoison(struct page *page,
+ struct address_space *mapping, pgoff_t index)
+{
+}
+
+static inline void remove_fake_hwpoison(struct page *page,
+ struct address_space *mapping)
+{
+}
+#endif /* CONFIG_MEMORY_FAILURE */
+
#endif /* _LINUX_PAGEMAP_H */
diff --git v3.6-rc1.orig/mm/filemap.c v3.6-rc1/mm/filemap.c
index fa5ca30..b446f7c 100644
--- v3.6-rc1.orig/mm/filemap.c
+++ v3.6-rc1/mm/filemap.c
@@ -123,6 +123,9 @@ void __delete_from_page_cache(struct page *page)
else
cleancache_invalidate_page(mapping, page);

+ if (unlikely(PageHWPoison(page)))
+ remove_fake_hwpoison(page, mapping);
+
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
@@ -270,6 +273,9 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
if (end_byte < start_byte)
return 0;

+ if (unlikely(hwpoison_file_range(mapping, start_byte, end_byte)))
+ return -EHWPOISON;
+
pagevec_init(&pvec, 0);
while ((index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
@@ -369,6 +375,16 @@ int filemap_write_and_wait_range(struct address_space *mapping,
err = err2;
}
}
+
+ /*
+ * When AS_HWPOISON is set, dirty page with memory error is
+ * removed from pagecache and mapping->nrpages is decreased by 1.
+ * So in order to detect memory error on single page file, we need
+ * to check AS_HWPOISON bit outside if(mapping->nrpages) block below.
+ */
+ if (unlikely(hwpoison_file_range(mapping, lstart, lend)))
+ return -EHWPOISON;
+
return err;
}
EXPORT_SYMBOL(filemap_write_and_wait_range);
@@ -447,6 +463,22 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

+ /*
+ * Mark as "fake hwpoison" which shows that the virtual address range
+ * backed by the page are affected by a memory error. Accesse to
+ * the fake hwpoison page will get -EHWPOISON if it's read or partial
+ * page write.
+ * The benefit of fake hwpoison is that we can get it back to the
+ * healthy page (i.e. we can ignore the memory error) by full page
+ * overwrite. Applications are expected to do full page overwrite
+ * from their own buffers when they find memory errors on the files
+ * which they open.
+ * We can distinguish fake hwpoison pages from real hwpoison (i.e.
+ * physically corrupted) pages by checking struct hwp_dirty (see
+ * mm/memory-failure.c).
+ */
+ add_fake_hwpoison(page, mapping, offset);
+
error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
@@ -1114,6 +1146,12 @@ find_page:
if (unlikely(page == NULL))
goto no_cached_page;
}
+
+ if (unlikely(PageHWPoison(page))) {
+ error = -EHWPOISON;
+ goto readpage_error;
+ }
+
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
ra, filp, page,
@@ -2085,6 +2123,9 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
if (unlikely(*pos < 0))
return -EINVAL;

+ if (unlikely(hwpoison_partial_write(file->f_mapping, *pos, *count)))
+ return -EHWPOISON;
+
if (!isblk) {
/* FIXME: this is for backwards compatibility with 2.4 */
if (file->f_flags & O_APPEND)
@@ -2339,6 +2380,16 @@ again:
flush_dcache_page(page);

mark_page_accessed(page);
+ /*
+ * By overwriting hwpoisoned page from userspace buffers, we
+ * can "detoxify" the hwpoison because no corrupted data will
+ * be consumed after this.
+ */
+ if (unlikely(PageHWPoison(page))) {
+ VM_BUG_ON(!test_bit(AS_HWPOISON, &mapping->flags));
+ ClearPageHWPoison(page);
+ remove_hwp_dirty_pgoff(mapping, page_index(page));
+ }
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
if (unlikely(status < 0))
diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
index 7e62797..348fa05 100644
--- v3.6-rc1.orig/mm/memory-failure.c
+++ v3.6-rc1/mm/memory-failure.c
@@ -607,23 +607,334 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
}

/*
+ * A control structure which keeps information about dirty pagecache error.
+ * This allows user processes to know the virtual address of corrupted pages,
+ * which is useful to make applications handle memroy errors.
+ * Especially if applications have the copy of corrupted data on their own
+ * buffers, they can take a recovery action by overwriting the error page.
+ */
+struct hwp_dirty {
+ struct hlist_node hash;
+ struct hlist_node hash_pfn;
+ struct page *page; /* real hwpoison (physically corrupted) page */
+ struct page *fpage; /* fake hwpoison page */
+ struct address_space *mapping;
+ unsigned long index; /* page offset of hwpoison page in the file */
+};
+#define HWP_DIRTY_HASH_HEADS 64
+static struct hlist_head *hwp_dirty_hash;
+static struct hlist_head *hwp_dirty_hash_pfn;
+static DEFINE_SPINLOCK(hwp_dirty_lock);
+
+static inline struct hlist_head *hwp_dirty_hlist_head(struct address_space *as)
+{
+ return &hwp_dirty_hash[((unsigned long)as/sizeof(struct address_space))
+ % HWP_DIRTY_HASH_HEADS];
+}
+
+static inline struct hlist_head *hwp_dirty_hlist_pfn_head(unsigned long pfn)
+{
+ return &hwp_dirty_hash_pfn[(pfn/sizeof(unsigned long))
+ % HWP_DIRTY_HASH_HEADS];
+}
+
+/*
+ * Check whether given address range [start,end) overlaps corrupted dirty
+ * pagecaches or not. Returns 1 if it's the case, and returns 0 otherwise.
+ */
+int __hwpoison_file_range(struct address_space *mapping,
+ loff_t start, loff_t end)
+{
+ int ret = 0;
+ struct hwp_dirty *hwp;
+ loff_t hwpstart, hwpend;
+ struct hlist_head *head;
+ struct hlist_node *node1, *node2;
+
+ spin_lock(&hwp_dirty_lock);
+ head = hwp_dirty_hlist_head(mapping);
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash) {
+ if (mapping != hwp->mapping)
+ continue;
+ hwpstart = hwp->index << PAGE_SHIFT;
+ hwpend = (hwp->index + 1) << PAGE_SHIFT;
+ if (!(hwpend <= start || end < hwpstart)) {
+ ret = 1;
+ break;
+ }
+ }
+ spin_unlock(&hwp_dirty_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__hwpoison_file_range);
+
+/*
+ * Check whether given write range partially covers corrupted dirty pagecaches
+ * or not. Writing over the entire range of the corrupted dirty pagecache is
+ * *acceptable*, because it does not cause any corrupted data consumption.
+ * Returns 1 if either @pos or (@pos + @count) is inside the range of dirty
+ * pagecache. Returns 0 otherwise.
+ *
+ * |....|....|XXXX|....|....| => returns 0 (covered entirely)
+ * ^------^
+ * @pos @pos + @count
+ *
+ * |....|....|XXXX|....|....| => returns 1 (covered partially)
+ * ^----^
+ *
+ * |....|XXXX|....|XXXX|....| => returns 1 (covered partially)
+ * ^-------------^
+ *
+ * |....|XXXX|....|XXXX|....| => returns 0 (covered entirely)
+ * ^------------------^
+ *
+ * |....| : healthy page
+ * |XXXX| : corrupted page
+ */
+int __hwpoison_partial_write(struct address_space *mapping,
+ loff_t pos, size_t count)
+{
+ int ret = 0;
+ struct hwp_dirty *hwp;
+ loff_t hwpstart, hwpend;
+ struct hlist_head *head;
+ struct hlist_node *node1, *node2;
+
+ spin_lock(&hwp_dirty_lock);
+ head = hwp_dirty_hlist_head(mapping);
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash) {
+ if (mapping != hwp->mapping)
+ continue;
+ hwpstart = hwp->index << PAGE_SHIFT;
+ hwpend = (hwp->index + 1) << PAGE_SHIFT;
+ if ((hwpstart < pos && pos < hwpend) ||
+ (hwpstart < pos + count && pos + count < hwpend)) {
+ ret = 1;
+ break;
+ }
+ }
+ spin_unlock(&hwp_dirty_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(__hwpoison_partial_write);
+
+static void add_hwp_dirty(struct hwp_dirty *hwp)
+{
+ struct hlist_head *head;
+ struct address_space *mapping = hwp->mapping;
+ unsigned long pfn = page_to_pfn(hwp->page);
+
+ spin_lock(&hwp_dirty_lock);
+ /*
+ * Keep inode cache (in the result AS_HWPOISON bit in address_space)
+ * on memory to remember that the file experienced dirty pagecache
+ * errors when reopened in the future.
+ */
+ igrab(mapping->host);
+ head = hwp_dirty_hlist_head(mapping);
+ hlist_add_head(&hwp->hash, head);
+ head = hwp_dirty_hlist_pfn_head(pfn);
+ hlist_add_head(&hwp->hash_pfn, head);
+ spin_unlock(&hwp_dirty_lock);
+}
+
+static void __remove_hwp_dirty(struct hwp_dirty *hwp, int count)
+{
+ hlist_del(&hwp->hash);
+ hlist_del(&hwp->hash_pfn);
+ /*
+ * If you try to unpoison the corrupted page with which a fake
+ * hwpoison associated, you also unpoison the fake hwpoison page for
+ * consistency. Unpoisoning should not be used on production systems.
+ */
+ if (hwp->fpage)
+ TestClearPageHWPoison(hwp->fpage);
+ if (count == 1)
+ clear_bit(AS_HWPOISON, &hwp->mapping->flags);
+ iput(hwp->mapping->host);
+ kfree(hwp);
+}
+
+/*
+ * Used by unpoison_memory(). If you try to unpoison a fake hwpoison
+ * which is linked to pagecache and not counted in mce_bad_pages,
+ * this function returns 1 to avoid switching on freeit flag in the
+ * caller and counting down mce_bad_pages. Otherwise, returns 0.
+ */
+static int remove_hwp_dirty_page(struct page *page)
+{
+ int ret = 0;
+ struct hwp_dirty *hwp;
+ struct hwp_dirty *hwp_hit = NULL;
+ struct hlist_head *head;
+ struct hlist_node *node1, *node2;
+ struct address_space *mapping = page_mapping(page);
+
+ spin_lock(&hwp_dirty_lock);
+ /* Unpoison a real corrupted page */
+ if (!mapping) {
+ head = hwp_dirty_hlist_pfn_head(page_to_pfn(page));
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash_pfn)
+ if (page == hwp->page) {
+ hwp_hit = hwp;
+ break;
+ }
+ if (hwp_hit) {
+ int count = 0;
+ head = hwp_dirty_hlist_head(hwp_hit->mapping);
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash)
+ if (hwp->mapping == hwp_hit->mapping)
+ count++;
+ __remove_hwp_dirty(hwp_hit, count);
+ }
+ /* Unpoison a fake hwpoison page */
+ } else {
+ head = hwp_dirty_hlist_head(mapping);
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash)
+ if (page == hwp->fpage) {
+ ret = 1;
+ break;
+ }
+ }
+ spin_unlock(&hwp_dirty_lock);
+ return ret;
+}
+
+/*
+ * Remove hwp_dirty for the given page offset @index in the file with which
+ * @mapping is associated.
+ */
+void __remove_hwp_dirty_pgoff(struct address_space *mapping, pgoff_t index)
+{
+ int count = 0;
+ struct hwp_dirty *hwp;
+ struct hwp_dirty *hwp_hit = NULL;
+ struct hlist_head *head;
+ struct hlist_node *node1, *node2;
+
+ spin_lock(&hwp_dirty_lock);
+ head = hwp_dirty_hlist_head(mapping);
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash) {
+ if (mapping == hwp->mapping) {
+ count++;
+ if (index == hwp->index)
+ hwp_hit = hwp;
+ }
+ }
+ if (hwp_hit)
+ __remove_hwp_dirty(hwp_hit, count);
+ spin_unlock(&hwp_dirty_lock);
+}
+EXPORT_SYMBOL_GPL(__remove_hwp_dirty_pgoff);
+
+/*
+ * Remove all dirty pagecache errors which belong to the given file
+ * represented by @inode from hwp_dirty_hash.
+ */
+void __remove_hwp_dirty_file(struct inode *inode)
+{
+ struct address_space *mapping = inode->i_mapping;
+ struct hwp_dirty *hwp;
+ struct hlist_head *head;
+ struct hlist_node *node1, *node2;
+
+ spin_lock(&hwp_dirty_lock);
+ head = hwp_dirty_hlist_head(mapping);
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash)
+ if (mapping == hwp->mapping)
+ __remove_hwp_dirty(hwp, 0);
+ clear_bit(AS_HWPOISON, &mapping->flags);
+ spin_unlock(&hwp_dirty_lock);
+}
+
+
+void __add_fake_hwpoison(struct page *page, struct address_space *mapping,
+ pgoff_t index)
+{
+ struct hwp_dirty *hwp;
+ struct hlist_head *head;
+ struct hlist_node *node1, *node2;
+
+ spin_lock(&hwp_dirty_lock);
+ head = hwp_dirty_hlist_head(mapping);
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash) {
+ if (mapping == hwp->mapping && index == hwp->index) {
+ hwp->fpage = page;
+ SetPageHWPoison(page);
+ break;
+ }
+ }
+ spin_unlock(&hwp_dirty_lock);
+}
+EXPORT_SYMBOL_GPL(__add_fake_hwpoison);
+
+void __remove_fake_hwpoison(struct page *page, struct address_space *mapping)
+{
+ struct hwp_dirty *hwp;
+ struct hlist_head *head;
+ struct hlist_node *node1, *node2;
+ pgoff_t index = page_index(page);
+
+ spin_lock(&hwp_dirty_lock);
+ head = hwp_dirty_hlist_head(mapping);
+ hlist_for_each_entry_safe(hwp, node1, node2, head, hash) {
+ if (page == hwp->fpage && mapping == hwp->mapping &&
+ index == hwp->index) {
+ hwp->fpage = NULL;
+ ClearPageHWPoison(page);
+ break;
+ }
+ }
+ spin_unlock(&hwp_dirty_lock);
+}
+EXPORT_SYMBOL_GPL(__remove_fake_hwpoison);
+
+/*
* Dirty cache page page
* Issues: when the error hit a hole page the error is not properly
* propagated.
*/
static int me_pagecache_dirty(struct page *p, unsigned long pfn)
{
- /*
- * The original memory error handling on dirty pagecache has
- * a bug that user processes who use corrupted pages via read()
- * or write() can't be aware of the memory error and result
- * in throwing out dirty data silently.
- *
- * Until we solve the problem, let's close the path of memory
- * error handling for dirty pagecache. We just leave errors
- * for the 2nd MCE to trigger panics.
- */
- return IGNORED;
+ struct address_space *mapping = page_mapping(p);
+
+ SetPageError(p);
+ if (mapping) {
+ struct hwp_dirty *hwp;
+ struct inode *inode = mapping->host;
+
+ /*
+ * Memory error is reported to userspace by AS_HWPOISON flags
+ * in mapping->flags. The mechanism is similar to that of
+ * AS_EIO, but we have separete flags because there'are two
+ * differences between them:
+ * 1. Expected userspace handling. When user processes get
+ * -EIO, they can retry writeback hoping the error in IO
+ * devices is temporary, switch to write to other devices,
+ * or do some other application-specific handling.
+ * For -EHWPOISON, we can clear the error by overwriting
+ * the corrupted page.
+ * 2. When to clear. For -EIO, we can think that we recover
+ * from the error when writeback succeeds. For -EHWPOISON
+ * OTOH, we can see that things are back to normal when
+ * corrupted data are overwritten from user buffer.
+ */
+ hwp = kmalloc(sizeof(struct hwp_dirty), GFP_ATOMIC);
+ hwp->page = p;
+ hwp->fpage = NULL;
+ hwp->mapping = mapping;
+ hwp->index = page_index(p);
+ hwp->ino = inode->i_ino;
+ hwp->dev = inode->i_sb->s_dev;
+ add_hwp_dirty(hwp);
+
+ pr_err("MCE %#lx: Corrupted dirty pagecache, dev %u:%u, inode:%lu, index:%lu\n",
+ pfn, MAJOR(inode->i_sb->s_dev),
+ MINOR(inode->i_sb->s_dev), inode->i_ino, page_index(p));
+ mapping_set_error(mapping, -EHWPOISON);
+ }
+
+ return me_pagecache_clean(p, pfn);
}

/*
@@ -1237,6 +1548,13 @@ static int __init memory_failure_init(void)
INIT_WORK(&mf_cpu->work, memory_failure_work_func);
}

+ hwp_dirty_hash = kzalloc(HWP_DIRTY_HASH_HEADS *
+ sizeof(struct hlist_head), GFP_KERNEL);
+ hwp_dirty_hash_pfn = kzalloc(HWP_DIRTY_HASH_HEADS *
+ sizeof(struct hlist_head), GFP_KERNEL);
+ if (!hwp_dirty_hash || !hwp_dirty_hash_pfn)
+ return -ENOMEM;
+
return 0;
}
core_initcall(memory_failure_init);
@@ -1299,11 +1617,14 @@ int unpoison_memory(unsigned long pfn)
*/
if (TestClearPageHWPoison(page)) {
pr_info("MCE: Software-unpoisoned page %#lx\n", pfn);
+ if (remove_hwp_dirty_page(page))
+ goto unlock;
atomic_long_sub(nr_pages, &mce_bad_pages);
freeit = 1;
if (PageHuge(page))
clear_page_hwpoison_huge_page(page);
}
+unlock:
unlock_page(page);

put_page(page);
diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
index 75801ac..1930b73 100644
--- v3.6-rc1.orig/mm/truncate.c
+++ v3.6-rc1/mm/truncate.c
@@ -217,6 +217,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (mapping->nrpages == 0)
return;

+ if (!hwpoison_file_range(mapping, 0, lstart))
+ remove_hwp_dirty_file(mapping->host);
+
BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
end = (lend >> PAGE_CACHE_SHIFT);

--
1.7.11.2

2012-08-10 21:42:24

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 2/3] HWPOISON: undo memory error handling for dirty pagecache

Current memory error handling on dirty pagecache has a bug that user
processes who use corrupted pages via read() or write() can't be aware
of the memory error and result in discarding dirty data silently.

The following patch is to improve handling/reporting memory errors on
this case, but as a short term solution I suggest that we should undo
the present error handling code and just leave errors for such cases
(which expect the 2nd MCE to panic the system) to ensure data consistency.

Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: [email protected]
---
mm/memory-failure.c | 54 +++++++++++------------------------------------------
1 file changed, 11 insertions(+), 43 deletions(-)

diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
index 79dfb2f..7e62797 100644
--- v3.6-rc1.orig/mm/memory-failure.c
+++ v3.6-rc1/mm/memory-failure.c
@@ -613,49 +613,17 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
*/
static int me_pagecache_dirty(struct page *p, unsigned long pfn)
{
- struct address_space *mapping = page_mapping(p);
-
- SetPageError(p);
- /* TBD: print more information about the file. */
- if (mapping) {
- /*
- * IO error will be reported by write(), fsync(), etc.
- * who check the mapping.
- * This way the application knows that something went
- * wrong with its dirty file data.
- *
- * There's one open issue:
- *
- * The EIO will be only reported on the next IO
- * operation and then cleared through the IO map.
- * Normally Linux has two mechanisms to pass IO error
- * first through the AS_EIO flag in the address space
- * and then through the PageError flag in the page.
- * Since we drop pages on memory failure handling the
- * only mechanism open to use is through AS_AIO.
- *
- * This has the disadvantage that it gets cleared on
- * the first operation that returns an error, while
- * the PageError bit is more sticky and only cleared
- * when the page is reread or dropped. If an
- * application assumes it will always get error on
- * fsync, but does other operations on the fd before
- * and the page is dropped between then the error
- * will not be properly reported.
- *
- * This can already happen even without hwpoisoned
- * pages: first on metadata IO errors (which only
- * report through AS_EIO) or when the page is dropped
- * at the wrong time.
- *
- * So right now we assume that the application DTRT on
- * the first EIO, but we're not worse than other parts
- * of the kernel.
- */
- mapping_set_error(mapping, EIO);
- }
-
- return me_pagecache_clean(p, pfn);
+ /*
+ * The original memory error handling on dirty pagecache has
+ * a bug that user processes who use corrupted pages via read()
+ * or write() can't be aware of the memory error and result
+ * in throwing out dirty data silently.
+ *
+ * Until we solve the problem, let's close the path of memory
+ * error handling for dirty pagecache. We just leave errors
+ * for the 2nd MCE to trigger panics.
+ */
+ return IGNORED;
}

/*
--
1.7.11.2

2012-08-10 22:01:39

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

On Fri, Aug 10, 2012 at 05:41:53PM -0400, Naoya Horiguchi wrote:
...
> +/*
> * Dirty cache page page
> * Issues: when the error hit a hole page the error is not properly
> * propagated.
> */
> static int me_pagecache_dirty(struct page *p, unsigned long pfn)
> {
> - /*
> - * The original memory error handling on dirty pagecache has
> - * a bug that user processes who use corrupted pages via read()
> - * or write() can't be aware of the memory error and result
> - * in throwing out dirty data silently.
> - *
> - * Until we solve the problem, let's close the path of memory
> - * error handling for dirty pagecache. We just leave errors
> - * for the 2nd MCE to trigger panics.
> - */
> - return IGNORED;
> + struct address_space *mapping = page_mapping(p);
> +
> + SetPageError(p);
> + if (mapping) {
> + struct hwp_dirty *hwp;
> + struct inode *inode = mapping->host;
> +
> + /*
> + * Memory error is reported to userspace by AS_HWPOISON flags
> + * in mapping->flags. The mechanism is similar to that of
> + * AS_EIO, but we have separete flags because there'are two
> + * differences between them:
> + * 1. Expected userspace handling. When user processes get
> + * -EIO, they can retry writeback hoping the error in IO
> + * devices is temporary, switch to write to other devices,
> + * or do some other application-specific handling.
> + * For -EHWPOISON, we can clear the error by overwriting
> + * the corrupted page.
> + * 2. When to clear. For -EIO, we can think that we recover
> + * from the error when writeback succeeds. For -EHWPOISON
> + * OTOH, we can see that things are back to normal when
> + * corrupted data are overwritten from user buffer.
> + */
> + hwp = kmalloc(sizeof(struct hwp_dirty), GFP_ATOMIC);
> + hwp->page = p;
> + hwp->fpage = NULL;
> + hwp->mapping = mapping;
> + hwp->index = page_index(p);

> + hwp->ino = inode->i_ino;
> + hwp->dev = inode->i_sb->s_dev;

Sorry, these two members are not in struct hwp_dirty in current version.
Please ignore them.

Thanks,
Naoya

> + add_hwp_dirty(hwp);
> +
> + pr_err("MCE %#lx: Corrupted dirty pagecache, dev %u:%u, inode:%lu, index:%lu\n",
> + pfn, MAJOR(inode->i_sb->s_dev),
> + MINOR(inode->i_sb->s_dev), inode->i_ino, page_index(p));
> + mapping_set_error(mapping, -EHWPOISON);
> + }
> +
> + return me_pagecache_clean(p, pfn);
> }
>
> /*

2012-08-10 23:08:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean

Naoya Horiguchi <[email protected]> writes:

> action_result() fails to print out "dirty" even if an error occurred on a
> dirty pagecache, because when we check PageDirty in action_result() it was
> cleared after page isolation even if it's dirty before error handling. This
> can break some applications that monitor this message, so should be fixed.
>
> There are several callers of action_result() except page_action(), but
> either of them are not for LRU pages but for free pages or kernel pages,
> so we don't have to consider dirty or not for them.

Looks good

Reviewed-by: Andi Kleen <[email protected]>


-Andi
--
[email protected] -- Speaking for myself only

2012-08-10 23:09:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2/3] HWPOISON: undo memory error handling for dirty pagecache

Naoya Horiguchi <[email protected]> writes:

> Current memory error handling on dirty pagecache has a bug that user
> processes who use corrupted pages via read() or write() can't be aware
> of the memory error and result in discarding dirty data silently.
>
> The following patch is to improve handling/reporting memory errors on
> this case, but as a short term solution I suggest that we should undo
> the present error handling code and just leave errors for such cases
> (which expect the 2nd MCE to panic the system) to ensure data consistency.

Not sure that's the right approach. It's not worse than any other IO
errors isn't it?

-Andi

--
[email protected] -- Speaking for myself only

2012-08-10 23:13:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

Naoya Horiguchi <[email protected]> writes:

> Current error reporting of memory errors on dirty pagecache has silent
> data lost problem because AS_EIO in struct address_space is cleared
> once checked.

Seems very complicated. I think I would prefer something simpler
if possible, especially unless it's proven the case is common.
It's hard to maintain rarely used error code when it's complicated.
Maybe try Fengguang's simple proposal first? That would fix other IO
errors too.

-Andi

--
[email protected] -- Speaking for myself only

2012-08-11 00:58:37

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 2/3] HWPOISON: undo memory error handling for dirty pagecache

Hi Andi,

On Fri, Aug 10, 2012 at 04:09:48PM -0700, Andi Kleen wrote:
> Naoya Horiguchi <[email protected]> writes:
>
> > Current memory error handling on dirty pagecache has a bug that user
> > processes who use corrupted pages via read() or write() can't be aware
> > of the memory error and result in discarding dirty data silently.
> >
> > The following patch is to improve handling/reporting memory errors on
> > this case, but as a short term solution I suggest that we should undo
> > the present error handling code and just leave errors for such cases
> > (which expect the 2nd MCE to panic the system) to ensure data consistency.
>
> Not sure that's the right approach. It's not worse than any other IO
> errors isn't it?

Right, in current situation both memory errors and other IO errors have
the possibility of data lost in the same manner.
I thought that in real mission critical system (for which I think
HWPOISON feature is targeted) closing dangerous path is better than
keeping waiting for someone to solve the problem in more generic manner.

But if we start with Fengguang's approach at first as you replied to
patch 3, this patch is not necessary.

Naoya

2012-08-11 01:01:34

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

Hello,

On Fri, Aug 10, 2012 at 04:13:03PM -0700, Andi Kleen wrote:
> Naoya Horiguchi <[email protected]> writes:
>
> > Current error reporting of memory errors on dirty pagecache has silent
> > data lost problem because AS_EIO in struct address_space is cleared
> > once checked.
>
> Seems very complicated. I think I would prefer something simpler
> if possible, especially unless it's proven the case is common.
> It's hard to maintain rarely used error code when it's complicated.

I'm not sure if memory error is a rare event, because I don't have
any numbers about that on real systems. But assuming that hwpoison
events are not rare, dirty pagecache error is not an ignorable case
because dirty page ratio is typically ~10% of total physical memory
in average systems. It may be small but not negligible.

> Maybe try Fengguang's simple proposal first? That would fix other IO
> errors too.

In my understanding, Fengguang's patch (specified in this patch's
description) only fixes memory error reporting. And I'm not sure
that similar appoarch (like making AS_EIO sticky) really fixes
the IO errors because this change can break userspace applications
which expect the current behavior.

Anyway, OK, I agree to start with Fengguang's one and separate
out the additional suggestion about "making dirty pagecache error
recoverable". And if possible, I want your feedback about the
additional part of my idea. Can I ask a favor?

Thanks,
Naoya

2012-08-11 11:15:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

Naoya Horiguchi <[email protected]> writes:

I'm sceptical on the patch, but here's my review.

> - return -EHWPOISON when we access to the error-affected address with
> read(), partial-page write(), fsync(),

Note that a lot of user space does not like new errnos (nothing in
strerror etc.). It's probably better to reuse some existing errno.

> @@ -270,6 +273,9 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
> if (end_byte < start_byte)
> return 0;
>
> + if (unlikely(hwpoison_file_range(mapping, start_byte, end_byte)))
> + return -EHWPOISON;
> +

That function uses a global lock. fdatawait is quite common. This will
likely cause performance problems in IO workloads.

You need to get that lock out of the hot path somehow.

Probably better to try to put the data into a existing data structure,
or if you cannot do that you would need some way to localize the lock.

Or at least make it conditional of hwpoison errors being around.


> pagevec_init(&pvec, 0);
> while ((index <= end) &&
> (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> @@ -369,6 +375,16 @@ int filemap_write_and_wait_range(struct address_space *mapping,
> err = err2;
> }
> }
> +
> + /*
> + * When AS_HWPOISON is set, dirty page with memory error is
> + * removed from pagecache and mapping->nrpages is decreased by 1.
> + * So in order to detect memory error on single page file, we need
> + * to check AS_HWPOISON bit outside if(mapping->nrpages) block below.
> + */
> + if (unlikely(hwpoison_file_range(mapping, lstart, lend)))
> + return -EHWPOISON;

Same here.
> ra, filp, page,
> @@ -2085,6 +2123,9 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
> if (unlikely(*pos < 0))
> return -EINVAL;
>
> + if (unlikely(hwpoison_partial_write(file->f_mapping, *pos, *count)))
> + return -EHWPOISON;

Same here.
> +
> + /*
> + * Memory error is reported to userspace by AS_HWPOISON flags
> + * in mapping->flags. The mechanism is similar to that of
> + * AS_EIO, but we have separete flags because there'are two
> + * differences between them:
> + * 1. Expected userspace handling. When user processes get
> + * -EIO, they can retry writeback hoping the error in IO
> + * devices is temporary, switch to write to other devices,
> + * or do some other application-specific handling.
> + * For -EHWPOISON, we can clear the error by overwriting
> + * the corrupted page.
> + * 2. When to clear. For -EIO, we can think that we recover
> + * from the error when writeback succeeds. For -EHWPOISON
> + * OTOH, we can see that things are back to normal when
> + * corrupted data are overwritten from user buffer.
> + */
> + hwp = kmalloc(sizeof(struct hwp_dirty), GFP_ATOMIC);

You need to check the return value, especially for GFP_ATOMIC which is
common to fail

> + hwp->page = p;
> + hwp->fpage = NULL;
> + hwp->mapping = mapping;
> + hwp->index = page_index(p);
> + hwp->ino = inode->i_ino;
> + hwp->dev = inode->i_sb->s_dev;
> + add_hwp_dirty(hwp);

-Andi
--
[email protected] -- Speaking for myself only

2012-08-11 21:14:57

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

Hi,

On Sat, Aug 11, 2012 at 04:15:06AM -0700, Andi Kleen wrote:
> Naoya Horiguchi <[email protected]> writes:
>
> I'm sceptical on the patch, but here's my review.

Thank you for taking your time for the review.

> > - return -EHWPOISON when we access to the error-affected address with
> > read(), partial-page write(), fsync(),
>
> Note that a lot of user space does not like new errnos (nothing in
> strerror etc.). It's probably better to reuse some existing errno.

I'm OK to use EIO if user space can know that it comes from memory errors.
The reason why I thought that user space should distinguish it from IO errors
is that the possible responses of user space for memory errors are different
from those for IO errors, considering that we can recover from memory errors
with overwriting. OTOH what user space can do for IO errors is to wait for
IO devices to recover by itself and retry, to change the IO's target to
another device, or something like that.

> > @@ -270,6 +273,9 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
> > if (end_byte < start_byte)
> > return 0;
> >
> > + if (unlikely(hwpoison_file_range(mapping, start_byte, end_byte)))
> > + return -EHWPOISON;
> > +
>
> That function uses a global lock. fdatawait is quite common. This will
> likely cause performance problems in IO workloads.

OK, I should avoid it.

> You need to get that lock out of the hot path somehow.
>
> Probably better to try to put the data into a existing data structure,
> or if you cannot do that you would need some way to localize the lock.

Yes, I have thought about adding some data like new pagecache tag or
new members in struct address_space, but it makes the size of heavily
used data structure larger so I'm not sure it's acceptable.
And localizing the lock is worth trying, I think.

> Or at least make it conditional of hwpoison errors being around.

I'll try to do your suggestions, but I'm not sure your point of the
last one. Can you explain more about 'make it conditional' option?

>
>
> > pagevec_init(&pvec, 0);
> > while ((index <= end) &&
> > (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> > @@ -369,6 +375,16 @@ int filemap_write_and_wait_range(struct address_space *mapping,
> > err = err2;
> > }
> > }
> > +
> > + /*
> > + * When AS_HWPOISON is set, dirty page with memory error is
> > + * removed from pagecache and mapping->nrpages is decreased by 1.
> > + * So in order to detect memory error on single page file, we need
> > + * to check AS_HWPOISON bit outside if(mapping->nrpages) block below.
> > + */
> > + if (unlikely(hwpoison_file_range(mapping, lstart, lend)))
> > + return -EHWPOISON;
>
> Same here.
> > ra, filp, page,
> > @@ -2085,6 +2123,9 @@ inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, i
> > if (unlikely(*pos < 0))
> > return -EINVAL;
> >
> > + if (unlikely(hwpoison_partial_write(file->f_mapping, *pos, *count)))
> > + return -EHWPOISON;
>
> Same here.
> > +
> > + /*
> > + * Memory error is reported to userspace by AS_HWPOISON flags
> > + * in mapping->flags. The mechanism is similar to that of
> > + * AS_EIO, but we have separete flags because there'are two
> > + * differences between them:
> > + * 1. Expected userspace handling. When user processes get
> > + * -EIO, they can retry writeback hoping the error in IO
> > + * devices is temporary, switch to write to other devices,
> > + * or do some other application-specific handling.
> > + * For -EHWPOISON, we can clear the error by overwriting
> > + * the corrupted page.
> > + * 2. When to clear. For -EIO, we can think that we recover
> > + * from the error when writeback succeeds. For -EHWPOISON
> > + * OTOH, we can see that things are back to normal when
> > + * corrupted data are overwritten from user buffer.
> > + */
> > + hwp = kmalloc(sizeof(struct hwp_dirty), GFP_ATOMIC);
>
> You need to check the return value, especially for GFP_ATOMIC which is
> common to fail

OK, I'll fix it.

Thanks,
Naoya

2012-08-11 22:41:53

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

> dirty pagecache error recoverable under some conditions. Consider that
> if there is a copy of the corrupted dirty pagecache on user buffer and
> you write() over the error page with the copy data, then we can ignore
> the effect of the error because no one consumes the corrupted data.

This sounds like a quite rare corner case. If the page is already dirty, it is
most likely because someone recently did a write(2) (or touched it via
mmap(2)). Now you are hoping that some process is going to write the
same page again. Do you have an application in mind where this would
be common. Remember that the write(2), memory-error, new write(2)
have to happen close together (before Linux decides to write out the
dirty page).

-Tony

2012-08-12 03:28:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

> > That function uses a global lock. fdatawait is quite common. This will
> > likely cause performance problems in IO workloads.
>
> OK, I should avoid it.

Maybe just RCU the hash table.

> > You need to get that lock out of the hot path somehow.
> >
> > Probably better to try to put the data into a existing data structure,
> > or if you cannot do that you would need some way to localize the lock.
>
> Yes, I have thought about adding some data like new pagecache tag or
> new members in struct address_space, but it makes the size of heavily
> used data structure larger so I'm not sure it's acceptable.
> And localizing the lock is worth trying, I think.

It's cheaper than a hash table lookup in the hot path.

> > Or at least make it conditional of hwpoison errors being around.
>
> I'll try to do your suggestions, but I'm not sure your point of the
> last one. Can you explain more about 'make it conditional' option?

The code should check some flag first that is only set when hwpoison
happened on the address space (or global, but that would mean that
performance can go down globally when any error is around)

-Andi

2012-08-12 15:19:41

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

Hi,

On Sun, Aug 12, 2012 at 05:28:44AM +0200, Andi Kleen wrote:
> > > That function uses a global lock. fdatawait is quite common. This will
> > > likely cause performance problems in IO workloads.
> >
> > OK, I should avoid it.
>
> Maybe just RCU the hash table.

OK.

> > > You need to get that lock out of the hot path somehow.
> > >
> > > Probably better to try to put the data into a existing data structure,
> > > or if you cannot do that you would need some way to localize the lock.
> >
> > Yes, I have thought about adding some data like new pagecache tag or
> > new members in struct address_space, but it makes the size of heavily
> > used data structure larger so I'm not sure it's acceptable.
> > And localizing the lock is worth trying, I think.
>
> It's cheaper than a hash table lookup in the hot path.
>
> > > Or at least make it conditional of hwpoison errors being around.
> >
> > I'll try to do your suggestions, but I'm not sure your point of the
> > last one. Can you explain more about 'make it conditional' option?
>
> The code should check some flag first that is only set when hwpoison
> happened on the address space (or global, but that would mean that
> performance can go down globally when any error is around)

I defined hwpoison_file_range() and hwpoison_partial_write() as wrapper
functions of __hwpoison_* variants, and they hold hwp_dirty_lock only
if AS_HWPOISON flag in mapping is set. So I hope we already did it.
But yes, I understand that in general a global lock is not good,
so I'll try to do other options.

Thank you,
Naoya

2012-08-12 15:58:07

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 3/3] HWPOISON: improve handling/reporting of memory error on dirty pagecache

Hi Tony,

Thank you for the comment.

On Sat, Aug 11, 2012 at 10:41:49PM +0000, Luck, Tony wrote:
> > dirty pagecache error recoverable under some conditions. Consider that
> > if there is a copy of the corrupted dirty pagecache on user buffer and
> > you write() over the error page with the copy data, then we can ignore
> > the effect of the error because no one consumes the corrupted data.
>
> This sounds like a quite rare corner case. If the page is already dirty, it is
> most likely because someone recently did a write(2) (or touched it via
> mmap(2)).

Yes, that's right.

> Now you are hoping that some process is going to write the
> same page again. Do you have an application in mind where this would
> be common.

No, I don't, particularly.

> Remember that the write(2), memory-error, new write(2)
> have to happen close together (before Linux decides to write out the
> dirty page).

Maybe this is different from my scenario, where I assumed that a hwpoison-
aware application kicks the second write(2) when it catches a memory error
report from kernel, and this write(2) copies from the same buffer from
which the first write(2) copied into pagecache.
In many case, user space applications keep their buffers for a while after
calling write(2), so then we can consider that dirty pagecaches also can
have copies in the buffers. This is a key idea of error recovery.

And let me discuss about another point. When memory errors happen on
dirty pagecaches, they are isolated from pagecache trees. So neither
fsync(2) nor writeback can write out the corrupted data on the backing
devices. So I don't think that we have to be careful about closeness
between two write(2)s.

Thanks,
Naoya

2012-08-13 10:17:45

by Junichi Nomura

[permalink] [raw]
Subject: Re: [PATCH 2/3] HWPOISON: undo memory error handling for dirty pagecache

On 08/11/12 08:09, Andi Kleen wrote:
> Naoya Horiguchi <[email protected]> writes:
>
>> Current memory error handling on dirty pagecache has a bug that user
>> processes who use corrupted pages via read() or write() can't be aware
>> of the memory error and result in discarding dirty data silently.
>>
>> The following patch is to improve handling/reporting memory errors on
>> this case, but as a short term solution I suggest that we should undo
>> the present error handling code and just leave errors for such cases
>> (which expect the 2nd MCE to panic the system) to ensure data consistency.
>
> Not sure that's the right approach. It's not worse than any other IO
> errors isn't it?

IMO, it's worse in certain cases. For example, producer-consumer type
program which uses file as a temporary storage.
Current memory-failure.c drops produced data from dirty pagecache
and allows reader to consume old or empty data from disk (silently!),
that's what I think HWPOISON should prevent.

Similar thing could happen theoretically with disk I/O errors,
though, practically those errors are often persistent and reader will
likely get errors again instead of bad data.
Also, ext3/ext4 has an option to panic when an error is detected,
for people who want to avoid corruption on intermittent errors.

--
Jun'ichi Nomura, NEC Corporation