[NOTE]
Although I think the patchset is ready to go since a) it fixes
the original issues and b) survives all my tests, I wanted to
giving it a last RFC spin.
If no further objections are presented, I will drop the RFC.
This patchset was initially based on Naoya's hwpoison rework [1], so
thanks to him for the initial work.
I would also like to think Naoya for testing the patchset off-line,
and report any issues he found, that was quite helpful.
This patchset aims to fix some issues laying in {soft,hard}-offline handling,
but it also takes the chance and takes some further steps to perform
cleanups and some refactoring as well.
While this patchset was initially thought for soft-offlining, I think
that hard-offline part can be further cleanup.
But that would be on top of this work.
- Motivation:
A customer and I were facing an issue were processes were killed
after having soft-offlined some of their pages.
This should not happen when soft-offlining, as it is meant to be non-disruptive.
I was able to reproduce the issue when I stressed the memory +
soft offlining pages in the meantime.
After debugging the issue, I saw that the problem was that pages were returned
back to user-space after having offlined them properly.
So, when those pages were faulted in, the fault handler returned VM_FAULT_POISON
all the way down to the arch handler, and it simply killed the process.
After a further anaylsis, it became clear that the problem was that when
kcompactd kicked in to migrate pages over, compaction_alloc callback
was handing poisoned pages to the migrate routine.
All this could happen because isolate_freepages_block and
fast_isolate_freepages just check for the page to be PageBuddy,
and since 1) poisoned pages can be part of a higher order page
and 2) poisoned pages are also Page Buddy, they can sneak in easily.
I also saw some other problems with sawap pages, but I suspected it
to be the same sort of problem, so I did not follow that trace.
The above refers to soft-offline.
But I also saw problems with hard-offline, specially hugetlb corruption,
and some other weird stuff. (I could paste the logs)
The full explanation refering to the soft-offline case can be found at [2].
- Approach:
The taken approach is to contain those pages and never let them hit
neither pcplists nor buddy freelists.
Only when they are completely out of reach, we flag them as poisoned.
A full explanation of this can be found in patch#10 and patch#11.
- Outcome:
With this patchset, I no longer see the issues with soft-offline and
hard-offline.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
Naoya Horiguchi (6):
mm,hwpoison: cleanup unused PageHuge() check
mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
mm,hwpoison-inject: don't pin for hwpoison_filter
mm,hwpoison: remove MF_COUNT_INCREASED
mm,hwpoison: remove flag argument from soft offline functions
mm, soft-offline: convert parameter to pfn
Oscar Salvador (10):
mm,madvise: Refactor madvise_inject_error
mm,hwpoison: Un-export get_hwpoison_page and make it static
mm,hwpoison: Kill put_hwpoison_page
mm,hwpoison: Unify THP handling for hard and soft offline
mm,hwpoison: Rework soft offline for free pages
mm,hwpoison: Rework soft offline for in-use pages
mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
mm,hwpoison: Take pages off the buddy when hard-offlining
mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
mm/hwpoison-inject: Rip off duplicated checks
drivers/base/memory.c | 7 +-
include/linux/mm.h | 11 +-
include/linux/page-flags.h | 5 -
mm/hwpoison-inject.c | 43 +-----
mm/madvise.c | 39 ++---
mm/memory-failure.c | 365 +++++++++++++++++++++------------------------
mm/migrate.c | 11 +-
mm/page_alloc.c | 69 +++++++--
8 files changed, 253 insertions(+), 297 deletions(-)
--
2.12.3
Make a proper if-else condition for {hard,soft}-offline.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/madvise.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 89ed9a22ff4f..b765860a5d04 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -854,16 +854,15 @@ static long madvise_remove(struct vm_area_struct *vma,
static int madvise_inject_error(int behavior,
unsigned long start, unsigned long end)
{
- struct page *page;
struct zone *zone;
unsigned int order;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
-
for (; start < end; start += PAGE_SIZE << order) {
unsigned long pfn;
+ struct page *page;
int ret;
ret = get_user_pages_fast(start, 1, 0, &page);
@@ -893,17 +892,14 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
- pfn, start);
-
+ pfn, start);
ret = soft_offline_page(page, 0);
- if (ret)
- return ret;
- continue;
+ } else {
+ pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+ pfn, start);
+ ret = memory_failure(pfn, 0);
}
- pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
- pfn, start);
- ret = memory_failure(pfn, 0);
if (ret)
return ret;
}
--
2.12.3
From: Naoya Horiguchi <[email protected]>
Now there's no user of MF_COUNT_INCREASED, so we can safely remove
it from all calling points.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/mm.h | 7 +++----
mm/memory-failure.c | 16 +++-------------
2 files changed, 6 insertions(+), 17 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6dca8778cc2..4c2cf2ea9953 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2780,10 +2780,9 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
unsigned long nr_pages);
enum mf_flags {
- MF_COUNT_INCREASED = 1 << 0,
- MF_ACTION_REQUIRED = 1 << 1,
- MF_MUST_KILL = 1 << 2,
- MF_SOFT_OFFLINE = 1 << 3,
+ MF_ACTION_REQUIRED = 1 << 0,
+ MF_MUST_KILL = 1 << 1,
+ MF_SOFT_OFFLINE = 1 << 2,
};
extern int memory_failure(unsigned long pfn, int flags);
extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3e73738a2246..c6b31cf409d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1081,7 +1081,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
num_poisoned_pages_inc();
- if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+ if (!get_hwpoison_page(p)) {
/*
* Check "filter hit" and "race with other subpage."
*/
@@ -1277,7 +1277,7 @@ int memory_failure(unsigned long pfn, int flags)
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
*/
- if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+ if (!get_hwpoison_page(p)) {
if (is_free_buddy_page(p)) {
action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
return 0;
@@ -1318,10 +1318,7 @@ int memory_failure(unsigned long pfn, int flags)
shake_page(p, 0);
/* shake_page could have turned it free. */
if (!PageLRU(p) && is_free_buddy_page(p)) {
- if (flags & MF_COUNT_INCREASED)
- action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
- else
- action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+ action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
return 0;
}
@@ -1609,9 +1606,6 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
{
int ret;
- if (flags & MF_COUNT_INCREASED)
- return 1;
-
/*
* When the target page is a free hugepage, just remove it
* from free hugepage list.
@@ -1881,15 +1875,11 @@ int soft_offline_page(struct page *page, int flags)
if (is_zone_device_page(page)) {
pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
pfn);
- if (flags & MF_COUNT_INCREASED)
- put_page(page);
return -EIO;
}
if (PageHWPoison(page)) {
pr_info("soft offline: %#lx page already poisoned\n", pfn);
- if (flags & MF_COUNT_INCREASED)
- put_page(page);
return -EBUSY;
}
--
2.12.3
After ("4e41a30c6d50: mm: hwpoison: adjust for new thp refcounting"),
put_hwpoison_page got reduced to a put_page.
Let us just use put_page instead.
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/mm.h | 1 -
mm/memory-failure.c | 30 +++++++++++++++---------------
2 files changed, 15 insertions(+), 16 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a646eb4dc993..d6dca8778cc2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2788,7 +2788,6 @@ enum mf_flags {
extern int memory_failure(unsigned long pfn, int flags);
extern void memory_failure_queue(unsigned long pfn, int flags);
extern int unpoison_memory(unsigned long pfn);
-#define put_hwpoison_page(page) put_page(page)
extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p, int access);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5ce634ddf9fa..3e73738a2246 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1107,7 +1107,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
num_poisoned_pages_dec();
unlock_page(head);
- put_hwpoison_page(head);
+ put_page(head);
return 0;
}
@@ -1299,7 +1299,7 @@ int memory_failure(unsigned long pfn, int flags)
pfn);
if (TestClearPageHWPoison(p))
num_poisoned_pages_dec();
- put_hwpoison_page(p);
+ put_page(p);
return -EBUSY;
}
unlock_page(p);
@@ -1353,14 +1353,14 @@ int memory_failure(unsigned long pfn, int flags)
pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
num_poisoned_pages_dec();
unlock_page(p);
- put_hwpoison_page(p);
+ put_page(p);
return 0;
}
if (hwpoison_filter(p)) {
if (TestClearPageHWPoison(p))
num_poisoned_pages_dec();
unlock_page(p);
- put_hwpoison_page(p);
+ put_page(p);
return 0;
}
@@ -1584,9 +1584,9 @@ int unpoison_memory(unsigned long pfn)
}
unlock_page(page);
- put_hwpoison_page(page);
+ put_page(page);
if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
- put_hwpoison_page(page);
+ put_page(page);
return 0;
}
@@ -1644,7 +1644,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
/*
* Try to free it.
*/
- put_hwpoison_page(page);
+ put_page(page);
shake_page(page, 1);
/*
@@ -1653,7 +1653,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
ret = __get_any_page(page, pfn, 0);
if (ret == 1 && !PageLRU(page)) {
/* Drop page reference which is from __get_any_page() */
- put_hwpoison_page(page);
+ put_page(page);
pr_info("soft_offline: %#lx: unknown non LRU page type %lx (%pGp)\n",
pfn, page->flags, &page->flags);
return -EIO;
@@ -1676,7 +1676,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
lock_page(hpage);
if (PageHWPoison(hpage)) {
unlock_page(hpage);
- put_hwpoison_page(hpage);
+ put_page(hpage);
pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
return -EBUSY;
}
@@ -1687,7 +1687,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
* get_any_page() and isolate_huge_page() takes a refcount each,
* so need to drop one here.
*/
- put_hwpoison_page(hpage);
+ put_page(hpage);
if (!ret) {
pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn);
return -EBUSY;
@@ -1736,7 +1736,7 @@ static int __soft_offline_page(struct page *page, int flags)
wait_on_page_writeback(page);
if (PageHWPoison(page)) {
unlock_page(page);
- put_hwpoison_page(page);
+ put_page(page);
pr_info("soft offline: %#lx page already poisoned\n", pfn);
return -EBUSY;
}
@@ -1751,7 +1751,7 @@ static int __soft_offline_page(struct page *page, int flags)
* would need to fix isolation locking first.
*/
if (ret == 1) {
- put_hwpoison_page(page);
+ put_page(page);
pr_info("soft_offline: %#lx: invalidated\n", pfn);
SetPageHWPoison(page);
num_poisoned_pages_inc();
@@ -1771,7 +1771,7 @@ static int __soft_offline_page(struct page *page, int flags)
* Drop page reference which is came from get_any_page()
* successful isolate_lru_page() already took another one.
*/
- put_hwpoison_page(page);
+ put_page(page);
if (!ret) {
LIST_HEAD(pagelist);
/*
@@ -1815,7 +1815,7 @@ static int soft_offline_in_use_page(struct page *page, int flags)
pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
else
pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
- put_hwpoison_page(page);
+ put_page(page);
return -EBUSY;
}
unlock_page(page);
@@ -1889,7 +1889,7 @@ int soft_offline_page(struct page *page, int flags)
if (PageHWPoison(page)) {
pr_info("soft offline: %#lx page already poisoned\n", pfn);
if (flags & MF_COUNT_INCREASED)
- put_hwpoison_page(page);
+ put_page(page);
return -EBUSY;
}
--
2.12.3
From: Naoya Horiguchi <[email protected]>
The argument @flag no longer affects the behavior of soft_offline_page()
and its variants, so let's remove them.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
---
drivers/base/memory.c | 2 +-
include/linux/mm.h | 2 +-
mm/madvise.c | 2 +-
mm/memory-failure.c | 27 +++++++++++++--------------
4 files changed, 16 insertions(+), 17 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 55907c27075b..b3cae2eb1c4f 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -543,7 +543,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
/* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
if (!pfn_to_online_page(pfn))
return -EIO;
- ret = soft_offline_page(pfn_to_page(pfn), 0);
+ ret = soft_offline_page(pfn_to_page(pfn));
return ret == 0 ? count : ret;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4c2cf2ea9953..0f80a1ce4e86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2791,7 +2791,7 @@ extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p, int access);
extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(struct page *page, int flags);
+extern int soft_offline_page(struct page *page);
/*
diff --git a/mm/madvise.c b/mm/madvise.c
index b765860a5d04..8a0b1f901d72 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -893,7 +893,7 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
pfn, start);
- ret = soft_offline_page(page, 0);
+ ret = soft_offline_page(page);
} else {
pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
pfn, start);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c6b31cf409d3..836d671fb74f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1469,7 +1469,7 @@ static void memory_failure_work_func(struct work_struct *work)
if (!gotten)
break;
if (entry.flags & MF_SOFT_OFFLINE)
- soft_offline_page(pfn_to_page(entry.pfn), entry.flags);
+ soft_offline_page(pfn_to_page(entry.pfn));
else
memory_failure(entry.pfn, entry.flags);
}
@@ -1602,7 +1602,7 @@ static struct page *new_page(struct page *p, unsigned long private)
* that is not free, and 1 for any other page type.
* For 1 the page is returned with increased page count, otherwise not.
*/
-static int __get_any_page(struct page *p, unsigned long pfn, int flags)
+static int __get_any_page(struct page *p, unsigned long pfn)
{
int ret;
@@ -1629,9 +1629,9 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
return ret;
}
-static int get_any_page(struct page *page, unsigned long pfn, int flags)
+static int get_any_page(struct page *page, unsigned long pfn)
{
- int ret = __get_any_page(page, pfn, flags);
+ int ret = __get_any_page(page, pfn);
if (ret == 1 && !PageHuge(page) &&
!PageLRU(page) && !__PageMovable(page)) {
@@ -1644,7 +1644,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
/*
* Did it turn free?
*/
- ret = __get_any_page(page, pfn, 0);
+ ret = __get_any_page(page, pfn);
if (ret == 1 && !PageLRU(page)) {
/* Drop page reference which is from __get_any_page() */
put_page(page);
@@ -1656,7 +1656,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
return ret;
}
-static int soft_offline_huge_page(struct page *page, int flags)
+static int soft_offline_huge_page(struct page *page)
{
int ret;
unsigned long pfn = page_to_pfn(page);
@@ -1715,7 +1715,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
return ret;
}
-static int __soft_offline_page(struct page *page, int flags)
+static int __soft_offline_page(struct page *page)
{
int ret;
unsigned long pfn = page_to_pfn(page);
@@ -1795,7 +1795,7 @@ static int __soft_offline_page(struct page *page, int flags)
return ret;
}
-static int soft_offline_in_use_page(struct page *page, int flags)
+static int soft_offline_in_use_page(struct page *page)
{
int ret;
int mt;
@@ -1825,9 +1825,9 @@ static int soft_offline_in_use_page(struct page *page, int flags)
mt = get_pageblock_migratetype(page);
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
if (PageHuge(page))
- ret = soft_offline_huge_page(page, flags);
+ ret = soft_offline_huge_page(page);
else
- ret = __soft_offline_page(page, flags);
+ ret = __soft_offline_page(page);
set_pageblock_migratetype(page, mt);
return ret;
}
@@ -1848,7 +1848,6 @@ static int soft_offline_free_page(struct page *page)
/**
* soft_offline_page - Soft offline a page.
* @page: page to offline
- * @flags: flags. Same as memory_failure().
*
* Returns 0 on success, otherwise negated errno.
*
@@ -1867,7 +1866,7 @@ static int soft_offline_free_page(struct page *page)
* This is not a 100% solution for all memory, but tries to be
* ``good enough'' for the majority of memory.
*/
-int soft_offline_page(struct page *page, int flags)
+int soft_offline_page(struct page *page)
{
int ret;
unsigned long pfn = page_to_pfn(page);
@@ -1884,11 +1883,11 @@ int soft_offline_page(struct page *page, int flags)
}
get_online_mems();
- ret = get_any_page(page, pfn, flags);
+ ret = get_any_page(page, pfn);
put_online_mems();
if (ret > 0)
- ret = soft_offline_in_use_page(page, flags);
+ ret = soft_offline_in_use_page(page);
else if (ret == 0)
ret = soft_offline_free_page(page);
--
2.12.3
We need to do as we do now for soft-offline, and take poisoned pages
off the buddy allocator.
Otherwise we could face [1] as well.
[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/memory-failure.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 48eb314598e0..3d491c0d3f91 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -791,6 +791,14 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn)
return MF_FAILED;
}
+static int me_huge_free_page(struct page *p)
+{
+ if (page_handle_poison(p, true, false))
+ return MF_RECOVERED;
+ else
+ return MF_FAILED;
+}
+
/*
* Huge pages. Needs work.
* Issues:
@@ -818,8 +826,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
*/
if (PageAnon(hpage))
put_page(hpage);
- dissolve_free_huge_page(p);
- res = MF_RECOVERED;
+ res = me_huge_free_page(p);
lock_page(hpage);
}
@@ -1145,8 +1152,10 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
}
}
unlock_page(head);
- dissolve_free_huge_page(p);
- action_result(pfn, MF_MSG_FREE_HUGE, MF_DELAYED);
+ res = me_huge_free_page(p);
+ if (res == MF_FAILED)
+ num_poisoned_pages_dec();
+ action_result(pfn, MF_MSG_FREE_HUGE, res);
return 0;
}
@@ -1307,6 +1316,12 @@ int memory_failure(unsigned long pfn, int flags)
if (PageHuge(p))
return memory_failure_hugetlb(pfn, flags);
+
+ if (is_free_buddy_page(p) && page_handle_poison(p, true, false)) {
+ action_result(pfn, MF_MSG_BUDDY, MF_RECOVERED);
+ return 0;
+ }
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
@@ -1328,10 +1343,10 @@ int memory_failure(unsigned long pfn, int flags)
* that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
*/
if (!get_hwpoison_page(p)) {
- if (is_free_buddy_page(p)) {
- action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+ if (is_free_buddy_page(p) && page_handle_poison(p, true, false)) {
+ action_result(pfn, MF_MSG_BUDDY, MF_RECOVERED);
return 0;
- } else {
+ } else if(!is_free_buddy_page(p)) {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
return -EBUSY;
}
@@ -1354,8 +1369,8 @@ int memory_failure(unsigned long pfn, int flags)
*/
shake_page(p, 0);
/* shake_page could have turned it free. */
- if (!PageLRU(p) && is_free_buddy_page(p)) {
- action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+ if (!PageLRU(p) && is_free_buddy_page(p) && page_handle_poison(p, true, false)) {
+ action_result(pfn, MF_MSG_BUDDY_2ND, MF_RECOVERED);
return 0;
}
--
2.12.3
Currently, there is an inconsistency when calling soft-offline from
different paths on a page that is already poisoned.
1) madvise:
madvise_inject_error skips any poisoned page and continues
the loop.
If that was the only page to madvise, it returns 0.
2) /sys/devices/system/memory/:
Whe calling soft_offline_page_store()->soft_offline_page(),
we return -EBUSY in case the page is already poisoned.
This is inconsistent with a) the above example and b)
memory_failure, where we return 0 if the page was poisoned.
Fix this by dropping the PageHWPoison() check in madvise_inject_error,
and let soft_offline_page return 0 if it finds the page already poisoned.
Please, note that this represents an user-api change, since now the return
error when calling soft_offline_page_store()->soft_offline_page() will be different.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/madvise.c | 3 ---
mm/memory-failure.c | 4 ++--
2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 8a0b1f901d72..9ca48345ce45 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -887,9 +887,6 @@ static int madvise_inject_error(int behavior,
*/
put_page(page);
- if (PageHWPoison(page))
- continue;
-
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
pfn, start);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3d491c0d3f91..c038896bedf0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1767,7 +1767,7 @@ static int __soft_offline_page(struct page *page)
unlock_page(page);
put_page(page);
pr_info("soft offline: %#lx page already poisoned\n", pfn);
- return -EBUSY;
+ return 0;
}
if (!PageHuge(page))
@@ -1866,7 +1866,7 @@ int soft_offline_page(struct page *page)
if (PageHWPoison(page)) {
pr_info("soft offline: %#lx page already poisoned\n", pfn);
- return -EBUSY;
+ return 0;
}
get_online_mems();
--
2.12.3
memory_failure() already performs the same checks, so leave it
to the main routine.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/hwpoison-inject.c | 33 +++------------------------------
1 file changed, 3 insertions(+), 30 deletions(-)
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 0c8cdb80fd7d..fdcca3df4283 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -14,49 +14,22 @@ static struct dentry *hwpoison_dir;
static int hwpoison_inject(void *data, u64 val)
{
unsigned long pfn = val;
- struct page *p;
- struct page *hpage;
- int err;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
- if (!pfn_valid(pfn))
- return -ENXIO;
-
- p = pfn_to_page(pfn);
- hpage = compound_head(p);
-
- if (!hwpoison_filter_enable)
- goto inject;
-
- shake_page(hpage, 0);
- /*
- * This implies unable to support non-LRU pages.
- */
- if (!PageLRU(hpage) && !PageHuge(p))
- return 0;
-
- /*
- * do a racy check to make sure PG_hwpoison will only be set for
- * the targeted owner (or on a free page).
- * memory_failure() will redo the check reliably inside page lock.
- */
- err = hwpoison_filter(hpage);
- if (err)
- return 0;
-
-inject:
pr_info("Injecting memory failure at pfn %#lx\n", pfn);
return memory_failure(pfn, 0);
}
static int hwpoison_unpoison(void *data, u64 val)
{
+ unsigned long pfn = val;
+
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
- return unpoison_memory(val);
+ return unpoison_memory(pfn);
}
DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
--
2.12.3
When trying to soft-offline a free page, we need to first take it off
the buddy allocator.
Once we know is out of reach, we can safely flag it as poisoned.
take_page_off_buddy will be used to take a page meant to be poisoned
off the buddy allocator.
take_page_off_buddy calls break_down_buddy_pages, which splits a
higher-order page in case our page belongs to one.
Once the page is under our control, we call page_set_poison to set it
as poisoned and grab a refcount on it.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/memory-failure.c | 20 +++++++++++-----
mm/page_alloc.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 82 insertions(+), 6 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 37b230b8cfe7..1d986580522d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -78,6 +78,15 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
+extern bool take_page_off_buddy(struct page *page);
+
+static void page_handle_poison(struct page *page)
+{
+ SetPageHWPoison(page);
+ page_ref_inc(page);
+ num_poisoned_pages_inc();
+}
+
static int hwpoison_filter_dev(struct page *p)
{
struct address_space *mapping;
@@ -1830,14 +1839,13 @@ static int soft_offline_in_use_page(struct page *page)
static int soft_offline_free_page(struct page *page)
{
- int rc = dissolve_free_huge_page(page);
+ int rc = -EBUSY;
- if (!rc) {
- if (set_hwpoison_free_buddy_page(page))
- num_poisoned_pages_inc();
- else
- rc = -EBUSY;
+ if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
+ page_handle_poison(page);
+ rc = 0;
}
+
return rc;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1dd0712624..255df0c76a40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8632,6 +8632,74 @@ bool is_free_buddy_page(struct page *page)
#ifdef CONFIG_MEMORY_FAILURE
/*
+ * Break down a higher-order page in sub-pages, and keep our target out of
+ * buddy allocator.
+ */
+static void break_down_buddy_pages(struct zone *zone, struct page *page,
+ struct page *target, int low, int high,
+ struct free_area *area, int migratetype)
+{
+ unsigned long size = 1 << high;
+ struct page *current_buddy, *next_page;
+
+ while (high > low) {
+ area--;
+ high--;
+ size >>= 1;
+
+ if (target >= &page[size]) {
+ next_page = page + size;
+ current_buddy = page;
+ } else {
+ next_page = page;
+ current_buddy = page + size;
+ }
+
+ if (set_page_guard(zone, current_buddy, high, migratetype))
+ continue;
+
+ if (current_buddy != target) {
+ add_to_free_area(current_buddy, area, migratetype);
+ set_page_order(current_buddy, high);
+ page = next_page;
+ }
+ }
+}
+
+/*
+ * Take a page that will be marked as poisoned off the buddy allocator.
+ */
+bool take_page_off_buddy(struct page *page)
+ {
+ struct zone *zone = page_zone(page);
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long flags;
+ unsigned int order;
+ bool ret = false;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ for (order = 0; order < MAX_ORDER; order++) {
+ struct page *page_head = page - (pfn & ((1 << order) - 1));
+ int buddy_order = page_order(page_head);
+ struct free_area *area = &(zone->free_area[buddy_order]);
+
+ if (PageBuddy(page_head) && buddy_order >= order) {
+ unsigned long pfn_head = page_to_pfn(page_head);
+ int migratetype = get_pfnblock_migratetype(page_head,
+ pfn_head);
+
+ del_page_from_free_area(page_head, area);
+ break_down_buddy_pages(zone, page_head, page, 0,
+ buddy_order, area, migratetype);
+ ret = true;
+ break;
+ }
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return ret;
+ }
+
+/*
* Set PG_hwpoison flag if a given page is confirmed to be a free page. This
* test is performed under the zone lock to prevent a race against page
* allocation.
--
2.12.3
Merging soft_offline_huge_page and __soft_offline_page let us get rid of
quite some duplicated code, and makes the code much easier to follow.
Now, __soft_offline_page will handle both normal and hugetlb pages.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/memory-failure.c | 175 ++++++++++++++++++++++++----------------------------
1 file changed, 79 insertions(+), 96 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9b40cf1cb4fc..48eb314598e0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -80,13 +80,33 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
extern bool take_page_off_buddy(struct page *page);
-static void page_handle_poison(struct page *page, bool release)
+static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
{
+ if (hugepage_or_freepage) {
+ /*
+ * Doing this check for free pages is also fine since dissolve_free_huge_page
+ * returns 0 for non-hugetlb pages as well.
+ */
+ if (dissolve_free_huge_page(page) || !take_page_off_buddy(page))
+ /*
+ * The hugetlb page can end up being enqueued back into
+ * the freelists by means of:
+ * unmap_and_move_huge_page
+ * putback_active_hugepage
+ * put_page->free_huge_page
+ * enqueue_huge_page
+ * If this happens, we might lose the race against an allocation.
+ */
+ return false;
+ }
+
SetPageHWPoison(page);
if (release)
put_page(page);
page_ref_inc(page);
num_poisoned_pages_inc();
+
+ return true;
}
static int hwpoison_filter_dev(struct page *p)
@@ -1673,63 +1693,51 @@ static int get_any_page(struct page *page, unsigned long pfn)
return ret;
}
-static int soft_offline_huge_page(struct page *page)
+static bool isolate_page(struct page *page, struct list_head *pagelist)
{
- int ret;
- unsigned long pfn = page_to_pfn(page);
- struct page *hpage = compound_head(page);
- LIST_HEAD(pagelist);
+ bool isolated = false;
+ bool lru = PageLRU(page);
- /*
- * This double-check of PageHWPoison is to avoid the race with
- * memory_failure(). See also comment in __soft_offline_page().
- */
- lock_page(hpage);
- if (PageHWPoison(hpage)) {
- unlock_page(hpage);
- put_page(hpage);
- pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
- return -EBUSY;
+ if (PageHuge(page)) {
+ isolated = isolate_huge_page(page, pagelist);
+ } else {
+ if (lru)
+ isolated = !isolate_lru_page(page);
+ else
+ isolated = !isolate_movable_page(page, ISOLATE_UNEVICTABLE);
+
+ if (isolated)
+ list_add(&page->lru, pagelist);
}
- unlock_page(hpage);
- ret = isolate_huge_page(hpage, &pagelist);
+ if (isolated && lru)
+ inc_node_page_state(page, NR_ISOLATED_ANON +
+ page_is_file_cache(page));
+
/*
- * get_any_page() and isolate_huge_page() takes a refcount each,
- * so need to drop one here.
+ * If we succeed to isolate the page, we grabbed another refcount on
+ * the page, so we can safely drop the one we got from get_any_pages().
+ * If we failed to isolate the page, it means that we cannot go further
+ * and we will return an error, so drop the reference we got from
+ * get_any_pages() as well.
*/
- put_page(hpage);
- if (!ret) {
- pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn);
- return -EBUSY;
- }
-
- ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
- MIGRATE_SYNC, MR_MEMORY_FAILURE);
- if (ret) {
- pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n",
- pfn, ret, page->flags, &page->flags);
- if (!list_empty(&pagelist))
- putback_movable_pages(&pagelist);
- if (ret > 0)
- ret = -EIO;
- } else {
- /*
- * We set PG_hwpoison only when we were able to take the page
- * off the buddy.
- */
- if (!dissolve_free_huge_page(page) && take_page_off_buddy(page))
- page_handle_poison(page, false);
- else
- ret = -EBUSY;
- }
- return ret;
+ put_page(page);
+ return isolated;
}
+/*
+ * __soft_offline_page handles hugetlb-pages and non-hugetlb pages.
+ * If the page is a non-dirty unmapped page-cache page, it simply invalidates.
+ * If the page is mapped, it migrates the contents over.
+ */
static int __soft_offline_page(struct page *page)
{
- int ret;
+ int ret = 0;
unsigned long pfn = page_to_pfn(page);
+ struct page *hpage = compound_head(page);
+ const char *msg_page[] = { "page", "hugepage"};
+ bool huge = PageHuge(page);
+ LIST_HEAD(pagelist);
/*
* Check PageHWPoison again inside page lock because PageHWPoison
@@ -1738,98 +1746,73 @@ static int __soft_offline_page(struct page *page)
* so there's no race between soft_offline_page() and memory_failure().
*/
lock_page(page);
- wait_on_page_writeback(page);
+ if (!PageHuge(page))
+ wait_on_page_writeback(page);
if (PageHWPoison(page)) {
unlock_page(page);
put_page(page);
pr_info("soft offline: %#lx page already poisoned\n", pfn);
return -EBUSY;
}
- /*
- * Try to invalidate first. This should work for
- * non dirty unmapped page cache pages.
- */
- ret = invalidate_inode_page(page);
+
+ if (!PageHuge(page))
+ /*
+ * Try to invalidate first. This should work for
+ * non dirty unmapped page cache pages.
+ */
+ ret = invalidate_inode_page(page);
unlock_page(page);
+
/*
* RED-PEN would be better to keep it isolated here, but we
* would need to fix isolation locking first.
*/
- if (ret == 1) {
+ if (ret) {
pr_info("soft_offline: %#lx: invalidated\n", pfn);
- page_handle_poison(page, true);
+ page_handle_poison(page, false, true);
return 0;
}
- /*
- * Simple invalidation didn't work.
- * Try to migrate to a new page instead. migrate.c
- * handles a large number of cases for us.
- */
- if (PageLRU(page))
- ret = isolate_lru_page(page);
- else
- ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
- /*
- * Drop page reference which is came from get_any_page()
- * successful isolate_lru_page() already took another one.
- */
- put_page(page);
- if (!ret) {
- LIST_HEAD(pagelist);
- /*
- * After isolated lru page, the PageLRU will be cleared,
- * so use !__PageMovable instead for LRU page's mapping
- * cannot have PAGE_MAPPING_MOVABLE.
- */
- if (!__PageMovable(page))
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
- list_add(&page->lru, &pagelist);
+ if (isolate_page(hpage, &pagelist)) {
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (!ret) {
- page_handle_poison(page, true);
+ bool release = !huge;
+ if (!page_handle_poison(page, true, release))
+ ret = -EBUSY;
} else {
if (!list_empty(&pagelist))
putback_movable_pages(&pagelist);
- pr_info("soft offline: %#lx: migration failed %d, type %lx (%pGp)\n",
- pfn, ret, page->flags, &page->flags);
+ pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n",
+ pfn, msg_page[huge], ret, page->flags, &page->flags);
if (ret > 0)
ret = -EIO;
}
} else {
- pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n",
- pfn, ret, page_count(page), page->flags, &page->flags);
+ pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n",
+ pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags);
}
return ret;
}
static int soft_offline_in_use_page(struct page *page)
{
- int ret;
struct page *hpage = compound_head(page);
if (!PageHuge(page) && PageTransHuge(hpage))
if (try_to_split_thp_page(page, "soft offline") < 0)
return -EBUSY;
- if (PageHuge(page))
- ret = soft_offline_huge_page(page);
- else
- ret = __soft_offline_page(page);
- return ret;
+ return __soft_offline_page(page);
}
static int soft_offline_free_page(struct page *page)
{
- int rc = -EBUSY;
+ int rc = 0;
- if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
- page_handle_poison(page, false);
- rc = 0;
- }
+ if (!page_handle_poison(page, true, false))
+ rc = -EBUSY;
return rc;
}
--
2.12.3
This patch changes the way we set and handle in-use poisoned pages.
Until now, poisoned pages were released to the buddy allocator, trusting
that the checks that take place prior to hand the page would act as a
safe net and would skip that page.
This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be PageBuddy and be in a
freelist.
Although this might not be the only user, having poisoned pages
in the buddy allocator seems a bad idea as we should only have
free pages that are ready and meant to be used as such.
Before explainaing the taken approach, let us break down the kind
of pages we can soft offline.
- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)
* Normal pages (order-0 and anon-THP)
- If they are clean and unmapped page cache pages, we invalidate
then by means of invalidate_inode_page().
- If they are mapped/dirty, we do the isolate-and-migrate dance.
Either way, do not call put_page directly from those paths.
Instead, we keep the page and send it to page_set_poison to perform the
right handling.
page_set_poison sets the HWPoison flag and does the last put_page.
This call to put_page is mainly to be able to call __page_cache_release,
since this function is not exported.
Down the chain, we placed a check for HWPoison page in free_pages_prepare,
that just skips any poisoned page, so those pages do not end up in any
pcplist/freelist.
After that, we set the refcount on the page to 1 and we increment
the poisoned pages counter.
We could do as we do for free pages:
1) wait until the page hits buddy's freelists
2) take it off
3) flag it
The problem is that we could race with an allocation, so by the time we
want to take the page off the buddy, the page is already allocated, so we
cannot soft-offline it.
This is not fatal of course, but if it is better if we can close the race
as does not require a lot of code.
* Hugetlb pages
- We isolate-and-migrate them
After the migration has been succesful, we call dissolve_free_huge_page,
and we set HWPoison on the page if we succeed.
Hugetlb has a slightly different handling though.
While for non-hugetlb pages we cared about closing the race with an allocation,
doing so for hugetlb pages requires quite some additional code (we would need to
hook in free_huge_page and some other places).
So I decided to not make the code overly complicated and just fail normally
if the page we allocated in the meantime.
Because of the way we handle now in-use pages, we no longer need the
put-as-isolation-migratetype dance, that was guarding for poisoned pages
to end up in pcplists.
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/page-flags.h | 5 -----
mm/memory-failure.c | 43 ++++++++++++++-----------------------------
mm/migrate.c | 11 +++--------
mm/page_alloc.c | 31 +++----------------------------
4 files changed, 20 insertions(+), 70 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..21df81c9ea57 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -414,13 +414,8 @@ PAGEFLAG_FALSE(Uncached)
PAGEFLAG(HWPoison, hwpoison, PF_ANY)
TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
#define __PG_HWPOISON (1UL << PG_hwpoison)
-extern bool set_hwpoison_free_buddy_page(struct page *page);
#else
PAGEFLAG_FALSE(HWPoison)
-static inline bool set_hwpoison_free_buddy_page(struct page *page)
-{
- return 0;
-}
#define __PG_HWPOISON 0
#endif
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1d986580522d..9b40cf1cb4fc 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -80,9 +80,11 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
extern bool take_page_off_buddy(struct page *page);
-static void page_handle_poison(struct page *page)
+static void page_handle_poison(struct page *page, bool release)
{
SetPageHWPoison(page);
+ if (release)
+ put_page(page);
page_ref_inc(page);
num_poisoned_pages_inc();
}
@@ -1713,19 +1715,13 @@ static int soft_offline_huge_page(struct page *page)
ret = -EIO;
} else {
/*
- * We set PG_hwpoison only when the migration source hugepage
- * was successfully dissolved, because otherwise hwpoisoned
- * hugepage remains on free hugepage list, then userspace will
- * find it as SIGBUS by allocation failure. That's not expected
- * in soft-offlining.
+ * We set PG_hwpoison only when we were able to take the page
+ * off the buddy.
*/
- ret = dissolve_free_huge_page(page);
- if (!ret) {
- if (set_hwpoison_free_buddy_page(page))
- num_poisoned_pages_inc();
- else
- ret = -EBUSY;
- }
+ if (!dissolve_free_huge_page(page) && take_page_off_buddy(page))
+ page_handle_poison(page, false);
+ else
+ ret = -EBUSY;
}
return ret;
}
@@ -1760,10 +1756,8 @@ static int __soft_offline_page(struct page *page)
* would need to fix isolation locking first.
*/
if (ret == 1) {
- put_page(page);
pr_info("soft_offline: %#lx: invalidated\n", pfn);
- SetPageHWPoison(page);
- num_poisoned_pages_inc();
+ page_handle_poison(page, true);
return 0;
}
@@ -1794,7 +1788,9 @@ static int __soft_offline_page(struct page *page)
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
- if (ret) {
+ if (!ret) {
+ page_handle_poison(page, true);
+ } else {
if (!list_empty(&pagelist))
putback_movable_pages(&pagelist);
@@ -1813,27 +1809,16 @@ static int __soft_offline_page(struct page *page)
static int soft_offline_in_use_page(struct page *page)
{
int ret;
- int mt;
struct page *hpage = compound_head(page);
if (!PageHuge(page) && PageTransHuge(hpage))
if (try_to_split_thp_page(page, "soft offline") < 0)
return -EBUSY;
- /*
- * Setting MIGRATE_ISOLATE here ensures that the page will be linked
- * to free list immediately (not via pcplist) when released after
- * successful page migration. Otherwise we can't guarantee that the
- * page is really free after put_page() returns, so
- * set_hwpoison_free_buddy_page() highly likely fails.
- */
- mt = get_pageblock_migratetype(page);
- set_pageblock_migratetype(page, MIGRATE_ISOLATE);
if (PageHuge(page))
ret = soft_offline_huge_page(page);
else
ret = __soft_offline_page(page);
- set_pageblock_migratetype(page, mt);
return ret;
}
@@ -1842,7 +1827,7 @@ static int soft_offline_free_page(struct page *page)
int rc = -EBUSY;
if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
- page_handle_poison(page);
+ page_handle_poison(page, false);
rc = 0;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 4fe45d1428c8..71acece248d7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1224,16 +1224,11 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
* we want to retry.
*/
if (rc == MIGRATEPAGE_SUCCESS) {
- put_page(page);
- if (reason == MR_MEMORY_FAILURE) {
+ if (reason != MR_MEMORY_FAILURE)
/*
- * Set PG_HWPoison on just freed page
- * intentionally. Although it's rather weird,
- * it's how HWPoison flag works at the moment.
+ * We release the page in page_handle_poison.
*/
- if (set_hwpoison_free_buddy_page(page))
- num_poisoned_pages_inc();
- }
+ put_page(page);
} else {
if (rc != -EAGAIN) {
if (likely(!__PageMovable(page))) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 255df0c76a40..cb35a4c8b1f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1132,6 +1132,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
VM_BUG_ON_PAGE(PageTail(page), page);
+ if (unlikely(PageHWPoison(page)) && !order)
+ return false;
+
trace_mm_page_free(page, order);
/*
@@ -8698,32 +8701,4 @@ bool take_page_off_buddy(struct page *page)
spin_unlock_irqrestore(&zone->lock, flags);
return ret;
}
-
-/*
- * Set PG_hwpoison flag if a given page is confirmed to be a free page. This
- * test is performed under the zone lock to prevent a race against page
- * allocation.
- */
-bool set_hwpoison_free_buddy_page(struct page *page)
-{
- struct zone *zone = page_zone(page);
- unsigned long pfn = page_to_pfn(page);
- unsigned long flags;
- unsigned int order;
- bool hwpoisoned = false;
-
- spin_lock_irqsave(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
- struct page *page_head = page - (pfn & ((1 << order) - 1));
-
- if (PageBuddy(page_head) && page_order(page_head) >= order) {
- if (!TestSetPageHWPoison(page))
- hwpoisoned = true;
- break;
- }
- }
- spin_unlock_irqrestore(&zone->lock, flags);
-
- return hwpoisoned;
-}
#endif
--
2.12.3
From: Naoya Horiguchi <[email protected]>
Currently soft_offline_page() receives struct page, and its sibling
memory_failure() receives pfn. This discrepancy looks weird and makes
precheck on pfn validity tricky. So let's align them.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
---
drivers/base/memory.c | 7 +------
include/linux/mm.h | 2 +-
mm/madvise.c | 2 +-
mm/memory-failure.c | 17 +++++++++--------
4 files changed, 12 insertions(+), 16 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b3cae2eb1c4f..b510b4d176c9 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -538,12 +538,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
if (kstrtoull(buf, 0, &pfn) < 0)
return -EINVAL;
pfn >>= PAGE_SHIFT;
- if (!pfn_valid(pfn))
- return -ENXIO;
- /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
- if (!pfn_to_online_page(pfn))
- return -EIO;
- ret = soft_offline_page(pfn_to_page(pfn));
+ ret = soft_offline_page(pfn);
return ret == 0 ? count : ret;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f80a1ce4e86..40722854d357 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2791,7 +2791,7 @@ extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
extern void shake_page(struct page *p, int access);
extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(struct page *page);
+extern int soft_offline_page(unsigned long pfn);
/*
diff --git a/mm/madvise.c b/mm/madvise.c
index 9ca48345ce45..f83b7d4c68c1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -890,7 +890,7 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
pfn, start);
- ret = soft_offline_page(page);
+ ret = soft_offline_page(pfn);
} else {
pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
pfn, start);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c038896bedf0..bfecb61fc064 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1521,7 +1521,7 @@ static void memory_failure_work_func(struct work_struct *work)
if (!gotten)
break;
if (entry.flags & MF_SOFT_OFFLINE)
- soft_offline_page(pfn_to_page(entry.pfn));
+ soft_offline_page(entry.pfn);
else
memory_failure(entry.pfn, entry.flags);
}
@@ -1834,7 +1834,7 @@ static int soft_offline_free_page(struct page *page)
/**
* soft_offline_page - Soft offline a page.
- * @page: page to offline
+ * @pfn: pfn to soft-offline
*
* Returns 0 on success, otherwise negated errno.
*
@@ -1853,16 +1853,17 @@ static int soft_offline_free_page(struct page *page)
* This is not a 100% solution for all memory, but tries to be
* ``good enough'' for the majority of memory.
*/
-int soft_offline_page(struct page *page)
+int soft_offline_page(unsigned long pfn)
{
int ret;
- unsigned long pfn = page_to_pfn(page);
+ struct page *page;
- if (is_zone_device_page(page)) {
- pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
- pfn);
+ if (!pfn_valid(pfn))
+ return -ENXIO;
+ /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
+ page = pfn_to_online_page(pfn);
+ if (!page)
return -EIO;
- }
if (PageHWPoison(page)) {
pr_info("soft offline: %#lx page already poisoned\n", pfn);
--
2.12.3
Since get_hwpoison_page is only used in memory-failure code now,
let us un-export it and make it private to that code.
Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/mm.h | 1 -
mm/memory-failure.c | 3 +--
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 44d058723db9..a646eb4dc993 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2788,7 +2788,6 @@ enum mf_flags {
extern int memory_failure(unsigned long pfn, int flags);
extern void memory_failure_queue(unsigned long pfn, int flags);
extern int unpoison_memory(unsigned long pfn);
-extern int get_hwpoison_page(struct page *page);
#define put_hwpoison_page(page) put_page(page)
extern int sysctl_memory_failure_early_kill;
extern int sysctl_memory_failure_recovery;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2cbadb58c7df..5ce634ddf9fa 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -913,7 +913,7 @@ static int page_action(struct page_state *ps, struct page *p,
* Return: return 0 if failed to grab the refcount, otherwise true (some
* non-zero value.)
*/
-int get_hwpoison_page(struct page *page)
+static int get_hwpoison_page(struct page *page)
{
struct page *head = compound_head(page);
@@ -942,7 +942,6 @@ int get_hwpoison_page(struct page *page)
return 0;
}
-EXPORT_SYMBOL_GPL(get_hwpoison_page);
/*
* Do all that is necessary to remove user space mappings. Unmap
--
2.12.3
Place the THP's page handling in a helper and use it
from both hard and soft-offline machinery, so we get rid
of some duplicated code.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/memory-failure.c | 48 ++++++++++++++++++++++--------------------------
1 file changed, 22 insertions(+), 26 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 836d671fb74f..37b230b8cfe7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1066,6 +1066,25 @@ static int identify_page_state(unsigned long pfn, struct page *p,
return page_action(ps, p, pfn);
}
+static int try_to_split_thp_page(struct page *page, const char *msg)
+{
+ lock_page(page);
+ if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+ unsigned long pfn = page_to_pfn(page);
+
+ unlock_page(page);
+ if (!PageAnon(page))
+ pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
+ else
+ pr_info("%s: %#lx: thp split failed\n", msg, pfn);
+ put_page(page);
+ return -EBUSY;
+ }
+ unlock_page(page);
+
+ return 0;
+}
+
static int memory_failure_hugetlb(unsigned long pfn, int flags)
{
struct page *p = pfn_to_page(pfn);
@@ -1288,21 +1307,8 @@ int memory_failure(unsigned long pfn, int flags)
}
if (PageTransHuge(hpage)) {
- lock_page(p);
- if (!PageAnon(p) || unlikely(split_huge_page(p))) {
- unlock_page(p);
- if (!PageAnon(p))
- pr_err("Memory failure: %#lx: non anonymous thp\n",
- pfn);
- else
- pr_err("Memory failure: %#lx: thp split failed\n",
- pfn);
- if (TestClearPageHWPoison(p))
- num_poisoned_pages_dec();
- put_page(p);
+ if (try_to_split_thp_page(p, "Memory Failure") < 0)
return -EBUSY;
- }
- unlock_page(p);
VM_BUG_ON_PAGE(!page_count(p), p);
hpage = compound_head(p);
}
@@ -1801,19 +1807,9 @@ static int soft_offline_in_use_page(struct page *page)
int mt;
struct page *hpage = compound_head(page);
- if (!PageHuge(page) && PageTransHuge(hpage)) {
- lock_page(page);
- if (!PageAnon(page) || unlikely(split_huge_page(page))) {
- unlock_page(page);
- if (!PageAnon(page))
- pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
- else
- pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
- put_page(page);
+ if (!PageHuge(page) && PageTransHuge(hpage))
+ if (try_to_split_thp_page(page, "soft offline") < 0)
return -EBUSY;
- }
- unlock_page(page);
- }
/*
* Setting MIGRATE_ISOLATE here ensures that the page will be linked
--
2.12.3
From: Naoya Horiguchi <[email protected]>
Another memory error injection interface debugfs:hwpoison/corrupt-pfn
also takes bogus refcount for hwpoison_filter(). It's justified
because this does a coarse filter, expecting that memory_failure()
redoes the check for sure.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/hwpoison-inject.c | 18 +++++-------------
1 file changed, 5 insertions(+), 13 deletions(-)
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 5b7430bd83a6..0c8cdb80fd7d 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -26,11 +26,6 @@ static int hwpoison_inject(void *data, u64 val)
p = pfn_to_page(pfn);
hpage = compound_head(p);
- /*
- * This implies unable to support free buddy pages.
- */
- if (!get_hwpoison_page(p))
- return 0;
if (!hwpoison_filter_enable)
goto inject;
@@ -40,23 +35,20 @@ static int hwpoison_inject(void *data, u64 val)
* This implies unable to support non-LRU pages.
*/
if (!PageLRU(hpage) && !PageHuge(p))
- goto put_out;
+ return 0;
/*
- * do a racy check with elevated page count, to make sure PG_hwpoison
- * will only be set for the targeted owner (or on a free page).
+ * do a racy check to make sure PG_hwpoison will only be set for
+ * the targeted owner (or on a free page).
* memory_failure() will redo the check reliably inside page lock.
*/
err = hwpoison_filter(hpage);
if (err)
- goto put_out;
+ return 0;
inject:
pr_info("Injecting memory failure at pfn %#lx\n", pfn);
- return memory_failure(pfn, MF_COUNT_INCREASED);
-put_out:
- put_hwpoison_page(p);
- return 0;
+ return memory_failure(pfn, 0);
}
static int hwpoison_unpoison(void *data, u64 val)
--
2.12.3
On 17.10.19 16:21, Oscar Salvador wrote:
> From: Naoya Horiguchi <[email protected]>
>
> Currently soft_offline_page() receives struct page, and its sibling
> memory_failure() receives pfn. This discrepancy looks weird and makes
> precheck on pfn validity tricky. So let's align them.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> drivers/base/memory.c | 7 +------
> include/linux/mm.h | 2 +-
> mm/madvise.c | 2 +-
> mm/memory-failure.c | 17 +++++++++--------
> 4 files changed, 12 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index b3cae2eb1c4f..b510b4d176c9 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -538,12 +538,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
> if (kstrtoull(buf, 0, &pfn) < 0)
> return -EINVAL;
> pfn >>= PAGE_SHIFT;
> - if (!pfn_valid(pfn))
> - return -ENXIO;
> - /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
> - if (!pfn_to_online_page(pfn))
> - return -EIO;
> - ret = soft_offline_page(pfn_to_page(pfn));
> + ret = soft_offline_page(pfn);
> return ret == 0 ? count : ret;
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0f80a1ce4e86..40722854d357 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2791,7 +2791,7 @@ extern int sysctl_memory_failure_early_kill;
> extern int sysctl_memory_failure_recovery;
> extern void shake_page(struct page *p, int access);
> extern atomic_long_t num_poisoned_pages __read_mostly;
> -extern int soft_offline_page(struct page *page);
> +extern int soft_offline_page(unsigned long pfn);
>
>
> /*
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9ca48345ce45..f83b7d4c68c1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -890,7 +890,7 @@ static int madvise_inject_error(int behavior,
> if (behavior == MADV_SOFT_OFFLINE) {
> pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
> pfn, start);
> - ret = soft_offline_page(page);
> + ret = soft_offline_page(pfn);
> } else {
> pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
> pfn, start);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index c038896bedf0..bfecb61fc064 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1521,7 +1521,7 @@ static void memory_failure_work_func(struct work_struct *work)
> if (!gotten)
> break;
> if (entry.flags & MF_SOFT_OFFLINE)
> - soft_offline_page(pfn_to_page(entry.pfn));
> + soft_offline_page(entry.pfn);
> else
> memory_failure(entry.pfn, entry.flags);
> }
> @@ -1834,7 +1834,7 @@ static int soft_offline_free_page(struct page *page)
>
> /**
> * soft_offline_page - Soft offline a page.
> - * @page: page to offline
> + * @pfn: pfn to soft-offline
> *
> * Returns 0 on success, otherwise negated errno.
> *
> @@ -1853,16 +1853,17 @@ static int soft_offline_free_page(struct page *page)
> * This is not a 100% solution for all memory, but tries to be
> * ``good enough'' for the majority of memory.
> */
> -int soft_offline_page(struct page *page)
> +int soft_offline_page(unsigned long pfn)
> {
> int ret;
> - unsigned long pfn = page_to_pfn(page);
> + struct page *page;
>
> - if (is_zone_device_page(page)) {
> - pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
> - pfn);
> + if (!pfn_valid(pfn))
> + return -ENXIO;
> + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
> + page = pfn_to_online_page(pfn);
> + if (!page)
> return -EIO;
> - }
>
> if (PageHWPoison(page)) {
> pr_info("soft offline: %#lx page already poisoned\n", pfn);
>
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On Thu 17-10-19 16:21:17, Oscar Salvador wrote:
[...]
> +bool take_page_off_buddy(struct page *page)
> + {
> + struct zone *zone = page_zone(page);
> + unsigned long pfn = page_to_pfn(page);
> + unsigned long flags;
> + unsigned int order;
> + bool ret = false;
> +
> + spin_lock_irqsave(&zone->lock, flags);
What prevents the page to be allocated in the meantime? Also what about
free pages on the pcp lists? Also the page could be gone by the time you
have reached here.
> + for (order = 0; order < MAX_ORDER; order++) {
> + struct page *page_head = page - (pfn & ((1 << order) - 1));
> + int buddy_order = page_order(page_head);
> + struct free_area *area = &(zone->free_area[buddy_order]);
> +
> + if (PageBuddy(page_head) && buddy_order >= order) {
> + unsigned long pfn_head = page_to_pfn(page_head);
> + int migratetype = get_pfnblock_migratetype(page_head,
> + pfn_head);
> +
> + del_page_from_free_area(page_head, area);
> + break_down_buddy_pages(zone, page_head, page, 0,
> + buddy_order, area, migratetype);
> + ret = true;
> + break;
> + }
> + }
> + spin_unlock_irqrestore(&zone->lock, flags);
> + return ret;
> + }
> +
> +/*
> * Set PG_hwpoison flag if a given page is confirmed to be a free page. This
> * test is performed under the zone lock to prevent a race against page
> * allocation.
> --
> 2.12.3
--
Michal Hocko
SUSE Labs
On Thu 17-10-19 16:21:18, Oscar Salvador wrote:
> This patch changes the way we set and handle in-use poisoned pages.
> Until now, poisoned pages were released to the buddy allocator, trusting
> that the checks that take place prior to hand the page would act as a
> safe net and would skip that page.
>
> This has proved to be wrong, as we got some pfn walkers out there, like
> compaction, that all they care is the page to be PageBuddy and be in a
> freelist.
> Although this might not be the only user, having poisoned pages
> in the buddy allocator seems a bad idea as we should only have
> free pages that are ready and meant to be used as such.
Agreed on this part.
> Before explainaing the taken approach, let us break down the kind
> of pages we can soft offline.
>
> - Anonymous THP (after the split, they end up being 4K pages)
> - Hugetlb
> - Order-0 pages (that can be either migrated or invalited)
>
> * Normal pages (order-0 and anon-THP)
>
> - If they are clean and unmapped page cache pages, we invalidate
> then by means of invalidate_inode_page().
> - If they are mapped/dirty, we do the isolate-and-migrate dance.
>
> Either way, do not call put_page directly from those paths.
> Instead, we keep the page and send it to page_set_poison to perform the
> right handling.
>
> page_set_poison sets the HWPoison flag and does the last put_page.
> This call to put_page is mainly to be able to call __page_cache_release,
> since this function is not exported.
>
> Down the chain, we placed a check for HWPoison page in free_pages_prepare,
> that just skips any poisoned page, so those pages do not end up in any
> pcplist/freelist.
>
> After that, we set the refcount on the page to 1 and we increment
> the poisoned pages counter.
>
> We could do as we do for free pages:
> 1) wait until the page hits buddy's freelists
> 2) take it off
> 3) flag it
>
> The problem is that we could race with an allocation, so by the time we
> want to take the page off the buddy, the page is already allocated, so we
> cannot soft-offline it.
> This is not fatal of course, but if it is better if we can close the race
> as does not require a lot of code.
>
> * Hugetlb pages
>
> - We isolate-and-migrate them
>
> After the migration has been succesful, we call dissolve_free_huge_page,
> and we set HWPoison on the page if we succeed.
> Hugetlb has a slightly different handling though.
>
> While for non-hugetlb pages we cared about closing the race with an allocation,
> doing so for hugetlb pages requires quite some additional code (we would need to
> hook in free_huge_page and some other places).
> So I decided to not make the code overly complicated and just fail normally
> if the page we allocated in the meantime.
>
> Because of the way we handle now in-use pages, we no longer need the
> put-as-isolation-migratetype dance, that was guarding for poisoned pages
> to end up in pcplists.
I am sorry but I got lost in the above description and I cannot really
make much sense from the code either. Let me try to outline the way how
I think about this.
Say we have a pfn to hwpoison. We have effectivelly three possibilities
- page is poisoned already - done nothing to do
- page is managed by the buddy allocator - excavate from there
- page is in use
The last category is the most interesting one. There are essentially
three classes of pages
- freeable
- migrateable
- others
We cannot do really much about the last one, right? Do we mark them
HWPoison anyway?
Freeable should be simply marked HWPoison and freed.
For all those migrateable, we simply do migrate and mark HWPoison.
Now the main question is how to handle HWPoison page when it is freed
- aka last reference is dropped. The main question is whether the last
reference is ever dropped. If yes then the free_pages_prepare should
never release it to the allocator (some compound destructors would have
to special case as well, e.g. hugetlb would have to hand over to the
allocator rather than a pool). If not then the page would be lingering
potentially with some state bound to it (e.g. memcg charge). So I
suspect you want the former.
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> include/linux/page-flags.h | 5 -----
> mm/memory-failure.c | 43 ++++++++++++++-----------------------------
> mm/migrate.c | 11 +++--------
> mm/page_alloc.c | 31 +++----------------------------
> 4 files changed, 20 insertions(+), 70 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f91cb8898ff0..21df81c9ea57 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -414,13 +414,8 @@ PAGEFLAG_FALSE(Uncached)
> PAGEFLAG(HWPoison, hwpoison, PF_ANY)
> TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
> #define __PG_HWPOISON (1UL << PG_hwpoison)
> -extern bool set_hwpoison_free_buddy_page(struct page *page);
> #else
> PAGEFLAG_FALSE(HWPoison)
> -static inline bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> - return 0;
> -}
> #define __PG_HWPOISON 0
> #endif
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1d986580522d..9b40cf1cb4fc 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -80,9 +80,11 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
>
> extern bool take_page_off_buddy(struct page *page);
>
> -static void page_handle_poison(struct page *page)
> +static void page_handle_poison(struct page *page, bool release)
> {
> SetPageHWPoison(page);
> + if (release)
> + put_page(page);
> page_ref_inc(page);
> num_poisoned_pages_inc();
> }
> @@ -1713,19 +1715,13 @@ static int soft_offline_huge_page(struct page *page)
> ret = -EIO;
> } else {
> /*
> - * We set PG_hwpoison only when the migration source hugepage
> - * was successfully dissolved, because otherwise hwpoisoned
> - * hugepage remains on free hugepage list, then userspace will
> - * find it as SIGBUS by allocation failure. That's not expected
> - * in soft-offlining.
> + * We set PG_hwpoison only when we were able to take the page
> + * off the buddy.
> */
> - ret = dissolve_free_huge_page(page);
> - if (!ret) {
> - if (set_hwpoison_free_buddy_page(page))
> - num_poisoned_pages_inc();
> - else
> - ret = -EBUSY;
> - }
> + if (!dissolve_free_huge_page(page) && take_page_off_buddy(page))
> + page_handle_poison(page, false);
> + else
> + ret = -EBUSY;
> }
> return ret;
> }
> @@ -1760,10 +1756,8 @@ static int __soft_offline_page(struct page *page)
> * would need to fix isolation locking first.
> */
> if (ret == 1) {
> - put_page(page);
> pr_info("soft_offline: %#lx: invalidated\n", pfn);
> - SetPageHWPoison(page);
> - num_poisoned_pages_inc();
> + page_handle_poison(page, true);
> return 0;
> }
>
> @@ -1794,7 +1788,9 @@ static int __soft_offline_page(struct page *page)
> list_add(&page->lru, &pagelist);
> ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
> MIGRATE_SYNC, MR_MEMORY_FAILURE);
> - if (ret) {
> + if (!ret) {
> + page_handle_poison(page, true);
> + } else {
> if (!list_empty(&pagelist))
> putback_movable_pages(&pagelist);
>
> @@ -1813,27 +1809,16 @@ static int __soft_offline_page(struct page *page)
> static int soft_offline_in_use_page(struct page *page)
> {
> int ret;
> - int mt;
> struct page *hpage = compound_head(page);
>
> if (!PageHuge(page) && PageTransHuge(hpage))
> if (try_to_split_thp_page(page, "soft offline") < 0)
> return -EBUSY;
>
> - /*
> - * Setting MIGRATE_ISOLATE here ensures that the page will be linked
> - * to free list immediately (not via pcplist) when released after
> - * successful page migration. Otherwise we can't guarantee that the
> - * page is really free after put_page() returns, so
> - * set_hwpoison_free_buddy_page() highly likely fails.
> - */
> - mt = get_pageblock_migratetype(page);
> - set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> if (PageHuge(page))
> ret = soft_offline_huge_page(page);
> else
> ret = __soft_offline_page(page);
> - set_pageblock_migratetype(page, mt);
> return ret;
> }
>
> @@ -1842,7 +1827,7 @@ static int soft_offline_free_page(struct page *page)
> int rc = -EBUSY;
>
> if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
> - page_handle_poison(page);
> + page_handle_poison(page, false);
> rc = 0;
> }
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4fe45d1428c8..71acece248d7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1224,16 +1224,11 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> * we want to retry.
> */
> if (rc == MIGRATEPAGE_SUCCESS) {
> - put_page(page);
> - if (reason == MR_MEMORY_FAILURE) {
> + if (reason != MR_MEMORY_FAILURE)
> /*
> - * Set PG_HWPoison on just freed page
> - * intentionally. Although it's rather weird,
> - * it's how HWPoison flag works at the moment.
> + * We release the page in page_handle_poison.
> */
> - if (set_hwpoison_free_buddy_page(page))
> - num_poisoned_pages_inc();
> - }
> + put_page(page);
> } else {
> if (rc != -EAGAIN) {
> if (likely(!__PageMovable(page))) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 255df0c76a40..cb35a4c8b1f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1132,6 +1132,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
>
> VM_BUG_ON_PAGE(PageTail(page), page);
>
> + if (unlikely(PageHWPoison(page)) && !order)
> + return false;
> +
> trace_mm_page_free(page, order);
>
> /*
> @@ -8698,32 +8701,4 @@ bool take_page_off_buddy(struct page *page)
> spin_unlock_irqrestore(&zone->lock, flags);
> return ret;
> }
> -
> -/*
> - * Set PG_hwpoison flag if a given page is confirmed to be a free page. This
> - * test is performed under the zone lock to prevent a race against page
> - * allocation.
> - */
> -bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> - struct zone *zone = page_zone(page);
> - unsigned long pfn = page_to_pfn(page);
> - unsigned long flags;
> - unsigned int order;
> - bool hwpoisoned = false;
> -
> - spin_lock_irqsave(&zone->lock, flags);
> - for (order = 0; order < MAX_ORDER; order++) {
> - struct page *page_head = page - (pfn & ((1 << order) - 1));
> -
> - if (PageBuddy(page_head) && page_order(page_head) >= order) {
> - if (!TestSetPageHWPoison(page))
> - hwpoisoned = true;
> - break;
> - }
> - }
> - spin_unlock_irqrestore(&zone->lock, flags);
> -
> - return hwpoisoned;
> -}
> #endif
> --
> 2.12.3
--
Michal Hocko
SUSE Labs
On Thu, Oct 17, 2019 at 04:21:10PM +0200, Oscar Salvador wrote:
> Make a proper if-else condition for {hard,soft}-offline.
>
> Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
On Thu, Oct 17, 2019 at 04:21:12PM +0200, Oscar Salvador wrote:
> Since get_hwpoison_page is only used in memory-failure code now,
> let us un-export it and make it private to that code.
>
> Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
On Thu, Oct 17, 2019 at 04:21:13PM +0200, Oscar Salvador wrote:
> After ("4e41a30c6d50: mm: hwpoison: adjust for new thp refcounting"),
> put_hwpoison_page got reduced to a put_page.
> Let us just use put_page instead.
>
> Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
On Thu, Oct 17, 2019 at 04:21:16PM +0200, Oscar Salvador wrote:
> Place the THP's page handling in a helper and use it
> from both hard and soft-offline machinery, so we get rid
> of some duplicated code.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> mm/memory-failure.c | 48 ++++++++++++++++++++++--------------------------
> 1 file changed, 22 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 836d671fb74f..37b230b8cfe7 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1066,6 +1066,25 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> return page_action(ps, p, pfn);
> }
>
> +static int try_to_split_thp_page(struct page *page, const char *msg)
> +{
> + lock_page(page);
> + if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> + unsigned long pfn = page_to_pfn(page);
> +
> + unlock_page(page);
> + if (!PageAnon(page))
> + pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> + else
> + pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> + put_page(page);
> + return -EBUSY;
> + }
> + unlock_page(page);
> +
> + return 0;
> +}
> +
> static int memory_failure_hugetlb(unsigned long pfn, int flags)
> {
> struct page *p = pfn_to_page(pfn);
> @@ -1288,21 +1307,8 @@ int memory_failure(unsigned long pfn, int flags)
> }
>
> if (PageTransHuge(hpage)) {
> - lock_page(p);
> - if (!PageAnon(p) || unlikely(split_huge_page(p))) {
> - unlock_page(p);
> - if (!PageAnon(p))
> - pr_err("Memory failure: %#lx: non anonymous thp\n",
> - pfn);
> - else
> - pr_err("Memory failure: %#lx: thp split failed\n",
> - pfn);
> - if (TestClearPageHWPoison(p))
> - num_poisoned_pages_dec();
> - put_page(p);
> + if (try_to_split_thp_page(p, "Memory Failure") < 0)
> return -EBUSY;
Although this is not a cleanup thing, this failure path means that
hwpoison is handled (PG_hwpoison is marked), so action_result() should
be called. I'll add a patch for this later.
Anyway, this cleanup patch looks fine to me.
Acked-by: Naoya Horiguchi <[email protected]>
> - }
> - unlock_page(p);
> VM_BUG_ON_PAGE(!page_count(p), p);
> hpage = compound_head(p);
> }
> @@ -1801,19 +1807,9 @@ static int soft_offline_in_use_page(struct page *page)
> int mt;
> struct page *hpage = compound_head(page);
>
> - if (!PageHuge(page) && PageTransHuge(hpage)) {
> - lock_page(page);
> - if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> - unlock_page(page);
> - if (!PageAnon(page))
> - pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
> - else
> - pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
> - put_page(page);
> + if (!PageHuge(page) && PageTransHuge(hpage))
> + if (try_to_split_thp_page(page, "soft offline") < 0)
> return -EBUSY;
> - }
> - unlock_page(page);
> - }
>
> /*
> * Setting MIGRATE_ISOLATE here ensures that the page will be linked
> --
> 2.12.3
>
>
On Thu, Oct 17, 2019 at 04:21:17PM +0200, Oscar Salvador wrote:
> When trying to soft-offline a free page, we need to first take it off
> the buddy allocator.
> Once we know is out of reach, we can safely flag it as poisoned.
>
> take_page_off_buddy will be used to take a page meant to be poisoned
> off the buddy allocator.
> take_page_off_buddy calls break_down_buddy_pages, which splits a
> higher-order page in case our page belongs to one.
>
> Once the page is under our control, we call page_set_poison to set it
I guess you mean page_handle_poison here.
> as poisoned and grab a refcount on it.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> mm/memory-failure.c | 20 +++++++++++-----
> mm/page_alloc.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 82 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 37b230b8cfe7..1d986580522d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -78,6 +78,15 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
> EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
> EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
>
> +extern bool take_page_off_buddy(struct page *page);
> +
> +static void page_handle_poison(struct page *page)
hwpoison is a separate idea from page poisoning, so maybe I think
it's better to be named like page_handle_hwpoison().
> +{
> + SetPageHWPoison(page);
> + page_ref_inc(page);
> + num_poisoned_pages_inc();
> +}
> +
> static int hwpoison_filter_dev(struct page *p)
> {
> struct address_space *mapping;
> @@ -1830,14 +1839,13 @@ static int soft_offline_in_use_page(struct page *page)
>
> static int soft_offline_free_page(struct page *page)
> {
> - int rc = dissolve_free_huge_page(page);
> + int rc = -EBUSY;
>
> - if (!rc) {
> - if (set_hwpoison_free_buddy_page(page))
> - num_poisoned_pages_inc();
> - else
> - rc = -EBUSY;
> + if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
> + page_handle_poison(page);
> + rc = 0;
> }
> +
> return rc;
> }
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cd1dd0712624..255df0c76a40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8632,6 +8632,74 @@ bool is_free_buddy_page(struct page *page)
>
> #ifdef CONFIG_MEMORY_FAILURE
> /*
> + * Break down a higher-order page in sub-pages, and keep our target out of
> + * buddy allocator.
> + */
> +static void break_down_buddy_pages(struct zone *zone, struct page *page,
> + struct page *target, int low, int high,
> + struct free_area *area, int migratetype)
> +{
> + unsigned long size = 1 << high;
> + struct page *current_buddy, *next_page;
> +
> + while (high > low) {
> + area--;
> + high--;
> + size >>= 1;
> +
> + if (target >= &page[size]) {
> + next_page = page + size;
> + current_buddy = page;
> + } else {
> + next_page = page;
> + current_buddy = page + size;
> + }
> +
> + if (set_page_guard(zone, current_buddy, high, migratetype))
> + continue;
> +
> + if (current_buddy != target) {
> + add_to_free_area(current_buddy, area, migratetype);
> + set_page_order(current_buddy, high);
> + page = next_page;
> + }
> + }
> +}
> +
> +/*
> + * Take a page that will be marked as poisoned off the buddy allocator.
> + */
> +bool take_page_off_buddy(struct page *page)
> + {
> + struct zone *zone = page_zone(page);
> + unsigned long pfn = page_to_pfn(page);
> + unsigned long flags;
> + unsigned int order;
> + bool ret = false;
> +
> + spin_lock_irqsave(&zone->lock, flags);
> + for (order = 0; order < MAX_ORDER; order++) {
> + struct page *page_head = page - (pfn & ((1 << order) - 1));
> + int buddy_order = page_order(page_head);
> + struct free_area *area = &(zone->free_area[buddy_order]);
> +
> + if (PageBuddy(page_head) && buddy_order >= order) {
> + unsigned long pfn_head = page_to_pfn(page_head);
> + int migratetype = get_pfnblock_migratetype(page_head,
> + pfn_head);
> +
> + del_page_from_free_area(page_head, area);
> + break_down_buddy_pages(zone, page_head, page, 0,
> + buddy_order, area, migratetype);
> + ret = true;
> + break;
indent with whitespace?
And you can find a few more coding style warning with checkpatch.pl.
BTW, if we consider to make unpoison mechanism to keep up with the
new semantics, we will need the reverse operation of take_page_off_buddy().
Do you think that that part will come with a separate work?
Thanks,
Naoya Horiguchi
> + }
> + }
> + spin_unlock_irqrestore(&zone->lock, flags);
> + return ret;
> + }
> +
> +/*
> * Set PG_hwpoison flag if a given page is confirmed to be a free page. This
> * test is performed under the zone lock to prevent a race against page
> * allocation.
> --
> 2.12.3
>
>
On Thu, Oct 17, 2019 at 04:21:21PM +0200, Oscar Salvador wrote:
> Currently, there is an inconsistency when calling soft-offline from
> different paths on a page that is already poisoned.
>
> 1) madvise:
>
> madvise_inject_error skips any poisoned page and continues
> the loop.
> If that was the only page to madvise, it returns 0.
>
> 2) /sys/devices/system/memory/:
>
> Whe calling soft_offline_page_store()->soft_offline_page(),
> we return -EBUSY in case the page is already poisoned.
> This is inconsistent with a) the above example and b)
> memory_failure, where we return 0 if the page was poisoned.
>
> Fix this by dropping the PageHWPoison() check in madvise_inject_error,
> and let soft_offline_page return 0 if it finds the page already poisoned.
>
> Please, note that this represents an user-api change, since now the return
> error when calling soft_offline_page_store()->soft_offline_page() will be different.
>
> Signed-off-by: Oscar Salvador <[email protected]>
Looks good to me.
Acked-by: Naoya Horiguchi <[email protected]>
> ---
> mm/madvise.c | 3 ---
> mm/memory-failure.c | 4 ++--
> 2 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 8a0b1f901d72..9ca48345ce45 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -887,9 +887,6 @@ static int madvise_inject_error(int behavior,
> */
> put_page(page);
>
> - if (PageHWPoison(page))
> - continue;
> -
> if (behavior == MADV_SOFT_OFFLINE) {
> pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
> pfn, start);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3d491c0d3f91..c038896bedf0 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1767,7 +1767,7 @@ static int __soft_offline_page(struct page *page)
> unlock_page(page);
> put_page(page);
> pr_info("soft offline: %#lx page already poisoned\n", pfn);
> - return -EBUSY;
> + return 0;
> }
>
> if (!PageHuge(page))
> @@ -1866,7 +1866,7 @@ int soft_offline_page(struct page *page)
>
> if (PageHWPoison(page)) {
> pr_info("soft offline: %#lx page already poisoned\n", pfn);
> - return -EBUSY;
> + return 0;
> }
>
> get_online_mems();
> --
> 2.12.3
>
>
On 17.10.19 16:21, Oscar Salvador wrote:
> memory_failure() already performs the same checks, so leave it
> to the main routine.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> mm/hwpoison-inject.c | 33 +++------------------------------
> 1 file changed, 3 insertions(+), 30 deletions(-)
>
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index 0c8cdb80fd7d..fdcca3df4283 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -14,49 +14,22 @@ static struct dentry *hwpoison_dir;
> static int hwpoison_inject(void *data, u64 val)
> {
> unsigned long pfn = val;
> - struct page *p;
> - struct page *hpage;
> - int err;
>
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> - if (!pfn_valid(pfn))
> - return -ENXIO;
> -
> - p = pfn_to_page(pfn);
> - hpage = compound_head(p);
> -
> - if (!hwpoison_filter_enable)
> - goto inject;
> -
> - shake_page(hpage, 0);
> - /*
> - * This implies unable to support non-LRU pages.
> - */
> - if (!PageLRU(hpage) && !PageHuge(p))
> - return 0;
> -
> - /*
> - * do a racy check to make sure PG_hwpoison will only be set for
> - * the targeted owner (or on a free page).
> - * memory_failure() will redo the check reliably inside page lock.
> - */
> - err = hwpoison_filter(hpage);
> - if (err)
> - return 0;
> -
> -inject:
> pr_info("Injecting memory failure at pfn %#lx\n", pfn);
> return memory_failure(pfn, 0);
> }
>
I explored somewhere already why this code was added:
commit 31d3d3484f9bd263925ecaa341500ac2df3a5d9b
Author: Wu Fengguang <[email protected]>
Date: Wed Dec 16 12:19:59 2009 +0100
HWPOISON: limit hwpoison injector to known page types
__memory_failure()'s workflow is
set PG_hwpoison
//...
unset PG_hwpoison if didn't pass hwpoison filter
That could kill unrelated process if it happens to page fault on the
page with the (temporary) PG_hwpoison. The race should be big enough to
appear in stress tests.
Fix it by grabbing the page and checking filter at inject time. This
also avoids the very noisy "Injecting memory failure..." messages.
- we don't touch madvise() based injection, because the filters are
generally not necessary for it.
- if we want to apply the filters to h/w aided injection, we'd better to
rearrange the logic in __memory_failure() instead of this patch.
AK: fix documentation, use drain all, cleanups
You should justify why it is okay to do rip that code out now.
It's not just duplicate checks.
Was the documented race fixed?
Will we fix the race within memory_failure() later?
Don't we care?
Also, you should add that this fixes the access of uninitialized memmaps
now and makes the interface work correctly with devmem.
--
Thanks,
David / dhildenb
On Mon, Oct 21, 2019 at 07:04:40AM +0000, Naoya Horiguchi wrote:
> On Thu, Oct 17, 2019 at 04:21:16PM +0200, Oscar Salvador wrote:
> > Place the THP's page handling in a helper and use it
> > from both hard and soft-offline machinery, so we get rid
> > of some duplicated code.
> >
> > Signed-off-by: Oscar Salvador <[email protected]>
...
> > @@ -1288,21 +1307,8 @@ int memory_failure(unsigned long pfn, int flags)
> > }
> >
> > if (PageTransHuge(hpage)) {
> > - lock_page(p);
> > - if (!PageAnon(p) || unlikely(split_huge_page(p))) {
> > - unlock_page(p);
> > - if (!PageAnon(p))
> > - pr_err("Memory failure: %#lx: non anonymous thp\n",
> > - pfn);
> > - else
> > - pr_err("Memory failure: %#lx: thp split failed\n",
> > - pfn);
> > - if (TestClearPageHWPoison(p))
> > - num_poisoned_pages_dec();
> > - put_page(p);
> > + if (try_to_split_thp_page(p, "Memory Failure") < 0)
> > return -EBUSY;
>
> Although this is not a cleanup thing, this failure path means that
> hwpoison is handled (PG_hwpoison is marked), so action_result() should
> be called. I'll add a patch for this later.
Here's the one. So Oscar, If you like, could you append this to
your tree in the next spin (with your credit or signed-off-by)?
Thanks,
Naoya Horiguchi
---
From b920f965485f6679ddc27e1a51da5bff7a5cc81a Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <[email protected]>
Date: Mon, 21 Oct 2019 18:42:33 +0900
Subject: [PATCH] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
memory_failure() is supposed to call action_result() when it handles
a memory error event, but there's one missing case. So let's add it.
I find that include/ras/ras_event.h has some other MF_MSG_* undefined,
so this patch also adds them.
Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/mm.h | 1 +
include/ras/ras_event.h | 3 +++
mm/memory-failure.c | 5 ++++-
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3eba26324ff1..022033cc6782 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2818,6 +2818,7 @@ enum mf_action_page_type {
MF_MSG_BUDDY,
MF_MSG_BUDDY_2ND,
MF_MSG_DAX,
+ MF_MSG_UNSPLIT_THP,
MF_MSG_UNKNOWN,
};
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 36c5c5e38c1d..0bdbc0d17d2f 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -361,6 +361,7 @@ TRACE_EVENT(aer_event,
EM ( MF_MSG_POISONED_HUGE, "huge page already hardware poisoned" ) \
EM ( MF_MSG_HUGE, "huge page" ) \
EM ( MF_MSG_FREE_HUGE, "free huge page" ) \
+ EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" ) \
EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" ) \
EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" ) \
EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" ) \
@@ -373,6 +374,8 @@ TRACE_EVENT(aer_event,
EM ( MF_MSG_TRUNCATED_LRU, "already truncated LRU page" ) \
EM ( MF_MSG_BUDDY, "free buddy page" ) \
EM ( MF_MSG_BUDDY_2ND, "free buddy page (2nd try)" ) \
+ EM ( MF_MSG_DAX, "dax page" ) \
+ EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" ) \
EMe ( MF_MSG_UNKNOWN, "unknown page" )
/*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 46ca856703f6..b15086ad8948 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -583,6 +583,7 @@ static const char * const action_page_types[] = {
[MF_MSG_BUDDY] = "free buddy page",
[MF_MSG_BUDDY_2ND] = "free buddy page (2nd try)",
[MF_MSG_DAX] = "dax page",
+ [MF_MSG_UNSPLIT_THP] = "unsplit thp",
[MF_MSG_UNKNOWN] = "unknown page",
};
@@ -1361,8 +1362,10 @@ int memory_failure(unsigned long pfn, int flags)
}
if (PageTransHuge(hpage)) {
- if (try_to_split_thp_page(p, "Memory Failure") < 0)
+ if (try_to_split_thp_page(p, "Memory Failure") < 0) {
+ action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
return -EBUSY;
+ }
VM_BUG_ON_PAGE(!page_count(p), p);
hpage = compound_head(p);
}
--
2.17.1
On Fri, Oct 18, 2019 at 02:06:15PM +0200, Michal Hocko wrote:
> On Thu 17-10-19 16:21:17, Oscar Salvador wrote:
> [...]
> > +bool take_page_off_buddy(struct page *page)
> > + {
> > + struct zone *zone = page_zone(page);
> > + unsigned long pfn = page_to_pfn(page);
> > + unsigned long flags;
> > + unsigned int order;
> > + bool ret = false;
> > +
> > + spin_lock_irqsave(&zone->lock, flags);
>
> What prevents the page to be allocated in the meantime? Also what about
> free pages on the pcp lists? Also the page could be gone by the time you
> have reached here.
Nothing prevents the page to be allocated in the meantime.
We would just bail out and return -EBUSY to userspace.
Since we do not do __anything__ to the page until we are sure we took it off,
and it is completely isolated from the memory, there is no danger.
Since soft-offline is kinda "best effort" mode, it is something like:
"Sorry, could not poison the page, try again".
Now, thinking about this a bit more, I guess we could be more clever here
and call the routine that handles in-use pages if we see that the page
was allocated by the time we reach take_page_off_buddy.
About pcp pages, you are right.
I thought that we were already handling that case, and we do, but looking closer the
call to shake_page() (that among other things spills pcppages into buddy)
is performed at a later stage.
I think we need to adjust __get_any_page to recognize pcp pages as well.
I will do some work here.
Thanks for comments.
--
Oscar Salvador
SUSE L3
On Fri, Oct 18, 2019 at 02:39:01PM +0200, Michal Hocko wrote:
>
> I am sorry but I got lost in the above description and I cannot really
> make much sense from the code either. Let me try to outline the way how
> I think about this.
>
> Say we have a pfn to hwpoison. We have effectivelly three possibilities
> - page is poisoned already - done nothing to do
> - page is managed by the buddy allocator - excavate from there
> - page is in use
>
> The last category is the most interesting one. There are essentially
> three classes of pages
> - freeable
> - migrateable
> - others
>
> We cannot do really much about the last one, right? Do we mark them
> HWPoison anyway?
We can only perform actions on LRU/Movable pages or hugetlb pages.
So unless the page does not fall into those areas, we do not do anything
with them.
> Freeable should be simply marked HWPoison and freed.
> For all those migrateable, we simply do migrate and mark HWPoison.
> Now the main question is how to handle HWPoison page when it is freed
> - aka last reference is dropped. The main question is whether the last
> reference is ever dropped. If yes then the free_pages_prepare should
> never release it to the allocator (some compound destructors would have
> to special case as well, e.g. hugetlb would have to hand over to the
> allocator rather than a pool). If not then the page would be lingering
> potentially with some state bound to it (e.g. memcg charge). So I
> suspect you want the former.
For non-hugetlb pages, we do not call put_page in the migration path,
but we do it in page_handle_poison, after the page has been flagged as
hwpoison.
Then the check in free_papes_prepare will see that the page is hwpoison
and will bail out, so the page is not released into the allocator/pcp lists.
Hugetlb pages follow a different methodology.
They are dissolved, and then we split the higher-order page and take the
page off the buddy.
The problem is that while it is easy to hold a non-hugetlb page,
doing the same for hugetlb pages is not that easy:
1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
into hugetlb freelists
2) when trying to free a hugetlb page, we would need to do as we do for gigantic
pages now, and that is breaking down the pages into order-0 pages and release
them to the buddy (so the check in free_papges_prepare would skip the
hwpoison page).
Trying to handle a higher-order hwpoison page in free_pages_prepare is
a bit complicated.
There is one thing I was unsure though.
Bailing out at the beginning of free_pages_prepare if the page is hwpoison
means that the calls to
- __memcg_kmem_uncharge
- page_cpupid_reset_last
- reset_page_owner
- ...
will not be performed.
I thought this is right because the page is not really "free", it is just unusable,
so.. it should be still charged to the memcg?
--
Oscar Salvador
SUSE L3
On Mon 21-10-19 15:48:48, Oscar Salvador wrote:
> On Fri, Oct 18, 2019 at 02:39:01PM +0200, Michal Hocko wrote:
> >
> > I am sorry but I got lost in the above description and I cannot really
> > make much sense from the code either. Let me try to outline the way how
> > I think about this.
> >
> > Say we have a pfn to hwpoison. We have effectivelly three possibilities
> > - page is poisoned already - done nothing to do
> > - page is managed by the buddy allocator - excavate from there
> > - page is in use
> >
> > The last category is the most interesting one. There are essentially
> > three classes of pages
> > - freeable
> > - migrateable
> > - others
> >
> > We cannot do really much about the last one, right? Do we mark them
> > HWPoison anyway?
>
> We can only perform actions on LRU/Movable pages or hugetlb pages.
What would prevent other pages mapped via page tables to be handled as
well?
> So unless the page does not fall into those areas, we do not do anything
> with them.
>
> > Freeable should be simply marked HWPoison and freed.
> > For all those migrateable, we simply do migrate and mark HWPoison.
> > Now the main question is how to handle HWPoison page when it is freed
> > - aka last reference is dropped. The main question is whether the last
> > reference is ever dropped. If yes then the free_pages_prepare should
> > never release it to the allocator (some compound destructors would have
> > to special case as well, e.g. hugetlb would have to hand over to the
> > allocator rather than a pool). If not then the page would be lingering
> > potentially with some state bound to it (e.g. memcg charge). So I
> > suspect you want the former.
>
> For non-hugetlb pages, we do not call put_page in the migration path,
> but we do it in page_handle_poison, after the page has been flagged as
> hwpoison.
> Then the check in free_papes_prepare will see that the page is hwpoison
> and will bail out, so the page is not released into the allocator/pcp lists.
>
> Hugetlb pages follow a different methodology.
> They are dissolved, and then we split the higher-order page and take the
> page off the buddy.
> The problem is that while it is easy to hold a non-hugetlb page,
> doing the same for hugetlb pages is not that easy:
>
> 1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
> into hugetlb freelists
> 2) when trying to free a hugetlb page, we would need to do as we do for gigantic
> pages now, and that is breaking down the pages into order-0 pages and release
> them to the buddy (so the check in free_papges_prepare would skip the
> hwpoison page).
> Trying to handle a higher-order hwpoison page in free_pages_prepare is
> a bit complicated.
I am not sure I see the problem. If you dissolve the hugetlb page then
there is no hugetlb page anymore and so you make it a regular high-order
page.
> There is one thing I was unsure though.
> Bailing out at the beginning of free_pages_prepare if the page is hwpoison
> means that the calls to
>
> - __memcg_kmem_uncharge
> - page_cpupid_reset_last
> - reset_page_owner
> - ...
>
> will not be performed.
> I thought this is right because the page is not really "free", it is just unusable,
> so.. it should be still charged to the memcg?
If the page is free then it shouldn't pin the memcg or any other state.
--
Michal Hocko
SUSE Labs
On Mon 21-10-19 14:58:49, Oscar Salvador wrote:
> On Fri, Oct 18, 2019 at 02:06:15PM +0200, Michal Hocko wrote:
> > On Thu 17-10-19 16:21:17, Oscar Salvador wrote:
> > [...]
> > > +bool take_page_off_buddy(struct page *page)
> > > + {
> > > + struct zone *zone = page_zone(page);
> > > + unsigned long pfn = page_to_pfn(page);
> > > + unsigned long flags;
> > > + unsigned int order;
> > > + bool ret = false;
> > > +
> > > + spin_lock_irqsave(&zone->lock, flags);
> >
> > What prevents the page to be allocated in the meantime? Also what about
> > free pages on the pcp lists? Also the page could be gone by the time you
> > have reached here.
>
> Nothing prevents the page to be allocated in the meantime.
> We would just bail out and return -EBUSY to userspace.
> Since we do not do __anything__ to the page until we are sure we took it off,
> and it is completely isolated from the memory, there is no danger.
Wouldn't it be better to simply check the PageBuddy state after the lock
has been taken?
> Since soft-offline is kinda "best effort" mode, it is something like:
> "Sorry, could not poison the page, try again".
Well, I would disagree here. While madvise is indeed a best effort
operation please keep in mind that the sole purpose of this interface is
to allow real MCE behavior. And that operation should better try
_really_ hard to make sure we try to recover as gracefully as possible.
> Now, thinking about this a bit more, I guess we could be more clever here
> and call the routine that handles in-use pages if we see that the page
> was allocated by the time we reach take_page_off_buddy.
>
> About pcp pages, you are right.
> I thought that we were already handling that case, and we do, but looking closer the
> call to shake_page() (that among other things spills pcppages into buddy)
> is performed at a later stage.
> I think we need to adjust __get_any_page to recognize pcp pages as well.
Yeah, pcp pages are PITA. We cannot really recognize them now. Dropping
all pcp pages is certainly a way to go but we need to mark the page
before that happens.
--
Michal Hocko
SUSE Labs
On Mon, Oct 21, 2019 at 05:41:58PM +0200, Michal Hocko wrote:
> On Mon 21-10-19 14:58:49, Oscar Salvador wrote:
> > Nothing prevents the page to be allocated in the meantime.
> > We would just bail out and return -EBUSY to userspace.
> > Since we do not do __anything__ to the page until we are sure we took it off,
> > and it is completely isolated from the memory, there is no danger.
>
> Wouldn't it be better to simply check the PageBuddy state after the lock
> has been taken?
We already do that:
bool take_page_off_buddy(struct page *page)
{
...
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
struct page *page_head = page - (pfn & ((1 << order) - 1));
int buddy_order = page_order(page_head);
struct free_area *area = &(zone->free_area[buddy_order]);
if (PageBuddy(page_head) && buddy_order >= order) {
...
}
Actually, we __only__ call break_down_buddy_pages() (which breaks down a higher-order page
and keeps our page out of buddy) if that is true.
> > Since soft-offline is kinda "best effort" mode, it is something like:
> > "Sorry, could not poison the page, try again".
>
> Well, I would disagree here. While madvise is indeed a best effort
> operation please keep in mind that the sole purpose of this interface is
> to allow real MCE behavior. And that operation should better try
> _really_ hard to make sure we try to recover as gracefully as possible.
In this case, there is nothing to be recovered from.
If we wanted to soft-offline a page that was free, and then it was allocated
in the meantime, there is no harm in that as we do not flag the page
until we are sure it is under our control.
That means:
- for free pages: was succesfully taken off buddy
- in use pages: was freed or migrated
So, opposite to hard-offline, in soft-offline we do not fiddle with pages
unless we are sure the page is not reachable anymore by any means.
> > Now, thinking about this a bit more, I guess we could be more clever here
> > and call the routine that handles in-use pages if we see that the page
> > was allocated by the time we reach take_page_off_buddy.
> >
> > About pcp pages, you are right.
> > I thought that we were already handling that case, and we do, but looking closer the
> > call to shake_page() (that among other things spills pcppages into buddy)
> > is performed at a later stage.
> > I think we need to adjust __get_any_page to recognize pcp pages as well.
>
> Yeah, pcp pages are PITA. We cannot really recognize them now. Dropping
> all pcp pages is certainly a way to go but we need to mark the page
> before that happens.
I will work on that.
--
Oscar Salvador
SUSE L3
On Mon, Oct 21, 2019 at 04:06:19PM +0200, Michal Hocko wrote:
> On Mon 21-10-19 15:48:48, Oscar Salvador wrote:
> > We can only perform actions on LRU/Movable pages or hugetlb pages.
>
> What would prevent other pages mapped via page tables to be handled as
> well?
What kind of pages?
I mean, I guess it could be done, it was just not implemented, and I
did not want to add more "features" as my main goal was to re-work
the interface to be more deterministic.
> > 1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
> > into hugetlb freelists
> > 2) when trying to free a hugetlb page, we would need to do as we do for gigantic
> > pages now, and that is breaking down the pages into order-0 pages and release
> > them to the buddy (so the check in free_papges_prepare would skip the
> > hwpoison page).
> > Trying to handle a higher-order hwpoison page in free_pages_prepare is
> > a bit complicated.
>
> I am not sure I see the problem. If you dissolve the hugetlb page then
> there is no hugetlb page anymore and so you make it a regular high-order
> page.
Yes, but the problem comes when trying to work with a hwpoison high-order page
in free_pages_prepare, it gets more complicated, and when I weigthed
code vs benefits, I was not really sure to go down that road.
If we get a hwpoison high-order page in free_pages_prepare, we need to
break it down to smaller pages, so we can skip the "bad" to not be sent
into buddy allocator.
> If the page is free then it shouldn't pin the memcg or any other state.
Well, it is not really free, as it is not usable, is it?
Anyway, I do agree that we should clean the bondings to other subsystems like memcg.
--
Oscar Salvador
SUSE L3
On Mon, Oct 21, 2019 at 07:45:33AM +0000, Naoya Horiguchi wrote:
> > +extern bool take_page_off_buddy(struct page *page);
> > +
> > +static void page_handle_poison(struct page *page)
>
> hwpoison is a separate idea from page poisoning, so maybe I think
> it's better to be named like page_handle_hwpoison().
Yeah, that sounds better.
> BTW, if we consider to make unpoison mechanism to keep up with the
> new semantics, we will need the reverse operation of take_page_off_buddy().
> Do you think that that part will come with a separate work?
Well, I am not really sure.
Since we grab a refcount in page_handle_poison, all unpoison mechanism does
is a "put_page", that should send the page back to buddy/pcp lists.
I did not spot any problem when testing it, but I will double check.
Thanks Naoya.
--
Oscar Salvador
SUSE L3
On Mon, Oct 21, 2019 at 06:51:09PM +0900, Naoya Horiguchi wrote:
> Here's the one. So Oscar, If you like, could you append this to
> your tree in the next spin (with your credit or signed-off-by)?
Sure, I will add it.
Thanks
>
> Thanks,
> Naoya Horiguchi
> ---
> From b920f965485f6679ddc27e1a51da5bff7a5cc81a Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <[email protected]>
> Date: Mon, 21 Oct 2019 18:42:33 +0900
> Subject: [PATCH] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
>
> memory_failure() is supposed to call action_result() when it handles
> a memory error event, but there's one missing case. So let's add it.
>
> I find that include/ras/ras_event.h has some other MF_MSG_* undefined,
> so this patch also adds them.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> include/linux/mm.h | 1 +
> include/ras/ras_event.h | 3 +++
> mm/memory-failure.c | 5 ++++-
> 3 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3eba26324ff1..022033cc6782 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2818,6 +2818,7 @@ enum mf_action_page_type {
> MF_MSG_BUDDY,
> MF_MSG_BUDDY_2ND,
> MF_MSG_DAX,
> + MF_MSG_UNSPLIT_THP,
> MF_MSG_UNKNOWN,
> };
>
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index 36c5c5e38c1d..0bdbc0d17d2f 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -361,6 +361,7 @@ TRACE_EVENT(aer_event,
> EM ( MF_MSG_POISONED_HUGE, "huge page already hardware poisoned" ) \
> EM ( MF_MSG_HUGE, "huge page" ) \
> EM ( MF_MSG_FREE_HUGE, "free huge page" ) \
> + EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" ) \
> EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" ) \
> EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" ) \
> EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" ) \
> @@ -373,6 +374,8 @@ TRACE_EVENT(aer_event,
> EM ( MF_MSG_TRUNCATED_LRU, "already truncated LRU page" ) \
> EM ( MF_MSG_BUDDY, "free buddy page" ) \
> EM ( MF_MSG_BUDDY_2ND, "free buddy page (2nd try)" ) \
> + EM ( MF_MSG_DAX, "dax page" ) \
> + EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" ) \
> EMe ( MF_MSG_UNKNOWN, "unknown page" )
>
> /*
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 46ca856703f6..b15086ad8948 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -583,6 +583,7 @@ static const char * const action_page_types[] = {
> [MF_MSG_BUDDY] = "free buddy page",
> [MF_MSG_BUDDY_2ND] = "free buddy page (2nd try)",
> [MF_MSG_DAX] = "dax page",
> + [MF_MSG_UNSPLIT_THP] = "unsplit thp",
> [MF_MSG_UNKNOWN] = "unknown page",
> };
>
> @@ -1361,8 +1362,10 @@ int memory_failure(unsigned long pfn, int flags)
> }
>
> if (PageTransHuge(hpage)) {
> - if (try_to_split_thp_page(p, "Memory Failure") < 0)
> + if (try_to_split_thp_page(p, "Memory Failure") < 0) {
> + action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> return -EBUSY;
> + }
> VM_BUG_ON_PAGE(!page_count(p), p);
> hpage = compound_head(p);
> }
> --
> 2.17.1
>
--
Oscar Salvador
SUSE L3
On Tue 22-10-19 09:46:20, Oscar Salvador wrote:
[...]
> So, opposite to hard-offline, in soft-offline we do not fiddle with pages
> unless we are sure the page is not reachable anymore by any means.
I have to say I do not follow. Is there any _real_ reason for
soft-offline to behave differenttly from MCE (hard-offline)?
--
Michal Hocko
SUSE Labs
On Tue 22-10-19 09:56:27, Oscar Salvador wrote:
> On Mon, Oct 21, 2019 at 04:06:19PM +0200, Michal Hocko wrote:
> > On Mon 21-10-19 15:48:48, Oscar Salvador wrote:
> > > We can only perform actions on LRU/Movable pages or hugetlb pages.
> >
> > What would prevent other pages mapped via page tables to be handled as
> > well?
>
> What kind of pages?
Any pages mapped to the userspace. E.g. driver memory which is not on
LRU.
> I mean, I guess it could be done, it was just not implemented, and I
> did not want to add more "features" as my main goal was to re-work
> the interface to be more deterministic.
Fair enough. One step at the time sounds definitely good
> > > 1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
> > > into hugetlb freelists
> > > 2) when trying to free a hugetlb page, we would need to do as we do for gigantic
> > > pages now, and that is breaking down the pages into order-0 pages and release
> > > them to the buddy (so the check in free_papges_prepare would skip the
> > > hwpoison page).
> > > Trying to handle a higher-order hwpoison page in free_pages_prepare is
> > > a bit complicated.
> >
> > I am not sure I see the problem. If you dissolve the hugetlb page then
> > there is no hugetlb page anymore and so you make it a regular high-order
> > page.
>
> Yes, but the problem comes when trying to work with a hwpoison high-order page
> in free_pages_prepare, it gets more complicated, and when I weigthed
> code vs benefits, I was not really sure to go down that road.
>
> If we get a hwpoison high-order page in free_pages_prepare, we need to
> break it down to smaller pages, so we can skip the "bad" to not be sent
> into buddy allocator.
But we have destructors for compound pages. Can we do the heavy lifting
there?
> > If the page is free then it shouldn't pin the memcg or any other state.
>
> Well, it is not really free, as it is not usable, is it?
Sorry I meant to say the page is free from the memcg POV - aka no task
from the memcg is holding a reference to it. The page is not usable for
anybody, that is true but no particular memcg should pay a price for
that. This would mean that random memcgs would end up pinned for ever
without a good reason.
--
Michal Hocko
SUSE Labs
On Tue 22-10-19 10:35:17, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 10:26:11AM +0200, Michal Hocko wrote:
> > On Tue 22-10-19 09:46:20, Oscar Salvador wrote:
> > [...]
> > > So, opposite to hard-offline, in soft-offline we do not fiddle with pages
> > > unless we are sure the page is not reachable anymore by any means.
> >
> > I have to say I do not follow. Is there any _real_ reason for
> > soft-offline to behave differenttly from MCE (hard-offline)?
>
> Yes.
> Do not take it as 100% true as I read that in some code/Documentation
> a while ago.
>
> But I think that it boils down to:
>
> soft-offline: "We have seen some erros in the underlying page, but
> it is still usable, so we have a chance to keep the
> the contents (via migration)"
> hard-offline: "The underlying page is dead, we cannot trust it, so
> we shut it down, killing whoever is holding it
> along the way".
Hmm, that might be a misunderstanding on my end. I thought that it is
the MCE handler to say whether the failure is recoverable or not. If yes
then we can touch the content of the memory (that would imply the
migration). Other than that both paths should be essentially the same,
no? Well unrecoverable case would be essentially force migration failure
path.
MADV_HWPOISON is explicitly documented to test MCE handling IIUC:
: This feature is intended for testing of memory error-handling
: code; it is available only if the kernel was configured with
: CONFIG_MEMORY_FAILURE.
There is no explicit note about the type of the error that is injected
but I think it is reasonably safe to assume this is a recoverable one.
--
Michal Hocko
SUSE Labs
On Tue, Oct 22, 2019 at 10:30:02AM +0200, Michal Hocko wrote:
> But we have destructors for compound pages. Can we do the heavy lifting
> there?
Yes, we could.
Actually, I tried that approach, but I thought it was simpler this way.
Since there is no hurry in this, I will try to take that up again and
see how it looks.
> > > If the page is free then it shouldn't pin the memcg or any other state.
> >
> > Well, it is not really free, as it is not usable, is it?
>
> Sorry I meant to say the page is free from the memcg POV - aka no task
> from the memcg is holding a reference to it. The page is not usable for
> anybody, that is true but no particular memcg should pay a price for
> that. This would mean that random memcgs would end up pinned for ever
> without a good reason.
Sure, I will re-work that.
--
Oscar Salvador
SUSE L3
On Tue, Oct 22, 2019 at 11:22:56AM +0200, Michal Hocko wrote:
> Hmm, that might be a misunderstanding on my end. I thought that it is
> the MCE handler to say whether the failure is recoverable or not. If yes
> then we can touch the content of the memory (that would imply the
> migration). Other than that both paths should be essentially the same,
> no? Well unrecoverable case would be essentially force migration failure
> path.
>
> MADV_HWPOISON is explicitly documented to test MCE handling IIUC:
> : This feature is intended for testing of memory error-handling
> : code; it is available only if the kernel was configured with
> : CONFIG_MEMORY_FAILURE.
>
> There is no explicit note about the type of the error that is injected
> but I think it is reasonably safe to assume this is a recoverable one.
MADV_HWPOISON stands for hard-offline.
MADV_SOFT_OFFLINE stands for soft-offline.
MADV_SOFT_OFFLINE (since Linux 2.6.33)
Soft offline the pages in the range specified by addr and
length. The memory of each page in the specified range is
preserved (i.e., when next accessed, the same content will be
visible, but in a new physical page frame), and the original
page is offlined (i.e., no longer used, and taken out of
normal memory management). The effect of the
MADV_SOFT_OFFLINE operation is invisible to (i.e., does not
change the semantics of) the calling process.
This feature is intended for testing of memory error-handling
code; it is available only if the kernel was configured with
CONFIG_MEMORY_FAILURE.
But both follow different approaches.
I think it is up to some controlers to trigger soft-offline or hard-offline:
static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int sev)
{
#ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
...
/* iff following two events can be handled properly by now */
if (sec_sev == GHES_SEV_CORRECTED &&
(gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
flags = MF_SOFT_OFFLINE;
if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
flags = 0;
if (flags != -1)
memory_failure_queue(pfn, flags);
...
#endif
}
static void memory_failure_work_func(struct work_struct *work)
{
...
for (;;) {
spin_lock_irqsave(&mf_cpu->lock, proc_flags);
gotten = kfifo_get(&mf_cpu->fifo, &entry);
spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
if (!gotten)
break;
if (entry.flags & MF_SOFT_OFFLINE)
soft_offline_page(pfn_to_page(entry.pfn), entry.flags);
else
memory_failure(entry.pfn, entry.flags);
}
}
AFAICS, for hard-offline case, a recovered event would be if:
- the page to shut down is already free
- the page was unmapped
In some cases we need to kill the process if it holds dirty pages.
But we never migrate contents in hard-offline path.
I guess it is because we cannot really trust the contents anymore.
--
Oscar Salvador
SUSE L3
On Tue, Oct 22, 2019 at 12:24:57PM +0200, Michal Hocko wrote:
> Yes, that makes a perfect sense. What I am saying that the migration
> (aka trying to recover) is the main and only difference. The soft
> offline should poison page tables when not able to migrate as well
> IIUC.
Yeah, I see your point.
I do not really why soft-offline strived so much to left the page
untouched unless it was able to content the problem.
Note that if we start now to poison pages even if we could not
content them (in soft-offline mode), that is a big and visible user
change.
Not saying it is wrong, but something to consider.
Anyway, I would like to put that aside as a follow-up
rework after this one, as this one already changes quite
some things.
--
Oscar Salvador
SUSE L3
On Mon, Oct 21, 2019 at 11:40:39AM +0200, David Hildenbrand wrote:
> I explored somewhere already why this code was added:
>
>
> commit 31d3d3484f9bd263925ecaa341500ac2df3a5d9b
> Author: Wu Fengguang <[email protected]>
> Date: Wed Dec 16 12:19:59 2009 +0100
>
> HWPOISON: limit hwpoison injector to known page types
>
> __memory_failure()'s workflow is
>
> set PG_hwpoison
> //...
> unset PG_hwpoison if didn't pass hwpoison filter
>
> That could kill unrelated process if it happens to page fault on the
> page with the (temporary) PG_hwpoison. The race should be big enough to
> appear in stress tests.
>
> Fix it by grabbing the page and checking filter at inject time. This
> also avoids the very noisy "Injecting memory failure..." messages.
>
> - we don't touch madvise() based injection, because the filters are
> generally not necessary for it.
> - if we want to apply the filters to h/w aided injection, we'd better to
> rearrange the logic in __memory_failure() instead of this patch.
>
> AK: fix documentation, use drain all, cleanups
>
>
> You should justify why it is okay to do rip that code out now.
> It's not just duplicate checks.
>
> Was the documented race fixed?
> Will we fix the race within memory_failure() later?
> Don't we care?
>
> Also, you should add that this fixes the access of uninitialized memmaps
> now and makes the interface work correctly with devmem.
Thanks for bringuing this up David.
I guess I was carried away.
Since I have to do another re-spin to re-work a couple of things, I will
work on this as well.
--
Oscar Salvador
SUSE L3
On Tue, Oct 22, 2019 at 10:26:11AM +0200, Michal Hocko wrote:
> On Tue 22-10-19 09:46:20, Oscar Salvador wrote:
> [...]
> > So, opposite to hard-offline, in soft-offline we do not fiddle with pages
> > unless we are sure the page is not reachable anymore by any means.
>
> I have to say I do not follow. Is there any _real_ reason for
> soft-offline to behave differenttly from MCE (hard-offline)?
Yes.
Do not take it as 100% true as I read that in some code/Documentation
a while ago.
But I think that it boils down to:
soft-offline: "We have seen some erros in the underlying page, but
it is still usable, so we have a chance to keep the
the contents (via migration)"
hard-offline: "The underlying page is dead, we cannot trust it, so
we shut it down, killing whoever is holding it
along the way".
Am I wrong Naoya?
--
Oscar Salvador
SUSE L3
On Tue 22-10-19 11:58:52, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 11:22:56AM +0200, Michal Hocko wrote:
> > Hmm, that might be a misunderstanding on my end. I thought that it is
> > the MCE handler to say whether the failure is recoverable or not. If yes
> > then we can touch the content of the memory (that would imply the
> > migration). Other than that both paths should be essentially the same,
> > no? Well unrecoverable case would be essentially force migration failure
> > path.
> >
> > MADV_HWPOISON is explicitly documented to test MCE handling IIUC:
> > : This feature is intended for testing of memory error-handling
> > : code; it is available only if the kernel was configured with
> > : CONFIG_MEMORY_FAILURE.
> >
> > There is no explicit note about the type of the error that is injected
> > but I think it is reasonably safe to assume this is a recoverable one.
>
> MADV_HWPOISON stands for hard-offline.
> MADV_SOFT_OFFLINE stands for soft-offline.
>
> MADV_SOFT_OFFLINE (since Linux 2.6.33)
> Soft offline the pages in the range specified by addr and
> length. The memory of each page in the specified range is
> preserved (i.e., when next accessed, the same content will be
> visible, but in a new physical page frame), and the original
> page is offlined (i.e., no longer used, and taken out of
> normal memory management). The effect of the
> MADV_SOFT_OFFLINE operation is invisible to (i.e., does not
> change the semantics of) the calling process.
>
> This feature is intended for testing of memory error-handling
> code; it is available only if the kernel was configured with
> CONFIG_MEMORY_FAILURE.
I have missed that one somehow. Thanks for pointing out.
[...]
> AFAICS, for hard-offline case, a recovered event would be if:
>
> - the page to shut down is already free
> - the page was unmapped
>
> In some cases we need to kill the process if it holds dirty pages.
Yes, I would expect that the page table would be poisoned and the
process receive a SIGBUS when accessing that memory.
> But we never migrate contents in hard-offline path.
> I guess it is because we cannot really trust the contents anymore.
Yes, that makes a perfect sense. What I am saying that the migration
(aka trying to recover) is the main and only difference. The soft
offline should poison page tables when not able to migrate as well
IIUC.
--
Michal Hocko
SUSE Labs
On Tue, Oct 22, 2019 at 12:33:25PM +0200, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 12:24:57PM +0200, Michal Hocko wrote:
> > Yes, that makes a perfect sense. What I am saying that the migration
> > (aka trying to recover) is the main and only difference. The soft
> > offline should poison page tables when not able to migrate as well
> > IIUC.
>
> Yeah, I see your point.
> I do not really why soft-offline strived so much to left the page
> untouched unless it was able to content the problem.
>
> Note that if we start now to poison pages even if we could not
> content them (in soft-offline mode), that is a big and visible user
> change.
It's declared that soft offline never disrupts userspace by design,
so if poisoning page tables in migration failure, we could break this
and send SIGBUSs. Then end users would complain that their processes
are killed by corrected (so non-urgent) errors.
Thanks,
Naoya Horiguchi
On Tue, Oct 22, 2019 at 11:58:52AM +0200, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 11:22:56AM +0200, Michal Hocko wrote:
> > Hmm, that might be a misunderstanding on my end. I thought that it is
> > the MCE handler to say whether the failure is recoverable or not. If yes
> > then we can touch the content of the memory (that would imply the
> > migration). Other than that both paths should be essentially the same,
> > no? Well unrecoverable case would be essentially force migration failure
> > path.
> >
> > MADV_HWPOISON is explicitly documented to test MCE handling IIUC:
> > : This feature is intended for testing of memory error-handling
> > : code; it is available only if the kernel was configured with
> > : CONFIG_MEMORY_FAILURE.
> >
> > There is no explicit note about the type of the error that is injected
> > but I think it is reasonably safe to assume this is a recoverable one.
>
> MADV_HWPOISON stands for hard-offline.
> MADV_SOFT_OFFLINE stands for soft-offline.
Maybe MADV_HWPOISON should've be named like MADV_HARD_OFFLINE, although
it's API and hard to change once implemented.
>
> MADV_SOFT_OFFLINE (since Linux 2.6.33)
> Soft offline the pages in the range specified by addr and
> length. The memory of each page in the specified range is
> preserved (i.e., when next accessed, the same content will be
> visible, but in a new physical page frame), and the original
> page is offlined (i.e., no longer used, and taken out of
> normal memory management). The effect of the
> MADV_SOFT_OFFLINE operation is invisible to (i.e., does not
> change the semantics of) the calling process.
>
> This feature is intended for testing of memory error-handling
> code;
Although this expression might not clear enough, madvise(MADV_HWPOISON or
MADV_SOFT_OFFLINE) only covers memory error handling part, not MCE handling
part. We have some other injection methods in the lower layers like
mce-inject and APEI.
> it is available only if the kernel was configured with
> CONFIG_MEMORY_FAILURE.
>
>
> But both follow different approaches.
>
> I think it is up to some controlers to trigger soft-offline or hard-offline:
Yes, I think so. One usecase of soft offline is triggered by CMCI interrupt
in Intel CPU. CMCI handler stores corrected error events in /dev/mcelog.
mcelogd polls on this device file and if corrected errors occur often enough
(IIRC the default threshold is "10 events in 24 hours",) mcelogd triggers
soft-offline via soft_offline_page under /sys.
OTOH, hard-offline is triggered directly (accurately over ring buffer to separate
context) from MCE handler. mcelogd logs MCE events but does not involve in
page offline logic.
>
> static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int sev)
> {
> #ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
> ...
> /* iff following two events can be handled properly by now */
> if (sec_sev == GHES_SEV_CORRECTED &&
> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
> flags = MF_SOFT_OFFLINE;
> if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
> flags = 0;
>
> if (flags != -1)
> memory_failure_queue(pfn, flags);
> ...
> #endif
> }
>
>
> static void memory_failure_work_func(struct work_struct *work)
> {
> ...
> for (;;) {
> spin_lock_irqsave(&mf_cpu->lock, proc_flags);
> gotten = kfifo_get(&mf_cpu->fifo, &entry);
> spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
> if (!gotten)
> break;
> if (entry.flags & MF_SOFT_OFFLINE)
> soft_offline_page(pfn_to_page(entry.pfn), entry.flags);
> else
> memory_failure(entry.pfn, entry.flags);
> }
> }
>
> AFAICS, for hard-offline case, a recovered event would be if:
>
> - the page to shut down is already free
> - the page was unmapped
>
> In some cases we need to kill the process if it holds dirty pages.
One caveat is that even if the process maps dirty error pages, we
don't have to kill it unless the error data is consumed.
Thanks,
Naoya Horiguchi
Hello!
We are faced with similar problems with hwpoisoned pages
on one of our production clusters after kernel update to stable 4.19.
Application that does a lot of memory allocations sometimes caught SIGBUS signal
with message in dmesg about hardware memory corruption fault.
In kernel and mce logs we saw messages about soft offlining pages with
correctable errors. Those events always had happened before application
was killed. This is not the behavior we expect. We want our application to
continue working on a smaller set of available pages in the system.
This issue is difficult to reproduce, but we suppose that the reason for such
behavior is that compaction does not check for page poisonness while processing
free pages, so as a result valid userspace data gets migrated to bad pages.
We wrote the simple test:
- soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
through writing pfn to /sys/devices/system/memory/soft_offline_page
- force compaction by echo 1 >> /proc/sys/vm/compact_memory
Without this patch series after these steps bash became unusable
and every attempt to run any command leads to SIGBUS with message about
hardware memory corruption fault. And after applying this series to our kernel
tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
this behavior is still reproducible.
So, we want to know, why this patchset wasn't merged to the upstream?
Is there any problems in such rework for {soft,hard}-offline handling?
BTW, this patchset should be updated with upstream changes in mm.
Thanks for you replies.
--
Dmitry Yakunin
Hi Dmitry,
On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote:
> Hello!
>
> We are faced with similar problems with hwpoisoned pages
> on one of our production clusters after kernel update to stable 4.19.
> Application that does a lot of memory allocations sometimes caught SIGBUS signal
> with message in dmesg about hardware memory corruption fault.
> In kernel and mce logs we saw messages about soft offlining pages with
> correctable errors. Those events always had happened before application
> was killed. This is not the behavior we expect. We want our application to
> continue working on a smaller set of available pages in the system.
>
> This issue is difficult to reproduce, but we suppose that the reason for such
> behavior is that compaction does not check for page poisonness while processing
> free pages, so as a result valid userspace data gets migrated to bad pages.
> We wrote the simple test:
> - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL
> through writing pfn to /sys/devices/system/memory/soft_offline_page
> - force compaction by echo 1 >> /proc/sys/vm/compact_memory
> Without this patch series after these steps bash became unusable
> and every attempt to run any command leads to SIGBUS with message about
> hardware memory corruption fault. And after applying this series to our kernel
> tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7
> this behavior is still reproducible.
>
> So, we want to know, why this patchset wasn't merged to the upstream?
> Is there any problems in such rework for {soft,hard}-offline handling?
No technical reason, it's just because I didn't have enough power to push
this to be merged. Really sorry about that.
> BTW, this patchset should be updated with upstream changes in mm.
I'm working this now and still need more testing to confirm, but I hope
I'll update and post this for 5.9.
Thanks,
Naoya Horiguchi