This patchset includes some fixups (patch#1,patch#2 and patch#3)
and some cleanups (patch#4-7).
Patch#1 is a fix to take off HWPoison pages off a buddy freelist since
it can lead us to having HWPoison pages back in the game without no one
noticing it.
So fix it (we did that already for soft_offline_page [1]).
Patch#2 is fixing a rebasing problem that made the call
to page_handle_poison from _soft_offline_page set the
wrong value for hugepage_or_freepage. [2]
Patch#3 is not really a fixup, but tries to re-handle a page
in case it was allocated under us.
[1] https://patchwork.kernel.org/cover/11704083/
[2] https://patchwork.kernel.org/comment/23619775/
Thanks
Oscar Salvador (7):
mm,hwpoison: take free pages off the buddy freelists
mm,hwpoison: Do not set hugepage_or_freepage unconditionally
mm,hwpoison: Try to narrow window race for free pages
mm,hwpoison: refactor madvise_inject_error
mm,hwpoison: drain pcplists before bailing out for non-buddy
zero-refcount page
mm,hwpoison: drop unneeded pcplist draining
mm,hwpoison: remove stale code
mm/madvise.c | 36 +++++++++++----------------
mm/memory-failure.c | 59 +++++++++++++++++++++++++++++++++------------
2 files changed, 58 insertions(+), 37 deletions(-)
--
2.26.2
The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that a gatekeeper for poisoned pages. Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.
As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.
Let us fix this we same way we did for soft_offline [2], and take the page
off the buddy freelist, so it is completely unreachable.
Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then we need to run some sort of
memory stress system.
Just for a matter of reference, I put a dump_page in compaction_alloc to
trigger for HWPoison patches:
kernel: page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db
kernel: flags: 0xfffffc0800000(hwpoison)
kernel: raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000
kernel: raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
kernel: page dumped because: compaction_alloc
kernel: CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
kernel: Call Trace:
kernel: dump_stack+0x6d/0x8b
kernel: compaction_alloc+0xb2/0xc0
kernel: migrate_pages+0x2a6/0x12a0
kernel: ? isolate_freepages+0xc80/0xc80
kernel: ? __ClearPageMovable+0xb0/0xb0
kernel: compact_zone+0x5eb/0x11c0
kernel: ? finish_task_switch+0x74/0x300
kernel: ? lock_timer_base+0xa8/0x170
kernel: proactive_compact_node+0x89/0xf0
kernel: ? kcompactd+0x2d0/0x3a0
kernel: kcompactd+0x2d0/0x3a0
kernel: ? finish_wait+0x80/0x80
kernel: ? kcompactd_do_work+0x350/0x350
kernel: kthread+0x118/0x130
kernel: ? kthread_associate_blkcg+0xa0/0xa0
kernel: ret_from_fork+0x22/0x30
After that, if e.g: someone faults in the page, that someone will get
killed unexpectedly.
While we are at it, I also changed the action result for such cases.
I think that MF_DELAYED is a bit misleading, because in case we could
contain the page and take it off the buddy, such a page is not to be used
again unless it is unpoisoned, so we fixed the situation.
So unless I am missing something, I strongly think that we should report
MF_RECOVERED.
[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/patch/11694847/
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/memory-failure.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4eb3c42ffe35..74a53881f94b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1323,8 +1323,9 @@ int memory_failure(unsigned long pfn, int flags)
struct page *hpage;
struct page *orig_head;
struct dev_pagemap *pgmap;
- int res;
unsigned long page_flags;
+ int res = 0;
+ bool retry = true;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1364,10 +1365,21 @@ int memory_failure(unsigned long pfn, int flags)
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
*/
+try_again:
if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
if (is_free_buddy_page(p)) {
- action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
- return 0;
+ if (take_page_off_buddy(p)) {
+ action_result(pfn, MF_MSG_BUDDY, MF_RECOVERED);
+ return 0;
+ } else {
+ /* We lost the race, try again */
+ if (retry) {
+ retry = false;
+ goto try_again;
+ }
+ action_result(pfn, MF_MSG_BUDDY, MF_FAILED);
+ return -EBUSY;
+ }
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
return -EBUSY;
@@ -1393,11 +1405,15 @@ int memory_failure(unsigned long pfn, int flags)
shake_page(p, 0);
/* shake_page could have turned it free. */
if (!PageLRU(p) && is_free_buddy_page(p)) {
+ if (!take_page_off_buddy(p))
+ res = -EBUSY;
+
if (flags & MF_COUNT_INCREASED)
- action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+ action_result(pfn, MF_MSG_BUDDY, res ? MF_FAILED : MF_RECOVERED);
else
- action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
- return 0;
+ action_result(pfn, MF_MSG_BUDDY_2ND, res ? MF_FAILED : MF_RECOVERED);
+
+ return res;
}
lock_page(p);
--
2.26.2
Aristeu Rozanski reported that some hwpoison tests
were returning -EBUSY while before the rework they succeed [1] [2].
The root case is that during a rebase, the call to page_handle_poison was
setting hugepage_or_freepage = true unconditionally from __soft_offline_page.
Aristeu said that after this fix, everything works.
[1] https://patchwork.kernel.org/comment/23617301/
[2] https://patchwork.kernel.org/comment/23619535/
Signed-off-by: Oscar Salvador <[email protected]>
Reported-by: Aristeu Rozanski <[email protected]>
---
mm/memory-failure.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 74a53881f94b..db61bdee9734 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1852,7 +1852,7 @@ static int __soft_offline_page(struct page *page)
if (!ret) {
bool release = !huge;
- if (!page_handle_poison(page, true, release))
+ if (!page_handle_poison(page, huge, release))
ret = -EBUSY;
} else {
if (!list_empty(&pagelist))
--
2.26.2
memory_failure and soft_offline_path paths now drain pcplists by calling
get_hwpoison_page.
memory_failure flags the page as HWPoison before, so that page cannot
longer go into a pcplist, and soft_offline_page only flags a page as
HWPoison if 1) we took the page off a buddy freelist 2) the page was
in-use and we migrated it 3) was a clean pagecache.
Because of that, a page cannot longer be poisoned and be in a pcplist.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/madvise.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 4a48f7215195..302f3a84d17c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -914,10 +914,6 @@ static int madvise_inject_error(int behavior,
return ret;
}
- /* Ensure that all poisoned pages are removed from per-cpu lists */
- for_each_populated_zone(zone)
- drain_all_pages(zone);
-
return 0;
}
#endif
--
2.26.2
Aristeu Rozanski reported that a customer test case started
to report -EBUSY after the hwpoison report patchset.
There is a race window between spotting a free page and taking it off
its buddy freelist, so it might be that by the time we try to take it off,
the page has been already allocated.
This patch tries to handle such race window by trying to handle the new
type of page again if the page was allocated under us.
After this patch, Aristeu said the test cases work properly.
Signed-off-by: Oscar Salvador <[email protected]>
Reported-by: Aristeu Rozanski <[email protected]>
---
mm/memory-failure.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index db61bdee9734..a2ccd3ba4015 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1917,6 +1917,7 @@ int soft_offline_page(unsigned long pfn, int flags)
{
int ret;
struct page *page;
+ bool try_again = true;
if (!pfn_valid(pfn))
return -ENXIO;
@@ -1932,6 +1933,7 @@ int soft_offline_page(unsigned long pfn, int flags)
return 0;
}
+retry:
get_online_mems();
ret = get_any_page(page, pfn, flags);
put_online_mems();
@@ -1939,7 +1941,10 @@ int soft_offline_page(unsigned long pfn, int flags)
if (ret > 0)
ret = soft_offline_in_use_page(page);
else if (ret == 0)
- ret = soft_offline_free_page(page);
+ if (soft_offline_free_page(page) && try_again) {
+ try_again = false;
+ goto retry;
+ }
return ret;
}
--
2.26.2
Currently we call shake_page and then check whether the page is Buddy
because shake_page calls drain_all_pages, which sends pcp-pages back to
the buddy freelists, so we could have a chance to handle free pages.
get_hwpoison_page already calls drain_all_pages, and we do call
get_hwpoison_page right before coming here, so we should be on the safe
side.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/memory-failure.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7fba4ba201d5..f68cb5e3b320 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1423,18 +1423,6 @@ int memory_failure(unsigned long pfn, int flags)
* walked by the page reclaim code, however that's not a big loss.
*/
shake_page(p, 0);
- /* shake_page could have turned it free. */
- if (!PageLRU(p) && is_free_buddy_page(p)) {
- if (!take_page_off_buddy(p))
- res = -EBUSY;
-
- if (flags & MF_COUNT_INCREASED)
- action_result(pfn, MF_MSG_BUDDY, res ? MF_FAILED : MF_RECOVERED);
- else
- action_result(pfn, MF_MSG_BUDDY_2ND, res ? MF_FAILED : MF_RECOVERED);
-
- return res;
- }
lock_page(p);
--
2.26.2
Make a proper if-else condition for {hard,soft}-offline.
[akpm: remove zone variable and refactor comment]
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/madvise.c | 32 ++++++++++++++------------------
1 file changed, 14 insertions(+), 18 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 4d87b31a1576..4a48f7215195 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -869,8 +869,6 @@ static long madvise_remove(struct vm_area_struct *vma,
static int madvise_inject_error(int behavior,
unsigned long start, unsigned long end)
{
- struct page *page;
- struct zone *zone;
unsigned long size;
if (!capable(CAP_SYS_ADMIN))
@@ -879,6 +877,7 @@ static int madvise_inject_error(int behavior,
for (; start < end; start += size) {
unsigned long pfn;
+ struct page *page;
int ret;
ret = get_user_pages_fast(start, 1, 0, &page);
@@ -895,25 +894,22 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
- pfn, start);
-
+ pfn, start);
ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
- if (ret)
- return ret;
- continue;
+ } else {
+ pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+ pfn, start);
+ /*
+ * Drop the page reference taken by
+ * get_user_pages_fast(). In the absence of
+ * MF_COUNT_INCREASED the memory_failure() routine is
+ * responsible for pinning the page to prevent it
+ * from being released back to the page allocator.
+ */
+ put_page(page);
+ ret = memory_failure(pfn, 0);
}
- pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
- pfn, start);
-
- /*
- * Drop the page reference taken by get_user_pages_fast(). In
- * the absence of MF_COUNT_INCREASED the memory_failure()
- * routine is responsible for pinning the page to prevent it
- * from being released back to the page allocator.
- */
- put_page(page);
- ret = memory_failure(pfn, 0);
if (ret)
return ret;
}
--
2.26.2
A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page, meaning
that we do not handle pcppages neither from hard-offline nor from
soft-offline path.
Fix this by draining pcplists whenever we find this kind of page and retry
the check again. It might be that pcplists have been spilled into the
buddy allocator and so we can handle it.
Signed-off-by: Oscar Salvador <[email protected]>
---
mm/memory-failure.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a2ccd3ba4015..7fba4ba201d5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -950,13 +950,13 @@ static int page_action(struct page_state *ps, struct page *p,
}
/**
- * get_hwpoison_page() - Get refcount for memory error handling:
+ * __get_hwpoison_page() - Get refcount for memory error handling:
* @page: raw error page (hit by memory error)
*
* Return: return 0 if failed to grab the refcount, otherwise true (some
* non-zero value.)
*/
-static int get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page)
{
struct page *head = compound_head(page);
@@ -986,6 +986,26 @@ static int get_hwpoison_page(struct page *page)
return 0;
}
+static int get_hwpoison_page(struct page *p)
+{
+ int ret;
+ bool drained = false;
+
+retry:
+ ret = __get_hwpoison_page(p);
+ if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
+ /*
+ * The page might be in a pcplist, so try to drain those
+ * and see if we are lucky.
+ */
+ drain_all_pages(page_zone(p));
+ drained = true;
+ goto retry;
+ }
+
+ return ret;
+}
+
/*
* Do all that is necessary to remove user space mappings. Unmap
* the pages and send SIGBUS to the processes if the data was dirty.
--
2.26.2
On Thu, Sep 17, 2020 at 10:10:42AM +0200, Oscar Salvador wrote:
> This patchset includes some fixups (patch#1,patch#2 and patch#3)
> and some cleanups (patch#4-7).
>
> Patch#1 is a fix to take off HWPoison pages off a buddy freelist since
> it can lead us to having HWPoison pages back in the game without no one
> noticing it.
> So fix it (we did that already for soft_offline_page [1]).
>
> Patch#2 is fixing a rebasing problem that made the call
> to page_handle_poison from _soft_offline_page set the
> wrong value for hugepage_or_freepage. [2]
>
> Patch#3 is not really a fixup, but tries to re-handle a page
> in case it was allocated under us.
Thanks for the update.
This patchset triggers the following BUG_ON() with Aristeu's workload:
[ 1010.400900] Soft offlining pfn 0xbff8c at process virtual address 0x7fe6c99c8000
[ 1010.402931] page:00000000f5670686 refcount:1 mapcount:-128 mapping:0000000000000000 index:0x1 pfn:0xbff89
[ 1010.405604] flags: 0xfffe000800000(hwpoison)
[ 1010.406755] raw: 000fffe000800000 ffffcddf029ab848 ffffcddf02ff9448 0000000000000000
[ 1010.408824] raw: 0000000000000001 0000000000000000 00000001ffffff7f 0000000000000000
[ 1010.410877] page dumped because: VM_BUG_ON_PAGE(page_count(buddy) != 0)
[ 1010.412673] ------------[ cut here ]------------
[ 1010.413930] kernel BUG at mm/page_alloc.c:800!
[ 1010.415143] invalid opcode: 0000 [#1] SMP PTI
[ 1010.416320] CPU: 3 PID: 1340 Comm: kworker/3:0 Not tainted 5.9.0-rc2-mm1-v5.9-rc2-200917-1952-00212-gf1a0765b04cb+ #33
[ 1010.419101] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[ 1010.422645] Workqueue: mm_percpu_wq drain_local_pages_wq
[ 1010.424075] RIP: 0010:__free_one_page+0x552/0x580
[ 1010.425344] Code: 48 c7 c6 90 6c 0f 84 4c 89 e7 e8 69 7e fd ff 0f 0b 0f 1f 44 00 00 e9 e5 fc ff ff 48 c7 c6 c8 f3 11 84 4c 89 f7 e8 4e 7e fd ff <0f> 0b 83 fb 08 0f 86 cb fc ff ff 48 83 c4 20 5b 5d 41 5c 41 5d 41
[ 1010.430231] RSP: 0018:ffffaa96c171fda0 EFLAGS: 00010082
[ 1010.431651] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
[ 1010.433598] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8dc8bbd18d08
[ 1010.435627] RBP: 00000000000bff88 R08: ffff8dc8bbd18d00 R09: 6573756163656220
[ 1010.437544] R10: 6163656220646570 R11: 6d75642065676170 R12: ffffcddf02ffe200
[ 1010.439376] R13: 00000000000bff89 R14: ffffcddf02ffe240 R15: ffff8dc7bffd5680
[ 1010.441271] FS: 0000000000000000(0000) GS:ffff8dc8bbd00000(0000) knlGS:0000000000000000
[ 1010.443349] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1010.444892] CR2: 00007f6b69f92000 CR3: 0000000139c4a000 CR4: 00000000001506e0
[ 1010.446746] Call Trace:
[ 1010.447424] free_pcppages_bulk+0x1d4/0x2c0
[ 1010.448553] drain_pages_zone+0x42/0x50
[ 1010.449585] drain_local_pages_wq+0xe/0x10
[ 1010.450702] process_one_work+0x1b0/0x360
[ 1010.451769] worker_thread+0x50/0x3a0
[ 1010.452940] ? process_one_work+0x360/0x360
[ 1010.454072] kthread+0xfe/0x140
[ 1010.454989] ? kthread_park+0x90/0x90
[ 1010.455970] ret_from_fork+0x22/0x30
This message seems to show that the pages to be moved to buddy have refcount.
Could you review how changes in v3 -> v4 make it?
Here's my reproducer.
[build1:~]$ cat test_ksm_madv_soft.c
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
#include <sys/types.h>
#include <errno.h>
#include <stdlib.h>
#define MADV_SOFT_OFFLINE 101
#define err(x) perror(x),exit(EXIT_FAILURE)
int main() {
int ret;
int size = 100000*0x1000;
char *p1 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
printf("p1 %p\n", p1);
char *p2 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
printf("p2 %p\n", p2);
ret = madvise(p1, size, MADV_MERGEABLE);
printf("madvise(p1) %d\n", ret);
ret = madvise(p2, size, MADV_MERGEABLE);
printf("madvise(p2) %d\n", ret);
printf("writing p1 ... ");
memset(p1, 'a', size);
printf("done\n");
printf("writing p2 ... ");
memset(p2, 'a', size);
printf("done\n");
usleep(10000000);
printf("soft offline\n");
ret = madvise(p1, size, MADV_SOFT_OFFLINE);
printf("soft offline returns %d\n", ret);
if (ret)
err("madvise");
madvise(p1, size, MADV_UNMERGEABLE);
madvise(p2, size, MADV_UNMERGEABLE);
printf("OK\n");
}
[build1:~/upstream/mm_regression/lib]$ cat tmp_run_ksm_madv.sh
rm test_ksm_madv_soft 2> /dev/null
gcc -o test_ksm_madv_soft test_ksm_madv_soft.c || exit 1
echo 0 > /sys/kernel/mm/ksm/sleep_millisecs
echo 100000 > /sys/kernel/mm/ksm/pages_to_scan
echo 100000 > /sys/kernel/mm/ksm/max_page_sharing
echo 2 > /sys/kernel/mm/ksm/run
echo 1 > /sys/kernel/mm/ksm/run
./test_ksm_madv_soft
Thanks,
Naoya Horiguchi
On Thu, Sep 17, 2020 at 11:39:21AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Thanks for the update.
> This patchset triggers the following BUG_ON() with Aristeu's workload:
I just took a look, but I found more oddities.
The patchset you sent seems to be a bit buggy and it is missing some things
my patchset contains, e.g:
static __always_inline bool free_pages_prepare(struct page *page,
unsigned int order, bool check_free)
{
...
if (unlikely(PageHWPoison(page)) && !order) {
...
return false;
}
Moreover, in page_handle_poison, you managed it wrong because:
static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
{
if (release) {
put_page(page);
drain_all_pages(page_zone(page));
}
...
SetPageHWPoison(page);
page_ref_inc(page);
1) You are freeing the page first, which means it goes to buddy
2) Then you set it as poisoned and you update its refcount.
Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.
Honestly, I do not know how your patchset diverged so much from mine, but is
not right.
I will go over my patchset and yours and compare/fix things.
--
Oscar Salvador
SUSE L3
On Thu, Sep 17, 2020 at 03:09:52PM +0200, Oscar Salvador wrote:
> static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> {
> if (release) {
> put_page(page);
> drain_all_pages(page_zone(page));
> }
>
> ...
> SetPageHWPoison(page);
> page_ref_inc(page);
>
> 1) You are freeing the page first, which means it goes to buddy
> 2) Then you set it as poisoned and you update its refcount.
>
> Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.
Hi Naoya,
Ok, I tested it and with the following changes on top I cannot reproduce the issue:
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f68cb5e3b320..4ffaaa5c2603 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -67,11 +67,6 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
{
- if (release) {
- put_page(page);
- drain_all_pages(page_zone(page));
- }
-
if (hugepage_or_freepage) {
/*
* Doing this check for free pages is also fine since dissolve_free_huge_page
@@ -89,6 +84,12 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
}
SetPageHWPoison(page);
+
+ if (release) {
+ put_page(page);
+ drain_all_pages(page_zone(page));
+ }
+
page_ref_inc(page);
num_poisoned_pages_inc();
return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d9f9bd0e06c..8a6ab05f074c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1173,6 +1173,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
trace_mm_page_free(page, order);
+ if (unlikely(PageHWPoison(page)) && !order) {
+ /*
+ * Untie memcg state and reset page's owner
+ */
+ if (memcg_kmem_enabled() && PageKmemcg(page))
+ __memcg_kmem_uncharge_page(page, order);
+ reset_page_owner(page, order);
+ return false;
+ }
+
/*
* Check tail pages before head page information is cleared to
* avoid checking PageCompound for order-0 pages.
# sh tmp_run_ksm_madv.sh
p1 0x7f6b6b08e000
p2 0x7f6b529ee000
madvise(p1) 0
madvise(p2) 0
writing p1 ... done
writing p2 ... done
soft offline
soft offline returns 0
OK
Can you try to re-run your tests and see if they come clean?
If they do, I will try to see if Andrew can squezee above changes into [1],
where they belong to.
Otherwise I will craft a v5 containing all fixups (pretty unfortunate).
[1] https://patchwork.kernel.org/patch/11704099/
--
Oscar Salvador
SUSE L3
On Thu, Sep 17, 2020 at 03:40:06PM +0200, Oscar Salvador wrote:
> On Thu, Sep 17, 2020 at 03:09:52PM +0200, Oscar Salvador wrote:
> > static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> > {
> > if (release) {
> > put_page(page);
> > drain_all_pages(page_zone(page));
> > }
> >
> > ...
> > SetPageHWPoison(page);
> > page_ref_inc(page);
> >
> > 1) You are freeing the page first, which means it goes to buddy
> > 2) Then you set it as poisoned and you update its refcount.
> >
> > Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.
>
> Hi Naoya,
>
> Ok, I tested it and with the following changes on top I cannot reproduce the issue:
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f68cb5e3b320..4ffaaa5c2603 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -67,11 +67,6 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>
> static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> {
> - if (release) {
> - put_page(page);
> - drain_all_pages(page_zone(page));
> - }
> -
> if (hugepage_or_freepage) {
> /*
> * Doing this check for free pages is also fine since dissolve_free_huge_page
> @@ -89,6 +84,12 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
> }
>
> SetPageHWPoison(page);
> +
> + if (release) {
> + put_page(page);
> + drain_all_pages(page_zone(page));
> + }
> +
> page_ref_inc(page);
> num_poisoned_pages_inc();
> return true;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0d9f9bd0e06c..8a6ab05f074c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1173,6 +1173,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
>
> trace_mm_page_free(page, order);
>
> + if (unlikely(PageHWPoison(page)) && !order) {
> + /*
> + * Untie memcg state and reset page's owner
> + */
> + if (memcg_kmem_enabled() && PageKmemcg(page))
> + __memcg_kmem_uncharge_page(page, order);
> + reset_page_owner(page, order);
> + return false;
> + }
> +
Sorry, I modified the patches based on the different assumption from yours.
I firstly thought of taking page off after confirming the error page
is freed back to buddy. This approach leaves the possibility of reusing
the error page (which is acceptable), but simpler and less invasive one.
Your approach removes the error page from page allocator's control in
freeing time. It has no possibility of reusing the error page but changes
are tightly coupled with page free code.
This is a tradeoff between complexity and completeness of soft offline,
Now I'm not sure I could persist on my own opinion without providing
working code, and it's OK for me to take your one.
> /*
> * Check tail pages before head page information is cleared to
> * avoid checking PageCompound for order-0 pages.
>
>
...
(let me reply to older email here...)
> static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> {
> if (release) {
> put_page(page);
> drain_all_pages(page_zone(page));
> }
>
> ...
> SetPageHWPoison(page);
> page_ref_inc(page);
>
> 1) You are freeing the page first, which means it goes to buddy
> 2) Then you set it as poisoned and you update its refcount.
>
> Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.
This order was correct in my approach. Isolation operation is done
after confirming it's in the free list. This prevents PageHWPoison pages
from being in pcpclists. But the page_ref_inc() may not be necessary in my approach.
> # sh tmp_run_ksm_madv.sh
> p1 0x7f6b6b08e000
> p2 0x7f6b529ee000
> madvise(p1) 0
> madvise(p2) 0
> writing p1 ... done
> writing p2 ... done
> soft offline
> soft offline returns 0
> OK
>
>
> Can you try to re-run your tests and see if they come clean?
The test passed in my environment, so this is fine.
> If they do, I will try to see if Andrew can squezee above changes into [1],
> where they belong to.
Yes, proposing the fix for mmhwpoison-rework-soft-offline-for-in-use-pages.patch
seems fine to me.
Again, sorry for modifying code without asking.
Naoya Horiguchi
On 2020-09-17 17:27, HORIGUCHI NAOYA wrote:
> Sorry, I modified the patches based on the different assumption from
> yours.
> I firstly thought of taking page off after confirming the error page
> is freed back to buddy. This approach leaves the possibility of reusing
> the error page (which is acceptable), but simpler and less invasive
> one.
>
> Your approach removes the error page from page allocator's control in
> freeing time. It has no possibility of reusing the error page but
> changes
> are tightly coupled with page free code.
>
> This is a tradeoff between complexity and completeness of soft offline,
> Now I'm not sure I could persist on my own opinion without providing
> working code, and it's OK for me to take your one.
Yeah, you are right it is a trade off.
I would suggest taking this path now, and if it proofs to be problematic
in some way, we can always
do the:
free_page
take_it_off_buddy
OK: mark it as hwpoison and increment refcount
NOT_OK (raced with allocation): oops, sorry
> The test passed in my environment, so this is fine.
Thanks for trying it out.
>
>> If they do, I will try to see if Andrew can squezee above changes into
>> [1],
>> where they belong to.
>
> Yes, proposing the fix for
> mmhwpoison-rework-soft-offline-for-in-use-pages.patch
> seems fine to me.
>
> Again, sorry for modifying code without asking.
No worries, I wil do a couple of tests on my own and then I will talk to
Andrew to see if we can squeeze the changes in there.
On Thu, Sep 17, 2020 at 10:10:44AM +0200, Oscar Salvador wrote:
> Aristeu Rozanski reported that some hwpoison tests
> were returning -EBUSY while before the rework they succeed [1] [2].
>
> The root case is that during a rebase, the call to page_handle_poison was
> setting hugepage_or_freepage = true unconditionally from __soft_offline_page.
>
> Aristeu said that after this fix, everything works.
>
> [1] https://patchwork.kernel.org/comment/23617301/
> [2] https://patchwork.kernel.org/comment/23619535/
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reported-by: Aristeu Rozanski <[email protected]>
Tested-by: Aristeu Rozanski <[email protected]>
--
Aristeu
On Thu, Sep 17, 2020 at 03:27:18PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> The test passed in my environment, so this is fine.
With everything applied, it works for me as well.
(apologies for the delay, the machine I was using stopped working and took a
while to get another one)
--
Aristeu
On Thu, Sep 17, 2020 at 10:10:45AM +0200, Oscar Salvador wrote:
> Aristeu Rozanski reported that a customer test case started
> to report -EBUSY after the hwpoison report patchset.
>
> There is a race window between spotting a free page and taking it off
> its buddy freelist, so it might be that by the time we try to take it off,
> the page has been already allocated.
>
> This patch tries to handle such race window by trying to handle the new
> type of page again if the page was allocated under us.
>
> After this patch, Aristeu said the test cases work properly.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reported-by: Aristeu Rozanski <[email protected]>
Tested-by: Aristeu Rozanski <[email protected]>
--
Aristeu
On Thu, Sep 17, 2020 at 10:10:43AM +0200, Oscar Salvador wrote:
> The crux of the matter is that historically we left poisoned pages in the
> buddy system because we have some checks in place when allocating a page
> that a gatekeeper for poisoned pages. Unfortunately, we do have other
> users (e.g: compaction [1]) that scan buddy freelists and try to get a
> page from there without checking whether the page is HWPoison.
>
> As I stated already, I think it is fundamentally wrong to keep HWPoison
> pages within the buddy systems, checks in place or not.
>
> Let us fix this we same way we did for soft_offline [2], and take the page
> off the buddy freelist, so it is completely unreachable.
>
> Note that this is fairly simple to trigger, as we only need to poison free
> buddy pages (madvise MADV_HWPOISON) and then we need to run some sort of
> memory stress system.
>
> Just for a matter of reference, I put a dump_page in compaction_alloc to
> trigger for HWPoison patches:
>
> kernel: page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db
> kernel: flags: 0xfffffc0800000(hwpoison)
> kernel: raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000
> kernel: raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
> kernel: page dumped because: compaction_alloc
>
> kernel: CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> kernel: Call Trace:
> kernel: dump_stack+0x6d/0x8b
> kernel: compaction_alloc+0xb2/0xc0
> kernel: migrate_pages+0x2a6/0x12a0
> kernel: ? isolate_freepages+0xc80/0xc80
> kernel: ? __ClearPageMovable+0xb0/0xb0
> kernel: compact_zone+0x5eb/0x11c0
> kernel: ? finish_task_switch+0x74/0x300
> kernel: ? lock_timer_base+0xa8/0x170
> kernel: proactive_compact_node+0x89/0xf0
> kernel: ? kcompactd+0x2d0/0x3a0
> kernel: kcompactd+0x2d0/0x3a0
> kernel: ? finish_wait+0x80/0x80
> kernel: ? kcompactd_do_work+0x350/0x350
> kernel: kthread+0x118/0x130
> kernel: ? kthread_associate_blkcg+0xa0/0xa0
> kernel: ret_from_fork+0x22/0x30
>
> After that, if e.g: someone faults in the page, that someone will get
> killed unexpectedly.
>
> While we are at it, I also changed the action result for such cases.
> I think that MF_DELAYED is a bit misleading, because in case we could
> contain the page and take it off the buddy, such a page is not to be used
> again unless it is unpoisoned, so we fixed the situation.
>
> So unless I am missing something, I strongly think that we should report
> MF_RECOVERED.
>
> [1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
> [2] https://patchwork.kernel.org/patch/11694847/
>
> Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
On Thu, Sep 17, 2020 at 10:10:47AM +0200, Oscar Salvador wrote:
> A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
> Currently, we bail out with an error if we encounter such a page, meaning
> that we do not handle pcppages neither from hard-offline nor from
> soft-offline path.
>
> Fix this by draining pcplists whenever we find this kind of page and retry
> the check again. It might be that pcplists have been spilled into the
> buddy allocator and so we can handle it.
>
> Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>