The important bit of this patchset is patch#1, which is a fix to take off
HWPoison pages off a buddy freelist since it can lead us to having HWPoison
pages back in the game without no one noticing it.
So fix it (we did that already for soft_offline_page [1]).
The other patches are clean-ups and not that important, so if anything,
consider patch#1 for inclusion.
[1] https://patchwork.kernel.org/cover/11704083/
Thanks
Oscar Salvador (5):
mm,hwpoison: take free pages off the buddy freelists
mm,hwpoison: refactor madvise_inject_error
mm,hwpoison: drain pcplists before bailing out for non-buddy
zero-refcount page
mm,hwpoison: drop unneeded pcplist draining
mm,hwpoison: remove stale code
mm/madvise.c | 36 +++++++++++++-------------------
mm/memory-failure.c | 50 +++++++++++++++++++++++++++++++++------------
2 files changed, 51 insertions(+), 35 deletions(-)
--
2.26.2
Patch series "HWpoison: further fixes and cleanups", v2.
The important bit of this patchset is patch#1, which is a fix to take off
HWPoison pages off a buddy freelist since it can lead us to having
HWPoison pages back in the game without no one noticing it. So fix it (we
did that already for soft_offline_page [1]).
The other patches are clean-ups and not that important.
This patch (of 5):
The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that a gatekeeper for poisoned pages. Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.
As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.
Let us fix this we same way we did for soft_offline [2], and take the page
off the buddy freelist, so it is completely unreachable.
Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then we need to run some sort of
memory stress system.
Just for a matter of reference, I put a dump_page in compaction_alloc to
trigger for HWPoison patches:
kernel: page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db
kernel: flags: 0xfffffc0800000(hwpoison)
kernel: raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000
kernel: raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
kernel: page dumped because: compaction_alloc
kernel: CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
kernel: Call Trace:
kernel: dump_stack+0x6d/0x8b
kernel: compaction_alloc+0xb2/0xc0
kernel: migrate_pages+0x2a6/0x12a0
kernel: ? isolate_freepages+0xc80/0xc80
kernel: ? __ClearPageMovable+0xb0/0xb0
kernel: compact_zone+0x5eb/0x11c0
kernel: ? finish_task_switch+0x74/0x300
kernel: ? lock_timer_base+0xa8/0x170
kernel: proactive_compact_node+0x89/0xf0
kernel: ? kcompactd+0x2d0/0x3a0
kernel: kcompactd+0x2d0/0x3a0
kernel: ? finish_wait+0x80/0x80
kernel: ? kcompactd_do_work+0x350/0x350
kernel: kthread+0x118/0x130
kernel: ? kthread_associate_blkcg+0xa0/0xa0
kernel: ret_from_fork+0x22/0x30
After that, if e.g: someone faults in the page, that someone will get
killed unexpectedly.
While we are at it, I also changed the action result for such cases.
I think that MF_DELAYED is a bit misleading, because in case we could
contain the page and take it off the buddy, such a page is not to be used
again unless it is unpoisoned, so we fixed the situation.
So unless I am missing something, I strongly think that we should report
MF_RECOVERED.
[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/patch/11694847/
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Oscar Salvador <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
---
mm/memory-failure.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 68f736add7cc..989fb3efdca6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1321,8 +1321,9 @@ int memory_failure(unsigned long pfn, int flags)
struct page *hpage;
struct page *orig_head;
struct dev_pagemap *pgmap;
- int res;
unsigned long page_flags;
+ int res = 0;
+ bool retry = true;
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1362,10 +1363,21 @@ int memory_failure(unsigned long pfn, int flags)
* In fact it's dangerous to directly bump up page count from 0,
* that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
*/
+try_again:
if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
if (is_free_buddy_page(p)) {
- action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
- return 0;
+ if (take_page_off_buddy(p)) {
+ action_result(pfn, MF_MSG_BUDDY, MF_RECOVERED);
+ return 0;
+ } else {
+ /* We lost the race, try again */
+ if (retry) {
+ retry = false;
+ goto try_again;
+ }
+ action_result(pfn, MF_MSG_BUDDY, MF_FAILED);
+ return -EBUSY;
+ }
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
return -EBUSY;
@@ -1391,11 +1403,15 @@ int memory_failure(unsigned long pfn, int flags)
shake_page(p, 0);
/* shake_page could have turned it free. */
if (!PageLRU(p) && is_free_buddy_page(p)) {
+ if (!take_page_off_buddy(p))
+ res = -EBUSY;
+
if (flags & MF_COUNT_INCREASED)
- action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+ action_result(pfn, MF_MSG_BUDDY, res ? MF_FAILED : MF_RECOVERED);
else
- action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
- return 0;
+ action_result(pfn, MF_MSG_BUDDY_2ND, res ? MF_FAILED : MF_RECOVERED);
+
+ return res;
}
lock_page(p);
--
2.26.2
Make a proper if-else condition for {hard,soft}-offline.
[akpm: remove zone variable and refactor comment]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
---
mm/madvise.c | 32 ++++++++++++++------------------
1 file changed, 14 insertions(+), 18 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index a8f18420efeb..4ce66bab53dd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -872,8 +872,6 @@ static long madvise_remove(struct vm_area_struct *vma,
static int madvise_inject_error(int behavior,
unsigned long start, unsigned long end)
{
- struct page *page;
- struct zone *zone;
unsigned long size;
if (!capable(CAP_SYS_ADMIN))
@@ -882,6 +880,7 @@ static int madvise_inject_error(int behavior,
for (; start < end; start += size) {
unsigned long pfn;
+ struct page *page;
int ret;
ret = get_user_pages_fast(start, 1, 0, &page);
@@ -898,25 +897,22 @@ static int madvise_inject_error(int behavior,
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
- pfn, start);
-
+ pfn, start);
ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
- if (ret)
- return ret;
- continue;
+ } else {
+ pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+ pfn, start);
+ /*
+ * Drop the page reference taken by
+ * get_user_pages_fast(). In the absence of
+ * MF_COUNT_INCREASED the memory_failure() routine is
+ * responsible for pinning the page to prevent it
+ * from being released back to the page allocator.
+ */
+ put_page(page);
+ ret = memory_failure(pfn, 0);
}
- pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
- pfn, start);
-
- /*
- * Drop the page reference taken by get_user_pages_fast(). In
- * the absence of MF_COUNT_INCREASED the memory_failure()
- * routine is responsible for pinning the page to prevent it
- * from being released back to the page allocator.
- */
- put_page(page);
- ret = memory_failure(pfn, 0);
if (ret)
return ret;
}
--
2.26.2
Currently we call shake_page and then check whether the page is Buddy
because shake_page calls drain_all_pages, which sends pcp-pages back to
the buddy freelists, so we could have a chance to handle free pages.
get_hwpoison_page already calls drain_all_pages, and we do call
get_hwpoison_page right before coming here, so we should be on the safe
side.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
---
mm/memory-failure.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4468c1eb5027..fbe174d54fad 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1421,18 +1421,6 @@ int memory_failure(unsigned long pfn, int flags)
* walked by the page reclaim code, however that's not a big loss.
*/
shake_page(p, 0);
- /* shake_page could have turned it free. */
- if (!PageLRU(p) && is_free_buddy_page(p)) {
- if (!take_page_off_buddy(p))
- res = -EBUSY;
-
- if (flags & MF_COUNT_INCREASED)
- action_result(pfn, MF_MSG_BUDDY, res ? MF_FAILED : MF_RECOVERED);
- else
- action_result(pfn, MF_MSG_BUDDY_2ND, res ? MF_FAILED : MF_RECOVERED);
-
- return res;
- }
lock_page(p);
--
2.26.2
A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page, meaning
that we do not handle pcppages neither from hard-offline nor from
soft-offline path.
Fix this by draining pcplists whenever we find this kind of page and retry
the check again. It might be that pcplists have been spilled into the
buddy allocator and so we can handle it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
---
mm/memory-failure.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 989fb3efdca6..4468c1eb5027 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -948,13 +948,13 @@ static int page_action(struct page_state *ps, struct page *p,
}
/**
- * get_hwpoison_page() - Get refcount for memory error handling:
+ * __get_hwpoison_page() - Get refcount for memory error handling:
* @page: raw error page (hit by memory error)
*
* Return: return 0 if failed to grab the refcount, otherwise true (some
* non-zero value.)
*/
-static int get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page)
{
struct page *head = compound_head(page);
@@ -984,6 +984,26 @@ static int get_hwpoison_page(struct page *page)
return 0;
}
+static int get_hwpoison_page(struct page *p)
+{
+ int ret;
+ bool drained = false;
+
+retry:
+ ret = __get_hwpoison_page(p);
+ if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
+ /*
+ * The page might be in a pcplist, so try to drain those
+ * and see if we are lucky.
+ */
+ drain_all_pages(page_zone(p));
+ drained = true;
+ goto retry;
+ }
+
+ return ret;
+}
+
/*
* Do all that is necessary to remove user space mappings. Unmap
* the pages and send SIGBUS to the processes if the data was dirty.
--
2.26.2
memory_failure and soft_offline_path paths now drain pcplists by calling
get_hwpoison_page.
memory_failure flags the page as HWPoison before, so that page cannot
longer go into a pcplist, and soft_offline_page only flags a page as
HWPoison if 1) we took the page off a buddy freelist 2) the page was
in-use and we migrated it 3) was a clean pagecache.
Because of that, a page cannot longer be poisoned and be in a pcplist.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
---
mm/madvise.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 4ce66bab53dd..4bac8e85497a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -917,10 +917,6 @@ static int madvise_inject_error(int behavior,
return ret;
}
- /* Ensure that all poisoned pages are removed from per-cpu lists */
- for_each_populated_zone(zone)
- drain_all_pages(zone);
-
return 0;
}
#endif
--
2.26.2
Hi Oscar, Naoya,
On Mon, Sep 14, 2020 at 12:15:54PM +0200, Oscar Salvador wrote:
> The important bit of this patchset is patch#1, which is a fix to take off
> HWPoison pages off a buddy freelist since it can lead us to having HWPoison
> pages back in the game without no one noticing it.
> So fix it (we did that already for soft_offline_page [1]).
>
> The other patches are clean-ups and not that important, so if anything,
> consider patch#1 for inclusion.
>
> [1] https://patchwork.kernel.org/cover/11704083/
I found something strange with your and Naoya's hwpoison rework. We have a
customer with a testcase that basically does:
p1 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
p2 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
madvise(p1, size, MADV_MERGEABLE);
madvise(p2, size, MADV_MERGEABLE);
memset(p1, 'a', size);
memset(p2, 'a', size);
madvise(p1, size, MADV_SOFT_OFFLINE);
madvise(p1, size, MADV_UNMERGEABLE);
madvise(p2, size, MADV_UNMERGEABLE);
where size is about 200,000 pages. It works on a x86_64 box (with and without the
hwpoison rework). On ppc64 boxes (tested 3 different ones with at least 250GB memory)
it fails to take a page off the buddy list (page_handle_poison()/take_page_off_buddy())
(madvise MADV_SOFT_OFFLINE returns -EBUSY). Without the hwpoison rework the test passes.
Possibly related is that ppc64 takes a long time to run this test and according
perf, it spends most of the time clearing pages:
17.15% ksm_poison [kernel.kallsyms] [k] copypage_power7
13.39% ksm_poison [kernel.kallsyms] [k] clear_user_page
8.70% ksm_poison libc-2.28.so [.] __memset_power8
8.63% ksm_poison [kernel.kallsyms] [k] opal_return
6.04% ksm_poison [kernel.kallsyms] [k] __opal_call
2.67% ksm_poison [kernel.kallsyms] [k] opal_call
1.52% ksm_poison [kernel.kallsyms] [k] _raw_spin_lock
1.45% ksm_poison [kernel.kallsyms] [k] opal_flush_console
1.43% ksm_poison [unknown] [k] 0x0000000030005138
1.43% ksm_poison [kernel.kallsyms] [k] opal_console_write_buffer_space
1.26% ksm_poison [kernel.kallsyms] [k] hvc_console_print
(...)
I've run these tests using mmotm and mmotm with this patchset on top.
Do you know what might be happening here?
--
Aristeu
On Tue, Sep 15, 2020 at 05:22:22PM -0400, Aristeu Rozanski wrote:
> Hi Oscar, Naoya,
Hi Aristeu,
thanks for reporting this.
> I've run these tests using mmotm and mmotm with this patchset on top.
Could you please re-run the tests with the below patch applied, and
attached then the logs here?
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 84a7f228af36..d7b6e7724e47 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -67,6 +67,7 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
{
+ dump_page(page, "page_handle_poison");
if (release) {
put_page(page);
drain_all_pages(page_zone(page));
@@ -77,7 +78,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
* Doing this check for free pages is also fine since dissolve_free_huge_page
* returns 0 for non-hugetlb pages as well.
*/
- if (dissolve_free_huge_page(page) || !take_page_off_buddy(page))
+ if (dissolve_free_huge_page(page) || !take_page_off_buddy(page)) {
/*
* We could fail to take off the target page from buddy
* for example due to racy page allocaiton, but that's
@@ -85,7 +86,9 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
* and if someone really want to use it, they should
* take it.
*/
+ pr_info("%s: hugepage_or_freepage failed?n", __func__);
return false;
+ }
}
SetPageHWPoison(page);
@@ -1858,8 +1861,11 @@ static int __soft_offline_page(struct page *page)
if (!ret) {
bool release = !huge;
- if (!page_handle_poison(page, true, release))
+ if (!page_handle_poison(page, true, release)) {
+ pr_info("%s: page_handle_poison -EBUSY\n", __func__);
+ dump_page(page, "__soft_offline_page after migrate");
ret = -EBUSY;
+ }
} else {
if (!list_empty(&pagelist))
putback_movable_pages(&pagelist);
@@ -1872,6 +1878,7 @@ static int __soft_offline_page(struct page *page)
} else {
pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n",
pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags);
+ dump_page(page, "__soft_offline_page isolation failed");
ret = -EBUSY;
}
return ret;
@@ -1882,8 +1889,11 @@ static int soft_offline_in_use_page(struct page *page)
struct page *hpage = compound_head(page);
if (!PageHuge(page) && PageTransHuge(hpage))
- if (try_to_split_thp_page(page, "soft offline") < 0)
+ if (try_to_split_thp_page(page, "soft offline") < 0) {
+ pr_info("%s: try_to_split_thp_page -EBUSY\n", __func__);
+ dump_page(page, "try_to_split_thp_page");
return -EBUSY;
+ }
return __soft_offline_page(page);
}
@@ -1891,8 +1901,11 @@ static int soft_offline_free_page(struct page *page)
{
int rc = 0;
- if (!page_handle_poison(page, true, false))
+ if (!page_handle_poison(page, true, false)) {
+ pr_info("%s: page_handle_poison -EBUSY\n", __func__);
+ dump_page(page, "soft_offline_free_page");
rc = -EBUSY;
+ }
return rc;
}
Thanks
--
Oscar Salvador
SUSE L3
On Wed, Sep 16, 2020 at 09:53:58AM -0400, Aristeu Rozanski wrote:
> Hi Oscar,
Thanks Aristeu,
>
> On Wed, Sep 16, 2020 at 09:27:02AM +0200, Oscar Salvador wrote:
> > Could you please re-run the tests with the below patch applied, and
> > attached then the logs here?
>
> here it is:
> (removed previous dump_page() calls for other pages that didn't fail)
>
> [ 109.709342] Soft offlining pfn 0x3fb526 at process virtual address 0x7ffc7a180000
> [ 109.716969] page:00000000f367dde5 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ffc7a18 pfn:0x3fb526
> [ 109.716978] anon flags: 0x3ffff80008000e(referenced|uptodate|dirty|swapbacked)
> [ 109.716988] raw: 003ffff80008000e 5deadbeef0000100 5deadbeef0000122 c000003fcd56d839
> [ 109.716997] raw: 000000007ffc7a18 0000000000000000 00000001ffffffff c000003fd42f5000
> [ 109.717005] page dumped because: page_handle_poison
> [ 109.717011] page->mem_cgroup:c000003fd42f5000
> [ 109.725882] page_handle_poison: hugepage_or_freepage failed
> [ 109.725885] __soft_offline_page: page_handle_poison -EBUSY
> [ 109.725898] page:00000000f367dde5 refcount:3 mapcount:0 mapping:00000000b43d73e6 index:0x58 pfn:0x3fb526
> [ 109.725951] aops:xfs_address_space_operations [xfs] ino:49f9c5f dentry name:"messages"
> [ 109.725958] flags: 0x3ffff800002008(dirty|private)
> [ 109.725963] raw: 003ffff800002008 5deadbeef0000100 5deadbeef0000122 c000003fb8b7eea8
> [ 109.725969] raw: 0000000000000058 c000003fdd94eb20 00000003ffffffff c000003fd3c42000
> [ 109.725975] page dumped because: __soft_offline_page after migrate
> [ 109.725980] page->mem_cgroup:c000003fd3c42000
Can you try the other patch I posted in response to Naoya?
thanks
--
Oscar Salvador
SUSE L3
On Tue, Sep 15, 2020 at 05:22:22PM -0400, Aristeu Rozanski wrote:
> Hi Oscar, Naoya,
>
> On Mon, Sep 14, 2020 at 12:15:54PM +0200, Oscar Salvador wrote:
> > The important bit of this patchset is patch#1, which is a fix to take off
> > HWPoison pages off a buddy freelist since it can lead us to having HWPoison
> > pages back in the game without no one noticing it.
> > So fix it (we did that already for soft_offline_page [1]).
> >
> > The other patches are clean-ups and not that important, so if anything,
> > consider patch#1 for inclusion.
> >
> > [1] https://patchwork.kernel.org/cover/11704083/
>
> I found something strange with your and Naoya's hwpoison rework. We have a
> customer with a testcase that basically does:
>
> p1 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> p2 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>
> madvise(p1, size, MADV_MERGEABLE);
> madvise(p2, size, MADV_MERGEABLE);
>
> memset(p1, 'a', size);
> memset(p2, 'a', size);
>
> madvise(p1, size, MADV_SOFT_OFFLINE);
>
> madvise(p1, size, MADV_UNMERGEABLE);
> madvise(p2, size, MADV_UNMERGEABLE);
>
>
> where size is about 200,000 pages. It works on a x86_64 box (with and without the
> hwpoison rework). On ppc64 boxes (tested 3 different ones with at least 250GB memory)
> it fails to take a page off the buddy list (page_handle_poison()/take_page_off_buddy())
> (madvise MADV_SOFT_OFFLINE returns -EBUSY). Without the hwpoison rework the test passes.
I reproduced the similar -EBUSY with small average x86 VM, where it seems to me
a race between page_take_off_buddy() and page allocation. Oscar's debug patch
shows the following kernel messages:
[ 627.357009] Soft offlining pfn 0x235018 at process virtual address 0x7fd112140000
[ 627.358747] __get_any_page: 0x235018 free buddy page
[ 627.359875] page:00000000038b52c9 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x1 pfn:0x235018
[ 627.362002] flags: 0x57ffe000000000()
[ 627.362841] raw: 0057ffe000000000 fffff84648d12688 ffff955abffd1dd0 0000000000000000
[ 627.364555] raw: 0000000000000001 0000000000000000 00000000ffffff7f 0000000000000000
[ 627.366258] page dumped because: page_handle_poison
[ 627.367357] page->mem_cgroup:ffff9559b6912000
[ 627.368342] page_handle_poison: hugepage_or_freepage failed\xb8n
[ 627.368344] soft_offline_free_page: page_handle_poison -EBUSY
[ 627.370901] page:00000000038b52c9 refcount:6 mapcount:3 mapping:000000001226bf89 index:0x2710 pfn:0x235018
[ 627.373048] aops:ext4_da_aops ino:c63f3 dentry name:"system.journal"
[ 627.374526] flags: 0x57ffe00000201c(uptodate|dirty|lru|private)
[ 627.375865] raw: 0057ffe00000201c fffff84648d300c8 ffff955ab8c3f020 ffff955aba5f4ee0
[ 627.377586] raw: 0000000000002710 ffff9559b811fc98 0000000500000002 ffff9559b6912000
[ 627.379308] page dumped because: soft_offline_free_page
[ 627.380480] page->mem_cgroup:ffff9559b6912000
CPU 0 CPU 1
get_any_page // returns 0 (free buddy path)
soft_offline_free_page
the page is allocated
page_handle_poison -> fail
return -EBUSY
I'm still not sure why this issue is invisible before rework patch,
but setting migrate type to MIGRATE_ISOLATE during offlining could affect
the behavior sensitively.
Thanks,
Naoya Horiguchi
On Wed, Sep 16, 2020 at 06:34:52PM +0200, [email protected] wrote:
> Fat fingers, sorry:
>
> Ok, this is something different.
> The race you saw previously is kinda normal as there is a race window
> between spotting a freepage and taking it off the buddy freelists.
> The retry patch should help there.
>
> The issue you are seeing right here is due to the call to
> page_handle_poison in __soft_offline_page being wrong, as we pass
> hugepage_or_freepage = true inconditionally, which is wrong.
> I think it was caused during rebasing.
>
> Should be:
Tests passed with this patch on top of others.
Thanks!
--
Aristeu
On 2020-09-16 19:58, Aristeu Rozanski wrote:
> On Wed, Sep 16, 2020 at 06:34:52PM +0200, [email protected] wrote:
>> Fat fingers, sorry:
>>
>> Ok, this is something different.
>> The race you saw previously is kinda normal as there is a race window
>> between spotting a freepage and taking it off the buddy freelists.
>> The retry patch should help there.
>>
>> The issue you are seeing right here is due to the call to
>> page_handle_poison in __soft_offline_page being wrong, as we pass
>> hugepage_or_freepage = true inconditionally, which is wrong.
>> I think it was caused during rebasing.
>>
>> Should be:
>
> Tests passed with this patch on top of others.
> Thanks!
Thanks for giving it a shot!
I will be sending patches tomorrow morning.
Thanks
Hi Oscar,
On Wed, Sep 16, 2020 at 04:09:30PM +0200, Oscar Salvador wrote:
> On Wed, Sep 16, 2020 at 09:53:58AM -0400, Aristeu Rozanski wrote:
> Can you try the other patch I posted in response to Naoya?
Same thing:
[ 369.195056] Soft offlining pfn 0x3fb5bf at process virtual address 0x7ffc84350000
[ 369.195073] page:000000002bb131e4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ffc8435 pfn:0x3fb5bf
[ 369.195080] anon flags: 0x3ffff80008000e(referenced|uptodate|dirty|swapbacked)
[ 369.202131] raw: 003ffff80008000e 5deadbeef0000100 5deadbeef0000122 c000003fda1c7431
[ 369.202137] raw: 000000007ffc8435 0000000000000000 00000001ffffffff c000003fd63af000
[ 369.202141] page dumped because: page_handle_poison
[ 369.202145] page->mem_cgroup:c000003fd63af000
[ 369.215055] page_handle_poison: hugepage_or_freepage failed�n
[ 369.215057] __soft_offline_page: page_handle_poison -EBUSY
[ 369.215068] page:000000002bb131e4 refcount:3 mapcount:0 mapping:00000000f6ca3f32 index:0x5c pfn:0x3fb5bf
[ 369.215110] aops:xfs_address_space_operations [xfs] ino:49f9c5f dentry name:"messages"
[ 369.215117] flags: 0x3ffff800002008(dirty|private)
[ 369.215121] raw: 003ffff800002008 5deadbeef0000100 5deadbeef0000122 c000003fadd3daa8
[ 369.215127] raw: 000000000000005c c000003fd9497c20 00000003ffffffff c000003fd1143000
[ 369.215132] page dumped because: __soft_offline_page after migrate
[ 369.215136] page->mem_cgroup:c000003fd1143000
--
Aristeu
Hi Oscar,
On Wed, Sep 16, 2020 at 09:27:02AM +0200, Oscar Salvador wrote:
> Could you please re-run the tests with the below patch applied, and
> attached then the logs here?
here it is:
(removed previous dump_page() calls for other pages that didn't fail)
[ 109.709342] Soft offlining pfn 0x3fb526 at process virtual address 0x7ffc7a180000
[ 109.716969] page:00000000f367dde5 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ffc7a18 pfn:0x3fb526
[ 109.716978] anon flags: 0x3ffff80008000e(referenced|uptodate|dirty|swapbacked)
[ 109.716988] raw: 003ffff80008000e 5deadbeef0000100 5deadbeef0000122 c000003fcd56d839
[ 109.716997] raw: 000000007ffc7a18 0000000000000000 00000001ffffffff c000003fd42f5000
[ 109.717005] page dumped because: page_handle_poison
[ 109.717011] page->mem_cgroup:c000003fd42f5000
[ 109.725882] page_handle_poison: hugepage_or_freepage failed
[ 109.725885] __soft_offline_page: page_handle_poison -EBUSY
[ 109.725898] page:00000000f367dde5 refcount:3 mapcount:0 mapping:00000000b43d73e6 index:0x58 pfn:0x3fb526
[ 109.725951] aops:xfs_address_space_operations [xfs] ino:49f9c5f dentry name:"messages"
[ 109.725958] flags: 0x3ffff800002008(dirty|private)
[ 109.725963] raw: 003ffff800002008 5deadbeef0000100 5deadbeef0000122 c000003fb8b7eea8
[ 109.725969] raw: 0000000000000058 c000003fdd94eb20 00000003ffffffff c000003fd3c42000
[ 109.725975] page dumped because: __soft_offline_page after migrate
[ 109.725980] page->mem_cgroup:c000003fd3c42000
--
Aristeu
On Wed, Sep 16, 2020 at 01:42:15PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Sep 15, 2020 at 05:22:22PM -0400, Aristeu Rozanski wrote:
> I reproduced the similar -EBUSY with small average x86 VM, where it seems to me
> a race between page_take_off_buddy() and page allocation. Oscar's debug patch
> shows the following kernel messages:
>
> [ 627.357009] Soft offlining pfn 0x235018 at process virtual address 0x7fd112140000
> [ 627.358747] __get_any_page: 0x235018 free buddy page
> [ 627.359875] page:00000000038b52c9 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x1 pfn:0x235018
> [ 627.362002] flags: 0x57ffe000000000()
> [ 627.362841] raw: 0057ffe000000000 fffff84648d12688 ffff955abffd1dd0 0000000000000000
> [ 627.364555] raw: 0000000000000001 0000000000000000 00000000ffffff7f 0000000000000000
> [ 627.366258] page dumped because: page_handle_poison
> [ 627.367357] page->mem_cgroup:ffff9559b6912000
> [ 627.368342] page_handle_poison: hugepage_or_freepage failed\xb8n
> [ 627.368344] soft_offline_free_page: page_handle_poison -EBUSY
> [ 627.370901] page:00000000038b52c9 refcount:6 mapcount:3 mapping:000000001226bf89 index:0x2710 pfn:0x235018
> [ 627.373048] aops:ext4_da_aops ino:c63f3 dentry name:"system.journal"
> [ 627.374526] flags: 0x57ffe00000201c(uptodate|dirty|lru|private)
> [ 627.375865] raw: 0057ffe00000201c fffff84648d300c8 ffff955ab8c3f020 ffff955aba5f4ee0
> [ 627.377586] raw: 0000000000002710 ffff9559b811fc98 0000000500000002 ffff9559b6912000
> [ 627.379308] page dumped because: soft_offline_free_page
> [ 627.380480] page->mem_cgroup:ffff9559b6912000
>
> CPU 0 CPU 1
>
> get_any_page // returns 0 (free buddy path)
> soft_offline_free_page
> the page is allocated
> page_handle_poison -> fail
> return -EBUSY
>
> I'm still not sure why this issue is invisible before rework patch,
> but setting migrate type to MIGRATE_ISOLATE during offlining could affect
> the behavior sensitively.
Well, this is very timing depending.
AFAICS, before the rework patchset, we could still race with an allocation
as the page could have been allocated between the get_any_page()
and the call to set_hwpoison_free_buddy_page() which takes the zone->lock
to prevent that.
Maybe we just happen to take longer now to reach take_page_off_buddy, so the
race window is bigger.
AFAICS, this has nothing to do with MIGRATE_ISOLATE, because here we are
dealing with pages that already free (part of the buddy system).
The only thing that comes to my mind right off the bat, might be to do
a "retry" in soft_offline_page in case soft_offline_free_page returns -EBUSY,
so we can call again get_any_page and try to handle the new type of page.
Something like (untested):
@@ -1923,6 +1977,7 @@ int soft_offline_page(unsigned long pfn, int flags)
{
int ret;
struct page *page;
+ bool try_again = true;
if (!pfn_valid(pfn))
return -ENXIO;
@@ -1938,6 +1993,7 @@ int soft_offline_page(unsigned long pfn, int flags)
return 0;
}
+retry:
get_online_mems();
ret = get_any_page(page, pfn, flags);
put_online_mems();
@@ -1945,7 +2001,10 @@ int soft_offline_page(unsigned long pfn, int flags)
if (ret > 0)
ret = soft_offline_in_use_page(page);
else if (ret == 0)
- ret = soft_offline_free_page(page);
+ if (soft_offline_free_page(page) && try_again) {
+ try_again = false;
+ goto retry;
+ }
return ret;
--
Oscar Salvador
SUSE L3
On 2020-09-16 16:46, Aristeu Rozanski wrote:
> Hi Oscar,
>
> On Wed, Sep 16, 2020 at 04:09:30PM +0200, Oscar Salvador wrote:
>> On Wed, Sep 16, 2020 at 09:53:58AM -0400, Aristeu Rozanski wrote:
>> Can you try the other patch I posted in response to Naoya?
>
> Same thing:
>
> [ 369.195056] Soft offlining pfn 0x3fb5bf at process virtual address
> 0x7ffc84350000
> [ 369.195073] page:000000002bb131e4 refcount:1 mapcount:0
> mapping:0000000000000000 index:0x7ffc8435 pfn:0x3fb5bf
> [ 369.195080] anon flags:
> 0x3ffff80008000e(referenced|uptodate|dirty|swapbacked)
> [ 369.202131] raw: 003ffff80008000e 5deadbeef0000100 5deadbeef0000122
> c000003fda1c7431
> [ 369.202137] raw: 000000007ffc8435 0000000000000000 00000001ffffffff
> c000003fd63af000
> [ 369.202141] page dumped because: page_handle_poison
> [ 369.202145] page->mem_cgroup:c000003fd63af000
> [ 369.215055] page_handle_poison: hugepage_or_freepage failed�n
> [ 369.215057] __soft_offline_page: page_handle_poison -EBUSY
> [ 369.215068] page:000000002bb131e4 refcount:3 mapcount:0
> mapping:00000000f6ca3f32 index:0x5c pfn:0x3fb5bf
> [ 369.215110] aops:xfs_address_space_operations [xfs] ino:49f9c5f
> dentry name:"messages"
> [ 369.215117] flags: 0x3ffff800002008(dirty|private)
> [ 369.215121] raw: 003ffff800002008 5deadbeef0000100 5deadbeef0000122
> c000003fadd3daa8
> [ 369.215127] raw: 000000000000005c c000003fd9497c20 00000003ffffffff
> c000003fd1143000
> [ 369.215132] page dumped because: __soft_offline_page after migrate
> [ 369.215136] page->mem_cgroup:c000003fd1143000
Ok, this is something different.
The race you saw previously is kinda normal as there is a race window
between spotting a freepage and taking it off the buddy freelists.
The retry patch should help there.
The issue you are seeing right here is due to the call to
page_handle_poison in __soft_offline_page being wrong, as we pass
hugepage_or_freepage = true inconditionally, which is wrong.
Should be:
On 2020-09-16 18:30, [email protected] wrote:
> On 2020-09-16 16:46, Aristeu Rozanski wrote:
>> Hi Oscar,
>>
>> On Wed, Sep 16, 2020 at 04:09:30PM +0200, Oscar Salvador wrote:
>>> On Wed, Sep 16, 2020 at 09:53:58AM -0400, Aristeu Rozanski wrote:
>>> Can you try the other patch I posted in response to Naoya?
>>
>> Same thing:
>>
>> [ 369.195056] Soft offlining pfn 0x3fb5bf at process virtual address
>> 0x7ffc84350000
>> [ 369.195073] page:000000002bb131e4 refcount:1 mapcount:0
>> mapping:0000000000000000 index:0x7ffc8435 pfn:0x3fb5bf
>> [ 369.195080] anon flags:
>> 0x3ffff80008000e(referenced|uptodate|dirty|swapbacked)
>> [ 369.202131] raw: 003ffff80008000e 5deadbeef0000100 5deadbeef0000122
>> c000003fda1c7431
>> [ 369.202137] raw: 000000007ffc8435 0000000000000000 00000001ffffffff
>> c000003fd63af000
>> [ 369.202141] page dumped because: page_handle_poison
>> [ 369.202145] page->mem_cgroup:c000003fd63af000
>> [ 369.215055] page_handle_poison: hugepage_or_freepage failed�n
>> [ 369.215057] __soft_offline_page: page_handle_poison -EBUSY
>> [ 369.215068] page:000000002bb131e4 refcount:3 mapcount:0
>> mapping:00000000f6ca3f32 index:0x5c pfn:0x3fb5bf
>> [ 369.215110] aops:xfs_address_space_operations [xfs] ino:49f9c5f
>> dentry name:"messages"
>> [ 369.215117] flags: 0x3ffff800002008(dirty|private)
>> [ 369.215121] raw: 003ffff800002008 5deadbeef0000100 5deadbeef0000122
>> c000003fadd3daa8
>> [ 369.215127] raw: 000000000000005c c000003fd9497c20 00000003ffffffff
>> c000003fd1143000
>> [ 369.215132] page dumped because: __soft_offline_page after migrate
>> [ 369.215136] page->mem_cgroup:c000003fd1143000
>
>
> Ok, this is something different.
> The race you saw previously is kinda normal as there is a race window
> between spotting a freepage and taking it off the buddy freelists.
> The retry patch should help there.
>
> The issue you are seeing right here is due to the call to
> page_handle_poison in __soft_offline_page being wrong, as we pass
> hugepage_or_freepage = true inconditionally, which is wrong.
>
> Should be:
Fat fingers, sorry:
Ok, this is something different.
The race you saw previously is kinda normal as there is a race window
between spotting a freepage and taking it off the buddy freelists.
The retry patch should help there.
The issue you are seeing right here is due to the call to
page_handle_poison in __soft_offline_page being wrong, as we pass
hugepage_or_freepage = true inconditionally, which is wrong.
I think it was caused during rebasing.
Should be:
@@ -1858,8 +1903,11 @@ static int __soft_offline_page(struct page *page)
if (!ret) {
bool release = !huge;
- if (!page_handle_poison(page, true, release))
+ if (!page_handle_poison(page, huge, release)) {
+ pr_info("%s: page_handle_poison
-EBUSY\n", __func__);
+ dump_page(page, "__soft_offline_page
after migrate");
ret = -EBUSY;
+ }
Could you try that on top please?
I am away from my laptop now but I will be taking a look later today.
thanks