2022-04-27 09:44:46

by Naoya Horiguchi

[permalink] [raw]
Subject: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

Hi,

This patchset addresses some issues on the workload related to hwpoison,
hugetlb, and memory_hotplug. The problem in memory hotremove reported by
Miaohe Lin [1] is mentioned in 2/4. This patch depends on "storing raw
error info" functionality provided by 1/4. This patch also provide delayed
dissolve function too.

Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for
hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Please let me know if you have any suggestions and comments.

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (4):
mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage
mm, hwpoison: add parameter unpoison to get_hwpoison_huge_page()
mm, memory_hotplug: fix inconsistent num_poisoned_pages on memory hotremove

include/linux/hugetlb.h | 29 +++++++++++++++++++++++++++--
mm/hugetlb.c | 24 ++++++++++++++++++++++--
mm/memory-failure.c | 8 ++++++--
mm/memory_hotplug.c | 23 +++++++++++------------
mm/page_alloc.c | 6 ++++++
5 files changed, 72 insertions(+), 18 deletions(-)


2022-04-27 09:46:51

by Naoya Horiguchi

[permalink] [raw]
Subject: [RFC PATCH v1 2/4] mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage

From: Naoya Horiguchi <[email protected]>

HWPoisoned page is not supposed to prevent memory hotremove, but
currently this does not properly work for hwpoisoned hugepages and the
kernel tries to migrate them, which could cause consuming corrupted
data.

Move dissolve_free_huge_pages() before scan_movable_pages(). This is
because the result of the movable check depends on the result of the
dissolve. Now delayed dissolve is available, so hwpoisoned hugepages
can be turned into 4kB hwpoison page which memory hotplug can handle.

And clear HPageMigratable pseudo flag for hwpoisoned hugepages. This is
also important because dissolve_free_huge_page() can fail. So it's
still necessary to prevent do_migrate_pages() from trying to migrate
hwpoison hugepages.

Reported-by: Miaohe Lin <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/hugetlb.c | 11 +++++++++++
mm/memory-failure.c | 2 ++
mm/memory_hotplug.c | 23 +++++++++++------------
3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6867ea8345d1..95b1db852ca9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2159,6 +2159,17 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)

for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
page = pfn_to_page(pfn);
+
+ if (PageHuge(page) && PageHWPoison(page)) {
+ /*
+ * Release the last refcount from hwpoison to turn into
+ * a free hugepage.
+ */
+ if (page_count(page) == 1)
+ put_page(page);
+ page = hugetlb_page_hwpoison(page);
+ }
+
rc = dissolve_free_huge_page(page);
if (rc)
break;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 73948a00ad4a..4a2e22bf0983 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1607,6 +1607,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
return res == MF_RECOVERED ? 0 : -EBUSY;
}

+ ClearHPageMigratable(head);
+
page_flags = head->flags;

/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 416b38ca8def..4bc0590f4334 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1864,6 +1864,17 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,

cond_resched();

+ /*
+ * Dissolve free hugepages in the memory block before doing
+ * offlining actually in order to make hugetlbfs's object
+ * counting consistent.
+ */
+ ret = dissolve_free_huge_pages(start_pfn, end_pfn);
+ if (ret) {
+ reason = "failure to dissolve huge pages";
+ goto failed_removal_isolated;
+ }
+
ret = scan_movable_pages(pfn, end_pfn, &pfn);
if (!ret) {
/*
@@ -1879,19 +1890,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
goto failed_removal_isolated;
}

- /*
- * Dissolve free hugepages in the memory block before doing
- * offlining actually in order to make hugetlbfs's object
- * counting consistent.
- */
- ret = dissolve_free_huge_pages(start_pfn, end_pfn);
- if (ret) {
- reason = "failure to dissolve huge pages";
- goto failed_removal_isolated;
- }
-
ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
-
} while (ret);

/* Mark all sections offline and remove free pages from the buddy. */
--
2.25.1

2022-04-27 10:47:39

by Naoya Horiguchi

[permalink] [raw]
Subject: [RFC PATCH v1 3/4] mm, hwpoison: add parameter unpoison to get_hwpoison_huge_page()

From: Naoya Horiguchi <[email protected]>

Now hwpoisoned hugepage is expected to be !HPageMigratable, so grabbing
hugepage for unpoison should negate the check from that for poisoning.
This patch implements it by logical XOR.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/hugetlb.h | 5 +++--
mm/hugetlb.c | 4 ++--
mm/memory-failure.c | 4 ++--
3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 689e69cb556b..99b7ded651f6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -171,7 +171,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
-int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb, bool unpoison);
int get_huge_page_for_hwpoison(unsigned long pfn, int flags);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
@@ -377,7 +377,8 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
return false;
}

-static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb,
+ bool unpoison)
{
return 0;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95b1db852ca9..0fbdfa753b54 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6788,7 +6788,7 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
return ret;
}

-int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb, bool unpoison)
{
int ret = 0;

@@ -6798,7 +6798,7 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
*hugetlb = true;
if (HPageFreed(page))
ret = 0;
- else if (HPageMigratable(page))
+ else if (!unpoison != !HPageMigratable(page))
ret = get_page_unless_zero(page);
else
ret = -EBUSY;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4a2e22bf0983..b5ee3cbc7fbc 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1190,7 +1190,7 @@ static int __get_hwpoison_page(struct page *page, unsigned long flags)
int ret = 0;
bool hugetlb = false;

- ret = get_hwpoison_huge_page(head, &hugetlb);
+ ret = get_hwpoison_huge_page(head, &hugetlb, false);
if (hugetlb)
return ret;

@@ -1283,7 +1283,7 @@ static int __get_unpoison_page(struct page *page)
int ret = 0;
bool hugetlb = false;

- ret = get_hwpoison_huge_page(head, &hugetlb);
+ ret = get_hwpoison_huge_page(head, &hugetlb, true);
if (hugetlb)
return ret;

--
2.25.1

2022-04-27 11:25:53

by Naoya Horiguchi

[permalink] [raw]
Subject: [RFC PATCH v1 1/4] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page

From: Naoya Horiguchi <[email protected]>

When handling memory error on a hugetlb page, the error handler tries to
dissolve and turn it into 4kB pages. If it's successfully dissolved,
PageHWPoison flag is moved to the raw error page, so but that's all
right. However, dissolve sometimes fails, then the error page is left
as hwpoisoned hugepage. It's useful if we can retry to dissolve it to
save healthy pages, but that's not possible now because the information
about where the raw error page is lost.

Use the private field of a tail page to keep that information. The code
path of shrinking hugepage pool used this info to try delayed dissolve.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
mm/hugetlb.c | 9 +++++++++
mm/memory-failure.c | 2 ++
3 files changed, 35 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ac2a1d758a80..689e69cb556b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -42,6 +42,9 @@ enum {
SUBPAGE_INDEX_CGROUP, /* reuse page->private */
SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */
__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
+#endif
+#ifdef CONFIG_CGROUP_HUGETLB
+ SUBPAGE_INDEX_HWPOISON,
#endif
__NR_USED_SUBPAGE,
};
@@ -784,6 +787,27 @@ extern int dissolve_free_huge_page(struct page *page);
extern int dissolve_free_huge_pages(unsigned long start_pfn,
unsigned long end_pfn);

+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * pointer to raw error page is located in hpage[SUBPAGE_INDEX_HWPOISON].private
+ */
+static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
+{
+ return (void *)page_private(hpage + SUBPAGE_INDEX_HWPOISON);
+}
+
+static inline void hugetlb_set_page_hwpoison(struct page *hpage,
+ struct page *page)
+{
+ set_page_private(hpage + SUBPAGE_INDEX_HWPOISON, (unsigned long)page);
+}
+#else
+static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
+{
+ return NULL;
+}
+#endif
+
#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
#ifndef arch_hugetlb_migration_supported
static inline bool arch_hugetlb_migration_supported(struct hstate *h)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f8e048b939c7..6867ea8345d1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1547,6 +1547,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
return;
}

+ if (unlikely(PageHWPoison(page))) {
+ struct page *raw_error = hugetlb_page_hwpoison(page);
+
+ if (raw_error && raw_error != page) {
+ SetPageHWPoison(raw_error);
+ ClearPageHWPoison(page);
+ }
+ }
+
for (i = 0; i < pages_per_huge_page(h);
i++, subpage = mem_map_next(subpage, page, i)) {
subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3e36fc19c4d1..73948a00ad4a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1535,6 +1535,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
goto out;
}

+ hugetlb_set_page_hwpoison(head, page);
+
return ret;
out:
if (count_increased)
--
2.25.1

2022-04-27 11:43:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On 27.04.22 06:28, Naoya Horiguchi wrote:
> Hi,
>
> This patchset addresses some issues on the workload related to hwpoison,
> hugetlb, and memory_hotplug. The problem in memory hotremove reported by
> Miaohe Lin [1] is mentioned in 2/4. This patch depends on "storing raw
> error info" functionality provided by 1/4. This patch also provide delayed
> dissolve function too.
>
> Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for
> hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Please let me know if you have any suggestions and comments.
>

Hi,

I raised some time ago already that I don't quite see the value of
allowing memory offlining with poisened pages.

1) It overcomplicates the offlining code and seems to be partially
broken
2) It happens rarely (ever?), so do we even care?
3) Once the memory is offline, we can re-online it and lost HWPoison.
The memory can be happily used.

3) can happen easily if our DIMM consists of multiple memory blocks and
offlining of some memory block fails -> we'll re-online all already
offlined ones. We'll happily reuse previously HWPoisoned pages, which
feels more dangerous to me then just leaving the DIMM around (and
eventually hwpoisoning all pages on it such that it won't get used
anymore?).

So maybe we should just fail offlining once we stumble over a hwpoisoned
page?

Yes, we would disallow removing a semi-broken DIMM from the system that
was onlined MOVABLE. I wonder if we really need that and how often it
happens in real life. Most systems I am aware of don't allow for
replacing individual DIMMs, but only complete NUMA nodes. Hm.

--
Thanks,

David / dhildenb

2022-04-27 12:46:46

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote:
> I raised some time ago already that I don't quite see the value of
> allowing memory offlining with poisened pages.
>
> 1) It overcomplicates the offlining code and seems to be partially
> broken
> 2) It happens rarely (ever?), so do we even care?
> 3) Once the memory is offline, we can re-online it and lost HWPoison.
> The memory can be happily used.
>
> 3) can happen easily if our DIMM consists of multiple memory blocks and
> offlining of some memory block fails -> we'll re-online all already
> offlined ones. We'll happily reuse previously HWPoisoned pages, which
> feels more dangerous to me then just leaving the DIMM around (and
> eventually hwpoisoning all pages on it such that it won't get used
> anymore?).
>
> So maybe we should just fail offlining once we stumble over a hwpoisoned
> page?
>
> Yes, we would disallow removing a semi-broken DIMM from the system that
> was onlined MOVABLE. I wonder if we really need that and how often it
> happens in real life. Most systems I am aware of don't allow for
> replacing individual DIMMs, but only complete NUMA nodes. Hm.

I teend to agree with all you said.
And to be honest, the mechanism of making a semi-broken DIMM healthy
again has always been a mistery to me.

One would think that:

1- you hot-remove the memory
2- you fix/remove it
3- you hotplug memory again

but I am not sure how many times this came to be.

And there is also the thing about losing the hwpoison information
between offline<->online transitions, so, the thing is unreliable.

And for that to work, we would have to add a bunch of code
to keep track of "offlined" pages that are hwpoisoned, so we
flag them again once they get onlined, and that means more
room for errors.

So, I would lean towards the fact of not allowing to offline
memory that contain such pages in the first place, unless that
proves to be a no-go.


--
Oscar Salvador
SUSE Labs

Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote:
> On 27.04.22 06:28, Naoya Horiguchi wrote:
> > Hi,
> >
> > This patchset addresses some issues on the workload related to hwpoison,
> > hugetlb, and memory_hotplug. The problem in memory hotremove reported by
> > Miaohe Lin [1] is mentioned in 2/4. This patch depends on "storing raw
> > error info" functionality provided by 1/4. This patch also provide delayed
> > dissolve function too.
> >
> > Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for
> > hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue.
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Please let me know if you have any suggestions and comments.
> >
>
> Hi,
>
> I raised some time ago already that I don't quite see the value of
> allowing memory offlining with poisened pages.
>
> 1) It overcomplicates the offlining code and seems to be partially
> broken

Yes, the current code has rooms of improvement. Maybe page refcount
of hwpoisoned pages is one of them.

> 2) It happens rarely (ever?), so do we even care?

I'm not certain of the rarity. Some cloud service providers who maintain
lots of servers may care?

> 3) Once the memory is offline, we can re-online it and lost HWPoison.
> The memory can be happily used.
>
> 3) can happen easily if our DIMM consists of multiple memory blocks and
> offlining of some memory block fails -> we'll re-online all already
> offlined ones. We'll happily reuse previously HWPoisoned pages, which
> feels more dangerous to me then just leaving the DIMM around (and
> eventually hwpoisoning all pages on it such that it won't get used
> anymore?).

I see. This scenario can often happen.

>
> So maybe we should just fail offlining once we stumble over a hwpoisoned
> page?

That could be one choice.

Maybe another is like this: offlining can succeed but HWPoison flags are
kept over offline-reonline operations. If the system noticed that the
re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
saved HWPoison flags are still effective, so keep using them. If the
re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
all HWPoison flags associated with replaced physical address range. This
can be done automatically in re-onlining if there's a way for kernel to know
whether DIMM/NUMA nodes are replaced with new ones. But if there isn't,
system applications have to check the HW and explicitly reset the HWPoison
flags.

>
> Yes, we would disallow removing a semi-broken DIMM from the system that
> was onlined MOVABLE. I wonder if we really need that and how often it
> happens in real life. Most systems I am aware of don't allow for
> replacing individual DIMMs, but only complete NUMA nodes. Hm.

Yes, maybe most servers do not provide direct physical access to DIMMs.

Thanks,
Naoya Horiguchi

Subject: Re: [RFC PATCH v1 1/4] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page

On Wed, Apr 27, 2022 at 03:11:31PM +0800, Miaohe Lin wrote:
> On 2022/4/27 12:28, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <[email protected]>
> >
> > When handling memory error on a hugetlb page, the error handler tries to
> > dissolve and turn it into 4kB pages. If it's successfully dissolved,
> > PageHWPoison flag is moved to the raw error page, so but that's all
>
> s/so but/so/

Fixed, thank you.

>
> > right. However, dissolve sometimes fails, then the error page is left
> > as hwpoisoned hugepage. It's useful if we can retry to dissolve it to
> > save healthy pages, but that's not possible now because the information
> > about where the raw error page is lost.
> >
> > Use the private field of a tail page to keep that information. The code
>
> Only one raw error page is saved now. Should this be ok? I think so as memory
> failure should be rare anyway?

This is a good point. It might be rare, but maybe we need some consideration
on it. Some ideas in my mind below ...

- using struct page of all subpages is not compatible with hugetlb_free_vmemmap,
so it's not desirable.
- defining a linked list starting from hpage[SUBPAGE_INDEX_HWPOISON].private
might be a solution to save the multiple offsets.
- hacking bits in hpage[SUBPAGE_INDEX_HWPOISON].private field to save offset
info in compressed format. For example, for 2MB hugepage there could be
512 offset numbers, so we can save one offset with 9 bits subfield.
So we can save upto 7 offsets in the field. This is not flexible and
still can't handle many errors.
- maintaining global data structure to save the pfn of all hwpoison pages
in the system. This might sound overkilling for the current purpose,
but this data structure might be helpful for other purpose, so in the long
run someone might get interested in it.

>
> > path of shrinking hugepage pool used this info to try delayed dissolve.
> >
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
> > mm/hugetlb.c | 9 +++++++++
> > mm/memory-failure.c | 2 ++
> > 3 files changed, 35 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index ac2a1d758a80..689e69cb556b 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -42,6 +42,9 @@ enum {
> > SUBPAGE_INDEX_CGROUP, /* reuse page->private */
> > SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */
> > __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
> > +#endif
> > +#ifdef CONFIG_CGROUP_HUGETLB
> > + SUBPAGE_INDEX_HWPOISON,
> > #endif
>
> Do we rely on the CONFIG_CGROUP_HUGETLB to store the raw error page?

No. I meant CONFIG_MEMORY_FAILURE.
# I just copied and pasted the #ifdef line just above, and forget to update
# the CONFIG_* part :(

Thanks,
Naoya Horiguchi

2022-04-28 14:57:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

>> 2) It happens rarely (ever?), so do we even care?
>
> I'm not certain of the rarity. Some cloud service providers who maintain
> lots of servers may care?

About replacing broken DIMMs? I'm not so sure, especially because it
requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be
somewhat reliable and individual DIMMs can usually not get replaced at all.

>
>> 3) Once the memory is offline, we can re-online it and lost HWPoison.
>> The memory can be happily used.
>>
>> 3) can happen easily if our DIMM consists of multiple memory blocks and
>> offlining of some memory block fails -> we'll re-online all already
>> offlined ones. We'll happily reuse previously HWPoisoned pages, which
>> feels more dangerous to me then just leaving the DIMM around (and
>> eventually hwpoisoning all pages on it such that it won't get used
>> anymore?).
>
> I see. This scenario can often happen.
>
>>
>> So maybe we should just fail offlining once we stumble over a hwpoisoned
>> page?
>
> That could be one choice.
>
> Maybe another is like this: offlining can succeed but HWPoison flags are
> kept over offline-reonline operations. If the system noticed that the
> re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
> saved HWPoison flags are still effective, so keep using them. If the
> re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
> all HWPoison flags associated with replaced physical address range. This
> can be done automatically in re-onlining if there's a way for kernel to know
> whether DIMM/NUMA nodes are replaced with new ones. But if there isn't,
> system applications have to check the HW and explicitly reset the HWPoison
> flags.

Offline memory sections have a stale memmap, so there is no trusting on
that. And trying to work around that or adjusting memory onlining code
overcomplicates something we really don't care about supporting.

So if we continue allowing offlining memory blocks with poisoned pages,
we could simply remember that that memory block had any posioned page
(either for the memory section or maybe better for the whole memory
block). We can then simply reject/fail memory onlining of these memory
blocks.

So that leaves us with either

1) Fail offlining -> no need to care about reonlining
2) Succeed offlining but fail re-onlining

--
Thanks,

David / dhildenb

Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On Thu, Apr 28, 2022 at 10:44:15AM +0200, David Hildenbrand wrote:
> >> 2) It happens rarely (ever?), so do we even care?
> >
> > I'm not certain of the rarity. Some cloud service providers who maintain
> > lots of servers may care?
>
> About replacing broken DIMMs? I'm not so sure, especially because it
> requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be
> somewhat reliable and individual DIMMs can usually not get replaced at all.
>
> >
> >> 3) Once the memory is offline, we can re-online it and lost HWPoison.
> >> The memory can be happily used.
> >>
> >> 3) can happen easily if our DIMM consists of multiple memory blocks and
> >> offlining of some memory block fails -> we'll re-online all already
> >> offlined ones. We'll happily reuse previously HWPoisoned pages, which
> >> feels more dangerous to me then just leaving the DIMM around (and
> >> eventually hwpoisoning all pages on it such that it won't get used
> >> anymore?).
> >
> > I see. This scenario can often happen.
> >
> >>
> >> So maybe we should just fail offlining once we stumble over a hwpoisoned
> >> page?
> >
> > That could be one choice.
> >
> > Maybe another is like this: offlining can succeed but HWPoison flags are
> > kept over offline-reonline operations. If the system noticed that the
> > re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
> > saved HWPoison flags are still effective, so keep using them. If the
> > re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
> > all HWPoison flags associated with replaced physical address range. This
> > can be done automatically in re-onlining if there's a way for kernel to know
> > whether DIMM/NUMA nodes are replaced with new ones. But if there isn't,
> > system applications have to check the HW and explicitly reset the HWPoison
> > flags.
>
> Offline memory sections have a stale memmap, so there is no trusting on
> that. And trying to work around that or adjusting memory onlining code
> overcomplicates something we really don't care about supporting.

OK, so I'll go forward to reduce complexity in hwpoison specific code in
memory offlining code.

>
> So if we continue allowing offlining memory blocks with poisoned pages,
> we could simply remember that that memory block had any posioned page
> (either for the memory section or maybe better for the whole memory
> block). We can then simply reject/fail memory onlining of these memory
> blocks.

It seems to be helpful also for other conext (like hugetlb) to know whether
there's any hwpoisoned page in a given range of physical address, so I'll
think of this approach.

>
> So that leaves us with either
>
> 1) Fail offlining -> no need to care about reonlining
> 2) Succeed offlining but fail re-onlining

Rephrasing in case I misread, memory offlining code should never check
hwpoisoned pages finally, and memory onlining code would do kind of range
query to find hwpoisoned pages (without depending on PageHWPoison flag).

Thanks,
Naoya Horiguchi

Subject: Re: [RFC PATCH v1 2/4] mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage

On Fri, Apr 29, 2022 at 04:49:16PM +0800, Miaohe Lin wrote:
> On 2022/4/27 12:28, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <[email protected]>
> >
> > HWPoisoned page is not supposed to prevent memory hotremove, but
> > currently this does not properly work for hwpoisoned hugepages and the
> > kernel tries to migrate them, which could cause consuming corrupted
> > data.
> >
> > Move dissolve_free_huge_pages() before scan_movable_pages(). This is
> > because the result of the movable check depends on the result of the
> > dissolve. Now delayed dissolve is available, so hwpoisoned hugepages
> > can be turned into 4kB hwpoison page which memory hotplug can handle.
> >
> > And clear HPageMigratable pseudo flag for hwpoisoned hugepages. This is
> > also important because dissolve_free_huge_page() can fail. So it's
> > still necessary to prevent do_migrate_pages() from trying to migrate
> > hwpoison hugepages.
> >
> > Reported-by: Miaohe Lin <[email protected]>
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > mm/hugetlb.c | 11 +++++++++++
> > mm/memory-failure.c | 2 ++
> > mm/memory_hotplug.c | 23 +++++++++++------------
> > 3 files changed, 24 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6867ea8345d1..95b1db852ca9 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2159,6 +2159,17 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> >
> > for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
> > page = pfn_to_page(pfn);
> > +
> > + if (PageHuge(page) && PageHWPoison(page)) {
> > + /*
> > + * Release the last refcount from hwpoison to turn into
> > + * a free hugepage.
> > + */
> > + if (page_count(page) == 1)
> > + put_page(page);
> > + page = hugetlb_page_hwpoison(page);
> > + }
> > +
>
> This patch looks good to me. Thanks!
>
> One question: Can this hugepage be put into buddy system? In free_huge_page,
> if h->surplus_huge_pages_node[nid] > 0, hugepage might be put into buddy via
> update_and_free_page. So it's not PageHuge anymore and won't be dissolved. If
> this happens, the "raw error page" is still missed and might be accessed later.

Yes, this put_page() could free pages directly into buddy. In such case, I
expect __update_and_free_page() to move the PageHWpoison flag to the raw error
page, so I think the final result should be the same.

Thanks,
Naoya Horiguchi

2022-05-09 10:56:18

by Oscar Salvador

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
> >> So that leaves us with either
> >>
> >> 1) Fail offlining -> no need to care about reonlining
>
> Maybe fail offlining will be a better alternative as we can get rid of many races
> between memory failure and memory offline? But no strong opinion. :)

If taking care of those races is not an herculean effort, I'd go with
allowing offlining + disallow re-onlining.
Mainly because memory RAS stuff.

Now, to the re-onlining thing, we'll have to come up with a way to check
whether a section contains hwpoisoned pages, so we do not have to go
and check every single page, as that will be really suboptimal.


--
Oscar Salvador
SUSE Labs

Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote:
> On 09.05.22 12:53, Miaohe Lin wrote:
> > On 2022/5/9 17:58, Oscar Salvador wrote:
> >> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
> >>>>> So that leaves us with either
> >>>>>
> >>>>> 1) Fail offlining -> no need to care about reonlining
> >>>
> >>> Maybe fail offlining will be a better alternative as we can get rid of many races
> >>> between memory failure and memory offline? But no strong opinion. :)
> >>
> >> If taking care of those races is not an herculean effort, I'd go with
> >> allowing offlining + disallow re-onlining.
> >> Mainly because memory RAS stuff.
> >
> > This dose make sense to me. Thanks. We can try to solve those races if
> > offlining + disallow re-onlining is applied. :)
> >
> >>
> >> Now, to the re-onlining thing, we'll have to come up with a way to check
> >> whether a section contains hwpoisoned pages, so we do not have to go
> >> and check every single page, as that will be really suboptimal.
> >
> > Yes, we need a stable and cheap way to do that.
>
> My simplistic approach would be a simple flag/indicator in the memory block devices
> that indicates that any page in the memory block was hwpoisoned. It's easy to
> check that during memory onlining and fail it.
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 084d67fd55cc..3d0ef812e901 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
> struct zone *zone;
> int ret;
>
> + if (mem->hwpoisoned)
> + return -EHWPOISON;
> +
> zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
> start_pfn, nr_pages);
>

Thanks for the idea, a simple flag could work if we don't have to consider
unpoison. If we need consider unpoison, we need remember the last hwpoison
page in the memory block, so mem->hwpoisoned should be the counter of
hwpoison pages.

>
>
> Once the problematic DIMM would actually get unplugged, the memory block devices
> would get removed as well. So when hotplugging a new DIMM in the same
> location, we could online that memory again.

What about PG_hwpoison flags? struct pages are also freed and reallocated
in the actual DIMM replacement?

Thanks,
Naoya Horiguchi

>
> Another place to store that would be the memory section, we'd then have to check
> all underlying sections here.
>
> We're a bit short on flags in the memory section I think, but they are easier to
> lookup from other code eventually then memory block devices.

2022-05-12 14:33:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

>>>>
>>>> Once the problematic DIMM would actually get unplugged, the memory block devices
>>>> would get removed as well. So when hotplugging a new DIMM in the same
>>>> location, we could online that memory again.
>>>
>>> What about PG_hwpoison flags? struct pages are also freed and reallocated
>>> in the actual DIMM replacement?
>>
>> Once memory is offline, the memmap is stale and is no longer
>> trustworthy. It gets reinitialize during memory onlining -- so any
>> previous PG_hwpoison is overridden at least there. In some setups, we
>> even poison the whole memmap via page_init_poison() during memory offlining.
>>
>> Apart from that, we should be freeing the memmap in all relevant cases
>> when removing memory. I remember there are a couple of corner cases, but
>> we don't really have to care about that.
>
> OK, so there seems no need to manipulate struct pages for hwpoison in
> all relevant cases.

Right. When offlining a memory block, all we have to do is remember if
we stumbled over a hwpoisoned page and rememebr that inside the memory
block. Rejecting to online is then easy.

--
Thanks,

David / dhildenb


2022-05-12 18:55:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On 11.05.22 18:10, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote:
>> On 09.05.22 12:53, Miaohe Lin wrote:
>>> On 2022/5/9 17:58, Oscar Salvador wrote:
>>>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
>>>>>>> So that leaves us with either
>>>>>>>
>>>>>>> 1) Fail offlining -> no need to care about reonlining
>>>>>
>>>>> Maybe fail offlining will be a better alternative as we can get rid of many races
>>>>> between memory failure and memory offline? But no strong opinion. :)
>>>>
>>>> If taking care of those races is not an herculean effort, I'd go with
>>>> allowing offlining + disallow re-onlining.
>>>> Mainly because memory RAS stuff.
>>>
>>> This dose make sense to me. Thanks. We can try to solve those races if
>>> offlining + disallow re-onlining is applied. :)
>>>
>>>>
>>>> Now, to the re-onlining thing, we'll have to come up with a way to check
>>>> whether a section contains hwpoisoned pages, so we do not have to go
>>>> and check every single page, as that will be really suboptimal.
>>>
>>> Yes, we need a stable and cheap way to do that.
>>
>> My simplistic approach would be a simple flag/indicator in the memory block devices
>> that indicates that any page in the memory block was hwpoisoned. It's easy to
>> check that during memory onlining and fail it.
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 084d67fd55cc..3d0ef812e901 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
>> struct zone *zone;
>> int ret;
>>
>> + if (mem->hwpoisoned)
>> + return -EHWPOISON;
>> +
>> zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
>> start_pfn, nr_pages);
>>
>
> Thanks for the idea, a simple flag could work if we don't have to consider
> unpoison. If we need consider unpoison, we need remember the last hwpoison
> page in the memory block, so mem->hwpoisoned should be the counter of
> hwpoison pages.

Right, but unpoisoning+memory offlining+memory onlining is a yet more
extreme use case we don't have to bother about I think.

>
>>
>>
>> Once the problematic DIMM would actually get unplugged, the memory block devices
>> would get removed as well. So when hotplugging a new DIMM in the same
>> location, we could online that memory again.
>
> What about PG_hwpoison flags? struct pages are also freed and reallocated
> in the actual DIMM replacement?

Once memory is offline, the memmap is stale and is no longer
trustworthy. It gets reinitialize during memory onlining -- so any
previous PG_hwpoison is overridden at least there. In some setups, we
even poison the whole memmap via page_init_poison() during memory offlining.

Apart from that, we should be freeing the memmap in all relevant cases
when removing memory. I remember there are a couple of corner cases, but
we don't really have to care about that.

--
Thanks,

David / dhildenb


Subject: Re: [RFC PATCH v1 1/4] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page

On Thu, May 12, 2022 at 10:31:42PM +0000, Jane Chu wrote:
> On 4/26/2022 9:28 PM, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <[email protected]>
> >
> > When handling memory error on a hugetlb page, the error handler tries to
> > dissolve and turn it into 4kB pages. If it's successfully dissolved,
> > PageHWPoison flag is moved to the raw error page, so but that's all
> > right. However, dissolve sometimes fails, then the error page is left
nnn> > as hwpoisoned hugepage. It's useful if we can retry to dissolve it to
> > save healthy pages, but that's not possible now because the information
> > about where the raw error page is lost.
> >
> > Use the private field of a tail page to keep that information. The code
> > path of shrinking hugepage pool used this info to try delayed dissolve.
> >
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > ---
> > include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
> > mm/hugetlb.c | 9 +++++++++
> > mm/memory-failure.c | 2 ++
> > 3 files changed, 35 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index ac2a1d758a80..689e69cb556b 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -42,6 +42,9 @@ enum {
> > SUBPAGE_INDEX_CGROUP, /* reuse page->private */
> > SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */
> > __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
> > +#endif
> > +#ifdef CONFIG_CGROUP_HUGETLB
> > + SUBPAGE_INDEX_HWPOISON,
> > #endif
> > __NR_USED_SUBPAGE,
> > };
> > @@ -784,6 +787,27 @@ extern int dissolve_free_huge_page(struct page *page);
> > extern int dissolve_free_huge_pages(unsigned long start_pfn,
> > unsigned long end_pfn);
> >
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +/*
> > + * pointer to raw error page is located in hpage[SUBPAGE_INDEX_HWPOISON].private
> > + */
> > +static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
> > +{
> > + return (void *)page_private(hpage + SUBPAGE_INDEX_HWPOISON);
> > +}
> > +
> > +static inline void hugetlb_set_page_hwpoison(struct page *hpage,
> > + struct page *page)
> > +{
> > + set_page_private(hpage + SUBPAGE_INDEX_HWPOISON, (unsigned long)page);
> > +}
>
> What happens if the ->private field is already holding a poisoned page
> pointer? that is, in a scenario of multiple poisoned pages within a
> hugepage, what to do? mark the entire hpage poisoned?

Hi Jane,

Current version does not consider multiple poisoned pages scenario,
so if that happens, ->private field would be simply overwritten.
But in this patch hugetlb_set_page_hwpoison() is called just after
"if (TestSetPageHWPoison(head))" check, so hugetlb_set_page_hwpoison()
is not expected to be called more than once on a single hugepage.

When we try to support multiple poison scenario, we may add some code
in "already hwpoisoned" path to store additional info about the raw
error page. The implementation detail is still to be determined.

Thanks,
Naoya Horiguchi

2022-05-14 00:16:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On 12.05.22 13:13, Miaohe Lin wrote:
> On 2022/5/12 15:28, David Hildenbrand wrote:
>>>>>>
>>>>>> Once the problematic DIMM would actually get unplugged, the memory block devices
>>>>>> would get removed as well. So when hotplugging a new DIMM in the same
>>>>>> location, we could online that memory again.
>>>>>
>>>>> What about PG_hwpoison flags? struct pages are also freed and reallocated
>>>>> in the actual DIMM replacement?
>>>>
>>>> Once memory is offline, the memmap is stale and is no longer
>>>> trustworthy. It gets reinitialize during memory onlining -- so any
>>>> previous PG_hwpoison is overridden at least there. In some setups, we
>>>> even poison the whole memmap via page_init_poison() during memory offlining.
>>>>
>>>> Apart from that, we should be freeing the memmap in all relevant cases
>>>> when removing memory. I remember there are a couple of corner cases, but
>>>> we don't really have to care about that.
>>>
>>> OK, so there seems no need to manipulate struct pages for hwpoison in
>>> all relevant cases.
>>
>> Right. When offlining a memory block, all we have to do is remember if
>> we stumbled over a hwpoisoned page and rememebr that inside the memory
>> block. Rejecting to online is then easy.
>
> BTW: How should we deal with the below race window:
>
> CPU A CPU B CPU C
> accessing page while hold page refcnt
> memory_failure happened on page
> offline_pages
> page can be offlined due to page refcnt
> is ignored when PG_hwpoison is set
> can still access page struct...
>
> Any in use page (with page refcnt incremented) might be offlined while its content, e.g. flags, private ..., can
> still be accessed if the above race happened. Is this possible? Or am I miss something? Any suggestion to fix it?
> I can't figure out a way yet. :(

I assume you mean that test_pages_isolated() essentially only checks for
PageHWPoison() and doesn't care about the refcount?

That part is very dodgy and it's part of my motivation to question that
whole handling in the first place.


In do_migrate_range(), there is a comment:

"
HWPoison pages have elevated reference counts so the migration would
fail on them. It also doesn't make any sense to migrate them in the
first place. Still try to unmap such a page in case it is still mapped
(e.g. current hwpoison implementation doesn't unmap KSM pages but keep
the unmap as the catch all safety net).
"

My assumption would be: if there are any unexpected references on a
hwpoison page, we must fail offlining. Ripping out the page might be
more harmful then just leaving it in place and failing offlining for the
time being.



I am no export on PageHWPoison(). Which guarantees do we have regarding
the page count?

If we succeed in unmapping the page, there shouldn't be any references
from the page tables. We might still have GUP references to such pages,
and it would be fair enough to fail offlining. I remember we try
removing the page from the pagecache etc. to free up these references.
So which additional references do we have that the comment in offlining
code talks about? A single additional one from hwpoison code?

Once we figure that out, we might tweak test_pages_isolated() to also
consider the page count and not rip out random pages that are still
referenced in the system.

--
Thanks,

David / dhildenb


Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On Wed, May 11, 2022 at 06:22:41PM +0200, David Hildenbrand wrote:
> On 11.05.22 18:10, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote:
> >> On 09.05.22 12:53, Miaohe Lin wrote:
> >>> On 2022/5/9 17:58, Oscar Salvador wrote:
> >>>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
> >>>>>>> So that leaves us with either
> >>>>>>>
> >>>>>>> 1) Fail offlining -> no need to care about reonlining
> >>>>>
> >>>>> Maybe fail offlining will be a better alternative as we can get rid of many races
> >>>>> between memory failure and memory offline? But no strong opinion. :)
> >>>>
> >>>> If taking care of those races is not an herculean effort, I'd go with
> >>>> allowing offlining + disallow re-onlining.
> >>>> Mainly because memory RAS stuff.
> >>>
> >>> This dose make sense to me. Thanks. We can try to solve those races if
> >>> offlining + disallow re-onlining is applied. :)
> >>>
> >>>>
> >>>> Now, to the re-onlining thing, we'll have to come up with a way to check
> >>>> whether a section contains hwpoisoned pages, so we do not have to go
> >>>> and check every single page, as that will be really suboptimal.
> >>>
> >>> Yes, we need a stable and cheap way to do that.
> >>
> >> My simplistic approach would be a simple flag/indicator in the memory block devices
> >> that indicates that any page in the memory block was hwpoisoned. It's easy to
> >> check that during memory onlining and fail it.
> >>
> >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >> index 084d67fd55cc..3d0ef812e901 100644
> >> --- a/drivers/base/memory.c
> >> +++ b/drivers/base/memory.c
> >> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
> >> struct zone *zone;
> >> int ret;
> >>
> >> + if (mem->hwpoisoned)
> >> + return -EHWPOISON;
> >> +
> >> zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
> >> start_pfn, nr_pages);
> >>
> >
> > Thanks for the idea, a simple flag could work if we don't have to consider
> > unpoison. If we need consider unpoison, we need remember the last hwpoison
> > page in the memory block, so mem->hwpoisoned should be the counter of
> > hwpoison pages.
>
> Right, but unpoisoning+memory offlining+memory onlining is a yet more
> extreme use case we don't have to bother about I think.

OK. Maybe starting with simple one is fine.

>
> >
> >>
> >>
> >> Once the problematic DIMM would actually get unplugged, the memory block devices
> >> would get removed as well. So when hotplugging a new DIMM in the same
> >> location, we could online that memory again.
> >
> > What about PG_hwpoison flags? struct pages are also freed and reallocated
> > in the actual DIMM replacement?
>
> Once memory is offline, the memmap is stale and is no longer
> trustworthy. It gets reinitialize during memory onlining -- so any
> previous PG_hwpoison is overridden at least there. In some setups, we
> even poison the whole memmap via page_init_poison() during memory offlining.
>
> Apart from that, we should be freeing the memmap in all relevant cases
> when removing memory. I remember there are a couple of corner cases, but
> we don't really have to care about that.

OK, so there seems no need to manipulate struct pages for hwpoison in
all relevant cases.

Thanks,
Naoya Horiguchi

2022-05-14 02:43:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

On 09.05.22 12:53, Miaohe Lin wrote:
> On 2022/5/9 17:58, Oscar Salvador wrote:
>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
>>>>> So that leaves us with either
>>>>>
>>>>> 1) Fail offlining -> no need to care about reonlining
>>>
>>> Maybe fail offlining will be a better alternative as we can get rid of many races
>>> between memory failure and memory offline? But no strong opinion. :)
>>
>> If taking care of those races is not an herculean effort, I'd go with
>> allowing offlining + disallow re-onlining.
>> Mainly because memory RAS stuff.
>
> This dose make sense to me. Thanks. We can try to solve those races if
> offlining + disallow re-onlining is applied. :)
>
>>
>> Now, to the re-onlining thing, we'll have to come up with a way to check
>> whether a section contains hwpoisoned pages, so we do not have to go
>> and check every single page, as that will be really suboptimal.
>
> Yes, we need a stable and cheap way to do that.

My simplistic approach would be a simple flag/indicator in the memory block devices
that indicates that any page in the memory block was hwpoisoned. It's easy to
check that during memory onlining and fail it.

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 084d67fd55cc..3d0ef812e901 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
struct zone *zone;
int ret;

+ if (mem->hwpoisoned)
+ return -EHWPOISON;
+
zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
start_pfn, nr_pages);



Once the problematic DIMM would actually get unplugged, the memory block devices
would get removed as well. So when hotplugging a new DIMM in the same
location, we could online that memory again.

Another place to store that would be the memory section, we'd then have to check
all underlying sections here.

We're a bit short on flags in the memory section I think, but they are easier to
lookup from other code eventually then memory block devices.

--
Thanks,

David / dhildenb


2022-05-14 03:45:08

by Jane Chu

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page

On 4/26/2022 9:28 PM, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <[email protected]>
>
> When handling memory error on a hugetlb page, the error handler tries to
> dissolve and turn it into 4kB pages. If it's successfully dissolved,
> PageHWPoison flag is moved to the raw error page, so but that's all
> right. However, dissolve sometimes fails, then the error page is left
> as hwpoisoned hugepage. It's useful if we can retry to dissolve it to
> save healthy pages, but that's not possible now because the information
> about where the raw error page is lost.
>
> Use the private field of a tail page to keep that information. The code
> path of shrinking hugepage pool used this info to try delayed dissolve.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
> mm/hugetlb.c | 9 +++++++++
> mm/memory-failure.c | 2 ++
> 3 files changed, 35 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ac2a1d758a80..689e69cb556b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -42,6 +42,9 @@ enum {
> SUBPAGE_INDEX_CGROUP, /* reuse page->private */
> SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */
> __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
> +#endif
> +#ifdef CONFIG_CGROUP_HUGETLB
> + SUBPAGE_INDEX_HWPOISON,
> #endif
> __NR_USED_SUBPAGE,
> };
> @@ -784,6 +787,27 @@ extern int dissolve_free_huge_page(struct page *page);
> extern int dissolve_free_huge_pages(unsigned long start_pfn,
> unsigned long end_pfn);
>
> +#ifdef CONFIG_MEMORY_FAILURE
> +/*
> + * pointer to raw error page is located in hpage[SUBPAGE_INDEX_HWPOISON].private
> + */
> +static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
> +{
> + return (void *)page_private(hpage + SUBPAGE_INDEX_HWPOISON);
> +}
> +
> +static inline void hugetlb_set_page_hwpoison(struct page *hpage,
> + struct page *page)
> +{
> + set_page_private(hpage + SUBPAGE_INDEX_HWPOISON, (unsigned long)page);
> +}

What happens if the ->private field is already holding a poisoned page
pointer? that is, in a scenario of multiple poisoned pages within a
hugepage, what to do? mark the entire hpage poisoned?

thanks,
-jane


> +#else
> +static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
> +{
> + return NULL;
> +}
> +#endif
> +
> #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> #ifndef arch_hugetlb_migration_supported
> static inline bool arch_hugetlb_migration_supported(struct hstate *h)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f8e048b939c7..6867ea8345d1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1547,6 +1547,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
> return;
> }
>
> + if (unlikely(PageHWPoison(page))) {
> + struct page *raw_error = hugetlb_page_hwpoison(page);
> +
> + if (raw_error && raw_error != page) {
> + SetPageHWPoison(raw_error);
> + ClearPageHWPoison(page);
> + }
> + }
> +
> for (i = 0; i < pages_per_huge_page(h);
> i++, subpage = mem_map_next(subpage, page, i)) {
> subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3e36fc19c4d1..73948a00ad4a 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1535,6 +1535,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
> goto out;
> }
>
> + hugetlb_set_page_hwpoison(head, page);
> +
> return ret;
> out:
> if (count_increased)