2023-07-17 18:47:38

by Sidhartha Kumar

[permalink] [raw]
Subject: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

It was pointed out[1] that using folio_test_hwpoison() is wrong
as we need to check the indiviual page that has poison.
folio_test_hwpoison() only checks the head page so go back to using
PageHWPoison().

Reported-by: Matthew Wilcox (Oracle) <[email protected]>
Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
Cc: [email protected] #v6.4
Signed-off-by: Sidhartha Kumar <[email protected]>

[1]: https://lore.kernel.org/lkml/[email protected]/
---
mm/memory-failure.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 02b1d8f104d51..a114c8c3039cd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2523,7 +2523,7 @@ int unpoison_memory(unsigned long pfn)
goto unlock_mutex;
}

- if (!folio_test_hwpoison(folio)) {
+ if (!PageHWPoison(p)) {
unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n",
pfn, &unpoison_rs);
goto unlock_mutex;
--
2.41.0



2023-07-17 19:34:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

On Mon, 17 Jul 2023 11:18:12 -0700 Sidhartha Kumar <[email protected]> wrote:

> It was pointed out[1] that using folio_test_hwpoison() is wrong
> as we need to check the indiviual page that has poison.
> folio_test_hwpoison() only checks the head page so go back to using
> PageHWPoison().

Please describe the user-visible effects of the bug, especially
when proposing a -stable backport.

2023-07-17 23:37:11

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

On 7/17/2023 11:18 AM, Sidhartha Kumar wrote:
> It was pointed out[1] that using folio_test_hwpoison() is wrong
> as we need to check the indiviual page that has poison.
> folio_test_hwpoison() only checks the head page so go back to using
> PageHWPoison().
>
> Reported-by: Matthew Wilcox (Oracle) <[email protected]>
> Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
> Cc: [email protected] #v6.4
> Signed-off-by: Sidhartha Kumar <[email protected]>
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
> ---
> mm/memory-failure.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 02b1d8f104d51..a114c8c3039cd 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2523,7 +2523,7 @@ int unpoison_memory(unsigned long pfn)
> goto unlock_mutex;
> }
>
> - if (!folio_test_hwpoison(folio)) {
> + if (!PageHWPoison(p)) {
> unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n",
> pfn, &unpoison_rs);
> goto unlock_mutex;

Would it worth the trouble to create folio_page_test_##lname(folio,
index) macros to address folio subpage?

thanks!
-jane



2023-07-18 00:47:59

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

On Mon, Jul 17, 2023 at 11:18:12AM -0700, Sidhartha Kumar wrote:
> It was pointed out[1] that using folio_test_hwpoison() is wrong
> as we need to check the indiviual page that has poison.
> folio_test_hwpoison() only checks the head page so go back to using
> PageHWPoison().
>
> Reported-by: Matthew Wilcox (Oracle) <[email protected]>
> Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
> Cc: [email protected] #v6.4
> Signed-off-by: Sidhartha Kumar <[email protected]>
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
> ---
> mm/memory-failure.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 02b1d8f104d51..a114c8c3039cd 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2523,7 +2523,7 @@ int unpoison_memory(unsigned long pfn)
> goto unlock_mutex;
> }
>
> - if (!folio_test_hwpoison(folio)) {
> + if (!PageHWPoison(p)) {


I don't think this works for hwpoisoned hugetlb pages that have PageHWPoison
set on the head page, rather than on the raw subpage. In the case of
hwpoisoned thps, PageHWPoison is set on the raw subpage, not on the head
pages. (I believe this is not detected because no one considers the
scenario of unpoisoning hwpoisoned thps, which is a rare case). Perhaps the
function is_page_hwpoison() would be useful for this purpose?

Thanks,
Naoya Horiguchi

> unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n",
> pfn, &unpoison_rs);
> goto unlock_mutex;
> --
> 2.41.0
>
>
>

2023-07-18 01:16:46

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

On Tue, Jul 18, 2023 at 09:14:09AM +0900, Naoya Horiguchi wrote:
> On Mon, Jul 17, 2023 at 11:18:12AM -0700, Sidhartha Kumar wrote:
> > It was pointed out[1] that using folio_test_hwpoison() is wrong
> > as we need to check the indiviual page that has poison.
> > folio_test_hwpoison() only checks the head page so go back to using
> > PageHWPoison().
> >
> > Reported-by: Matthew Wilcox (Oracle) <[email protected]>
> > Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
> > Cc: [email protected] #v6.4
> > Signed-off-by: Sidhartha Kumar <[email protected]>
> >
> > [1]: https://lore.kernel.org/lkml/[email protected]/
> > ---
> > mm/memory-failure.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 02b1d8f104d51..a114c8c3039cd 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2523,7 +2523,7 @@ int unpoison_memory(unsigned long pfn)
> > goto unlock_mutex;
> > }
> >
> > - if (!folio_test_hwpoison(folio)) {
> > + if (!PageHWPoison(p)) {
>
>
> I don't think this works for hwpoisoned hugetlb pages that have PageHWPoison
> set on the head page, rather than on the raw subpage. In the case of
> hwpoisoned thps, PageHWPoison is set on the raw subpage, not on the head
> pages. (I believe this is not detected because no one considers the
> scenario of unpoisoning hwpoisoned thps, which is a rare case). Perhaps the
> function is_page_hwpoison() would be useful for this purpose?

Sorry, I was wrong. Checking PageHWPoison() is fine because the users of
unpoison should know where the PageHWPoison is set via /proc/kpageflags.
So this patch is OK to me after comments from other reviewers are resolved.

Acked-by: Naoya Horiguchi <[email protected]>

Thanks,
Naoya Horiguchi

2023-07-18 14:37:23

by Sidhartha Kumar

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

On 7/17/23 12:00 PM, Andrew Morton wrote:
> On Mon, 17 Jul 2023 11:18:12 -0700 Sidhartha Kumar <[email protected]> wrote:
>
>> It was pointed out[1] that using folio_test_hwpoison() is wrong
>> as we need to check the indiviual page that has poison.
>> folio_test_hwpoison() only checks the head page so go back to using
>> PageHWPoison().
>
> Please describe the user-visible effects of the bug, especially
> when proposing a -stable backport.

User-visible effects include existing hwpoison-inject tests possibly
failing as unpoisoning a single subpage could lead to unpoisoning an
entire folio. Memory unpoisoning could also not work as expected as the
function will break early due to only checking the head page and not the
actually poisoned subpage.

2023-07-18 14:40:48

by Sidhartha Kumar

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

On 7/17/23 5:39 PM, Naoya Horiguchi wrote:
> On Tue, Jul 18, 2023 at 09:14:09AM +0900, Naoya Horiguchi wrote:
>> On Mon, Jul 17, 2023 at 11:18:12AM -0700, Sidhartha Kumar wrote:
>>> It was pointed out[1] that using folio_test_hwpoison() is wrong
>>> as we need to check the indiviual page that has poison.
>>> folio_test_hwpoison() only checks the head page so go back to using
>>> PageHWPoison().
>>>
>>> Reported-by: Matthew Wilcox (Oracle) <[email protected]>
>>> Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
>>> Cc: [email protected] #v6.4
>>> Signed-off-by: Sidhartha Kumar <[email protected]>
>>>
>>> [1]: https://lore.kernel.org/lkml/[email protected]/
>>> ---
>>> mm/memory-failure.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 02b1d8f104d51..a114c8c3039cd 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2523,7 +2523,7 @@ int unpoison_memory(unsigned long pfn)
>>> goto unlock_mutex;
>>> }
>>>
>>> - if (!folio_test_hwpoison(folio)) {
>>> + if (!PageHWPoison(p)) {
>>
>>
>> I don't think this works for hwpoisoned hugetlb pages that have PageHWPoison
>> set on the head page, rather than on the raw subpage. In the case of
>> hwpoisoned thps, PageHWPoison is set on the raw subpage, not on the head
>> pages. (I believe this is not detected because no one considers the
>> scenario of unpoisoning hwpoisoned thps, which is a rare case). Perhaps the
>> function is_page_hwpoison() would be useful for this purpose?
>
> Sorry, I was wrong. Checking PageHWPoison() is fine because the users of
> unpoison should know where the PageHWPoison is set via /proc/kpageflags.
> So this patch is OK to me after comments from other reviewers are resolved.
>

Hi Naoya,

While taking a closer at the patch, later in unpoison_memory() there is
also:

- ret = TestClearPageHWPoison(page) ? 0 : -EBUSY;
+ ret = folio_test_clear_hwpoison(folio) ? 0 : -EBUSY;

I thought this folio conversion would be safe because page is the result
of a compound_head() call but I'm wondering if the same issue exists
here and we should be calling TestClearPageHWPoison() on the specific
subpage by doing TestClearPageHWPoison(p).

Thanks,
Sidhartha Kumar

> Acked-by: Naoya Horiguchi <[email protected]>
>
> Thanks,
> Naoya Horiguchi
>


2023-07-19 00:20:08

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

On Tue, Jul 18, 2023 at 07:30:23AM -0700, Sidhartha Kumar wrote:
> On 7/17/23 5:39 PM, Naoya Horiguchi wrote:
> > On Tue, Jul 18, 2023 at 09:14:09AM +0900, Naoya Horiguchi wrote:
> > > On Mon, Jul 17, 2023 at 11:18:12AM -0700, Sidhartha Kumar wrote:
> > > > It was pointed out[1] that using folio_test_hwpoison() is wrong
> > > > as we need to check the indiviual page that has poison.
> > > > folio_test_hwpoison() only checks the head page so go back to using
> > > > PageHWPoison().
> > > >
> > > > Reported-by: Matthew Wilcox (Oracle) <[email protected]>
> > > > Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
> > > > Cc: [email protected] #v6.4
> > > > Signed-off-by: Sidhartha Kumar <[email protected]>
> > > >
> > > > [1]: https://lore.kernel.org/lkml/[email protected]/
> > > > ---
> > > > mm/memory-failure.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > index 02b1d8f104d51..a114c8c3039cd 100644
> > > > --- a/mm/memory-failure.c
> > > > +++ b/mm/memory-failure.c
> > > > @@ -2523,7 +2523,7 @@ int unpoison_memory(unsigned long pfn)
> > > > goto unlock_mutex;
> > > > }
> > > > - if (!folio_test_hwpoison(folio)) {
> > > > + if (!PageHWPoison(p)) {
> > >
> > >
> > > I don't think this works for hwpoisoned hugetlb pages that have PageHWPoison
> > > set on the head page, rather than on the raw subpage. In the case of
> > > hwpoisoned thps, PageHWPoison is set on the raw subpage, not on the head
> > > pages. (I believe this is not detected because no one considers the
> > > scenario of unpoisoning hwpoisoned thps, which is a rare case). Perhaps the
> > > function is_page_hwpoison() would be useful for this purpose?
> >
> > Sorry, I was wrong. Checking PageHWPoison() is fine because the users of
> > unpoison should know where the PageHWPoison is set via /proc/kpageflags.
> > So this patch is OK to me after comments from other reviewers are resolved.
> >
>
> Hi Naoya,
>
> While taking a closer at the patch, later in unpoison_memory() there is
> also:
>
> - ret = TestClearPageHWPoison(page) ? 0 : -EBUSY;
> + ret = folio_test_clear_hwpoison(folio) ? 0 : -EBUSY;
>
> I thought this folio conversion would be safe because page is the result of
> a compound_head() call but I'm wondering if the same issue exists here and
> we should be calling TestClearPageHWPoison() on the specific subpage by
> doing TestClearPageHWPoison(p).

In this case (get_hwpoison_page returns 0), the target of unpoison_memory was
buddy page or free huge page, so there seems not any realistic problem.
But putting back to TestClearPageHWPoison() looks consistent, so I'm fine with it.

Thanks,
Naoya Horiguchi

2023-07-20 09:37:54

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH] mm/memory-failure: fix hardware poison check in unpoison_memory()

On 2023/7/18 2:18, Sidhartha Kumar wrote:
> It was pointed out[1] that using folio_test_hwpoison() is wrong
> as we need to check the indiviual page that has poison.
> folio_test_hwpoison() only checks the head page so go back to using
> PageHWPoison().
>
> Reported-by: Matthew Wilcox (Oracle) <[email protected]>
> Fixes: a6fddef49eef ("mm/memory-failure: convert unpoison_memory() to folios")
> Cc: [email protected] #v6.4
> Signed-off-by: Sidhartha Kumar <[email protected]>
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
> ---
> mm/memory-failure.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 02b1d8f104d51..a114c8c3039cd 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2523,7 +2523,7 @@ int unpoison_memory(unsigned long pfn)
> goto unlock_mutex;
> }
>
> - if (!folio_test_hwpoison(folio)) {
> + if (!PageHWPoison(p)) {

For successfully handled pages, they should be non-compound pages (dissolved, splitted or normal pages).
So this patch makes no change for them. But for failed to hwpoisoned thp and hugetlb, there's some difference.
But since Naoya points out that, "the users of unpoison should know where the PageHWPoison is set via
/proc/kpageflags.", I'm fine with this patch.

Reviewed-by: Miaohe Lin <[email protected]>

Thanks.