2021-03-17 08:39:34

by yaoaili [么爱利]

[permalink] [raw]
Subject: [PATCH] mm/gup: check page posion status for coredump.

When we do coredump for user process signal, this may be an SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page
and get it dumped and lead to system panic as its in kernel code.

So check the poison status in get_dump_page(), and if TRUE, return NULL.

Signed-off-by: Aili Yao <[email protected]>
---
mm/gup.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224c..499a496 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,6 +1536,14 @@ struct page *get_dump_page(unsigned long addr)
FOLL_FORCE | FOLL_DUMP | FOLL_GET);
if (locked)
mmap_read_unlock(mm);
+
+ if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && ret == 1) {
+ if (unlikely(PageHuge(page) && PageHWPoison(compound_head(page))))
+ ret = 0;
+ else if (unlikely(PageHWPoison(page)))
+ ret = 0;
+ }
+
return (ret == 1) ? page : NULL;
}
#endif /* CONFIG_ELF_CORE */
--
1.8.3.1


2021-03-17 09:14:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/gup: check page posion status for coredump.

On 17.03.21 09:37, Aili Yao wrote:
> When we do coredump for user process signal, this may be an SIGBUS signal
> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> recovery work is finished correctly, then the get_dump_page() will not
> return the error page as its process pte is set invalid by
> memory_failure().
>
> But memory_failure() may fail, and the process's related pte may not be
> correctly set invalid, for current code, we will return the poison page
> and get it dumped and lead to system panic as its in kernel code.
>
> So check the poison status in get_dump_page(), and if TRUE, return NULL.
>
> Signed-off-by: Aili Yao <[email protected]>
> ---
> mm/gup.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index e4c224c..499a496 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1536,6 +1536,14 @@ struct page *get_dump_page(unsigned long addr)
> FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> if (locked)
> mmap_read_unlock(mm);
> +
> + if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && ret == 1) {
> + if (unlikely(PageHuge(page) && PageHWPoison(compound_head(page))))
> + ret = 0;
> + else if (unlikely(PageHWPoison(page)))
> + ret = 0;
> + }

I wonder if a simple

if (PageHWPoison(compound_head(page)))
ret = 0;

won't suffice. But I guess the "issue" is compound pages that are not
huge pages or transparent huge pages.

If not, we certainly want a wrapper for that magic, otherwise we have to
replicate the same logic all over the place.

> +
> return (ret == 1) ? page : NULL;
> }
> #endif /* CONFIG_ELF_CORE */
>


--
Thanks,

David / dhildenb

2021-03-18 03:20:33

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH] mm/gup: check page posion status for coredump.

On Wed, 17 Mar 2021 10:12:02 +0100
David Hildenbrand <[email protected]> wrote:

>
> I wonder if a simple
>
> if (PageHWPoison(compound_head(page)))
> ret = 0;
>
> won't suffice. But I guess the "issue" is compound pages that are not
> huge pages or transparent huge pages.

Yes, the simple case won't suffice, as we mark the hugetlb page poison in head, and
other cases in the specific single page struct.

> If not, we certainly want a wrapper for that magic, otherwise we have to
> replicate the same logic all over the place.
>
> > +
> > return (ret == 1) ? page : NULL;
> > }
> > #endif /* CONFIG_ELF_CORE */
> >
>
>

Yes, May other places meet the requirements as the coredump meets, it's better to make a
wrapper for this. But i am not familiar with the specific scenario, so this patch only cover
the coredump case.

I will post a v2 patch for this.

--
Thanks!
Aili Yao

2021-03-18 03:21:17

by yaoaili [么爱利]

[permalink] [raw]
Subject: [PATCH v2] mm/gup: check page posion status for coredump.

When we do coredump for user process signal, this may be an SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page
and get it dumped and lead to system panic as its in kernel code.

So check the poison status in get_dump_page(), and if TRUE, return NULL.

There maybe other scenario that is also better to check the posion status
and not to panic, so make a wrapper for this check, suggested by
David Hildenbrand <[email protected]>

Signed-off-by: Aili Yao <[email protected]>
---
mm/gup.c | 4 ++++
mm/internal.h | 21 +++++++++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224c..3b4703a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
FOLL_FORCE | FOLL_DUMP | FOLL_GET);
if (locked)
mmap_read_unlock(mm);
+
+ if (ret == 1 && check_user_page_poison(page))
+ return NULL;
+
return (ret == 1) ? page : NULL;
}
#endif /* CONFIG_ELF_CORE */
diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439..777b3e5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,27 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}

+/*
+ * When kernel touch the user page, the user page may be have been marked
+ * poison but still mapped in user space, if without this page, the kernel
+ * can guarantee the data integrity and operation success, the kernel is
+ * better to check the posion status and avoid touching it, be good not to
+ * panic, coredump for process fatal signal is a sample case matching this
+ * scenario. Or if kernel can't guarantee the data integrity, it's better
+ * not to call this function, let kernel touch the poison page and get to
+ * panic.
+ */
+static inline int check_user_page_poison(struct page *page)
+{
+ if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && page != NULL) {
+ if (unlikely(PageHuge(page) && PageHWPoison(compound_head(page))))
+ return true;
+ else if (unlikely(PageHWPoison(page)))
+ return true;
+ }
+ return 0;
+}
+
extern unsigned long highest_memmap_pfn;

/*
--
1.8.3.1

2021-03-18 04:49:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/gup: check page posion status for coredump.

On Wed, Mar 17, 2021 at 10:12:02AM +0100, David Hildenbrand wrote:
> > + if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && ret == 1) {
> > + if (unlikely(PageHuge(page) && PageHWPoison(compound_head(page))))
> > + ret = 0;
> > + else if (unlikely(PageHWPoison(page)))
> > + ret = 0;
> > + }
>
> I wonder if a simple
>
> if (PageHWPoison(compound_head(page)))
> ret = 0;
>
> won't suffice. But I guess the "issue" is compound pages that are not huge
> pages or transparent huge pages.

THPs don't set the HWPoison bit on the head page.

https://lore.kernel.org/linux-mm/[email protected]/

(and PAGEFLAG(HWPoison, hwpoison, PF_ANY))

By the way,

#ifdef CONFIG_MEMORY_FAILURE
PAGEFLAG(HWPoison, hwpoison, PF_ANY)
TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
#define __PG_HWPOISON (1UL << PG_hwpoison)
extern bool take_page_off_buddy(struct page *page);
#else
PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

so there's no need for this
if (IS_ENABLED(CONFIG_MEMORY_FAILURE)
check, as it simply turns into

if (PageHuge(page) && 0)
else if (0)

and the compiler can optimise it all away.

2021-03-18 05:36:26

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH] mm/gup: check page posion status for coredump.

On Thu, 18 Mar 2021 04:46:00 +0000
Matthew Wilcox <[email protected]> wrote:

> On Wed, Mar 17, 2021 at 10:12:02AM +0100, David Hildenbrand wrote:
> > > + if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && ret == 1) {
> > > + if (unlikely(PageHuge(page) && PageHWPoison(compound_head(page))))
> > > + ret = 0;
> > > + else if (unlikely(PageHWPoison(page)))
> > > + ret = 0;
> > > + }
> >
> > I wonder if a simple
> >
> > if (PageHWPoison(compound_head(page)))
> > ret = 0;
> >
> > won't suffice. But I guess the "issue" is compound pages that are not huge
> > pages or transparent huge pages.
>
> THPs don't set the HWPoison bit on the head page.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> (and PAGEFLAG(HWPoison, hwpoison, PF_ANY))
>
> By the way,
>
> #ifdef CONFIG_MEMORY_FAILURE
> PAGEFLAG(HWPoison, hwpoison, PF_ANY)
> TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
> #define __PG_HWPOISON (1UL << PG_hwpoison)
> extern bool take_page_off_buddy(struct page *page);
> #else
> PAGEFLAG_FALSE(HWPoison)
> #define __PG_HWPOISON 0
> #endif
>
> so there's no need for this
> if (IS_ENABLED(CONFIG_MEMORY_FAILURE)
> check, as it simply turns into
>
> if (PageHuge(page) && 0)
> else if (0)
>
> and the compiler can optimise it all away.

Yes, You are right, I will modify this later.
Thanks for correction

--
Thanks!
Aili Yao

2021-03-18 08:18:38

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] mm/gup: check page posion status for coredump.

On 18.03.21 05:46, Matthew Wilcox wrote:
> On Wed, Mar 17, 2021 at 10:12:02AM +0100, David Hildenbrand wrote:
>>> + if (IS_ENABLED(CONFIG_MEMORY_FAILURE) && ret == 1) {
>>> + if (unlikely(PageHuge(page) && PageHWPoison(compound_head(page))))
>>> + ret = 0;
>>> + else if (unlikely(PageHWPoison(page)))
>>> + ret = 0;
>>> + }
>>
>> I wonder if a simple
>>
>> if (PageHWPoison(compound_head(page)))
>> ret = 0;
>>
>> won't suffice. But I guess the "issue" is compound pages that are not huge
>> pages or transparent huge pages.
>
> THPs don't set the HWPoison bit on the head page.
>
> https://lore.kernel.org/linux-mm/[email protected]/

Oh, okay -- I was missing that we actually already set the HWPoison bit
before trying to split via TestSetPageHWPoison(). I thought for a second
that if splitting fails, we don't set any HWPoison bit.

--
Thanks,

David / dhildenb

2021-03-19 02:46:21

by yaoaili [么爱利]

[permalink] [raw]
Subject: [PATCH v3] mm/gup: check page posion status for coredump.

When we do coredump for user process signal, this may be an SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page,
get it dumped, and then lead to system panic as its in kernel code.

So check the poison status in get_dump_page(), and if TRUE, return NULL.

There maybe other scenario that is also better to check the posion status
and not to panic, so make a wrapper for this check, Thanks to David's
suggestion(<[email protected]>).

Signed-off-by: Aili Yao <[email protected]>
---
mm/gup.c | 4 ++++
mm/internal.h | 21 +++++++++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224c..dcabe96 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
FOLL_FORCE | FOLL_DUMP | FOLL_GET);
if (locked)
mmap_read_unlock(mm);
+
+ if (ret == 1 && is_page_poisoned(page))
+ return NULL;
+
return (ret == 1) ? page : NULL;
}
#endif /* CONFIG_ELF_CORE */
diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439..902d993 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,27 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}

+/*
+ * When kernel touch the user page, the user page may be have been marked
+ * poison but still mapped in user space, if without this page, the kernel
+ * can guarantee the data integrity and operation success, the kernel is
+ * better to check the posion status and avoid touching it, be good not to
+ * panic, coredump for process fatal signal is a sample case matching this
+ * scenario. Or if kernel can't guarantee the data integrity, it's better
+ * not to call this function, let kernel touch the poison page and get to
+ * panic.
+ */
+static inline bool is_page_poisoned(struct page *page)
+{
+ if (page != NULL) {
+ if (PageHWPoison(page))
+ return true;
+ else if (PageHuge(page) && PageHWPoison(compound_head(page)))
+ return true;
+ }
+ return 0;
+}
+
extern unsigned long highest_memmap_pfn;

/*
--
1.8.3.1

2021-03-20 00:37:41

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3] mm/gup: check page posion status for coredump.

On Fri, Mar 19, 2021 at 10:44:37AM +0800, Aili Yao wrote:
> +++ b/mm/gup.c
> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> if (locked)
> mmap_read_unlock(mm);
> +
> + if (ret == 1 && is_page_poisoned(page))
> + return NULL;
> +
> return (ret == 1) ? page : NULL;
> }
> #endif /* CONFIG_ELF_CORE */
> diff --git a/mm/internal.h b/mm/internal.h
> index 25d2b2439..902d993 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -97,6 +97,27 @@ static inline void set_page_refcounted(struct page *page)
> set_page_count(page, 1);
> }
>
> +/*
> + * When kernel touch the user page, the user page may be have been marked
> + * poison but still mapped in user space, if without this page, the kernel
> + * can guarantee the data integrity and operation success, the kernel is
> + * better to check the posion status and avoid touching it, be good not to
> + * panic, coredump for process fatal signal is a sample case matching this
> + * scenario. Or if kernel can't guarantee the data integrity, it's better
> + * not to call this function, let kernel touch the poison page and get to
> + * panic.
> + */
> +static inline bool is_page_poisoned(struct page *page)
> +{
> + if (page != NULL) {

Why are you checking page for NULL here? How can it possibly be NULL?

> + if (PageHWPoison(page))
> + return true;
> + else if (PageHuge(page) && PageHWPoison(compound_head(page)))
> + return true;
> + }
> + return 0;
> +}
> +
> extern unsigned long highest_memmap_pfn;
>
> /*
> --
> 1.8.3.1
>
>

2021-03-22 03:42:25

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v3] mm/gup: check page posion status for coredump.

On Sat, 20 Mar 2021 00:35:16 +0000
Matthew Wilcox <[email protected]> wrote:

> On Fri, Mar 19, 2021 at 10:44:37AM +0800, Aili Yao wrote:
> > +++ b/mm/gup.c
> > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> > FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > if (locked)
> > mmap_read_unlock(mm);
> > +
> > + if (ret == 1 && is_page_poisoned(page))
> > + return NULL;
> > +
> > return (ret == 1) ? page : NULL;
> > }
> > #endif /* CONFIG_ELF_CORE */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 25d2b2439..902d993 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -97,6 +97,27 @@ static inline void set_page_refcounted(struct page *page)
> > set_page_count(page, 1);
> > }
> >
> > +/*
> > + * When kernel touch the user page, the user page may be have been marked
> > + * poison but still mapped in user space, if without this page, the kernel
> > + * can guarantee the data integrity and operation success, the kernel is
> > + * better to check the posion status and avoid touching it, be good not to
> > + * panic, coredump for process fatal signal is a sample case matching this
> > + * scenario. Or if kernel can't guarantee the data integrity, it's better
> > + * not to call this function, let kernel touch the poison page and get to
> > + * panic.
> > + */
> > +static inline bool is_page_poisoned(struct page *page)
> > +{
> > + if (page != NULL) {
>
> Why are you checking page for NULL here? How can it possibly be NULL?

For this get_dump_page() case, it can't be NULL, I thougt may other place
will call this function and may not guarantee this, But yes, kernel is a more
safer place and checking page NULL is not a common behavior.

Better to remove it, Thanks you very much for pointing this!

> > + if (PageHWPoison(page))
> > + return true;
> > + else if (PageHuge(page) && PageHWPoison(compound_head(page)))
> > + return true;
> > + }
> > + return 0;
> > +}
> > +
> > extern unsigned long highest_memmap_pfn;
> >
> > /*
> > --
> > 1.8.3.1
> >
> >
--
Thanks!
Aili Yao

2021-03-22 11:37:08

by yaoaili [么爱利]

[permalink] [raw]
Subject: [PATCH v5] mm/gup: check page hwposion status for coredump.

When we do coredump for user process signal, this may be one SIGBUS signal
with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
resulted from ECC memory fail like SRAR or SRAO, we expect the memory
recovery work is finished correctly, then the get_dump_page() will not
return the error page as its process pte is set invalid by
memory_failure().

But memory_failure() may fail, and the process's related pte may not be
correctly set invalid, for current code, we will return the poison page,
get it dumped, and then lead to system panic as its in kernel code.

So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.

There maybe other scenario that is also better to check hwposion status
and not to panic, so make a wrapper for this check, Thanks to David's
suggestion(<[email protected]>).

Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
Signed-off-by: Aili Yao <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Aili Yao <[email protected]>
Cc: [email protected]
Signed-off-by: Andrew Morton <[email protected]>
---
mm/gup.c | 4 ++++
mm/internal.h | 20 ++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e4c224c..6f7e1aa 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
FOLL_FORCE | FOLL_DUMP | FOLL_GET);
if (locked)
mmap_read_unlock(mm);
+
+ if (ret == 1 && is_page_hwpoison(page))
+ return NULL;
+
return (ret == 1) ? page : NULL;
}
#endif /* CONFIG_ELF_CORE */
diff --git a/mm/internal.h b/mm/internal.h
index 25d2b2439..b751cef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,26 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}

+/*
+ * When kernel touch the user page, the user page may be have been marked
+ * poison but still mapped in user space, if without this page, the kernel
+ * can guarantee the data integrity and operation success, the kernel is
+ * better to check the posion status and avoid touching it, be good not to
+ * panic, coredump for process fatal signal is a sample case matching this
+ * scenario. Or if kernel can't guarantee the data integrity, it's better
+ * not to call this function, let kernel touch the poison page and get to
+ * panic.
+ */
+static inline bool is_page_hwpoison(struct page *page)
+{
+ if (PageHWPoison(page))
+ return true;
+ else if (PageHuge(page) && PageHWPoison(compound_head(page)))
+ return true;
+
+ return false;
+}
+
extern unsigned long highest_memmap_pfn;

/*
--
1.8.3.1

2021-03-26 14:11:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On 22.03.21 12:33, Aili Yao wrote:
> When we do coredump for user process signal, this may be one SIGBUS signal
> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> recovery work is finished correctly, then the get_dump_page() will not
> return the error page as its process pte is set invalid by
> memory_failure().
>
> But memory_failure() may fail, and the process's related pte may not be
> correctly set invalid, for current code, we will return the poison page,
> get it dumped, and then lead to system panic as its in kernel code.
>
> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
>
> There maybe other scenario that is also better to check hwposion status
> and not to panic, so make a wrapper for this check, Thanks to David's
> suggestion(<[email protected]>).
>
> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> Signed-off-by: Aili Yao <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Kravetz <[email protected]>
> Cc: Aili Yao <[email protected]>
> Cc: [email protected]
> Signed-off-by: Andrew Morton <[email protected]>
> ---
> mm/gup.c | 4 ++++
> mm/internal.h | 20 ++++++++++++++++++++
> 2 files changed, 24 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index e4c224c..6f7e1aa 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> if (locked)
> mmap_read_unlock(mm);

Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
when stumbling over a hwpoisoned page?

See __get_user_pages_locked()->__get_user_pages()->faultin_page():

handle_mm_fault()->vm_fault_to_errno(), which translates
VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)

?

> +
> + if (ret == 1 && is_page_hwpoison(page))
> + return NULL;
> +
> return (ret == 1) ? page : NULL;
> }
> #endif /* CONFIG_ELF_CORE */
> diff --git a/mm/internal.h b/mm/internal.h
> index 25d2b2439..b751cef 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -97,6 +97,26 @@ static inline void set_page_refcounted(struct page *page)
> set_page_count(page, 1);
> }
>
> +/*
> + * When kernel touch the user page, the user page may be have been marked
> + * poison but still mapped in user space, if without this page, the kernel
> + * can guarantee the data integrity and operation success, the kernel is
> + * better to check the posion status and avoid touching it, be good not to
> + * panic, coredump for process fatal signal is a sample case matching this
> + * scenario. Or if kernel can't guarantee the data integrity, it's better
> + * not to call this function, let kernel touch the poison page and get to
> + * panic.
> + */
> +static inline bool is_page_hwpoison(struct page *page)
> +{
> + if (PageHWPoison(page))
> + return true;
> + else if (PageHuge(page) && PageHWPoison(compound_head(page)))
> + return true;
> +
> + return false;
> +}
> +
> extern unsigned long highest_memmap_pfn;
>
> /*
>


--
Thanks,

David / dhildenb

2021-03-26 14:25:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On 26.03.21 15:09, David Hildenbrand wrote:
> On 22.03.21 12:33, Aili Yao wrote:
>> When we do coredump for user process signal, this may be one SIGBUS signal
>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
>> recovery work is finished correctly, then the get_dump_page() will not
>> return the error page as its process pte is set invalid by
>> memory_failure().
>>
>> But memory_failure() may fail, and the process's related pte may not be
>> correctly set invalid, for current code, we will return the poison page,
>> get it dumped, and then lead to system panic as its in kernel code.
>>
>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
>>
>> There maybe other scenario that is also better to check hwposion status
>> and not to panic, so make a wrapper for this check, Thanks to David's
>> suggestion(<[email protected]>).
>>
>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
>> Signed-off-by: Aili Yao <[email protected]>
>> Cc: David Hildenbrand <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Naoya Horiguchi <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Mike Kravetz <[email protected]>
>> Cc: Aili Yao <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Andrew Morton <[email protected]>
>> ---
>> mm/gup.c | 4 ++++
>> mm/internal.h | 20 ++++++++++++++++++++
>> 2 files changed, 24 insertions(+)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index e4c224c..6f7e1aa 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
>> FOLL_FORCE | FOLL_DUMP | FOLL_GET);
>> if (locked)
>> mmap_read_unlock(mm);
>
> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> when stumbling over a hwpoisoned page?
>
> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
>
> handle_mm_fault()->vm_fault_to_errno(), which translates
> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
>
> ?

Or doesn't that happen as you describe "But memory_failure() may fail,
and the process's related pte may not be correctly set invalid" -- but
why does that happen?

On a similar thought, should get_user_pages() never return a page that
has HWPoison set? E.g., check also for existing PTEs if the page is
hwpoisoned?

@Naoya, Oscar

--
Thanks,

David / dhildenb

Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> On 26.03.21 15:09, David Hildenbrand wrote:
> > On 22.03.21 12:33, Aili Yao wrote:
> > > When we do coredump for user process signal, this may be one SIGBUS signal
> > > with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> > > resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> > > recovery work is finished correctly, then the get_dump_page() will not
> > > return the error page as its process pte is set invalid by
> > > memory_failure().
> > >
> > > But memory_failure() may fail, and the process's related pte may not be
> > > correctly set invalid, for current code, we will return the poison page,
> > > get it dumped, and then lead to system panic as its in kernel code.
> > >
> > > So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> > >
> > > There maybe other scenario that is also better to check hwposion status
> > > and not to panic, so make a wrapper for this check, Thanks to David's
> > > suggestion(<[email protected]>).
> > >
> > > Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> > > Signed-off-by: Aili Yao <[email protected]>
> > > Cc: David Hildenbrand <[email protected]>
> > > Cc: Matthew Wilcox <[email protected]>
> > > Cc: Naoya Horiguchi <[email protected]>
> > > Cc: Oscar Salvador <[email protected]>
> > > Cc: Mike Kravetz <[email protected]>
> > > Cc: Aili Yao <[email protected]>
> > > Cc: [email protected]
> > > Signed-off-by: Andrew Morton <[email protected]>
> > > ---
> > > mm/gup.c | 4 ++++
> > > mm/internal.h | 20 ++++++++++++++++++++
> > > 2 files changed, 24 insertions(+)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index e4c224c..6f7e1aa 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> > > FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > > if (locked)
> > > mmap_read_unlock(mm);
> >
> > Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> > when stumbling over a hwpoisoned page?
> >
> > See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> >
> > handle_mm_fault()->vm_fault_to_errno(), which translates
> > VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> >
> > ?

We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).

If we failed to unmap, the page table is not converted to hwpoison entry,
so __get_user_pages_locked() get the hwpoisoned page.

If we successfully unmapped but failed in truncate_error_page() for example,
the processes mapping the page would get -EFAULT as expected. But even in
this case, other processes could reach the error page via page cache and
__get_user_pages_locked() for them could return the hwpoisoned page.

>
> Or doesn't that happen as you describe "But memory_failure() may fail, and
> the process's related pte may not be correctly set invalid" -- but why does
> that happen?

Simply because memory_failure() doesn't handle some page types like ksm page
and zero page. Or maybe shmem thp also belongs to this class.

>
> On a similar thought, should get_user_pages() never return a page that has
> HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?

Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
follow_huge_pmd() would work well.

Thanks,
Naoya Horiguchi

2021-03-31 02:45:28

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On Wed, 31 Mar 2021 01:52:59 +0000
HORIGUCHI NAOYA(堀口 直也) <[email protected]> wrote:

> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> > On 26.03.21 15:09, David Hildenbrand wrote:
> > > On 22.03.21 12:33, Aili Yao wrote:
> > > > When we do coredump for user process signal, this may be one SIGBUS signal
> > > > with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> > > > resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> > > > recovery work is finished correctly, then the get_dump_page() will not
> > > > return the error page as its process pte is set invalid by
> > > > memory_failure().
> > > >
> > > > But memory_failure() may fail, and the process's related pte may not be
> > > > correctly set invalid, for current code, we will return the poison page,
> > > > get it dumped, and then lead to system panic as its in kernel code.
> > > >
> > > > So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> > > >
> > > > There maybe other scenario that is also better to check hwposion status
> > > > and not to panic, so make a wrapper for this check, Thanks to David's
> > > > suggestion(<[email protected]>).
> > > >
> > > > Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> > > > Signed-off-by: Aili Yao <[email protected]>
> > > > Cc: David Hildenbrand <[email protected]>
> > > > Cc: Matthew Wilcox <[email protected]>
> > > > Cc: Naoya Horiguchi <[email protected]>
> > > > Cc: Oscar Salvador <[email protected]>
> > > > Cc: Mike Kravetz <[email protected]>
> > > > Cc: Aili Yao <[email protected]>
> > > > Cc: [email protected]
> > > > Signed-off-by: Andrew Morton <[email protected]>
> > > > ---
> > > > mm/gup.c | 4 ++++
> > > > mm/internal.h | 20 ++++++++++++++++++++
> > > > 2 files changed, 24 insertions(+)
> > > >
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index e4c224c..6f7e1aa 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> > > > FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > > > if (locked)
> > > > mmap_read_unlock(mm);
> > >
> > > Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> > > when stumbling over a hwpoisoned page?
> > >
> > > See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> > >
> > > handle_mm_fault()->vm_fault_to_errno(), which translates
> > > VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> > >
> > > ?
>
> We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
>
> If we failed to unmap, the page table is not converted to hwpoison entry,
> so __get_user_pages_locked() get the hwpoisoned page.
>
> If we successfully unmapped but failed in truncate_error_page() for example,
> the processes mapping the page would get -EFAULT as expected. But even in
> this case, other processes could reach the error page via page cache and
> __get_user_pages_locked() for them could return the hwpoisoned page.
>
> >
> > Or doesn't that happen as you describe "But memory_failure() may fail, and
> > the process's related pte may not be correctly set invalid" -- but why does
> > that happen?
>
> Simply because memory_failure() doesn't handle some page types like ksm page
> and zero page. Or maybe shmem thp also belongs to this class.
>
> >
> > On a similar thought, should get_user_pages() never return a page that has
> > HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?
>
> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> follow_huge_pmd() would work well.

I think we should take more care to broadcast the hwpoison check to other cases,
SIGBUS coredump is such a case that it is supposed to not touch the poison page,
and if we return NULL for this, the coredump process will get a successful finish.

Other cases may also meet the requirements like coredump, but we need to identify it,
that's the poison check wrapper's purpose. If not, we may break the integrity of the
related action, which may be no better than panic.

--
Thanks!
Aili Yao

Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:
> On Wed, 31 Mar 2021 01:52:59 +0000 HORIGUCHI NAOYA(堀口 直也) <[email protected]> wrote:
> > On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> > > On 26.03.21 15:09, David Hildenbrand wrote:
> > > > On 22.03.21 12:33, Aili Yao wrote:
> > > > > When we do coredump for user process signal, this may be one SIGBUS signal
> > > > > with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> > > > > resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> > > > > recovery work is finished correctly, then the get_dump_page() will not
> > > > > return the error page as its process pte is set invalid by
> > > > > memory_failure().
> > > > >
> > > > > But memory_failure() may fail, and the process's related pte may not be
> > > > > correctly set invalid, for current code, we will return the poison page,
> > > > > get it dumped, and then lead to system panic as its in kernel code.
> > > > >
> > > > > So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> > > > >
> > > > > There maybe other scenario that is also better to check hwposion status
> > > > > and not to panic, so make a wrapper for this check, Thanks to David's
> > > > > suggestion(<[email protected]>).
> > > > >
> > > > > Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> > > > > Signed-off-by: Aili Yao <[email protected]>
> > > > > Cc: David Hildenbrand <[email protected]>
> > > > > Cc: Matthew Wilcox <[email protected]>
> > > > > Cc: Naoya Horiguchi <[email protected]>
> > > > > Cc: Oscar Salvador <[email protected]>
> > > > > Cc: Mike Kravetz <[email protected]>
> > > > > Cc: Aili Yao <[email protected]>
> > > > > Cc: [email protected]
> > > > > Signed-off-by: Andrew Morton <[email protected]>
> > > > > ---
> > > > > mm/gup.c | 4 ++++
> > > > > mm/internal.h | 20 ++++++++++++++++++++
> > > > > 2 files changed, 24 insertions(+)
> > > > >
> > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > index e4c224c..6f7e1aa 100644
> > > > > --- a/mm/gup.c
> > > > > +++ b/mm/gup.c
> > > > > @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> > > > > FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> > > > > if (locked)
> > > > > mmap_read_unlock(mm);
> > > >
> > > > Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> > > > when stumbling over a hwpoisoned page?
> > > >
> > > > See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> > > >
> > > > handle_mm_fault()->vm_fault_to_errno(), which translates
> > > > VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> > > >
> > > > ?
> >
> > We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
> >
> > If we failed to unmap, the page table is not converted to hwpoison entry,
> > so __get_user_pages_locked() get the hwpoisoned page.
> >
> > If we successfully unmapped but failed in truncate_error_page() for example,
> > the processes mapping the page would get -EFAULT as expected. But even in
> > this case, other processes could reach the error page via page cache and
> > __get_user_pages_locked() for them could return the hwpoisoned page.
> >
> > >
> > > Or doesn't that happen as you describe "But memory_failure() may fail, and
> > > the process's related pte may not be correctly set invalid" -- but why does
> > > that happen?
> >
> > Simply because memory_failure() doesn't handle some page types like ksm page
> > and zero page. Or maybe shmem thp also belongs to this class.
> >
> > >
> > > On a similar thought, should get_user_pages() never return a page that has
> > > HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?
> >
> > Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> > follow_huge_pmd() would work well.
>
> I think we should take more care to broadcast the hwpoison check to other cases,
> SIGBUS coredump is such a case that it is supposed to not touch the poison page,
> and if we return NULL for this, the coredump process will get a successful finish.
>
> Other cases may also meet the requirements like coredump, but we need to identify it,
> that's the poison check wrapper's purpose. If not, we may break the integrity of the
> related action, which may be no better than panic.

If you worry about regression and would like to make this new behavior conditional,
we could use FOLL_HWPOISON to specify that the caller is hwpoison-aware so that
any !FOLL_HWPOISON caller ignores the hwpoison check and works as it does now.
This approach looks to me helpful because it would encourage developers touching
gup code to pay attention to FOLL_HWPOISON code.

Thanks,
Naoya Horiguchi

2021-03-31 06:10:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On Wed, Mar 31, 2021 at 01:52:59AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> If we successfully unmapped but failed in truncate_error_page() for example,
> the processes mapping the page would get -EFAULT as expected. But even in
> this case, other processes could reach the error page via page cache and
> __get_user_pages_locked() for them could return the hwpoisoned page.

How would that happen? We check PageHWPoison before inserting a page
into the page tables. See, eg, filemap_map_pages() and __do_fault().

2021-03-31 06:48:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On 31.03.21 06:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:
>> On Wed, 31 Mar 2021 01:52:59 +0000 HORIGUCHI NAOYA(堀口 直也) <[email protected]> wrote:
>>> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
>>>> On 26.03.21 15:09, David Hildenbrand wrote:
>>>>> On 22.03.21 12:33, Aili Yao wrote:
>>>>>> When we do coredump for user process signal, this may be one SIGBUS signal
>>>>>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
>>>>>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
>>>>>> recovery work is finished correctly, then the get_dump_page() will not
>>>>>> return the error page as its process pte is set invalid by
>>>>>> memory_failure().
>>>>>>
>>>>>> But memory_failure() may fail, and the process's related pte may not be
>>>>>> correctly set invalid, for current code, we will return the poison page,
>>>>>> get it dumped, and then lead to system panic as its in kernel code.
>>>>>>
>>>>>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
>>>>>>
>>>>>> There maybe other scenario that is also better to check hwposion status
>>>>>> and not to panic, so make a wrapper for this check, Thanks to David's
>>>>>> suggestion(<[email protected]>).
>>>>>>
>>>>>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
>>>>>> Signed-off-by: Aili Yao <[email protected]>
>>>>>> Cc: David Hildenbrand <[email protected]>
>>>>>> Cc: Matthew Wilcox <[email protected]>
>>>>>> Cc: Naoya Horiguchi <[email protected]>
>>>>>> Cc: Oscar Salvador <[email protected]>
>>>>>> Cc: Mike Kravetz <[email protected]>
>>>>>> Cc: Aili Yao <[email protected]>
>>>>>> Cc: [email protected]
>>>>>> Signed-off-by: Andrew Morton <[email protected]>
>>>>>> ---
>>>>>> mm/gup.c | 4 ++++
>>>>>> mm/internal.h | 20 ++++++++++++++++++++
>>>>>> 2 files changed, 24 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>> index e4c224c..6f7e1aa 100644
>>>>>> --- a/mm/gup.c
>>>>>> +++ b/mm/gup.c
>>>>>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
>>>>>> FOLL_FORCE | FOLL_DUMP | FOLL_GET);
>>>>>> if (locked)
>>>>>> mmap_read_unlock(mm);
>>>>>
>>>>> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
>>>>> when stumbling over a hwpoisoned page?
>>>>>
>>>>> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
>>>>>
>>>>> handle_mm_fault()->vm_fault_to_errno(), which translates
>>>>> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
>>>>>
>>>>> ?
>>>
>>> We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
>>>
>>> If we failed to unmap, the page table is not converted to hwpoison entry,
>>> so __get_user_pages_locked() get the hwpoisoned page.
>>>
>>> If we successfully unmapped but failed in truncate_error_page() for example,
>>> the processes mapping the page would get -EFAULT as expected. But even in
>>> this case, other processes could reach the error page via page cache and
>>> __get_user_pages_locked() for them could return the hwpoisoned page.
>>>
>>>>
>>>> Or doesn't that happen as you describe "But memory_failure() may fail, and
>>>> the process's related pte may not be correctly set invalid" -- but why does
>>>> that happen?
>>>
>>> Simply because memory_failure() doesn't handle some page types like ksm page
>>> and zero page. Or maybe shmem thp also belongs to this class.

Thanks for that info!

>>>
>>>>
>>>> On a similar thought, should get_user_pages() never return a page that has
>>>> HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?
>>>
>>> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
>>> follow_huge_pmd() would work well.
>>
>> I think we should take more care to broadcast the hwpoison check to other cases,
>> SIGBUS coredump is such a case that it is supposed to not touch the poison page,
>> and if we return NULL for this, the coredump process will get a successful finish.
>>
>> Other cases may also meet the requirements like coredump, but we need to identify it,
>> that's the poison check wrapper's purpose. If not, we may break the integrity of the
>> related action, which may be no better than panic.
>
> If you worry about regression and would like to make this new behavior conditional,
> we could use FOLL_HWPOISON to specify that the caller is hwpoison-aware so that
> any !FOLL_HWPOISON caller ignores the hwpoison check and works as it does now.
> This approach looks to me helpful because it would encourage developers touching
> gup code to pay attention to FOLL_HWPOISON code.

FOLL_HWPOISON might be the right start, indeed.

--
Thanks,

David / dhildenb

Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On Wed, Mar 31, 2021 at 07:07:39AM +0100, Matthew Wilcox wrote:
> On Wed, Mar 31, 2021 at 01:52:59AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > If we successfully unmapped but failed in truncate_error_page() for example,
> > the processes mapping the page would get -EFAULT as expected. But even in
> > this case, other processes could reach the error page via page cache and
> > __get_user_pages_locked() for them could return the hwpoisoned page.
>
> How would that happen? We check PageHWPoison before inserting a page
> into the page tables. See, eg, filemap_map_pages() and __do_fault().

Ah, you're right, that never happens. I misread the code.
Thanks for correcting me.

2021-03-31 07:09:17

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On 31.03.21 08:53, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Mar 31, 2021 at 07:07:39AM +0100, Matthew Wilcox wrote:
>> On Wed, Mar 31, 2021 at 01:52:59AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> If we successfully unmapped but failed in truncate_error_page() for example,
>>> the processes mapping the page would get -EFAULT as expected. But even in
>>> this case, other processes could reach the error page via page cache and
>>> __get_user_pages_locked() for them could return the hwpoisoned page.
>>
>> How would that happen? We check PageHWPoison before inserting a page
>> into the page tables. See, eg, filemap_map_pages() and __do_fault().
>
> Ah, you're right, that never happens. I misread the code.
> Thanks for correcting me.
>

I'm wondering if there is a small race window, if we poison a page while
inserting it.

--
Thanks,

David / dhildenb

2021-03-31 07:09:35

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On Wed, 31 Mar 2021 08:44:53 +0200
David Hildenbrand <[email protected]> wrote:

> On 31.03.21 06:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:
> >> On Wed, 31 Mar 2021 01:52:59 +0000 HORIGUCHI NAOYA(堀口 直也) <[email protected]> wrote:
> >>> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> >>>> On 26.03.21 15:09, David Hildenbrand wrote:
> >>>>> On 22.03.21 12:33, Aili Yao wrote:
> >>>>>> When we do coredump for user process signal, this may be one SIGBUS signal
> >>>>>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> >>>>>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> >>>>>> recovery work is finished correctly, then the get_dump_page() will not
> >>>>>> return the error page as its process pte is set invalid by
> >>>>>> memory_failure().
> >>>>>>
> >>>>>> But memory_failure() may fail, and the process's related pte may not be
> >>>>>> correctly set invalid, for current code, we will return the poison page,
> >>>>>> get it dumped, and then lead to system panic as its in kernel code.
> >>>>>>
> >>>>>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> >>>>>>
> >>>>>> There maybe other scenario that is also better to check hwposion status
> >>>>>> and not to panic, so make a wrapper for this check, Thanks to David's
> >>>>>> suggestion(<[email protected]>).
> >>>>>>
> >>>>>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> >>>>>> Signed-off-by: Aili Yao <[email protected]>
> >>>>>> Cc: David Hildenbrand <[email protected]>
> >>>>>> Cc: Matthew Wilcox <[email protected]>
> >>>>>> Cc: Naoya Horiguchi <[email protected]>
> >>>>>> Cc: Oscar Salvador <[email protected]>
> >>>>>> Cc: Mike Kravetz <[email protected]>
> >>>>>> Cc: Aili Yao <[email protected]>
> >>>>>> Cc: [email protected]
> >>>>>> Signed-off-by: Andrew Morton <[email protected]>
> >>>>>> ---
> >>>>>> mm/gup.c | 4 ++++
> >>>>>> mm/internal.h | 20 ++++++++++++++++++++
> >>>>>> 2 files changed, 24 insertions(+)
> >>>>>>
> >>>>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>>>> index e4c224c..6f7e1aa 100644
> >>>>>> --- a/mm/gup.c
> >>>>>> +++ b/mm/gup.c
> >>>>>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> >>>>>> FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> >>>>>> if (locked)
> >>>>>> mmap_read_unlock(mm);
> >>>>>
> >>>>> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> >>>>> when stumbling over a hwpoisoned page?
> >>>>>
> >>>>> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> >>>>>
> >>>>> handle_mm_fault()->vm_fault_to_errno(), which translates
> >>>>> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> >>>>>
> >>>>> ?
> >>>
> >>> We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
> >>>
> >>> If we failed to unmap, the page table is not converted to hwpoison entry,
> >>> so __get_user_pages_locked() get the hwpoisoned page.
> >>>
> >>> If we successfully unmapped but failed in truncate_error_page() for example,
> >>> the processes mapping the page would get -EFAULT as expected. But even in
> >>> this case, other processes could reach the error page via page cache and
> >>> __get_user_pages_locked() for them could return the hwpoisoned page.
> >>>
> >>>>
> >>>> Or doesn't that happen as you describe "But memory_failure() may fail, and
> >>>> the process's related pte may not be correctly set invalid" -- but why does
> >>>> that happen?
> >>>
> >>> Simply because memory_failure() doesn't handle some page types like ksm page
> >>> and zero page. Or maybe shmem thp also belongs to this class.
>
> Thanks for that info!
>
> >>>
> >>>>
> >>>> On a similar thought, should get_user_pages() never return a page that has
> >>>> HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?
> >>>
> >>> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> >>> follow_huge_pmd() would work well.
> >>
> >> I think we should take more care to broadcast the hwpoison check to other cases,
> >> SIGBUS coredump is such a case that it is supposed to not touch the poison page,
> >> and if we return NULL for this, the coredump process will get a successful finish.
> >>
> >> Other cases may also meet the requirements like coredump, but we need to identify it,
> >> that's the poison check wrapper's purpose. If not, we may break the integrity of the
> >> related action, which may be no better than panic.
> >
> > If you worry about regression and would like to make this new behavior conditional,
> > we could use FOLL_HWPOISON to specify that the caller is hwpoison-aware so that
> > any !FOLL_HWPOISON caller ignores the hwpoison check and works as it does now.
> > This approach looks to me helpful because it would encourage developers touching
> > gup code to pay attention to FOLL_HWPOISON code.
>
> FOLL_HWPOISON might be the right start, indeed.
>

Got this, Thanks!
I will dig more!

--
Thanks!
Aili Yao

2021-04-01 02:33:08

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v5] mm/gup: check page hwposion status for coredump.

On Wed, 31 Mar 2021 08:44:53 +0200
David Hildenbrand <[email protected]> wrote:

> On 31.03.21 06:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Mar 31, 2021 at 10:43:36AM +0800, Aili Yao wrote:
> >> On Wed, 31 Mar 2021 01:52:59 +0000 HORIGUCHI NAOYA(堀口 直也) <[email protected]> wrote:
> >>> On Fri, Mar 26, 2021 at 03:22:49PM +0100, David Hildenbrand wrote:
> >>>> On 26.03.21 15:09, David Hildenbrand wrote:
> >>>>> On 22.03.21 12:33, Aili Yao wrote:
> >>>>>> When we do coredump for user process signal, this may be one SIGBUS signal
> >>>>>> with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
> >>>>>> resulted from ECC memory fail like SRAR or SRAO, we expect the memory
> >>>>>> recovery work is finished correctly, then the get_dump_page() will not
> >>>>>> return the error page as its process pte is set invalid by
> >>>>>> memory_failure().
> >>>>>>
> >>>>>> But memory_failure() may fail, and the process's related pte may not be
> >>>>>> correctly set invalid, for current code, we will return the poison page,
> >>>>>> get it dumped, and then lead to system panic as its in kernel code.
> >>>>>>
> >>>>>> So check the hwpoison status in get_dump_page(), and if TRUE, return NULL.
> >>>>>>
> >>>>>> There maybe other scenario that is also better to check hwposion status
> >>>>>> and not to panic, so make a wrapper for this check, Thanks to David's
> >>>>>> suggestion(<[email protected]>).
> >>>>>>
> >>>>>> Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
> >>>>>> Signed-off-by: Aili Yao <[email protected]>
> >>>>>> Cc: David Hildenbrand <[email protected]>
> >>>>>> Cc: Matthew Wilcox <[email protected]>
> >>>>>> Cc: Naoya Horiguchi <[email protected]>
> >>>>>> Cc: Oscar Salvador <[email protected]>
> >>>>>> Cc: Mike Kravetz <[email protected]>
> >>>>>> Cc: Aili Yao <[email protected]>
> >>>>>> Cc: [email protected]
> >>>>>> Signed-off-by: Andrew Morton <[email protected]>
> >>>>>> ---
> >>>>>> mm/gup.c | 4 ++++
> >>>>>> mm/internal.h | 20 ++++++++++++++++++++
> >>>>>> 2 files changed, 24 insertions(+)
> >>>>>>
> >>>>>> diff --git a/mm/gup.c b/mm/gup.c
> >>>>>> index e4c224c..6f7e1aa 100644
> >>>>>> --- a/mm/gup.c
> >>>>>> +++ b/mm/gup.c
> >>>>>> @@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr)
> >>>>>> FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> >>>>>> if (locked)
> >>>>>> mmap_read_unlock(mm);
> >>>>>
> >>>>> Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked()
> >>>>> when stumbling over a hwpoisoned page?
> >>>>>
> >>>>> See __get_user_pages_locked()->__get_user_pages()->faultin_page():
> >>>>>
> >>>>> handle_mm_fault()->vm_fault_to_errno(), which translates
> >>>>> VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON)
> >>>>>
> >>>>> ?
> >>>
> >>> We could get -EFAULT, but sometimes not (depends on how memory_failure() fails).
> >>>
> >>> If we failed to unmap, the page table is not converted to hwpoison entry,
> >>> so __get_user_pages_locked() get the hwpoisoned page.
> >>>
> >>> If we successfully unmapped but failed in truncate_error_page() for example,
> >>> the processes mapping the page would get -EFAULT as expected. But even in
> >>> this case, other processes could reach the error page via page cache and
> >>> __get_user_pages_locked() for them could return the hwpoisoned page.
> >>>
> >>>>
> >>>> Or doesn't that happen as you describe "But memory_failure() may fail, and
> >>>> the process's related pte may not be correctly set invalid" -- but why does
> >>>> that happen?
> >>>
> >>> Simply because memory_failure() doesn't handle some page types like ksm page
> >>> and zero page. Or maybe shmem thp also belongs to this class.
>
> Thanks for that info!
>
> >>>
> >>>>
> >>>> On a similar thought, should get_user_pages() never return a page that has
> >>>> HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned?
> >>>
> >>> Make sense to me. Maybe inserting hwpoison check into follow_page_pte() and
> >>> follow_huge_pmd() would work well.
> >>
> >> I think we should take more care to broadcast the hwpoison check to other cases,
> >> SIGBUS coredump is such a case that it is supposed to not touch the poison page,
> >> and if we return NULL for this, the coredump process will get a successful finish.
> >>
> >> Other cases may also meet the requirements like coredump, but we need to identify it,
> >> that's the poison check wrapper's purpose. If not, we may break the integrity of the
> >> related action, which may be no better than panic.

I think I have wrong logic here, before this patch, the code has already returned error for
pages which the user pte has been set invalid because of hwpoison. And this patch is adding another
missing scenario for the same purpose. Without this patch, the code may still fail in gup.c for
hwpoison case, I think that's OK as it's already there. Then the same rule will apply to this missing
case, I think I am wrong, David,Naoya, you are right!

Thanks!

> > If you worry about regression and would like to make this new behavior conditional,
> > we could use FOLL_HWPOISON to specify that the caller is hwpoison-aware so that
> > any !FOLL_HWPOISON caller ignores the hwpoison check and works as it does now.
> > This approach looks to me helpful because it would encourage developers touching
> > gup code to pay attention to FOLL_HWPOISON code.
>
> FOLL_HWPOISON might be the right start, indeed.
>
I think we may still need this flag to return different error code for this case.
I will change the patch accordingly!

--
Thanks!
Aili Yao

2021-04-06 12:49:43

by yaoaili [么爱利]

[permalink] [raw]
Subject: [PATCH v6] mm/gup: check page hwpoison status for memory recovery failures.

When we call get_user_pages() to pin user page in memory, there may be
hwpoison page, currently, we just handle the normal case that memory
recovery jod is correctly finished, and we will not return the hwpoison
page to callers, but for other cases like memory recovery fails and the
user process related pte is not correctly set invalid, we will still
return the hwpoison page, and may touch it and lead to panic.

In gup.c, for normal page, after we call follow_page_mask(), we will
return the related page pointer; or like another hwpoison case with pte
invalid, it will return NULL. For NULL, we will handle it in if (!page)
branch. In this patch, we will filter out the hwpoison page in
follow_page_mask() and return error code for recovery failure cases.

We will check the page hwpoison status as soon as possible and avoid doing
followed normal procedure and try not to grab related pages.

Signed-off-by: Aili Yao <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
---
mm/gup.c | 27 +++++++++++++++++++++++----
mm/huge_memory.c | 9 +++++++--
mm/hugetlb.c | 8 +++++++-
mm/internal.h | 13 +++++++++++++
4 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..88a93b89c03e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -433,6 +433,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = ERR_PTR(ret);
goto out;
}
+ } else if (PageHWPoison(page)) {
+ page = ERR_PTR(-EHWPOISON);
+ goto out;
}

if (flags & FOLL_SPLIT && PageTransCompound(page)) {
@@ -540,8 +543,13 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
page = follow_huge_pd(vma, address,
__hugepd(pmd_val(pmdval)), flags,
PMD_SHIFT);
- if (page)
- return page;
+ if (page) {
+ struct page *p = check_page_hwpoison(page);
+
+ if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+ put_page(page);
+ return p;
+ }
return no_page_table(vma, flags);
}
retry:
@@ -643,7 +651,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) {
page = follow_huge_pud(mm, address, pud, flags);
if (page)
- return page;
+ return check_page_hwpoison(page);
return no_page_table(vma, flags);
}
if (is_hugepd(__hugepd(pud_val(*pud)))) {
@@ -652,6 +660,13 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
PUD_SHIFT);
if (page)
return page;
+ if (page) {
+ struct page *p = check_page_hwpoison(page);
+
+ if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+ put_page(page);
+ return p;
+ }
return no_page_table(vma, flags);
}
if (pud_devmap(*pud)) {
@@ -1087,10 +1102,14 @@ static long __get_user_pages(struct mm_struct *mm,
* struct page.
*/
goto next_page;
- } else if (IS_ERR(page)) {
+ } else if (PTR_ERR(page) == -EHWPOISON) {
+ ret = (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
+ goto out;
+ } else if (IS_ERR(page)) {
ret = PTR_ERR(page);
goto out;
}
+
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae907a9c2050..3973d988e485 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1349,6 +1349,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
+ struct page *tail = NULL;

assert_spin_locked(pmd_lockptr(mm, pmd));

@@ -1366,6 +1367,11 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

+ tail = page + ((addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT);
+
+ if (PageHWPoison(tail))
+ return ERR_PTR(-EHWPOISON);
+
if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);

@@ -1405,11 +1411,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unlock_page(page);
}
skip_mlock:
- page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);

out:
- return page;
+ return tail;
}

/* NUMA hinting page fault entry point for trans huge pmds */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a86a58ef132d..8b50f7eaa159 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4958,7 +4958,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
likely(pages) ? pages + i : NULL,
vmas ? vmas + i : NULL);

- if (pages) {
+ /* As we will filter out the hwpoison page, so don't try grab it */
+ if (pages && !PageHWPoison(page)) {
/*
* try_grab_compound_head() should always succeed here,
* because: a) we hold the ptl lock, and b) we've just
@@ -5581,6 +5582,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pte = huge_ptep_get((pte_t *)pmd);
if (pte_present(pte)) {
page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ /* if hwpoison, we don't grab it */
+ if (PageHWPoison(compound_head(page))) {
+ page = ERR_PTR(-EHWPOISON);
+ goto out;
+ }
/*
* try_grab_page() should always succeed here, because: a) we
* hold the pmd (ptl) lock, and b) we've just checked that the
diff --git a/mm/internal.h b/mm/internal.h
index 1432feec62df..049b310bc79a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,19 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}

+/*
+ * Check the hwposion status of any page type, and if TRUE, return ERR ptr.
+ */
+static inline struct page *check_page_hwpoison(struct page *page)
+{
+ if (PageHWPoison(page))
+ return ERR_PTR(-EHWPOISON);
+ else if (PageHuge(page) && PageHWPoison(compound_head(page)))
+ return ERR_PTR(-EHWPOISON);
+
+ return page;
+}
+
extern unsigned long highest_memmap_pfn;

/*
--
2.29.3

2021-04-06 14:55:21

by yaoaili [么爱利]

[permalink] [raw]
Subject: [PATCH v7] mm/gup: check page hwpoison status for memory recovery failures.

When we call get_user_pages() to pin user page in memory, there may be
hwpoison page, currently, we just handle the normal case that memory
recovery jod is correctly finished, and we will not return the hwpoison
page to callers, but for other cases like memory recovery fails and the
user process related pte is not correctly set invalid, we will still
return the hwpoison page, and may touch it and lead to panic.

In gup.c, for normal page, after we call follow_page_mask(), we will
return the related page pointer; or like another hwpoison case with pte
invalid, it will return NULL. For NULL, we will handle it in if (!page)
branch. In this patch, we will filter out the hwpoison page in
follow_page_mask() and return error code for recovery failure cases.

We will check the page hwpoison status as soon as possible and avoid doing
followed normal procedure and try not to grab related pages.

Changes since v6:
- Fix wrong page pointer check in follow_trans_huge_pmd();

Signed-off-by: Aili Yao <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
---
mm/gup.c | 27 +++++++++++++++++++++++----
mm/huge_memory.c | 11 ++++++++---
mm/hugetlb.c | 8 +++++++-
mm/internal.h | 13 +++++++++++++
4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..88a93b89c03e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -433,6 +433,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = ERR_PTR(ret);
goto out;
}
+ } else if (PageHWPoison(page)) {
+ page = ERR_PTR(-EHWPOISON);
+ goto out;
}

if (flags & FOLL_SPLIT && PageTransCompound(page)) {
@@ -540,8 +543,13 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
page = follow_huge_pd(vma, address,
__hugepd(pmd_val(pmdval)), flags,
PMD_SHIFT);
- if (page)
- return page;
+ if (page) {
+ struct page *p = check_page_hwpoison(page);
+
+ if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+ put_page(page);
+ return p;
+ }
return no_page_table(vma, flags);
}
retry:
@@ -643,7 +651,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) {
page = follow_huge_pud(mm, address, pud, flags);
if (page)
- return page;
+ return check_page_hwpoison(page);
return no_page_table(vma, flags);
}
if (is_hugepd(__hugepd(pud_val(*pud)))) {
@@ -652,6 +660,13 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
PUD_SHIFT);
if (page)
return page;
+ if (page) {
+ struct page *p = check_page_hwpoison(page);
+
+ if (p == ERR_PTR(-EHWPOISON) && flags & FOLL_GET)
+ put_page(page);
+ return p;
+ }
return no_page_table(vma, flags);
}
if (pud_devmap(*pud)) {
@@ -1087,10 +1102,14 @@ static long __get_user_pages(struct mm_struct *mm,
* struct page.
*/
goto next_page;
- } else if (IS_ERR(page)) {
+ } else if (PTR_ERR(page) == -EHWPOISON) {
+ ret = (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
+ goto out;
+ } else if (IS_ERR(page)) {
ret = PTR_ERR(page);
goto out;
}
+
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae907a9c2050..56ff2e83b67c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1349,6 +1349,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
+ struct page *tail = NULL;

assert_spin_locked(pmd_lockptr(mm, pmd));

@@ -1366,6 +1367,11 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

+ tail = page + ((addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT);
+
+ if (PageHWPoison(tail))
+ return ERR_PTR(-EHWPOISON);
+
if (!try_grab_page(page, flags))
return ERR_PTR(-ENOMEM);

@@ -1405,11 +1411,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unlock_page(page);
}
skip_mlock:
- page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
- VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+ VM_BUG_ON_PAGE(!PageCompound(tail) && !is_zone_device_page(tail), tail);

out:
- return page;
+ return tail;
}

/* NUMA hinting page fault entry point for trans huge pmds */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a86a58ef132d..8b50f7eaa159 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4958,7 +4958,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
likely(pages) ? pages + i : NULL,
vmas ? vmas + i : NULL);

- if (pages) {
+ /* As we will filter out the hwpoison page, so don't try grab it */
+ if (pages && !PageHWPoison(page)) {
/*
* try_grab_compound_head() should always succeed here,
* because: a) we hold the ptl lock, and b) we've just
@@ -5581,6 +5582,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pte = huge_ptep_get((pte_t *)pmd);
if (pte_present(pte)) {
page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ /* if hwpoison, we don't grab it */
+ if (PageHWPoison(compound_head(page))) {
+ page = ERR_PTR(-EHWPOISON);
+ goto out;
+ }
/*
* try_grab_page() should always succeed here, because: a) we
* hold the pmd (ptl) lock, and b) we've just checked that the
diff --git a/mm/internal.h b/mm/internal.h
index 1432feec62df..049b310bc79a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -97,6 +97,19 @@ static inline void set_page_refcounted(struct page *page)
set_page_count(page, 1);
}

+/*
+ * Check the hwposion status of any page type, and if TRUE, return ERR ptr.
+ */
+static inline struct page *check_page_hwpoison(struct page *page)
+{
+ if (PageHWPoison(page))
+ return ERR_PTR(-EHWPOISON);
+ else if (PageHuge(page) && PageHWPoison(compound_head(page)))
+ return ERR_PTR(-EHWPOISON);
+
+ return page;
+}
+
extern unsigned long highest_memmap_pfn;

/*
--
2.29.3

Subject: Re: [PATCH v7] mm/gup: check page hwpoison status for memory recovery failures.

On Tue, Apr 06, 2021 at 10:41:23AM +0800, Aili Yao wrote:
> When we call get_user_pages() to pin user page in memory, there may be
> hwpoison page, currently, we just handle the normal case that memory
> recovery jod is correctly finished, and we will not return the hwpoison
> page to callers, but for other cases like memory recovery fails and the
> user process related pte is not correctly set invalid, we will still
> return the hwpoison page, and may touch it and lead to panic.
>
> In gup.c, for normal page, after we call follow_page_mask(), we will
> return the related page pointer; or like another hwpoison case with pte
> invalid, it will return NULL. For NULL, we will handle it in if (!page)
> branch. In this patch, we will filter out the hwpoison page in
> follow_page_mask() and return error code for recovery failure cases.
>
> We will check the page hwpoison status as soon as possible and avoid doing
> followed normal procedure and try not to grab related pages.
>
> Changes since v6:
> - Fix wrong page pointer check in follow_trans_huge_pmd();
>
> Signed-off-by: Aili Yao <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Mike Kravetz <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> ---
> mm/gup.c | 27 +++++++++++++++++++++++----
> mm/huge_memory.c | 11 ++++++++---
> mm/hugetlb.c | 8 +++++++-
> mm/internal.h | 13 +++++++++++++
> 4 files changed, 51 insertions(+), 8 deletions(-)

Thank you for the work.

Looking through this patch, the internal of follow_page_mask() is
very complicated so it's not easy to make this hwpoison-aware.
Now I'm getting unsure to judge that this is the best approach.
What actually I imagined might be like below (which is totally
untested, and I'm sorry about my previous misleading comments):

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..a60a08fc7668 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1090,6 +1090,11 @@ static long __get_user_pages(struct mm_struct *mm,
} else if (IS_ERR(page)) {
ret = PTR_ERR(page);
goto out;
+ } else if (gup_flags & FOLL_HWPOISON && PageHWPoison(page)) {
+ if (gup_flags & FOLL_GET)
+ put_page(page);
+ ret = -EHWPOISON;
+ goto out;
}
if (pages) {
pages[i] = page;
@@ -1532,7 +1537,7 @@ struct page *get_dump_page(unsigned long addr)
if (mmap_read_lock_killable(mm))
return NULL;
ret = __get_user_pages_locked(mm, addr, 1, &page, NULL, &locked,
- FOLL_FORCE | FOLL_DUMP | FOLL_GET);
+ FOLL_FORCE | FOLL_DUMP | FOLL_GET | FOLL_HWPOISON);
if (locked)
mmap_read_unlock(mm);
return (ret == 1) ? page : NULL;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a86a58ef132d..03c3d3225c0d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4949,6 +4949,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
continue;
}

+ if (flags & FOLL_HWPOISON && PageHWPoison(page)) {
+ vaddr += huge_page_size(h);
+ remainder -= pages_per_huge_page(h);
+ i += pages_per_huge_page(h);
+ spin_unlock(ptl);
+ continue;
+ }
+
refs = min3(pages_per_huge_page(h) - pfn_offset,
(vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);


We can surely say that this change only affects get_user_pages() callers
with FOLL_HWPOISON set, so this should pinpoint the current problem only.
A side note is that the above change on follow_hugetlb_page() has a room of
refactoring to reduce duplicated code.

Could you try to test and complete it?

Thanks,
Naoya Horiguchi

2021-04-07 20:28:51

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v7] mm/gup: check page hwpoison status for memory recovery failures.

On Wed, 7 Apr 2021 01:54:28 +0000
HORIGUCHI NAOYA(堀口 直也) <[email protected]> wrote:

> On Tue, Apr 06, 2021 at 10:41:23AM +0800, Aili Yao wrote:
> > When we call get_user_pages() to pin user page in memory, there may be
> > hwpoison page, currently, we just handle the normal case that memory
> > recovery jod is correctly finished, and we will not return the hwpoison
> > page to callers, but for other cases like memory recovery fails and the
> > user process related pte is not correctly set invalid, we will still
> > return the hwpoison page, and may touch it and lead to panic.
> >
> > In gup.c, for normal page, after we call follow_page_mask(), we will
> > return the related page pointer; or like another hwpoison case with pte
> > invalid, it will return NULL. For NULL, we will handle it in if (!page)
> > branch. In this patch, we will filter out the hwpoison page in
> > follow_page_mask() and return error code for recovery failure cases.
> >
> > We will check the page hwpoison status as soon as possible and avoid doing
> > followed normal procedure and try not to grab related pages.
> >
> > Changes since v6:
> > - Fix wrong page pointer check in follow_trans_huge_pmd();
> >
> > Signed-off-by: Aili Yao <[email protected]>
> > Cc: David Hildenbrand <[email protected]>
> > Cc: Matthew Wilcox <[email protected]>
> > Cc: Naoya Horiguchi <[email protected]>
> > Cc: Oscar Salvador <[email protected]>
> > Cc: Mike Kravetz <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: [email protected]
> > ---
> > mm/gup.c | 27 +++++++++++++++++++++++----
> > mm/huge_memory.c | 11 ++++++++---
> > mm/hugetlb.c | 8 +++++++-
> > mm/internal.h | 13 +++++++++++++
> > 4 files changed, 51 insertions(+), 8 deletions(-)
>
> Thank you for the work.
>
> Looking through this patch, the internal of follow_page_mask() is
> very complicated so it's not easy to make this hwpoison-aware.
> Now I'm getting unsure to judge that this is the best approach.
> What actually I imagined might be like below (which is totally
> untested, and I'm sorry about my previous misleading comments):
>
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..a60a08fc7668 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1090,6 +1090,11 @@ static long __get_user_pages(struct mm_struct *mm,
> } else if (IS_ERR(page)) {
> ret = PTR_ERR(page);
> goto out;
> + } else if (gup_flags & FOLL_HWPOISON && PageHWPoison(page)) {
> + if (gup_flags & FOLL_GET)
> + put_page(page);
> + ret = -EHWPOISON;
> + goto out;
> }
> if (pages) {
> pages[i] = page;
> @@ -1532,7 +1537,7 @@ struct page *get_dump_page(unsigned long addr)
> if (mmap_read_lock_killable(mm))
> return NULL;
> ret = __get_user_pages_locked(mm, addr, 1, &page, NULL, &locked,
> - FOLL_FORCE | FOLL_DUMP | FOLL_GET);
> + FOLL_FORCE | FOLL_DUMP | FOLL_GET | FOLL_HWPOISON);
> if (locked)
> mmap_read_unlock(mm);
> return (ret == 1) ? page : NULL;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a86a58ef132d..03c3d3225c0d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4949,6 +4949,14 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> continue;
> }
>
> + if (flags & FOLL_HWPOISON && PageHWPoison(page)) {
> + vaddr += huge_page_size(h);
> + remainder -= pages_per_huge_page(h);
> + i += pages_per_huge_page(h);
> + spin_unlock(ptl);
> + continue;
> + }
> +
> refs = min3(pages_per_huge_page(h) - pfn_offset,
> (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
>
>
> We can surely say that this change only affects get_user_pages() callers
> with FOLL_HWPOISON set, so this should pinpoint the current problem only.
> A side note is that the above change on follow_hugetlb_page() has a room of
> refactoring to reduce duplicated code.
>
> Could you try to test and complete it?

Got it, I will try to complete it and test it.

For the code:

long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> continue;
> }
>
> + if (flags & FOLL_HWPOISON && PageHWPoison(page)) {
> + vaddr += huge_page_size(h);
> + remainder -= pages_per_huge_page(h);
> + i += pages_per_huge_page(h);
> + spin_unlock(ptl);
> + continue;
> + }
> +

I am wondering if we still need to continue the loop in follow_hugetlb_page()? This function
seems mainly for prerparation of vmas and grab the hugepage, if we meet one hwpoison hugetlb page,
we will check it after follow_page_mask() return, then we will quit the total loop
and the num of page or error code will be returned, and the vmas after the hwpoison one will
not be needed?

--
Thanks!
Aili Yao

2021-05-10 03:16:55

by yaoaili [么爱利]

[permalink] [raw]
Subject: Re: [PATCH v7] mm/gup: check page hwpoison status for memory recovery failures.

On Tue, 6 Apr 2021 10:41:23 +0800
Aili Yao <[email protected]> wrote:

> When we call get_user_pages() to pin user page in memory, there may be
> hwpoison page, currently, we just handle the normal case that memory
> recovery jod is correctly finished, and we will not return the hwpoison
> page to callers, but for other cases like memory recovery fails and the
> user process related pte is not correctly set invalid, we will still
> return the hwpoison page, and may touch it and lead to panic.
>
> In gup.c, for normal page, after we call follow_page_mask(), we will
> return the related page pointer; or like another hwpoison case with pte
> invalid, it will return NULL. For NULL, we will handle it in if (!page)
> branch. In this patch, we will filter out the hwpoison page in
> follow_page_mask() and return error code for recovery failure cases.
>
> We will check the page hwpoison status as soon as possible and avoid doing
> followed normal procedure and try not to grab related pages.
>
> Changes since v6:
> - Fix wrong page pointer check in follow_trans_huge_pmd();
>
> Signed-off-by: Aili Yao <[email protected]>

After considering more, I still insist that we should take more care to treat the case that kernel touched user hwpoison pages.

Usually when process touched the hwpoison page, it will be killed, and no other bad consequence will happen. while in a case, kernel touches
the user process hwpoison page, It should be one more serious issue than the user process's access, the error return nor Null return is
not sufficient and reasonable for me.

There will be some doubt places,
1) should the user process be killed first?
2) if error returned, the process can correctly process it? It seems there is no way the process can finish that.
3) if NULL returned, job continue again? the data integrity will be guaranteed? I doubted that.

There are so many arguements about the CopyIn patch from intel that the process should killed or not...
And this gup module seems more compicated and have the same question.
When error returned or NULL return from gup, in some way and some scenario, this will make the hw issue even more complicated.

Yes, there is a flag FOLL_HWPOISON, but this flag seems be designed for different return values that the process can accept,
it is not designed to decide kernel to panic or not;

Currently I have no other scenario like coredump case. And I havn't one good way to fix this leak in a graceful way.

Maybe, For some scenario, the kernel recovery thing for user process is wrong and not deserved.

I have a lot of other works to do and have not much time to continue on this thread, And I insist my original post.

There should be a better option for this issue, when the better option is ready, this patch can and should be reverted.

Thanks!
Aili Yao