2022-10-18 00:09:27

by Luck, Tony

[permalink] [raw]
Subject: [RFC PATCH] mm, hwpoison: Recover from copy-on-write machine checks

If the kernel is copying a page as the result of a copy-on-write
fault and runs into an uncorrectable error, Linux will crash because
it does not have recovery code for this case where poison is consumed
by the kernel.

It is easy to set up a test case. Just inject an error into a private
page, fork(2), and have the child process write to the page.

I wrapped that neatly into a test at:

git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

just enable ACPI error injection and run:

# ./einj_mem-uc -f copy-on-write

[Note this test needs some better reporting for the case where this
patch has been applied and the system does NOT crash]

Patch below works ... but there are probably many places where it could
fit better into the general "mm" way of doing things. E.g. using the
copy_mc_to_kernel() function does what I need here, but the name doesn't
seem like it is quite right.

Basic idea is very simple ... if the kernel gets a machine check copying
the page, just free up the new page that was going to be the target of
the copy and return VM_FAULT_HWPOISON to the calling stack.

Slightly-signed-off-by: Tony Luck <[email protected]>
---
include/linux/highmem.h | 19 +++++++++++++++++++
mm/memory.c | 28 ++++++++++++++++++++--------
2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index e9912da5441b..5967541fbf0e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from,

#endif

+static inline int copy_user_highpage_mc(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ unsigned long ret = 0;
+#ifdef copy_mc_to_kernel
+ char *vfrom, *vto;
+
+ vfrom = kmap_local_page(from);
+ vto = kmap_local_page(to);
+ ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
+ kunmap_local(vto);
+ kunmap_local(vfrom);
+#else
+ copy_user_highpage(to, from, vaddr, vma);
+#endif
+
+ return ret;
+}
+
#ifndef __HAVE_ARCH_COPY_HIGHPAGE

static inline void copy_highpage(struct page *to, struct page *from)
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..b5e22bf4c10a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
return same;
}

-static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
- struct vm_fault *vmf)
+/*
+ * Return:
+ * -1 = copy failed due to poison in source page
+ * 0 = copied failed (some other reason)
+ * 1 = copied succeeded
+ */
+static inline int __wp_page_copy_user(struct page *dst, struct page *src,
+ struct vm_fault *vmf)
{
bool ret;
void *kaddr;
@@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
unsigned long addr = vmf->address;

if (likely(src)) {
- copy_user_highpage(dst, src, addr, vma);
- return true;
+ if (copy_user_highpage_mc(dst, src, addr, vma))
+ return -1;
+ return 1;
}

/*
@@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
* and update local tlb only
*/
update_mmu_tlb(vma, addr, vmf->pte);
- ret = false;
+ ret = 0;
goto pte_unlock;
}

@@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
/* The PTE changed under us, update local tlb */
update_mmu_tlb(vma, addr, vmf->pte);
- ret = false;
+ ret = 0;
goto pte_unlock;
}

@@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
}
}

- ret = true;
+ ret = 1;

pte_unlock:
if (locked)
@@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
pte_t entry;
int page_copied = 0;
struct mmu_notifier_range range;
+ int ret;

delayacct_wpcopy_start();

@@ -3121,7 +3129,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (!new_page)
goto oom;

- if (!__wp_page_copy_user(new_page, old_page, vmf)) {
+ ret = __wp_page_copy_user(new_page, old_page, vmf);
+ if (ret == -1) {
+ put_page(new_page);
+ return VM_FAULT_HWPOISON;
+ } else if (ret == 0) {
/*
* COW failed, if the fault was solved by other,
* it's fine. If not, userspace would re-fault on
--
2.37.3


Subject: Re: [RFC PATCH] mm, hwpoison: Recover from copy-on-write machine checks

On Mon, Oct 17, 2022 at 04:42:03PM -0700, Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
>
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
>
> I wrapped that neatly into a test at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>
> just enable ACPI error injection and run:
>
> # ./einj_mem-uc -f copy-on-write
>
> [Note this test needs some better reporting for the case where this
> patch has been applied and the system does NOT crash]
>
> Patch below works ... but there are probably many places where it could
> fit better into the general "mm" way of doing things. E.g. using the
> copy_mc_to_kernel() function does what I need here, but the name doesn't
> seem like it is quite right.

As for the name "copy_user_highpage_mc", it simply sounds like an mcsafe
variant of copy_user_highpage, so I have no objection.

Recently similar routine is suggested in
https://lore.kernel.org/linux-mm/[email protected]/
, so maybe we need some coordination between related interfaces.

>
> Basic idea is very simple ... if the kernel gets a machine check copying
> the page, just free up the new page that was going to be the target of
> the copy and return VM_FAULT_HWPOISON to the calling stack.
>
> Slightly-signed-off-by: Tony Luck <[email protected]>

I'm basically supportive to what this patch intends to do.

> ---
> include/linux/highmem.h | 19 +++++++++++++++++++
> mm/memory.c | 28 ++++++++++++++++++++--------
> 2 files changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..5967541fbf0e 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>
> #endif
>
> +static inline int copy_user_highpage_mc(struct page *to, struct page *from,
> + unsigned long vaddr, struct vm_area_struct *vma)
> +{
> + unsigned long ret = 0;
> +#ifdef copy_mc_to_kernel
> + char *vfrom, *vto;
> +
> + vfrom = kmap_local_page(from);
> + vto = kmap_local_page(to);
> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> + kunmap_local(vto);
> + kunmap_local(vfrom);
> +#else
> + copy_user_highpage(to, from, vaddr, vma);
> +#endif
> +
> + return ret;
> +}
> +
> #ifndef __HAVE_ARCH_COPY_HIGHPAGE
>
> static inline void copy_highpage(struct page *to, struct page *from)
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..b5e22bf4c10a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
> return same;
> }
>
> -static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> - struct vm_fault *vmf)
> +/*
> + * Return:
> + * -1 = copy failed due to poison in source page

Simply calling "poison" might cause confusion with page poisoning feature,
so "hwpoison" might be better. But I know that "poison" is commonly used
under arch/x86, and it's not clear to me what to do with this terminology.
Maybe using -EHWPOISON instead of -1 might be helpful to the distinction.

> + * 0 = copied failed (some other reason)
> + * 1 = copied succeeded
> + */
> +static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> + struct vm_fault *vmf)
> {
> bool ret;
> void *kaddr;
...
> @@ -3121,7 +3129,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> if (!new_page)
> goto oom;
>
> - if (!__wp_page_copy_user(new_page, old_page, vmf)) {
> + ret = __wp_page_copy_user(new_page, old_page, vmf);
> + if (ret == -1) {
> + put_page(new_page);

Maybe I miss something, but don't you have to care about refcount of
old_page in this branch (as done in "ret == 0" branch)?

Thanks,
Naoya Horiguchi

> + return VM_FAULT_HWPOISON;
> + } else if (ret == 0) {
> /*
> * COW failed, if the fault was solved by other,
> * it's fine. If not, userspace would re-fault on
> --
> 2.37.3

2022-10-18 18:13:40

by Luck, Tony

[permalink] [raw]
Subject: RE: [RFC PATCH] mm, hwpoison: Recover from copy-on-write machine checks

>> + * -1 = copy failed due to poison in source page

> Simply calling "poison" might cause confusion with page poisoning feature,
> so "hwpoison" might be better. But I know that "poison" is commonly used
> under arch/x86, and it's not clear to me what to do with this terminology.
> Maybe using -EHWPOISON instead of -1 might be helpful to the distinction.

Agreed. Using -EHWPOISON return is clearer here.

>> - if (!__wp_page_copy_user(new_page, old_page, vmf)) {
>> + ret = __wp_page_copy_user(new_page, old_page, vmf);
>> + if (ret == -1) {
>> + put_page(new_page);
>
> Maybe I miss something, but don't you have to care about refcount of
> old_page in this branch (as done in "ret == 0" branch)?

You didn't miss anything. But I did. More needs to be done with old_page
(it is where the poison is). I got "lucky" just ignoring/forgetting about it in
my patch ... the system just happened to recover, but I think the poison
may not have been handled for the parent process. that still has the page
mapped.. Need to think about this more.

Thanks for the review.

-Tony

2022-10-19 17:32:22

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

If the kernel is copying a page as the result of a copy-on-write
fault and runs into an uncorrectable error, Linux will crash because
it does not have recovery code for this case where poison is consumed
by the kernel.

It is easy to set up a test case. Just inject an error into a private
page, fork(2), and have the child process write to the page.

I wrapped that neatly into a test at:

git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

just enable ACPI error injection and run:

# ./einj_mem-uc -f copy-on-write

Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
on architectures where that is available (currently x86 and powerpc).
When an error is detected during the page copy, return VM_FAULT_HWPOISON
to caller of wp_page_copy(). This propagates up the call stack. Both x86
and powerpc have code in their fault handler to deal with this code by
sending a SIGBUS to the application.

Note that this patch avoids a system crash and signals the process that
triggered the copy-on-write action. It does not take any action for the
memory error that is still in the shared page. To handle that a call to
memory_failure() is needed. But this cannot be done from wp_page_copy()
because it holds mmap_lock(). Perhaps the architecture fault handlers
can deal with this loose end in a subsequent patch?

On Intel/x86 this loose end will often be handled automatically because
the memory controller provides an additional notification of the h/w
poison in memory, the handler for this will call memory_failure(). This
isn't a 100% solution. If there are multiple errors, not all may be
logged in this way.

Signed-off-by: Tony Luck <[email protected]>

---
Changes in V2:
Naoya Horiguchi:
1) Use -EHWPOISON error code instead of minus one.
2) Poison path needs also to deal with old_page
Tony Luck:
Rewrote commit message
Added some powerpc folks to Cc: list
---
include/linux/highmem.h | 19 +++++++++++++++++++
mm/memory.c | 28 +++++++++++++++++++---------
2 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index e9912da5441b..5967541fbf0e 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from,

#endif

+static inline int copy_user_highpage_mc(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ unsigned long ret = 0;
+#ifdef copy_mc_to_kernel
+ char *vfrom, *vto;
+
+ vfrom = kmap_local_page(from);
+ vto = kmap_local_page(to);
+ ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
+ kunmap_local(vto);
+ kunmap_local(vfrom);
+#else
+ copy_user_highpage(to, from, vaddr, vma);
+#endif
+
+ return ret;
+}
+
#ifndef __HAVE_ARCH_COPY_HIGHPAGE

static inline void copy_highpage(struct page *to, struct page *from)
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..a32556c9b689 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
return same;
}

-static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
- struct vm_fault *vmf)
+/*
+ * Return:
+ * -EHWPOISON: copy failed due to hwpoison in source page
+ * 0: copied failed (some other reason)
+ * 1: copied succeeded
+ */
+static inline int __wp_page_copy_user(struct page *dst, struct page *src,
+ struct vm_fault *vmf)
{
bool ret;
void *kaddr;
@@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
unsigned long addr = vmf->address;

if (likely(src)) {
- copy_user_highpage(dst, src, addr, vma);
- return true;
+ if (copy_user_highpage_mc(dst, src, addr, vma))
+ return -EHWPOISON;
+ return 1;
}

/*
@@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
* and update local tlb only
*/
update_mmu_tlb(vma, addr, vmf->pte);
- ret = false;
+ ret = 0;
goto pte_unlock;
}

@@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
/* The PTE changed under us, update local tlb */
update_mmu_tlb(vma, addr, vmf->pte);
- ret = false;
+ ret = 0;
goto pte_unlock;
}

@@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
}
}

- ret = true;
+ ret = 1;

pte_unlock:
if (locked)
@@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
pte_t entry;
int page_copied = 0;
struct mmu_notifier_range range;
+ int ret;

delayacct_wpcopy_start();

@@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (!new_page)
goto oom;

- if (!__wp_page_copy_user(new_page, old_page, vmf)) {
+ ret = __wp_page_copy_user(new_page, old_page, vmf);
+ if (ret <= 0) {
/*
* COW failed, if the fault was solved by other,
* it's fine. If not, userspace would re-fault on
* the same address and we will handle the fault
* from the second attempt.
+ * The -EHWPOISON case will not be retried.
*/
put_page(new_page);
if (old_page)
put_page(old_page);

delayacct_wpcopy_end();
- return 0;
+ return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
}
kmsan_copy_page_meta(new_page, old_page);
}
--
2.37.3

2022-10-19 18:35:43

by Dan Williams

[permalink] [raw]
Subject: RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
>
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
>
> I wrapped that neatly into a test at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>
> just enable ACPI error injection and run:
>
> # ./einj_mem-uc -f copy-on-write
>
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
>
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
>
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
>
> Signed-off-by: Tony Luck <[email protected]>

Just some minor comments below, but you can add:

Reviewed-by: Dan Williams <[email protected]>

>
> ---
> Changes in V2:
> Naoya Horiguchi:
> 1) Use -EHWPOISON error code instead of minus one.
> 2) Poison path needs also to deal with old_page
> Tony Luck:
> Rewrote commit message
> Added some powerpc folks to Cc: list
> ---
> include/linux/highmem.h | 19 +++++++++++++++++++
> mm/memory.c | 28 +++++++++++++++++++---------
> 2 files changed, 38 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..5967541fbf0e 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>
> #endif
>
> +static inline int copy_user_highpage_mc(struct page *to, struct page *from,
> + unsigned long vaddr, struct vm_area_struct *vma)
> +{
> + unsigned long ret = 0;
> +#ifdef copy_mc_to_kernel
> + char *vfrom, *vto;
> +
> + vfrom = kmap_local_page(from);
> + vto = kmap_local_page(to);
> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> + kunmap_local(vto);
> + kunmap_local(vfrom);
> +#else
> + copy_user_highpage(to, from, vaddr, vma);
> +#endif
> +
> + return ret;
> +}
> +

There is likely some small benefit of doing this the idiomatic way and
let grep see that there are multiple definitions of
copy_user_highpage_mc() with an organization like:

#ifdef copy_mc_to_kernel
static inline int copy_user_highpage_mc(struct page *to, struct page *from,
unsigned long vaddr,
struct vm_area_struct *vma)
{
unsigned long ret = 0;
char *vfrom, *vto;

vfrom = kmap_local_page(from);
vto = kmap_local_page(to);
ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
kunmap_local(vto);
kunmap_local(vfrom);

return ret;
}
#else
static inline int copy_user_highpage_mc(struct page *to, struct page *from,
unsigned long vaddr,
struct vm_area_struct *vma)
{
copy_user_highpage(to, from, vaddr, vma);
return 0;
}
#endif

Per the copy_mc* discussion with Linus I would have called this function
copy_mc_to_user_highpage() to clarify that hwpoison is handled from the
source buffer of the copy.

> #ifndef __HAVE_ARCH_COPY_HIGHPAGE
>
> static inline void copy_highpage(struct page *to, struct page *from)
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..a32556c9b689 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
> return same;
> }
>
> -static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> - struct vm_fault *vmf)
> +/*
> + * Return:
> + * -EHWPOISON: copy failed due to hwpoison in source page
> + * 0: copied failed (some other reason)
> + * 1: copied succeeded
> + */
> +static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> + struct vm_fault *vmf)
> {
> bool ret;
> void *kaddr;
> @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> unsigned long addr = vmf->address;
>
> if (likely(src)) {
> - copy_user_highpage(dst, src, addr, vma);
> - return true;
> + if (copy_user_highpage_mc(dst, src, addr, vma))
> + return -EHWPOISON;

Given there is no use case for the residue value returned by
copy_mc_to_kernel() perhaps just return EHWPOISON directly from
copyuser_highpage_mc() in the short-copy case?

> + return 1;
> }
>
> /*
> @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> * and update local tlb only
> */
> update_mmu_tlb(vma, addr, vmf->pte);
> - ret = false;
> + ret = 0;

What do you think about just making these 'false' cases also return a
negative errno? (rationale below...)

> goto pte_unlock;
> }
>
> @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
> /* The PTE changed under us, update local tlb */
> update_mmu_tlb(vma, addr, vmf->pte);
> - ret = false;
> + ret = 0;
> goto pte_unlock;
> }
>
> @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> }
> }
>
> - ret = true;
> + ret = 1;
>
> pte_unlock:
> if (locked)
> @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> pte_t entry;
> int page_copied = 0;
> struct mmu_notifier_range range;
> + int ret;
>
> delayacct_wpcopy_start();
>
> @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> if (!new_page)
> goto oom;
>
> - if (!__wp_page_copy_user(new_page, old_page, vmf)) {
> + ret = __wp_page_copy_user(new_page, old_page, vmf);
> + if (ret <= 0) {

...this would become a typical '0 == success' and 'negative errno ==
failure', where all but EHWPOISON are retried.

> /*
> * COW failed, if the fault was solved by other,
> * it's fine. If not, userspace would re-fault on
> * the same address and we will handle the fault
> * from the second attempt.
> + * The -EHWPOISON case will not be retried.
> */
> put_page(new_page);
> if (old_page)
> put_page(old_page);
>
> delayacct_wpcopy_end();
> - return 0;
> + return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;

2022-10-19 21:51:50

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

> Given there is no use case for the residue value returned by
> copy_mc_to_kernel() perhaps just return EHWPOISON directly from
> copyuser_highpage_mc() in the short-copy case?

I don't think it hurts to keep the return value as residue count. It isn't
making that code any more complex and could be useful someday.

Other feedback looks good and I have applied ready for next version.

Thanks for the review.

-Tony

2022-10-20 03:03:53

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults



在 2022/10/20 AM1:08, Tony Luck 写道:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
>
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
>
> I wrapped that neatly into a test at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>
> just enable ACPI error injection and run:
>
> # ./einj_mem-uc -f copy-on-write
>
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.

Does it send SIGBUS to only child process or both parent and child process?

>
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed.

If the error page is not poisoned, should the return value of wp_page_copy
be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.

Thanks.

Best Regards,
Shuai


> But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
>
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
>
> Signed-off-by: Tony Luck <[email protected]>
>
> ---
> Changes in V2:
> Naoya Horiguchi:
> 1) Use -EHWPOISON error code instead of minus one.
> 2) Poison path needs also to deal with old_page
> Tony Luck:
> Rewrote commit message
> Added some powerpc folks to Cc: list
> ---
> include/linux/highmem.h | 19 +++++++++++++++++++
> mm/memory.c | 28 +++++++++++++++++++---------
> 2 files changed, 38 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..5967541fbf0e 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from,
>
> #endif
>
> +static inline int copy_user_highpage_mc(struct page *to, struct page *from,
> + unsigned long vaddr, struct vm_area_struct *vma)
> +{
> + unsigned long ret = 0;
> +#ifdef copy_mc_to_kernel
> + char *vfrom, *vto;
> +
> + vfrom = kmap_local_page(from);
> + vto = kmap_local_page(to);
> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> + kunmap_local(vto);
> + kunmap_local(vfrom);
> +#else
> + copy_user_highpage(to, from, vaddr, vma);
> +#endif
> +
> + return ret;
> +}
> +
> #ifndef __HAVE_ARCH_COPY_HIGHPAGE
>
> static inline void copy_highpage(struct page *to, struct page *from)
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..a32556c9b689 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
> return same;
> }
>
> -static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> - struct vm_fault *vmf)
> +/*
> + * Return:
> + * -EHWPOISON: copy failed due to hwpoison in source page
> + * 0: copied failed (some other reason)
> + * 1: copied succeeded
> + */
> +static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> + struct vm_fault *vmf)
> {
> bool ret;
> void *kaddr;
> @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> unsigned long addr = vmf->address;
>
> if (likely(src)) {
> - copy_user_highpage(dst, src, addr, vma);
> - return true;
> + if (copy_user_highpage_mc(dst, src, addr, vma))
> + return -EHWPOISON;
> + return 1;
> }
>
> /*
> @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> * and update local tlb only
> */
> update_mmu_tlb(vma, addr, vmf->pte);
> - ret = false;
> + ret = 0;
> goto pte_unlock;
> }
>
> @@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
> /* The PTE changed under us, update local tlb */
> update_mmu_tlb(vma, addr, vmf->pte);
> - ret = false;
> + ret = 0;
> goto pte_unlock;
> }
>
> @@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> }
> }
>
> - ret = true;
> + ret = 1;
>
> pte_unlock:
> if (locked)
> @@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> pte_t entry;
> int page_copied = 0;
> struct mmu_notifier_range range;
> + int ret;
>
> delayacct_wpcopy_start();
>
> @@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> if (!new_page)
> goto oom;
>
> - if (!__wp_page_copy_user(new_page, old_page, vmf)) {
> + ret = __wp_page_copy_user(new_page, old_page, vmf);
> + if (ret <= 0) {
> /*
> * COW failed, if the fault was solved by other,
> * it's fine. If not, userspace would re-fault on
> * the same address and we will handle the fault
> * from the second attempt.
> + * The -EHWPOISON case will not be retried.
> */
> put_page(new_page);
> if (old_page)
> put_page(old_page);
>
> delayacct_wpcopy_end();
> - return 0;
> + return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
> }
> kmsan_copy_page_meta(new_page, old_page);
> }

2022-10-20 20:23:43

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>
>
> 在 2022/10/20 AM1:08, Tony Luck 写道:
> > If the kernel is copying a page as the result of a copy-on-write
> > fault and runs into an uncorrectable error, Linux will crash because
> > it does not have recovery code for this case where poison is consumed
> > by the kernel.
> >
> > It is easy to set up a test case. Just inject an error into a private
> > page, fork(2), and have the child process write to the page.
> >
> > I wrapped that neatly into a test at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> >
> > just enable ACPI error injection and run:
> >
> > # ./einj_mem-uc -f copy-on-write
> >
> > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> > on architectures where that is available (currently x86 and powerpc).
> > When an error is detected during the page copy, return VM_FAULT_HWPOISON
> > to caller of wp_page_copy(). This propagates up the call stack. Both x86
> > and powerpc have code in their fault handler to deal with this code by
> > sending a SIGBUS to the application.
>
> Does it send SIGBUS to only child process or both parent and child process?

This only sends a SIGBUS to the process that wrote the page (typically
the child, but also possible that the parent is the one that does the
write that causes the COW).

> >
> > Note that this patch avoids a system crash and signals the process that
> > triggered the copy-on-write action. It does not take any action for the
> > memory error that is still in the shared page. To handle that a call to
> > memory_failure() is needed.
>
> If the error page is not poisoned, should the return value of wp_page_copy
> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.

The page has uncorrected data in it, but this patch doesn't mark it
as poisoned. Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
also skip the:

"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"

console message. So might result in confusion and attepmts to debug a
s/w problem with the application instead of blaming the death on a bad
DIMM.

> > But this cannot be done from wp_page_copy()
> > because it holds mmap_lock(). Perhaps the architecture fault handlers
> > can deal with this loose end in a subsequent patch?

I started looking at this for x86 ... but I have changed my mind
about this being a good place for a fix. When control returns back
to the architecture fault handler it no longer has easy access to
the physical page frame number. It has the virtual address, so it
could descend back into somee new mm/memory.c function to get the
physical address ... but that seems silly.

I'm experimenting with using sched_work() to handle the call to
memory_failure() (echoing what the machine check handler does using
task_work)_add() to avoid the same problem of not being able to directly
call memory_failure()).

So far it seems to be working. Patch below (goes on top of original
patch ... well on top of the internal version with mods based on
feedback from Dan Williams ... but should show the general idea)

With this patch applied the page does get unmapped from all users.
Other tasks that shared the page will get a SIGBUS if they attempt
to access it later (from the page fault handler because of
is_hwpoison_entry() as you mention above.

-Tony

From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
From: Tony Luck <[email protected]>
Date: Thu, 20 Oct 2022 09:57:28 -0700
Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW
failure

Cannot call memory_failure() directly from the fault handler because
mmap_lock (and others) are held.

It is important, but not urgent, to mark the source page as h/w poisoned
and unmap it from other tasks.

Use schedule_work() to queue a request to call memory_failure() for the
page with the error.

Signed-off-by: Tony Luck <[email protected]>
---
mm/memory.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b6056eef2f72..4a1304cf1f4e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
return same;
}

+#ifdef CONFIG_MEMORY_FAILURE
+struct pfn_work {
+ struct work_struct work;
+ unsigned long pfn;
+};
+
+static void do_sched_memory_failure(struct work_struct *w)
+{
+ struct pfn_work *p = container_of(w, struct pfn_work, work);
+
+ memory_failure(p->pfn, 0);
+ kfree(p);
+}
+
+static void sched_memory_failure(unsigned long pfn)
+{
+ struct pfn_work *p;
+
+ p = kmalloc(sizeof *p, GFP_KERNEL);
+ if (!p)
+ return;
+ INIT_WORK(&p->work, do_sched_memory_failure);
+ p->pfn = pfn;
+ schedule_work(&p->work);
+}
+#else
+static void sched_memory_failure(unsigned long pfn)
+{
+}
+#endif
+
/*
* Return:
* 0: copied succeeded
@@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
unsigned long addr = vmf->address;

if (likely(src)) {
- if (copy_mc_user_highpage(dst, src, addr, vma))
+ if (copy_mc_user_highpage(dst, src, addr, vma)) {
+ sched_memory_failure(page_to_pfn(src));
return -EHWPOISON;
+ }
return 0;
}

--
2.37.3

2022-10-21 02:05:55

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults



在 2022/10/21 AM4:05, Tony Luck 写道:
> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>>> If the kernel is copying a page as the result of a copy-on-write
>>> fault and runs into an uncorrectable error, Linux will crash because
>>> it does not have recovery code for this case where poison is consumed
>>> by the kernel.
>>>
>>> It is easy to set up a test case. Just inject an error into a private
>>> page, fork(2), and have the child process write to the page.
>>>
>>> I wrapped that neatly into a test at:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>>>
>>> just enable ACPI error injection and run:
>>>
>>> # ./einj_mem-uc -f copy-on-write
>>>
>>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
>>> on architectures where that is available (currently x86 and powerpc).
>>> When an error is detected during the page copy, return VM_FAULT_HWPOISON
>>> to caller of wp_page_copy(). This propagates up the call stack. Both x86
>>> and powerpc have code in their fault handler to deal with this code by
>>> sending a SIGBUS to the application.
>>
>> Does it send SIGBUS to only child process or both parent and child process?
>
> This only sends a SIGBUS to the process that wrote the page (typically
> the child, but also possible that the parent is the one that does the
> write that causes the COW).


Thanks for your explanation.

>
>>>
>>> Note that this patch avoids a system crash and signals the process that
>>> triggered the copy-on-write action. It does not take any action for the
>>> memory error that is still in the shared page. To handle that a call to
>>> memory_failure() is needed.
>>
>> If the error page is not poisoned, should the return value of wp_page_copy
>> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
>> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
>> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.
>
> The page has uncorrected data in it, but this patch doesn't mark it
> as poisoned. Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
> that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
> also skip the:
>
> "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"
>
> console message. So might result in confusion and attepmts to debug a
> s/w problem with the application instead of blaming the death on a bad
> DIMM.

I see your point. Thank you.

>
>>> But this cannot be done from wp_page_copy()
>>> because it holds mmap_lock(). Perhaps the architecture fault handlers
>>> can deal with this loose end in a subsequent patch?
>
> I started looking at this for x86 ... but I have changed my mind
> about this being a good place for a fix. When control returns back
> to the architecture fault handler it no longer has easy access to
> the physical page frame number. It has the virtual address, so it
> could descend back into somee new mm/memory.c function to get the
> physical address ... but that seems silly.
>
> I'm experimenting with using sched_work() to handle the call to
> memory_failure() (echoing what the machine check handler does using
> task_work)_add() to avoid the same problem of not being able to directly
> call memory_failure()).

Work queues permit work to be deferred outside of the interrupt context
into the kernel process context. If we return to user-space before the
queued memory_failure() work is processed, we will take the fault again,
as we discussed recently.

commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors
commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak

So, in my opinion, we should add memory failure as a task work, like
do_machine_check does, e.g.

queue_task_work(&m, msg, kill_me_maybe);

>
> So far it seems to be working. Patch below (goes on top of original
> patch ... well on top of the internal version with mods based on
> feedback from Dan Williams ... but should show the general idea)
>
> With this patch applied the page does get unmapped from all users.
> Other tasks that shared the page will get a SIGBUS if they attempt
> to access it later (from the page fault handler because of
> is_hwpoison_entry() as you mention above.
>
> -Tony
>
> From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
> From: Tony Luck <[email protected]>
> Date: Thu, 20 Oct 2022 09:57:28 -0700
> Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW
> failure
>
> Cannot call memory_failure() directly from the fault handler because
> mmap_lock (and others) are held.
>
> It is important, but not urgent, to mark the source page as h/w poisoned
> and unmap it from other tasks.
>
> Use schedule_work() to queue a request to call memory_failure() for the
> page with the error.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> mm/memory.c | 35 ++++++++++++++++++++++++++++++++++-
> 1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b6056eef2f72..4a1304cf1f4e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
> return same;
> }
>
> +#ifdef CONFIG_MEMORY_FAILURE
> +struct pfn_work {
> + struct work_struct work;
> + unsigned long pfn;
> +};
> +
> +static void do_sched_memory_failure(struct work_struct *w)
> +{
> + struct pfn_work *p = container_of(w, struct pfn_work, work);
> +
> + memory_failure(p->pfn, 0);
> + kfree(p);
> +}
> +
> +static void sched_memory_failure(unsigned long pfn)
> +{
> + struct pfn_work *p;
> +
> + p = kmalloc(sizeof *p, GFP_KERNEL);
> + if (!p)
> + return;
> + INIT_WORK(&p->work, do_sched_memory_failure);
> + p->pfn = pfn;
> + schedule_work(&p->work);
> +}

I think there is already a function to do such work in mm/memory-failure.c.

void memory_failure_queue(unsigned long pfn, int flags)


Best Regards,
Shuai


> +#else
> +static void sched_memory_failure(unsigned long pfn)
> +{
> +}
> +#endif
> +
> /*
> * Return:
> * 0: copied succeeded
> @@ -2866,8 +2897,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> unsigned long addr = vmf->address;
>
> if (likely(src)) {
> - if (copy_mc_user_highpage(dst, src, addr, vma))
> + if (copy_mc_user_highpage(dst, src, addr, vma)) {
> + sched_memory_failure(page_to_pfn(src));
> return -EHWPOISON;
> + }
> return 0;
> }
>

2022-10-21 04:04:52

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

>> + INIT_WORK(&p->work, do_sched_memory_failure);
>> + p->pfn = pfn;
>> + schedule_work(&p->work);
>
> There is already memory_failure_queue() that can do this. Can we use it directly?

Miaohe Lin,

Yes, can use that. A thousand thanks for pointing it out. I just tried it, and it works
perfectly.

I think I'll need to add an empty stub version for the CONFIG_MEMORY_FAILURE=n
build. But that's trivial.

-Tony

2022-10-21 04:16:09

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
>
>
> 在 2022/10/21 AM4:05, Tony Luck 写道:
> > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
> >>
> >>
> >> 在 2022/10/20 AM1:08, Tony Luck 写道:

> > I'm experimenting with using sched_work() to handle the call to
> > memory_failure() (echoing what the machine check handler does using
> > task_work)_add() to avoid the same problem of not being able to directly
> > call memory_failure()).
>
> Work queues permit work to be deferred outside of the interrupt context
> into the kernel process context. If we return to user-space before the
> queued memory_failure() work is processed, we will take the fault again,
> as we discussed recently.
>
> commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors
> commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak
>
> So, in my opinion, we should add memory failure as a task work, like
> do_machine_check does, e.g.
>
> queue_task_work(&m, msg, kill_me_maybe);

Maybe ... but this case isn't pending back to a user instruction
that is trying to READ the poison memory address. The task is just
trying to WRITE to any address within the page.

So this is much more like a patrol scrub error found asynchronously
by the memory controller (in this case found asynchronously by the
Linux page copy function). So I don't feel that it's really the
responsibility of the current task.

When we do return to user mode the task is going to be busy servicing
a SIGBUS ... so shouldn't try to touch the poison page before the
memory_failure() called by the worker thread cleans things up.

> > + INIT_WORK(&p->work, do_sched_memory_failure);
> > + p->pfn = pfn;
> > + schedule_work(&p->work);
> > +}
>
> I think there is already a function to do such work in mm/memory-failure.c.
>
> void memory_failure_queue(unsigned long pfn, int flags)

Also pointed out by Miaohe Lin <[email protected]> ... this does
exacly what I want, and is working well in tests so far. So perhaps
a cleaner solution than making the kill_me_maybe() function globally
visible.

-Tony

2022-10-21 04:42:08

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

From: Tony Luck
> Sent: 21 October 2022 05:08
....
> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.

What about an RT process on a busy system?
The worker threads are pretty low priority.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-10-21 05:08:08

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

>> When we do return to user mode the task is going to be busy servicing
>> a SIGBUS ... so shouldn't try to touch the poison page before the
>> memory_failure() called by the worker thread cleans things up.
>
> What about an RT process on a busy system?
> The worker threads are pretty low priority.

Most tasks don't have a SIGBUS handler ... so they just die without possibility of accessing poison

If this task DOES have a SIGBUS handler, and that for some bizarre reason just does a "return"
so the task jumps back to the instruction that cause the COW then there is a 63/64
likelihood that it is touching a different cache line from the poisoned one.

In the 1/64 case ... its probably a simple store (since there was a COW, we know it was trying to
modify the page) ... so won't generate another machine check (those only happen for reads).

But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
could get another machine check from the same address. But then we just follow the usual
recovery path.

-Tony

2022-10-21 07:19:47

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults



在 2022/10/21 PM12:08, Tony Luck 写道:
> On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/21 AM4:05, Tony Luck 写道:
>>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>
>>> I'm experimenting with using sched_work() to handle the call to
>>> memory_failure() (echoing what the machine check handler does using
>>> task_work)_add() to avoid the same problem of not being able to directly
>>> call memory_failure()).
>>
>> Work queues permit work to be deferred outside of the interrupt context
>> into the kernel process context. If we return to user-space before the
>> queued memory_failure() work is processed, we will take the fault again,
>> as we discussed recently.
>>
>> commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for synchronous errors
>> commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to avoid memory leak
>>
>> So, in my opinion, we should add memory failure as a task work, like
>> do_machine_check does, e.g.
>>
>> queue_task_work(&m, msg, kill_me_maybe);
>
> Maybe ... but this case isn't pending back to a user instruction
> that is trying to READ the poison memory address. The task is just
> trying to WRITE to any address within the page.

Aha, I see the difference. Thank you. But I still have a question on
this. Let us discuss in your reply to David Laight.

Best Regards,
Shuai

>
> So this is much more like a patrol scrub error found asynchronously
> by the memory controller (in this case found asynchronously by the
> Linux page copy function). So I don't feel that it's really the
> responsibility of the current task.
>
> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.
>
>>> + INIT_WORK(&p->work, do_sched_memory_failure);
>>> + p->pfn = pfn;
>>> + schedule_work(&p->work);
>>> +}
>>
>> I think there is already a function to do such work in mm/memory-failure.c.
>>
>> void memory_failure_queue(unsigned long pfn, int flags)
>
> Also pointed out by Miaohe Lin <[email protected]> ... this does
> exacly what I want, and is working well in tests so far. So perhaps
> a cleaner solution than making the kill_me_maybe() function globally
> visible.
>
> -Tony

2022-10-21 09:55:36

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults



在 2022/10/21 PM12:41, Luck, Tony 写道:
>>> When we do return to user mode the task is going to be busy servicing
>>> a SIGBUS ... so shouldn't try to touch the poison page before the
>>> memory_failure() called by the worker thread cleans things up.
>>
>> What about an RT process on a busy system?
>> The worker threads are pretty low priority.
>
> Most tasks don't have a SIGBUS handler ... so they just die without possibility of accessing poison
>
> If this task DOES have a SIGBUS handler, and that for some bizarre reason just does a "return"
> so the task jumps back to the instruction that cause the COW then there is a 63/64
> likelihood that it is touching a different cache line from the poisoned one.
>
> In the 1/64 case ... its probably a simple store (since there was a COW, we know it was trying to
> modify the page) ... so won't generate another machine check (those only happen for reads).
>
> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
> could get another machine check from the same address. But then we just follow the usual
> recovery path.
>
> -Tony


Let assume the instruction that cause the COW is in the 63/64 case, aka,
it is writing a different cache line from the poisoned one. But the new_page
allocated in COW is dropped right? So might page fault again?

Best Regards,
Shuai

2022-10-21 17:19:58

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
>> could get another machine check from the same address. But then we just follow the usual
>> recovery path.


> Let assume the instruction that cause the COW is in the 63/64 case, aka,
> it is writing a different cache line from the poisoned one. But the new_page
> allocated in COW is dropped right? So might page fault again?

It can, but this should be no surprise to a user that has a signal handler for
a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the
problem, but simply returns to re-execute the same instruction that caused
the original trap.

There may be badly written signal handlers that do this. But they just cause
pain for themselves. Linux can keep taking the traps and fixing things up and
sending a new signal over and over.

In this case that loop may involve taking the machine check again, so some
extra pain for the kernel, but recoverable machine checks on Intel/x86 switched
from broadcast to delivery to just the logical CPU that tried to consume the poison
a few generations back. So only a bit more painful than a repeated page fault.

-Tony


2022-10-21 20:25:31

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3 0/2] Copy-on-write poison recovery

Part 1 deals with the process that triggered the copy on write
fault with a store to a shared read-only page. That process is
send a SIGBUS with the usual machine check decoration to specify
the virtual address of the lost page, together with the scope.

Part 2 sets up to asynchronously take the page with the uncorrected
error offline to prevent additional machine check faults. H/t to
Miaohe Lin <[email protected]> and Shuai Xue <[email protected]>
for pointing me to the existing function to queue a call to
memory_failure().

On x86 there is some duplicate reporting (because the error is
also signalled by the memory controller as well as by the core
that triggered the machine check). Console logs look like this:

[ 1647.723403] mce: [Hardware Error]: Machine check events logged
Machine check from kernel copy routine

[ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400
x86 fault handler sends SIGBUS to child process

[ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered
Async call to memory_failure() from copy on write path

[ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
uc_decode_notifier() processes memory controller report

[ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400
Parent process tries to read poisoned page. Page has been unmapped, so
#PF handler sends SIGBUS


Tony Luck (2):
mm, hwpoison: Try to recover from copy-on write faults
mm, hwpoison: When copy-on-write hits poison, take page offline

include/linux/highmem.h | 24 ++++++++++++++++++++++++
include/linux/mm.h | 5 ++++-
mm/memory.c | 32 ++++++++++++++++++++++----------
3 files changed, 50 insertions(+), 11 deletions(-)

--
2.37.3

2022-10-21 20:39:34

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults

If the kernel is copying a page as the result of a copy-on-write
fault and runs into an uncorrectable error, Linux will crash because
it does not have recovery code for this case where poison is consumed
by the kernel.

It is easy to set up a test case. Just inject an error into a private
page, fork(2), and have the child process write to the page.

I wrapped that neatly into a test at:

git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

just enable ACPI error injection and run:

# ./einj_mem-uc -f copy-on-write

Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
on architectures where that is available (currently x86 and powerpc).
When an error is detected during the page copy, return VM_FAULT_HWPOISON
to caller of wp_page_copy(). This propagates up the call stack. Both x86
and powerpc have code in their fault handler to deal with this code by
sending a SIGBUS to the application.

Note that this patch avoids a system crash and signals the process that
triggered the copy-on-write action. It does not take any action for the
memory error that is still in the shared page. To handle that a call to
memory_failure() is needed. But this cannot be done from wp_page_copy()
because it holds mmap_lock(). Perhaps the architecture fault handlers
can deal with this loose end in a subsequent patch?

On Intel/x86 this loose end will often be handled automatically because
the memory controller provides an additional notification of the h/w
poison in memory, the handler for this will call memory_failure(). This
isn't a 100% solution. If there are multiple errors, not all may be
logged in this way.

Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Tony Luck <[email protected]>

---
Changes in V3:
Dan Williams
Rename copy_user_highpage_mc() to copy_mc_user_highpage() for
consistency with Linus' discussion on names of functions that
check for machine check.
Write complete functions for the have/have-not copy_mc_to_kernel
cases (so grep shows there are two versions)
Change __wp_page_copy_user() to return 0 for success, negative for fail
[I picked -EAGAIN for both non-EHWPOISON cases]

Changes in V2:
Naoya Horiguchi:
1) Use -EHWPOISON error code instead of minus one.
2) Poison path needs also to deal with old_page
Tony Luck:
Rewrote commit message
Added some powerpc folks to Cc: list
---
include/linux/highmem.h | 24 ++++++++++++++++++++++++
mm/memory.c | 30 ++++++++++++++++++++----------
2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index e9912da5441b..a32c64681f03 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -319,6 +319,30 @@ static inline void copy_user_highpage(struct page *to, struct page *from,

#endif

+#ifdef copy_mc_to_kernel
+static inline int copy_mc_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ unsigned long ret;
+ char *vfrom, *vto;
+
+ vfrom = kmap_local_page(from);
+ vto = kmap_local_page(to);
+ ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
+ kunmap_local(vto);
+ kunmap_local(vfrom);
+
+ return ret;
+}
+#else
+static inline int copy_mc_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ copy_user_highpage(to, from, vaddr, vma);
+ return 0;
+}
+#endif
+
#ifndef __HAVE_ARCH_COPY_HIGHPAGE

static inline void copy_highpage(struct page *to, struct page *from)
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..b6056eef2f72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,10 +2848,16 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
return same;
}

-static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
- struct vm_fault *vmf)
+/*
+ * Return:
+ * 0: copied succeeded
+ * -EHWPOISON: copy failed due to hwpoison in source page
+ * -EAGAIN: copied failed (some other reason)
+ */
+static inline int __wp_page_copy_user(struct page *dst, struct page *src,
+ struct vm_fault *vmf)
{
- bool ret;
+ int ret;
void *kaddr;
void __user *uaddr;
bool locked = false;
@@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
unsigned long addr = vmf->address;

if (likely(src)) {
- copy_user_highpage(dst, src, addr, vma);
- return true;
+ if (copy_mc_user_highpage(dst, src, addr, vma))
+ return -EHWPOISON;
+ return 0;
}

/*
@@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
* and update local tlb only
*/
update_mmu_tlb(vma, addr, vmf->pte);
- ret = false;
+ ret = -EAGAIN;
goto pte_unlock;
}

@@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
/* The PTE changed under us, update local tlb */
update_mmu_tlb(vma, addr, vmf->pte);
- ret = false;
+ ret = -EAGAIN;
goto pte_unlock;
}

@@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
}
}

- ret = true;
+ ret = 0;

pte_unlock:
if (locked)
@@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
pte_t entry;
int page_copied = 0;
struct mmu_notifier_range range;
+ int ret;

delayacct_wpcopy_start();

@@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (!new_page)
goto oom;

- if (!__wp_page_copy_user(new_page, old_page, vmf)) {
+ ret = __wp_page_copy_user(new_page, old_page, vmf);
+ if (ret) {
/*
* COW failed, if the fault was solved by other,
* it's fine. If not, userspace would re-fault on
* the same address and we will handle the fault
* from the second attempt.
+ * The -EHWPOISON case will not be retried.
*/
put_page(new_page);
if (old_page)
put_page(old_page);

delayacct_wpcopy_end();
- return 0;
+ return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
}
kmsan_copy_page_meta(new_page, old_page);
}
--
2.37.3

2022-10-23 15:17:08

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults



在 2022/10/22 AM12:30, Luck, Tony 写道:
>>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
>>> could get another machine check from the same address. But then we just follow the usual
>>> recovery path.
>
>
>> Let assume the instruction that cause the COW is in the 63/64 case, aka,
>> it is writing a different cache line from the poisoned one. But the new_page
>> allocated in COW is dropped right? So might page fault again?
>
> It can, but this should be no surprise to a user that has a signal handler for
> a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the
> problem, but simply returns to re-execute the same instruction that caused
> the original trap.
>
> There may be badly written signal handlers that do this. But they just cause
> pain for themselves. Linux can keep taking the traps and fixing things up and
> sending a new signal over and over.
>
> In this case that loop may involve taking the machine check again, so some
> extra pain for the kernel, but recoverable machine checks on Intel/x86 switched
> from broadcast to delivery to just the logical CPU that tried to consume the poison
> a few generations back. So only a bit more painful than a repeated page fault.
>
> -Tony
>
>

I see, thanks for your patient explanation :)

Best Regards,
Shuai

2022-10-23 16:07:50

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] Copy-on-write poison recovery



在 2022/10/22 AM4:01, Tony Luck 写道:
> Part 1 deals with the process that triggered the copy on write
> fault with a store to a shared read-only page. That process is
> send a SIGBUS with the usual machine check decoration to specify
> the virtual address of the lost page, together with the scope.
>
> Part 2 sets up to asynchronously take the page with the uncorrected
> error offline to prevent additional machine check faults. H/t to
> Miaohe Lin <[email protected]> and Shuai Xue <[email protected]>
> for pointing me to the existing function to queue a call to
> memory_failure().
>
> On x86 there is some duplicate reporting (because the error is
> also signalled by the memory controller as well as by the core
> that triggered the machine check). Console logs look like this:
>
> [ 1647.723403] mce: [Hardware Error]: Machine check events logged
> Machine check from kernel copy routine
>
> [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400
> x86 fault handler sends SIGBUS to child process
>
> [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered
> Async call to memory_failure() from copy on write path

The recovery action might also be handled asynchronously in CMCI uc_decode_notifier
handler signaled by memory controller, right?

I have a one more memory failure log than yours.

[ 3187.485742] MCE: Killing einj_mem_uc:31746 due to hardware memory corruption fault at 7fc4bf7cf400
[ 3187.740620] Memory failure: 0x1a3b80: recovery action for dirty LRU page: Recovered
uc_decode_notifier() processes memory controller report

[ 3187.748272] Memory failure: 0x1a3b80: already hardware poisoned
Workqueue: events memory_failure_work_func // queued by ghes_do_memory_failure

[ 3187.754194] Memory failure: 0x1a3b80: already hardware poisoned
Workqueue: events memory_failure_work_func // queued by __wp_page_copy_user

[ 3188.615920] MCE: Killing einj_mem_uc:31745 due to hardware memory corruption fault at 7fc4bf7cf400

Best Regards,
Shuai

>
> [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
> uc_decode_notifier() processes memory controller report
>
> [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400
> Parent process tries to read poisoned page. Page has been unmapped, so
> #PF handler sends SIGBUS
>
>
> Tony Luck (2):
> mm, hwpoison: Try to recover from copy-on write faults
> mm, hwpoison: When copy-on-write hits poison, take page offline
>
> include/linux/highmem.h | 24 ++++++++++++++++++++++++
> include/linux/mm.h | 5 ++++-
> mm/memory.c | 32 ++++++++++++++++++++++----------
> 3 files changed, 50 insertions(+), 11 deletions(-)
>

Subject: Re: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults

On Fri, Oct 21, 2022 at 01:01:19PM -0700, Tony Luck wrote:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
>
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
>
> I wrapped that neatly into a test at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>
> just enable ACPI error injection and run:
>
> # ./einj_mem-uc -f copy-on-write
>
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.
>
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
>
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>

Thank you for the update. Looks good to me.

Reviewed-by: Naoya Horiguchi <[email protected]>

2022-10-26 05:37:57

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] Copy-on-write poison recovery



在 2022/10/23 PM11:52, Shuai Xue 写道:
>
>
> 在 2022/10/22 AM4:01, Tony Luck 写道:
>> Part 1 deals with the process that triggered the copy on write
>> fault with a store to a shared read-only page. That process is
>> send a SIGBUS with the usual machine check decoration to specify
>> the virtual address of the lost page, together with the scope.
>>
>> Part 2 sets up to asynchronously take the page with the uncorrected
>> error offline to prevent additional machine check faults. H/t to
>> Miaohe Lin <[email protected]> and Shuai Xue <[email protected]>
>> for pointing me to the existing function to queue a call to
>> memory_failure().
>>
>> On x86 there is some duplicate reporting (because the error is
>> also signalled by the memory controller as well as by the core
>> that triggered the machine check). Console logs look like this:
>>
>> [ 1647.723403] mce: [Hardware Error]: Machine check events logged
>> Machine check from kernel copy routine
>>
>> [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400
>> x86 fault handler sends SIGBUS to child process
>>
>> [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered
>> Async call to memory_failure() from copy on write path
>
> The recovery action might also be handled asynchronously in CMCI uc_decode_notifier
> handler signaled by memory controller, right?
>
> I have a one more memory failure log than yours.
>
> [ 3187.485742] MCE: Killing einj_mem_uc:31746 due to hardware memory corruption fault at 7fc4bf7cf400
> [ 3187.740620] Memory failure: 0x1a3b80: recovery action for dirty LRU page: Recovered
> uc_decode_notifier() processes memory controller report
>
> [ 3187.748272] Memory failure: 0x1a3b80: already hardware poisoned
> Workqueue: events memory_failure_work_func // queued by ghes_do_memory_failure
>
> [ 3187.754194] Memory failure: 0x1a3b80: already hardware poisoned
> Workqueue: events memory_failure_work_func // queued by __wp_page_copy_user
>
> [ 3188.615920] MCE: Killing einj_mem_uc:31745 due to hardware memory corruption fault at 7fc4bf7cf400
>
> Best Regards,
> Shuai

Tested-by: Shuai Xue <[email protected]>

Thank you.
Shuai

>
>>
>> [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
>> uc_decode_notifier() processes memory controller report
>>
>> [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400
>> Parent process tries to read poisoned page. Page has been unmapped, so
>> #PF handler sends SIGBUS
>>
>>
>> Tony Luck (2):
>> mm, hwpoison: Try to recover from copy-on write faults
>> mm, hwpoison: When copy-on-write hits poison, take page offline
>>
>> include/linux/highmem.h | 24 ++++++++++++++++++++++++
>> include/linux/mm.h | 5 ++++-
>> mm/memory.c | 32 ++++++++++++++++++++++----------
>> 3 files changed, 50 insertions(+), 11 deletions(-)
>>

2022-10-28 16:40:09

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults

>> + vfrom = kmap_local_page(from);
>> + vto = kmap_local_page(to);
>> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
>
> In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
> __HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
> with kmsan, so I can easy be wrong.

It looks like that kmsan_unpoison_memory() call was added recently, after I copied
copy_user_highpage() to create copy_mc_user_highpage(). I'm not familiar with
kmsan either. Adding Alexander to this thread since they added that code.

> Anyway, this patch looks good to me. Thanks.
>
> Reviewed-by: Miaohe Lin <[email protected]>

Thanks for the review.

-Tony

2022-10-31 20:26:43

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 0/2] Copy-on-write poison recovery

Recover from poison consumption while copying pages
in the kernel for a copy-on-write fault.

Changes since v3:

1) Miaohe Lin <[email protected]> pointed out that a recent change
by Alexander Potapenko <[email protected]> to copy_user_highpage()
added a call to kmsan_unpoison_memory(). Same is needed in my cloned
copy_mc_user_highpage() ... at least in the successful case where the
page was copied with no machine checks.

2) Picked up some additional Reviewed-by and Tested-by tags.

Tony Luck (2):
mm, hwpoison: Try to recover from copy-on write faults
mm, hwpoison: When copy-on-write hits poison, take page offline

include/linux/highmem.h | 26 ++++++++++++++++++++++++++
include/linux/mm.h | 5 ++++-
mm/memory.c | 32 ++++++++++++++++++++++----------
3 files changed, 52 insertions(+), 11 deletions(-)


base-commit: 30a0b95b1335e12efef89dd78518ed3e4a71a763
--
2.37.3


2022-10-31 21:17:57

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 1/2] mm, hwpoison: Try to recover from copy-on write faults

If the kernel is copying a page as the result of a copy-on-write
fault and runs into an uncorrectable error, Linux will crash because
it does not have recovery code for this case where poison is consumed
by the kernel.

It is easy to set up a test case. Just inject an error into a private
page, fork(2), and have the child process write to the page.

I wrapped that neatly into a test at:

git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

just enable ACPI error injection and run:

# ./einj_mem-uc -f copy-on-write

Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
on architectures where that is available (currently x86 and powerpc).
When an error is detected during the page copy, return VM_FAULT_HWPOISON
to caller of wp_page_copy(). This propagates up the call stack. Both x86
and powerpc have code in their fault handler to deal with this code by
sending a SIGBUS to the application.

Note that this patch avoids a system crash and signals the process that
triggered the copy-on-write action. It does not take any action for the
memory error that is still in the shared page. To handle that a call to
memory_failure() is needed. But this cannot be done from wp_page_copy()
because it holds mmap_lock(). Perhaps the architecture fault handlers
can deal with this loose end in a subsequent patch?

On Intel/x86 this loose end will often be handled automatically because
the memory controller provides an additional notification of the h/w
poison in memory, the handler for this will call memory_failure(). This
isn't a 100% solution. If there are multiple errors, not all may be
logged in this way.

Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Tested-by: Shuai Xue <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/highmem.h | 26 ++++++++++++++++++++++++++
mm/memory.c | 30 ++++++++++++++++++++----------
2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index e9912da5441b..44242268f53b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -319,6 +319,32 @@ static inline void copy_user_highpage(struct page *to, struct page *from,

#endif

+#ifdef copy_mc_to_kernel
+static inline int copy_mc_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ unsigned long ret;
+ char *vfrom, *vto;
+
+ vfrom = kmap_local_page(from);
+ vto = kmap_local_page(to);
+ ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
+ if (!ret)
+ kmsan_unpoison_memory(page_address(to), PAGE_SIZE);
+ kunmap_local(vto);
+ kunmap_local(vfrom);
+
+ return ret;
+}
+#else
+static inline int copy_mc_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ copy_user_highpage(to, from, vaddr, vma);
+ return 0;
+}
+#endif
+
#ifndef __HAVE_ARCH_COPY_HIGHPAGE

static inline void copy_highpage(struct page *to, struct page *from)
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..b6056eef2f72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,10 +2848,16 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
return same;
}

-static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
- struct vm_fault *vmf)
+/*
+ * Return:
+ * 0: copied succeeded
+ * -EHWPOISON: copy failed due to hwpoison in source page
+ * -EAGAIN: copied failed (some other reason)
+ */
+static inline int __wp_page_copy_user(struct page *dst, struct page *src,
+ struct vm_fault *vmf)
{
- bool ret;
+ int ret;
void *kaddr;
void __user *uaddr;
bool locked = false;
@@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
unsigned long addr = vmf->address;

if (likely(src)) {
- copy_user_highpage(dst, src, addr, vma);
- return true;
+ if (copy_mc_user_highpage(dst, src, addr, vma))
+ return -EHWPOISON;
+ return 0;
}

/*
@@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
* and update local tlb only
*/
update_mmu_tlb(vma, addr, vmf->pte);
- ret = false;
+ ret = -EAGAIN;
goto pte_unlock;
}

@@ -2913,7 +2920,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
/* The PTE changed under us, update local tlb */
update_mmu_tlb(vma, addr, vmf->pte);
- ret = false;
+ ret = -EAGAIN;
goto pte_unlock;
}

@@ -2932,7 +2939,7 @@ static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
}
}

- ret = true;
+ ret = 0;

pte_unlock:
if (locked)
@@ -3104,6 +3111,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
pte_t entry;
int page_copied = 0;
struct mmu_notifier_range range;
+ int ret;

delayacct_wpcopy_start();

@@ -3121,19 +3129,21 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
if (!new_page)
goto oom;

- if (!__wp_page_copy_user(new_page, old_page, vmf)) {
+ ret = __wp_page_copy_user(new_page, old_page, vmf);
+ if (ret) {
/*
* COW failed, if the fault was solved by other,
* it's fine. If not, userspace would re-fault on
* the same address and we will handle the fault
* from the second attempt.
+ * The -EHWPOISON case will not be retried.
*/
put_page(new_page);
if (old_page)
put_page(old_page);

delayacct_wpcopy_end();
- return 0;
+ return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
}
kmsan_copy_page_meta(new_page, old_page);
}
--
2.37.3


2022-10-31 21:23:14

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline

Cannot call memory_failure() directly from the fault handler because
mmap_lock (and others) are held.

It is important, but not urgent, to mark the source page as h/w poisoned
and unmap it from other tasks.

Use memory_failure_queue() to request a call to memory_failure() for the
page with the error.

Also provide a stub version for CONFIG_MEMORY_FAILURE=n

Reviewed-by: Miaohe Lin <[email protected]>
Tested-by: Shuai Xue <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
include/linux/mm.h | 5 ++++-
mm/memory.c | 4 +++-
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..03ced659eb58 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3268,7 +3268,6 @@ enum mf_flags {
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
extern int memory_failure(unsigned long pfn, int flags);
-extern void memory_failure_queue(unsigned long pfn, int flags);
extern void memory_failure_queue_kick(int cpu);
extern int unpoison_memory(unsigned long pfn);
extern int sysctl_memory_failure_early_kill;
@@ -3277,8 +3276,12 @@ extern void shake_page(struct page *p);
extern atomic_long_t num_poisoned_pages __read_mostly;
extern int soft_offline_page(unsigned long pfn, int flags);
#ifdef CONFIG_MEMORY_FAILURE
+extern void memory_failure_queue(unsigned long pfn, int flags);
extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
#else
+static inline void memory_failure_queue(unsigned long pfn, int flags)
+{
+}
static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
{
return 0;
diff --git a/mm/memory.c b/mm/memory.c
index b6056eef2f72..eae242351726 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2866,8 +2866,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
unsigned long addr = vmf->address;

if (likely(src)) {
- if (copy_mc_user_highpage(dst, src, addr, vma))
+ if (copy_mc_user_highpage(dst, src, addr, vma)) {
+ memory_failure_queue(page_to_pfn(src), 0);
return -EHWPOISON;
+ }
return 0;
}

--
2.37.3


2022-11-02 15:03:53

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults

On Wed, Nov 2, 2022 at 3:27 PM Alexander Potapenko <[email protected]> wrote:
>
> On Fri, Oct 28, 2022 at 6:14 PM Luck, Tony <[email protected]> wrote:
> >
> > >> + vfrom = kmap_local_page(from);
> > >> + vto = kmap_local_page(to);
> > >> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> > >
> > > In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
> > > __HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
> > > with kmsan, so I can easy be wrong.
> >
> > It looks like that kmsan_unpoison_memory() call was added recently, after I copied
> > copy_user_highpage() to create copy_mc_user_highpage(). I'm not familiar with
> > kmsan either. Adding Alexander to this thread since they added that code.
> >
>
> Given that copy_mc_user_highpage() replaces one of the calls to
> copy_user_highpage(), it sure makes sense to call
> kmsan_unpoison_memory() here.
>
> KMSAN tracks the status (initialized/uninitialized) of the kernel
> memory. Newly allocated memory is marked uninitialized, copying memory
> preserves its status, and writing constants to that memory makes it
> initialized.
> Userspace memory does not have its status tracked by KMSAN, so when
> values are copied from the userspace, KMSAN does nothing with their
> status.
> That's why every (successful) copy_from_user event should be followed
> by kmsan_unpoison_memory(), which marks the corresponding kernel
> buffer initialized - otherwise the status of that buffer may get
> stale.
>
> > > Anyway, this patch looks good to me. Thanks.
> > >
> > > Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>

> >
> > Thanks for the review.
> >
> > -Tony
>
>
>
> --
> Alexander Potapenko
> Software Engineer
>
> Google Germany GmbH
> Erika-Mann-Straße, 33
> 80636 München
>
> Geschäftsführer: Paul Manicle, Liana Sebastian
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2022-11-02 15:44:53

by Alexander Potapenko

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mm, hwpoison: Try to recover from copy-on write faults

On Fri, Oct 28, 2022 at 6:14 PM Luck, Tony <[email protected]> wrote:
>
> >> + vfrom = kmap_local_page(from);
> >> + vto = kmap_local_page(to);
> >> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> >
> > In copy_user_highpage(), kmsan_unpoison_memory(page_address(to), PAGE_SIZE) is done after the copy when
> > __HAVE_ARCH_COPY_USER_HIGHPAGE isn't defined. Do we need to do something similar here? But I'm not familiar
> > with kmsan, so I can easy be wrong.
>
> It looks like that kmsan_unpoison_memory() call was added recently, after I copied
> copy_user_highpage() to create copy_mc_user_highpage(). I'm not familiar with
> kmsan either. Adding Alexander to this thread since they added that code.
>

Given that copy_mc_user_highpage() replaces one of the calls to
copy_user_highpage(), it sure makes sense to call
kmsan_unpoison_memory() here.

KMSAN tracks the status (initialized/uninitialized) of the kernel
memory. Newly allocated memory is marked uninitialized, copying memory
preserves its status, and writing constants to that memory makes it
initialized.
Userspace memory does not have its status tracked by KMSAN, so when
values are copied from the userspace, KMSAN does nothing with their
status.
That's why every (successful) copy_from_user event should be followed
by kmsan_unpoison_memory(), which marks the corresponding kernel
buffer initialized - otherwise the status of that buffer may get
stale.

> > Anyway, this patch looks good to me. Thanks.
> >
> > Reviewed-by: Miaohe Lin <[email protected]>
>
> Thanks for the review.
>
> -Tony



--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

2023-05-18 21:55:12

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline

Hi, Tony, Greg,

Does it make sense to include this patch series in the
5.15.y LTS kernel? I just checked: it's not in 5.15.112.

Thanks!
-jane


On 10/31/2022 1:10 PM, Tony Luck wrote:
> Cannot call memory_failure() directly from the fault handler because
> mmap_lock (and others) are held.
>
> It is important, but not urgent, to mark the source page as h/w poisoned
> and unmap it from other tasks.
>
> Use memory_failure_queue() to request a call to memory_failure() for the
> page with the error.
>
> Also provide a stub version for CONFIG_MEMORY_FAILURE=n
>
> Reviewed-by: Miaohe Lin <[email protected]>
> Tested-by: Shuai Xue <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/mm.h | 5 ++++-
> mm/memory.c | 4 +++-
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8bbcccbc5565..03ced659eb58 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3268,7 +3268,6 @@ enum mf_flags {
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> extern int memory_failure(unsigned long pfn, int flags);
> -extern void memory_failure_queue(unsigned long pfn, int flags);
> extern void memory_failure_queue_kick(int cpu);
> extern int unpoison_memory(unsigned long pfn);
> extern int sysctl_memory_failure_early_kill;
> @@ -3277,8 +3276,12 @@ extern void shake_page(struct page *p);
> extern atomic_long_t num_poisoned_pages __read_mostly;
> extern int soft_offline_page(unsigned long pfn, int flags);
> #ifdef CONFIG_MEMORY_FAILURE
> +extern void memory_failure_queue(unsigned long pfn, int flags);
> extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
> #else
> +static inline void memory_failure_queue(unsigned long pfn, int flags)
> +{
> +}
> static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
> {
> return 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index b6056eef2f72..eae242351726 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2866,8 +2866,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> unsigned long addr = vmf->address;
>
> if (likely(src)) {
> - if (copy_mc_user_highpage(dst, src, addr, vma))
> + if (copy_mc_user_highpage(dst, src, addr, vma)) {
> + memory_failure_queue(page_to_pfn(src), 0);
> return -EHWPOISON;
> + }
> return 0;
> }
>

2023-05-18 22:27:06

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v4 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline

Jane,

You should do some analysis and testing to make sure that applying just this patch to 5.15.y works. There has been a bunch of recovery changes and I'm not sure if this depends on other changes.

-Tony

-----Original Message-----
From: Jane Chu <[email protected]>
Sent: Thursday, May 18, 2023 2:50 PM
To: Luck, Tony <[email protected]>; Andrew Morton <[email protected]>; Greg Kroah-Hartman <[email protected]>
Cc: Alexander Potapenko <[email protected]>; Naoya Horiguchi <[email protected]>; Miaohe Lin <[email protected]>; Matthew Wilcox <[email protected]>; Shuai Xue <[email protected]>; Williams, Dan J <[email protected]>; [email protected]; [email protected]
Subject: Re: [PATCH v4 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline

Hi, Tony, Greg,

Does it make sense to include this patch series in the
5.15.y LTS kernel? I just checked: it's not in 5.15.112.

Thanks!
-jane


On 10/31/2022 1:10 PM, Tony Luck wrote:
> Cannot call memory_failure() directly from the fault handler because
> mmap_lock (and others) are held.
>
> It is important, but not urgent, to mark the source page as h/w poisoned
> and unmap it from other tasks.
>
> Use memory_failure_queue() to request a call to memory_failure() for the
> page with the error.
>
> Also provide a stub version for CONFIG_MEMORY_FAILURE=n
>
> Reviewed-by: Miaohe Lin <[email protected]>
> Tested-by: Shuai Xue <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/mm.h | 5 ++++-
> mm/memory.c | 4 +++-
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8bbcccbc5565..03ced659eb58 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3268,7 +3268,6 @@ enum mf_flags {
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> extern int memory_failure(unsigned long pfn, int flags);
> -extern void memory_failure_queue(unsigned long pfn, int flags);
> extern void memory_failure_queue_kick(int cpu);
> extern int unpoison_memory(unsigned long pfn);
> extern int sysctl_memory_failure_early_kill;
> @@ -3277,8 +3276,12 @@ extern void shake_page(struct page *p);
> extern atomic_long_t num_poisoned_pages __read_mostly;
> extern int soft_offline_page(unsigned long pfn, int flags);
> #ifdef CONFIG_MEMORY_FAILURE
> +extern void memory_failure_queue(unsigned long pfn, int flags);
> extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags);
> #else
> +static inline void memory_failure_queue(unsigned long pfn, int flags)
> +{
> +}
> static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
> {
> return 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index b6056eef2f72..eae242351726 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2866,8 +2866,10 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> unsigned long addr = vmf->address;
>
> if (likely(src)) {
> - if (copy_mc_user_highpage(dst, src, addr, vma))
> + if (copy_mc_user_highpage(dst, src, addr, vma)) {
> + memory_failure_queue(page_to_pfn(src), 0);
> return -EHWPOISON;
> + }
> return 0;
> }
>

2023-05-19 07:35:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline

On Thu, May 18, 2023 at 02:49:44PM -0700, Jane Chu wrote:
> Hi, Tony, Greg,
>
> Does it make sense to include this patch series in the
> 5.15.y LTS kernel? I just checked: it's not in 5.15.112.

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>