LinuxLists.cc - [PATCH] mm: clean up hwpoison page cache page in fault path

2022-02-11 22:48:06

Subject: [PATCH] mm: clean up hwpoison page cache page in fault path

Sometimes the page offlining code can leave behind a hwpoisoned clean
page cache page. This can lead to programs being killed over and over
and over again as they fault in the hwpoisoned page, get killed, and
then get re-spawned by whatever wanted to run them.

This is particularly embarrassing when the page was offlined due to
having too many corrected memory errors. Now we are killing tasks
due to them trying to access memory that probably isn't even corrupted.

This problem can be avoided by invalidating the page from the page
fault handler, which already has a branch for dealing with these
kinds of pages. With this patch we simply pretend the page fault
was successful if the page was invalidated, return to userspace,
incur another page fault, read in the file from disk (to a new
memory page), and then everything works again.

Signed-off-by: Rik van Riel <[email protected]>

diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..2300358e268c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
return ret;

if (unlikely(PageHWPoison(vmf->page))) {
- if (ret & VM_FAULT_LOCKED)
+ int poisonret = VM_FAULT_HWPOISON;
+ if (ret & VM_FAULT_LOCKED) {
+ /* Retry if a clean page was removed from the cache. */
+ if (invalidate_inode_page(vmf->page))
+ poisonret = 0;
unlock_page(vmf->page);
+ }
put_page(vmf->page);
vmf->page = NULL;
- return VM_FAULT_HWPOISON;
+ return poisonret;
}

if (unlikely(!(ret & VM_FAULT_LOCKED)))

--
All rights reversed.

2022-02-14 02:32:33

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH] mm: clean up hwpoison page cache page in fault path

On Sun, 2022-02-13 at 00:56 -0800, John Hubbard wrote:
> On Fri, 11 Feb 2022, Rik van Riel wrote:
>
> >
> > This is particularly embarrassing when the page was offlined due to
> > having too many corrected memory errors. Now we are killing tasks
> > due to them trying to access memory that probably isn't even
> > corrupted.
>
> I'd recommend deleting that paragraph entirely. It's a separate
> question, and it is not necessarily an accurate assessment of that
> question either: the engineers who set the thresholds for "too many
> corrected errors" may not--in fact, probably *will not*--agree with
> your
> feeling that the memory is still working and reliable!

Fair enough. We try to offline pages before we get to
a point where the error correction might no longer be
able to correct the error correctly, but I am pretty
sure I have seen a few odd kernel crashes following a
stream of corrected errors that strongly suggested
corruption had in fact happened.

I'll take that paragraph out if anybody else asks
for further changes for v3 of the patch.

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2022-02-14 09:17:32

by Miaohe Lin

[permalink] [raw]

Subject: Re: [PATCH] mm: clean up hwpoison page cache page in fault path

On 2022/2/12 6:05, Rik van Riel wrote:
> Sometimes the page offlining code can leave behind a hwpoisoned clean
> page cache page. This can lead to programs being killed over and over

Yep, __soft_offline_page tries to invalidate_inode_page in a lightway.

> and over again as they fault in the hwpoisoned page, get killed, and
> then get re-spawned by whatever wanted to run them.
>
> This is particularly embarrassing when the page was offlined due to
> having too many corrected memory errors. Now we are killing tasks
> due to them trying to access memory that probably isn't even corrupted.
>
> This problem can be avoided by invalidating the page from the page
> fault handler, which already has a branch for dealing with these
> kinds of pages. With this patch we simply pretend the page fault
> was successful if the page was invalidated, return to userspace,
> incur another page fault, read in the file from disk (to a new
> memory page), and then everything works again.
>
> Signed-off-by: Rik van Riel <[email protected]>

Good catch! This looks good to me. Thanks.

Reviewed-by: Miaohe Lin <[email protected]>

>
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..2300358e268c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> return ret;
>
> if (unlikely(PageHWPoison(vmf->page))) {
> - if (ret & VM_FAULT_LOCKED)
> + int poisonret = VM_FAULT_HWPOISON;
> + if (ret & VM_FAULT_LOCKED) {
> + /* Retry if a clean page was removed from the cache. */
> + if (invalidate_inode_page(vmf->page))
> + poisonret = 0;
> unlock_page(vmf->page);
> + }
> put_page(vmf->page);
> vmf->page = NULL;
> - return VM_FAULT_HWPOISON;
> + return poisonret;
> }
>
> if (unlikely(!(ret & VM_FAULT_LOCKED)))
>
>

2022-02-14 10:08:51

by John Hubbard

[permalink] [raw]

Subject: Re: [PATCH] mm: clean up hwpoison page cache page in fault path

On Fri, 11 Feb 2022, Rik van Riel wrote:
> Sometimes the page offlining code can leave behind a hwpoisoned clean
> page cache page. This can lead to programs being killed over and over
> and over again as they fault in the hwpoisoned page, get killed, and
> then get re-spawned by whatever wanted to run them.

Hi Rik,

This looks good. Some minor suggestions below:

>
> This is particularly embarrassing when the page was offlined due to
> having too many corrected memory errors. Now we are killing tasks
> due to them trying to access memory that probably isn't even corrupted.

I'd recommend deleting that paragraph entirely. It's a separate
question, and it is not necessarily an accurate assessment of that
question either: the engineers who set the thresholds for "too many
corrected errors" may not--in fact, probably *will not*--agree with your
feeling that the memory is still working and reliable!

>
> This problem can be avoided by invalidating the page from the page
> fault handler, which already has a branch for dealing with these
> kinds of pages. With this patch we simply pretend the page fault
> was successful if the page was invalidated, return to userspace,
> incur another page fault, read in the file from disk (to a new
> memory page), and then everything works again.

Very nice write-up here, it's rare that I get to read such a short, yet
clear explanation. Thank you for that. :)

>
> Signed-off-by: Rik van Riel <[email protected]>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..2300358e268c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3871,11 +3871,16 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> return ret;
>
> if (unlikely(PageHWPoison(vmf->page))) {
> - if (ret & VM_FAULT_LOCKED)
> + int poisonret = VM_FAULT_HWPOISON;
> + if (ret & VM_FAULT_LOCKED) {

How about changing those two lines to these three (note the newline
after the declaration):

vm_fault_t poison_ret = VM_FAULT_HWPOISON;

if (ret & VM_FAULT_LOCKED) {

...which should fix the krobot complaint, and the underscore is just a
very tiny readability improvement while we're there.

> + /* Retry if a clean page was removed from the cache. */

This is more cryptic than necessary, and in fact you've already provided
a write-up, so how about something like this, instead of the above:

/*
* Avoid refaulting on the same poisoned page
* forever: attempt to invalidate the page. If that
* succeeds, then pretend the page fault was successful,
* return to userspace, incur another page fault, read
* in the file from disk (to a new page),and then
* everything works again.
*/

> + if (invalidate_inode_page(vmf->page))
> + poisonret = 0;
> unlock_page(vmf->page);
> + }
> put_page(vmf->page);
> vmf->page = NULL;
> - return VM_FAULT_HWPOISON;
> + return poisonret;
> }
>
> if (unlikely(!(ret & VM_FAULT_LOCKED)))
>
>
> --
> All rights reversed.

Anyway, those are all documentation nits, except for the vm_fault_t,
which is a declaration nit :) , so I'm comfortable saying:

Reviewed-by: John Hubbard <[email protected]>

thanks,
--
John Hubbard
NVIDIA

2022-02-14 20:47:36

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH] mm: clean up hwpoison page cache page in fault path

Hi Rik,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on hnaz-mm/master]

url: https://github.com/0day-ci/linux/commits/Rik-van-Riel/mm-clean-up-hwpoison-page-cache-page-in-fault-path/20220212-060643
base: https://github.com/hnaz/linux-mm master
config: sparc-randconfig-s031-20220211 (https://download.01.org/0day-ci/archive/20220212/[email protected]/config)
compiler: sparc-linux-gcc (GCC) 11.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-dirty
# https://github.com/0day-ci/linux/commit/abd960cdbc9487dbf0a0dc3b2395825a38f8fa44
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Rik-van-Riel/mm-clean-up-hwpoison-page-cache-page-in-fault-path/20220212-060643
git checkout abd960cdbc9487dbf0a0dc3b2395825a38f8fa44
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sparc SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

sparse warnings: (new ones prefixed by >>)
>> mm/memory.c:3913:33: sparse: sparse: incorrect type in initializer (different base types) @@ expected int poisonret @@ got restricted vm_fault_t @@
mm/memory.c:3913:33: sparse: expected int poisonret
mm/memory.c:3913:33: sparse: got restricted vm_fault_t
>> mm/memory.c:3922:24: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted vm_fault_t @@ got int [assigned] poisonret @@
mm/memory.c:3922:24: sparse: expected restricted vm_fault_t
mm/memory.c:3922:24: sparse: got int [assigned] poisonret
mm/memory.c:1024:17: sparse: sparse: context imbalance in 'copy_pte_range' - different lock contexts for basic block
mm/memory.c:1740:16: sparse: sparse: context imbalance in '__get_locked_pte' - different lock contexts for basic block
mm/memory.c:1788:9: sparse: sparse: context imbalance in 'insert_page' - different lock contexts for basic block
mm/memory.c:2290:17: sparse: sparse: context imbalance in 'remap_pte_range' - different lock contexts for basic block
mm/memory.c:2546:17: sparse: sparse: context imbalance in 'apply_to_pte_range' - unexpected unlock
mm/memory.c:2834:17: sparse: sparse: context imbalance in 'wp_page_copy' - unexpected unlock
mm/memory.c:3173:17: sparse: sparse: context imbalance in 'wp_pfn_shared' - unexpected unlock
mm/memory.c:3236:19: sparse: sparse: context imbalance in 'do_wp_page' - different lock contexts for basic block
mm/memory.c:4939:5: sparse: sparse: context imbalance in 'follow_invalidate_pte' - different lock contexts for basic block
mm/memory.c: note: in included file (through arch/sparc/include/asm/pgtable.h, include/linux/pgtable.h, include/linux/mm.h):
arch/sparc/include/asm/pgtable_32.h:275:29: sparse: sparse: context imbalance in 'follow_pfn' - unexpected unlock

vim +3913 mm/memory.c

3875
3876 /*
3877 * The mmap_lock must have been held on entry, and may have been
3878 * released depending on flags and vma->vm_ops->fault() return value.
3879 * See filemap_fault() and __lock_page_retry().
3880 */
3881 static vm_fault_t __do_fault(struct vm_fault *vmf)
3882 {
3883 struct vm_area_struct *vma = vmf->vma;
3884 vm_fault_t ret;
3885
3886 /*
3887 * Preallocate pte before we take page_lock because this might lead to
3888 * deadlocks for memcg reclaim which waits for pages under writeback:
3889 * lock_page(A)
3890 * SetPageWriteback(A)
3891 * unlock_page(A)
3892 * lock_page(B)
3893 * lock_page(B)
3894 * pte_alloc_one
3895 * shrink_page_list
3896 * wait_on_page_writeback(A)
3897 * SetPageWriteback(B)
3898 * unlock_page(B)
3899 * # flush A, B to clear the writeback
3900 */
3901 if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
3902 vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
3903 if (!vmf->prealloc_pte)
3904 return VM_FAULT_OOM;
3905 }
3906
3907 ret = vma->vm_ops->fault(vmf);
3908 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
3909 VM_FAULT_DONE_COW)))
3910 return ret;
3911
3912 if (unlikely(PageHWPoison(vmf->page))) {
> 3913 int poisonret = VM_FAULT_HWPOISON;
3914 if (ret & VM_FAULT_LOCKED) {
3915 /* Retry if a clean page was removed from the cache. */
3916 if (invalidate_inode_page(vmf->page))
3917 poisonret = 0;
3918 unlock_page(vmf->page);
3919 }
3920 put_page(vmf->page);
3921 vmf->page = NULL;
> 3922 return poisonret;
3923 }
3924
3925 if (unlikely(!(ret & VM_FAULT_LOCKED)))
3926 lock_page(vmf->page);
3927 else
3928 VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
3929
3930 return ret;
3931 }
3932

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]