LinuxLists.cc - [RFC PATCH 0/4] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)

2021-09-14 18:38:50

Subject: [RFC PATCH 0/4] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)

When discussing the patch that splits page cache THP in order to offline the
poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this
from working since the page cache page will be truncated if uncorrectable
errors happen. By looking this deeper it turns out this approach (truncating
poisoned page) may incur silent data loss for all non-readonly filesystems if
the page is dirty. It may be worse for in-memory filesystem, e.g. shmem/tmpfs
since the data blocks are actually gone.

To solve this problem we could keep the poisoned dirty page in page cache then
notify the users on any later access, e.g. page fault, read/write, etc. The
clean page could be truncated as is since they can be reread from disk later on.

The consequence is the filesystems may find poisoned page and manipulate it as
healthy page since all the filesystems actually don't check if the page is
poisoned or not in all the relevant paths except page fault. In general, we
need make the filesystems be aware of poisoned page before we could keep the
poisoned page in page cache in order to solve the data loss problem.

To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could prevent from
writeback.
- The page should not be dropped (it shows as a clean page) by drop caches or
other callers: the refcount pin from hwpoison could prevent from invalidating
(called by cache drop, inode cache shrinking, etc), but it doesn't avoid
invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it works as it
is.
- Notify users when the page is accessed, e.g. read/write, page fault and other
paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do it once
a page is returned from page cache lookup. There are a couple of options to
do it:

1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most callsites
check if NULL is returned, this should have least work I think. But the
error handling in filesystems just return -ENOMEM, the error code will incur
confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
will involve significant amount of code change as well since all the paths
need check if the pointer is ERR or not just like option #1.

I did prototype for both #1 and #3, but it seems #3 may require more changes
than #1. For #3 ERR_PTR will be returned so all the callers need to check the
return value otherwise invalid pointer may be dereferenced, but not all callers
really care about the content of the page, for example, partial truncate which
just sets the truncated range in one page to 0. So for such paths it needs
additional modification if ERR_PTR is returned. And if the callers have their
own way to handle the problematic pages we need to add a new FGP flag to tell
FGP functions to return the pointer to the page.

It may happen very rarely, but once it happens the consequence (data corruption)
could be very bad and it is very hard to debug. It seems this problem had been
slightly discussed before, but seems no action was taken at that time. [2]

As the aforementioned investigation, it needs huge amount of work to solve
the potential data loss for all filesystems. But it is much easier for
in-memory filesystems and such filesystems actually suffer more than others
since even the data blocks are gone due to truncating. So this patchset starts
from shmem/tmpfs by taking option #1.

Patch #1 and #2: fix bugs in page fault and khugepaged. And patch #2 also
did some preparation for the later patches.
Patch #3: keep the poisoned page in page cache and handle such case for all
the paths.
Patch #4: the previous patches unblock page cache THP split, so this patch
add page cache THP split support.

[1] https://lore.kernel.org/linux-mm/CAHbLzkqNPBh_sK09qfr4yu4WTFOzRy+MKj+PA7iG-adzi9zGsg@mail.gmail.com/T/#m0e959283380156f1d064456af01ae51fdff91265
[2] https://lore.kernel.org/lkml/[email protected]/

2021-09-14 18:39:04

by Yang Shi

[permalink] [raw]

Subject: [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens

The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean. If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users. This may cause silent data loss. It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks. The later read would return all zero.

The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed. The regular
storage backed filesystems would be more complicated so this patch
is focused on shmem. This also unblock the support for soft
offlining shmem THP.

Signed-off-by: Yang Shi <[email protected]>
---
mm/memory-failure.c | 3 ++-
mm/shmem.c | 25 +++++++++++++++++++++++--
2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 54879c339024..3e06cb9d5121 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1101,7 +1101,8 @@ static int page_action(struct page_state *ps, struct page *p,
result = ps->action(p, pfn);

count = page_count(p) - 1;
- if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
+ if ((ps->action == me_swapcache_dirty && result == MF_DELAYED) ||
+ (ps->action == me_pagecache_dirty && result == MF_FAILED))
count--;
if (count > 0) {
pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
diff --git a/mm/shmem.c b/mm/shmem.c
index 88742953532c..ec33f4f7173d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
struct inode *inode = mapping->host;
struct shmem_inode_info *info = SHMEM_I(inode);
pgoff_t index = pos >> PAGE_SHIFT;
+ int ret = 0;

/* i_rwsem is held by caller */
if (unlikely(info->seals & (F_SEAL_GROW |
@@ -2466,7 +2467,19 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
return -EPERM;
}

- return shmem_getpage(inode, index, pagep, SGP_WRITE);
+ ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
+
+ if (!ret) {
+ if (*pagep) {
+ if (PageHWPoison(*pagep)) {
+ unlock_page(*pagep);
+ put_page(*pagep);
+ ret = -EIO;
+ }
+ }
+ }
+
+ return ret;
}

static int
@@ -2555,6 +2568,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
unlock_page(page);
}

+ if (page && PageHWPoison(page)) {
+ error = -EIO;
+ break;
+ }
+
/*
* We must evaluate after, since reads (unlike writes)
* are called without i_rwsem protection against truncate
@@ -3782,7 +3800,6 @@ const struct address_space_operations shmem_aops = {
#ifdef CONFIG_MIGRATION
.migratepage = migrate_page,
#endif
- .error_remove_page = generic_error_remove_page,
};
EXPORT_SYMBOL(shmem_aops);

@@ -4193,6 +4210,10 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
page = ERR_PTR(error);
else
unlock_page(page);
+
+ if (PageHWPoison(page))
+ page = NULL;
+
return page;
#else
/*
--
2.26.2

2021-09-14 18:39:49

by Yang Shi

[permalink] [raw]

Subject: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page

The khugepaged does check if the page is on LRU or not but it doesn't
hold page lock. And it doesn't check this again after holding page
lock. So it may race with some others, e.g. reclaimer, migration, etc.
All of them isolates page from LRU then lock the page then do something.

But it could pass the refcount check done by khugepaged to proceed
collapse. Typically such race is not fatal. But if the page has been
isolated from LRU before khugepaged it likely means the page may be not
suitable for collapse for now.

The other more fatal case is the following patch will keep the poisoned
page in page cache for shmem, so khugepaged may collapse a poisoned page
since the refcount check could pass. 3 refcounts come from:
- hwpoison
- page cache
- khugepaged

Since it is not on LRU so no refcount is incremented from LRU isolation.

This is definitely not expected. Checking if it is on LRU or not after
holding page lock could help serialize against hwpoison handler.

But there is still a small race window between setting hwpoison flag and
bump refcount in hwpoison handler. It could be closed by checking
hwpoison flag in khugepaged, however this race seems unlikely to happen
in real life workload. So just check LRU flag for now to avoid
over-engineering.

Signed-off-by: Yang Shi <[email protected]>
---
mm/khugepaged.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 045cc579f724..bdc161dc27dc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
goto out_unlock;
}

+ /* The hwpoisoned page is off LRU but in page cache */
+ if (!PageLRU(page)) {
+ result = SCAN_PAGE_LRU;
+ goto out_unlock;
+ }
+
if (isolate_lru_page(page)) {
result = SCAN_DEL_PAGE_LRU;
goto out_unlock;
--
2.26.2

2021-09-14 18:40:43

by Yang Shi

[permalink] [raw]

Subject: [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly

Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
support for tmpfs and read-only file cache has been added. They could
be offlined by split THP, just like anonymous THP.

Signed-off-by: Yang Shi <[email protected]>
---
mm/memory-failure.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3e06cb9d5121..6f72aab8ec4a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1150,13 +1150,16 @@ static int __get_hwpoison_page(struct page *page)

if (PageTransHuge(head)) {
/*
- * Non anonymous thp exists only in allocation/free time. We
- * can't handle such a case correctly, so let's give it up.
- * This should be better than triggering BUG_ON when kernel
- * tries to touch the "partially handled" page.
+ * We can't handle allocating or freeing THPs, so let's give
+ * it up. This should be better than triggering BUG_ON when
+ * kernel tries to touch the "partially handled" page.
+ *
+ * page->mapping won't be initialized until the page is added
+ * to rmap or page cache. Use this as an indicator for if
+ * this is an instantiated page.
*/
- if (!PageAnon(head)) {
- pr_err("Memory failure: %#lx: non anonymous thp\n",
+ if (!head->mapping) {
+ pr_err("Memory failure: %#lx: non instantiated thp\n",
page_to_pfn(page));
return 0;
}
@@ -1415,12 +1418,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
static int try_to_split_thp_page(struct page *page, const char *msg)
{
lock_page(page);
- if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+ if (!page->mapping || unlikely(split_huge_page(page))) {
unsigned long pfn = page_to_pfn(page);

unlock_page(page);
- if (!PageAnon(page))
- pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
+ if (!page->mapping)
+ pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
else
pr_info("%s: %#lx: thp split failed\n", msg, pfn);
put_page(page);
--
2.26.2

2021-09-15 11:50:42

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page

On Tue, Sep 14, 2021 at 11:37:16AM -0700, Yang Shi wrote:
> The khugepaged does check if the page is on LRU or not but it doesn't
> hold page lock. And it doesn't check this again after holding page
> lock. So it may race with some others, e.g. reclaimer, migration, etc.
> All of them isolates page from LRU then lock the page then do something.
>
> But it could pass the refcount check done by khugepaged to proceed
> collapse. Typically such race is not fatal. But if the page has been
> isolated from LRU before khugepaged it likely means the page may be not
> suitable for collapse for now.
>
> The other more fatal case is the following patch will keep the poisoned
> page in page cache for shmem, so khugepaged may collapse a poisoned page
> since the refcount check could pass. 3 refcounts come from:
> - hwpoison
> - page cache
> - khugepaged
>
> Since it is not on LRU so no refcount is incremented from LRU isolation.
>
> This is definitely not expected. Checking if it is on LRU or not after
> holding page lock could help serialize against hwpoison handler.
>
> But there is still a small race window between setting hwpoison flag and
> bump refcount in hwpoison handler. It could be closed by checking
> hwpoison flag in khugepaged, however this race seems unlikely to happen
> in real life workload. So just check LRU flag for now to avoid
> over-engineering.
>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> mm/khugepaged.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 045cc579f724..bdc161dc27dc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
> goto out_unlock;
> }
>
> + /* The hwpoisoned page is off LRU but in page cache */
> + if (!PageLRU(page)) {
> + result = SCAN_PAGE_LRU;
> + goto out_unlock;
> + }
> +
> if (isolate_lru_page(page)) {

isolate_lru_page() should catch the case, no? TestClearPageLRU would fail
and we get here.

> result = SCAN_DEL_PAGE_LRU;
> goto out_unlock;
> --
> 2.26.2
>
>

--
Kirill A. Shutemov

2021-09-15 17:52:38

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page

On Wed, Sep 15, 2021 at 4:49 AM Kirill A. Shutemov <[email protected]> wrote:
>
> On Tue, Sep 14, 2021 at 11:37:16AM -0700, Yang Shi wrote:
> > The khugepaged does check if the page is on LRU or not but it doesn't
> > hold page lock. And it doesn't check this again after holding page
> > lock. So it may race with some others, e.g. reclaimer, migration, etc.
> > All of them isolates page from LRU then lock the page then do something.
> >
> > But it could pass the refcount check done by khugepaged to proceed
> > collapse. Typically such race is not fatal. But if the page has been
> > isolated from LRU before khugepaged it likely means the page may be not
> > suitable for collapse for now.
> >
> > The other more fatal case is the following patch will keep the poisoned
> > page in page cache for shmem, so khugepaged may collapse a poisoned page
> > since the refcount check could pass. 3 refcounts come from:
> > - hwpoison
> > - page cache
> > - khugepaged
> >
> > Since it is not on LRU so no refcount is incremented from LRU isolation.
> >
> > This is definitely not expected. Checking if it is on LRU or not after
> > holding page lock could help serialize against hwpoison handler.
> >
> > But there is still a small race window between setting hwpoison flag and
> > bump refcount in hwpoison handler. It could be closed by checking
> > hwpoison flag in khugepaged, however this race seems unlikely to happen
> > in real life workload. So just check LRU flag for now to avoid
> > over-engineering.
> >
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > mm/khugepaged.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 045cc579f724..bdc161dc27dc 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
> > goto out_unlock;
> > }
> >
> > + /* The hwpoisoned page is off LRU but in page cache */
> > + if (!PageLRU(page)) {
> > + result = SCAN_PAGE_LRU;
> > + goto out_unlock;
> > + }
> > +
> > if (isolate_lru_page(page)) {
>
> isolate_lru_page() should catch the case, no? TestClearPageLRU would fail
> and we get here.

Hmm... you are definitely right. How could I miss this point.

It might be because of I messed up the page state by some tests which
may do hole punch then reread the same index. That could drop the
poisoned page then collapse succeed. But I'm not sure. Anyway I didn't
figure out how the poisoned page could be collapsed. It seems
impossible. I will drop this patch.

>
> > result = SCAN_DEL_PAGE_LRU;
> > goto out_unlock;
> > --
> > 2.26.2
> >
> >
>
> --
> Kirill A. Shutemov

2021-09-15 23:01:40

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page

On Wed, Sep 15, 2021 at 10:48 AM Yang Shi <[email protected]> wrote:
>
> On Wed, Sep 15, 2021 at 4:49 AM Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Tue, Sep 14, 2021 at 11:37:16AM -0700, Yang Shi wrote:
> > > The khugepaged does check if the page is on LRU or not but it doesn't
> > > hold page lock. And it doesn't check this again after holding page
> > > lock. So it may race with some others, e.g. reclaimer, migration, etc.
> > > All of them isolates page from LRU then lock the page then do something.
> > >
> > > But it could pass the refcount check done by khugepaged to proceed
> > > collapse. Typically such race is not fatal. But if the page has been
> > > isolated from LRU before khugepaged it likely means the page may be not
> > > suitable for collapse for now.
> > >
> > > The other more fatal case is the following patch will keep the poisoned
> > > page in page cache for shmem, so khugepaged may collapse a poisoned page
> > > since the refcount check could pass. 3 refcounts come from:
> > > - hwpoison
> > > - page cache
> > > - khugepaged
> > >
> > > Since it is not on LRU so no refcount is incremented from LRU isolation.
> > >
> > > This is definitely not expected. Checking if it is on LRU or not after
> > > holding page lock could help serialize against hwpoison handler.
> > >
> > > But there is still a small race window between setting hwpoison flag and
> > > bump refcount in hwpoison handler. It could be closed by checking
> > > hwpoison flag in khugepaged, however this race seems unlikely to happen
> > > in real life workload. So just check LRU flag for now to avoid
> > > over-engineering.
> > >
> > > Signed-off-by: Yang Shi <[email protected]>
> > > ---
> > > mm/khugepaged.c | 6 ++++++
> > > 1 file changed, 6 insertions(+)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 045cc579f724..bdc161dc27dc 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
> > > goto out_unlock;
> > > }
> > >
> > > + /* The hwpoisoned page is off LRU but in page cache */
> > > + if (!PageLRU(page)) {
> > > + result = SCAN_PAGE_LRU;
> > > + goto out_unlock;
> > > + }
> > > +
> > > if (isolate_lru_page(page)) {
> >
> > isolate_lru_page() should catch the case, no? TestClearPageLRU would fail
> > and we get here.
>
> Hmm... you are definitely right. How could I miss this point.
>
> It might be because of I messed up the page state by some tests which
> may do hole punch then reread the same index. That could drop the
> poisoned page then collapse succeed. But I'm not sure. Anyway I didn't
> figure out how the poisoned page could be collapsed. It seems
> impossible. I will drop this patch.

I think I figured out the problem. This problem happened after the
page cache split patch and if the hwpoisoned page is not head page. It
is because THP split will unfreeze the refcount of tail pages to 2
(restore refcount from page cache) then dec refcount to 1. The
refcount pin from hwpoison is gone and it is still on LRU. Then
khugepged locked the page before hwpoison, the refcount is expected to
khugepaged.

The worse thing is it seems this problem is applicable to anonymous
page too. Once the anonymous THP is split by hwpoison the pin from
hwpoison is gone too the refcount is 1 (comes from PTE map). Then
khugepaged could collapse it to huge page again. It may incur data
corruption.

And the poisoned page may be freed back to buddy since the lost refcount pin.

If the poisoned page is head page, the code is fine since hwpoison
doesn't put the refcount for head page after split.

The fix is simple, just keep the refcount pin for hwpoisoned subpage.

>
> >
> > > result = SCAN_DEL_PAGE_LRU;
> > > goto out_unlock;
> > > --
> > > 2.26.2
> > >
> > >
> >
> > --
> > Kirill A. Shutemov

2021-09-15 23:11:37

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page

On Wed, Sep 15, 2021 at 4:00 PM Yang Shi <[email protected]> wrote:
>
> On Wed, Sep 15, 2021 at 10:48 AM Yang Shi <[email protected]> wrote:
> >
> > On Wed, Sep 15, 2021 at 4:49 AM Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > On Tue, Sep 14, 2021 at 11:37:16AM -0700, Yang Shi wrote:
> > > > The khugepaged does check if the page is on LRU or not but it doesn't
> > > > hold page lock. And it doesn't check this again after holding page
> > > > lock. So it may race with some others, e.g. reclaimer, migration, etc.
> > > > All of them isolates page from LRU then lock the page then do something.
> > > >
> > > > But it could pass the refcount check done by khugepaged to proceed
> > > > collapse. Typically such race is not fatal. But if the page has been
> > > > isolated from LRU before khugepaged it likely means the page may be not
> > > > suitable for collapse for now.
> > > >
> > > > The other more fatal case is the following patch will keep the poisoned
> > > > page in page cache for shmem, so khugepaged may collapse a poisoned page
> > > > since the refcount check could pass. 3 refcounts come from:
> > > > - hwpoison
> > > > - page cache
> > > > - khugepaged
> > > >
> > > > Since it is not on LRU so no refcount is incremented from LRU isolation.
> > > >
> > > > This is definitely not expected. Checking if it is on LRU or not after
> > > > holding page lock could help serialize against hwpoison handler.
> > > >
> > > > But there is still a small race window between setting hwpoison flag and
> > > > bump refcount in hwpoison handler. It could be closed by checking
> > > > hwpoison flag in khugepaged, however this race seems unlikely to happen
> > > > in real life workload. So just check LRU flag for now to avoid
> > > > over-engineering.
> > > >
> > > > Signed-off-by: Yang Shi <[email protected]>
> > > > ---
> > > > mm/khugepaged.c | 6 ++++++
> > > > 1 file changed, 6 insertions(+)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 045cc579f724..bdc161dc27dc 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
> > > > goto out_unlock;
> > > > }
> > > >
> > > > + /* The hwpoisoned page is off LRU but in page cache */
> > > > + if (!PageLRU(page)) {
> > > > + result = SCAN_PAGE_LRU;
> > > > + goto out_unlock;
> > > > + }
> > > > +
> > > > if (isolate_lru_page(page)) {
> > >
> > > isolate_lru_page() should catch the case, no? TestClearPageLRU would fail
> > > and we get here.
> >
> > Hmm... you are definitely right. How could I miss this point.
> >
> > It might be because of I messed up the page state by some tests which
> > may do hole punch then reread the same index. That could drop the
> > poisoned page then collapse succeed. But I'm not sure. Anyway I didn't
> > figure out how the poisoned page could be collapsed. It seems
> > impossible. I will drop this patch.
>
> I think I figured out the problem. This problem happened after the
> page cache split patch and if the hwpoisoned page is not head page. It
> is because THP split will unfreeze the refcount of tail pages to 2
> (restore refcount from page cache) then dec refcount to 1. The
> refcount pin from hwpoison is gone and it is still on LRU. Then
> khugepged locked the page before hwpoison, the refcount is expected to
> khugepaged.
>
> The worse thing is it seems this problem is applicable to anonymous
> page too. Once the anonymous THP is split by hwpoison the pin from
> hwpoison is gone too the refcount is 1 (comes from PTE map). Then
> khugepaged could collapse it to huge page again. It may incur data
> corruption.
>
> And the poisoned page may be freed back to buddy since the lost refcount pin.
>
> If the poisoned page is head page, the code is fine since hwpoison
> doesn't put the refcount for head page after split.
>
> The fix is simple, just keep the refcount pin for hwpoisoned subpage.

Err... wait... I just realized I missed the below code block:

if (subpage == page)
continue;

It skips the subpage passed to split_huge_page() so the refcount pin
from the caller for this subpage is kept. And hwpoison doesn't put it.
So it seems fine.

>
> >
> > >
> > > > result = SCAN_DEL_PAGE_LRU;
> > > > goto out_unlock;
> > > > --
> > > > 2.26.2
> > > >
> > > >
> > >
> > > --
> > > Kirill A. Shutemov

2021-09-21 09:50:25

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens

On Tue, Sep 14, 2021 at 11:37:17AM -0700, Yang Shi wrote:
> The current behavior of memory failure is to truncate the page cache
> regardless of dirty or clean. If the page is dirty the later access
> will get the obsolete data from disk without any notification to the
> users. This may cause silent data loss. It is even worse for shmem
> since shmem is in-memory filesystem, truncating page cache means
> discarding data blocks. The later read would return all zero.
>
> The right approach is to keep the corrupted page in page cache, any
> later access would return error for syscalls or SIGBUS for page fault,
> until the file is truncated, hole punched or removed. The regular
> storage backed filesystems would be more complicated so this patch
> is focused on shmem. This also unblock the support for soft
> offlining shmem THP.
>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> mm/memory-failure.c | 3 ++-
> mm/shmem.c | 25 +++++++++++++++++++++++--
> 2 files changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 54879c339024..3e06cb9d5121 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1101,7 +1101,8 @@ static int page_action(struct page_state *ps, struct page *p,
> result = ps->action(p, pfn);
>
> count = page_count(p) - 1;
> - if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
> + if ((ps->action == me_swapcache_dirty && result == MF_DELAYED) ||
> + (ps->action == me_pagecache_dirty && result == MF_FAILED))

This new line seems to affect the cases of dirty page cache
on other filesystems, whose result is to miss "still referenced"
messages for some unmap failure cases (although it's not so critical).
So checking filesystem type (for example with shmem_mapping())
might be helpful?

And I think that if we might want to have some refactoring to pass
*ps to each ps->action() callback, then move this refcount check to
the needed places.
I don't think that we always need the refcount check, for example in
MF_MSG_KERNEL and MF_MSG_UNKNOWN cases (because no one knows the
expected values for these cases).

> count--;
> if (count > 0) {
> pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 88742953532c..ec33f4f7173d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> struct inode *inode = mapping->host;
> struct shmem_inode_info *info = SHMEM_I(inode);
> pgoff_t index = pos >> PAGE_SHIFT;
> + int ret = 0;
>
> /* i_rwsem is held by caller */
> if (unlikely(info->seals & (F_SEAL_GROW |
> @@ -2466,7 +2467,19 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> return -EPERM;
> }
>
> - return shmem_getpage(inode, index, pagep, SGP_WRITE);
> + ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> +
> + if (!ret) {

Maybe this "!ret" check is not necessary because *pagep is set
non-NULL only when ret is 0. It could save one indent level.

> + if (*pagep) {
> + if (PageHWPoison(*pagep)) {
> + unlock_page(*pagep);
> + put_page(*pagep);
> + ret = -EIO;
> + }
> + }
> + }
> +
> + return ret;
> }
>
> static int
> @@ -2555,6 +2568,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> unlock_page(page);
> }
>
> + if (page && PageHWPoison(page)) {
> + error = -EIO;
> + break;
> + }
> +
> /*
> * We must evaluate after, since reads (unlike writes)
> * are called without i_rwsem protection against truncate
> @@ -3782,7 +3800,6 @@ const struct address_space_operations shmem_aops = {
> #ifdef CONFIG_MIGRATION
> .migratepage = migrate_page,
> #endif
> - .error_remove_page = generic_error_remove_page,

This change makes truncate_error_page() calls invalidate_inode_page(),
and in my testing it fails with "Failed to invalidate" message.
So as a result memory_failure() finally returns with -EBUSY. I'm not
sure it's expected because this patchset changes to keep error pages
in page cache as a proper error handling.
Maybe you can avoid this by defining .error_remove_page in shmem_aops
which simply returns 0.

> };
> EXPORT_SYMBOL(shmem_aops);
>
> @@ -4193,6 +4210,10 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
> page = ERR_PTR(error);
> else
> unlock_page(page);
> +
> + if (PageHWPoison(page))
> + page = NULL;
> +
> return page;

One more comment ...

- I guess that you add PageHWPoison() checks after some call sites
of shmem_getpage() and shmem_getpage_gfp(), but seems not cover all.
For example, mcontinue_atomic_pte() in mm/userfaultfd.c can properly
handle PageHWPoison?

I'm trying to test more detail, but in my current understanding,
this patch looks promising to me. Thank you for your effort.

Thanks,
Naoya Horiguchi

2021-09-21 09:52:10

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly

On Tue, Sep 14, 2021 at 11:37:18AM -0700, Yang Shi wrote:
> Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
> support for tmpfs and read-only file cache has been added. They could
> be offlined by split THP, just like anonymous THP.
>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> mm/memory-failure.c | 21 ++++++++++++---------
> 1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3e06cb9d5121..6f72aab8ec4a 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1150,13 +1150,16 @@ static int __get_hwpoison_page(struct page *page)
>
> if (PageTransHuge(head)) {
> /*
> - * Non anonymous thp exists only in allocation/free time. We
> - * can't handle such a case correctly, so let's give it up.
> - * This should be better than triggering BUG_ON when kernel
> - * tries to touch the "partially handled" page.
> + * We can't handle allocating or freeing THPs, so let's give
> + * it up. This should be better than triggering BUG_ON when
> + * kernel tries to touch the "partially handled" page.
> + *
> + * page->mapping won't be initialized until the page is added
> + * to rmap or page cache. Use this as an indicator for if
> + * this is an instantiated page.
> */
> - if (!PageAnon(head)) {
> - pr_err("Memory failure: %#lx: non anonymous thp\n",
> + if (!head->mapping) {
> + pr_err("Memory failure: %#lx: non instantiated thp\n",
> page_to_pfn(page));
> return 0;
> }

How about cleaning up this whole "PageTransHuge()" block? As explained in
commit 415c64c1453a (mm/memory-failure: split thp earlier in memory error
handling), this check was introduced to avoid that non-anonymous thp is
considered as hugetlb and code for hugetlb is executed (resulting in crash).

With recent improvement in __get_hwpoison_page(), this confusion never
happens (because hugetlb check is done before this check), so this check
seems to finish its role.

Thanks,
Naoya Horiguchi

> @@ -1415,12 +1418,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> static int try_to_split_thp_page(struct page *page, const char *msg)
> {
> lock_page(page);
> - if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> + if (!page->mapping || unlikely(split_huge_page(page))) {
> unsigned long pfn = page_to_pfn(page);
>
> unlock_page(page);
> - if (!PageAnon(page))
> - pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> + if (!page->mapping)
> + pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
> else
> pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> put_page(page);
> --
> 2.26.2
>

2021-09-21 20:50:57

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly

On Tue, Sep 21, 2021 at 2:50 AM Naoya Horiguchi
<[email protected]> wrote:
>
> On Tue, Sep 14, 2021 at 11:37:18AM -0700, Yang Shi wrote:
> > Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
> > support for tmpfs and read-only file cache has been added. They could
> > be offlined by split THP, just like anonymous THP.
> >
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > mm/memory-failure.c | 21 ++++++++++++---------
> > 1 file changed, 12 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 3e06cb9d5121..6f72aab8ec4a 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1150,13 +1150,16 @@ static int __get_hwpoison_page(struct page *page)
> >
> > if (PageTransHuge(head)) {
> > /*
> > - * Non anonymous thp exists only in allocation/free time. We
> > - * can't handle such a case correctly, so let's give it up.
> > - * This should be better than triggering BUG_ON when kernel
> > - * tries to touch the "partially handled" page.
> > + * We can't handle allocating or freeing THPs, so let's give
> > + * it up. This should be better than triggering BUG_ON when
> > + * kernel tries to touch the "partially handled" page.
> > + *
> > + * page->mapping won't be initialized until the page is added
> > + * to rmap or page cache. Use this as an indicator for if
> > + * this is an instantiated page.
> > */
> > - if (!PageAnon(head)) {
> > - pr_err("Memory failure: %#lx: non anonymous thp\n",
> > + if (!head->mapping) {
> > + pr_err("Memory failure: %#lx: non instantiated thp\n",
> > page_to_pfn(page));
> > return 0;
> > }
>
> How about cleaning up this whole "PageTransHuge()" block? As explained in
> commit 415c64c1453a (mm/memory-failure: split thp earlier in memory error
> handling), this check was introduced to avoid that non-anonymous thp is
> considered as hugetlb and code for hugetlb is executed (resulting in crash).
>
> With recent improvement in __get_hwpoison_page(), this confusion never
> happens (because hugetlb check is done before this check), so this check
> seems to finish its role.

I see. IIUC the !PageAnon check was used to prevent from mistreating
the THP to hugetlb page. But it was actually solved by splitting THP
earlier. If so this check definitely could go away since the worst
case is split failure. Will fix it in the next version.

>
> Thanks,
> Naoya Horiguchi
>
> > @@ -1415,12 +1418,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> > static int try_to_split_thp_page(struct page *page, const char *msg)
> > {
> > lock_page(page);
> > - if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> > + if (!page->mapping || unlikely(split_huge_page(page))) {
> > unsigned long pfn = page_to_pfn(page);
> >
> > unlock_page(page);
> > - if (!PageAnon(page))
> > - pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> > + if (!page->mapping)
> > + pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
> > else
> > pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> > put_page(page);
> > --
> > 2.26.2
> >

2021-09-21 22:08:18

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens

On Tue, Sep 21, 2021 at 2:49 AM Naoya Horiguchi
<[email protected]> wrote:
>
> On Tue, Sep 14, 2021 at 11:37:17AM -0700, Yang Shi wrote:
> > The current behavior of memory failure is to truncate the page cache
> > regardless of dirty or clean. If the page is dirty the later access
> > will get the obsolete data from disk without any notification to the
> > users. This may cause silent data loss. It is even worse for shmem
> > since shmem is in-memory filesystem, truncating page cache means
> > discarding data blocks. The later read would return all zero.
> >
> > The right approach is to keep the corrupted page in page cache, any
> > later access would return error for syscalls or SIGBUS for page fault,
> > until the file is truncated, hole punched or removed. The regular
> > storage backed filesystems would be more complicated so this patch
> > is focused on shmem. This also unblock the support for soft
> > offlining shmem THP.
> >
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > mm/memory-failure.c | 3 ++-
> > mm/shmem.c | 25 +++++++++++++++++++++++--
> > 2 files changed, 25 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 54879c339024..3e06cb9d5121 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1101,7 +1101,8 @@ static int page_action(struct page_state *ps, struct page *p,
> > result = ps->action(p, pfn);
> >
> > count = page_count(p) - 1;
> > - if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
> > + if ((ps->action == me_swapcache_dirty && result == MF_DELAYED) ||
> > + (ps->action == me_pagecache_dirty && result == MF_FAILED))
>
> This new line seems to affect the cases of dirty page cache
> on other filesystems, whose result is to miss "still referenced"
> messages for some unmap failure cases (although it's not so critical).
> So checking filesystem type (for example with shmem_mapping())
> might be helpful?
>
> And I think that if we might want to have some refactoring to pass
> *ps to each ps->action() callback, then move this refcount check to
> the needed places.
> I don't think that we always need the refcount check, for example in
> MF_MSG_KERNEL and MF_MSG_UNKNOWN cases (because no one knows the
> expected values for these cases).

Yeah, seems make sense to me. How's about doing the below (totally untested):

static inline bool check_refcount(struct *page, bool dec)
{
int count = page_count(page) - 1;

if (dec || shmem_mapping(page->mapping))
count -= 1;

if (count > 0) {
pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
pfn, action_page_types[ps->type], count);
return false;
}

return true;
}

Then call this in the needed me_* functions and return right value per
the return value of it. I think me_swapcache_dirty() is the only place
need pass in true for dec parameter.

>
>
> > count--;
> > if (count > 0) {
> > pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 88742953532c..ec33f4f7173d 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> > struct inode *inode = mapping->host;
> > struct shmem_inode_info *info = SHMEM_I(inode);
> > pgoff_t index = pos >> PAGE_SHIFT;
> > + int ret = 0;
> >
> > /* i_rwsem is held by caller */
> > if (unlikely(info->seals & (F_SEAL_GROW |
> > @@ -2466,7 +2467,19 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> > return -EPERM;
> > }
> >
> > - return shmem_getpage(inode, index, pagep, SGP_WRITE);
> > + ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +
> > + if (!ret) {
>
> Maybe this "!ret" check is not necessary because *pagep is set
> non-NULL only when ret is 0. It could save one indent level.

Yes, sure.

>
> > + if (*pagep) {
> > + if (PageHWPoison(*pagep)) {
> > + unlock_page(*pagep);
> > + put_page(*pagep);
> > + ret = -EIO;
> > + }
> > + }
> > + }
> > +
> > + return ret;
> > }
> >
> > static int
> > @@ -2555,6 +2568,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > unlock_page(page);
> > }
> >
> > + if (page && PageHWPoison(page)) {
> > + error = -EIO;
> > + break;
> > + }
> > +
> > /*
> > * We must evaluate after, since reads (unlike writes)
> > * are called without i_rwsem protection against truncate
> > @@ -3782,7 +3800,6 @@ const struct address_space_operations shmem_aops = {
> > #ifdef CONFIG_MIGRATION
> > .migratepage = migrate_page,
> > #endif
> > - .error_remove_page = generic_error_remove_page,
>
> This change makes truncate_error_page() calls invalidate_inode_page(),
> and in my testing it fails with "Failed to invalidate" message.
> So as a result memory_failure() finally returns with -EBUSY. I'm not
> sure it's expected because this patchset changes to keep error pages
> in page cache as a proper error handling.
> Maybe you can avoid this by defining .error_remove_page in shmem_aops
> which simply returns 0.

Yes, the "Failed to invalidate" message seems confusing. I agree a
shmem specific callback is better.

>
> > };
> > EXPORT_SYMBOL(shmem_aops);
> >
> > @@ -4193,6 +4210,10 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
> > page = ERR_PTR(error);
> > else
> > unlock_page(page);
> > +
> > + if (PageHWPoison(page))
> > + page = NULL;
> > +
> > return page;
>
> One more comment ...
>
> - I guess that you add PageHWPoison() checks after some call sites
> of shmem_getpage() and shmem_getpage_gfp(), but seems not cover all.
> For example, mcontinue_atomic_pte() in mm/userfaultfd.c can properly
> handle PageHWPoison?

No, I didn't touch anything outside shmem.c. I could add this in the
next version.

BTW, I just found another problem for the change in
shmem_read_mapping_page_gfp(), it should return ERR_PTR(-EIO) instead
of NULL since the callers may not handle NULL. Will fix in the next
version too.

>
> I'm trying to test more detail, but in my current understanding,
> this patch looks promising to me. Thank you for your effort.

Thank a lot for taking time do the review.

>
> Thanks,
> Naoya Horiguchi