2024-06-08 02:43:51

by Yosry Ahmed

[permalink] [raw]
Subject: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

Zswap does not support storing or loading large folios. Until proper
support is added, attempts to load large folios from zswap are a bug.

For example, if a swapin fault observes that contiguous PTEs are
pointing to contiguous swap entries and tries to swap them in as a large
folio, swap_read_folio() will pass in a large folio to zswap_load(), but
zswap_load() will only effectively load the first page in the folio. If
the first page is not in zswap, the folio will be read from disk, even
though other pages may be in zswap.

In both cases, this will lead to silent data corruption. Proper support
needs to be added before large folio swapins and zswap can work
together.

Looking at callers of swap_read_folio(), it seems like they are either
allocated from __read_swap_cache_async() or do_swap_page() in the
SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so
everything is fine for now.

However, there is ongoing work to add to support large folio swapins
[1]. To make sure new development does not break zswap (or get broken by
zswap), add minimal handling of incorrect loads of large folios to
zswap.

First, move the call folio_mark_uptodate() inside zswap_load().

If a large folio load is attempted, and any page in that folio is in
zswap, return 'true' without calling folio_mark_uptodate(). This will
prevent the folio from being read from disk, and will emit an IO error
because the folio is not uptodate (e.g. do_swap_fault() will return
VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but it
is better than nothing.

This was tested by hacking the allocation in __read_swap_cache_async()
to use order 2 and __GFP_COMP.

In the future, to handle this correctly, the swapin code should:
(a) Fallback to order-0 swapins if zswap was ever used on the machine,
because compressed pages remain in zswap after it is disabled.
(b) Add proper support to swapin large folios from zswap (fully or
partially).

Probably start with (a) then followup with (b).

[1]https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: Yosry Ahmed <[email protected]>
---

v1: https://lore.kernel.org/lkml/[email protected]/

v1 -> v2:
- Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recovery
handling (David Hildenbrand).

---
mm/page_io.c | 1 -
mm/zswap.c | 22 +++++++++++++++++++++-
2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index f1a9cfab6e748..8f441dd8e109f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
delayacct_swapin_start();

if (zswap_load(folio)) {
- folio_mark_uptodate(folio);
folio_unlock(folio);
} else if (data_race(sis->flags & SWP_FS_OPS)) {
swap_read_folio_fs(folio, plug);
diff --git a/mm/zswap.c b/mm/zswap.c
index b9b35ef86d9be..ebb878d3e7865 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)

VM_WARN_ON_ONCE(!folio_test_locked(folio));

+ /*
+ * Large folios should not be swapped in while zswap is being used, as
+ * they are not properly handled. Zswap does not properly load large
+ * folios, and a large folio may only be partially in zswap.
+ *
+ * If any of the subpages are in zswap, reading from disk would result
+ * in data corruption, so return true without marking the folio uptodate
+ * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
+ *
+ * Otherwise, return false and read the folio from disk.
+ */
+ if (folio_test_large(folio)) {
+ if (xa_find(tree, &offset,
+ offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
+ WARN_ON_ONCE(1);
+ return true;
+ }
+ return false;
+ }
+
/*
* When reading into the swapcache, invalidate our entry. The
* swapcache can be the authoritative owner of the page and
@@ -1590,7 +1610,7 @@ bool zswap_load(struct folio *folio)
zswap_entry_free(entry);
folio_mark_dirty(folio);
}
-
+ folio_mark_uptodate(folio);
return true;
}

--
2.45.2.505.gda0bf45e8d-goog



2024-06-08 04:08:51

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On 6/8/24 05:36, Yosry Ahmed wrote:
> diff --git a/mm/zswap.c b/mm/zswap.c
> index b9b35ef86d9be..ebb878d3e7865 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> + /*
> + * Large folios should not be swapped in while zswap is being used, as
> + * they are not properly handled. Zswap does not properly load large
> + * folios, and a large folio may only be partially in zswap.
> + *
> + * If any of the subpages are in zswap, reading from disk would result
> + * in data corruption, so return true without marking the folio uptodate
> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> + *
> + * Otherwise, return false and read the folio from disk.
> + */
> + if (folio_test_large(folio)) {
> + if (xa_find(tree, &offset,
> + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> + WARN_ON_ONCE(1);
> + return true;
> + }

How does that work? Should it be xa_find_after() to not always find
current entry?

And does it still mean those subsequent entries map to same folio?


--Mika



2024-06-08 04:14:04

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Sat, Jun 8, 2024 at 10:37 AM Yosry Ahmed <[email protected]> wrote:
>
> Zswap does not support storing or loading large folios. Until proper
> support is added, attempts to load large folios from zswap are a bug.
>
> For example, if a swapin fault observes that contiguous PTEs are
> pointing to contiguous swap entries and tries to swap them in as a large
> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> zswap_load() will only effectively load the first page in the folio. If
> the first page is not in zswap, the folio will be read from disk, even
> though other pages may be in zswap.
>
> In both cases, this will lead to silent data corruption. Proper support
> needs to be added before large folio swapins and zswap can work
> together.
>
> Looking at callers of swap_read_folio(), it seems like they are either
> allocated from __read_swap_cache_async() or do_swap_page() in the
> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so
> everything is fine for now.
>
> However, there is ongoing work to add to support large folio swapins
> [1]. To make sure new development does not break zswap (or get broken by
> zswap), add minimal handling of incorrect loads of large folios to
> zswap.
>
> First, move the call folio_mark_uptodate() inside zswap_load().
>
> If a large folio load is attempted, and any page in that folio is in
> zswap, return 'true' without calling folio_mark_uptodate(). This will
> prevent the folio from being read from disk, and will emit an IO error
> because the folio is not uptodate (e.g. do_swap_fault() will return
> VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but it
> is better than nothing.
>
> This was tested by hacking the allocation in __read_swap_cache_async()
> to use order 2 and __GFP_COMP.
>
> In the future, to handle this correctly, the swapin code should:
> (a) Fallback to order-0 swapins if zswap was ever used on the machine,
> because compressed pages remain in zswap after it is disabled.
> (b) Add proper support to swapin large folios from zswap (fully or
> partially).
>
> Probably start with (a) then followup with (b).
>
> [1]https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
>
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> v1 -> v2:
> - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recovery
> handling (David Hildenbrand).
>
> ---
> mm/page_io.c | 1 -
> mm/zswap.c | 22 +++++++++++++++++++++-
> 2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index f1a9cfab6e748..8f441dd8e109f 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> delayacct_swapin_start();
>
> if (zswap_load(folio)) {
> - folio_mark_uptodate(folio);
> folio_unlock(folio);
> } else if (data_race(sis->flags & SWP_FS_OPS)) {
> swap_read_folio_fs(folio, plug);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index b9b35ef86d9be..ebb878d3e7865 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> + /*
> + * Large folios should not be swapped in while zswap is being used, as
> + * they are not properly handled. Zswap does not properly load large
> + * folios, and a large folio may only be partially in zswap.
> + *
> + * If any of the subpages are in zswap, reading from disk would result
> + * in data corruption, so return true without marking the folio uptodate
> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> + *
> + * Otherwise, return false and read the folio from disk.
> + */
> + if (folio_test_large(folio)) {
> + if (xa_find(tree, &offset,
> + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> + WARN_ON_ONCE(1);
> + return true;
> + }
> + return false;

IMHO, this appears to be over-designed. Personally, I would opt to
use

if (folio_test_large(folio))
return true;

Before we address large folio support in zswap, it’s essential
not to let them coexist. Expecting valid data by lunchtime is
not advisable.

> + }
> +
> /*
> * When reading into the swapcache, invalidate our entry. The
> * swapcache can be the authoritative owner of the page and
> @@ -1590,7 +1610,7 @@ bool zswap_load(struct folio *folio)
> zswap_entry_free(entry);
> folio_mark_dirty(folio);
> }
> -
> + folio_mark_uptodate(folio);
> return true;
> }
>
> --
> 2.45.2.505.gda0bf45e8d-goog
>

Thanks
Barry

2024-06-10 17:37:57

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Fri, Jun 7, 2024 at 9:08 PM Mika Penttilä <[email protected]> wrote:
>
> On 6/8/24 05:36, Yosry Ahmed wrote:
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index b9b35ef86d9be..ebb878d3e7865 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> >
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >
> > + /*
> > + * Large folios should not be swapped in while zswap is being used, as
> > + * they are not properly handled. Zswap does not properly load large
> > + * folios, and a large folio may only be partially in zswap.
> > + *
> > + * If any of the subpages are in zswap, reading from disk would result
> > + * in data corruption, so return true without marking the folio uptodate
> > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > + *
> > + * Otherwise, return false and read the folio from disk.
> > + */
> > + if (folio_test_large(folio)) {
> > + if (xa_find(tree, &offset,
> > + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> > + WARN_ON_ONCE(1);
> > + return true;
> > + }
>
> How does that work? Should it be xa_find_after() to not always find
> current entry?

By "current entry" I believe you mean the entry corresponding to
"offset" (i.e. the first subpage of the folio). At this point, we
haven't checked if that offset has a corresponding entry in zswap or
not. It may be on disk, or zwap may be disabled.

>
> And does it still mean those subsequent entries map to same folio?

If I understand correctly, a folio in the swapcache has contiguous
swap offsets for its subpages. So I am assuming that the large folio
swapin case will adhere to that (i.e. we only swapin a large folio if
the swap offsets are contiguous). Did I misunderstand something here?

>
>
> --Mika
>
>

2024-06-10 17:43:09

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Fri, Jun 7, 2024 at 9:13 PM Barry Song <[email protected]> wrote:
>
> On Sat, Jun 8, 2024 at 10:37 AM Yosry Ahmed <[email protected]> wrote:
> >
> > Zswap does not support storing or loading large folios. Until proper
> > support is added, attempts to load large folios from zswap are a bug.
> >
> > For example, if a swapin fault observes that contiguous PTEs are
> > pointing to contiguous swap entries and tries to swap them in as a large
> > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > zswap_load() will only effectively load the first page in the folio. If
> > the first page is not in zswap, the folio will be read from disk, even
> > though other pages may be in zswap.
> >
> > In both cases, this will lead to silent data corruption. Proper support
> > needs to be added before large folio swapins and zswap can work
> > together.
> >
> > Looking at callers of swap_read_folio(), it seems like they are either
> > allocated from __read_swap_cache_async() or do_swap_page() in the
> > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so
> > everything is fine for now.
> >
> > However, there is ongoing work to add to support large folio swapins
> > [1]. To make sure new development does not break zswap (or get broken by
> > zswap), add minimal handling of incorrect loads of large folios to
> > zswap.
> >
> > First, move the call folio_mark_uptodate() inside zswap_load().
> >
> > If a large folio load is attempted, and any page in that folio is in
> > zswap, return 'true' without calling folio_mark_uptodate(). This will
> > prevent the folio from being read from disk, and will emit an IO error
> > because the folio is not uptodate (e.g. do_swap_fault() will return
> > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but it
> > is better than nothing.
> >
> > This was tested by hacking the allocation in __read_swap_cache_async()
> > to use order 2 and __GFP_COMP.
> >
> > In the future, to handle this correctly, the swapin code should:
> > (a) Fallback to order-0 swapins if zswap was ever used on the machine,
> > because compressed pages remain in zswap after it is disabled.
> > (b) Add proper support to swapin large folios from zswap (fully or
> > partially).
> >
> > Probably start with (a) then followup with (b).
> >
> > [1]https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> >
> > v1: https://lore.kernel.org/lkml/[email protected]/
> >
> > v1 -> v2:
> > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recovery
> > handling (David Hildenbrand).
> >
> > ---
> > mm/page_io.c | 1 -
> > mm/zswap.c | 22 +++++++++++++++++++++-
> > 2 files changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index f1a9cfab6e748..8f441dd8e109f 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> > delayacct_swapin_start();
> >
> > if (zswap_load(folio)) {
> > - folio_mark_uptodate(folio);
> > folio_unlock(folio);
> > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > swap_read_folio_fs(folio, plug);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index b9b35ef86d9be..ebb878d3e7865 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> >
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >
> > + /*
> > + * Large folios should not be swapped in while zswap is being used, as
> > + * they are not properly handled. Zswap does not properly load large
> > + * folios, and a large folio may only be partially in zswap.
> > + *
> > + * If any of the subpages are in zswap, reading from disk would result
> > + * in data corruption, so return true without marking the folio uptodate
> > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > + *
> > + * Otherwise, return false and read the folio from disk.
> > + */
> > + if (folio_test_large(folio)) {
> > + if (xa_find(tree, &offset,
> > + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> > + WARN_ON_ONCE(1);
> > + return true;
> > + }
> > + return false;
>
> IMHO, this appears to be over-designed. Personally, I would opt to
> use
>
> if (folio_test_large(folio))
> return true;

I am sure you mean "return false" here. Always returning true means we
will never read a large folio from either zswap or disk, whether it's
in zswap or not. Basically guaranteeing corrupting data for large
folio swapin, even if zswap is disabled :)

>
> Before we address large folio support in zswap, it’s essential
> not to let them coexist. Expecting valid data by lunchtime is
> not advisable.

The goal here is to enable development for large folio swapin without
breaking zswap or being blocked on adding support in zswap. If we
always return false for large folios, as you suggest, then even if the
folio is in zswap (or parts of it), we will go read it from disk. This
will result in silent data corruption.

As you mentioned before, you spent a week debugging problems with your
large folio swapin series because of a zswap problem, and even after
then, the zswap_is_enabled() check you had is not enough to prevent
problems as I mentioned before (if zswap was enabled before). So we
need stronger checks to make sure we don't break things when we
support large folio swapin.

Since we can't just check if zswap is enabled or not, we need to
rather check if the folio (or any part of it) is in zswap or not. We
can only WARN in that case, but delivering the error to userspace is a
couple of extra lines of code (not set uptodate), and will make the
problem much easier to notice.

I am not sure I understand what you mean. The alternative is to
introduce a config option (perhaps internal) for large folio swapin,
and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get
enabled if large folio swapin is enabled (through config or boot
option). This is until proper handling is added, of course.

>
> > + }
> > +
> > /*
> > * When reading into the swapcache, invalidate our entry. The
> > * swapcache can be the authoritative owner of the page and
> > @@ -1590,7 +1610,7 @@ bool zswap_load(struct folio *folio)
> > zswap_entry_free(entry);
> > folio_mark_dirty(folio);
> > }
> > -
> > + folio_mark_uptodate(folio);
> > return true;
> > }
> >
> > --
> > 2.45.2.505.gda0bf45e8d-goog
> >
>
> Thanks
> Barry

2024-06-10 20:06:13

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Tue, Jun 11, 2024 at 1:42 AM Yosry Ahmed <[email protected]> wrote:
>
> On Fri, Jun 7, 2024 at 9:13 PM Barry Song <[email protected]> wrote:
> >
> > On Sat, Jun 8, 2024 at 10:37 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > > Zswap does not support storing or loading large folios. Until proper
> > > support is added, attempts to load large folios from zswap are a bug.
> > >
> > > For example, if a swapin fault observes that contiguous PTEs are
> > > pointing to contiguous swap entries and tries to swap them in as a large
> > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > > zswap_load() will only effectively load the first page in the folio. If
> > > the first page is not in zswap, the folio will be read from disk, even
> > > though other pages may be in zswap.
> > >
> > > In both cases, this will lead to silent data corruption. Proper support
> > > needs to be added before large folio swapins and zswap can work
> > > together.
> > >
> > > Looking at callers of swap_read_folio(), it seems like they are either
> > > allocated from __read_swap_cache_async() or do_swap_page() in the
> > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so
> > > everything is fine for now.
> > >
> > > However, there is ongoing work to add to support large folio swapins
> > > [1]. To make sure new development does not break zswap (or get broken by
> > > zswap), add minimal handling of incorrect loads of large folios to
> > > zswap.
> > >
> > > First, move the call folio_mark_uptodate() inside zswap_load().
> > >
> > > If a large folio load is attempted, and any page in that folio is in
> > > zswap, return 'true' without calling folio_mark_uptodate(). This will
> > > prevent the folio from being read from disk, and will emit an IO error
> > > because the folio is not uptodate (e.g. do_swap_fault() will return
> > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but it
> > > is better than nothing.
> > >
> > > This was tested by hacking the allocation in __read_swap_cache_async()
> > > to use order 2 and __GFP_COMP.
> > >
> > > In the future, to handle this correctly, the swapin code should:
> > > (a) Fallback to order-0 swapins if zswap was ever used on the machine,
> > > because compressed pages remain in zswap after it is disabled.
> > > (b) Add proper support to swapin large folios from zswap (fully or
> > > partially).
> > >
> > > Probably start with (a) then followup with (b).
> > >
> > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > ---
> > >
> > > v1: https://lore.kernel.org/lkml/[email protected]/
> > >
> > > v1 -> v2:
> > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recovery
> > > handling (David Hildenbrand).
> > >
> > > ---
> > > mm/page_io.c | 1 -
> > > mm/zswap.c | 22 +++++++++++++++++++++-
> > > 2 files changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > --- a/mm/page_io.c
> > > +++ b/mm/page_io.c
> > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> > > delayacct_swapin_start();
> > >
> > > if (zswap_load(folio)) {
> > > - folio_mark_uptodate(folio);
> > > folio_unlock(folio);
> > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > swap_read_folio_fs(folio, plug);
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index b9b35ef86d9be..ebb878d3e7865 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> > >
> > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > >
> > > + /*
> > > + * Large folios should not be swapped in while zswap is being used, as
> > > + * they are not properly handled. Zswap does not properly load large
> > > + * folios, and a large folio may only be partially in zswap.
> > > + *
> > > + * If any of the subpages are in zswap, reading from disk would result
> > > + * in data corruption, so return true without marking the folio uptodate
> > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > + *
> > > + * Otherwise, return false and read the folio from disk.
> > > + */
> > > + if (folio_test_large(folio)) {
> > > + if (xa_find(tree, &offset,
> > > + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> > > + WARN_ON_ONCE(1);
> > > + return true;
> > > + }
> > > + return false;
> >
> > IMHO, this appears to be over-designed. Personally, I would opt to
> > use
> >
> > if (folio_test_large(folio))
> > return true;
>
> I am sure you mean "return false" here. Always returning true means we
> will never read a large folio from either zswap or disk, whether it's
> in zswap or not. Basically guaranteeing corrupting data for large
> folio swapin, even if zswap is disabled :)
>
> >
> > Before we address large folio support in zswap, it’s essential
> > not to let them coexist. Expecting valid data by lunchtime is
> > not advisable.
>
> The goal here is to enable development for large folio swapin without
> breaking zswap or being blocked on adding support in zswap. If we
> always return false for large folios, as you suggest, then even if the
> folio is in zswap (or parts of it), we will go read it from disk. This
> will result in silent data corruption.
>
> As you mentioned before, you spent a week debugging problems with your
> large folio swapin series because of a zswap problem, and even after
> then, the zswap_is_enabled() check you had is not enough to prevent
> problems as I mentioned before (if zswap was enabled before). So we
> need stronger checks to make sure we don't break things when we
> support large folio swapin.
>
> Since we can't just check if zswap is enabled or not, we need to
> rather check if the folio (or any part of it) is in zswap or not. We
> can only WARN in that case, but delivering the error to userspace is a
> couple of extra lines of code (not set uptodate), and will make the
> problem much easier to notice.
>
> I am not sure I understand what you mean. The alternative is to
> introduce a config option (perhaps internal) for large folio swapin,
> and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get
> enabled if large folio swapin is enabled (through config or boot
> option). This is until proper handling is added, of course.

Hi Yosry,
My point is that anybody attempts to do large folios swap-in should
either
1. always use small folios if zswap has been once enabled before or now
or
2. address the large folios swapin issues in zswap

there is no 3rd way which you are providing.

it is over-designed to give users true or false based on if data is zswap
as there is always a chance data could be in zswap. so before approach
2 is done, we should always WARN_ON large folios and report data
corruption.

>
> >
> > > + }
> > > +
> > > /*
> > > * When reading into the swapcache, invalidate our entry. The
> > > * swapcache can be the authoritative owner of the page and
> > > @@ -1590,7 +1610,7 @@ bool zswap_load(struct folio *folio)
> > > zswap_entry_free(entry);
> > > folio_mark_dirty(folio);
> > > }
> > > -
> > > + folio_mark_uptodate(folio);
> > > return true;
> > > }
> > >
> > > --
> > > 2.45.2.505.gda0bf45e8d-goog
> > >
> >
> > Thanks
> > Barry

Thanks
Barry

2024-06-10 20:12:20

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Mon, Jun 10, 2024 at 1:06 PM Barry Song <[email protected]> wrote:
>
> On Tue, Jun 11, 2024 at 1:42 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Fri, Jun 7, 2024 at 9:13 PM Barry Song <[email protected]> wrote:
> > >
> > > On Sat, Jun 8, 2024 at 10:37 AM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > Zswap does not support storing or loading large folios. Until proper
> > > > support is added, attempts to load large folios from zswap are a bug.
> > > >
> > > > For example, if a swapin fault observes that contiguous PTEs are
> > > > pointing to contiguous swap entries and tries to swap them in as a large
> > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > > > zswap_load() will only effectively load the first page in the folio. If
> > > > the first page is not in zswap, the folio will be read from disk, even
> > > > though other pages may be in zswap.
> > > >
> > > > In both cases, this will lead to silent data corruption. Proper support
> > > > needs to be added before large folio swapins and zswap can work
> > > > together.
> > > >
> > > > Looking at callers of swap_read_folio(), it seems like they are either
> > > > allocated from __read_swap_cache_async() or do_swap_page() in the
> > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so
> > > > everything is fine for now.
> > > >
> > > > However, there is ongoing work to add to support large folio swapins
> > > > [1]. To make sure new development does not break zswap (or get broken by
> > > > zswap), add minimal handling of incorrect loads of large folios to
> > > > zswap.
> > > >
> > > > First, move the call folio_mark_uptodate() inside zswap_load().
> > > >
> > > > If a large folio load is attempted, and any page in that folio is in
> > > > zswap, return 'true' without calling folio_mark_uptodate(). This will
> > > > prevent the folio from being read from disk, and will emit an IO error
> > > > because the folio is not uptodate (e.g. do_swap_fault() will return
> > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but it
> > > > is better than nothing.
> > > >
> > > > This was tested by hacking the allocation in __read_swap_cache_async()
> > > > to use order 2 and __GFP_COMP.
> > > >
> > > > In the future, to handle this correctly, the swapin code should:
> > > > (a) Fallback to order-0 swapins if zswap was ever used on the machine,
> > > > because compressed pages remain in zswap after it is disabled.
> > > > (b) Add proper support to swapin large folios from zswap (fully or
> > > > partially).
> > > >
> > > > Probably start with (a) then followup with (b).
> > > >
> > > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > > >
> > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > > ---
> > > >
> > > > v1: https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > v1 -> v2:
> > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recovery
> > > > handling (David Hildenbrand).
> > > >
> > > > ---
> > > > mm/page_io.c | 1 -
> > > > mm/zswap.c | 22 +++++++++++++++++++++-
> > > > 2 files changed, 21 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > > --- a/mm/page_io.c
> > > > +++ b/mm/page_io.c
> > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> > > > delayacct_swapin_start();
> > > >
> > > > if (zswap_load(folio)) {
> > > > - folio_mark_uptodate(folio);
> > > > folio_unlock(folio);
> > > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > > swap_read_folio_fs(folio, plug);
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index b9b35ef86d9be..ebb878d3e7865 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> > > >
> > > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > >
> > > > + /*
> > > > + * Large folios should not be swapped in while zswap is being used, as
> > > > + * they are not properly handled. Zswap does not properly load large
> > > > + * folios, and a large folio may only be partially in zswap.
> > > > + *
> > > > + * If any of the subpages are in zswap, reading from disk would result
> > > > + * in data corruption, so return true without marking the folio uptodate
> > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > > + *
> > > > + * Otherwise, return false and read the folio from disk.
> > > > + */
> > > > + if (folio_test_large(folio)) {
> > > > + if (xa_find(tree, &offset,
> > > > + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> > > > + WARN_ON_ONCE(1);
> > > > + return true;
> > > > + }
> > > > + return false;
> > >
> > > IMHO, this appears to be over-designed. Personally, I would opt to
> > > use
> > >
> > > if (folio_test_large(folio))
> > > return true;
> >
> > I am sure you mean "return false" here. Always returning true means we
> > will never read a large folio from either zswap or disk, whether it's
> > in zswap or not. Basically guaranteeing corrupting data for large
> > folio swapin, even if zswap is disabled :)
> >
> > >
> > > Before we address large folio support in zswap, it’s essential
> > > not to let them coexist. Expecting valid data by lunchtime is
> > > not advisable.
> >
> > The goal here is to enable development for large folio swapin without
> > breaking zswap or being blocked on adding support in zswap. If we
> > always return false for large folios, as you suggest, then even if the
> > folio is in zswap (or parts of it), we will go read it from disk. This
> > will result in silent data corruption.
> >
> > As you mentioned before, you spent a week debugging problems with your
> > large folio swapin series because of a zswap problem, and even after
> > then, the zswap_is_enabled() check you had is not enough to prevent
> > problems as I mentioned before (if zswap was enabled before). So we
> > need stronger checks to make sure we don't break things when we
> > support large folio swapin.
> >
> > Since we can't just check if zswap is enabled or not, we need to
> > rather check if the folio (or any part of it) is in zswap or not. We
> > can only WARN in that case, but delivering the error to userspace is a
> > couple of extra lines of code (not set uptodate), and will make the
> > problem much easier to notice.
> >
> > I am not sure I understand what you mean. The alternative is to
> > introduce a config option (perhaps internal) for large folio swapin,
> > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get
> > enabled if large folio swapin is enabled (through config or boot
> > option). This is until proper handling is added, of course.
>
> Hi Yosry,
> My point is that anybody attempts to do large folios swap-in should
> either
> 1. always use small folios if zswap has been once enabled before or now
> or
> 2. address the large folios swapin issues in zswap
>
> there is no 3rd way which you are providing.
>
> it is over-designed to give users true or false based on if data is zswap
> as there is always a chance data could be in zswap. so before approach
> 2 is done, we should always WARN_ON large folios and report data
> corruption.

We can't always WARN_ON for large folios, as this will fire even if
zswap was never enabled. The alternative is tracking whether zswap was
ever enabled, and checking that instead of checking if any part of the
folio is in zswap.

Basically replacing xa_find(..) with zswap_was_enabled(..) or something.

What I don't like about this is that we will report data corruption
even in cases where data is not really corrupted and it exists on
disk. For example, if zswap is globally enabled but disabled in a
cgroup, there shouldn't be any corruption swapping in large folios.

That being said, I don't feel strongly, as long as we either check
that part of the folio is in zswap or that zswap was ever enabled (or
maybe check if a page was ever stored, just in case zswap was enabled
and immediately disabled).

Johannes, Nhat, any opinions on which way to handle this?

2024-06-10 21:00:24

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Tue, Jun 11, 2024 at 4:12 AM Yosry Ahmed <[email protected]> wrote:
>
> On Mon, Jun 10, 2024 at 1:06 PM Barry Song <[email protected]> wrote:
> >
> > On Tue, Jun 11, 2024 at 1:42 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Fri, Jun 7, 2024 at 9:13 PM Barry Song <[email protected]> wrote:
> > > >
> > > > On Sat, Jun 8, 2024 at 10:37 AM Yosry Ahmed <[email protected]> wrote:
> > > > >
> > > > > Zswap does not support storing or loading large folios. Until proper
> > > > > support is added, attempts to load large folios from zswap are a bug.
> > > > >
> > > > > For example, if a swapin fault observes that contiguous PTEs are
> > > > > pointing to contiguous swap entries and tries to swap them in as a large
> > > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > > > > zswap_load() will only effectively load the first page in the folio. If
> > > > > the first page is not in zswap, the folio will be read from disk, even
> > > > > though other pages may be in zswap.
> > > > >
> > > > > In both cases, this will lead to silent data corruption. Proper support
> > > > > needs to be added before large folio swapins and zswap can work
> > > > > together.
> > > > >
> > > > > Looking at callers of swap_read_folio(), it seems like they are either
> > > > > allocated from __read_swap_cache_async() or do_swap_page() in the
> > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so
> > > > > everything is fine for now.
> > > > >
> > > > > However, there is ongoing work to add to support large folio swapins
> > > > > [1]. To make sure new development does not break zswap (or get broken by
> > > > > zswap), add minimal handling of incorrect loads of large folios to
> > > > > zswap.
> > > > >
> > > > > First, move the call folio_mark_uptodate() inside zswap_load().
> > > > >
> > > > > If a large folio load is attempted, and any page in that folio is in
> > > > > zswap, return 'true' without calling folio_mark_uptodate(). This will
> > > > > prevent the folio from being read from disk, and will emit an IO error
> > > > > because the folio is not uptodate (e.g. do_swap_fault() will return
> > > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but it
> > > > > is better than nothing.
> > > > >
> > > > > This was tested by hacking the allocation in __read_swap_cache_async()
> > > > > to use order 2 and __GFP_COMP.
> > > > >
> > > > > In the future, to handle this correctly, the swapin code should:
> > > > > (a) Fallback to order-0 swapins if zswap was ever used on the machine,
> > > > > because compressed pages remain in zswap after it is disabled.
> > > > > (b) Add proper support to swapin large folios from zswap (fully or
> > > > > partially).
> > > > >
> > > > > Probably start with (a) then followup with (b).
> > > > >
> > > > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > > > >
> > > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > > > ---
> > > > >
> > > > > v1: https://lore.kernel.org/lkml/[email protected]/
> > > > >
> > > > > v1 -> v2:
> > > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recovery
> > > > > handling (David Hildenbrand).
> > > > >
> > > > > ---
> > > > > mm/page_io.c | 1 -
> > > > > mm/zswap.c | 22 +++++++++++++++++++++-
> > > > > 2 files changed, 21 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > > > --- a/mm/page_io.c
> > > > > +++ b/mm/page_io.c
> > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> > > > > delayacct_swapin_start();
> > > > >
> > > > > if (zswap_load(folio)) {
> > > > > - folio_mark_uptodate(folio);
> > > > > folio_unlock(folio);
> > > > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > > > swap_read_folio_fs(folio, plug);
> > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > index b9b35ef86d9be..ebb878d3e7865 100644
> > > > > --- a/mm/zswap.c
> > > > > +++ b/mm/zswap.c
> > > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> > > > >
> > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > >
> > > > > + /*
> > > > > + * Large folios should not be swapped in while zswap is being used, as
> > > > > + * they are not properly handled. Zswap does not properly load large
> > > > > + * folios, and a large folio may only be partially in zswap.
> > > > > + *
> > > > > + * If any of the subpages are in zswap, reading from disk would result
> > > > > + * in data corruption, so return true without marking the folio uptodate
> > > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > > > + *
> > > > > + * Otherwise, return false and read the folio from disk.
> > > > > + */
> > > > > + if (folio_test_large(folio)) {
> > > > > + if (xa_find(tree, &offset,
> > > > > + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> > > > > + WARN_ON_ONCE(1);
> > > > > + return true;
> > > > > + }
> > > > > + return false;
> > > >
> > > > IMHO, this appears to be over-designed. Personally, I would opt to
> > > > use
> > > >
> > > > if (folio_test_large(folio))
> > > > return true;
> > >
> > > I am sure you mean "return false" here. Always returning true means we
> > > will never read a large folio from either zswap or disk, whether it's
> > > in zswap or not. Basically guaranteeing corrupting data for large
> > > folio swapin, even if zswap is disabled :)
> > >
> > > >
> > > > Before we address large folio support in zswap, it’s essential
> > > > not to let them coexist. Expecting valid data by lunchtime is
> > > > not advisable.
> > >
> > > The goal here is to enable development for large folio swapin without
> > > breaking zswap or being blocked on adding support in zswap. If we
> > > always return false for large folios, as you suggest, then even if the
> > > folio is in zswap (or parts of it), we will go read it from disk. This
> > > will result in silent data corruption.
> > >
> > > As you mentioned before, you spent a week debugging problems with your
> > > large folio swapin series because of a zswap problem, and even after
> > > then, the zswap_is_enabled() check you had is not enough to prevent
> > > problems as I mentioned before (if zswap was enabled before). So we
> > > need stronger checks to make sure we don't break things when we
> > > support large folio swapin.
> > >
> > > Since we can't just check if zswap is enabled or not, we need to
> > > rather check if the folio (or any part of it) is in zswap or not. We
> > > can only WARN in that case, but delivering the error to userspace is a
> > > couple of extra lines of code (not set uptodate), and will make the
> > > problem much easier to notice.
> > >
> > > I am not sure I understand what you mean. The alternative is to
> > > introduce a config option (perhaps internal) for large folio swapin,
> > > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get
> > > enabled if large folio swapin is enabled (through config or boot
> > > option). This is until proper handling is added, of course.
> >
> > Hi Yosry,
> > My point is that anybody attempts to do large folios swap-in should
> > either
> > 1. always use small folios if zswap has been once enabled before or now
> > or
> > 2. address the large folios swapin issues in zswap
> >
> > there is no 3rd way which you are providing.
> >
> > it is over-designed to give users true or false based on if data is zswap
> > as there is always a chance data could be in zswap. so before approach
> > 2 is done, we should always WARN_ON large folios and report data
> > corruption.
>
> We can't always WARN_ON for large folios, as this will fire even if
> zswap was never enabled. The alternative is tracking whether zswap was
> ever enabled, and checking that instead of checking if any part of the
> folio is in zswap.
>
> Basically replacing xa_find(..) with zswap_was_enabled(..) or something.

My point is that mm core should always fallback

if (zswap_was_or_is_enabled())
goto fallback;

till zswap fixes the issue. This is the only way to enable large folios swap-in
development before we fix zswap.

>
> What I don't like about this is that we will report data corruption
> even in cases where data is not really corrupted and it exists on
> disk. For example, if zswap is globally enabled but disabled in a
> cgroup, there shouldn't be any corruption swapping in large folios.
>
> That being said, I don't feel strongly, as long as we either check
> that part of the folio is in zswap or that zswap was ever enabled (or
> maybe check if a page was ever stored, just in case zswap was enabled
> and immediately disabled).
>
> Johannes, Nhat, any opinions on which way to handle this?

Thanks
Barry

2024-06-10 21:12:39

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Mon, Jun 10, 2024 at 2:00 PM Barry Song <[email protected]> wrote:
>
> On Tue, Jun 11, 2024 at 4:12 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Mon, Jun 10, 2024 at 1:06 PM Barry Song <[email protected]> wrote:
> > >
> > > On Tue, Jun 11, 2024 at 1:42 AM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > On Fri, Jun 7, 2024 at 9:13 PM Barry Song <[email protected]> wrote:
> > > > >
> > > > > On Sat, Jun 8, 2024 at 10:37 AM Yosry Ahmed <[email protected]> wrote:
> > > > > >
> > > > > > Zswap does not support storing or loading large folios. Until proper
> > > > > > support is added, attempts to load large folios from zswap are a bug.
> > > > > >
> > > > > > For example, if a swapin fault observes that contiguous PTEs are
> > > > > > pointing to contiguous swap entries and tries to swap them in as a large
> > > > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > > > > > zswap_load() will only effectively load the first page in the folio. If
> > > > > > the first page is not in zswap, the folio will be read from disk, even
> > > > > > though other pages may be in zswap.
> > > > > >
> > > > > > In both cases, this will lead to silent data corruption. Proper support
> > > > > > needs to be added before large folio swapins and zswap can work
> > > > > > together.
> > > > > >
> > > > > > Looking at callers of swap_read_folio(), it seems like they are either
> > > > > > allocated from __read_swap_cache_async() or do_swap_page() in the
> > > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so
> > > > > > everything is fine for now.
> > > > > >
> > > > > > However, there is ongoing work to add to support large folio swapins
> > > > > > [1]. To make sure new development does not break zswap (or get broken by
> > > > > > zswap), add minimal handling of incorrect loads of large folios to
> > > > > > zswap.
> > > > > >
> > > > > > First, move the call folio_mark_uptodate() inside zswap_load().
> > > > > >
> > > > > > If a large folio load is attempted, and any page in that folio is in
> > > > > > zswap, return 'true' without calling folio_mark_uptodate(). This will
> > > > > > prevent the folio from being read from disk, and will emit an IO error
> > > > > > because the folio is not uptodate (e.g. do_swap_fault() will return
> > > > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but it
> > > > > > is better than nothing.
> > > > > >
> > > > > > This was tested by hacking the allocation in __read_swap_cache_async()
> > > > > > to use order 2 and __GFP_COMP.
> > > > > >
> > > > > > In the future, to handle this correctly, the swapin code should:
> > > > > > (a) Fallback to order-0 swapins if zswap was ever used on the machine,
> > > > > > because compressed pages remain in zswap after it is disabled.
> > > > > > (b) Add proper support to swapin large folios from zswap (fully or
> > > > > > partially).
> > > > > >
> > > > > > Probably start with (a) then followup with (b).
> > > > > >
> > > > > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > > > > >
> > > > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > > > > ---
> > > > > >
> > > > > > v1: https://lore.kernel.org/lkml/[email protected]/
> > > > > >
> > > > > > v1 -> v2:
> > > > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recovery
> > > > > > handling (David Hildenbrand).
> > > > > >
> > > > > > ---
> > > > > > mm/page_io.c | 1 -
> > > > > > mm/zswap.c | 22 +++++++++++++++++++++-
> > > > > > 2 files changed, 21 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > > > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > > > > --- a/mm/page_io.c
> > > > > > +++ b/mm/page_io.c
> > > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> > > > > > delayacct_swapin_start();
> > > > > >
> > > > > > if (zswap_load(folio)) {
> > > > > > - folio_mark_uptodate(folio);
> > > > > > folio_unlock(folio);
> > > > > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > > > > swap_read_folio_fs(folio, plug);
> > > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > > index b9b35ef86d9be..ebb878d3e7865 100644
> > > > > > --- a/mm/zswap.c
> > > > > > +++ b/mm/zswap.c
> > > > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> > > > > >
> > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > > >
> > > > > > + /*
> > > > > > + * Large folios should not be swapped in while zswap is being used, as
> > > > > > + * they are not properly handled. Zswap does not properly load large
> > > > > > + * folios, and a large folio may only be partially in zswap.
> > > > > > + *
> > > > > > + * If any of the subpages are in zswap, reading from disk would result
> > > > > > + * in data corruption, so return true without marking the folio uptodate
> > > > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > > > > + *
> > > > > > + * Otherwise, return false and read the folio from disk.
> > > > > > + */
> > > > > > + if (folio_test_large(folio)) {
> > > > > > + if (xa_find(tree, &offset,
> > > > > > + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> > > > > > + WARN_ON_ONCE(1);
> > > > > > + return true;
> > > > > > + }
> > > > > > + return false;
> > > > >
> > > > > IMHO, this appears to be over-designed. Personally, I would opt to
> > > > > use
> > > > >
> > > > > if (folio_test_large(folio))
> > > > > return true;
> > > >
> > > > I am sure you mean "return false" here. Always returning true means we
> > > > will never read a large folio from either zswap or disk, whether it's
> > > > in zswap or not. Basically guaranteeing corrupting data for large
> > > > folio swapin, even if zswap is disabled :)
> > > >
> > > > >
> > > > > Before we address large folio support in zswap, it’s essential
> > > > > not to let them coexist. Expecting valid data by lunchtime is
> > > > > not advisable.
> > > >
> > > > The goal here is to enable development for large folio swapin without
> > > > breaking zswap or being blocked on adding support in zswap. If we
> > > > always return false for large folios, as you suggest, then even if the
> > > > folio is in zswap (or parts of it), we will go read it from disk. This
> > > > will result in silent data corruption.
> > > >
> > > > As you mentioned before, you spent a week debugging problems with your
> > > > large folio swapin series because of a zswap problem, and even after
> > > > then, the zswap_is_enabled() check you had is not enough to prevent
> > > > problems as I mentioned before (if zswap was enabled before). So we
> > > > need stronger checks to make sure we don't break things when we
> > > > support large folio swapin.
> > > >
> > > > Since we can't just check if zswap is enabled or not, we need to
> > > > rather check if the folio (or any part of it) is in zswap or not. We
> > > > can only WARN in that case, but delivering the error to userspace is a
> > > > couple of extra lines of code (not set uptodate), and will make the
> > > > problem much easier to notice.
> > > >
> > > > I am not sure I understand what you mean. The alternative is to
> > > > introduce a config option (perhaps internal) for large folio swapin,
> > > > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get
> > > > enabled if large folio swapin is enabled (through config or boot
> > > > option). This is until proper handling is added, of course.
> > >
> > > Hi Yosry,
> > > My point is that anybody attempts to do large folios swap-in should
> > > either
> > > 1. always use small folios if zswap has been once enabled before or now
> > > or
> > > 2. address the large folios swapin issues in zswap
> > >
> > > there is no 3rd way which you are providing.
> > >
> > > it is over-designed to give users true or false based on if data is zswap
> > > as there is always a chance data could be in zswap. so before approach
> > > 2 is done, we should always WARN_ON large folios and report data
> > > corruption.
> >
> > We can't always WARN_ON for large folios, as this will fire even if
> > zswap was never enabled. The alternative is tracking whether zswap was
> > ever enabled, and checking that instead of checking if any part of the
> > folio is in zswap.
> >
> > Basically replacing xa_find(..) with zswap_was_enabled(..) or something.
>
> My point is that mm core should always fallback
>
> if (zswap_was_or_is_enabled())
> goto fallback;
>
> till zswap fixes the issue. This is the only way to enable large folios swap-in
> development before we fix zswap.

I agree with this, I just want an extra fallback in zswap itself in
case something was missed during large folio swapin development (which
can evidently happen).

2024-06-10 23:35:02

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Tue, Jun 11, 2024 at 9:12 AM Yosry Ahmed <[email protected]> wrote:
>
> On Mon, Jun 10, 2024 at 2:00 PM Barry Song <[email protected]> wrote:
> >
> > On Tue, Jun 11, 2024 at 4:12 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Mon, Jun 10, 2024 at 1:06 PM Barry Song <[email protected]> wrote:
> > > >
> > > > On Tue, Jun 11, 2024 at 1:42 AM Yosry Ahmed <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jun 7, 2024 at 9:13 PM Barry Song <[email protected]> wrote:
> > > > > >
> > > > > > On Sat, Jun 8, 2024 at 10:37 AM Yosry Ahmed <[email protected]> wrote:
> > > > > > >
> > > > > > > Zswap does not support storing or loading large folios. Until proper
> > > > > > > support is added, attempts to load large folios from zswap are a bug.
> > > > > > >
> > > > > > > For example, if a swapin fault observes that contiguous PTEs are
> > > > > > > pointing to contiguous swap entries and tries to swap them in as a large
> > > > > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > > > > > > zswap_load() will only effectively load the first page in the folio. If
> > > > > > > the first page is not in zswap, the folio will be read from disk, even
> > > > > > > though other pages may be in zswap.
> > > > > > >
> > > > > > > In both cases, this will lead to silent data corruption. Proper support
> > > > > > > needs to be added before large folio swapins and zswap can work
> > > > > > > together.
> > > > > > >
> > > > > > > Looking at callers of swap_read_folio(), it seems like they are either
> > > > > > > allocated from __read_swap_cache_async() or do_swap_page() in the
> > > > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so
> > > > > > > everything is fine for now.
> > > > > > >
> > > > > > > However, there is ongoing work to add to support large folio swapins
> > > > > > > [1]. To make sure new development does not break zswap (or get broken by
> > > > > > > zswap), add minimal handling of incorrect loads of large folios to
> > > > > > > zswap.
> > > > > > >
> > > > > > > First, move the call folio_mark_uptodate() inside zswap_load().
> > > > > > >
> > > > > > > If a large folio load is attempted, and any page in that folio is in
> > > > > > > zswap, return 'true' without calling folio_mark_uptodate(). This will
> > > > > > > prevent the folio from being read from disk, and will emit an IO error
> > > > > > > because the folio is not uptodate (e.g. do_swap_fault() will return
> > > > > > > VM_FAULT_SIGBUS). It may not be reliable recovery in all cases, but it
> > > > > > > is better than nothing.
> > > > > > >
> > > > > > > This was tested by hacking the allocation in __read_swap_cache_async()
> > > > > > > to use order 2 and __GFP_COMP.
> > > > > > >
> > > > > > > In the future, to handle this correctly, the swapin code should:
> > > > > > > (a) Fallback to order-0 swapins if zswap was ever used on the machine,
> > > > > > > because compressed pages remain in zswap after it is disabled.
> > > > > > > (b) Add proper support to swapin large folios from zswap (fully or
> > > > > > > partially).
> > > > > > >
> > > > > > > Probably start with (a) then followup with (b).
> > > > > > >
> > > > > > > [1]https://lore.kernel.org/linux-mm/[email protected]/
> > > > > > >
> > > > > > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > > > > > ---
> > > > > > >
> > > > > > > v1: https://lore.kernel.org/lkml/[email protected]/
> > > > > > >
> > > > > > > v1 -> v2:
> > > > > > > - Instead of using VM_BUG_ON() use WARN_ON_ONCE() and add some recovery
> > > > > > > handling (David Hildenbrand).
> > > > > > >
> > > > > > > ---
> > > > > > > mm/page_io.c | 1 -
> > > > > > > mm/zswap.c | 22 +++++++++++++++++++++-
> > > > > > > 2 files changed, 21 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > > > > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > > > > > --- a/mm/page_io.c
> > > > > > > +++ b/mm/page_io.c
> > > > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> > > > > > > delayacct_swapin_start();
> > > > > > >
> > > > > > > if (zswap_load(folio)) {
> > > > > > > - folio_mark_uptodate(folio);
> > > > > > > folio_unlock(folio);
> > > > > > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > > > > > swap_read_folio_fs(folio, plug);
> > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > > > index b9b35ef86d9be..ebb878d3e7865 100644
> > > > > > > --- a/mm/zswap.c
> > > > > > > +++ b/mm/zswap.c
> > > > > > > @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> > > > > > >
> > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > > > >
> > > > > > > + /*
> > > > > > > + * Large folios should not be swapped in while zswap is being used, as
> > > > > > > + * they are not properly handled. Zswap does not properly load large
> > > > > > > + * folios, and a large folio may only be partially in zswap.
> > > > > > > + *
> > > > > > > + * If any of the subpages are in zswap, reading from disk would result
> > > > > > > + * in data corruption, so return true without marking the folio uptodate
> > > > > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > > > > > + *
> > > > > > > + * Otherwise, return false and read the folio from disk.
> > > > > > > + */
> > > > > > > + if (folio_test_large(folio)) {
> > > > > > > + if (xa_find(tree, &offset,
> > > > > > > + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> > > > > > > + WARN_ON_ONCE(1);
> > > > > > > + return true;
> > > > > > > + }
> > > > > > > + return false;
> > > > > >
> > > > > > IMHO, this appears to be over-designed. Personally, I would opt to
> > > > > > use
> > > > > >
> > > > > > if (folio_test_large(folio))
> > > > > > return true;
> > > > >
> > > > > I am sure you mean "return false" here. Always returning true means we
> > > > > will never read a large folio from either zswap or disk, whether it's
> > > > > in zswap or not. Basically guaranteeing corrupting data for large
> > > > > folio swapin, even if zswap is disabled :)
> > > > >
> > > > > >
> > > > > > Before we address large folio support in zswap, it’s essential
> > > > > > not to let them coexist. Expecting valid data by lunchtime is
> > > > > > not advisable.
> > > > >
> > > > > The goal here is to enable development for large folio swapin without
> > > > > breaking zswap or being blocked on adding support in zswap. If we
> > > > > always return false for large folios, as you suggest, then even if the
> > > > > folio is in zswap (or parts of it), we will go read it from disk. This
> > > > > will result in silent data corruption.
> > > > >
> > > > > As you mentioned before, you spent a week debugging problems with your
> > > > > large folio swapin series because of a zswap problem, and even after
> > > > > then, the zswap_is_enabled() check you had is not enough to prevent
> > > > > problems as I mentioned before (if zswap was enabled before). So we
> > > > > need stronger checks to make sure we don't break things when we
> > > > > support large folio swapin.
> > > > >
> > > > > Since we can't just check if zswap is enabled or not, we need to
> > > > > rather check if the folio (or any part of it) is in zswap or not. We
> > > > > can only WARN in that case, but delivering the error to userspace is a
> > > > > couple of extra lines of code (not set uptodate), and will make the
> > > > > problem much easier to notice.
> > > > >
> > > > > I am not sure I understand what you mean. The alternative is to
> > > > > introduce a config option (perhaps internal) for large folio swapin,
> > > > > and make this depend on !CONFIG_ZSWAP, or make zswap refuse to get
> > > > > enabled if large folio swapin is enabled (through config or boot
> > > > > option). This is until proper handling is added, of course.
> > > >
> > > > Hi Yosry,
> > > > My point is that anybody attempts to do large folios swap-in should
> > > > either
> > > > 1. always use small folios if zswap has been once enabled before or now
> > > > or
> > > > 2. address the large folios swapin issues in zswap
> > > >
> > > > there is no 3rd way which you are providing.
> > > >
> > > > it is over-designed to give users true or false based on if data is zswap
> > > > as there is always a chance data could be in zswap. so before approach
> > > > 2 is done, we should always WARN_ON large folios and report data
> > > > corruption.
> > >
> > > We can't always WARN_ON for large folios, as this will fire even if
> > > zswap was never enabled. The alternative is tracking whether zswap was
> > > ever enabled, and checking that instead of checking if any part of the
> > > folio is in zswap.
> > >
> > > Basically replacing xa_find(..) with zswap_was_enabled(..) or something.
> >
> > My point is that mm core should always fallback
> >
> > if (zswap_was_or_is_enabled())
> > goto fallback;
> >
> > till zswap fixes the issue. This is the only way to enable large folios swap-in
> > development before we fix zswap.
>
> I agree with this, I just want an extra fallback in zswap itself in
> case something was missed during large folio swapin development (which
> can evidently happen).

yes. then i feel we only need to warn_on the case mm-core fails to fallback.

I mean, only WARN_ON is_zswap_ever_enabled&&large folio. there is no
need to do more. Before zswap brings up the large folio support, mm-core
will need is_zswap_ever_enabled() to do fallback.

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 2a85b941db97..035e51ed89c4 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -36,6 +36,7 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
void zswap_lruvec_state_init(struct lruvec *lruvec);
void zswap_folio_swapin(struct folio *folio);
bool is_zswap_enabled(void);
+bool is_zswap_ever_enabled(void);
#else

struct zswap_lruvec_state {};
@@ -65,6 +66,10 @@ static inline bool is_zswap_enabled(void)
return false;
}

+static inline bool is_zswap_ever_enabled(void)
+{
+ return false;
+}
#endif

#endif /* _LINUX_ZSWAP_H */
diff --git a/mm/zswap.c b/mm/zswap.c
index b9b35ef86d9b..bf2da5d37e47 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -86,6 +86,9 @@ static int zswap_setup(void);
static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON);
static int zswap_enabled_param_set(const char *,
const struct kernel_param *);
+
+static bool zswap_ever_enable;
+
static const struct kernel_param_ops zswap_enabled_param_ops = {
.set = zswap_enabled_param_set,
.get = param_get_bool,
@@ -136,6 +139,11 @@ bool is_zswap_enabled(void)
return zswap_enabled;
}

+bool is_zswap_ever_enabled(void)
+{
+ return zswap_enabled || zswap_ever_enabled;
+}
+
/*********************************
* data structures
**********************************/
@@ -1734,6 +1742,7 @@ static int zswap_setup(void)
pr_info("loaded using pool %s/%s\n", pool->tfm_name,
zpool_get_type(pool->zpools[0]));
list_add(&pool->list, &zswap_pools);
+ zswap_ever_enabled = true;
zswap_has_pool = true;
} else {
pr_err("pool creation failed\n");

Thanks
Barry

2024-06-10 23:42:03

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

[..]
> > > > We can't always WARN_ON for large folios, as this will fire even if
> > > > zswap was never enabled. The alternative is tracking whether zswap was
> > > > ever enabled, and checking that instead of checking if any part of the
> > > > folio is in zswap.
> > > >
> > > > Basically replacing xa_find(..) with zswap_was_enabled(..) or something.
> > >
> > > My point is that mm core should always fallback
> > >
> > > if (zswap_was_or_is_enabled())
> > > goto fallback;
> > >
> > > till zswap fixes the issue. This is the only way to enable large folios swap-in
> > > development before we fix zswap.
> >
> > I agree with this, I just want an extra fallback in zswap itself in
> > case something was missed during large folio swapin development (which
> > can evidently happen).
>
> yes. then i feel we only need to warn_on the case mm-core fails to fallback.
>
> I mean, only WARN_ON is_zswap_ever_enabled&&large folio. there is no
> need to do more. Before zswap brings up the large folio support, mm-core
> will need is_zswap_ever_enabled() to do fallback.

I don't have a problem with doing it this way instead of checking if
any part of the folio is in zswap. Such a check may be needed for core
MM to fallback to order-0 anyway, as we discussed. But I'd rather have
this as a static key since it will never be changed.

Also, I still prefer we do not mark the folio as uptodate in this
case. It is one extra line of code to propagate the kernel warning to
userspace as well and make it much more noticeable.


>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 2a85b941db97..035e51ed89c4 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -36,6 +36,7 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
> void zswap_lruvec_state_init(struct lruvec *lruvec);
> void zswap_folio_swapin(struct folio *folio);
> bool is_zswap_enabled(void);
> +bool is_zswap_ever_enabled(void);
> #else
>
> struct zswap_lruvec_state {};
> @@ -65,6 +66,10 @@ static inline bool is_zswap_enabled(void)
> return false;
> }
>
> +static inline bool is_zswap_ever_enabled(void)
> +{
> + return false;
> +}
> #endif
>
> #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/zswap.c b/mm/zswap.c
> index b9b35ef86d9b..bf2da5d37e47 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -86,6 +86,9 @@ static int zswap_setup(void);
> static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON);
> static int zswap_enabled_param_set(const char *,
> const struct kernel_param *);
> +
> +static bool zswap_ever_enable;
> +
> static const struct kernel_param_ops zswap_enabled_param_ops = {
> .set = zswap_enabled_param_set,
> .get = param_get_bool,
> @@ -136,6 +139,11 @@ bool is_zswap_enabled(void)
> return zswap_enabled;
> }
>
> +bool is_zswap_ever_enabled(void)
> +{
> + return zswap_enabled || zswap_ever_enabled;
> +}
> +
> /*********************************
> * data structures
> **********************************/
> @@ -1734,6 +1742,7 @@ static int zswap_setup(void)
> pr_info("loaded using pool %s/%s\n", pool->tfm_name,
> zpool_get_type(pool->zpools[0]));
> list_add(&pool->list, &zswap_pools);
> + zswap_ever_enabled = true;
> zswap_has_pool = true;
> } else {
> pr_err("pool creation failed\n");
>
> Thanks
> Barry

2024-06-11 00:11:36

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Tue, Jun 11, 2024 at 11:41 AM Yosry Ahmed <[email protected]> wrote:
>
> [..]
> > > > > We can't always WARN_ON for large folios, as this will fire even if
> > > > > zswap was never enabled. The alternative is tracking whether zswap was
> > > > > ever enabled, and checking that instead of checking if any part of the
> > > > > folio is in zswap.
> > > > >
> > > > > Basically replacing xa_find(..) with zswap_was_enabled(..) or something.
> > > >
> > > > My point is that mm core should always fallback
> > > >
> > > > if (zswap_was_or_is_enabled())
> > > > goto fallback;
> > > >
> > > > till zswap fixes the issue. This is the only way to enable large folios swap-in
> > > > development before we fix zswap.
> > >
> > > I agree with this, I just want an extra fallback in zswap itself in
> > > case something was missed during large folio swapin development (which
> > > can evidently happen).
> >
> > yes. then i feel we only need to warn_on the case mm-core fails to fallback.
> >
> > I mean, only WARN_ON is_zswap_ever_enabled&&large folio. there is no
> > need to do more. Before zswap brings up the large folio support, mm-core
> > will need is_zswap_ever_enabled() to do fallback.
>
> I don't have a problem with doing it this way instead of checking if
> any part of the folio is in zswap. Such a check may be needed for core
> MM to fallback to order-0 anyway, as we discussed. But I'd rather have
> this as a static key since it will never be changed.

right. This is better.

>
> Also, I still prefer we do not mark the folio as uptodate in this
> case. It is one extra line of code to propagate the kernel warning to
> userspace as well and make it much more noticeable.

right. I have no objection to returning true and skipping mark uptodate.
Just searching xa is not so useful as anyway, we have to either fallback
in mm-core or bring up large folios in zswap.

>
>
> >
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 2a85b941db97..035e51ed89c4 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -36,6 +36,7 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
> > void zswap_lruvec_state_init(struct lruvec *lruvec);
> > void zswap_folio_swapin(struct folio *folio);
> > bool is_zswap_enabled(void);
> > +bool is_zswap_ever_enabled(void);
> > #else
> >
> > struct zswap_lruvec_state {};
> > @@ -65,6 +66,10 @@ static inline bool is_zswap_enabled(void)
> > return false;
> > }
> >
> > +static inline bool is_zswap_ever_enabled(void)
> > +{
> > + return false;
> > +}
> > #endif
> >
> > #endif /* _LINUX_ZSWAP_H */
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index b9b35ef86d9b..bf2da5d37e47 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -86,6 +86,9 @@ static int zswap_setup(void);
> > static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON);
> > static int zswap_enabled_param_set(const char *,
> > const struct kernel_param *);
> > +
> > +static bool zswap_ever_enable;
> > +
> > static const struct kernel_param_ops zswap_enabled_param_ops = {
> > .set = zswap_enabled_param_set,
> > .get = param_get_bool,
> > @@ -136,6 +139,11 @@ bool is_zswap_enabled(void)
> > return zswap_enabled;
> > }
> >
> > +bool is_zswap_ever_enabled(void)
> > +{
> > + return zswap_enabled || zswap_ever_enabled;
> > +}
> > +
> > /*********************************
> > * data structures
> > **********************************/
> > @@ -1734,6 +1742,7 @@ static int zswap_setup(void)
> > pr_info("loaded using pool %s/%s\n", pool->tfm_name,
> > zpool_get_type(pool->zpools[0]));
> > list_add(&pool->list, &zswap_pools);
> > + zswap_ever_enabled = true;
> > zswap_has_pool = true;
> > } else {
> > pr_err("pool creation failed\n");
> >

Thanks
Barry

2024-06-11 04:14:35

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios


On 6/10/24 20:35, Yosry Ahmed wrote:
> On Fri, Jun 7, 2024 at 9:08 PM Mika Penttilä <[email protected]> wrote:
>> On 6/8/24 05:36, Yosry Ahmed wrote:
>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>> index b9b35ef86d9be..ebb878d3e7865 100644
>>> --- a/mm/zswap.c
>>> +++ b/mm/zswap.c
>>> @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
>>>
>>> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>>
>>> + /*
>>> + * Large folios should not be swapped in while zswap is being used, as
>>> + * they are not properly handled. Zswap does not properly load large
>>> + * folios, and a large folio may only be partially in zswap.
>>> + *
>>> + * If any of the subpages are in zswap, reading from disk would result
>>> + * in data corruption, so return true without marking the folio uptodate
>>> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
>>> + *
>>> + * Otherwise, return false and read the folio from disk.
>>> + */
>>> + if (folio_test_large(folio)) {
>>> + if (xa_find(tree, &offset,
>>> + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
>>> + WARN_ON_ONCE(1);
>>> + return true;
>>> + }
>> How does that work? Should it be xa_find_after() to not always find
>> current entry?
> By "current entry" I believe you mean the entry corresponding to
> "offset" (i.e. the first subpage of the folio). At this point, we
> haven't checked if that offset has a corresponding entry in zswap or
> not. It may be on disk, or zwap may be disabled.

Okay you test if there's any matching offset in zswap for the folio.


>> And does it still mean those subsequent entries map to same folio?
> If I understand correctly, a folio in the swapcache has contiguous
> swap offsets for its subpages. So I am assuming that the large folio
> swapin case will adhere to that (i.e. we only swapin a large folio if
> the swap offsets are contiguous). Did I misunderstand something here?

Yes I think that is fair assumption for now. But also saw your v3 which
doesn't depend on this.


>
>>
>> --Mika
>>
>>


2024-06-11 15:48:10

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v2] mm: zswap: handle incorrect attempts to load of large folios

On Mon, Jun 10, 2024 at 9:14 PM Mika Penttilä <[email protected]> wrote:
>
>
> On 6/10/24 20:35, Yosry Ahmed wrote:
> > On Fri, Jun 7, 2024 at 9:08 PM Mika Penttilä <[email protected]> wrote:
> >> On 6/8/24 05:36, Yosry Ahmed wrote:
> >>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>> index b9b35ef86d9be..ebb878d3e7865 100644
> >>> --- a/mm/zswap.c
> >>> +++ b/mm/zswap.c
> >>> @@ -1557,6 +1557,26 @@ bool zswap_load(struct folio *folio)
> >>>
> >>> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >>>
> >>> + /*
> >>> + * Large folios should not be swapped in while zswap is being used, as
> >>> + * they are not properly handled. Zswap does not properly load large
> >>> + * folios, and a large folio may only be partially in zswap.
> >>> + *
> >>> + * If any of the subpages are in zswap, reading from disk would result
> >>> + * in data corruption, so return true without marking the folio uptodate
> >>> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> >>> + *
> >>> + * Otherwise, return false and read the folio from disk.
> >>> + */
> >>> + if (folio_test_large(folio)) {
> >>> + if (xa_find(tree, &offset,
> >>> + offset + folio_nr_pages(folio) - 1, XA_PRESENT)) {
> >>> + WARN_ON_ONCE(1);
> >>> + return true;
> >>> + }
> >> How does that work? Should it be xa_find_after() to not always find
> >> current entry?
> > By "current entry" I believe you mean the entry corresponding to
> > "offset" (i.e. the first subpage of the folio). At this point, we
> > haven't checked if that offset has a corresponding entry in zswap or
> > not. It may be on disk, or zwap may be disabled.
>
> Okay you test if there's any matching offset in zswap for the folio.
>
>
> >> And does it still mean those subsequent entries map to same folio?
> > If I understand correctly, a folio in the swapcache has contiguous
> > swap offsets for its subpages. So I am assuming that the large folio
> > swapin case will adhere to that (i.e. we only swapin a large folio if
> > the swap offsets are contiguous). Did I misunderstand something here?
>
> Yes I think that is fair assumption for now. But also saw your v3 which
> doesn't depend on this.

Yeah Barry pointed out that we want to warn if a large folio reaches
zswap and there is a chance that it is in zswap (i.e. zswap was
enabled), even if it happens that the folio is not in zswap during
swapin. This gives wider coverage and is cheaper when zswap is
disabled than the lookups.

We will also need to check if zswap was ever enabled in the large
folio swapin series anyway, so the helper introduced in v3 should be
helpful there as well.