With ongoing work to support large folio swapin, it is important to make
sure we do not pass large folios to zswap_load() without implementing
proper support.
For example, if a swapin fault observes that contiguous PTEs are
pointing to contiguous swap entries and tries to swap them in as a large
folio, swap_read_folio() will pass in a large folio to zswap_load(), but
zswap_load() will only effectively load the first page in the folio. If
the first page is not in zswap, the folio will be read from disk, even
though other pages may be in zswap.
In both cases, this will lead to silent data corruption.
Proper large folio swapin support needs to go into zswap before zswap
can be enabled in a system that supports large folio swapin.
Looking at callers of swap_read_folio(), it seems like they are either
allocated from __read_swap_cache_async() or do_swap_page() in the
SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
are fine for now.
Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
the order of those allocations without proper handling of zswap.
Alternatively, swap_read_folio() (or its callers) can be updated to have
a fallback mechanism that splits large folios or reads subpages
separately. Similar logic may be needed anyway in case part of a large
folio is already in the swapcache and the rest of it is swapped out.
Signed-off-by: Yosry Ahmed <[email protected]>
---
Sorry for the long CC list, I just found myself repeatedly looking at
new series that add swap support for mTHPs / large folios, making sure
they do not break with zswap or make incorrect assumptions. This debug
check should give us some peace of mind. Hopefully this patch will also
raise awareness among people who are working on this.
---
mm/zswap.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/zswap.c b/mm/zswap.c
index b9b35ef86d9be..6007252429bb2 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
if (!entry)
return false;
+ /* Zswap loads do not handle large folio swapins correctly yet */
+ VM_BUG_ON(folio_test_large(folio));
+
if (entry->length)
zswap_decompress(entry, folio);
else
--
2.45.2.505.gda0bf45e8d-goog
On 06.06.24 20:48, Yosry Ahmed wrote:
> With ongoing work to support large folio swapin, it is important to make
> sure we do not pass large folios to zswap_load() without implementing
> proper support.
>
> For example, if a swapin fault observes that contiguous PTEs are
> pointing to contiguous swap entries and tries to swap them in as a large
> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> zswap_load() will only effectively load the first page in the folio. If
> the first page is not in zswap, the folio will be read from disk, even
> though other pages may be in zswap.
>
> In both cases, this will lead to silent data corruption.
>
> Proper large folio swapin support needs to go into zswap before zswap
> can be enabled in a system that supports large folio swapin.
>
> Looking at callers of swap_read_folio(), it seems like they are either
> allocated from __read_swap_cache_async() or do_swap_page() in the
> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> are fine for now.
>
> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> the order of those allocations without proper handling of zswap.
>
> Alternatively, swap_read_folio() (or its callers) can be updated to have
> a fallback mechanism that splits large folios or reads subpages
> separately. Similar logic may be needed anyway in case part of a large
> folio is already in the swapcache and the rest of it is swapped out.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
>
> Sorry for the long CC list, I just found myself repeatedly looking at
> new series that add swap support for mTHPs / large folios, making sure
> they do not break with zswap or make incorrect assumptions. This debug
> check should give us some peace of mind. Hopefully this patch will also
> raise awareness among people who are working on this.
>
> ---
> mm/zswap.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index b9b35ef86d9be..6007252429bb2 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> if (!entry)
> return false;
>
> + /* Zswap loads do not handle large folio swapins correctly yet */
> + VM_BUG_ON(folio_test_large(folio));
> +
There is no way we could have a WARN_ON_ONCE() and recover, right?
--
Cheers,
David / dhildenb
On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
>
> On 06.06.24 20:48, Yosry Ahmed wrote:
> > With ongoing work to support large folio swapin, it is important to make
> > sure we do not pass large folios to zswap_load() without implementing
> > proper support.
> >
> > For example, if a swapin fault observes that contiguous PTEs are
> > pointing to contiguous swap entries and tries to swap them in as a large
> > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > zswap_load() will only effectively load the first page in the folio. If
> > the first page is not in zswap, the folio will be read from disk, even
> > though other pages may be in zswap.
> >
> > In both cases, this will lead to silent data corruption.
> >
> > Proper large folio swapin support needs to go into zswap before zswap
> > can be enabled in a system that supports large folio swapin.
> >
> > Looking at callers of swap_read_folio(), it seems like they are either
> > allocated from __read_swap_cache_async() or do_swap_page() in the
> > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> > are fine for now.
> >
> > Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> > the order of those allocations without proper handling of zswap.
> >
> > Alternatively, swap_read_folio() (or its callers) can be updated to have
> > a fallback mechanism that splits large folios or reads subpages
> > separately. Similar logic may be needed anyway in case part of a large
> > folio is already in the swapcache and the rest of it is swapped out.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> >
> > Sorry for the long CC list, I just found myself repeatedly looking at
> > new series that add swap support for mTHPs / large folios, making sure
> > they do not break with zswap or make incorrect assumptions. This debug
> > check should give us some peace of mind. Hopefully this patch will also
> > raise awareness among people who are working on this.
> >
> > ---
> > mm/zswap.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index b9b35ef86d9be..6007252429bb2 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> > if (!entry)
> > return false;
> >
> > + /* Zswap loads do not handle large folio swapins correctly yet */
> > + VM_BUG_ON(folio_test_large(folio));
> > +
>
> There is no way we could have a WARN_ON_ONCE() and recover, right?
Not without making more fundamental changes to the surrounding swap
code. Currently zswap_load() returns either true (folio was loaded
from zswap) or false (folio is not in zswap).
To handle this correctly zswap_load() would need to tell
swap_read_folio() which subpages are in zswap and have been loaded,
and then swap_read_folio() would need to read the remaining subpages
from disk. This of course assumes that the caller of swap_read_folio()
made sure that the entire folio is swapped out and protected against
races with other swapins.
Also, because swap_read_folio() cannot split the folio itself, other
swap_read_folio_*() functions that are called from it should be
updated to handle swapping in tail subpages, which may be questionable
in its own right.
An alternative would be that zswap_load() (or a separate interface)
could tell swap_read_folio() that the folio is partially in zswap,
then we can just bail and tell the caller that it cannot read the
large folio and that it should be split.
There may be other options as well, but the bottom line is that it is
possible, but probably not something that we want to do right now.
A stronger protection method would be to introduce a config option or
boot parameter for large folio swapin, and then make CONFIG_ZSWAP
depend on it being disabled, or have zswap check it at boot and refuse
to be enabled if it is on.
On Fri, Jun 7, 2024 at 8:32 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
> >
> > On 06.06.24 20:48, Yosry Ahmed wrote:
> > > With ongoing work to support large folio swapin, it is important to make
> > > sure we do not pass large folios to zswap_load() without implementing
> > > proper support.
> > >
> > > For example, if a swapin fault observes that contiguous PTEs are
> > > pointing to contiguous swap entries and tries to swap them in as a large
> > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > > zswap_load() will only effectively load the first page in the folio. If
> > > the first page is not in zswap, the folio will be read from disk, even
> > > though other pages may be in zswap.
> > >
> > > In both cases, this will lead to silent data corruption.
> > >
> > > Proper large folio swapin support needs to go into zswap before zswap
> > > can be enabled in a system that supports large folio swapin.
> > >
> > > Looking at callers of swap_read_folio(), it seems like they are either
> > > allocated from __read_swap_cache_async() or do_swap_page() in the
> > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> > > are fine for now.
> > >
> > > Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> > > the order of those allocations without proper handling of zswap.
> > >
> > > Alternatively, swap_read_folio() (or its callers) can be updated to have
> > > a fallback mechanism that splits large folios or reads subpages
> > > separately. Similar logic may be needed anyway in case part of a large
> > > folio is already in the swapcache and the rest of it is swapped out.
> > >
> > > Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Barry Song <[email protected]>
this has been observed by me[1], that's why you can find the below
code in my swapin patch
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+ ...
+ /*
+ * a large folio being swapped-in could be partially in
+ * zswap and partially in swap devices, zswap doesn't
+ * support large folios yet, we might get corrupted
+ * zero-filled data by reading all subpages from swap
+ * devices while some of them are actually in zswap
+ */
+ if (is_zswap_enabled())
+ goto fallback;
[1] https://lore.kernel.org/linux-mm/[email protected]/
> > > ---
> > >
> > > Sorry for the long CC list, I just found myself repeatedly looking at
> > > new series that add swap support for mTHPs / large folios, making sure
> > > they do not break with zswap or make incorrect assumptions. This debug
> > > check should give us some peace of mind. Hopefully this patch will also
> > > raise awareness among people who are working on this.
> > >
> > > ---
> > > mm/zswap.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index b9b35ef86d9be..6007252429bb2 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> > > if (!entry)
> > > return false;
> > >
> > > + /* Zswap loads do not handle large folio swapins correctly yet */
> > > + VM_BUG_ON(folio_test_large(folio));
> > > +
> >
> > There is no way we could have a WARN_ON_ONCE() and recover, right?
>
> Not without making more fundamental changes to the surrounding swap
> code. Currently zswap_load() returns either true (folio was loaded
> from zswap) or false (folio is not in zswap).
>
> To handle this correctly zswap_load() would need to tell
> swap_read_folio() which subpages are in zswap and have been loaded,
> and then swap_read_folio() would need to read the remaining subpages
> from disk. This of course assumes that the caller of swap_read_folio()
> made sure that the entire folio is swapped out and protected against
> races with other swapins.
>
> Also, because swap_read_folio() cannot split the folio itself, other
> swap_read_folio_*() functions that are called from it should be
> updated to handle swapping in tail subpages, which may be questionable
> in its own right.
>
> An alternative would be that zswap_load() (or a separate interface)
> could tell swap_read_folio() that the folio is partially in zswap,
> then we can just bail and tell the caller that it cannot read the
> large folio and that it should be split.
>
> There may be other options as well, but the bottom line is that it is
> possible, but probably not something that we want to do right now.
>
> A stronger protection method would be to introduce a config option or
> boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> depend on it being disabled, or have zswap check it at boot and refuse
> to be enabled if it is on.
Thanks
Barry
On Thu, Jun 6, 2024 at 1:55 PM Barry Song <[email protected]> wrote:
>
> On Fri, Jun 7, 2024 at 8:32 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
> > >
> > > On 06.06.24 20:48, Yosry Ahmed wrote:
> > > > With ongoing work to support large folio swapin, it is important to make
> > > > sure we do not pass large folios to zswap_load() without implementing
> > > > proper support.
> > > >
> > > > For example, if a swapin fault observes that contiguous PTEs are
> > > > pointing to contiguous swap entries and tries to swap them in as a large
> > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > > > zswap_load() will only effectively load the first page in the folio. If
> > > > the first page is not in zswap, the folio will be read from disk, even
> > > > though other pages may be in zswap.
> > > >
> > > > In both cases, this will lead to silent data corruption.
> > > >
> > > > Proper large folio swapin support needs to go into zswap before zswap
> > > > can be enabled in a system that supports large folio swapin.
> > > >
> > > > Looking at callers of swap_read_folio(), it seems like they are either
> > > > allocated from __read_swap_cache_async() or do_swap_page() in the
> > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> > > > are fine for now.
> > > >
> > > > Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> > > > the order of those allocations without proper handling of zswap.
> > > >
> > > > Alternatively, swap_read_folio() (or its callers) can be updated to have
> > > > a fallback mechanism that splits large folios or reads subpages
> > > > separately. Similar logic may be needed anyway in case part of a large
> > > > folio is already in the swapcache and the rest of it is swapped out.
> > > >
> > > > Signed-off-by: Yosry Ahmed <[email protected]>
>
> Acked-by: Barry Song <[email protected]>
Thanks!
>
> this has been observed by me[1], that's why you can find the below
> code in my swapin patch
Thanks for the pointer. I suppose if we add more generic swapin
support we'll have to add a similar check in
__read_swap_cache_async(), unless we are planning to reuse
alloc_swap_folio() there.
It would be nice if we can have a global check for this rather than
add it to all different swapin paths (although I suspect there are
only two allocations paths right now). We can always disable zswap if
mTHP swapin is enabled, or always disable mTHP swapin if zswap is
enabled. At least until proper support is added.
>
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> + ...
> + /*
> + * a large folio being swapped-in could be partially in
> + * zswap and partially in swap devices, zswap doesn't
> + * support large folios yet, we might get corrupted
> + * zero-filled data by reading all subpages from swap
> + * devices while some of them are actually in zswap
> + */
> + if (is_zswap_enabled())
> + goto fallback;
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
On 06.06.24 22:31, Yosry Ahmed wrote:
> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 06.06.24 20:48, Yosry Ahmed wrote:
>>> With ongoing work to support large folio swapin, it is important to make
>>> sure we do not pass large folios to zswap_load() without implementing
>>> proper support.
>>>
>>> For example, if a swapin fault observes that contiguous PTEs are
>>> pointing to contiguous swap entries and tries to swap them in as a large
>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
>>> zswap_load() will only effectively load the first page in the folio. If
>>> the first page is not in zswap, the folio will be read from disk, even
>>> though other pages may be in zswap.
>>>
>>> In both cases, this will lead to silent data corruption.
>>>
>>> Proper large folio swapin support needs to go into zswap before zswap
>>> can be enabled in a system that supports large folio swapin.
>>>
>>> Looking at callers of swap_read_folio(), it seems like they are either
>>> allocated from __read_swap_cache_async() or do_swap_page() in the
>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
>>> are fine for now.
>>>
>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
>>> the order of those allocations without proper handling of zswap.
>>>
>>> Alternatively, swap_read_folio() (or its callers) can be updated to have
>>> a fallback mechanism that splits large folios or reads subpages
>>> separately. Similar logic may be needed anyway in case part of a large
>>> folio is already in the swapcache and the rest of it is swapped out.
>>>
>>> Signed-off-by: Yosry Ahmed <[email protected]>
>>> ---
>>>
>>> Sorry for the long CC list, I just found myself repeatedly looking at
>>> new series that add swap support for mTHPs / large folios, making sure
>>> they do not break with zswap or make incorrect assumptions. This debug
>>> check should give us some peace of mind. Hopefully this patch will also
>>> raise awareness among people who are working on this.
>>>
>>> ---
>>> mm/zswap.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>> index b9b35ef86d9be..6007252429bb2 100644
>>> --- a/mm/zswap.c
>>> +++ b/mm/zswap.c
>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
>>> if (!entry)
>>> return false;
>>>
>>> + /* Zswap loads do not handle large folio swapins correctly yet */
>>> + VM_BUG_ON(folio_test_large(folio));
>>> +
>>
>> There is no way we could have a WARN_ON_ONCE() and recover, right?
>
> Not without making more fundamental changes to the surrounding swap
> code. Currently zswap_load() returns either true (folio was loaded
> from zswap) or false (folio is not in zswap).
>
> To handle this correctly zswap_load() would need to tell
> swap_read_folio() which subpages are in zswap and have been loaded,
> and then swap_read_folio() would need to read the remaining subpages
> from disk. This of course assumes that the caller of swap_read_folio()
> made sure that the entire folio is swapped out and protected against
> races with other swapins.
>
> Also, because swap_read_folio() cannot split the folio itself, other
> swap_read_folio_*() functions that are called from it should be
> updated to handle swapping in tail subpages, which may be questionable
> in its own right.
>
> An alternative would be that zswap_load() (or a separate interface)
> could tell swap_read_folio() that the folio is partially in zswap,
> then we can just bail and tell the caller that it cannot read the
> large folio and that it should be split.
>
> There may be other options as well, but the bottom line is that it is
> possible, but probably not something that we want to do right now.
>
> A stronger protection method would be to introduce a config option or
> boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> depend on it being disabled, or have zswap check it at boot and refuse
> to be enabled if it is on.
Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
I was wondering, if we could WARN_ON_ONCE and make the swap code detect
this like a read-error from disk.
I think do_swap_page() detects that by checking if the folio is not
uptodate:
if (unlikely(!folio_test_uptodate(folio))) {
ret = VM_FAULT_SIGBUS;
goto out_nomap;
}
So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
system (but the app would crash either way, there is no way around it).
--
Cheers,
David / dhildenb
On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <[email protected]> wrote:
>
> On 06.06.24 22:31, Yosry Ahmed wrote:
> > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 06.06.24 20:48, Yosry Ahmed wrote:
> >>> With ongoing work to support large folio swapin, it is important to make
> >>> sure we do not pass large folios to zswap_load() without implementing
> >>> proper support.
> >>>
> >>> For example, if a swapin fault observes that contiguous PTEs are
> >>> pointing to contiguous swap entries and tries to swap them in as a large
> >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> >>> zswap_load() will only effectively load the first page in the folio. If
> >>> the first page is not in zswap, the folio will be read from disk, even
> >>> though other pages may be in zswap.
> >>>
> >>> In both cases, this will lead to silent data corruption.
> >>>
> >>> Proper large folio swapin support needs to go into zswap before zswap
> >>> can be enabled in a system that supports large folio swapin.
> >>>
> >>> Looking at callers of swap_read_folio(), it seems like they are either
> >>> allocated from __read_swap_cache_async() or do_swap_page() in the
> >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> >>> are fine for now.
> >>>
> >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> >>> the order of those allocations without proper handling of zswap.
> >>>
> >>> Alternatively, swap_read_folio() (or its callers) can be updated to have
> >>> a fallback mechanism that splits large folios or reads subpages
> >>> separately. Similar logic may be needed anyway in case part of a large
> >>> folio is already in the swapcache and the rest of it is swapped out.
> >>>
> >>> Signed-off-by: Yosry Ahmed <[email protected]>
> >>> ---
> >>>
> >>> Sorry for the long CC list, I just found myself repeatedly looking at
> >>> new series that add swap support for mTHPs / large folios, making sure
> >>> they do not break with zswap or make incorrect assumptions. This debug
> >>> check should give us some peace of mind. Hopefully this patch will also
> >>> raise awareness among people who are working on this.
> >>>
> >>> ---
> >>> mm/zswap.c | 3 +++
> >>> 1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>> index b9b35ef86d9be..6007252429bb2 100644
> >>> --- a/mm/zswap.c
> >>> +++ b/mm/zswap.c
> >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> >>> if (!entry)
> >>> return false;
> >>>
> >>> + /* Zswap loads do not handle large folio swapins correctly yet */
> >>> + VM_BUG_ON(folio_test_large(folio));
> >>> +
> >>
> >> There is no way we could have a WARN_ON_ONCE() and recover, right?
> >
> > Not without making more fundamental changes to the surrounding swap
> > code. Currently zswap_load() returns either true (folio was loaded
> > from zswap) or false (folio is not in zswap).
> >
> > To handle this correctly zswap_load() would need to tell
> > swap_read_folio() which subpages are in zswap and have been loaded,
> > and then swap_read_folio() would need to read the remaining subpages
> > from disk. This of course assumes that the caller of swap_read_folio()
> > made sure that the entire folio is swapped out and protected against
> > races with other swapins.
> >
> > Also, because swap_read_folio() cannot split the folio itself, other
> > swap_read_folio_*() functions that are called from it should be
> > updated to handle swapping in tail subpages, which may be questionable
> > in its own right.
> >
> > An alternative would be that zswap_load() (or a separate interface)
> > could tell swap_read_folio() that the folio is partially in zswap,
> > then we can just bail and tell the caller that it cannot read the
> > large folio and that it should be split.
> >
> > There may be other options as well, but the bottom line is that it is
> > possible, but probably not something that we want to do right now.
> >
> > A stronger protection method would be to introduce a config option or
> > boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> > depend on it being disabled, or have zswap check it at boot and refuse
> > to be enabled if it is on.
>
> Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
>
> I was wondering, if we could WARN_ON_ONCE and make the swap code detect
> this like a read-error from disk.
>
> I think do_swap_page() detects that by checking if the folio is not
> uptodate:
>
> if (unlikely(!folio_test_uptodate(folio))) {
> ret = VM_FAULT_SIGBUS;
> goto out_nomap;
> }
>
> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
> system (but the app would crash either way, there is no way around it).
>
I'd rather fallback to small folios swapin instead crashing apps till we fix
the large folio swapin in zswap :-)
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+ ...
+
+ if (is_zswap_enabled())
+ goto fallback;
> --
> Cheers,
>
> David / dhildenb
Thanks
Barry
On Thu, Jun 6, 2024 at 2:17 PM David Hildenbrand <[email protected]> wrote:
>
> On 06.06.24 22:31, Yosry Ahmed wrote:
> > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 06.06.24 20:48, Yosry Ahmed wrote:
> >>> With ongoing work to support large folio swapin, it is important to make
> >>> sure we do not pass large folios to zswap_load() without implementing
> >>> proper support.
> >>>
> >>> For example, if a swapin fault observes that contiguous PTEs are
> >>> pointing to contiguous swap entries and tries to swap them in as a large
> >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> >>> zswap_load() will only effectively load the first page in the folio. If
> >>> the first page is not in zswap, the folio will be read from disk, even
> >>> though other pages may be in zswap.
> >>>
> >>> In both cases, this will lead to silent data corruption.
> >>>
> >>> Proper large folio swapin support needs to go into zswap before zswap
> >>> can be enabled in a system that supports large folio swapin.
> >>>
> >>> Looking at callers of swap_read_folio(), it seems like they are either
> >>> allocated from __read_swap_cache_async() or do_swap_page() in the
> >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> >>> are fine for now.
> >>>
> >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> >>> the order of those allocations without proper handling of zswap.
> >>>
> >>> Alternatively, swap_read_folio() (or its callers) can be updated to have
> >>> a fallback mechanism that splits large folios or reads subpages
> >>> separately. Similar logic may be needed anyway in case part of a large
> >>> folio is already in the swapcache and the rest of it is swapped out.
> >>>
> >>> Signed-off-by: Yosry Ahmed <[email protected]>
> >>> ---
> >>>
> >>> Sorry for the long CC list, I just found myself repeatedly looking at
> >>> new series that add swap support for mTHPs / large folios, making sure
> >>> they do not break with zswap or make incorrect assumptions. This debug
> >>> check should give us some peace of mind. Hopefully this patch will also
> >>> raise awareness among people who are working on this.
> >>>
> >>> ---
> >>> mm/zswap.c | 3 +++
> >>> 1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>> index b9b35ef86d9be..6007252429bb2 100644
> >>> --- a/mm/zswap.c
> >>> +++ b/mm/zswap.c
> >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> >>> if (!entry)
> >>> return false;
> >>>
> >>> + /* Zswap loads do not handle large folio swapins correctly yet */
> >>> + VM_BUG_ON(folio_test_large(folio));
> >>> +
> >>
> >> There is no way we could have a WARN_ON_ONCE() and recover, right?
> >
> > Not without making more fundamental changes to the surrounding swap
> > code. Currently zswap_load() returns either true (folio was loaded
> > from zswap) or false (folio is not in zswap).
> >
> > To handle this correctly zswap_load() would need to tell
> > swap_read_folio() which subpages are in zswap and have been loaded,
> > and then swap_read_folio() would need to read the remaining subpages
> > from disk. This of course assumes that the caller of swap_read_folio()
> > made sure that the entire folio is swapped out and protected against
> > races with other swapins.
> >
> > Also, because swap_read_folio() cannot split the folio itself, other
> > swap_read_folio_*() functions that are called from it should be
> > updated to handle swapping in tail subpages, which may be questionable
> > in its own right.
> >
> > An alternative would be that zswap_load() (or a separate interface)
> > could tell swap_read_folio() that the folio is partially in zswap,
> > then we can just bail and tell the caller that it cannot read the
> > large folio and that it should be split.
> >
> > There may be other options as well, but the bottom line is that it is
> > possible, but probably not something that we want to do right now.
> >
> > A stronger protection method would be to introduce a config option or
> > boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> > depend on it being disabled, or have zswap check it at boot and refuse
> > to be enabled if it is on.
>
> Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
>
> I was wondering, if we could WARN_ON_ONCE and make the swap code detect
> this like a read-error from disk.
>
> I think do_swap_page() detects that by checking if the folio is not
> uptodate:
>
> if (unlikely(!folio_test_uptodate(folio))) {
> ret = VM_FAULT_SIGBUS;
> goto out_nomap;
> }
>
> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
> system (but the app would crash either way, there is no way around it).
It seems like most paths will handle this correctly just if the folio
is not uptodate. do_swap_page() seems like it will work correctly
whether
swapin_readahead() and the direct call to swap_read_folio() in
do_swap_page() should work correctly in this case. The shmem swapin
path seems like it will return -EIO, which in the fault path will also
sigbus, and in the file read/write path I assume will be handled
correctly.
However, looking at the swapoff paths, it seems like we don't really
check uptodate. For example, shmem_unuse_swap_entries() will just
throw -EIO away. Maybe it is handled on a higher level by the fact
that the number of swap entries will not drop to zero so swapoff will
not complete? :)
Anyway, I believe it may be possible to just not set uptodate, but I
am not very sure how reliable it will be. It may be better than
nothing anyway, I guess?
On Thu, Jun 6, 2024 at 2:30 PM Barry Song <[email protected]> wrote:
>
> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <[email protected]> wrote:
> >
> > On 06.06.24 22:31, Yosry Ahmed wrote:
> > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
> > >>
> > >> On 06.06.24 20:48, Yosry Ahmed wrote:
> > >>> With ongoing work to support large folio swapin, it is important to make
> > >>> sure we do not pass large folios to zswap_load() without implementing
> > >>> proper support.
> > >>>
> > >>> For example, if a swapin fault observes that contiguous PTEs are
> > >>> pointing to contiguous swap entries and tries to swap them in as a large
> > >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > >>> zswap_load() will only effectively load the first page in the folio. If
> > >>> the first page is not in zswap, the folio will be read from disk, even
> > >>> though other pages may be in zswap.
> > >>>
> > >>> In both cases, this will lead to silent data corruption.
> > >>>
> > >>> Proper large folio swapin support needs to go into zswap before zswap
> > >>> can be enabled in a system that supports large folio swapin.
> > >>>
> > >>> Looking at callers of swap_read_folio(), it seems like they are either
> > >>> allocated from __read_swap_cache_async() or do_swap_page() in the
> > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> > >>> are fine for now.
> > >>>
> > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> > >>> the order of those allocations without proper handling of zswap.
> > >>>
> > >>> Alternatively, swap_read_folio() (or its callers) can be updated to have
> > >>> a fallback mechanism that splits large folios or reads subpages
> > >>> separately. Similar logic may be needed anyway in case part of a large
> > >>> folio is already in the swapcache and the rest of it is swapped out.
> > >>>
> > >>> Signed-off-by: Yosry Ahmed <[email protected]>
> > >>> ---
> > >>>
> > >>> Sorry for the long CC list, I just found myself repeatedly looking at
> > >>> new series that add swap support for mTHPs / large folios, making sure
> > >>> they do not break with zswap or make incorrect assumptions. This debug
> > >>> check should give us some peace of mind. Hopefully this patch will also
> > >>> raise awareness among people who are working on this.
> > >>>
> > >>> ---
> > >>> mm/zswap.c | 3 +++
> > >>> 1 file changed, 3 insertions(+)
> > >>>
> > >>> diff --git a/mm/zswap.c b/mm/zswap.c
> > >>> index b9b35ef86d9be..6007252429bb2 100644
> > >>> --- a/mm/zswap.c
> > >>> +++ b/mm/zswap.c
> > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> > >>> if (!entry)
> > >>> return false;
> > >>>
> > >>> + /* Zswap loads do not handle large folio swapins correctly yet */
> > >>> + VM_BUG_ON(folio_test_large(folio));
> > >>> +
> > >>
> > >> There is no way we could have a WARN_ON_ONCE() and recover, right?
> > >
> > > Not without making more fundamental changes to the surrounding swap
> > > code. Currently zswap_load() returns either true (folio was loaded
> > > from zswap) or false (folio is not in zswap).
> > >
> > > To handle this correctly zswap_load() would need to tell
> > > swap_read_folio() which subpages are in zswap and have been loaded,
> > > and then swap_read_folio() would need to read the remaining subpages
> > > from disk. This of course assumes that the caller of swap_read_folio()
> > > made sure that the entire folio is swapped out and protected against
> > > races with other swapins.
> > >
> > > Also, because swap_read_folio() cannot split the folio itself, other
> > > swap_read_folio_*() functions that are called from it should be
> > > updated to handle swapping in tail subpages, which may be questionable
> > > in its own right.
> > >
> > > An alternative would be that zswap_load() (or a separate interface)
> > > could tell swap_read_folio() that the folio is partially in zswap,
> > > then we can just bail and tell the caller that it cannot read the
> > > large folio and that it should be split.
> > >
> > > There may be other options as well, but the bottom line is that it is
> > > possible, but probably not something that we want to do right now.
> > >
> > > A stronger protection method would be to introduce a config option or
> > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> > > depend on it being disabled, or have zswap check it at boot and refuse
> > > to be enabled if it is on.
> >
> > Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
> >
> > I was wondering, if we could WARN_ON_ONCE and make the swap code detect
> > this like a read-error from disk.
> >
> > I think do_swap_page() detects that by checking if the folio is not
> > uptodate:
> >
> > if (unlikely(!folio_test_uptodate(folio))) {
> > ret = VM_FAULT_SIGBUS;
> > goto out_nomap;
> > }
> >
> > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
> > system (but the app would crash either way, there is no way around it).
> >
>
> I'd rather fallback to small folios swapin instead crashing apps till we fix
> the large folio swapin in zswap :-)
I think David is referring to catching the buggy cases that do not
properly fallback to small folios with zswap, not as an alternative to
the fallback. This is at least what I had in mind with the patch.
On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <[email protected]> wrote:
> >
> > On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <[email protected]> wrote:
> > >
> > > On 06.06.24 22:31, Yosry Ahmed wrote:
> > > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
> > > >>
> > > >> On 06.06.24 20:48, Yosry Ahmed wrote:
> > > >>> With ongoing work to support large folio swapin, it is important to make
> > > >>> sure we do not pass large folios to zswap_load() without implementing
> > > >>> proper support.
> > > >>>
> > > >>> For example, if a swapin fault observes that contiguous PTEs are
> > > >>> pointing to contiguous swap entries and tries to swap them in as a large
> > > >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > > >>> zswap_load() will only effectively load the first page in the folio. If
> > > >>> the first page is not in zswap, the folio will be read from disk, even
> > > >>> though other pages may be in zswap.
> > > >>>
> > > >>> In both cases, this will lead to silent data corruption.
> > > >>>
> > > >>> Proper large folio swapin support needs to go into zswap before zswap
> > > >>> can be enabled in a system that supports large folio swapin.
> > > >>>
> > > >>> Looking at callers of swap_read_folio(), it seems like they are either
> > > >>> allocated from __read_swap_cache_async() or do_swap_page() in the
> > > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> > > >>> are fine for now.
> > > >>>
> > > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> > > >>> the order of those allocations without proper handling of zswap.
> > > >>>
> > > >>> Alternatively, swap_read_folio() (or its callers) can be updated to have
> > > >>> a fallback mechanism that splits large folios or reads subpages
> > > >>> separately. Similar logic may be needed anyway in case part of a large
> > > >>> folio is already in the swapcache and the rest of it is swapped out.
> > > >>>
> > > >>> Signed-off-by: Yosry Ahmed <[email protected]>
> > > >>> ---
> > > >>>
> > > >>> Sorry for the long CC list, I just found myself repeatedly looking at
> > > >>> new series that add swap support for mTHPs / large folios, making sure
> > > >>> they do not break with zswap or make incorrect assumptions. This debug
> > > >>> check should give us some peace of mind. Hopefully this patch will also
> > > >>> raise awareness among people who are working on this.
> > > >>>
> > > >>> ---
> > > >>> mm/zswap.c | 3 +++
> > > >>> 1 file changed, 3 insertions(+)
> > > >>>
> > > >>> diff --git a/mm/zswap.c b/mm/zswap.c
> > > >>> index b9b35ef86d9be..6007252429bb2 100644
> > > >>> --- a/mm/zswap.c
> > > >>> +++ b/mm/zswap.c
> > > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> > > >>> if (!entry)
> > > >>> return false;
> > > >>>
> > > >>> + /* Zswap loads do not handle large folio swapins correctly yet */
> > > >>> + VM_BUG_ON(folio_test_large(folio));
> > > >>> +
> > > >>
> > > >> There is no way we could have a WARN_ON_ONCE() and recover, right?
> > > >
> > > > Not without making more fundamental changes to the surrounding swap
> > > > code. Currently zswap_load() returns either true (folio was loaded
> > > > from zswap) or false (folio is not in zswap).
> > > >
> > > > To handle this correctly zswap_load() would need to tell
> > > > swap_read_folio() which subpages are in zswap and have been loaded,
> > > > and then swap_read_folio() would need to read the remaining subpages
> > > > from disk. This of course assumes that the caller of swap_read_folio()
> > > > made sure that the entire folio is swapped out and protected against
> > > > races with other swapins.
> > > >
> > > > Also, because swap_read_folio() cannot split the folio itself, other
> > > > swap_read_folio_*() functions that are called from it should be
> > > > updated to handle swapping in tail subpages, which may be questionable
> > > > in its own right.
> > > >
> > > > An alternative would be that zswap_load() (or a separate interface)
> > > > could tell swap_read_folio() that the folio is partially in zswap,
> > > > then we can just bail and tell the caller that it cannot read the
> > > > large folio and that it should be split.
> > > >
> > > > There may be other options as well, but the bottom line is that it is
> > > > possible, but probably not something that we want to do right now.
> > > >
> > > > A stronger protection method would be to introduce a config option or
> > > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> > > > depend on it being disabled, or have zswap check it at boot and refuse
> > > > to be enabled if it is on.
> > >
> > > Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
> > >
> > > I was wondering, if we could WARN_ON_ONCE and make the swap code detect
> > > this like a read-error from disk.
> > >
> > > I think do_swap_page() detects that by checking if the folio is not
> > > uptodate:
> > >
> > > if (unlikely(!folio_test_uptodate(folio))) {
> > > ret = VM_FAULT_SIGBUS;
> > > goto out_nomap;
> > > }
> > >
> > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
> > > system (but the app would crash either way, there is no way around it).
> > >
> >
> > I'd rather fallback to small folios swapin instead crashing apps till we fix
> > the large folio swapin in zswap :-)
>
> I think David is referring to catching the buggy cases that do not
> properly fallback to small folios with zswap, not as an alternative to
> the fallback. This is at least what I had in mind with the patch.
Cool. Thanks for the clarification. I'm fine with keeping the fallback,
whether it's the current VM_BUG_ON or David's recommended
SIGBUS.
Currently, mainline doesn't support large folios swap-in, so I see
your patch as a helpful warning for those attempting large folio
swap-in to avoid making mistakes like loading large folios from
zswap.
In fact, I spent a week trying to figure out why my app was crashing
before I added 'if (zswap_is_enabled()) goto fallback'. If your patch
had come earlier, it would have saved me that week of work :-)
To me, a direct crash seems like a more straightforward way to
prompt people to fallback when attempting large folio swap-ins.
So, I am slightly in favor of your current patch.
Thanks
Barry
On 06.06.24 23:53, Barry Song wrote:
> On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <[email protected]> wrote:
>>
>> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <[email protected]> wrote:
>>>
>>> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 06.06.24 22:31, Yosry Ahmed wrote:
>>>>> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
>>>>>>
>>>>>> On 06.06.24 20:48, Yosry Ahmed wrote:
>>>>>>> With ongoing work to support large folio swapin, it is important to make
>>>>>>> sure we do not pass large folios to zswap_load() without implementing
>>>>>>> proper support.
>>>>>>>
>>>>>>> For example, if a swapin fault observes that contiguous PTEs are
>>>>>>> pointing to contiguous swap entries and tries to swap them in as a large
>>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
>>>>>>> zswap_load() will only effectively load the first page in the folio. If
>>>>>>> the first page is not in zswap, the folio will be read from disk, even
>>>>>>> though other pages may be in zswap.
>>>>>>>
>>>>>>> In both cases, this will lead to silent data corruption.
>>>>>>>
>>>>>>> Proper large folio swapin support needs to go into zswap before zswap
>>>>>>> can be enabled in a system that supports large folio swapin.
>>>>>>>
>>>>>>> Looking at callers of swap_read_folio(), it seems like they are either
>>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in the
>>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
>>>>>>> are fine for now.
>>>>>>>
>>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
>>>>>>> the order of those allocations without proper handling of zswap.
>>>>>>>
>>>>>>> Alternatively, swap_read_folio() (or its callers) can be updated to have
>>>>>>> a fallback mechanism that splits large folios or reads subpages
>>>>>>> separately. Similar logic may be needed anyway in case part of a large
>>>>>>> folio is already in the swapcache and the rest of it is swapped out.
>>>>>>>
>>>>>>> Signed-off-by: Yosry Ahmed <[email protected]>
>>>>>>> ---
>>>>>>>
>>>>>>> Sorry for the long CC list, I just found myself repeatedly looking at
>>>>>>> new series that add swap support for mTHPs / large folios, making sure
>>>>>>> they do not break with zswap or make incorrect assumptions. This debug
>>>>>>> check should give us some peace of mind. Hopefully this patch will also
>>>>>>> raise awareness among people who are working on this.
>>>>>>>
>>>>>>> ---
>>>>>>> mm/zswap.c | 3 +++
>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>>>>>> index b9b35ef86d9be..6007252429bb2 100644
>>>>>>> --- a/mm/zswap.c
>>>>>>> +++ b/mm/zswap.c
>>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
>>>>>>> if (!entry)
>>>>>>> return false;
>>>>>>>
>>>>>>> + /* Zswap loads do not handle large folio swapins correctly yet */
>>>>>>> + VM_BUG_ON(folio_test_large(folio));
>>>>>>> +
>>>>>>
>>>>>> There is no way we could have a WARN_ON_ONCE() and recover, right?
>>>>>
>>>>> Not without making more fundamental changes to the surrounding swap
>>>>> code. Currently zswap_load() returns either true (folio was loaded
>>>>> from zswap) or false (folio is not in zswap).
>>>>>
>>>>> To handle this correctly zswap_load() would need to tell
>>>>> swap_read_folio() which subpages are in zswap and have been loaded,
>>>>> and then swap_read_folio() would need to read the remaining subpages
>>>>> from disk. This of course assumes that the caller of swap_read_folio()
>>>>> made sure that the entire folio is swapped out and protected against
>>>>> races with other swapins.
>>>>>
>>>>> Also, because swap_read_folio() cannot split the folio itself, other
>>>>> swap_read_folio_*() functions that are called from it should be
>>>>> updated to handle swapping in tail subpages, which may be questionable
>>>>> in its own right.
>>>>>
>>>>> An alternative would be that zswap_load() (or a separate interface)
>>>>> could tell swap_read_folio() that the folio is partially in zswap,
>>>>> then we can just bail and tell the caller that it cannot read the
>>>>> large folio and that it should be split.
>>>>>
>>>>> There may be other options as well, but the bottom line is that it is
>>>>> possible, but probably not something that we want to do right now.
>>>>>
>>>>> A stronger protection method would be to introduce a config option or
>>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP
>>>>> depend on it being disabled, or have zswap check it at boot and refuse
>>>>> to be enabled if it is on.
>>>>
>>>> Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
>>>>
>>>> I was wondering, if we could WARN_ON_ONCE and make the swap code detect
>>>> this like a read-error from disk.
>>>>
>>>> I think do_swap_page() detects that by checking if the folio is not
>>>> uptodate:
>>>>
>>>> if (unlikely(!folio_test_uptodate(folio))) {
>>>> ret = VM_FAULT_SIGBUS;
>>>> goto out_nomap;
>>>> }
>>>>
>>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
>>>> system (but the app would crash either way, there is no way around it).
>>>>
>>>
>>> I'd rather fallback to small folios swapin instead crashing apps till we fix
>>> the large folio swapin in zswap :-)
>>
>> I think David is referring to catching the buggy cases that do not
>> properly fallback to small folios with zswap, not as an alternative to
>> the fallback. This is at least what I had in mind with the patch.
>
> Cool. Thanks for the clarification. I'm fine with keeping the fallback,
> whether it's the current VM_BUG_ON or David's recommended
> SIGBUS.
>
> Currently, mainline doesn't support large folios swap-in, so I see
> your patch as a helpful warning for those attempting large folio
> swap-in to avoid making mistakes like loading large folios from
> zswap.
>
> In fact, I spent a week trying to figure out why my app was crashing
> before I added 'if (zswap_is_enabled()) goto fallback'. If your patch
> had come earlier, it would have saved me that week of work :-)
>
> To me, a direct crash seems like a more straightforward way to
> prompt people to fallback when attempting large folio swap-ins.
> So, I am slightly in favor of your current patch.
BUG_ON and friends are frowned-upon, in particular in scenarios where we
can recover or that are so unexpected that they can be found during
early testing.
I have no strong opinion on this one, but likely a VM_WARN_ON would also
be sufficient to find such issues early during testing. No need to crash
the machine.
--
Cheers,
David / dhildenb
On Fri, Jun 7, 2024 at 12:11 AM David Hildenbrand <[email protected]> wrote:
>
> On 06.06.24 23:53, Barry Song wrote:
> > On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <[email protected]> wrote:
> >>
> >> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <[email protected]> wrote:
> >>>
> >>> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <[email protected]> wrote:
> >>>>
> >>>> On 06.06.24 22:31, Yosry Ahmed wrote:
> >>>>> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
> >>>>>>
> >>>>>> On 06.06.24 20:48, Yosry Ahmed wrote:
> >>>>>>> With ongoing work to support large folio swapin, it is important to make
> >>>>>>> sure we do not pass large folios to zswap_load() without implementing
> >>>>>>> proper support.
> >>>>>>>
> >>>>>>> For example, if a swapin fault observes that contiguous PTEs are
> >>>>>>> pointing to contiguous swap entries and tries to swap them in as a large
> >>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> >>>>>>> zswap_load() will only effectively load the first page in the folio. If
> >>>>>>> the first page is not in zswap, the folio will be read from disk, even
> >>>>>>> though other pages may be in zswap.
> >>>>>>>
> >>>>>>> In both cases, this will lead to silent data corruption.
> >>>>>>>
> >>>>>>> Proper large folio swapin support needs to go into zswap before zswap
> >>>>>>> can be enabled in a system that supports large folio swapin.
> >>>>>>>
> >>>>>>> Looking at callers of swap_read_folio(), it seems like they are either
> >>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in the
> >>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> >>>>>>> are fine for now.
> >>>>>>>
> >>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> >>>>>>> the order of those allocations without proper handling of zswap.
> >>>>>>>
> >>>>>>> Alternatively, swap_read_folio() (or its callers) can be updated to have
> >>>>>>> a fallback mechanism that splits large folios or reads subpages
> >>>>>>> separately. Similar logic may be needed anyway in case part of a large
> >>>>>>> folio is already in the swapcache and the rest of it is swapped out.
> >>>>>>>
> >>>>>>> Signed-off-by: Yosry Ahmed <[email protected]>
> >>>>>>> ---
> >>>>>>>
> >>>>>>> Sorry for the long CC list, I just found myself repeatedly looking at
> >>>>>>> new series that add swap support for mTHPs / large folios, making sure
> >>>>>>> they do not break with zswap or make incorrect assumptions. This debug
> >>>>>>> check should give us some peace of mind. Hopefully this patch will also
> >>>>>>> raise awareness among people who are working on this.
> >>>>>>>
> >>>>>>> ---
> >>>>>>> mm/zswap.c | 3 +++
> >>>>>>> 1 file changed, 3 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>>>>>> index b9b35ef86d9be..6007252429bb2 100644
> >>>>>>> --- a/mm/zswap.c
> >>>>>>> +++ b/mm/zswap.c
> >>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> >>>>>>> if (!entry)
> >>>>>>> return false;
> >>>>>>>
> >>>>>>> + /* Zswap loads do not handle large folio swapins correctly yet */
> >>>>>>> + VM_BUG_ON(folio_test_large(folio));
> >>>>>>> +
> >>>>>>
> >>>>>> There is no way we could have a WARN_ON_ONCE() and recover, right?
> >>>>>
> >>>>> Not without making more fundamental changes to the surrounding swap
> >>>>> code. Currently zswap_load() returns either true (folio was loaded
> >>>>> from zswap) or false (folio is not in zswap).
> >>>>>
> >>>>> To handle this correctly zswap_load() would need to tell
> >>>>> swap_read_folio() which subpages are in zswap and have been loaded,
> >>>>> and then swap_read_folio() would need to read the remaining subpages
> >>>>> from disk. This of course assumes that the caller of swap_read_folio()
> >>>>> made sure that the entire folio is swapped out and protected against
> >>>>> races with other swapins.
> >>>>>
> >>>>> Also, because swap_read_folio() cannot split the folio itself, other
> >>>>> swap_read_folio_*() functions that are called from it should be
> >>>>> updated to handle swapping in tail subpages, which may be questionable
> >>>>> in its own right.
> >>>>>
> >>>>> An alternative would be that zswap_load() (or a separate interface)
> >>>>> could tell swap_read_folio() that the folio is partially in zswap,
> >>>>> then we can just bail and tell the caller that it cannot read the
> >>>>> large folio and that it should be split.
> >>>>>
> >>>>> There may be other options as well, but the bottom line is that it is
> >>>>> possible, but probably not something that we want to do right now.
> >>>>>
> >>>>> A stronger protection method would be to introduce a config option or
> >>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> >>>>> depend on it being disabled, or have zswap check it at boot and refuse
> >>>>> to be enabled if it is on.
> >>>>
> >>>> Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
> >>>>
> >>>> I was wondering, if we could WARN_ON_ONCE and make the swap code detect
> >>>> this like a read-error from disk.
> >>>>
> >>>> I think do_swap_page() detects that by checking if the folio is not
> >>>> uptodate:
> >>>>
> >>>> if (unlikely(!folio_test_uptodate(folio))) {
> >>>> ret = VM_FAULT_SIGBUS;
> >>>> goto out_nomap;
> >>>> }
> >>>>
> >>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
> >>>> system (but the app would crash either way, there is no way around it).
> >>>>
> >>>
> >>> I'd rather fallback to small folios swapin instead crashing apps till we fix
> >>> the large folio swapin in zswap :-)
> >>
> >> I think David is referring to catching the buggy cases that do not
> >> properly fallback to small folios with zswap, not as an alternative to
> >> the fallback. This is at least what I had in mind with the patch.
> >
> > Cool. Thanks for the clarification. I'm fine with keeping the fallback,
> > whether it's the current VM_BUG_ON or David's recommended
> > SIGBUS.
> >
> > Currently, mainline doesn't support large folios swap-in, so I see
> > your patch as a helpful warning for those attempting large folio
> > swap-in to avoid making mistakes like loading large folios from
> > zswap.
> >
> > In fact, I spent a week trying to figure out why my app was crashing
> > before I added 'if (zswap_is_enabled()) goto fallback'. If your patch
> > had come earlier, it would have saved me that week of work :-)
> >
> > To me, a direct crash seems like a more straightforward way to
> > prompt people to fallback when attempting large folio swap-ins.
> > So, I am slightly in favor of your current patch.
>
> BUG_ON and friends are frowned-upon, in particular in scenarios where we
> can recover or that are so unexpected that they can be found during
> early testing.
>
> I have no strong opinion on this one, but likely a VM_WARN_ON would also
> be sufficient to find such issues early during testing. No need to crash
> the machine.
I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
some digging I found your patches to checkpatch and Linus clearly
stating that it isn't.
How about something like the following (untested), it is the minimal
recovery we can do but should work for a lot of cases, and does
nothing beyond a warning if we can swapin the large folio from disk:
diff --git a/mm/page_io.c b/mm/page_io.c
index f1a9cfab6e748..8f441dd8e109f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
swap_iocb **plug)
delayacct_swapin_start();
if (zswap_load(folio)) {
- folio_mark_uptodate(folio);
folio_unlock(folio);
} else if (data_race(sis->flags & SWP_FS_OPS)) {
swap_read_folio_fs(folio, plug);
diff --git a/mm/zswap.c b/mm/zswap.c
index 6007252429bb2..cc04db6bb217e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
VM_WARN_ON_ONCE(!folio_test_locked(folio));
+ /*
+ * Large folios should not be swapped in while zswap is being used, as
+ * they are not properly handled.
+ *
+ * If any of the subpages are in zswap, reading from disk would result
+ * in data corruption, so return true without marking the folio uptodate
+ * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
+ *
+ * Otherwise, return false and read the folio from disk.
+ */
+ if (WARN_ON_ONCE(folio_test_large(folio))) {
+ if (xa_find(tree, &offset, offset +
folio_nr_pages(folio) - 1, 0))
+ return true;
+ return false;
+ }
+
/*
* When reading into the swapcache, invalidate our entry. The
* swapcache can be the authoritative owner of the page and
@@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
zswap_entry_free(entry);
folio_mark_dirty(folio);
}
-
+ folio_mark_uptodate(folio);
return true;
}
One problem is that even if zswap was never enabled, the warning will
be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
static key if zswap was "ever" enabled.
Barry, I suspect your is_zswap_enabled() check is deficient for
similar reasons, zswap could have been enabled before then became
disabled.
>> I have no strong opinion on this one, but likely a VM_WARN_ON would also
>> be sufficient to find such issues early during testing. No need to crash
>> the machine.
>
> I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
> some digging I found your patches to checkpatch and Linus clearly
> stating that it isn't.
:) yes.
VM_BUG_ON is not particularly helpful IMHO. If you want something to be
found early during testing, VM_WARN_ON is good enough.
Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are
primarily reported during early/development testing only. But maybe some
distro out there still sets it.
>
> How about something like the following (untested), it is the minimal
> recovery we can do but should work for a lot of cases, and does
> nothing beyond a warning if we can swapin the large folio from disk:
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index f1a9cfab6e748..8f441dd8e109f 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
> swap_iocb **plug)
> delayacct_swapin_start();
>
> if (zswap_load(folio)) {
> - folio_mark_uptodate(folio);
> folio_unlock(folio);
> } else if (data_race(sis->flags & SWP_FS_OPS)) {
> swap_read_folio_fs(folio, plug);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6007252429bb2..cc04db6bb217e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> + /*
> + * Large folios should not be swapped in while zswap is being used, as
> + * they are not properly handled.
> + *
> + * If any of the subpages are in zswap, reading from disk would result
> + * in data corruption, so return true without marking the folio uptodate
> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> + *
> + * Otherwise, return false and read the folio from disk.
> + */
> + if (WARN_ON_ONCE(folio_test_large(folio))) {
> + if (xa_find(tree, &offset, offset +
> folio_nr_pages(folio) - 1, 0))
> + return true;
> + return false;
> + }
> +
> /*
> * When reading into the swapcache, invalidate our entry. The
> * swapcache can be the authoritative owner of the page and
> @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
> zswap_entry_free(entry);
> folio_mark_dirty(folio);
> }
> -
> + folio_mark_uptodate(folio);
> return true;
> }
>
> One problem is that even if zswap was never enabled, the warning will
> be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> static key if zswap was "ever" enabled.
We should use WARN_ON_ONCE() only for things that cannot happen. So if
there are cases where this could be triggered today, it would be
problematic -- especially if it can be triggered from unprivileged user
space. But if we're concerned of other code messing up our invariant in
the future (e.g., enabling large folios without taking proper care about
zswap etc), we're good to add it.
--
Cheers,
David / dhildenb
On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <[email protected]> wrote:
>
> >> I have no strong opinion on this one, but likely a VM_WARN_ON would also
> >> be sufficient to find such issues early during testing. No need to crash
> >> the machine.
> >
> > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
> > some digging I found your patches to checkpatch and Linus clearly
> > stating that it isn't.
>
> :) yes.
>
> VM_BUG_ON is not particularly helpful IMHO. If you want something to be
> found early during testing, VM_WARN_ON is good enough.
>
> Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are
> primarily reported during early/development testing only. But maybe some
> distro out there still sets it.
>
> >
> > How about something like the following (untested), it is the minimal
> > recovery we can do but should work for a lot of cases, and does
> > nothing beyond a warning if we can swapin the large folio from disk:
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index f1a9cfab6e748..8f441dd8e109f 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
> > swap_iocb **plug)
> > delayacct_swapin_start();
> >
> > if (zswap_load(folio)) {
> > - folio_mark_uptodate(folio);
> > folio_unlock(folio);
> > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > swap_read_folio_fs(folio, plug);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 6007252429bb2..cc04db6bb217e 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
> >
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >
> > + /*
> > + * Large folios should not be swapped in while zswap is being used, as
> > + * they are not properly handled.
> > + *
> > + * If any of the subpages are in zswap, reading from disk would result
> > + * in data corruption, so return true without marking the folio uptodate
> > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > + *
> > + * Otherwise, return false and read the folio from disk.
> > + */
> > + if (WARN_ON_ONCE(folio_test_large(folio))) {
> > + if (xa_find(tree, &offset, offset +
> > folio_nr_pages(folio) - 1, 0))
> > + return true;
> > + return false;
> > + }
> > +
> > /*
> > * When reading into the swapcache, invalidate our entry. The
> > * swapcache can be the authoritative owner of the page and
> > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
> > zswap_entry_free(entry);
> > folio_mark_dirty(folio);
> > }
> > -
> > + folio_mark_uptodate(folio);
> > return true;
> > }
> >
> > One problem is that even if zswap was never enabled, the warning will
> > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> > static key if zswap was "ever" enabled.
>
> We should use WARN_ON_ONCE() only for things that cannot happen. So if
> there are cases where this could be triggered today, it would be
> problematic -- especially if it can be triggered from unprivileged user
> space. But if we're concerned of other code messing up our invariant in
> the future (e.g., enabling large folios without taking proper care about
> zswap etc), we're good to add it.
Right now I can't see any paths allocating large folios for swapin, so
I think it cannot happen. Once someone tries adding it, the warning
will fire if CONFIG_ZSWAP is used, even if zswap is disabled.
At this point we will have several options:
- Make large folios swapin depend on !CONFIG_ZSWAP for now.
- Keep track if zswap was ever enabled and make the warning
conditional on it. We should also always fallback to order-0 if zswap
was ever enabled.
- Properly handle large folio swapin with zswap.
Does this sound reasonable to you?
On 07.06.24 20:58, Yosry Ahmed wrote:
> On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <[email protected]> wrote:
>>
>>>> I have no strong opinion on this one, but likely a VM_WARN_ON would also
>>>> be sufficient to find such issues early during testing. No need to crash
>>>> the machine.
>>>
>>> I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
>>> some digging I found your patches to checkpatch and Linus clearly
>>> stating that it isn't.
>>
>> :) yes.
>>
>> VM_BUG_ON is not particularly helpful IMHO. If you want something to be
>> found early during testing, VM_WARN_ON is good enough.
>>
>> Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are
>> primarily reported during early/development testing only. But maybe some
>> distro out there still sets it.
>>
>>>
>>> How about something like the following (untested), it is the minimal
>>> recovery we can do but should work for a lot of cases, and does
>>> nothing beyond a warning if we can swapin the large folio from disk:
>>>
>>> diff --git a/mm/page_io.c b/mm/page_io.c
>>> index f1a9cfab6e748..8f441dd8e109f 100644
>>> --- a/mm/page_io.c
>>> +++ b/mm/page_io.c
>>> @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
>>> swap_iocb **plug)
>>> delayacct_swapin_start();
>>>
>>> if (zswap_load(folio)) {
>>> - folio_mark_uptodate(folio);
>>> folio_unlock(folio);
>>> } else if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swap_read_folio_fs(folio, plug);
>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>> index 6007252429bb2..cc04db6bb217e 100644
>>> --- a/mm/zswap.c
>>> +++ b/mm/zswap.c
>>> @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
>>>
>>> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>>
>>> + /*
>>> + * Large folios should not be swapped in while zswap is being used, as
>>> + * they are not properly handled.
>>> + *
>>> + * If any of the subpages are in zswap, reading from disk would result
>>> + * in data corruption, so return true without marking the folio uptodate
>>> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
>>> + *
>>> + * Otherwise, return false and read the folio from disk.
>>> + */
>>> + if (WARN_ON_ONCE(folio_test_large(folio))) {
>>> + if (xa_find(tree, &offset, offset +
>>> folio_nr_pages(folio) - 1, 0))
>>> + return true;
>>> + return false;
>>> + }
>>> +
>>> /*
>>> * When reading into the swapcache, invalidate our entry. The
>>> * swapcache can be the authoritative owner of the page and
>>> @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
>>> zswap_entry_free(entry);
>>> folio_mark_dirty(folio);
>>> }
>>> -
>>> + folio_mark_uptodate(folio);
>>> return true;
>>> }
>>>
>>> One problem is that even if zswap was never enabled, the warning will
>>> be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
>>> static key if zswap was "ever" enabled.
>>
>> We should use WARN_ON_ONCE() only for things that cannot happen. So if
>> there are cases where this could be triggered today, it would be
>> problematic -- especially if it can be triggered from unprivileged user
>> space. But if we're concerned of other code messing up our invariant in
>> the future (e.g., enabling large folios without taking proper care about
>> zswap etc), we're good to add it.
>
> Right now I can't see any paths allocating large folios for swapin, so
> I think it cannot happen. Once someone tries adding it, the warning
> will fire if CONFIG_ZSWAP is used, even if zswap is disabled.
>
> At this point we will have several options:
> - Make large folios swapin depend on !CONFIG_ZSWAP for now.
> - Keep track if zswap was ever enabled and make the warning
> conditional on it. We should also always fallback to order-0 if zswap
> was ever enabled.
> - Properly handle large folio swapin with zswap.
>
> Does this sound reasonable to you?
Yes, probably easiest is 1) -> 3) or if we don't want to glue it to
config options 2) -> 3).
--
Cheers,
David / dhildenb
On Fri, Jun 7, 2024 at 11:58 AM Yosry Ahmed <[email protected]> wrote:
>
> On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <[email protected]> wrote:
> >
> > >> I have no strong opinion on this one, but likely a VM_WARN_ON would also
> > >> be sufficient to find such issues early during testing. No need to crash
> > >> the machine.
> > >
> > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
> > > some digging I found your patches to checkpatch and Linus clearly
> > > stating that it isn't.
> >
> > :) yes.
> >
> > VM_BUG_ON is not particularly helpful IMHO. If you want something to be
> > found early during testing, VM_WARN_ON is good enough.
> >
> > Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are
> > primarily reported during early/development testing only. But maybe some
> > distro out there still sets it.
> >
> > >
> > > How about something like the following (untested), it is the minimal
> > > recovery we can do but should work for a lot of cases, and does
> > > nothing beyond a warning if we can swapin the large folio from disk:
> > >
> > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > --- a/mm/page_io.c
> > > +++ b/mm/page_io.c
> > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
> > > swap_iocb **plug)
> > > delayacct_swapin_start();
> > >
> > > if (zswap_load(folio)) {
> > > - folio_mark_uptodate(folio);
> > > folio_unlock(folio);
> > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > swap_read_folio_fs(folio, plug);
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 6007252429bb2..cc04db6bb217e 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
> > >
> > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > >
> > > + /*
> > > + * Large folios should not be swapped in while zswap is being used, as
> > > + * they are not properly handled.
> > > + *
> > > + * If any of the subpages are in zswap, reading from disk would result
> > > + * in data corruption, so return true without marking the folio uptodate
> > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > + *
> > > + * Otherwise, return false and read the folio from disk.
> > > + */
> > > + if (WARN_ON_ONCE(folio_test_large(folio))) {
> > > + if (xa_find(tree, &offset, offset +
> > > folio_nr_pages(folio) - 1, 0))
> > > + return true;
> > > + return false;
> > > + }
> > > +
> > > /*
> > > * When reading into the swapcache, invalidate our entry. The
> > > * swapcache can be the authoritative owner of the page and
> > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
> > > zswap_entry_free(entry);
> > > folio_mark_dirty(folio);
> > > }
> > > -
> > > + folio_mark_uptodate(folio);
> > > return true;
> > > }
> > >
> > > One problem is that even if zswap was never enabled, the warning will
> > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> > > static key if zswap was "ever" enabled.
> >
> > We should use WARN_ON_ONCE() only for things that cannot happen. So if
> > there are cases where this could be triggered today, it would be
> > problematic -- especially if it can be triggered from unprivileged user
> > space. But if we're concerned of other code messing up our invariant in
> > the future (e.g., enabling large folios without taking proper care about
> > zswap etc), we're good to add it.
>
> Right now I can't see any paths allocating large folios for swapin, so
> I think it cannot happen. Once someone tries adding it, the warning
> will fire if CONFIG_ZSWAP is used, even if zswap is disabled.
> At this point we will have several options:
Here is my take on this:
> - Make large folios swapin depend on !CONFIG_ZSWAP for now.
I think a WARON or BUG_ON is better. I would need to revert this
change when I am working on 3). It is a make up rule, not a real
dependency any way.
> - Keep track if zswap was ever enabled and make the warning
> conditional on it. We should also always fallback to order-0 if zswap
> was ever enabled.
IMHO, falling back to order-0 inside zswap is not desired because it
complicates the zswap code. We should not pass large folio to zswap if
zswap is not ready to handle large folio. The core swap already has
the fall back to order-0. If we get to 3), then this fall back in
zswap needs to be removed. It is a transitional thing then maybe not
introduce it in the first place.
> - Properly handle large folio swapin with zswap.
That obviously is ideal.
Chris
On Sat, Jun 8, 2024 at 5:43 AM Yosry Ahmed <[email protected]> wrote:
>
> On Fri, Jun 7, 2024 at 12:11 AM David Hildenbrand <[email protected]> wrote:
> >
> > On 06.06.24 23:53, Barry Song wrote:
> > > On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <[email protected]> wrote:
> > >>
> > >> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <[email protected]> wrote:
> > >>>
> > >>> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <[email protected]> wrote:
> > >>>>
> > >>>> On 06.06.24 22:31, Yosry Ahmed wrote:
> > >>>>> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <[email protected]> wrote:
> > >>>>>>
> > >>>>>> On 06.06.24 20:48, Yosry Ahmed wrote:
> > >>>>>>> With ongoing work to support large folio swapin, it is important to make
> > >>>>>>> sure we do not pass large folios to zswap_load() without implementing
> > >>>>>>> proper support.
> > >>>>>>>
> > >>>>>>> For example, if a swapin fault observes that contiguous PTEs are
> > >>>>>>> pointing to contiguous swap entries and tries to swap them in as a large
> > >>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but
> > >>>>>>> zswap_load() will only effectively load the first page in the folio. If
> > >>>>>>> the first page is not in zswap, the folio will be read from disk, even
> > >>>>>>> though other pages may be in zswap.
> > >>>>>>>
> > >>>>>>> In both cases, this will lead to silent data corruption.
> > >>>>>>>
> > >>>>>>> Proper large folio swapin support needs to go into zswap before zswap
> > >>>>>>> can be enabled in a system that supports large folio swapin.
> > >>>>>>>
> > >>>>>>> Looking at callers of swap_read_folio(), it seems like they are either
> > >>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in the
> > >>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we
> > >>>>>>> are fine for now.
> > >>>>>>>
> > >>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in
> > >>>>>>> the order of those allocations without proper handling of zswap.
> > >>>>>>>
> > >>>>>>> Alternatively, swap_read_folio() (or its callers) can be updated to have
> > >>>>>>> a fallback mechanism that splits large folios or reads subpages
> > >>>>>>> separately. Similar logic may be needed anyway in case part of a large
> > >>>>>>> folio is already in the swapcache and the rest of it is swapped out.
> > >>>>>>>
> > >>>>>>> Signed-off-by: Yosry Ahmed <[email protected]>
> > >>>>>>> ---
> > >>>>>>>
> > >>>>>>> Sorry for the long CC list, I just found myself repeatedly looking at
> > >>>>>>> new series that add swap support for mTHPs / large folios, making sure
> > >>>>>>> they do not break with zswap or make incorrect assumptions. This debug
> > >>>>>>> check should give us some peace of mind. Hopefully this patch will also
> > >>>>>>> raise awareness among people who are working on this.
> > >>>>>>>
> > >>>>>>> ---
> > >>>>>>> mm/zswap.c | 3 +++
> > >>>>>>> 1 file changed, 3 insertions(+)
> > >>>>>>>
> > >>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c
> > >>>>>>> index b9b35ef86d9be..6007252429bb2 100644
> > >>>>>>> --- a/mm/zswap.c
> > >>>>>>> +++ b/mm/zswap.c
> > >>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio)
> > >>>>>>> if (!entry)
> > >>>>>>> return false;
> > >>>>>>>
> > >>>>>>> + /* Zswap loads do not handle large folio swapins correctly yet */
> > >>>>>>> + VM_BUG_ON(folio_test_large(folio));
> > >>>>>>> +
> > >>>>>>
> > >>>>>> There is no way we could have a WARN_ON_ONCE() and recover, right?
> > >>>>>
> > >>>>> Not without making more fundamental changes to the surrounding swap
> > >>>>> code. Currently zswap_load() returns either true (folio was loaded
> > >>>>> from zswap) or false (folio is not in zswap).
> > >>>>>
> > >>>>> To handle this correctly zswap_load() would need to tell
> > >>>>> swap_read_folio() which subpages are in zswap and have been loaded,
> > >>>>> and then swap_read_folio() would need to read the remaining subpages
> > >>>>> from disk. This of course assumes that the caller of swap_read_folio()
> > >>>>> made sure that the entire folio is swapped out and protected against
> > >>>>> races with other swapins.
> > >>>>>
> > >>>>> Also, because swap_read_folio() cannot split the folio itself, other
> > >>>>> swap_read_folio_*() functions that are called from it should be
> > >>>>> updated to handle swapping in tail subpages, which may be questionable
> > >>>>> in its own right.
> > >>>>>
> > >>>>> An alternative would be that zswap_load() (or a separate interface)
> > >>>>> could tell swap_read_folio() that the folio is partially in zswap,
> > >>>>> then we can just bail and tell the caller that it cannot read the
> > >>>>> large folio and that it should be split.
> > >>>>>
> > >>>>> There may be other options as well, but the bottom line is that it is
> > >>>>> possible, but probably not something that we want to do right now.
> > >>>>>
> > >>>>> A stronger protection method would be to introduce a config option or
> > >>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP
> > >>>>> depend on it being disabled, or have zswap check it at boot and refuse
> > >>>>> to be enabled if it is on.
> > >>>>
> > >>>> Right, sounds like the VM_BUG_ON() really is not that easily avoidable.
> > >>>>
> > >>>> I was wondering, if we could WARN_ON_ONCE and make the swap code detect
> > >>>> this like a read-error from disk.
> > >>>>
> > >>>> I think do_swap_page() detects that by checking if the folio is not
> > >>>> uptodate:
> > >>>>
> > >>>> if (unlikely(!folio_test_uptodate(folio))) {
> > >>>> ret = VM_FAULT_SIGBUS;
> > >>>> goto out_nomap;
> > >>>> }
> > >>>>
> > >>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the
> > >>>> system (but the app would crash either way, there is no way around it).
> > >>>>
> > >>>
> > >>> I'd rather fallback to small folios swapin instead crashing apps till we fix
> > >>> the large folio swapin in zswap :-)
> > >>
> > >> I think David is referring to catching the buggy cases that do not
> > >> properly fallback to small folios with zswap, not as an alternative to
> > >> the fallback. This is at least what I had in mind with the patch.
> > >
> > > Cool. Thanks for the clarification. I'm fine with keeping the fallback,
> > > whether it's the current VM_BUG_ON or David's recommended
> > > SIGBUS.
> > >
> > > Currently, mainline doesn't support large folios swap-in, so I see
> > > your patch as a helpful warning for those attempting large folio
> > > swap-in to avoid making mistakes like loading large folios from
> > > zswap.
> > >
> > > In fact, I spent a week trying to figure out why my app was crashing
> > > before I added 'if (zswap_is_enabled()) goto fallback'. If your patch
> > > had come earlier, it would have saved me that week of work :-)
> > >
> > > To me, a direct crash seems like a more straightforward way to
> > > prompt people to fallback when attempting large folio swap-ins.
> > > So, I am slightly in favor of your current patch.
> >
> > BUG_ON and friends are frowned-upon, in particular in scenarios where we
> > can recover or that are so unexpected that they can be found during
> > early testing.
> >
> > I have no strong opinion on this one, but likely a VM_WARN_ON would also
> > be sufficient to find such issues early during testing. No need to crash
> > the machine.
>
> I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
> some digging I found your patches to checkpatch and Linus clearly
> stating that it isn't.
>
> How about something like the following (untested), it is the minimal
> recovery we can do but should work for a lot of cases, and does
> nothing beyond a warning if we can swapin the large folio from disk:
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index f1a9cfab6e748..8f441dd8e109f 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
> swap_iocb **plug)
> delayacct_swapin_start();
>
> if (zswap_load(folio)) {
> - folio_mark_uptodate(folio);
> folio_unlock(folio);
> } else if (data_race(sis->flags & SWP_FS_OPS)) {
> swap_read_folio_fs(folio, plug);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6007252429bb2..cc04db6bb217e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>
> + /*
> + * Large folios should not be swapped in while zswap is being used, as
> + * they are not properly handled.
> + *
> + * If any of the subpages are in zswap, reading from disk would result
> + * in data corruption, so return true without marking the folio uptodate
> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> + *
> + * Otherwise, return false and read the folio from disk.
> + */
> + if (WARN_ON_ONCE(folio_test_large(folio))) {
> + if (xa_find(tree, &offset, offset +
> folio_nr_pages(folio) - 1, 0))
> + return true;
> + return false;
> + }
> +
> /*
> * When reading into the swapcache, invalidate our entry. The
> * swapcache can be the authoritative owner of the page and
> @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
> zswap_entry_free(entry);
> folio_mark_dirty(folio);
> }
> -
> + folio_mark_uptodate(folio);
> return true;
> }
>
> One problem is that even if zswap was never enabled, the warning will
> be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> static key if zswap was "ever" enabled.
>
> Barry, I suspect your is_zswap_enabled() check is deficient for
> similar reasons, zswap could have been enabled before then became
> disabled.
I don't understand this. if zswap was enabled before but is disabled when
I am loading data, will I get corrupted data before zswap was once enabled?
If not, it seems nothing important.
Thanks
Barry
On Sat, Jun 8, 2024 at 6:58 AM Yosry Ahmed <[email protected]> wrote:
>
> On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <[email protected]> wrote:
> >
> > >> I have no strong opinion on this one, but likely a VM_WARN_ON would also
> > >> be sufficient to find such issues early during testing. No need to crash
> > >> the machine.
> > >
> > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
> > > some digging I found your patches to checkpatch and Linus clearly
> > > stating that it isn't.
> >
> > :) yes.
> >
> > VM_BUG_ON is not particularly helpful IMHO. If you want something to be
> > found early during testing, VM_WARN_ON is good enough.
> >
> > Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are
> > primarily reported during early/development testing only. But maybe some
> > distro out there still sets it.
> >
> > >
> > > How about something like the following (untested), it is the minimal
> > > recovery we can do but should work for a lot of cases, and does
> > > nothing beyond a warning if we can swapin the large folio from disk:
> > >
> > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > --- a/mm/page_io.c
> > > +++ b/mm/page_io.c
> > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
> > > swap_iocb **plug)
> > > delayacct_swapin_start();
> > >
> > > if (zswap_load(folio)) {
> > > - folio_mark_uptodate(folio);
> > > folio_unlock(folio);
> > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > swap_read_folio_fs(folio, plug);
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 6007252429bb2..cc04db6bb217e 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
> > >
> > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > >
> > > + /*
> > > + * Large folios should not be swapped in while zswap is being used, as
> > > + * they are not properly handled.
> > > + *
> > > + * If any of the subpages are in zswap, reading from disk would result
> > > + * in data corruption, so return true without marking the folio uptodate
> > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > + *
> > > + * Otherwise, return false and read the folio from disk.
> > > + */
> > > + if (WARN_ON_ONCE(folio_test_large(folio))) {
> > > + if (xa_find(tree, &offset, offset +
> > > folio_nr_pages(folio) - 1, 0))
> > > + return true;
> > > + return false;
> > > + }
> > > +
> > > /*
> > > * When reading into the swapcache, invalidate our entry. The
> > > * swapcache can be the authoritative owner of the page and
> > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
> > > zswap_entry_free(entry);
> > > folio_mark_dirty(folio);
> > > }
> > > -
> > > + folio_mark_uptodate(folio);
> > > return true;
> > > }
> > >
> > > One problem is that even if zswap was never enabled, the warning will
> > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> > > static key if zswap was "ever" enabled.
> >
> > We should use WARN_ON_ONCE() only for things that cannot happen. So if
> > there are cases where this could be triggered today, it would be
> > problematic -- especially if it can be triggered from unprivileged user
> > space. But if we're concerned of other code messing up our invariant in
> > the future (e.g., enabling large folios without taking proper care about
> > zswap etc), we're good to add it.
>
> Right now I can't see any paths allocating large folios for swapin, so
> I think it cannot happen. Once someone tries adding it, the warning
> will fire if CONFIG_ZSWAP is used, even if zswap is disabled.
>
> At this point we will have several options:
> - Make large folios swapin depend on !CONFIG_ZSWAP for now.
It appears quite problematic. We lack control over whether the kernel build
will enable CONFIG_ZSWAP, particularly when aiming for a common
defconfig across all platforms to streamline configurations. For instance,
in the case of ARM, this was once a significant goal.
Simply trigger a single WARN or BUG if an attempt is made to load
large folios in zswap_load, while ensuring that zswap_is_enabled()
remains unaffected. In the mainline code, large folio swap-in support
is absent, so this warning is intended for debugging purposes and
targets a very small audience—perhaps fewer than five individuals
worldwide. Real users won’t encounter this warning, as it remains
hidden from their view.
> - Keep track if zswap was ever enabled and make the warning
> conditional on it. We should also always fallback to order-0 if zswap
> was ever enabled.
> - Properly handle large folio swapin with zswap.
>
> Does this sound reasonable to you?
Thanks
Barry
[..]
> > > >
> > > > How about something like the following (untested), it is the minimal
> > > > recovery we can do but should work for a lot of cases, and does
> > > > nothing beyond a warning if we can swapin the large folio from disk:
> > > >
> > > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > > --- a/mm/page_io.c
> > > > +++ b/mm/page_io.c
> > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
> > > > swap_iocb **plug)
> > > > delayacct_swapin_start();
> > > >
> > > > if (zswap_load(folio)) {
> > > > - folio_mark_uptodate(folio);
> > > > folio_unlock(folio);
> > > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > > swap_read_folio_fs(folio, plug);
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 6007252429bb2..cc04db6bb217e 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
> > > >
> > > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > >
> > > > + /*
> > > > + * Large folios should not be swapped in while zswap is being used, as
> > > > + * they are not properly handled.
> > > > + *
> > > > + * If any of the subpages are in zswap, reading from disk would result
> > > > + * in data corruption, so return true without marking the folio uptodate
> > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > > + *
> > > > + * Otherwise, return false and read the folio from disk.
> > > > + */
> > > > + if (WARN_ON_ONCE(folio_test_large(folio))) {
> > > > + if (xa_find(tree, &offset, offset +
> > > > folio_nr_pages(folio) - 1, 0))
> > > > + return true;
> > > > + return false;
> > > > + }
> > > > +
> > > > /*
> > > > * When reading into the swapcache, invalidate our entry. The
> > > > * swapcache can be the authoritative owner of the page and
> > > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
> > > > zswap_entry_free(entry);
> > > > folio_mark_dirty(folio);
> > > > }
> > > > -
> > > > + folio_mark_uptodate(folio);
> > > > return true;
> > > > }
> > > >
> > > > One problem is that even if zswap was never enabled, the warning will
> > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> > > > static key if zswap was "ever" enabled.
> > >
> > > We should use WARN_ON_ONCE() only for things that cannot happen. So if
> > > there are cases where this could be triggered today, it would be
> > > problematic -- especially if it can be triggered from unprivileged user
> > > space. But if we're concerned of other code messing up our invariant in
> > > the future (e.g., enabling large folios without taking proper care about
> > > zswap etc), we're good to add it.
> >
> > Right now I can't see any paths allocating large folios for swapin, so
> > I think it cannot happen. Once someone tries adding it, the warning
> > will fire if CONFIG_ZSWAP is used, even if zswap is disabled.
> > At this point we will have several options:
>
> Here is my take on this:
>
> > - Make large folios swapin depend on !CONFIG_ZSWAP for now.
>
> I think a WARON or BUG_ON is better. I would need to revert this
> change when I am working on 3). It is a make up rule, not a real
> dependency any way.
I am intending to send a new version with WARN_ON_ONCE() and some
attempt to recover.
It is not a rule, it is just that we don't have the support for it today.
>
> > - Keep track if zswap was ever enabled and make the warning
> > conditional on it. We should also always fallback to order-0 if zswap
> > was ever enabled.
>
> IMHO, falling back to order-0 inside zswap is not desired because it
> complicates the zswap code. We should not pass large folio to zswap if
> zswap is not ready to handle large folio. The core swap already has
> the fall back to order-0. If we get to 3), then this fall back in
> zswap needs to be removed. It is a transitional thing then maybe not
> introduce it in the first place.
We cannot split the folio inside zswap. What I meant is that the
swapin code should fallback to order-0 if zswap is being used, to
avoid passing large folios to zswap.
>
> > - Properly handle large folio swapin with zswap.
> That obviously is ideal.
>
> Chris
[..]
> > One problem is that even if zswap was never enabled, the warning will
> > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> > static key if zswap was "ever" enabled.
> >
> > Barry, I suspect your is_zswap_enabled() check is deficient for
> > similar reasons, zswap could have been enabled before then became
> > disabled.
>
> I don't understand this. if zswap was enabled before but is disabled when
> I am loading data, will I get corrupted data before zswap was once enabled?
> If not, it seems nothing important.
If zswap was enabled and then disabled, some pages may still be in
zswap. We do not load the pages from zswap when it is disabled, we
just stop storing new pages.
So if you just rely in checking whether zswap is enabled at swapin
time to decide whether to use large folios, you may end up with a
situation where zswap is disabled, yet parts of the large folio you
are trying to swapin (or all of it) is in zswap.
This is why I think we'll need to track whether zswap was ever enabled
instead (or if a page was ever stored).
>
> Thanks
> Barry
On Sat, Jun 8, 2024 at 7:17 AM Yosry Ahmed <[email protected]> wrote:
>
> [..]
> > > One problem is that even if zswap was never enabled, the warning will
> > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> > > static key if zswap was "ever" enabled.
> > >
> > > Barry, I suspect your is_zswap_enabled() check is deficient for
> > > similar reasons, zswap could have been enabled before then became
> > > disabled.
> >
> > I don't understand this. if zswap was enabled before but is disabled when
> > I am loading data, will I get corrupted data before zswap was once enabled?
> > If not, it seems nothing important.
>
> If zswap was enabled and then disabled, some pages may still be in
> zswap. We do not load the pages from zswap when it is disabled, we
> just stop storing new pages.
>
> So if you just rely in checking whether zswap is enabled at swapin
> time to decide whether to use large folios, you may end up with a
> situation where zswap is disabled, yet parts of the large folio you
> are trying to swapin (or all of it) is in zswap.
>
> This is why I think we'll need to track whether zswap was ever enabled
> instead (or if a page was ever stored).
Thanks! It doesn't seem good. Do we have a simple way to clean zswap
when it is disabled? seems not easy? Just like we do swapoff, or disable
cache, we ensure they are clean - this is a real "disable".
Thanks
Barry
On Fri, Jun 7, 2024 at 3:09 PM Barry Song <[email protected]> wrote:
>
> On Sat, Jun 8, 2024 at 6:58 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <[email protected]> wrote:
> > >
> > > >> I have no strong opinion on this one, but likely a VM_WARN_ON would also
> > > >> be sufficient to find such issues early during testing. No need to crash
> > > >> the machine.
> > > >
> > > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after
> > > > some digging I found your patches to checkpatch and Linus clearly
> > > > stating that it isn't.
> > >
> > > :) yes.
> > >
> > > VM_BUG_ON is not particularly helpful IMHO. If you want something to be
> > > found early during testing, VM_WARN_ON is good enough.
> > >
> > > Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are
> > > primarily reported during early/development testing only. But maybe some
> > > distro out there still sets it.
> > >
> > > >
> > > > How about something like the following (untested), it is the minimal
> > > > recovery we can do but should work for a lot of cases, and does
> > > > nothing beyond a warning if we can swapin the large folio from disk:
> > > >
> > > > diff --git a/mm/page_io.c b/mm/page_io.c
> > > > index f1a9cfab6e748..8f441dd8e109f 100644
> > > > --- a/mm/page_io.c
> > > > +++ b/mm/page_io.c
> > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct
> > > > swap_iocb **plug)
> > > > delayacct_swapin_start();
> > > >
> > > > if (zswap_load(folio)) {
> > > > - folio_mark_uptodate(folio);
> > > > folio_unlock(folio);
> > > > } else if (data_race(sis->flags & SWP_FS_OPS)) {
> > > > swap_read_folio_fs(folio, plug);
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 6007252429bb2..cc04db6bb217e 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio)
> > > >
> > > > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > >
> > > > + /*
> > > > + * Large folios should not be swapped in while zswap is being used, as
> > > > + * they are not properly handled.
> > > > + *
> > > > + * If any of the subpages are in zswap, reading from disk would result
> > > > + * in data corruption, so return true without marking the folio uptodate
> > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault).
> > > > + *
> > > > + * Otherwise, return false and read the folio from disk.
> > > > + */
> > > > + if (WARN_ON_ONCE(folio_test_large(folio))) {
> > > > + if (xa_find(tree, &offset, offset +
> > > > folio_nr_pages(folio) - 1, 0))
> > > > + return true;
> > > > + return false;
> > > > + }
> > > > +
> > > > /*
> > > > * When reading into the swapcache, invalidate our entry. The
> > > > * swapcache can be the authoritative owner of the page and
> > > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio)
> > > > zswap_entry_free(entry);
> > > > folio_mark_dirty(folio);
> > > > }
> > > > -
> > > > + folio_mark_uptodate(folio);
> > > > return true;
> > > > }
> > > >
> > > > One problem is that even if zswap was never enabled, the warning will
> > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> > > > static key if zswap was "ever" enabled.
> > >
> > > We should use WARN_ON_ONCE() only for things that cannot happen. So if
> > > there are cases where this could be triggered today, it would be
> > > problematic -- especially if it can be triggered from unprivileged user
> > > space. But if we're concerned of other code messing up our invariant in
> > > the future (e.g., enabling large folios without taking proper care about
> > > zswap etc), we're good to add it.
> >
> > Right now I can't see any paths allocating large folios for swapin, so
> > I think it cannot happen. Once someone tries adding it, the warning
> > will fire if CONFIG_ZSWAP is used, even if zswap is disabled.
> >
> > At this point we will have several options:
> > - Make large folios swapin depend on !CONFIG_ZSWAP for now.
>
> It appears quite problematic. We lack control over whether the kernel build
> will enable CONFIG_ZSWAP, particularly when aiming for a common
> defconfig across all platforms to streamline configurations. For instance,
> in the case of ARM, this was once a significant goal.
>
> Simply trigger a single WARN or BUG if an attempt is made to load
> large folios in zswap_load, while ensuring that zswap_is_enabled()
> remains unaffected. In the mainline code, large folio swap-in support
> is absent, so this warning is intended for debugging purposes and
> targets a very small audience—perhaps fewer than five individuals
> worldwide. Real users won’t encounter this warning, as it remains
> hidden from their view.
I can make the warning only fire if any part of the folio is in zswap
to avoid getting warnings from zswap_load() if we never actually use
zswap, that's reasonable. I wanted to warn if we reach zswap_load()
with any large folio at all for higher coverage only.
I will send something out in the next week or so.
>
> > - Keep track if zswap was ever enabled and make the warning
> > conditional on it. We should also always fallback to order-0 if zswap
> > was ever enabled.
> > - Properly handle large folio swapin with zswap.
> >
> > Does this sound reasonable to you?
>
> Thanks
> Barry
On Fri, Jun 7, 2024 at 4:28 PM Barry Song <[email protected]> wrote:
>
> On Sat, Jun 8, 2024 at 7:17 AM Yosry Ahmed <[email protected]> wrote:
> >
> > [..]
> > > > One problem is that even if zswap was never enabled, the warning will
> > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or
> > > > static key if zswap was "ever" enabled.
> > > >
> > > > Barry, I suspect your is_zswap_enabled() check is deficient for
> > > > similar reasons, zswap could have been enabled before then became
> > > > disabled.
> > >
> > > I don't understand this. if zswap was enabled before but is disabled when
> > > I am loading data, will I get corrupted data before zswap was once enabled?
> > > If not, it seems nothing important.
> >
> > If zswap was enabled and then disabled, some pages may still be in
> > zswap. We do not load the pages from zswap when it is disabled, we
> > just stop storing new pages.
> >
> > So if you just rely in checking whether zswap is enabled at swapin
> > time to decide whether to use large folios, you may end up with a
> > situation where zswap is disabled, yet parts of the large folio you
> > are trying to swapin (or all of it) is in zswap.
> >
> > This is why I think we'll need to track whether zswap was ever enabled
> > instead (or if a page was ever stored).
>
> Thanks! It doesn't seem good. Do we have a simple way to clean zswap
> when it is disabled? seems not easy? Just like we do swapoff, or disable
> cache, we ensure they are clean - this is a real "disable".
I think it's just simpler, and disabling zswap at runtime is not that
common. Keep in mind that if zswap being disabled takes time, then
we'll want to handle races between zswap being disabled and incoming
swapins, it will probably end up being a state machine or so.
Without a valid use case, I think the complexity is not justified.
It's probably simpler to just track if zswap was ever used, and
disable large folio swapin. This should be a transitional state
anyway.