2020-08-19 20:00:18

by Gao Xiang

[permalink] [raw]
Subject: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

SWP_FS doesn't mean the device is file-backed swap device,
which just means each writeback request should go through fs
by DIO. Or it'll just use extents added by .swap_activate(),
but it also works as file-backed swap device.

So in order to achieve the goal of the original patch,
SWP_BLKDEV should be used instead.

FS corruption can be observed with SSD device + XFS +
fragmented swapfile due to CONFIG_THP_SWAP=y.

Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Cc: "Huang, Ying" <[email protected]>
Cc: stable <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---

I reproduced the issue with the following details:

Environment:
QEMU + upstream kernel + buildroot + NVMe (2 GB)

Kernel config:
CONFIG_BLK_DEV_NVME=y
CONFIG_THP_SWAP=y

Some reproducable steps:
mkfs.xfs -f /dev/nvme0n1
mkdir /tmp/mnt
mount /dev/nvme0n1 /tmp/mnt
bs="32k"
sz="1024m" # doesn't matter too much, I also tried 16m
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw

mkswap /tmp/mnt/sw
swapon /tmp/mnt/sw

stress --vm 2 --vm-bytes 600M # doesn't matter too much as well

Symptoms:
- FS corruption (e.g. checksum failure)
- memory corruption at: 0xd2808010
- segfault
...

mm/swapfile.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6c26916e95fd..2937daf3ca02 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
goto nextsi;
}
if (size == SWAPFILE_CLUSTER) {
- if (!(si->flags & SWP_FS))
+ if (si->flags & SWP_BLKDEV)
n_ret = swap_alloc_cluster(si, swp_entries);
} else
n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
--
2.18.1


2020-08-19 20:06:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang <[email protected]> wrote:

> SWP_FS doesn't mean the device is file-backed swap device,
> which just means each writeback request should go through fs
> by DIO. Or it'll just use extents added by .swap_activate(),
> but it also works as file-backed swap device.

This is very hard to understand :(

> So in order to achieve the goal of the original patch,
> SWP_BLKDEV should be used instead.
>
> FS corruption can be observed with SSD device + XFS +
> fragmented swapfile due to CONFIG_THP_SWAP=y.
>
> Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
> Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")

Why do you think it has taken three years to discover this?


2020-08-19 20:18:37

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

Hi Andrew,

On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang <[email protected]> wrote:
>
> > SWP_FS doesn't mean the device is file-backed swap device,
> > which just means each writeback request should go through fs
> > by DIO. Or it'll just use extents added by .swap_activate(),
> > but it also works as file-backed swap device.
>
> This is very hard to understand :(

Thanks for your reply...

The related logic is in __swap_writepage() and setup_swap_extents(),
and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...

I will also talk with "Huang, Ying" in person if no response here.

>
> > So in order to achieve the goal of the original patch,
> > SWP_BLKDEV should be used instead.
> >
> > FS corruption can be observed with SSD device + XFS +
> > fragmented swapfile due to CONFIG_THP_SWAP=y.
> >
> > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
> > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
>
> Why do you think it has taken three years to discover this?

I'm not sure if the Redhat BZ is available for public, it can be reproduced
since rhel 8
https://bugzilla.redhat.com/show_bug.cgi?id=1855474

It seems hard to believe, but I think just because rare user uses the SSD device +
THP + file-backed swap device combination... maybe I'm wrong here, but my test
shows as it is.

Thanks,
Gao Xiang

>
>
>

2020-08-19 20:45:20

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang <[email protected]> wrote:
>
> > SWP_FS doesn't mean the device is file-backed swap device,
> > which just means each writeback request should go through fs
> > by DIO. Or it'll just use extents added by .swap_activate(),
> > but it also works as file-backed swap device.
>
> This is very hard to understand :(
>

I'll work with Gao to rephrase that message. Sorry!


> > So in order to achieve the goal of the original patch,
> > SWP_BLKDEV should be used instead.
> >
> > FS corruption can be observed with SSD device + XFS +
> > fragmented swapfile due to CONFIG_THP_SWAP=y.
> >
> > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
> > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
>
> Why do you think it has taken three years to discover this?
>

My bet here is that it's rare to go for a swapfile on non-rotational
devices, and even rarer to create the swapfile when the filesystem is
already fragmented.

RHEL-8, v4.18-based, is starting to see more adpters among Red Hat's
customer base, thus the report now. We are also working on a secondary
issue related to CONFIG_THP_SWAP, as well, where the deferred THP split
registered shriker goes for a NULL pointer dereference in case the
swap device is backed by a rotational drive.

-- Rafael

2020-08-19 20:57:37

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

Hi Rafael,

On Wed, Aug 19, 2020 at 04:44:05PM -0400, Rafael Aquini wrote:
> On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> > On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang <[email protected]> wrote:
> >
> > > SWP_FS doesn't mean the device is file-backed swap device,
> > > which just means each writeback request should go through fs
> > > by DIO. Or it'll just use extents added by .swap_activate(),
> > > but it also works as file-backed swap device.
> >
> > This is very hard to understand :(
> >
>
> I'll work with Gao to rephrase that message. Sorry!

Sorry about that :( I just finished the test and went through
the related swap code and finally saw this so I think it wouldn't
work entirely for the current swap code... and Sorry about
my limited English.

Kindly feel free to repost the patch with rephrased commit
message. Anyway, I've done this task :)

Thanks,
Gao Xiang

2020-08-19 21:42:23

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

On Wed, Aug 19, 2020 at 1:15 PM Gao Xiang <[email protected]> wrote:
>
> Hi Andrew,
>
> On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> > On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang <[email protected]> wrote:
> >
> > > SWP_FS doesn't mean the device is file-backed swap device,
> > > which just means each writeback request should go through fs
> > > by DIO. Or it'll just use extents added by .swap_activate(),
> > > but it also works as file-backed swap device.
> >
> > This is very hard to understand :(
>
> Thanks for your reply...
>
> The related logic is in __swap_writepage() and setup_swap_extents(),
> and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...

I think just NFS falls into this case, so you may rephrase it to:

SWP_FS is only used for swap files over NFS. So, !SWP_FS means non NFS
swap, it could be either file backed or device backed.

Does this look more understandable?

> I will also talk with "Huang, Ying" in person if no response here.
>
> >
> > > So in order to achieve the goal of the original patch,
> > > SWP_BLKDEV should be used instead.
> > >
> > > FS corruption can be observed with SSD device + XFS +
> > > fragmented swapfile due to CONFIG_THP_SWAP=y.
> > >
> > > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
> > > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> >
> > Why do you think it has taken three years to discover this?
>
> I'm not sure if the Redhat BZ is available for public, it can be reproduced
> since rhel 8
> https://bugzilla.redhat.com/show_bug.cgi?id=1855474
>
> It seems hard to believe, but I think just because rare user uses the SSD device +
> THP + file-backed swap device combination... maybe I'm wrong here, but my test
> shows as it is.
>
> Thanks,
> Gao Xiang
>
> >
> >
> >
>
>

2020-08-20 01:25:47

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

Hi Yang,

On Wed, Aug 19, 2020 at 02:41:08PM -0700, Yang Shi wrote:
> On Wed, Aug 19, 2020 at 1:15 PM Gao Xiang <[email protected]> wrote:
> >
> > Hi Andrew,
> >
> > On Wed, Aug 19, 2020 at 01:05:06PM -0700, Andrew Morton wrote:
> > > On Thu, 20 Aug 2020 03:56:13 +0800 Gao Xiang <[email protected]> wrote:
> > >
> > > > SWP_FS doesn't mean the device is file-backed swap device,
> > > > which just means each writeback request should go through fs
> > > > by DIO. Or it'll just use extents added by .swap_activate(),
> > > > but it also works as file-backed swap device.
> > >
> > > This is very hard to understand :(
> >
> > Thanks for your reply...
> >
> > The related logic is in __swap_writepage() and setup_swap_extents(),
> > and also see e.g generic_swapfile_activate() or iomap_swapfile_activate()...
>
> I think just NFS falls into this case, so you may rephrase it to:
>
> SWP_FS is only used for swap files over NFS. So, !SWP_FS means non NFS
> swap, it could be either file backed or device backed.

Thanks for your suggestion...

That looks reasonable, and after I looked
bc4ae27d817a ("mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS")

I think it could be rephrased into

"
The SWP_FS flag is used to make swap_{read,write}page() go
through the filesystem, and it's only used for swap files
over NFS. So, !SWP_FS means non NFS for now, it could be
either file backed or device backed. Something similar goes
with legacy SWP_FILE.
"

Does it look sane? And I will wait for further suggestion
about this for a while.

And IMO, SWP_FS flag might be useful for other uses later
(e.g. laterly for some CoW swapfile use, but I don't think
carefully if it's practical or not...)

Thanks,
Gao Xiang

2020-08-20 04:37:20

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

Gao Xiang <[email protected]> writes:

> SWP_FS doesn't mean the device is file-backed swap device,
> which just means each writeback request should go through fs
> by DIO. Or it'll just use extents added by .swap_activate(),
> but it also works as file-backed swap device.
>
> So in order to achieve the goal of the original patch,
> SWP_BLKDEV should be used instead.
>
> FS corruption can be observed with SSD device + XFS +
> fragmented swapfile due to CONFIG_THP_SWAP=y.
>
> Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
> Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> Cc: "Huang, Ying" <[email protected]>
> Cc: stable <[email protected]>
> Signed-off-by: Gao Xiang <[email protected]>

Good catch! The fix itself looks good me! Although the description is
a little confusing.

After some digging, it seems that SWP_FS is set on the swap devices
which make swap entry read/write go through the file system specific
callback (now used by swap over NFS only).

Best Regards,
Huang, Ying

> ---
>
> I reproduced the issue with the following details:
>
> Environment:
> QEMU + upstream kernel + buildroot + NVMe (2 GB)
>
> Kernel config:
> CONFIG_BLK_DEV_NVME=y
> CONFIG_THP_SWAP=y
>
> Some reproducable steps:
> mkfs.xfs -f /dev/nvme0n1
> mkdir /tmp/mnt
> mount /dev/nvme0n1 /tmp/mnt
> bs="32k"
> sz="1024m" # doesn't matter too much, I also tried 16m
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
>
> mkswap /tmp/mnt/sw
> swapon /tmp/mnt/sw
>
> stress --vm 2 --vm-bytes 600M # doesn't matter too much as well
>
> Symptoms:
> - FS corruption (e.g. checksum failure)
> - memory corruption at: 0xd2808010
> - segfault
> ...
>
> mm/swapfile.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6c26916e95fd..2937daf3ca02 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> goto nextsi;
> }
> if (size == SWAPFILE_CLUSTER) {
> - if (!(si->flags & SWP_FS))
> + if (si->flags & SWP_BLKDEV)
> n_ret = swap_alloc_cluster(si, swp_entries);
> } else
> n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,

2020-08-20 04:44:48

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

Hi Ying,

On Thu, Aug 20, 2020 at 12:36:08PM +0800, Huang, Ying wrote:
> Gao Xiang <[email protected]> writes:
>
> > SWP_FS doesn't mean the device is file-backed swap device,
> > which just means each writeback request should go through fs
> > by DIO. Or it'll just use extents added by .swap_activate(),
> > but it also works as file-backed swap device.
> >
> > So in order to achieve the goal of the original patch,
> > SWP_BLKDEV should be used instead.
> >
> > FS corruption can be observed with SSD device + XFS +
> > fragmented swapfile due to CONFIG_THP_SWAP=y.
> >
> > Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
> > Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> > Cc: "Huang, Ying" <[email protected]>
> > Cc: stable <[email protected]>
> > Signed-off-by: Gao Xiang <[email protected]>
>
> Good catch! The fix itself looks good me! Although the description is
> a little confusing.
>
> After some digging, it seems that SWP_FS is set on the swap devices
> which make swap entry read/write go through the file system specific
> callback (now used by swap over NFS only).

Okay, let me send out v2 with the updated commit message in
https://lore.kernel.org/r/[email protected]/

Thanks,
Gao Xiang

>
> Best Regards,
> Huang, Ying
>
> > ---
> >
> > I reproduced the issue with the following details:
> >
> > Environment:
> > QEMU + upstream kernel + buildroot + NVMe (2 GB)
> >
> > Kernel config:
> > CONFIG_BLK_DEV_NVME=y
> > CONFIG_THP_SWAP=y
> >
> > Some reproducable steps:
> > mkfs.xfs -f /dev/nvme0n1
> > mkdir /tmp/mnt
> > mount /dev/nvme0n1 /tmp/mnt
> > bs="32k"
> > sz="1024m" # doesn't matter too much, I also tried 16m
> > xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> > xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> > xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> > xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> > xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
> >
> > mkswap /tmp/mnt/sw
> > swapon /tmp/mnt/sw
> >
> > stress --vm 2 --vm-bytes 600M # doesn't matter too much as well
> >
> > Symptoms:
> > - FS corruption (e.g. checksum failure)
> > - memory corruption at: 0xd2808010
> > - segfault
> > ...
> >
> > mm/swapfile.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 6c26916e95fd..2937daf3ca02 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> > goto nextsi;
> > }
> > if (size == SWAPFILE_CLUSTER) {
> > - if (!(si->flags & SWP_FS))
> > + if (si->flags & SWP_BLKDEV)
> > n_ret = swap_alloc_cluster(si, swp_entries);
> > } else
> > n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>