LinuxLists.cc - [PATCH] md/raid0: drop discard support past the first zone for the original layout (stable fix)

2023-06-28 14:52:20

Subject: [PATCH] md/raid0: drop discard support past the first zone for the original layout (stable fix)

We've found that using raid0 with the 'original' layout and discard
enabled with different disk sizes (such that at least two zones are
created) can result in data corruption. This is due to the fact that
the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
layout. We've seen this corruption using ext4 but other filesystems are
likely susceptible as well.

More specifically, while multiple zones are necessary to create the
corruption, the corruption may not occur with multiple zones if they
layout in such a way the layout matches what the 'alternate' layout
would have produced. Thus, not all raid0 devices with the 'original'
layout, different size disks and discard enabled will encounter this
corruption.

The 3.14 kernel inadvertently changed the raid0 disk layout for different
size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
same raid0 array could corrupt data. This lead to the creation of the
'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
(to match the post 3.14 layout) in the 5.4 kernel time frame and an option
to tell the kernel which layout to use (since it couldn't be autodetected).
However, when the 'original' layout was added back to 5.4 discard support
for the 'original' layout was not added leading this issue.

I've been able to reliably reproduce the corruption with the following
test case:

1. create raid0 array with different size disks using original layout
2. mkfs
3. mount -o discard
4. create lots of files
5. remove 1/2 the files
6. fstrim -a (or just the mount point for the raid0 array)
7. umount
8. fsck -fn /dev/md0 (spews all sorts of corruptions)

The fix here is a minimal fix intended for stable trees which doesn't do
discard if we have the original layout and we are not in first zone or
if the i/o crosses zones (we either do the entire discard or none of it).
The proper fix to actually perform discard to all zones for the original
layout will land in upstream versions. We have implemented the minimal
fix here for stable branches to reduce risk.

I've verified the change using the reproducer mentioned above. Typically,
the corruption is seen after less than 3 iterations, while the patch has
run 500+ iterations.

Cc: NeilBrown <[email protected]>
Cc: Song Liu <[email protected]>
Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Signed-off-by: Jason Baron <[email protected]>
---
drivers/md/raid0.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f8ee9a95e25d..2713a4acb44f 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -444,10 +444,25 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
sector_t end_disk_offset;
unsigned int end_disk_index;
unsigned int disk;
+ bool bio_zone_overlap = false;

zone = find_zone(conf, &start);
+ if (bio_end_sector(bio) > zone->zone_end)
+ bio_zone_overlap = true;
+
+ /* The discard code below doesn't properly support the original
+ * layout for the zones above the first one. We are adding
+ * proper support in later kernel versions but have decided
+ * that dropping discard support here is the lower risk option
+ * to avoid data corruption for stable versions.
+ */
+ if ((conf->layout == RAID0_ORIG_LAYOUT) &&
+ ((zone != conf->strip_zone) || (bio_zone_overlap))) {
+ bio_endio(bio);
+ return;
+ }

- if (bio_end_sector(bio) > zone->zone_end) {
+ if (bio_zone_overlap) {
struct bio *split = bio_split(bio,
zone->zone_end - bio->bi_iter.bi_sector, GFP_NOIO,
&mddev->bio_set);
--
2.25.1

2023-06-28 22:45:22

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH] md/raid0: drop discard support past the first zone for the original layout (stable fix)

Hi Jason,

Thanks for the fix!

On Wed, Jun 28, 2023 at 7:32 AM Jason Baron <[email protected]> wrote:
>
> We've found that using raid0 with the 'original' layout and discard
> enabled with different disk sizes (such that at least two zones are
> created) can result in data corruption. This is due to the fact that
> the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
> layout. We've seen this corruption using ext4 but other filesystems are
> likely susceptible as well.
>
> More specifically, while multiple zones are necessary to create the
> corruption, the corruption may not occur with multiple zones if they
> layout in such a way the layout matches what the 'alternate' layout
> would have produced. Thus, not all raid0 devices with the 'original'
> layout, different size disks and discard enabled will encounter this
> corruption.
>
> The 3.14 kernel inadvertently changed the raid0 disk layout for different
> size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
> same raid0 array could corrupt data. This lead to the creation of the
> 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
> (to match the post 3.14 layout) in the 5.4 kernel time frame and an option
> to tell the kernel which layout to use (since it couldn't be autodetected).
> However, when the 'original' layout was added back to 5.4 discard support
> for the 'original' layout was not added leading this issue.
>
> I've been able to reliably reproduce the corruption with the following
> test case:
>
> 1. create raid0 array with different size disks using original layout
> 2. mkfs
> 3. mount -o discard
> 4. create lots of files
> 5. remove 1/2 the files
> 6. fstrim -a (or just the mount point for the raid0 array)
> 7. umount
> 8. fsck -fn /dev/md0 (spews all sorts of corruptions)
>
> The fix here is a minimal fix intended for stable trees which doesn't do
> discard if we have the original layout and we are not in first zone or
> if the i/o crosses zones (we either do the entire discard or none of it).
> The proper fix to actually perform discard to all zones for the original
> layout will land in upstream versions. We have implemented the minimal
> fix here for stable branches to reduce risk.
>
> I've verified the change using the reproducer mentioned above. Typically,
> the corruption is seen after less than 3 iterations, while the patch has
> run 500+ iterations.
>
> Cc: NeilBrown <[email protected]>
> Cc: Song Liu <[email protected]>
> Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
> Signed-off-by: Jason Baron <[email protected]>
> ---
> drivers/md/raid0.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index f8ee9a95e25d..2713a4acb44f 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -444,10 +444,25 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
> sector_t end_disk_offset;
> unsigned int end_disk_index;
> unsigned int disk;
> + bool bio_zone_overlap = false;
>
> zone = find_zone(conf, &start);
> + if (bio_end_sector(bio) > zone->zone_end)
> + bio_zone_overlap = true;
> +
> + /* The discard code below doesn't properly support the original
> + * layout for the zones above the first one. We are adding
> + * proper support in later kernel versions but have decided
> + * that dropping discard support here is the lower risk option
> + * to avoid data corruption for stable versions.
> + */
> + if ((conf->layout == RAID0_ORIG_LAYOUT) &&
> + ((zone != conf->strip_zone) || (bio_zone_overlap))) {
> + bio_endio(bio);
> + return;
> + }

For bio_zone_overlap case, I think we can still do the split below and
send discard to the first zone?

Song

>
> - if (bio_end_sector(bio) > zone->zone_end) {
> + if (bio_zone_overlap) {
> struct bio *split = bio_split(bio,
> zone->zone_end - bio->bi_iter.bi_sector, GFP_NOIO,
> &mddev->bio_set);
> --
> 2.25.1
>

2023-06-28 23:16:10

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH] md/raid0: drop discard support past the first zone for the original layout (stable fix)

On Wed, Jun 28, 2023 at 3:43 PM Song Liu <[email protected]> wrote:
>
> Hi Jason,
>
> Thanks for the fix!
>
> On Wed, Jun 28, 2023 at 7:32 AM Jason Baron <[email protected]> wrote:
> >
> > We've found that using raid0 with the 'original' layout and discard
> > enabled with different disk sizes (such that at least two zones are
> > created) can result in data corruption. This is due to the fact that
> > the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
> > layout. We've seen this corruption using ext4 but other filesystems are
> > likely susceptible as well.
> >
> > More specifically, while multiple zones are necessary to create the
> > corruption, the corruption may not occur with multiple zones if they
> > layout in such a way the layout matches what the 'alternate' layout
> > would have produced. Thus, not all raid0 devices with the 'original'
> > layout, different size disks and discard enabled will encounter this
> > corruption.
> >
> > The 3.14 kernel inadvertently changed the raid0 disk layout for different
> > size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
> > same raid0 array could corrupt data. This lead to the creation of the
> > 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
> > (to match the post 3.14 layout) in the 5.4 kernel time frame and an option
> > to tell the kernel which layout to use (since it couldn't be autodetected).
> > However, when the 'original' layout was added back to 5.4 discard support
> > for the 'original' layout was not added leading this issue.
> >
> > I've been able to reliably reproduce the corruption with the following
> > test case:
> >
> > 1. create raid0 array with different size disks using original layout
> > 2. mkfs
> > 3. mount -o discard
> > 4. create lots of files
> > 5. remove 1/2 the files
> > 6. fstrim -a (or just the mount point for the raid0 array)
> > 7. umount
> > 8. fsck -fn /dev/md0 (spews all sorts of corruptions)
> >
> > The fix here is a minimal fix intended for stable trees which doesn't do
> > discard if we have the original layout and we are not in first zone or
> > if the i/o crosses zones (we either do the entire discard or none of it).
> > The proper fix to actually perform discard to all zones for the original
> > layout will land in upstream versions. We have implemented the minimal
> > fix here for stable branches to reduce risk.
> >
> > I've verified the change using the reproducer mentioned above. Typically,
> > the corruption is seen after less than 3 iterations, while the patch has
> > run 500+ iterations.
> >
> > Cc: NeilBrown <[email protected]>
> > Cc: Song Liu <[email protected]>
> > Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
> > Signed-off-by: Jason Baron <[email protected]>
> > ---
> > drivers/md/raid0.c | 17 ++++++++++++++++-
> > 1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> > index f8ee9a95e25d..2713a4acb44f 100644
> > --- a/drivers/md/raid0.c
> > +++ b/drivers/md/raid0.c
> > @@ -444,10 +444,25 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
> > sector_t end_disk_offset;
> > unsigned int end_disk_index;
> > unsigned int disk;
> > + bool bio_zone_overlap = false;
> >
> > zone = find_zone(conf, &start);
> > + if (bio_end_sector(bio) > zone->zone_end)
> > + bio_zone_overlap = true;
> > +
> > + /* The discard code below doesn't properly support the original
> > + * layout for the zones above the first one. We are adding
> > + * proper support in later kernel versions but have decided
> > + * that dropping discard support here is the lower risk option
> > + * to avoid data corruption for stable versions.
> > + */
> > + if ((conf->layout == RAID0_ORIG_LAYOUT) &&
> > + ((zone != conf->strip_zone) || (bio_zone_overlap))) {
> > + bio_endio(bio);
> > + return;
> > + }
>
> For bio_zone_overlap case, I think we can still do the split below and
> send discard to the first zone?

Actually, on a second (third?) thought, I think we can just ship the other
version ([email protected]) to stable
kernel. It is very safe. So we don't need more work on this version.
Sorry for the confusion.

Thanks again for the fix!
Song