LinuxLists.cc - [PATCH] md/raid0: add discard support for the 'original' layout

2023-06-23 19:04:51

Subject: [PATCH] md/raid0: add discard support for the 'original' layout

We've found that using raid0 with the 'original' layout and discard
enabled with different disk sizes (such that at least two zones are
created) can result in data corruption. This is due to the fact that
the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
layout. We've seen this corruption using ext4 but other filesystems are
likely susceptible as well.

More specifically, while multiple zones are necessary to create the
corruption, the corruption may not occur with multiple zones if they
layout in such a way the layout matches what the 'alternate' layout
would have produced. Thus, not all raid0 devices with the 'original'
layout, different size disks and discard enabled will encounter this
corruption.

The 3.14 kernel inadvertently changed the raid0 disk layout for different
size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
same raid0 array could corrupt data. This lead to the creation of the
'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
(to match the post 3.14 layout) in the 5.4 kernel time frame and an option
to tell the kernel which layout to use (since it couldn't be autodetected).
However, when the 'original' layout was added back to 5.4 discard support
for the 'original' layout was not added leading this issue.

I've been able to reliably reproduce the corruption with the following
test case:

1. create raid0 array with different size disks using original layout
2. mkfs
3. mount -o discard
4. create lots of files
5. remove 1/2 the files
6. fstrim -a (or just the mount point for the raid0 array)
7. umount
8. fsck -fn /dev/md0 (spews all sorts of corruptions)

Let's fix this by adding proper discard support to the 'original' layout.
The fix 'maps' the 'original' layout disks to the order in which they are
read/written such that we can compare the disks in the same way that the
current 'alternate' layout does. A 'disk_shift' field is added to
'struct strip_zone'. This could be computed on the fly in
raid0_handle_discard() but by adding this field, we save some computation
in the discard path.

Note we could also potentially fix this by re-ordering the disks in the
zones that follow the first one, and then always read/writing them using
the 'alternate' layout. However, that is seen as a more substantial change,
and we are attempting the least invasive fix at this time to remedy the
corruption.

I've verified the change using the reproducer mentioned above. Typically,
the corruption is seen after less than 3 iterations, while the patch has
run 500+ iterations.

Cc: NeilBrown <[email protected]>
Cc: Song Liu <[email protected]>
Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Signed-off-by: Jason Baron <[email protected]>
---
drivers/md/raid0.c | 61 ++++++++++++++++++++++++++++++++++++++++------
drivers/md/raid0.h | 1 +
2 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f8ee9a95e25d..cb4b3e248519 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -270,6 +270,18 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
goto abort;
}

+ if (conf->layout == RAID0_ORIG_LAYOUT) {
+ for (i = 1; i < conf->nr_strip_zones; i++) {
+ sector_t first_sector = conf->strip_zone[i-1].zone_end;
+
+ sector_div(first_sector, mddev->chunk_sectors);
+ zone = conf->strip_zone + i;
+ /* disk_shift is first disk index used in the zone */
+ zone->disk_shift = sector_div(first_sector,
+ zone->nb_dev);
+ }
+ }
+
pr_debug("md/raid0:%s: done.\n", mdname(mddev));
*private_conf = conf;

@@ -431,6 +443,19 @@ static int raid0_run(struct mddev *mddev)
return ret;
}

+/* Convert disk_index to the disk order in which it is read/written.
+ * For example, if we have 4 disks, they are numbered 0,1,2,3. If we
+ * write the disks starting at disk 3, then the read/write order would
+ * be disk 3, then 0, then 1, and then disk 2 and we want map_disk_shift()
+ * to map the disks as follows 0,1,2,3 => 1,2,3,0. So disk 0 would map
+ * to 1, 1 to 2, 2 to 3, and 3 to 0. That way we can compare disks in
+ * that 'output' space to understand the read/write disk ordering.
+ */
+static int map_disk_shift(int disk_index, int num_disks, int disk_shift)
+{
+ return ((disk_index + num_disks - disk_shift) % num_disks);
+}
+
static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
{
struct r0conf *conf = mddev->private;
@@ -444,7 +469,9 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
sector_t end_disk_offset;
unsigned int end_disk_index;
unsigned int disk;
+ sector_t orig_start, orig_end;

+ orig_start = start;
zone = find_zone(conf, &start);

if (bio_end_sector(bio) > zone->zone_end) {
@@ -458,6 +485,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
} else
end = bio_end_sector(bio);

+ orig_end = end;
if (zone != conf->strip_zone)
end = end - zone[-1].zone_end;

@@ -469,13 +497,26 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
last_stripe_index = end;
sector_div(last_stripe_index, stripe_size);

- start_disk_index = (int)(start - first_stripe_index * stripe_size) /
- mddev->chunk_sectors;
+ /* In the first zone the original and alternate layouts are the same */
+ if ((conf->layout == RAID0_ORIG_LAYOUT) && (zone != conf->strip_zone)) {
+ sector_div(orig_start, mddev->chunk_sectors);
+ start_disk_index = sector_div(orig_start, zone->nb_dev);
+ start_disk_index = map_disk_shift(start_disk_index,
+ zone->nb_dev,
+ zone->disk_shift);
+ sector_div(orig_end, mddev->chunk_sectors);
+ end_disk_index = sector_div(orig_end, zone->nb_dev);
+ end_disk_index = map_disk_shift(end_disk_index,
+ zone->nb_dev, zone->disk_shift);
+ } else {
+ start_disk_index = (int)(start - first_stripe_index * stripe_size) /
+ mddev->chunk_sectors;
+ end_disk_index = (int)(end - last_stripe_index * stripe_size) /
+ mddev->chunk_sectors;
+ }
start_disk_offset = ((int)(start - first_stripe_index * stripe_size) %
mddev->chunk_sectors) +
first_stripe_index * mddev->chunk_sectors;
- end_disk_index = (int)(end - last_stripe_index * stripe_size) /
- mddev->chunk_sectors;
end_disk_offset = ((int)(end - last_stripe_index * stripe_size) %
mddev->chunk_sectors) +
last_stripe_index * mddev->chunk_sectors;
@@ -483,18 +524,22 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
for (disk = 0; disk < zone->nb_dev; disk++) {
sector_t dev_start, dev_end;
struct md_rdev *rdev;
+ int compare_disk;
+
+ compare_disk = map_disk_shift(disk, zone->nb_dev,
+ zone->disk_shift);

- if (disk < start_disk_index)
+ if (compare_disk < start_disk_index)
dev_start = (first_stripe_index + 1) *
mddev->chunk_sectors;
- else if (disk > start_disk_index)
+ else if (compare_disk > start_disk_index)
dev_start = first_stripe_index * mddev->chunk_sectors;
else
dev_start = start_disk_offset;

- if (disk < end_disk_index)
+ if (compare_disk < end_disk_index)
dev_end = (last_stripe_index + 1) * mddev->chunk_sectors;
- else if (disk > end_disk_index)
+ else if (compare_disk > end_disk_index)
dev_end = last_stripe_index * mddev->chunk_sectors;
else
dev_end = end_disk_offset;
diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h
index 3816e5477db1..8cc761ca7423 100644
--- a/drivers/md/raid0.h
+++ b/drivers/md/raid0.h
@@ -6,6 +6,7 @@ struct strip_zone {
sector_t zone_end; /* Start of the next zone (in sectors) */
sector_t dev_start; /* Zone offset in real dev (in sectors) */
int nb_dev; /* # of devices attached to the zone */
+ int disk_shift; /* start disk for the original layout */
};

/* Linux 3.14 (20d0189b101) made an unintended change to
--
2.25.1

2023-06-27 00:40:42

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH] md/raid0: add discard support for the 'original' layout

On Fri, Jun 23, 2023 at 11:05 AM Jason Baron <[email protected]> wrote:
>
> We've found that using raid0 with the 'original' layout and discard
> enabled with different disk sizes (such that at least two zones are
> created) can result in data corruption. This is due to the fact that
> the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
> layout. We've seen this corruption using ext4 but other filesystems are
> likely susceptible as well.
>
> More specifically, while multiple zones are necessary to create the
> corruption, the corruption may not occur with multiple zones if they
> layout in such a way the layout matches what the 'alternate' layout
> would have produced. Thus, not all raid0 devices with the 'original'
> layout, different size disks and discard enabled will encounter this
> corruption.
>
> The 3.14 kernel inadvertently changed the raid0 disk layout for different
> size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
> same raid0 array could corrupt data. This lead to the creation of the
> 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
> (to match the post 3.14 layout) in the 5.4 kernel time frame and an option
> to tell the kernel which layout to use (since it couldn't be autodetected).
> However, when the 'original' layout was added back to 5.4 discard support
> for the 'original' layout was not added leading this issue.
>
> I've been able to reliably reproduce the corruption with the following
> test case:
>
> 1. create raid0 array with different size disks using original layout
> 2. mkfs
> 3. mount -o discard
> 4. create lots of files
> 5. remove 1/2 the files
> 6. fstrim -a (or just the mount point for the raid0 array)
> 7. umount
> 8. fsck -fn /dev/md0 (spews all sorts of corruptions)
>
> Let's fix this by adding proper discard support to the 'original' layout.
> The fix 'maps' the 'original' layout disks to the order in which they are
> read/written such that we can compare the disks in the same way that the
> current 'alternate' layout does. A 'disk_shift' field is added to
> 'struct strip_zone'. This could be computed on the fly in
> raid0_handle_discard() but by adding this field, we save some computation
> in the discard path.
>
> Note we could also potentially fix this by re-ordering the disks in the
> zones that follow the first one, and then always read/writing them using
> the 'alternate' layout. However, that is seen as a more substantial change,
> and we are attempting the least invasive fix at this time to remedy the
> corruption.
>
> I've verified the change using the reproducer mentioned above. Typically,
> the corruption is seen after less than 3 iterations, while the patch has
> run 500+ iterations.
>
> Cc: NeilBrown <[email protected]>
> Cc: Song Liu <[email protected]>
> Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
> Signed-off-by: Jason Baron <[email protected]>

Looks good to me! Applied to md-next.

Since this will be released with 6.6, we should have a smaller and safer fix
before that. Would you mind create a patch that ignores all discards to
orig_layout and not the first zone? We will roll that to 6.5 and back port to
stable. Then this version will be shipped to 6.6+.

Thanks,
Song

2023-06-27 15:16:23

by Jason Baron

[permalink] [raw]

Subject: Re: [PATCH] md/raid0: add discard support for the 'original' layout

On 6/26/23 8:35 PM, Song Liu wrote:
> On Fri, Jun 23, 2023 at 11:05 AM Jason Baron <[email protected]> wrote:
>>
>> We've found that using raid0 with the 'original' layout and discard
>> enabled with different disk sizes (such that at least two zones are
>> created) can result in data corruption. This is due to the fact that
>> the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
>> layout. We've seen this corruption using ext4 but other filesystems are
>> likely susceptible as well.
>>
>> More specifically, while multiple zones are necessary to create the
>> corruption, the corruption may not occur with multiple zones if they
>> layout in such a way the layout matches what the 'alternate' layout
>> would have produced. Thus, not all raid0 devices with the 'original'
>> layout, different size disks and discard enabled will encounter this
>> corruption.
>>
>> The 3.14 kernel inadvertently changed the raid0 disk layout for different
>> size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
>> same raid0 array could corrupt data. This lead to the creation of the
>> 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
>> (to match the post 3.14 layout) in the 5.4 kernel time frame and an option
>> to tell the kernel which layout to use (since it couldn't be autodetected).
>> However, when the 'original' layout was added back to 5.4 discard support
>> for the 'original' layout was not added leading this issue.
>>
>> I've been able to reliably reproduce the corruption with the following
>> test case:
>>
>> 1. create raid0 array with different size disks using original layout
>> 2. mkfs
>> 3. mount -o discard
>> 4. create lots of files
>> 5. remove 1/2 the files
>> 6. fstrim -a (or just the mount point for the raid0 array)
>> 7. umount
>> 8. fsck -fn /dev/md0 (spews all sorts of corruptions)
>>
>> Let's fix this by adding proper discard support to the 'original' layout.
>> The fix 'maps' the 'original' layout disks to the order in which they are
>> read/written such that we can compare the disks in the same way that the
>> current 'alternate' layout does. A 'disk_shift' field is added to
>> 'struct strip_zone'. This could be computed on the fly in
>> raid0_handle_discard() but by adding this field, we save some computation
>> in the discard path.
>>
>> Note we could also potentially fix this by re-ordering the disks in the
>> zones that follow the first one, and then always read/writing them using
>> the 'alternate' layout. However, that is seen as a more substantial change,
>> and we are attempting the least invasive fix at this time to remedy the
>> corruption.
>>
>> I've verified the change using the reproducer mentioned above. Typically,
>> the corruption is seen after less than 3 iterations, while the patch has
>> run 500+ iterations.
>>
>> Cc: NeilBrown <[email protected]>
>> Cc: Song Liu <[email protected]>
>> Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
>> Signed-off-by: Jason Baron <[email protected]>
>
> Looks good to me! Applied to md-next.
>
> Since this will be released with 6.6, we should have a smaller and safer fix
> before that. Would you mind create a patch that ignores all discards to
> orig_layout and not the first zone? We will roll that to 6.5 and back port to
> stable. Then this version will be shipped to 6.6+.
>

Hi Song,

Ok, I mean the current patch was meant to be fairly conservative in that
it attempts to only change codepaths where we are doing discards above
the first zone. IE Changing only the codepaths that currently don't work.

But if we want to be more conservative (given this fixes corruption), I
can post a patch to disable discard as you've suggested. I'm going to
let the testing run for a while, so I'll post it in a bit.

Thanks,

-Jason