2023-06-01 06:40:49

by Li Nan

[permalink] [raw]
Subject: [PATCH v5 0/2] raid10 bugfix

From: Li Nan <[email protected]>

Changes in v5:
- v4 send wrong patch, correct and resend.

Changes in v4:
- improve commit log
- removed applied patches

Li Nan (2):
md/raid10: fix incorrect done of recovery
md/raid10: fix io loss while replacement replace rdev

drivers/md/raid10.c | 40 +++++++++++++++++++++++++++++++++++-----
1 file changed, 35 insertions(+), 5 deletions(-)

--
2.31.1



2023-06-01 06:41:40

by Li Nan

[permalink] [raw]
Subject: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery

From: Li Nan <[email protected]>

In raid10_sync_request(), if data cannot be read from any disk for
recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
return 'max_sector', indicating that the recovery has been completed.
However, the recovery is just aborted and the data remains inconsistent.

Fix it by setting mirror->recovery_disabled, which will prevent the spare
disk from being added to this mirror. The same issue also exists during
resync, it will be fixed afterwards.

Signed-off-by: Li Nan <[email protected]>
---
drivers/md/raid10.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d93d8cb2b620..3ba1516ea160 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
int chunks_skipped = 0;
sector_t chunk_mask = conf->geo.chunk_mask;
int page_idx = 0;
+ int error_disk = -1;

/*
* Allow skipping a full rebuild for incremental assembly
@@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
return reshape_request(mddev, sector_nr, skipped);

if (chunks_skipped >= conf->geo.raid_disks) {
- /* if there has been nothing to do on any drive,
+ pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
+ test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ? "resync" : "recovery");
+ if (error_disk >= 0 &&
+ !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+ /*
+ * recovery fail, set mirrors.recovory_disabled,
+ * device shouldn't be added to there.
+ */
+ conf->mirrors[error_disk].recovery_disabled =
+ mddev->recovery_disabled;
+ return 0;
+ }
+ /*
+ * if there has been nothing to do on any drive,
* then there is nothing to do at all..
*/
*skipped = 1;
@@ -3638,6 +3652,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
mdname(mddev));
mirror->recovery_disabled
= mddev->recovery_disabled;
+ } else {
+ error_disk = i;
}
put_buf(r10_bio);
if (rb2)
--
2.31.1


2023-06-01 06:45:14

by Li Nan

[permalink] [raw]
Subject: [PATCH v5 2/2] md/raid10: fix io loss while replacement replace rdev

From: Li Nan <[email protected]>

When removing a disk with replacement, the replacement will be used to
replace rdev. During this process, there is a brief window in which both
rdev and replacement are read as NULL in raid10_write_request(). This
will result in io not being submitted but it should be.

//remove //write
raid10_remove_disk raid10_write_request
mirror->rdev = NULL
read rdev -> NULL
mirror->rdev = mirror->replacement
mirror->replacement = NULL
read replacement -> NULL

Fix it by reading replacement first and rdev later, meanwhile, use smp_mb()
to prevent memory reordering.

Fixes: 475b0321a4df ("md/raid10: writes should get directed to replacement as well as original.")
Signed-off-by: Li Nan <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
---
drivers/md/raid10.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3ba1516ea160..a7830d2c7477 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -779,8 +779,16 @@ static struct md_rdev *read_balance(struct r10conf *conf,
disk = r10_bio->devs[slot].devnum;
rdev = rcu_dereference(conf->mirrors[disk].replacement);
if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
- r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
+ r10_bio->devs[slot].addr + sectors >
+ rdev->recovery_offset) {
+ /*
+ * Read replacement first to prevent reading both rdev
+ * and replacement as NULL during replacement replace
+ * rdev.
+ */
+ smp_mb();
rdev = rcu_dereference(conf->mirrors[disk].rdev);
+ }
if (rdev == NULL ||
test_bit(Faulty, &rdev->flags))
continue;
@@ -1479,9 +1487,15 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,

for (i = 0; i < conf->copies; i++) {
int d = r10_bio->devs[i].devnum;
- struct md_rdev *rdev = rcu_dereference(conf->mirrors[d].rdev);
- struct md_rdev *rrdev = rcu_dereference(
- conf->mirrors[d].replacement);
+ struct md_rdev *rdev, *rrdev;
+
+ rrdev = rcu_dereference(conf->mirrors[d].replacement);
+ /*
+ * Read replacement first to Prevent reading both rdev and
+ * replacement as NULL during replacement replace rdev.
+ */
+ smp_mb();
+ rdev = rcu_dereference(conf->mirrors[d].rdev);
if (rdev == rrdev)
rrdev = NULL;
if (rdev && (test_bit(Faulty, &rdev->flags)))
--
2.31.1


2023-06-01 07:26:32

by Paul Menzel

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery

Dear Li,


Thank you for your patch.

Am 01.06.23 um 08:24 schrieb [email protected]:
> From: Li Nan <[email protected]>

Unfortunately, I do not understand your commit message summary “fix
incorrect done of recovery”. Maybe:

Do not add sparse disk when recovery aborts

> In raid10_sync_request(), if data cannot be read from any disk for
> recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
> multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
> return 'max_sector', indicating that the recovery has been completed.
> However, the recovery is just aborted and the data remains inconsistent.
>
> Fix it by setting mirror->recovery_disabled, which will prevent the spare
> disk from being added to this mirror. The same issue also exists during
> resync, it will be fixed afterwards.
>
> Signed-off-by: Li Nan <[email protected]>
> ---
> drivers/md/raid10.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index d93d8cb2b620..3ba1516ea160 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
> int chunks_skipped = 0;
> sector_t chunk_mask = conf->geo.chunk_mask;
> int page_idx = 0;
> + int error_disk = -1;
>
> /*
> * Allow skipping a full rebuild for incremental assembly
> @@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
> return reshape_request(mddev, sector_nr, skipped);
>
> if (chunks_skipped >= conf->geo.raid_disks) {
> - /* if there has been nothing to do on any drive,
> + pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
> + test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ? "resync" : "recovery");
> + if (error_disk >= 0 &&
> + !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
> + /*
> + * recovery fail, set mirrors.recovory_disabled,

recov*e*ry

> + * device shouldn't be added to there.
> + */
> + conf->mirrors[error_disk].recovery_disabled =
> + mddev->recovery_disabled;
> + return 0;
> + }
> + /*
> + * if there has been nothing to do on any drive,
> * then there is nothing to do at all..

Just one dot/period at the end?

> */
> *skipped = 1;
> @@ -3638,6 +3652,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
> mdname(mddev));
> mirror->recovery_disabled
> = mddev->recovery_disabled;
> + } else {
> + error_disk = i;
> }
> put_buf(r10_bio);
> if (rb2)


Kind regards,

Paul

2023-06-02 07:06:02

by Li Nan

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery



在 2023/6/1 15:06, Paul Menzel 写道:
> Dear Li,
>
>
> Thank you for your patch.
>
> Am 01.06.23 um 08:24 schrieb [email protected]:
>> From: Li Nan <[email protected]>
>
> Unfortunately, I do not understand your commit message summary “fix
> incorrect done of recovery”. Maybe:
>
> Do not add sparse disk when recovery aborts
>

"recovery fail" is better?

>> In raid10_sync_request(), if data cannot be read from any disk for
>> recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
>> multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
>> return 'max_sector', indicating that the recovery has been completed.
>> However, the recovery is just aborted and the data remains inconsistent.
>>
>> Fix it by setting mirror->recovery_disabled, which will prevent the spare
>> disk from being added to this mirror. The same issue also exists during
>> resync, it will be fixed afterwards.
>>
>> Signed-off-by: Li Nan <[email protected]>
>> ---
>>   drivers/md/raid10.c | 18 +++++++++++++++++-
>>   1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index d93d8cb2b620..3ba1516ea160 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct mddev
>> *mddev, sector_t sector_nr,
>>       int chunks_skipped = 0;
>>       sector_t chunk_mask = conf->geo.chunk_mask;
>>       int page_idx = 0;
>> +    int error_disk = -1;
>>       /*
>>        * Allow skipping a full rebuild for incremental assembly
>> @@ -3386,7 +3387,20 @@ static sector_t raid10_sync_request(struct
>> mddev *mddev, sector_t sector_nr,
>>           return reshape_request(mddev, sector_nr, skipped);
>>       if (chunks_skipped >= conf->geo.raid_disks) {
>> -        /* if there has been nothing to do on any drive,
>> +        pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
>> +            test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync"
>> : "recovery");
>> +        if (error_disk >= 0 &&
>> +            !test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
>> +            /*
>> +             * recovery fail, set mirrors.recovory_disabled,
>
> recov*e*ry
>
>> +             * device shouldn't be added to there.
>> +             */
>> +            conf->mirrors[error_disk].recovery_disabled =
>> +                        mddev->recovery_disabled;
>> +            return 0;
>> +        }
>> +        /*
>> +         * if there has been nothing to do on any drive,
>>            * then there is nothing to do at all..
>
> Just one dot/period at the end?
>
Thanks for your suggestion. I will change it in next version.

--
Thanks,
Nan


2023-06-02 08:53:44

by Paul Menzel

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery

Dear Li Nan,


Am 02.06.23 um 08:57 schrieb Li Nan:

> 在 2023/6/1 15:06, Paul Menzel 写道:

>> Am 01.06.23 um 08:24 schrieb [email protected]:
>>> From: Li Nan <[email protected]>
>>
>> Unfortunately, I do not understand your commit message summary “fix
>> incorrect done of recovery”. Maybe:
>>
>> Do not add sparse disk when recovery aborts
>
> "recovery fail" is better?

I think the grammar is incorrect, and it should be fail*s*.

[…]


Kind regards,

Paul

2023-06-02 09:41:56

by Li Nan

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] md/raid10: fix incorrect done of recovery



在 2023/6/2 16:33, Paul Menzel 写道:
> Dear Li Nan,
>
>
> Am 02.06.23 um 08:57 schrieb Li Nan:
>
>> 在 2023/6/1 15:06, Paul Menzel 写道:
>
>>> Am 01.06.23 um 08:24 schrieb [email protected]:
>>>> From: Li Nan <[email protected]>
>>>
>>> Unfortunately, I do not understand your commit message summary “fix
>>> incorrect done of recovery”. Maybe:
>>>
>>> Do not add sparse disk when recovery aborts
>>
>> "recovery fail" is better?
>
> I think the grammar is incorrect, and it should be fail*s*.
>

fix in v7. :)

--
Thanks,
Nan