From: Yu Kuai <[email protected]>
patch 1 fix that array can't assemble if replacement is set while
reshape is still in progress.
patch 2 fix data corruption while echo reshape to sync_action
patch 3-5 fix a deadlock for assemble while reboot from growing
Yu Kuai (5):
md/raid5: don't allow replacement while reshape is not done
md: fix data corruption for raid456 when reshape restart while grow up
md: export md_is_rdwr() and is_md_suspended()
md: add a new api prepare_suspend() in md_personality
md/raid5: fix a deadlock in the case that reshape is interrupted
drivers/md/md.c | 35 +++++++++++++++++------------------
drivers/md/md.h | 18 ++++++++++++++++++
drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 79 insertions(+), 19 deletions(-)
--
2.39.2
From: Yu Kuai <[email protected]>
If reshape is in progress and io across reshape_position is issued, such
io will wait for reshape to make progress(see details in the case that
make_stripe_request() return STRIPE_SCHEDULE_AND_RETRY).
It has been reported several times that if system reboot while growing
raid5 to raid6, array assemble will hang infinitely([1, 2]). This is
because following deadlock is triggered:
1) a normal io is waiting for reshape to progress, this io can be from
system-udevd or mdadm.
2) while assemble, mdadm tries to suspend the array, hence
'reconfig_mutex' is held and mddev_suspend() must wait for normal io
to be done.
3) daemon thread can't start reshape because 'reconfig_mutex' can't be
held.
1) and 3) is unbreakable because they're foundation design. In order to
break 2), following is possible solutions that I can think of:
a) Let mddev_suspend() fail is not a good option, because this will
break many scenarios since mddev_suspend() doesn't fail before.
b) Fail the io that is waiting for reshape to make progress from
mddev_suspend().
c) Return false for the io that is waiting for reshape to make
progress from raid5_make_request(), and these io will wait for
suspend to be done in md_handle_request(), where 'active_io' is
not grabbed.
c) sounds better than b), however, b) is used because it's easy and
straightforward, and it's verified that mdadm can assemble in this case.
On the other hand, c) breaks the logic that mddev_suspend() will wait
for submitted io to be completely handled.
Fix the problem by checking reshape in mddev_suspend(), if reshape can't
make progress and there are still some io waiting for reshape, fail
those io.
Reported-by: Jove <[email protected]>
[1] https://lore.kernel.org/all/CAFig2csUV2QiomUhj_t3dPOgV300dbQ6XtM9ygKPdXJFSH__Nw@mail.gmail.com/
Reported-by: David Gilmour <[email protected]>
[2] https://lore.kernel.org/all/CAO2ABipzbw6QL5eNa44CQHjiVa-LTvS696Mh9QaTw+qsUKFUCw@mail.gmail.com/
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 1 +
drivers/md/raid5.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 44 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1ef81421e131..e2bef5bb00f2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9133,6 +9133,7 @@ void md_do_sync(struct md_thread *thread)
spin_unlock(&mddev->lock);
wake_up(&resync_wait);
+ wake_up(&mddev->sb_wait);
md_wakeup_thread(mddev->thread);
return;
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index bd3b535c0739..7e2bbcfef325 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5966,6 +5966,19 @@ static int add_all_stripe_bios(struct r5conf *conf,
return ret;
}
+static bool reshape_inprogress(struct mddev *mddev)
+{
+ return test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
+ test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+ !test_bit(MD_RECOVERY_DONE, &mddev->recovery) &&
+ !test_bit(MD_RECOVERY_INTR, &mddev->recovery);
+}
+
+static bool reshape_disabled(struct mddev *mddev)
+{
+ return is_md_suspended(mddev) || !md_is_rdwr(mddev);
+}
+
static enum stripe_result make_stripe_request(struct mddev *mddev,
struct r5conf *conf, struct stripe_request_ctx *ctx,
sector_t logical_sector, struct bio *bi)
@@ -5997,7 +6010,8 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
if (ahead_of_reshape(mddev, logical_sector,
conf->reshape_safe)) {
spin_unlock_irq(&conf->device_lock);
- return STRIPE_SCHEDULE_AND_RETRY;
+ ret = STRIPE_SCHEDULE_AND_RETRY;
+ goto out;
}
}
spin_unlock_irq(&conf->device_lock);
@@ -6076,6 +6090,15 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
out_release:
raid5_release_stripe(sh);
+out:
+ if (ret == STRIPE_SCHEDULE_AND_RETRY && !reshape_inprogress(mddev) &&
+ reshape_disabled(mddev)) {
+ bi->bi_status = BLK_STS_IOERR;
+ ret = STRIPE_FAIL;
+ pr_err("md/raid456:%s: io failed across reshape position while reshape can't make progress.\n",
+ mdname(mddev));
+ }
+
return ret;
}
@@ -9045,6 +9068,22 @@ static int raid5_start(struct mddev *mddev)
return r5l_start(conf->log);
}
+static void raid5_prepare_suspend(struct mddev *mddev)
+{
+ struct r5conf *conf = mddev->private;
+
+ wait_event(mddev->sb_wait, !reshape_inprogress(mddev) ||
+ percpu_ref_is_zero(&mddev->active_io));
+ if (percpu_ref_is_zero(&mddev->active_io))
+ return;
+
+ /*
+ * Reshape is not in progress, and array is suspended, io that is
+ * waiting for reshpape can never be done.
+ */
+ wake_up(&conf->wait_for_overlap);
+}
+
static struct md_personality raid6_personality =
{
.name = "raid6",
@@ -9065,6 +9104,7 @@ static struct md_personality raid6_personality =
.check_reshape = raid6_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
+ .prepare_suspend = raid5_prepare_suspend,
.quiesce = raid5_quiesce,
.takeover = raid6_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
@@ -9089,6 +9129,7 @@ static struct md_personality raid5_personality =
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
+ .prepare_suspend = raid5_prepare_suspend,
.quiesce = raid5_quiesce,
.takeover = raid5_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
@@ -9114,6 +9155,7 @@ static struct md_personality raid4_personality =
.check_reshape = raid5_check_reshape,
.start_reshape = raid5_start_reshape,
.finish_reshape = raid5_finish_reshape,
+ .prepare_suspend = raid5_prepare_suspend,
.quiesce = raid5_quiesce,
.takeover = raid4_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
--
2.39.2
From: Yu Kuai <[email protected]>
Set rdev replacement has but not only two conditions:
1) MD_RECOVERY_RUNNING is not set;
2) rdev nr_pending is 0;
If reshape is interrupted(for example, echo frozen to sync_action), then
rdev replacement can be set. It's safe because reshape is always prior to
resync in md_check_recovery(). However, if system reboots, then kernel will
complain cannot handle concurrent replacement and reshape and this array
is not able to assemble anymore.
Fix this problem by don't allow replacement until reshape is done.
Reported-by: Peter Neuwirth <[email protected]>
Link: https://lore.kernel.org/linux-raid/[email protected]/
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid5.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a58507a4345d..bd3b535c0739 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8378,6 +8378,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
p = conf->disks + disk;
tmp = rdev_mdlock_deref(mddev, p->rdev);
if (test_bit(WantReplacement, &tmp->flags) &&
+ mddev->reshape_position == MaxSector &&
p->replacement == NULL) {
clear_bit(In_sync, &rdev->flags);
set_bit(Replacement, &rdev->flags);
--
2.39.2
From: Yu Kuai <[email protected]>
The two apis will be used later to fix a deadlock in raid456, there are
no functional changes.
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 16 ----------------
drivers/md/md.h | 17 +++++++++++++++++
2 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index cf60d4b5356d..5db26b7e7314 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -90,18 +90,6 @@ static void mddev_detach(struct mddev *mddev);
static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
static void export_rdev(struct md_rdev *rdev);
-enum md_ro_state {
- MD_RDWR,
- MD_RDONLY,
- MD_AUTO_READ,
- MD_MAX_STATE
-};
-
-static bool md_is_rdwr(struct mddev *mddev)
-{
- return (mddev->ro == MD_RDWR);
-}
-
/*
* Default number of read corrections we'll attempt on an rdev
* before ejecting it from the array. We divide the read error
@@ -377,10 +365,6 @@ EXPORT_SYMBOL_GPL(md_new_event);
static LIST_HEAD(all_mddevs);
static DEFINE_SPINLOCK(all_mddevs_lock);
-static bool is_md_suspended(struct mddev *mddev)
-{
- return percpu_ref_is_dying(&mddev->active_io);
-}
/* Rather than calling directly into the personality make_request function,
* IO requests come here first so that we can check if the device is
* being suspended pending a reconfiguration.
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a228e78e7f3a..7827f27c1406 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -563,6 +563,23 @@ enum recovery_flags {
MD_RESYNCING_REMOTE, /* remote node is running resync thread */
};
+enum md_ro_state {
+ MD_RDWR,
+ MD_RDONLY,
+ MD_AUTO_READ,
+ MD_MAX_STATE
+};
+
+static inline bool md_is_rdwr(struct mddev *mddev)
+{
+ return (mddev->ro == MD_RDWR);
+}
+
+static inline bool is_md_suspended(struct mddev *mddev)
+{
+ return percpu_ref_is_dying(&mddev->active_io);
+}
+
static inline int __must_check mddev_lock(struct mddev *mddev)
{
return mutex_lock_interruptible(&mddev->reconfig_mutex);
--
2.39.2
On Thu, May 11, 2023 at 6:59 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> Set rdev replacement has but not only two conditions:
>
> 1) MD_RECOVERY_RUNNING is not set;
> 2) rdev nr_pending is 0;
The above is confusing. I updated it and applied the set to md-next.
Please let me know if it looks good.
Thanks,
Song
>
> If reshape is interrupted(for example, echo frozen to sync_action), then
> rdev replacement can be set. It's safe because reshape is always prior to
> resync in md_check_recovery(). However, if system reboots, then kernel will
> complain cannot handle concurrent replacement and reshape and this array
> is not able to assemble anymore.
>
> Fix this problem by don't allow replacement until reshape is done.
>
> Reported-by: Peter Neuwirth <[email protected]>
> Link: https://lore.kernel.org/linux-raid/[email protected]/
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid5.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index a58507a4345d..bd3b535c0739 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8378,6 +8378,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> p = conf->disks + disk;
> tmp = rdev_mdlock_deref(mddev, p->rdev);
> if (test_bit(WantReplacement, &tmp->flags) &&
> + mddev->reshape_position == MaxSector &&
> p->replacement == NULL) {
> clear_bit(In_sync, &rdev->flags);
> set_bit(Replacement, &rdev->flags);
> --
> 2.39.2
>
Hi,
在 2023/05/20 7:33, Song Liu 写道:
> On Thu, May 11, 2023 at 6:59 PM Yu Kuai <[email protected]> wrote:
>>
>> From: Yu Kuai <[email protected]>
>>
>> Set rdev replacement has but not only two conditions:
>>
>> 1) MD_RECOVERY_RUNNING is not set;
>> 2) rdev nr_pending is 0;
>
> The above is confusing. I updated it and applied the set to md-next.
By the way, I'm willing to add regression test for these problems, and I
already send two other tests and there are no response yet. Should the
test wait for fixed patch to be applied to make progress?
Thanks,
Kuai
> Please let me know if it looks good.
>
> Thanks,
> Song
>
>>
>> If reshape is interrupted(for example, echo frozen to sync_action), then
>> rdev replacement can be set. It's safe because reshape is always prior to
>> resync in md_check_recovery(). However, if system reboots, then kernel will
>> complain cannot handle concurrent replacement and reshape and this array
>> is not able to assemble anymore.
>>
>> Fix this problem by don't allow replacement until reshape is done.
>>
>> Reported-by: Peter Neuwirth <[email protected]>
>> Link: https://lore.kernel.org/linux-raid/[email protected]/
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/raid5.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index a58507a4345d..bd3b535c0739 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -8378,6 +8378,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
>> p = conf->disks + disk;
>> tmp = rdev_mdlock_deref(mddev, p->rdev);
>> if (test_bit(WantReplacement, &tmp->flags) &&
>> + mddev->reshape_position == MaxSector &&
>> p->replacement == NULL) {
>> clear_bit(In_sync, &rdev->flags);
>> set_bit(Replacement, &rdev->flags);
>> --
>> 2.39.2
>>
> .
>
On Sun, May 21, 2023 at 8:46 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/05/20 7:33, Song Liu 写道:
> > On Thu, May 11, 2023 at 6:59 PM Yu Kuai <[email protected]> wrote:
> >>
> >> From: Yu Kuai <[email protected]>
> >>
> >> Set rdev replacement has but not only two conditions:
> >>
> >> 1) MD_RECOVERY_RUNNING is not set;
> >> 2) rdev nr_pending is 0;
> >
> > The above is confusing. I updated it and applied the set to md-next.
>
> By the way, I'm willing to add regression test for these problems, and I
> already send two other tests and there are no response yet. Should the
> test wait for fixed patch to be applied to make progress?
Jes just had a baby, so there will be long delays with mdadm patches.
I am already using the new tests, so please keep sending them.
Thanks,
Song