2024-01-24 09:40:52

by Yu Kuai

[permalink] [raw]
Subject: [PATCH v2 00/11] dm-raid: fix v6.7 regressions

First regression related to stop sync thread:

The lifetime of sync_thread is designed as following:

1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
daemon thread;
2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
MD_RECOVERY_RUNNING and register sync_thread;
3) Execute md_do_sync() for the actual work, if it's done or
interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
MD_RECOVERY_RUNNING and unregister sync_thread;

In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
("md: fix stopping sync thread"), however, dm-raid is not considered at
that time, and following test will hang:

shell/integrity-caching.sh
shell/lvconvert-raid-reshape.sh

This patch set fix the broken test by patch 1-4;
- patch 1 fix that step 4) is broken by suspended array;
- patch 2 fix that step 4) is broken by read-only array;
- patch 3 fix that step 3) is broken that md_do_sync() doesn't set
MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
data will be corrupted, which will be fixed in later patches.
- patch 4 fix that setp 1) is broken that sync_thread is register and
MD_RECOVERY_RUNNING is set directly;

With patch 1-4, the above test won't hang anymore, however, the test
will still fail and complain that ext4 is corrupted;

Second regression related to frozen sync thread:

Noted that for raid456, if reshape is interrupted, then call
"pers->start_reshape" will corrupt data. This is because dm-raid rely on
md_do_sync() doesn't set MD_RECOVERY_DONE so that new sync_thread won't
be registered, and patch 3 just break this.

- Patch 5-6 fix this problem by interrupting reshape and frozen
sync_thread in dm_suspend(), then unfrozen and continue reshape in
dm_resume(). It's verified that dm-raid tests won't complain that
ext4 is corrupted anymore.
- Patch 7 fix the problem that raid_message() call
md_reap_sync_thread() directly, without holding 'reconfig_mutex'.

Last regression related to dm-raid456 IO concurrent with reshape:

For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:

1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;

After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().

For md/raid, the problem doesn't exist because:

1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex' anymore, it can be cleared and reshape can continue by
sysfs api 'sync_action'.

However, I'm not sure yet how to avoid the problem in dm-raid yet.

- patch 9-11 fix this problem by detecting the above 3 cases in
dm_suspend(), and fail those IO directly.

If user really meet the IO error, then it means they're reading the wrong
data before c467e97f079f. And it's safe to read/write the array after
reshape make progress successfully.

Tests:

I already run the following two tests many times and verified that they
won't fail anymore:

shell/integrity-caching.sh
shell/lvconvert-raid-reshape.sh

For other tests, I'm still running. However, I'm sending this patchset
in case people think the fixes is not appropriate. Running the full
tests will cost lots of time in my VM, and I'll update full test results
soon.

Yu Kuai (11):
md: don't ignore suspended array in md_check_recovery()
md: don't ignore read-only array in md_check_recovery()
md: make sure md_do_sync() will set MD_RECOVERY_DONE
md: don't register sync_thread for reshape directly
md: export helpers to stop sync_thread
dm-raid: really frozen sync_thread during suspend
md/dm-raid: don't call md_reap_sync_thread() directly
dm-raid: remove mddev_suspend/resume()
dm-raid: add a new helper prepare_suspend() in md_personality
md: export helper md_is_rdwr()
md/raid456: fix a deadlock for dm-raid456 while io concurrent with
reshape

drivers/md/dm-raid.c | 76 +++++++++++++++++++++----------
drivers/md/md.c | 104 ++++++++++++++++++++++++++++---------------
drivers/md/md.h | 16 +++++++
drivers/md/raid10.c | 16 +------
drivers/md/raid5.c | 61 +++++++++++++------------
5 files changed, 171 insertions(+), 102 deletions(-)

--
2.39.2



2024-01-24 09:45:18

by Yu Kuai

[permalink] [raw]
Subject: [PATCH v2 03/11] md: make sure md_do_sync() will set MD_RECOVERY_DONE

stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must
set MD_RECOVERY_DONE, so that follow up md_check_recovery() will
unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up
stop_sync_thread().

If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will
return without setting MD_RECOVERY_DONE, and after commit f52f5c71f3d4
("md: fix stopping sync thread"), dm-raid switch from
md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread
from md_stop() and md_stop_writes(), causing the test
shell/lvconvert-raid-reshape.sh hang.

We shouldn't switch back to md_reap_sync_thread() because it's
problematic in the first place. Fix the problem by making sure
md_do_sync() will set MD_RECOVERY_DONE.

Reported-by: Mikulas Patocka <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: d5d885fd514f ("md: introduce new personality funciton start()")
Fixes: 5fd6c1dce06e ("[PATCH] md: allow checkpoint of recovery with version-1 superblock")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6906d023f1d6..c65dfd156090 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8788,12 +8788,16 @@ void md_do_sync(struct md_thread *thread)
int ret;

/* just incase thread restarts... */
- if (test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
- test_bit(MD_RECOVERY_WAIT, &mddev->recovery))
+ if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
return;
- if (!md_is_rdwr(mddev)) {/* never try to sync a read-only array */
+
+ if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+ goto skip;
+
+ if (test_bit(MD_RECOVERY_WAIT, &mddev->recovery) ||
+ !md_is_rdwr(mddev)) {/* never try to sync a read-only array */
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- return;
+ goto skip;
}

if (mddev_is_clustered(mddev)) {
--
2.39.2


2024-01-24 09:45:48

by Yu Kuai

[permalink] [raw]
Subject: [PATCH v2 04/11] md: don't register sync_thread for reshape directly

Currently, if reshape is interrupted, then reassemble the array will
register sync_thread directly from pers->run(), in this case
'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee
that md_do_sync() will be executed, hence stop_sync_thread() will hang
because 'MD_RECOVERY_RUNNING' can't be cleared.

Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE,
however, following hang can still be triggered by dm-raid test
shell/lvconvert-raid-reshape.sh occasionally:

[root@fedora ~]# cat /proc/1982/stack
[<0>] stop_sync_thread+0x1ab/0x270 [md_mod]
[<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod]
[<0>] raid_presuspend+0x1e/0x70 [dm_raid]
[<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod]
[<0>] __dm_destroy+0x2a5/0x310 [dm_mod]
[<0>] dm_destroy+0x16/0x30 [dm_mod]
[<0>] dev_remove+0x165/0x290 [dm_mod]
[<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod]
[<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod]
[<0>] vfs_ioctl+0x21/0x60
[<0>] __x64_sys_ioctl+0xb9/0xe0
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74

Meanwhile mddev->recovery is:
MD_RECOVERY_RUNNING |
MD_RECOVERY_INTR |
MD_RECOVERY_RESHAPE |
MD_RECOVERY_FROZEN

Fix this problem by remove the code to register sync_thread directly
from raid10 and raid5. And let md_check_recovery() to register
sync_thread.

Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 5 ++++-
drivers/md/raid10.c | 16 ++--------------
drivers/md/raid5.c | 29 ++---------------------------
3 files changed, 8 insertions(+), 42 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c65dfd156090..6c5d0a372927 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9372,6 +9372,7 @@ static void md_start_sync(struct work_struct *ws)
struct mddev *mddev = container_of(ws, struct mddev, sync_work);
int spares = 0;
bool suspend = false;
+ char *name;

if (md_spares_need_change(mddev))
suspend = true;
@@ -9404,8 +9405,10 @@ static void md_start_sync(struct work_struct *ws)
if (spares)
md_bitmap_write_all(mddev->bitmap);

+ name = test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ?
+ "reshape" : "resync";
rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "resync"));
+ md_register_thread(md_do_sync, mddev, name));
if (!mddev->sync_thread) {
pr_warn("%s: could not start resync thread...\n",
mdname(mddev));
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 7412066ea22c..a5f8419e2df1 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4175,11 +4175,7 @@ static int raid10_run(struct mddev *mddev)
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "reshape"));
- if (!mddev->sync_thread)
- goto out_free_conf;
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
}

return 0;
@@ -4573,16 +4569,8 @@ static int raid10_start_reshape(struct mddev *mddev)
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-
- rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "reshape"));
- if (!mddev->sync_thread) {
- ret = -EAGAIN;
- goto abort;
- }
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
conf->reshape_checkpoint = jiffies;
- md_wakeup_thread(mddev->sync_thread);
md_new_event();
return 0;

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8497880135ee..6a7a32f7fb91 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7936,11 +7936,7 @@ static int raid5_run(struct mddev *mddev)
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "reshape"));
- if (!mddev->sync_thread)
- goto abort;
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
}

/* Ok, everything is just fine now */
@@ -8506,29 +8502,8 @@ static int raid5_start_reshape(struct mddev *mddev)
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "reshape"));
- if (!mddev->sync_thread) {
- mddev->recovery = 0;
- spin_lock_irq(&conf->device_lock);
- write_seqcount_begin(&conf->gen_lock);
- mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
- mddev->new_chunk_sectors =
- conf->chunk_sectors = conf->prev_chunk_sectors;
- mddev->new_layout = conf->algorithm = conf->prev_algo;
- rdev_for_each(rdev, mddev)
- rdev->new_data_offset = rdev->data_offset;
- smp_wmb();
- conf->generation --;
- conf->reshape_progress = MaxSector;
- mddev->reshape_position = MaxSector;
- write_seqcount_end(&conf->gen_lock);
- spin_unlock_irq(&conf->device_lock);
- return -EAGAIN;
- }
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
conf->reshape_checkpoint = jiffies;
- md_wakeup_thread(mddev->sync_thread);
md_new_event();
return 0;
}
--
2.39.2


2024-01-24 09:49:05

by Yu Kuai

[permalink] [raw]
Subject: [PATCH v2 07/11] md/dm-raid: don't call md_reap_sync_thread() directly

Currently md_reap_sync_thread() is called from raid_message() directly
without holding 'reconfig_mutex', this is definitely unsafe because
md_reap_sync_thread() can change many fields that is protected by
'reconfig_mutex'.

However, hold 'reconfig_mutex' here is still problematic because this
will cause deadlock, for example, commit 130443d60b1b ("md: refactor
idle/frozen_sync_thread() to fix deadlock").

Fix this problem by using stop_sync_thread() to unregister sync_thread,
like md/raid did.

Fixes: be83651f0050 ("DM RAID: Add message/status support for changing sync action")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 5ce3c6020b1b..6b6c011d9f69 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3719,6 +3719,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
{
struct raid_set *rs = ti->private;
struct mddev *mddev = &rs->md;
+ int ret = 0;

if (!mddev->pers || !mddev->pers->sync_request)
return -EINVAL;
@@ -3726,17 +3727,24 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
if (test_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
return -EBUSY;

- if (!strcasecmp(argv[0], "frozen"))
- set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
- else
- clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ if (!strcasecmp(argv[0], "frozen")) {
+ ret = mddev_lock(mddev);
+ if (ret)
+ return ret;

- if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
- if (mddev->sync_thread) {
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev);
- }
- } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
+ md_frozen_sync_thread(mddev);
+ mddev_unlock(mddev);
+ } else if (!strcasecmp(argv[0], "idle")) {
+ ret = mddev_lock(mddev);
+ if (ret)
+ return ret;
+
+ md_idle_sync_thread(mddev);
+ mddev_unlock(mddev);
+ }
+
+ clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
return -EBUSY;
else if (!strcasecmp(argv[0], "resync"))
; /* MD_RECOVERY_NEEDED set below */
--
2.39.2


2024-01-24 09:50:35

by Yu Kuai

[permalink] [raw]
Subject: [PATCH v2 10/11] md: export helper md_is_rdwr()

There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 12 ------------
drivers/md/md.h | 12 ++++++++++++
2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 90cf31b53804..c803a805ea0d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -99,18 +99,6 @@ static void mddev_detach(struct mddev *mddev);
static void export_rdev(struct md_rdev *rdev, struct mddev *mddev);
static void md_wakeup_thread_directly(struct md_thread __rcu *thread);

-enum md_ro_state {
- MD_RDWR,
- MD_RDONLY,
- MD_AUTO_READ,
- MD_MAX_STATE
-};
-
-static bool md_is_rdwr(struct mddev *mddev)
-{
- return (mddev->ro == MD_RDWR);
-}
-
/*
* Default number of read corrections we'll attempt on an rdev
* before ejecting it from the array. We divide the read error
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 29b476ff3b9f..98da86d38ba8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -558,6 +558,18 @@ enum recovery_flags {
MD_RESYNCING_REMOTE, /* remote node is running resync thread */
};

+enum md_ro_state {
+ MD_RDWR,
+ MD_RDONLY,
+ MD_AUTO_READ,
+ MD_MAX_STATE
+};
+
+static bool md_is_rdwr(struct mddev *mddev)
+{
+ return (mddev->ro == MD_RDWR);
+}
+
static inline int __must_check mddev_lock(struct mddev *mddev)
{
return mutex_lock_interruptible(&mddev->reconfig_mutex);
--
2.39.2


2024-01-24 12:52:14

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] dm-raid: fix v6.7 regressions

On Wed, Jan 24, 2024 at 5:18 PM Yu Kuai <[email protected]> wrote:
>
> First regression related to stop sync thread:
>
> The lifetime of sync_thread is designed as following:
>
> 1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
> daemon thread;
> 2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
> MD_RECOVERY_RUNNING and register sync_thread;
> 3) Execute md_do_sync() for the actual work, if it's done or
> interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
> 4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
> MD_RECOVERY_RUNNING and unregister sync_thread;
>
> In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
> ("md: fix stopping sync thread"), however, dm-raid is not considered at
> that time, and following test will hang:

Hi Kuai

Thanks very much for the patch set. I reported the dm raid deadlock
when stopping dm raid and we had the patch set "[PATCH v5 md-fixes
0/3] md: fix stopping sync thread" which has patch f52f5c71f3d4. So we
indeed considered dm-raid that time. Because we want to resolve the
deadlock problem. I re-look patch f52f5c71f3d4. It has two major
changes. One is to use a common function stop_sync_thread for stopping
sync thread. This can fix the deadlock problem. The second change
changes the way to reap sync thread. mdraid and dmraid reap sync
thread in __md_stop_writes. So the patch looks overweight.

Before f52f5c71f3d4 do_md_stop release reconfig_mutex before waiting
sync_thread to finish. So there should not be the deadlock problem
which has been fixed in 130443d60b1b ("md: refactor
idle/frozen_sync_thread() to fix deadlock"). So we only need to change
__md_stop_writes to stop sync thread like do_md_stop and reap sync
thread directly.

Maybe this can avoid deadlock? I'll try this way and give the test result.
>
> shell/integrity-caching.sh
> shell/lvconvert-raid-reshape.sh
>
> This patch set fix the broken test by patch 1-4;
> - patch 1 fix that step 4) is broken by suspended array;
> - patch 2 fix that step 4) is broken by read-only array;
> - patch 3 fix that step 3) is broken that md_do_sync() doesn't set
> MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
> data will be corrupted, which will be fixed in later patches.
> - patch 4 fix that setp 1) is broken that sync_thread is register and
> MD_RECOVERY_RUNNING is set directly;
>
> With patch 1-4, the above test won't hang anymore, however, the test
> will still fail and complain that ext4 is corrupted;

For patch3, as I mentioned today, the root cause is
dm-raid->rs_start_reshape sets MD_RECOVERY_WAIT. So md_do_sync returns
when MD_RECOVERY_WAIT is set. It's the reason why dm-raid can't stop
sync thread when start a new reshape. . The way in patch3 looks like a
workaround. We need to figure out if dm raid really needs to set
MD_RECOVERY_WAIT. Because now we stop sync thread in an asynchronous
way. So the deadlock problem which was fixed in 644e2537f (dm raid:
fix stripe adding reshape deadlock) may disappear. Maybe we can revert
the patch.

Best Regards
Xiao

>
> Second regression related to frozen sync thread:
>
> Noted that for raid456, if reshape is interrupted, then call
> "pers->start_reshape" will corrupt data. This is because dm-raid rely on
> md_do_sync() doesn't set MD_RECOVERY_DONE so that new sync_thread won't
> be registered, and patch 3 just break this.
>
> - Patch 5-6 fix this problem by interrupting reshape and frozen
> sync_thread in dm_suspend(), then unfrozen and continue reshape in
> dm_resume(). It's verified that dm-raid tests won't complain that
> ext4 is corrupted anymore.
> - Patch 7 fix the problem that raid_message() call
> md_reap_sync_thread() directly, without holding 'reconfig_mutex'.
>
> Last regression related to dm-raid456 IO concurrent with reshape:
>
> For raid456, if reshape is still in progress, then IO across reshape
> position will wait for reshape to make progress. However, for dm-raid,
> in following cases reshape will never make progress hence IO will hang:
>
> 1) the array is read-only;
> 2) MD_RECOVERY_WAIT is set;
> 3) MD_RECOVERY_FROZEN is set;
>
> After commit c467e97f079f ("md/raid6: use valid sector values to determine
> if an I/O should wait on the reshape") fix the problem that IO across
> reshape position doesn't wait for reshape, the dm-raid test
> shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().
>
> For md/raid, the problem doesn't exist because:
>
> 1) If array is read-only, it can switch to read-write by ioctl/sysfs;
> 2) md/raid never set MD_RECOVERY_WAIT;
> 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
> 'reconfig_mutex' anymore, it can be cleared and reshape can continue by
> sysfs api 'sync_action'.
>
> However, I'm not sure yet how to avoid the problem in dm-raid yet.
>
> - patch 9-11 fix this problem by detecting the above 3 cases in
> dm_suspend(), and fail those IO directly.
>
> If user really meet the IO error, then it means they're reading the wrong
> data before c467e97f079f. And it's safe to read/write the array after
> reshape make progress successfully.
>
> Tests:
>
> I already run the following two tests many times and verified that they
> won't fail anymore:
>
> shell/integrity-caching.sh
> shell/lvconvert-raid-reshape.sh
>
> For other tests, I'm still running. However, I'm sending this patchset
> in case people think the fixes is not appropriate. Running the full
> tests will cost lots of time in my VM, and I'll update full test results
> soon.
>
> Yu Kuai (11):
> md: don't ignore suspended array in md_check_recovery()
> md: don't ignore read-only array in md_check_recovery()
> md: make sure md_do_sync() will set MD_RECOVERY_DONE
> md: don't register sync_thread for reshape directly
> md: export helpers to stop sync_thread
> dm-raid: really frozen sync_thread during suspend
> md/dm-raid: don't call md_reap_sync_thread() directly
> dm-raid: remove mddev_suspend/resume()
> dm-raid: add a new helper prepare_suspend() in md_personality
> md: export helper md_is_rdwr()
> md/raid456: fix a deadlock for dm-raid456 while io concurrent with
> reshape
>
> drivers/md/dm-raid.c | 76 +++++++++++++++++++++----------
> drivers/md/md.c | 104 ++++++++++++++++++++++++++++---------------
> drivers/md/md.h | 16 +++++++
> drivers/md/raid10.c | 16 +------
> drivers/md/raid5.c | 61 +++++++++++++------------
> 5 files changed, 171 insertions(+), 102 deletions(-)
>
> --
> 2.39.2
>


2024-01-25 00:51:32

by Xiao Ni

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] dm-raid: fix v6.7 regressions

On Wed, Jan 24, 2024 at 8:19 PM Xiao Ni <[email protected]> wrote:
>
> On Wed, Jan 24, 2024 at 5:18 PM Yu Kuai <[email protected]> wrote:
> >
> > First regression related to stop sync thread:
> >
> > The lifetime of sync_thread is designed as following:
> >
> > 1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
> > daemon thread;
> > 2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
> > MD_RECOVERY_RUNNING and register sync_thread;
> > 3) Execute md_do_sync() for the actual work, if it's done or
> > interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
> > 4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
> > MD_RECOVERY_RUNNING and unregister sync_thread;
> >
> > In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
> > ("md: fix stopping sync thread"), however, dm-raid is not considered at
> > that time, and following test will hang:
>
> Hi Kuai
>
> Thanks very much for the patch set. I reported the dm raid deadlock
> when stopping dm raid and we had the patch set "[PATCH v5 md-fixes
> 0/3] md: fix stopping sync thread" which has patch f52f5c71f3d4. So we
> indeed considered dm-raid that time. Because we want to resolve the
> deadlock problem. I re-look patch f52f5c71f3d4. It has two major
> changes. One is to use a common function stop_sync_thread for stopping
> sync thread. This can fix the deadlock problem. The second change
> changes the way to reap sync thread. mdraid and dmraid reap sync
> thread in __md_stop_writes. So the patch looks overweight.
>
> Before f52f5c71f3d4 do_md_stop release reconfig_mutex before waiting
> sync_thread to finish. So there should not be the deadlock problem
> which has been fixed in 130443d60b1b ("md: refactor
> idle/frozen_sync_thread() to fix deadlock"). So we only need to change
> __md_stop_writes to stop sync thread like do_md_stop and reap sync
> thread directly.
>
> Maybe this can avoid deadlock? I'll try this way and give the test result.

Please ignore my last comment. There is something wrong. Only dmraid
calls reap_sync_thread directly in __md_stop_writes before.

130443d60b1b ("md: refactor idle/frozen_sync_thread() to fix
deadlock") fixes a deadlock problem. sync io is running and user io
comes. sync io needs to wait user io. user io needs to update
suerblock and it needs mddev->reconfig_mutex. But user action happens
with this lock to stop sync thread. So this is the deadlock. For
dmraid, it doesn't update superblock like md. I'm not sure if dmraid
has such deadlock problem. If not, dmraid can call md_reap_sync_thread
directly, right?

> >
> > shell/integrity-caching.sh
> > shell/lvconvert-raid-reshape.sh
> >
> > This patch set fix the broken test by patch 1-4;
> > - patch 1 fix that step 4) is broken by suspended array;
> > - patch 2 fix that step 4) is broken by read-only array;
> > - patch 3 fix that step 3) is broken that md_do_sync() doesn't set
> > MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
> > data will be corrupted, which will be fixed in later patches.
> > - patch 4 fix that setp 1) is broken that sync_thread is register and
> > MD_RECOVERY_RUNNING is set directly;
> >
> > With patch 1-4, the above test won't hang anymore, however, the test
> > will still fail and complain that ext4 is corrupted;
>
> For patch3, as I mentioned today, the root cause is
> dm-raid->rs_start_reshape sets MD_RECOVERY_WAIT. So md_do_sync returns
> when MD_RECOVERY_WAIT is set. It's the reason why dm-raid can't stop
> sync thread when start a new reshape. . The way in patch3 looks like a
> workaround. We need to figure out if dm raid really needs to set
> MD_RECOVERY_WAIT. Because now we stop sync thread in an asynchronous
> way. So the deadlock problem which was fixed in 644e2537f (dm raid:
> fix stripe adding reshape deadlock) may disappear. Maybe we can revert
> the patch.

After talking with Heinz, he mentioned dmraid needs this bit to avoid
md sync thread to start during reshape. So patch3 looks good.

Best Regards
Xiao
>
> Best Regards
> Xiao
>
> >
> > Second regression related to frozen sync thread:
> >
> > Noted that for raid456, if reshape is interrupted, then call
> > "pers->start_reshape" will corrupt data. This is because dm-raid rely on
> > md_do_sync() doesn't set MD_RECOVERY_DONE so that new sync_thread won't
> > be registered, and patch 3 just break this.
> >
> > - Patch 5-6 fix this problem by interrupting reshape and frozen
> > sync_thread in dm_suspend(), then unfrozen and continue reshape in
> > dm_resume(). It's verified that dm-raid tests won't complain that
> > ext4 is corrupted anymore.
> > - Patch 7 fix the problem that raid_message() call
> > md_reap_sync_thread() directly, without holding 'reconfig_mutex'.
> >
> > Last regression related to dm-raid456 IO concurrent with reshape:
> >
> > For raid456, if reshape is still in progress, then IO across reshape
> > position will wait for reshape to make progress. However, for dm-raid,
> > in following cases reshape will never make progress hence IO will hang:
> >
> > 1) the array is read-only;
> > 2) MD_RECOVERY_WAIT is set;
> > 3) MD_RECOVERY_FROZEN is set;
> >
> > After commit c467e97f079f ("md/raid6: use valid sector values to determine
> > if an I/O should wait on the reshape") fix the problem that IO across
> > reshape position doesn't wait for reshape, the dm-raid test
> > shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().
> >
> > For md/raid, the problem doesn't exist because:
> >
> > 1) If array is read-only, it can switch to read-write by ioctl/sysfs;
> > 2) md/raid never set MD_RECOVERY_WAIT;
> > 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
> > 'reconfig_mutex' anymore, it can be cleared and reshape can continue by
> > sysfs api 'sync_action'.
> >
> > However, I'm not sure yet how to avoid the problem in dm-raid yet.
> >
> > - patch 9-11 fix this problem by detecting the above 3 cases in
> > dm_suspend(), and fail those IO directly.
> >
> > If user really meet the IO error, then it means they're reading the wrong
> > data before c467e97f079f. And it's safe to read/write the array after
> > reshape make progress successfully.
> >
> > Tests:
> >
> > I already run the following two tests many times and verified that they
> > won't fail anymore:
> >
> > shell/integrity-caching.sh
> > shell/lvconvert-raid-reshape.sh
> >
> > For other tests, I'm still running. However, I'm sending this patchset
> > in case people think the fixes is not appropriate. Running the full
> > tests will cost lots of time in my VM, and I'll update full test results
> > soon.
> >
> > Yu Kuai (11):
> > md: don't ignore suspended array in md_check_recovery()
> > md: don't ignore read-only array in md_check_recovery()
> > md: make sure md_do_sync() will set MD_RECOVERY_DONE
> > md: don't register sync_thread for reshape directly
> > md: export helpers to stop sync_thread
> > dm-raid: really frozen sync_thread during suspend
> > md/dm-raid: don't call md_reap_sync_thread() directly
> > dm-raid: remove mddev_suspend/resume()
> > dm-raid: add a new helper prepare_suspend() in md_personality
> > md: export helper md_is_rdwr()
> > md/raid456: fix a deadlock for dm-raid456 while io concurrent with
> > reshape
> >
> > drivers/md/dm-raid.c | 76 +++++++++++++++++++++----------
> > drivers/md/md.c | 104 ++++++++++++++++++++++++++++---------------
> > drivers/md/md.h | 16 +++++++
> > drivers/md/raid10.c | 16 +------
> > drivers/md/raid5.c | 61 +++++++++++++------------
> > 5 files changed, 171 insertions(+), 102 deletions(-)
> >
> > --
> > 2.39.2
> >


2024-01-25 00:56:38

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] dm-raid: fix v6.7 regressions

On Wed, Jan 24, 2024 at 1:18 AM Yu Kuai <[email protected]> wrote:
>
> First regression related to stop sync thread:
>
> The lifetime of sync_thread is designed as following:
>
> 1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
> daemon thread;
> 2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
> MD_RECOVERY_RUNNING and register sync_thread;
> 3) Execute md_do_sync() for the actual work, if it's done or
> interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
> 4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
> MD_RECOVERY_RUNNING and unregister sync_thread;
>
> In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
> ("md: fix stopping sync thread"), however, dm-raid is not considered at
> that time, and following test will hang:
>
> shell/integrity-caching.sh
> shell/lvconvert-raid-reshape.sh
>
> This patch set fix the broken test by patch 1-4;
> - patch 1 fix that step 4) is broken by suspended array;
> - patch 2 fix that step 4) is broken by read-only array;
> - patch 3 fix that step 3) is broken that md_do_sync() doesn't set
> MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
> data will be corrupted, which will be fixed in later patches.
> - patch 4 fix that setp 1) is broken that sync_thread is register and
> MD_RECOVERY_RUNNING is set directly;
>
> With patch 1-4, the above test won't hang anymore, however, the test
> will still fail and complain that ext4 is corrupted;
>
> Second regression related to frozen sync thread:
>
> Noted that for raid456, if reshape is interrupted, then call
> "pers->start_reshape" will corrupt data. This is because dm-raid rely on
> md_do_sync() doesn't set MD_RECOVERY_DONE so that new sync_thread won't
> be registered, and patch 3 just break this.
>
> - Patch 5-6 fix this problem by interrupting reshape and frozen
> sync_thread in dm_suspend(), then unfrozen and continue reshape in
> dm_resume(). It's verified that dm-raid tests won't complain that
> ext4 is corrupted anymore.
> - Patch 7 fix the problem that raid_message() call
> md_reap_sync_thread() directly, without holding 'reconfig_mutex'.
>
> Last regression related to dm-raid456 IO concurrent with reshape:
>
> For raid456, if reshape is still in progress, then IO across reshape
> position will wait for reshape to make progress. However, for dm-raid,
> in following cases reshape will never make progress hence IO will hang:
>
> 1) the array is read-only;
> 2) MD_RECOVERY_WAIT is set;
> 3) MD_RECOVERY_FROZEN is set;
>
> After commit c467e97f079f ("md/raid6: use valid sector values to determine
> if an I/O should wait on the reshape") fix the problem that IO across
> reshape position doesn't wait for reshape, the dm-raid test
> shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().
>
> For md/raid, the problem doesn't exist because:
>
> 1) If array is read-only, it can switch to read-write by ioctl/sysfs;
> 2) md/raid never set MD_RECOVERY_WAIT;
> 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
> 'reconfig_mutex' anymore, it can be cleared and reshape can continue by
> sysfs api 'sync_action'.
>
> However, I'm not sure yet how to avoid the problem in dm-raid yet.
>
> - patch 9-11 fix this problem by detecting the above 3 cases in
> dm_suspend(), and fail those IO directly.
>
> If user really meet the IO error, then it means they're reading the wrong
> data before c467e97f079f. And it's safe to read/write the array after
> reshape make progress successfully.

c467e97f079f got back ported to stable kernels (6.6.13, for example). We
will need some fixes for them (to fix shell/lvconvert-raid-reshape.sh).

Mikulas and folks, please help review the analysis above and dm-raid
changes. The failure was triggered by c467e97f079f. However, the commit
is doing the right thing, so we really shouldn't revert it.

>
> Tests:
>
> I already run the following two tests many times and verified that they
> won't fail anymore:
>
> shell/integrity-caching.sh
> shell/lvconvert-raid-reshape.sh

shell/lvconvert-raid-reshape-linear_to_raid6-single-type.sh is failing
with upstream + this set. (I need to fix some trivial compilation errors,
which are probably last minute typos).

Thanks,
Song

2024-01-25 01:09:26

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] dm-raid: fix v6.7 regressions

Hi,

在 2024/01/25 8:46, Song Liu 写道:
> On Wed, Jan 24, 2024 at 1:18 AM Yu Kuai <[email protected]> wrote:
>>
>> First regression related to stop sync thread:
>>
>> The lifetime of sync_thread is designed as following:
>>
>> 1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
>> daemon thread;
>> 2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
>> MD_RECOVERY_RUNNING and register sync_thread;
>> 3) Execute md_do_sync() for the actual work, if it's done or
>> interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
>> 4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
>> MD_RECOVERY_RUNNING and unregister sync_thread;
>>
>> In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
>> ("md: fix stopping sync thread"), however, dm-raid is not considered at
>> that time, and following test will hang:
>>
>> shell/integrity-caching.sh
>> shell/lvconvert-raid-reshape.sh
>>
>> This patch set fix the broken test by patch 1-4;
>> - patch 1 fix that step 4) is broken by suspended array;
>> - patch 2 fix that step 4) is broken by read-only array;
>> - patch 3 fix that step 3) is broken that md_do_sync() doesn't set
>> MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
>> data will be corrupted, which will be fixed in later patches.
>> - patch 4 fix that setp 1) is broken that sync_thread is register and
>> MD_RECOVERY_RUNNING is set directly;
>>
>> With patch 1-4, the above test won't hang anymore, however, the test
>> will still fail and complain that ext4 is corrupted;
>>
>> Second regression related to frozen sync thread:
>>
>> Noted that for raid456, if reshape is interrupted, then call
>> "pers->start_reshape" will corrupt data. This is because dm-raid rely on
>> md_do_sync() doesn't set MD_RECOVERY_DONE so that new sync_thread won't
>> be registered, and patch 3 just break this.
>>
>> - Patch 5-6 fix this problem by interrupting reshape and frozen
>> sync_thread in dm_suspend(), then unfrozen and continue reshape in
>> dm_resume(). It's verified that dm-raid tests won't complain that
>> ext4 is corrupted anymore.
>> - Patch 7 fix the problem that raid_message() call
>> md_reap_sync_thread() directly, without holding 'reconfig_mutex'.
>>
>> Last regression related to dm-raid456 IO concurrent with reshape:
>>
>> For raid456, if reshape is still in progress, then IO across reshape
>> position will wait for reshape to make progress. However, for dm-raid,
>> in following cases reshape will never make progress hence IO will hang:
>>
>> 1) the array is read-only;
>> 2) MD_RECOVERY_WAIT is set;
>> 3) MD_RECOVERY_FROZEN is set;
>>
>> After commit c467e97f079f ("md/raid6: use valid sector values to determine
>> if an I/O should wait on the reshape") fix the problem that IO across
>> reshape position doesn't wait for reshape, the dm-raid test
>> shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().
>>
>> For md/raid, the problem doesn't exist because:
>>
>> 1) If array is read-only, it can switch to read-write by ioctl/sysfs;
>> 2) md/raid never set MD_RECOVERY_WAIT;
>> 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
>> 'reconfig_mutex' anymore, it can be cleared and reshape can continue by
>> sysfs api 'sync_action'.
>>
>> However, I'm not sure yet how to avoid the problem in dm-raid yet.
>>
>> - patch 9-11 fix this problem by detecting the above 3 cases in
>> dm_suspend(), and fail those IO directly.
>>
>> If user really meet the IO error, then it means they're reading the wrong
>> data before c467e97f079f. And it's safe to read/write the array after
>> reshape make progress successfully.
>
> c467e97f079f got back ported to stable kernels (6.6.13, for example). We
> will need some fixes for them (to fix shell/lvconvert-raid-reshape.sh).
>
> Mikulas and folks, please help review the analysis above and dm-raid
> changes. The failure was triggered by c467e97f079f. However, the commit
> is doing the right thing, so we really shouldn't revert it.
>
>>
>> Tests:
>>
>> I already run the following two tests many times and verified that they
>> won't fail anymore:
>>
>> shell/integrity-caching.sh
>> shell/lvconvert-raid-reshape.sh
>
> shell/lvconvert-raid-reshape-linear_to_raid6-single-type.sh is failing
> with upstream + this set. (I need to fix some trivial compilation errors,
> which are probably last minute typos).

I'm running test for this patchset overnight in my vm, and this test has
been ran for 9 times and all passed. Looks like I can't reporduce this
in my vm.

Thanks,
Kuai

>
> Thanks,
> Song
>
> .
>


2024-01-25 01:41:34

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] dm-raid: fix v6.7 regressions

Hi,

在 2024/01/25 8:50, Xiao Ni 写道:
> On Wed, Jan 24, 2024 at 8:19 PM Xiao Ni <[email protected]> wrote:
>>
>> On Wed, Jan 24, 2024 at 5:18 PM Yu Kuai <[email protected]> wrote:
>>>
>>> First regression related to stop sync thread:
>>>
>>> The lifetime of sync_thread is designed as following:
>>>
>>> 1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
>>> daemon thread;
>>> 2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
>>> MD_RECOVERY_RUNNING and register sync_thread;
>>> 3) Execute md_do_sync() for the actual work, if it's done or
>>> interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
>>> 4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
>>> MD_RECOVERY_RUNNING and unregister sync_thread;
>>>
>>> In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
>>> ("md: fix stopping sync thread"), however, dm-raid is not considered at
>>> that time, and following test will hang:
>>
>> Hi Kuai
>>
>> Thanks very much for the patch set. I reported the dm raid deadlock
>> when stopping dm raid and we had the patch set "[PATCH v5 md-fixes
>> 0/3] md: fix stopping sync thread" which has patch f52f5c71f3d4. So we
>> indeed considered dm-raid that time. Because we want to resolve the
>> deadlock problem. I re-look patch f52f5c71f3d4. It has two major
>> changes. One is to use a common function stop_sync_thread for stopping
>> sync thread. This can fix the deadlock problem. The second change
>> changes the way to reap sync thread. mdraid and dmraid reap sync
>> thread in __md_stop_writes. So the patch looks overweight.
>>
>> Before f52f5c71f3d4 do_md_stop release reconfig_mutex before waiting
>> sync_thread to finish. So there should not be the deadlock problem
>> which has been fixed in 130443d60b1b ("md: refactor
>> idle/frozen_sync_thread() to fix deadlock"). So we only need to change
>> __md_stop_writes to stop sync thread like do_md_stop and reap sync
>> thread directly.
>>
>> Maybe this can avoid deadlock? I'll try this way and give the test result.
>
> Please ignore my last comment. There is something wrong. Only dmraid
> calls reap_sync_thread directly in __md_stop_writes before.
>
> 130443d60b1b ("md: refactor idle/frozen_sync_thread() to fix
> deadlock") fixes a deadlock problem. sync io is running and user io
> comes. sync io needs to wait user io. user io needs to update
> suerblock and it needs mddev->reconfig_mutex. But user action happens
> with this lock to stop sync thread. So this is the deadlock. For
> dmraid, it doesn't update superblock like md. I'm not sure if dmraid
> has such deadlock problem. If not, dmraid can call md_reap_sync_thread
> directly, right?

Yes, the deadlock problem is because holding the lock to call
md_reap_sync_thread() directly will block daemon thread to handle IO.

However, for dm-raid superblock, I'm confused here, the code looks like
md superblock is still there, for example:

rs_update_sbs
set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
md_update_sb(mddev, 1);

And the code in raid1/10/5 to update md superblock doesn't have any
special handling for dm-raid. Or am I missing something here?

Thanks,
Kuai

>
>>>
>>> shell/integrity-caching.sh
>>> shell/lvconvert-raid-reshape.sh
>>>
>>> This patch set fix the broken test by patch 1-4;
>>> - patch 1 fix that step 4) is broken by suspended array;
>>> - patch 2 fix that step 4) is broken by read-only array;
>>> - patch 3 fix that step 3) is broken that md_do_sync() doesn't set
>>> MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
>>> data will be corrupted, which will be fixed in later patches.
>>> - patch 4 fix that setp 1) is broken that sync_thread is register and
>>> MD_RECOVERY_RUNNING is set directly;
>>>
>>> With patch 1-4, the above test won't hang anymore, however, the test
>>> will still fail and complain that ext4 is corrupted;
>>
>> For patch3, as I mentioned today, the root cause is
>> dm-raid->rs_start_reshape sets MD_RECOVERY_WAIT. So md_do_sync returns
>> when MD_RECOVERY_WAIT is set. It's the reason why dm-raid can't stop
>> sync thread when start a new reshape. . The way in patch3 looks like a
>> workaround. We need to figure out if dm raid really needs to set
>> MD_RECOVERY_WAIT. Because now we stop sync thread in an asynchronous
>> way. So the deadlock problem which was fixed in 644e2537f (dm raid:
>> fix stripe adding reshape deadlock) may disappear. Maybe we can revert
>> the patch.

In fact, the flag MD_RECOVERY_WAIT looks like a workaround to prevent
new sync thread to start to me. I actually frozen the sync_thread during
suspend, and prevent user to unfrozen it from raid_message() in patch 6.
I think this way is better and probably MD_RECOVERY_WAIT can be removed.

>
> After talking with Heinz, he mentioned dmraid needs this bit to avoid
> md sync thread to start during reshape. So patch3 looks good.
>
> Best Regards
> Xiao
>>
>> Best Regards
>> Xiao
>>
>>>
>>> Second regression related to frozen sync thread:
>>>
>>> Noted that for raid456, if reshape is interrupted, then call
>>> "pers->start_reshape" will corrupt data. This is because dm-raid rely on
>>> md_do_sync() doesn't set MD_RECOVERY_DONE so that new sync_thread won't
>>> be registered, and patch 3 just break this.
>>>
>>> - Patch 5-6 fix this problem by interrupting reshape and frozen
>>> sync_thread in dm_suspend(), then unfrozen and continue reshape in
>>> dm_resume(). It's verified that dm-raid tests won't complain that
>>> ext4 is corrupted anymore.
>>> - Patch 7 fix the problem that raid_message() call
>>> md_reap_sync_thread() directly, without holding 'reconfig_mutex'.
>>>
>>> Last regression related to dm-raid456 IO concurrent with reshape:
>>>
>>> For raid456, if reshape is still in progress, then IO across reshape
>>> position will wait for reshape to make progress. However, for dm-raid,
>>> in following cases reshape will never make progress hence IO will hang:
>>>
>>> 1) the array is read-only;
>>> 2) MD_RECOVERY_WAIT is set;
>>> 3) MD_RECOVERY_FROZEN is set;
>>>
>>> After commit c467e97f079f ("md/raid6: use valid sector values to determine
>>> if an I/O should wait on the reshape") fix the problem that IO across
>>> reshape position doesn't wait for reshape, the dm-raid test
>>> shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().
>>>
>>> For md/raid, the problem doesn't exist because:
>>>
>>> 1) If array is read-only, it can switch to read-write by ioctl/sysfs;
>>> 2) md/raid never set MD_RECOVERY_WAIT;
>>> 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
>>> 'reconfig_mutex' anymore, it can be cleared and reshape can continue by
>>> sysfs api 'sync_action'.
>>>
>>> However, I'm not sure yet how to avoid the problem in dm-raid yet.
>>>
>>> - patch 9-11 fix this problem by detecting the above 3 cases in
>>> dm_suspend(), and fail those IO directly.
>>>
>>> If user really meet the IO error, then it means they're reading the wrong
>>> data before c467e97f079f. And it's safe to read/write the array after
>>> reshape make progress successfully.
>>>
>>> Tests:
>>>
>>> I already run the following two tests many times and verified that they
>>> won't fail anymore:
>>>
>>> shell/integrity-caching.sh
>>> shell/lvconvert-raid-reshape.sh
>>>
>>> For other tests, I'm still running. However, I'm sending this patchset
>>> in case people think the fixes is not appropriate. Running the full
>>> tests will cost lots of time in my VM, and I'll update full test results
>>> soon.
>>>
>>> Yu Kuai (11):
>>> md: don't ignore suspended array in md_check_recovery()
>>> md: don't ignore read-only array in md_check_recovery()
>>> md: make sure md_do_sync() will set MD_RECOVERY_DONE
>>> md: don't register sync_thread for reshape directly
>>> md: export helpers to stop sync_thread
>>> dm-raid: really frozen sync_thread during suspend
>>> md/dm-raid: don't call md_reap_sync_thread() directly
>>> dm-raid: remove mddev_suspend/resume()
>>> dm-raid: add a new helper prepare_suspend() in md_personality
>>> md: export helper md_is_rdwr()
>>> md/raid456: fix a deadlock for dm-raid456 while io concurrent with
>>> reshape
>>>
>>> drivers/md/dm-raid.c | 76 +++++++++++++++++++++----------
>>> drivers/md/md.c | 104 ++++++++++++++++++++++++++++---------------
>>> drivers/md/md.h | 16 +++++++
>>> drivers/md/raid10.c | 16 +------
>>> drivers/md/raid5.c | 61 +++++++++++++------------
>>> 5 files changed, 171 insertions(+), 102 deletions(-)
>>>
>>> --
>>> 2.39.2
>>>
>
> .
>


2024-01-25 01:53:17

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] dm-raid: fix v6.7 regressions

On Wed, Jan 24, 2024 at 5:09 PM Yu Kuai <[email protected]> wrote:
>
[...]
> >>
> >> shell/integrity-caching.sh
> >> shell/lvconvert-raid-reshape.sh
> >
> > shell/lvconvert-raid-reshape-linear_to_raid6-single-type.sh is failing
> > with upstream + this set. (I need to fix some trivial compilation errors,
> > which are probably last minute typos).
>
> I'm running test for this patchset overnight in my vm, and this test has
> been ran for 9 times and all passed. Looks like I can't reporduce this
> in my vm.

Hmm.. maybe it is caused by different .config? Attached is the
output from the test and the config file I used.

Thanks,
Song


Attachments:
config (125.73 kB)
runner.out (816.41 kB)
Download all attachments

2024-01-25 02:37:38

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH v2 00/11] dm-raid: fix v6.7 regressions

Hi,

在 2024/01/25 9:51, Song Liu 写道:
> Hmm.. maybe it is caused by different .config? Attached is the
> output from the test and the config file I used.

Looks like your test differs from mine here:

| [ 0:16.663] aux wait_for_sync $vg $lv
| [ 0:17.002] #lvconvert-raid-reshape-linear_to_raid6-single-type.sh:67+
aux wait_for_sync LVMTEST2056vg LV
| [ 0:17.002] LVMTEST2056vg/LV (raid5_ls) in-sync, but 'a' characters in
health status
| [ 0:17.200] LVMTEST2056vg/LV (raid5_ls) in-sync, but 'a' characters in
health status

Mine:
[ 0:08.875] #lvconvert-raid-reshape-linear_to_raid6-single-type.sh:67+
aux wait_for_sync LVMTEST2826329vg LV
[ 0:08.876] LVMTEST2826329vg/LV (raid5_ls) is in-sync 0 98304 raid
raid5_ls 4 AAAA 32768/32768 idle 0 0 -
[ 0:08.940] fsck -fn "$DM_DEV_DIR/$vg/$lv"
[ 0:08.943] #lvconvert-raid-reshape-linear_to_raid6-single-type.sh:68+
fsck -fn /mnt/test/LVMTEST2826329.RgRdke29n8/dev/LVMTEST2826329vg/LV

However, the we all enabled dm and raid related config, I have no clue
yet. :(

Thanks,
Kuai