From: Yu Kuai <[email protected]>
Changes in v4:
- add patch 10 to fix a raid456 deadlock(for both md/raid and dm-raid);
- add patch 13 to wait for inflight IO completion while removing dm
device;
Changes in v3:
- fix a problem in patch 5;
- add patch 12;
Changes in v2:
- replace revert changes for dm-raid with real fixes;
- fix dm-raid5 deadlock that exist for a long time, this deadlock is
triggered because another problem is fixed in raid5, and instead of
deadlock, user will read wrong data before v6.7, patch 9-11;
First regression related to stop sync thread:
The lifetime of sync_thread is designed as following:
1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
daemon thread;
2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
MD_RECOVERY_RUNNING and register sync_thread;
3) Execute md_do_sync() for the actual work, if it's done or
interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
MD_RECOVERY_RUNNING and unregister sync_thread;
In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
("md: fix stopping sync thread"), however, dm-raid is not considered at
that time, and following test will hang:
shell/integrity-caching.sh
shell/lvconvert-raid-reshape.sh
This patch set fix the broken test by patch 1-4;
- patch 1 fix that step 4) is broken by suspended array;
- patch 2 fix that step 4) is broken by read-only array;
- patch 3 fix that step 3) is broken that md_do_sync() doesn't set
MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
data will be corrupted, which will be fixed in later patches.
- patch 4 fix that setp 1) is broken that sync_thread is register and
MD_RECOVERY_RUNNING is set directly, md/raid behaviour, not related to
dm-raid;
With patch 1-4, the above test won't hang anymore, however, the test
will still fail and complain that ext4 is corrupted;
Second regression related to frozen sync thread:
Noted that for raid456, if reshape is interrupted, then call
"pers->start_reshape" will corrupt data. And dm-raid rely on md_do_sync()
doesn't set MD_RECOVERY_DONE so that new sync_thread won't be registered,
and patch 3 just break this.
- Patch 5-6 fix this problem by interrupting reshape and frozen
sync_thread in dm_suspend(), then unfrozen and continue reshape in
dm_resume(). It's verified that dm-raid tests won't complain that
ext4 is corrupted anymore.
- Patch 7 fix the problem that raid_message() call
md_reap_sync_thread() directly, without holding 'reconfig_mutex'.
Last regression related to dm-raid456 IO concurrent with reshape:
For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:
1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;
After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().
For md/raid, the problem doesn't exist because:
1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex' anymore, it can be cleared and reshape can continue by
sysfs api 'sync_action'.
However, I'm not sure yet how to avoid the problem in dm-raid yet.
- patch 9-11 fix this problem by detecting the above 3 cases in
dm_suspend(), and fail those IO directly.
If user really meet the IO error, then it means they're reading the wrong
data before c467e97f079f. And it's safe to read/write the array after
reshape make progress successfully.
There are also some other minor changes: patch 8 and patch 12;
Test result:
I apply this patchset on top of v6.8-rc1, and run lvm2 tests suite with
folling cmd for 24 round(for about 2 days):
for t in `ls test/shell`; do
if cat test/shell/$t | grep raid &> /dev/null; then
make check T=shell/$t
fi
done
failed count failed test
1 ### failed: [ndev-vanilla] shell/dmsecuretest.sh
1 ### failed: [ndev-vanilla] shell/dmsetup-integrity-keys.sh
1 ### failed: [ndev-vanilla] shell/dmsetup-keyring.sh
5 ### failed: [ndev-vanilla] shell/duplicate-pvs-md0.sh
1 ### failed: [ndev-vanilla] shell/duplicate-vgid.sh
2 ### failed: [ndev-vanilla] shell/duplicate-vgnames.sh
1 ### failed: [ndev-vanilla] shell/fsadm-crypt.sh
1 ### failed: [ndev-vanilla] shell/integrity.sh
6 ### failed: [ndev-vanilla] shell/lvchange-raid1-writemostly.sh
2 ### failed: [ndev-vanilla] shell/lvchange-rebuild-raid.sh
5 ### failed: [ndev-vanilla] shell/lvconvert-raid-reshape-stripes-load-reload.sh
4 ### failed: [ndev-vanilla] shell/lvconvert-raid-restripe-linear.sh
1 ### failed: [ndev-vanilla] shell/lvconvert-raid1-split-trackchanges.sh
20 ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
20 ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
24 ### failed: [ndev-vanilla] shell/lvextend-raid.sh
And I ramdomly pick some tests verified by hand that these test will
fail in v6.6 as well(not all tests, I don't have the time do do this yet):
shell/lvextend-raid.sh
shell/lvcreate-large-raid.sh
shell/lvconvert-repair-raid.sh
shell/lvchange-rebuild-raid.sh
shell/lvchange-raid1-writemostly.sh
Yu Kuai (14):
md: don't ignore suspended array in md_check_recovery()
md: don't ignore read-only array in md_check_recovery()
md: make sure md_do_sync() will set MD_RECOVERY_DONE
md: don't register sync_thread for reshape directly
md: export helpers to stop sync_thread
dm-raid: really frozen sync_thread during suspend
md/dm-raid: don't call md_reap_sync_thread() directly
dm-raid: add a new helper prepare_suspend() in md_personality
md: export helper md_is_rdwr()
md: don't suspend the array for interrupted reshape
md/raid456: fix a deadlock for dm-raid456 while io concurrent with
reshape
dm-raid: fix lockdep waring in "pers->hot_add_disk"
dm: wait for IO completion before removing dm device
dm-raid: remove mddev_suspend/resume()
drivers/md/dm-raid.c | 78 +++++++++++++++++++---------
drivers/md/dm.c | 3 ++
drivers/md/md.c | 120 +++++++++++++++++++++++++++++--------------
drivers/md/md.h | 16 ++++++
drivers/md/raid10.c | 16 +-----
drivers/md/raid5.c | 61 ++++++++++++----------
6 files changed, 190 insertions(+), 104 deletions(-)
--
2.39.2
From: Yu Kuai <[email protected]>
mddev_suspend() never stop sync_thread, hence it doesn't make sense to
ignore suspended array in md_check_recovery(), which might cause
sync_thread can't be unregistered.
After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following
hang can be triggered by test shell/integrity-caching.sh:
1) suspend the array:
raid_postsuspend
mddev_suspend
2) stop the array:
raid_dtr
md_stop
__md_stop_writes
stop_sync_thread
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_wakeup_thread_directly(mddev->sync_thread);
wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
3) sync thread done:
md_do_sync
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
md_wakeup_thread(mddev->thread);
4) daemon thread can't unregister sync thread:
md_check_recovery
if (mddev->suspended)
return; -> return directly
md_read_sync_thread
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang;
This problem is not just related to dm-raid, fix it by ignoring
suspended array in md_check_recovery(). And follow up patches will
improve dm-raid better to frozen sync thread during suspend.
Reported-by: Mikulas Patocka <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: 68866e425be2 ("MD: no sync IO while suspended")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2266358d8074..07b80278eaa5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9469,9 +9469,6 @@ static void md_start_sync(struct work_struct *ws)
*/
void md_check_recovery(struct mddev *mddev)
{
- if (READ_ONCE(mddev->suspended))
- return;
-
if (mddev->bitmap)
md_bitmap_daemon_work(mddev);
--
2.39.2
From: Yu Kuai <[email protected]>
Usually if the array is not read-write, md_check_recovery() won't
register new sync_thread in the first place. And if the array is
read-write and sync_thread is registered, md_set_readonly() will
unregister sync_thread before setting the array read-only. md/raid
follow this behavior hence there is no problem.
After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following
hang can be triggered by test shell/integrity-caching.sh:
1) array is read-only. dm-raid update super block:
rs_update_sbs
ro = mddev->ro
mddev->ro = 0
-> set array read-write
md_update_sb
2) register new sync thread concurrently.
3) dm-raid set array back to read-only:
rs_update_sbs
mddev->ro = ro
4) stop the array:
raid_dtr
md_stop
stop_sync_thread
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_wakeup_thread_directly(mddev->sync_thread);
wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
5) sync thread done:
md_do_sync
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
md_wakeup_thread(mddev->thread);
6) daemon thread can't unregister sync thread:
md_check_recovery
if (!md_is_rdwr(mddev) &&
!test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
return;
-> -> MD_RECOVERY_RUNNING can't be cleared, hence step 4 hang;
The root cause is that dm-raid manipulate 'mddev->ro' by itself,
however, dm-raid really should stop sync thread before setting the
array read-only. Unfortunately, I need to read more code before I
can refacter the handler of 'mddev->ro' in dm-raid, hence let's fix
the problem the easy way for now to prevent dm-raid regression.
Reported-by: Mikulas Patocka <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: ecbfb9f118bc ("dm raid: add raid level takeover support")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 31 ++++++++++++++++++-------------
1 file changed, 18 insertions(+), 13 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 07b80278eaa5..6906d023f1d6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9445,6 +9445,20 @@ static void md_start_sync(struct work_struct *ws)
sysfs_notify_dirent_safe(mddev->sysfs_action);
}
+static void unregister_sync_thread(struct mddev *mddev)
+{
+ if (!test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
+ /* resync/recovery still happening */
+ clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ return;
+ }
+
+ if (WARN_ON_ONCE(!mddev->sync_thread))
+ return;
+
+ md_reap_sync_thread(mddev);
+}
+
/*
* This routine is regularly called by all per-raid-array threads to
* deal with generic issues like resync and super-block update.
@@ -9482,7 +9496,8 @@ void md_check_recovery(struct mddev *mddev)
}
if (!md_is_rdwr(mddev) &&
- !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
+ !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) &&
+ !test_bit(MD_RECOVERY_DONE, &mddev->recovery))
return;
if ( ! (
(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
@@ -9504,8 +9519,7 @@ void md_check_recovery(struct mddev *mddev)
struct md_rdev *rdev;
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
- /* sync_work already queued. */
- clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ unregister_sync_thread(mddev);
goto unlock;
}
@@ -9568,16 +9582,7 @@ void md_check_recovery(struct mddev *mddev)
* still set.
*/
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
- if (!test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
- /* resync/recovery still happening */
- clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
- goto unlock;
- }
-
- if (WARN_ON_ONCE(!mddev->sync_thread))
- goto unlock;
-
- md_reap_sync_thread(mddev);
+ unregister_sync_thread(mddev);
goto unlock;
}
--
2.39.2
From: Yu Kuai <[email protected]>
stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must
set MD_RECOVERY_DONE, so that follow up md_check_recovery() will
unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up
stop_sync_thread().
If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will
return without setting MD_RECOVERY_DONE, and after commit f52f5c71f3d4
("md: fix stopping sync thread"), dm-raid switch from
md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread
from md_stop() and md_stop_writes(), causing the test
shell/lvconvert-raid-reshape.sh hang.
We shouldn't switch back to md_reap_sync_thread() because it's
problematic in the first place. Fix the problem by making sure
md_do_sync() will set MD_RECOVERY_DONE.
Reported-by: Mikulas Patocka <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: d5d885fd514f ("md: introduce new personality funciton start()")
Fixes: 5fd6c1dce06e ("[PATCH] md: allow checkpoint of recovery with version-1 superblock")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6906d023f1d6..c65dfd156090 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8788,12 +8788,16 @@ void md_do_sync(struct md_thread *thread)
int ret;
/* just incase thread restarts... */
- if (test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
- test_bit(MD_RECOVERY_WAIT, &mddev->recovery))
+ if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
return;
- if (!md_is_rdwr(mddev)) {/* never try to sync a read-only array */
+
+ if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+ goto skip;
+
+ if (test_bit(MD_RECOVERY_WAIT, &mddev->recovery) ||
+ !md_is_rdwr(mddev)) {/* never try to sync a read-only array */
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- return;
+ goto skip;
}
if (mddev_is_clustered(mddev)) {
--
2.39.2
From: Yu Kuai <[email protected]>
1) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen,
it only prevent new sync_thread to start, and it can't stop the
running sync thread;
2) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use
it as condition for md_stop_writes() in raid_postsuspend() doesn't
look correct.
3) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime,
and if MD_RECOVERY_FROZEN is cleared while the array is suspended,
new sync_thread can start unexpected.
Fix above problems by using the new helper to suspend the array during
suspend, also disallow raid_message() to change sync_thread status
during suspend.
Note that after commit f52f5c71f3d4 ("md: fix stopping sync thread"), the
test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(),
and with previous fixes, the test won't hang there anymore, however, the
test will still fail and complain that ext4 is corrupted. And with this
patch, the test won't hang due to stop_sync_thread() or fail due to ext4
is corrupted anymore. However, there is still a deadlock related to
dm-raid456 that will be fixed in following patches.
Reported-by: Mikulas Patocka <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: 1af2048a3e87 ("dm raid: fix deadlock caused by premature md_stop_writes()")
Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 38 +++++++++++++++++++++++++++++---------
1 file changed, 29 insertions(+), 9 deletions(-)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index eb009d6bb03a..5ce3c6020b1b 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3240,11 +3240,12 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
rs->md.ro = 1;
rs->md.in_sync = 1;
- /* Keep array frozen until resume. */
- set_bit(MD_RECOVERY_FROZEN, &rs->md.recovery);
-
/* Has to be held on running the array */
mddev_suspend_and_lock_nointr(&rs->md);
+
+ /* Keep array frozen until resume. */
+ md_frozen_sync_thread(&rs->md);
+
r = md_run(&rs->md);
rs->md.in_sync = 0; /* Assume already marked dirty */
if (r) {
@@ -3722,6 +3723,9 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
if (!mddev->pers || !mddev->pers->sync_request)
return -EINVAL;
+ if (test_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
+ return -EBUSY;
+
if (!strcasecmp(argv[0], "frozen"))
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
else
@@ -3791,15 +3795,31 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits)
blk_limits_io_opt(limits, chunk_size_bytes * mddev_data_stripes(rs));
}
+static void raid_presuspend(struct dm_target *ti)
+{
+ struct raid_set *rs = ti->private;
+
+ mddev_lock_nointr(&rs->md);
+ md_frozen_sync_thread(&rs->md);
+ mddev_unlock(&rs->md);
+}
+
+static void raid_presuspend_undo(struct dm_target *ti)
+{
+ struct raid_set *rs = ti->private;
+
+ mddev_lock_nointr(&rs->md);
+ md_unfrozen_sync_thread(&rs->md);
+ mddev_unlock(&rs->md);
+}
+
static void raid_postsuspend(struct dm_target *ti)
{
struct raid_set *rs = ti->private;
if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
/* Writes have to be stopped before suspending to avoid deadlocks. */
- if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery))
- md_stop_writes(&rs->md);
-
+ md_stop_writes(&rs->md);
mddev_suspend(&rs->md, false);
}
}
@@ -4012,8 +4032,6 @@ static int raid_preresume(struct dm_target *ti)
}
/* Check for any resize/reshape on @rs and adjust/initiate */
- /* Be prepared for mddev_resume() in raid_resume() */
- set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (mddev->recovery_cp && mddev->recovery_cp < MaxSector) {
set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
mddev->resync_min = mddev->recovery_cp;
@@ -4056,9 +4074,9 @@ static void raid_resume(struct dm_target *ti)
rs_set_capacity(rs);
mddev_lock_nointr(mddev);
- clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
mddev->ro = 0;
mddev->in_sync = 0;
+ md_unfrozen_sync_thread(mddev);
mddev_unlock_and_resume(mddev);
}
}
@@ -4074,6 +4092,8 @@ static struct target_type raid_target = {
.message = raid_message,
.iterate_devices = raid_iterate_devices,
.io_hints = raid_io_hints,
+ .presuspend = raid_presuspend,
+ .presuspend_undo = raid_presuspend_undo,
.postsuspend = raid_postsuspend,
.preresume = raid_preresume,
.resume = raid_resume,
--
2.39.2
From: Yu Kuai <[email protected]>
Currently md_reap_sync_thread() is called from raid_message() directly
without holding 'reconfig_mutex', this is definitely unsafe because
md_reap_sync_thread() can change many fields that is protected by
'reconfig_mutex'.
However, hold 'reconfig_mutex' here is still problematic because this
will cause deadlock, for example, commit 130443d60b1b ("md: refactor
idle/frozen_sync_thread() to fix deadlock").
Fix this problem by using stop_sync_thread() to unregister sync_thread,
like md/raid did.
Fixes: be83651f0050 ("DM RAID: Add message/status support for changing sync action")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 5ce3c6020b1b..6b6c011d9f69 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3719,6 +3719,7 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
{
struct raid_set *rs = ti->private;
struct mddev *mddev = &rs->md;
+ int ret = 0;
if (!mddev->pers || !mddev->pers->sync_request)
return -EINVAL;
@@ -3726,17 +3727,24 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
if (test_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
return -EBUSY;
- if (!strcasecmp(argv[0], "frozen"))
- set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
- else
- clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ if (!strcasecmp(argv[0], "frozen")) {
+ ret = mddev_lock(mddev);
+ if (ret)
+ return ret;
- if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
- if (mddev->sync_thread) {
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev);
- }
- } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
+ md_frozen_sync_thread(mddev);
+ mddev_unlock(mddev);
+ } else if (!strcasecmp(argv[0], "idle")) {
+ ret = mddev_lock(mddev);
+ if (ret)
+ return ret;
+
+ md_idle_sync_thread(mddev);
+ mddev_unlock(mddev);
+ }
+
+ clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
return -EBUSY;
else if (!strcasecmp(argv[0], "resync"))
; /* MD_RECOVERY_NEEDED set below */
--
2.39.2
From: Yu Kuai <[email protected]>
There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 10 +++++++---
drivers/md/md.h | 1 +
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 6b6c011d9f69..16d3348abfed 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3806,10 +3806,14 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits)
static void raid_presuspend(struct dm_target *ti)
{
struct raid_set *rs = ti->private;
+ struct mddev *mddev = &rs->md;
- mddev_lock_nointr(&rs->md);
- md_frozen_sync_thread(&rs->md);
- mddev_unlock(&rs->md);
+ mddev_lock_nointr(mddev);
+ md_frozen_sync_thread(mddev);
+ mddev_unlock(mddev);
+
+ if (mddev->pers && mddev->pers->prepare_suspend)
+ mddev->pers->prepare_suspend(mddev);
}
static void raid_presuspend_undo(struct dm_target *ti)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 437ab70ce79b..29b476ff3b9f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -617,6 +617,7 @@ struct md_personality
int (*start_reshape) (struct mddev *mddev);
void (*finish_reshape) (struct mddev *mddev);
void (*update_reshape_pos) (struct mddev *mddev);
+ void (*prepare_suspend) (struct mddev *mddev);
/* quiesce suspends or resumes internal processing.
* 1 - stop new actions and wait for action io to complete
* 0 - return to normal behaviour
--
2.39.2
From: Yu Kuai <[email protected]>
The new heleprs will be used in dm-raid in later patches to fix
regressions and prevent calling md_reap_sync_thread() directly.
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 33 +++++++++++++++++++++++++++++++--
drivers/md/md.h | 3 +++
2 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6c5d0a372927..6a76bd27e381 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4915,16 +4915,45 @@ static void stop_sync_thread(struct mddev *mddev, bool locked, bool check_seq)
mddev_lock_nointr(mddev);
}
+void md_idle_sync_thread(struct mddev *mddev)
+{
+ lockdep_assert_held(&mddev->reconfig_mutex);
+
+ clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ stop_sync_thread(mddev, true, true);
+}
+EXPORT_SYMBOL_GPL(md_idle_sync_thread);
+
+void md_frozen_sync_thread(struct mddev *mddev)
+{
+ lockdep_assert_held(&mddev->reconfig_mutex);
+
+ set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ stop_sync_thread(mddev, true, false);
+}
+EXPORT_SYMBOL_GPL(md_frozen_sync_thread);
+
+void md_unfrozen_sync_thread(struct mddev *mddev)
+{
+ lockdep_assert_held(&mddev->reconfig_mutex);
+
+ clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ md_wakeup_thread(mddev->thread);
+ sysfs_notify_dirent_safe(mddev->sysfs_action);
+}
+EXPORT_SYMBOL_GPL(md_unfrozen_sync_thread);
+
static void idle_sync_thread(struct mddev *mddev)
{
mutex_lock(&mddev->sync_mutex);
- clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (mddev_lock(mddev)) {
mutex_unlock(&mddev->sync_mutex);
return;
}
+ clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev, false, true);
mutex_unlock(&mddev->sync_mutex);
}
@@ -4932,13 +4961,13 @@ static void idle_sync_thread(struct mddev *mddev)
static void frozen_sync_thread(struct mddev *mddev)
{
mutex_lock(&mddev->sync_mutex);
- set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (mddev_lock(mddev)) {
mutex_unlock(&mddev->sync_mutex);
return;
}
+ set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev, false, false);
mutex_unlock(&mddev->sync_mutex);
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 8d881cc59799..437ab70ce79b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -781,6 +781,9 @@ extern void md_rdev_clear(struct md_rdev *rdev);
extern void md_handle_request(struct mddev *mddev, struct bio *bio);
extern int mddev_suspend(struct mddev *mddev, bool interruptible);
extern void mddev_resume(struct mddev *mddev);
+extern void md_idle_sync_thread(struct mddev *mddev);
+extern void md_frozen_sync_thread(struct mddev *mddev);
+extern void md_unfrozen_sync_thread(struct mddev *mddev);
extern void md_reload_sb(struct mddev *mddev, int raid_disk);
extern void md_update_sb(struct mddev *mddev, int force);
--
2.39.2
From: Yu Kuai <[email protected]>
There are no functional changes for now, prepare to fix a deadlock for
dm-raid456.
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 12 ------------
drivers/md/md.h | 12 ++++++++++++
2 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6a76bd27e381..b942cc126ce0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -99,18 +99,6 @@ static void mddev_detach(struct mddev *mddev);
static void export_rdev(struct md_rdev *rdev, struct mddev *mddev);
static void md_wakeup_thread_directly(struct md_thread __rcu *thread);
-enum md_ro_state {
- MD_RDWR,
- MD_RDONLY,
- MD_AUTO_READ,
- MD_MAX_STATE
-};
-
-static bool md_is_rdwr(struct mddev *mddev)
-{
- return (mddev->ro == MD_RDWR);
-}
-
/*
* Default number of read corrections we'll attempt on an rdev
* before ejecting it from the array. We divide the read error
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 29b476ff3b9f..98da86d38ba8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -558,6 +558,18 @@ enum recovery_flags {
MD_RESYNCING_REMOTE, /* remote node is running resync thread */
};
+enum md_ro_state {
+ MD_RDWR,
+ MD_RDONLY,
+ MD_AUTO_READ,
+ MD_MAX_STATE
+};
+
+static bool md_is_rdwr(struct mddev *mddev)
+{
+ return (mddev->ro == MD_RDWR);
+}
+
static inline int __must_check mddev_lock(struct mddev *mddev)
{
return mutex_lock_interruptible(&mddev->reconfig_mutex);
--
2.39.2
From: Yu Kuai <[email protected]>
The lockdep assert is added by commit a448af25becf ("md/raid10: remove
rcu protection to access rdev from conf") in print_conf(). And I didn't
notice that dm-raid is calling "pers->hot_add_disk" without holding
'reconfig_mutex'.
"pers->hot_add_disk" read and write many fields that is protected by
'reconfig_mutex', and raid_resume() already grab the lock in other
contex. Hence fix this problem by protecting "pers->host_add_disk"
with the lock.
Fixes: 9092c02d9435 ("DM RAID: Add ability to restore transiently failed devices on resume")
Fixes: a448af25becf ("md/raid10: remove rcu protection to access rdev from conf")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 16d3348abfed..5f78cc19d6f3 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -4077,7 +4077,9 @@ static void raid_resume(struct dm_target *ti)
* Take this opportunity to check whether any failed
* devices are reachable again.
*/
+ mddev_lock_nointr(mddev);
attempt_restore_of_faulty_devices(rs);
+ mddev_unlock(mddev);
}
if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
--
2.39.2
From: Yu Kuai <[email protected]>
__dm_destroy() guarantee that device openers is zero, and then
only call 'presuspend' and 'postsuspend' for the target. For
request-based dm, 'md->holders' will be grabbed for each rq and
__dm_destroy() will wait for 'md->holders' to be zero. However, for
bio-based device, __dm_destroy() doesn't wait for all bios to be done.
Fix this problem by calling dm_wait_for_completion() to wail for all
inflight IO to be done, like what dm_suspend() does.
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8dcabf84d866..2c0eae67d0f1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -58,6 +58,7 @@ static DEFINE_IDR(_minor_idr);
static DEFINE_SPINLOCK(_minor_lock);
static void do_deferred_remove(struct work_struct *w);
+static int dm_wait_for_completion(struct mapped_device *md, unsigned int task_state);
static DECLARE_WORK(deferred_remove_work, do_deferred_remove);
@@ -2495,6 +2496,8 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
if (!dm_suspended_md(md)) {
dm_table_presuspend_targets(map);
set_bit(DMF_SUSPENDED, &md->flags);
+ if (wait)
+ dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
set_bit(DMF_POST_SUSPENDING, &md->flags);
dm_table_postsuspend_targets(map);
}
--
2.39.2
From: Yu Kuai <[email protected]>
Currently, if reshape is interrupted, then reassemble the array will
register sync_thread directly from pers->run(), in this case
'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee
that md_do_sync() will be executed, hence stop_sync_thread() will hang
because 'MD_RECOVERY_RUNNING' can't be cleared.
Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE,
however, following hang can still be triggered by dm-raid test
shell/lvconvert-raid-reshape.sh occasionally:
[root@fedora ~]# cat /proc/1982/stack
[<0>] stop_sync_thread+0x1ab/0x270 [md_mod]
[<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod]
[<0>] raid_presuspend+0x1e/0x70 [dm_raid]
[<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod]
[<0>] __dm_destroy+0x2a5/0x310 [dm_mod]
[<0>] dm_destroy+0x16/0x30 [dm_mod]
[<0>] dev_remove+0x165/0x290 [dm_mod]
[<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod]
[<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod]
[<0>] vfs_ioctl+0x21/0x60
[<0>] __x64_sys_ioctl+0xb9/0xe0
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74
Meanwhile mddev->recovery is:
MD_RECOVERY_RUNNING |
MD_RECOVERY_INTR |
MD_RECOVERY_RESHAPE |
MD_RECOVERY_FROZEN
Fix this problem by remove the code to register sync_thread directly
from raid10 and raid5. And let md_check_recovery() to register
sync_thread.
Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 5 ++++-
drivers/md/raid10.c | 16 ++--------------
drivers/md/raid5.c | 29 ++---------------------------
3 files changed, 8 insertions(+), 42 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c65dfd156090..6c5d0a372927 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9372,6 +9372,7 @@ static void md_start_sync(struct work_struct *ws)
struct mddev *mddev = container_of(ws, struct mddev, sync_work);
int spares = 0;
bool suspend = false;
+ char *name;
if (md_spares_need_change(mddev))
suspend = true;
@@ -9404,8 +9405,10 @@ static void md_start_sync(struct work_struct *ws)
if (spares)
md_bitmap_write_all(mddev->bitmap);
+ name = test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ?
+ "reshape" : "resync";
rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "resync"));
+ md_register_thread(md_do_sync, mddev, name));
if (!mddev->sync_thread) {
pr_warn("%s: could not start resync thread...\n",
mdname(mddev));
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 7412066ea22c..a5f8419e2df1 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4175,11 +4175,7 @@ static int raid10_run(struct mddev *mddev)
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "reshape"));
- if (!mddev->sync_thread)
- goto out_free_conf;
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
}
return 0;
@@ -4573,16 +4569,8 @@ static int raid10_start_reshape(struct mddev *mddev)
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-
- rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "reshape"));
- if (!mddev->sync_thread) {
- ret = -EAGAIN;
- goto abort;
- }
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
conf->reshape_checkpoint = jiffies;
- md_wakeup_thread(mddev->sync_thread);
md_new_event();
return 0;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8497880135ee..6a7a32f7fb91 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7936,11 +7936,7 @@ static int raid5_run(struct mddev *mddev)
clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "reshape"));
- if (!mddev->sync_thread)
- goto abort;
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
}
/* Ok, everything is just fine now */
@@ -8506,29 +8502,8 @@ static int raid5_start_reshape(struct mddev *mddev)
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- rcu_assign_pointer(mddev->sync_thread,
- md_register_thread(md_do_sync, mddev, "reshape"));
- if (!mddev->sync_thread) {
- mddev->recovery = 0;
- spin_lock_irq(&conf->device_lock);
- write_seqcount_begin(&conf->gen_lock);
- mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
- mddev->new_chunk_sectors =
- conf->chunk_sectors = conf->prev_chunk_sectors;
- mddev->new_layout = conf->algorithm = conf->prev_algo;
- rdev_for_each(rdev, mddev)
- rdev->new_data_offset = rdev->data_offset;
- smp_wmb();
- conf->generation --;
- conf->reshape_progress = MaxSector;
- mddev->reshape_position = MaxSector;
- write_seqcount_end(&conf->gen_lock);
- spin_unlock_irq(&conf->device_lock);
- return -EAGAIN;
- }
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
conf->reshape_checkpoint = jiffies;
- md_wakeup_thread(mddev->sync_thread);
md_new_event();
return 0;
}
--
2.39.2
From: Yu Kuai <[email protected]>
md_start_sync() will suspend the array if there are spares that can be
added or removed from conf, however, if reshape is still in progress,
this won't happen at all or data will be corrupted(remove_and_add_spares
won't be called from md_choose_sync_action for reshape), hence there is
no need to suspend the array if reshape is not done yet.
Meanwhile, there is a potential deadlock for raid456:
1) reshape is interrupted;
2) set one of the disk WantReplacement, and add a new disk to the array,
however, recovery won't start until the reshape is finished;
3) then issue an IO across reshpae position, this IO will wait for
reshape to make progress;
4) continue to reshape, then md_start_sync() found there is a spare disk
that can be added to conf, mddev_suspend() is called;
Step 4 and step 3 is waiting for each other, deadlock triggered. Noted
this problem is found by code review, and it's not reporduced yet.
Fix this porblem by don't suspend the array for interrupted reshape,
this is safe because conf won't be changed until reshape is done.
Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index b942cc126ce0..093abf3ce27b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9391,12 +9391,17 @@ static void md_start_sync(struct work_struct *ws)
bool suspend = false;
char *name;
- if (md_spares_need_change(mddev))
+ /*
+ * If reshape is still in progress, spares won't be added or removed
+ * from conf until reshape is done.
+ */
+ if (mddev->reshape_position == MaxSector &&
+ md_spares_need_change(mddev)) {
suspend = true;
+ mddev_suspend(mddev, false);
+ }
- suspend ? mddev_suspend_and_lock_nointr(mddev) :
- mddev_lock_nointr(mddev);
-
+ mddev_lock_nointr(mddev);
if (!md_is_rdwr(mddev)) {
/*
* On a read-only array we can:
--
2.39.2
From: Yu Kuai <[email protected]>
dm_suspend() already make sure that no new IO can be issued and will
wait for all dispatched IO to be done. There is no need to call
mddev_suspend() to make sure that again.
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 8 +++-----
drivers/md/md.c | 11 +++++++++++
2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 5f78cc19d6f3..ed8c28952b14 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3241,7 +3241,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
rs->md.in_sync = 1;
/* Has to be held on running the array */
- mddev_suspend_and_lock_nointr(&rs->md);
+ mddev_lock_nointr(&rs->md);
/* Keep array frozen until resume. */
md_frozen_sync_thread(&rs->md);
@@ -3829,11 +3829,9 @@ static void raid_postsuspend(struct dm_target *ti)
{
struct raid_set *rs = ti->private;
- if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
+ if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
/* Writes have to be stopped before suspending to avoid deadlocks. */
md_stop_writes(&rs->md);
- mddev_suspend(&rs->md, false);
- }
}
static void attempt_restore_of_faulty_devices(struct raid_set *rs)
@@ -4091,7 +4089,7 @@ static void raid_resume(struct dm_target *ti)
mddev->ro = 0;
mddev->in_sync = 0;
md_unfrozen_sync_thread(mddev);
- mddev_unlock_and_resume(mddev);
+ mddev_unlock(mddev);
}
}
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 093abf3ce27b..e3a56a958b47 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -437,6 +437,10 @@ int mddev_suspend(struct mddev *mddev, bool interruptible)
{
int err = 0;
+ /* Array is supended from dm_suspend() for dm-raid. */
+ if (!mddev->gendisk)
+ return 0;
+
/*
* hold reconfig_mutex to wait for normal io will deadlock, because
* other context can't update super_block, and normal io can rely on
@@ -488,6 +492,13 @@ EXPORT_SYMBOL_GPL(mddev_suspend);
static void __mddev_resume(struct mddev *mddev, bool recovery_needed)
{
+ /*
+ * Array is supended from dm_suspend() and resumed from dm_resume() for
+ * dm-raid.
+ */
+ if (!mddev->gendisk)
+ return;
+
lockdep_assert_not_held(&mddev->reconfig_mutex);
mutex_lock(&mddev->suspend_mutex);
--
2.39.2
From: Yu Kuai <[email protected]>
For raid456, if reshape is still in progress, then IO across reshape
position will wait for reshape to make progress. However, for dm-raid,
in following cases reshape will never make progress hence IO will hang:
1) the array is read-only;
2) MD_RECOVERY_WAIT is set;
3) MD_RECOVERY_FROZEN is set;
After commit c467e97f079f ("md/raid6: use valid sector values to determine
if an I/O should wait on the reshape") fix the problem that IO across
reshape position doesn't wait for reshape, the dm-raid test
shell/lvconvert-raid-reshape.sh start to hang:
[root@fedora ~]# cat /proc/979/stack
[<0>] wait_woken+0x7d/0x90
[<0>] raid5_make_request+0x929/0x1d70 [raid456]
[<0>] md_handle_request+0xc2/0x3b0 [md_mod]
[<0>] raid_map+0x2c/0x50 [dm_raid]
[<0>] __map_bio+0x251/0x380 [dm_mod]
[<0>] dm_submit_bio+0x1f0/0x760 [dm_mod]
[<0>] __submit_bio+0xc2/0x1c0
[<0>] submit_bio_noacct_nocheck+0x17f/0x450
[<0>] submit_bio_noacct+0x2bc/0x780
[<0>] submit_bio+0x70/0xc0
[<0>] mpage_readahead+0x169/0x1f0
[<0>] blkdev_readahead+0x18/0x30
[<0>] read_pages+0x7c/0x3b0
[<0>] page_cache_ra_unbounded+0x1ab/0x280
[<0>] force_page_cache_ra+0x9e/0x130
[<0>] page_cache_sync_ra+0x3b/0x110
[<0>] filemap_get_pages+0x143/0xa30
[<0>] filemap_read+0xdc/0x4b0
[<0>] blkdev_read_iter+0x75/0x200
[<0>] vfs_read+0x272/0x460
[<0>] ksys_read+0x7a/0x170
[<0>] __x64_sys_read+0x1c/0x30
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74
This is because reshape can't make progress.
For md/raid, the problem doesn't exist because register new sync_thread
doesn't rely on the IO to be done any more:
1) If array is read-only, it can switch to read-write by ioctl/sysfs;
2) md/raid never set MD_RECOVERY_WAIT;
3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
'reconfig_mutex', hence it can be cleared and reshape can continue by
sysfs api 'sync_action'.
However, I'm not sure yet how to avoid the problem in dm-raid yet. This
patch detect the above 3 cases in dm_suspend(), and fail those IO
directly. If user really meet the IO error, then it means they're
reading the wrong data before c467e97f079f. And it's safe to read/write
the array after reshape make progress successfully.
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.h | 2 +-
drivers/md/raid5.c | 32 +++++++++++++++++++++++++++++++-
2 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 98da86d38ba8..8e81f9e2fb20 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -565,7 +565,7 @@ enum md_ro_state {
MD_MAX_STATE
};
-static bool md_is_rdwr(struct mddev *mddev)
+static inline bool md_is_rdwr(struct mddev *mddev)
{
return (mddev->ro == MD_RDWR);
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6a7a32f7fb91..812d7ec64da5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5915,6 +5915,13 @@ static int add_all_stripe_bios(struct r5conf *conf,
return ret;
}
+static bool reshape_disabled(struct mddev *mddev)
+{
+ return !md_is_rdwr(mddev) ||
+ test_bit(MD_RECOVERY_WAIT, &mddev->recovery) ||
+ test_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+}
+
static enum stripe_result make_stripe_request(struct mddev *mddev,
struct r5conf *conf, struct stripe_request_ctx *ctx,
sector_t logical_sector, struct bio *bi)
@@ -5946,7 +5953,8 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
if (ahead_of_reshape(mddev, logical_sector,
conf->reshape_safe)) {
spin_unlock_irq(&conf->device_lock);
- return STRIPE_SCHEDULE_AND_RETRY;
+ ret = STRIPE_SCHEDULE_AND_RETRY;
+ goto out;
}
}
spin_unlock_irq(&conf->device_lock);
@@ -6025,6 +6033,13 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
out_release:
raid5_release_stripe(sh);
+out:
+ if (ret == STRIPE_SCHEDULE_AND_RETRY && !mddev->gendisk &&
+ reshape_disabled(mddev)) {
+ bi->bi_status = BLK_STS_IOERR;
+ ret = STRIPE_FAIL;
+ pr_err("dm-raid456: io failed across reshape position while reshape can't make progress");
+ }
return ret;
}
@@ -8909,6 +8924,18 @@ static int raid5_start(struct mddev *mddev)
return r5l_start(conf->log);
}
+/*
+ * This is only used for dm-raid456, caller already frozen sync_thread, hence
+ * if rehsape is still in progress, io that is waiting for reshape can never be
+ * done now, hence wake up and handle those IO.
+ */
+static void raid5_prepare_suspend(struct mddev *mddev)
+{
+ struct r5conf *conf = mddev->private;
+
+ wake_up(&conf->wait_for_overlap);
+}
+
static struct md_personality raid6_personality =
{
.name = "raid6",
@@ -8932,6 +8959,7 @@ static struct md_personality raid6_personality =
.quiesce = raid5_quiesce,
.takeover = raid6_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
+ .prepare_suspend = raid5_prepare_suspend,
};
static struct md_personality raid5_personality =
{
@@ -8956,6 +8984,7 @@ static struct md_personality raid5_personality =
.quiesce = raid5_quiesce,
.takeover = raid5_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
+ .prepare_suspend = raid5_prepare_suspend,
};
static struct md_personality raid4_personality =
@@ -8981,6 +9010,7 @@ static struct md_personality raid4_personality =
.quiesce = raid5_quiesce,
.takeover = raid4_takeover,
.change_consistency_policy = raid5_change_consistency_policy,
+ .prepare_suspend = raid5_prepare_suspend,
};
static int __init raid5_init(void)
--
2.39.2
On Mon, Jan 29, 2024 at 6:23 PM Yu Kuai <[email protected]> wrote:
>
[...]
>
> Test result:
>
> I apply this patchset on top of v6.8-rc1, and run lvm2 tests suite with
> folling cmd for 24 round(for about 2 days):
>
> for t in `ls test/shell`; do
> if cat test/shell/$t | grep raid &> /dev/null; then
> make check T=shell/$t
> fi
> done
>
> failed count failed test
> 1 ### failed: [ndev-vanilla] shell/dmsecuretest.sh
> 1 ### failed: [ndev-vanilla] shell/dmsetup-integrity-keys.sh
> 1 ### failed: [ndev-vanilla] shell/dmsetup-keyring.sh
> 5 ### failed: [ndev-vanilla] shell/duplicate-pvs-md0.sh
> 1 ### failed: [ndev-vanilla] shell/duplicate-vgid.sh
> 2 ### failed: [ndev-vanilla] shell/duplicate-vgnames.sh
> 1 ### failed: [ndev-vanilla] shell/fsadm-crypt.sh
> 1 ### failed: [ndev-vanilla] shell/integrity.sh
> 6 ### failed: [ndev-vanilla] shell/lvchange-raid1-writemostlysh
> 2 ### failed: [ndev-vanilla] shell/lvchange-rebuild-raid.sh
> 5 ### failed: [ndev-vanilla] shell/lvconvert-raid-reshape-stripes-load-reload.sh
> 4 ### failed: [ndev-vanilla] shell/lvconvert-raid-restripe-linear.sh
> 1 ### failed: [ndev-vanilla] shell/lvconvert-raid1-split-trackchanges.sh
> 20 ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
> 20 ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
> 24 ### failed: [ndev-vanilla] shell/lvextend-raid.sh
>
> And I ramdomly pick some tests verified by hand that these test will
> fail in v6.6 as well(not all tests, I don't have the time do do this yet):
>
> shell/lvextend-raid.sh
> shell/lvcreate-large-raid.sh
> shell/lvconvert-repair-raid.sh
> shell/lvchange-rebuild-raid.sh
> shell/lvchange-raid1-writemostly.sh
Hi Mikulas,
Could you please advise the proper way to run the tests and to interpret
the results? Are these failures on 6.6 expected?
We hope to run lvm2 tests regularly for all md patches. However, Yu Kuai
has spent days on this, and it seems really hard to run it properly once.
Thanks,
Song
On Tue, 30 Jan 2024, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> __dm_destroy() guarantee that device openers is zero, and then
> only call 'presuspend' and 'postsuspend' for the target. For
> request-based dm, 'md->holders' will be grabbed for each rq and
> __dm_destroy() will wait for 'md->holders' to be zero. However, for
> bio-based device, __dm_destroy() doesn't wait for all bios to be done.
>
> Fix this problem by calling dm_wait_for_completion() to wail for all
> inflight IO to be done, like what dm_suspend() does.
If the number of openers is zero, it is guaranteed that there are no bios
in flight. Therefore, we don't have to wait for them.
If there are bios in flight, it is a bug in the code that issues the bios.
You can put WARN_ON(dm_in_flight_bios(md)) there.
Mikulas
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/dm.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 8dcabf84d866..2c0eae67d0f1 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -58,6 +58,7 @@ static DEFINE_IDR(_minor_idr);
> static DEFINE_SPINLOCK(_minor_lock);
>
> static void do_deferred_remove(struct work_struct *w);
> +static int dm_wait_for_completion(struct mapped_device *md, unsigned int task_state);
>
> static DECLARE_WORK(deferred_remove_work, do_deferred_remove);
>
> @@ -2495,6 +2496,8 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
> if (!dm_suspended_md(md)) {
> dm_table_presuspend_targets(map);
> set_bit(DMF_SUSPENDED, &md->flags);
> + if (wait)
> + dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> set_bit(DMF_POST_SUSPENDING, &md->flags);
> dm_table_postsuspend_targets(map);
> }
> --
> 2.39.2
>
Hi,
?? 2024/01/30 19:46, Mikulas Patocka д??:
>
>
> On Tue, 30 Jan 2024, Yu Kuai wrote:
>
>> From: Yu Kuai <[email protected]>
>>
>> __dm_destroy() guarantee that device openers is zero, and then
>> only call 'presuspend' and 'postsuspend' for the target. For
>> request-based dm, 'md->holders' will be grabbed for each rq and
>> __dm_destroy() will wait for 'md->holders' to be zero. However, for
>> bio-based device, __dm_destroy() doesn't wait for all bios to be done.
>>
>> Fix this problem by calling dm_wait_for_completion() to wail for all
>> inflight IO to be done, like what dm_suspend() does.
>
> If the number of openers is zero, it is guaranteed that there are no bios
> in flight. Therefore, we don't have to wait for them.
>
> If there are bios in flight, it is a bug in the code that issues the bios.
> You can put WARN_ON(dm_in_flight_bios(md)) there.
I add this patch because while testing, there is a problem that is
hard to reporduce, as I mentioned in the other thread. I'll add BUG_ON()
and try if I can still reporduce this problem without triggering it.
Thanks,
Kuai
[12504.959682] BUG bio-296 (Not tainted): Object already free
[12504.960239]
-----------------------------------------------------------------------------
[12504.960239]
[12504.961209] Allocated in mempool_alloc+0xe8/0x270 age=30 cpu=1 pid=203288
[12504.961905] kmem_cache_alloc+0x36a/0x3b0
[12504.962324] mempool_alloc+0xe8/0x270
[12504.962712] bio_alloc_bioset+0x3b5/0x920
[12504.963129] bio_alloc_clone+0x3e/0x160
[12504.963533] alloc_io+0x3d/0x1f0
[12504.963876] dm_submit_bio+0x12f/0xa30
[12504.964267] __submit_bio+0x9c/0xe0
[12504.964639] submit_bio_noacct_nocheck+0x25a/0x570
[12504.965136] submit_bio_wait+0xc2/0x160
[12504.965535] blkdev_issue_zeroout+0x19b/0x2e0
[12504.965991] ext4_init_inode_table+0x246/0x560
[12504.966462] ext4_lazyinit_thread+0x750/0xbe0
[12504.966922] kthread+0x1b4/0x1f0
>
> Mikulas
>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/dm.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 8dcabf84d866..2c0eae67d0f1 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -58,6 +58,7 @@ static DEFINE_IDR(_minor_idr);
>> static DEFINE_SPINLOCK(_minor_lock);
>>
>> static void do_deferred_remove(struct work_struct *w);
>> +static int dm_wait_for_completion(struct mapped_device *md, unsigned int task_state);
>>
>> static DECLARE_WORK(deferred_remove_work, do_deferred_remove);
>>
>> @@ -2495,6 +2496,8 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
>> if (!dm_suspended_md(md)) {
>> dm_table_presuspend_targets(map);
>> set_bit(DMF_SUSPENDED, &md->flags);
>> + if (wait)
>> + dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>> set_bit(DMF_POST_SUSPENDING, &md->flags);
>> dm_table_postsuspend_targets(map);
>> }
>> --
>> 2.39.2
>>
>
> .
>
On Tue, Jan 30, 2024 at 10:23 AM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> Changes in v4:
> - add patch 10 to fix a raid456 deadlock(for both md/raid and dm-raid);
> - add patch 13 to wait for inflight IO completion while removing dm
> device;
>
> Changes in v3:
> - fix a problem in patch 5;
> - add patch 12;
>
> Changes in v2:
> - replace revert changes for dm-raid with real fixes;
> - fix dm-raid5 deadlock that exist for a long time, this deadlock is
> triggered because another problem is fixed in raid5, and instead of
> deadlock, user will read wrong data before v6.7, patch 9-11;
>
> First regression related to stop sync thread:
>
> The lifetime of sync_thread is designed as following:
>
> 1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
> daemon thread;
> 2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
> MD_RECOVERY_RUNNING and register sync_thread;
> 3) Execute md_do_sync() for the actual work, if it's done or
> interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
> 4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
> MD_RECOVERY_RUNNING and unregister sync_thread;
>
> In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
> ("md: fix stopping sync thread"), however, dm-raid is not considered at
> that time, and following test will hang:
>
> shell/integrity-caching.sh
> shell/lvconvert-raid-reshape.sh
>
> This patch set fix the broken test by patch 1-4;
> - patch 1 fix that step 4) is broken by suspended array;
> - patch 2 fix that step 4) is broken by read-only array;
> - patch 3 fix that step 3) is broken that md_do_sync() doesn't set
> MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
> data will be corrupted, which will be fixed in later patches.
> - patch 4 fix that setp 1) is broken that sync_thread is register and
> MD_RECOVERY_RUNNING is set directly, md/raid behaviour, not related to
> dm-raid;
>
> With patch 1-4, the above test won't hang anymore, however, the test
> will still fail and complain that ext4 is corrupted;
>
> Second regression related to frozen sync thread:
>
> Noted that for raid456, if reshape is interrupted, then call
> "pers->start_reshape" will corrupt data. And dm-raid rely on md_do_sync()
> doesn't set MD_RECOVERY_DONE so that new sync_thread won't be registered,
> and patch 3 just break this.
>
> - Patch 5-6 fix this problem by interrupting reshape and frozen
> sync_thread in dm_suspend(), then unfrozen and continue reshape in
> dm_resume(). It's verified that dm-raid tests won't complain that
> ext4 is corrupted anymore.
> - Patch 7 fix the problem that raid_message() call
> md_reap_sync_thread() directly, without holding 'reconfig_mutex'.
>
> Last regression related to dm-raid456 IO concurrent with reshape:
>
> For raid456, if reshape is still in progress, then IO across reshape
> position will wait for reshape to make progress. However, for dm-raid,
> in following cases reshape will never make progress hence IO will hang:
>
> 1) the array is read-only;
> 2) MD_RECOVERY_WAIT is set;
> 3) MD_RECOVERY_FROZEN is set;
>
> After commit c467e97f079f ("md/raid6: use valid sector values to determine
> if an I/O should wait on the reshape") fix the problem that IO across
> reshape position doesn't wait for reshape, the dm-raid test
> shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().
>
> For md/raid, the problem doesn't exist because:
>
> 1) If array is read-only, it can switch to read-write by ioctl/sysfs;
> 2) md/raid never set MD_RECOVERY_WAIT;
> 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
> 'reconfig_mutex' anymore, it can be cleared and reshape can continue by
> sysfs api 'sync_action'.
>
> However, I'm not sure yet how to avoid the problem in dm-raid yet.
>
> - patch 9-11 fix this problem by detecting the above 3 cases in
> dm_suspend(), and fail those IO directly.
>
> If user really meet the IO error, then it means they're reading the wrong
> data before c467e97f079f. And it's safe to read/write the array after
> reshape make progress successfully.
>
> There are also some other minor changes: patch 8 and patch 12;
>
> Test result:
>
> I apply this patchset on top of v6.8-rc1, and run lvm2 tests suite with
> folling cmd for 24 round(for about 2 days):
>
> for t in `ls test/shell`; do
> if cat test/shell/$t | grep raid &> /dev/null; then
> make check T=shell/$t
> fi
> done
>
> failed count failed test
> 1 ### failed: [ndev-vanilla] shell/dmsecuretest.sh
> 1 ### failed: [ndev-vanilla] shell/dmsetup-integrity-keys.sh
> 1 ### failed: [ndev-vanilla] shell/dmsetup-keyring.sh
> 5 ### failed: [ndev-vanilla] shell/duplicate-pvs-md0.sh
> 1 ### failed: [ndev-vanilla] shell/duplicate-vgid.sh
> 2 ### failed: [ndev-vanilla] shell/duplicate-vgnames.sh
> 1 ### failed: [ndev-vanilla] shell/fsadm-crypt.sh
> 1 ### failed: [ndev-vanilla] shell/integrity.sh
> 6 ### failed: [ndev-vanilla] shell/lvchange-raid1-writemostlysh
> 2 ### failed: [ndev-vanilla] shell/lvchange-rebuild-raid.sh
> 5 ### failed: [ndev-vanilla] shell/lvconvert-raid-reshape-stripes-load-reload.sh
> 4 ### failed: [ndev-vanilla] shell/lvconvert-raid-restripe-linear.sh
> 1 ### failed: [ndev-vanilla] shell/lvconvert-raid1-split-trackchanges.sh
> 20 ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
> 20 ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
> 24 ### failed: [ndev-vanilla] shell/lvextend-raid.sh
>
> And I ramdomly pick some tests verified by hand that these test will
> fail in v6.6 as well(not all tests, I don't have the time do do this yet):
>
> shell/lvextend-raid.sh
> shell/lvcreate-large-raid.sh
> shell/lvconvert-repair-raid.sh
> shell/lvchange-rebuild-raid.sh
> shell/lvchange-raid1-writemostly.sh
>
> Yu Kuai (14):
> md: don't ignore suspended array in md_check_recovery()
> md: don't ignore read-only array in md_check_recovery()
> md: make sure md_do_sync() will set MD_RECOVERY_DONE
> md: don't register sync_thread for reshape directly
> md: export helpers to stop sync_thread
> dm-raid: really frozen sync_thread during suspend
> md/dm-raid: don't call md_reap_sync_thread() directly
> dm-raid: add a new helper prepare_suspend() in md_personality
> md: export helper md_is_rdwr()
> md: don't suspend the array for interrupted reshape
> md/raid456: fix a deadlock for dm-raid456 while io concurrent with
> reshape
> dm-raid: fix lockdep waring in "pers->hot_add_disk"
> dm: wait for IO completion before removing dm device
> dm-raid: remove mddev_suspend/resume()
>
> drivers/md/dm-raid.c | 78 +++++++++++++++++++---------
> drivers/md/dm.c | 3 ++
> drivers/md/md.c | 120 +++++++++++++++++++++++++++++--------------
> drivers/md/md.h | 16 ++++++
> drivers/md/raid10.c | 16 +-----
> drivers/md/raid5.c | 61 ++++++++++++----------
> 6 files changed, 190 insertions(+), 104 deletions(-)
>
> --
> 2.39.2
>
Hi all
In my environment, the lvm2 regression test has passed. There are only
three failed cases which also fail in kernel 6.6.
### failed: [ndev-vanilla] shell/lvresize-fs-crypt.sh
### failed: [ndev-vanilla] shell/pvck-dump.sh
### failed: [ndev-vanilla] shell/select-report.sh
### 426 tests: 346 passed, 70 skipped, 0 timed out, 7 warned, 3 failed
in 89:26.073
Best Regards
Xiao
Hi, Xiao Ni!
在 2024/01/31 8:29, Xiao Ni 写道:
> In my environment, the lvm2 regression test has passed. There are only
> three failed cases which also fail in kernel 6.6.
>
> ### failed: [ndev-vanilla] shell/lvresize-fs-crypt.sh
> ### failed: [ndev-vanilla] shell/pvck-dump.sh
> ### failed: [ndev-vanilla] shell/select-report.sh
> ### 426 tests: 346 passed, 70 skipped, 0 timed out, 7 warned, 3 failed
> in 89:26.073
Thanks for the test, this is greate news.
Kuai
On Wed, Jan 31, 2024 at 9:25 AM Yu Kuai <[email protected]> wrote:
>
> Hi, Xiao Ni!
>
> 在 2024/01/31 8:29, Xiao Ni 写道:
> > In my environment, the lvm2 regression test has passed. There are only
> > three failed cases which also fail in kernel 6.6.
> >
> > ### failed: [ndev-vanilla] shell/lvresize-fs-crypt.sh
> > ### failed: [ndev-vanilla] shell/pvck-dump.sh
> > ### failed: [ndev-vanilla] shell/select-report.sh
> > ### 426 tests: 346 passed, 70 skipped, 0 timed out, 7 warned, 3 failed
> > in 89:26.073
>
> Thanks for the test, this is greate news.
>
> Kuai
>
Hi Kuai
Have you run mdadm regression tests based on this patch set?
Regards
Xiao
Hi,
在 2024/01/30 21:05, Yu Kuai 写道:
> Hi,
>
> 在 2024/01/30 19:46, Mikulas Patocka 写道:
>>
>>
>> On Tue, 30 Jan 2024, Yu Kuai wrote:
>>
>>> From: Yu Kuai <[email protected]>
>>>
>>> __dm_destroy() guarantee that device openers is zero, and then
>>> only call 'presuspend' and 'postsuspend' for the target. For
>>> request-based dm, 'md->holders' will be grabbed for each rq and
>>> __dm_destroy() will wait for 'md->holders' to be zero. However, for
>>> bio-based device, __dm_destroy() doesn't wait for all bios to be done.
>>>
>>> Fix this problem by calling dm_wait_for_completion() to wail for all
>>> inflight IO to be done, like what dm_suspend() does.
>>
>> If the number of openers is zero, it is guaranteed that there are no bios
>> in flight. Therefore, we don't have to wait for them.
>>
>> If there are bios in flight, it is a bug in the code that issues the
>> bios.
>> You can put WARN_ON(dm_in_flight_bios(md)) there.
>
> I add this patch because while testing, there is a problem that is
> hard to reporduce, as I mentioned in the other thread. I'll add BUG_ON()
> and try if I can still reporduce this problem without triggering it.
>
> Thanks,
> Kuai
>
> [12504.959682] BUG bio-296 (Not tainted): Object already free
> [12504.960239]
> -----------------------------------------------------------------------------
>
> [12504.960239]
> [12504.961209] Allocated in mempool_alloc+0xe8/0x270 age=30 cpu=1
> pid=203288
> [12504.961905] kmem_cache_alloc+0x36a/0x3b0
> [12504.962324] mempool_alloc+0xe8/0x270
> [12504.962712] bio_alloc_bioset+0x3b5/0x920
> [12504.963129] bio_alloc_clone+0x3e/0x160
> [12504.963533] alloc_io+0x3d/0x1f0
> [12504.963876] dm_submit_bio+0x12f/0xa30
> [12504.964267] __submit_bio+0x9c/0xe0
> [12504.964639] submit_bio_noacct_nocheck+0x25a/0x570
> [12504.965136] submit_bio_wait+0xc2/0x160
> [12504.965535] blkdev_issue_zeroout+0x19b/0x2e0
> [12504.965991] ext4_init_inode_table+0x246/0x560
> [12504.966462] ext4_lazyinit_thread+0x750/0xbe0
> [12504.966922] kthread+0x1b4/0x1f0
After adding the BUG_ON(), I can still reporducing this BUG, this really
looks like a BUG, and I don't think this is related to dm-raid. Perhaps
you guys can take a look?
Thanks,
Kuai
>>
>> Mikulas
>>
>>> Signed-off-by: Yu Kuai <[email protected]>
>>> ---
>>> drivers/md/dm.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>> index 8dcabf84d866..2c0eae67d0f1 100644
>>> --- a/drivers/md/dm.c
>>> +++ b/drivers/md/dm.c
>>> @@ -58,6 +58,7 @@ static DEFINE_IDR(_minor_idr);
>>> static DEFINE_SPINLOCK(_minor_lock);
>>> static void do_deferred_remove(struct work_struct *w);
>>> +static int dm_wait_for_completion(struct mapped_device *md, unsigned
>>> int task_state);
>>> static DECLARE_WORK(deferred_remove_work, do_deferred_remove);
>>> @@ -2495,6 +2496,8 @@ static void __dm_destroy(struct mapped_device
>>> *md, bool wait)
>>> if (!dm_suspended_md(md)) {
>>> dm_table_presuspend_targets(map);
>>> set_bit(DMF_SUSPENDED, &md->flags);
>>> + if (wait)
>>> + dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>>> set_bit(DMF_POST_SUSPENDING, &md->flags);
>>> dm_table_postsuspend_targets(map);
>>> }
>>> --
>>> 2.39.2
>>>
>>
>> .
>>
>
> .
>
Hi,
?? 2024/01/30 10:18, Yu Kuai д??:
> From: Yu Kuai <[email protected]>
>
> dm_suspend() already make sure that no new IO can be issued and will
> wait for all dispatched IO to be done. There is no need to call
> mddev_suspend() to make sure that again.
>
I'm about to send the final version(I hope), please let me know if
anyone thinks this patch should not be included.
Thanks,
Kuai
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/dm-raid.c | 8 +++-----
> drivers/md/md.c | 11 +++++++++++
> 2 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 5f78cc19d6f3..ed8c28952b14 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3241,7 +3241,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> rs->md.in_sync = 1;
>
> /* Has to be held on running the array */
> - mddev_suspend_and_lock_nointr(&rs->md);
> + mddev_lock_nointr(&rs->md);
>
> /* Keep array frozen until resume. */
> md_frozen_sync_thread(&rs->md);
> @@ -3829,11 +3829,9 @@ static void raid_postsuspend(struct dm_target *ti)
> {
> struct raid_set *rs = ti->private;
>
> - if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
> + if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
> /* Writes have to be stopped before suspending to avoid deadlocks. */
> md_stop_writes(&rs->md);
> - mddev_suspend(&rs->md, false);
> - }
> }
>
> static void attempt_restore_of_faulty_devices(struct raid_set *rs)
> @@ -4091,7 +4089,7 @@ static void raid_resume(struct dm_target *ti)
> mddev->ro = 0;
> mddev->in_sync = 0;
> md_unfrozen_sync_thread(mddev);
> - mddev_unlock_and_resume(mddev);
> + mddev_unlock(mddev);
> }
> }
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 093abf3ce27b..e3a56a958b47 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -437,6 +437,10 @@ int mddev_suspend(struct mddev *mddev, bool interruptible)
> {
> int err = 0;
>
> + /* Array is supended from dm_suspend() for dm-raid. */
> + if (!mddev->gendisk)
> + return 0;
> +
> /*
> * hold reconfig_mutex to wait for normal io will deadlock, because
> * other context can't update super_block, and normal io can rely on
> @@ -488,6 +492,13 @@ EXPORT_SYMBOL_GPL(mddev_suspend);
>
> static void __mddev_resume(struct mddev *mddev, bool recovery_needed)
> {
> + /*
> + * Array is supended from dm_suspend() and resumed from dm_resume() for
> + * dm-raid.
> + */
> + if (!mddev->gendisk)
> + return;
> +
> lockdep_assert_not_held(&mddev->reconfig_mutex);
>
> mutex_lock(&mddev->suspend_mutex);
>
Hi,
在 2024/01/31 9:28, Xiao Ni 写道:
> On Wed, Jan 31, 2024 at 9:25 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi, Xiao Ni!
>>
>> 在 2024/01/31 8:29, Xiao Ni 写道:
>>> In my environment, the lvm2 regression test has passed. There are only
>>> three failed cases which also fail in kernel 6.6.
>>>
>>> ### failed: [ndev-vanilla] shell/lvresize-fs-crypt.sh
>>> ### failed: [ndev-vanilla] shell/pvck-dump.sh
>>> ### failed: [ndev-vanilla] shell/select-report.sh
>>> ### 426 tests: 346 passed, 70 skipped, 0 timed out, 7 warned, 3 failed
>>> in 89:26.073
>>
>> Thanks for the test, this is greate news.
>>
>> Kuai
>>
>
> Hi Kuai
>
> Have you run mdadm regression tests based on this patch set?
Of course, I'm runing in my VM with loop devices.
Thanks,
Kuai
>
> Regards
> Xiao
>
> .
>
Hi,
在 2024/01/31 10:52, Yu Kuai 写道:
> Hi,
>
> 在 2024/01/31 9:28, Xiao Ni 写道:
>> On Wed, Jan 31, 2024 at 9:25 AM Yu Kuai <[email protected]> wrote:
>>>
>>> Hi, Xiao Ni!
>>>
>>> 在 2024/01/31 8:29, Xiao Ni 写道:
>>>> In my environment, the lvm2 regression test has passed. There are only
>>>> three failed cases which also fail in kernel 6.6.
>>>>
>>>> ### failed: [ndev-vanilla] shell/lvresize-fs-crypt.sh
>>>> ### failed: [ndev-vanilla] shell/pvck-dump.sh
>>>> ### failed: [ndev-vanilla] shell/select-report.sh
>>>> ### 426 tests: 346 passed, 70 skipped, 0 timed out, 7 warned, 3 failed
>>>> in 89:26.073
>>>
>>> Thanks for the test, this is greate news.
>>>
>>> Kuai
>>>
>>
>> Hi Kuai
>>
>> Have you run mdadm regression tests based on this patch set?
BTW, I just make sure there are no new failed tests, however, there
looks like some tests are broken. For example:
04update-metadata:
++ /root/mdadm/mdadm --quiet -CR --assume-clean -e 0.90 /dev/md0 --level
linear -n 4 -c 64 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 --auto=yes
++ rv=1
++ case $* in
++ cat /var/tmp/stderr
mdadm: RUN_ARRAY failed: Invalid argument
04r1update:
++ /root/mdadm/mdadm --quiet -A /dev/md0 -U resync /dev/loop0 /dev/loop1
++ rv=1
++ case $* in
++ cat /var/tmp/stderr
mdadm: --update=resync not understood for 1.x metadata
Thanks,
Kuai
>
> Of course, I'm runing in my VM with loop devices.
>
> Thanks,
> Kuai
>
>>
>> Regards
>> Xiao
>>
>> .
>>
>
> .
>