LinuxLists.cc - [PATCH -next v2 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running

2023-05-29 13:26:46

[permalink] [raw]

Subject: [PATCH -next v2 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running

From: Yu Kuai <[email protected]>

Changes in v2:
- rebase for the latest md-next

Patch 1 revert the commit because it will cause MD_RECOVERY_RUNNING to be
cleared while sync_thread is still running. The deadlock this patch tries
to fix will be fixed by patch 2-5.

Patch 6 enhance checking to prevent MD_RECOVERY_RUNNING to be cleared
while sync_thread is still running.

Yu Kuai (6):
Revert "md: unlock mddev before reap sync_thread in action_store"
md: refactor action_store() for 'idle' and 'frozen'
md: add a mutex to synchronize idle and frozen in action_store()
md: refactor idle/frozen_sync_thread() to fix deadlock
md: wake up 'resync_wait' at last in md_reap_sync_thread()
md: enhance checking in md_check_recovery()

drivers/md/dm-raid.c | 1 -
drivers/md/md.c | 124 +++++++++++++++++++++++++++++--------------
drivers/md/md.h | 5 ++
3 files changed, 88 insertions(+), 42 deletions(-)

--
2.39.2

2023-05-29 13:28:02

[permalink] [raw]

Subject: [PATCH -next v2 6/6] md: enhance checking in md_check_recovery()

From: Yu Kuai <[email protected]>

For md_check_recovery():

1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread.
2) if 'MD_RECOVERY_RUNING' is set:
a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for
md_do_sync() to be done.
b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code
expects that sync_thread is not NULL, otherwise new sync_thread will
be registered, which will corrupt the array.

Make sure md_check_recovery() won't register new sync_thread if
'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for
the above corruption,

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f90226e6ddf8..9da0fc906bbd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9397,16 +9397,24 @@ void md_check_recovery(struct mddev *mddev)
if (mddev->sb_flags)
md_update_sb(mddev, 0);

- if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
- !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
- /* resync/recovery still happening */
- clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
- goto unlock;
- }
- if (mddev->sync_thread) {
+ /*
+ * Never start a new sync thread if MD_RECOVERY_RUNNING is
+ * still set.
+ */
+ if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
+ if (!test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
+ /* resync/recovery still happening */
+ clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ goto unlock;
+ }
+
+ if (WARN_ON_ONCE(!mddev->sync_thread))
+ goto unlock;
+
md_reap_sync_thread(mddev);
goto unlock;
}
+
/* Set RUNNING before clearing NEEDED to avoid
* any transients in the value of "sync_action".
*/
--
2.39.2

2023-05-29 13:32:32

[permalink] [raw]

Subject: [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

From: Yu Kuai <[email protected]>

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry

// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

2) Trigger a recovery:

raid10_sync_request
raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

action_store
mddev_lock
md_unregister_thread
kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.

Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.

[1] https://lore.kernel.org/linux-raid/[email protected]/T/#t
[2] https://lore.kernel.org/linux-raid/[email protected]/
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 23 +++++++++++++++++++----
drivers/md/md.h | 2 ++
2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 63a993b52cd7..7912de0e4d12 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
atomic_set(&mddev->active, 1);
atomic_set(&mddev->openers, 0);
+ atomic_set(&mddev->sync_seq, 0);
spin_lock_init(&mddev->lock);
atomic_set(&mddev->flush_pending, 0);
init_waitqueue_head(&mddev->sb_wait);
@@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
if (work_pending(&mddev->del_work))
flush_workqueue(md_misc_wq);

- if (mddev->sync_thread) {
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev);
- }
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ /*
+ * Thread might be blocked waiting for metadata update which will now
+ * never happen
+ */
+ md_wakeup_thread_directly(mddev->sync_thread);

mddev_unlock(mddev);
}

static void idle_sync_thread(struct mddev *mddev)
{
+ int sync_seq = atomic_read(&mddev->sync_seq);
+
mutex_lock(&mddev->sync_mutex);
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+
+ wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
+ !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
mutex_unlock(&mddev->sync_mutex);
}

@@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev *mddev)
mutex_init(&mddev->delete_mutex);
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+
+ wait_event(resync_wait, mddev->sync_thread == NULL &&
+ !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
mutex_unlock(&mddev->sync_mutex);
}

@@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)

/* resync has finished, collect result */
md_unregister_thread(&mddev->sync_thread);
+ atomic_inc(&mddev->sync_seq);
+
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2fa903de5bd0..7cab9c7c45b8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -539,6 +539,8 @@ struct mddev {

/* Used to synchronize idle and frozen for action_store() */
struct mutex sync_mutex;
+ /* The sequence number for sync thread */
+ atomic_t sync_seq;

bool has_superblocks:1;
bool fail_last_dev:1;
--
2.39.2

2023-05-29 13:32:45

[permalink] [raw]

Subject: [PATCH -next v2 5/6] md: wake up 'resync_wait' at last in md_reap_sync_thread()

From: Yu Kuai <[email protected]>

md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...)
from action_store(), just make sure action_store() will still wait for
everything to be done in md_reap_sync_thread().

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7912de0e4d12..f90226e6ddf8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9531,7 +9531,6 @@ void md_reap_sync_thread(struct mddev *mddev)
if (mddev_is_clustered(mddev) && is_reshaped
&& !test_bit(MD_CLOSING, &mddev->flags))
md_cluster_ops->update_size(mddev, old_dev_sectors);
- wake_up(&resync_wait);
/* flag recovery needed just to double check */
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
sysfs_notify_dirent_safe(mddev->sysfs_completed);
@@ -9539,6 +9538,7 @@ void md_reap_sync_thread(struct mddev *mddev)
md_new_event();
if (mddev->event_work.func)
queue_work(md_misc_wq, &mddev->event_work);
+ wake_up(&resync_wait);
}
EXPORT_SYMBOL(md_reap_sync_thread);

--
2.39.2

2023-05-29 13:34:34

[permalink] [raw]

Subject: [PATCH -next v2 3/6] md: add a mutex to synchronize idle and frozen in action_store()

From: Yu Kuai <[email protected]>

Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.

Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 5 +++++
drivers/md/md.h | 3 +++
2 files changed, 8 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 23e8e7eae062..63a993b52cd7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -644,6 +644,7 @@ void mddev_init(struct mddev *mddev)
mutex_init(&mddev->open_mutex);
mutex_init(&mddev->reconfig_mutex);
mutex_init(&mddev->delete_mutex);
+ mutex_init(&mddev->sync_mutex);
mutex_init(&mddev->bitmap_info.mutex);
INIT_LIST_HEAD(&mddev->disks);
INIT_LIST_HEAD(&mddev->all_mddevs);
@@ -4785,14 +4786,18 @@ static void stop_sync_thread(struct mddev *mddev)

static void idle_sync_thread(struct mddev *mddev)
{
+ mutex_lock(&mddev->sync_mutex);
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+ mutex_unlock(&mddev->sync_mutex);
}

static void frozen_sync_thread(struct mddev *mddev)
{
+ mutex_init(&mddev->delete_mutex);
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+ mutex_unlock(&mddev->sync_mutex);
}

static ssize_t
diff --git a/drivers/md/md.h b/drivers/md/md.h
index bfd2306bc750..2fa903de5bd0 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -537,6 +537,9 @@ struct mddev {
/* Protect the deleting list */
struct mutex delete_mutex;

+ /* Used to synchronize idle and frozen for action_store() */
+ struct mutex sync_mutex;
+
bool has_superblocks:1;
bool fail_last_dev:1;
bool serialize_policy:1;
--
2.39.2

2023-05-29 13:47:08

[permalink] [raw]

Subject: [PATCH -next v2 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

From: Yu Kuai <[email protected]>

This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.

Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:

list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18

This is because work is requeued while it's still inside workqueue:

t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock

t3:

// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted

The above commit is reverted to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 1 -
drivers/md/md.c | 19 ++-----------------
2 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 8846bf510a35..1f22bef27841 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3725,7 +3725,6 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
}
} else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index a5a7af2f4e59..9b97731e1fe4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4772,19 +4772,6 @@ action_store(struct mddev *mddev, const char *page, size_t len)
if (work_pending(&mddev->del_work))
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
- sector_t save_rp = mddev->reshape_position;
-
- mddev_unlock(mddev);
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_unregister_thread(&mddev->sync_thread);
- mddev_lock_nointr(mddev);
- /*
- * set RECOVERY_INTR again and restore reshape
- * position in case others changed them after
- * got lock, eg, reshape_position_store and
- * md_check_recovery.
- */
- mddev->reshape_position = save_rp;
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_reap_sync_thread(mddev);
}
@@ -6184,7 +6171,6 @@ static void __md_stop_writes(struct mddev *mddev)
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
}

@@ -9336,7 +9322,6 @@ void md_check_recovery(struct mddev *mddev)
* ->spare_active and clear saved_raid_disk
*/
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
@@ -9372,7 +9357,6 @@ void md_check_recovery(struct mddev *mddev)
goto unlock;
}
if (mddev->sync_thread) {
- md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
goto unlock;
}
@@ -9452,7 +9436,8 @@ void md_reap_sync_thread(struct mddev *mddev)
sector_t old_dev_sectors = mddev->dev_sectors;
bool is_reshaped = false;

- /* sync_thread should be unregistered, collect result */
+ /* resync has finished, collect result */
+ md_unregister_thread(&mddev->sync_thread);
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
mddev->degraded != mddev->raid_disks) {
--
2.39.2

2023-05-29 13:47:35

[permalink] [raw]

Subject: [PATCH -next v2 2/6] md: refactor action_store() for 'idle' and 'frozen'

From: Yu Kuai <[email protected]>

Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
are no functional changes except that MD_RECOVERY_RUNNING is checked
again after 'reconfig_mutex' is held.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 61 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 45 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9b97731e1fe4..23e8e7eae062 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4755,6 +4755,46 @@ action_show(struct mddev *mddev, char *page)
return sprintf(page, "%s\n", type);
}

+static void stop_sync_thread(struct mddev *mddev)
+{
+ if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+ return;
+
+ if (mddev_lock(mddev))
+ return;
+
+ /*
+ * Check again in case MD_RECOVERY_RUNNING is cleared before lock is
+ * held.
+ */
+ if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
+ mddev_unlock(mddev);
+ return;
+ }
+
+ if (work_pending(&mddev->del_work))
+ flush_workqueue(md_misc_wq);
+
+ if (mddev->sync_thread) {
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ md_reap_sync_thread(mddev);
+ }
+
+ mddev_unlock(mddev);
+}
+
+static void idle_sync_thread(struct mddev *mddev)
+{
+ clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ stop_sync_thread(mddev);
+}
+
+static void frozen_sync_thread(struct mddev *mddev)
+{
+ set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ stop_sync_thread(mddev);
+}
+
static ssize_t
action_store(struct mddev *mddev, const char *page, size_t len)
{
@@ -4762,22 +4802,11 @@ action_store(struct mddev *mddev, const char *page, size_t len)
return -EINVAL;

- if (cmd_match(page, "idle") || cmd_match(page, "frozen")) {
- if (cmd_match(page, "frozen"))
- set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
- else
- clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
- if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
- mddev_lock(mddev) == 0) {
- if (work_pending(&mddev->del_work))
- flush_workqueue(md_misc_wq);
- if (mddev->sync_thread) {
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev);
- }
- mddev_unlock(mddev);
- }
- } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+ if (cmd_match(page, "idle"))
+ idle_sync_thread(mddev);
+ else if (cmd_match(page, "frozen"))
+ frozen_sync_thread(mddev);
+ else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
return -EBUSY;
else if (cmd_match(page, "resync"))
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
--
2.39.2

2023-06-08 02:55:25

[permalink] [raw]

Subject: Re: [PATCH -next v2 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running

Hi,

?? 2023/05/29 21:20, Yu Kuai д??:
> From: Yu Kuai <[email protected]>
>
> Changes in v2:
> - rebase for the latest md-next
>
> Patch 1 revert the commit because it will cause MD_RECOVERY_RUNNING to be
> cleared while sync_thread is still running. The deadlock this patch tries
> to fix will be fixed by patch 2-5.
>
> Patch 6 enhance checking to prevent MD_RECOVERY_RUNNING to be cleared
> while sync_thread is still running.

Any suggestions on this patchset? I already sent regression test
for the deadlock problem for both raid10 and raid456.

Thanks,
Kuai
>
> Yu Kuai (6):
> Revert "md: unlock mddev before reap sync_thread in action_store"
> md: refactor action_store() for 'idle' and 'frozen'
> md: add a mutex to synchronize idle and frozen in action_store()
> md: refactor idle/frozen_sync_thread() to fix deadlock
> md: wake up 'resync_wait' at last in md_reap_sync_thread()
> md: enhance checking in md_check_recovery()
>
> drivers/md/dm-raid.c | 1 -
> drivers/md/md.c | 124 +++++++++++++++++++++++++++++--------------
> drivers/md/md.h | 5 ++
> 3 files changed, 88 insertions(+), 42 deletions(-)
>

2023-06-09 05:50:30

[permalink] [raw]

Subject: Re: [PATCH -next v2 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running

On Wed, Jun 7, 2023 at 7:41 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/05/29 21:20, Yu Kuai 写道:
> > From: Yu Kuai <[email protected]>
> >
> > Changes in v2:
> > - rebase for the latest md-next
> >
> > Patch 1 revert the commit because it will cause MD_RECOVERY_RUNNING to be
> > cleared while sync_thread is still running. The deadlock this patch tries
> > to fix will be fixed by patch 2-5.
> >
> > Patch 6 enhance checking to prevent MD_RECOVERY_RUNNING to be cleared
> > while sync_thread is still running.
>
> Any suggestions on this patchset? I already sent regression test
> for the deadlock problem for both raid10 and raid456.

Sorry for the delay. I will look into this soon.

Thanks,
Song

2023-06-13 06:39:19

[permalink] [raw]

Subject: Re: [PATCH -next v2 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

在 2023/5/29 下午9:20, Yu Kuai 写道:
> From: Yu Kuai <[email protected]>
>
> This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.
>
> Because it will introduce a defect that sync_thread can be running while
> MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
> for example:
>
> list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
> Call trace:
> __list_add_valid+0xfc/0x140
> insert_work+0x78/0x1a0
> __queue_work+0x500/0xcf4
> queue_work_on+0xe8/0x12c
> md_check_recovery+0xa34/0xf30
> raid10d+0xb8/0x900 [raid10]
> md_thread+0x16c/0x2cc
> kthread+0x1a4/0x1ec
> ret_from_fork+0x10/0x18
>
> This is because work is requeued while it's still inside workqueue:
>
> t1: t2:
> action_store
> mddev_lock
> if (mddev->sync_thread)
> mddev_unlock
> md_unregister_thread
> // first sync_thread is done
> md_check_recovery
> mddev_try_lock
> /*
> * once MD_RECOVERY_DONE is set, new sync_thread
> * can start.
> */
> set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
> INIT_WORK(&mddev->del_work, md_start_sync)
> queue_work(md_misc_wq, &mddev->del_work)
> test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
> // set pending bit
> insert_work
> list_add_tail
> mddev_unlock
> mddev_lock_nointr
> md_reap_sync_thread
> // MD_RECOVERY_RUNNING is cleared
> mddev_unlock
>
> t3:
>
> // before queued work started from t2
> md_check_recovery
> // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
> INIT_WORK(&mddev->del_work, md_start_sync)
> work->data = 0
> // work pending bit is cleared
> queue_work(md_misc_wq, &mddev->del_work)
> insert_work
> list_add_tail
> // list is corrupted
>
> The above commit is reverted to fix the problem, the deadlock this
> commit tries to fix will be fixed in following patches.
>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/dm-raid.c | 1 -
> drivers/md/md.c | 19 ++-----------------
> 2 files changed, 2 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 8846bf510a35..1f22bef27841 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3725,7 +3725,6 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
> if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
> if (mddev->sync_thread) {
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> - md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> }
> } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index a5a7af2f4e59..9b97731e1fe4 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4772,19 +4772,6 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> if (work_pending(&mddev->del_work))
> flush_workqueue(md_misc_wq);
> if (mddev->sync_thread) {
> - sector_t save_rp = mddev->reshape_position;
> -
> - mddev_unlock(mddev);
> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> - md_unregister_thread(&mddev->sync_thread);
> - mddev_lock_nointr(mddev);
> - /*
> - * set RECOVERY_INTR again and restore reshape
> - * position in case others changed them after
> - * got lock, eg, reshape_position_store and
> - * md_check_recovery.
> - */
> - mddev->reshape_position = save_rp;
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> md_reap_sync_thread(mddev);
> }
> @@ -6184,7 +6171,6 @@ static void __md_stop_writes(struct mddev *mddev)
> flush_workqueue(md_misc_wq);
> if (mddev->sync_thread) {
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> - md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> }
>
> @@ -9336,7 +9322,6 @@ void md_check_recovery(struct mddev *mddev)
> * ->spare_active and clear saved_raid_disk
> */
> set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> - md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> @@ -9372,7 +9357,6 @@ void md_check_recovery(struct mddev *mddev)
> goto unlock;
> }
> if (mddev->sync_thread) {
> - md_unregister_thread(&mddev->sync_thread);
> md_reap_sync_thread(mddev);
> goto unlock;
> }
> @@ -9452,7 +9436,8 @@ void md_reap_sync_thread(struct mddev *mddev)
> sector_t old_dev_sectors = mddev->dev_sectors;
> bool is_reshaped = false;
>
> - /* sync_thread should be unregistered, collect result */
> + /* resync has finished, collect result */
> + md_unregister_thread(&mddev->sync_thread);
> if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
> mddev->degraded != mddev->raid_disks) {

Hi Kuai

Thanks for the patch and the explanation in V1. In version1, I took much
time to try to understand the problem. Maybe we can use the problem

itself as the subject. Something like "Don't allow two sync processes
running at the same time"? And could you add the test steps which talked
in v1

in the patch? It can help to understand the problem very much.

Best Regards

Xiao

2023-06-13 08:06:36

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 2/6] md: refactor action_store() for 'idle' and 'frozen'

在 2023/5/29 下午9:20, Yu Kuai 写道:
> From: Yu Kuai <[email protected]>
>
> Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
> are no functional changes except that MD_RECOVERY_RUNNING is checked
> again after 'reconfig_mutex' is held.

Can you explain more about why it needs to check MD_RECOVERY_RUNNING
again here?

>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/md.c | 61 ++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 45 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 9b97731e1fe4..23e8e7eae062 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4755,6 +4755,46 @@ action_show(struct mddev *mddev, char *page)
> return sprintf(page, "%s\n", type);
> }
>
> +static void stop_sync_thread(struct mddev *mddev)
> +{
> + if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> + return;
> +
> + if (mddev_lock(mddev))
> + return;
> +
> + /*
> + * Check again in case MD_RECOVERY_RUNNING is cleared before lock is
> + * held.
> + */
> + if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
> + mddev_unlock(mddev);
> + return;
> + }
> +
> + if (work_pending(&mddev->del_work))
> + flush_workqueue(md_misc_wq);
> +
> + if (mddev->sync_thread) {
> + set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> + md_reap_sync_thread(mddev);
> + }
> +
> + mddev_unlock(mddev);
> +}
> +
> +static void idle_sync_thread(struct mddev *mddev)
> +{
> + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> + stop_sync_thread(mddev);
> +}
> +
> +static void frozen_sync_thread(struct mddev *mddev)
> +{
> + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> + stop_sync_thread(mddev);
> +}
> +
> static ssize_t
> action_store(struct mddev *mddev, const char *page, size_t len)
> {
> @@ -4762,22 +4802,11 @@ action_store(struct mddev *mddev, const char *page, size_t len)
> return -EINVAL;
>
>
> - if (cmd_match(page, "idle") || cmd_match(page, "frozen")) {
> - if (cmd_match(page, "frozen"))
> - set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> - else
> - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> - mddev_lock(mddev) == 0) {
> - if (work_pending(&mddev->del_work))
> - flush_workqueue(md_misc_wq);
> - if (mddev->sync_thread) {
> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> - md_reap_sync_thread(mddev);
> - }
> - mddev_unlock(mddev);
> - }
> - } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> + if (cmd_match(page, "idle"))
> + idle_sync_thread(mddev);
> + else if (cmd_match(page, "frozen"))
> + frozen_sync_thread(mddev);
> + else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> return -EBUSY;
> else if (cmd_match(page, "resync"))
> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);

2023-06-13 12:11:18

[permalink] [raw]

Subject: Re: [PATCH -next v2 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

Hi,

在 2023/06/13 14:25, Xiao Ni 写道:
> Thanks for the patch and the explanation in V1. In version1, I took much
> time to try to understand the problem. Maybe we can use the problem
> itself as the subject. Something like "Don't allow two sync processes
> running at the same time"? And could you add the test steps which talked
> in v1
>
> in the patch? It can help to understand the problem very much.

Ok, thanks for the suggestion, I can do that in the next version.

Kuai

2023-06-13 12:29:48

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 2/6] md: refactor action_store() for 'idle' and 'frozen'

Hi,

在 2023/06/13 16:02, Xiao Ni 写道:
>
> 在 2023/5/29 下午9:20, Yu Kuai 写道:
>> From: Yu Kuai <[email protected]>
>>
>> Prepare to handle 'idle' and 'frozen' differently to fix a deadlock,
>> there
>> are no functional changes except that MD_RECOVERY_RUNNING is checked
>> again after 'reconfig_mutex' is held.
>
>
> Can you explain more about why it needs to check MD_RECOVERY_RUNNING
> again here?

As I explain in the following comment:
>> +    /*
>> +     * Check again in case MD_RECOVERY_RUNNING is cleared before lock is
>> +     * held.
>> +     */
>> +    if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
>> +        mddev_unlock(mddev);
>> +        return;
>> +    }

Thanks,
Kuai

2023-06-13 12:33:52

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 2/6] md: refactor action_store() for 'idle' and 'frozen'

On Tue, Jun 13, 2023 at 8:00 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/13 16:02, Xiao Ni 写道:
> >
> > 在 2023/5/29 下午9:20, Yu Kuai 写道:
> >> From: Yu Kuai <[email protected]>
> >>
> >> Prepare to handle 'idle' and 'frozen' differently to fix a deadlock,
> >> there
> >> are no functional changes except that MD_RECOVERY_RUNNING is checked
> >> again after 'reconfig_mutex' is held.
> >
> >
> > Can you explain more about why it needs to check MD_RECOVERY_RUNNING
> > again here?
>
> As I explain in the following comment:

Hi

Who can clear the flag before the lock is held?

Regards
Xiao
> >> + /*
> >> + * Check again in case MD_RECOVERY_RUNNING is cleared before lock is
> >> + * held.
> >> + */
> >> + if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
> >> + mddev_unlock(mddev);
> >> + return;
> >> + }
>
> Thanks,
> Kuai
>

2023-06-13 12:50:48

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 2/6] md: refactor action_store() for 'idle' and 'frozen'

Hi,

在 2023/06/13 20:25, Xiao Ni 写道:
> On Tue, Jun 13, 2023 at 8:00 PM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/06/13 16:02, Xiao Ni 写道:
>>>
>>> 在 2023/5/29 下午9:20, Yu Kuai 写道:
>>>> From: Yu Kuai <[email protected]>
>>>>
>>>> Prepare to handle 'idle' and 'frozen' differently to fix a deadlock,
>>>> there
>>>> are no functional changes except that MD_RECOVERY_RUNNING is checked
>>>> again after 'reconfig_mutex' is held.
>>>
>>>
>>> Can you explain more about why it needs to check MD_RECOVERY_RUNNING
>>> again here?
>>
>> As I explain in the following comment:
>
> Hi
>
> Who can clear the flag before the lock is held?

Basically every where that can clear the flag...

// This context // Other context
mutex_lock
...
test_bit -> pass
clear_bit
mutex_unlock
mutex_lock
test_bit -> check again

Thanks,
Kuai
>
> Regards
> Xiao
>>>> + /*
>>>> + * Check again in case MD_RECOVERY_RUNNING is cleared before lock is
>>>> + * held.
>>>> + */
>>>> + if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
>>>> + mddev_unlock(mddev);
>>>> + return;
>>>> + }
>>
>> Thanks,
>> Kuai
>>
>
> .
>

2023-06-13 14:42:33

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 2/6] md: refactor action_store() for 'idle' and 'frozen'

在 2023/6/13 下午8:44, Yu Kuai 写道:
> Hi,
>
> 在 2023/06/13 20:25, Xiao Ni 写道:
>> On Tue, Jun 13, 2023 at 8:00 PM Yu Kuai <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> 在 2023/06/13 16:02, Xiao Ni 写道:
>>>>
>>>> 在 2023/5/29 下午9:20, Yu Kuai 写道:
>>>>> From: Yu Kuai <[email protected]>
>>>>>
>>>>> Prepare to handle 'idle' and 'frozen' differently to fix a deadlock,
>>>>> there
>>>>> are no functional changes except that MD_RECOVERY_RUNNING is checked
>>>>> again after 'reconfig_mutex' is held.
>>>>
>>>>
>>>> Can you explain more about why it needs to check MD_RECOVERY_RUNNING
>>>> again here?
>>>
>>> As I explain in the following comment:
>>
>> Hi
>>
>> Who can clear the flag before the lock is held?
>
> Basically every where that can clear the flag...
>
> // This context     // Other context
>             mutex_lock
>             ...
> test_bit -> pass
>             clear_bit
>             mutex_unlock
> mutex_lock
> test_bit -> check again
>
> Thanks,
> Kuai

At first, I wanted to figure out a specific case. Now I have the answer.
Maybe there are two people that want to stop

the sync action at the same time. So this is the case that can be
checked by the codes.

Regards

Xiao

>>
>> Regards
>> Xiao
>>>>> +    /*
>>>>> +     * Check again in case MD_RECOVERY_RUNNING is cleared before
>>>>> lock is
>>>>> +     * held.
>>>>> +     */
>>>>> +    if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
>>>>> +        mddev_unlock(mddev);
>>>>> +        return;
>>>>> +    }
>>>
>>> Thanks,
>>> Kuai
>>>
>>
>> .
>>
>

2023-06-13 14:58:35

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

在 2023/5/29 下午9:20, Yu Kuai 写道:
> From: Yu Kuai <[email protected]>
>
> Our test found a following deadlock in raid10:
>
> 1) Issue a normal write, and such write failed:
>
> raid10_end_write_request
> set_bit(R10BIO_WriteError, &r10_bio->state)
> one_write_done
> reschedule_retry
>
> // later from md thread
> raid10d
> handle_write_completed
> list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
>
> // later from md thread
> raid10d
> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> list_move(conf->bio_end_io_list.prev, &tmp)
> r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
> raid_end_bio_io(r10_bio)
>
> Dependency chain 1: normal io is waiting for updating superblock

Hi Kuai

It looks like the above situation is more complex. It only needs a
normal write and md_write_start needs to

wait until the metadata is written to member disks, right? If so, it
doesn't need to introduce raid10 write failure

here. I guess, it should be your test case. It's nice, if you can put
your test steps in the patch. But for the analysis

of the deadlock here, it's better to be simple.

>
> 2) Trigger a recovery:
>
> raid10_sync_request
> raise_barrier
>
> Dependency chain 2: sync thread is waiting for normal io
>
> 3) echo idle/frozen to sync_action:
>
> action_store
> mddev_lock
> md_unregister_thread
> kthread_stop
>
> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
>
> 4) md thread can't update superblock:
>
> raid10d
> md_check_recovery
> if (mddev_trylock(mddev))
> md_update_sb
>
> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
>
> Hence cyclic dependency exist, in order to fix the problem, we must
> break one of them. Dependency 1 and 2 can't be broken because they are
> foundation design. Dependency 4 may be possible if it can be guaranteed
> that no io can be inflight, however, this requires a new mechanism which
> seems complex. Dependency 3 is a good choice, because idle/frozen only
> requires sync thread to finish, which can be done asynchronously that is
> already implemented, and 'reconfig_mutex' is not needed anymore.
>
> This patch switch 'idle' and 'frozen' to wait sync thread to be done
> asynchronously, and this patch also add a sequence counter to record how
> many times sync thread is done, so that 'idle' won't keep waiting on new
> started sync thread.

In the patch, sync_seq is added in md_reap_sync_thread. In
idle_sync_thread, if sync_seq isn't equal

mddev->sync_seq, it should mean there is someone that stops the sync
thread already, right? Why do

you say 'new started sync thread' here?

Regards

Xiao

>
> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
> deadlock can be fixed by this patch as well.
>
> [1] https://lore.kernel.org/linux-raid/[email protected]/T/#t
> [2] https://lore.kernel.org/linux-raid/[email protected]/
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/md.c | 23 +++++++++++++++++++----
> drivers/md/md.h | 2 ++
> 2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 63a993b52cd7..7912de0e4d12 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
> timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
> atomic_set(&mddev->active, 1);
> atomic_set(&mddev->openers, 0);
> + atomic_set(&mddev->sync_seq, 0);
> spin_lock_init(&mddev->lock);
> atomic_set(&mddev->flush_pending, 0);
> init_waitqueue_head(&mddev->sb_wait);
> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
> if (work_pending(&mddev->del_work))
> flush_workqueue(md_misc_wq);
>
> - if (mddev->sync_thread) {
> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> - md_reap_sync_thread(mddev);
> - }
> + set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> + /*
> + * Thread might be blocked waiting for metadata update which will now
> + * never happen
> + */
> + md_wakeup_thread_directly(mddev->sync_thread);
>
> mddev_unlock(mddev);
> }
>
> static void idle_sync_thread(struct mddev *mddev)
> {
> + int sync_seq = atomic_read(&mddev->sync_seq);
> +
> mutex_lock(&mddev->sync_mutex);
> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> stop_sync_thread(mddev);
> +
> + wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
> + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> +
> mutex_unlock(&mddev->sync_mutex);
> }
>
> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev *mddev)
> mutex_init(&mddev->delete_mutex);
> set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> stop_sync_thread(mddev);
> +
> + wait_event(resync_wait, mddev->sync_thread == NULL &&
> + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> +
> mutex_unlock(&mddev->sync_mutex);
> }
>
> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
>
> /* resync has finished, collect result */
> md_unregister_thread(&mddev->sync_thread);
> + atomic_inc(&mddev->sync_seq);
> +
> if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
> mddev->degraded != mddev->raid_disks) {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 2fa903de5bd0..7cab9c7c45b8 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -539,6 +539,8 @@ struct mddev {
>
> /* Used to synchronize idle and frozen for action_store() */
> struct mutex sync_mutex;
> + /* The sequence number for sync thread */
> + atomic_t sync_seq;
>
> bool has_superblocks:1;
> bool fail_last_dev:1;

2023-06-13 15:05:09

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 3/6] md: add a mutex to synchronize idle and frozen in action_store()

在 2023/5/29 下午9:20, Yu Kuai 写道:
> From: Yu Kuai <[email protected]>
>
> Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
> and call md_reap_sync_thread() to stop sync thread, however, this will
> cause deadlock (explained in the next patch). In order to fix the
> problem, following patch will release 'reconfig_mutex' and wait on
> 'resync_wait', like md_set_readonly() and do_md_stop() does.
>
> Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
> unconditionally, which might cause unexpected problems, for example,
> frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
> 'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
> might starve in progress frozen. A mutex is added to synchronize idle
> and frozen from action_store().
>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/md.c | 5 +++++
> drivers/md/md.h | 3 +++
> 2 files changed, 8 insertions(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 23e8e7eae062..63a993b52cd7 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -644,6 +644,7 @@ void mddev_init(struct mddev *mddev)
> mutex_init(&mddev->open_mutex);
> mutex_init(&mddev->reconfig_mutex);
> mutex_init(&mddev->delete_mutex);
> + mutex_init(&mddev->sync_mutex);
> mutex_init(&mddev->bitmap_info.mutex);
> INIT_LIST_HEAD(&mddev->disks);
> INIT_LIST_HEAD(&mddev->all_mddevs);
> @@ -4785,14 +4786,18 @@ static void stop_sync_thread(struct mddev *mddev)
>
> static void idle_sync_thread(struct mddev *mddev)
> {
> + mutex_lock(&mddev->sync_mutex);
> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> stop_sync_thread(mddev);
> + mutex_unlock(&mddev->sync_mutex);
> }
>
> static void frozen_sync_thread(struct mddev *mddev)
> {
> + mutex_init(&mddev->delete_mutex);

typo error? It should be mutex_lock(&mddev->sync_mutex); ?

Regards

Xiao

> set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> stop_sync_thread(mddev);
> + mutex_unlock(&mddev->sync_mutex);
> }
>
> static ssize_t
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index bfd2306bc750..2fa903de5bd0 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -537,6 +537,9 @@ struct mddev {
> /* Protect the deleting list */
> struct mutex delete_mutex;
>
> + /* Used to synchronize idle and frozen for action_store() */
> + struct mutex sync_mutex;
> +
> bool has_superblocks:1;
> bool fail_last_dev:1;
> bool serialize_policy:1;

2023-06-14 01:23:13

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 3/6] md: add a mutex to synchronize idle and frozen in action_store()

Hi,

在 2023/06/13 22:43, Xiao Ni 写道:
>
> 在 2023/5/29 下午9:20, Yu Kuai 写道:
>> From: Yu Kuai <[email protected]>
>>
>> Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
>> and call md_reap_sync_thread() to stop sync thread, however, this will
>> cause deadlock (explained in the next patch). In order to fix the
>> problem, following patch will release 'reconfig_mutex' and wait on
>> 'resync_wait', like md_set_readonly() and do_md_stop() does.
>>
>> Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
>> unconditionally, which might cause unexpected problems, for example,
>> frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
>> 'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
>> might starve in progress frozen. A mutex is added to synchronize idle
>> and frozen from action_store().
>>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/md.c | 5 +++++
>> drivers/md/md.h | 3 +++
>> 2 files changed, 8 insertions(+)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 23e8e7eae062..63a993b52cd7 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -644,6 +644,7 @@ void mddev_init(struct mddev *mddev)
>>       mutex_init(&mddev->open_mutex);
>>       mutex_init(&mddev->reconfig_mutex);
>>       mutex_init(&mddev->delete_mutex);
>> +    mutex_init(&mddev->sync_mutex);
>>       mutex_init(&mddev->bitmap_info.mutex);
>>       INIT_LIST_HEAD(&mddev->disks);
>>       INIT_LIST_HEAD(&mddev->all_mddevs);
>> @@ -4785,14 +4786,18 @@ static void stop_sync_thread(struct mddev *mddev)
>> static void idle_sync_thread(struct mddev *mddev)
>> {
>> +    mutex_lock(&mddev->sync_mutex);
>>       clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>       stop_sync_thread(mddev);
>> +    mutex_unlock(&mddev->sync_mutex);
>> }
>> static void frozen_sync_thread(struct mddev *mddev)
>> {
>> +    mutex_init(&mddev->delete_mutex);
>
>
> typo error? It should be mutex_lock(&mddev->sync_mutex); ?
>

Yes, and thanks for spotting this, this looks like I did this while
rebasing.

Thanks,
Kuai
> Regards
>
> Xiao
>
>>       set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>       stop_sync_thread(mddev);
>> +    mutex_unlock(&mddev->sync_mutex);
>> }
>> static ssize_t
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index bfd2306bc750..2fa903de5bd0 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -537,6 +537,9 @@ struct mddev {
>>       /* Protect the deleting list */
>>       struct mutex            delete_mutex;
>> +    /* Used to synchronize idle and frozen for action_store() */
>> +    struct mutex            sync_mutex;
>> +
>>       bool    has_superblocks:1;
>>       bool    fail_last_dev:1;
>>       bool    serialize_policy:1;
>
> .
>

2023-06-14 01:58:31

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Hi,

在 2023/06/13 22:50, Xiao Ni 写道:
>
> 在 2023/5/29 下午9:20, Yu Kuai 写道:
>> From: Yu Kuai <[email protected]>
>>
>> Our test found a following deadlock in raid10:
>>
>> 1) Issue a normal write, and such write failed:
>>
>>    raid10_end_write_request
>>     set_bit(R10BIO_WriteError, &r10_bio->state)
>>     one_write_done
>>      reschedule_retry
>>
>>    // later from md thread
>>    raid10d
>>     handle_write_completed
>>      list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
>>
>>    // later from md thread
>>    raid10d
>>     if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>>      list_move(conf->bio_end_io_list.prev, &tmp)
>>      r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
>>      raid_end_bio_io(r10_bio)
>>
>> Dependency chain 1: normal io is waiting for updating superblock
>
> Hi Kuai
>
> It looks like the above situation is more complex. It only needs a
> normal write and md_write_start needs to
>
> wait until the metadata is written to member disks, right? If so, it
> doesn't need to introduce raid10 write failure
>
> here. I guess, it should be your test case. It's nice, if you can put
> your test steps in the patch. But for the analysis
>
> of the deadlock here, it's better to be simple.

Test script can be found here, it's pretty easy to trigger:

https://patchwork.kernel.org/project/linux-raid/patch/[email protected]/

While reviewing the related code, I found that io can only be added to
list bio_end_io_list from handle_write_completed() if such io failed, so
I think io failure is needed to trigger deadlock from daemon thread.

I think the key point is how MD_SB_CHANGE_PENDING is set:

1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
2) raid10_write_request() related to reshape;
3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
however, I was thinking this is not a common case;

1) is used here because it's quite easy to trigger and this is what
we meet in real test. 3) is possible but I will say let's keep 1), I
don't think it's necessary to reporduce this deadlock through another
path again.

Thanks,
Kuai
>
>>
>> 2) Trigger a recovery:
>>
>>    raid10_sync_request
>>     raise_barrier
>>
>> Dependency chain 2: sync thread is waiting for normal io
>>
>> 3) echo idle/frozen to sync_action:
>>
>>    action_store
>>     mddev_lock
>>      md_unregister_thread
>>       kthread_stop
>>
>> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
>>
>> 4) md thread can't update superblock:
>>
>>    raid10d
>>     md_check_recovery
>>      if (mddev_trylock(mddev))
>>       md_update_sb
>>
>> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
>>
>> Hence cyclic dependency exist, in order to fix the problem, we must
>> break one of them. Dependency 1 and 2 can't be broken because they are
>> foundation design. Dependency 4 may be possible if it can be guaranteed
>> that no io can be inflight, however, this requires a new mechanism which
>> seems complex. Dependency 3 is a good choice, because idle/frozen only
>> requires sync thread to finish, which can be done asynchronously that is
>> already implemented, and 'reconfig_mutex' is not needed anymore.
>>
>> This patch switch 'idle' and 'frozen' to wait sync thread to be done
>> asynchronously, and this patch also add a sequence counter to record how
>> many times sync thread is done, so that 'idle' won't keep waiting on new
>> started sync thread.
>
> In the patch, sync_seq is added in md_reap_sync_thread. In
> idle_sync_thread, if sync_seq isn't equal
>
> mddev->sync_seq, it should mean there is someone that stops the sync
> thread already, right? Why do
>
> you say 'new started sync thread' here?
>
> Regards
>
> Xiao
>
>
>>
>> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
>> deadlock can be fixed by this patch as well.
>>
>> [1]
>> https://lore.kernel.org/linux-raid/[email protected]/T/#t
>>
>> [2]
>> https://lore.kernel.org/linux-raid/[email protected]/
>>
>> Signed-off-by: Yu Kuai <[email protected]>
>> ---
>> drivers/md/md.c | 23 +++++++++++++++++++----
>> drivers/md/md.h | 2 ++
>> 2 files changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 63a993b52cd7..7912de0e4d12 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
>>       timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
>>       atomic_set(&mddev->active, 1);
>>       atomic_set(&mddev->openers, 0);
>> +    atomic_set(&mddev->sync_seq, 0);
>>       spin_lock_init(&mddev->lock);
>>       atomic_set(&mddev->flush_pending, 0);
>>       init_waitqueue_head(&mddev->sb_wait);
>> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
>>       if (work_pending(&mddev->del_work))
>>           flush_workqueue(md_misc_wq);
>> -    if (mddev->sync_thread) {
>> -        set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> -        md_reap_sync_thread(mddev);
>> -    }
>> +    set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> +    /*
>> +     * Thread might be blocked waiting for metadata update which will
>> now
>> +     * never happen
>> +     */
>> +    md_wakeup_thread_directly(mddev->sync_thread);
>>       mddev_unlock(mddev);
>> }
>> static void idle_sync_thread(struct mddev *mddev)
>> {
>> +    int sync_seq = atomic_read(&mddev->sync_seq);
>> +
>>       mutex_lock(&mddev->sync_mutex);
>>       clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>       stop_sync_thread(mddev);
>> +
>> +    wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
>> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
>> +
>>       mutex_unlock(&mddev->sync_mutex);
>> }
>> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
>> *mddev)
>>       mutex_init(&mddev->delete_mutex);
>>       set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>       stop_sync_thread(mddev);
>> +
>> +    wait_event(resync_wait, mddev->sync_thread == NULL &&
>> +            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
>> +
>>       mutex_unlock(&mddev->sync_mutex);
>> }
>> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
>>       /* resync has finished, collect result */
>>       md_unregister_thread(&mddev->sync_thread);
>> +    atomic_inc(&mddev->sync_seq);
>> +
>>       if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>>           !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
>>           mddev->degraded != mddev->raid_disks) {
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 2fa903de5bd0..7cab9c7c45b8 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -539,6 +539,8 @@ struct mddev {
>>       /* Used to synchronize idle and frozen for action_store() */
>>       struct mutex            sync_mutex;
>> +    /* The sequence number for sync thread */
>> +    atomic_t sync_seq;
>>       bool    has_superblocks:1;
>>       bool    fail_last_dev:1;
>
> --
> dm-devel mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/dm-devel

2023-06-14 02:38:19

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Hi,

在 2023/06/14 9:48, Yu Kuai 写道:

>>
>> In the patch, sync_seq is added in md_reap_sync_thread. In
>> idle_sync_thread, if sync_seq isn't equal
>>
>> mddev->sync_seq, it should mean there is someone that stops the sync
>> thread already, right? Why do
>>
>> you say 'new started sync thread' here?

If someone stops the sync thread, and new sync thread is not started,
then this sync_seq won't make a difference, above wait_event() will not
wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
So 'sync_seq' is only used when the old sync thread stops and new sync
thread starts, add 'sync_seq' will bypass this case.

Thanks,
Kuai

2023-06-14 04:00:27

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

On Wed, Jun 14, 2023 at 9:48 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/13 22:50, Xiao Ni 写道:
> >
> > 在 2023/5/29 下午9:20, Yu Kuai 写道:
> >> From: Yu Kuai <[email protected]>
> >>
> >> Our test found a following deadlock in raid10:
> >>
> >> 1) Issue a normal write, and such write failed:
> >>
> >> raid10_end_write_request
> >> set_bit(R10BIO_WriteError, &r10_bio->state)
> >> one_write_done
> >> reschedule_retry
> >>
> >> // later from md thread
> >> raid10d
> >> handle_write_completed
> >> list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
> >>
> >> // later from md thread
> >> raid10d
> >> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> >> list_move(conf->bio_end_io_list.prev, &tmp)
> >> r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
> >> raid_end_bio_io(r10_bio)
> >>
> >> Dependency chain 1: normal io is waiting for updating superblock
> >
> > Hi Kuai
> >
> > It looks like the above situation is more complex. It only needs a
> > normal write and md_write_start needs to
> >
> > wait until the metadata is written to member disks, right? If so, it
> > doesn't need to introduce raid10 write failure
> >
> > here. I guess, it should be your test case. It's nice, if you can put
> > your test steps in the patch. But for the analysis
> >
> > of the deadlock here, it's better to be simple.
>
> Test script can be found here, it's pretty easy to trigger:
>
> https://patchwork.kernel.org/project/linux-raid/patch/[email protected]/

Thanks for this.
>
> While reviewing the related code, I found that io can only be added to
> list bio_end_io_list from handle_write_completed() if such io failed, so
> I think io failure is needed to trigger deadlock from daemon thread.
>
> I think the key point is how MD_SB_CHANGE_PENDING is set:
>
> 1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
> 2) raid10_write_request() related to reshape;
> 3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
> however, I was thinking this is not a common case;
>
> 1) is used here because it's quite easy to trigger and this is what
> we meet in real test. 3) is possible but I will say let's keep 1), I
> don't think it's necessary to reporduce this deadlock through another
> path again.

It makes sense. Let's go back to the first path mentioned in the patch.

> 1) Issue a normal write, and such write failed:
>
> raid10_end_write_request
> set_bit(R10BIO_WriteError, &r10_bio->state)
> one_write_done
> reschedule_retry

This is good.
>
> // later from md thread
> raid10d
> handle_write_completed
> list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

I have a question here. It should run narrow_write_error in
handle_write_completed. In the test case, will narrow_write_error run
successfully? Or it fails and will call rdev_set_badblocks and
md_error. So MD_RECOVERY_PENDING will be set?

>
> // later from md thread
> raid10d
> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> list_move(conf->bio_end_io_list.prev, &tmp)
> r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
> raid_end_bio_io(r10_bio)
>
> Dependency chain 1: normal io is waiting for updating superblock

It's a little hard to understand. Because it doesn't show how normal
io waits for a superblock update. And based on your last email, I
guess you want to say rdev_set_badblock sets MD_RECOVERY_PENDING, but
the flag can't be cleared, so the bios can't be added to
bio_end_io_list, so the io rquests can't be finished.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> >>
> >> 2) Trigger a recovery:
> >>
> >> raid10_sync_request
> >> raise_barrier
> >>
> >> Dependency chain 2: sync thread is waiting for normal io
> >>
> >> 3) echo idle/frozen to sync_action:
> >>
> >> action_store
> >> mddev_lock
> >> md_unregister_thread
> >> kthread_stop
> >>
> >> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
> >>
> >> 4) md thread can't update superblock:
> >>
> >> raid10d
> >> md_check_recovery
> >> if (mddev_trylock(mddev))
> >> md_update_sb
> >>
> >> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
> >>
> >> Hence cyclic dependency exist, in order to fix the problem, we must
> >> break one of them. Dependency 1 and 2 can't be broken because they are
> >> foundation design. Dependency 4 may be possible if it can be guaranteed
> >> that no io can be inflight, however, this requires a new mechanism which
> >> seems complex. Dependency 3 is a good choice, because idle/frozen only
> >> requires sync thread to finish, which can be done asynchronously that is
> >> already implemented, and 'reconfig_mutex' is not needed anymore.
> >>
> >> This patch switch 'idle' and 'frozen' to wait sync thread to be done
> >> asynchronously, and this patch also add a sequence counter to record how
> >> many times sync thread is done, so that 'idle' won't keep waiting on new
> >> started sync thread.
> >
> > In the patch, sync_seq is added in md_reap_sync_thread. In
> > idle_sync_thread, if sync_seq isn't equal
> >
> > mddev->sync_seq, it should mean there is someone that stops the sync
> > thread already, right? Why do
> >
> > you say 'new started sync thread' here?
> >
> > Regards
> >
> > Xiao
> >
> >
> >>
> >> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
> >> deadlock can be fixed by this patch as well.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-raid/[email protected]/T/#t
> >>
> >> [2]
> >> https://lore.kernel.org/linux-raid/[email protected]/
> >>
> >> Signed-off-by: Yu Kuai <[email protected]>
> >> ---
> >> drivers/md/md.c | 23 +++++++++++++++++++----
> >> drivers/md/md.h | 2 ++
> >> 2 files changed, 21 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index 63a993b52cd7..7912de0e4d12 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
> >> timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
> >> atomic_set(&mddev->active, 1);
> >> atomic_set(&mddev->openers, 0);
> >> + atomic_set(&mddev->sync_seq, 0);
> >> spin_lock_init(&mddev->lock);
> >> atomic_set(&mddev->flush_pending, 0);
> >> init_waitqueue_head(&mddev->sb_wait);
> >> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
> >> if (work_pending(&mddev->del_work))
> >> flush_workqueue(md_misc_wq);
> >> - if (mddev->sync_thread) {
> >> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >> - md_reap_sync_thread(mddev);
> >> - }
> >> + set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >> + /*
> >> + * Thread might be blocked waiting for metadata update which will
> >> now
> >> + * never happen
> >> + */
> >> + md_wakeup_thread_directly(mddev->sync_thread);
> >> mddev_unlock(mddev);
> >> }
> >> static void idle_sync_thread(struct mddev *mddev)
> >> {
> >> + int sync_seq = atomic_read(&mddev->sync_seq);
> >> +
> >> mutex_lock(&mddev->sync_mutex);
> >> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >> stop_sync_thread(mddev);
> >> +
> >> + wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
> >> + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> >> +
> >> mutex_unlock(&mddev->sync_mutex);
> >> }
> >> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
> >> *mddev)
> >> mutex_init(&mddev->delete_mutex);
> >> set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >> stop_sync_thread(mddev);
> >> +
> >> + wait_event(resync_wait, mddev->sync_thread == NULL &&
> >> + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> >> +
> >> mutex_unlock(&mddev->sync_mutex);
> >> }
> >> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
> >> /* resync has finished, collect result */
> >> md_unregister_thread(&mddev->sync_thread);
> >> + atomic_inc(&mddev->sync_seq);
> >> +
> >> if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> >> !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
> >> mddev->degraded != mddev->raid_disks) {
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index 2fa903de5bd0..7cab9c7c45b8 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -539,6 +539,8 @@ struct mddev {
> >> /* Used to synchronize idle and frozen for action_store() */
> >> struct mutex sync_mutex;
> >> + /* The sequence number for sync thread */
> >> + atomic_t sync_seq;
> >> bool has_superblocks:1;
> >> bool fail_last_dev:1;
> >
> > --
> > dm-devel mailing list
> > [email protected]
> > https://listman.redhat.com/mailman/listinfo/dm-devel
>

2023-06-14 06:30:36

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Hi,

在 2023/06/14 11:47, Xiao Ni 写道:
> On Wed, Jun 14, 2023 at 9:48 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/06/13 22:50, Xiao Ni 写道:
>>>
>>> 在 2023/5/29 下午9:20, Yu Kuai 写道:
>>>> From: Yu Kuai <[email protected]>
>>>>
>>>> Our test found a following deadlock in raid10:
>>>>
>>>> 1) Issue a normal write, and such write failed:
>>>>
>>>> raid10_end_write_request
>>>> set_bit(R10BIO_WriteError, &r10_bio->state)
>>>> one_write_done
>>>> reschedule_retry
>>>>
>>>> // later from md thread
>>>> raid10d
>>>> handle_write_completed
>>>> list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
>>>>
>>>> // later from md thread
>>>> raid10d
>>>> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>>>> list_move(conf->bio_end_io_list.prev, &tmp)
>>>> r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
>>>> raid_end_bio_io(r10_bio)
>>>>
>>>> Dependency chain 1: normal io is waiting for updating superblock
>>>
>>> Hi Kuai
>>>
>>> It looks like the above situation is more complex. It only needs a
>>> normal write and md_write_start needs to
>>>
>>> wait until the metadata is written to member disks, right? If so, it
>>> doesn't need to introduce raid10 write failure
>>>
>>> here. I guess, it should be your test case. It's nice, if you can put
>>> your test steps in the patch. But for the analysis
>>>
>>> of the deadlock here, it's better to be simple.
>>
>> Test script can be found here, it's pretty easy to trigger:
>>
>> https://patchwork.kernel.org/project/linux-raid/patch/[email protected]/
>
> Thanks for this.
>>
>> While reviewing the related code, I found that io can only be added to
>> list bio_end_io_list from handle_write_completed() if such io failed, so
>> I think io failure is needed to trigger deadlock from daemon thread.
>>
>> I think the key point is how MD_SB_CHANGE_PENDING is set:
>>
>> 1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
>> 2) raid10_write_request() related to reshape;
>> 3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
>> however, I was thinking this is not a common case;
>>
>> 1) is used here because it's quite easy to trigger and this is what
>> we meet in real test. 3) is possible but I will say let's keep 1), I
>> don't think it's necessary to reporduce this deadlock through another
>> path again.
>
> It makes sense. Let's go back to the first path mentioned in the patch.
>
>> 1) Issue a normal write, and such write failed:
>>
>> raid10_end_write_request
>> set_bit(R10BIO_WriteError, &r10_bio->state)
>> one_write_done
>> reschedule_retry
>
> This is good.
>>
>> // later from md thread
>> raid10d
>> handle_write_completed
>> list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
>
> I have a question here. It should run narrow_write_error in
> handle_write_completed. In the test case, will narrow_write_error run
> successfully? Or it fails and will call rdev_set_badblocks and
> md_error. So MD_RECOVERY_PENDING will be set?

r10_bio will always be added to bio_end_io_list, no matter
narrow_write_error() succeed or not. The dependecy chain 1 here is just
indicate handle this r10_bio will wait for updating super block, it's
not where MD_RECOVERY_PENDING is set...

And MD_RECOVERY_PENDING can be set from narrow_write_error() and other
places where rdev_set_badblocks() is called.
>
>>
>> // later from md thread
>> raid10d
>> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
-> It's here, if the flag is set, bio won't be handled.
>> list_move(conf->bio_end_io_list.prev, &tmp)
>> r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
>> raid_end_bio_io(r10_bio)
>>
>> Dependency chain 1: normal io is waiting for updating superblock
>
> It's a little hard to understand. Because it doesn't show how normal
> io waits for a superblock update. And based on your last email, I
> guess you want to say rdev_set_badblock sets MD_RECOVERY_PENDING, but
> the flag can't be cleared, so the bios can't be added to
> bio_end_io_list, so the io rquests can't be finished.

It's not that bio can't be added to bio_end_io_list, it's that bio in
this list can't be handled if sb_flags is set.

Thanks,
Kuai
>
> Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>>
>>>>
>>>> 2) Trigger a recovery:
>>>>
>>>> raid10_sync_request
>>>> raise_barrier
>>>>
>>>> Dependency chain 2: sync thread is waiting for normal io
>>>>
>>>> 3) echo idle/frozen to sync_action:
>>>>
>>>> action_store
>>>> mddev_lock
>>>> md_unregister_thread
>>>> kthread_stop
>>>>
>>>> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
>>>>
>>>> 4) md thread can't update superblock:
>>>>
>>>> raid10d
>>>> md_check_recovery
>>>> if (mddev_trylock(mddev))
>>>> md_update_sb
>>>>
>>>> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
>>>>
>>>> Hence cyclic dependency exist, in order to fix the problem, we must
>>>> break one of them. Dependency 1 and 2 can't be broken because they are
>>>> foundation design. Dependency 4 may be possible if it can be guaranteed
>>>> that no io can be inflight, however, this requires a new mechanism which
>>>> seems complex. Dependency 3 is a good choice, because idle/frozen only
>>>> requires sync thread to finish, which can be done asynchronously that is
>>>> already implemented, and 'reconfig_mutex' is not needed anymore.
>>>>
>>>> This patch switch 'idle' and 'frozen' to wait sync thread to be done
>>>> asynchronously, and this patch also add a sequence counter to record how
>>>> many times sync thread is done, so that 'idle' won't keep waiting on new
>>>> started sync thread.
>>>
>>> In the patch, sync_seq is added in md_reap_sync_thread. In
>>> idle_sync_thread, if sync_seq isn't equal
>>>
>>> mddev->sync_seq, it should mean there is someone that stops the sync
>>> thread already, right? Why do
>>>
>>> you say 'new started sync thread' here?
>>>
>>> Regards
>>>
>>> Xiao
>>>
>>>
>>>>
>>>> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
>>>> deadlock can be fixed by this patch as well.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-raid/[email protected]/T/#t
>>>>
>>>> [2]
>>>> https://lore.kernel.org/linux-raid/[email protected]/
>>>>
>>>> Signed-off-by: Yu Kuai <[email protected]>
>>>> ---
>>>> drivers/md/md.c | 23 +++++++++++++++++++----
>>>> drivers/md/md.h | 2 ++
>>>> 2 files changed, 21 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 63a993b52cd7..7912de0e4d12 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
>>>> timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
>>>> atomic_set(&mddev->active, 1);
>>>> atomic_set(&mddev->openers, 0);
>>>> + atomic_set(&mddev->sync_seq, 0);
>>>> spin_lock_init(&mddev->lock);
>>>> atomic_set(&mddev->flush_pending, 0);
>>>> init_waitqueue_head(&mddev->sb_wait);
>>>> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
>>>> if (work_pending(&mddev->del_work))
>>>> flush_workqueue(md_misc_wq);
>>>> - if (mddev->sync_thread) {
>>>> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>> - md_reap_sync_thread(mddev);
>>>> - }
>>>> + set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>> + /*
>>>> + * Thread might be blocked waiting for metadata update which will
>>>> now
>>>> + * never happen
>>>> + */
>>>> + md_wakeup_thread_directly(mddev->sync_thread);
>>>> mddev_unlock(mddev);
>>>> }
>>>> static void idle_sync_thread(struct mddev *mddev)
>>>> {
>>>> + int sync_seq = atomic_read(&mddev->sync_seq);
>>>> +
>>>> mutex_lock(&mddev->sync_mutex);
>>>> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>> stop_sync_thread(mddev);
>>>> +
>>>> + wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
>>>> + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
>>>> +
>>>> mutex_unlock(&mddev->sync_mutex);
>>>> }
>>>> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
>>>> *mddev)
>>>> mutex_init(&mddev->delete_mutex);
>>>> set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>> stop_sync_thread(mddev);
>>>> +
>>>> + wait_event(resync_wait, mddev->sync_thread == NULL &&
>>>> + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
>>>> +
>>>> mutex_unlock(&mddev->sync_mutex);
>>>> }
>>>> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
>>>> /* resync has finished, collect result */
>>>> md_unregister_thread(&mddev->sync_thread);
>>>> + atomic_inc(&mddev->sync_seq);
>>>> +
>>>> if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>>>> !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
>>>> mddev->degraded != mddev->raid_disks) {
>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>> index 2fa903de5bd0..7cab9c7c45b8 100644
>>>> --- a/drivers/md/md.h
>>>> +++ b/drivers/md/md.h
>>>> @@ -539,6 +539,8 @@ struct mddev {
>>>> /* Used to synchronize idle and frozen for action_store() */
>>>> struct mutex sync_mutex;
>>>> + /* The sequence number for sync thread */
>>>> + atomic_t sync_seq;
>>>> bool has_superblocks:1;
>>>> bool fail_last_dev:1;
>>>
>>> --
>>> dm-devel mailing list
>>> [email protected]
>>> https://listman.redhat.com/mailman/listinfo/dm-devel
>>
>
> .
>

2023-06-14 06:53:01

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

On Wed, Jun 14, 2023 at 2:05 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/14 11:47, Xiao Ni 写道:
> > On Wed, Jun 14, 2023 at 9:48 AM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/06/13 22:50, Xiao Ni 写道:
> >>>
> >>> 在 2023/5/29 下午9:20, Yu Kuai 写道:
> >>>> From: Yu Kuai <[email protected]>
> >>>>
> >>>> Our test found a following deadlock in raid10:
> >>>>
> >>>> 1) Issue a normal write, and such write failed:
> >>>>
> >>>> raid10_end_write_request
> >>>> set_bit(R10BIO_WriteError, &r10_bio->state)
> >>>> one_write_done
> >>>> reschedule_retry
> >>>>
> >>>> // later from md thread
> >>>> raid10d
> >>>> handle_write_completed
> >>>> list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
> >>>>
> >>>> // later from md thread
> >>>> raid10d
> >>>> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> >>>> list_move(conf->bio_end_io_list.prev, &tmp)
> >>>> r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
> >>>> raid_end_bio_io(r10_bio)
> >>>>
> >>>> Dependency chain 1: normal io is waiting for updating superblock
> >>>
> >>> Hi Kuai
> >>>
> >>> It looks like the above situation is more complex. It only needs a
> >>> normal write and md_write_start needs to
> >>>
> >>> wait until the metadata is written to member disks, right? If so, it
> >>> doesn't need to introduce raid10 write failure
> >>>
> >>> here. I guess, it should be your test case. It's nice, if you can put
> >>> your test steps in the patch. But for the analysis
> >>>
> >>> of the deadlock here, it's better to be simple.
> >>
> >> Test script can be found here, it's pretty easy to trigger:
> >>
> >> https://patchwork.kernel.org/project/linux-raid/patch/[email protected]/
> >
> > Thanks for this.
> >>
> >> While reviewing the related code, I found that io can only be added to
> >> list bio_end_io_list from handle_write_completed() if such io failed, so
> >> I think io failure is needed to trigger deadlock from daemon thread.
> >>
> >> I think the key point is how MD_SB_CHANGE_PENDING is set:
> >>
> >> 1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
> >> 2) raid10_write_request() related to reshape;
> >> 3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
> >> however, I was thinking this is not a common case;
> >>
> >> 1) is used here because it's quite easy to trigger and this is what
> >> we meet in real test. 3) is possible but I will say let's keep 1), I
> >> don't think it's necessary to reporduce this deadlock through another
> >> path again.
> >
> > It makes sense. Let's go back to the first path mentioned in the patch.
> >
> >> 1) Issue a normal write, and such write failed:
> >>
> >> raid10_end_write_request
> >> set_bit(R10BIO_WriteError, &r10_bio->state)
> >> one_write_done
> >> reschedule_retry
> >
> > This is good.
> >>
> >> // later from md thread
> >> raid10d
> >> handle_write_completed
> >> list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
> >
> > I have a question here. It should run narrow_write_error in
> > handle_write_completed. In the test case, will narrow_write_error run
> > successfully? Or it fails and will call rdev_set_badblocks and
> > md_error. So MD_RECOVERY_PENDING will be set?
>
> r10_bio will always be added to bio_end_io_list, no matter
> narrow_write_error() succeed or not. The dependecy chain 1 here is just
> indicate handle this r10_bio will wait for updating super block, it's
> not where MD_RECOVERY_PENDING is set...
>
> And MD_RECOVERY_PENDING can be set from narrow_write_error() and other
> places where rdev_set_badblocks() is called.

Because in your patch, it doesn't show which step sets
MD_RECOVERY_PENDING. It's the reason I need to guess. It's a normal
write, so md_write_start can set the flag. In this case, it can cause
the same deadlock. So it's better to give which step sets the flag.
> >
> >>
> >> // later from md thread
> >> raid10d
> >> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> -> It's here, if the flag is set, bio won't be handled.

Yes.

> >> list_move(conf->bio_end_io_list.prev, &tmp)
> >> r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
> >> raid_end_bio_io(r10_bio)
> >>
> >> Dependency chain 1: normal io is waiting for updating superblock
> >
> > It's a little hard to understand. Because it doesn't show how normal
> > io waits for a superblock update. And based on your last email, I
> > guess you want to say rdev_set_badblock sets MD_RECOVERY_PENDING, but
> > the flag can't be cleared, so the bios can't be added to
> > bio_end_io_list, so the io rquests can't be finished.
>
> It's not that bio can't be added to bio_end_io_list, it's that bio in
> this list can't be handled if sb_flags is set.

Sorry for this. I wanted to say the same thing. I understand the case totally.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Regards
> > Xiao
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>>>
> >>>> 2) Trigger a recovery:
> >>>>
> >>>> raid10_sync_request
> >>>> raise_barrier
> >>>>
> >>>> Dependency chain 2: sync thread is waiting for normal io
> >>>>
> >>>> 3) echo idle/frozen to sync_action:
> >>>>
> >>>> action_store
> >>>> mddev_lock
> >>>> md_unregister_thread
> >>>> kthread_stop
> >>>>
> >>>> Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
> >>>>
> >>>> 4) md thread can't update superblock:
> >>>>
> >>>> raid10d
> >>>> md_check_recovery
> >>>> if (mddev_trylock(mddev))
> >>>> md_update_sb
> >>>>
> >>>> Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
> >>>>
> >>>> Hence cyclic dependency exist, in order to fix the problem, we must
> >>>> break one of them. Dependency 1 and 2 can't be broken because they are
> >>>> foundation design. Dependency 4 may be possible if it can be guaranteed
> >>>> that no io can be inflight, however, this requires a new mechanism which
> >>>> seems complex. Dependency 3 is a good choice, because idle/frozen only
> >>>> requires sync thread to finish, which can be done asynchronously that is
> >>>> already implemented, and 'reconfig_mutex' is not needed anymore.
> >>>>
> >>>> This patch switch 'idle' and 'frozen' to wait sync thread to be done
> >>>> asynchronously, and this patch also add a sequence counter to record how
> >>>> many times sync thread is done, so that 'idle' won't keep waiting on new
> >>>> started sync thread.
> >>>
> >>> In the patch, sync_seq is added in md_reap_sync_thread. In
> >>> idle_sync_thread, if sync_seq isn't equal
> >>>
> >>> mddev->sync_seq, it should mean there is someone that stops the sync
> >>> thread already, right? Why do
> >>>
> >>> you say 'new started sync thread' here?
> >>>
> >>> Regards
> >>>
> >>> Xiao
> >>>
> >>>
> >>>>
> >>>> Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
> >>>> deadlock can be fixed by this patch as well.
> >>>>
> >>>> [1]
> >>>> https://lore.kernel.org/linux-raid/[email protected]/T/#t
> >>>>
> >>>> [2]
> >>>> https://lore.kernel.org/linux-raid/[email protected]/
> >>>>
> >>>> Signed-off-by: Yu Kuai <[email protected]>
> >>>> ---
> >>>> drivers/md/md.c | 23 +++++++++++++++++++----
> >>>> drivers/md/md.h | 2 ++
> >>>> 2 files changed, 21 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >>>> index 63a993b52cd7..7912de0e4d12 100644
> >>>> --- a/drivers/md/md.c
> >>>> +++ b/drivers/md/md.c
> >>>> @@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
> >>>> timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
> >>>> atomic_set(&mddev->active, 1);
> >>>> atomic_set(&mddev->openers, 0);
> >>>> + atomic_set(&mddev->sync_seq, 0);
> >>>> spin_lock_init(&mddev->lock);
> >>>> atomic_set(&mddev->flush_pending, 0);
> >>>> init_waitqueue_head(&mddev->sb_wait);
> >>>> @@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
> >>>> if (work_pending(&mddev->del_work))
> >>>> flush_workqueue(md_misc_wq);
> >>>> - if (mddev->sync_thread) {
> >>>> - set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >>>> - md_reap_sync_thread(mddev);
> >>>> - }
> >>>> + set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >>>> + /*
> >>>> + * Thread might be blocked waiting for metadata update which will
> >>>> now
> >>>> + * never happen
> >>>> + */
> >>>> + md_wakeup_thread_directly(mddev->sync_thread);
> >>>> mddev_unlock(mddev);
> >>>> }
> >>>> static void idle_sync_thread(struct mddev *mddev)
> >>>> {
> >>>> + int sync_seq = atomic_read(&mddev->sync_seq);
> >>>> +
> >>>> mutex_lock(&mddev->sync_mutex);
> >>>> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >>>> stop_sync_thread(mddev);
> >>>> +
> >>>> + wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
> >>>> + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> >>>> +
> >>>> mutex_unlock(&mddev->sync_mutex);
> >>>> }
> >>>> @@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev
> >>>> *mddev)
> >>>> mutex_init(&mddev->delete_mutex);
> >>>> set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >>>> stop_sync_thread(mddev);
> >>>> +
> >>>> + wait_event(resync_wait, mddev->sync_thread == NULL &&
> >>>> + !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> >>>> +
> >>>> mutex_unlock(&mddev->sync_mutex);
> >>>> }
> >>>> @@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
> >>>> /* resync has finished, collect result */
> >>>> md_unregister_thread(&mddev->sync_thread);
> >>>> + atomic_inc(&mddev->sync_seq);
> >>>> +
> >>>> if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
> >>>> !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
> >>>> mddev->degraded != mddev->raid_disks) {
> >>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >>>> index 2fa903de5bd0..7cab9c7c45b8 100644
> >>>> --- a/drivers/md/md.h
> >>>> +++ b/drivers/md/md.h
> >>>> @@ -539,6 +539,8 @@ struct mddev {
> >>>> /* Used to synchronize idle and frozen for action_store() */
> >>>> struct mutex sync_mutex;
> >>>> + /* The sequence number for sync thread */
> >>>> + atomic_t sync_seq;
> >>>> bool has_superblocks:1;
> >>>> bool fail_last_dev:1;
> >>>
> >>> --
> >>> dm-devel mailing list
> >>> [email protected]
> >>> https://listman.redhat.com/mailman/listinfo/dm-devel
> >>
> >
> > .
> >
>

2023-06-14 07:34:10

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 6/6] md: enhance checking in md_check_recovery()

在 2023/5/29 下午9:20, Yu Kuai 写道:
> From: Yu Kuai <[email protected]>
>
> For md_check_recovery():
>
> 1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread.
> 2) if 'MD_RECOVERY_RUNING' is set:
> a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for
> md_do_sync() to be done.
> b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code
> expects that sync_thread is not NULL, otherwise new sync_thread will
> be registered, which will corrupt the array.
>
> Make sure md_check_recovery() won't register new sync_thread if
> 'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for
> the above corruption,
>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/md.c | 22 +++++++++++++++-------
> 1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f90226e6ddf8..9da0fc906bbd 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9397,16 +9397,24 @@ void md_check_recovery(struct mddev *mddev)
> if (mddev->sb_flags)
> md_update_sb(mddev, 0);
>
> - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> - !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
> - /* resync/recovery still happening */
> - clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> - goto unlock;
> - }
> - if (mddev->sync_thread) {
> + /*
> + * Never start a new sync thread if MD_RECOVERY_RUNNING is
> + * still set.
> + */
> + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
> + if (!test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
> + /* resync/recovery still happening */
> + clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> + goto unlock;
> + }
> +
> + if (WARN_ON_ONCE(!mddev->sync_thread))
> + goto unlock;
> +
> md_reap_sync_thread(mddev);
> goto unlock;
> }
> +
> /* Set RUNNING before clearing NEEDED to avoid
> * any transients in the value of "sync_action".
> */

It makes the logical more clear.

Reviewed-by: Xiao Ni <[email protected]>

2023-06-14 07:34:32

[permalink] [raw]

Subject: Re: [PATCH -next v2 5/6] md: wake up 'resync_wait' at last in md_reap_sync_thread()

在 2023/5/29 下午9:20, Yu Kuai 写道:
> From: Yu Kuai <[email protected]>
>
> md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...)
> from action_store(), just make sure action_store() will still wait for
> everything to be done in md_reap_sync_thread().
>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/md.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 7912de0e4d12..f90226e6ddf8 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9531,7 +9531,6 @@ void md_reap_sync_thread(struct mddev *mddev)
> if (mddev_is_clustered(mddev) && is_reshaped
> && !test_bit(MD_CLOSING, &mddev->flags))
> md_cluster_ops->update_size(mddev, old_dev_sectors);
> - wake_up(&resync_wait);
> /* flag recovery needed just to double check */
> set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> sysfs_notify_dirent_safe(mddev->sysfs_completed);
> @@ -9539,6 +9538,7 @@ void md_reap_sync_thread(struct mddev *mddev)
> md_new_event();
> if (mddev->event_work.func)
> queue_work(md_misc_wq, &mddev->event_work);
> + wake_up(&resync_wait);
> }
> EXPORT_SYMBOL(md_reap_sync_thread);
>

Reviewd-by: Xiao Ni <[email protected]>

2023-06-14 07:39:51

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/14 9:48, Yu Kuai 写道:
>
>
> >>
> >> In the patch, sync_seq is added in md_reap_sync_thread. In
> >> idle_sync_thread, if sync_seq isn't equal
> >>
> >> mddev->sync_seq, it should mean there is someone that stops the sync
> >> thread already, right? Why do
> >>
> >> you say 'new started sync thread' here?
>
> If someone stops the sync thread, and new sync thread is not started,
> then this sync_seq won't make a difference, above wait_event() will not
> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> So 'sync_seq' is only used when the old sync thread stops and new sync
> thread starts, add 'sync_seq' will bypass this case.

Hi

If a new sync thread starts, why can sync_seq be different? sync_seq
is only added in md_reap_sync_thread. And when a new sync request
starts, it can't stop the sync request again?

Af first, the sync_seq is 0

admin1
echo idle > sync_action
idle_sync_thread(sync_seq is 1)
echo resync > sync_action (new sync)

Then admin2 echos idle > sync_action, sync_seq is still 1

Regards
Xiao

>
> Thanks,
> Kuai
>

2023-06-14 07:51:35

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Hi,

在 2023/06/14 15:12, Xiao Ni 写道:
> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/06/14 9:48, Yu Kuai 写道:
>>
>>
>>>>
>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
>>>> idle_sync_thread, if sync_seq isn't equal
>>>>
>>>> mddev->sync_seq, it should mean there is someone that stops the sync
>>>> thread already, right? Why do
>>>>
>>>> you say 'new started sync thread' here?
>>
>> If someone stops the sync thread, and new sync thread is not started,
>> then this sync_seq won't make a difference, above wait_event() will not
>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
>> So 'sync_seq' is only used when the old sync thread stops and new sync
>> thread starts, add 'sync_seq' will bypass this case.
>
> Hi
>
> If a new sync thread starts, why can sync_seq be different? sync_seq
> is only added in md_reap_sync_thread. And when a new sync request
> starts, it can't stop the sync request again?
>
> Af first, the sync_seq is 0
>
> admin1
> echo idle > sync_action
> idle_sync_thread(sync_seq is 1)

Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
mean that there is a sync_thread just finished?

Then the problem is that idle_sync_thread() read sync_seq after the old
sync_thread is done, and new sync_thread start before wait_event() is
called, should we wait for this new sync_thread?

My answer here is that we should, but I'm also ok to not wait this new
sync_thread, I don't think this behaviour matters. The key point here
is that once wait_event() is called from idle_sync_thread(), this
wait_event() should not wait for new sync_thread...

> echo resync > sync_action (new sync)

If this is behind "echo idle > sync_action", idle_sync_thread should not
see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.

Thanks,
Kuai
>
> Then admin2 echos idle > sync_action, sync_seq is still 1
>
> Regards
> Xiao
>
>>
>> Thanks,
>> Kuai
>>
>
> .
>

2023-06-14 08:16:43

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/14 15:12, Xiao Ni 写道:
> > On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/06/14 9:48, Yu Kuai 写道:
> >>
> >>
> >>>>
> >>>> In the patch, sync_seq is added in md_reap_sync_thread. In
> >>>> idle_sync_thread, if sync_seq isn't equal
> >>>>
> >>>> mddev->sync_seq, it should mean there is someone that stops the sync
> >>>> thread already, right? Why do
> >>>>
> >>>> you say 'new started sync thread' here?
> >>
> >> If someone stops the sync thread, and new sync thread is not started,
> >> then this sync_seq won't make a difference, above wait_event() will not
> >> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> >> So 'sync_seq' is only used when the old sync thread stops and new sync
> >> thread starts, add 'sync_seq' will bypass this case.
> >
> > Hi
> >
> > If a new sync thread starts, why can sync_seq be different? sync_seq
> > is only added in md_reap_sync_thread. And when a new sync request
> > starts, it can't stop the sync request again?
> >
> > Af first, the sync_seq is 0
> >
> > admin1
> > echo idle > sync_action
> > idle_sync_thread(sync_seq is 1)
>
> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
> mean that there is a sync_thread just finished?

Hi Kuai

Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
patch right?

>
> Then the problem is that idle_sync_thread() read sync_seq after the old
> sync_thread is done, and new sync_thread start before wait_event() is
> called, should we wait for this new sync_thread?
>
> My answer here is that we should, but I'm also ok to not wait this new
> sync_thread, I don't think this behaviour matters. The key point here
> is that once wait_event() is called from idle_sync_thread(), this
> wait_event() should not wait for new sync_thread...

I think we should wait. If we don't wait for it, there is a problem.
One person echos idle to sync_action and it doesn't work sometimes.
It's a strange thing.

>
> > echo resync > sync_action (new sync)
>
> If this is behind "echo idle > sync_action", idle_sync_thread should not
> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.

`echo resync > sync_action` can't change the sync_seq. So 'echo idle >
sync_action' still waits until MD_RECOVERY_RUNNING is cleared?

Regards
Xiao

>
> Thanks,
> Kuai
> >
> > Then admin2 echos idle > sync_action, sync_seq is still 1
> >
> > Regards
> > Xiao
> >
> >>
> >> Thanks,
> >> Kuai
> >>
> >
> > .
> >
>

2023-06-14 09:07:31

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Hi,

在 2023/06/14 15:57, Xiao Ni 写道:
> On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/06/14 15:12, Xiao Ni 写道:
>>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2023/06/14 9:48, Yu Kuai 写道:
>>>>
>>>>
>>>>>>
>>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
>>>>>> idle_sync_thread, if sync_seq isn't equal
>>>>>>
>>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
>>>>>> thread already, right? Why do
>>>>>>
>>>>>> you say 'new started sync thread' here?
>>>>
>>>> If someone stops the sync thread, and new sync thread is not started,
>>>> then this sync_seq won't make a difference, above wait_event() will not
>>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
>>>> So 'sync_seq' is only used when the old sync thread stops and new sync
>>>> thread starts, add 'sync_seq' will bypass this case.
>>>
>>> Hi
>>>
>>> If a new sync thread starts, why can sync_seq be different? sync_seq
>>> is only added in md_reap_sync_thread. And when a new sync request
>>> starts, it can't stop the sync request again?
>>>
>>> Af first, the sync_seq is 0
>>>
>>> admin1
>>> echo idle > sync_action
>>> idle_sync_thread(sync_seq is 1)
>>
>> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
>> mean that there is a sync_thread just finished?
>
> Hi Kuai
>
> Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
> finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
> patch right?

Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
is set.

>
>>
>> Then the problem is that idle_sync_thread() read sync_seq after the old
>> sync_thread is done, and new sync_thread start before wait_event() is
>> called, should we wait for this new sync_thread?
>>
>> My answer here is that we should, but I'm also ok to not wait this new
>> sync_thread, I don't think this behaviour matters. The key point here
>> is that once wait_event() is called from idle_sync_thread(), this
>> wait_event() should not wait for new sync_thread...
>
> I think we should wait. If we don't wait for it, there is a problem.
> One person echos idle to sync_action and it doesn't work sometimes.
> It's a strange thing.
>

Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
for new sync_thread that is started after wait_event().

>>
>>> echo resync > sync_action (new sync)
>>
>> If this is behind "echo idle > sync_action", idle_sync_thread should not
>> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
>
> `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
> sync_action' still waits until MD_RECOVERY_RUNNING is cleared?

This is not accurate, if `echo resync > sync_action` triggers a new
sync_thread, then sync_seq is updated when this sync_thread is done,
during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
>sync_action` will wait for sync_thread to be done.

Thanks,
Kuai
>
> Regards
> Xiao
>
>>
>> Thanks,
>> Kuai
>>>
>>> Then admin2 echos idle > sync_action, sync_seq is still 1
>>>
>>> Regards
>>> Xiao
>>>
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>
>>> .
>>>
>>
>
> .
>

2023-06-14 09:52:56

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

On Wed, Jun 14, 2023 at 4:29 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/14 15:57, Xiao Ni 写道:
> > On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/06/14 15:12, Xiao Ni 写道:
> >>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <[email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> 在 2023/06/14 9:48, Yu Kuai 写道:
> >>>>
> >>>>
> >>>>>>
> >>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
> >>>>>> idle_sync_thread, if sync_seq isn't equal
> >>>>>>
> >>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
> >>>>>> thread already, right? Why do
> >>>>>>
> >>>>>> you say 'new started sync thread' here?
> >>>>
> >>>> If someone stops the sync thread, and new sync thread is not started,
> >>>> then this sync_seq won't make a difference, above wait_event() will not
> >>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> >>>> So 'sync_seq' is only used when the old sync thread stops and new sync
> >>>> thread starts, add 'sync_seq' will bypass this case.
> >>>
> >>> Hi
> >>>
> >>> If a new sync thread starts, why can sync_seq be different? sync_seq
> >>> is only added in md_reap_sync_thread. And when a new sync request
> >>> starts, it can't stop the sync request again?
> >>>
> >>> Af first, the sync_seq is 0
> >>>
> >>> admin1
> >>> echo idle > sync_action
> >>> idle_sync_thread(sync_seq is 1)
> >>
> >> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
> >> mean that there is a sync_thread just finished?
> >
> > Hi Kuai
> >
> > Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
> > finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
> > patch right?
>
> Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
> is set.
>
> >
> >>
> >> Then the problem is that idle_sync_thread() read sync_seq after the old
> >> sync_thread is done, and new sync_thread start before wait_event() is
> >> called, should we wait for this new sync_thread?
> >>
> >> My answer here is that we should, but I'm also ok to not wait this new
> >> sync_thread, I don't think this behaviour matters. The key point here
> >> is that once wait_event() is called from idle_sync_thread(), this
> >> wait_event() should not wait for new sync_thread...
> >
> > I think we should wait. If we don't wait for it, there is a problem.
> > One person echos idle to sync_action and it doesn't work sometimes.
> > It's a strange thing.
> >
>
> Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
> for new sync_thread that is started after wait_event().

I suggest removing this function. Without this change, it's more
simple and it can work well without problem. The people that echo idle
to sync_action needs to wait until the sync action finishes. The code
semantic is clear and simple.
>
> >>
> >>> echo resync > sync_action (new sync)
> >>
> >> If this is behind "echo idle > sync_action", idle_sync_thread should not
> >> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
> >
> > `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
> > sync_action' still waits until MD_RECOVERY_RUNNING is cleared?
>
> This is not accurate, if `echo resync > sync_action` triggers a new
> sync_thread, then sync_seq is updated when this sync_thread is done,
> during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
> >sync_action` will wait for sync_thread to be done.

I can understand your comment, but sorry, I still can't get how
sync_seq works. Could you give a specific case that explains how it
works?

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Regards
> > Xiao
> >
> >>
> >> Thanks,
> >> Kuai
> >>>
> >>> Then admin2 echos idle > sync_action, sync_seq is still 1
> >>>
> >>> Regards
> >>> Xiao
> >>>
> >>>>
> >>>> Thanks,
> >>>> Kuai
> >>>>
> >>>
> >>> .
> >>>
> >>
> >
> > .
> >
>

2023-06-15 01:40:14

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Hi,

在 2023/06/14 17:08, Xiao Ni 写道:
> On Wed, Jun 14, 2023 at 4:29 PM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/06/14 15:57, Xiao Ni 写道:
>>> On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2023/06/14 15:12, Xiao Ni 写道:
>>>>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <[email protected]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> 在 2023/06/14 9:48, Yu Kuai 写道:
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
>>>>>>>> idle_sync_thread, if sync_seq isn't equal
>>>>>>>>
>>>>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
>>>>>>>> thread already, right? Why do
>>>>>>>>
>>>>>>>> you say 'new started sync thread' here?
>>>>>>
>>>>>> If someone stops the sync thread, and new sync thread is not started,
>>>>>> then this sync_seq won't make a difference, above wait_event() will not
>>>>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
>>>>>> So 'sync_seq' is only used when the old sync thread stops and new sync
>>>>>> thread starts, add 'sync_seq' will bypass this case.
>>>>>
>>>>> Hi
>>>>>
>>>>> If a new sync thread starts, why can sync_seq be different? sync_seq
>>>>> is only added in md_reap_sync_thread. And when a new sync request
>>>>> starts, it can't stop the sync request again?
>>>>>
>>>>> Af first, the sync_seq is 0
>>>>>
>>>>> admin1
>>>>> echo idle > sync_action
>>>>> idle_sync_thread(sync_seq is 1)
>>>>
>>>> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
>>>> mean that there is a sync_thread just finished?
>>>
>>> Hi Kuai
>>>
>>> Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
>>> finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
>>> patch right?
>>
>> Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
>> is set.
>>
>>>
>>>>
>>>> Then the problem is that idle_sync_thread() read sync_seq after the old
>>>> sync_thread is done, and new sync_thread start before wait_event() is
>>>> called, should we wait for this new sync_thread?
>>>>
>>>> My answer here is that we should, but I'm also ok to not wait this new
>>>> sync_thread, I don't think this behaviour matters. The key point here
>>>> is that once wait_event() is called from idle_sync_thread(), this
>>>> wait_event() should not wait for new sync_thread...
>>>
>>> I think we should wait. If we don't wait for it, there is a problem.
>>> One person echos idle to sync_action and it doesn't work sometimes.
>>> It's a strange thing.
>>>
>>
>> Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
>> for new sync_thread that is started after wait_event().
>
> I suggest removing this function. Without this change, it's more
> simple and it can work well without problem. The people that echo idle
> to sync_action needs to wait until the sync action finishes. The code
> semantic is clear and simple.
>>
>>>>
>>>>> echo resync > sync_action (new sync)
>>>>
>>>> If this is behind "echo idle > sync_action", idle_sync_thread should not
>>>> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
>>>
>>> `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
>>> sync_action' still waits until MD_RECOVERY_RUNNING is cleared?
>>
>> This is not accurate, if `echo resync > sync_action` triggers a new
>> sync_thread, then sync_seq is updated when this sync_thread is done,
>> during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
>> >sync_action` will wait for sync_thread to be done.
>
> I can understand your comment, but sorry, I still can't get how
> sync_seq works. Could you give a specific case that explains how it
> works?

Ok, the problem is that echo ilde is supposed to interrupt sync_thread
and stop sync_thread quickly. Now that we don't hold mutex here, we
can't prevent new sync_thread to start. For exapmle:

1) a sync_thread A is runing, MD_RECOVERY_RUNNING is set;

2) echo idle, A will be interrupted, mutex is not hold and
idle_sync_thread() is waiting for MD_RECOVERY_RUNNING to be cleared.

3) A is interrupted, it'll clear MD_RECOVERY_RUNNING and try to wakeup
idle_sync_thread(), however, before idle_sync_thread() is woken, A can
be done and a new sync_thread B can be started, and MD_RECOVERY_RUNNING
will be set again.

4) idle_sync_thread() finially wake up, however, MD_RECOVERY_RUNNING is
set and it will still waiting. And this time B won't be interrupted.

Thanks,
Kuai

2023-06-15 08:19:40

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

On Thu, Jun 15, 2023 at 9:29 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/14 17:08, Xiao Ni 写道:
> > On Wed, Jun 14, 2023 at 4:29 PM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/06/14 15:57, Xiao Ni 写道:
> >>> On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <[email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> 在 2023/06/14 15:12, Xiao Ni 写道:
> >>>>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <[email protected]> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> 在 2023/06/14 9:48, Yu Kuai 写道:
> >>>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
> >>>>>>>> idle_sync_thread, if sync_seq isn't equal
> >>>>>>>>
> >>>>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
> >>>>>>>> thread already, right? Why do
> >>>>>>>>
> >>>>>>>> you say 'new started sync thread' here?
> >>>>>>
> >>>>>> If someone stops the sync thread, and new sync thread is not started,
> >>>>>> then this sync_seq won't make a difference, above wait_event() will not
> >>>>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> >>>>>> So 'sync_seq' is only used when the old sync thread stops and new sync
> >>>>>> thread starts, add 'sync_seq' will bypass this case.
> >>>>>
> >>>>> Hi
> >>>>>
> >>>>> If a new sync thread starts, why can sync_seq be different? sync_seq
> >>>>> is only added in md_reap_sync_thread. And when a new sync request
> >>>>> starts, it can't stop the sync request again?
> >>>>>
> >>>>> Af first, the sync_seq is 0
> >>>>>
> >>>>> admin1
> >>>>> echo idle > sync_action
> >>>>> idle_sync_thread(sync_seq is 1)
> >>>>
> >>>> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
> >>>> mean that there is a sync_thread just finished?
> >>>
> >>> Hi Kuai
> >>>
> >>> Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
> >>> finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
> >>> patch right?
> >>
> >> Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
> >> is set.
> >>
> >>>
> >>>>
> >>>> Then the problem is that idle_sync_thread() read sync_seq after the old
> >>>> sync_thread is done, and new sync_thread start before wait_event() is
> >>>> called, should we wait for this new sync_thread?
> >>>>
> >>>> My answer here is that we should, but I'm also ok to not wait this new
> >>>> sync_thread, I don't think this behaviour matters. The key point here
> >>>> is that once wait_event() is called from idle_sync_thread(), this
> >>>> wait_event() should not wait for new sync_thread...
> >>>
> >>> I think we should wait. If we don't wait for it, there is a problem.
> >>> One person echos idle to sync_action and it doesn't work sometimes.
> >>> It's a strange thing.
> >>>
> >>
> >> Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
> >> for new sync_thread that is started after wait_event().
> >
> > I suggest removing this function. Without this change, it's more
> > simple and it can work well without problem. The people that echo idle
> > to sync_action needs to wait until the sync action finishes. The code
> > semantic is clear and simple.
> >>
> >>>>
> >>>>> echo resync > sync_action (new sync)
> >>>>
> >>>> If this is behind "echo idle > sync_action", idle_sync_thread should not
> >>>> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
> >>>
> >>> `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
> >>> sync_action' still waits until MD_RECOVERY_RUNNING is cleared?
> >>
> >> This is not accurate, if `echo resync > sync_action` triggers a new
> >> sync_thread, then sync_seq is updated when this sync_thread is done,
> >> during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
> >> >sync_action` will wait for sync_thread to be done.
> >
> > I can understand your comment, but sorry, I still can't get how
> > sync_seq works. Could you give a specific case that explains how it
> > works?
>
> Ok, the problem is that echo ilde is supposed to interrupt sync_thread
> and stop sync_thread quickly. Now that we don't hold mutex here, we
> can't prevent new sync_thread to start. For exapmle:
>
> 1) a sync_thread A is runing, MD_RECOVERY_RUNNING is set;
>
> 2) echo idle, A will be interrupted, mutex is not hold and
> idle_sync_thread() is waiting for MD_RECOVERY_RUNNING to be cleared.
>
> 3) A is interrupted, it'll clear MD_RECOVERY_RUNNING and try to wakeup
> idle_sync_thread(), however, before idle_sync_thread() is woken, A can
> be done and a new sync_thread B can be started, and MD_RECOVERY_RUNNING
> will be set again.
>
> 4) idle_sync_thread() finially wake up, however, MD_RECOVERY_RUNNING is
> set and it will still waiting. And this time B won't be interrupted.

Thanks for the example. I can understand the usage of it. It's the
side effect that removes the mutex protection for idle_sync_thread.

There is a problem. New sync thread is started in md_check_recovery.
After your patch, md_reap_sync_thread is called in md_check_recovery
too. So it looks like they can't happen at the same time?

Regards
Xiao

>
> Thanks,
> Kuai
>
> --
> dm-devel mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/dm-devel

2023-06-15 08:37:42

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

On Thu, Jun 15, 2023 at 4:01 PM Xiao Ni <[email protected]> wrote:
>
> On Thu, Jun 15, 2023 at 9:29 AM Yu Kuai <[email protected]> wrote:
> >
> > Hi,
> >
> > 在 2023/06/14 17:08, Xiao Ni 写道:
> > > On Wed, Jun 14, 2023 at 4:29 PM Yu Kuai <[email protected]> wrote:
> > >>
> > >> Hi,
> > >>
> > >> 在 2023/06/14 15:57, Xiao Ni 写道:
> > >>> On Wed, Jun 14, 2023 at 3:38 PM Yu Kuai <[email protected]> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> 在 2023/06/14 15:12, Xiao Ni 写道:
> > >>>>> On Wed, Jun 14, 2023 at 10:04 AM Yu Kuai <[email protected]> wrote:
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> 在 2023/06/14 9:48, Yu Kuai 写道:
> > >>>>>>
> > >>>>>>
> > >>>>>>>>
> > >>>>>>>> In the patch, sync_seq is added in md_reap_sync_thread. In
> > >>>>>>>> idle_sync_thread, if sync_seq isn't equal
> > >>>>>>>>
> > >>>>>>>> mddev->sync_seq, it should mean there is someone that stops the sync
> > >>>>>>>> thread already, right? Why do
> > >>>>>>>>
> > >>>>>>>> you say 'new started sync thread' here?
> > >>>>>>
> > >>>>>> If someone stops the sync thread, and new sync thread is not started,
> > >>>>>> then this sync_seq won't make a difference, above wait_event() will not
> > >>>>>> wait because !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) will pass.
> > >>>>>> So 'sync_seq' is only used when the old sync thread stops and new sync
> > >>>>>> thread starts, add 'sync_seq' will bypass this case.
> > >>>>>
> > >>>>> Hi
> > >>>>>
> > >>>>> If a new sync thread starts, why can sync_seq be different? sync_seq
> > >>>>> is only added in md_reap_sync_thread. And when a new sync request
> > >>>>> starts, it can't stop the sync request again?
> > >>>>>
> > >>>>> Af first, the sync_seq is 0
> > >>>>>
> > >>>>> admin1
> > >>>>> echo idle > sync_action
> > >>>>> idle_sync_thread(sync_seq is 1)
> > >>>>
> > >>>> Wait, I'm confused here, how can sync_seq to be 1 here? I suppose you
> > >>>> mean that there is a sync_thread just finished?
> > >>>
> > >>> Hi Kuai
> > >>>
> > >>> Yes. Because idle_sync_thread needs to wait until md_reap_sync_thread
> > >>> finishes. And md_reap_sync_thread adds sync_seq. Do I understand your
> > >>> patch right?
> > >>
> > >> Yes, noted that idle_sync_thread() will only wait if MD_RECOVERY_RUNNING
> > >> is set.
> > >>
> > >>>
> > >>>>
> > >>>> Then the problem is that idle_sync_thread() read sync_seq after the old
> > >>>> sync_thread is done, and new sync_thread start before wait_event() is
> > >>>> called, should we wait for this new sync_thread?
> > >>>>
> > >>>> My answer here is that we should, but I'm also ok to not wait this new
> > >>>> sync_thread, I don't think this behaviour matters. The key point here
> > >>>> is that once wait_event() is called from idle_sync_thread(), this
> > >>>> wait_event() should not wait for new sync_thread...
> > >>>
> > >>> I think we should wait. If we don't wait for it, there is a problem.
> > >>> One person echos idle to sync_action and it doesn't work sometimes.
> > >>> It's a strange thing.
> > >>>
> > >>
> > >> Ok. I'll add new comment to emphasize that idle_sync_thread() won't wait
> > >> for new sync_thread that is started after wait_event().
> > >
> > > I suggest removing this function. Without this change, it's more
> > > simple and it can work well without problem. The people that echo idle
> > > to sync_action needs to wait until the sync action finishes. The code
> > > semantic is clear and simple.
> > >>
> > >>>>
> > >>>>> echo resync > sync_action (new sync)
> > >>>>
> > >>>> If this is behind "echo idle > sync_action", idle_sync_thread should not
> > >>>> see that MD_RECOVERY_RUNNING is set and wait_event() won't wait at all.
> > >>>
> > >>> `echo resync > sync_action` can't change the sync_seq. So 'echo idle >
> > >>> sync_action' still waits until MD_RECOVERY_RUNNING is cleared?
> > >>
> > >> This is not accurate, if `echo resync > sync_action` triggers a new
> > >> sync_thread, then sync_seq is updated when this sync_thread is done,
> > >> during this period, MD_RECOVERY_RUNNING is still set, so `echo idle
> > >> >sync_action` will wait for sync_thread to be done.
> > >
> > > I can understand your comment, but sorry, I still can't get how
> > > sync_seq works. Could you give a specific case that explains how it
> > > works?
> >
> > Ok, the problem is that echo ilde is supposed to interrupt sync_thread
> > and stop sync_thread quickly. Now that we don't hold mutex here, we
> > can't prevent new sync_thread to start. For exapmle:
> >
> > 1) a sync_thread A is runing, MD_RECOVERY_RUNNING is set;
> >
> > 2) echo idle, A will be interrupted, mutex is not hold and
> > idle_sync_thread() is waiting for MD_RECOVERY_RUNNING to be cleared.
> >
> > 3) A is interrupted, it'll clear MD_RECOVERY_RUNNING and try to wakeup
> > idle_sync_thread(), however, before idle_sync_thread() is woken, A can
> > be done and a new sync_thread B can be started, and MD_RECOVERY_RUNNING
> > will be set again.
> >
> > 4) idle_sync_thread() finially wake up, however, MD_RECOVERY_RUNNING is
> > set and it will still waiting. And this time B won't be interrupted.
>
> Thanks for the example. I can understand the usage of it. It's the
> side effect that removes the mutex protection for idle_sync_thread.
>
> There is a problem. New sync thread is started in md_check_recovery.
> After your patch, md_reap_sync_thread is called in md_check_recovery
> too. So it looks like they can't happen at the same time?

After thinking a while, there is still a race possibility.

md_reap_sync_thread is called in pers deamon (e.g. raid10d ->
md_check_recovery) and md_check_recovery returns. Before
idle_sync_thread is woken, the new sync thread can be started in
md_check_recovery again.

But it's really strange, when one people echo idle to sync_action.
It's better to add some messages to notify the users that they need to
echo idle to sync_action again to have a try. Is there a way that
md_reap_sync_thread can wait idle_sync_thread?

Regards
Xiao
>
> Regards
> Xiao
>
> >
> > Thanks,
> > Kuai
> >
> > --
> > dm-devel mailing list
> > [email protected]
> > https://listman.redhat.com/mailman/listinfo/dm-devel

2023-06-15 09:09:55

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

Hi,

在 2023/06/15 16:17, Xiao Ni 写道:
>> Thanks for the example. I can understand the usage of it. It's the
>> side effect that removes the mutex protection for idle_sync_thread.
>>
>> There is a problem. New sync thread is started in md_check_recovery.
>> After your patch, md_reap_sync_thread is called in md_check_recovery
>> too. So it looks like they can't happen at the same time?

Of course they can't. md_check_recovery() can only do one thing at a
time.

>
> After thinking a while, there is still a race possibility.
>
> md_reap_sync_thread is called in pers deamon (e.g. raid10d ->
> md_check_recovery) and md_check_recovery returns. Before
> idle_sync_thread is woken, the new sync thread can be started in
> md_check_recovery again.
>
> But it's really strange, when one people echo idle to sync_action.
> It's better to add some messages to notify the users that they need to
> echo idle to sync_action again to have a try. Is there a way that
> md_reap_sync_thread can wait idle_sync_thread?

I don't think this is a problem, echo idle only make sure to interupt
current sync_thread, there is no gurantee that sync_thread is not
running after "echo idle" is done with or without this patchset, before
this patchset, new sync thread can still start after the mutex is
released.

User shoud "echo forzen" instead of "echo idle" if they really what to
avoid new sync_thread to start.

Thanks,
Kuai
>
> Regards
> Xiao
>>
>> Regards
>> Xiao
>>
>>>
>>> Thanks,
>>> Kuai
>>>
>>> --
>>> dm-devel mailing list
>>> [email protected]
>>> https://listman.redhat.com/mailman/listinfo/dm-devel
>
> .
>

2023-06-15 09:21:30

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

On Thu, Jun 15, 2023 at 5:05 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/15 16:17, Xiao Ni 写道:
> >> Thanks for the example. I can understand the usage of it. It's the
> >> side effect that removes the mutex protection for idle_sync_thread.
> >>
> >> There is a problem. New sync thread is started in md_check_recovery.
> >> After your patch, md_reap_sync_thread is called in md_check_recovery
> >> too. So it looks like they can't happen at the same time?
>
> Of course they can't. md_check_recovery() can only do one thing at a
> time.
>
> >
> > After thinking a while, there is still a race possibility.
> >
> > md_reap_sync_thread is called in pers deamon (e.g. raid10d ->
> > md_check_recovery) and md_check_recovery returns. Before
> > idle_sync_thread is woken, the new sync thread can be started in
> > md_check_recovery again.
> >
> > But it's really strange, when one people echo idle to sync_action.
> > It's better to add some messages to notify the users that they need to
> > echo idle to sync_action again to have a try. Is there a way that
> > md_reap_sync_thread can wait idle_sync_thread?
>
> I don't think this is a problem, echo idle only make sure to interupt
> current sync_thread, there is no gurantee that sync_thread is not
> running after "echo idle" is done with or without this patchset, before
> this patchset, new sync thread can still start after the mutex is
> released.
>
> User shoud "echo forzen" instead of "echo idle" if they really what to
> avoid new sync_thread to start.

Thanks for all the explanations and patience.

Regards
Xiao
>
> Thanks,
> Kuai
> >
> > Regards
> > Xiao
> >>
> >> Regards
> >> Xiao
> >>
> >>>
> >>> Thanks,
> >>> Kuai
> >>>
> >>> --
> >>> dm-devel mailing list
> >>> [email protected]
> >>> https://listman.redhat.com/mailman/listinfo/dm-devel
> >
> > .
> >
>

2023-06-16 06:55:43

[permalink] [raw]

Subject: Re: [dm-devel] [PATCH -next v2 3/6] md: add a mutex to synchronize idle and frozen in action_store()

On Tue, Jun 13, 2023 at 6:15 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/06/13 22:43, Xiao Ni 写道:
> >
> > 在 2023/5/29 下午9:20, Yu Kuai 写道:
> >> From: Yu Kuai <[email protected]>
> >>
> >> Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
> >> and call md_reap_sync_thread() to stop sync thread, however, this will
> >> cause deadlock (explained in the next patch). In order to fix the
> >> problem, following patch will release 'reconfig_mutex' and wait on
> >> 'resync_wait', like md_set_readonly() and do_md_stop() does.
> >>
> >> Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
> >> unconditionally, which might cause unexpected problems, for example,
> >> frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
> >> 'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
> >> might starve in progress frozen. A mutex is added to synchronize idle
> >> and frozen from action_store().
> >>
> >> Signed-off-by: Yu Kuai <[email protected]>
> >> ---
> >> drivers/md/md.c | 5 +++++
> >> drivers/md/md.h | 3 +++
> >> 2 files changed, 8 insertions(+)
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index 23e8e7eae062..63a993b52cd7 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -644,6 +644,7 @@ void mddev_init(struct mddev *mddev)
> >> mutex_init(&mddev->open_mutex);
> >> mutex_init(&mddev->reconfig_mutex);
> >> mutex_init(&mddev->delete_mutex);
> >> + mutex_init(&mddev->sync_mutex);
> >> mutex_init(&mddev->bitmap_info.mutex);
> >> INIT_LIST_HEAD(&mddev->disks);
> >> INIT_LIST_HEAD(&mddev->all_mddevs);
> >> @@ -4785,14 +4786,18 @@ static void stop_sync_thread(struct mddev *mddev)
> >> static void idle_sync_thread(struct mddev *mddev)
> >> {
> >> + mutex_lock(&mddev->sync_mutex);
> >> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >> stop_sync_thread(mddev);
> >> + mutex_unlock(&mddev->sync_mutex);
> >> }
> >> static void frozen_sync_thread(struct mddev *mddev)
> >> {
> >> + mutex_init(&mddev->delete_mutex);
> >
> >
> > typo error? It should be mutex_lock(&mddev->sync_mutex); ?
> >
>
> Yes, and thanks for spotting this, this looks like I did this while
> rebasing.

I fixed this one and applied the set to md-next.

Thanks,
Song