LinuxLists.cc - [PATCH -next 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync

2023-03-22 06:45:44

Subject: [PATCH -next 0/6] md: fix that MD_RECOVERY_RUNNING can be cleared while sync_thread is still running

From: Yu Kuai <[email protected]>

Patch 1 revert the commit because it will cause MD_RECOVERY_RUNNING to be
cleared while sync_thread is still running. The deadlock this patch tries
to fix will be fixed by patch 2-5.

Patch 6 enhance checking to prevent MD_RECOVERY_RUNNING to be cleared
while sync_thread is still running.

Yu Kuai (6):
Revert "md: unlock mddev before reap sync_thread in action_store"
md: refactor action_store() for 'idle' and 'frozen'
md: add a mutex to synchronize idle and frozen in action_store()
md: refactor idle/frozen_sync_thread()
md: wake up 'resync_wait' at last in md_reap_sync_thread()
md: enhance checking in md_check_recovery()

drivers/md/dm-raid.c | 1 -
drivers/md/md.c | 125 +++++++++++++++++++++++++++++--------------
drivers/md/md.h | 5 ++
3 files changed, 89 insertions(+), 42 deletions(-)

--
2.31.1

2023-03-22 07:04:18

by Yu Kuai

[permalink] [raw]

Subject: [PATCH -next 5/6] md: wake up 'resync_wait' at last in md_reap_sync_thread()

From: Yu Kuai <[email protected]>

We just replace md_reap_sync_thread() with wait_event(resync_wait, ...)
from action_store(), this patch just make sure action_store() will still
wait for everything to be done in md_reap_sync_thread().

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ddf33a11d8de..cabdfd4ec001 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9511,7 +9511,6 @@ void md_reap_sync_thread(struct mddev *mddev)
if (mddev_is_clustered(mddev) && is_reshaped
&& !test_bit(MD_CLOSING, &mddev->flags))
md_cluster_ops->update_size(mddev, old_dev_sectors);
- wake_up(&resync_wait);
/* flag recovery needed just to double check */
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
sysfs_notify_dirent_safe(mddev->sysfs_completed);
@@ -9519,6 +9518,7 @@ void md_reap_sync_thread(struct mddev *mddev)
md_new_event();
if (mddev->event_work.func)
queue_work(md_misc_wq, &mddev->event_work);
+ wake_up(&resync_wait);
}
EXPORT_SYMBOL(md_reap_sync_thread);

--
2.31.1

2023-03-22 07:04:43

by Yu Kuai

[permalink] [raw]

Subject: [PATCH -next 3/6] md: add a mutex to synchronize idle and frozen in action_store()

From: Yu Kuai <[email protected]>

Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.

Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen.

This patch add a mutex to synchronize idle and frozen from
action_store().

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 5 +++++
drivers/md/md.h | 3 +++
2 files changed, 8 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 076936fee65a..223c03149852 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -656,6 +656,7 @@ void mddev_init(struct mddev *mddev)
mutex_init(&mddev->open_mutex);
mutex_init(&mddev->reconfig_mutex);
mutex_init(&mddev->bitmap_info.mutex);
+ mutex_init(&mddev->sync_mutex);
INIT_LIST_HEAD(&mddev->disks);
INIT_LIST_HEAD(&mddev->all_mddevs);
timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
@@ -4783,14 +4784,18 @@ static void stop_sync_thread(struct mddev *mddev)

static void idle_sync_thread(struct mddev *mddev)
{
+ mutex_lock(&mddev->sync_mutex);
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+ mutex_unlock(&mddev->sync_mutex);
}

static void frozen_sync_thread(struct mddev *mddev)
{
+ mutex_lock(&mddev->sync_mutex);
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+ mutex_unlock(&mddev->sync_mutex);
}

static ssize_t
diff --git a/drivers/md/md.h b/drivers/md/md.h
index e148e3c83b0d..64474e458545 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -534,6 +534,9 @@ struct mddev {
bool has_superblocks:1;
bool fail_last_dev:1;
bool serialize_policy:1;
+
+ /* Used to synchronize idle and frozen for action_store() */
+ struct mutex sync_mutex;
};

enum recovery_flags {
--
2.31.1

2023-03-22 07:05:32

by Yu Kuai

[permalink] [raw]

Subject: [PATCH -next 2/6] md: refactor action_store() for 'idle' and 'frozen'

From: Yu Kuai <[email protected]>

Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
are no functional changes except that MD_RECOVERY_RUNNING is checked
again after 'reconfig_mutex' is held.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 61 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 45 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index acf57a5156c7..076936fee65a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4753,6 +4753,46 @@ action_show(struct mddev *mddev, char *page)
return sprintf(page, "%s\n", type);
}

+static void stop_sync_thread(struct mddev *mddev)
+{
+ if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+ return;
+
+ if (mddev_lock(mddev))
+ return;
+
+ /*
+ * Check again in case MD_RECOVERY_RUNNING is cleared before lock is
+ * held.
+ */
+ if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
+ mddev_unlock(mddev);
+ return;
+ }
+
+ if (work_pending(&mddev->del_work))
+ flush_workqueue(md_misc_wq);
+
+ if (mddev->sync_thread) {
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ md_reap_sync_thread(mddev);
+ }
+
+ mddev_unlock(mddev);
+}
+
+static void idle_sync_thread(struct mddev *mddev)
+{
+ clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ stop_sync_thread(mddev);
+}
+
+static void frozen_sync_thread(struct mddev *mddev)
+{
+ set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+ stop_sync_thread(mddev);
+}
+
static ssize_t
action_store(struct mddev *mddev, const char *page, size_t len)
{
@@ -4760,22 +4800,11 @@ action_store(struct mddev *mddev, const char *page, size_t len)
return -EINVAL;

- if (cmd_match(page, "idle") || cmd_match(page, "frozen")) {
- if (cmd_match(page, "frozen"))
- set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
- else
- clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
- if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
- mddev_lock(mddev) == 0) {
- if (work_pending(&mddev->del_work))
- flush_workqueue(md_misc_wq);
- if (mddev->sync_thread) {
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev);
- }
- mddev_unlock(mddev);
- }
- } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+ if (cmd_match(page, "idle"))
+ idle_sync_thread(mddev);
+ else if (cmd_match(page, "frozen"))
+ frozen_sync_thread(mddev);
+ else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
return -EBUSY;
else if (cmd_match(page, "resync"))
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
--
2.31.1

2023-03-22 07:05:39

by Yu Kuai

[permalink] [raw]

Subject: [PATCH -next 4/6] md: refactor idle/frozen_sync_thread()

From: Yu Kuai <[email protected]>

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry

// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

2) Trigger a recovery:

raid10_sync_request
raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

action_store
mddev_lock
md_unregister_thread
kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 24 ++++++++++++++++++++----
drivers/md/md.h | 2 ++
2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 223c03149852..ddf33a11d8de 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -662,6 +662,7 @@ void mddev_init(struct mddev *mddev)
timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
atomic_set(&mddev->active, 1);
atomic_set(&mddev->openers, 0);
+ atomic_set(&mddev->sync_seq, 0);
spin_lock_init(&mddev->lock);
atomic_set(&mddev->flush_pending, 0);
init_waitqueue_head(&mddev->sb_wait);
@@ -4774,19 +4775,28 @@ static void stop_sync_thread(struct mddev *mddev)
if (work_pending(&mddev->del_work))
flush_workqueue(md_misc_wq);

- if (mddev->sync_thread) {
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_reap_sync_thread(mddev);
- }
+ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ /*
+ * Thread might be blocked waiting for metadata update which will now
+ * never happen.
+ */
+ if (mddev->sync_thread)
+ wake_up_process(mddev->sync_thread->tsk);

mddev_unlock(mddev);
}

static void idle_sync_thread(struct mddev *mddev)
{
+ int sync_seq = atomic_read(&mddev->sync_seq);
+
mutex_lock(&mddev->sync_mutex);
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+
+ wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
+ !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
mutex_unlock(&mddev->sync_mutex);
}

@@ -4795,6 +4805,10 @@ static void frozen_sync_thread(struct mddev *mddev)
mutex_lock(&mddev->sync_mutex);
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
stop_sync_thread(mddev);
+
+ wait_event(resync_wait, mddev->sync_thread == NULL &&
+ !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
mutex_unlock(&mddev->sync_mutex);
}

@@ -9451,6 +9465,8 @@ void md_reap_sync_thread(struct mddev *mddev)

/* resync has finished, collect result */
md_unregister_thread(&mddev->sync_thread);
+ atomic_inc(&mddev->sync_seq);
+
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 64474e458545..10d425a3daa3 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -537,6 +537,8 @@ struct mddev {

/* Used to synchronize idle and frozen for action_store() */
struct mutex sync_mutex;
+ /* The sequence number for sync thread */
+ atomic_t sync_seq;
};

enum recovery_flags {
--
2.31.1

2023-03-22 07:06:22

by Yu Kuai

[permalink] [raw]

Subject: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

From: Yu Kuai <[email protected]>

This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.

Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:

list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18

This is because work is requeued while it's still inside workqueue:

t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock

t3:

// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted

This patch revert the commit to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 1 -
drivers/md/md.c | 19 ++-----------------
2 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 60632b409b80..0601edbf579f 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3729,7 +3729,6 @@ static int raid_message(struct dm_target *ti, unsigned int argc, char **argv,
if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
}
} else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 546b1b81eb28..acf57a5156c7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4770,19 +4770,6 @@ action_store(struct mddev *mddev, const char *page, size_t len)
if (work_pending(&mddev->del_work))
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
- sector_t save_rp = mddev->reshape_position;
-
- mddev_unlock(mddev);
- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_unregister_thread(&mddev->sync_thread);
- mddev_lock_nointr(mddev);
- /*
- * set RECOVERY_INTR again and restore reshape
- * position in case others changed them after
- * got lock, eg, reshape_position_store and
- * md_check_recovery.
- */
- mddev->reshape_position = save_rp;
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_reap_sync_thread(mddev);
}
@@ -6173,7 +6160,6 @@ static void __md_stop_writes(struct mddev *mddev)
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
}

@@ -9315,7 +9301,6 @@ void md_check_recovery(struct mddev *mddev)
* ->spare_active and clear saved_raid_disk
*/
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
- md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
@@ -9351,7 +9336,6 @@ void md_check_recovery(struct mddev *mddev)
goto unlock;
}
if (mddev->sync_thread) {
- md_unregister_thread(&mddev->sync_thread);
md_reap_sync_thread(mddev);
goto unlock;
}
@@ -9431,7 +9415,8 @@ void md_reap_sync_thread(struct mddev *mddev)
sector_t old_dev_sectors = mddev->dev_sectors;
bool is_reshaped = false;

- /* sync_thread should be unregistered, collect result */
+ /* resync has finished, collect result */
+ md_unregister_thread(&mddev->sync_thread);
if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
mddev->degraded != mddev->raid_disks) {
--
2.31.1

2023-03-22 07:20:48

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

On 3/22/23 14:41, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.
>
> Because it will introduce a defect that sync_thread can be running while
> MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
> for example:
>
> list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
> Call trace:
> __list_add_valid+0xfc/0x140
> insert_work+0x78/0x1a0
> __queue_work+0x500/0xcf4
> queue_work_on+0xe8/0x12c
> md_check_recovery+0xa34/0xf30
> raid10d+0xb8/0x900 [raid10]
> md_thread+0x16c/0x2cc
> kthread+0x1a4/0x1ec
> ret_from_fork+0x10/0x18
>
> This is because work is requeued while it's still inside workqueue:

If the workqueue subsystem can have such problem because of md flag,
then I have to think workqueue is fragile.

> t1: t2:
> action_store
> mddev_lock
> if (mddev->sync_thread)
> mddev_unlock
> md_unregister_thread
> // first sync_thread is done
> md_check_recovery
> mddev_try_lock
> /*
> * once MD_RECOVERY_DONE is set, new sync_thread
> * can start.
> */
> set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
> INIT_WORK(&mddev->del_work, md_start_sync)
> queue_work(md_misc_wq, &mddev->del_work)
> test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)

Assume you mean below,

1551 if(!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
1552                 __queue_work(cpu, wq, work);
1553                 ret = true;
1554         }

Could you explain how the same work can be re-queued? Isn't the PENDING_BIT
is already set in t3? I believe queue_work shouldn't do that per the comment
but I am not expert ...

Returns %false if @work was already on a queue, %true otherwise.

> // set pending bit
> insert_work
> list_add_tail
> mddev_unlock
> mddev_lock_nointr
> md_reap_sync_thread
> // MD_RECOVERY_RUNNING is cleared
> mddev_unlock
>
> t3:
>
> // before queued work started from t2
> md_check_recovery
> // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
> INIT_WORK(&mddev->del_work, md_start_sync)
> work->data = 0
> // work pending bit is cleared
> queue_work(md_misc_wq, &mddev->del_work)
> insert_work
> list_add_tail
> // list is corrupted
>
> This patch revert the commit to fix the problem, the deadlock this
> commit tries to fix will be fixed in following patches.

Pls cc the previous users who had encounter the problem to test the
second patch.

And can you share your test which can trigger the re-queued issue?
I'd like to try with latest mainline such as 6.3-rc3, and your test is
not only run against 5.10 kernel as you described before, right?

Thanks,
Guoqing

2023-03-22 09:08:36

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

Hi,

在 2023/03/22 15:19, Guoqing Jiang 写道:
>
>
> On 3/22/23 14:41, Yu Kuai wrote:
>> From: Yu Kuai <[email protected]>
>>
>> This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.
>>
>> Because it will introduce a defect that sync_thread can be running while
>> MD_RECOVERY_RUNNING is cleared, which will cause some unexpected
>> problems,
>> for example:
>>
>> list_add corruption. prev->next should be next (ffff0001ac1daba0), but
>> was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
>> Call trace:
>> __list_add_valid+0xfc/0x140
>> insert_work+0x78/0x1a0
>> __queue_work+0x500/0xcf4
>> queue_work_on+0xe8/0x12c
>> md_check_recovery+0xa34/0xf30
>> raid10d+0xb8/0x900 [raid10]
>> md_thread+0x16c/0x2cc
>> kthread+0x1a4/0x1ec
>> ret_from_fork+0x10/0x18
>>
>> This is because work is requeued while it's still inside workqueue:
>
> If the workqueue subsystem can have such problem because of md flag,
> then I have to think workqueue is fragile.
>
>> t1:            t2:
>> action_store
>> mddev_lock
>>    if (mddev->sync_thread)
>>     mddev_unlock
>>     md_unregister_thread
>>     // first sync_thread is done
>>             md_check_recovery
>>              mddev_try_lock
>>              /*
>>               * once MD_RECOVERY_DONE is set, new sync_thread
>>               * can start.
>>               */
>>              set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
>>              INIT_WORK(&mddev->del_work, md_start_sync)
>>              queue_work(md_misc_wq, &mddev->del_work)
>>               test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
>
> Assume you mean below,
>
> 1551 if(!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> 1552                 __queue_work(cpu, wq, work);
> 1553                 ret = true;
> 1554         }
>
> Could you explain how the same work can be re-queued? Isn't the PENDING_BIT
> is already set in t3? I believe queue_work shouldn't do that per the
> comment
> but I am not expert ...

This is not related to workqueue, it is just because raid10
reinitialize the work that is already queued, like I discribed later
in t3:

t2:
md_check_recovery:
INIT_WORK -> clear pending
queue_work -> set pending
list_add_tail
...

t3: -> work is still pending
md_check_recovery:
INIT_WORK -> clear pending
queue_work -> set pending
list_add_tail -> list is corrupted

>
> Returns %false if @work was already on a queue, %true otherwise.
>
>>               // set pending bit
>>               insert_work
>>                list_add_tail
>>              mddev_unlock
>>     mddev_lock_nointr
>>     md_reap_sync_thread
>>     // MD_RECOVERY_RUNNING is cleared
>> mddev_unlock
>>
>> t3:
>>
>> // before queued work started from t2
>> md_check_recovery
>> // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
>> INIT_WORK(&mddev->del_work, md_start_sync)
>>    work->data = 0
>>    // work pending bit is cleared
>> queue_work(md_misc_wq, &mddev->del_work)
>>    insert_work
>>     list_add_tail
>>     // list is corrupted
>>
>> This patch revert the commit to fix the problem, the deadlock this
>> commit tries to fix will be fixed in following patches.
>
> Pls cc the previous users who had encounter the problem to test the
> second patch.

Ok, cc Marc. Can you try if this patchset fix the problem you reproted
in the following thread?

md_raid: mdX_raid6 looping after sync_action "check" to "idle"
transition
>
> And can you share your test which can trigger the re-queued issue?
> I'd like to try with latest mainline such as 6.3-rc3, and your test is
> not only run against 5.10 kernel as you described before, right?
>

Of course, our 5.10 and mainline are the same,

there are some tests:

First the deadlock can be reporduced reliably, test script is simple:

mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]

fio -filename=/dev/md0 -rw=randwrite -direct=1 -name=a -bs=4k
-numjobs=16 -iodepth=16 &

echo -1 > /sys/kernel/debug/fail_make_request/times
echo 1 > /sys/kernel/debug/fail_make_request/probability
echo 1 > /sys/block/sda/make-it-fail

{
while true; do
mdadm -f /dev/md0 /dev/sda
mdadm -r /dev/md0 /dev/sda
mdadm --zero-superblock /dev/sda
mdadm -a /dev/md0 /dev/sda
sleep 2
done
} &

{
while true; do
mdadm -f /dev/md0 /dev/sdd
mdadm -r /dev/md0 /dev/sdd
mdadm --zero-superblock /dev/sdd
mdadm -a /dev/md0 /dev/sdd
sleep 10
done
} &

{
while true; do
echo frozen > /sys/block/md0/md/sync_action
echo idle > /sys/block/md0/md/sync_action
sleep 0.1
done
} &

Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced
reliably, usually it takes 2+ days to triggered a problem, and each time
problem phenomenon can be different, I'm hacking the kernel and add
some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
test can trigger the BUG_ON:

mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
sleep 5
echo 1 > /sys/module/md_mod/parameters/set_delay
echo idle > /sys/block/md0/md/sync_action &
sleep 5
echo "want_replacement" > /sys/block/md0/md/dev-sdd/state

test result:

[ 228.390237] md_check_recovery: running is set
[ 228.391376] md_check_recovery: queue new sync thread
[ 233.671041] action_store unregister success! delay 10s
[ 233.689276] md_check_recovery: running is set
[ 238.722448] md_check_recovery: running is set
[ 238.723328] md_check_recovery: queue new sync thread
[ 238.724851] md_do_sync: before new wor, sleep 10s
[ 239.725818] md_do_sync: delay done
[ 243.674828] action_store delay done
[ 243.700102] md_reap_sync_thread: running is cleared!
[ 243.748703] ------------[ cut here ]------------
[ 243.749656] kernel BUG at drivers/md/md.c:9084!
[ 243.750548] invalid opcode: 0000 [#1] PREEMPT SMP
[ 243.752028] CPU: 6 PID: 1495 Comm: md0_resync Not tainted
6.3.0-rc1-next-20230310-00001-g4b3965bcb967-dirty #47
[ 243.755030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31
04/01/2014
[ 243.758516] RIP: 0010:md_do_sync+0x16a9/0x1b00
[ 243.759583] Code: ff 48 83 05 60 ce a7 0c 01 e9 8d f9 ff ff 48 83 05
13 ce a7 0c 01 48 c7 c6 e9 e0 29 83 e9 3b f9 ff ff 48 83 05 5f d0 a7 0c
01 <0f> 0b 48 83 05 5d d0 a7 0c 01 e8 f8 d5 0b0
[ 243.763661] RSP: 0018:ffffc90003847d50 EFLAGS: 00010202
[ 243.764212] RAX: 0000000000000028 RBX: ffff88817b529000 RCX:
0000000000000000
[ 243.764936] RDX: 0000000000000000 RSI: 0000000000000206 RDI:
ffff888100040740
[ 243.765648] RBP: 00000000002d6780 R08: 0101010101010101 R09:
ffff888165671d80
[ 243.766352] R10: ffffffff8ad6096c R11: ffff88816fcfa9f0 R12:
0000000000000001
[ 243.767066] R13: ffff888173920040 R14: ffff88817b529000 R15:
0000000000187100
[ 243.767781] FS: 0000000000000000(0000) GS:ffff888ffef80000(0000)
knlGS:0000000000000000
[ 243.768588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 243.769172] CR2: 00005599effa8451 CR3: 00000001663e6000 CR4:
00000000000006e0
[ 243.769888] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 243.770598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 243.771300] Call Trace:
[ 243.771555] <TASK>
[ 243.771779] ? kvm_clock_read+0x14/0x30
[ 243.772169] ? kvm_sched_clock_read+0x9/0x20
[ 243.772611] ? sched_clock_cpu+0x21/0x330
[ 243.773023] md_thread+0x2ec/0x300
[ 243.773373] ? md_write_start+0x420/0x420
[ 243.773845] kthread+0x13e/0x1a0
[ 243.774210] ? kthread_exit+0x50/0x50
[ 243.774591] ret_from_fork+0x1f/0x30

> Thanks,
> Guoqing
>
> .
>

Attachments:

0001-echo-idle.patch (3.68 kB)

2023-03-22 14:39:24

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

On 3/22/23 17:00, Yu Kuai wrote:
> Hi,
>
> 在 2023/03/22 15:19, Guoqing Jiang 写道:
>>
>>
>> On 3/22/23 14:41, Yu Kuai wrote:
>>> From: Yu Kuai <[email protected]>
>>>
>>> This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.
>>>
>>> Because it will introduce a defect that sync_thread can be running
>>> while
>>> MD_RECOVERY_RUNNING is cleared, which will cause some unexpected
>>> problems,
>>> for example:
>>>
>>> list_add corruption. prev->next should be next (ffff0001ac1daba0),
>>> but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
>>> Call trace:
>>> __list_add_valid+0xfc/0x140
>>> insert_work+0x78/0x1a0
>>> __queue_work+0x500/0xcf4
>>> queue_work_on+0xe8/0x12c
>>> md_check_recovery+0xa34/0xf30
>>> raid10d+0xb8/0x900 [raid10]
>>> md_thread+0x16c/0x2cc
>>> kthread+0x1a4/0x1ec
>>> ret_from_fork+0x10/0x18
>>>
>>> This is because work is requeued while it's still inside workqueue:
>>
>> If the workqueue subsystem can have such problem because of md flag,
>> then I have to think workqueue is fragile.
>>
>>> t1:            t2:
>>> action_store
>>> mddev_lock
>>>    if (mddev->sync_thread)
>>>     mddev_unlock
>>>     md_unregister_thread
>>>     // first sync_thread is done
>>>             md_check_recovery
>>>              mddev_try_lock
>>>              /*
>>>               * once MD_RECOVERY_DONE is set, new sync_thread
>>>               * can start.
>>>               */
>>>              set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
>>>              INIT_WORK(&mddev->del_work, md_start_sync)
>>>              queue_work(md_misc_wq, &mddev->del_work)
>>>               test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
>>
>> Assume you mean below,
>>
>> 1551 if(!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
>> work_data_bits(work))) {
>> 1552                 __queue_work(cpu, wq, work);
>> 1553                 ret = true;
>> 1554         }
>>
>> Could you explain how the same work can be re-queued? Isn't the
>> PENDING_BIT
>> is already set in t3? I believe queue_work shouldn't do that per the
>> comment
>> but I am not expert ...
>
> This is not related to workqueue, it is just because raid10
> reinitialize the work that is already queued,

I am trying to understand the possibility.

> like I discribed later in t3:
>
> t2:
> md_check_recovery:
> INIT_WORK -> clear pending
> queue_work -> set pending
> list_add_tail
> ...
>
> t3: -> work is still pending
> md_check_recovery:
> INIT_WORK -> clear pending
> queue_work -> set pending
> list_add_tail -> list is corrupted

First, t2 and t3 can't be run in parallel since reconfig_mutex must be
held. And if sync_thread existed,
the second process would unregister and reap sync_thread which means the
second process will
call INIT_WORK and queue_work again.

Maybe your description is valid, I would prefer call work_pending and
flush_workqueue instead of
INIT_WORK and queue_work.

>
>>
>> Returns %false if @work was already on a queue, %true otherwise.
>>
>>>               // set pending bit
>>>               insert_work
>>>                list_add_tail
>>>              mddev_unlock
>>>     mddev_lock_nointr
>>>     md_reap_sync_thread
>>>     // MD_RECOVERY_RUNNING is cleared
>>> mddev_unlock
>>>
>>> t3:
>>>
>>> // before queued work started from t2
>>> md_check_recovery
>>> // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
>>> INIT_WORK(&mddev->del_work, md_start_sync)
>>>    work->data = 0
>>>    // work pending bit is cleared
>>> queue_work(md_misc_wq, &mddev->del_work)
>>>    insert_work
>>>     list_add_tail
>>>     // list is corrupted
>>>
>>> This patch revert the commit to fix the problem, the deadlock this
>>> commit tries to fix will be fixed in following patches.
>>
>> Pls cc the previous users who had encounter the problem to test the
>> second patch.
>
> Ok, cc Marc. Can you try if this patchset fix the problem you reproted
> in the following thread?
>
> md_raid: mdX_raid6 looping after sync_action "check" to "idle"
> transition
>>
>> And can you share your test which can trigger the re-queued issue?
>> I'd like to try with latest mainline such as 6.3-rc3, and your test is
>> not only run against 5.10 kernel as you described before, right?
>>
>
> Of course, our 5.10 and mainline are the same,
>
> there are some tests:
>
> First the deadlock can be reporduced reliably, test script is simple:
>
> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]

So this is raid10 while the previous problem was appeared in raid456, I
am not sure it is the same
issue, but let's see.

>
> fio -filename=/dev/md0 -rw=randwrite -direct=1 -name=a -bs=4k
> -numjobs=16 -iodepth=16 &
>
> echo -1 > /sys/kernel/debug/fail_make_request/times
> echo 1 > /sys/kernel/debug/fail_make_request/probability
> echo 1 > /sys/block/sda/make-it-fail
>
> {
>         while true; do
>                 mdadm -f /dev/md0 /dev/sda
>                 mdadm -r /dev/md0 /dev/sda
>                 mdadm --zero-superblock /dev/sda
>                 mdadm -a /dev/md0 /dev/sda
>                 sleep 2
>         done
> } &
>
> {
>         while true; do
>                 mdadm -f /dev/md0 /dev/sdd
>                 mdadm -r /dev/md0 /dev/sdd
>                 mdadm --zero-superblock /dev/sdd
>                 mdadm -a /dev/md0 /dev/sdd
>                 sleep 10
>         done
> } &
>
> {
>         while true; do
>                 echo frozen > /sys/block/md0/md/sync_action
>                 echo idle > /sys/block/md0/md/sync_action
>                 sleep 0.1
>         done
> } &
>
> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced
> reliably, usually it takes 2+ days to triggered a problem, and each time
> problem phenomenon can be different, I'm hacking the kernel and add
> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
> test can trigger the BUG_ON:

Also your debug patch obviously added large delay which make the
calltrace happen, I doubt
if user can hit it in real life. Anyway, will try below test from my side.

> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
> sleep 5
> echo 1 > /sys/module/md_mod/parameters/set_delay
> echo idle > /sys/block/md0/md/sync_action &
> sleep 5
> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state
>
> test result:
>
> [ 228.390237] md_check_recovery: running is set
> [ 228.391376] md_check_recovery: queue new sync thread
> [ 233.671041] action_store unregister success! delay 10s
> [ 233.689276] md_check_recovery: running is set
> [ 238.722448] md_check_recovery: running is set
> [ 238.723328] md_check_recovery: queue new sync thread
> [ 238.724851] md_do_sync: before new wor, sleep 10s
> [ 239.725818] md_do_sync: delay done
> [ 243.674828] action_store delay done
> [ 243.700102] md_reap_sync_thread: running is cleared!
> [ 243.748703] ------------[ cut here ]------------
> [ 243.749656] kernel BUG at drivers/md/md.c:9084!

After your debug patch applied, is L9084 points to below?

9084                                 mddev->curr_resync = MaxSector;

I don't understand how it triggers below calltrace, and it has nothing
to do with
list corruption, right?

>
> [ 243.750548] invalid opcode: 0000 [#1] PREEMPT SMP
> [ 243.752028] CPU: 6 PID: 1495 Comm: md0_resync Not tainted
> 6.3.0-rc1-next-20230310-00001-g4b3965bcb967-dirty #47
> [ 243.755030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31
> 04/01/2014
> [ 243.758516] RIP: 0010:md_do_sync+0x16a9/0x1b00
> [ 243.759583] Code: ff 48 83 05 60 ce a7 0c 01 e9 8d f9 ff ff 48 83
> 05 13 ce a7 0c 01 48 c7 c6 e9 e0 29 83 e9 3b f9 ff ff 48 83 05 5f d0
> a7 0c 01 <0f> 0b 48 83 05 5d d0 a7 0c 01 e8 f8 d5 0b0
> [ 243.763661] RSP: 0018:ffffc90003847d50 EFLAGS: 00010202
> [ 243.764212] RAX: 0000000000000028 RBX: ffff88817b529000 RCX:
> 0000000000000000
> [ 243.764936] RDX: 0000000000000000 RSI: 0000000000000206 RDI:
> ffff888100040740
> [ 243.765648] RBP: 00000000002d6780 R08: 0101010101010101 R09:
> ffff888165671d80
> [ 243.766352] R10: ffffffff8ad6096c R11: ffff88816fcfa9f0 R12:
> 0000000000000001
> [ 243.767066] R13: ffff888173920040 R14: ffff88817b529000 R15:
> 0000000000187100
> [ 243.767781] FS: 0000000000000000(0000) GS:ffff888ffef80000(0000)
> knlGS:0000000000000000
> [ 243.768588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 243.769172] CR2: 00005599effa8451 CR3: 00000001663e6000 CR4:
> 00000000000006e0
> [ 243.769888] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 243.770598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 243.771300] Call Trace:
> [ 243.771555] <TASK>
> [ 243.771779] ? kvm_clock_read+0x14/0x30
> [ 243.772169] ? kvm_sched_clock_read+0x9/0x20
> [ 243.772611] ? sched_clock_cpu+0x21/0x330
> [ 243.773023] md_thread+0x2ec/0x300
> [ 243.773373] ? md_write_start+0x420/0x420
> [ 243.773845] kthread+0x13e/0x1a0
> [ 243.774210] ? kthread_exit+0x50/0x50
> [ 243.774591] ret_from_fork+0x1f/0x30
>

Thanks,
Guoqing

2023-03-23 01:41:20

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

Hi,

在 2023/03/22 22:32, Guoqing Jiang 写道:
>>> Could you explain how the same work can be re-queued? Isn't the
>>> PENDING_BIT
>>> is already set in t3? I believe queue_work shouldn't do that per the
>>> comment
>>> but I am not expert ...
>>
>> This is not related to workqueue, it is just because raid10
>> reinitialize the work that is already queued,
>
> I am trying to understand the possibility.
>
>> like I discribed later in t3:
>>
>> t2:
>> md_check_recovery:
>> INIT_WORK -> clear pending
>> queue_work -> set pending
>> list_add_tail
>> ...
>>
>> t3: -> work is still pending
>> md_check_recovery:
>> INIT_WORK -> clear pending
>> queue_work -> set pending
>> list_add_tail -> list is corrupted
>
> First, t2 and t3 can't be run in parallel since reconfig_mutex must be
> held. And if sync_thread existed,
> the second process would unregister and reap sync_thread which means the
> second process will
> call INIT_WORK and queue_work again.
>
> Maybe your description is valid, I would prefer call work_pending and
> flush_workqueue instead of
> INIT_WORK and queue_work.

This is not enough, it's right this can avoid list corruption, but the
worker function md_start_sync just register a sync_thread, and
md_do_sync() can still in progress, hence this can't prevent a new
sync_thread to start while the old one is not done, some other problems
like deadlock can still be triggered.

>> Of course, our 5.10 and mainline are the same,
>>
>> there are some tests:
>>
>> First the deadlock can be reporduced reliably, test script is simple:
>>
>> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
>
> So this is raid10 while the previous problem was appeared in raid456, I
> am not sure it is the same
> issue, but let's see.

Ok, I'm not quite familiar with raid456 yet, however, the problem is
still related to that action_store hold mutex to unregister sync_thread,
right?

>> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced
>> reliably, usually it takes 2+ days to triggered a problem, and each time
>> problem phenomenon can be different, I'm hacking the kernel and add
>> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
>> test can trigger the BUG_ON:
>
> Also your debug patch obviously added large delay which make the
> calltrace happen, I doubt
> if user can hit it in real life. Anyway, will try below test from my side.
>
>> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
>> sleep 5
>> echo 1 > /sys/module/md_mod/parameters/set_delay
>> echo idle > /sys/block/md0/md/sync_action &
>> sleep 5
>> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state
>>
>> test result:
>>
>> [ 228.390237] md_check_recovery: running is set
>> [ 228.391376] md_check_recovery: queue new sync thread
>> [ 233.671041] action_store unregister success! delay 10s
>> [ 233.689276] md_check_recovery: running is set
>> [ 238.722448] md_check_recovery: running is set
>> [ 238.723328] md_check_recovery: queue new sync thread
>> [ 238.724851] md_do_sync: before new wor, sleep 10s
>> [ 239.725818] md_do_sync: delay done
>> [ 243.674828] action_store delay done
>> [ 243.700102] md_reap_sync_thread: running is cleared!
>> [ 243.748703] ------------[ cut here ]------------
>> [ 243.749656] kernel BUG at drivers/md/md.c:9084!
>
> After your debug patch applied, is L9084 points to below?
>
> 9084 mddev->curr_resync = MaxSector;

In my environment, it's a BUG_ON() that I added in md_do_sync:

9080 skip:
9081 /* set CHANGE_PENDING here since maybe another update is
needed,
9082 ┊* so other nodes are informed. It should be harmless for
normal
9083 ┊* raid */
9084 BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
9085 set_mask_bits(&mddev->sb_flags, 0,
9086 ┊ BIT(MD_SB_CHANGE_PENDING) |
BIT(MD_SB_CHANGE_DEVS));

>
> I don't understand how it triggers below calltrace, and it has nothing
> to do with
> list corruption, right?

Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
is cleared while sync_thread is still in progress.

Thanks,
Kuai

2023-03-23 03:54:58

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

On 3/23/23 09:36, Yu Kuai wrote:
> Hi,
>
> 在 2023/03/22 22:32, Guoqing Jiang 写道:
>>>> Could you explain how the same work can be re-queued? Isn't the
>>>> PENDING_BIT
>>>> is already set in t3? I believe queue_work shouldn't do that per
>>>> the comment
>>>> but I am not expert ...
>>>
>>> This is not related to workqueue, it is just because raid10
>>> reinitialize the work that is already queued,
>>
>> I am trying to understand the possibility.
>>
>>> like I discribed later in t3:
>>>
>>> t2:
>>> md_check_recovery:
>>> INIT_WORK -> clear pending
>>> queue_work -> set pending
>>> list_add_tail
>>> ...
>>>
>>> t3: -> work is still pending
>>> md_check_recovery:
>>> INIT_WORK -> clear pending
>>> queue_work -> set pending
>>> list_add_tail -> list is corrupted
>>
>> First, t2 and t3 can't be run in parallel since reconfig_mutex must
>> be held. And if sync_thread existed,
>> the second process would unregister and reap sync_thread which means
>> the second process will
>> call INIT_WORK and queue_work again.
>>
>> Maybe your description is valid, I would prefer call work_pending and
>> flush_workqueue instead of
>> INIT_WORK and queue_work.
>
> This is not enough, it's right this can avoid list corruption, but the
> worker function md_start_sync just register a sync_thread, and
> md_do_sync() can still in progress, hence this can't prevent a new
> sync_thread to start while the old one is not done, some other problems
> like deadlock can still be triggered.
>
>>> Of course, our 5.10 and mainline are the same,
>>>
>>> there are some tests:
>>>
>>> First the deadlock can be reporduced reliably, test script is simple:
>>>
>>> mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]
>>
>> So this is raid10 while the previous problem was appeared in raid456,
>> I am not sure it is the same
>> issue, but let's see.
>
> Ok, I'm not quite familiar with raid456 yet, however, the problem is
> still related to that action_store hold mutex to unregister sync_thread,
> right?

Yes and no, the previous raid456 bug also existed because it can't get
stripe while
barrier is involved as you mentioned in patch 4, which is different.

>
>>> Then, the problem MD_RECOVERY_RUNNING can be cleared can't be
>>> reporduced
>>> reliably, usually it takes 2+ days to triggered a problem, and each
>>> time
>>> problem phenomenon can be different, I'm hacking the kernel and add
>>> some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
>>> test can trigger the BUG_ON:
>>
>> Also your debug patch obviously added large delay which make the
>> calltrace happen, I doubt
>> if user can hit it in real life. Anyway, will try below test from my
>> side.
>>
>>> mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
>>> sleep 5
>>> echo 1 > /sys/module/md_mod/parameters/set_delay
>>> echo idle > /sys/block/md0/md/sync_action &
>>> sleep 5
>>> echo "want_replacement" > /sys/block/md0/md/dev-sdd/state

Combined your debug patch with above steps. Seems you are

1. add delay to action_store, so it can't get lock in time.
2. echo "want_replacement"**triggers md_check_recovery which can grab lock
    to start sync thread.
3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync
thread.
4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
    in step 3.

>>>
>>> test result:
>>>
>>> [ 228.390237] md_check_recovery: running is set
>>> [ 228.391376] md_check_recovery: queue new sync thread
>>> [ 233.671041] action_store unregister success! delay 10s
>>> [ 233.689276] md_check_recovery: running is set
>>> [ 238.722448] md_check_recovery: running is set
>>> [ 238.723328] md_check_recovery: queue new sync thread
>>> [ 238.724851] md_do_sync: before new wor, sleep 10s
>>> [ 239.725818] md_do_sync: delay done
>>> [ 243.674828] action_store delay done
>>> [ 243.700102] md_reap_sync_thread: running is cleared!
>>> [ 243.748703] ------------[ cut here ]------------
>>> [ 243.749656] kernel BUG at drivers/md/md.c:9084!
>>
>> After your debug patch applied, is L9084 points to below?
>>
>> 9084                                 mddev->curr_resync = MaxSector;
>
> In my environment, it's a BUG_ON() that I added in md_do_sync:

Ok, so we are on different code base ...

> 9080 skip:
> 9081         /* set CHANGE_PENDING here since maybe another update is
> needed,
> 9082         ┊* so other nodes are informed. It should be harmless for
> normal
> 9083         ┊* raid */
> 9084         BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
> 9085         set_mask_bits(&mddev->sb_flags, 0,
> 9086                 ┊     BIT(MD_SB_CHANGE_PENDING) |
> BIT(MD_SB_CHANGE_DEVS));
>
>>
>> I don't understand how it triggers below calltrace, and it has
>> nothing to do with
>> list corruption, right?
>
> Yes, this is just a early BUG_ON() to detect that if MD_RECOVERY_RUNNING
> is cleared while sync_thread is still in progress.

sync_thread can be interrupted once MD_RECOVERY_INTR is set which means
the RUNNING
can be cleared, so I am not sure the added BUG_ON is reasonable. And
change BUG_ON
like this makes more sense to me.

+BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+!test_bit(MD_RECOVERY_INTR, &mddev->recovery));

I think there might be racy window like you described but it should be
really small, I prefer
to just add a few lines like this instead of revert and introduce new
lock to resolve the same
issue (if it is).

@@ -4792,9 +4793,15 @@action_store(struct mddev *mddev, const char
*page, size_t len)
                       if (mddev->sync_thread) {
                               sector_t save_rp = mddev->reshape_position;

+set_bit(MD_RECOVERY_DONOT, &mddev->recovery);
@@ -4805,6 +4812,7 @@action_store(struct mddev *mddev, const char *page,
size_t len)
                               mddev->reshape_position = save_rp;
                               set_bit(MD_RECOVERY_INTR,
&mddev->recovery);
                               md_reap_sync_thread(mddev);
+clear_bit(MD_RECOVERY_DONOT, &mddev->recovery);
                       }
                       mddev_unlock(mddev);
@@ -9296,6 +9313,9 @@void md_check_recovery(struct mddev *mddev)
       if (!md_is_rdwr(mddev) &&
           !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
               return;
+/* action_store is in the middle of reap sync thread, let's wait */
+if (test_bit(MD_RECOVERY_DONOT, &mddev->recovery))
+return;
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -553,6 +553,7 @@enum recovery_flags {
       MD_RECOVERY_ERROR,      /* sync-action interrupted because
io-error */
       MD_RECOVERY_WAIT,       /* waiting for pers->start() to finish */
       MD_RESYNCING_REMOTE,    /* remote node is running resync thread */
+MD_RECOVERY_DONOT, /* for a nasty racy issue */
};

TBH, I am reluctant to see the changes in the series, it can only be
considered
acceptable with conditions:

1. the previous raid456 bug can be fixed in this way too, hopefully Marc
or others
    can verify it.
2. pass all the tests in mdadm.

Thanks,
Guoqing

2023-03-23 06:34:18

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

Hi,

在 2023/03/23 11:50, Guoqing Jiang 写道:

> Combined your debug patch with above steps. Seems you are
>
> 1. add delay to action_store, so it can't get lock in time.
> 2. echo "want_replacement"**triggers md_check_recovery which can grab lock
>     to start sync thread.
> 3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync
> thread.
> 4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
>     in step 3.

Yes, this is exactly what I did.

> sync_thread can be interrupted once MD_RECOVERY_INTR is set which means
> the RUNNING
> can be cleared, so I am not sure the added BUG_ON is reasonable. And
> change BUG_ON

I think BUG_ON() is reasonable because only md_reap_sync_thread can
clear it, md_do_sync will exit quictly if MD_RECOVERY_INTR is set, but
md_do_sync should not see that MD_RECOVERY_RUNNING is cleared, otherwise
there is no gurantee that only one sync_thread can be in progress.

> like this makes more sense to me.
>
> +BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> +!test_bit(MD_RECOVERY_INTR, &mddev->recovery));

I think this can be reporduced likewise, md_check_recovery clear
MD_RECOVERY_INTR, and new sync_thread triggered by echo
"want_replacement" won't set this bit.

>
> I think there might be racy window like you described but it should be
> really small, I prefer
> to just add a few lines like this instead of revert and introduce new
> lock to resolve the same
> issue (if it is).

The new lock that I add in this patchset is just try to synchronize idle
and forzen from action_store（patch 3), I can drop it if you think this
is not necessary.

The main changes is patch 4, new lines is not much and I really don't
like to add new flags unless we have to, current code is already hard
to understand...

By the way, I'm concerned that drop the mutex to unregister sync_thread
might not be safe, since the mutex protects lots of stuff, and there
might exist other implicit dependencies.

>
> TBH, I am reluctant to see the changes in the series, it can only be
> considered
> acceptable with conditions:
>
> 1. the previous raid456 bug can be fixed in this way too, hopefully Marc
> or others
>     can verify it.
> 2. pass all the tests in mdadm

I already test this patchset with mdadm, If there are reporducer for
raid456 bug, I can try to verify it myself.

Thanks,
Kuai
>
> Thanks,
> Guoqing
> .
>

2023-03-29 00:02:10

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

On Wed, Mar 22, 2023 at 11:32 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/03/23 11:50, Guoqing Jiang 写道:
>
> > Combined your debug patch with above steps. Seems you are
> >
> > 1. add delay to action_store, so it can't get lock in time.
> > 2. echo "want_replacement"**triggers md_check_recovery which can grab lock
> > to start sync thread.
> > 3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync
> > thread.
> > 4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
> > in step 3.
>
> Yes, this is exactly what I did.
>
> > sync_thread can be interrupted once MD_RECOVERY_INTR is set which means
> > the RUNNING
> > can be cleared, so I am not sure the added BUG_ON is reasonable. And
> > change BUG_ON
>
> I think BUG_ON() is reasonable because only md_reap_sync_thread can
> clear it, md_do_sync will exit quictly if MD_RECOVERY_INTR is set, but
> md_do_sync should not see that MD_RECOVERY_RUNNING is cleared, otherwise
> there is no gurantee that only one sync_thread can be in progress.
>
> > like this makes more sense to me.
> >
> > +BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> > +!test_bit(MD_RECOVERY_INTR, &mddev->recovery));
>
> I think this can be reporduced likewise, md_check_recovery clear
> MD_RECOVERY_INTR, and new sync_thread triggered by echo
> "want_replacement" won't set this bit.
>
> >
> > I think there might be racy window like you described but it should be
> > really small, I prefer
> > to just add a few lines like this instead of revert and introduce new
> > lock to resolve the same
> > issue (if it is).
>
> The new lock that I add in this patchset is just try to synchronize idle
> and forzen from action_store（patch 3), I can drop it if you think this
> is not necessary.
>
> The main changes is patch 4, new lines is not much and I really don't
> like to add new flags unless we have to, current code is already hard
> to understand...
>
> By the way, I'm concerned that drop the mutex to unregister sync_thread
> might not be safe, since the mutex protects lots of stuff, and there
> might exist other implicit dependencies.
>
> >
> > TBH, I am reluctant to see the changes in the series, it can only be
> > considered
> > acceptable with conditions:
> >
> > 1. the previous raid456 bug can be fixed in this way too, hopefully Marc
> > or others
> > can verify it.
> > 2. pass all the tests in mdadm

AFAICT, this set looks like a better solution for this problem. But I agree
that we need to make sure it fixes the original bug. mdadm tests are not
in a very good shape at the moment. I will spend more time to look into
these tests.

Thanks,
Song

2023-04-06 08:58:15

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

Hi,

在 2023/03/29 7:58, Song Liu 写道:
> On Wed, Mar 22, 2023 at 11:32 PM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/03/23 11:50, Guoqing Jiang 写道:
>>
>>> Combined your debug patch with above steps. Seems you are
>>>
>>> 1. add delay to action_store, so it can't get lock in time.
>>> 2. echo "want_replacement"**triggers md_check_recovery which can grab lock
>>> to start sync thread.
>>> 3. action_store finally hold lock to clear RECOVERY_RUNNING in reap sync
>>> thread.
>>> 4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is cleared
>>> in step 3.
>>
>> Yes, this is exactly what I did.
>>
>>> sync_thread can be interrupted once MD_RECOVERY_INTR is set which means
>>> the RUNNING
>>> can be cleared, so I am not sure the added BUG_ON is reasonable. And
>>> change BUG_ON
>>
>> I think BUG_ON() is reasonable because only md_reap_sync_thread can
>> clear it, md_do_sync will exit quictly if MD_RECOVERY_INTR is set, but
>> md_do_sync should not see that MD_RECOVERY_RUNNING is cleared, otherwise
>> there is no gurantee that only one sync_thread can be in progress.
>>
>>> like this makes more sense to me.
>>>
>>> +BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>> +!test_bit(MD_RECOVERY_INTR, &mddev->recovery));
>>
>> I think this can be reporduced likewise, md_check_recovery clear
>> MD_RECOVERY_INTR, and new sync_thread triggered by echo
>> "want_replacement" won't set this bit.
>>
>>>
>>> I think there might be racy window like you described but it should be
>>> really small, I prefer
>>> to just add a few lines like this instead of revert and introduce new
>>> lock to resolve the same
>>> issue (if it is).
>>
>> The new lock that I add in this patchset is just try to synchronize idle
>> and forzen from action_store（patch 3), I can drop it if you think this
>> is not necessary.
>>
>> The main changes is patch 4, new lines is not much and I really don't
>> like to add new flags unless we have to, current code is already hard
>> to understand...
>>
>> By the way, I'm concerned that drop the mutex to unregister sync_thread
>> might not be safe, since the mutex protects lots of stuff, and there
>> might exist other implicit dependencies.
>>
>>>
>>> TBH, I am reluctant to see the changes in the series, it can only be
>>> considered
>>> acceptable with conditions:
>>>
>>> 1. the previous raid456 bug can be fixed in this way too, hopefully Marc
>>> or others
>>> can verify it.
>>> 2. pass all the tests in mdadm
>
> AFAICT, this set looks like a better solution for this problem. But I agree
> that we need to make sure it fixes the original bug. mdadm tests are not
> in a very good shape at the moment. I will spend more time to look into
> these tests.

While I'm working on another thread to protect md_thread with rcu, I
found that this patch has other defects that can cause null-ptr-
deference in theory where md_unregister_thread(&mddev->sync_thread) can
concurrent with other context to access sync_thread, for example:

t1: md_set_readonly t2: action_store
md_unregister_thread
// 'reconfig_mutex' is not held
// 'reconfig_mutex' is held by caller
if (mddev->sync_thread)
thread = *threadp
*threadp = NULL
wake_up_process(mddev->sync_thread->tsk)
// null-ptr-deference

So, I think this revert will make more sence. ????

Thanks,
Kuai

2023-05-05 09:17:48

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

Hi, Song and Guoqing

在 2023/04/06 16:53, Yu Kuai 写道:
> Hi,
>
> 在 2023/03/29 7:58, Song Liu 写道:
>> On Wed, Mar 22, 2023 at 11:32 PM Yu Kuai <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> 在 2023/03/23 11:50, Guoqing Jiang 写道:
>>>
>>>> Combined your debug patch with above steps. Seems you are
>>>>
>>>> 1. add delay to action_store, so it can't get lock in time.
>>>> 2. echo "want_replacement"**triggers md_check_recovery which can
>>>> grab lock
>>>>       to start sync thread.
>>>> 3. action_store finally hold lock to clear RECOVERY_RUNNING in reap
>>>> sync
>>>> thread.
>>>> 4. Then the new added BUG_ON is invoked since RECOVERY_RUNNING is
>>>> cleared
>>>>       in step 3.
>>>
>>> Yes, this is exactly what I did.
>>>
>>>> sync_thread can be interrupted once MD_RECOVERY_INTR is set which means
>>>> the RUNNING
>>>> can be cleared, so I am not sure the added BUG_ON is reasonable. And
>>>> change BUG_ON
>>>
>>> I think BUG_ON() is reasonable because only md_reap_sync_thread can
>>> clear it, md_do_sync will exit quictly if MD_RECOVERY_INTR is set, but
>>> md_do_sync should not see that MD_RECOVERY_RUNNING is cleared, otherwise
>>> there is no gurantee that only one sync_thread can be in progress.
>>>
>>>> like this makes more sense to me.
>>>>
>>>> +BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>> +!test_bit(MD_RECOVERY_INTR, &mddev->recovery));
>>>
>>> I think this can be reporduced likewise, md_check_recovery clear
>>> MD_RECOVERY_INTR, and new sync_thread triggered by echo
>>> "want_replacement" won't set this bit.
>>>
>>>>
>>>> I think there might be racy window like you described but it should be
>>>> really small, I prefer
>>>> to just add a few lines like this instead of revert and introduce new
>>>> lock to resolve the same
>>>> issue (if it is).
>>>
>>> The new lock that I add in this patchset is just try to synchronize idle
>>> and forzen from action_store（patch 3), I can drop it if you think this
>>> is not necessary.
>>>
>>> The main changes is patch 4, new lines is not much and I really don't
>>> like to add new flags unless we have to, current code is already hard
>>> to understand...
>>>
>>> By the way, I'm concerned that drop the mutex to unregister sync_thread
>>> might not be safe, since the mutex protects lots of stuff, and there
>>> might exist other implicit dependencies.
>>>
>>>>
>>>> TBH, I am reluctant to see the changes in the series, it can only be
>>>> considered
>>>> acceptable with conditions:
>>>>
>>>> 1. the previous raid456 bug can be fixed in this way too, hopefully
>>>> Marc
>>>> or others
>>>>       can verify it.

After reading the thread:

https://lore.kernel.org/linux-raid/[email protected]/T/#t

The deadlock in raid456 has same conditions as raid10:
1) echo idle hold mutex to stop sync thread;
2) sync thread wait for io to complete;
3) io can't be handled by daemon thread because sb flag is set;
4) sb flag can't be cleared because daemon thread can't hold mutex;

I tried to reporduce the deadlock with the reporducer provided in the
thread, howerver, the deadlock is not reporduced after running for more
than a day.

I changed the reporducer to below:

[root@fedora raid5]# cat test_deadlock.sh
#! /bin/bash

(
while true; do
echo check > /sys/block/md0/md/sync_action
sleep 0.5
echo idle > /sys/block/md0/md/sync_action
done
) &

echo 0 > /proc/sys/vm/dirty_background_ratio
(
while true; do
fio -filename=/dev/md0 -bs=4k -rw=write -numjobs=1
-name=xxx
done
) &

And I finially able to reporduce the deadlock with this patch
reverted(running for about an hour):

[root@fedora raid5]# ps -elf | grep " D " | grep -v grep
1 D root 156 2 16 80 0 - 0 md_wri 06:51 ?
00:19:15 [kworker/u8:11+flush-9:0]
5 D root 2239 1 2 80 0 - 992 kthrea 06:57 pts/0
00:02:15 sh test_deadlock.sh
1 D root 42791 2 0 80 0 - 0 raid5_ 07:45 ?
00:00:00 [md0_resync]
5 D root 42803 42797 0 80 0 - 92175 balanc 07:45 ?
00:00:06 fio -filename=/dev/md0 -bs=4k -rw=write -numjobs=1 -name=xxx

[root@fedora raid5]# cat /proc/2239/stack
[<0>] kthread_stop+0x96/0x2b0
[<0>] md_unregister_thread+0x5e/0xd0
[<0>] md_reap_sync_thread+0x27/0x370
[<0>] action_store+0x1fa/0x490
[<0>] md_attr_store+0xa7/0x120
[<0>] sysfs_kf_write+0x3a/0x60
[<0>] kernfs_fop_write_iter+0x144/0x2b0
[<0>] new_sync_write+0x140/0x210
[<0>] vfs_write+0x21a/0x350
[<0>] ksys_write+0x77/0x150
[<0>] __x64_sys_write+0x1d/0x30
[<0>] do_syscall_64+0x45/0x70
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
[root@fedora raid5]# cat /proc/42791/stack
[<0>] raid5_get_active_stripe+0x606/0x960
[<0>] raid5_sync_request+0x508/0x570
[<0>] md_do_sync.cold+0xaa6/0xee7
[<0>] md_thread+0x266/0x280
[<0>] kthread+0x151/0x1b0
[<0>] ret_from_fork+0x1f/0x30

And with this patchset applied, I run the above reporducer for more than
a day now, and I think the deadlock in raid456 can be fixed.

Can this patchset be considered in next merge window? If so, I'll rebase
this patchset.

Thanks,
Kuai
>>>> 2. pass all the tests in mdadm
>>
>> AFAICT, this set looks like a better solution for this problem. But I
>> agree
>> that we need to make sure it fixes the original bug. mdadm tests are not
>> in a very good shape at the moment. I will spend more time to look into
>> these tests.
>
> While I'm working on another thread to protect md_thread with rcu, I
> found that this patch has other defects that can cause null-ptr-
> deference in theory where md_unregister_thread(&mddev->sync_thread) can
> concurrent with other context to access sync_thread, for example:
>
> t1: md_set_readonly             t2: action_store
>                                 md_unregister_thread
>                                 // 'reconfig_mutex' is not held
> // 'reconfig_mutex' is held by caller
> if (mddev->sync_thread)
>                                  thread = *threadp
>                                  *threadp = NULL
> wake_up_process(mddev->sync_thread->tsk)
> // null-ptr-deference
>
> So, I think this revert will make more sence. ????
>
> Thanks,
> Kuai
>
> .
>