2023-09-28 06:33:23

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 00/25] md: synchronize io with array reconfiguration

From: Yu Kuai <[email protected]>

Changes in v3:
- rebase with latest md-next;
- remove patch 2 from v2, and replace it with a new patch;
- fix a null-ptr-derefrence in rdev_attr_store() that mddev is used
before checking;
- merge patch 20-22 from v1 into one patch;
- mddev_lock() used to be called first and can be interruptted, allow new
api, which is called before mddev_lock() now, to be interruptted as well;
- improve some comments and coding;

Changes in v2:
- rebase with latest md-next;
- remove some follow up cleanup patches, these patches will be sent
later after this patchset.

After previous four patchset of preparatory work, this patchset impelement
a new version of mddev_suspend(), the new apis:
- reconfig_mutex is not required;
- the weird logical that suspend array hold 'reconfig_mutex' for
mddev_check_recovery() to update superblock is not needed;
- the special handling, 'pers->prepare_suspend', for raid456 is not
needed;
- It's safe to be called at any time once mddev is allocated, and it's
designed to be used from slow path where array configuration is changed;

And use the new api to replace:

mddev_lock
mddev_suspend or not
// array reconfiguration
mddev_resume or not
mddev_unlock

With:

mddev_suspend
mddev_lock
// array reconfiguration
mddev_unlock
mddev_resume

However, the above change is not possible for raid5 and raid-cluster in
some corner cases, and mddev_suspend/resume() is replaced with quiesce()
callback, which will suspend the array as well.

This patchset is tested in my VM with mdadm testsuite with loop device
except for 10ddf tests(they always fail before this patchset).

A lot of cleanups will be started after this patchset.

Yu Kuai (25):
md: use READ_ONCE/WRITE_ONCE for 'suspend_lo' and 'suspend_hi'
md: replace is_md_suspended() with 'mddev->suspended' in
md_check_recovery()
md: add new helpers to suspend/resume array
md: add new helpers to suspend/resume and lock/unlock array
md: use new apis to suspend array for suspend_lo/hi_store()
md: use new apis to suspend array for level_store()
md: use new apis to suspend array for serialize_policy_store()
md/dm-raid: use new apis to suspend array
md/md-bitmap: use new apis to suspend array for location_store()
md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'
md/raid5-cache: use new apis to suspend array for
r5c_disable_writeback_async()
md/raid5-cache: use new apis to suspend array for
r5c_journal_mode_store()
md/raid5: use new apis to suspend array for raid5_store_stripe_size()
md/raid5: use new apis to suspend array for raid5_store_skip_copy()
md/raid5: use new apis to suspend array for
raid5_store_group_thread_cnt()
md/raid5: use new apis to suspend array for
raid5_change_consistency_policy()
md/raid5: replace suspend with quiesce() callback
md: use new apis to suspend array for ioctls involed array
reconfiguration
md: use new apis to suspend array for adding/removing rdev from
state_store()
md: use new apis to suspend array before
mddev_create/destroy_serial_pool
md: cleanup mddev_create/destroy_serial_pool()
md/md-linear: cleanup linear_add()
md: suspend array in md_start_sync() if array need reconfiguration
md: remove old apis to suspend the array
md: rename __mddev_suspend/resume() back to mddev_suspend/resume()

drivers/md/dm-raid.c | 10 +-
drivers/md/md-autodetect.c | 4 +-
drivers/md/md-bitmap.c | 18 ++-
drivers/md/md-linear.c | 2 -
drivers/md/md.c | 233 ++++++++++++++++++++-----------------
drivers/md/md.h | 43 +++++--
drivers/md/raid5-cache.c | 64 +++++-----
drivers/md/raid5.c | 56 ++++-----
8 files changed, 226 insertions(+), 204 deletions(-)

--
2.39.2


2023-09-28 06:33:26

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 07/25] md: use new apis to suspend array for serialize_policy_store()

From: Yu Kuai <[email protected]>

Convert to use new apis, the old apis will be removed eventually.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 740c477a6149..0c5a6169453c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5573,7 +5573,7 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
if (value == mddev->serialize_policy)
return len;

- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;
if (mddev->pers == NULL || (mddev->pers->level != 1)) {
@@ -5582,15 +5582,13 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
goto unlock;
}

- mddev_suspend(mddev);
if (value)
mddev_create_serial_pool(mddev, NULL, true);
else
mddev_destroy_serial_pool(mddev, NULL, true);
mddev->serialize_policy = value;
- mddev_resume(mddev);
unlock:
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return err ?: len;
}

--
2.39.2

2023-09-28 06:33:28

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 25/25] md: rename __mddev_suspend/resume() back to mddev_suspend/resume()

From: Yu Kuai <[email protected]>

Now that the old apis are removed, __mddev_suspend/resume() can be
renamed to their original names.

This is done by:

sed -i "s/__mddev_suspend/mddev_suspend/g" *.[ch]
sed -i "s/__mddev_resume/mddev_resume/g" *.[ch]

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 4 ++--
drivers/md/md.c | 18 +++++++++---------
drivers/md/md.h | 12 ++++++------
drivers/md/raid5-cache.c | 4 ++--
4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 05dd6ccf6f48..a4692f8f98ee 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3797,7 +3797,7 @@ static void raid_postsuspend(struct dm_target *ti)
if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery))
md_stop_writes(&rs->md);

- __mddev_suspend(&rs->md, false);
+ mddev_suspend(&rs->md, false);
}
}

@@ -4009,7 +4009,7 @@ static int raid_preresume(struct dm_target *ti)
}

/* Check for any resize/reshape on @rs and adjust/initiate */
- /* Be prepared for __mddev_resume() in raid_resume() */
+ /* Be prepared for mddev_resume() in raid_resume() */
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (mddev->recovery_cp && mddev->recovery_cp < MaxSector) {
set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 271d3f336026..b711eaf53e41 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -422,7 +422,7 @@ static void md_submit_bio(struct bio *bio)
* Make sure no new requests are submitted to the device, and any requests that
* have been submitted are completely handled.
*/
-int __mddev_suspend(struct mddev *mddev, bool interruptible)
+int mddev_suspend(struct mddev *mddev, bool interruptible)
{
int err = 0;

@@ -473,9 +473,9 @@ int __mddev_suspend(struct mddev *mddev, bool interruptible)
mutex_unlock(&mddev->suspend_mutex);
return 0;
}
-EXPORT_SYMBOL_GPL(__mddev_suspend);
+EXPORT_SYMBOL_GPL(mddev_suspend);

-void __mddev_resume(struct mddev *mddev)
+void mddev_resume(struct mddev *mddev)
{
lockdep_assert_not_held(&mddev->reconfig_mutex);

@@ -486,7 +486,7 @@ void __mddev_resume(struct mddev *mddev)
return;
}

- /* entred the memalloc scope from __mddev_suspend() */
+ /* entred the memalloc scope from mddev_suspend() */
memalloc_noio_restore(mddev->noio_flag);

percpu_ref_resurrect(&mddev->active_io);
@@ -498,7 +498,7 @@ void __mddev_resume(struct mddev *mddev)

mutex_unlock(&mddev->suspend_mutex);
}
-EXPORT_SYMBOL_GPL(__mddev_resume);
+EXPORT_SYMBOL_GPL(mddev_resume);

/*
* Generic flush handling for md
@@ -5216,12 +5216,12 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
if (new != (sector_t)new)
return -EINVAL;

- err = __mddev_suspend(mddev, true);
+ err = mddev_suspend(mddev, true);
if (err)
return err;

WRITE_ONCE(mddev->suspend_lo, new);
- __mddev_resume(mddev);
+ mddev_resume(mddev);

return len;
}
@@ -5247,12 +5247,12 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
if (new != (sector_t)new)
return -EINVAL;

- err = __mddev_suspend(mddev, true);
+ err = mddev_suspend(mddev, true);
if (err)
return err;

WRITE_ONCE(mddev->suspend_hi, new);
- __mddev_resume(mddev);
+ mddev_resume(mddev);

return len;
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 4c5f3f032656..55d01d431418 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -804,8 +804,8 @@ extern int md_rdev_init(struct md_rdev *rdev);
extern void md_rdev_clear(struct md_rdev *rdev);

extern void md_handle_request(struct mddev *mddev, struct bio *bio);
-extern int __mddev_suspend(struct mddev *mddev, bool interruptible);
-extern void __mddev_resume(struct mddev *mddev);
+extern int mddev_suspend(struct mddev *mddev, bool interruptible);
+extern void mddev_resume(struct mddev *mddev);

extern void md_reload_sb(struct mddev *mddev, int raid_disk);
extern void md_update_sb(struct mddev *mddev, int force);
@@ -853,27 +853,27 @@ static inline int mddev_suspend_and_lock(struct mddev *mddev)
{
int ret;

- ret = __mddev_suspend(mddev, true);
+ ret = mddev_suspend(mddev, true);
if (ret)
return ret;

ret = mddev_lock(mddev);
if (ret)
- __mddev_resume(mddev);
+ mddev_resume(mddev);

return ret;
}

static inline void mddev_suspend_and_lock_nointr(struct mddev *mddev)
{
- __mddev_suspend(mddev, false);
+ mddev_suspend(mddev, false);
mutex_lock(&mddev->reconfig_mutex);
}

static inline void mddev_unlock_and_resume(struct mddev *mddev)
{
mddev_unlock(mddev);
- __mddev_resume(mddev);
+ mddev_resume(mddev);
}

struct mdu_array_info_s;
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9909110262ee..6157f5beb9fe 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -699,9 +699,9 @@ static void r5c_disable_writeback_async(struct work_struct *work)

log = READ_ONCE(conf->log);
if (log) {
- __mddev_suspend(mddev, false);
+ mddev_suspend(mddev, false);
log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
- __mddev_resume(mddev);
+ mddev_resume(mddev);
}
}

--
2.39.2

2023-09-28 06:33:28

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 11/25] md/raid5-cache: use new apis to suspend array for r5c_disable_writeback_async()

From: Yu Kuai <[email protected]>

Convert to use new apis, the old apis will be removed eventually.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid5-cache.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 889bba60d6ff..01d33e5c19c1 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -686,7 +686,6 @@ static void r5c_disable_writeback_async(struct work_struct *work)
disable_writeback_work);
struct mddev *mddev = log->rdev->mddev;
struct r5conf *conf = mddev->private;
- int locked = 0;

if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
return;
@@ -696,13 +695,13 @@ static void r5c_disable_writeback_async(struct work_struct *work)
/* wait superblock change before suspend */
wait_event(mddev->sb_wait,
!READ_ONCE(conf->log) ||
- (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
- (locked = mddev_trylock(mddev))));
- if (locked) {
- mddev_suspend(mddev);
+ !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
+
+ log = READ_ONCE(conf->log);
+ if (log) {
+ __mddev_suspend(mddev, false);
log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
- mddev_resume(mddev);
- mddev_unlock(mddev);
+ __mddev_resume(mddev);
}
}

--
2.39.2

2023-09-28 06:33:28

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 14/25] md/raid5: use new apis to suspend array for raid5_store_skip_copy()

From: Yu Kuai <[email protected]>

Convert to use new apis, the old apis will be removed eventually.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid5.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f1c32b4d190f..c937716fed01 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7151,7 +7151,7 @@ raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len)
return -EINVAL;
new = !!new;

- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;
conf = mddev->private;
@@ -7160,15 +7160,13 @@ raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len)
else if (new != conf->skip_copy) {
struct request_queue *q = mddev->queue;

- mddev_suspend(mddev);
conf->skip_copy = new;
if (new)
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q);
else
blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
- mddev_resume(mddev);
}
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return err ?: len;
}

--
2.39.2

2023-09-28 06:33:28

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 16/25] md/raid5: use new apis to suspend array for raid5_change_consistency_policy()

From: Yu Kuai <[email protected]>

Convert to use new apis, the old apis will be removed eventually.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid5.c | 19 ++++++-------------
1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8060d29e99d2..e6b8c0145648 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8967,12 +8967,12 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
struct r5conf *conf;
int err;

- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;
conf = mddev->private;
if (!conf) {
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return -ENODEV;
}

@@ -8982,19 +8982,14 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
err = log_init(conf, NULL, true);
if (!err) {
err = resize_stripes(conf, conf->pool_size);
- if (err) {
- mddev_suspend(mddev);
+ if (err)
log_exit(conf);
- mddev_resume(mddev);
- }
}
} else
err = -EINVAL;
} else if (strncmp(buf, "resync", 6) == 0) {
if (raid5_has_ppl(conf)) {
- mddev_suspend(mddev);
log_exit(conf);
- mddev_resume(mddev);
err = resize_stripes(conf, conf->pool_size);
} else if (test_bit(MD_HAS_JOURNAL, &conf->mddev->flags) &&
r5l_log_disk_error(conf)) {
@@ -9007,11 +9002,9 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
break;
}

- if (!journal_dev_exists) {
- mddev_suspend(mddev);
+ if (!journal_dev_exists)
clear_bit(MD_HAS_JOURNAL, &mddev->flags);
- mddev_resume(mddev);
- } else /* need remove journal device first */
+ else /* need remove journal device first */
err = -EBUSY;
} else
err = -EINVAL;
@@ -9022,7 +9015,7 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
if (!err)
md_update_sb(mddev, 1);

- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);

return err;
}
--
2.39.2

2023-09-28 06:33:32

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 20/25] md: use new apis to suspend array before mddev_create/destroy_serial_pool

From: Yu Kuai <[email protected]>

mddev_create/destroy_serial_pool() will be called from several places
where mddev_suspend() will be called later.

Prepare to remove the mddev_suspend() from
mddev_create/destroy_serial_pool().

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md-autodetect.c | 4 ++--
drivers/md/md-bitmap.c | 8 ++++----
drivers/md/md.c | 22 ++++++++++++----------
3 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md-autodetect.c b/drivers/md/md-autodetect.c
index 6eaa0eab40f9..4b80165afd23 100644
--- a/drivers/md/md-autodetect.c
+++ b/drivers/md/md-autodetect.c
@@ -175,7 +175,7 @@ static void __init md_setup_drive(struct md_setup_args *args)
return;
}

- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err) {
pr_err("md: failed to lock array %s\n", name);
goto out_mddev_put;
@@ -221,7 +221,7 @@ static void __init md_setup_drive(struct md_setup_args *args)
if (err)
pr_warn("md: starting %s failed\n", name);
out_unlock:
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
out_mddev_put:
mddev_put(mddev);
}
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 7d21e2a5b06e..b3d701c5c461 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -2537,7 +2537,7 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
if (backlog > COUNTER_MAX)
return -EINVAL;

- rv = mddev_lock(mddev);
+ rv = mddev_suspend_and_lock(mddev);
if (rv)
return rv;

@@ -2562,16 +2562,16 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
if (!backlog && mddev->serial_info_pool) {
/* serial_info_pool is not needed if backlog is zero */
if (!mddev->serialize_policy)
- mddev_destroy_serial_pool(mddev, NULL, false);
+ mddev_destroy_serial_pool(mddev, NULL, true);
} else if (backlog && !mddev->serial_info_pool) {
/* serial_info_pool is needed since backlog is not zero */
rdev_for_each(rdev, mddev)
- mddev_create_serial_pool(mddev, rdev, false);
+ mddev_create_serial_pool(mddev, rdev, true);
}
if (old_mwb != backlog)
md_bitmap_update_sb(mddev->bitmap);

- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return len;
}

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c5fb75b066b5..f8d92d745105 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2557,7 +2557,7 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
pr_debug("md: bind<%s>\n", b);

if (mddev->raid_disks)
- mddev_create_serial_pool(mddev, rdev, false);
+ mddev_create_serial_pool(mddev, rdev, true);

if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
goto fail;
@@ -3077,11 +3077,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
}
} else if (cmd_match(buf, "writemostly")) {
set_bit(WriteMostly, &rdev->flags);
- mddev_create_serial_pool(rdev->mddev, rdev, false);
+ mddev_create_serial_pool(rdev->mddev, rdev, true);
need_update_sb = true;
err = 0;
} else if (cmd_match(buf, "-writemostly")) {
- mddev_destroy_serial_pool(rdev->mddev, rdev, false);
+ mddev_destroy_serial_pool(rdev->mddev, rdev, true);
clear_bit(WriteMostly, &rdev->flags);
need_update_sb = true;
err = 0;
@@ -3707,7 +3707,9 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr,
if (entry->store == state_store) {
if (cmd_match(page, "remove"))
kn = sysfs_break_active_protection(kobj, attr);
- if (cmd_match(page, "remove") || cmd_match(page, "re-add"))
+ if (cmd_match(page, "remove") || cmd_match(page, "re-add") ||
+ cmd_match(page, "writemostly") ||
+ cmd_match(page, "-writemostly"))
suspend = true;
}

@@ -4681,7 +4683,7 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
minor != MINOR(dev))
return -EOVERFLOW;

- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;
if (mddev->persistent) {
@@ -4702,14 +4704,14 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
rdev = md_import_device(dev, -1, -1);

if (IS_ERR(rdev)) {
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return PTR_ERR(rdev);
}
err = bind_rdev_to_array(rdev, mddev);
out:
if (err)
export_rdev(rdev, mddev);
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
if (!err)
md_new_event();
return err ? err : len;
@@ -6646,13 +6648,13 @@ static void autorun_devices(int part)
if (IS_ERR(mddev))
break;

- if (mddev_lock(mddev))
+ if (mddev_suspend_and_lock(mddev))
pr_warn("md: %s locked, cannot run\n", mdname(mddev));
else if (mddev->raid_disks || mddev->major_version
|| !list_empty(&mddev->disks)) {
pr_warn("md: %s already running, cannot run %pg\n",
mdname(mddev), rdev0->bdev);
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
} else {
pr_debug("md: created %s\n", mdname(mddev));
mddev->persistent = 1;
@@ -6662,7 +6664,7 @@ static void autorun_devices(int part)
export_rdev(rdev, mddev);
}
autorun_array(mddev);
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
}
/* on success, candidates will be empty, on error
* it won't...
--
2.39.2

2023-09-28 06:33:34

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 18/25] md: use new apis to suspend array for ioctls involed array reconfiguration

From: Yu Kuai <[email protected]>

'reconfig_mutex' will be grabbed before these ioctls, suspend array
before holding the lock, so that io won't concurrent with array
reconfiguration through ioctls.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 30 ++++++++++++++++++++----------
1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0c5a6169453c..957813b7d7e5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7209,7 +7209,6 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
struct bitmap *bitmap;

bitmap = md_bitmap_create(mddev, -1);
- mddev_suspend(mddev);
if (!IS_ERR(bitmap)) {
mddev->bitmap = bitmap;
err = md_bitmap_load(mddev);
@@ -7219,11 +7218,8 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
md_bitmap_destroy(mddev);
fd = -1;
}
- mddev_resume(mddev);
} else if (fd < 0) {
- mddev_suspend(mddev);
md_bitmap_destroy(mddev);
- mddev_resume(mddev);
}
}
if (fd < 0) {
@@ -7512,7 +7508,6 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
mddev->bitmap_info.space =
mddev->bitmap_info.default_space;
bitmap = md_bitmap_create(mddev, -1);
- mddev_suspend(mddev);
if (!IS_ERR(bitmap)) {
mddev->bitmap = bitmap;
rv = md_bitmap_load(mddev);
@@ -7520,7 +7515,6 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
rv = PTR_ERR(bitmap);
if (rv)
md_bitmap_destroy(mddev);
- mddev_resume(mddev);
} else {
/* remove the bitmap */
if (!mddev->bitmap) {
@@ -7545,9 +7539,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
module_put(md_cluster_mod);
mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY;
}
- mddev_suspend(mddev);
md_bitmap_destroy(mddev);
- mddev_resume(mddev);
mddev->bitmap_info.offset = 0;
}
}
@@ -7618,6 +7610,20 @@ static inline bool md_ioctl_valid(unsigned int cmd)
}
}

+static bool md_ioctl_need_suspend(unsigned int cmd)
+{
+ switch (cmd) {
+ case ADD_NEW_DISK:
+ case HOT_ADD_DISK:
+ case HOT_REMOVE_DISK:
+ case SET_BITMAP_FILE:
+ case SET_ARRAY_INFO:
+ return true;
+ default:
+ return false;
+ }
+}
+
static int __md_set_array_info(struct mddev *mddev, void __user *argp)
{
mdu_array_info_t info;
@@ -7750,7 +7756,8 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
if (!md_is_rdwr(mddev))
flush_work(&mddev->sync_work);

- err = mddev_lock(mddev);
+ err = md_ioctl_need_suspend(cmd) ? mddev_suspend_and_lock(mddev) :
+ mddev_lock(mddev);
if (err) {
pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n",
err, cmd);
@@ -7878,7 +7885,10 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
if (mddev->hold_active == UNTIL_IOCTL &&
err != -EINVAL)
mddev->hold_active = 0;
- mddev_unlock(mddev);
+
+ md_ioctl_need_suspend(cmd) ? mddev_unlock_and_resume(mddev) :
+ mddev_unlock(mddev);
+
out:
if(did_set_md_closing)
clear_bit(MD_CLOSING, &mddev->flags);
--
2.39.2

2023-09-28 06:33:35

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 19/25] md: use new apis to suspend array for adding/removing rdev from state_store()

From: Yu Kuai <[email protected]>

User can write 'remove' and 're-add' to trigger array reconfiguration
through sysfs, suspend array in this case so that io won't concurrent
with array reconfiguration.

And now that all the caller of add_bound_rdev() alread suspend the
array, remove mddev_suspend/resume() from add_bound_rdev() as well.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 957813b7d7e5..c5fb75b066b5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2940,11 +2940,7 @@ static int add_bound_rdev(struct md_rdev *rdev)
*/
super_types[mddev->major_version].
validate_super(mddev, rdev);
- if (add_journal)
- mddev_suspend(mddev);
err = mddev->pers->hot_add_disk(mddev, rdev);
- if (add_journal)
- mddev_resume(mddev);
if (err) {
md_kick_rdev_from_array(rdev);
return err;
@@ -3697,6 +3693,7 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr,
struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
struct kernfs_node *kn = NULL;
+ bool suspend = false;
ssize_t rv;
struct mddev *mddev = rdev->mddev;

@@ -3704,17 +3701,23 @@ rdev_attr_store(struct kobject *kobj, struct attribute *attr,
return -EIO;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
+ if (!mddev)
+ return -ENODEV;

- if (entry->store == state_store && cmd_match(page, "remove"))
- kn = sysfs_break_active_protection(kobj, attr);
+ if (entry->store == state_store) {
+ if (cmd_match(page, "remove"))
+ kn = sysfs_break_active_protection(kobj, attr);
+ if (cmd_match(page, "remove") || cmd_match(page, "re-add"))
+ suspend = true;
+ }

- rv = mddev ? mddev_lock(mddev) : -ENODEV;
+ rv = suspend ? mddev_suspend_and_lock(mddev) : mddev_lock(mddev);
if (!rv) {
if (rdev->mddev == NULL)
rv = -ENODEV;
else
rv = entry->store(rdev, page, length);
- mddev_unlock(mddev);
+ suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
}

if (kn)
--
2.39.2

2023-09-28 06:33:36

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 13/25] md/raid5: use new apis to suspend array for raid5_store_stripe_size()

From: Yu Kuai <[email protected]>

Convert to use new apis, the old apis will be removed eventually.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid5.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6383723468e5..f1c32b4d190f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7025,7 +7025,7 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len)
new != roundup_pow_of_two(new))
return -EINVAL;

- err = mddev_lock(mddev);
+ err = mddev_suspend_and_lock(mddev);
if (err)
return err;

@@ -7049,7 +7049,6 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len)
goto out_unlock;
}

- mddev_suspend(mddev);
mutex_lock(&conf->cache_size_mutex);
size = conf->max_nr_stripes;

@@ -7064,10 +7063,9 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len)
err = -ENOMEM;
}
mutex_unlock(&conf->cache_size_mutex);
- mddev_resume(mddev);

out_unlock:
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return err ?: len;
}

--
2.39.2

2023-09-28 06:36:01

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 21/25] md: cleanup mddev_create/destroy_serial_pool()

From: Yu Kuai <[email protected]>

Now that except for stopping the array, all the callers already suspend
the array, there is no need to suspend anymore, hence remove the second
parameter.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md-bitmap.c | 8 ++++----
drivers/md/md.c | 33 ++++++++++-----------------------
drivers/md/md.h | 7 +++----
3 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index b3d701c5c461..9672f75c3050 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1861,7 +1861,7 @@ void md_bitmap_destroy(struct mddev *mddev)

md_bitmap_wait_behind_writes(mddev);
if (!mddev->serialize_policy)
- mddev_destroy_serial_pool(mddev, NULL, true);
+ mddev_destroy_serial_pool(mddev, NULL);

mutex_lock(&mddev->bitmap_info.mutex);
spin_lock(&mddev->lock);
@@ -1977,7 +1977,7 @@ int md_bitmap_load(struct mddev *mddev)
goto out;

rdev_for_each(rdev, mddev)
- mddev_create_serial_pool(mddev, rdev, true);
+ mddev_create_serial_pool(mddev, rdev);

if (mddev_is_clustered(mddev))
md_cluster_ops->load_bitmaps(mddev, mddev->bitmap_info.nodes);
@@ -2562,11 +2562,11 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
if (!backlog && mddev->serial_info_pool) {
/* serial_info_pool is not needed if backlog is zero */
if (!mddev->serialize_policy)
- mddev_destroy_serial_pool(mddev, NULL, true);
+ mddev_destroy_serial_pool(mddev, NULL);
} else if (backlog && !mddev->serial_info_pool) {
/* serial_info_pool is needed since backlog is not zero */
rdev_for_each(rdev, mddev)
- mddev_create_serial_pool(mddev, rdev, true);
+ mddev_create_serial_pool(mddev, rdev);
}
if (old_mwb != backlog)
md_bitmap_update_sb(mddev->bitmap);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index f8d92d745105..c3a51d309063 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -206,8 +206,7 @@ static int rdev_need_serial(struct md_rdev *rdev)
* 1. rdev is the first device which return true from rdev_enable_serial.
* 2. rdev is NULL, means we want to enable serialization for all rdevs.
*/
-void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
- bool is_suspend)
+void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
{
int ret = 0;

@@ -215,15 +214,12 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
!test_bit(CollisionCheck, &rdev->flags))
return;

- if (!is_suspend)
- mddev_suspend(mddev);
-
if (!rdev)
ret = rdevs_init_serial(mddev);
else
ret = rdev_init_serial(rdev);
if (ret)
- goto abort;
+ return;

if (mddev->serial_info_pool == NULL) {
/*
@@ -238,10 +234,6 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
pr_err("can't alloc memory pool for serialization\n");
}
}
-
-abort:
- if (!is_suspend)
- mddev_resume(mddev);
}

/*
@@ -250,8 +242,7 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
* 2. when bitmap is destroyed while policy is not enabled.
* 3. for disable policy, the pool is destroyed only when no rdev needs it.
*/
-void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
- bool is_suspend)
+void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
{
if (rdev && !test_bit(CollisionCheck, &rdev->flags))
return;
@@ -260,8 +251,6 @@ void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
struct md_rdev *temp;
int num = 0; /* used to track if other rdevs need the pool */

- if (!is_suspend)
- mddev_suspend(mddev);
rdev_for_each(temp, mddev) {
if (!rdev) {
if (!mddev->serialize_policy ||
@@ -283,8 +272,6 @@ void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
mempool_destroy(mddev->serial_info_pool);
mddev->serial_info_pool = NULL;
}
- if (!is_suspend)
- mddev_resume(mddev);
}
}

@@ -2557,7 +2544,7 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
pr_debug("md: bind<%s>\n", b);

if (mddev->raid_disks)
- mddev_create_serial_pool(mddev, rdev, true);
+ mddev_create_serial_pool(mddev, rdev);

if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
goto fail;
@@ -2610,7 +2597,7 @@ static void md_kick_rdev_from_array(struct md_rdev *rdev)
bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk);
list_del_rcu(&rdev->same_set);
pr_debug("md: unbind<%pg>\n", rdev->bdev);
- mddev_destroy_serial_pool(rdev->mddev, rdev, false);
+ mddev_destroy_serial_pool(rdev->mddev, rdev);
rdev->mddev = NULL;
sysfs_remove_link(&rdev->kobj, "block");
sysfs_put(rdev->sysfs_state);
@@ -3077,11 +3064,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
}
} else if (cmd_match(buf, "writemostly")) {
set_bit(WriteMostly, &rdev->flags);
- mddev_create_serial_pool(rdev->mddev, rdev, true);
+ mddev_create_serial_pool(rdev->mddev, rdev);
need_update_sb = true;
err = 0;
} else if (cmd_match(buf, "-writemostly")) {
- mddev_destroy_serial_pool(rdev->mddev, rdev, true);
+ mddev_destroy_serial_pool(rdev->mddev, rdev);
clear_bit(WriteMostly, &rdev->flags);
need_update_sb = true;
err = 0;
@@ -5588,9 +5575,9 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
}

if (value)
- mddev_create_serial_pool(mddev, NULL, true);
+ mddev_create_serial_pool(mddev, NULL);
else
- mddev_destroy_serial_pool(mddev, NULL, true);
+ mddev_destroy_serial_pool(mddev, NULL);
mddev->serialize_policy = value;
unlock:
mddev_unlock_and_resume(mddev);
@@ -6356,7 +6343,7 @@ static void __md_stop_writes(struct mddev *mddev)
}
/* disable policy to guarantee rdevs free resources for serialization */
mddev->serialize_policy = 0;
- mddev_destroy_serial_pool(mddev, NULL, true);
+ mddev_destroy_serial_pool(mddev, NULL);
}

void md_stop_writes(struct mddev *mddev)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5c8f3f045e78..63b4c393b1ee 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -817,10 +817,9 @@ extern void __mddev_resume(struct mddev *mddev);

extern void md_reload_sb(struct mddev *mddev, int raid_disk);
extern void md_update_sb(struct mddev *mddev, int force);
-extern void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
- bool is_suspend);
-extern void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
- bool is_suspend);
+extern void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev);
+extern void mddev_destroy_serial_pool(struct mddev *mddev,
+ struct md_rdev *rdev);
struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
struct md_rdev *md_find_rdev_rcu(struct mddev *mddev, dev_t dev);

--
2.39.2

2023-09-28 06:36:57

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 09/25] md/md-bitmap: use new apis to suspend array for location_store()

From: Yu Kuai <[email protected]>

Convert to use new apis, the old apis will be removed eventually.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md-bitmap.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 0c661e5036bb..7d21e2a5b06e 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -2348,11 +2348,10 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
{
int rv;

- rv = mddev_lock(mddev);
+ rv = mddev_suspend_and_lock(mddev);
if (rv)
return rv;

- mddev_suspend(mddev);
if (mddev->pers) {
if (mddev->recovery || mddev->sync_thread) {
rv = -EBUSY;
@@ -2429,8 +2428,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
}
rv = 0;
out:
- mddev_resume(mddev);
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
if (rv)
return rv;
return len;
--
2.39.2

2023-09-28 06:37:27

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 12/25] md/raid5-cache: use new apis to suspend array for r5c_journal_mode_store()

From: Yu Kuai <[email protected]>

r5c_journal_mode_set() will suspend array and it has only 2 caller, the
other caller raid_ctl() already suspend the array with new apis.

This is not hot path, so performance is not concerned.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid5-cache.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 01d33e5c19c1..9909110262ee 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2585,9 +2585,7 @@ int r5c_journal_mode_set(struct mddev *mddev, int mode)
mode == R5C_JOURNAL_MODE_WRITE_BACK)
return -EINVAL;

- mddev_suspend(mddev);
conf->log->r5c_journal_mode = mode;
- mddev_resume(mddev);

pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n",
mdname(mddev), mode, r5c_journal_mode_str[mode]);
@@ -2612,11 +2610,11 @@ static ssize_t r5c_journal_mode_store(struct mddev *mddev,
if (strlen(r5c_journal_mode_str[mode]) == len &&
!strncmp(page, r5c_journal_mode_str[mode], len))
break;
- ret = mddev_lock(mddev);
+ ret = mddev_suspend_and_lock(mddev);
if (ret)
return ret;
ret = r5c_journal_mode_set(mddev, mode);
- mddev_unlock(mddev);
+ mddev_unlock_and_resume(mddev);
return ret ?: length;
}

--
2.39.2

2023-09-28 06:37:33

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 04/25] md: add new helpers to suspend/resume and lock/unlock array

From: Yu Kuai <[email protected]>

The new helpers suspend the array first and then lock the array,

Prepare to refactor from:

mddev_lock/lock_nointr
mddev_suspend
...
mddev_resuem
mddev_lock

With:

mddev_suspend_and_lock/lock_nointr
...
mddev_unlock_and_resume

After all the use cases is refactored, mddev_suspend/resume() will be
removed.

And mddev_suspend_and_lock() will also replace mddev_lock() for the case
that the array will be reconfigured, in order to synchronize with io to
prevent problems in many corner cases.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.h | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index b5894dc64615..5c8f3f045e78 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -858,6 +858,33 @@ static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio
mddev->queue->limits.max_write_zeroes_sectors = 0;
}

+static inline int mddev_suspend_and_lock(struct mddev *mddev)
+{
+ int ret;
+
+ ret = __mddev_suspend(mddev, true);
+ if (ret)
+ return ret;
+
+ ret = mddev_lock(mddev);
+ if (ret)
+ __mddev_resume(mddev);
+
+ return ret;
+}
+
+static inline void mddev_suspend_and_lock_nointr(struct mddev *mddev)
+{
+ __mddev_suspend(mddev, false);
+ mutex_lock(&mddev->reconfig_mutex);
+}
+
+static inline void mddev_unlock_and_resume(struct mddev *mddev)
+{
+ mddev_unlock(mddev);
+ __mddev_resume(mddev);
+}
+
struct mdu_array_info_s;
struct mdu_disk_info_s;

--
2.39.2

2023-09-28 06:37:34

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 24/25] md: remove old apis to suspend the array

From: Yu Kuai <[email protected]>

Now that mddev_suspend() and mddev_resume() is not used anywhere, remove
them, and remove 'MD_ALLOW_SB_UPDATE' and 'MD_UPDATING_SB' as well.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md.c | 82 ++-----------------------------------------------
drivers/md/md.h | 8 -----
2 files changed, 3 insertions(+), 87 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a3b62c6c5332..271d3f336026 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -418,74 +418,10 @@ static void md_submit_bio(struct bio *bio)
md_handle_request(mddev, bio);
}

-/* mddev_suspend makes sure no new requests are submitted
- * to the device, and that any requests that have been submitted
- * are completely handled.
- * Once mddev_detach() is called and completes, the module will be
- * completely unused.
+/*
+ * Make sure no new requests are submitted to the device, and any requests that
+ * have been submitted are completely handled.
*/
-void mddev_suspend(struct mddev *mddev)
-{
- struct md_thread *thread = rcu_dereference_protected(mddev->thread,
- lockdep_is_held(&mddev->reconfig_mutex));
-
- WARN_ON_ONCE(thread && current == thread->tsk);
-
- /* can't concurrent with __mddev_suspend() and __mddev_resume() */
- mutex_lock(&mddev->suspend_mutex);
- if (mddev->suspended++) {
- mutex_unlock(&mddev->suspend_mutex);
- return;
- }
-
- wake_up(&mddev->sb_wait);
- set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
- percpu_ref_kill(&mddev->active_io);
-
- /*
- * TODO: cleanup 'pers->prepare_suspend after all callers are replaced
- * by __mddev_suspend().
- */
- if (mddev->pers && mddev->pers->prepare_suspend)
- mddev->pers->prepare_suspend(mddev);
-
- wait_event(mddev->sb_wait, percpu_ref_is_zero(&mddev->active_io));
- clear_bit_unlock(MD_ALLOW_SB_UPDATE, &mddev->flags);
- wait_event(mddev->sb_wait, !test_bit(MD_UPDATING_SB, &mddev->flags));
-
- del_timer_sync(&mddev->safemode_timer);
- /* restrict memory reclaim I/O during raid array is suspend */
- mddev->noio_flag = memalloc_noio_save();
-
- mutex_unlock(&mddev->suspend_mutex);
-}
-EXPORT_SYMBOL_GPL(mddev_suspend);
-
-void mddev_resume(struct mddev *mddev)
-{
- lockdep_assert_held(&mddev->reconfig_mutex);
-
- /* can't concurrent with __mddev_suspend() and __mddev_resume() */
- mutex_lock(&mddev->suspend_mutex);
- if (--mddev->suspended) {
- mutex_unlock(&mddev->suspend_mutex);
- return;
- }
-
- /* entred the memalloc scope from mddev_suspend() */
- memalloc_noio_restore(mddev->noio_flag);
-
- percpu_ref_resurrect(&mddev->active_io);
- wake_up(&mddev->sb_wait);
-
- set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
- md_wakeup_thread(mddev->thread);
- md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
-
- mutex_unlock(&mddev->suspend_mutex);
-}
-EXPORT_SYMBOL_GPL(mddev_resume);
-
int __mddev_suspend(struct mddev *mddev, bool interruptible)
{
int err = 0;
@@ -9500,18 +9436,6 @@ static void md_start_sync(struct work_struct *ws)
*/
void md_check_recovery(struct mddev *mddev)
{
- if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags) && mddev->sb_flags) {
- /* Write superblock - thread that called mddev_suspend()
- * holds reconfig_mutex for us.
- */
- set_bit(MD_UPDATING_SB, &mddev->flags);
- smp_mb__after_atomic();
- if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags))
- md_update_sb(mddev, 0);
- clear_bit_unlock(MD_UPDATING_SB, &mddev->flags);
- wake_up(&mddev->sb_wait);
- }
-
if (READ_ONCE(mddev->suspended))
return;

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 63b4c393b1ee..4c5f3f032656 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -248,10 +248,6 @@ struct md_cluster_info;
* become failed.
* @MD_HAS_PPL: The raid array has PPL feature set.
* @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
- * @MD_ALLOW_SB_UPDATE: md_check_recovery is allowed to update the metadata
- * without taking reconfig_mutex.
- * @MD_UPDATING_SB: md_check_recovery is updating the metadata without
- * explicitly holding reconfig_mutex.
* @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
* array is ready yet.
* @MD_BROKEN: This is used to stop writes and mark array as failed.
@@ -268,8 +264,6 @@ enum mddev_flags {
MD_FAILFAST_SUPPORTED,
MD_HAS_PPL,
MD_HAS_MULTIPLE_PPLS,
- MD_ALLOW_SB_UPDATE,
- MD_UPDATING_SB,
MD_NOT_READY,
MD_BROKEN,
MD_DELETED,
@@ -810,8 +804,6 @@ extern int md_rdev_init(struct md_rdev *rdev);
extern void md_rdev_clear(struct md_rdev *rdev);

extern void md_handle_request(struct mddev *mddev, struct bio *bio);
-extern void mddev_suspend(struct mddev *mddev);
-extern void mddev_resume(struct mddev *mddev);
extern int __mddev_suspend(struct mddev *mddev, bool interruptible);
extern void __mddev_resume(struct mddev *mddev);

--
2.39.2

2023-09-28 06:37:46

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next v3 22/25] md/md-linear: cleanup linear_add()

From: Yu Kuai <[email protected]>

Now that caller already suspend the array, there is no need to suspend
array in liner_add().

Note that mddev_suspend/resume() is not used anymore.

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/md-linear.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index ae2826e9645b..8eca7693b793 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -183,7 +183,6 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
* in linear_congested(), therefore kfree_rcu() is used to free
* oldconf until no one uses it anymore.
*/
- mddev_suspend(mddev);
oldconf = rcu_dereference_protected(mddev->private,
lockdep_is_held(&mddev->reconfig_mutex));
mddev->raid_disks++;
@@ -192,7 +191,6 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
rcu_assign_pointer(mddev->private, newconf);
md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
set_capacity_and_notify(mddev->gendisk, mddev->array_sectors);
- mddev_resume(mddev);
kfree_rcu(oldconf, rcu);
return 0;
}
--
2.39.2

2023-09-28 19:21:50

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH -next v3 00/25] md: synchronize io with array reconfiguration

Hi Kuai,

Thanks for the patchset!

A few high level questions/suggestions:

1. This is a big change that needs a lot of explanation. While you managed to
keep each patch relatively small (great job btw), it is not very clear why we
need these changes. Specifically, we are adding a new mutex, it is worth
mentioning why we cannot achieve the same goal without it. Please add
more information in the cover letter. We will put part of the cover letter in
the merge commit.

2. In the cover letter, please also highlight that we are removing
MD_ALLOW_SB_UPDATE and MD_UPDATING_SB. This is a big improvement.

3. Please rearrange the patch set so that the two "READ_ONCE/WRITE_ONCE"
patches are at the beginning.

4. Please consider merging some patches. Current "add-api => use-api =>
remove-old-api" makes it tricky to follow what is being changed. For this set,
I found the diff of the whole set easier to follow than some of the big patches.

Thanks again for your hard work into this!
Song

On Wed, Sep 27, 2023 at 11:22 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
[...]

2023-10-05 15:46:03

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH -next v3 00/25] md: synchronize io with array reconfiguration

On Wed, Oct 4, 2023 at 8:42 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/09/29 3:15, Song Liu 写道:
> > Hi Kuai,
> >
> > Thanks for the patchset!
> >
> > A few high level questions/suggestions:
>
> Thanks a lot for these!
> >
> > 1. This is a big change that needs a lot of explanation. While you managed to
> > keep each patch relatively small (great job btw), it is not very clear why we
> > need these changes. Specifically, we are adding a new mutex, it is worth
> > mentioning why we cannot achieve the same goal without it. Please add
> > more information in the cover letter. We will put part of the cover letter in
> > the merge commit.
>
> Yeah, I realize that I explain too little. I will add background and
> design.
> >
> > 2. In the cover letter, please also highlight that we are removing
> > MD_ALLOW_SB_UPDATE and MD_UPDATING_SB. This is a big improvement.
> >
>
> Okay.
> > 3. Please rearrange the patch set so that the two "READ_ONCE/WRITE_ONCE"
> > patches are at the beginning.
>
> Okay.
> >
> > 4. Please consider merging some patches. Current "add-api => use-api =>
> > remove-old-api" makes it tricky to follow what is being changed. For this set,
> > I found the diff of the whole set easier to follow than some of the big patches.
> I refer to some other big patchset to replace an old api, for example:
>
> https://lore.kernel.org/all/[email protected]/

Yes, this is a safe way to replace old APIs. Since the scale of this
patchset is
smaller, I was thinking it might not be necessary to go that path. But
I will let
you make the decision.

> Currently I prefer to use one patch for each function point. And I do
> merged some patches in this version, and for remaining patches, do you
> prefer to use one patch for one file instead of one function point?(For
> example, merge patch 10-12 for md/raid5-cache, and 13-16 for md/raid5).

I think 10 should be a separate patch, and we can merge 11 and 12. We can
merge 13-16, and maybe also 5-7 and 18-20.

Thanks,
Song

2023-10-07 02:33:20

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next v3 00/25] md: synchronize io with array reconfiguration

Hi,

在 2023/10/05 11:55, Song Liu 写道:
> On Wed, Oct 4, 2023 at 8:42 PM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/09/29 3:15, Song Liu 写道:
>>> Hi Kuai,
>>>
>>> Thanks for the patchset!
>>>
>>> A few high level questions/suggestions:
>>
>> Thanks a lot for these!
>>>
>>> 1. This is a big change that needs a lot of explanation. While you managed to
>>> keep each patch relatively small (great job btw), it is not very clear why we
>>> need these changes. Specifically, we are adding a new mutex, it is worth
>>> mentioning why we cannot achieve the same goal without it. Please add
>>> more information in the cover letter. We will put part of the cover letter in
>>> the merge commit.
>>
>> Yeah, I realize that I explain too little. I will add background and
>> design.
>>>
Can you take a look about this new cover letter?

##### Backgroud

Our testers started to test raid10 last year, and we found that there
are lots of problem in the following test scenario:

- add or remove disks to the array
- issue io to the array

At first, we fixed each problem independently respect that io can
concurrent with array reconfiguration. However, on the one hand new
issues are continuously reported, on the other hand other personalities
might have the same problems. I'm thinking about how to fix these
problems thoroughly.

Refer to how block layer protect io with queue reconfiguration(for
example, change elevator):

```
blk_mq_freeze_queue
-> wait for all io to be done, and prevent new io to be dispatched
// reconfiguration
blk_mq_unfreeze_queue
```

Then it comes to my mind that I can do something similar to synchronize
io with array reconfiguration.

##### rcu introduction

see details in https://www.kernel.org/doc/html/next/RCU/whatisRCU.html

- writer should replace old data with new data first, and free old data
after grace period;
- reader should handle both cases that old data and new data is read,
and the data that is read should not be dereferenced after critical
section;

##### Current synchronization

Add or remove disks to the array can be triggered by ioctl/sysfs/daemon
thread:

1. hold 'reconfig_mutex';

2. check that rdev can be added/removed, one condition is that there is
no IO, for example:

```
raid10_remove_disk
if (atomic_read(&rdev->nr_pending))
err = -EBUSY;
```

3. do the actual operations to add/remove a rdev, one procedure is
set/clear a pointer to rdev, for example:

```
raid10_remove_disk
p = conf->mirrors[xx]
rdevp = &p->rdev/replacement
*rdevp = NULL
```

4. check if there is still no io on this rdev, if not, revert the
pointer to rdev and return failure, for example

```
raid10_remove_disk
synchronize_rcu()
if (atomic_read(&rdev->nr_pending))
err = -EBUSY
*rdevp = rdev
```

IO path is using rcu_read_lock/unlock() to access rdev, for example:

```
raid10_write_request
rcu_read_lock
rdev = rcu_dereference(mirror->rdev/replacement)
rcu_read_unlock

raid10_end_write_request
rdev = conf->mirrors[dev].rdev/replacement
-> rdev/rrdev is still used after rcu_read_unlock()
```

##### Current problems

- rcu is used wrongly;
- There are lots of places involved that old value is read, however,
many places doesn't handle this correctly;
- Between step 3 and 4, if new io is dispatched, NULL will be read for
the rdev, and data will be lost.

##### New synchronization

Similar to how blk_mq_freeze_queue() works

Add or remove disks:

1. suspend the array, this should guarantee no new io is dispatched and
wait for dispatched io to be done;
2. add or remove rdevs from array;
3. resume the array;

IO path doesn't need to change for now, and all rcu implementation can
be removed.

There are already apis to suspend/resume the array, unfortunately, they
can't be used here because:

- old apis only wait for io to be dispatched, not to be done;
- old apis is only supported for the personality that implement quiesce
callback;
- old apis must be called after the array start running;
- old apis must hold 'reconfig_mutex', and will wait for io to be done,
this behavior is risky because 'reconfig_mutex' is used for daemon
thread to update super_block and handle io. In order to prevent
potential problems, there is a weird logical that suspend array hold
'reconfig_mutex' for mddev_check_recovery() to update super_block;

Then main work is divided into 3 steps, at first make sure new apis to
suspend the array is general:

- make sure suspend array will wait for io to be done(Done by []);
- make sure suspend array can be called for all personalities(Done by
[]);
- make sure suspend array can be called at any time(Done by []);
- make sure suspend array doesn't rely on 'reconfig_mutex';

The second step is to replace old apis with new apis:

```
From:
lock reconfig_mutex
suspend array
resume array
unlock reconfig_mutex

To:
suspend array
lock reconfig_mutex
unlock reconfig_mutex
resume array
```

Finally, for the remain path that involved reconfiguration, suspend the
array first:

```
From:
// reconfiguration

To:
suspend array
// reconfiguration
resume array
```

>>> 2. In the cover letter, please also highlight that we are removing
>>> MD_ALLOW_SB_UPDATE and MD_UPDATING_SB. This is a big improvement.
>>>
>>
>> Okay.
>>> 3. Please rearrange the patch set so that the two "READ_ONCE/WRITE_ONCE"
>>> patches are at the beginning.
>>
>> Okay.
>>>
>>> 4. Please consider merging some patches. Current "add-api => use-api =>
>>> remove-old-api" makes it tricky to follow what is being changed. For this set,
>>> I found the diff of the whole set easier to follow than some of the big patches.
>> I refer to some other big patchset to replace an old api, for example:
>>
>> https://lore.kernel.org/all/[email protected]/
>
> Yes, this is a safe way to replace old APIs. Since the scale of this
> patchset is
> smaller, I was thinking it might not be necessary to go that path. But
> I will let
> you make the decision.
>
>> Currently I prefer to use one patch for each function point. And I do
>> merged some patches in this version, and for remaining patches, do you
>> prefer to use one patch for one file instead of one function point?(For
>> example, merge patch 10-12 for md/raid5-cache, and 13-16 for md/raid5).
>
> I think 10 should be a separate patch, and we can merge 11 and 12. We can
> merge 13-16, and maybe also 5-7 and 18-20.
>
> Thanks,
> Song
> .
>

2023-10-07 02:41:19

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH -next v3 00/25] md: synchronize io with array reconfiguration

On Fri, Oct 6, 2023 at 7:32 PM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/10/05 11:55, Song Liu 写道:
> > On Wed, Oct 4, 2023 at 8:42 PM Yu Kuai <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/09/29 3:15, Song Liu 写道:
> >>> Hi Kuai,
> >>>
> >>> Thanks for the patchset!
> >>>
> >>> A few high level questions/suggestions:
> >>
> >> Thanks a lot for these!
> >>>
> >>> 1. This is a big change that needs a lot of explanation. While you managed to
> >>> keep each patch relatively small (great job btw), it is not very clear why we
> >>> need these changes. Specifically, we are adding a new mutex, it is worth
> >>> mentioning why we cannot achieve the same goal without it. Please add
> >>> more information in the cover letter. We will put part of the cover letter in
> >>> the merge commit.
> >>
> >> Yeah, I realize that I explain too little. I will add background and
> >> design.
> >>>
> Can you take a look about this new cover letter?

I don't have time right now to look into all the details, but it looks
great at first glance. We can still edit it a little bit when applying the
patchset, but that may not be necessary.

Thanks,
Song

>
> ##### Backgroud
>
> Our testers started to test raid10 last year, and we found that there
> are lots of problem in the following test scenario:
>
> - add or remove disks to the array
> - issue io to the array

2023-10-07 02:49:57

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next v3 00/25] md: synchronize io with array reconfiguration

Hi,

在 2023/10/07 10:40, Song Liu 写道:
>> Can you take a look about this new cover letter?
>
> I don't have time right now to look into all the details, but it looks
> great at first glance. We can still edit it a little bit when applying the
> patchset, but that may not be necessary.

Yeah, it's not urgent so you can take it slow, I just want to make sure
that you're good with it. I'll edit this cover letter a bit and send v4
soon.

Thanks,
Kuai