LinuxLists.cc - [PATCH -next] md: split MD_RECOVERY_NEEDED out of mddev

2023-12-04 03:45:19

Subject: [PATCH -next] md: split MD_RECOVERY_NEEDED out of mddev_resume

From: Yu Kuai <[email protected]>

New mddev_resume() calls are added to synchroniza IO with array
reconfiguration, however, this introduce a regression while adding it in
md_start_sync():

1) someone set MD_RECOVERY_NEEDED first;
2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
queue a new sync work;
3) daemon thread release reconfig_mutex;
4) in md_start_sync
a) check that there are spares that can be added/removed, then suspend
the array;
b) remove_and_add_spares may not be called, or called without really
add/remove spares;
c) resume the array, then set MD_RECOVERY_NEEDED again!

Loop between 2 - 4, then mddev_suspend() will be called quite often, for
consequence, normal IO will be quite slow.

Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
that md_start_sync() won't set such flag and hence the loop will be broken.

Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Reported-and-tested-by: Janpieter Sollie <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218200
Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/dm-raid.c | 1 +
drivers/md/md-bitmap.c | 2 ++
drivers/md/md.c | 6 +++++-
drivers/md/raid5.c | 4 ++++
4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index eb009d6bb03a..e9c0d70f7fe5 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -4059,6 +4059,7 @@ static void raid_resume(struct dm_target *ti)
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
mddev->ro = 0;
mddev->in_sync = 0;
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
}
}
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 9672f75c3050..16112750ee64 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -2428,6 +2428,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
}
rv = 0;
out:
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
if (rv)
return rv;
@@ -2571,6 +2572,7 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
if (old_mwb != backlog)
md_bitmap_update_sb(mddev->bitmap);

+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
return len;
}
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4b1e8007dd15..48a1b12f3c2c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -515,7 +515,6 @@ void mddev_resume(struct mddev *mddev)
percpu_ref_resurrect(&mddev->active_io);
wake_up(&mddev->sb_wait);

- set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
md_wakeup_thread(mddev->thread);
md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */

@@ -4146,6 +4145,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
md_new_event();
rv = len;
out_unlock:
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
return rv;
}
@@ -4652,6 +4652,8 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
out:
if (err)
export_rdev(rdev, mddev);
+ else
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
if (!err)
md_new_event();
@@ -5533,6 +5535,7 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
mddev_destroy_serial_pool(mddev, NULL);
mddev->serialize_policy = value;
unlock:
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
return err ?: len;
}
@@ -6593,6 +6596,7 @@ static void autorun_devices(int part)
export_rdev(rdev, mddev);
}
autorun_array(mddev);
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
}
/* on success, candidates will be empty, on error
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 42ba3581cfea..f88f92517a18 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6989,6 +6989,7 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len)
mutex_unlock(&conf->cache_size_mutex);

out_unlock:
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
return err ?: len;
}
@@ -7090,6 +7091,7 @@ raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len)
else
blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
}
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);
return err ?: len;
}
@@ -7169,6 +7171,7 @@ raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len)
kfree(old_groups);
}
}
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);

return err ?: len;
@@ -8920,6 +8923,7 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
if (!err)
md_update_sb(mddev, 1);

+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
mddev_unlock_and_resume(mddev);

return err;
--
2.39.2

2023-12-04 14:09:45

by Paul Menzel

[permalink] [raw]

Subject: Re: [PATCH -next] md: split MD_RECOVERY_NEEDED out of mddev_resume

Dear Yu,

Thank you for your patch.

Am 04.12.23 um 04:17 schrieb Yu Kuai:
> From: Yu Kuai <[email protected]>
>
> New mddev_resume() calls are added to synchroniza IO with array

synchronize

> reconfiguration, however, this introduce a regression while adding it in

1. Maybe: … performance regression …
2. introduce*s*

> md_start_sync():
>
> 1) someone set MD_RECOVERY_NEEDED first;

set*s*

> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
> queue a new sync work;

grab*s*, clear*s*, queue*s*

> 3) daemon thread release reconfig_mutex;

release*s*

> 4) in md_start_sync
> a) check that there are spares that can be added/removed, then suspend
> the array;
> b) remove_and_add_spares may not be called, or called without really
> add/remove spares;
> c) resume the array, then set MD_RECOVERY_NEEDED again!
>
> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
> consequence, normal IO will be quite slow.

It’d be great if you could document the exact “test case”, and numbers.

> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so

split*t*ing

> that md_start_sync() won't set such flag and hence the loop will be broken.
>
> Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
> Reported-and-tested-by: Janpieter Sollie <[email protected]>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218200
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/dm-raid.c | 1 +
> drivers/md/md-bitmap.c | 2 ++
> drivers/md/md.c | 6 +++++-
> drivers/md/raid5.c | 4 ++++
> 4 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index eb009d6bb03a..e9c0d70f7fe5 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -4059,6 +4059,7 @@ static void raid_resume(struct dm_target *ti)
> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> mddev->ro = 0;
> mddev->in_sync = 0;
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> }
> }
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index 9672f75c3050..16112750ee64 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -2428,6 +2428,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
> }
> rv = 0;
> out:
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> if (rv)
> return rv;
> @@ -2571,6 +2572,7 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
> if (old_mwb != backlog)
> md_bitmap_update_sb(mddev->bitmap);
>
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> return len;
> }
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4b1e8007dd15..48a1b12f3c2c 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -515,7 +515,6 @@ void mddev_resume(struct mddev *mddev)
> percpu_ref_resurrect(&mddev->active_io);
> wake_up(&mddev->sb_wait);
>
> - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> md_wakeup_thread(mddev->thread);
> md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
>
> @@ -4146,6 +4145,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
> md_new_event();
> rv = len;
> out_unlock:
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> return rv;
> }
> @@ -4652,6 +4652,8 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len)
> out:
> if (err)
> export_rdev(rdev, mddev);
> + else
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> if (!err)
> md_new_event();
> @@ -5533,6 +5535,7 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
> mddev_destroy_serial_pool(mddev, NULL);
> mddev->serialize_policy = value;
> unlock:
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> return err ?: len;
> }
> @@ -6593,6 +6596,7 @@ static void autorun_devices(int part)
> export_rdev(rdev, mddev);
> }
> autorun_array(mddev);
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> }
> /* on success, candidates will be empty, on error
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 42ba3581cfea..f88f92517a18 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6989,6 +6989,7 @@ raid5_store_stripe_size(struct mddev *mddev, const char *page, size_t len)
> mutex_unlock(&conf->cache_size_mutex);
>
> out_unlock:
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> return err ?: len;
> }
> @@ -7090,6 +7091,7 @@ raid5_store_skip_copy(struct mddev *mddev, const char *page, size_t len)
> else
> blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
> }
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
> return err ?: len;
> }
> @@ -7169,6 +7171,7 @@ raid5_store_group_thread_cnt(struct mddev *mddev, const char *page, size_t len)
> kfree(old_groups);
> }
> }
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
>
> return err ?: len;
> @@ -8920,6 +8923,7 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
> if (!err)
> md_update_sb(mddev, 1);
>
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> mddev_unlock_and_resume(mddev);
>
> return err;

Acked-by: Paul Menzel <[email protected]>

Kind regards,

Paul

2023-12-06 08:30:55

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH -next] md: split MD_RECOVERY_NEEDED out of mddev_resume

On Sun, Dec 3, 2023 at 7:18 PM Yu Kuai <[email protected]> wrote:
>
> From: Yu Kuai <[email protected]>
>
> New mddev_resume() calls are added to synchroniza IO with array
> reconfiguration, however, this introduce a regression while adding it in
> md_start_sync():
>
> 1) someone set MD_RECOVERY_NEEDED first;
> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
> queue a new sync work;
> 3) daemon thread release reconfig_mutex;
> 4) in md_start_sync
> a) check that there are spares that can be added/removed, then suspend
> the array;
> b) remove_and_add_spares may not be called, or called without really
> add/remove spares;
> c) resume the array, then set MD_RECOVERY_NEEDED again!
>
> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
> consequence, normal IO will be quite slow.
>
> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
> that md_start_sync() won't set such flag and hence the loop will be broken.

I hope we don't leak set_bit MD_RECOVERY_NEEDED to all call
sites of mddev_resume().

How about something like the following instead?

Please also incorporate feedback from Paul in the next version.

Thanks,
Song

diff --git i/drivers/md/md.c w/drivers/md/md.c
index c94373d64f2c..2d53e1b57070 100644
--- i/drivers/md/md.c
+++ w/drivers/md/md.c
@@ -490,7 +490,7 @@ int mddev_suspend(struct mddev *mddev, bool interruptible)
}
EXPORT_SYMBOL_GPL(mddev_suspend);

-void mddev_resume(struct mddev *mddev)
+static void __mddev_resume(struct mddev *mddev, bool recovery_needed)
{
lockdep_assert_not_held(&mddev->reconfig_mutex);

@@ -507,12 +507,18 @@ void mddev_resume(struct mddev *mddev)
percpu_ref_resurrect(&mddev->active_io);
wake_up(&mddev->sb_wait);

- set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ if (recovery_needed)
+ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
md_wakeup_thread(mddev->thread);
md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */

mutex_unlock(&mddev->suspend_mutex);
}
+
+void mddev_resume(struct mddev *mddev)
+{
+ __mddev_resume(mddev, true);
+}
EXPORT_SYMBOL_GPL(mddev_resume);

/*
@@ -9403,7 +9409,9 @@ static void md_start_sync(struct work_struct *ws)
goto not_running;
}

- suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
+ mddev_unlock(mddev);
+ if (suspend)
+ __mddev_resume(mddev, false);
md_wakeup_thread(mddev->sync_thread);
sysfs_notify_dirent_safe(mddev->sysfs_action);
md_new_event();
@@ -9415,7 +9423,9 @@ static void md_start_sync(struct work_struct *ws)
clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
+ mddev_unlock(mddev);
+ if (suspend)
+ __mddev_resume(mddev, false);

wake_up(&resync_wait);
if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&

2023-12-06 11:37:02

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next] md: split MD_RECOVERY_NEEDED out of mddev_resume

Hi,

在 2023/12/06 16:30, Song Liu 写道:
> On Sun, Dec 3, 2023 at 7:18 PM Yu Kuai <[email protected]> wrote:
>>
>> From: Yu Kuai <[email protected]>
>>
>> New mddev_resume() calls are added to synchroniza IO with array
>> reconfiguration, however, this introduce a regression while adding it in
>> md_start_sync():
>>
>> 1) someone set MD_RECOVERY_NEEDED first;
>> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
>> queue a new sync work;
>> 3) daemon thread release reconfig_mutex;
>> 4) in md_start_sync
>> a) check that there are spares that can be added/removed, then suspend
>> the array;
>> b) remove_and_add_spares may not be called, or called without really
>> add/remove spares;
>> c) resume the array, then set MD_RECOVERY_NEEDED again!
>>
>> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
>> consequence, normal IO will be quite slow.
>>
>> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
>> that md_start_sync() won't set such flag and hence the loop will be broken.
>
> I hope we don't leak set_bit MD_RECOVERY_NEEDED to all call
> sites of mddev_resume().

There are also some other mddev_resume() that is added later and don't
need recovery, so md_start_sync() is not the only place:

- md_setup_drive
- rdev_attr_store
- suspend_lo_store
- suspend_hi_store
- autorun_devices
- md_ioct
- r5c_disable_writeback_async
- error path from new_dev_store(), ...

I'm not sure add a new helper is a good idea, because all above apis
should use new helper as well.

>
> How about something like the following instead?
>
> Please also incorporate feedback from Paul in the next version.

Of course.

Thanks,
Kuai

>
> Thanks,
> Song
>
> diff --git i/drivers/md/md.c w/drivers/md/md.c
> index c94373d64f2c..2d53e1b57070 100644
> --- i/drivers/md/md.c
> +++ w/drivers/md/md.c
> @@ -490,7 +490,7 @@ int mddev_suspend(struct mddev *mddev, bool interruptible)
> }
> EXPORT_SYMBOL_GPL(mddev_suspend);
>
> -void mddev_resume(struct mddev *mddev)
> +static void __mddev_resume(struct mddev *mddev, bool recovery_needed)
> {
> lockdep_assert_not_held(&mddev->reconfig_mutex);
>
> @@ -507,12 +507,18 @@ void mddev_resume(struct mddev *mddev)
> percpu_ref_resurrect(&mddev->active_io);
> wake_up(&mddev->sb_wait);
>
> - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> + if (recovery_needed)
> + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> md_wakeup_thread(mddev->thread);
> md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
>
> mutex_unlock(&mddev->suspend_mutex);
> }
> +
> +void mddev_resume(struct mddev *mddev)
> +{
> + __mddev_resume(mddev, true);
> +}
> EXPORT_SYMBOL_GPL(mddev_resume);
>
> /*
> @@ -9403,7 +9409,9 @@ static void md_start_sync(struct work_struct *ws)
> goto not_running;
> }
>
> - suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
> + mddev_unlock(mddev);
> + if (suspend)
> + __mddev_resume(mddev, false);
> md_wakeup_thread(mddev->sync_thread);
> sysfs_notify_dirent_safe(mddev->sysfs_action);
> md_new_event();
> @@ -9415,7 +9423,9 @@ static void md_start_sync(struct work_struct *ws)
> clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
> clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
> clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
> - suspend ? mddev_unlock_and_resume(mddev) : mddev_unlock(mddev);
> + mddev_unlock(mddev);
> + if (suspend)
> + __mddev_resume(mddev, false);
>
> wake_up(&resync_wait);
> if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery) &&
>
> .
>

2023-12-06 17:26:05

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH -next] md: split MD_RECOVERY_NEEDED out of mddev_resume

On Wed, Dec 6, 2023 at 3:36 AM Yu Kuai <[email protected]> wrote:
>
> Hi,
>
> 在 2023/12/06 16:30, Song Liu 写道:
> > On Sun, Dec 3, 2023 at 7:18 PM Yu Kuai <[email protected]> wrote:
> >>
> >> From: Yu Kuai <[email protected]>
> >>
> >> New mddev_resume() calls are added to synchroniza IO with array
> >> reconfiguration, however, this introduce a regression while adding it in
> >> md_start_sync():
> >>
> >> 1) someone set MD_RECOVERY_NEEDED first;
> >> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
> >> queue a new sync work;
> >> 3) daemon thread release reconfig_mutex;
> >> 4) in md_start_sync
> >> a) check that there are spares that can be added/removed, then suspend
> >> the array;
> >> b) remove_and_add_spares may not be called, or called without really
> >> add/remove spares;
> >> c) resume the array, then set MD_RECOVERY_NEEDED again!
> >>
> >> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
> >> consequence, normal IO will be quite slow.
> >>
> >> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
> >> that md_start_sync() won't set such flag and hence the loop will be broken.
> >
> > I hope we don't leak set_bit MD_RECOVERY_NEEDED to all call
> > sites of mddev_resume().
>
> There are also some other mddev_resume() that is added later and don't
> need recovery, so md_start_sync() is not the only place:
>
> - md_setup_drive
> - rdev_attr_store
> - suspend_lo_store
> - suspend_hi_store
> - autorun_devices
> - md_ioct
> - r5c_disable_writeback_async
> - error path from new_dev_store(), ...
>
> I'm not sure add a new helper is a good idea, because all above apis
> should use new helper as well.

I think for most of these call sites, it is OK to set MD_RECOVERY_NEEDED
(although it is not needed), and md_start_sync() is the only one that may
trigger "loop between 2 - 4" scenario. Did I miss something?

It is already rc4, so we need to send the fix soon.

Thanks,
Song

2023-12-07 01:34:30

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next] md: split MD_RECOVERY_NEEDED out of mddev_resume

Hi,

在 2023/12/07 1:24, Song Liu 写道:
> On Wed, Dec 6, 2023 at 3:36 AM Yu Kuai <[email protected]> wrote:
>>
>> Hi,
>>
>> 在 2023/12/06 16:30, Song Liu 写道:
>>> On Sun, Dec 3, 2023 at 7:18 PM Yu Kuai <[email protected]> wrote:
>>>>
>>>> From: Yu Kuai <[email protected]>
>>>>
>>>> New mddev_resume() calls are added to synchroniza IO with array
>>>> reconfiguration, however, this introduce a regression while adding it in
>>>> md_start_sync():
>>>>
>>>> 1) someone set MD_RECOVERY_NEEDED first;
>>>> 2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and
>>>> queue a new sync work;
>>>> 3) daemon thread release reconfig_mutex;
>>>> 4) in md_start_sync
>>>> a) check that there are spares that can be added/removed, then suspend
>>>> the array;
>>>> b) remove_and_add_spares may not be called, or called without really
>>>> add/remove spares;
>>>> c) resume the array, then set MD_RECOVERY_NEEDED again!
>>>>
>>>> Loop between 2 - 4, then mddev_suspend() will be called quite often, for
>>>> consequence, normal IO will be quite slow.
>>>>
>>>> Fix this problem by spliting MD_RECOVERY_NEEDED out of mddev_resume(), so
>>>> that md_start_sync() won't set such flag and hence the loop will be broken.
>>>
>>> I hope we don't leak set_bit MD_RECOVERY_NEEDED to all call
>>> sites of mddev_resume().
>>
>> There are also some other mddev_resume() that is added later and don't
>> need recovery, so md_start_sync() is not the only place:
>>
>> - md_setup_drive
>> - rdev_attr_store
>> - suspend_lo_store
>> - suspend_hi_store
>> - autorun_devices
>> - md_ioct
>> - r5c_disable_writeback_async
>> - error path from new_dev_store(), ...
>>
>> I'm not sure add a new helper is a good idea, because all above apis
>> should use new helper as well.
>
> I think for most of these call sites, it is OK to set MD_RECOVERY_NEEDED
> (although it is not needed), and md_start_sync() is the only one that may
> trigger "loop between 2 - 4" scenario. Did I miss something?

Yes, it's the only problematic one. I'll send v2.

Thanks,
Kuai

>
> It is already rc4, so we need to send the fix soon.
>
> Thanks,
> Song
> .
>