Hi,
This is the updated series with the feedback received in v3[1].
This series includes fixes to fix all the kernel panics in the mdadm
tests and some, related, sparse issues, plus some cleanup.
Patches 2 and 3 are cleanup from the original series that aren't
related now but I thought are worth including.
Patches 4 through 7 fix bugs with conf->log and remove the single,
unecessary, RCU access. This cleans up some sparse errors.
Patch 8 cleans up some sparse warnings with pslot usage.
Patch 9 is a cleanup which adds an enum so that patch 10 can
fix an mdadm hang. Patch 11 also fixes an mdadm hang.
I've also included Patch 1 in this series which fixes a recent
mistake in raid5-ppl that was reported by 0day which I don't think
has been fixed yet.
This series will be followed by another series for mdadm which fixes
the segfaults and annotates some failing tests to make mdadm tests
runnable fairly reliably, but I'll wait for a stable hash for this
series to note the kernel version tested against. Following that,
v3 of my lock contention series will be sent with more confidence
of its correctness.
This series is based on the current md/md-next branch as of yesterday
(42b805af10). A git branch is available here:
https://github.com/sbates130272/linux-p2pmem md-bug_v4
Thanks,
Logan
[1] https://lore.kernel.org/all/[email protected]
--
Changes since v3:
* Collected Reviewed-by Tags from Christoph
* Removed lockdep_assert_held() that didn't belong in patch
(per Christoph)
* Moved the last patch to the beginning
Changes since v2:
* Rework the RCU changes to remove the RCU usage instead of using
it every. This makes more sense seeing most accesses do not need
RCU due to them being on the IO path, or with mddev_lock() held
and the remaining ones are on the slow path and may use
mddev_lock(). (Per Christoph)
* Collect a couple more Reviewed-by tags from Christoph
Changes since v1:
* Add a patch to move the struct r5l_log to raid5-log.h in order
to fix a compiler error with rcu_access_pointer() in versions
prior to gcc-10
* Rework r5c_is_writeback() changes to make less churn (per Christoph)
* Change some 1s to trues in rcu_dereference_protected calls (per
Christoph)
* Fix an odd hunk mistake in the RCU protection patch (per Christoph)
* Fix an inverted conditional (noticed by Donald)
* Add a patch to add an enum for the overloaded values used by
mddev->curr_resync to make the status_resync() fixes clearer
(per Christoph)
--
Logan Gunthorpe (11):
md/raid5-ppl: Fix argument order in bio_alloc_bioset()
md/raid5-log: Drop extern decorators for function prototypes
md/raid5-ppl: Drop unused argument from ppl_handle_flush_request()
md/raid5: Ensure array is suspended for calls to log_exit()
md/raid5-cache: Take mddev_lock in r5c_journal_mode_show()
md/raid5-cache: Drop RCU usage of conf->log
md/raid5-cache: Clear conf->log after finishing work
md/raid5-cache: Annotate pslot with __rcu notation
md: Use enum for overloaded magic numbers used by mddev->curr_resync
md: Ensure resync is reported after it starts
md: Notify sysfs sync_completed in md_reap_sync_thread()
drivers/md/md.c | 55 +++++++++++++++-------------
drivers/md/md.h | 15 ++++++++
drivers/md/raid5-cache.c | 40 ++++++++++-----------
drivers/md/raid5-log.h | 77 +++++++++++++++++++---------------------
drivers/md/raid5-ppl.c | 6 ++--
drivers/md/raid5.c | 18 ++++------
6 files changed, 111 insertions(+), 100 deletions(-)
base-commit: 92a045f71a3c6767a7f30c0d80d2e4f6d5d49d77
--
2.30.2
radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the
slot returned and looked up to be marked with __rcu. Otherwise
sparse warnings are generated:
drivers/md/raid5-cache.c:2939:23: warning: incorrect type in
assignment (different address spaces)
drivers/md/raid5-cache.c:2939:23: expected void **pslot
drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu **
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-cache.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index ca57b5fac599..b2e6016934ee 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2637,7 +2637,7 @@ int r5c_try_caching_write(struct r5conf *conf,
int i;
struct r5dev *dev;
int to_cache = 0;
- void **pslot;
+ void __rcu **pslot;
sector_t tree_index;
int ret;
uintptr_t refcount;
@@ -2804,7 +2804,7 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
int i;
int do_wakeup = 0;
sector_t tree_index;
- void **pslot;
+ void __rcu **pslot;
uintptr_t refcount;
if (!log || !test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags))
--
2.30.2
ppl_handle_flush_request() takes an struct r5log argument but doesn't
use it. It has no buisiness taking this argument as it is only used
by raid5-cache and has no way to derference it anyway. Remove
the argument.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-log.h | 4 ++--
drivers/md/raid5-ppl.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index 270ced4f770f..c8332502669e 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -41,7 +41,7 @@ void ppl_write_stripe_run(struct r5conf *conf);
void ppl_stripe_write_finished(struct stripe_head *sh);
int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
void ppl_quiesce(struct r5conf *conf, int quiesce);
-int ppl_handle_flush_request(struct r5l_log *log, struct bio *bio);
+int ppl_handle_flush_request(struct bio *bio);
extern struct md_sysfs_entry ppl_write_hint;
static inline bool raid5_has_log(struct r5conf *conf)
@@ -108,7 +108,7 @@ static inline int log_handle_flush_request(struct r5conf *conf, struct bio *bio)
if (conf->log)
ret = r5l_handle_flush_request(conf->log, bio);
else if (raid5_has_ppl(conf))
- ret = ppl_handle_flush_request(conf->log, bio);
+ ret = ppl_handle_flush_request(bio);
return ret;
}
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 0a2e4806b1ec..db49edec362a 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -679,7 +679,7 @@ void ppl_quiesce(struct r5conf *conf, int quiesce)
}
}
-int ppl_handle_flush_request(struct r5l_log *log, struct bio *bio)
+int ppl_handle_flush_request(struct bio *bio)
{
if (bio->bi_iter.bi_size == 0) {
bio_endio(bio);
--
2.30.2
bio_alloc_bioset() takes a block device, number of vectors, the
OP flags, the GFP mask and the bio set. However when the prototype
was changed, the callisite in ppl_do_flush() had the OP flags and
the GFP flags reversed. This introduced some sparse error:
drivers/md/raid5-ppl.c:632:57: warning: incorrect type in argument 3
(different base types)
drivers/md/raid5-ppl.c:632:57: expected unsigned int opf
drivers/md/raid5-ppl.c:632:57: got restricted gfp_t [usertype]
drivers/md/raid5-ppl.c:633:61: warning: incorrect type in argument 4
(different base types)
drivers/md/raid5-ppl.c:633:61: expected restricted gfp_t [usertype]
gfp_mask
drivers/md/raid5-ppl.c:633:61: got unsigned long long
The sparse error introduction may not have been reported correctly by
0day due to other work that was cleaning up other sparse errors in this
area.
Fixes: 609be1066731 ("block: pass a block_device and opf to bio_alloc_bioset")
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-ppl.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 973e2e06f19c..0a2e4806b1ec 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -629,9 +629,9 @@ static void ppl_do_flush(struct ppl_io_unit *io)
if (bdev) {
struct bio *bio;
- bio = bio_alloc_bioset(bdev, 0, GFP_NOIO,
+ bio = bio_alloc_bioset(bdev, 0,
REQ_OP_WRITE | REQ_PREFLUSH,
- &ppl_conf->flush_bs);
+ GFP_NOIO, &ppl_conf->flush_bs);
bio->bi_private = io;
bio->bi_end_io = ppl_flush_endio;
--
2.30.2
extern is not necessary and recommended against when defining prototype
functions in headers. checkpatch.pl complains about these. So remove
them.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-log.h | 75 ++++++++++++++++++++----------------------
1 file changed, 36 insertions(+), 39 deletions(-)
diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index 43c714a8798c..270ced4f770f 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -2,49 +2,46 @@
#ifndef _RAID5_LOG_H
#define _RAID5_LOG_H
-extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
-extern void r5l_exit_log(struct r5conf *conf);
-extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
-extern void r5l_write_stripe_run(struct r5l_log *log);
-extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
-extern void r5l_stripe_write_finished(struct stripe_head *sh);
-extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
-extern void r5l_quiesce(struct r5l_log *log, int quiesce);
-extern bool r5l_log_disk_error(struct r5conf *conf);
-extern bool r5c_is_writeback(struct r5l_log *log);
-extern int
-r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
- struct stripe_head_state *s, int disks);
-extern void
-r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
- struct stripe_head_state *s);
-extern void r5c_release_extra_page(struct stripe_head *sh);
-extern void r5c_use_extra_page(struct stripe_head *sh);
-extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
-extern void r5c_handle_cached_data_endio(struct r5conf *conf,
- struct stripe_head *sh, int disks);
-extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh);
-extern void r5c_make_stripe_write_out(struct stripe_head *sh);
-extern void r5c_flush_cache(struct r5conf *conf, int num);
-extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
-extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
+void r5l_exit_log(struct r5conf *conf);
+int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
+void r5l_write_stripe_run(struct r5l_log *log);
+void r5l_flush_stripe_to_raid(struct r5l_log *log);
+void r5l_stripe_write_finished(struct stripe_head *sh);
+int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
+void r5l_quiesce(struct r5l_log *log, int quiesce);
+bool r5l_log_disk_error(struct r5conf *conf);
+bool r5c_is_writeback(struct r5l_log *log);
+int r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
+ struct stripe_head_state *s, int disks);
+void r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
+ struct stripe_head_state *s);
+void r5c_release_extra_page(struct stripe_head *sh);
+void r5c_use_extra_page(struct stripe_head *sh);
+void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
+void r5c_handle_cached_data_endio(struct r5conf *conf,
+ struct stripe_head *sh, int disks);
+int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh);
+void r5c_make_stripe_write_out(struct stripe_head *sh);
+void r5c_flush_cache(struct r5conf *conf, int num);
+void r5c_check_stripe_cache_usage(struct r5conf *conf);
+void r5c_check_cached_full_stripe(struct r5conf *conf);
extern struct md_sysfs_entry r5c_journal_mode;
-extern void r5c_update_on_rdev_error(struct mddev *mddev,
- struct md_rdev *rdev);
-extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
-extern int r5l_start(struct r5l_log *log);
+void r5c_update_on_rdev_error(struct mddev *mddev, struct md_rdev *rdev);
+bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+int r5l_start(struct r5l_log *log);
-extern struct dma_async_tx_descriptor *
+struct dma_async_tx_descriptor *
ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
struct dma_async_tx_descriptor *tx);
-extern int ppl_init_log(struct r5conf *conf);
-extern void ppl_exit_log(struct r5conf *conf);
-extern int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
-extern void ppl_write_stripe_run(struct r5conf *conf);
-extern void ppl_stripe_write_finished(struct stripe_head *sh);
-extern int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
-extern void ppl_quiesce(struct r5conf *conf, int quiesce);
-extern int ppl_handle_flush_request(struct r5l_log *log, struct bio *bio);
+int ppl_init_log(struct r5conf *conf);
+void ppl_exit_log(struct r5conf *conf);
+int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
+void ppl_write_stripe_run(struct r5conf *conf);
+void ppl_stripe_write_finished(struct stripe_head *sh);
+int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
+void ppl_quiesce(struct r5conf *conf, int quiesce);
+int ppl_handle_flush_request(struct r5l_log *log, struct bio *bio);
extern struct md_sysfs_entry ppl_write_hint;
static inline bool raid5_has_log(struct r5conf *conf)
--
2.30.2
The only place that uses RCU to access conf->log is in
r5l_log_disk_error(). This function is mostly used in the IO path
and once with mddev_lock() held in raid5_change_consistency_policy().
It is known that the IO will be suspended before the log is freed and
r5l_log_exit() is called with the mddev_lock() held.
This should mean that conf->log can not be freed while the function is
being called, so the RCU protection is not necessary. Drop the
rcu_read_lock() as well as the synchronize_rcu() and
rcu_assign_pointer() usage.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-cache.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index ca8fc317da95..b1fc65e113f8 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1590,18 +1590,13 @@ void r5l_quiesce(struct r5l_log *log, int quiesce)
bool r5l_log_disk_error(struct r5conf *conf)
{
- struct r5l_log *log;
- bool ret;
- /* don't allow write if journal disk is missing */
- rcu_read_lock();
- log = rcu_dereference(conf->log);
+ struct r5l_log *log = conf->log;
+ /* don't allow write if journal disk is missing */
if (!log)
- ret = test_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
+ return test_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
else
- ret = test_bit(Faulty, &log->rdev->flags);
- rcu_read_unlock();
- return ret;
+ return test_bit(Faulty, &log->rdev->flags);
}
#define R5L_RECOVERY_PAGE_POOL_SIZE 256
@@ -3148,7 +3143,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
spin_lock_init(&log->stripe_in_journal_lock);
atomic_set(&log->stripe_in_journal_count, 0);
- rcu_assign_pointer(conf->log, log);
+ conf->log = log;
set_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
return 0;
@@ -3171,7 +3166,6 @@ void r5l_exit_log(struct r5conf *conf)
struct r5l_log *log = conf->log;
conf->log = NULL;
- synchronize_rcu();
/* Ensure disable_writeback_work wakes up and exits */
wake_up(&conf->mddev->sb_wait);
--
2.30.2
The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)
When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".
To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/md.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ec46b83adf29..6e583f41caff 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9476,6 +9476,7 @@ void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
wake_up(&resync_wait);
/* flag recovery needed just to double check */
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ sysfs_notify_dirent_safe(mddev->sysfs_completed);
sysfs_notify_dirent_safe(mddev->sysfs_action);
md_new_event();
if (mddev->event_work.func)
--
2.30.2
The raid5-cache code relies on there being no IO in flight when
log_exit() is called. There are two places where this is not
guaranteed so add mddev_suspend() and mddev_resume() calls to these
sites.
The site in raid5_remove_disk() has a comment saying that it is
called in raid5d and thus cannot wait for pending writes; however that
does not appear to be correct anymore (if it ever was) as
raid5_remove_disk() is called from hot_remove_disk() which only
appears to be called in the md_ioctl(). Thus, the comment is removed,
as well as the racy check and replaced with calls to suspend/resume.
The site in raid5_change_consistency_policy() is in the error path,
and another similar call site already has suspend/resume calls just
below it; so it should be equally safe to make that change here.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5d09256d7f81..3ad37dd4c5cd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7938,18 +7938,9 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
print_raid5_conf(conf);
if (test_bit(Journal, &rdev->flags) && conf->log) {
- /*
- * we can't wait pending write here, as this is called in
- * raid5d, wait will deadlock.
- * neilb: there is no locking about new writes here,
- * so this cannot be safe.
- */
- if (atomic_read(&conf->active_stripes) ||
- atomic_read(&conf->r5c_cached_full_stripes) ||
- atomic_read(&conf->r5c_cached_partial_stripes)) {
- return -EBUSY;
- }
+ mddev_suspend(mddev);
log_exit(conf);
+ mddev_resume(mddev);
return 0;
}
if (rdev == rcu_access_pointer(p->rdev))
@@ -8697,8 +8688,11 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
err = log_init(conf, NULL, true);
if (!err) {
err = resize_stripes(conf, conf->pool_size);
- if (err)
+ if (err) {
+ mddev_suspend(mddev);
log_exit(conf);
+ mddev_resume(mddev);
+ }
}
} else
err = -EINVAL;
--
2.30.2
The mddev->lock spinlock doesn't protect against the removal of
conf->log in r5l_exit_log() so conf->log may be freed before it
is used.
To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-cache.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 83c184eddbda..ca8fc317da95 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2534,12 +2534,13 @@ static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
struct r5conf *conf;
int ret;
- spin_lock(&mddev->lock);
+ ret = mddev_lock(mddev);
+ if (ret)
+ return ret;
+
conf = mddev->private;
- if (!conf || !conf->log) {
- spin_unlock(&mddev->lock);
- return 0;
- }
+ if (!conf || !conf->log)
+ goto out_unlock;
switch (conf->log->r5c_journal_mode) {
case R5C_JOURNAL_MODE_WRITE_THROUGH:
@@ -2557,7 +2558,9 @@ static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
default:
ret = 0;
}
- spin_unlock(&mddev->lock);
+
+out_unlock:
+ mddev_unlock(mddev);
return ret;
}
--
2.30.2
A NULL pointer dereferlence on conf->log is seen randomly with
the mdadm test 21raid5cache. Kasan reporst:
BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140
Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086
Call Trace:
dump_stack_lvl+0x5a/0x74
kasan_report.cold+0x5f/0x1a9
__asan_load8+0x69/0x90
r5l_reclaimable_space+0xf5/0x140
r5l_do_reclaim+0xf4/0x5e0
r5l_reclaim_thread+0x69/0x3b0
md_thread+0x1a2/0x2c0
kthread+0x177/0x1b0
ret_from_fork+0x22/0x30
This is caused by conf->log being cleared in r5l_exit_log() before
stopping the reclaim thread.
To fix this, clear conf->log after the reclaim_thread is unregistered
and after flushing disable_writeback_work.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-cache.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index b1fc65e113f8..ca57b5fac599 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -3165,12 +3165,13 @@ void r5l_exit_log(struct r5conf *conf)
{
struct r5l_log *log = conf->log;
- conf->log = NULL;
-
/* Ensure disable_writeback_work wakes up and exits */
wake_up(&conf->mddev->sb_wait);
flush_work(&log->disable_writeback_work);
md_unregister_thread(&log->reclaim_thread);
+
+ conf->log = NULL;
+
mempool_exit(&log->meta_pool);
bioset_exit(&log->bs);
mempool_exit(&log->io_pool);
--
2.30.2
On 2022-06-08 11:59, Song Liu wrote:
> On Wed, Jun 8, 2022 at 9:28 AM Logan Gunthorpe <[email protected]> wrote:
>>
>> The raid5-cache code relies on there being no IO in flight when
>> log_exit() is called. There are two places where this is not
>> guaranteed so add mddev_suspend() and mddev_resume() calls to these
>> sites.
>>
>> The site in raid5_remove_disk() has a comment saying that it is
>> called in raid5d and thus cannot wait for pending writes; however that
>> does not appear to be correct anymore (if it ever was) as
>> raid5_remove_disk() is called from hot_remove_disk() which only
>> appears to be called in the md_ioctl(). Thus, the comment is removed,
>> as well as the racy check and replaced with calls to suspend/resume.
>>
>> The site in raid5_change_consistency_policy() is in the error path,
>> and another similar call site already has suspend/resume calls just
>> below it; so it should be equally safe to make that change here.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> ---
>> drivers/md/raid5.c | 18 ++++++------------
>> 1 file changed, 6 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 5d09256d7f81..3ad37dd4c5cd 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -7938,18 +7938,9 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>>
>> print_raid5_conf(conf);
>> if (test_bit(Journal, &rdev->flags) && conf->log) {
>> - /*
>> - * we can't wait pending write here, as this is called in
>> - * raid5d, wait will deadlock.
>> - * neilb: there is no locking about new writes here,
>> - * so this cannot be safe.
>> - */
>> - if (atomic_read(&conf->active_stripes) ||
>> - atomic_read(&conf->r5c_cached_full_stripes) ||
>> - atomic_read(&conf->r5c_cached_partial_stripes)) {
>> - return -EBUSY;
>> - }
>> + mddev_suspend(mddev);
>
> Unfortunately, the comment about deadlock is still true, and we cannot call
> mddev_suspend here. To trigger it:
Ah, yes. What a tangle. I think we can just drop this patch. Now that we
are removing RCU it isn't actually necessary to fix the bug I was
seeing. It's still probably broken as the comment notes though.
Thanks,
Logan
On Wed, Jun 8, 2022 at 9:28 AM Logan Gunthorpe <[email protected]> wrote:
>
> The raid5-cache code relies on there being no IO in flight when
> log_exit() is called. There are two places where this is not
> guaranteed so add mddev_suspend() and mddev_resume() calls to these
> sites.
>
> The site in raid5_remove_disk() has a comment saying that it is
> called in raid5d and thus cannot wait for pending writes; however that
> does not appear to be correct anymore (if it ever was) as
> raid5_remove_disk() is called from hot_remove_disk() which only
> appears to be called in the md_ioctl(). Thus, the comment is removed,
> as well as the racy check and replaced with calls to suspend/resume.
>
> The site in raid5_change_consistency_policy() is in the error path,
> and another similar call site already has suspend/resume calls just
> below it; so it should be equally safe to make that change here.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> drivers/md/raid5.c | 18 ++++++------------
> 1 file changed, 6 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 5d09256d7f81..3ad37dd4c5cd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7938,18 +7938,9 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>
> print_raid5_conf(conf);
> if (test_bit(Journal, &rdev->flags) && conf->log) {
> - /*
> - * we can't wait pending write here, as this is called in
> - * raid5d, wait will deadlock.
> - * neilb: there is no locking about new writes here,
> - * so this cannot be safe.
> - */
> - if (atomic_read(&conf->active_stripes) ||
> - atomic_read(&conf->r5c_cached_full_stripes) ||
> - atomic_read(&conf->r5c_cached_partial_stripes)) {
> - return -EBUSY;
> - }
> + mddev_suspend(mddev);
Unfortunately, the comment about deadlock is still true, and we cannot call
mddev_suspend here. To trigger it:
# create raid5 with journal:
mdadm --create /dev/md0 -l 5 -n 3 /dev/nvme[0-2]n1 --write-journal
/dev/nvme3n1
# start some writes
fio ...
# fail the journal
mdadm --fail /dev/md0 /dev/nvme3n1
This will cause deadlock of the thread:
[<0>] raid5_quiesce+0x2a8/0x5f0
[<0>] mddev_suspend+0x26b/0x530
[<0>] raid5_remove_disk+0x12e/0x6f3
[<0>] remove_and_add_spares+0x5b2/0xef0
[<0>] md_check_recovery+0xe68/0x12b0
[<0>] raid5d+0xf4/0xf30
[<0>] md_thread+0x299/0x350
[<0>] kthread+0x29d/0x340
[<0>] ret_from_fork+0x22/0x30
Thanks,
Song
> log_exit(conf);
> + mddev_resume(mddev);
> return 0;
> }
> if (rdev == rcu_access_pointer(p->rdev))
> @@ -8697,8 +8688,11 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
> err = log_init(conf, NULL, true);
> if (!err) {
> err = resize_stripes(conf, conf->pool_size);
> - if (err)
> + if (err) {
> + mddev_suspend(mddev);
> log_exit(conf);
> + mddev_resume(mddev);
> + }
> }
> } else
> err = -EINVAL;
> --
> 2.30.2
>
On Wed, Jun 8, 2022 at 11:21 AM Logan Gunthorpe <[email protected]> wrote:
>
>
>
> On 2022-06-08 11:59, Song Liu wrote:
> > On Wed, Jun 8, 2022 at 9:28 AM Logan Gunthorpe <[email protected]> wrote:
> >>
> >> The raid5-cache code relies on there being no IO in flight when
> >> log_exit() is called. There are two places where this is not
> >> guaranteed so add mddev_suspend() and mddev_resume() calls to these
> >> sites.
> >>
> >> The site in raid5_remove_disk() has a comment saying that it is
> >> called in raid5d and thus cannot wait for pending writes; however that
> >> does not appear to be correct anymore (if it ever was) as
> >> raid5_remove_disk() is called from hot_remove_disk() which only
> >> appears to be called in the md_ioctl(). Thus, the comment is removed,
> >> as well as the racy check and replaced with calls to suspend/resume.
> >>
> >> The site in raid5_change_consistency_policy() is in the error path,
> >> and another similar call site already has suspend/resume calls just
> >> below it; so it should be equally safe to make that change here.
> >>
> >> Signed-off-by: Logan Gunthorpe <[email protected]>
> >> Reviewed-by: Christoph Hellwig <[email protected]>
> >> ---
> >> drivers/md/raid5.c | 18 ++++++------------
> >> 1 file changed, 6 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> index 5d09256d7f81..3ad37dd4c5cd 100644
> >> --- a/drivers/md/raid5.c
> >> +++ b/drivers/md/raid5.c
> >> @@ -7938,18 +7938,9 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
> >>
> >> print_raid5_conf(conf);
> >> if (test_bit(Journal, &rdev->flags) && conf->log) {
> >> - /*
> >> - * we can't wait pending write here, as this is called in
> >> - * raid5d, wait will deadlock.
> >> - * neilb: there is no locking about new writes here,
> >> - * so this cannot be safe.
> >> - */
> >> - if (atomic_read(&conf->active_stripes) ||
> >> - atomic_read(&conf->r5c_cached_full_stripes) ||
> >> - atomic_read(&conf->r5c_cached_partial_stripes)) {
> >> - return -EBUSY;
> >> - }
> >> + mddev_suspend(mddev);
> >
> > Unfortunately, the comment about deadlock is still true, and we cannot call
> > mddev_suspend here. To trigger it:
>
> Ah, yes. What a tangle. I think we can just drop this patch. Now that we
> are removing RCU it isn't actually necessary to fix the bug I was
> seeing. It's still probably broken as the comment notes though.
How about we keep the suspend/resume in raid5_change_consistency_policy()?
Like the one I just pushed to md-next:
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=ac1506992459fe45a085c1f93df74d51c381887b
Thanks,
Song
On 2022-06-08 16:02, Song Liu wrote:
> On Wed, Jun 8, 2022 at 11:21 AM Logan Gunthorpe <[email protected]> wrote:
>>
>>
>>
>> On 2022-06-08 11:59, Song Liu wrote:
>>> On Wed, Jun 8, 2022 at 9:28 AM Logan Gunthorpe <[email protected]> wrote:
>>>>
>>>> The raid5-cache code relies on there being no IO in flight when
>>>> log_exit() is called. There are two places where this is not
>>>> guaranteed so add mddev_suspend() and mddev_resume() calls to these
>>>> sites.
>>>>
>>>> The site in raid5_remove_disk() has a comment saying that it is
>>>> called in raid5d and thus cannot wait for pending writes; however that
>>>> does not appear to be correct anymore (if it ever was) as
>>>> raid5_remove_disk() is called from hot_remove_disk() which only
>>>> appears to be called in the md_ioctl(). Thus, the comment is removed,
>>>> as well as the racy check and replaced with calls to suspend/resume.
>>>>
>>>> The site in raid5_change_consistency_policy() is in the error path,
>>>> and another similar call site already has suspend/resume calls just
>>>> below it; so it should be equally safe to make that change here.
>>>>
>>>> Signed-off-by: Logan Gunthorpe <[email protected]>
>>>> Reviewed-by: Christoph Hellwig <[email protected]>
>>>> ---
>>>> drivers/md/raid5.c | 18 ++++++------------
>>>> 1 file changed, 6 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>> index 5d09256d7f81..3ad37dd4c5cd 100644
>>>> --- a/drivers/md/raid5.c
>>>> +++ b/drivers/md/raid5.c
>>>> @@ -7938,18 +7938,9 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>>>>
>>>> print_raid5_conf(conf);
>>>> if (test_bit(Journal, &rdev->flags) && conf->log) {
>>>> - /*
>>>> - * we can't wait pending write here, as this is called in
>>>> - * raid5d, wait will deadlock.
>>>> - * neilb: there is no locking about new writes here,
>>>> - * so this cannot be safe.
>>>> - */
>>>> - if (atomic_read(&conf->active_stripes) ||
>>>> - atomic_read(&conf->r5c_cached_full_stripes) ||
>>>> - atomic_read(&conf->r5c_cached_partial_stripes)) {
>>>> - return -EBUSY;
>>>> - }
>>>> + mddev_suspend(mddev);
>>>
>>> Unfortunately, the comment about deadlock is still true, and we cannot call
>>> mddev_suspend here. To trigger it:
>>
>> Ah, yes. What a tangle. I think we can just drop this patch. Now that we
>> are removing RCU it isn't actually necessary to fix the bug I was
>> seeing. It's still probably broken as the comment notes though.
>
> How about we keep the suspend/resume in raid5_change_consistency_policy()?
> Like the one I just pushed to md-next:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=ac1506992459fe45a085c1f93df74d51c381887b
I'm good with that. Thanks.
I'll do some testing on md-next shortly.
Thanks,
Logan
While I think this whole series needs to get in of course, this one
seems even morge urgent. Song, can you send it to Jens ASAP, please?
On Thu, Jun 16, 2022 at 11:28 PM Christoph Hellwig <[email protected]> wrote:
>
> While I think this whole series needs to get in of course, this one
> seems even morge urgent. Song, can you send it to Jens ASAP, please?
This is already in Jens' block-5.19 branch. :)
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=block-5.19&id=f34fdcd4a0e7a0b92340ad7e48e7bcff9393fab5
Thanks,
Song