Hi,
This is the updated series with the feedback received in v2[1].
This series includes fixes to fix all the kernel panics in the mdadm
tests and some, related, sparse issues, plus some cleanup.
The first two patches are cleanup from the original series that aren't
related now but I thought are worth including.
Patches 3 through 6 fix bugs with conf->log and remove the single,
unecessary, RCU access. This cleans up some sparse errors.
Patch 7 cleans up some sparse warnings with pslot usage.
Patch 8 is a cleanup which adds an enum so that patch 9 can
fix an mdadm hang. Patch 10 also fixes an mdadm hang.
I've also included Patch 11 in this series which fixes a recent
mistake in raid5-ppl that was reported by 0day which I don't think
has been fixed yet.
This series will be followed by another series for mdadm which fixes
the segfaults and annotates some failing tests to make mdadm tests
runnable fairly reliably, but I'll wait for a stable hash for this
series to note the kernel version tested against. Following that,
v3 of my lock contention series will be sent with more confidence
of its correctness.
This series is based on the current md/md-next branch as of yesterday
(42b805af10). A git branch is available here:
https://github.com/sbates130272/linux-p2pmem md-bug_v3
Thanks,
Logan
--
Changes since v2:
* Rework the RCU changes to remove the RCU usage instead of using
it every. This makes more sense seeing most accesses do not need
RCU due to them being on the IO path, or with mddev_lock() held
and the remaining ones are on the slow path and may use
mddev_lock(). (Per Christoph)
* Collect a couple more Reviewed-by tags from Christoph
Changes since v1:
* Add a patch to move the struct r5l_log to raid5-log.h in order
to fix a compiler error with rcu_access_pointer() in versions
prior to gcc-10
* Rework r5c_is_writeback() changes to make less churn (per Christoph)
* Change some 1s to trues in rcu_dereference_protected calls (per
Christoph)
* Fix an odd hunk mistake in the RCU protection patch (per Christoph)
* Fix an inverted conditional (noticed by Donald)
* Add a patch to add an enum for the overloaded values used by
mddev->curr_resync to make the status_resync() fixes clearer
(per Christoph)
--
[1] https://lore.kernel.org/all/[email protected]
Logan Gunthorpe (11):
md/raid5-log: Drop extern decorators for function prototypes
md/raid5-ppl: Drop unused argument from ppl_handle_flush_request()
md/raid5: Ensure array is suspended for calls to log_exit()
md/raid5-cache: Take mddev_lock in r5c_journal_mode_show()
md/raid5-cache: Drop RCU usage of conf->log
md/raid5-cache: Clear conf->log after finishing work
md/raid5-cache: Annotate pslot with __rcu notation
md: Use enum for overloaded magic numbers used by mddev->curr_resync
md: Ensure resync is reported after it starts
md: Notify sysfs sync_completed in md_reap_sync_thread()
md/raid5-ppl: Fix argument order in bio_alloc_bioset()
drivers/md/md.c | 55 +++++++++++++++-------------
drivers/md/md.h | 15 ++++++++
drivers/md/raid5-cache.c | 34 +++++++++---------
drivers/md/raid5-log.h | 77 +++++++++++++++++++---------------------
drivers/md/raid5-ppl.c | 6 ++--
drivers/md/raid5.c | 18 ++++------
6 files changed, 109 insertions(+), 96 deletions(-)
base-commit: 42b805af102471f53e3c7867b8c2b502ea4eef7e
--
2.30.2
The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)
When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".
To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/md.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2be429874d18..2c07c9508222 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9476,6 +9476,7 @@ void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
wake_up(&resync_wait);
/* flag recovery needed just to double check */
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ sysfs_notify_dirent_safe(mddev->sysfs_completed);
sysfs_notify_dirent_safe(mddev->sysfs_action);
md_new_event();
if (mddev->event_work.func)
--
2.30.2
ppl_handle_flush_request() takes an struct r5log argument but doesn't
use it. It has no buisiness taking this argument as it is only used
by raid5-cache and has no way to derference it anyway. Remove
the argument.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-log.h | 4 ++--
drivers/md/raid5-ppl.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index 270ced4f770f..c8332502669e 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -41,7 +41,7 @@ void ppl_write_stripe_run(struct r5conf *conf);
void ppl_stripe_write_finished(struct stripe_head *sh);
int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
void ppl_quiesce(struct r5conf *conf, int quiesce);
-int ppl_handle_flush_request(struct r5l_log *log, struct bio *bio);
+int ppl_handle_flush_request(struct bio *bio);
extern struct md_sysfs_entry ppl_write_hint;
static inline bool raid5_has_log(struct r5conf *conf)
@@ -108,7 +108,7 @@ static inline int log_handle_flush_request(struct r5conf *conf, struct bio *bio)
if (conf->log)
ret = r5l_handle_flush_request(conf->log, bio);
else if (raid5_has_ppl(conf))
- ret = ppl_handle_flush_request(conf->log, bio);
+ ret = ppl_handle_flush_request(bio);
return ret;
}
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 973e2e06f19c..4f5bdb4cad2b 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -679,7 +679,7 @@ void ppl_quiesce(struct r5conf *conf, int quiesce)
}
}
-int ppl_handle_flush_request(struct r5l_log *log, struct bio *bio)
+int ppl_handle_flush_request(struct bio *bio)
{
if (bio->bi_iter.bi_size == 0) {
bio_endio(bio);
--
2.30.2
The mddev->lock spinlock doesn't protect against the removal of
conf->log in r5l_exit_log() so conf->log may be freed before it
is used.
To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5-cache.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 83c184eddbda..69b95005abca 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2534,10 +2534,13 @@ static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
struct r5conf *conf;
int ret;
- spin_lock(&mddev->lock);
+ ret = mddev_lock(mddev);
+ if (ret)
+ return ret;
+
conf = mddev->private;
if (!conf || !conf->log) {
- spin_unlock(&mddev->lock);
+ mddev_unlock(mddev);
return 0;
}
@@ -2557,7 +2560,7 @@ static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
default:
ret = 0;
}
- spin_unlock(&mddev->lock);
+ mddev_unlock(mddev);
return ret;
}
@@ -3167,6 +3170,8 @@ void r5l_exit_log(struct r5conf *conf)
{
struct r5l_log *log = conf->log;
+ lockdep_assert_held(&conf->mddev->reconfig_mutex);
+
conf->log = NULL;
synchronize_rcu();
--
2.30.2
bio_alloc_bioset() takes a block device, number of vectors, the
OP flags, the GFP mask and the bio set. However when the prototype
was changed, the callisite in ppl_do_flush() had the OP flags and
the GFP flags reversed. This introduced some sparse error:
drivers/md/raid5-ppl.c:632:57: warning: incorrect type in argument 3
(different base types)
drivers/md/raid5-ppl.c:632:57: expected unsigned int opf
drivers/md/raid5-ppl.c:632:57: got restricted gfp_t [usertype]
drivers/md/raid5-ppl.c:633:61: warning: incorrect type in argument 4
(different base types)
drivers/md/raid5-ppl.c:633:61: expected restricted gfp_t [usertype]
gfp_mask
drivers/md/raid5-ppl.c:633:61: got unsigned long long
The sparse error introduction may not have been reported correctly by
0day due to other work that was cleaning up other sparse errors in this
area.
Fixes: 609be1066731 ("block: pass a block_device and opf to bio_alloc_bioset")
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5-ppl.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 4f5bdb4cad2b..db49edec362a 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -629,9 +629,9 @@ static void ppl_do_flush(struct ppl_io_unit *io)
if (bdev) {
struct bio *bio;
- bio = bio_alloc_bioset(bdev, 0, GFP_NOIO,
+ bio = bio_alloc_bioset(bdev, 0,
REQ_OP_WRITE | REQ_PREFLUSH,
- &ppl_conf->flush_bs);
+ GFP_NOIO, &ppl_conf->flush_bs);
bio->bi_private = io;
bio->bi_end_io = ppl_flush_endio;
--
2.30.2
Comments in the code document special values used for
mddev->curr_resync. Make this clearer by using an enum to label these
values.
The only functional change is a couple places use the wrong comparison
operator that implied 3 is another special value. They are all
fixed to imply that 3 or greater is an active resync.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/md.c | 40 ++++++++++++++++++----------------------
drivers/md/md.h | 15 +++++++++++++++
2 files changed, 33 insertions(+), 22 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8273ac5eef06..0893029865eb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5001,7 +5001,7 @@ static ssize_t
sync_speed_show(struct mddev *mddev, char *page)
{
unsigned long resync, dt, db;
- if (mddev->curr_resync == 0)
+ if (mddev->curr_resync == MD_RESYNC_NONE)
return sprintf(page, "none\n");
resync = mddev->curr_mark_cnt - atomic_read(&mddev->recovery_active);
dt = (jiffies - mddev->resync_mark) / HZ;
@@ -5020,8 +5020,8 @@ sync_completed_show(struct mddev *mddev, char *page)
if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
return sprintf(page, "none\n");
- if (mddev->curr_resync == 1 ||
- mddev->curr_resync == 2)
+ if (mddev->curr_resync == MD_RESYNC_YIELDED ||
+ mddev->curr_resync == MD_RESYNC_DELAYED)
return sprintf(page, "delayed\n");
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
@@ -8018,7 +8018,7 @@ static int status_resync(struct seq_file *seq, struct mddev *mddev)
max_sectors = mddev->dev_sectors;
resync = mddev->curr_resync;
- if (resync <= 3) {
+ if (resync < MD_RESYNC_ACTIVE) {
if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
/* Still cleaning up */
resync = max_sectors;
@@ -8027,7 +8027,7 @@ static int status_resync(struct seq_file *seq, struct mddev *mddev)
else
resync -= atomic_read(&mddev->recovery_active);
- if (resync == 0) {
+ if (resync == MD_RESYNC_NONE) {
if (test_bit(MD_RESYNCING_REMOTE, &mddev->recovery)) {
struct md_rdev *rdev;
@@ -8051,7 +8051,7 @@ static int status_resync(struct seq_file *seq, struct mddev *mddev)
}
return 0;
}
- if (resync < 3) {
+ if (resync < MD_RESYNC_ACTIVE) {
seq_printf(seq, "\tresync=DELAYED");
return 1;
}
@@ -8729,13 +8729,7 @@ void md_do_sync(struct md_thread *thread)
mddev->last_sync_action = action ?: desc;
- /* we overload curr_resync somewhat here.
- * 0 == not engaged in resync at all
- * 2 == checking that there is no conflict with another sync
- * 1 == like 2, but have yielded to allow conflicting resync to
- * commence
- * other == active in resync - this many blocks
- *
+ /*
* Before starting a resync we must have set curr_resync to
* 2, and then checked that every "conflicting" array has curr_resync
* less than ours. When we find one that is the same or higher
@@ -8747,7 +8741,7 @@ void md_do_sync(struct md_thread *thread)
do {
int mddev2_minor = -1;
- mddev->curr_resync = 2;
+ mddev->curr_resync = MD_RESYNC_DELAYED;
try_again:
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
@@ -8759,12 +8753,14 @@ void md_do_sync(struct md_thread *thread)
&& mddev2->curr_resync
&& match_mddev_units(mddev, mddev2)) {
DEFINE_WAIT(wq);
- if (mddev < mddev2 && mddev->curr_resync == 2) {
+ if (mddev < mddev2 &&
+ mddev->curr_resync == MD_RESYNC_DELAYED) {
/* arbitrarily yield */
- mddev->curr_resync = 1;
+ mddev->curr_resync = MD_RESYNC_YIELDED;
wake_up(&resync_wait);
}
- if (mddev > mddev2 && mddev->curr_resync == 1)
+ if (mddev > mddev2 &&
+ mddev->curr_resync == MD_RESYNC_YIELDED)
/* no need to wait here, we can wait the next
* time 'round when curr_resync == 2
*/
@@ -8792,7 +8788,7 @@ void md_do_sync(struct md_thread *thread)
finish_wait(&resync_wait, &wq);
}
}
- } while (mddev->curr_resync < 2);
+ } while (mddev->curr_resync < MD_RESYNC_DELAYED);
j = 0;
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
@@ -8876,7 +8872,7 @@ void md_do_sync(struct md_thread *thread)
desc, mdname(mddev));
mddev->curr_resync = j;
} else
- mddev->curr_resync = 3; /* no longer delayed */
+ mddev->curr_resync = MD_RESYNC_ACTIVE; /* no longer delayed */
mddev->curr_resync_completed = j;
sysfs_notify_dirent_safe(mddev->sysfs_completed);
md_new_event();
@@ -9011,14 +9007,14 @@ void md_do_sync(struct md_thread *thread)
if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
- mddev->curr_resync > 3) {
+ mddev->curr_resync >= MD_RESYNC_ACTIVE) {
mddev->curr_resync_completed = mddev->curr_resync;
sysfs_notify_dirent_safe(mddev->sysfs_completed);
}
mddev->pers->sync_request(mddev, max_sectors, &skipped);
if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
- mddev->curr_resync > 3) {
+ mddev->curr_resync >= MD_RESYNC_ACTIVE) {
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
if (mddev->curr_resync >= mddev->recovery_cp) {
@@ -9082,7 +9078,7 @@ void md_do_sync(struct md_thread *thread)
} else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
mddev->resync_min = mddev->curr_resync_completed;
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
- mddev->curr_resync = 0;
+ mddev->curr_resync = MD_RESYNC_NONE;
spin_unlock(&mddev->lock);
wake_up(&resync_wait);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 5f62c46ac2d3..2d06003a4c3f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -288,6 +288,21 @@ struct serial_info {
sector_t _subtree_last; /* highest sector in subtree of rb node */
};
+/*
+ * mddev->curr_resync stores the current sector of the resync but
+ * also has some overloaded values.
+ */
+enum {
+ /* No resync in progress */
+ MD_RESYNC_NONE = 0,
+ /* Yielded to allow another conflicting resync to commence */
+ MD_RESYNC_YIELDED = 1,
+ /* Delayed to check that there is no conflict with another sync */
+ MD_RESYNC_DELAYED = 2,
+ /* Any value greater than or equal to this is in an active resync */
+ MD_RESYNC_ACTIVE = 3,
+};
+
struct mddev {
void *private;
struct md_personality *pers;
--
2.30.2
A NULL pointer dereferlence on conf->log is seen randomly with
the mdadm test 21raid5cache. Kasan reporst:
BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140
Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086
Call Trace:
dump_stack_lvl+0x5a/0x74
kasan_report.cold+0x5f/0x1a9
__asan_load8+0x69/0x90
r5l_reclaimable_space+0xf5/0x140
r5l_do_reclaim+0xf4/0x5e0
r5l_reclaim_thread+0x69/0x3b0
md_thread+0x1a2/0x2c0
kthread+0x177/0x1b0
ret_from_fork+0x22/0x30
This is caused by conf->log being cleared in r5l_exit_log() before
stopping the reclaim thread.
To fix this, clear conf->log after the reclaim_thread is unregistered
and after flushing disable_writeback_work.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5-cache.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 48020d36b334..d6461d4814ab 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -3167,12 +3167,13 @@ void r5l_exit_log(struct r5conf *conf)
lockdep_assert_held(&conf->mddev->reconfig_mutex);
- conf->log = NULL;
-
/* Ensure disable_writeback_work wakes up and exits */
wake_up(&conf->mddev->sb_wait);
flush_work(&log->disable_writeback_work);
md_unregister_thread(&log->reclaim_thread);
+
+ conf->log = NULL;
+
mempool_exit(&log->meta_pool);
bioset_exit(&log->bs);
mempool_exit(&log->io_pool);
--
2.30.2
The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:
mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
File exists
This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.
mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.
The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.
To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/md.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0893029865eb..2be429874d18 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8022,10 +8022,20 @@ static int status_resync(struct seq_file *seq, struct mddev *mddev)
if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
/* Still cleaning up */
resync = max_sectors;
- } else if (resync > max_sectors)
+ } else if (resync > max_sectors) {
resync = max_sectors;
- else
+ } else {
resync -= atomic_read(&mddev->recovery_active);
+ if (resync < MD_RESYNC_ACTIVE) {
+ /*
+ * Resync has started, but the subtraction has
+ * yielded one of the special values. Force it
+ * to active to ensure the status reports an
+ * active resync.
+ */
+ resync = MD_RESYNC_ACTIVE;
+ }
+ }
if (resync == MD_RESYNC_NONE) {
if (test_bit(MD_RESYNCING_REMOTE, &mddev->recovery)) {
--
2.30.2
On 2022-06-03 00:39, Christoph Hellwig wrote:
> On Thu, Jun 02, 2022 at 12:18:10PM -0600, Logan Gunthorpe wrote:
>> conf = mddev->private;
>> if (!conf || !conf->log) {
>> - spin_unlock(&mddev->lock);
>> + mddev_unlock(mddev);
>> return 0;
>> }
>>
>> @@ -2557,7 +2560,7 @@ static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
>> default:
>> ret = 0;
>> }
>> - spin_unlock(&mddev->lock);
>> + mddev_unlock(mddev);
>> return ret;
>
> Using a goto out_unlock would be nice here to keep the critical
> sections simple. But even as-is this looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>
>
>> + lockdep_assert_held(&conf->mddev->reconfig_mutex);
>> +
>
> .. but this looks unrelated and misplaced in this patch.
>
Ok, yup, I'll fix these issues up and send an updated series next week.
Logan
extern is not necessary and recommended against when defining prototype
functions in headers. checkpatch.pl complains about these. So remove
them.
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-log.h | 75 ++++++++++++++++++++----------------------
1 file changed, 36 insertions(+), 39 deletions(-)
diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index 43c714a8798c..270ced4f770f 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -2,49 +2,46 @@
#ifndef _RAID5_LOG_H
#define _RAID5_LOG_H
-extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
-extern void r5l_exit_log(struct r5conf *conf);
-extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
-extern void r5l_write_stripe_run(struct r5l_log *log);
-extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
-extern void r5l_stripe_write_finished(struct stripe_head *sh);
-extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
-extern void r5l_quiesce(struct r5l_log *log, int quiesce);
-extern bool r5l_log_disk_error(struct r5conf *conf);
-extern bool r5c_is_writeback(struct r5l_log *log);
-extern int
-r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
- struct stripe_head_state *s, int disks);
-extern void
-r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
- struct stripe_head_state *s);
-extern void r5c_release_extra_page(struct stripe_head *sh);
-extern void r5c_use_extra_page(struct stripe_head *sh);
-extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
-extern void r5c_handle_cached_data_endio(struct r5conf *conf,
- struct stripe_head *sh, int disks);
-extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh);
-extern void r5c_make_stripe_write_out(struct stripe_head *sh);
-extern void r5c_flush_cache(struct r5conf *conf, int num);
-extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
-extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
+void r5l_exit_log(struct r5conf *conf);
+int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
+void r5l_write_stripe_run(struct r5l_log *log);
+void r5l_flush_stripe_to_raid(struct r5l_log *log);
+void r5l_stripe_write_finished(struct stripe_head *sh);
+int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
+void r5l_quiesce(struct r5l_log *log, int quiesce);
+bool r5l_log_disk_error(struct r5conf *conf);
+bool r5c_is_writeback(struct r5l_log *log);
+int r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
+ struct stripe_head_state *s, int disks);
+void r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
+ struct stripe_head_state *s);
+void r5c_release_extra_page(struct stripe_head *sh);
+void r5c_use_extra_page(struct stripe_head *sh);
+void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
+void r5c_handle_cached_data_endio(struct r5conf *conf,
+ struct stripe_head *sh, int disks);
+int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh);
+void r5c_make_stripe_write_out(struct stripe_head *sh);
+void r5c_flush_cache(struct r5conf *conf, int num);
+void r5c_check_stripe_cache_usage(struct r5conf *conf);
+void r5c_check_cached_full_stripe(struct r5conf *conf);
extern struct md_sysfs_entry r5c_journal_mode;
-extern void r5c_update_on_rdev_error(struct mddev *mddev,
- struct md_rdev *rdev);
-extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
-extern int r5l_start(struct r5l_log *log);
+void r5c_update_on_rdev_error(struct mddev *mddev, struct md_rdev *rdev);
+bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+int r5l_start(struct r5l_log *log);
-extern struct dma_async_tx_descriptor *
+struct dma_async_tx_descriptor *
ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
struct dma_async_tx_descriptor *tx);
-extern int ppl_init_log(struct r5conf *conf);
-extern void ppl_exit_log(struct r5conf *conf);
-extern int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
-extern void ppl_write_stripe_run(struct r5conf *conf);
-extern void ppl_stripe_write_finished(struct stripe_head *sh);
-extern int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
-extern void ppl_quiesce(struct r5conf *conf, int quiesce);
-extern int ppl_handle_flush_request(struct r5l_log *log, struct bio *bio);
+int ppl_init_log(struct r5conf *conf);
+void ppl_exit_log(struct r5conf *conf);
+int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
+void ppl_write_stripe_run(struct r5conf *conf);
+void ppl_stripe_write_finished(struct stripe_head *sh);
+int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
+void ppl_quiesce(struct r5conf *conf, int quiesce);
+int ppl_handle_flush_request(struct r5l_log *log, struct bio *bio);
extern struct md_sysfs_entry ppl_write_hint;
static inline bool raid5_has_log(struct r5conf *conf)
--
2.30.2
On Thu, Jun 02, 2022 at 12:18:10PM -0600, Logan Gunthorpe wrote:
> conf = mddev->private;
> if (!conf || !conf->log) {
> - spin_unlock(&mddev->lock);
> + mddev_unlock(mddev);
> return 0;
> }
>
> @@ -2557,7 +2560,7 @@ static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
> default:
> ret = 0;
> }
> - spin_unlock(&mddev->lock);
> + mddev_unlock(mddev);
> return ret;
Using a goto out_unlock would be nice here to keep the critical
sections simple. But even as-is this looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
> + lockdep_assert_held(&conf->mddev->reconfig_mutex);
> +
.. but this looks unrelated and misplaced in this patch.
radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the
slot returned and looked up to be marked with __rcu. Otherwise
sparse warnings are generated:
drivers/md/raid5-cache.c:2939:23: warning: incorrect type in
assignment (different address spaces)
drivers/md/raid5-cache.c:2939:23: expected void **pslot
drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu **
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5-cache.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index d6461d4814ab..15cab59adfcf 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2637,7 +2637,7 @@ int r5c_try_caching_write(struct r5conf *conf,
int i;
struct r5dev *dev;
int to_cache = 0;
- void **pslot;
+ void __rcu **pslot;
sector_t tree_index;
int ret;
uintptr_t refcount;
@@ -2804,7 +2804,7 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
int i;
int do_wakeup = 0;
sector_t tree_index;
- void **pslot;
+ void __rcu **pslot;
uintptr_t refcount;
if (!log || !test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags))
--
2.30.2
The only place that uses RCU to access conf->log is in
r5l_log_disk_error(). This function is mostly used in the IO path
and once with mddev_lock() held in raid5_change_consistency_policy().
It is known that the IO will be suspended before the log is freed and
r5l_log_exit() is called with the mddev_lock() held.
This should mean that conf->log can not be freed while the function is
being called, so the RCU protection is not necessary. Drop the
rcu_read_lock() as well as the synchronize_rcu() and
rcu_assign_pointer() usage.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5-cache.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 69b95005abca..48020d36b334 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1590,18 +1590,13 @@ void r5l_quiesce(struct r5l_log *log, int quiesce)
bool r5l_log_disk_error(struct r5conf *conf)
{
- struct r5l_log *log;
- bool ret;
- /* don't allow write if journal disk is missing */
- rcu_read_lock();
- log = rcu_dereference(conf->log);
+ struct r5l_log *log = conf->log;
+ /* don't allow write if journal disk is missing */
if (!log)
- ret = test_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
+ return test_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
else
- ret = test_bit(Faulty, &log->rdev->flags);
- rcu_read_unlock();
- return ret;
+ return test_bit(Faulty, &log->rdev->flags);
}
#define R5L_RECOVERY_PAGE_POOL_SIZE 256
@@ -3148,7 +3143,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
spin_lock_init(&log->stripe_in_journal_lock);
atomic_set(&log->stripe_in_journal_count, 0);
- rcu_assign_pointer(conf->log, log);
+ conf->log = log;
set_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
return 0;
@@ -3173,7 +3168,6 @@ void r5l_exit_log(struct r5conf *conf)
lockdep_assert_held(&conf->mddev->reconfig_mutex);
conf->log = NULL;
- synchronize_rcu();
/* Ensure disable_writeback_work wakes up and exits */
wake_up(&conf->mddev->sb_wait);
--
2.30.2
On Thu, Jun 02, 2022 at 12:18:12PM -0600, Logan Gunthorpe wrote:
> A NULL pointer dereferlence on conf->log is seen randomly with
> the mdadm test 21raid5cache. Kasan reporst:
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>