Hi,
I've been doing some work trying to improve the bulk write performance of
raid5 on large systems with fast NVMe drives. The bottleneck appears
largely to be lock contention on the hash_lock and device_lock. This
series improves the situation slightly by addressing a couple of low
hanging fruit ways to take the lock fewer times in the request path.
Patch 5 adjusts how batching works by keeping a reference to the
previous stripe_head in raid5_make_request(). Under most situtations,
this removes the need to take the hash_lock in stripe_add_to_batch_list()
which should reduce the number of times the lock is taken by a factor of
about 2.
Patch 8 pivots the way raid5_make_request() works. Before the patch, the
code must find the stripe_head for every 4KB page in the request, so each
stripe head must be found once for every data disk. The patch changes this
so that all the data disks can be added to a stripe_head at once and the
number of times the stripe_head must be found (and thus the number of
times the hash_lock is taken) should be reduced by a factor roughly equal
to the number of data disks.
The remaining patches are just cleanup and prep patches for those two
patches.
Doing apples to apples testing this series on a small VM with 5 ram disks,
I saw a bandwidth increase of roughly 14% and lock contentions on the
hash_lock (as reported by lock stat) reduced by more than a factor of
5 (though is still significantly contended).
Testing on larger systems with NVMe drives saw similar small bandwidth
increases from 3% to 20% depending on the parameters. Oddly small arrays
had larger gains, likely due to them having lower starting bandwidths; I
would have expected larger gains with larger arrays (seeing there
should have been even fewer locks taken in raid5_make_request()).
Logan
--
Logan Gunthorpe (8):
md/raid5: Refactor raid5_make_request loop
md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
md/raid5: Move common stripe count increment code into __find_stripe()
md/raid5: Make common label for schedule/retry in raid5_make_request()
md/raid5: Keep a reference to last stripe_head for batch
md/raid5: Refactor add_stripe_bio()
md/raid5: Check all disks in a stripe_head for reshape progress
md/raid5: Pivot raid5_make_request()
drivers/md/raid5.c | 442 +++++++++++++++++++++++++++------------------
drivers/md/raid5.h | 1 +
2 files changed, 270 insertions(+), 173 deletions(-)
base-commit: 3123109284176b1532874591f7c81f3837bbdc17
--
2.30.2
Break immediately if raid5_get_active_stripe() returns NULL and deindent
the rest of the loop. Annotate this check with an unlikely().
This makes the code easier to read and reduces the indentation level.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 111 +++++++++++++++++++++++----------------------
1 file changed, 56 insertions(+), 55 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 351d341a1ffa..b794253efd15 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5868,69 +5868,70 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
sh = raid5_get_active_stripe(conf, new_sector, previous,
(bi->bi_opf & REQ_RAHEAD), 0);
- if (sh) {
- if (unlikely(previous)) {
- /* expansion might have moved on while waiting for a
- * stripe, so we must do the range check again.
- * Expansion could still move past after this
- * test, but as we are holding a reference to
- * 'sh', we know that if that happens,
- * STRIPE_EXPANDING will get set and the expansion
- * won't proceed until we finish with the stripe.
- */
- int must_retry = 0;
- spin_lock_irq(&conf->device_lock);
- if (mddev->reshape_backwards
- ? logical_sector >= conf->reshape_progress
- : logical_sector < conf->reshape_progress)
- /* mismatch, need to try again */
- must_retry = 1;
- spin_unlock_irq(&conf->device_lock);
- if (must_retry) {
- raid5_release_stripe(sh);
- schedule();
- do_prepare = true;
- goto retry;
- }
- }
- if (read_seqcount_retry(&conf->gen_lock, seq)) {
- /* Might have got the wrong stripe_head
- * by accident
- */
- raid5_release_stripe(sh);
- goto retry;
- }
+ if (unlikely(!sh)) {
+ /* cannot get stripe, just give-up */
+ bi->bi_status = BLK_STS_IOERR;
+ break;
+ }
- if (test_bit(STRIPE_EXPANDING, &sh->state) ||
- !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
- /* Stripe is busy expanding or
- * add failed due to overlap. Flush everything
- * and wait a while
- */
- md_wakeup_thread(mddev->thread);
+ if (unlikely(previous)) {
+ /*
+ * Expansion might have moved on while waiting for a
+ * stripe, so we must do the range check again.
+ * Expansion could still move past after this
+ * test, but as we are holding a reference to
+ * 'sh', we know that if that happens,
+ * STRIPE_EXPANDING will get set and the expansion
+ * won't proceed until we finish with the stripe.
+ */
+ int must_retry = 0;
+ spin_lock_irq(&conf->device_lock);
+ if (mddev->reshape_backwards
+ ? logical_sector >= conf->reshape_progress
+ : logical_sector < conf->reshape_progress)
+ /* mismatch, need to try again */
+ must_retry = 1;
+ spin_unlock_irq(&conf->device_lock);
+ if (must_retry) {
raid5_release_stripe(sh);
schedule();
do_prepare = true;
goto retry;
}
- if (do_flush) {
- set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
- /* we only need flush for one stripe */
- do_flush = false;
- }
+ }
+ if (read_seqcount_retry(&conf->gen_lock, seq)) {
+ /* Might have got the wrong stripe_head by accident */
+ raid5_release_stripe(sh);
+ goto retry;
+ }
- set_bit(STRIPE_HANDLE, &sh->state);
- clear_bit(STRIPE_DELAYED, &sh->state);
- if ((!sh->batch_head || sh == sh->batch_head) &&
- (bi->bi_opf & REQ_SYNC) &&
- !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- atomic_inc(&conf->preread_active_stripes);
- release_stripe_plug(mddev, sh);
- } else {
- /* cannot get stripe for read-ahead, just give-up */
- bi->bi_status = BLK_STS_IOERR;
- break;
+ if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+ !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+ /*
+ * Stripe is busy expanding or add failed due to
+ * overlap. Flush everything and wait a while.
+ */
+ md_wakeup_thread(mddev->thread);
+ raid5_release_stripe(sh);
+ schedule();
+ do_prepare = true;
+ goto retry;
}
+
+ if (do_flush) {
+ set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+ /* we only need flush for one stripe */
+ do_flush = false;
+ }
+
+ set_bit(STRIPE_HANDLE, &sh->state);
+ clear_bit(STRIPE_DELAYED, &sh->state);
+ if ((!sh->batch_head || sh == sh->batch_head) &&
+ (bi->bi_opf & REQ_SYNC) &&
+ !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+ atomic_inc(&conf->preread_active_stripes);
+
+ release_stripe_plug(mddev, sh);
}
finish_wait(&conf->wait_for_overlap, &w);
--
2.30.2
Factor out two helper functions from add_stripe_bio(): one to check for
overlap (stripe_bio_overlaps()), and one to actually add the bio to the
stripe (__add_stripe_bio()). The latter function will always succeed.
This will be useful in the next patch so that overlap can be checked for
multiple disks before adding any
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 79 ++++++++++++++++++++++++++++++----------------
1 file changed, 52 insertions(+), 27 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b852b6439898..52227dd91e89 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3371,39 +3371,33 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
s->locked, s->ops_request);
}
-/*
- * Each stripe/dev can have one or more bion attached.
- * toread/towrite point to the first in a chain.
- * The bi_next chain must be in order.
- */
-static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
- int forwrite, int previous)
+static int stripe_bio_overlaps(struct stripe_head *sh, struct bio *bi,
+ int dd_idx, int forwrite)
{
- struct bio **bip;
struct r5conf *conf = sh->raid_conf;
- int firstwrite=0;
+ struct bio **bip;
- pr_debug("adding bi b#%llu to stripe s#%llu\n",
- (unsigned long long)bi->bi_iter.bi_sector,
- (unsigned long long)sh->sector);
+ pr_debug("checking bi b#%llu to stripe s#%llu\n",
+ (unsigned long long)bi->bi_iter.bi_sector,
+ (unsigned long long)sh->sector);
- spin_lock_irq(&sh->stripe_lock);
/* Don't allow new IO added to stripes in batch list */
if (sh->batch_head)
- goto overlap;
- if (forwrite) {
+ return 1;
+
+ if (forwrite)
bip = &sh->dev[dd_idx].towrite;
- if (*bip == NULL)
- firstwrite = 1;
- } else
+ else
bip = &sh->dev[dd_idx].toread;
+
while (*bip && (*bip)->bi_iter.bi_sector < bi->bi_iter.bi_sector) {
if (bio_end_sector(*bip) > bi->bi_iter.bi_sector)
- goto overlap;
+ return 1;
bip = & (*bip)->bi_next;
}
+
if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
- goto overlap;
+ return 1;
if (forwrite && raid5_has_ppl(conf)) {
/*
@@ -3432,7 +3426,25 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
}
if (first + conf->chunk_sectors * (count - 1) != last)
- goto overlap;
+ return 1;
+ }
+
+ return 0;
+}
+
+static void __add_stripe_bio(struct stripe_head *sh, struct bio *bi,
+ int dd_idx, int forwrite, int previous)
+{
+ struct r5conf *conf = sh->raid_conf;
+ struct bio **bip;
+ int firstwrite = 0;
+
+ if (forwrite) {
+ bip = &sh->dev[dd_idx].towrite;
+ if (!*bip)
+ firstwrite = 1;
+ } else {
+ bip = &sh->dev[dd_idx].toread;
}
if (!forwrite || previous)
@@ -3470,7 +3482,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
* we have added to the bitmap and set bm_seq.
* So set STRIPE_BITMAP_PENDING to prevent
* batching.
- * If multiple add_stripe_bio() calls race here they
+ * If multiple __add_stripe_bio() calls race here they
* much all set STRIPE_BITMAP_PENDING. So only the first one
* to complete "bitmap_startwrite" gets to set
* STRIPE_BIT_DELAY. This is important as once a stripe
@@ -3488,14 +3500,27 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
set_bit(STRIPE_BIT_DELAY, &sh->state);
}
}
- spin_unlock_irq(&sh->stripe_lock);
+}
- return 1;
+/*
+ * Each stripe/dev can have one or more bion attached.
+ * toread/towrite point to the first in a chain.
+ * The bi_next chain must be in order.
+ */
+static int add_stripe_bio(struct stripe_head *sh, struct bio *bi,
+ int dd_idx, int forwrite, int previous)
+{
+ spin_lock_irq(&sh->stripe_lock);
- overlap:
- set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+ if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
+ set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+ spin_unlock_irq(&sh->stripe_lock);
+ return 0;
+ }
+
+ __add_stripe_bio(sh, bi, dd_idx, forwrite, previous);
spin_unlock_irq(&sh->stripe_lock);
- return 0;
+ return 1;
}
static void end_reshape(struct r5conf *conf);
--
2.30.2
stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.
The call to add_stripe_bio() in retry_aligned_read() is for a read only
and thus wouldn't have added the batch anyway.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b794253efd15..e3c75b3b8923 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3504,8 +3504,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
}
spin_unlock_irq(&sh->stripe_lock);
- if (stripe_can_batch(sh))
- stripe_add_to_batch_list(conf, sh);
return 1;
overlap:
@@ -5918,6 +5916,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
goto retry;
}
+ if (stripe_can_batch(sh))
+ stripe_add_to_batch_list(conf, sh);
+
if (do_flush) {
set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
/* we only need flush for one stripe */
--
2.30.2
Cleanup the code to make a common label for the schedule,
prepare_to_wait() and retry path. This drops the do_prepare boolean.
This requires moveing the prepare_to_wait() above the
read_seqcount_begin() call on the retry path. However there's no
appearant requirement for ordering between these two calls.
This should hopefully be easier to read rather than following the
extra do_prepare boolean, but it will also be used in a subsequent
patch to add more code common to all schedule() calls.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 24 ++++++++++--------------
1 file changed, 10 insertions(+), 14 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index be01c4515f0e..f963ffb35484 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5742,7 +5742,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
struct stripe_head *sh;
const int rw = bio_data_dir(bi);
DEFINE_WAIT(w);
- bool do_prepare;
bool do_flush = false;
if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
@@ -5803,13 +5802,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
int previous;
int seq;
- do_prepare = false;
retry:
seq = read_seqcount_begin(&conf->gen_lock);
previous = 0;
- if (do_prepare)
- prepare_to_wait(&conf->wait_for_overlap, &w,
- TASK_UNINTERRUPTIBLE);
if (unlikely(conf->reshape_progress != MaxSector)) {
/* spinlock is needed as reshape_progress may be
* 64bit on a 32bit platform, and so it might be
@@ -5829,9 +5824,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
? logical_sector < conf->reshape_safe
: logical_sector >= conf->reshape_safe) {
spin_unlock_irq(&conf->device_lock);
- schedule();
- do_prepare = true;
- goto retry;
+ goto schedule_and_retry;
}
}
spin_unlock_irq(&conf->device_lock);
@@ -5872,9 +5865,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
spin_unlock_irq(&conf->device_lock);
if (must_retry) {
raid5_release_stripe(sh);
- schedule();
- do_prepare = true;
- goto retry;
+ goto schedule_and_retry;
}
}
if (read_seqcount_retry(&conf->gen_lock, seq)) {
@@ -5891,9 +5882,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
*/
md_wakeup_thread(mddev->thread);
raid5_release_stripe(sh);
- schedule();
- do_prepare = true;
- goto retry;
+ goto schedule_and_retry;
}
if (stripe_can_batch(sh))
@@ -5913,6 +5902,13 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
atomic_inc(&conf->preread_active_stripes);
release_stripe_plug(mddev, sh);
+ continue;
+
+schedule_and_retry:
+ schedule();
+ prepare_to_wait(&conf->wait_for_overlap, &w,
+ TASK_UNINTERRUPTIBLE);
+ goto retry;
}
finish_wait(&conf->wait_for_overlap, &w);
--
2.30.2
Both uses of find_stripe() require a fairly complicated dance to
increment the reference count. Move this into a common place in
__find_stripe()
No functional changes intended.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 118 +++++++++++++++++++--------------------------
1 file changed, 49 insertions(+), 69 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e3c75b3b8923..be01c4515f0e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -605,16 +605,42 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
}
static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
- short generation)
+ short generation, int hash)
{
+ int inc_empty_inactive_list_flag;
struct stripe_head *sh;
pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
hlist_for_each_entry(sh, stripe_hash(conf, sector), hash)
if (sh->sector == sector && sh->generation == generation)
- return sh;
+ goto found;
pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
return NULL;
+
+found:
+ if (!atomic_inc_not_zero(&sh->count)) {
+ spin_lock(&conf->device_lock);
+ if (!atomic_read(&sh->count)) {
+ if (!test_bit(STRIPE_HANDLE, &sh->state))
+ atomic_inc(&conf->active_stripes);
+ BUG_ON(list_empty(&sh->lru) &&
+ !test_bit(STRIPE_EXPANDING, &sh->state));
+ inc_empty_inactive_list_flag = 0;
+ if (!list_empty(conf->inactive_list + hash))
+ inc_empty_inactive_list_flag = 1;
+ list_del_init(&sh->lru);
+ if (list_empty(conf->inactive_list + hash) &&
+ inc_empty_inactive_list_flag)
+ atomic_inc(&conf->empty_inactive_list_nr);
+ if (sh->group) {
+ sh->group->stripes_cnt--;
+ sh->group = NULL;
+ }
+ }
+ atomic_inc(&sh->count);
+ spin_unlock(&conf->device_lock);
+ }
+ return sh;
}
/*
@@ -705,7 +731,6 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
{
struct stripe_head *sh;
int hash = stripe_hash_locks_hash(conf, sector);
- int inc_empty_inactive_list_flag;
pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
@@ -715,57 +740,34 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
wait_event_lock_irq(conf->wait_for_quiescent,
conf->quiesce == 0 || noquiesce,
*(conf->hash_locks + hash));
- sh = __find_stripe(conf, sector, conf->generation - previous);
- if (!sh) {
- if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
- sh = get_free_stripe(conf, hash);
- if (!sh && !test_bit(R5_DID_ALLOC,
- &conf->cache_state))
- set_bit(R5_ALLOC_MORE,
- &conf->cache_state);
- }
- if (noblock && sh == NULL)
- break;
+ sh = __find_stripe(conf, sector, conf->generation - previous,
+ hash);
+ if (sh)
+ break;
- r5c_check_stripe_cache_usage(conf);
- if (!sh) {
- set_bit(R5_INACTIVE_BLOCKED,
- &conf->cache_state);
- r5l_wake_reclaim(conf->log, 0);
- wait_event_lock_irq(
- conf->wait_for_stripe,
+ if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
+ sh = get_free_stripe(conf, hash);
+ if (!sh && !test_bit(R5_DID_ALLOC, &conf->cache_state))
+ set_bit(R5_ALLOC_MORE, &conf->cache_state);
+ }
+ if (noblock && !sh)
+ break;
+
+ r5c_check_stripe_cache_usage(conf);
+ if (!sh) {
+ set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
+ r5l_wake_reclaim(conf->log, 0);
+ wait_event_lock_irq(conf->wait_for_stripe,
!list_empty(conf->inactive_list + hash) &&
(atomic_read(&conf->active_stripes)
< (conf->max_nr_stripes * 3 / 4)
|| !test_bit(R5_INACTIVE_BLOCKED,
&conf->cache_state)),
*(conf->hash_locks + hash));
- clear_bit(R5_INACTIVE_BLOCKED,
- &conf->cache_state);
- } else {
- init_stripe(sh, sector, previous);
- atomic_inc(&sh->count);
- }
- } else if (!atomic_inc_not_zero(&sh->count)) {
- spin_lock(&conf->device_lock);
- if (!atomic_read(&sh->count)) {
- if (!test_bit(STRIPE_HANDLE, &sh->state))
- atomic_inc(&conf->active_stripes);
- BUG_ON(list_empty(&sh->lru) &&
- !test_bit(STRIPE_EXPANDING, &sh->state));
- inc_empty_inactive_list_flag = 0;
- if (!list_empty(conf->inactive_list + hash))
- inc_empty_inactive_list_flag = 1;
- list_del_init(&sh->lru);
- if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
- atomic_inc(&conf->empty_inactive_list_nr);
- if (sh->group) {
- sh->group->stripes_cnt--;
- sh->group = NULL;
- }
- }
+ clear_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
+ } else {
+ init_stripe(sh, sector, previous);
atomic_inc(&sh->count);
- spin_unlock(&conf->device_lock);
}
} while (sh == NULL);
@@ -819,7 +821,6 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
sector_t head_sector, tmp_sec;
int hash;
int dd_idx;
- int inc_empty_inactive_list_flag;
/* Don't cross chunks, so stripe pd_idx/qd_idx is the same */
tmp_sec = sh->sector;
@@ -829,28 +830,7 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
hash = stripe_hash_locks_hash(conf, head_sector);
spin_lock_irq(conf->hash_locks + hash);
- head = __find_stripe(conf, head_sector, conf->generation);
- if (head && !atomic_inc_not_zero(&head->count)) {
- spin_lock(&conf->device_lock);
- if (!atomic_read(&head->count)) {
- if (!test_bit(STRIPE_HANDLE, &head->state))
- atomic_inc(&conf->active_stripes);
- BUG_ON(list_empty(&head->lru) &&
- !test_bit(STRIPE_EXPANDING, &head->state));
- inc_empty_inactive_list_flag = 0;
- if (!list_empty(conf->inactive_list + hash))
- inc_empty_inactive_list_flag = 1;
- list_del_init(&head->lru);
- if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
- atomic_inc(&conf->empty_inactive_list_nr);
- if (head->group) {
- head->group->stripes_cnt--;
- head->group = NULL;
- }
- }
- atomic_inc(&head->count);
- spin_unlock(&conf->device_lock);
- }
+ head = __find_stripe(conf, head_sector, conf->generation, hash);
spin_unlock_irq(conf->hash_locks + hash);
if (!head)
--
2.30.2
raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the disk.
This causes a great deal of contention on the hash_lock seeing the
lock for that hash must be taken for every single page.
The number of times the lock is taken can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the lock must be taken by
a factor equal to the number of data disks.
To accomplish this, store the disk sector that has currently been
finished and continue to the next logical sector if the disk sector
has already been done. Then add a add_all_stripe_bios() to check all
the bios for overlap and add them all if none of them overlap.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--
drivers/md/raid5.h | 1 +
2 files changed, 52 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1ddce09970fa..6b098819f7db 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3523,6 +3523,48 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi,
return 1;
}
+static int add_all_stripe_bios(struct stripe_head *sh, struct bio *bi,
+ sector_t first_logical_sector, sector_t last_sector,
+ int forwrite, int previous)
+{
+ int dd_idx;
+ int ret = 1;
+
+ spin_lock_irq(&sh->stripe_lock);
+
+ for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+ struct r5dev *dev = &sh->dev[dd_idx];
+
+ clear_bit(R5_BioReady, &dev->flags);
+
+ if (dd_idx == sh->pd_idx)
+ continue;
+
+ if (dev->sector < first_logical_sector ||
+ dev->sector >= last_sector)
+ continue;
+
+ if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
+ set_bit(R5_Overlap, &dev->flags);
+ ret = 0;
+ continue;
+ }
+
+ set_bit(R5_BioReady, &dev->flags);
+ }
+
+ if (!ret)
+ goto out;
+
+ for (dd_idx = 0; dd_idx < sh->disks; dd_idx++)
+ if (test_bit(R5_BioReady, &sh->dev[dd_idx].flags))
+ __add_stripe_bio(sh, bi, dd_idx, forwrite, previous);
+
+out:
+ spin_unlock_irq(&sh->stripe_lock);
+ return ret;
+}
+
static void end_reshape(struct r5conf *conf);
static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
@@ -5796,7 +5838,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
struct stripe_head *batch_last = NULL;
struct r5conf *conf = mddev->private;
int dd_idx;
- sector_t new_sector;
+ sector_t new_sector, disk_sector = MaxSector;
sector_t logical_sector, last_sector;
struct stripe_head *sh;
const int rw = bio_data_dir(bi);
@@ -5855,6 +5897,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
md_write_end(mddev);
return true;
}
+
md_account_bio(mddev, &bi);
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
@@ -5892,6 +5935,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
new_sector = raid5_compute_sector(conf, logical_sector,
previous,
&dd_idx, NULL);
+ if (disk_sector != MaxSector && new_sector <= disk_sector)
+ continue;
+
pr_debug("raid456: raid5_make_request, sector %llu logical %llu\n",
(unsigned long long)new_sector,
(unsigned long long)logical_sector);
@@ -5924,7 +5970,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
}
if (test_bit(STRIPE_EXPANDING, &sh->state) ||
- !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+ !add_all_stripe_bios(sh, bi, logical_sector,
+ last_sector, rw, previous)) {
/*
* Stripe is busy expanding or add failed due to
* overlap. Flush everything and wait a while.
@@ -5934,6 +5981,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
goto schedule_and_retry;
}
+ disk_sector = new_sector;
+
if (stripe_can_batch(sh)) {
stripe_add_to_batch_list(conf, sh, batch_last);
if (batch_last)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 9e8486a9e445..6f1ef9c75504 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -308,6 +308,7 @@ enum r5dev_flags {
R5_Wantwrite,
R5_Overlap, /* There is a pending overlapping request
* on this block */
+ R5_BioReady, /* The current bio can be added to this disk */
R5_ReadNoMerge, /* prevent bio from merging in block-layer */
R5_ReadError, /* seen a read error here recently */
R5_ReWrite, /* have tried to over-write the readerror */
--
2.30.2
When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.
Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.
The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 51 +++++++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 12 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f963ffb35484..b852b6439898 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -815,7 +815,8 @@ static bool stripe_can_batch(struct stripe_head *sh)
}
/* we only do back search */
-static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh)
+static void stripe_add_to_batch_list(struct r5conf *conf,
+ struct stripe_head *sh, struct stripe_head *last_sh)
{
struct stripe_head *head;
sector_t head_sector, tmp_sec;
@@ -828,15 +829,20 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
return;
head_sector = sh->sector - RAID5_STRIPE_SECTORS(conf);
- hash = stripe_hash_locks_hash(conf, head_sector);
- spin_lock_irq(conf->hash_locks + hash);
- head = __find_stripe(conf, head_sector, conf->generation, hash);
- spin_unlock_irq(conf->hash_locks + hash);
-
- if (!head)
- return;
- if (!stripe_can_batch(head))
- goto out;
+ if (last_sh && head_sector == last_sh->sector) {
+ head = last_sh;
+ atomic_inc(&head->count);
+ } else {
+ hash = stripe_hash_locks_hash(conf, head_sector);
+ spin_lock_irq(conf->hash_locks + hash);
+ head = __find_stripe(conf, head_sector, conf->generation,
+ hash);
+ spin_unlock_irq(conf->hash_locks + hash);
+ if (!head)
+ return;
+ if (!stripe_can_batch(head))
+ goto out;
+ }
lock_two_stripes(head, sh);
/* clear_batch_ready clear the flag */
@@ -5735,6 +5741,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
{
+ struct stripe_head *batch_last = NULL;
struct r5conf *conf = mddev->private;
int dd_idx;
sector_t new_sector;
@@ -5885,8 +5892,13 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
goto schedule_and_retry;
}
- if (stripe_can_batch(sh))
- stripe_add_to_batch_list(conf, sh);
+ if (stripe_can_batch(sh)) {
+ stripe_add_to_batch_list(conf, sh, batch_last);
+ if (batch_last)
+ raid5_release_stripe(batch_last);
+ atomic_inc(&sh->count);
+ batch_last = sh;
+ }
if (do_flush) {
set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
@@ -5905,6 +5917,18 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
continue;
schedule_and_retry:
+ /*
+ * Must release the reference to batch_last before
+ * scheduling and waiting for work to be done, otherwise
+ * the batch_last stripe head could prevent
+ * raid5_activate_delayed() from making progress
+ * and thus deadlocking.
+ */
+ if (batch_last) {
+ raid5_release_stripe(batch_last);
+ batch_last = NULL;
+ }
+
schedule();
prepare_to_wait(&conf->wait_for_overlap, &w,
TASK_UNINTERRUPTIBLE);
@@ -5912,6 +5936,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
}
finish_wait(&conf->wait_for_overlap, &w);
+ if (batch_last)
+ raid5_release_stripe(batch_last);
+
if (rw == WRITE)
md_write_end(mddev);
bio_endio(bi);
--
2.30.2
When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.
To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 47 +++++++++++++++++++++++++++++++---------------
1 file changed, 32 insertions(+), 15 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 52227dd91e89..1ddce09970fa 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5764,6 +5764,33 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
bio_endio(bi);
}
+static bool stripe_ahead_of_reshape(struct mddev *mddev, struct r5conf *conf,
+ struct stripe_head *sh)
+{
+ sector_t max_sector = 0, min_sector = MaxSector;
+ int dd_idx, ret = 0;
+
+ for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+ if (dd_idx == sh->pd_idx)
+ continue;
+
+ min_sector = min(min_sector, sh->dev[dd_idx].sector);
+ max_sector = min(max_sector, sh->dev[dd_idx].sector);
+ }
+
+ spin_lock_irq(&conf->device_lock);
+
+ if (mddev->reshape_backwards
+ ? max_sector >= conf->reshape_progress
+ : min_sector < conf->reshape_progress)
+ /* mismatch, need to try again */
+ ret = 1;
+
+ spin_unlock_irq(&conf->device_lock);
+
+ return ret;
+}
+
static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
{
struct stripe_head *batch_last = NULL;
@@ -5877,28 +5904,18 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
break;
}
- if (unlikely(previous)) {
+ if (unlikely(previous) &&
+ stripe_ahead_of_reshape(mddev, conf, sh)) {
/*
- * Expansion might have moved on while waiting for a
- * stripe, so we must do the range check again.
+ * Expansion moved on while waiting for a stripe.
* Expansion could still move past after this
* test, but as we are holding a reference to
* 'sh', we know that if that happens,
* STRIPE_EXPANDING will get set and the expansion
* won't proceed until we finish with the stripe.
*/
- int must_retry = 0;
- spin_lock_irq(&conf->device_lock);
- if (mddev->reshape_backwards
- ? logical_sector >= conf->reshape_progress
- : logical_sector < conf->reshape_progress)
- /* mismatch, need to try again */
- must_retry = 1;
- spin_unlock_irq(&conf->device_lock);
- if (must_retry) {
- raid5_release_stripe(sh);
- goto schedule_and_retry;
- }
+ raid5_release_stripe(sh);
+ goto schedule_and_retry;
}
if (read_seqcount_retry(&conf->gen_lock, seq)) {
/* Might have got the wrong stripe_head by accident */
--
2.30.2
On Thu, Apr 07, 2022 at 10:45:04AM -0600, Logan Gunthorpe wrote:
> + spin_lock_irq(&conf->device_lock);
> + if (mddev->reshape_backwards
> + ? logical_sector >= conf->reshape_progress
> + : logical_sector < conf->reshape_progress)
Normal kernel style is to have the operators at the end of the previous
line (and the whole reshape_backwards handlign is ripe for a bunhc
of well contained helpers).
Otherwise looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
On Thu, Apr 07, 2022 at 10:45:05AM -0600, Logan Gunthorpe wrote:
> The call to add_stripe_bio() in retry_aligned_read() is for a read only
> and thus wouldn't have added the batch anyway.
This sentence is missing an object.
Otherwise looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
On Thu, Apr 07, 2022 at 10:45:06AM -0600, Logan Gunthorpe wrote:
> static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
> - short generation)
> + short generation, int hash)
> {
> + int inc_empty_inactive_list_flag;
> struct stripe_head *sh;
>
> pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
> hlist_for_each_entry(sh, stripe_hash(conf, sector), hash)
> if (sh->sector == sector && sh->generation == generation)
> - return sh;
> + goto found;
> pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
> return NULL;
> +
> +found:
> + if (!atomic_inc_not_zero(&sh->count)) {
There is a way on list iterators outside the loop body waging right
now. So maybe just leave __find_stripe as-is and add a new
find_get_stripe that wraps it. And then just return early when the
atomic_inc_not_zero dos not succeed and save on one level of
indentation.
On Thu, Apr 07, 2022 at 10:45:07AM -0600, Logan Gunthorpe wrote:
> Cleanup the code to make a common label for the schedule,
> prepare_to_wait() and retry path. This drops the do_prepare boolean.
>
> This requires moveing the prepare_to_wait() above the
> read_seqcount_begin() call on the retry path. However there's no
> appearant requirement for ordering between these two calls.
>
> This should hopefully be easier to read rather than following the
> extra do_prepare boolean, but it will also be used in a subsequent
> patch to add more code common to all schedule() calls.
Hm. Maybe it is marginally better, but I always hate gotos inside
of loop bodys. What prevents us from factoring most of the loop
body into a helper function?
On Thu, Apr 07, 2022 at 10:45:09AM -0600, Logan Gunthorpe wrote:
> +static int stripe_bio_overlaps(struct stripe_head *sh, struct bio *bi,
> + int dd_idx, int forwrite)
This might be a good time to switch the int used as boolean return
value to an actual bool.
> + pr_debug("checking bi b#%llu to stripe s#%llu\n",
> + (unsigned long long)bi->bi_iter.bi_sector,
> + (unsigned long long)sh->sector);
sector_t has always been a u64 for years, so these casts are not
needed.