LinuxLists.cc - [PATCH v3 00/15] Improve Raid5 Lock Contention

2022-06-16 19:31:13

Subject: [PATCH v3 00/15] Improve Raid5 Lock Contention

Hi,

Now that I've done some cleanup of the mdadm testing infrastructure
as well as a lot of long run testing and bug fixes I'm much more
confident in the correctness of this series. The previous posting is
at [1].

Patch 14 has been completely reworked from the previous series (where
much of the feedback was on). Rare bugs were found with the original
method where if the array changed shape at just the right time while
first_wrap was true, the algorithm would fail and blocks would not be
written correctly. To fix this, the new version uses a bitmap to track
which pages have been added to the stripe_head. This requires
limitting the size of the request but to a size greater than the
current limit (which is based on the number of segments). I've also
included another patch to remove the limit on the number of segments
(seeing it is not needed) and the limit on the number of sectors is
higher and ends up with less bio splitting and fewer bios that are
unaligned with the chunk size.

--

I've been doing some work trying to improve the bulk write performance
of raid5 on large systems with fast NVMe drives. The bottleneck appears
largely to be lock contention on the hash_lock and device_lock. This
series improves the situation slightly by addressing a couple of low
hanging fruit ways to take the lock fewer times in the request path.

Patch 11 adjusts how batching works by keeping a reference to the
previous stripe_head in raid5_make_request(). Under most situtations,
this removes the need to take the hash_lock in stripe_add_to_batch_list()
which should reduce the number of times the lock is taken by a factor of
about 2.

Patch 14 pivots the way raid5_make_request() works. Before the patch, the
code must find the stripe_head for every 4KB page in the request, so each
stripe head must be found once for every data disk. The patch changes this
so that all the data disks can be added to a stripe_head at once and the
number of times the stripe_head must be found (and thus the number of
times the hash_lock is taken) should be reduced by a factor roughly equal
to the number of data disks.

Patch 16 increases the restriction on block layer IO size to reduce the
amount of bio splitting which decreases the amount of broken batches that
occur with large IOs due to the unecessary splitting.

I've also included Patch 15 which changes some debug prints to make
debugging a bit easier.

The remaining patches are just cleanup and prep patches for those two
patches.

Doing apples to apples testing this series on a small VM with 5 ram
disks, I saw a bandwidth increase of roughly 14% and lock contentions
on the hash_lock (as reported by lock stat) reduced by more than a factor
of 5 (though it is still significantly contended).

Testing on larger systems with NVMe drives saw similar small bandwidth
increases from 3% to 20% depending on the parameters. Oddly small arrays
had larger gains, likely due to them having lower starting bandwidths; I
would have expected larger gains with larger arrays (seeing there
should have been even fewer locks taken in raid5_make_request()).

This series is based on the current md/md-next (facef3b96c5b9565). A git
branch is available here:

https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v3

Logan

[1] https://lkml.kernel.org/r/[email protected]

--

Changes since v2:
- Rebased on current md-next branch (facef3b96c5b9565)
- Reworked Pivot patch with bitmap due to unfixable bug
- Changed to a ternary operator in ahead_of_reshape() helper (per Paul)
- Seperated out the functional change from non-functional change in
the first patch (per Paul)
- Dropped an unecessary hash argument in __find_stripe() (per
Christoph)
- Fixed some minor commit message and comment errors
- Collected tags from Christoph and Guoqing

Changes since v1:
- Rebased on current md-next branch (190a901246c69d79)
- Added patch to create a helper for checking if a sector
is ahead of the reshape (per Christoph)
- Reworked the __find_stripe() patch to create a find_get_stripe()
helper (per Christoph)
- Added more patches to further refactor raid5_make_request() and
pull most of the loop body into a helper function (per Christoph)
- A few other minor cleanups (boolean return, droping casting when
printing sectors, commit message grammar) as suggested by Christoph.
- Fixed two uncommon but bad data corruption bugs in that were found.

--

Logan Gunthorpe (15):
md/raid5: Make logic blocking check consistent with logic that blocks
md/raid5: Factor out ahead_of_reshape() function
md/raid5: Refactor raid5_make_request loop
md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
md/raid5: Move common stripe get code into new find_get_stripe()
helper
md/raid5: Factor out helper from raid5_make_request() loop
md/raid5: Drop the do_prepare flag in raid5_make_request()
md/raid5: Move read_seqcount_begin() into make_stripe_request()
md/raid5: Refactor for loop in raid5_make_request() into while loop
md/raid5: Keep a reference to last stripe_head for batch
md/raid5: Refactor add_stripe_bio()
md/raid5: Check all disks in a stripe_head for reshape progress
md/raid5: Pivot raid5_make_request()
md/raid5: Improve debug prints
md/raid5: Increase restriction on max segments per request

drivers/md/raid5.c | 641 +++++++++++++++++++++++++++++----------------
1 file changed, 418 insertions(+), 223 deletions(-)

base-commit: facef3b96c5b9565fa0416d7701ef990ef96e5a6
--
2.30.2

2022-06-16 19:32:04

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 13/15] md/raid5: Pivot raid5_make_request()

raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the
disk.

This causes a great deal of contention on the hash_lock and extra
work seeing each stripe must be found once for every data disk.

The number of times a stripe must be found can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the hash lock must be taken
by a factor equal to the number of data disks.

To accomplish this, the logical sectors that have already been added
must be tracked. Tracking them is done with a bitmap: the bits
for all pages are set at the start of the request and each bit
is cleared once the bio is added to a stripe.

Finding the next sector to be done is then just a call to
find_first_bit() so that sectors that have been done can simply be
skipped.

One minor downside is that the maximum sectors for a request must be
limited so that the bitmap can be appropriately sized on the stack.
This limit is arbitrarily chosen to be 256 stripe pages which works out
to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually
restrict the maximum request further seeing the default block queue
settings are used which restricts the number of segments to 128 (which
results in request sizes that are approximately 512KB).

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 89 ++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 83 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b27b754ee18c..9e18dc9ae8b4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -61,6 +61,8 @@
#define cpu_to_group(cpu) cpu_to_node(cpu)
#define ANY_GROUP NUMA_NO_NODE

+#define RAID5_MAX_REQ_STRIPES 256
+
static bool devices_handle_discard_safely = false;
module_param(devices_handle_discard_safely, bool, 0644);
MODULE_PARM_DESC(devices_handle_discard_safely,
@@ -5862,10 +5864,69 @@ enum stripe_result {
struct stripe_request_ctx {
/* a reference to the last stripe_head for batching */
struct stripe_head *batch_last;
+
+ /* first sector in the request */
+ sector_t first_sector;
+
+ /* last sector in the request */
+ sector_t last_sector;
+
+ /* bitmap to track stripe sectors that have been added to stripes */
+ DECLARE_BITMAP(sectors_to_do, RAID5_MAX_REQ_STRIPES);
+
/* the request had REQ_PREFLUSH, cleared after the first stripe_head */
bool do_flush;
};

+static int add_all_stripe_bios(struct r5conf *conf,
+ struct stripe_request_ctx *ctx, struct stripe_head *sh,
+ struct bio *bi, int forwrite, int previous)
+{
+ int dd_idx;
+ int ret = 1;
+
+ spin_lock_irq(&sh->stripe_lock);
+
+ for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+ struct r5dev *dev = &sh->dev[dd_idx];
+
+ if (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx)
+ continue;
+
+ if (dev->sector < ctx->first_sector ||
+ dev->sector >= ctx->last_sector)
+ continue;
+
+ if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
+ set_bit(R5_Overlap, &dev->flags);
+ ret = 0;
+ continue;
+ }
+ }
+
+ if (!ret)
+ goto out;
+
+ for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+ struct r5dev *dev = &sh->dev[dd_idx];
+
+ if (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx)
+ continue;
+
+ if (dev->sector < ctx->first_sector ||
+ dev->sector >= ctx->last_sector)
+ continue;
+
+ __add_stripe_bio(sh, bi, dd_idx, forwrite, previous);
+ clear_bit((dev->sector - ctx->first_sector) >>
+ RAID5_STRIPE_SHIFT(conf), ctx->sectors_to_do);
+ }
+
+out:
+ spin_unlock_irq(&sh->stripe_lock);
+ return ret;
+}
+
static enum stripe_result make_stripe_request(struct mddev *mddev,
struct r5conf *conf, struct stripe_request_ctx *ctx,
sector_t logical_sector, struct bio *bi)
@@ -5937,7 +5998,7 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
}

if (test_bit(STRIPE_EXPANDING, &sh->state) ||
- !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+ !add_all_stripe_bios(conf, ctx, sh, bi, rw, previous)) {
/*
* Stripe is busy expanding or add failed due to
* overlap. Flush everything and wait a while.
@@ -5979,11 +6040,12 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
{
struct r5conf *conf = mddev->private;
- sector_t logical_sector, last_sector;
+ sector_t logical_sector;
struct stripe_request_ctx ctx = {};
const int rw = bio_data_dir(bi);
enum stripe_result res;
DEFINE_WAIT(w);
+ int s;

if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
int ret = log_handle_flush_request(conf, bi);
@@ -6023,9 +6085,14 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
}

logical_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1);
- last_sector = bio_end_sector(bi);
+ ctx.first_sector = logical_sector;
+ ctx.last_sector = bio_end_sector(bi);
bi->bi_next = NULL;

+ bitmap_set(ctx.sectors_to_do, 0,
+ DIV_ROUND_UP(ctx.last_sector - logical_sector,
+ RAID5_STRIPE_SECTORS(conf)));
+
/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
if ((bi->bi_opf & REQ_NOWAIT) &&
(conf->reshape_progress != MaxSector) &&
@@ -6038,7 +6105,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
}
md_account_bio(mddev, &bi);
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
- while (logical_sector < last_sector) {
+ while (1) {
res = make_stripe_request(mddev, conf, &ctx, logical_sector,
bi);
if (res == STRIPE_FAIL)
@@ -6066,7 +6133,12 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
continue;
}

- logical_sector += RAID5_STRIPE_SECTORS(conf);
+ s = find_first_bit(ctx.sectors_to_do, RAID5_MAX_REQ_STRIPES);
+ if (s == RAID5_MAX_REQ_STRIPES)
+ break;
+
+ logical_sector = ctx.first_sector +
+ (s << RAID5_STRIPE_SHIFT(conf));
}

finish_wait(&conf->wait_for_overlap, &w);
@@ -7923,7 +7995,12 @@ static int raid5_run(struct mddev *mddev)
mddev->queue->limits.discard_granularity < stripe)
blk_queue_max_discard_sectors(mddev->queue, 0);

- blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);
+ /*
+ * Requests require having a bitmap for each stripe.
+ * Limit the max sectors based on this.
+ */
+ blk_queue_max_hw_sectors(mddev->queue,
+ RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf));
}

if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
--
2.30.2

2022-06-16 19:32:05

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request

The block layer defaults the maximum segments to 128, which means
requests tend to get split around the 512KB depending on how many
pages can be merged. There's no such restriction in the raid5 code
so increase the limit to USHRT_MAX so that larger requests can be
sent as one.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e48c16bfe657..5723a497108a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8005,6 +8005,9 @@ static int raid5_run(struct mddev *mddev)
*/
blk_queue_max_hw_sectors(mddev->queue,
RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf));
+
+ /* No restrictions on the number of segments in the request */
+ blk_queue_max_segments(mddev->queue, USHRT_MAX);
}

if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
--
2.30.2

2022-06-16 19:32:46

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 02/15] md/raid5: Factor out ahead_of_reshape() function

There are a few uses of an ugly ternary operator in raid5_make_request()
to check if a sector is a head of a reshape sector.

Factor this out into a simple helper called ahead_of_reshape().

No functional changes intended.

Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b3d1f894f154..6e53b8490fff 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5784,6 +5784,13 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
bio_endio(bi);
}

+static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
+ sector_t reshape_sector)
+{
+ return mddev->reshape_backwards ? sector < reshape_sector :
+ sector >= reshape_sector;
+}
+
static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
{
struct r5conf *conf = mddev->private;
@@ -5840,9 +5847,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
if ((bi->bi_opf & REQ_NOWAIT) &&
(conf->reshape_progress != MaxSector) &&
- (mddev->reshape_backwards
- ? (logical_sector >= conf->reshape_progress && logical_sector < conf->reshape_safe)
- : (logical_sector >= conf->reshape_safe && logical_sector < conf->reshape_progress))) {
+ !ahead_of_reshape(mddev, logical_sector, conf->reshape_progress) &&
+ ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) {
bio_wouldblock_error(bi);
if (rw == WRITE)
md_write_end(mddev);
@@ -5871,14 +5877,12 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
* to check again.
*/
spin_lock_irq(&conf->device_lock);
- if (mddev->reshape_backwards
- ? logical_sector < conf->reshape_progress
- : logical_sector >= conf->reshape_progress) {
+ if (ahead_of_reshape(mddev, logical_sector,
+ conf->reshape_progress)) {
previous = 1;
} else {
- if (mddev->reshape_backwards
- ? logical_sector < conf->reshape_safe
- : logical_sector >= conf->reshape_safe) {
+ if (ahead_of_reshape(mddev, logical_sector,
+ conf->reshape_safe)) {
spin_unlock_irq(&conf->device_lock);
schedule();
do_prepare = true;
@@ -5909,9 +5913,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
*/
int must_retry = 0;
spin_lock_irq(&conf->device_lock);
- if (mddev->reshape_backwards
- ? logical_sector >= conf->reshape_progress
- : logical_sector < conf->reshape_progress)
+ if (!ahead_of_reshape(mddev, logical_sector,
+ conf->reshape_progress))
/* mismatch, need to try again */
must_retry = 1;
spin_unlock_irq(&conf->device_lock);
--
2.30.2

2022-06-16 19:33:45

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 09/15] md/raid5: Refactor for loop in raid5_make_request() into while loop

The for loop with retry label can be more cleanly expressed as a while
loop by moving the logical_sector increment into the success path.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 345350d34623..17ddaa41147c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5974,22 +5974,23 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
}
md_account_bio(mddev, &bi);
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
- for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
- retry:
+ while (logical_sector < last_sector) {
res = make_stripe_request(mddev, conf, &ctx, logical_sector,
bi);
if (res == STRIPE_FAIL)
break;

if (res == STRIPE_RETRY)
- goto retry;
+ continue;

if (res == STRIPE_SCHEDULE_AND_RETRY) {
schedule();
prepare_to_wait(&conf->wait_for_overlap, &w,
TASK_UNINTERRUPTIBLE);
- goto retry;
+ continue;
}
+
+ logical_sector += RAID5_STRIPE_SECTORS(conf);
}

finish_wait(&conf->wait_for_overlap, &w);
--
2.30.2

2022-06-16 19:34:32

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 12/15] md/raid5: Check all disks in a stripe_head for reshape progress

When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.

To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5.c | 53 ++++++++++++++++++++++++++++++++++------------
1 file changed, 39 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f12773c2387a..b27b754ee18c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5818,6 +5818,40 @@ static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
sector >= reshape_sector;
}

+static bool range_ahead_of_reshape(struct mddev *mddev, sector_t min,
+ sector_t max, sector_t reshape_sector)
+{
+ return mddev->reshape_backwards ? max < reshape_sector :
+ min >= reshape_sector;
+}
+
+static bool stripe_ahead_of_reshape(struct mddev *mddev, struct r5conf *conf,
+ struct stripe_head *sh)
+{
+ sector_t max_sector = 0, min_sector = MaxSector;
+ bool ret = false;
+ int dd_idx;
+
+ for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+ if (dd_idx == sh->pd_idx)
+ continue;
+
+ min_sector = min(min_sector, sh->dev[dd_idx].sector);
+ max_sector = min(max_sector, sh->dev[dd_idx].sector);
+ }
+
+ spin_lock_irq(&conf->device_lock);
+
+ if (!range_ahead_of_reshape(mddev, min_sector, max_sector,
+ conf->reshape_progress))
+ /* mismatch, need to try again */
+ ret = true;
+
+ spin_unlock_irq(&conf->device_lock);
+
+ return ret;
+}
+
enum stripe_result {
STRIPE_SUCCESS = 0,
STRIPE_RETRY,
@@ -5882,27 +5916,18 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
return STRIPE_FAIL;
}

- if (unlikely(previous)) {
+ if (unlikely(previous) &&
+ stripe_ahead_of_reshape(mddev, conf, sh)) {
/*
- * Expansion might have moved on while waiting for a
- * stripe, so we must do the range check again.
+ * Expansion moved on while waiting for a stripe.
* Expansion could still move past after this
* test, but as we are holding a reference to
* 'sh', we know that if that happens,
* STRIPE_EXPANDING will get set and the expansion
* won't proceed until we finish with the stripe.
*/
- int must_retry = 0;
- spin_lock_irq(&conf->device_lock);
- if (!ahead_of_reshape(mddev, logical_sector,
- conf->reshape_progress))
- /* mismatch, need to try again */
- must_retry = 1;
- spin_unlock_irq(&conf->device_lock);
- if (must_retry) {
- ret = STRIPE_SCHEDULE_AND_RETRY;
- goto out_release;
- }
+ ret = STRIPE_SCHEDULE_AND_RETRY;
+ goto out_release;
}

if (read_seqcount_retry(&conf->gen_lock, seq)) {
--
2.30.2

2022-06-16 19:36:28

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 07/15] md/raid5: Drop the do_prepare flag in raid5_make_request()

prepare_to_wait() can be reasonably called after schedule instead of
setting a flag and preparing in the next loop iteration.

This means that prepare_to_wait() will be called before
read_seqcount_begin(), but there shouldn't be any reason that the order
matters here. On the first iteration of the loop prepare_to_wait() is
already called first.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Guoqing Jiang <[email protected]>
---
drivers/md/raid5.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 26ef292842de..c58e70db204a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5918,7 +5918,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
const int rw = bio_data_dir(bi);
enum stripe_result res;
DEFINE_WAIT(w);
- bool do_prepare;

if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
int ret = log_handle_flush_request(conf, bi);
@@ -5976,12 +5975,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
int seq;

- do_prepare = false;
retry:
seq = read_seqcount_begin(&conf->gen_lock);
- if (do_prepare)
- prepare_to_wait(&conf->wait_for_overlap, &w,
- TASK_UNINTERRUPTIBLE);

res = make_stripe_request(mddev, conf, &ctx, logical_sector,
bi, seq);
@@ -5993,7 +5988,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)

if (res == STRIPE_SCHEDULE_AND_RETRY) {
schedule();
- do_prepare = true;
+ prepare_to_wait(&conf->wait_for_overlap, &w,
+ TASK_UNINTERRUPTIBLE);
goto retry;
}
}
--
2.30.2

2022-06-16 19:46:17

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 06/15] md/raid5: Factor out helper from raid5_make_request() loop

Factor out the inner loop of raid5_make_request() into it's own helper
called make_stripe_request().

The helper returns a number of statuses: SUCCESS, RETRY,
SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
understand and allows the SCHEDULE_AND_RETRY path to be made common.

A context structure is added to contain do_flush. It will be used
more in subsequent patches for state that needs to be kept
outside the loop.

No functional changes intended. This will be cleaned up further in
subsequent patches to untangle the gen_lock and do_prepare logic
further.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 231 ++++++++++++++++++++++++++-------------------
1 file changed, 133 insertions(+), 98 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1bbf87d15bc8..26ef292842de 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5786,17 +5786,139 @@ static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
sector >= reshape_sector;
}

+enum stripe_result {
+ STRIPE_SUCCESS = 0,
+ STRIPE_RETRY,
+ STRIPE_SCHEDULE_AND_RETRY,
+ STRIPE_FAIL,
+};
+
+struct stripe_request_ctx {
+ /* the request had REQ_PREFLUSH, cleared after the first stripe_head */
+ bool do_flush;
+};
+
+static enum stripe_result make_stripe_request(struct mddev *mddev,
+ struct r5conf *conf, struct stripe_request_ctx *ctx,
+ sector_t logical_sector, struct bio *bi, int seq)
+{
+ const int rw = bio_data_dir(bi);
+ enum stripe_result ret;
+ struct stripe_head *sh;
+ sector_t new_sector;
+ int previous = 0;
+ int dd_idx;
+
+ if (unlikely(conf->reshape_progress != MaxSector)) {
+ /*
+ * Spinlock is needed as reshape_progress may be
+ * 64bit on a 32bit platform, and so it might be
+ * possible to see a half-updated value
+ * Of course reshape_progress could change after
+ * the lock is dropped, so once we get a reference
+ * to the stripe that we think it is, we will have
+ * to check again.
+ */
+ spin_lock_irq(&conf->device_lock);
+ if (ahead_of_reshape(mddev, logical_sector,
+ conf->reshape_progress)) {
+ previous = 1;
+ } else {
+ if (ahead_of_reshape(mddev, logical_sector,
+ conf->reshape_safe)) {
+ spin_unlock_irq(&conf->device_lock);
+ return STRIPE_SCHEDULE_AND_RETRY;
+ }
+ }
+ spin_unlock_irq(&conf->device_lock);
+ }
+
+ new_sector = raid5_compute_sector(conf, logical_sector, previous,
+ &dd_idx, NULL);
+ pr_debug("raid456: %s, sector %llu logical %llu\n", __func__,
+ new_sector, logical_sector);
+
+ sh = raid5_get_active_stripe(conf, new_sector, previous,
+ (bi->bi_opf & REQ_RAHEAD), 0);
+ if (unlikely(!sh)) {
+ /* cannot get stripe, just give-up */
+ bi->bi_status = BLK_STS_IOERR;
+ return STRIPE_FAIL;
+ }
+
+ if (unlikely(previous)) {
+ /*
+ * Expansion might have moved on while waiting for a
+ * stripe, so we must do the range check again.
+ * Expansion could still move past after this
+ * test, but as we are holding a reference to
+ * 'sh', we know that if that happens,
+ * STRIPE_EXPANDING will get set and the expansion
+ * won't proceed until we finish with the stripe.
+ */
+ int must_retry = 0;
+ spin_lock_irq(&conf->device_lock);
+ if (!ahead_of_reshape(mddev, logical_sector,
+ conf->reshape_progress))
+ /* mismatch, need to try again */
+ must_retry = 1;
+ spin_unlock_irq(&conf->device_lock);
+ if (must_retry) {
+ ret = STRIPE_SCHEDULE_AND_RETRY;
+ goto out_release;
+ }
+ }
+
+ if (read_seqcount_retry(&conf->gen_lock, seq)) {
+ /* Might have got the wrong stripe_head by accident */
+ ret = STRIPE_RETRY;
+ goto out_release;
+ }
+
+ if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+ !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+ /*
+ * Stripe is busy expanding or add failed due to
+ * overlap. Flush everything and wait a while.
+ */
+ md_wakeup_thread(mddev->thread);
+ ret = STRIPE_SCHEDULE_AND_RETRY;
+ goto out_release;
+ }
+
+ if (stripe_can_batch(sh))
+ stripe_add_to_batch_list(conf, sh);
+
+ if (ctx->do_flush) {
+ set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+ /* we only need flush for one stripe */
+ ctx->do_flush = false;
+ }
+
+ set_bit(STRIPE_HANDLE, &sh->state);
+ clear_bit(STRIPE_DELAYED, &sh->state);
+ if ((!sh->batch_head || sh == sh->batch_head) &&
+ (bi->bi_opf & REQ_SYNC) &&
+ !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+ atomic_inc(&conf->preread_active_stripes);
+
+ release_stripe_plug(mddev, sh);
+ return STRIPE_SUCCESS;
+
+out_release:
+ raid5_release_stripe(sh);
+ return ret;
+}
+
static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
{
struct r5conf *conf = mddev->private;
- int dd_idx;
- sector_t new_sector;
sector_t logical_sector, last_sector;
- struct stripe_head *sh;
+ struct stripe_request_ctx ctx = {};
const int rw = bio_data_dir(bi);
+ enum stripe_result res;
DEFINE_WAIT(w);
bool do_prepare;
- bool do_flush = false;

if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
int ret = log_handle_flush_request(conf, bi);
@@ -5812,7 +5934,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
* if r5l_handle_flush_request() didn't clear REQ_PREFLUSH,
* we need to flush journal device
*/
- do_flush = bi->bi_opf & REQ_PREFLUSH;
+ ctx.do_flush = bi->bi_opf & REQ_PREFLUSH;
}

if (!md_write_start(mddev, bi))
@@ -5852,117 +5974,30 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
md_account_bio(mddev, &bi);
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
- int previous;
int seq;

do_prepare = false;
retry:
seq = read_seqcount_begin(&conf->gen_lock);
- previous = 0;
if (do_prepare)
prepare_to_wait(&conf->wait_for_overlap, &w,
TASK_UNINTERRUPTIBLE);
- if (unlikely(conf->reshape_progress != MaxSector)) {
- /* spinlock is needed as reshape_progress may be
- * 64bit on a 32bit platform, and so it might be
- * possible to see a half-updated value
- * Of course reshape_progress could change after
- * the lock is dropped, so once we get a reference
- * to the stripe that we think it is, we will have
- * to check again.
- */
- spin_lock_irq(&conf->device_lock);
- if (ahead_of_reshape(mddev, logical_sector,
- conf->reshape_progress)) {
- previous = 1;
- } else {
- if (ahead_of_reshape(mddev, logical_sector,
- conf->reshape_safe)) {
- spin_unlock_irq(&conf->device_lock);
- schedule();
- do_prepare = true;
- goto retry;
- }
- }
- spin_unlock_irq(&conf->device_lock);
- }
-
- new_sector = raid5_compute_sector(conf, logical_sector,
- previous,
- &dd_idx, NULL);
- pr_debug("raid456: raid5_make_request, sector %llu logical %llu\n",
- (unsigned long long)new_sector,
- (unsigned long long)logical_sector);

- sh = raid5_get_active_stripe(conf, new_sector, previous,
- (bi->bi_opf & REQ_RAHEAD), 0);
- if (unlikely(!sh)) {
- /* cannot get stripe, just give-up */
- bi->bi_status = BLK_STS_IOERR;
+ res = make_stripe_request(mddev, conf, &ctx, logical_sector,
+ bi, seq);
+ if (res == STRIPE_FAIL)
break;
- }
-
- if (unlikely(previous)) {
- /* expansion might have moved on while waiting for a
- * stripe, so we must do the range check again.
- * Expansion could still move past after this
- * test, but as we are holding a reference to
- * 'sh', we know that if that happens,
- * STRIPE_EXPANDING will get set and the expansion
- * won't proceed until we finish with the stripe.
- */
- int must_retry = 0;
- spin_lock_irq(&conf->device_lock);
- if (!ahead_of_reshape(mddev, logical_sector,
- conf->reshape_progress))
- /* mismatch, need to try again */
- must_retry = 1;
- spin_unlock_irq(&conf->device_lock);
- if (must_retry) {
- raid5_release_stripe(sh);
- schedule();
- do_prepare = true;
- goto retry;
- }
- }

- if (read_seqcount_retry(&conf->gen_lock, seq)) {
- /* Might have got the wrong stripe_head by accident */
- raid5_release_stripe(sh);
+ if (res == STRIPE_RETRY)
goto retry;
- }

- if (test_bit(STRIPE_EXPANDING, &sh->state) ||
- !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
- /*
- * Stripe is busy expanding or add failed due to
- * overlap. Flush everything and wait a while.
- */
- md_wakeup_thread(mddev->thread);
- raid5_release_stripe(sh);
+ if (res == STRIPE_SCHEDULE_AND_RETRY) {
schedule();
do_prepare = true;
goto retry;
}
-
- if (stripe_can_batch(sh))
- stripe_add_to_batch_list(conf, sh);
-
- if (do_flush) {
- set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
- /* we only need flush for one stripe */
- do_flush = false;
- }
-
- set_bit(STRIPE_HANDLE, &sh->state);
- clear_bit(STRIPE_DELAYED, &sh->state);
- if ((!sh->batch_head || sh == sh->batch_head) &&
- (bi->bi_opf & REQ_SYNC) &&
- !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- atomic_inc(&conf->preread_active_stripes);
-
- release_stripe_plug(mddev, sh);
}
+
finish_wait(&conf->wait_for_overlap, &w);

if (rw == WRITE)
--
2.30.2

2022-06-16 19:47:53

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 01/15] md/raid5: Make logic blocking check consistent with logic that blocks

The check in raid5_make_request differs very slightly from the logic
that causes it to block lower down. This likely does not cause a bug
as the check is fuzzy anyway (as reshape may move on between the first
check and the subsequent check). However, make it consistent so it can
be cleaned up in a subsequent patch.

The condition which causes the schedule is:

!(mddev->reshape_backwards ? logical_sector < conf->reshape_progress :
logical_sector >= conf->reshape_progress) &&
(mddev->reshape_backwards ? logical_sector < conf->reshape_safe :
logical_sector >= conf->reshape_safe)

The condition that causes the early bailout is made to match this.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5d84bad8b854..b3d1f894f154 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5841,7 +5841,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
if ((bi->bi_opf & REQ_NOWAIT) &&
(conf->reshape_progress != MaxSector) &&
(mddev->reshape_backwards
- ? (logical_sector > conf->reshape_progress && logical_sector <= conf->reshape_safe)
+ ? (logical_sector >= conf->reshape_progress && logical_sector < conf->reshape_safe)
: (logical_sector >= conf->reshape_safe && logical_sector < conf->reshape_progress))) {
bio_wouldblock_error(bi);
if (rw == WRITE)
--
2.30.2

2022-06-16 19:55:11

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v3 03/15] md/raid5: Refactor raid5_make_request loop

Break immediately if raid5_get_active_stripe() returns NULL and deindent
the rest of the loop. Annotate this check with an unlikely().

This makes the code easier to read and reduces the indentation level.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Guoqing Jiang <[email protected]>
---
drivers/md/raid5.c | 109 +++++++++++++++++++++++----------------------
1 file changed, 55 insertions(+), 54 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6e53b8490fff..25db747c5856 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5901,68 +5901,69 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)

sh = raid5_get_active_stripe(conf, new_sector, previous,
(bi->bi_opf & REQ_RAHEAD), 0);
- if (sh) {
- if (unlikely(previous)) {
- /* expansion might have moved on while waiting for a
- * stripe, so we must do the range check again.
- * Expansion could still move past after this
- * test, but as we are holding a reference to
- * 'sh', we know that if that happens,
- * STRIPE_EXPANDING will get set and the expansion
- * won't proceed until we finish with the stripe.
- */
- int must_retry = 0;
- spin_lock_irq(&conf->device_lock);
- if (!ahead_of_reshape(mddev, logical_sector,
- conf->reshape_progress))
- /* mismatch, need to try again */
- must_retry = 1;
- spin_unlock_irq(&conf->device_lock);
- if (must_retry) {
- raid5_release_stripe(sh);
- schedule();
- do_prepare = true;
- goto retry;
- }
- }
- if (read_seqcount_retry(&conf->gen_lock, seq)) {
- /* Might have got the wrong stripe_head
- * by accident
- */
- raid5_release_stripe(sh);
- goto retry;
- }
+ if (unlikely(!sh)) {
+ /* cannot get stripe, just give-up */
+ bi->bi_status = BLK_STS_IOERR;
+ break;
+ }

- if (test_bit(STRIPE_EXPANDING, &sh->state) ||
- !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
- /* Stripe is busy expanding or
- * add failed due to overlap. Flush everything
- * and wait a while
- */
- md_wakeup_thread(mddev->thread);
+ if (unlikely(previous)) {
+ /* expansion might have moved on while waiting for a
+ * stripe, so we must do the range check again.
+ * Expansion could still move past after this
+ * test, but as we are holding a reference to
+ * 'sh', we know that if that happens,
+ * STRIPE_EXPANDING will get set and the expansion
+ * won't proceed until we finish with the stripe.
+ */
+ int must_retry = 0;
+ spin_lock_irq(&conf->device_lock);
+ if (!ahead_of_reshape(mddev, logical_sector,
+ conf->reshape_progress))
+ /* mismatch, need to try again */
+ must_retry = 1;
+ spin_unlock_irq(&conf->device_lock);
+ if (must_retry) {
raid5_release_stripe(sh);
schedule();
do_prepare = true;
goto retry;
}
- if (do_flush) {
- set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
- /* we only need flush for one stripe */
- do_flush = false;
- }
+ }

- set_bit(STRIPE_HANDLE, &sh->state);
- clear_bit(STRIPE_DELAYED, &sh->state);
- if ((!sh->batch_head || sh == sh->batch_head) &&
- (bi->bi_opf & REQ_SYNC) &&
- !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
- atomic_inc(&conf->preread_active_stripes);
- release_stripe_plug(mddev, sh);
- } else {
- /* cannot get stripe for read-ahead, just give-up */
- bi->bi_status = BLK_STS_IOERR;
- break;
+ if (read_seqcount_retry(&conf->gen_lock, seq)) {
+ /* Might have got the wrong stripe_head by accident */
+ raid5_release_stripe(sh);
+ goto retry;
+ }
+
+ if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+ !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+ /*
+ * Stripe is busy expanding or add failed due to
+ * overlap. Flush everything and wait a while.
+ */
+ md_wakeup_thread(mddev->thread);
+ raid5_release_stripe(sh);
+ schedule();
+ do_prepare = true;
+ goto retry;
}
+
+ if (do_flush) {
+ set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+ /* we only need flush for one stripe */
+ do_flush = false;
+ }
+
+ set_bit(STRIPE_HANDLE, &sh->state);
+ clear_bit(STRIPE_DELAYED, &sh->state);
+ if ((!sh->batch_head || sh == sh->batch_head) &&
+ (bi->bi_opf & REQ_SYNC) &&
+ !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+ atomic_inc(&conf->preread_active_stripes);
+
+ release_stripe_plug(mddev, sh);
}
finish_wait(&conf->wait_for_overlap, &w);

--
2.30.2

2022-06-16 21:47:58

by Finlayson, James M CIV (USA)

[permalink] [raw]

Subject: RE: [Non-DoD Source] [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request

Innocent question from position of ignorance....I see these last 15 check-ins all as performance improvements. I tend to push hard on mdraid performance, but have RAID6 needs....are these some optimizations available for RAID6 and are they in process or did I just ask a silly question?

Regards,
Jim Finlayson
US Department of Defense

-----Original Message-----
From: Logan Gunthorpe <[email protected]>
Sent: Thursday, June 16, 2022 3:20 PM
To: [email protected]; [email protected]; Song Liu <[email protected]>
Cc: Christoph Hellwig <[email protected]>; Guoqing Jiang <[email protected]>; Stephen Bates <[email protected]>; Martin Oliveira <[email protected]>; David Sloan <[email protected]>; Logan Gunthorpe <[email protected]>
Subject: [Non-DoD Source] [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request

The block layer defaults the maximum segments to 128, which means requests tend to get split around the 512KB depending on how many pages can be merged. There's no such restriction in the raid5 code so increase the limit to USHRT_MAX so that larger requests can be sent as one.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e48c16bfe657..5723a497108a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8005,6 +8005,9 @@ static int raid5_run(struct mddev *mddev)
*/
blk_queue_max_hw_sectors(mddev->queue,
RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf));
+
+ /* No restrictions on the number of segments in the request */
+ blk_queue_max_segments(mddev->queue, USHRT_MAX);
}

if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
--
2.30.2

2022-06-23 05:26:31

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH v3 00/15] Improve Raid5 Lock Contention

On Thu, Jun 16, 2022 at 12:20 PM Logan Gunthorpe <[email protected]> wrote:
>
> Hi,
>
> Now that I've done some cleanup of the mdadm testing infrastructure
> as well as a lot of long run testing and bug fixes I'm much more
> confident in the correctness of this series. The previous posting is
> at [1].

Applied to md-next. Thanks for the great work!

Song

2022-07-03 08:15:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v3 00/15] Improve Raid5 Lock Contention

On Wed, Jun 22, 2022 at 09:48:02PM -0700, Song Liu wrote:
> > Now that I've done some cleanup of the mdadm testing infrastructure
> > as well as a lot of long run testing and bug fixes I'm much more
> > confident in the correctness of this series. The previous posting is
> > at [1].
>
> Applied to md-next. Thanks for the great work!

They still don't seem to have made it to linux-next.

2022-07-03 14:57:52

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH v3 00/15] Improve Raid5 Lock Contention

On Sun, Jul 3, 2022 at 12:51 AM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Jun 22, 2022 at 09:48:02PM -0700, Song Liu wrote:
> > > Now that I've done some cleanup of the mdadm testing infrastructure
> > > as well as a lot of long run testing and bug fixes I'm much more
> > > confident in the correctness of this series. The previous posting is
> > > at [1].
> >
> > Applied to md-next. Thanks for the great work!
>
> They still don't seem to have made it to linux-next.

I will send pull request to Jens soon. Thanks for the reminder.

Song