LinuxLists.cc - [PATCH v2 00/12] Improve Raid5 Lock Contention

2022-04-22 18:07:22

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v2 00/12] Improve Raid5 Lock Contention

Hi,

This is v2 of this series which addresses Christoph's feedback and
fixes some bugs. The first posting is at [1]. A git branch is
available at [2].

--

I've been doing some work trying to improve the bulk write performance
of raid5 on large systems with fast NVMe drives. The bottleneck appears
largely to be lock contention on the hash_lock and device_lock. This
series improves the situation slightly by addressing a couple of low
hanging fruit ways to take the lock fewer times in the request path.

Patch 9 adjusts how batching works by keeping a reference to the
previous stripe_head in raid5_make_request(). Under most situtations,
this removes the need to take the hash_lock in stripe_add_to_batch_list()
which should reduce the number of times the lock is taken by a factor of
about 2.

Patch 12 pivots the way raid5_make_request() works. Before the patch, the
code must find the stripe_head for every 4KB page in the request, so each
stripe head must be found once for every data disk. The patch changes this
so that all the data disks can be added to a stripe_head at once and the
number of times the stripe_head must be found (and thus the number of
times the hash_lock is taken) should be reduced by a factor roughly equal
to the number of data disks.

The remaining patches are just cleanup and prep patches for those two
patches.

Doing apples to apples testing this series on a small VM with 5 ram
disks, I saw a bandwidth increase of roughly 14% and lock contentions
on the hash_lock (as reported by lock stat) reduced by more than a factor
of 5 (though it is still significantly contended).

Testing on larger systems with NVMe drives saw similar small bandwidth
increases from 3% to 20% depending on the parameters. Oddly small arrays
had larger gains, likely due to them having lower starting bandwidths; I
would have expected larger gains with larger arrays (seeing there
should have been even fewer locks taken in raid5_make_request()).

Logan

[1] https://lkml.kernel.org/r/[email protected]
[2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2

--

Changes since v1:
- Rebased on current md-next branch (190a901246c69d79)
- Added patch to create a helper for checking if a sector
is ahead of the reshape (per Christoph)
- Reworked the __find_stripe() patch to create a find_get_stripe()
helper (per Christoph)
- Added more patches to further refactor raid5_make_request() and
pull most of the loop body into a helper function (per Christoph)
- A few other minor cleanups (boolean return, droping casting when
printing sectors, commit message grammar) as suggested by Christoph.
- Fixed two uncommon but bad data corruption bugs in that were found.

--

Logan Gunthorpe (12):
md/raid5: Factor out ahead_of_reshape() function
md/raid5: Refactor raid5_make_request loop
md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
md/raid5: Move common stripe count increment code into __find_stripe()
md/raid5: Factor out helper from raid5_make_request() loop
md/raid5: Drop the do_prepare flag in raid5_make_request()
md/raid5: Move read_seqcount_begin() into make_stripe_request()
md/raid5: Refactor for loop in raid5_make_request() into while loop
md/raid5: Keep a reference to last stripe_head for batch
md/raid5: Refactor add_stripe_bio()
md/raid5: Check all disks in a stripe_head for reshape progress
md/raid5: Pivot raid5_make_request()

drivers/md/raid5.c | 632 +++++++++++++++++++++++++++++----------------
drivers/md/raid5.h | 1 +
2 files changed, 410 insertions(+), 223 deletions(-)

base-commit: 190a901246c69d79dadd1ab574022548da612724
--
2.30.2

2022-04-22 18:33:44

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch

When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.

Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.

The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 51 +++++++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0c250cc3bfff..28ea7b9b6ab6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -843,7 +843,8 @@ static bool stripe_can_batch(struct stripe_head *sh)
}

/* we only do back search */
-static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh)
+static void stripe_add_to_batch_list(struct r5conf *conf,
+ struct stripe_head *sh, struct stripe_head *last_sh)
{
struct stripe_head *head;
sector_t head_sector, tmp_sec;
@@ -856,15 +857,20 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
return;
head_sector = sh->sector - RAID5_STRIPE_SECTORS(conf);

- hash = stripe_hash_locks_hash(conf, head_sector);
- spin_lock_irq(conf->hash_locks + hash);
- head = find_get_stripe(conf, head_sector, conf->generation, hash);
- spin_unlock_irq(conf->hash_locks + hash);
-
- if (!head)
- return;
- if (!stripe_can_batch(head))
- goto out;
+ if (last_sh && head_sector == last_sh->sector) {
+ head = last_sh;
+ atomic_inc(&head->count);
+ } else {
+ hash = stripe_hash_locks_hash(conf, head_sector);
+ spin_lock_irq(conf->hash_locks + hash);
+ head = find_get_stripe(conf, head_sector, conf->generation,
+ hash);
+ spin_unlock_irq(conf->hash_locks + hash);
+ if (!head)
+ return;
+ if (!stripe_can_batch(head))
+ goto out;
+ }

lock_two_stripes(head, sh);
/* clear_batch_ready clear the flag */
@@ -5800,6 +5806,7 @@ enum stripe_result {

struct stripe_request_ctx {
bool do_flush;
+ struct stripe_head *batch_last;
};

static enum stripe_result make_stripe_request(struct mddev *mddev,
@@ -5889,8 +5896,13 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
return STRIPE_SCHEDULE_AND_RETRY;
}

- if (stripe_can_batch(sh))
- stripe_add_to_batch_list(conf, sh);
+ if (stripe_can_batch(sh)) {
+ stripe_add_to_batch_list(conf, sh, ctx->batch_last);
+ if (ctx->batch_last)
+ raid5_release_stripe(ctx->batch_last);
+ atomic_inc(&sh->count);
+ ctx->batch_last = sh;
+ }

if (ctx->do_flush) {
set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
@@ -5979,6 +5991,18 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
} else if (res == STRIPE_RETRY) {
continue;
} else if (res == STRIPE_SCHEDULE_AND_RETRY) {
+ /*
+ * Must release the reference to batch_last before
+ * scheduling and waiting for work to be done,
+ * otherwise the batch_last stripe head could prevent
+ * raid5_activate_delayed() from making progress
+ * and thus deadlocking.
+ */
+ if (ctx.batch_last) {
+ raid5_release_stripe(ctx.batch_last);
+ ctx.batch_last = NULL;
+ }
+
schedule();
prepare_to_wait(&conf->wait_for_overlap, &w,
TASK_UNINTERRUPTIBLE);
@@ -5990,6 +6014,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)

finish_wait(&conf->wait_for_overlap, &w);

+ if (ctx.batch_last)
+ raid5_release_stripe(ctx.batch_last);
+
if (rw == WRITE)
md_write_end(mddev);
bio_endio(bi);
--
2.30.2

2022-04-22 19:54:51

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request()

Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
into make_stripe_request().

No functional changes intended.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b9f618356446..1bce9075e165 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5804,13 +5804,15 @@ struct stripe_request_ctx {

static enum stripe_result make_stripe_request(struct mddev *mddev,
struct r5conf *conf, struct stripe_request_ctx *ctx,
- sector_t logical_sector, struct bio *bi, int seq)
+ sector_t logical_sector, struct bio *bi)
{
const int rw = bio_data_dir(bi);
struct stripe_head *sh;
sector_t new_sector;
int previous = 0;
- int dd_idx;
+ int seq, dd_idx;
+
+ seq = read_seqcount_begin(&conf->gen_lock);

if (unlikely(conf->reshape_progress != MaxSector)) {
/* spinlock is needed as reshape_progress may be
@@ -5970,13 +5972,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
md_account_bio(mddev, &bi);
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
- int seq;
-
retry:
- seq = read_seqcount_begin(&conf->gen_lock);
-
res = make_stripe_request(mddev, conf, &ctx, logical_sector,
- bi, seq);
+ bi);
if (res == STRIPE_FAIL) {
break;
} else if (res == STRIPE_RETRY) {
--
2.30.2

2022-04-22 20:08:24

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request()

On Wed, Apr 20, 2022 at 01:54:20PM -0600, Logan Gunthorpe wrote:
> Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
> into make_stripe_request().
>
> No functional changes intended.

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-04-22 20:56:52

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe()

Both uses of find_stripe() require a fairly complicated dance to
increment the reference count. Move this into a common find_get_stripe()
helper.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 133 ++++++++++++++++++++++-----------------------
1 file changed, 65 insertions(+), 68 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8e1ece5ce984..a0946af5b1ac 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -612,7 +612,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
}

static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
- short generation)
+ short generation, int hash)
{
struct stripe_head *sh;

@@ -624,6 +624,49 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
return NULL;
}

+static struct stripe_head *find_get_stripe(struct r5conf *conf,
+ sector_t sector, short generation, int hash)
+{
+ int inc_empty_inactive_list_flag;
+ struct stripe_head *sh;
+
+ sh = __find_stripe(conf, sector, generation, hash);
+ if (!sh)
+ return NULL;
+
+ if (atomic_inc_not_zero(&sh->count))
+ return sh;
+
+ /*
+ * Slow path. The reference count is zero which means the stripe must
+ * be on a list (sh->lru). Must remove the stripe from the list that
+ * references it with the device_lock held.
+ */
+
+ spin_lock(&conf->device_lock);
+ if (!atomic_read(&sh->count)) {
+ if (!test_bit(STRIPE_HANDLE, &sh->state))
+ atomic_inc(&conf->active_stripes);
+ BUG_ON(list_empty(&sh->lru) &&
+ !test_bit(STRIPE_EXPANDING, &sh->state));
+ inc_empty_inactive_list_flag = 0;
+ if (!list_empty(conf->inactive_list + hash))
+ inc_empty_inactive_list_flag = 1;
+ list_del_init(&sh->lru);
+ if (list_empty(conf->inactive_list + hash) &&
+ inc_empty_inactive_list_flag)
+ atomic_inc(&conf->empty_inactive_list_nr);
+ if (sh->group) {
+ sh->group->stripes_cnt--;
+ sh->group = NULL;
+ }
+ }
+ atomic_inc(&sh->count);
+ spin_unlock(&conf->device_lock);
+
+ return sh;
+}
+
/*
* Need to check if array has failed when deciding whether to:
* - start an array
@@ -716,7 +759,6 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
{
struct stripe_head *sh;
int hash = stripe_hash_locks_hash(conf, sector);
- int inc_empty_inactive_list_flag;

pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);

@@ -726,57 +768,34 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
wait_event_lock_irq(conf->wait_for_quiescent,
conf->quiesce == 0 || noquiesce,
*(conf->hash_locks + hash));
- sh = __find_stripe(conf, sector, conf->generation - previous);
- if (!sh) {
- if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
- sh = get_free_stripe(conf, hash);
- if (!sh && !test_bit(R5_DID_ALLOC,
- &conf->cache_state))
- set_bit(R5_ALLOC_MORE,
- &conf->cache_state);
- }
- if (noblock && sh == NULL)
- break;
+ sh = find_get_stripe(conf, sector, conf->generation - previous,
+ hash);
+ if (sh)
+ break;

- r5c_check_stripe_cache_usage(conf);
- if (!sh) {
- set_bit(R5_INACTIVE_BLOCKED,
- &conf->cache_state);
- r5l_wake_reclaim(conf->log, 0);
- wait_event_lock_irq(
- conf->wait_for_stripe,
+ if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
+ sh = get_free_stripe(conf, hash);
+ if (!sh && !test_bit(R5_DID_ALLOC, &conf->cache_state))
+ set_bit(R5_ALLOC_MORE, &conf->cache_state);
+ }
+ if (noblock && !sh)
+ break;
+
+ r5c_check_stripe_cache_usage(conf);
+ if (!sh) {
+ set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
+ r5l_wake_reclaim(conf->log, 0);
+ wait_event_lock_irq(conf->wait_for_stripe,
!list_empty(conf->inactive_list + hash) &&
(atomic_read(&conf->active_stripes)
< (conf->max_nr_stripes * 3 / 4)
|| !test_bit(R5_INACTIVE_BLOCKED,
&conf->cache_state)),
*(conf->hash_locks + hash));
- clear_bit(R5_INACTIVE_BLOCKED,
- &conf->cache_state);
- } else {
- init_stripe(sh, sector, previous);
- atomic_inc(&sh->count);
- }
- } else if (!atomic_inc_not_zero(&sh->count)) {
- spin_lock(&conf->device_lock);
- if (!atomic_read(&sh->count)) {
- if (!test_bit(STRIPE_HANDLE, &sh->state))
- atomic_inc(&conf->active_stripes);
- BUG_ON(list_empty(&sh->lru) &&
- !test_bit(STRIPE_EXPANDING, &sh->state));
- inc_empty_inactive_list_flag = 0;
- if (!list_empty(conf->inactive_list + hash))
- inc_empty_inactive_list_flag = 1;
- list_del_init(&sh->lru);
- if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
- atomic_inc(&conf->empty_inactive_list_nr);
- if (sh->group) {
- sh->group->stripes_cnt--;
- sh->group = NULL;
- }
- }
+ clear_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
+ } else {
+ init_stripe(sh, sector, previous);
atomic_inc(&sh->count);
- spin_unlock(&conf->device_lock);
}
} while (sh == NULL);

@@ -830,7 +849,6 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
sector_t head_sector, tmp_sec;
int hash;
int dd_idx;
- int inc_empty_inactive_list_flag;

/* Don't cross chunks, so stripe pd_idx/qd_idx is the same */
tmp_sec = sh->sector;
@@ -840,28 +858,7 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh

hash = stripe_hash_locks_hash(conf, head_sector);
spin_lock_irq(conf->hash_locks + hash);
- head = __find_stripe(conf, head_sector, conf->generation);
- if (head && !atomic_inc_not_zero(&head->count)) {
- spin_lock(&conf->device_lock);
- if (!atomic_read(&head->count)) {
- if (!test_bit(STRIPE_HANDLE, &head->state))
- atomic_inc(&conf->active_stripes);
- BUG_ON(list_empty(&head->lru) &&
- !test_bit(STRIPE_EXPANDING, &head->state));
- inc_empty_inactive_list_flag = 0;
- if (!list_empty(conf->inactive_list + hash))
- inc_empty_inactive_list_flag = 1;
- list_del_init(&head->lru);
- if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
- atomic_inc(&conf->empty_inactive_list_nr);
- if (head->group) {
- head->group->stripes_cnt--;
- head->group = NULL;
- }
- }
- atomic_inc(&head->count);
- spin_unlock(&conf->device_lock);
- }
+ head = find_get_stripe(conf, head_sector, conf->generation, hash);
spin_unlock_irq(conf->hash_locks + hash);

if (!head)
--
2.30.2

2022-04-22 21:40:57

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v2 03/12] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()

stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.

The call to add_stripe_bio() in retry_aligned_read() is for read
and batching only applies to write. So it's impossible for batching
to happen at that call site.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
drivers/md/raid5.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cda6857e6207..8e1ece5ce984 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3534,8 +3534,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
}
spin_unlock_irq(&sh->stripe_lock);

- if (stripe_can_batch(sh))
- stripe_add_to_batch_list(conf, sh);
return 1;

overlap:
@@ -5955,6 +5953,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
goto retry;
}

+ if (stripe_can_batch(sh))
+ stripe_add_to_batch_list(conf, sh);
+
if (do_flush) {
set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
/* we only need flush for one stripe */
--
2.30.2

2022-04-22 22:08:35

by Logan Gunthorpe

[permalink] [raw]

Subject: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()

raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the disk.

This causes a great deal of contention on the hash_lock seeing the
lock for that hash must be taken for every single page.

The number of times the lock is taken can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the lock must be taken by
a factor equal to the number of data disks.

To accomplish this, store the minimum and maxmimum disk sector that
has already been finished and continue to the next logical sector if
it is found that the disk sector has already been done. Then add a
add_all_stripe_bios() to check all the bios for overlap and add them
all if none of them overlap.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/md/raid5.c | 92 +++++++++++++++++++++++++++++++++++++++++++---
drivers/md/raid5.h | 1 +
2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 40a25c4b80bd..f86866cb15be 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3571,6 +3571,48 @@ static bool add_stripe_bio(struct stripe_head *sh, struct bio *bi,
return true;
}

+static int add_all_stripe_bios(struct stripe_head *sh, struct bio *bi,
+ sector_t first_logical_sector, sector_t last_sector,
+ int forwrite, int previous)
+{
+ int dd_idx;
+ int ret = 1;
+
+ spin_lock_irq(&sh->stripe_lock);
+
+ for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+ struct r5dev *dev = &sh->dev[dd_idx];
+
+ clear_bit(R5_BioReady, &dev->flags);
+
+ if (dd_idx == sh->pd_idx)
+ continue;
+
+ if (dev->sector < first_logical_sector ||
+ dev->sector >= last_sector)
+ continue;
+
+ if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
+ set_bit(R5_Overlap, &dev->flags);
+ ret = 0;
+ continue;
+ }
+
+ set_bit(R5_BioReady, &dev->flags);
+ }
+
+ if (!ret)
+ goto out;
+
+ for (dd_idx = 0; dd_idx < sh->disks; dd_idx++)
+ if (test_bit(R5_BioReady, &sh->dev[dd_idx].flags))
+ __add_stripe_bio(sh, bi, dd_idx, forwrite, previous);
+
+out:
+ spin_unlock_irq(&sh->stripe_lock);
+ return ret;
+}
+
static void end_reshape(struct r5conf *conf);

static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
@@ -5869,6 +5911,10 @@ enum stripe_result {
struct stripe_request_ctx {
bool do_flush;
struct stripe_head *batch_last;
+ sector_t disk_sector_done;
+ sector_t start_disk_sector;
+ bool first_wrap;
+ sector_t last_sector;
};

static enum stripe_result make_stripe_request(struct mddev *mddev,
@@ -5908,6 +5954,36 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,

new_sector = raid5_compute_sector(conf, logical_sector, previous,
&dd_idx, NULL);
+
+ /*
+ * This is a tricky algorithm to figure out which stripe_heads that
+ * have already been visited and exit early if the stripe_head has
+ * already been done. (Seeing all disks are added to a stripe_head
+ * once in add_all_stripe_bios().
+ *
+ * To start with, the disk sector of the last stripe that has been
+ * completed is stored in ctx->disk_sector_done. If the new_sector is
+ * less than this value, the stripe_head has already been done.
+ *
+ * There's one issue with this: if the request starts in the middle of
+ * a chunk, all the stripe heads before the starting offset will be
+ * missed. To account for this, set the first_wrap boolean to true
+ * if new_sector is less than the starting sector. Clear the
+ * boolean once the start sector is hit for the second time.
+ * When first_wrap is set, ignore the disk_sector_done.
+ */
+ if (ctx->start_disk_sector == MaxSector) {
+ ctx->start_disk_sector = new_sector;
+ } else if (new_sector < ctx->start_disk_sector) {
+ ctx->first_wrap = true;
+ } else if (new_sector == ctx->start_disk_sector) {
+ ctx->first_wrap = false;
+ ctx->start_disk_sector = 0;
+ return STRIPE_SUCCESS;
+ } else if (!ctx->first_wrap && new_sector <= ctx->disk_sector_done) {
+ return STRIPE_SUCCESS;
+ }
+
pr_debug("raid456: %s, sector %llu logical %llu\n", __func__,
new_sector, logical_sector);

@@ -5939,7 +6015,8 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
}

if (test_bit(STRIPE_EXPANDING, &sh->state) ||
- !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+ !add_all_stripe_bios(sh, bi, logical_sector, ctx->last_sector, rw,
+ previous)) {
/*
* Stripe is busy expanding or add failed due to
* overlap. Flush everything and wait a while.
@@ -5949,6 +6026,9 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
return STRIPE_SCHEDULE_AND_RETRY;
}

+ if (new_sector > ctx->disk_sector_done)
+ ctx->disk_sector_done = new_sector;
+
if (stripe_can_batch(sh)) {
stripe_add_to_batch_list(conf, sh, ctx->batch_last);
if (ctx->batch_last)
@@ -5977,8 +6057,10 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
{
struct r5conf *conf = mddev->private;
- sector_t logical_sector, last_sector;
- struct stripe_request_ctx ctx = {};
+ sector_t logical_sector;
+ struct stripe_request_ctx ctx = {
+ .start_disk_sector = MaxSector,
+ };
const int rw = bio_data_dir(bi);
enum stripe_result res;
DEFINE_WAIT(w);
@@ -6021,7 +6103,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
}

logical_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1);
- last_sector = bio_end_sector(bi);
+ ctx.last_sector = bio_end_sector(bi);
bi->bi_next = NULL;

/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
@@ -6036,7 +6118,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
}
md_account_bio(mddev, &bi);
prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
- while (logical_sector < last_sector) {
+ while (logical_sector < ctx.last_sector) {
res = make_stripe_request(mddev, conf, &ctx, logical_sector,
bi);
if (res == STRIPE_FAIL) {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 638d29863503..e73b58844f83 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -308,6 +308,7 @@ enum r5dev_flags {
R5_Wantwrite,
R5_Overlap, /* There is a pending overlapping request
* on this block */
+ R5_BioReady, /* The current bio can be added to this disk */
R5_ReadNoMerge, /* prevent bio from merging in block-layer */
R5_ReadError, /* seen a read error here recently */
R5_ReWrite, /* have tried to over-write the readerror */
--
2.30.2

2022-04-25 06:00:30

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> Hi,
>
> This is v2 of this series which addresses Christoph's feedback and
> fixes some bugs. The first posting is at [1]. A git branch is
> available at [2].
>
> --
>
> I've been doing some work trying to improve the bulk write performance
> of raid5 on large systems with fast NVMe drives. The bottleneck appears
> largely to be lock contention on the hash_lock and device_lock. This
> series improves the situation slightly by addressing a couple of low
> hanging fruit ways to take the lock fewer times in the request path.
>
> Patch 9 adjusts how batching works by keeping a reference to the
> previous stripe_head in raid5_make_request(). Under most situtations,
> this removes the need to take the hash_lock in stripe_add_to_batch_list()
> which should reduce the number of times the lock is taken by a factor of
> about 2.
>
> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
> code must find the stripe_head for every 4KB page in the request, so each
> stripe head must be found once for every data disk. The patch changes this
> so that all the data disks can be added to a stripe_head at once and the
> number of times the stripe_head must be found (and thus the number of
> times the hash_lock is taken) should be reduced by a factor roughly equal
> to the number of data disks.
>
> The remaining patches are just cleanup and prep patches for those two
> patches.
>
> Doing apples to apples testing this series on a small VM with 5 ram
> disks, I saw a bandwidth increase of roughly 14% and lock contentions
> on the hash_lock (as reported by lock stat) reduced by more than a factor
> of 5 (though it is still significantly contended).
>
> Testing on larger systems with NVMe drives saw similar small bandwidth
> increases from 3% to 20% depending on the parameters. Oddly small arrays
> had larger gains, likely due to them having lower starting bandwidths; I
> would have expected larger gains with larger arrays (seeing there
> should have been even fewer locks taken in raid5_make_request()).
>
> Logan
>
> [1] https://lkml.kernel.org/r/[email protected]
> [2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2
>
> --
>
> Changes since v1:
> - Rebased on current md-next branch (190a901246c69d79)
> - Added patch to create a helper for checking if a sector
> is ahead of the reshape (per Christoph)
> - Reworked the __find_stripe() patch to create a find_get_stripe()
> helper (per Christoph)
> - Added more patches to further refactor raid5_make_request() and
> pull most of the loop body into a helper function (per Christoph)
> - A few other minor cleanups (boolean return, droping casting when
> printing sectors, commit message grammar) as suggested by Christoph.
> - Fixed two uncommon but bad data corruption bugs in that were found.
>
> --
>
> Logan Gunthorpe (12):
> md/raid5: Factor out ahead_of_reshape() function
> md/raid5: Refactor raid5_make_request loop
> md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
> md/raid5: Move common stripe count increment code into __find_stripe()
> md/raid5: Factor out helper from raid5_make_request() loop
> md/raid5: Drop the do_prepare flag in raid5_make_request()
> md/raid5: Move read_seqcount_begin() into make_stripe_request()
> md/raid5: Refactor for loop in raid5_make_request() into while loop
> md/raid5: Keep a reference to last stripe_head for batch
> md/raid5: Refactor add_stripe_bio()
> md/raid5: Check all disks in a stripe_head for reshape progress
> md/raid5: Pivot raid5_make_request()

Generally, I don't object the cleanup patches since the code looks more
cleaner.
But my concern is that since some additional function calls are added to
hot path
(raid5_make_request), could the performance be affected?

And I think patch 9 and patch 12 are helpful for performance
improvement, did
you measure the performance without those cleanup patches?

Thanks,
Guoqing

2022-04-25 16:25:56

by Logan Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

On 2022-04-24 01:53, Guoqing Jiang wrote:
>
>
> On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
>> Hi,
>>
>> This is v2 of this series which addresses Christoph's feedback and
>> fixes some bugs. The first posting is at [1]. A git branch is
>> available at [2].
>>
>> --
>>
>> I've been doing some work trying to improve the bulk write performance
>> of raid5 on large systems with fast NVMe drives. The bottleneck appears
>> largely to be lock contention on the hash_lock and device_lock. This
>> series improves the situation slightly by addressing a couple of low
>> hanging fruit ways to take the lock fewer times in the request path.
>>
>> Patch 9 adjusts how batching works by keeping a reference to the
>> previous stripe_head in raid5_make_request(). Under most situtations,
>> this removes the need to take the hash_lock in stripe_add_to_batch_list()
>> which should reduce the number of times the lock is taken by a factor of
>> about 2.
>>
>> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
>> code must find the stripe_head for every 4KB page in the request, so each
>> stripe head must be found once for every data disk. The patch changes this
>> so that all the data disks can be added to a stripe_head at once and the
>> number of times the stripe_head must be found (and thus the number of
>> times the hash_lock is taken) should be reduced by a factor roughly equal
>> to the number of data disks.
>>
>> The remaining patches are just cleanup and prep patches for those two
>> patches.
>>
>> Doing apples to apples testing this series on a small VM with 5 ram
>> disks, I saw a bandwidth increase of roughly 14% and lock contentions
>> on the hash_lock (as reported by lock stat) reduced by more than a factor
>> of 5 (though it is still significantly contended).
>>
>> Testing on larger systems with NVMe drives saw similar small bandwidth
>> increases from 3% to 20% depending on the parameters. Oddly small arrays
>> had larger gains, likely due to them having lower starting bandwidths; I
>> would have expected larger gains with larger arrays (seeing there
>> should have been even fewer locks taken in raid5_make_request()).
>>
>> Logan
>>
>> [1] https://lkml.kernel.org/r/[email protected]
>> [2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2
>>
>> --
>>
>> Changes since v1:
>> - Rebased on current md-next branch (190a901246c69d79)
>> - Added patch to create a helper for checking if a sector
>> is ahead of the reshape (per Christoph)
>> - Reworked the __find_stripe() patch to create a find_get_stripe()
>> helper (per Christoph)
>> - Added more patches to further refactor raid5_make_request() and
>> pull most of the loop body into a helper function (per Christoph)
>> - A few other minor cleanups (boolean return, droping casting when
>> printing sectors, commit message grammar) as suggested by Christoph.
>> - Fixed two uncommon but bad data corruption bugs in that were found.
>>
>> --
>>
>> Logan Gunthorpe (12):
>> md/raid5: Factor out ahead_of_reshape() function
>> md/raid5: Refactor raid5_make_request loop
>> md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
>> md/raid5: Move common stripe count increment code into __find_stripe()
>> md/raid5: Factor out helper from raid5_make_request() loop
>> md/raid5: Drop the do_prepare flag in raid5_make_request()
>> md/raid5: Move read_seqcount_begin() into make_stripe_request()
>> md/raid5: Refactor for loop in raid5_make_request() into while loop
>> md/raid5: Keep a reference to last stripe_head for batch
>> md/raid5: Refactor add_stripe_bio()
>> md/raid5: Check all disks in a stripe_head for reshape progress
>> md/raid5: Pivot raid5_make_request()
>
> Generally, I don't object the cleanup patches since the code looks more
> cleaner.
> But my concern is that since some additional function calls are added to
> hot path
> (raid5_make_request), could the performance be affected?

There's a bit of logic added to the raid5_make_requests but it is all
local and should be fast, and it reduces the amount of calls to the slow
contended locks.

> And I think patch 9 and patch 12 are helpful for performance
> improvement, did
> you measure the performance without those cleanup patches?

Yes, I compared performance with and without this entire series.

Logan

2022-04-26 07:24:40

[permalink] [raw]

Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

On Wed, Apr 20, 2022 at 12:55 PM Logan Gunthorpe <[email protected]> wrote:
>
> Hi,
>
> This is v2 of this series which addresses Christoph's feedback and
> fixes some bugs. The first posting is at [1]. A git branch is
> available at [2].
>
> --
>
> I've been doing some work trying to improve the bulk write performance
> of raid5 on large systems with fast NVMe drives. The bottleneck appears
> largely to be lock contention on the hash_lock and device_lock. This
> series improves the situation slightly by addressing a couple of low
> hanging fruit ways to take the lock fewer times in the request path.
>
> Patch 9 adjusts how batching works by keeping a reference to the
> previous stripe_head in raid5_make_request(). Under most situtations,
> this removes the need to take the hash_lock in stripe_add_to_batch_list()
> which should reduce the number of times the lock is taken by a factor of
> about 2.
>
> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
> code must find the stripe_head for every 4KB page in the request, so each
> stripe head must be found once for every data disk. The patch changes this
> so that all the data disks can be added to a stripe_head at once and the
> number of times the stripe_head must be found (and thus the number of
> times the hash_lock is taken) should be reduced by a factor roughly equal
> to the number of data disks.
>
> The remaining patches are just cleanup and prep patches for those two
> patches.
>
> Doing apples to apples testing this series on a small VM with 5 ram
> disks, I saw a bandwidth increase of roughly 14% and lock contentions
> on the hash_lock (as reported by lock stat) reduced by more than a factor
> of 5 (though it is still significantly contended).
>
> Testing on larger systems with NVMe drives saw similar small bandwidth
> increases from 3% to 20% depending on the parameters. Oddly small arrays
> had larger gains, likely due to them having lower starting bandwidths; I
> would have expected larger gains with larger arrays (seeing there
> should have been even fewer locks taken in raid5_make_request()).
>
> Logan
>
> [1] https://lkml.kernel.org/r/[email protected]
> [2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2
>

The set looks good to me overall. Thanks everyone for the review and feedback.

Logan, please incorporate feedback and send v3.

Thanks,
Song

2022-04-27 09:09:19

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request()

On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
> into make_stripe_request().
>
> No functional changes intended.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/md/raid5.c | 12 +++++-------
> 1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index b9f618356446..1bce9075e165 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5804,13 +5804,15 @@ struct stripe_request_ctx {
>
> static enum stripe_result make_stripe_request(struct mddev *mddev,
> struct r5conf *conf, struct stripe_request_ctx *ctx,
> - sector_t logical_sector, struct bio *bi, int seq)
> + sector_t logical_sector, struct bio *bi)
> {
> const int rw = bio_data_dir(bi);
> struct stripe_head *sh;
> sector_t new_sector;
> int previous = 0;
> - int dd_idx;
> + int seq, dd_idx;
> +
> + seq = read_seqcount_begin(&conf->gen_lock);
>
> if (unlikely(conf->reshape_progress != MaxSector)) {
> /* spinlock is needed as reshape_progress may be
> @@ -5970,13 +5972,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> md_account_bio(mddev, &bi);
> prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
> for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
> - int seq;
> -
> retry:
> - seq = read_seqcount_begin(&conf->gen_lock);
> -
> res = make_stripe_request(mddev, conf, &ctx, logical_sector,
> - bi, seq);
> + bi);
> if (res == STRIPE_FAIL) {
> break;
> } else if (res == STRIPE_RETRY) {

Reviewed-by: Guoqing Jiang <[email protected]>

Thanks,
Guoqing

2022-04-27 09:32:03

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch

On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> When batching, every stripe head has to find the previous stripe head to
> add to the batch list. This involves taking the hash lock which is
> highly contended during IO.
>
> Instead of finding the previous stripe_head each time, store a
> reference to the previous stripe_head in a pointer so that it doesn't
> require taking the contended lock another time.
>
> The reference to the previous stripe must be released before scheduling
> and waiting for work to get done. Otherwise, it can hold up
> raid5_activate_delayed() and deadlock.
>
> Signed-off-by: Logan Gunthorpe<[email protected]>
> ---
> drivers/md/raid5.c | 51 +++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 0c250cc3bfff..28ea7b9b6ab6 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -843,7 +843,8 @@ static bool stripe_can_batch(struct stripe_head *sh)
> }
>
> /* we only do back search */
> -static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh)
> +static void stripe_add_to_batch_list(struct r5conf *conf,
> + struct stripe_head *sh, struct stripe_head *last_sh)

Nit, from stripe_add_to_batch_list's view, I think "head_sh" makes more
sense than
"last_sh".

> {
> struct stripe_head *head;
> sector_t head_sector, tmp_sec;
> @@ -856,15 +857,20 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
> return;
> head_sector = sh->sector - RAID5_STRIPE_SECTORS(conf);
>
> - hash = stripe_hash_locks_hash(conf, head_sector);
> - spin_lock_irq(conf->hash_locks + hash);
> - head = find_get_stripe(conf, head_sector, conf->generation, hash);
> - spin_unlock_irq(conf->hash_locks + hash);
> -
> - if (!head)
> - return;
> - if (!stripe_can_batch(head))
> - goto out;
> + if (last_sh && head_sector == last_sh->sector) {
> + head = last_sh;
> + atomic_inc(&head->count);
> + } else {
> + hash = stripe_hash_locks_hash(conf, head_sector);
> + spin_lock_irq(conf->hash_locks + hash);
> + head = find_get_stripe(conf, head_sector, conf->generation,
> + hash);
> + spin_unlock_irq(conf->hash_locks + hash);
> + if (!head)
> + return;
> + if (!stripe_can_batch(head))
> + goto out;
> + }
>
> lock_two_stripes(head, sh);
> /* clear_batch_ready clear the flag */
> @@ -5800,6 +5806,7 @@ enum stripe_result {
>
> struct stripe_request_ctx {
> bool do_flush;
> + struct stripe_head *batch_last;
> };
>
> static enum stripe_result make_stripe_request(struct mddev *mddev,
> @@ -5889,8 +5896,13 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
> return STRIPE_SCHEDULE_AND_RETRY;
> }
>
> - if (stripe_can_batch(sh))
> - stripe_add_to_batch_list(conf, sh);
> + if (stripe_can_batch(sh)) {
> + stripe_add_to_batch_list(conf, sh, ctx->batch_last);
> + if (ctx->batch_last)
> + raid5_release_stripe(ctx->batch_last);
> + atomic_inc(&sh->count);
> + ctx->batch_last = sh;
> + }
>
> if (ctx->do_flush) {
> set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
> @@ -5979,6 +5991,18 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> } else if (res == STRIPE_RETRY) {
> continue;
> } else if (res == STRIPE_SCHEDULE_AND_RETRY) {
> + /*
> + * Must release the reference to batch_last before
> + * scheduling and waiting for work to be done,
> + * otherwise the batch_last stripe head could prevent
> + * raid5_activate_delayed() from making progress
> + * and thus deadlocking.
> + */
> + if (ctx.batch_last) {
> + raid5_release_stripe(ctx.batch_last);
> + ctx.batch_last = NULL;
> + }
> +
> schedule();
> prepare_to_wait(&conf->wait_for_overlap, &w,
> TASK_UNINTERRUPTIBLE);
> @@ -5990,6 +6014,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>
> finish_wait(&conf->wait_for_overlap, &w);
>
> + if (ctx.batch_last)
> + raid5_release_stripe(ctx.batch_last);
> +
> if (rw == WRITE)
> md_write_end(mddev);
> bio_endio(bi);

Otherwise looks good, Acked-by: Guoqing Jiang <[email protected]>

Thanks,
Guoqing

2022-04-27 10:06:44

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 03/12] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()

On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> stripe_add_to_batch_list() is better done in the loop in make_request
> instead of inside add_stripe_bio(). This is clearer and allows for
> storing the batch_head state outside the loop in a subsequent patch.
>
> The call to add_stripe_bio() in retry_aligned_read() is for read
> and batching only applies to write. So it's impossible for batching
> to happen at that call site.
>
> No functional changes intended.
>
> Signed-off-by: Logan Gunthorpe<[email protected]>
> Reviewed-by: Christoph Hellwig<[email protected]>
> ---
> drivers/md/raid5.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cda6857e6207..8e1ece5ce984 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3534,8 +3534,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
> }
> spin_unlock_irq(&sh->stripe_lock);
>
> - if (stripe_can_batch(sh))
> - stripe_add_to_batch_list(conf, sh);
> return 1;
>
> overlap:
> @@ -5955,6 +5953,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> goto retry;
> }
>
> + if (stripe_can_batch(sh))
> + stripe_add_to_batch_list(conf, sh);
> +
> if (do_flush) {
> set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
> /* we only need flush for one stripe */

Reviewed-by: Guoqing Jiang <[email protected]>

Thanks,
Guoqing

2022-04-27 10:28:29

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe()

On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> Both uses of find_stripe() require a fairly complicated dance to
> increment the reference count. Move this into a common find_get_stripe()
> helper.
>
> No functional changes intended.
>
> Signed-off-by: Logan Gunthorpe<[email protected]>
> ---
> drivers/md/raid5.c | 133 ++++++++++++++++++++++-----------------------
> 1 file changed, 65 insertions(+), 68 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8e1ece5ce984..a0946af5b1ac 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -612,7 +612,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
> }

...

> @@ -624,6 +624,49 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
> return NULL;
> }
>
> +static struct stripe_head *find_get_stripe(struct r5conf *conf,
> + sector_t sector, short generation, int hash)
> +{

The subject doesn't match the change, maybe something about " md/raid5:
factor out common stripe count
increment code into find_get_stripe()", just FYI. Otherwise, it
generally looks good.

Acked-by: Guoqing Jiang <[email protected]>

Thanks,
Guoqing

2022-04-27 11:33:26

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()

On 4/21/22 3:54 AM, Logan Gunthorpe wrote:

> The number of times the lock is taken can be reduced by pivoting
> raid5_make_request() so that it loops through every stripe and then
> loops through every disk in that stripe to see if the bio must be
> added. This reduces the number of times the lock must be taken by
> a factor equal to the number of data disks.
>
> To accomplish this, store the minimum and maxmimum disk sector that
> has already been finished and continue to the next logical sector if
> it is found that the disk sector has already been done. Then add a
> add_all_stripe_bios() to check all the bios for overlap and add them
> all if none of them overlap.
>
> Signed-off-by: Logan Gunthorpe<[email protected]>
> ---
> drivers/md/raid5.c | 92 +++++++++++++++++++++++++++++++++++++++++++---
> drivers/md/raid5.h | 1 +
> 2 files changed, 88 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 40a25c4b80bd..f86866cb15be 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3571,6 +3571,48 @@ static bool add_stripe_bio(struct stripe_head *sh, struct bio *bi,
> return true;
> }
>
> +static int add_all_stripe_bios(struct stripe_head *sh, struct bio *bi,
> + sector_t first_logical_sector, sector_t last_sector,
> + int forwrite, int previous)
> +{
> + int dd_idx;
> + int ret = 1;
> +
> + spin_lock_irq(&sh->stripe_lock);
> +
> + for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
> + struct r5dev *dev = &sh->dev[dd_idx];
> +
> + clear_bit(R5_BioReady, &dev->flags);
> +
> + if (dd_idx == sh->pd_idx)
> + continue;
> +
> + if (dev->sector < first_logical_sector ||
> + dev->sector >= last_sector)
> + continue;
> +
> + if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
> + set_bit(R5_Overlap, &dev->flags);
> + ret = 0;
> + continue;
> + }
> +
> + set_bit(R5_BioReady, &dev->flags);

Is it possible to just call __add_stripe_bio here? And change above
"continue"
to "return",

> + }
> +
> + if (!ret)
> + goto out;
> +
> + for (dd_idx = 0; dd_idx < sh->disks; dd_idx++)
> + if (test_bit(R5_BioReady, &sh->dev[dd_idx].flags))
> + __add_stripe_bio(sh, bi, dd_idx, forwrite, previous);

then we don't need another loop here, also no need to introduce another
flag.
But I am not sure it is feasible, so just FYI.

> +
> +out:
> + spin_unlock_irq(&sh->stripe_lock);
> + return ret;
> +}
> +
> static void end_reshape(struct r5conf *conf);
>
> static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
> @@ -5869,6 +5911,10 @@ enum stripe_result {
> struct stripe_request_ctx {
> bool do_flush;
> struct stripe_head *batch_last;
> + sector_t disk_sector_done;
> + sector_t start_disk_sector;
> + bool first_wrap;
> + sector_t last_sector;
> };

Could you add some comments to above members if possible?

> static enum stripe_result make_stripe_request(struct mddev *mddev,
> @@ -5908,6 +5954,36 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
>
> new_sector = raid5_compute_sector(conf, logical_sector, previous,
> &dd_idx, NULL);
> +
> + /*
> + * This is a tricky algorithm to figure out which stripe_heads that
> + * have already been visited and exit early if the stripe_head has
> + * already been done. (Seeing all disks are added to a stripe_head
> + * once in add_all_stripe_bios().
> + *
> + * To start with, the disk sector of the last stripe that has been
> + * completed is stored in ctx->disk_sector_done. If the new_sector is
> + * less than this value, the stripe_head has already been done.
> + *
> + * There's one issue with this: if the request starts in the middle of
> + * a chunk, all the stripe heads before the starting offset will be
> + * missed. To account for this, set the first_wrap boolean to true
> + * if new_sector is less than the starting sector. Clear the
> + * boolean once the start sector is hit for the second time.
> + * When first_wrap is set, ignore the disk_sector_done.
> + */
> + if (ctx->start_disk_sector == MaxSector) {
> + ctx->start_disk_sector = new_sector;
> + } else if (new_sector < ctx->start_disk_sector) {
> + ctx->first_wrap = true;
> + } else if (new_sector == ctx->start_disk_sector) {
> + ctx->first_wrap = false;
> + ctx->start_disk_sector = 0;
> + return STRIPE_SUCCESS;
> + } else if (!ctx->first_wrap && new_sector <= ctx->disk_sector_done) {
> + return STRIPE_SUCCESS;
> + }
> +

Hmm, with above tricky algorithm, I guess the point is that we can avoid
to call below
stripe_add_to_batch_list where has hash_lock contention. If so, maybe we
can change
stripe_can_batch for the purpose.

> if (stripe_can_batch(sh)) {
> stripe_add_to_batch_list(conf, sh, ctx->batch_last);
> if (ctx->batch_last)
> @@ -5977,8 +6057,10 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
> static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> {
> struct r5conf *conf = mddev->private;
> - sector_t logical_sector, last_sector;
> - struct stripe_request_ctx ctx = {};
> + sector_t logical_sector;
> + struct stripe_request_ctx ctx = {
> + .start_disk_sector = MaxSector,
> + };
> const int rw = bio_data_dir(bi);
> enum stripe_result res;
> DEFINE_WAIT(w);
> @@ -6021,7 +6103,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> }
>
> logical_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1);
> - last_sector = bio_end_sector(bi);
> + ctx.last_sector = bio_end_sector(bi);
> bi->bi_next = NULL;
>
> /* Bail out if conflicts with reshape and REQ_NOWAIT is set */
> @@ -6036,7 +6118,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> }
> md_account_bio(mddev, &bi);
> prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
> - while (logical_sector < last_sector) {
> + while (logical_sector < ctx.last_sector) {
> res = make_stripe_request(mddev, conf, &ctx, logical_sector,
> bi);
> if (res == STRIPE_FAIL) {
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 638d29863503..e73b58844f83 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -308,6 +308,7 @@ enum r5dev_flags {
> R5_Wantwrite,
> R5_Overlap, /* There is a pending overlapping request
> * on this block */
> + R5_BioReady, /* The current bio can be added to this disk */

This doesn't seem right to me, since the comment describes bio status
while others
are probably for r5dev.

Thanks,
Guoqing

2022-04-27 17:20:25

by Logan Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()

On 2022-04-26 20:06, Guoqing Jiang wrote:
>> +static int add_all_stripe_bios(struct stripe_head *sh, struct bio *bi,
>> +        sector_t first_logical_sector, sector_t last_sector,
>> +        int forwrite, int previous)
>> +{
>> +    int dd_idx;
>> +    int ret = 1;
>> +
>> +    spin_lock_irq(&sh->stripe_lock);
>> +
>> +    for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
>> +        struct r5dev *dev = &sh->dev[dd_idx];
>> +
>> +        clear_bit(R5_BioReady, &dev->flags);
>> +
>> +        if (dd_idx == sh->pd_idx)
>> +            continue;
>> +
>> +        if (dev->sector < first_logical_sector ||
>> +            dev->sector >= last_sector)
>> +            continue;
>> +
>> +        if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
>> +            set_bit(R5_Overlap, &dev->flags);
>> +            ret = 0;
>> +            continue;
>> +        }
>> +
>> +        set_bit(R5_BioReady, &dev->flags);
>
> Is it possible to just call __add_stripe_bio here? And change above
> "continue"
> to "return",

No. The reason it was done this way is because we either have to add the
bio to all the disks or none of them. Otherwise, if one fails and we
have to retry and we can't know which stripes were already added or not.

>> @@ -5869,6 +5911,10 @@ enum stripe_result {
>> struct stripe_request_ctx {
>>       bool do_flush;
>>       struct stripe_head *batch_last;
>> +    sector_t disk_sector_done;
>> +    sector_t start_disk_sector;
>> +    bool first_wrap;
>> +    sector_t last_sector;
>> };
>
> Could you add some comments to above members if possible?

Yes, Christoph already asked for this and I've reworked this patch to
make it much clearer for v3.

>> static enum stripe_result make_stripe_request(struct mddev *mddev,
>> @@ -5908,6 +5954,36 @@ static enum stripe_result
>> make_stripe_request(struct mddev *mddev,
>>       new_sector = raid5_compute_sector(conf, logical_sector, previous,
>>                         &dd_idx, NULL);
>> +
>> +    /*
>> +     * This is a tricky algorithm to figure out which stripe_heads that
>> +     * have already been visited and exit early if the stripe_head has
>> +     * already been done. (Seeing all disks are added to a stripe_head
>> +     * once in add_all_stripe_bios().
>> +     *
>> +     * To start with, the disk sector of the last stripe that has been
>> +     * completed is stored in ctx->disk_sector_done. If the
>> new_sector is
>> +     * less than this value, the stripe_head has already been done.
>> +     *
>> +     * There's one issue with this: if the request starts in the
>> middle of
>> +     * a chunk, all the stripe heads before the starting offset will be
>> +     * missed. To account for this, set the first_wrap boolean to true
>> +     * if new_sector is less than the starting sector. Clear the
>> +     * boolean once the start sector is hit for the second time.
>> +     * When first_wrap is set, ignore the disk_sector_done.
>> +     */
>> +    if (ctx->start_disk_sector == MaxSector) {
>> +        ctx->start_disk_sector = new_sector;
>> +    } else if (new_sector < ctx->start_disk_sector) {
>> +        ctx->first_wrap = true;
>> +    } else if (new_sector == ctx->start_disk_sector) {
>> +        ctx->first_wrap = false;
>> +        ctx->start_disk_sector = 0;
>> +        return STRIPE_SUCCESS;
>> +    } else if (!ctx->first_wrap && new_sector <=
>> ctx->disk_sector_done) {
>> +        return STRIPE_SUCCESS;
>> +    }
>> +
>
> Hmm, with above tricky algorithm, I guess the point is that we can avoid
> to call below
> stripe_add_to_batch_list where has hash_lock contention. If so, maybe we
> can change
> stripe_can_batch for the purpose.

No, that's not the purpose. The purpose is to add the bio to the stripe
for every disk, so that we can avoid calling find_get_stripe() for every
single page and only call it once per stripe head.

>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index 638d29863503..e73b58844f83 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -308,6 +308,7 @@ enum r5dev_flags {
>>       R5_Wantwrite,
>>       R5_Overlap,    /* There is a pending overlapping request
>>                * on this block */
>> +    R5_BioReady,    /* The current bio can be added to this disk */
>
> This doesn't seem right to me, since the comment describes bio status
> while others
> are probably for r5dev.

I'm not sure I understand the objection. If you have a better option
please let me know.

I'm still working on this patch. Caught a couple more rare bugs that I'm
working to fix. The next version should also hopefully be clearer.

Thanks,

Logan

2022-04-28 01:08:50

by Logan Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch

On 2022-04-26 19:36, Guoqing Jiang wrote:
> On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
>> /* we only do back search */
>> -static void stripe_add_to_batch_list(struct r5conf *conf, struct
>> stripe_head *sh)
>> +static void stripe_add_to_batch_list(struct r5conf *conf,
>> +        struct stripe_head *sh, struct stripe_head *last_sh)
>
> Nit, from stripe_add_to_batch_list's view, I think "head_sh" makes more
> sense than
> "last_sh".

That made sense to me, but upon a closer look while making the change, I
think it's not a good idea:

stripe_add_to_batch_list() already has a stripe_head variable called
"head". If it now has an argument called "head_sh", it becomes very
confusing. This statement wouldn't make any sense:
+    if (last_sh && head_sector == last_sh->sector) {
+        head = last_sh;
+        atomic_inc(&head->count);
+    } else {

If it was changed to "head = head_sh" what would that even mean?

From stripe_add_to_batch_list's perspective, it is the "last" stripe
head. And it then decides whether the it is the correct stripe to use as
the head of the list to add to.

So I decline to make this change.

Thanks,

Logan

2022-04-28 07:02:47

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()

On 4/28/22 12:18 AM, Logan Gunthorpe wrote:
>>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>>> index 638d29863503..e73b58844f83 100644
>>> --- a/drivers/md/raid5.h
>>> +++ b/drivers/md/raid5.h
>>> @@ -308,6 +308,7 @@ enum r5dev_flags {
>>>       R5_Wantwrite,
>>>       R5_Overlap,    /* There is a pending overlapping request
>>>                * on this block */
>>> +    R5_BioReady,    /* The current bio can be added to this disk */
>> This doesn't seem right to me, since the comment describes bio status
>> while others
>> are probably for r5dev.
> I'm not sure I understand the objection. If you have a better option
> please let me know.

As said, the flag is for r5dev not for bio, please don't make people confuse
about it.

My personal suggestion would be change it to R5_Ready_For_BIO or some
thing else, then update the comment accordingly.

> I'm still working on this patch. Caught a couple more rare bugs that I'm
> working to fix. The next version should also hopefully be clearer.

Thanks,
Guoqing

2022-04-29 14:16:53

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

On 4/29/22 5:22 AM, Logan Gunthorpe wrote:
>
> On 2022-04-25 10:12, Xiao Ni wrote:
>>> I do know that lkp-tests has run it on this series as I did get an error
>>> from it. But while I'm pretty sure that error has been resolved, I was
>>> never able to figure out how to run them locally.
>>>
>> Hi Logan
>>
>> You can clone the mdadm repo at
>> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
>> Then you can find there is a script test under the directory. It's not
>> under the tests directory.
>> The test cases are under tests directory.
> So I've been fighting with this and it seems there are just a ton of
> failures in these tests without my changes. Running on the latest master
> (52c67fcdd6dad) with stock v5.17.5 I see major brokenness. About 17 out
> of 44 tests that run failed. I had to run with --disable-integrity
> because those tests seem to hang on an infinite loop waiting for the md
> array to go into the U state (even though it appears idle).
>
> Even though I ran the tests with '--keep-going', the testing stopped
> after the 07revert-grow reported errors in dmesg -- even though the only
> errors printed to dmesg were that of mdadm segfaulting.
>
> Running on md/md-next seems to get a bit further (to
> 10ddf-create-fail-rebuild) and stops with the same segfaulting issue (or
> perhaps the 07 test only randomly fails first -- I haven't run it that
> many times). Though most of the tests between these points fail anyway.
>
> My upcoming v3 patches cause no failures that are different from the
> md/md-next branch. But it seems these tests have rotted to the point
> that they aren't all that useful; or maybe there are a ton of
> regressions in the kernel already and nobody was paying much attention.

I can't agree with you anymore. I would say some patches were submitted
without run enough tests, then after one by one kernel release, the thing
becomes worse.

This is also the reason that I recommend run mdadm tests since md raid
is a complex subsystem, perhaps a simple change could cause regression.
And considering there are really limited developers and reviewers in the
community, the chance to cause regression get bigger.

> I have also tried to test certain cases that appear broken in recent
> kernels anyway (like reducing the number of disks in a raid5 array hangs
> on the first stripe to reshape).
>
> In any case I have a very rough ad-hoc test suite I've been expanding
> that is targeted at testing my specific changes. Testing these changes
> has definitely been challenging. In any case, I've published my tests here:
>
> https://github.com/Eideticom/raid5-tests

If I may, is it possible to submit your tests to mdadm as well? So we can
have one common place to contain enough tests.

Thanks,
Guoqing

2022-04-30 17:28:55

by Logan Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

On 2022-04-28 18:49, Guoqing Jiang wrote:
> I can't agree with you anymore. I would say some patches were submitted
> without run enough tests, then after one by one kernel release, the thing
> becomes worse.

I'm not sure where we disagree here. I certainly don't want to introduce
regressions myself. I haven't submitted v3 yet because I've become less
certain that there are no regressions in it. The point of my last email
was try to explain that I am taking testing seriously.

> This is also the reason that I recommend run mdadm tests since md raid
> is a complex subsystem, perhaps a simple change could cause regression.
> And considering there are really limited developers and reviewers in the
> community, the chance to cause regression get bigger.

While I'd certainly like to run mdadm tests, they appear to be very
broken to me. Too broken for me to fix all of it -- I don't have time
for fixing that many issues. Seems I'm not the only one to run into this
problem recently:

https://lore.kernel.org/linux-raid/[email protected]/T/#t

And it's a shame nobody could even bother to remove the unsupported 0.9
metadata tests from the repo as a result of this conversation.

> If I may, is it possible to submit your tests to mdadm as well? So we can
> have one common place to contain enough tests.

I'd certainly consider that if I could run the test suite. Though one
hitch is that I've found I need to run my tests repeatedly, for hours,
before hitting some rare bugs. Running the tests only once is much
easier to pass. It's hard to fully test things like this with so many
rare retry paths in a simple regression test.

Logan

2022-05-03 00:47:46

by Guoqing Jiang

[permalink] [raw]

Subject: Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

On 4/30/22 12:01 AM, Logan Gunthorpe wrote:
>
> On 2022-04-28 18:49, Guoqing Jiang wrote:
>> I can't agree with you anymore. I would say some patches were submitted
>> without run enough tests, then after one by one kernel release, the thing
>> becomes worse.
> I'm not sure where we disagree here. I certainly don't want to introduce
> regressions myself. I haven't submitted v3 yet because I've become less
> certain that there are no regressions in it. The point of my last email
> was try to explain that I am taking testing seriously.

That is my intention too, no more new regression.

>> This is also the reason that I recommend run mdadm tests since md raid
>> is a complex subsystem, perhaps a simple change could cause regression.
>> And considering there are really limited developers and reviewers in the
>> community, the chance to cause regression get bigger.
> While I'd certainly like to run mdadm tests, they appear to be very
> broken to me. Too broken for me to fix all of it -- I don't have time
> for fixing that many issues.

I do agree it is not reasonable to ask you to fix them, just compare
the test result
with and without your set, at least there is no more new failure as said.

> Seems I'm not the only one to run into this problem recently:
>
> https://lore.kernel.org/linux-raid/[email protected]/T/#t
>
> And it's a shame nobody could even bother to remove the unsupported 0.9
> metadata tests from the repo as a result of this conversation.
>
>> If I may, is it possible to submit your tests to mdadm as well? So we can
>> have one common place to contain enough tests.
> I'd certainly consider that if I could run the test suite. Though one
> hitch is that I've found I need to run my tests repeatedly, for hours,
> before hitting some rare bugs. Running the tests only once is much
> easier to pass. It's hard to fully test things like this with so many
> rare retry paths in a simple regression test.

Let's focus on raid456 test given the code only touches raid5, you can
pass argument
like this, FYI.

mdadm> ./test --raidtype=raid456 --dev=loop

Thanks,
Guoqing