2022-09-16 11:38:11

by Yu Kuai

[permalink] [raw]
Subject: [PATCH v3 0/5] md/raid10: reduce lock contention for io

From: Yu Kuai <[email protected]>

Changes in v3:
- split a patch from patch 1
- only modify hot path in patch 3
- wake up barrier if 'nr_pending' is decreased to 0 in
wait_barrier_nolock(), otherwise raise_barrier() might hung.

Changes in v2:
- add patch 1, as suggested by Logan Gunthorpe.
- in patch 4, instead of use spin_lock/unlock in wait_event, which will
confuse lockdep, use write_seqlock/unlock instead.
- in patch 4, use read_seqbegin() to get seqcount instead of unusual
usage of raw_read_seqcount().
- test result is different from v1 in aarch64 due to retest from different
environment.

This patchset tries to avoid that two locks are held unconditionally
in hot path.

Test environment:

Architecture:
aarch64 Huawei KUNPENG 920
x86 Intel(R) Xeon(R) Platinum 8380

Raid10 initialize:
mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Test cmd:
(task set -c 0-15) fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread

Test result:

aarch64:
before this patchset: 3.2 GiB/s
bind node before this patchset: 6.9 Gib/s
after this patchset: 7.9 Gib/s
bind node after this patchset: 8.0 Gib/s

x86:(bind node is not tested yet)
before this patchset: 7.0 GiB/s
after this patchset : 9.3 GiB/s

Please noted that in the test machine, memory access latency is very bad
across nodes compare to local node in aarch64, which is why bandwidth
while bind node is much better.

Yu Kuai (5):
md/raid10: Factor out code from wait_barrier() to
stop_waiting_barrier()
md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case
nowait
md/raid10: prevent unnecessary calls to wake_up() in fast path
md/raid10: fix improper BUG_ON() in raise_barrier()
md/raid10: convert resync_lock to use seqlock

drivers/md/raid10.c | 149 ++++++++++++++++++++++++++++----------------
drivers/md/raid10.h | 2 +-
2 files changed, 96 insertions(+), 55 deletions(-)

--
2.31.1


2022-09-16 11:40:18

by Yu Kuai

[permalink] [raw]
Subject: [PATCH v3 3/5] md/raid10: prevent unnecessary calls to wake_up() in fast path

From: Yu Kuai <[email protected]>

Currently, wake_up() is called unconditionally in fast path such as
raid10_make_request(), which will cause lock contention under high
concurrency:

raid10_make_request
wake_up
__wake_up_common_lock
spin_lock_irqsave

Improve performance by only call wake_up() if waitqueue is not empty
in allow_barrier() and raid10_make_request().

Signed-off-by: Yu Kuai <[email protected]>
---
drivers/md/raid10.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index df435d693637..cf452c25e1e5 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -274,6 +274,12 @@ static void put_buf(struct r10bio *r10_bio)
lower_barrier(conf);
}

+static void wake_up_barrier(struct r10conf *conf)
+{
+ if (wq_has_sleeper(&conf->wait_barrier))
+ wake_up(&conf->wait_barrier);
+}
+
static void reschedule_retry(struct r10bio *r10_bio)
{
unsigned long flags;
@@ -1015,7 +1021,7 @@ static void allow_barrier(struct r10conf *conf)
{
if ((atomic_dec_and_test(&conf->nr_pending)) ||
(conf->array_freeze_pending))
- wake_up(&conf->wait_barrier);
+ wake_up_barrier(conf);
}

static void freeze_array(struct r10conf *conf, int extra)
@@ -1891,7 +1897,7 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
__make_request(mddev, bio, sectors);

/* In case raid10d snuck in to freeze_array */
- wake_up(&conf->wait_barrier);
+ wake_up_barrier(conf);
return true;
}

--
2.31.1

2022-09-16 11:44:04

by Yu Kuai

[permalink] [raw]
Subject: [PATCH v3 1/5] md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()

From: Yu Kuai <[email protected]>

Currently the nasty condition in wait_barrier() is hard to read. This
patch factors out the condition into a function.

There are no functional changes.

Signed-off-by: Yu Kuai <[email protected]>
Acked-by: Paul Menzel <[email protected]>
---
drivers/md/raid10.c | 50 +++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 64d6e4cd8a3a..37fd9284054e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -957,41 +957,47 @@ static void lower_barrier(struct r10conf *conf)
wake_up(&conf->wait_barrier);
}

+static bool stop_waiting_barrier(struct r10conf *conf)
+{
+ struct bio_list *bio_list = current->bio_list;
+
+ /* barrier is dropped */
+ if (!conf->barrier)
+ return true;
+
+ /*
+ * If there are already pending requests (preventing the barrier from
+ * rising completely), and the pre-process bio queue isn't empty, then
+ * don't wait, as we need to empty that queue to get the nr_pending
+ * count down.
+ */
+ if (atomic_read(&conf->nr_pending) && bio_list &&
+ (!bio_list_empty(&bio_list[0]) || !bio_list_empty(&bio_list[1])))
+ return true;
+
+ /* move on if recovery thread is blocked by us */
+ if (conf->mddev->thread->tsk == current &&
+ test_bit(MD_RECOVERY_RUNNING, &conf->mddev->recovery) &&
+ conf->nr_queued > 0)
+ return true;
+
+ return false;
+}
+
static bool wait_barrier(struct r10conf *conf, bool nowait)
{
bool ret = true;

spin_lock_irq(&conf->resync_lock);
if (conf->barrier) {
- struct bio_list *bio_list = current->bio_list;
conf->nr_waiting++;
- /* Wait for the barrier to drop.
- * However if there are already pending
- * requests (preventing the barrier from
- * rising completely), and the
- * pre-process bio queue isn't empty,
- * then don't wait, as we need to empty
- * that queue to get the nr_pending
- * count down.
- */
/* Return false when nowait flag is set */
if (nowait) {
ret = false;
} else {
raid10_log(conf->mddev, "wait barrier");
wait_event_lock_irq(conf->wait_barrier,
- !conf->barrier ||
- (atomic_read(&conf->nr_pending) &&
- bio_list &&
- (!bio_list_empty(&bio_list[0]) ||
- !bio_list_empty(&bio_list[1]))) ||
- /* move on if recovery thread is
- * blocked by us
- */
- (conf->mddev->thread->tsk == current &&
- test_bit(MD_RECOVERY_RUNNING,
- &conf->mddev->recovery) &&
- conf->nr_queued > 0),
+ stop_waiting_barrier(conf),
conf->resync_lock);
}
conf->nr_waiting--;
--
2.31.1

2022-09-16 18:42:49

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()



On 2022-09-16 05:34, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> Currently the nasty condition in wait_barrier() is hard to read. This
> patch factors out the condition into a function.
>
> There are no functional changes.
>
> Signed-off-by: Yu Kuai <[email protected]>
> Acked-by: Paul Menzel <[email protected]>

Reviewed-by: Logan Gunthorpe <[email protected]>

2022-09-16 18:49:32

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] md/raid10: prevent unnecessary calls to wake_up() in fast path



On 2022-09-16 05:34, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> Currently, wake_up() is called unconditionally in fast path such as
> raid10_make_request(), which will cause lock contention under high
> concurrency:
>
> raid10_make_request
> wake_up
> __wake_up_common_lock
> spin_lock_irqsave
>
> Improve performance by only call wake_up() if waitqueue is not empty
> in allow_barrier() and raid10_make_request().
>
> Signed-off-by: Yu Kuai <[email protected]>

Reviewed-by: Logan Gunthorpe <[email protected]>

2022-09-18 12:16:58

by Guoqing Jiang

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] md/raid10: prevent unnecessary calls to wake_up() in fast path



On 9/16/22 7:34 PM, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> Currently, wake_up() is called unconditionally in fast path such as
> raid10_make_request(), which will cause lock contention under high
> concurrency:
>
> raid10_make_request
> wake_up
> __wake_up_common_lock
> spin_lock_irqsave
>
> Improve performance by only call wake_up() if waitqueue is not empty
> in allow_barrier() and raid10_make_request().
>
> Signed-off-by: Yu Kuai <[email protected]>
> ---
> drivers/md/raid10.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index df435d693637..cf452c25e1e5 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -274,6 +274,12 @@ static void put_buf(struct r10bio *r10_bio)
> lower_barrier(conf);
> }
>
> +static void wake_up_barrier(struct r10conf *conf)
> +{
> + if (wq_has_sleeper(&conf->wait_barrier))
> + wake_up(&conf->wait_barrier);
> +}
> +
> static void reschedule_retry(struct r10bio *r10_bio)
> {
> unsigned long flags;
> @@ -1015,7 +1021,7 @@ static void allow_barrier(struct r10conf *conf)
> {
> if ((atomic_dec_and_test(&conf->nr_pending)) ||
> (conf->array_freeze_pending))
> - wake_up(&conf->wait_barrier);
> + wake_up_barrier(conf);
> }
>
> static void freeze_array(struct r10conf *conf, int extra)
> @@ -1891,7 +1897,7 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
> __make_request(mddev, bio, sectors);
>
> /* In case raid10d snuck in to freeze_array */
> - wake_up(&conf->wait_barrier);
> + wake_up_barrier(conf);
> return true;
> }
>

Acked-by: Guoqing Jiang <[email protected]>

Thanks,
Guoqing

2022-09-18 12:26:59

by Guoqing Jiang

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()



On 9/16/22 7:34 PM, Yu Kuai wrote:
> From: Yu Kuai <[email protected]>
>
> Currently the nasty condition in wait_barrier() is hard to read. This
> patch factors out the condition into a function.
>
> There are no functional changes.
>
> Signed-off-by: Yu Kuai <[email protected]>
> Acked-by: Paul Menzel <[email protected]>
> ---
> drivers/md/raid10.c | 50 +++++++++++++++++++++++++--------------------
> 1 file changed, 28 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 64d6e4cd8a3a..37fd9284054e 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -957,41 +957,47 @@ static void lower_barrier(struct r10conf *conf)
> wake_up(&conf->wait_barrier);
> }
>
> +static bool stop_waiting_barrier(struct r10conf *conf)
> +{
> + struct bio_list *bio_list = current->bio_list;
> +
> + /* barrier is dropped */
> + if (!conf->barrier)
> + return true;
> +
> + /*
> + * If there are already pending requests (preventing the barrier from
> + * rising completely), and the pre-process bio queue isn't empty, then
> + * don't wait, as we need to empty that queue to get the nr_pending
> + * count down.
> + */
> + if (atomic_read(&conf->nr_pending) && bio_list &&
> + (!bio_list_empty(&bio_list[0]) || !bio_list_empty(&bio_list[1])))
> + return true;
> +
> + /* move on if recovery thread is blocked by us */
> + if (conf->mddev->thread->tsk == current &&
> + test_bit(MD_RECOVERY_RUNNING, &conf->mddev->recovery) &&
> + conf->nr_queued > 0)
> + return true;
> +
> + return false;
> +}
> +
> static bool wait_barrier(struct r10conf *conf, bool nowait)
> {
> bool ret = true;
>
> spin_lock_irq(&conf->resync_lock);
> if (conf->barrier) {
> - struct bio_list *bio_list = current->bio_list;
> conf->nr_waiting++;
> - /* Wait for the barrier to drop.
> - * However if there are already pending
> - * requests (preventing the barrier from
> - * rising completely), and the
> - * pre-process bio queue isn't empty,
> - * then don't wait, as we need to empty
> - * that queue to get the nr_pending
> - * count down.
> - */
> /* Return false when nowait flag is set */
> if (nowait) {
> ret = false;
> } else {
> raid10_log(conf->mddev, "wait barrier");
> wait_event_lock_irq(conf->wait_barrier,
> - !conf->barrier ||
> - (atomic_read(&conf->nr_pending) &&
> - bio_list &&
> - (!bio_list_empty(&bio_list[0]) ||
> - !bio_list_empty(&bio_list[1]))) ||
> - /* move on if recovery thread is
> - * blocked by us
> - */
> - (conf->mddev->thread->tsk == current &&
> - test_bit(MD_RECOVERY_RUNNING,
> - &conf->mddev->recovery) &&
> - conf->nr_queued > 0),
> + stop_waiting_barrier(conf),
> conf->resync_lock);
> }
> conf->nr_waiting--;

Acked-by: Guoqing Jiang <[email protected]>

Thanks,
Guoqing