2022-03-29 13:57:19

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next RFC 0/6] improve large random io for HDD

There is a defect for blk-mq compare to blk-sq, specifically split io
will end up discontinuous if the device is under high io pressure, while
split io will still be continuous in sq, this is because:

1) split bio is issued one by one, if one bio can't get tag, it will go
to wail. - patch 2
2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
Thus if a thread is woken up, it will unlikey to get multiple tags.
- patch 3,4
3) new io can preempt tag even if there are lots of threads waiting for
tags. - patch 5

Test environment:
x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
max_sectors_kb is set to 128.

I haven't tested this patchset on physical machine yet, I'll try later
if anyone thinks this approch is meaningful.

Fio test cmd:
[global]
filename=/dev/sda
ioengine=libaio
direct=1
offset_increment=100m

[test]
rw=randwrite
bs=512k
numjobs=256
iodepth=2

Result: raito of sequential io(calculated from log by blktrace)
original:
21%
patched: split io thoroughly and wake up based on required tags.
40%
patched and set flag: disable tag preemption.
69%

Yu Kuai (6):
blk-mq: add a new flag 'BLK_MQ_F_NO_TAG_PREEMPTION'
block: refactor to split bio thoroughly
blk-mq: record how many tags are needed for splited bio
sbitmap: wake up the number of threads based on required tags
blk-mq: don't preempt tag expect for split bios
sbitmap: force tag preemption if free tags are sufficient

block/bio.c | 2 +
block/blk-merge.c | 95 ++++++++++++++++++++++++++++-----------
block/blk-mq-debugfs.c | 1 +
block/blk-mq-tag.c | 39 +++++++++++-----
block/blk-mq.c | 14 +++++-
block/blk-mq.h | 7 +++
block/blk.h | 3 +-
include/linux/blk-mq.h | 7 ++-
include/linux/blk_types.h | 6 +++
include/linux/sbitmap.h | 8 ++++
lib/sbitmap.c | 33 +++++++++++++-
11 files changed, 173 insertions(+), 42 deletions(-)

--
2.31.1


2022-03-29 14:11:07

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next RFC 5/6] blk-mq: don't preempt tag expect for split bios

In order to improve the sequential of split io, this patch disables
tag preemption for the first split bios and other non-split bios if
the device is under high io pressure.

Noted that this solution rely on waitqueues of sbitmap to be balanced,
otherwise it may happen that 'wake_batch' tags is freed and wakers don't
obtain 'wake_batch' new tags, thus concurrent io will become less. The
next patch will avoid such problem, however, fix the unfairness of
waitqueues might be better.

Signed-off-by: Yu Kuai <[email protected]>
---
block/blk-merge.c | 7 ++++++-
block/blk-mq-tag.c | 37 ++++++++++++++++++++++++++-----------
block/blk-mq.c | 6 ++++++
block/blk-mq.h | 1 +
include/linux/blk_types.h | 2 ++
lib/sbitmap.c | 14 ++++++++++----
6 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 340860746cac..fd4bbf773b45 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -357,6 +357,11 @@ static unsigned short blk_queue_split_all(struct request_queue *q,
if (!first)
first = split;

+ /*
+ * Except the first split bio, others will always preempt
+ * tag, so that they can be sequential.
+ */
+ split->bi_opf |= REQ_PREEMPTIVE;
nr_split++;
submit_bio_noacct(split);
}
@@ -387,7 +392,7 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio)

if (split) {
split->bi_nr_split = blk_queue_split_all(q, *bio);
- (*bio)->bi_opf |= REQ_SPLIT;
+ (*bio)->bi_opf |= (REQ_SPLIT | REQ_PREEMPTIVE);
submit_bio_noacct(*bio);
*bio = split;
}
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 83dfbe2f1cfc..4e485bcc5820 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -127,6 +127,13 @@ unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
return ret;
}

+static inline bool preempt_tag(struct blk_mq_alloc_data *data,
+ struct sbitmap_queue *bt)
+{
+ return data->preemption ||
+ atomic_read(&bt->ws_active) <= SBQ_WAIT_QUEUES;
+}
+
unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
{
struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
@@ -148,12 +155,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
tag_offset = tags->nr_reserved_tags;
}

- tag = __blk_mq_get_tag(data, bt);
- if (tag != BLK_MQ_NO_TAG)
- goto found_tag;
+ if (data->flags & BLK_MQ_REQ_NOWAIT || preempt_tag(data, bt)) {
+ tag = __blk_mq_get_tag(data, bt);
+ if (tag != BLK_MQ_NO_TAG)
+ goto found_tag;

- if (data->flags & BLK_MQ_REQ_NOWAIT)
- return BLK_MQ_NO_TAG;
+ if (data->flags & BLK_MQ_REQ_NOWAIT)
+ return BLK_MQ_NO_TAG;
+ }

wait.nr_tags += data->nr_split;
ws = bt_wait_ptr(bt, data->hctx);
@@ -171,20 +180,26 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
* Retry tag allocation after running the hardware queue,
* as running the queue may also have found completions.
*/
- tag = __blk_mq_get_tag(data, bt);
- if (tag != BLK_MQ_NO_TAG)
- break;
+ if (preempt_tag(data, bt)) {
+ tag = __blk_mq_get_tag(data, bt);
+ if (tag != BLK_MQ_NO_TAG)
+ break;
+ }

sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);

- tag = __blk_mq_get_tag(data, bt);
- if (tag != BLK_MQ_NO_TAG)
- break;
+ if (preempt_tag(data, bt)) {
+ tag = __blk_mq_get_tag(data, bt);
+ if (tag != BLK_MQ_NO_TAG)
+ break;
+ }

bt_prev = bt;
io_schedule();

sbitmap_finish_wait(bt, ws, &wait);
+ if (!blk_mq_is_tag_preemptive(data->hctx->flags))
+ data->preemption = true;

data->ctx = blk_mq_get_ctx(data->q);
data->hctx = blk_mq_map_queue(data->q, data->cmd_flags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9bace9e2c5ca..06ba6fa9ec1a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -470,6 +470,9 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
retry:
data->ctx = blk_mq_get_ctx(q);
data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
+ if (blk_mq_is_tag_preemptive(data->hctx->flags))
+ data->preemption = true;
+
if (!(data->rq_flags & RQF_ELV))
blk_mq_tag_busy(data->hctx);

@@ -577,6 +580,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
data.hctx = xa_load(&q->hctx_table, hctx_idx);
if (!blk_mq_hw_queue_mapped(data.hctx))
goto out_queue_exit;
+ if (blk_mq_is_tag_preemptive(data.hctx->flags))
+ data.preemption = true;
cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
data.ctx = __blk_mq_get_ctx(q, cpu);

@@ -2738,6 +2743,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
.nr_tags = 1,
.cmd_flags = bio->bi_opf,
.nr_split = bio->bi_nr_split,
+ .preemption = (bio->bi_opf & REQ_PREEMPTIVE),
};
struct request *rq;

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3eabe394a5a9..915bb710dd6f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -157,6 +157,7 @@ struct blk_mq_alloc_data {
/* allocate multiple requests/tags in one go */
unsigned int nr_tags;
unsigned int nr_split;
+ bool preemption;
struct request **cached_rq;

/* input & output parameter */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 702f6b83dc88..8fd9756f0a06 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -419,6 +419,7 @@ enum req_flag_bits {
__REQ_DRV,
__REQ_SWAP, /* swapping request. */
__REQ_SPLIT, /* io is splitted */
+ __REQ_PREEMPTIVE, /* io can preempt tag */
__REQ_NR_BITS, /* stops here */
};

@@ -444,6 +445,7 @@ enum req_flag_bits {
#define REQ_DRV (1ULL << __REQ_DRV)
#define REQ_SWAP (1ULL << __REQ_SWAP)
#define REQ_SPLIT (1ULL << __REQ_SPLIT)
+#define REQ_PREEMPTIVE (1ULL << __REQ_PREEMPTIVE)

#define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
diff --git a/lib/sbitmap.c b/lib/sbitmap.c
index 9d04c0ecc8f7..1655c15ee11d 100644
--- a/lib/sbitmap.c
+++ b/lib/sbitmap.c
@@ -597,7 +597,8 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
return NULL;
}

-static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)
+static unsigned int get_wake_nr(struct sbq_wait_state *ws,
+ unsigned int *nr_tags)
{
struct sbq_wait *wait;
struct wait_queue_entry *entry;
@@ -606,11 +607,13 @@ static unsigned int get_wake_nr(struct sbq_wait_state *ws, unsigned int nr_tags)
spin_lock_irq(&ws->wait.lock);
list_for_each_entry(entry, &ws->wait.head, entry) {
wait = container_of(entry, struct sbq_wait, wait);
- if (nr_tags <= wait->nr_tags)
+ if (*nr_tags <= wait->nr_tags) {
+ *nr_tags = 0;
break;
+ }

nr++;
- nr_tags -= wait->nr_tags;
+ *nr_tags -= wait->nr_tags;
}
spin_unlock_irq(&ws->wait.lock);

@@ -648,7 +651,10 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq)
ret = atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wake_batch);
if (ret == wait_cnt) {
sbq_index_atomic_inc(&sbq->wake_index);
- wake_up_nr(&ws->wait, get_wake_nr(ws, wake_batch));
+ wake_up_nr(&ws->wait, get_wake_nr(ws, &wake_batch));
+ if (wake_batch)
+ sbitmap_queue_wake_all(sbq);
+
return false;
}

--
2.31.1

2022-03-29 17:48:04

by Yu Kuai

[permalink] [raw]
Subject: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

Currently, the splited bio is handled first, and then continue to split
the original bio. This patch tries to split the original bio thoroughly,
so that it can be known in advance how many tags will be needed.

Signed-off-by: Yu Kuai <[email protected]>
---
block/bio.c | 2 +
block/blk-merge.c | 90 ++++++++++++++++++++++++++++-----------
block/blk-mq.c | 7 ++-
block/blk.h | 3 +-
include/linux/blk_types.h | 4 ++
5 files changed, 77 insertions(+), 29 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index cdd7b2915c53..ac7ce8b4ba42 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -258,6 +258,8 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
bio->bi_flags = 0;
bio->bi_ioprio = 0;
bio->bi_status = 0;
+ bio->bi_nr_segs = 0;
+ bio->bi_nr_split = 0;
bio->bi_iter.bi_sector = 0;
bio->bi_iter.bi_size = 0;
bio->bi_iter.bi_idx = 0;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7771dacc99cb..340860746cac 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -309,44 +309,85 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
return bio_split(bio, sectors, GFP_NOIO, bs);
}

-/**
- * __blk_queue_split - split a bio and submit the second half
- * @q: [in] request_queue new bio is being queued at
- * @bio: [in, out] bio to be split
- * @nr_segs: [out] number of segments in the first bio
- *
- * Split a bio into two bios, chain the two bios, submit the second half and
- * store a pointer to the first half in *@bio. If the second bio is still too
- * big it will be split by a recursive call to this function. Since this
- * function may allocate a new bio from q->bio_split, it is the responsibility
- * of the caller to ensure that q->bio_split is only released after processing
- * of the split bio has finished.
- */
-void __blk_queue_split(struct request_queue *q, struct bio **bio,
- unsigned int *nr_segs)
+static struct bio *blk_queue_split_one(struct request_queue *q, struct bio *bio)
{
struct bio *split = NULL;
+ unsigned int nr_segs = 1;

- switch (bio_op(*bio)) {
+ switch (bio_op(bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
- split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
+ split = blk_bio_discard_split(q, bio, &q->bio_split, &nr_segs);
break;
case REQ_OP_WRITE_ZEROES:
- split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
- nr_segs);
+ split = blk_bio_write_zeroes_split(q, bio, &q->bio_split,
+ &nr_segs);
break;
default:
- split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
+ split = blk_bio_segment_split(q, bio, &q->bio_split, &nr_segs);
break;
}

if (split) {
/* there isn't chance to merge the splitted bio */
- split->bi_opf |= REQ_NOMERGE;
+ split->bi_opf |= (REQ_NOMERGE | REQ_SPLIT);
+ split->bi_nr_segs = nr_segs;
+
+ bio_chain(split, bio);
+ trace_block_split(split, bio->bi_iter.bi_sector);
+ } else {
+ bio->bi_nr_segs = nr_segs;
+ }
+
+ return split;
+}
+
+static unsigned short blk_queue_split_all(struct request_queue *q,
+ struct bio *bio)
+{
+ struct bio *split = NULL;
+ struct bio *first = NULL;
+ unsigned short nr_split = 1;
+ unsigned short total;

- bio_chain(split, *bio);
- trace_block_split(split, (*bio)->bi_iter.bi_sector);
+ if (!current->bio_list)
+ return 1;
+
+ while ((split = blk_queue_split_one(q, bio))) {
+ if (!first)
+ first = split;
+
+ nr_split++;
+ submit_bio_noacct(split);
+ }
+
+ total = nr_split;
+ while (first) {
+ first->bi_nr_split = --total;
+ first = first->bi_next;
+ }
+
+ return nr_split;
+}
+
+/**
+ * __blk_queue_split - split a bio, store the first and submit others
+ * @q: [in] request_queue new bio is being queued at
+ * @bio: [in, out] bio to be split
+ *
+ * Split a bio into several bios, chain all the bios, store a pointer to the
+ * first in *@bio, and submit others. Since this function may allocate a new
+ * bio from q->bio_split, it is the responsibility of the caller to ensure
+ * that q->bio_split is only released after processing of the split bio has
+ * finished.
+ */
+void __blk_queue_split(struct request_queue *q, struct bio **bio)
+{
+ struct bio *split = blk_queue_split_one(q, *bio);
+
+ if (split) {
+ split->bi_nr_split = blk_queue_split_all(q, *bio);
+ (*bio)->bi_opf |= REQ_SPLIT;
submit_bio_noacct(*bio);
*bio = split;
}
@@ -365,10 +406,9 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
void blk_queue_split(struct bio **bio)
{
struct request_queue *q = bdev_get_queue((*bio)->bi_bdev);
- unsigned int nr_segs;

if (blk_may_split(q, *bio))
- __blk_queue_split(q, bio, &nr_segs);
+ __blk_queue_split(q, bio);
}
EXPORT_SYMBOL(blk_queue_split);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e6f24fa4a4c2..cad207d2079e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2817,8 +2817,11 @@ void blk_mq_submit_bio(struct bio *bio)
blk_status_t ret;

blk_queue_bounce(q, &bio);
- if (blk_may_split(q, bio))
- __blk_queue_split(q, &bio, &nr_segs);
+ if (blk_may_split(q, bio)) {
+ if (!(bio->bi_opf & REQ_SPLIT))
+ __blk_queue_split(q, &bio);
+ nr_segs = bio->bi_nr_segs;
+ }

if (!bio_integrity_prep(bio))
return;
diff --git a/block/blk.h b/block/blk.h
index 8ccbc6e07636..cd478187b525 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -303,8 +303,7 @@ static inline bool blk_may_split(struct request_queue *q, struct bio *bio)
bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
}

-void __blk_queue_split(struct request_queue *q, struct bio **bio,
- unsigned int *nr_segs);
+void __blk_queue_split(struct request_queue *q, struct bio **bio);
int ll_back_merge_fn(struct request *req, struct bio *bio,
unsigned int nr_segs);
bool blk_attempt_req_merge(struct request_queue *q, struct request *rq,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dd0763a1c674..702f6b83dc88 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -250,6 +250,8 @@ struct bio {
*/
unsigned short bi_flags; /* BIO_* below */
unsigned short bi_ioprio;
+ unsigned int bi_nr_segs;
+ unsigned int bi_nr_split;
blk_status_t bi_status;
atomic_t __bi_remaining;

@@ -416,6 +418,7 @@ enum req_flag_bits {
/* for driver use */
__REQ_DRV,
__REQ_SWAP, /* swapping request. */
+ __REQ_SPLIT, /* io is split */
__REQ_NR_BITS, /* stops here */
};

@@ -440,6 +443,7 @@ enum req_flag_bits {

#define REQ_DRV (1ULL << __REQ_DRV)
#define REQ_SWAP (1ULL << __REQ_SWAP)
+#define REQ_SPLIT (1ULL << __REQ_SPLIT)

#define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
--
2.31.1

2022-03-29 19:51:49

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

On 3/29/22 3:40 AM, Yu Kuai wrote:
> Currently, the splited bio is handled first, and then continue to split
> the original bio. This patch tries to split the original bio thoroughly,
> so that it can be known in advance how many tags will be needed.

This makes the bio almost 10% bigger in size, which is NOT something
that is just done casually and without strong reasoning.

So please provide that, your commit message is completely missing
justification for why this change is useful or necessary. A good
commit always explains WHY the change needs to be made, yours is
simply stating WHAT is being done. The latter can be gleaned from
the code change anyway.

--
Jens Axboe

2022-03-30 00:57:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

On Tue, Mar 29, 2022 at 08:35:29AM -0600, Jens Axboe wrote:
> > But more importantly why does your use case even have splits that get
> > submitted together? Is this a case of Linus' stupidly low default
> > max_sectors when the hardware supports more, or is the hardware limited
> > to a low number of sectors per request? Or do we hit another reason
> > for the split?
>
> See the posted use case, it's running 512kb ios on a 128kb device.

That is an awfully low limit these days. I'm really not sure we should
optimize the block layer for that.

2022-03-30 05:21:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

On 3/29/22 7:32 AM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2022 at 05:40:44PM +0800, Yu Kuai wrote:
>> Currently, the splited bio is handled first, and then continue to split
>> the original bio. This patch tries to split the original bio thoroughly,
>> so that it can be known in advance how many tags will be needed.
>
> How do you avoid the deadlock risk with all the split bios being
> submitted together?

That too

> But more importantly why does your use case even have splits that get
> submitted together? Is this a case of Linus' stupidly low default
> max_sectors when the hardware supports more, or is the hardware limited
> to a low number of sectors per request? Or do we hit another reason
> for the split?

See the posted use case, it's running 512kb ios on a 128kb device.

--
Jens Axboe

2022-03-30 08:20:29

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

On 2022/03/29 22:41, Jens Axboe wrote:
> On 3/29/22 8:40 AM, Christoph Hellwig wrote:
>> On Tue, Mar 29, 2022 at 08:35:29AM -0600, Jens Axboe wrote:
>>>> But more importantly why does your use case even have splits that get
>>>> submitted together? Is this a case of Linus' stupidly low default
>>>> max_sectors when the hardware supports more, or is the hardware limited
>>>> to a low number of sectors per request? Or do we hit another reason
>>>> for the split?
>>>
>>> See the posted use case, it's running 512kb ios on a 128kb device.
Hi,

The problem was first found during kernel upgrade(v3.10 to v4.18), and
we maintain a series of io performance test suites, and one of the test
is fio random rw with large bs. In the environment, the 'max_sectors_kb'
is 256kb, and fio bs is 1m.
>>
>> That is an awfully low limit these days. I'm really not sure we should
>> optimize the block layer for that.
>
> That's exactly what my replies have been saying. I don't think this is
> a relevant thing to optimize for.

If the use case that large ios get submitted together is not a common
issue(probably not since it's been a long time without complaining),
I agree that we should not optimize the block layer for that.

Thanks,
Kuai
>
> Fixing fairness for wakeups seems useful, however.
>

2022-03-30 12:07:58

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH -next RFC 0/6] improve large random io for HDD

On 3/29/22 3:40 AM, Yu Kuai wrote:
> There is a defect for blk-mq compare to blk-sq, specifically split io
> will end up discontinuous if the device is under high io pressure, while
> split io will still be continuous in sq, this is because:
>
> 1) split bio is issued one by one, if one bio can't get tag, it will go
> to wail. - patch 2
> 2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
> Thus if a thread is woken up, it will unlikey to get multiple tags.
> - patch 3,4
> 3) new io can preempt tag even if there are lots of threads waiting for
> tags. - patch 5
>
> Test environment:
> x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
> max_sectors_kb is set to 128.
>
> I haven't tested this patchset on physical machine yet, I'll try later
> if anyone thinks this approch is meaningful.

A real machine test would definitely be a requirement. What real world
uses cases is this solving? These days most devices have plenty of tags,
and I would not really expect tag starvation to be much of a concern.

However, I do think there's merrit in fixing the unfairness we have
here. But not at the cost of all of this. Why not just simply enforce
more strict ordering of tag allocations? If someone is waiting, you get
to wait too.

And I don't see much utility at all in tracking how many splits (and
hence tags) would be required. Is this really a common issue, tons of
splits and needing many tags? Why not just enforce the strict ordering
as mentioned above, not allowing new allocators to get a tag if others
are waiting, but perhaps allow someone submitting a string of splits to
indeed keep allocating.

Yes, it'll be less efficient to still wake one-by-one, but honestly do
we really care about that? If you're stalled on waiting for other IO to
finish and release a tag, that isn't very efficient to begin with and
doesn't seem like a case worth optimizing for me.

--
Jens Axboe

2022-03-30 14:59:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

On Tue, Mar 29, 2022 at 05:40:44PM +0800, Yu Kuai wrote:
> Currently, the splited bio is handled first, and then continue to split
> the original bio. This patch tries to split the original bio thoroughly,
> so that it can be known in advance how many tags will be needed.

How do you avoid the deadlock risk with all the split bios being
submitted together?

But more importantly why does your use case even have splits that get
submitted together? Is this a case of Linus' stupidly low default
max_sectors when the hardware supports more, or is the hardware limited
to a low number of sectors per request? Or do we hit another reason
for the split?

2022-03-30 22:30:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

On Tue, Mar 29, 2022 at 08:41:57AM -0600, Jens Axboe wrote:
> Fixing fairness for wakeups seems useful, however.

Agreed.

2022-03-31 02:43:19

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next RFC 0/6] improve large random io for HDD

在 2022/03/29 20:53, Jens Axboe 写道:
> On 3/29/22 3:40 AM, Yu Kuai wrote:
>> There is a defect for blk-mq compare to blk-sq, specifically split io
>> will end up discontinuous if the device is under high io pressure, while
>> split io will still be continuous in sq, this is because:
>>
>> 1) split bio is issued one by one, if one bio can't get tag, it will go
>> to wail. - patch 2
>> 2) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>> - patch 3,4
>> 3) new io can preempt tag even if there are lots of threads waiting for
>> tags. - patch 5
>>
>> Test environment:
>> x86 vm, nr_requests is set to 64, queue_depth is set to 32 and
>> max_sectors_kb is set to 128.
>>
>> I haven't tested this patchset on physical machine yet, I'll try later
>> if anyone thinks this approch is meaningful.
>
> A real machine test would definitely be a requirement. What real world
> uses cases is this solving? These days most devices have plenty of tags,
> and I would not really expect tag starvation to be much of a concern.
>
> However, I do think there's merrit in fixing the unfairness we have
> here. But not at the cost of all of this. Why not just simply enforce
> more strict ordering of tag allocations? If someone is waiting, you get
> to wait too.
>
> And I don't see much utility at all in tracking how many splits (and
> hence tags) would be required. Is this really a common issue, tons of
> splits and needing many tags? Why not just enforce the strict ordering
> as mentioned above, not allowing new allocators to get a tag if others
> are waiting, but perhaps allow someone submitting a string of splits to
> indeed keep allocating.
>
> Yes, it'll be less efficient to still wake one-by-one, but honestly do
> we really care about that? If you're stalled on waiting for other IO to
> finish and release a tag, that isn't very efficient to begin with and
> doesn't seem like a case worth optimizing for me.
>

Hi,

Thanks for your adivce, I'll do more work based on your suggestions.

Kuai

2022-03-31 03:08:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

On 3/29/22 8:40 AM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2022 at 08:35:29AM -0600, Jens Axboe wrote:
>>> But more importantly why does your use case even have splits that get
>>> submitted together? Is this a case of Linus' stupidly low default
>>> max_sectors when the hardware supports more, or is the hardware limited
>>> to a low number of sectors per request? Or do we hit another reason
>>> for the split?
>>
>> See the posted use case, it's running 512kb ios on a 128kb device.
>
> That is an awfully low limit these days. I'm really not sure we should
> optimize the block layer for that.

That's exactly what my replies have been saying. I don't think this is
a relevant thing to optimize for.

Fixing fairness for wakeups seems useful, however.

--
Jens Axboe

2022-03-31 04:46:59

by Yu Kuai

[permalink] [raw]
Subject: Re: [PATCH -next RFC 2/6] block: refactor to split bio thoroughly

On 2022/03/29 20:46, Jens Axboe wrote:
> On 3/29/22 3:40 AM, Yu Kuai wrote:
>> Currently, the splited bio is handled first, and then continue to split
>> the original bio. This patch tries to split the original bio thoroughly,
>> so that it can be known in advance how many tags will be needed.
>
> This makes the bio almost 10% bigger in size, which is NOT something
> that is just done casually and without strong reasoning.
Hi,

Thanks for taking time on this patchset, It's right this way is not
appropriate.
>
> So please provide that, your commit message is completely missing
> justification for why this change is useful or necessary. A good
> commit always explains WHY the change needs to be made, yours is
> simply stating WHAT is being done. The latter can be gleaned from
> the code change anyway.
>
Thanks for the guidance, I'll pay attemtion to that carefully in future.

For this patch, what I want to do is to gain information about how many
tags will be needed for the big io before getting the first tag, and use
that information to optimize wake up path. The problem in this patch is
that adding two feilds in bio is a bad move.

I was thinking that for segment split, I can get the information by
caculating bio segments / queue max segments.

Thanks,
Kuai