LinuxLists.cc - [PATCH -next RFC v2 0/8] improve tag allocation under heavy load

2022-04-08 07:56:03

Subject: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load

Changes in v2:
- use a new title
- add patches to fix waitqueues' unfairness - path 1-3
- delete patch to add queue flag
- delete patch to split big io thoroughly

There is a defect for blk-mq compare to blk-sq, specifically split io
will end up discontinuous if the device is under high io pressure, while
split io will still be continuous in sq, this is because:

1) new io can preempt tag even if there are lots of threads waiting.
2) split bio is issued one by one, if one bio can't get tag, it will go
to wail.
3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
Thus if a thread is woken up, it will unlikey to get multiple tags.

The problem was first found by upgrading kernel from v3.10 to v4.18,
test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
ios with high concurrency.

Noted that there is a precondition for such performance problem:
There is a certain gap between bandwith for single io with
bs=max_sectors_kb and disk upper limit.

During the test, I found that waitqueues can be extremly unbalanced on
heavy load. This is because 'wake_index' is not set properly in
__sbq_wake_up(), see details in patch 3.

In this patchset:
- patch 1-3 fix waitqueues' unfairness.
- patch 4,5 disable tag preemption on heavy load.
- patch 6 forces tag preemption for split bios.
- patch 7,8 improve large random io for HDD. As I mentioned above, we
do meet the problem and I'm trying to fix it at very low cost. However,
if anyone still thinks this is not a common case and not worth to
optimize, I'll drop them.

Test environment:
arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
where 'max_sectors_kb' is 256).

The single io performance(randwrite):

| bs | 128k | 256k | 512k | 1m | 1280k | 2m | 4m |
| -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
| bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7 | 82.9 | 82.9 |

It can be seen that 1280k io is already close to upper limit, and it'll
be hard to see differences with the default value, thus I set
'max_sectors_kb' to 128 in the following test.

Test cmd:
fio \
-filename=/dev/$dev \
-name=test \
-ioengine=psync \
-allow_mounted_write=0 \
-group_reporting \
-direct=1 \
-offset_increment=1g \
-rw=randwrite \
-bs=1024k \
-numjobs={1,2,4,8,16,32,64,128,256,512} \
-runtime=110 \
-ramp_time=10

Test result: MiB/s

| numjobs | v5.18-rc1 | v5.18-rc1-patched |
| ------- | --------- | ----------------- |
| 1 | 67.7 | 67.7 |
| 2 | 67.7 | 67.7 |
| 4 | 67.7 | 67.7 |
| 8 | 67.7 | 67.7 |
| 16 | 64.8 | 65.2 |
| 32 | 59.8 | 62.8 |
| 64 | 54.9 | 58.6 |
| 128 | 49 | 55.8 |
| 256 | 37.7 | 52.3 |
| 512 | 31.8 | 51.4 |

Yu Kuai (8):
sbitmap: record the number of waiters for each waitqueue
blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
sbitmap: make sure waitqueues are balanced
blk-mq: don't preempt tag on heavy load
sbitmap: force tag preemption if free tags are sufficient
blk-mq: force tag preemption for split bios
blk-mq: record how many tags are needed for splited bio
sbitmap: wake up the number of threads based on required tags

block/blk-merge.c | 9 ++-
block/blk-mq-tag.c | 42 +++++++++-----
block/blk-mq.c | 25 +++++++-
block/blk-mq.h | 2 +
include/linux/blk_types.h | 4 ++
include/linux/sbitmap.h | 9 +++
lib/sbitmap.c | 117 +++++++++++++++++++++++++-------------
7 files changed, 150 insertions(+), 58 deletions(-)

--
2.31.1

2022-04-08 07:59:24

by Yu Kuai

[permalink] [raw]

Subject: [PATCH -next RFC v2 7/8] blk-mq: record how many tags are needed for splited bio

Currently, each time 8(or wake batch) requests is done, 8 waiters will
be woken up, this is not necessary because we only need to make sure
wakers will use up 8 tags. For example, if we know in advance that a
thread need 8 tags, then wake up one thread is enough, and this can also
avoid unnecessary context switch.

This patch tries to provide such information that how many tags will
be needed for huge io, and it will be used in next patch.

Signed-off-by: Yu Kuai <[email protected]>
---
block/blk-mq-tag.c | 1 +
block/blk-mq.c | 24 +++++++++++++++++++++---
block/blk-mq.h | 1 +
include/linux/sbitmap.h | 2 ++
4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index dfbb06edfbc3..f91879772dc8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -165,6 +165,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
return BLK_MQ_NO_TAG;
}

+ wait.nr_tags += data->nr_split;
do {
struct sbitmap_queue *bt_prev;

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 909420c5186c..65a3b11d5c9f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2731,12 +2731,14 @@ static bool blk_mq_attempt_bio_merge(struct request_queue *q,
static struct request *blk_mq_get_new_requests(struct request_queue *q,
struct blk_plug *plug,
struct bio *bio,
- unsigned int nsegs)
+ unsigned int nsegs,
+ unsigned int nr_split)
{
struct blk_mq_alloc_data data = {
.q = q,
.nr_tags = 1,
.cmd_flags = bio->bi_opf,
+ .nr_split = nr_split,
.preemption = (bio->bi_opf & REQ_PREEMPT),
};
struct request *rq;
@@ -2795,6 +2797,19 @@ static inline struct request *blk_mq_get_cached_request(struct request_queue *q,
return rq;
}

+static inline unsigned int caculate_sectors_split(struct bio *bio)
+{
+ switch (bio_op(bio)) {
+ case REQ_OP_DISCARD:
+ case REQ_OP_SECURE_ERASE:
+ case REQ_OP_WRITE_ZEROES:
+ return 0;
+ default:
+ return (bio_sectors(bio) - 1) /
+ queue_max_sectors(bio->bi_bdev->bd_queue);
+ }
+}
+
/**
* blk_mq_submit_bio - Create and send a request to block device.
* @bio: Bio pointer.
@@ -2815,11 +2830,14 @@ void blk_mq_submit_bio(struct bio *bio)
const int is_sync = op_is_sync(bio->bi_opf);
struct request *rq;
unsigned int nr_segs = 1;
+ unsigned int nr_split = 0;
blk_status_t ret;

blk_queue_bounce(q, &bio);
- if (blk_may_split(q, bio))
+ if (blk_may_split(q, bio)) {
+ nr_split = caculate_sectors_split(bio);
__blk_queue_split(q, &bio, &nr_segs);
+ }

if (!bio_integrity_prep(bio))
return;
@@ -2828,7 +2846,7 @@ void blk_mq_submit_bio(struct bio *bio)
if (!rq) {
if (!bio)
return;
- rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
+ rq = blk_mq_get_new_requests(q, plug, bio, nr_segs, nr_split);
if (unlikely(!rq))
return;
}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index b49b20e11350..dfb2f1b9bf06 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -156,6 +156,7 @@ struct blk_mq_alloc_data {

/* allocate multiple requests/tags in one go */
unsigned int nr_tags;
+ unsigned int nr_split;
bool preemption;
struct request **cached_rq;

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index ca00ccb6af48..1abd8ed5d406 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -596,12 +596,14 @@ void sbitmap_queue_wake_up(struct sbitmap_queue *sbq);
void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m);

struct sbq_wait {
+ unsigned int nr_tags;
struct sbitmap_queue *sbq; /* if set, sbq_wait is accounted */
struct wait_queue_entry wait;
};

#define DEFINE_SBQ_WAIT(name) \
struct sbq_wait name = { \
+ .nr_tags = 1, \
.sbq = NULL, \
.wait = { \
.private = current, \
--
2.31.1

2022-04-08 08:00:19

by Yu Kuai

[permalink] [raw]

Subject: [PATCH -next RFC v2 6/8] blk-mq: force tag preemption for split bios

For HDD, sequential io is much faster than random io, thus it's better
to issue split io continuously. However, this is broken when tag preemption
is disabled, because wakers can only get one tag each time.

Thus tag preemption should be disabled for split bios, specifically the
first bio won't preempt tag, and following split bios will preempt tag.

Signed-off-by: Yu Kuai <[email protected]>
---
block/blk-merge.c | 9 ++++++++-
block/blk-mq.c | 1 +
include/linux/blk_types.h | 4 ++++
3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7771dacc99cb..cab6ca681513 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -343,12 +343,19 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,

if (split) {
/* there isn't chance to merge the splitted bio */
- split->bi_opf |= REQ_NOMERGE;
+ split->bi_opf |= (REQ_NOMERGE | REQ_SPLIT);
+ if ((*bio)->bi_opf & REQ_SPLIT)
+ split->bi_opf |= REQ_PREEMPT;
+ else
+ (*bio)->bi_opf |= REQ_SPLIT;

bio_chain(split, *bio);
trace_block_split(split, (*bio)->bi_iter.bi_sector);
submit_bio_noacct(*bio);
*bio = split;
+ } else {
+ if ((*bio)->bi_opf & REQ_SPLIT)
+ (*bio)->bi_opf |= REQ_PREEMPT;
}
}

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ed3ed86f7dd2..909420c5186c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2737,6 +2737,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
.q = q,
.nr_tags = 1,
.cmd_flags = bio->bi_opf,
+ .preemption = (bio->bi_opf & REQ_PREEMPT),
};
struct request *rq;

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c62274466e72..6b56e271f926 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -418,6 +418,8 @@ enum req_flag_bits {
/* for driver use */
__REQ_DRV,
__REQ_SWAP, /* swapping request. */
+ __REQ_SPLIT,
+ __REQ_PREEMPT,
__REQ_NR_BITS, /* stops here */
};

@@ -443,6 +445,8 @@ enum req_flag_bits {

#define REQ_DRV (1ULL << __REQ_DRV)
#define REQ_SWAP (1ULL << __REQ_SWAP)
+#define REQ_SPLIT (1ULL << __REQ_SPLIT)
+#define REQ_PREEMPT (1ULL << __REQ_PREEMPT)

#define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
--
2.31.1

2022-04-09 11:47:35

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load

在 2022/04/09 10:28, Jens Axboe 写道:
> On 4/8/22 8:26 PM, yukuai (C) wrote:
>> 在 2022/04/09 3:10, Jens Axboe 写道:
>>> For this one, and other patches you send, they send up in spam because
>>> the sender can't be verified. I would encourage you to figure out what
>>> is going wrong here, because a lot of your patches end up getting
>>> dropped or missed because of it.
>>>
>>
>> Hi,
>>
>> Thanks for your notice, however, I have no clue what is going on right
>> now. I'll look for some help and hopefully that can be fixed.
>
> The easiest is probably to try and send patches to a gmail account. If
> you don't have one, just create one. That will help you see the issue
> and verify whatever the fix might be. It might be a company email
> server issue, hower.
>
Thanks very much for your advice, I'll try that asap.

2022-04-09 15:26:31

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load

在 2022/04/09 3:10, Jens Axboe 写道:
> For this one, and other patches you send, they send up in spam because
> the sender can't be verified. I would encourage you to figure out what
> is going wrong here, because a lot of your patches end up getting
> dropped or missed because of it.
>

Hi,

Thanks for your notice, however, I have no clue what is going on right
now. I'll look for some help and hopefully that can be fixed.

Thanks,
Kuai

2022-04-10 16:32:22

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load

On 4/8/22 19:28, Jens Axboe wrote:
> The easiest is probably to try and send patches to a gmail account. If
> you don't have one, just create one. That will help you see the issue
> and verify whatever the fix might be. It might be a company email
> server issue, hower.

Hi Jens and Yu,

I think it's a company email issue. Many servers that receive email rely
on the SPF, DKIM & DMARC standards to determine whether or not to
classify email as spam. I had to add the following rule to my inbox
receive Huawei.com emails:

Matches: from:(huawei.com)
Do this: Never send it to Spam

Thanks,

Bart.

2022-04-12 00:19:45

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load

For this one, and other patches you send, they send up in spam because
the sender can't be verified. I would encourage you to figure out what
is going wrong here, because a lot of your patches end up getting
dropped or missed because of it.

--
Jens Axboe

2022-04-12 21:12:47

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load

On 4/8/22 8:26 PM, yukuai (C) wrote:
> 在 2022/04/09 3:10, Jens Axboe 写道:
>> For this one, and other patches you send, they send up in spam because
>> the sender can't be verified. I would encourage you to figure out what
>> is going wrong here, because a lot of your patches end up getting
>> dropped or missed because of it.
>>
>
> Hi,
>
> Thanks for your notice, however, I have no clue what is going on right
> now. I'll look for some help and hopefully that can be fixed.

The easiest is probably to try and send patches to a gmail account. If
you don't have one, just create one. That will help you see the issue
and verify whatever the fix might be. It might be a company email
server issue, hower.

--
Jens Axboe

2022-04-12 21:32:49

by Yu Kuai

[permalink] [raw]

Subject: Re: [PATCH -next RFC v2 0/8] improve tag allocation under heavy load

在 2022/04/09 10:28, Jens Axboe 写道:
> On 4/8/22 8:26 PM, yukuai (C) wrote:
>> 在 2022/04/09 3:10, Jens Axboe 写道:
>>> For this one, and other patches you send, they send up in spam because
>>> the sender can't be verified. I would encourage you to figure out what
>>> is going wrong here, because a lot of your patches end up getting
>>> dropped or missed because of it.
>>>
>>
>> Hi,
>>
>> Thanks for your notice, however, I have no clue what is going on right
>> now. I'll look for some help and hopefully that can be fixed.
>
> The easiest is probably to try and send patches to a gmail account. If
> you don't have one, just create one. That will help you see the issue
> and verify whatever the fix might be. It might be a company email
> server issue, hower.
>

I tried to send a patch to gmail, however, no issues are found. I am
contacting our IT support and hope they can figure out what is going on.

By the way, I didn't see anything unusual for my patches in linux-block
patchwork:

https://patchwork.kernel.org/project/linux-block/list/?series=&submitter=187999&state=&q=&archive=both&delegate=

Is there a seperate place to track patches?

Thanks,
Kuai