2023-10-13 19:56:22

by Gulam Mohamed

[permalink] [raw]
Subject: [PATCH V2] Consider inflight IO in io accounting for high latency devices

For high latency devices,we need to take into account the inflight IOs
for calculating the disk utilization. For IO accounting in block layer
currently, the io ticks are updated (added) with '1' at the start of the
IO and delta (time difference between start and end of the IO) is added
at the end of the IO. This causes a small issue for high latency
devices.
Multiple IOs can come in before the end of the previous IOs. Suppose,
IO 'A' came in and before its completion a couple of IOs 'B', 'C', 'D'
and 'E' came in. As per the current implementation, we add only 1 to the
ioticks at IO arrival and the disk time 'stamp' is updated with current
time ('now'). Now, just after 'E' IO 'A' completion arrived. For this
completion of 'A', we update the ioticks with delta which is nothing but
"A(end_time) - E(start_time('stamp'))". Eventhough the disk was busy in
processing the IOs B,C,D also, we were missing the processing time of B,C,D
IOs to add to ioticks. This was causing the %util to show the disk
utilization incorrectly. This incorrect behavior was causing it impossible
to drive the disk utilization to 100% even for heavy load.

To fix this, we need to take into account the inflight IOs also for
calculating the disk utilization. While updating the ioticks when IO
arrives, check if there are any inflight IOs. If there are any inflight
IOs, then add delta to the ioticks instead of 1.

This may not be an issue for low latency devices as there will be very
less difference between the start and end of an IO. Also, if we add this
inflight IO check for IO accounting of low latency devices, then it will
be an overhead and impact IO performance.

So, this inflight IO check will be added only to high latency devices
like HDDs and not to low latency devices like NVME. In the fix, this
distinction is made based upon the number of hardware 'queues' supported
by the disk. Usually HDDs support only 1 HW queue and NVME devices support
multiple HW queues.

To find if there are any inflight IOs, the fix will iterate through both
the IO tags and scheduler tags of the hardware context (hctx) to see if
they are occupied or not. If the tags taken are greater than 1 (first one,
we take it for the current IO), then we say that there are in inflight IOs

The following is the fio script used to test the fix with correct results:

[global]
bs=64
iodepth=120
direct=1
ioengine=libaio
group_reporting
time_based
runtime=100
numjobs=1
name=raw-read
rw=randread
ignore_error=EIO:EIO
[job1]
filename=/dev/sdb

Results without fix
-------------------
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz %util
sda 4001.00 256064.00 0.00 0.00 26.20 64.00 69.00

lat (usec): min=1639, max=295085, avg=26327.48, stdev=60948.84

Result with fix
---------------

Device r/s rkB/s %rrqm r_await rareq-sz %util
sdb 3970.00 254080.00 0.00 26.32 64.00 100.00

lat (usec): min=1553, max=281217, avg=26319.47, stdev=60563.99

Changes V1->V2:
1. Created a new function blk_mq_hctx_has_tags() instead of using
the existing function part_in_flight()
2. This new function will check if there are any inflight IOs
and returns true if any inflight. Else returns false

Signed-off-by: Gulam Mohamed <[email protected]>
---
block/blk-core.c | 11 ++++++++++-
block/blk-mq.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++-
block/blk-mq.h | 1 +
3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9d51e9894ece..ea59f4f3cbdf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -953,8 +953,17 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end)
unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
unsigned long start_time)
{
+ bool inflight = false;
+ struct blk_mq_hw_ctx *hctx;
+
part_stat_lock();
- update_io_ticks(bdev, start_time, false);
+
+ if (bdev->bd_queue->nr_hw_queues == 1) {
+ hctx = xa_load(&bdev->bd_queue->hctx_table, 0);
+ inflight = blk_mq_hctx_has_tags(hctx);
+ }
+
+ update_io_ticks(bdev, start_time, inflight);
part_stat_local_inc(bdev, in_flight[op_is_write(op)]);
part_stat_unlock();

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fafd54dce3c..5748327de92d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1000,6 +1000,9 @@ static inline void blk_account_io_done(struct request *req, u64 now)

static inline void blk_account_io_start(struct request *req)
{
+ bool inflight = false;
+ struct blk_mq_hw_ctx *hctx;
+
trace_block_io_start(req);

if (blk_do_io_stat(req)) {
@@ -1015,7 +1018,13 @@ static inline void blk_account_io_start(struct request *req)
req->part = req->q->disk->part0;

part_stat_lock();
- update_io_ticks(req->part, jiffies, false);
+
+ if (req->q->nr_hw_queues == 1) {
+ hctx = xa_load(&req->q->hctx_table, 0);
+ inflight = blk_mq_hctx_has_tags(hctx);
+ }
+
+ update_io_ticks(req->part, jiffies, inflight);
part_stat_unlock();
}
}
@@ -3476,6 +3485,45 @@ static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
return data.has_rq;
}

+struct tags_iter_data {
+ struct blk_mq_hw_ctx *hctx;
+ unsigned tags;
+ bool has_tags;
+};
+
+static bool blk_mq_hctx_has_tag(struct request *rq, void *data)
+{
+ struct tags_iter_data *iter_data = data;
+
+ if (rq->mq_hctx != iter_data->hctx)
+ return true;
+
+ iter_data->tags++;
+
+ if (iter_data->tags > 1){
+ iter_data->has_tags = true;
+ return false;
+ }
+ return true;
+}
+
+bool blk_mq_hctx_has_tags(struct blk_mq_hw_ctx *hctx)
+{
+ struct tags_iter_data data = {
+ .hctx = hctx,
+ .has_tags = false,
+ .tags = 0,
+ };
+
+ if (hctx->sched_tags)
+ blk_mq_all_tag_iter(hctx->sched_tags, blk_mq_hctx_has_tag, &data);
+
+ if (hctx->tags && !data.has_tags)
+ blk_mq_all_tag_iter(hctx->tags, blk_mq_hctx_has_tag, &data);
+
+ return data.has_tags;
+}
+
static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
struct blk_mq_hw_ctx *hctx)
{
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1743857e0b01..7707033dcaa2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -240,6 +240,7 @@ unsigned int blk_mq_in_flight(struct request_queue *q,
struct block_device *part);
void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
unsigned int inflight[2]);
+bool blk_mq_hctx_has_tags(struct blk_mq_hw_ctx *hctx);

static inline void blk_mq_put_dispatch_budget(struct request_queue *q,
int budget_token)
--
2.39.3


2023-10-13 21:13:26

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH V2] Consider inflight IO in io accounting for high latency devices

On 10/13/23 12:55, Gulam Mohamed wrote:
> @@ -1015,7 +1018,13 @@ static inline void blk_account_io_start(struct request *req)
> req->part = req->q->disk->part0;
>
> part_stat_lock();
> - update_io_ticks(req->part, jiffies, false);
> +
> + if (req->q->nr_hw_queues == 1) {
> + hctx = xa_load(&req->q->hctx_table, 0);
> + inflight = blk_mq_hctx_has_tags(hctx);
> + }
> +
> + update_io_ticks(req->part, jiffies, inflight);
> part_stat_unlock();
> }
> }

blk_account_io_start() is called by blk_mq_bio_to_request(). So if I/O
statistics are enabled and if there is only a single hardware queue,
blk_mq_hctx_has_tags() will be called every time a bio is submitted?
The blk_mq_hctx_has_tags() function iterates over all tags. I would be
surprised if anyone would consider the overhead of this approach
acceptable.

Thanks,

Bart.

2023-10-16 20:02:10

by Gulam Mohamed

[permalink] [raw]
Subject: RE: [PATCH V2] Consider inflight IO in io accounting for high latency devices

Hi Bart,

Thanks for your review. Can you please see my inline comments?

Regards,
Gulam Mohamed.

-----Original Message-----
From: Bart Van Assche <[email protected]>
Sent: Saturday, October 14, 2023 2:43 AM
To: Gulam Mohamed <[email protected]>; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH V2] Consider inflight IO in io accounting for high latency devices

On 10/13/23 12:55, Gulam Mohamed wrote:
> @@ -1015,7 +1018,13 @@ static inline void blk_account_io_start(struct request *req)
> req->part = req->q->disk->part0;
>
> part_stat_lock();
> - update_io_ticks(req->part, jiffies, false);
> +
> + if (req->q->nr_hw_queues == 1) {
> + hctx = xa_load(&req->q->hctx_table, 0);
> + inflight = blk_mq_hctx_has_tags(hctx);
> + }
> +
> + update_io_ticks(req->part, jiffies, inflight);
> part_stat_unlock();
> }
> }

blk_account_io_start() is called by blk_mq_bio_to_request(). So if I/O statistics are enabled and if there is only a single hardware queue,
blk_mq_hctx_has_tags() will be called every time a bio is submitted?
The blk_mq_hctx_has_tags() function iterates over all tags. I would be surprised if anyone would consider the overhead of this approach acceptable.

[GULAM]: Yes, it will be called for every submitted bio but for the high latency devices it will not have much impact. This is indicated by the latency figures I provided in the review mail, with and without our patch.

Thanks,

Bart.

2023-10-16 21:16:27

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH V2] Consider inflight IO in io accounting for high latency devices

On 10/16/23 13:01, Gulam Mohamed wrote:
> [GULAM]: Yes, it will be called for every submitted bio but for the
> high latency devices it will not have much impact. This is indicated
> by the latency figures I provided in the review mail, with and
> without our patch.

Calling tag iteration code from the hot path is wrong.

There are more devices than only hard disks that have a single hardware
queue, e.g. certain SSDs. I'm pretty sure that this patch will have a
significant negative performance impact for such devices.

Thanks,

Bart.

2023-10-17 04:13:51

by Gulam Mohamed

[permalink] [raw]
Subject: RE: [PATCH V2] Consider inflight IO in io accounting for high latency devices

Hi Bart,

Thanks for the review. I agree, for low latency devices if they have single hardware queue, this patch will have significant impact. Can you please let me know about what kind of low latency devices will have a single queue (just for my knowledge)?
Also I would be grateful if you have any suggestions to fix this issue?

Regards,
Gulam Mohamed.

-----Original Message-----
From: Bart Van Assche <[email protected]>
Sent: Tuesday, October 17, 2023 2:46 AM
To: Gulam Mohamed <[email protected]>; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH V2] Consider inflight IO in io accounting for high latency devices

On 10/16/23 13:01, Gulam Mohamed wrote:
> [GULAM]: Yes, it will be called for every submitted bio but for the
> high latency devices it will not have much impact. This is indicated
> by the latency figures I provided in the review mail, with and without
> our patch.

Calling tag iteration code from the hot path is wrong.

There are more devices than only hard disks that have a single hardware queue, e.g. certain SSDs. I'm pretty sure that this patch will have a significant negative performance impact for such devices.

Thanks,

Bart.

2023-10-17 14:23:51

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH V2] Consider inflight IO in io accounting for high latency devices

On 10/16/23 10:13 PM, Gulam Mohamed wrote:
> Hi Bart,
>
> Thanks for the review. I agree, for low latency devices if they have
> single hardware queue, this patch will have significant impact. Can
> you please let me know about what kind of low latency devices will
> have a single queue (just for my knowledge)? Also I would be grateful
> if you have any suggestions to fix this issue?

I mentioned this last time, I'll do it again - please follow normal
mailing list etiquette. Don't top post, and don't do those
unreadable/unquotable "see replies inline".

As I told you last time as well, the problem with the approach is
exactly as Bart says - it's expensive. Because it's expensive, you limit
the fix to single queue devices and then hope that this is fine because
of some ill perceived notion that single queue devices must be slow, and
hence this isn't a big problem. But this both misses the fact that you
could very well have single queue devices that are much faster than the
one you used to test with. Those do exist.

This doesn't even address the fact that presumably this problem isn't
specific to single queue devices at all. You may only care about those,
but as a fix, it's lacking as it doesn't apply to devices across the
board.

--
Jens Axboe

2023-10-18 18:44:47

by Gulam Mohamed

[permalink] [raw]
Subject: RE: [PATCH V2] Consider inflight IO in io accounting for high latency devices

-----Original Message-----
From: Jens Axboe <[email protected]>
Sent: Tuesday, October 17, 2023 7:54 PM
To: Gulam Mohamed <[email protected]>; Bart Van Assche <[email protected]>; [email protected]; [email protected]
Subject: Re: [PATCH V2] Consider inflight IO in io accounting for high latency devices

On 10/16/23 10:13 PM, Gulam Mohamed wrote:
> Hi Bart,
>
> Thanks for the review. I agree, for low latency devices if they have
> single hardware queue, this patch will have significant impact. Can
> you please let me know about what kind of low latency devices will
> have a single queue (just for my knowledge)? Also I would be grateful
> if you have any suggestions to fix this issue?

I mentioned this last time, I'll do it again - please follow normal mailing list etiquette. Don't top post, and don't do those unreadable/unquotable "see replies inline".

As I told you last time as well, the problem with the approach is exactly as Bart says - it's expensive. Because it's expensive, you limit the fix to single queue devices and then hope that this is fine because of some ill perceived notion that single queue devices must be slow, and hence this isn't a big problem. But this both misses the fact that you could very well have single queue devices that are much faster than the one you used to test with. Those do exist.

This doesn't even address the fact that presumably this problem isn't specific to single queue devices at all. You may only care about those, but as a fix, it's lacking as it doesn't apply to devices across the board.

[GULAM]: Thanks Jens for the comments. I thought that most of the faster devices will have more than 1 queue. Sorry for that misinterpretation. Now I understood the issue with the fix here.
I was trying to understand how to fix this issue. Can you please provide your suggestions?

--
Jens Axboe

2023-10-18 18:56:50

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH V2] Consider inflight IO in io accounting for high latency devices

Please fix your quoting, it's impossible to have coherent threads
otherwise. This is the third time I'm bringing that up, but it's just
being ignored. Not the best way to have a fruitful discussion on how
best to fix this.

--
Jens Axboe

2023-10-19 16:44:48

by Gulam Mohamed

[permalink] [raw]
Subject: RE: [PATCH V2] Consider inflight IO in io accounting for high latency devices



> -----Original Message-----
> From: Jens Axboe <[email protected]>
> Sent: Thursday, October 19, 2023 12:26 AM
> To: Gulam Mohamed <[email protected]>; Bart Van Assche
> <[email protected]>; [email protected]; linux-
> [email protected]
> Subject: Re: [PATCH V2] Consider inflight IO in io accounting for high latency
> devices
>
> Please fix your quoting, it's impossible to have coherent threads otherwise.
> This is the third time I'm bringing that up, but it's just being ignored. Not the
> best way to have a fruitful discussion on how best to fix this.
>
> --
> Jens Axboe

Sorry Jens. That was the problem with the configuration of the mail client I was using. I am extremely sorry for the inconvenience.
Regarding the fix, I thought that most of the faster devices will have more than 1 queue. Sorry for that misinterpretation. Now I understood the issue with the fix here.
I was trying to understand how to fix this issue. Can you please provide your suggestions?