2016-10-24 18:54:33

by Kashyap Desai

[permalink] [raw]
Subject: RE: Device or HBA level QD throttling creates randomness in sequetial workload

> -----Original Message-----
> From: Omar Sandoval [mailto:[email protected]]
> Sent: Monday, October 24, 2016 9:11 PM
> To: Kashyap Desai
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Christoph Hellwig;
> [email protected]
> Subject: Re: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> > >
> > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > > Hi -
> > > >
> > > > I found below conversation and it is on the same line as I wanted
> > > > some input from mailing list.
> > > >
> > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> > > >
> > > > I can do testing on any WIP item as Omar mentioned in above
> > discussion.
> > > > https://github.com/osandov/linux/tree/blk-mq-iosched
> >
> > I tried build kernel using this repo, but looks like it is not allowed
> > to reboot due to some changes in <block> layer.
>
> Did you build the most up-to-date version of that branch? I've been
force
> pushing to it, so the commit id that you built would be useful.
> What boot failure are you seeing?

Below is latest commit on repo.
commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda
Author: Omar Sandoval <[email protected]>
Date: Tue Sep 20 11:20:03 2016 -0700

[WIP] blk-mq: limit bio queue depth

I have latest repo from 4.9/scsi-next maintained by Martin which boots
fine. Only Delta is " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched
branch. I could not see any meaningful data on boot hang, so going to try
one more time tomorrow.


>
> > >
> > > Are you using blk-mq for this disk? If not, then the work there
> > > won't
> > affect you.
> >
> > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> > disable, Sequential work load issue is not seen and <cfq> scheduling
> > works well.
>
> Ah, okay, perfect. Can you send the fio job file you're using? Hard to
tell exactly
> what's going on without the details. A sequential workload with just one
> submitter is about as easy as it gets, so this _should_ be behaving
nicely.

<FIO script>

; setup numa policy for each thread
; 'numactl --show' to determine the maximum numa nodes
[global]
ioengine=libaio
buffered=0
rw=write
bssplit=4K/100
iodepth=256
numjobs=1
direct=1
runtime=60s
allow_mounted_write=0

[job1]
filename=/dev/sdd
..
[job24]
filename=/dev/sdaa

When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a
ioscheduler detail. (It is in blk-mq mode. )
/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:none

When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0,
ioscheduler picked by SML is <cfq>.
/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:noop deadline [cfq]

I see in blk-mq performance is very low for Sequential Write work load and
I confirm that blk-mq convert Sequential work load into random stream due
to io-scheduler change in blk-mq vs legacy block layer.

>
> > >
> > > > Is there any workaround/alternative in latest upstream kernel, if
> > > > user wants to see limited penalty for Sequential Work load on HDD
?
> > > >
> > > > ` Kashyap
> > > >
>
> P.S., your emails are being marked as spam by Gmail. Actually, Gmail
seems to
> mark just about everything I get from Broadcom as spam due to failed
DMARC.
>
> --
> Omar


2016-10-26 20:56:12

by Omar Sandoval

[permalink] [raw]
Subject: Re: Device or HBA level QD throttling creates randomness in sequetial workload

On Tue, Oct 25, 2016 at 12:24:24AM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Omar Sandoval [mailto:[email protected]]
> > Sent: Monday, October 24, 2016 9:11 PM
> > To: Kashyap Desai
> > Cc: [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; Christoph Hellwig;
> > [email protected]
> > Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial
> > workload
> >
> > On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> > > >
> > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > > > Hi -
> > > > >
> > > > > I found below conversation and it is on the same line as I wanted
> > > > > some input from mailing list.
> > > > >
> > > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> > > > >
> > > > > I can do testing on any WIP item as Omar mentioned in above
> > > discussion.
> > > > > https://github.com/osandov/linux/tree/blk-mq-iosched
> > >
> > > I tried build kernel using this repo, but looks like it is not allowed
> > > to reboot due to some changes in <block> layer.
> >
> > Did you build the most up-to-date version of that branch? I've been
> force
> > pushing to it, so the commit id that you built would be useful.
> > What boot failure are you seeing?
>
> Below is latest commit on repo.
> commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda
> Author: Omar Sandoval <[email protected]>
> Date: Tue Sep 20 11:20:03 2016 -0700
>
> [WIP] blk-mq: limit bio queue depth
>
> I have latest repo from 4.9/scsi-next maintained by Martin which boots
> fine. Only Delta is " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched
> branch. I could not see any meaningful data on boot hang, so going to try
> one more time tomorrow.

The blk-mq-bio-queueing branch has the latest work there separated out.
Not sure that it'll help in this case.

> >
> > > >
> > > > Are you using blk-mq for this disk? If not, then the work there
> > > > won't
> > > affect you.
> > >
> > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> > > disable, Sequential work load issue is not seen and <cfq> scheduling
> > > works well.
> >
> > Ah, okay, perfect. Can you send the fio job file you're using? Hard to
> tell exactly
> > what's going on without the details. A sequential workload with just one
> > submitter is about as easy as it gets, so this _should_ be behaving
> nicely.
>
> <FIO script>
>
> ; setup numa policy for each thread
> ; 'numactl --show' to determine the maximum numa nodes
> [global]
> ioengine=libaio
> buffered=0
> rw=write
> bssplit=4K/100
> iodepth=256
> numjobs=1
> direct=1
> runtime=60s
> allow_mounted_write=0
>
> [job1]
> filename=/dev/sdd
> ..
> [job24]
> filename=/dev/sdaa

Okay, so you have one high-iodepth job per disk, got it.

> When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a
> ioscheduler detail. (It is in blk-mq mode. )
> /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/host10/target10:2:13/10:
> 2:13:0/block/sdq/queue/scheduler:none
>
> When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0,
> ioscheduler picked by SML is <cfq>.
> /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/host10/target10:2:13/10:
> 2:13:0/block/sdq/queue/scheduler:noop deadline [cfq]
>
> I see in blk-mq performance is very low for Sequential Write work load and
> I confirm that blk-mq convert Sequential work load into random stream due
> to io-scheduler change in blk-mq vs legacy block layer.

Since this happens when the fio iodepth exceeds the per-device QD, my
best guess is that this is that requests are getting requeued and
scrambled when that happens. Do you have the blktrace lying around?

> > > > > Is there any workaround/alternative in latest upstream kernel, if
> > > > > user wants to see limited penalty for Sequential Work load on HDD
> ?
> > > > >
> > > > > ` Kashyap
> > > > >
> >
> > P.S., your emails are being marked as spam by Gmail. Actually, Gmail
> seems to
> > mark just about everything I get from Broadcom as spam due to failed
> DMARC.
> >
> > --
> > Omar

--
Omar

2016-10-31 17:24:08

by Jens Axboe

[permalink] [raw]
Subject: Re: Device or HBA level QD throttling creates randomness in sequetial workload

Hi,

One guess would be that this isn't around a requeue condition, but
rather the fact that we don't really guarantee any sort of hard FIFO
behavior between the software queues. Can you try this test patch to see
if it changes the behavior for you? Warning: untested...

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f3d27a6dee09..5404ca9c71b2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned
int queued)
return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
}

+static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head *b)
+{
+ struct request *rqa = container_of(a, struct request, queuelist);
+ struct request *rqb = container_of(b, struct request, queuelist);
+
+ return blk_rq_pos(rqa) < blk_rq_pos(rqb);
+}
+
/*
* Run this hardware queue, pulling any software queues mapped to it in.
* Note that this function currently has various problems around ordering
@@ -812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct
blk_mq_hw_ctx *hctx)
}

/*
+ * If the device is rotational, sort the list sanely to avoid
+ * unecessary seeks. The software queues are roughly FIFO, but
+ * only roughly, there are no hard guarantees.
+ */
+ if (!blk_queue_nonrot(q))
+ list_sort(NULL, &rq_list, rq_pos_cmp);
+
+ /*
* Start off with dptr being NULL, so we start the first request
* immediately, even if we have more pending.
*/

--
Jens Axboe

2016-11-01 05:41:04

by Kashyap Desai

[permalink] [raw]
Subject: RE: Device or HBA level QD throttling creates randomness in sequetial workload

Jens- Replied inline.


Omar - I tested your WIP repo and figure out System hangs only if I pass "
scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am
looking for scsi_mod.use_blk_mq=Y.

Also below is snippet of blktrace. In case of higher per device QD, I see
Requeue request in blktrace.

65,128 10 6268 2.432404509 18594 P N [fio]
65,128 10 6269 2.432405013 18594 U N [fio] 1
65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio]
65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0]
65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio]
65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio]
65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio]
65,128 23 3626 2.432432595 16232 D WS 148800 + 8 [kworker/23:1H]
65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0]
65,128 7 6126 2.433032637 18594 P N [fio]
65,128 7 6127 2.433033204 18594 U N [fio] 1
65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio]
65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio]
65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0]
65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio]
65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio]
65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio]
65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0]
65,128 9 6393 2.433138166 905 D WS 148808 + 8 [kworker/9:1H]
65,128 7 6134 2.433167450 18594 P N [fio]
65,128 7 6135 2.433167911 18594 U N [fio] 1
65,128 7 6136 2.433168074 18594 I WS 148816 + 8 [fio]
65,128 7 6137 2.433168492 18594 D WS 148816 + 8 [fio]
65,128 7 6138 2.433174016 18594 Q WS 148824 + 8 [fio]
65,128 7 6139 2.433174282 18594 G WS 148824 + 8 [fio]
65,128 7 6140 2.433174613 18594 S WS 148824 + 8 [fio]
CPU0 (sdy):
Reads Queued: 0, 0KiB Writes Queued: 79,
316KiB
Read Dispatches: 0, 0KiB Write Dispatches: 67,
18,446,744,073PiB
Reads Requeued: 0 Writes Requeued: 86
Reads Completed: 0, 0KiB Writes Completed: 98,
392KiB
Read Merges: 0, 0KiB Write Merges: 0,
0KiB
Read depth: 0 Write depth: 5
IO unplugs: 79 Timer unplugs: 0



` Kashyap

> -----Original Message-----
> From: Jens Axboe [mailto:[email protected]]
> Sent: Monday, October 31, 2016 10:54 PM
> To: Kashyap Desai; Omar Sandoval
> Cc: [email protected]; [email protected]; linux-
> [email protected]; Christoph Hellwig; [email protected]
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial
> workload
>
> Hi,
>
> One guess would be that this isn't around a requeue condition, but rather
> the
> fact that we don't really guarantee any sort of hard FIFO behavior between
> the
> software queues. Can you try this test patch to see if it changes the
> behavior for
> you? Warning: untested...

Jens - I tested the patch, but I still see random IO pattern for expected
Sequential Run. I am intentionally running case of Re-queue and seeing
issue at the time of Re-queue.
If there is no Requeue, I see no issue at LLD.


>
> diff --git a/block/blk-mq.c b/block/blk-mq.c index
> f3d27a6dee09..5404ca9c71b2
> 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned
> int
> queued)
> return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
> }
>
> +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head
> +*b) {
> + struct request *rqa = container_of(a, struct request, queuelist);
> + struct request *rqb = container_of(b, struct request, queuelist);
> +
> + return blk_rq_pos(rqa) < blk_rq_pos(rqb); }
> +
> /*
> * Run this hardware queue, pulling any software queues mapped to it in.
> * Note that this function currently has various problems around
> ordering @@ -
> 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx
> *hctx)
> }
>
> /*
> + * If the device is rotational, sort the list sanely to avoid
> + * unecessary seeks. The software queues are roughly FIFO, but
> + * only roughly, there are no hard guarantees.
> + */
> + if (!blk_queue_nonrot(q))
> + list_sort(NULL, &rq_list, rq_pos_cmp);
> +
> + /*
> * Start off with dptr being NULL, so we start the first request
> * immediately, even if we have more pending.
> */
>
> --
> Jens Axboe