2017-08-03 08:51:21

by Mel Gorman

[permalink] [raw]
Subject: Switching to MQ by default may generate some bug reports

Hi Christoph,

I know the reasons for switching to MQ by default but just be aware that it's
not without hazards albeit it the biggest issues I've seen are switching
CFQ to BFQ. On my home grid, there is some experimental automatic testing
running every few weeks searching for regressions. Yesterday, it noticed
that creating some work files for a postgres simulator called pgioperf
was 38.33% slower and it auto-bisected to the switch to MQ. This is just
linearly writing two files for testing on another benchmark and is not
remarkable. The relevant part of the report is

Last good/First bad commit
==========================
Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5
First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa
>From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Fri, 16 Jun 2017 10:27:55 +0200
Subject: [PATCH] scsi: default to scsi-mq
Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
path now that we had plenty of testing, and have I/O schedulers for
blk-mq. The module option to disable the blk-mq path is kept around for
now.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
drivers/scsi/Kconfig | 11 -----------
drivers/scsi/scsi.c | 4 ----
2 files changed, 15 deletions(-)

Comparison
==========
initial initial last penup first
good-v4.12 bad-16f73eb02d7e good-6d311fa7 good-d06c587d bad-5c279bd9
User min 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
User mean 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
User stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
User coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
User max 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
System min 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
System mean 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
System stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
System coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
System max 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
Elapsed min 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
Elapsed mean 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
Elapsed stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Elapsed coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Elapsed max 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
CPU min 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)
CPU mean 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)
CPU stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
CPU coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
CPU max 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)

The "Elapsed mean" line is what the testing and auto-bisection was paying
attention to. Commit 16f73eb02d7e is simply the head commit at the time
the continuous testing started. The first "bad commit" is the last column.

It's not the only slowdown that has been observed from other testing when
examining whether it's ok to switch to MQ by default. The biggest slowdown
observed was with a modified version of dbench4 -- the modifications use
shorter, but representative, load files to avoid timing artifacts and
reports time to complete a load file instead of throughput as throughput
is kind of meaningless for dbench4

dbench4 Loadfile Execution Time
4.12.0 4.12.0
legacy-cfq mq-bfq
Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%)
Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%)
Amean 4 102.72 ( 0.00%) 474.33 (-361.77%)
Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%)

The units are "milliseconds to complete a load file" so as thread count
increased, there were some fairly bad slowdowns. The most dramatic
slowdown was observed on a machine with a controller with on-board cache

4.12.0 4.12.0
legacy-cfq mq-bfq
Amean 1 289.09 ( 0.00%) 128.43 ( 55.57%)
Amean 2 491.32 ( 0.00%) 794.04 ( -61.61%)
Amean 4 875.26 ( 0.00%) 9331.79 (-966.17%)
Amean 8 2074.30 ( 0.00%) 317.79 ( 84.68%)
Amean 16 3380.47 ( 0.00%) 669.51 ( 80.19%)
Amean 32 7427.25 ( 0.00%) 8821.75 ( -18.78%)
Amean 256 53376.81 ( 0.00%) 69006.94 ( -29.28%)

The slowdown wasn't universal but at 4 threads, it was severe. There
are other examples but it'd just be a lot of noise and not change the
central point.

The major problems were all observed switching from CFQ to BFQ on single disk
rotary storage. It's not machine specific as 5 separate machines noticed
problems with dbench and fio when switching to MQ on kernel 4.12. Weirdly,
I've seen cases of read starvation in the presence of heavy writers
using fio to generate the workload which was surprising to me. Jan Kara
suggested that it may be because the read workload is not being identified
as "interactive" but I didn't dig into the details myself and have zero
understanding of BFQ. I was only interested in answering the question "is
it safe to switch the default and will the performance be similar enough
to avoid bug reports?" and concluded that the answer is "no".

For what it's worth, I've noticed on SSDs that switching from legacy-mq
to deadline-mq also slowed down but in many cases the slowdown was small
enough that it may be tolerable and not generate many bug reports. Also,
mq-deadline appears to receive more attention so issues there are probably
going to be noticed faster.

I'm not suggesting for a second that you fix this or switch back to legacy
by default because it's BFQ, Paulo is cc'd and it'll have to be fixed
eventually but you might see "workload foo is slower on 4.13" reports that
bisect to this commit. What filesystem is used changes the results but at
least btrfs, ext3, ext4 and xfs experience slowdowns.

For Paulo, if you want to try preemptively dealing with regression reports
before 4.13 releases then all the tests in question can be reproduced with
https://github.com/gormanm/mmtests . The most relevant test configurations
I've seen so far are

configs/config-global-dhp__io-dbench4-async
configs/config-global-dhp__io-fio-randread-async-randwrite
configs/config-global-dhp__io-fio-randread-async-seqwrite
configs/config-global-dhp__io-fio-randread-sync-heavywrite
configs/config-global-dhp__io-fio-randread-sync-randwrite
configs/config-global-dhp__pgioperf

--
Mel Gorman
SUSE Labs


2017-08-03 09:17:25

by Ming Lei

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

Hi Mel Gorman,

On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <[email protected]> wrote:
> Hi Christoph,
>
> I know the reasons for switching to MQ by default but just be aware that it's
> not without hazards albeit it the biggest issues I've seen are switching
> CFQ to BFQ. On my home grid, there is some experimental automatic testing
> running every few weeks searching for regressions. Yesterday, it noticed
> that creating some work files for a postgres simulator called pgioperf
> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> linearly writing two files for testing on another benchmark and is not
> remarkable. The relevant part of the report is

We saw some SCSI-MQ performance issue too, please see if the following
patchset fixes your issue:

http://marc.info/?l=linux-block&m=150151989915776&w=2

Thanks,
Ming

2017-08-03 09:22:15

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 03 ago 2017, alle ore 10:51, Mel Gorman <[email protected]> ha scritto:
>
> Hi Christoph,
>
> I know the reasons for switching to MQ by default but just be aware that it's
> not without hazards albeit it the biggest issues I've seen are switching
> CFQ to BFQ. On my home grid, there is some experimental automatic testing
> running every few weeks searching for regressions. Yesterday, it noticed
> that creating some work files for a postgres simulator called pgioperf
> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> linearly writing two files for testing on another benchmark and is not
> remarkable. The relevant part of the report is
>
> Last good/First bad commit
> ==========================
> Last good commit: 6d311fa7d2c18659d040b9beba5e41fe24c2a6f5
> First bad commit: 5c279bd9e40624f4ab6e688671026d6005b066fa
> From 5c279bd9e40624f4ab6e688671026d6005b066fa Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <[email protected]>
> Date: Fri, 16 Jun 2017 10:27:55 +0200
> Subject: [PATCH] scsi: default to scsi-mq
> Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
> path now that we had plenty of testing, and have I/O schedulers for
> blk-mq. The module option to disable the blk-mq path is kept around for
> now.
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Martin K. Petersen <[email protected]>
> drivers/scsi/Kconfig | 11 -----------
> drivers/scsi/scsi.c | 4 ----
> 2 files changed, 15 deletions(-)
>
> Comparison
> ==========
> initial initial last penup first
> good-v4.12 bad-16f73eb02d7e good-6d311fa7 good-d06c587d bad-5c279bd9
> User min 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
> User mean 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
> User stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> User coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> User max 0.06 ( 0.00%) 0.14 (-133.33%) 0.14 (-133.33%) 0.06 ( 0.00%) 0.19 (-216.67%)
> System min 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
> System mean 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
> System stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> System coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> System max 10.04 ( 0.00%) 10.75 ( -7.07%) 10.05 ( -0.10%) 10.16 ( -1.20%) 10.73 ( -6.87%)
> Elapsed min 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
> Elapsed mean 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
> Elapsed stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> Elapsed coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> Elapsed max 251.53 ( 0.00%) 351.05 ( -39.57%) 252.83 ( -0.52%) 252.96 ( -0.57%) 347.93 ( -38.33%)
> CPU min 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)
> CPU mean 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)
> CPU stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> CPU coeffvar 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> CPU max 4.00 ( 0.00%) 3.00 ( 25.00%) 4.00 ( 0.00%) 4.00 ( 0.00%) 3.00 ( 25.00%)
>
> The "Elapsed mean" line is what the testing and auto-bisection was paying
> attention to. Commit 16f73eb02d7e is simply the head commit at the time
> the continuous testing started. The first "bad commit" is the last column.
>
> It's not the only slowdown that has been observed from other testing when
> examining whether it's ok to switch to MQ by default. The biggest slowdown
> observed was with a modified version of dbench4 -- the modifications use
> shorter, but representative, load files to avoid timing artifacts and
> reports time to complete a load file instead of throughput as throughput
> is kind of meaningless for dbench4
>
> dbench4 Loadfile Execution Time
> 4.12.0 4.12.0
> legacy-cfq mq-bfq
> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%)
> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%)
> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%)
> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%)
>
> The units are "milliseconds to complete a load file" so as thread count
> increased, there were some fairly bad slowdowns. The most dramatic
> slowdown was observed on a machine with a controller with on-board cache
>
> 4.12.0 4.12.0
> legacy-cfq mq-bfq
> Amean 1 289.09 ( 0.00%) 128.43 ( 55.57%)
> Amean 2 491.32 ( 0.00%) 794.04 ( -61.61%)
> Amean 4 875.26 ( 0.00%) 9331.79 (-966.17%)
> Amean 8 2074.30 ( 0.00%) 317.79 ( 84.68%)
> Amean 16 3380.47 ( 0.00%) 669.51 ( 80.19%)
> Amean 32 7427.25 ( 0.00%) 8821.75 ( -18.78%)
> Amean 256 53376.81 ( 0.00%) 69006.94 ( -29.28%)
>
> The slowdown wasn't universal but at 4 threads, it was severe. There
> are other examples but it'd just be a lot of noise and not change the
> central point.
>
> The major problems were all observed switching from CFQ to BFQ on single disk
> rotary storage. It's not machine specific as 5 separate machines noticed
> problems with dbench and fio when switching to MQ on kernel 4.12. Weirdly,
> I've seen cases of read starvation in the presence of heavy writers
> using fio to generate the workload which was surprising to me. Jan Kara
> suggested that it may be because the read workload is not being identified
> as "interactive" but I didn't dig into the details myself and have zero
> understanding of BFQ. I was only interested in answering the question "is
> it safe to switch the default and will the performance be similar enough
> to avoid bug reports?" and concluded that the answer is "no".
>
> For what it's worth, I've noticed on SSDs that switching from legacy-mq
> to deadline-mq also slowed down but in many cases the slowdown was small
> enough that it may be tolerable and not generate many bug reports. Also,
> mq-deadline appears to receive more attention so issues there are probably
> going to be noticed faster.
>
> I'm not suggesting for a second that you fix this or switch back to legacy
> by default because it's BFQ, Paulo is cc'd and it'll have to be fixed
> eventually but you might see "workload foo is slower on 4.13" reports that
> bisect to this commit. What filesystem is used changes the results but at
> least btrfs, ext3, ext4 and xfs experience slowdowns.
>
> For Paulo, if you want to try preemptively dealing with regression reports
> before 4.13 releases then all the tests in question can be reproduced with
> https://github.com/gormanm/mmtests . The most relevant test configurations
> I've seen so far are
>
> configs/config-global-dhp__io-dbench4-async
> configs/config-global-dhp__io-fio-randread-async-randwrite
> configs/config-global-dhp__io-fio-randread-async-seqwrite
> configs/config-global-dhp__io-fio-randread-sync-heavywrite
> configs/config-global-dhp__io-fio-randread-sync-randwrite
> configs/config-global-dhp__pgioperf
>

Hi Mel,
as it already happened with the latest Phoronix benchmark article (and
with other test results reported several months ago on this list), bad
results may be caused (also) by the fact that the low-latency, default
configuration of BFQ is being used. This configuration is the default
one because the motivation for yet-another-scheduler as BFQ is that it
drastically reduces latency for interactive and soft real-time tasks
(e.g., opening an app or playing/streaming a video), when there is
some background I/O. Low-latency heuristics are willing to sacrifice
throughput when this provides a large benefit in terms of the above
latency.

Things do change if, instead, one wants to use BFQ for tasks that
don't need this kind of low-latency guarantees, but need only the
highest possible sustained throughput. This seems to be the case for
all the tests you have listed above. In this case, it doesn't make
much sense to leave low-latency heuristics on. Throughput may only
get worse for these tests, and the elapsed time can only increase.

How to switch low-latency heuristics off?
echo 0 > /sys/block/<dev>/queue/iosched/low_latency

Of course, BFQ may not be optimal for every workload, even if
low-latency mode is switched off. In addition, there may still be
some bug. I'll repeat your tests on a machine of mine ASAP.

Thanks,
Paolo

> --
> Mel Gorman
> SUSE Labs


2017-08-03 09:32:51

by Ming Lei

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Thu, Aug 3, 2017 at 5:17 PM, Ming Lei <[email protected]> wrote:
> Hi Mel Gorman,
>
> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <[email protected]> wrote:
>> Hi Christoph,
>>
>> I know the reasons for switching to MQ by default but just be aware that it's
>> not without hazards albeit it the biggest issues I've seen are switching
>> CFQ to BFQ. On my home grid, there is some experimental automatic testing
>> running every few weeks searching for regressions. Yesterday, it noticed
>> that creating some work files for a postgres simulator called pgioperf
>> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> linearly writing two files for testing on another benchmark and is not
>> remarkable. The relevant part of the report is
>
> We saw some SCSI-MQ performance issue too, please see if the following
> patchset fixes your issue:
>
> http://marc.info/?l=linux-block&m=150151989915776&w=2

BTW, the above patches(V1) can be found in the following tree:

https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V1

V2 has already been done but not posted out yet, because the performance test
on SRP isn't completed:

https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V2


Thanks,
Ming Lei

2017-08-03 09:42:45

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
> Hi Mel Gorman,
>
> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <[email protected]> wrote:
> > Hi Christoph,
> >
> > I know the reasons for switching to MQ by default but just be aware that it's
> > not without hazards albeit it the biggest issues I've seen are switching
> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
> > running every few weeks searching for regressions. Yesterday, it noticed
> > that creating some work files for a postgres simulator called pgioperf
> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> > linearly writing two files for testing on another benchmark and is not
> > remarkable. The relevant part of the report is
>
> We saw some SCSI-MQ performance issue too, please see if the following
> patchset fixes your issue:
>
> http://marc.info/?l=linux-block&m=150151989915776&w=2
>

That series is dealing with problems with legacy-deadline vs mq-none where
as the bulk of the problems reported in this mail are related to
legacy-CFQ vs mq-BFQ.

--
Mel Gorman
SUSE Labs

2017-08-03 09:44:12

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 03 ago 2017, alle ore 11:42, Mel Gorman <[email protected]> ha scritto:
>
> On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> Hi Mel Gorman,
>>
>> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <[email protected]> wrote:
>>> Hi Christoph,
>>>
>>> I know the reasons for switching to MQ by default but just be aware that it's
>>> not without hazards albeit it the biggest issues I've seen are switching
>>> CFQ to BFQ. On my home grid, there is some experimental automatic testing
>>> running every few weeks searching for regressions. Yesterday, it noticed
>>> that creating some work files for a postgres simulator called pgioperf
>>> was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>>> linearly writing two files for testing on another benchmark and is not
>>> remarkable. The relevant part of the report is
>>
>> We saw some SCSI-MQ performance issue too, please see if the following
>> patchset fixes your issue:
>>
>> http://marc.info/?l=linux-block&m=150151989915776&w=2
>>
>
> That series is dealing with problems with legacy-deadline vs mq-none where
> as the bulk of the problems reported in this mail are related to
> legacy-CFQ vs mq-BFQ.
>

Out-of-curiosity: you get no regression with mq-none or mq-deadline?

Thanks,
Paolo

> --
> Mel Gorman
> SUSE Labs


2017-08-03 10:08:36

by Ming Lei

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <[email protected]> wrote:
> On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> Hi Mel Gorman,
>>
>> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <[email protected]> wrote:
>> > Hi Christoph,
>> >
>> > I know the reasons for switching to MQ by default but just be aware that it's
>> > not without hazards albeit it the biggest issues I've seen are switching
>> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
>> > running every few weeks searching for regressions. Yesterday, it noticed
>> > that creating some work files for a postgres simulator called pgioperf
>> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> > linearly writing two files for testing on another benchmark and is not
>> > remarkable. The relevant part of the report is
>>
>> We saw some SCSI-MQ performance issue too, please see if the following
>> patchset fixes your issue:
>>
>> http://marc.info/?l=linux-block&m=150151989915776&w=2
>>
>
> That series is dealing with problems with legacy-deadline vs mq-none where
> as the bulk of the problems reported in this mail are related to
> legacy-CFQ vs mq-BFQ.

The serials deals with none and all mq schedulers, and you can see
the improvement on mq-deadline in cover letter, :-)

Thanks,
Ming Lei

2017-08-03 10:47:04

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Thu, Aug 03, 2017 at 11:44:06AM +0200, Paolo Valente wrote:
> > That series is dealing with problems with legacy-deadline vs mq-none where
> > as the bulk of the problems reported in this mail are related to
> > legacy-CFQ vs mq-BFQ.
> >
>
> Out-of-curiosity: you get no regression with mq-none or mq-deadline?
>

I didn't test mq-none as the underlying storage was not fast enough to
make a legacy-noop vs mq-none meaningful. legacy-deadline vs mq-deadline
did show small regressions on some workloads but not as dramatic and
small enough that it would go unmissed in some cases.

--
Mel Gorman
SUSE Labs

2017-08-03 10:48:02

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote:
> On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <[email protected]> wrote:
> > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
> >> Hi Mel Gorman,
> >>
> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <[email protected]> wrote:
> >> > Hi Christoph,
> >> >
> >> > I know the reasons for switching to MQ by default but just be aware that it's
> >> > not without hazards albeit it the biggest issues I've seen are switching
> >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
> >> > running every few weeks searching for regressions. Yesterday, it noticed
> >> > that creating some work files for a postgres simulator called pgioperf
> >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
> >> > linearly writing two files for testing on another benchmark and is not
> >> > remarkable. The relevant part of the report is
> >>
> >> We saw some SCSI-MQ performance issue too, please see if the following
> >> patchset fixes your issue:
> >>
> >> http://marc.info/?l=linux-block&m=150151989915776&w=2
> >>
> >
> > That series is dealing with problems with legacy-deadline vs mq-none where
> > as the bulk of the problems reported in this mail are related to
> > legacy-CFQ vs mq-BFQ.
>
> The serials deals with none and all mq schedulers, and you can see
> the improvement on mq-deadline in cover letter, :-)
>

Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ
that was not observed on other schedulers?

--
Mel Gorman
SUSE Labs

2017-08-03 11:01:47

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote:
> > For Paulo, if you want to try preemptively dealing with regression reports
> > before 4.13 releases then all the tests in question can be reproduced with
> > https://github.com/gormanm/mmtests . The most relevant test configurations
> > I've seen so far are
> >
> > configs/config-global-dhp__io-dbench4-async
> > configs/config-global-dhp__io-fio-randread-async-randwrite
> > configs/config-global-dhp__io-fio-randread-async-seqwrite
> > configs/config-global-dhp__io-fio-randread-sync-heavywrite
> > configs/config-global-dhp__io-fio-randread-sync-randwrite
> > configs/config-global-dhp__pgioperf
> >
>
> Hi Mel,
> as it already happened with the latest Phoronix benchmark article (and
> with other test results reported several months ago on this list), bad
> results may be caused (also) by the fact that the low-latency, default
> configuration of BFQ is being used.

I took that into account BFQ with low-latency was also tested and the
impact was not a universal improvement although it can be a noticable
improvement. From the same machine;

dbench4 Loadfile Execution Time
4.12.0 4.12.0 4.12.0
legacy-cfq mq-bfq mq-bfq-tput
Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)

However, it's not a universal gain and there are also fairness issues.
For example, this is a fio configuration with a single random reader and
a single random writer on the same machine

fio Throughput
4.12.0 4.12.0 4.12.0
legacy-cfq mq-bfq mq-bfq-tput
Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%)
Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%)

With CFQ, there is some fairness between the readers and writers and
with BFQ, there is a strong preference to writers. Again, this is not
universal. It'll be a mix and sometimes it'll be classed as a gain and
sometimes a regression.

While I accept that BFQ can be tuned, tuning IO schedulers is not something
that normal users get right and they'll only look at "out of box" performance
which, right now, will trigger bug reports. This is neither good nor bad,
it simply is.

> This configuration is the default
> one because the motivation for yet-another-scheduler as BFQ is that it
> drastically reduces latency for interactive and soft real-time tasks
> (e.g., opening an app or playing/streaming a video), when there is
> some background I/O. Low-latency heuristics are willing to sacrifice
> throughput when this provides a large benefit in terms of the above
> latency.
>

I had seen this assertion so one of the fio configurations had multiple
heavy writers in the background and a random reader of small files to
simulate that scenario. The intent was to simulate heavy IO in the presence
of application startup

4.12.0 4.12.0 4.12.0
legacy-cfq mq-bfq mq-bfq-tput
Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%)
Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%)

Write throughput is steady-ish across each IO scheduler but readers get
starved badly which I expect would slow application startup and disabling
low_latency makes it much worse. The mmtests configuration in question
is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
a fresh XFS filesystem on a test partition.

This is not exactly equivalent to real application startup but that can
be difficult to quantify properly.

> Of course, BFQ may not be optimal for every workload, even if
> low-latency mode is switched off. In addition, there may still be
> some bug. I'll repeat your tests on a machine of mine ASAP.
>

The intent here is not to rag on BFQ because I know it's going to have some
wins and some losses and will take time to fix up. The primary intent was
to flag that 4.13 might have some "blah blah blah is slower on 4.13" reports
due to the switching of defaults that will bisect to a misleading commit.

--
Mel Gorman
SUSE Labs

2017-08-03 11:48:48

by Ming Lei

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Thu, Aug 3, 2017 at 6:47 PM, Mel Gorman <[email protected]> wrote:
> On Thu, Aug 03, 2017 at 05:57:50PM +0800, Ming Lei wrote:
>> On Thu, Aug 3, 2017 at 5:42 PM, Mel Gorman <[email protected]> wrote:
>> > On Thu, Aug 03, 2017 at 05:17:21PM +0800, Ming Lei wrote:
>> >> Hi Mel Gorman,
>> >>
>> >> On Thu, Aug 3, 2017 at 4:51 PM, Mel Gorman <[email protected]> wrote:
>> >> > Hi Christoph,
>> >> >
>> >> > I know the reasons for switching to MQ by default but just be aware that it's
>> >> > not without hazards albeit it the biggest issues I've seen are switching
>> >> > CFQ to BFQ. On my home grid, there is some experimental automatic testing
>> >> > running every few weeks searching for regressions. Yesterday, it noticed
>> >> > that creating some work files for a postgres simulator called pgioperf
>> >> > was 38.33% slower and it auto-bisected to the switch to MQ. This is just
>> >> > linearly writing two files for testing on another benchmark and is not
>> >> > remarkable. The relevant part of the report is
>> >>
>> >> We saw some SCSI-MQ performance issue too, please see if the following
>> >> patchset fixes your issue:
>> >>
>> >> http://marc.info/?l=linux-block&m=150151989915776&w=2
>> >>
>> >
>> > That series is dealing with problems with legacy-deadline vs mq-none where
>> > as the bulk of the problems reported in this mail are related to
>> > legacy-CFQ vs mq-BFQ.
>>
>> The serials deals with none and all mq schedulers, and you can see
>> the improvement on mq-deadline in cover letter, :-)
>>
>
> Would it be expected to fix a 2x to 4x slowdown as experienced by BFQ
> that was not observed on other schedulers?

Actually if you look at the cover letter, you will see this patchset
increases by
> 10X sequential I/O IOPS on mq-deadline, so it would be reasonable to see
2x to 4x BFQ slowdown, but I didn't test BFQ.

Thanks,
Ming Lei

2017-08-04 07:26:26

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 03 ago 2017, alle ore 13:01, Mel Gorman <[email protected]> ha scritto:
>
> On Thu, Aug 03, 2017 at 11:21:59AM +0200, Paolo Valente wrote:
>>> For Paulo, if you want to try preemptively dealing with regression reports
>>> before 4.13 releases then all the tests in question can be reproduced with
>>> https://github.com/gormanm/mmtests . The most relevant test configurations
>>> I've seen so far are
>>>
>>> configs/config-global-dhp__io-dbench4-async
>>> configs/config-global-dhp__io-fio-randread-async-randwrite
>>> configs/config-global-dhp__io-fio-randread-async-seqwrite
>>> configs/config-global-dhp__io-fio-randread-sync-heavywrite
>>> configs/config-global-dhp__io-fio-randread-sync-randwrite
>>> configs/config-global-dhp__pgioperf
>>>
>>
>> Hi Mel,
>> as it already happened with the latest Phoronix benchmark article (and
>> with other test results reported several months ago on this list), bad
>> results may be caused (also) by the fact that the low-latency, default
>> configuration of BFQ is being used.
>
> I took that into account BFQ with low-latency was also tested and the
> impact was not a universal improvement although it can be a noticable
> improvement. From the same machine;
>
> dbench4 Loadfile Execution Time
> 4.12.0 4.12.0 4.12.0
> legacy-cfq mq-bfq mq-bfq-tput
> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
>

Thanks for trying with low_latency disabled. If I read numbers
correctly, we move from a worst case of 361% higher execution time to
a worst case of 11%. With a best case of 20% of lower execution time.

I asked you about none and mq-deadline in a previous email, because
actually we have a double change here: change of the I/O stack, and
change of the scheduler, with the first change probably not irrelevant
with respect to the second one.

Are we sure that part of the small losses and gains with bfq-mq-tput
aren't due to the change of I/O stack? My problem is that it may be
hard to find issues or anomalies in BFQ that justify a 5% or 11% loss
in two cases, while the same scheduler has a 4% and a 20% gain in the
other two cases.

By chance, according to what you have measured so far, is there any
test where, instead, you expect or have seen bfq-mq-tput to always
lose? I could start from there.

> However, it's not a universal gain and there are also fairness issues.
> For example, this is a fio configuration with a single random reader and
> a single random writer on the same machine
>
> fio Throughput
> 4.12.0 4.12.0 4.12.0
> legacy-cfq mq-bfq mq-bfq-tput
> Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%)
> Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%)
>
> With CFQ, there is some fairness between the readers and writers and
> with BFQ, there is a strong preference to writers. Again, this is not
> universal. It'll be a mix and sometimes it'll be classed as a gain and
> sometimes a regression.
>

Yes, that's why I didn't pay too much attention so far to such an
issue. I preferred to tune for maximum responsiveness and minimal
latency for soft real-time applications, w.r.t. to reducing a kind of
unfairness for which no user happened to complain (so far). Do you
have some real application (or benchmark simulating a real
application) in which we can see actual problems because of this form
of unfairness? I was thinking of, e.g., two virtual machines, one
doing heavy writes and the other heavy reads. But in that case,
cgroups have to be used, and I'm not sure we would still see this
problem. Any suggestion is welcome.

In any case, if needed, changing read/write throughput ratio should
not be a problem.

> While I accept that BFQ can be tuned, tuning IO schedulers is not something
> that normal users get right and they'll only look at "out of box" performance
> which, right now, will trigger bug reports. This is neither good nor bad,
> it simply is.
>
>> This configuration is the default
>> one because the motivation for yet-another-scheduler as BFQ is that it
>> drastically reduces latency for interactive and soft real-time tasks
>> (e.g., opening an app or playing/streaming a video), when there is
>> some background I/O. Low-latency heuristics are willing to sacrifice
>> throughput when this provides a large benefit in terms of the above
>> latency.
>>
>
> I had seen this assertion so one of the fio configurations had multiple
> heavy writers in the background and a random reader of small files to
> simulate that scenario. The intent was to simulate heavy IO in the presence
> of application startup
>
> 4.12.0 4.12.0 4.12.0
> legacy-cfq mq-bfq mq-bfq-tput
> Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%)
> Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%)
>
> Write throughput is steady-ish across each IO scheduler but readers get
> starved badly which I expect would slow application startup and disabling
> low_latency makes it much worse.

A greedy random reader that goes on steadily mimics an application startup
only for the first handful of seconds.

Where can I find the exact script/configuration you used, to check
more precisely what is going on and whether BFQ is actually behaving very
badly for some reason?

> The mmtests configuration in question
> is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
> a fresh XFS filesystem on a test partition.
>
> This is not exactly equivalent to real application startup but that can
> be difficult to quantify properly.
>

If you do want to check application startup, then just 1) start some
background workload, 2) drop caches, 3) start the app, 4) measure how
long it takes to start. Otherwise, the comm_startup_lat test in the
S suite [1] does all of this for you.

[1] https://github.com/Algodev-github/S

>> Of course, BFQ may not be optimal for every workload, even if
>> low-latency mode is switched off. In addition, there may still be
>> some bug. I'll repeat your tests on a machine of mine ASAP.
>>
>
> The intent here is not to rag on BFQ because I know it's going to have some
> wins and some losses and will take time to fix up. The primary intent was
> to flag that 4.13 might have some "blah blah blah is slower on 4.13" reports
> due to the switching of defaults that will bisect to a misleading commit.
>

I see, and being ready in advance is extremely helpful for me.

Thanks,
Paolo

> --
> Mel Gorman
> SUSE Labs


2017-08-04 11:01:09

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
> > I took that into account BFQ with low-latency was also tested and the
> > impact was not a universal improvement although it can be a noticable
> > improvement. From the same machine;
> >
> > dbench4 Loadfile Execution Time
> > 4.12.0 4.12.0 4.12.0
> > legacy-cfq mq-bfq mq-bfq-tput
> > Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
> > Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
> > Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
> > Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
> >
>
> Thanks for trying with low_latency disabled. If I read numbers
> correctly, we move from a worst case of 361% higher execution time to
> a worst case of 11%. With a best case of 20% of lower execution time.
>

Yes.

> I asked you about none and mq-deadline in a previous email, because
> actually we have a double change here: change of the I/O stack, and
> change of the scheduler, with the first change probably not irrelevant
> with respect to the second one.
>

True. However, the difference between legacy-deadline mq-deadline is
roughly around the 5-10% mark across workloads for SSD. It's not
universally true but the impact is not as severe. While this is not
proof that the stack change is the sole root cause, it makes it less
likely.

> By chance, according to what you have measured so far, is there any
> test where, instead, you expect or have seen bfq-mq-tput to always
> lose? I could start from there.
>

global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
it could be the stack change.

global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
ext4 as a filesystem. The same is not true for XFS so the filesystem
matters.

> > However, it's not a universal gain and there are also fairness issues.
> > For example, this is a fio configuration with a single random reader and
> > a single random writer on the same machine
> >
> > fio Throughput
> > 4.12.0 4.12.0 4.12.0
> > legacy-cfq mq-bfq mq-bfq-tput
> > Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%)
> > Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%)
> >
> > With CFQ, there is some fairness between the readers and writers and
> > with BFQ, there is a strong preference to writers. Again, this is not
> > universal. It'll be a mix and sometimes it'll be classed as a gain and
> > sometimes a regression.
> >
>
> Yes, that's why I didn't pay too much attention so far to such an
> issue. I preferred to tune for maximum responsiveness and minimal
> latency for soft real-time applications, w.r.t. to reducing a kind of
> unfairness for which no user happened to complain (so far). Do you
> have some real application (or benchmark simulating a real
> application) in which we can see actual problems because of this form
> of unfairness?

I don't have data on that. This was a preliminary study only to see if
a switch was safe running workloads that would appear in internal bug
reports related to benchmarking.

> I was thinking of, e.g., two virtual machines, one
> doing heavy writes and the other heavy reads. But in that case,
> cgroups have to be used, and I'm not sure we would still see this
> problem. Any suggestion is welcome.
>

I haven't spent time designing such a thing. Even if I did, I know I would
get hit within weeks of a switch during distro development with reports
related to fio, dbench and other basic IO benchmarks.

> > I had seen this assertion so one of the fio configurations had multiple
> > heavy writers in the background and a random reader of small files to
> > simulate that scenario. The intent was to simulate heavy IO in the presence
> > of application startup
> >
> > 4.12.0 4.12.0 4.12.0
> > legacy-cfq mq-bfq mq-bfq-tput
> > Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%)
> > Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%)
> >
> > Write throughput is steady-ish across each IO scheduler but readers get
> > starved badly which I expect would slow application startup and disabling
> > low_latency makes it much worse.
>
> A greedy random reader that goes on steadily mimics an application startup
> only for the first handful of seconds.
>

Sure, but if during those handful of seconds the throughput is 10% of
what is used to be, it'll still be noticable.

> Where can I find the exact script/configuration you used, to check
> more precisely what is going on and whether BFQ is actually behaving very
> badly for some reason?
>

https://github.com/gormanm/mmtests

All the configuration files are in configs/ so
global-dhp__io-dbench4-fsync-ext4 maps to global-dhp__io-dbench4-fsync but
it has to be editted if you want to format a test partition. Otherwise,
you'd just need to make sure the current directory was ext4 and ignore
any filesystem aging artifacts.

> > The mmtests configuration in question
> > is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
> > a fresh XFS filesystem on a test partition.
> >
> > This is not exactly equivalent to real application startup but that can
> > be difficult to quantify properly.
> >
>
> If you do want to check application startup, then just 1) start some
> background workload, 2) drop caches, 3) start the app, 4) measure how
> long it takes to start. Otherwise, the comm_startup_lat test in the
> S suite [1] does all of this for you.
>

I did have something like this before but found it unreliable because it
couldn't tell the difference between when an application has a window
and when it's ready for use. Evolution for example may start up and
start displaing but then clicking on a mail may stall for a few seconds.
It's difficult to quantify meaningfully which is why I eventually gave
up and relied instead on proxy measures.

--
Mel Gorman
SUSE Labs

2017-08-04 22:05:09

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <[email protected]> ha scritto:
>
> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>> I took that into account BFQ with low-latency was also tested and the
>>> impact was not a universal improvement although it can be a noticable
>>> improvement. From the same machine;
>>>
>>> dbench4 Loadfile Execution Time
>>> 4.12.0 4.12.0 4.12.0
>>> legacy-cfq mq-bfq mq-bfq-tput
>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
>>>
>>
>> Thanks for trying with low_latency disabled. If I read numbers
>> correctly, we move from a worst case of 361% higher execution time to
>> a worst case of 11%. With a best case of 20% of lower execution time.
>>
>
> Yes.
>
>> I asked you about none and mq-deadline in a previous email, because
>> actually we have a double change here: change of the I/O stack, and
>> change of the scheduler, with the first change probably not irrelevant
>> with respect to the second one.
>>
>
> True. However, the difference between legacy-deadline mq-deadline is
> roughly around the 5-10% mark across workloads for SSD. It's not
> universally true but the impact is not as severe. While this is not
> proof that the stack change is the sole root cause, it makes it less
> likely.
>

I'm getting a little lost here. If I'm not mistaken, you are saying,
since the difference between two virtually identical schedulers
(legacy-deadline and mq-deadline) is only around 5-10%, while the
difference between cfq and mq-bfq-tput is higher, then in the latter
case it is not the stack's fault. Yet the loss of mq-bfq-tput in the
above test is exactly in the 5-10% range? What am I missing? Other
tests with mq-bfq-tput not yet reported?

>> By chance, according to what you have measured so far, is there any
>> test where, instead, you expect or have seen bfq-mq-tput to always
>> lose? I could start from there.
>>
>
> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
> it could be the stack change.
>
> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
> ext4 as a filesystem. The same is not true for XFS so the filesystem
> matters.
>

Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
soon as I can, thanks.


>>> However, it's not a universal gain and there are also fairness issues.
>>> For example, this is a fio configuration with a single random reader and
>>> a single random writer on the same machine
>>>
>>> fio Throughput
>>> 4.12.0 4.12.0 4.12.0
>>> legacy-cfq mq-bfq mq-bfq-tput
>>> Hmean kb/sec-writer-write 398.15 ( 0.00%) 4659.18 (1070.21%) 4934.52 (1139.37%)
>>> Hmean kb/sec-reader-read 507.00 ( 0.00%) 66.36 ( -86.91%) 14.68 ( -97.10%)
>>>
>>> With CFQ, there is some fairness between the readers and writers and
>>> with BFQ, there is a strong preference to writers. Again, this is not
>>> universal. It'll be a mix and sometimes it'll be classed as a gain and
>>> sometimes a regression.
>>>
>>
>> Yes, that's why I didn't pay too much attention so far to such an
>> issue. I preferred to tune for maximum responsiveness and minimal
>> latency for soft real-time applications, w.r.t. to reducing a kind of
>> unfairness for which no user happened to complain (so far). Do you
>> have some real application (or benchmark simulating a real
>> application) in which we can see actual problems because of this form
>> of unfairness?
>
> I don't have data on that. This was a preliminary study only to see if
> a switch was safe running workloads that would appear in internal bug
> reports related to benchmarking.
>
>> I was thinking of, e.g., two virtual machines, one
>> doing heavy writes and the other heavy reads. But in that case,
>> cgroups have to be used, and I'm not sure we would still see this
>> problem. Any suggestion is welcome.
>>
>
> I haven't spent time designing such a thing. Even if I did, I know I would
> get hit within weeks of a switch during distro development with reports
> related to fio, dbench and other basic IO benchmarks.
>

I see.

>>> I had seen this assertion so one of the fio configurations had multiple
>>> heavy writers in the background and a random reader of small files to
>>> simulate that scenario. The intent was to simulate heavy IO in the presence
>>> of application startup
>>>
>>> 4.12.0 4.12.0 4.12.0
>>> legacy-cfq mq-bfq mq-bfq-tput
>>> Hmean kb/sec-writer-write 1997.75 ( 0.00%) 2035.65 ( 1.90%) 2014.50 ( 0.84%)
>>> Hmean kb/sec-reader-read 128.50 ( 0.00%) 79.46 ( -38.16%) 12.78 ( -90.06%)
>>>
>>> Write throughput is steady-ish across each IO scheduler but readers get
>>> starved badly which I expect would slow application startup and disabling
>>> low_latency makes it much worse.
>>
>> A greedy random reader that goes on steadily mimics an application startup
>> only for the first handful of seconds.
>>
>
> Sure, but if during those handful of seconds the throughput is 10% of
> what is used to be, it'll still be noticeable.
>

I did not have the time yet to repeat this test (I will try soon), but
I had the time think about it a little bit. And I soon realized that
actually this is not a responsiveness test against background
workload, or, it is at most an extreme corner case for it. Both the
write and the read thread start at the same time. So, we are
mimicking a user starting, e.g., a file copy, and, exactly at the same
time, an app(in addition, the file copy starts to cause heavy writes
immediately).

BFQ uses time patterns to guess which processes to privilege, and the
time patterns of the writer and reader are indistinguishable here.
Only tagging processes with extra information would help, but that is
a different story. And in this case tagging would help for a
not-so-frequent use case.

In addition, a greedy random reader may mimick the start-up of only
very simple applications. Even a simple terminal such as xterm does
some I/O (not completely random, but I guess we don't need to be
overpicky), then it stops doing I/O and passes the ball to the X
server, which does some I/O, stops and passes the ball back to xterm
for its final start-up phase. More and more processes are involved,
and more and more complex I/O patterns are issued as applications
become more complex. This is the reason why we strived to benchmark
application start-up by truly starting real applications and measuring
their start-up time (see below).

>> Where can I find the exact script/configuration you used, to check
>> more precisely what is going on and whether BFQ is actually behaving very
>> badly for some reason?
>>
>
> https://github.com/gormanm/mmtests
>
> All the configuration files are in configs/ so
> global-dhp__io-dbench4-fsync-ext4 maps to global-dhp__io-dbench4-fsync but
> it has to be editted if you want to format a test partition. Otherwise,
> you'd just need to make sure the current directory was ext4 and ignore
> any filesystem aging artifacts.
>

Thank you, I'll do it ASAP.

>>> The mmtests configuration in question
>>> is global-dhp__io-fio-randread-sync-heavywrite albeit editted to create
>>> a fresh XFS filesystem on a test partition.
>>>
>>> This is not exactly equivalent to real application startup but that can
>>> be difficult to quantify properly.
>>>
>>
>> If you do want to check application startup, then just 1) start some
>> background workload, 2) drop caches, 3) start the app, 4) measure how
>> long it takes to start. Otherwise, the comm_startup_lat test in the
>> S suite [1] does all of this for you.
>>
>
> I did have something like this before but found it unreliable because it
> couldn't tell the difference between when an application has a window
> and when it's ready for use. Evolution for example may start up and
> start displaing but then clicking on a mail may stall for a few seconds.
> It's difficult to quantify meaningfully which is why I eventually gave
> up and relied instead on proxy measures.
>

Right, that's why we looked for other applications that were as
popular, but for which we could get reliable and precise measures.
One such application is a terminal, another one a shell. On the
opposite end of the size spectrum, another other such applications are
libreoffice/openoffice.

For, e.g, gnome-terminal, it is enough to invoke "time gnome-terminal
-e /bin/true". By the stopwatch, such a command measures very
precisely the time that elapses from when you start the terminal, to
when you can start typing a command in its window. Similarly, "xterm
/bin/true", "ssh localhost exit", "bash -c exit", "lowriter
--terminate-after-init". Of course, these tricks certainly cause a
few more block reads than the real, bare application start-up, but,
even if the difference were noticeable in terms of time, what matters
is to measure the execution time of these commands without background
workload, and then compare it against their execution time with some
background workload. If it takes, say, 5 seconds without background
workload, and still about 5 seconds with background workload and a
given scheduler, but, with another scheduler, it takes 40 seconds with
background workload (all real numbers, actually), then you can draw
some sound conclusion on responsiveness for the each of the two
schedulers.

In addition, as for coverage, we made the empiric assumption that
start-up time measured with each of the above easy-to-benchmark
applications gives an idea of the time that it would take with any
application of the same size and complexity. User feedback confirmed
this assumptions so far. Of course there may well be exceptions.

Thanks,
Paolo

> --
> Mel Gorman
> SUSE Labs


2017-08-05 11:54:09

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Sat, Aug 05, 2017 at 12:05:00AM +0200, Paolo Valente wrote:
> >
> > True. However, the difference between legacy-deadline mq-deadline is
> > roughly around the 5-10% mark across workloads for SSD. It's not
> > universally true but the impact is not as severe. While this is not
> > proof that the stack change is the sole root cause, it makes it less
> > likely.
> >
>
> I'm getting a little lost here. If I'm not mistaken, you are saying,
> since the difference between two virtually identical schedulers
> (legacy-deadline and mq-deadline) is only around 5-10%, while the
> difference between cfq and mq-bfq-tput is higher, then in the latter
> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the
> above test is exactly in the 5-10% range? What am I missing? Other
> tests with mq-bfq-tput not yet reported?
>

Unfortunately it's due to very broad generalisations. 10 configurations
from mmtests were used in total when I was checking this. Multiply those by
4 for each tested filesystem and then multiply again for each io scheduler
on a total of 7 machines taking 3-4 weeks to execute all tests. The deltas
between each configuration on different machines varies a lot. It also
is an impractical amount of information to present and discuss and the
point of the original mail was to highlight that switching the default
may create some bug reports so as not be too surprised or panic.

The general trend observed was that legacy-deadline vs mq-deadline generally
showed a small regression switching to mq-deadline but it was not universal
and it wasn't consistent. If nothing else, IO tests that are borderline
are difficult to test for significance as distributions are multimodal.
However, it was generally close enough to conclude "this could be tolerated
and more mq work is on the way". However, it's impossible to give a precise
range of how much of a hit it would take but it generally seemed to be
around the 5% mark.

CFQ switching to BFQ was often more dramatic. Sometimes it doesn't really
matter and sometimes turning off low_latency helped enough. bonnie, which
is a single IO issuer didn't show much differences in throughput. It had
a few problems with file create/delete but the absolute times there are
so small that tiny differences look relatively large and were ignored.
For the moment, I'll be temporarily ignoring bonnie because it was a
sniff-test only and I didn't expect many surprises from a single IO issuer.

The workload that cropped up as being most alarming was dbench was is ironic
given that it's not actually that IO intensive and tends to be limited by
fsync times. The benchmark has a number of other weaknesses. It's more
often dominated by scheduler performance, can be gamed by starving all
but one threads from IO to give "better" results and is sensitive to the
exact timing of when writeback occurs which mmtests tries to mitigate by
reducing the loadfile size. If it turns out that it's the only benchmark
that really suffers then I think we would live with or find ways of tuning
around it but fio concerned me.

The fio ones were a concern because of different read/write throughputs
and the fact it was not consistent read or write that was favoured. These
changes are not necessary good or bad but I've seen in the past that writes
that get starved tend to impact workloads that periodically fsync dirty
data (think databases) and had to be tuned by reducing dirty_ratio. I've
also seen cases where syncing of metadata on some filesystems would cause
large stalls if there was a lot of write starvation. I regretted not adding
pgioperf (basic simulator of postgres IO behaviour) to the original set
of tests because it tends to be very good at detecting fsync stalls due
to write starvation.

> > <SNIP>
> > Sure, but if during those handful of seconds the throughput is 10% of
> > what is used to be, it'll still be noticeable.
> >
>
> I did not have the time yet to repeat this test (I will try soon), but
> I had the time think about it a little bit. And I soon realized that
> actually this is not a responsiveness test against background
> workload, or, it is at most an extreme corner case for it. Both the
> write and the read thread start at the same time. So, we are
> mimicking a user starting, e.g., a file copy, and, exactly at the same
> time, an app(in addition, the file copy starts to cause heavy writes
> immediately).
>

Yes, although it's not entirely unrealistic to have light random readers
and heavy writers starting at the same time. A write-intensive database
can behave like this.

Also, I wouldn't panic about needing time to repeat this test. This is
not blocking me as such as all I was interested in was checking if the
switch could be safely made now or should it be deferred while keeping an
eye on how it's doing. It's perfectly possible others will make the switch
and find the majority of their workloads are fine. If others report bugs
and they're using rotary storage then it should be obvious to ask them
to test with the legacy block layer and work from there. At least then,
there should be better reference workloads to look from. Unfortunately,
given the scope and the time it takes to test, I had little choice except
to shotgun a few workloads and see what happened.

> BFQ uses time patterns to guess which processes to privilege, and the
> time patterns of the writer and reader are indistinguishable here.
> Only tagging processes with extra information would help, but that is
> a different story. And in this case tagging would help for a
> not-so-frequent use case.
>

Hopefully there will not be a reliance on tagging processes. If we're
lucky, I just happened to pick a few IO workloads that seemed to suffer
particularly badly.

> In addition, a greedy random reader may mimick the start-up of only
> very simple applications. Even a simple terminal such as xterm does
> some I/O (not completely random, but I guess we don't need to be
> overpicky), then it stops doing I/O and passes the ball to the X
> server, which does some I/O, stops and passes the ball back to xterm
> for its final start-up phase. More and more processes are involved,
> and more and more complex I/O patterns are issued as applications
> become more complex. This is the reason why we strived to benchmark
> application start-up by truly starting real applications and measuring
> their start-up time (see below).
>

Which is fair enough, can't argue with that. Again, the intent here is
not to rag on BFQ. I had a few configurations that looked alarming which I
sometimes use as an early warning that complex workloads may have problems
that are harder to debug. It's not always true. Sometimes the early warnings
are red herrings. I've had a long dislike for dbench4 too but each time I
got rid of it, it showed up again on some random bug report which is the
only reason I included it in this evaluation.

> > I did have something like this before but found it unreliable because it
> > couldn't tell the difference between when an application has a window
> > and when it's ready for use. Evolution for example may start up and
> > start displaing but then clicking on a mail may stall for a few seconds.
> > It's difficult to quantify meaningfully which is why I eventually gave
> > up and relied instead on proxy measures.
> >
>
> Right, that's why we looked for other applications that were as
> popular, but for which we could get reliable and precise measures.
> One such application is a terminal, another one a shell. On the
> opposite end of the size spectrum, another other such applications are
> libreoffice/openoffice.
>

Seems reasonable.

> For, e.g, gnome-terminal, it is enough to invoke "time gnome-terminal
> -e /bin/true". By the stopwatch, such a command measures very
> precisely the time that elapses from when you start the terminal, to
> when you can start typing a command in its window. Similarly, "xterm
> /bin/true", "ssh localhost exit", "bash -c exit", "lowriter
> --terminate-after-init". Of course, these tricks certainly cause a
> few more block reads than the real, bare application start-up, but,
> even if the difference were noticeable in terms of time, what matters
> is to measure the execution time of these commands without background
> workload, and then compare it against their execution time with some
> background workload. If it takes, say, 5 seconds without background
> workload, and still about 5 seconds with background workload and a
> given scheduler, but, with another scheduler, it takes 40 seconds with
> background workload (all real numbers, actually), then you can draw
> some sound conclusion on responsiveness for the each of the two
> schedulers.
>

Again, that is a fair enough methodology and will work in many cases.
It's somewhat impractical for myself. When I'm checking patches (be they new
patches I developed, am backporting or looking at new kernels), I usually
am checking a range of workloads across multiple machines and it's only
when I'm doing live analysis of a problem that I'm directly using a machine.

> In addition, as for coverage, we made the empiric assumption that
> start-up time measured with each of the above easy-to-benchmark
> applications gives an idea of the time that it would take with any
> application of the same size and complexity. User feedback confirmed
> this assumptions so far. Of course there may well be exceptions.
>

FWIW, I also have anecdotal evidence from at least one user that using
BFQ is way better on their desktop than CFQ ever was even under the best
of circumstances. I've had problems directly measuring it empirically but
this was also the first time I switched on BFQ to see what fell out so
it's early days yet.

--
Mel Gorman
SUSE Labs

2017-08-07 17:32:48

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <[email protected]> ha scritto:
>>
>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>> I took that into account BFQ with low-latency was also tested and the
>>>> impact was not a universal improvement although it can be a noticable
>>>> improvement. From the same machine;
>>>>
>>>> dbench4 Loadfile Execution Time
>>>> 4.12.0 4.12.0 4.12.0
>>>> legacy-cfq mq-bfq mq-bfq-tput
>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
>>>>
>>>
>>> Thanks for trying with low_latency disabled. If I read numbers
>>> correctly, we move from a worst case of 361% higher execution time to
>>> a worst case of 11%. With a best case of 20% of lower execution time.
>>>
>>
>> Yes.
>>
>>> I asked you about none and mq-deadline in a previous email, because
>>> actually we have a double change here: change of the I/O stack, and
>>> change of the scheduler, with the first change probably not irrelevant
>>> with respect to the second one.
>>>
>>
>> True. However, the difference between legacy-deadline mq-deadline is
>> roughly around the 5-10% mark across workloads for SSD. It's not
>> universally true but the impact is not as severe. While this is not
>> proof that the stack change is the sole root cause, it makes it less
>> likely.
>>
>
> I'm getting a little lost here. If I'm not mistaken, you are saying,
> since the difference between two virtually identical schedulers
> (legacy-deadline and mq-deadline) is only around 5-10%, while the
> difference between cfq and mq-bfq-tput is higher, then in the latter
> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the
> above test is exactly in the 5-10% range? What am I missing? Other
> tests with mq-bfq-tput not yet reported?
>
>>> By chance, according to what you have measured so far, is there any
>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>> lose? I could start from there.
>>>
>>
>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>> it could be the stack change.
>>
>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>> matters.
>>
>
> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
> soon as I can, thanks.
>
>

I've run this test and tried to further investigate this regression.
For the moment, the gist seems to be that blk-mq plays an important
role, not only with bfq (unless I'm considering the wrong numbers).
Even if your main purpose in this thread was just to give a heads-up,
I guess it may be useful to share what I have found out. In addition,
I want to ask for some help, to try to get closer to the possible
causes of at least this regression. If you think it would be better
to open a new thread on this stuff, I'll do it.

First, I got mixed results on my system. I'll focus only on the the
case where mq-bfq-tput achieves its worst relative performance w.r.t.
to cfq, which happens with 64 clients. Still, also in this case
mq-bfq is better than cfq in all average values, but Flush. I don't
know which are the best/right values to look at, so, here's the final
report for both schedulers:

CFQ

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13120 20.069 348.594
Close 133696 0.008 14.642
LockX 512 0.009 0.059
Rename 7552 1.857 415.418
ReadX 270720 0.141 535.632
WriteX 89591 421.961 6363.271
Unlink 34048 1.281 662.467
UnlockX 512 0.007 0.057
FIND_FIRST 62016 0.086 25.060
SET_FILE_INFORMATION 15616 0.995 176.621
QUERY_FILE_INFORMATION 28734 0.004 1.372
QUERY_PATH_INFORMATION 170240 0.163 820.292
QUERY_FS_INFORMATION 28736 0.017 4.110
NTCreateX 178688 0.437 905.567

MQ-BFQ-TPUT

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13504 75.828 11196.035
Close 136896 0.004 3.855
LockX 640 0.005 0.031
Rename 8064 1.020 288.989
ReadX 297600 0.081 685.850
WriteX 93515 391.637 12681.517
Unlink 34880 0.500 146.928
UnlockX 640 0.004 0.032
FIND_FIRST 63680 0.045 222.491
SET_FILE_INFORMATION 16000 0.436 686.115
QUERY_FILE_INFORMATION 30464 0.003 0.773
QUERY_PATH_INFORMATION 175552 0.044 148.449
QUERY_FS_INFORMATION 29888 0.009 1.984
NTCreateX 183152 0.289 300.867

Are these results in line with yours for this test?

Anyway, to investigate this regression more in depth, I took two
further steps. First, I repeated the same test with bfq-sq, my
out-of-tree version of bfq for legacy block (identical to mq-bfq apart
from the changes needed for bfq to live in blk-mq). I got:

BFQ-SQ-TPUT

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 12618 30.212 484.099
Close 123884 0.008 10.477
LockX 512 0.010 0.170
Rename 7296 2.032 426.409
ReadX 262179 0.251 985.478
WriteX 84072 461.398 7283.003
Unlink 33076 1.685 848.734
UnlockX 512 0.007 0.036
FIND_FIRST 58690 0.096 220.720
SET_FILE_INFORMATION 14976 1.792 466.435
QUERY_FILE_INFORMATION 26575 0.004 2.194
QUERY_PATH_INFORMATION 158125 0.112 614.063
QUERY_FS_INFORMATION 28224 0.017 1.385
NTCreateX 167877 0.827 945.644

So, the worst-case regression is now around 15%. This made me suspect
that blk-mq influences results a lot for this test. To crosscheck, I
compared legacy-deadline and mq-deadline too.

LEGACY-DEADLINE

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13267 9.622 298.206
Close 135692 0.007 10.627
LockX 640 0.008 0.066
Rename 7827 0.544 481.123
ReadX 285929 0.220 2698.442
WriteX 92309 430.867 5191.608
Unlink 34534 1.133 619.235
UnlockX 640 0.008 0.724
FIND_FIRST 63289 0.086 56.851
SET_FILE_INFORMATION 16000 1.254 844.065
QUERY_FILE_INFORMATION 29883 0.004 0.618
QUERY_PATH_INFORMATION 173232 0.089 1295.651
QUERY_FS_INFORMATION 29632 0.017 4.813
NTCreateX 181464 0.479 2214.343


MQ-DEADLINE

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13760 90.542 13221.495
Close 137654 0.008 27.133
LockX 640 0.009 0.115
Rename 8064 1.062 246.759
ReadX 297956 0.051 347.018
WriteX 94698 425.636 15090.020
Unlink 35077 0.580 208.462
UnlockX 640 0.007 0.291
FIND_FIRST 66630 0.566 530.339
SET_FILE_INFORMATION 16000 1.419 811.494
QUERY_FILE_INFORMATION 30717 0.004 1.108
QUERY_PATH_INFORMATION 176153 0.182 517.419
QUERY_FS_INFORMATION 30857 0.018 18.562
NTCreateX 184145 0.281 582.076

So, with both bfq and deadline there seems to be a serious regression,
especially on MaxLat, when moving from legacy block to blk-mq. The
regression is much worse with deadline, as legacy-deadline has the
lowest max latency among all the schedulers, whereas mq-deadline has
the highest one.

Regardless of the actual culprit of this regression, I would like to
investigate further this issue. In this respect, I would like to ask
for a little help. I would like to isolate the workloads generating
the highest latencies. To this purpose, I had a look at the loadfile
client-tiny.txt, and I still have a doubt: is every item in the
loadfile executed somehow several times (for each value of the number
of clients), or is it executed only once? More precisely, IIUC, for
each operation reported in the above results, there are several items
(lines) in the loadfile. So, is each of these items executed only
once?

I'm asking because, if it is executed only once, then I guess I can
find the critical tasks ore easily. Finally, if it is actually
executed only once, is it expected that the latency for such a task is
one order of magnitude higher than that of the average latency for
that group of tasks? I mean, is such a task intrinsically much
heavier, and then expectedly much longer, or is the fact that latency
is much higher for this task a sign that something in the kernel
misbehaves for that task?

While waiting for some feedback, I'm going to execute your test
showing great unfairness between writes and reads, and to also check
whether responsiveness does worsen if the write workload for that test
is being executed in the background.

Thanks,
Paolo

> ...
>> --
>> Mel Gorman
>> SUSE Labs


2017-08-07 17:35:39

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 05 ago 2017, alle ore 13:54, Mel Gorman <[email protected]> ha scritto:
> ...
>
>> In addition, as for coverage, we made the empiric assumption that
>> start-up time measured with each of the above easy-to-benchmark
>> applications gives an idea of the time that it would take with any
>> application of the same size and complexity. User feedback confirmed
>> this assumptions so far. Of course there may well be exceptions.
>>
>
> FWIW, I also have anecdotal evidence from at least one user that using
> BFQ is way better on their desktop than CFQ ever was even under the best
> of circumstances. I've had problems directly measuring it empirically but
> this was also the first time I switched on BFQ to see what fell out so
> it's early days yet.
>

Yeah, I'm constantly trying (without great success so far :) ) to turn
this folklore into shared, repeatable tests and numbers. The latter
could then be reliably evaluated, questioned or defended.

Thanks,
Paolo

> --
> Mel Gorman
> SUSE Labs


2017-08-07 18:42:26

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <[email protected]> ha scritto:
>>
>>>
>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <[email protected]> ha scritto:
>>>
>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>> I took that into account BFQ with low-latency was also tested and the
>>>>> impact was not a universal improvement although it can be a noticable
>>>>> improvement. From the same machine;
>>>>>
>>>>> dbench4 Loadfile Execution Time
>>>>> 4.12.0 4.12.0 4.12.0
>>>>> legacy-cfq mq-bfq mq-bfq-tput
>>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
>>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
>>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
>>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
>>>>>
>>>>
>>>> Thanks for trying with low_latency disabled. If I read numbers
>>>> correctly, we move from a worst case of 361% higher execution time to
>>>> a worst case of 11%. With a best case of 20% of lower execution time.
>>>>
>>>
>>> Yes.
>>>
>>>> I asked you about none and mq-deadline in a previous email, because
>>>> actually we have a double change here: change of the I/O stack, and
>>>> change of the scheduler, with the first change probably not irrelevant
>>>> with respect to the second one.
>>>>
>>>
>>> True. However, the difference between legacy-deadline mq-deadline is
>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>> universally true but the impact is not as severe. While this is not
>>> proof that the stack change is the sole root cause, it makes it less
>>> likely.
>>>
>>
>> I'm getting a little lost here. If I'm not mistaken, you are saying,
>> since the difference between two virtually identical schedulers
>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>> difference between cfq and mq-bfq-tput is higher, then in the latter
>> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the
>> above test is exactly in the 5-10% range? What am I missing? Other
>> tests with mq-bfq-tput not yet reported?
>>
>>>> By chance, according to what you have measured so far, is there any
>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>> lose? I could start from there.
>>>>
>>>
>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>>> it could be the stack change.
>>>
>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>> matters.
>>>
>>
>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>> soon as I can, thanks.
>>
>>
>
> I've run this test and tried to further investigate this regression.
> For the moment, the gist seems to be that blk-mq plays an important
> role, not only with bfq (unless I'm considering the wrong numbers).
> Even if your main purpose in this thread was just to give a heads-up,
> I guess it may be useful to share what I have found out. In addition,
> I want to ask for some help, to try to get closer to the possible
> causes of at least this regression. If you think it would be better
> to open a new thread on this stuff, I'll do it.
>
> First, I got mixed results on my system. I'll focus only on the the
> case where mq-bfq-tput achieves its worst relative performance w.r.t.
> to cfq, which happens with 64 clients. Still, also in this case
> mq-bfq is better than cfq in all average values, but Flush. I don't
> know which are the best/right values to look at, so, here's the final
> report for both schedulers:
>
> CFQ
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 13120 20.069 348.594
> Close 133696 0.008 14.642
> LockX 512 0.009 0.059
> Rename 7552 1.857 415.418
> ReadX 270720 0.141 535.632
> WriteX 89591 421.961 6363.271
> Unlink 34048 1.281 662.467
> UnlockX 512 0.007 0.057
> FIND_FIRST 62016 0.086 25.060
> SET_FILE_INFORMATION 15616 0.995 176.621
> QUERY_FILE_INFORMATION 28734 0.004 1.372
> QUERY_PATH_INFORMATION 170240 0.163 820.292
> QUERY_FS_INFORMATION 28736 0.017 4.110
> NTCreateX 178688 0.437 905.567
>
> MQ-BFQ-TPUT
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 13504 75.828 11196.035
> Close 136896 0.004 3.855
> LockX 640 0.005 0.031
> Rename 8064 1.020 288.989
> ReadX 297600 0.081 685.850
> WriteX 93515 391.637 12681.517
> Unlink 34880 0.500 146.928
> UnlockX 640 0.004 0.032
> FIND_FIRST 63680 0.045 222.491
> SET_FILE_INFORMATION 16000 0.436 686.115
> QUERY_FILE_INFORMATION 30464 0.003 0.773
> QUERY_PATH_INFORMATION 175552 0.044 148.449
> QUERY_FS_INFORMATION 29888 0.009 1.984
> NTCreateX 183152 0.289 300.867
>
> Are these results in line with yours for this test?
>
> Anyway, to investigate this regression more in depth, I took two
> further steps. First, I repeated the same test with bfq-sq, my
> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
> from the changes needed for bfq to live in blk-mq). I got:
>
> BFQ-SQ-TPUT
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 12618 30.212 484.099
> Close 123884 0.008 10.477
> LockX 512 0.010 0.170
> Rename 7296 2.032 426.409
> ReadX 262179 0.251 985.478
> WriteX 84072 461.398 7283.003
> Unlink 33076 1.685 848.734
> UnlockX 512 0.007 0.036
> FIND_FIRST 58690 0.096 220.720
> SET_FILE_INFORMATION 14976 1.792 466.435
> QUERY_FILE_INFORMATION 26575 0.004 2.194
> QUERY_PATH_INFORMATION 158125 0.112 614.063
> QUERY_FS_INFORMATION 28224 0.017 1.385
> NTCreateX 167877 0.827 945.644
>
> So, the worst-case regression is now around 15%. This made me suspect
> that blk-mq influences results a lot for this test. To crosscheck, I
> compared legacy-deadline and mq-deadline too.
>

Ok, found the problem for the 15% loss in bfq-sq. bfq-sq gets
occasionally confused by the workload, and grants device idling to
processes that, for this specific workload, would be better to
de-schedule immediately. If we set slice_idle to 0, then bfq-sq
becomes more or less equivalent to cfq (for some operations apparently
even much better):

bfq-sq-tput-0idle

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13013 17.888 280.517
Close 133004 0.008 20.698
LockX 512 0.008 0.088
Rename 7427 2.041 193.232
ReadX 270534 0.138 408.534
WriteX 88598 429.615 6272.212
Unlink 33734 1.205 559.152
UnlockX 512 0.011 1.808
FIND_FIRST 61762 0.087 23.012
SET_FILE_INFORMATION 15337 1.322 220.155
QUERY_FILE_INFORMATION 28415 0.004 0.559
QUERY_PATH_INFORMATION 169423 0.150 580.570
QUERY_FS_INFORMATION 28547 0.019 24.466
NTCreateX 177618 0.544 681.795

I'll try soon with mq-bfq too, for which I expect however a deeper
investigation to be needed.

Thanks,
Paolo

> LEGACY-DEADLINE
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 13267 9.622 298.206
> Close 135692 0.007 10.627
> LockX 640 0.008 0.066
> Rename 7827 0.544 481.123
> ReadX 285929 0.220 2698.442
> WriteX 92309 430.867 5191.608
> Unlink 34534 1.133 619.235
> UnlockX 640 0.008 0.724
> FIND_FIRST 63289 0.086 56.851
> SET_FILE_INFORMATION 16000 1.254 844.065
> QUERY_FILE_INFORMATION 29883 0.004 0.618
> QUERY_PATH_INFORMATION 173232 0.089 1295.651
> QUERY_FS_INFORMATION 29632 0.017 4.813
> NTCreateX 181464 0.479 2214.343
>
>
> MQ-DEADLINE
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 13760 90.542 13221.495
> Close 137654 0.008 27.133
> LockX 640 0.009 0.115
> Rename 8064 1.062 246.759
> ReadX 297956 0.051 347.018
> WriteX 94698 425.636 15090.020
> Unlink 35077 0.580 208.462
> UnlockX 640 0.007 0.291
> FIND_FIRST 66630 0.566 530.339
> SET_FILE_INFORMATION 16000 1.419 811.494
> QUERY_FILE_INFORMATION 30717 0.004 1.108
> QUERY_PATH_INFORMATION 176153 0.182 517.419
> QUERY_FS_INFORMATION 30857 0.018 18.562
> NTCreateX 184145 0.281 582.076
>
> So, with both bfq and deadline there seems to be a serious regression,
> especially on MaxLat, when moving from legacy block to blk-mq. The
> regression is much worse with deadline, as legacy-deadline has the
> lowest max latency among all the schedulers, whereas mq-deadline has
> the highest one.
>
> Regardless of the actual culprit of this regression, I would like to
> investigate further this issue. In this respect, I would like to ask
> for a little help. I would like to isolate the workloads generating
> the highest latencies. To this purpose, I had a look at the loadfile
> client-tiny.txt, and I still have a doubt: is every item in the
> loadfile executed somehow several times (for each value of the number
> of clients), or is it executed only once? More precisely, IIUC, for
> each operation reported in the above results, there are several items
> (lines) in the loadfile. So, is each of these items executed only
> once?
>
> I'm asking because, if it is executed only once, then I guess I can
> find the critical tasks ore easily. Finally, if it is actually
> executed only once, is it expected that the latency for such a task is
> one order of magnitude higher than that of the average latency for
> that group of tasks? I mean, is such a task intrinsically much
> heavier, and then expectedly much longer, or is the fact that latency
> is much higher for this task a sign that something in the kernel
> misbehaves for that task?
>
> While waiting for some feedback, I'm going to execute your test
> showing great unfairness between writes and reads, and to also check
> whether responsiveness does worsen if the write workload for that test
> is being executed in the background.
>
> Thanks,
> Paolo
>
>> ...
>>> --
>>> Mel Gorman
>>> SUSE Labs


2017-08-08 08:06:09

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <[email protected]> ha scritto:
>>
>>>
>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <[email protected]> ha scritto:
>>>
>>>>
>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <[email protected]> ha scritto:
>>>>
>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>> I took that into account BFQ with low-latency was also tested and the
>>>>>> impact was not a universal improvement although it can be a noticable
>>>>>> improvement. From the same machine;
>>>>>>
>>>>>> dbench4 Loadfile Execution Time
>>>>>> 4.12.0 4.12.0 4.12.0
>>>>>> legacy-cfq mq-bfq mq-bfq-tput
>>>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
>>>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
>>>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
>>>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
>>>>>>
>>>>>
>>>>> Thanks for trying with low_latency disabled. If I read numbers
>>>>> correctly, we move from a worst case of 361% higher execution time to
>>>>> a worst case of 11%. With a best case of 20% of lower execution time.
>>>>>
>>>>
>>>> Yes.
>>>>
>>>>> I asked you about none and mq-deadline in a previous email, because
>>>>> actually we have a double change here: change of the I/O stack, and
>>>>> change of the scheduler, with the first change probably not irrelevant
>>>>> with respect to the second one.
>>>>>
>>>>
>>>> True. However, the difference between legacy-deadline mq-deadline is
>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>> universally true but the impact is not as severe. While this is not
>>>> proof that the stack change is the sole root cause, it makes it less
>>>> likely.
>>>>
>>>
>>> I'm getting a little lost here. If I'm not mistaken, you are saying,
>>> since the difference between two virtually identical schedulers
>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>> difference between cfq and mq-bfq-tput is higher, then in the latter
>>> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the
>>> above test is exactly in the 5-10% range? What am I missing? Other
>>> tests with mq-bfq-tput not yet reported?
>>>
>>>>> By chance, according to what you have measured so far, is there any
>>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>>> lose? I could start from there.
>>>>>
>>>>
>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>>>> it could be the stack change.
>>>>
>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>> matters.
>>>>
>>>
>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>> soon as I can, thanks.
>>>
>>>
>>
>> I've run this test and tried to further investigate this regression.
>> For the moment, the gist seems to be that blk-mq plays an important
>> role, not only with bfq (unless I'm considering the wrong numbers).
>> Even if your main purpose in this thread was just to give a heads-up,
>> I guess it may be useful to share what I have found out. In addition,
>> I want to ask for some help, to try to get closer to the possible
>> causes of at least this regression. If you think it would be better
>> to open a new thread on this stuff, I'll do it.
>>
>> First, I got mixed results on my system. I'll focus only on the the
>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>> to cfq, which happens with 64 clients. Still, also in this case
>> mq-bfq is better than cfq in all average values, but Flush. I don't
>> know which are the best/right values to look at, so, here's the final
>> report for both schedulers:
>>
>> CFQ
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13120 20.069 348.594
>> Close 133696 0.008 14.642
>> LockX 512 0.009 0.059
>> Rename 7552 1.857 415.418
>> ReadX 270720 0.141 535.632
>> WriteX 89591 421.961 6363.271
>> Unlink 34048 1.281 662.467
>> UnlockX 512 0.007 0.057
>> FIND_FIRST 62016 0.086 25.060
>> SET_FILE_INFORMATION 15616 0.995 176.621
>> QUERY_FILE_INFORMATION 28734 0.004 1.372
>> QUERY_PATH_INFORMATION 170240 0.163 820.292
>> QUERY_FS_INFORMATION 28736 0.017 4.110
>> NTCreateX 178688 0.437 905.567
>>
>> MQ-BFQ-TPUT
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13504 75.828 11196.035
>> Close 136896 0.004 3.855
>> LockX 640 0.005 0.031
>> Rename 8064 1.020 288.989
>> ReadX 297600 0.081 685.850
>> WriteX 93515 391.637 12681.517
>> Unlink 34880 0.500 146.928
>> UnlockX 640 0.004 0.032
>> FIND_FIRST 63680 0.045 222.491
>> SET_FILE_INFORMATION 16000 0.436 686.115
>> QUERY_FILE_INFORMATION 30464 0.003 0.773
>> QUERY_PATH_INFORMATION 175552 0.044 148.449
>> QUERY_FS_INFORMATION 29888 0.009 1.984
>> NTCreateX 183152 0.289 300.867
>>
>> Are these results in line with yours for this test?
>>
>> Anyway, to investigate this regression more in depth, I took two
>> further steps. First, I repeated the same test with bfq-sq, my
>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>> from the changes needed for bfq to live in blk-mq). I got:
>>
>> BFQ-SQ-TPUT
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 12618 30.212 484.099
>> Close 123884 0.008 10.477
>> LockX 512 0.010 0.170
>> Rename 7296 2.032 426.409
>> ReadX 262179 0.251 985.478
>> WriteX 84072 461.398 7283.003
>> Unlink 33076 1.685 848.734
>> UnlockX 512 0.007 0.036
>> FIND_FIRST 58690 0.096 220.720
>> SET_FILE_INFORMATION 14976 1.792 466.435
>> QUERY_FILE_INFORMATION 26575 0.004 2.194
>> QUERY_PATH_INFORMATION 158125 0.112 614.063
>> QUERY_FS_INFORMATION 28224 0.017 1.385
>> NTCreateX 167877 0.827 945.644
>>
>> So, the worst-case regression is now around 15%. This made me suspect
>> that blk-mq influences results a lot for this test. To crosscheck, I
>> compared legacy-deadline and mq-deadline too.
>>
>
> Ok, found the problem for the 15% loss in bfq-sq. bfq-sq gets
> occasionally confused by the workload, and grants device idling to
> processes that, for this specific workload, would be better to
> de-schedule immediately. If we set slice_idle to 0, then bfq-sq
> becomes more or less equivalent to cfq (for some operations apparently
> even much better):
>
> bfq-sq-tput-0idle
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 13013 17.888 280.517
> Close 133004 0.008 20.698
> LockX 512 0.008 0.088
> Rename 7427 2.041 193.232
> ReadX 270534 0.138 408.534
> WriteX 88598 429.615 6272.212
> Unlink 33734 1.205 559.152
> UnlockX 512 0.011 1.808
> FIND_FIRST 61762 0.087 23.012
> SET_FILE_INFORMATION 15337 1.322 220.155
> QUERY_FILE_INFORMATION 28415 0.004 0.559
> QUERY_PATH_INFORMATION 169423 0.150 580.570
> QUERY_FS_INFORMATION 28547 0.019 24.466
> NTCreateX 177618 0.544 681.795
>
> I'll try soon with mq-bfq too, for which I expect however a deeper
> investigation to be needed.
>

Hi,
to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also
applied Ming patches, and Ah, victory!

Regardless of the value of slice idle:

mq-bfq-tput

Operation Count AvgLat MaxLat
--------------------------------------------------
Flush 13183 70.381 1025.407
Close 134539 0.004 1.011
LockX 512 0.005 0.025
Rename 7721 0.740 404.979
ReadX 274422 0.126 873.364
WriteX 90535 408.371 7400.585
Unlink 34276 0.634 581.067
UnlockX 512 0.003 0.029
FIND_FIRST 62664 0.052 321.027
SET_FILE_INFORMATION 15981 0.234 124.739
QUERY_FILE_INFORMATION 29042 0.003 1.731
QUERY_PATH_INFORMATION 171769 0.032 522.415
QUERY_FS_INFORMATION 28958 0.009 3.043
NTCreateX 179643 0.298 687.466

Throughput 9.11183 MB/sec 64 clients 64 procs max_latency=7400.588 ms

Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
benefit, which lets me suspect that there is some other issue in
blk-mq (only a suspect). I think I may have already understood how to
guarantee that bfq almost never idles the device uselessly also for
this workload. Yet, since in blk-mq there is no gain even after
excluding useless idling, I'll wait for at least Ming's patches to be
merged before possibly proposing this contribution. Maybe some other
little issue related to this lack of gain in blk-mq will be found and
solved in the meantime.

Moving to the read-write unfairness problem.

Thanks,
Paolo

> Thanks,
> Paolo
>
>> LEGACY-DEADLINE
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13267 9.622 298.206
>> Close 135692 0.007 10.627
>> LockX 640 0.008 0.066
>> Rename 7827 0.544 481.123
>> ReadX 285929 0.220 2698.442
>> WriteX 92309 430.867 5191.608
>> Unlink 34534 1.133 619.235
>> UnlockX 640 0.008 0.724
>> FIND_FIRST 63289 0.086 56.851
>> SET_FILE_INFORMATION 16000 1.254 844.065
>> QUERY_FILE_INFORMATION 29883 0.004 0.618
>> QUERY_PATH_INFORMATION 173232 0.089 1295.651
>> QUERY_FS_INFORMATION 29632 0.017 4.813
>> NTCreateX 181464 0.479 2214.343
>>
>>
>> MQ-DEADLINE
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13760 90.542 13221.495
>> Close 137654 0.008 27.133
>> LockX 640 0.009 0.115
>> Rename 8064 1.062 246.759
>> ReadX 297956 0.051 347.018
>> WriteX 94698 425.636 15090.020
>> Unlink 35077 0.580 208.462
>> UnlockX 640 0.007 0.291
>> FIND_FIRST 66630 0.566 530.339
>> SET_FILE_INFORMATION 16000 1.419 811.494
>> QUERY_FILE_INFORMATION 30717 0.004 1.108
>> QUERY_PATH_INFORMATION 176153 0.182 517.419
>> QUERY_FS_INFORMATION 30857 0.018 18.562
>> NTCreateX 184145 0.281 582.076
>>
>> So, with both bfq and deadline there seems to be a serious regression,
>> especially on MaxLat, when moving from legacy block to blk-mq. The
>> regression is much worse with deadline, as legacy-deadline has the
>> lowest max latency among all the schedulers, whereas mq-deadline has
>> the highest one.
>>
>> Regardless of the actual culprit of this regression, I would like to
>> investigate further this issue. In this respect, I would like to ask
>> for a little help. I would like to isolate the workloads generating
>> the highest latencies. To this purpose, I had a look at the loadfile
>> client-tiny.txt, and I still have a doubt: is every item in the
>> loadfile executed somehow several times (for each value of the number
>> of clients), or is it executed only once? More precisely, IIUC, for
>> each operation reported in the above results, there are several items
>> (lines) in the loadfile. So, is each of these items executed only
>> once?
>>
>> I'm asking because, if it is executed only once, then I guess I can
>> find the critical tasks ore easily. Finally, if it is actually
>> executed only once, is it expected that the latency for such a task is
>> one order of magnitude higher than that of the average latency for
>> that group of tasks? I mean, is such a task intrinsically much
>> heavier, and then expectedly much longer, or is the fact that latency
>> is much higher for this task a sign that something in the kernel
>> misbehaves for that task?
>>
>> While waiting for some feedback, I'm going to execute your test
>> showing great unfairness between writes and reads, and to also check
>> whether responsiveness does worsen if the write workload for that test
>> is being executed in the background.
>>
>> Thanks,
>> Paolo
>>
>>> ...
>>>> --
>>>> Mel Gorman
>>>> SUSE Labs


2017-08-08 10:30:27

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote:
> >> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
> >> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
> >> ext4 as a filesystem. The same is not true for XFS so the filesystem
> >> matters.
> >>
> >
> > Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
> > soon as I can, thanks.
> >
> >
>
> I've run this test and tried to further investigate this regression.
> For the moment, the gist seems to be that blk-mq plays an important
> role, not only with bfq (unless I'm considering the wrong numbers).
> Even if your main purpose in this thread was just to give a heads-up,
> I guess it may be useful to share what I have found out. In addition,
> I want to ask for some help, to try to get closer to the possible
> causes of at least this regression. If you think it would be better
> to open a new thread on this stuff, I'll do it.
>

I don't think it's necessary unless Christoph or Jens object and I doubt
they will.

> First, I got mixed results on my system.

For what it's worth, this is standard. In my experience, IO benchmarks
are always multi-modal, particularly on rotary storage. Cases of universal
win or universal loss for a scheduler or set of tuning are rare.

> I'll focus only on the the
> case where mq-bfq-tput achieves its worst relative performance w.r.t.
> to cfq, which happens with 64 clients. Still, also in this case
> mq-bfq is better than cfq in all average values, but Flush. I don't
> know which are the best/right values to look at, so, here's the final
> report for both schedulers:
>

For what it's worth, it has often been observed that dbench overall
performance was dominated by flush costs. This is also true for the
standard reported throughput figures rather than the modified load file
elapsed time that mmtests reports. In dbench3 it was even worse where the
"performance" was dominated by whether the temporary files were deleted
before writeback started.

> CFQ
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 13120 20.069 348.594
> Close 133696 0.008 14.642
> LockX 512 0.009 0.059
> Rename 7552 1.857 415.418
> ReadX 270720 0.141 535.632
> WriteX 89591 421.961 6363.271
> Unlink 34048 1.281 662.467
> UnlockX 512 0.007 0.057
> FIND_FIRST 62016 0.086 25.060
> SET_FILE_INFORMATION 15616 0.995 176.621
> QUERY_FILE_INFORMATION 28734 0.004 1.372
> QUERY_PATH_INFORMATION 170240 0.163 820.292
> QUERY_FS_INFORMATION 28736 0.017 4.110
> NTCreateX 178688 0.437 905.567
>
> MQ-BFQ-TPUT
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 13504 75.828 11196.035
> Close 136896 0.004 3.855
> LockX 640 0.005 0.031
> Rename 8064 1.020 288.989
> ReadX 297600 0.081 685.850
> WriteX 93515 391.637 12681.517
> Unlink 34880 0.500 146.928
> UnlockX 640 0.004 0.032
> FIND_FIRST 63680 0.045 222.491
> SET_FILE_INFORMATION 16000 0.436 686.115
> QUERY_FILE_INFORMATION 30464 0.003 0.773
> QUERY_PATH_INFORMATION 175552 0.044 148.449
> QUERY_FS_INFORMATION 29888 0.009 1.984
> NTCreateX 183152 0.289 300.867
>
> Are these results in line with yours for this test?
>

Very broadly speaking yes, but it varies. On a small machine, the differences
in flush latency are visible but not as dramatic. It only has a few
CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On
the one machine I have that topped out with CFQ/BFQ at 64 threads, the
latency of flush is vaguely similar

CFQ BFQ BFQ-TPUT
latency avg-Flush-64 287.05 ( 0.00%) 389.14 ( -35.57%) 349.90 ( -21.90%)
latency avg-Close-64 0.00 ( 0.00%) 0.00 ( -33.33%) 0.00 ( 0.00%)
latency avg-LockX-64 0.01 ( 0.00%) 0.01 ( -16.67%) 0.01 ( 0.00%)
latency avg-Rename-64 0.18 ( 0.00%) 0.21 ( -16.39%) 0.18 ( 3.28%)
latency avg-ReadX-64 0.10 ( 0.00%) 0.15 ( -40.95%) 0.15 ( -40.95%)
latency avg-WriteX-64 0.86 ( 0.00%) 0.81 ( 6.18%) 0.74 ( 13.75%)
latency avg-Unlink-64 1.49 ( 0.00%) 1.52 ( -2.28%) 1.14 ( 23.69%)
latency avg-UnlockX-64 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
latency avg-NTCreateX-64 0.26 ( 0.00%) 0.30 ( -16.15%) 0.21 ( 19.62%)

So, different figures to yours but the general observation that flush
latency is higher holds.

> Anyway, to investigate this regression more in depth, I took two
> further steps. First, I repeated the same test with bfq-sq, my
> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
> from the changes needed for bfq to live in blk-mq). I got:
>
> <SNIP>
>
> So, with both bfq and deadline there seems to be a serious regression,
> especially on MaxLat, when moving from legacy block to blk-mq. The
> regression is much worse with deadline, as legacy-deadline has the
> lowest max latency among all the schedulers, whereas mq-deadline has
> the highest one.
>

I wouldn't worry too much about max latency simply because a large
outliier can be due to multiple factors and it will be variable.
However, I accept that deadline is not necessarily great either.

> Regardless of the actual culprit of this regression, I would like to
> investigate further this issue. In this respect, I would like to ask
> for a little help. I would like to isolate the workloads generating
> the highest latencies. To this purpose, I had a look at the loadfile
> client-tiny.txt, and I still have a doubt: is every item in the
> loadfile executed somehow several times (for each value of the number
> of clients), or is it executed only once? More precisely, IIUC, for
> each operation reported in the above results, there are several items
> (lines) in the loadfile. So, is each of these items executed only
> once?
>

The load file is executed multiple times. The normal loadfile was
basically just the same commands, or very similar commands, run multiple
times within a single load file. This made the workload too sensitive to
the exact time the workload finished and too coarse.

> I'm asking because, if it is executed only once, then I guess I can
> find the critical tasks ore easily. Finally, if it is actually
> executed only once, is it expected that the latency for such a task is
> one order of magnitude higher than that of the average latency for
> that group of tasks? I mean, is such a task intrinsically much
> heavier, and then expectedly much longer, or is the fact that latency
> is much higher for this task a sign that something in the kernel
> misbehaves for that task?
>

I don't think it's quite as easily isolated. It's all the operations in
combination that replicate the behaviour. If it was just a single operation
like "fsync" then it would be fairly straight-forward but the full mix
is relevant as it matters when writeback kicks off, when merges happen,
how much dirty data was outstanding when writeback or sync started etc.

I see you've made other responses to the thread so rather than respond
individually

o I've queued a subset of tests with Ming's v3 patchset as that was the
latest branch at the time I looked. It'll take quite some time to execute
as the grid I use to collect data is backlogged with other work

o I've included pgioperf this time because it is good at demonstrate
oddities related to fsync. Granted it's mostly simulating a database
workload that is typically recommended to use deadline scheduler but I
think it's still a useful demonstration

o If you want a patch set queued that may improve workload pattern
detection for dbench then I can add that to the grid with the caveat that
results take time. It'll be a blind test as I'm not actively debugging
IO-related problems right now.

o I'll keep an eye out for other workloads that demonstrate empirically
better performance given that a stopwatch and desktop performance is
tough to quantify even though I'm typically working in other areas. While
I don't spend a lot of time on IO-related problems, it would still
be preferred if switching to MQ by default was a safe option so I'm
interested enough to keep it in mind.

--
Mel Gorman
SUSE Labs

2017-08-08 10:43:06

by Ming Lei

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

Hi Mel Gorman,

On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <[email protected]> wrote:
....
>
> o I've queued a subset of tests with Ming's v3 patchset as that was the
> latest branch at the time I looked. It'll take quite some time to execute
> as the grid I use to collect data is backlogged with other work

The latest patchset is in the following post:

http://marc.info/?l=linux-block&m=150191624318513&w=2

And you can find it in my github:

https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4

--
Ming Lei

2017-08-08 11:27:14

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote:
> Hi Mel Gorman,
>
> On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <[email protected]> wrote:
> ....
> >
> > o I've queued a subset of tests with Ming's v3 patchset as that was the
> > latest branch at the time I looked. It'll take quite some time to execute
> > as the grid I use to collect data is backlogged with other work
>
> The latest patchset is in the following post:
>
> http://marc.info/?l=linux-block&m=150191624318513&w=2
>
> And you can find it in my github:
>
> https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4
>

Unfortunately, the tests were queued last Friday and are partially complete
depending on when machines become available. As it is, v3 will take a few
days to complete and a requeue would incur further delays. If you believe
the results will be substantially different then I'll discard v3 and requeue.

--
Mel Gorman
SUSE Labs

2017-08-08 11:49:57

by Ming Lei

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Tue, Aug 8, 2017 at 7:27 PM, Mel Gorman <[email protected]> wrote:
> On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote:
>> Hi Mel Gorman,
>>
>> On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <[email protected]> wrote:
>> ....
>> >
>> > o I've queued a subset of tests with Ming's v3 patchset as that was the
>> > latest branch at the time I looked. It'll take quite some time to execute
>> > as the grid I use to collect data is backlogged with other work
>>
>> The latest patchset is in the following post:
>>
>> http://marc.info/?l=linux-block&m=150191624318513&w=2
>>
>> And you can find it in my github:
>>
>> https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4
>>
>
> Unfortunately, the tests were queued last Friday and are partially complete
> depending on when machines become available. As it is, v3 will take a few
> days to complete and a requeue would incur further delays. If you believe
> the results will be substantially different then I'll discard v3 and requeue.

Firstly V3 on github(never posted out) causes boot hang if CPU cores is >= 16,
so you need to check if the test is still running, :-(

Also V3 on github may not perform well on IB SRP(or other low latency
SCSI disk), so
I improve bio merge in V4 and make IB SRP's perf better too, and it depends on
devices.

I suggest to focus on V2 posted in mail list(V4 in github).

--
Ming Lei

2017-08-08 11:55:17

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Tue, Aug 08, 2017 at 07:49:53PM +0800, Ming Lei wrote:
> On Tue, Aug 8, 2017 at 7:27 PM, Mel Gorman <[email protected]> wrote:
> > On Tue, Aug 08, 2017 at 06:43:03PM +0800, Ming Lei wrote:
> >> Hi Mel Gorman,
> >>
> >> On Tue, Aug 8, 2017 at 6:30 PM, Mel Gorman <[email protected]> wrote:
> >> ....
> >> >
> >> > o I've queued a subset of tests with Ming's v3 patchset as that was the
> >> > latest branch at the time I looked. It'll take quite some time to execute
> >> > as the grid I use to collect data is backlogged with other work
> >>
> >> The latest patchset is in the following post:
> >>
> >> http://marc.info/?l=linux-block&m=150191624318513&w=2
> >>
> >> And you can find it in my github:
> >>
> >> https://github.com/ming1/linux/commits/blk-mq-dispatch_for_scsi.V4
> >>
> >
> > Unfortunately, the tests were queued last Friday and are partially complete
> > depending on when machines become available. As it is, v3 will take a few
> > days to complete and a requeue would incur further delays. If you believe
> > the results will be substantially different then I'll discard v3 and requeue.
>
> Firstly V3 on github(never posted out) causes boot hang if CPU cores is >= 16,
> so you need to check if the test is still running, :-(
>

By co-incidence, the few machines that have completed had core counts
below this so I'll discard existing results and requeue.

Thanks.

--
Mel Gorman
SUSE Labs

2017-08-08 17:16:30

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 08 ago 2017, alle ore 12:30, Mel Gorman <[email protected]> ha scritto:
>
> On Mon, Aug 07, 2017 at 07:32:41PM +0200, Paolo Valente wrote:
>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>> matters.
>>>>
>>>
>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>> soon as I can, thanks.
>>>
>>>
>>
>> I've run this test and tried to further investigate this regression.
>> For the moment, the gist seems to be that blk-mq plays an important
>> role, not only with bfq (unless I'm considering the wrong numbers).
>> Even if your main purpose in this thread was just to give a heads-up,
>> I guess it may be useful to share what I have found out. In addition,
>> I want to ask for some help, to try to get closer to the possible
>> causes of at least this regression. If you think it would be better
>> to open a new thread on this stuff, I'll do it.
>>
>
> I don't think it's necessary unless Christoph or Jens object and I doubt
> they will.
>
>> First, I got mixed results on my system.
>
> For what it's worth, this is standard. In my experience, IO benchmarks
> are always multi-modal, particularly on rotary storage. Cases of universal
> win or universal loss for a scheduler or set of tuning are rare.
>
>> I'll focus only on the the
>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>> to cfq, which happens with 64 clients. Still, also in this case
>> mq-bfq is better than cfq in all average values, but Flush. I don't
>> know which are the best/right values to look at, so, here's the final
>> report for both schedulers:
>>
>
> For what it's worth, it has often been observed that dbench overall
> performance was dominated by flush costs. This is also true for the
> standard reported throughput figures rather than the modified load file
> elapsed time that mmtests reports. In dbench3 it was even worse where the
> "performance" was dominated by whether the temporary files were deleted
> before writeback started.
>
>> CFQ
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13120 20.069 348.594
>> Close 133696 0.008 14.642
>> LockX 512 0.009 0.059
>> Rename 7552 1.857 415.418
>> ReadX 270720 0.141 535.632
>> WriteX 89591 421.961 6363.271
>> Unlink 34048 1.281 662.467
>> UnlockX 512 0.007 0.057
>> FIND_FIRST 62016 0.086 25.060
>> SET_FILE_INFORMATION 15616 0.995 176.621
>> QUERY_FILE_INFORMATION 28734 0.004 1.372
>> QUERY_PATH_INFORMATION 170240 0.163 820.292
>> QUERY_FS_INFORMATION 28736 0.017 4.110
>> NTCreateX 178688 0.437 905.567
>>
>> MQ-BFQ-TPUT
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13504 75.828 11196.035
>> Close 136896 0.004 3.855
>> LockX 640 0.005 0.031
>> Rename 8064 1.020 288.989
>> ReadX 297600 0.081 685.850
>> WriteX 93515 391.637 12681.517
>> Unlink 34880 0.500 146.928
>> UnlockX 640 0.004 0.032
>> FIND_FIRST 63680 0.045 222.491
>> SET_FILE_INFORMATION 16000 0.436 686.115
>> QUERY_FILE_INFORMATION 30464 0.003 0.773
>> QUERY_PATH_INFORMATION 175552 0.044 148.449
>> QUERY_FS_INFORMATION 29888 0.009 1.984
>> NTCreateX 183152 0.289 300.867
>>
>> Are these results in line with yours for this test?
>>
>
> Very broadly speaking yes, but it varies. On a small machine, the differences
> in flush latency are visible but not as dramatic. It only has a few
> CPUs. On a machine that tops out with 32 CPUs, it is more noticable. On
> the one machine I have that topped out with CFQ/BFQ at 64 threads, the
> latency of flush is vaguely similar
>
> CFQ BFQ BFQ-TPUT
> latency avg-Flush-64 287.05 ( 0.00%) 389.14 ( -35.57%) 349.90 ( -21.90%)
> latency avg-Close-64 0.00 ( 0.00%) 0.00 ( -33.33%) 0.00 ( 0.00%)
> latency avg-LockX-64 0.01 ( 0.00%) 0.01 ( -16.67%) 0.01 ( 0.00%)
> latency avg-Rename-64 0.18 ( 0.00%) 0.21 ( -16.39%) 0.18 ( 3.28%)
> latency avg-ReadX-64 0.10 ( 0.00%) 0.15 ( -40.95%) 0.15 ( -40.95%)
> latency avg-WriteX-64 0.86 ( 0.00%) 0.81 ( 6.18%) 0.74 ( 13.75%)
> latency avg-Unlink-64 1.49 ( 0.00%) 1.52 ( -2.28%) 1.14 ( 23.69%)
> latency avg-UnlockX-64 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> latency avg-NTCreateX-64 0.26 ( 0.00%) 0.30 ( -16.15%) 0.21 ( 19.62%)
>
> So, different figures to yours but the general observation that flush
> latency is higher holds.
>
>> Anyway, to investigate this regression more in depth, I took two
>> further steps. First, I repeated the same test with bfq-sq, my
>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>> from the changes needed for bfq to live in blk-mq). I got:
>>
>> <SNIP>
>>
>> So, with both bfq and deadline there seems to be a serious regression,
>> especially on MaxLat, when moving from legacy block to blk-mq. The
>> regression is much worse with deadline, as legacy-deadline has the
>> lowest max latency among all the schedulers, whereas mq-deadline has
>> the highest one.
>>
>
> I wouldn't worry too much about max latency simply because a large
> outliier can be due to multiple factors and it will be variable.
> However, I accept that deadline is not necessarily great either.
>
>> Regardless of the actual culprit of this regression, I would like to
>> investigate further this issue. In this respect, I would like to ask
>> for a little help. I would like to isolate the workloads generating
>> the highest latencies. To this purpose, I had a look at the loadfile
>> client-tiny.txt, and I still have a doubt: is every item in the
>> loadfile executed somehow several times (for each value of the number
>> of clients), or is it executed only once? More precisely, IIUC, for
>> each operation reported in the above results, there are several items
>> (lines) in the loadfile. So, is each of these items executed only
>> once?
>>
>
> The load file is executed multiple times. The normal loadfile was
> basically just the same commands, or very similar commands, run multiple
> times within a single load file. This made the workload too sensitive to
> the exact time the workload finished and too coarse.
>
>> I'm asking because, if it is executed only once, then I guess I can
>> find the critical tasks ore easily. Finally, if it is actually
>> executed only once, is it expected that the latency for such a task is
>> one order of magnitude higher than that of the average latency for
>> that group of tasks? I mean, is such a task intrinsically much
>> heavier, and then expectedly much longer, or is the fact that latency
>> is much higher for this task a sign that something in the kernel
>> misbehaves for that task?
>>
>
> I don't think it's quite as easily isolated. It's all the operations in
> combination that replicate the behaviour. If it was just a single operation
> like "fsync" then it would be fairly straight-forward but the full mix
> is relevant as it matters when writeback kicks off, when merges happen,
> how much dirty data was outstanding when writeback or sync started etc.
>
> I see you've made other responses to the thread so rather than respond
> individually
>
> o I've queued a subset of tests with Ming's v3 patchset as that was the
> latest branch at the time I looked. It'll take quite some time to execute
> as the grid I use to collect data is backlogged with other work
>
> o I've included pgioperf this time because it is good at demonstrate
> oddities related to fsync. Granted it's mostly simulating a database
> workload that is typically recommended to use deadline scheduler but I
> think it's still a useful demonstration
>
> o If you want a patch set queued that may improve workload pattern
> detection for dbench then I can add that to the grid with the caveat that
> results take time. It'll be a blind test as I'm not actively debugging
> IO-related problems right now.
>
> o I'll keep an eye out for other workloads that demonstrate empirically
> better performance given that a stopwatch and desktop performance is
> tough to quantify even though I'm typically working in other areas. While
> I don't spend a lot of time on IO-related problems, it would still
> be preferred if switching to MQ by default was a safe option so I'm
> interested enough to keep it in mind.
>

Hi Mel,
thanks for your thorough responses (I'm about to write something about
the read-write unfairness issue, with, again, some surprise).

I want to reply only to your last point above. With our
responsiveness benchmark of course you don't need a stopwatch, but,
yes, to get some minimally comprehensive results you need a machine
with at least a desktop application like a terminal installed.

Thanks,
Paolo

> --
> Mel Gorman
> SUSE Labs


2017-08-08 17:33:43

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 08 ago 2017, alle ore 10:06, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <[email protected]> ha scritto:
>>
>>>
>>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <[email protected]> ha scritto:
>>>
>>>>
>>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <[email protected]> ha scritto:
>>>>
>>>>>
>>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <[email protected]> ha scritto:
>>>>>
>>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>>> I took that into account BFQ with low-latency was also tested and the
>>>>>>> impact was not a universal improvement although it can be a noticable
>>>>>>> improvement. From the same machine;
>>>>>>>
>>>>>>> dbench4 Loadfile Execution Time
>>>>>>> 4.12.0 4.12.0 4.12.0
>>>>>>> legacy-cfq mq-bfq mq-bfq-tput
>>>>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
>>>>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
>>>>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
>>>>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
>>>>>>>
>>>>>>
>>>>>> Thanks for trying with low_latency disabled. If I read numbers
>>>>>> correctly, we move from a worst case of 361% higher execution time to
>>>>>> a worst case of 11%. With a best case of 20% of lower execution time.
>>>>>>
>>>>>
>>>>> Yes.
>>>>>
>>>>>> I asked you about none and mq-deadline in a previous email, because
>>>>>> actually we have a double change here: change of the I/O stack, and
>>>>>> change of the scheduler, with the first change probably not irrelevant
>>>>>> with respect to the second one.
>>>>>>
>>>>>
>>>>> True. However, the difference between legacy-deadline mq-deadline is
>>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>>> universally true but the impact is not as severe. While this is not
>>>>> proof that the stack change is the sole root cause, it makes it less
>>>>> likely.
>>>>>
>>>>
>>>> I'm getting a little lost here. If I'm not mistaken, you are saying,
>>>> since the difference between two virtually identical schedulers
>>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>>> difference between cfq and mq-bfq-tput is higher, then in the latter
>>>> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the
>>>> above test is exactly in the 5-10% range? What am I missing? Other
>>>> tests with mq-bfq-tput not yet reported?
>>>>
>>>>>> By chance, according to what you have measured so far, is there any
>>>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>>>> lose? I could start from there.
>>>>>>
>>>>>
>>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>>>>> it could be the stack change.
>>>>>
>>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>>> matters.
>>>>>
>>>>
>>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>>> soon as I can, thanks.
>>>>
>>>>
>>>
>>> I've run this test and tried to further investigate this regression.
>>> For the moment, the gist seems to be that blk-mq plays an important
>>> role, not only with bfq (unless I'm considering the wrong numbers).
>>> Even if your main purpose in this thread was just to give a heads-up,
>>> I guess it may be useful to share what I have found out. In addition,
>>> I want to ask for some help, to try to get closer to the possible
>>> causes of at least this regression. If you think it would be better
>>> to open a new thread on this stuff, I'll do it.
>>>
>>> First, I got mixed results on my system. I'll focus only on the the
>>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>>> to cfq, which happens with 64 clients. Still, also in this case
>>> mq-bfq is better than cfq in all average values, but Flush. I don't
>>> know which are the best/right values to look at, so, here's the final
>>> report for both schedulers:
>>>
>>> CFQ
>>>
>>> Operation Count AvgLat MaxLat
>>> --------------------------------------------------
>>> Flush 13120 20.069 348.594
>>> Close 133696 0.008 14.642
>>> LockX 512 0.009 0.059
>>> Rename 7552 1.857 415.418
>>> ReadX 270720 0.141 535.632
>>> WriteX 89591 421.961 6363.271
>>> Unlink 34048 1.281 662.467
>>> UnlockX 512 0.007 0.057
>>> FIND_FIRST 62016 0.086 25.060
>>> SET_FILE_INFORMATION 15616 0.995 176.621
>>> QUERY_FILE_INFORMATION 28734 0.004 1.372
>>> QUERY_PATH_INFORMATION 170240 0.163 820.292
>>> QUERY_FS_INFORMATION 28736 0.017 4.110
>>> NTCreateX 178688 0.437 905.567
>>>
>>> MQ-BFQ-TPUT
>>>
>>> Operation Count AvgLat MaxLat
>>> --------------------------------------------------
>>> Flush 13504 75.828 11196.035
>>> Close 136896 0.004 3.855
>>> LockX 640 0.005 0.031
>>> Rename 8064 1.020 288.989
>>> ReadX 297600 0.081 685.850
>>> WriteX 93515 391.637 12681.517
>>> Unlink 34880 0.500 146.928
>>> UnlockX 640 0.004 0.032
>>> FIND_FIRST 63680 0.045 222.491
>>> SET_FILE_INFORMATION 16000 0.436 686.115
>>> QUERY_FILE_INFORMATION 30464 0.003 0.773
>>> QUERY_PATH_INFORMATION 175552 0.044 148.449
>>> QUERY_FS_INFORMATION 29888 0.009 1.984
>>> NTCreateX 183152 0.289 300.867
>>>
>>> Are these results in line with yours for this test?
>>>
>>> Anyway, to investigate this regression more in depth, I took two
>>> further steps. First, I repeated the same test with bfq-sq, my
>>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>>> from the changes needed for bfq to live in blk-mq). I got:
>>>
>>> BFQ-SQ-TPUT
>>>
>>> Operation Count AvgLat MaxLat
>>> --------------------------------------------------
>>> Flush 12618 30.212 484.099
>>> Close 123884 0.008 10.477
>>> LockX 512 0.010 0.170
>>> Rename 7296 2.032 426.409
>>> ReadX 262179 0.251 985.478
>>> WriteX 84072 461.398 7283.003
>>> Unlink 33076 1.685 848.734
>>> UnlockX 512 0.007 0.036
>>> FIND_FIRST 58690 0.096 220.720
>>> SET_FILE_INFORMATION 14976 1.792 466.435
>>> QUERY_FILE_INFORMATION 26575 0.004 2.194
>>> QUERY_PATH_INFORMATION 158125 0.112 614.063
>>> QUERY_FS_INFORMATION 28224 0.017 1.385
>>> NTCreateX 167877 0.827 945.644
>>>
>>> So, the worst-case regression is now around 15%. This made me suspect
>>> that blk-mq influences results a lot for this test. To crosscheck, I
>>> compared legacy-deadline and mq-deadline too.
>>>
>>
>> Ok, found the problem for the 15% loss in bfq-sq. bfq-sq gets
>> occasionally confused by the workload, and grants device idling to
>> processes that, for this specific workload, would be better to
>> de-schedule immediately. If we set slice_idle to 0, then bfq-sq
>> becomes more or less equivalent to cfq (for some operations apparently
>> even much better):
>>
>> bfq-sq-tput-0idle
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13013 17.888 280.517
>> Close 133004 0.008 20.698
>> LockX 512 0.008 0.088
>> Rename 7427 2.041 193.232
>> ReadX 270534 0.138 408.534
>> WriteX 88598 429.615 6272.212
>> Unlink 33734 1.205 559.152
>> UnlockX 512 0.011 1.808
>> FIND_FIRST 61762 0.087 23.012
>> SET_FILE_INFORMATION 15337 1.322 220.155
>> QUERY_FILE_INFORMATION 28415 0.004 0.559
>> QUERY_PATH_INFORMATION 169423 0.150 580.570
>> QUERY_FS_INFORMATION 28547 0.019 24.466
>> NTCreateX 177618 0.544 681.795
>>
>> I'll try soon with mq-bfq too, for which I expect however a deeper
>> investigation to be needed.
>>
>
> Hi,
> to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also
> applied Ming patches, and Ah, victory!
>
> Regardless of the value of slice idle:
>
> mq-bfq-tput
>
> Operation Count AvgLat MaxLat
> --------------------------------------------------
> Flush 13183 70.381 1025.407
> Close 134539 0.004 1.011
> LockX 512 0.005 0.025
> Rename 7721 0.740 404.979
> ReadX 274422 0.126 873.364
> WriteX 90535 408.371 7400.585
> Unlink 34276 0.634 581.067
> UnlockX 512 0.003 0.029
> FIND_FIRST 62664 0.052 321.027
> SET_FILE_INFORMATION 15981 0.234 124.739
> QUERY_FILE_INFORMATION 29042 0.003 1.731
> QUERY_PATH_INFORMATION 171769 0.032 522.415
> QUERY_FS_INFORMATION 28958 0.009 3.043
> NTCreateX 179643 0.298 687.466
>
> Throughput 9.11183 MB/sec 64 clients 64 procs max_latency=7400.588 ms
>
> Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
> benefit, which lets me suspect that there is some other issue in
> blk-mq (only a suspect). I think I may have already understood how to
> guarantee that bfq almost never idles the device uselessly also for
> this workload. Yet, since in blk-mq there is no gain even after
> excluding useless idling, I'll wait for at least Ming's patches to be
> merged before possibly proposing this contribution. Maybe some other
> little issue related to this lack of gain in blk-mq will be found and
> solved in the meantime.
>
> Moving to the read-write unfairness problem.
>

I've reproduced the unfairness issue (rand reader throttled by heavy
writers) with bfq, using
configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with
an important side problem: cfq suffers from exactly the same
unfairness (785kB/s writers, 13.4kB/s reader). Of course, this
happens in my system, with a HITACHI HTS727550A9E364.

This discrepancy with your results makes a little bit harder for me to
understand how to better proceed, as I see no regression. Anyway,
since this reader-throttling issue seems relevant, I have investigated
it a little more in depth. The cause of the throttling is that the
fdatasync frequently performed by the writers in this test turns the
I/O of the writers into a 100% sync I/O. And neither bfq or cfq
differentiate bandwidth between sync reads and sync writes. Basically
both cfq and bfq are willing to dispatch the I/O requests of each
writer for a time slot equal to that devoted to the reader. But write
requests, after reaching the device, use the latter for much more time
than reads. This delays the completion of the requests of the reader,
and, being the I/O sync, the issuing of the next I/O requests by the
reader. The final result is that the device spends most of the time
serving write requests, while the reader issues its read requests very
slowly.

It might not be so difficult to balance this unfairness, although I'm
a little worried about changing bfq without being able to see the
regression you report. In case I give it a try, could I then count on
some testing on your machines?

Thanks,
Paolo

> Thanks,
> Paolo
>
>> Thanks,
>> Paolo
>>
>>> LEGACY-DEADLINE
>>>
>>> Operation Count AvgLat MaxLat
>>> --------------------------------------------------
>>> Flush 13267 9.622 298.206
>>> Close 135692 0.007 10.627
>>> LockX 640 0.008 0.066
>>> Rename 7827 0.544 481.123
>>> ReadX 285929 0.220 2698.442
>>> WriteX 92309 430.867 5191.608
>>> Unlink 34534 1.133 619.235
>>> UnlockX 640 0.008 0.724
>>> FIND_FIRST 63289 0.086 56.851
>>> SET_FILE_INFORMATION 16000 1.254 844.065
>>> QUERY_FILE_INFORMATION 29883 0.004 0.618
>>> QUERY_PATH_INFORMATION 173232 0.089 1295.651
>>> QUERY_FS_INFORMATION 29632 0.017 4.813
>>> NTCreateX 181464 0.479 2214.343
>>>
>>>
>>> MQ-DEADLINE
>>>
>>> Operation Count AvgLat MaxLat
>>> --------------------------------------------------
>>> Flush 13760 90.542 13221.495
>>> Close 137654 0.008 27.133
>>> LockX 640 0.009 0.115
>>> Rename 8064 1.062 246.759
>>> ReadX 297956 0.051 347.018
>>> WriteX 94698 425.636 15090.020
>>> Unlink 35077 0.580 208.462
>>> UnlockX 640 0.007 0.291
>>> FIND_FIRST 66630 0.566 530.339
>>> SET_FILE_INFORMATION 16000 1.419 811.494
>>> QUERY_FILE_INFORMATION 30717 0.004 1.108
>>> QUERY_PATH_INFORMATION 176153 0.182 517.419
>>> QUERY_FS_INFORMATION 30857 0.018 18.562
>>> NTCreateX 184145 0.281 582.076
>>>
>>> So, with both bfq and deadline there seems to be a serious regression,
>>> especially on MaxLat, when moving from legacy block to blk-mq. The
>>> regression is much worse with deadline, as legacy-deadline has the
>>> lowest max latency among all the schedulers, whereas mq-deadline has
>>> the highest one.
>>>
>>> Regardless of the actual culprit of this regression, I would like to
>>> investigate further this issue. In this respect, I would like to ask
>>> for a little help. I would like to isolate the workloads generating
>>> the highest latencies. To this purpose, I had a look at the loadfile
>>> client-tiny.txt, and I still have a doubt: is every item in the
>>> loadfile executed somehow several times (for each value of the number
>>> of clients), or is it executed only once? More precisely, IIUC, for
>>> each operation reported in the above results, there are several items
>>> (lines) in the loadfile. So, is each of these items executed only
>>> once?
>>>
>>> I'm asking because, if it is executed only once, then I guess I can
>>> find the critical tasks ore easily. Finally, if it is actually
>>> executed only once, is it expected that the latency for such a task is
>>> one order of magnitude higher than that of the average latency for
>>> that group of tasks? I mean, is such a task intrinsically much
>>> heavier, and then expectedly much longer, or is the fact that latency
>>> is much higher for this task a sign that something in the kernel
>>> misbehaves for that task?
>>>
>>> While waiting for some feedback, I'm going to execute your test
>>> showing great unfairness between writes and reads, and to also check
>>> whether responsiveness does worsen if the write workload for that test
>>> is being executed in the background.
>>>
>>> Thanks,
>>> Paolo
>>>
>>>> ...
>>>>> --
>>>>> Mel Gorman
>>>>> SUSE Labs


2017-08-08 18:27:39

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Tue, Aug 08, 2017 at 07:33:37PM +0200, Paolo Valente wrote:
> > Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
> > benefit, which lets me suspect that there is some other issue in
> > blk-mq (only a suspect). I think I may have already understood how to
> > guarantee that bfq almost never idles the device uselessly also for
> > this workload. Yet, since in blk-mq there is no gain even after
> > excluding useless idling, I'll wait for at least Ming's patches to be
> > merged before possibly proposing this contribution. Maybe some other
> > little issue related to this lack of gain in blk-mq will be found and
> > solved in the meantime.
> >
> > Moving to the read-write unfairness problem.
> >
>
> I've reproduced the unfairness issue (rand reader throttled by heavy
> writers) with bfq, using
> configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with
> an important side problem: cfq suffers from exactly the same
> unfairness (785kB/s writers, 13.4kB/s reader). Of course, this
> happens in my system, with a HITACHI HTS727550A9E364.
>

It's interesting that CFQ suffers the same on your system. It's possible
that this is down to luck and the results depend not only on the disk but
the number of CPUs. At absolute minimum we saw different latency figures
from dbench even if the only observation s "different machines behave
differently, news at 11". If the results are inconsistent, then the value of
the benchmark can be dropped as a basis of comparison between IO schedulers
(although I'll be keeping it for detecting regressions between releases).

When the v4 results from Ming's patches complete, I'll double check the
results from this config.

> This discrepancy with your results makes a little bit harder for me to
> understand how to better proceed, as I see no regression. Anyway,
> since this reader-throttling issue seems relevant, I have investigated
> it a little more in depth. The cause of the throttling is that the
> fdatasync frequently performed by the writers in this test turns the
> I/O of the writers into a 100% sync I/O. And neither bfq or cfq
> differentiate bandwidth between sync reads and sync writes. Basically
> both cfq and bfq are willing to dispatch the I/O requests of each
> writer for a time slot equal to that devoted to the reader. But write
> requests, after reaching the device, use the latter for much more time
> than reads. This delays the completion of the requests of the reader,
> and, being the I/O sync, the issuing of the next I/O requests by the
> reader. The final result is that the device spends most of the time
> serving write requests, while the reader issues its read requests very
> slowly.
>

That is certainly plausible and implies that the actual results depend
too heavily on random timing factors and disk model to be really useful.

> It might not be so difficult to balance this unfairness, although I'm
> a little worried about changing bfq without being able to see the
> regression you report. In case I give it a try, could I then count on
> some testing on your machines?
>

Yes with the caveat that results take a variable amount of time depending
on how many problems I'm juggling in the air and how many of them are
occupying time on the machines.

--
Mel Gorman
SUSE Labs

2017-08-09 21:49:24

by Paolo Valente

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports


> Il giorno 08 ago 2017, alle ore 19:33, Paolo Valente <[email protected]> ha scritto:
>
>>
>> Il giorno 08 ago 2017, alle ore 10:06, Paolo Valente <[email protected]> ha scritto:
>>
>>>
>>> Il giorno 07 ago 2017, alle ore 20:42, Paolo Valente <[email protected]> ha scritto:
>>>
>>>>
>>>> Il giorno 07 ago 2017, alle ore 19:32, Paolo Valente <[email protected]> ha scritto:
>>>>
>>>>>
>>>>> Il giorno 05 ago 2017, alle ore 00:05, Paolo Valente <[email protected]> ha scritto:
>>>>>
>>>>>>
>>>>>> Il giorno 04 ago 2017, alle ore 13:01, Mel Gorman <[email protected]> ha scritto:
>>>>>>
>>>>>> On Fri, Aug 04, 2017 at 09:26:20AM +0200, Paolo Valente wrote:
>>>>>>>> I took that into account BFQ with low-latency was also tested and the
>>>>>>>> impact was not a universal improvement although it can be a noticable
>>>>>>>> improvement. From the same machine;
>>>>>>>>
>>>>>>>> dbench4 Loadfile Execution Time
>>>>>>>> 4.12.0 4.12.0 4.12.0
>>>>>>>> legacy-cfq mq-bfq mq-bfq-tput
>>>>>>>> Amean 1 80.67 ( 0.00%) 83.68 ( -3.74%) 84.70 ( -5.00%)
>>>>>>>> Amean 2 92.87 ( 0.00%) 121.63 ( -30.96%) 88.74 ( 4.45%)
>>>>>>>> Amean 4 102.72 ( 0.00%) 474.33 (-361.77%) 113.97 ( -10.95%)
>>>>>>>> Amean 32 2543.93 ( 0.00%) 1927.65 ( 24.23%) 2038.74 ( 19.86%)
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for trying with low_latency disabled. If I read numbers
>>>>>>> correctly, we move from a worst case of 361% higher execution time to
>>>>>>> a worst case of 11%. With a best case of 20% of lower execution time.
>>>>>>>
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> I asked you about none and mq-deadline in a previous email, because
>>>>>>> actually we have a double change here: change of the I/O stack, and
>>>>>>> change of the scheduler, with the first change probably not irrelevant
>>>>>>> with respect to the second one.
>>>>>>>
>>>>>>
>>>>>> True. However, the difference between legacy-deadline mq-deadline is
>>>>>> roughly around the 5-10% mark across workloads for SSD. It's not
>>>>>> universally true but the impact is not as severe. While this is not
>>>>>> proof that the stack change is the sole root cause, it makes it less
>>>>>> likely.
>>>>>>
>>>>>
>>>>> I'm getting a little lost here. If I'm not mistaken, you are saying,
>>>>> since the difference between two virtually identical schedulers
>>>>> (legacy-deadline and mq-deadline) is only around 5-10%, while the
>>>>> difference between cfq and mq-bfq-tput is higher, then in the latter
>>>>> case it is not the stack's fault. Yet the loss of mq-bfq-tput in the
>>>>> above test is exactly in the 5-10% range? What am I missing? Other
>>>>> tests with mq-bfq-tput not yet reported?
>>>>>
>>>>>>> By chance, according to what you have measured so far, is there any
>>>>>>> test where, instead, you expect or have seen bfq-mq-tput to always
>>>>>>> lose? I could start from there.
>>>>>>>
>>>>>>
>>>>>> global-dhp__io-fio-randread-async-randwrite-xfs but marginal enough that
>>>>>> it could be the stack change.
>>>>>>
>>>>>> global-dhp__io-dbench4-fsync-ext4 was a universal loss across any
>>>>>> machine tested. This is global-dhp__io-dbench4-fsync from mmtests using
>>>>>> ext4 as a filesystem. The same is not true for XFS so the filesystem
>>>>>> matters.
>>>>>>
>>>>>
>>>>> Ok, then I will try to repeat global-dhp__io-dbench4-fsync-ext4 as
>>>>> soon as I can, thanks.
>>>>>
>>>>>
>>>>
>>>> I've run this test and tried to further investigate this regression.
>>>> For the moment, the gist seems to be that blk-mq plays an important
>>>> role, not only with bfq (unless I'm considering the wrong numbers).
>>>> Even if your main purpose in this thread was just to give a heads-up,
>>>> I guess it may be useful to share what I have found out. In addition,
>>>> I want to ask for some help, to try to get closer to the possible
>>>> causes of at least this regression. If you think it would be better
>>>> to open a new thread on this stuff, I'll do it.
>>>>
>>>> First, I got mixed results on my system. I'll focus only on the the
>>>> case where mq-bfq-tput achieves its worst relative performance w.r.t.
>>>> to cfq, which happens with 64 clients. Still, also in this case
>>>> mq-bfq is better than cfq in all average values, but Flush. I don't
>>>> know which are the best/right values to look at, so, here's the final
>>>> report for both schedulers:
>>>>
>>>> CFQ
>>>>
>>>> Operation Count AvgLat MaxLat
>>>> --------------------------------------------------
>>>> Flush 13120 20.069 348.594
>>>> Close 133696 0.008 14.642
>>>> LockX 512 0.009 0.059
>>>> Rename 7552 1.857 415.418
>>>> ReadX 270720 0.141 535.632
>>>> WriteX 89591 421.961 6363.271
>>>> Unlink 34048 1.281 662.467
>>>> UnlockX 512 0.007 0.057
>>>> FIND_FIRST 62016 0.086 25.060
>>>> SET_FILE_INFORMATION 15616 0.995 176.621
>>>> QUERY_FILE_INFORMATION 28734 0.004 1.372
>>>> QUERY_PATH_INFORMATION 170240 0.163 820.292
>>>> QUERY_FS_INFORMATION 28736 0.017 4.110
>>>> NTCreateX 178688 0.437 905.567
>>>>
>>>> MQ-BFQ-TPUT
>>>>
>>>> Operation Count AvgLat MaxLat
>>>> --------------------------------------------------
>>>> Flush 13504 75.828 11196.035
>>>> Close 136896 0.004 3.855
>>>> LockX 640 0.005 0.031
>>>> Rename 8064 1.020 288.989
>>>> ReadX 297600 0.081 685.850
>>>> WriteX 93515 391.637 12681.517
>>>> Unlink 34880 0.500 146.928
>>>> UnlockX 640 0.004 0.032
>>>> FIND_FIRST 63680 0.045 222.491
>>>> SET_FILE_INFORMATION 16000 0.436 686.115
>>>> QUERY_FILE_INFORMATION 30464 0.003 0.773
>>>> QUERY_PATH_INFORMATION 175552 0.044 148.449
>>>> QUERY_FS_INFORMATION 29888 0.009 1.984
>>>> NTCreateX 183152 0.289 300.867
>>>>
>>>> Are these results in line with yours for this test?
>>>>
>>>> Anyway, to investigate this regression more in depth, I took two
>>>> further steps. First, I repeated the same test with bfq-sq, my
>>>> out-of-tree version of bfq for legacy block (identical to mq-bfq apart
>>>> from the changes needed for bfq to live in blk-mq). I got:
>>>>
>>>> BFQ-SQ-TPUT
>>>>
>>>> Operation Count AvgLat MaxLat
>>>> --------------------------------------------------
>>>> Flush 12618 30.212 484.099
>>>> Close 123884 0.008 10.477
>>>> LockX 512 0.010 0.170
>>>> Rename 7296 2.032 426.409
>>>> ReadX 262179 0.251 985.478
>>>> WriteX 84072 461.398 7283.003
>>>> Unlink 33076 1.685 848.734
>>>> UnlockX 512 0.007 0.036
>>>> FIND_FIRST 58690 0.096 220.720
>>>> SET_FILE_INFORMATION 14976 1.792 466.435
>>>> QUERY_FILE_INFORMATION 26575 0.004 2.194
>>>> QUERY_PATH_INFORMATION 158125 0.112 614.063
>>>> QUERY_FS_INFORMATION 28224 0.017 1.385
>>>> NTCreateX 167877 0.827 945.644
>>>>
>>>> So, the worst-case regression is now around 15%. This made me suspect
>>>> that blk-mq influences results a lot for this test. To crosscheck, I
>>>> compared legacy-deadline and mq-deadline too.
>>>>
>>>
>>> Ok, found the problem for the 15% loss in bfq-sq. bfq-sq gets
>>> occasionally confused by the workload, and grants device idling to
>>> processes that, for this specific workload, would be better to
>>> de-schedule immediately. If we set slice_idle to 0, then bfq-sq
>>> becomes more or less equivalent to cfq (for some operations apparently
>>> even much better):
>>>
>>> bfq-sq-tput-0idle
>>>
>>> Operation Count AvgLat MaxLat
>>> --------------------------------------------------
>>> Flush 13013 17.888 280.517
>>> Close 133004 0.008 20.698
>>> LockX 512 0.008 0.088
>>> Rename 7427 2.041 193.232
>>> ReadX 270534 0.138 408.534
>>> WriteX 88598 429.615 6272.212
>>> Unlink 33734 1.205 559.152
>>> UnlockX 512 0.011 1.808
>>> FIND_FIRST 61762 0.087 23.012
>>> SET_FILE_INFORMATION 15337 1.322 220.155
>>> QUERY_FILE_INFORMATION 28415 0.004 0.559
>>> QUERY_PATH_INFORMATION 169423 0.150 580.570
>>> QUERY_FS_INFORMATION 28547 0.019 24.466
>>> NTCreateX 177618 0.544 681.795
>>>
>>> I'll try soon with mq-bfq too, for which I expect however a deeper
>>> investigation to be needed.
>>>
>>
>> Hi,
>> to test mq-bfq (with both slice_idle==0 and slice_idle>0), I have also
>> applied Ming patches, and Ah, victory!
>>
>> Regardless of the value of slice idle:
>>
>> mq-bfq-tput
>>
>> Operation Count AvgLat MaxLat
>> --------------------------------------------------
>> Flush 13183 70.381 1025.407
>> Close 134539 0.004 1.011
>> LockX 512 0.005 0.025
>> Rename 7721 0.740 404.979
>> ReadX 274422 0.126 873.364
>> WriteX 90535 408.371 7400.585
>> Unlink 34276 0.634 581.067
>> UnlockX 512 0.003 0.029
>> FIND_FIRST 62664 0.052 321.027
>> SET_FILE_INFORMATION 15981 0.234 124.739
>> QUERY_FILE_INFORMATION 29042 0.003 1.731
>> QUERY_PATH_INFORMATION 171769 0.032 522.415
>> QUERY_FS_INFORMATION 28958 0.009 3.043
>> NTCreateX 179643 0.298 687.466
>>
>> Throughput 9.11183 MB/sec 64 clients 64 procs max_latency=7400.588 ms
>>
>> Differently from bfq-sq, setting slice_idle to 0 doesn't provide any
>> benefit, which lets me suspect that there is some other issue in
>> blk-mq (only a suspect). I think I may have already understood how to
>> guarantee that bfq almost never idles the device uselessly also for
>> this workload. Yet, since in blk-mq there is no gain even after
>> excluding useless idling, I'll wait for at least Ming's patches to be
>> merged before possibly proposing this contribution. Maybe some other
>> little issue related to this lack of gain in blk-mq will be found and
>> solved in the meantime.
>>
>> Moving to the read-write unfairness problem.
>>
>
> I've reproduced the unfairness issue (rand reader throttled by heavy
> writers) with bfq, using
> configs/config-global-dhp__io-fio-randread-sync-heavywrite, but with
> an important side problem: cfq suffers from exactly the same
> unfairness (785kB/s writers, 13.4kB/s reader). Of course, this
> happens in my system, with a HITACHI HTS727550A9E364.
>
> This discrepancy with your results makes a little bit harder for me to
> understand how to better proceed, as I see no regression. Anyway,
> since this reader-throttling issue seems relevant, I have investigated
> it a little more in depth. The cause of the throttling is that the
> fdatasync frequently performed by the writers in this test turns the
> I/O of the writers into a 100% sync I/O. And neither bfq or cfq
> differentiate bandwidth between sync reads and sync writes. Basically
> both cfq and bfq are willing to dispatch the I/O requests of each
> writer for a time slot equal to that devoted to the reader. But write
> requests, after reaching the device, use the latter for much more time
> than reads. This delays the completion of the requests of the reader,
> and, being the I/O sync, the issuing of the next I/O requests by the
> reader. The final result is that the device spends most of the time
> serving write requests, while the reader issues its read requests very
> slowly.
>
> It might not be so difficult to balance this unfairness, although I'm
> a little worried about changing bfq without being able to see the
> regression you report. In case I give it a try, could I then count on
> some testing on your machines?
>

Hi Mel,
I've investigated this test case a little bit more, and the outcome is
unfortunately rather drastic, unless I'm missing some important point.
It is impossible to control the rate of the reader with the exact
configuration of this test. In fact, since iodepth is equal to 1, the
reader issues one I/O request at a time. When one such request is
dispatched, after some write requests have already been dispatched
(and then queued in the device), the time to serve the request is
controlled only by the device. The longer the device makes the read
request wait before being served, the later the reader will see the
completion of its request, and then the later the reader will issue a
new request, and so on. So, for this test, it is mainly the device
controller to decide the rate of the reader.

On the other hand, the scheduler can gain again control of the
bandwidth of the reader, if the reader issues more than one request at
a time. Anyway, before analyzing this second, controllable case, I
wanted to test responsiveness with this heavy write workload in the
background. And it was very bad! After some hour of mild panic, I
found out that this failure depends on a bug in bfq, bug that,
luckily, happens to be triggered by these heavy writes as a background
workload ...

I've already found and am testing a fix for this bug. Yet, it will
probably take me some week to submit this fix, because I'm finally
going on vacation.

Thanks,
Paolo

> Thanks,
> Paolo
>
>> Thanks,
>> Paolo
>>
>>> Thanks,
>>> Paolo
>>>
>>>> LEGACY-DEADLINE
>>>>
>>>> Operation Count AvgLat MaxLat
>>>> --------------------------------------------------
>>>> Flush 13267 9.622 298.206
>>>> Close 135692 0.007 10.627
>>>> LockX 640 0.008 0.066
>>>> Rename 7827 0.544 481.123
>>>> ReadX 285929 0.220 2698.442
>>>> WriteX 92309 430.867 5191.608
>>>> Unlink 34534 1.133 619.235
>>>> UnlockX 640 0.008 0.724
>>>> FIND_FIRST 63289 0.086 56.851
>>>> SET_FILE_INFORMATION 16000 1.254 844.065
>>>> QUERY_FILE_INFORMATION 29883 0.004 0.618
>>>> QUERY_PATH_INFORMATION 173232 0.089 1295.651
>>>> QUERY_FS_INFORMATION 29632 0.017 4.813
>>>> NTCreateX 181464 0.479 2214.343
>>>>
>>>>
>>>> MQ-DEADLINE
>>>>
>>>> Operation Count AvgLat MaxLat
>>>> --------------------------------------------------
>>>> Flush 13760 90.542 13221.495
>>>> Close 137654 0.008 27.133
>>>> LockX 640 0.009 0.115
>>>> Rename 8064 1.062 246.759
>>>> ReadX 297956 0.051 347.018
>>>> WriteX 94698 425.636 15090.020
>>>> Unlink 35077 0.580 208.462
>>>> UnlockX 640 0.007 0.291
>>>> FIND_FIRST 66630 0.566 530.339
>>>> SET_FILE_INFORMATION 16000 1.419 811.494
>>>> QUERY_FILE_INFORMATION 30717 0.004 1.108
>>>> QUERY_PATH_INFORMATION 176153 0.182 517.419
>>>> QUERY_FS_INFORMATION 30857 0.018 18.562
>>>> NTCreateX 184145 0.281 582.076
>>>>
>>>> So, with both bfq and deadline there seems to be a serious regression,
>>>> especially on MaxLat, when moving from legacy block to blk-mq. The
>>>> regression is much worse with deadline, as legacy-deadline has the
>>>> lowest max latency among all the schedulers, whereas mq-deadline has
>>>> the highest one.
>>>>
>>>> Regardless of the actual culprit of this regression, I would like to
>>>> investigate further this issue. In this respect, I would like to ask
>>>> for a little help. I would like to isolate the workloads generating
>>>> the highest latencies. To this purpose, I had a look at the loadfile
>>>> client-tiny.txt, and I still have a doubt: is every item in the
>>>> loadfile executed somehow several times (for each value of the number
>>>> of clients), or is it executed only once? More precisely, IIUC, for
>>>> each operation reported in the above results, there are several items
>>>> (lines) in the loadfile. So, is each of these items executed only
>>>> once?
>>>>
>>>> I'm asking because, if it is executed only once, then I guess I can
>>>> find the critical tasks ore easily. Finally, if it is actually
>>>> executed only once, is it expected that the latency for such a task is
>>>> one order of magnitude higher than that of the average latency for
>>>> that group of tasks? I mean, is such a task intrinsically much
>>>> heavier, and then expectedly much longer, or is the fact that latency
>>>> is much higher for this task a sign that something in the kernel
>>>> misbehaves for that task?
>>>>
>>>> While waiting for some feedback, I'm going to execute your test
>>>> showing great unfairness between writes and reads, and to also check
>>>> whether responsiveness does worsen if the write workload for that test
>>>> is being executed in the background.
>>>>
>>>> Thanks,
>>>> Paolo
>>>>
>>>>> ...
>>>>>> --
>>>>>> Mel Gorman
>>>>>> SUSE Labs


2017-08-10 08:44:14

by Mel Gorman

[permalink] [raw]
Subject: Re: Switching to MQ by default may generate some bug reports

On Wed, Aug 09, 2017 at 11:49:17PM +0200, Paolo Valente wrote:
> > This discrepancy with your results makes a little bit harder for me to
> > understand how to better proceed, as I see no regression. Anyway,
> > since this reader-throttling issue seems relevant, I have investigated
> > it a little more in depth. The cause of the throttling is that the
> > fdatasync frequently performed by the writers in this test turns the
> > I/O of the writers into a 100% sync I/O. And neither bfq or cfq
> > differentiate bandwidth between sync reads and sync writes. Basically
> > both cfq and bfq are willing to dispatch the I/O requests of each
> > writer for a time slot equal to that devoted to the reader. But write
> > requests, after reaching the device, use the latter for much more time
> > than reads. This delays the completion of the requests of the reader,
> > and, being the I/O sync, the issuing of the next I/O requests by the
> > reader. The final result is that the device spends most of the time
> > serving write requests, while the reader issues its read requests very
> > slowly.
> >
> > It might not be so difficult to balance this unfairness, although I'm
> > a little worried about changing bfq without being able to see the
> > regression you report. In case I give it a try, could I then count on
> > some testing on your machines?
> >
>
> Hi Mel,
> I've investigated this test case a little bit more, and the outcome is
> unfortunately rather drastic, unless I'm missing some important point.
> It is impossible to control the rate of the reader with the exact
> configuration of this test.

Correct, both are simply competing for access to IO. Very broadly speaking,
it's only checking for loose (but not perfect) fairness with different IO
patterns. While it's not a recent problem, historically (2+ years ago) we
had problems whereby a heavy reader or writer could starve IO completely. It
had odd effects like some multi-threaded benchmarks being artifically good
simply because one thread would dominate and artifically complete faster and
exit prematurely. "Fixing" it had a tendency to help real workloads while
hurting some benchmarks so it's not straight-forward to control for properly.
Bottom line, I'm not necessarily worried if a particular benchmark shows
an apparent regression once I understand why and can convince myself that a
"real" workload benefits from it (preferably proving it).

> In fact, since iodepth is equal to 1, the
> reader issues one I/O request at a time. When one such request is
> dispatched, after some write requests have already been dispatched
> (and then queued in the device), the time to serve the request is
> controlled only by the device. The longer the device makes the read
> request wait before being served, the later the reader will see the
> completion of its request, and then the later the reader will issue a
> new request, and so on. So, for this test, it is mainly the device
> controller to decide the rate of the reader.
>

Understood. It's less than ideal but not a completely silly test either.
That said, the fio tests are relatively new compared to some of the tests
monitored by mmtests looking for issues. It can take time to finalise a
test configuration before it's giving useful data 100% of the time.

> On the other hand, the scheduler can gain again control of the
> bandwidth of the reader, if the reader issues more than one request at
> a time.

Ok, I'll take it as a todo item to increase the depth as a depth of 1 is
not that interesting as such. It's also on my todo list to add fio
configs that add think time.

> Anyway, before analyzing this second, controllable case, I
> wanted to test responsiveness with this heavy write workload in the
> background. And it was very bad! After some hour of mild panic, I
> found out that this failure depends on a bug in bfq, bug that,
> luckily, happens to be triggered by these heavy writes as a background
> workload ...
>
> I've already found and am testing a fix for this bug. Yet, it will
> probably take me some week to submit this fix, because I'm finally
> going on vacation.
>

This is obviously both good and bad. Bad in that the bug exists at all,
good in that you detected it and a fix is possible. I don't think you have
to panic considering that some of the pending fixes include Ming's work
which won't be merged for quite some time and tests take a long time anyway.
Whenever you get around to a fix after your vacation, just cc me and I'll
queue it across a range of machines so you have some independent tests.
A review from me would not be worth much as I haven't spent the time to
fully understand BFQ yet.

If the fixes do not hit until the next merge window or the window after that
then someone who cares enough can do a performance-based -stable backport. If
there are any bugs in the meantime (e.g. after 4.13 comes out) then there
will be a series for the reporter to test. I think it's still reasonably
positive that issues with MQ being enabled by default were detected within
weeks with potential fixes in the pipeline. It's better than months passing
before a distro picked up a suitable kernel and enough time passed for a
coherent bug report to show up that's better than "my computer is slow".

Thanks for the hard work and prompt research.

--
Mel Gorman
SUSE Labs