From: Rafal Mielniczuk <rafal.mielniczuk@citrix.com>
To: Bob Liu <bob.liu@oracle.com>
CC: Jens Axboe <axboe@fb.com>, Marcus Granado <Marcus.Granado@citrix.com>,
        Arianna Avanzini <avanzini.arianna@gmail.com>,
        Felipe Franciosi <felipe.franciosi@citrix.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Christoph Hellwig <hch@infradead.org>,
        "David Vrabel" <david.vrabel@citrix.com>,
        "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
        "boris.ostrovsky@oracle.com" <boris.ostrovsky@oracle.com>,
        Jonathan Davies <Jonathan.Davies@citrix.com>
Subject: Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for
 xen-blkfront and xen-blkback
Thread-Topic: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for
 xen-blkfront and xen-blkback
Thread-Index: AQHQs0ASjp9idKelmUi7KHWADehexw==
Date: Fri, 14 Aug 2015 12:30:03 +0000
Message-ID: <A1D98E0E70C35541AEBDE192A520C5434DD27A@AMSPEX01CL03.citrite.net>
References: <1410479844-2864-1-git-send-email-avanzini.arianna@gmail.com>
 <20141001202721.GF12581@laptop.dumpdata.com>
 <20150428073646.GA16022@infradead.org> <553F3ADF.3000301@gmail.com>
 <555327A5.1060200@oracle.com> <5592A5EF.2050005@citrix.com>
 <55935848.7080909@fb.com>
 <A1D98E0E70C35541AEBDE192A520C5434DABD6@AMSPEX01CL03.citrite.net>
 <55C8C8CE.7020301@fb.com> <55C99130.3020501@oracle.com>
 <A1D98E0E70C35541AEBDE192A520C5434DB3BC@AMSPEX01CL03.citrite.net>
 <55CA31AB.3030308@fb.com> <55CB1CF7.20102@oracle.com>
 <A1D98E0E70C35541AEBDE192A520C5434DC0B9@AMSPEX01CL03.citrite.net>
 <55CDA6FB.1090707@oracle.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6854
Lines: 144

On 14/08/15 09:31, Bob Liu wrote:
> On 08/13/2015 12:46 AM, Rafal Mielniczuk wrote:
>> On 12/08/15 11:17, Bob Liu wrote:
>>> On 08/12/2015 01:32 AM, Jens Axboe wrote:
>>>> On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
>>>>> On 11/08/15 07:08, Bob Liu wrote:
>>>>>> On 08/10/2015 11:52 PM, Jens Axboe wrote:
>>>>>>> On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
>>> ...
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
>>>>>>>> The results we obtained were *better* than the results we got with multiqueue patches applied:
>>>>>>>>
>>>>>>>> fio_threads  io_depth  block_size   1-queue_iops  8-queue_iops  *no-mq-patches_iops*
>>>>>>>>        8           32       512           158K         264K         321K
>>>>>>>>        8           32        1K           157K         260K         328K
>>>>>>>>        8           32        2K           157K         258K         336K
>>>>>>>>        8           32        4K           148K         257K         308K
>>>>>>>>        8           32        8K           124K         207K         188K
>>>>>>>>        8           32       16K            84K         105K         82K
>>>>>>>>        8           32       32K            50K          54K         36K
>>>>>>>>        8           32       64K            24K          27K         16K
>>>>>>>>        8           32      128K            11K          13K         11K
>>>>>>>>
>>>>>>>> We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
>>>>>>>> which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).
>>>>>>>>
>>>>>>>> We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD
>>>>>>>>
>>>>>>>> As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
>>>>>>>> Could you explain why it is difficult to enable merging in the blk-mq layer?
>>>>>>>> That could help closing the performance gap we observed.
>>>>>>>>
>>>>>>>> Otherwise, the tests shows that the multiqueue patches does not improve the performance,
>>>>>>>> at least when it comes to sequential read/writes operations.
>>>>>>> blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?
>>>>>>>
>>>>>> Yes.
>>>>>> Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
>>>>>> Because new requests don't have the chance merging with old requests which were already dequeued and issued.
>>>>>>
>>>>> For some reason we don't see merges even when we set multiqueue to 1.
>>>>> Below are some stats from the guest system when doing sequential 4KB reads:
>>>>>
>>>>> $ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
>>>>>        --iodepth=32 --time_based=1 --runtime=300 --bs=4KB
>>>>> --filename=/dev/xvdb
>>>>>
>>>>> $ iostat -xt 5 /dev/xvdb
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>             0.50    0.00    2.73   85.14    2.00    9.63
>>>>>
>>>>> Device:         rrqm/s   wrqm/s       r/s     w/s     rkB/s    wkB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>> xvdb              0.00     0.00 156926.00    0.00 627704.00     0.00
>>>>> 8.00    30.06    0.19    0.19    0.00   0.01 100.48
>>>>>
>>>>> $ cat /sys/block/xvdb/queue/scheduler
>>>>> none
>>>>>
>>>>> $ cat /sys/block/xvdb/queue/nomerges
>>>>> 0
>>>>>
>>>>> Relevant bits from the xenstore configuration on the dom0:
>>>>>
>>>>> /local/domain/0/backend/vbd/2/51728/dev = "xvdb"
>>>>> /local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
>>>>> /local/domain/0/backend/vbd/2/51728/type = "phy"
>>>>> /local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"
>>>>>
>>>>> /local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
>>>>> /local/domain/2/device/vbd/51728/ring-ref = "9"
>>>>> /local/domain/2/device/vbd/51728/event-channel = "60"
>>>> If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get
>>>> batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. 
>>>> I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device,
>>>> if we're missing merging for that case, then that's a much bigger issue.
>>>>
>>>  
>>> I was using the null block driver for xen blk-mq test.
>>>
>>> There were not merges happen any more even after patch: 
>>> https://lkml.org/lkml/2015/7/13/185
>>> (Which just converted xen block driver to use blk-mq apis)
>>>
>>> Will try a file system soon.
>>>
>> I have more results for the guest with and without the patch
>> https://lkml.org/lkml/2015/7/13/185
>> applied to the latest stable kernel (4.1.5).
>>
> Thank you.
>
>> Command line used was:
>> fio --name=test --ioengine=libaio --rw=read --numjobs=8 \
>>     --iodepth=32 --time_based=1 --runtime=300 --bs=4KB \
>>     --filename=/dev/xvdb --direct=(0 and 1) --iodepth_batch=16
>>
>> without patch (--direct=1):
>>   xvdb: ios=18696304/0, merge=75763177/0, ticks=11323872/0, in_queue=11344352, util=100.00%
>>
>> with patch (--direct=1):
>>   xvdb: ios=43709976/0, merge=97/0, ticks=8851972/0, in_queue=8902928, util=100.00%
>>
> So request merge can happen just more difficult to be triggered.
> How about the iops of both cases?

Without the patch it is 318Kiops, with the patch 146Kiops

>> without patch buffered (--direct=0):
>>   xvdb: ios=1079051/0, merge=76/0, ticks=749364/0, in_queue=748840, util=94.60
>>
>> with patch buffered (--direct=0):
>>   xvdb: ios=1132932/0, merge=0/0, ticks=689108/0, in_queue=688488, util=93.32%
>>

There seems to be very little difference when we measure buffered
sequential reads.
Although iostat shows that there are almost no merges happening for both
cases,
the avgrq-sz is around 250 sectors (125KB). Does that mean that the
merges are actually happening
but on some other layer, not visible to the iostat?

There is a big discrepancy for direct sequential reads and small block
sizes,
where we are missing merges that were happening in the version before
the patch.
It looks like the request does not reside in the queue for long enough
to get merged.

One thing I noticed is that in block/blk-mq.c in function

bool blk_mq_attempt_merge(struct request_queue *q,
                          struct blk_mq_ctx *ctx, struct bio *bio)

The ctx->rq_list queue is mostly empty, the for loop inside the body
of the function is almost never executed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/