Subject: Re: [RFC] blk-mq and I/O scheduling
To: Andreas Herrmann <aherrmann@suse.com>, Christoph Hellwig <hch@lst.de>
References: <20151119120235.GA7966@suselix.suse.de>
Cc: linux-kernel@vger.kernel.org
From: Jens Axboe <axboe@kernel.dk>
Message-ID: <56561049.7000201@kernel.dk>
Date: Wed, 25 Nov 2015 12:47:21 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <20151119120235.GA7966@suselix.suse.de>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4120
Lines: 85

On 11/19/2015 05:02 AM, Andreas Herrmann wrote:
> Hi,
>
> I've looked into blk-mq and possible support for I/O scheduling.
>
> The reason for this is to minimize performance degradation with
> rotational devices when scsi_mod.use_blk_mq=1 is switched on.
>
> I think that the degradation is well reflected with fio measurements.
> With an increasing number of jobs you'll encounter a significant
> performance drop for sequential reads and writes with blk-mq in
> contrast to CFQ. blk-mq ensures that requests from different processes
> (CPUs) are "perfectly shuffled" in a hardware queue. This is no
> problem for non-rotational devices for which blk-mq is aimed for but
> not so nice for rotational disks.
>
>    (i) I've done some tests with patch c2ed2f2dcf92 (blk-mq: first cut
>        deadline scheduling) from branch mq-deadline of linux-block
>        repository. I've not seen a significant performance impact when
>        enabling it (neither for non-rotational nor for rotational
>        disks).
>
>   (ii) I've played with code to enable sorting/merging of requests. I
>        did this in flush_busy_ctxs. This didn't have a performance
>        impact either. On a closer look this was due to high frequency
>        of calls to __blk_mq_run_hw_queue. There was almost nothing to
>        sort (too few requests). I guess that's also the reason why (i)
>        had not much impact.
>
> (iii) With CFQ I've observed similar performance patterns to blk-mq if
>        slice_idle was set to 0.
>
>   (iv) I thought about introducing a per software queue time slice
>        during which blk-mq will service only one software queue (one
>        CPU) and not flush all software queues. This could help to
>        enqueue multiple requests belonging to the same process (as long
>        as it runs on same CPU) into a hardware queue.  A minimal patch
>        to implement this is attached below.
>
> The latter helped to improve performance for sequential reads and
> writes. But it's not on a par with CFQ. Increasing the time slice is
> suboptimal (as shown with the 2ms results, see below). It might be
> possible to get better performance when further reducing the initial
> time slice and adapting it up to a maximum value if there are
> repeatedly pending requests for a CPU.
>
> After these observations and assuming that non-rotational devices are
> most likely fine using blk-mq without I/O scheduling support I wonder
> whether
>
> - it's really a good idea to re-implement scheduling support for
>    blk-mq that eventually behaves like CFQ for rotational devices.
>
> - it's technical possible to support both blk-mq and CFQ for different
>    devices on the same host adapter. This would allow to use "good old"
>    code for "good old" rotational devices. (But this might not be a
>    choice if in the long run a goal is to get rid of non-blk-mq code --
>    not sure what the plans are.)
>
> What do you think about this?

Sorry I did not get around to properly looking at this this week, I'll 
tend to it next week. I think the concept of tying the idling to a 
specific CPU is likely fine, though I wonder if there are cases where we 
preempt more heavily and subsequently miss breaking the idling properly. 
I don't think we want/need cfq for blk-mq, but basic idling could 
potentially be enough. That's still a far cry from a full cfq 
implementation. The long term plans are still to move away from the 
legacy IO path, though with things like scheduling, that's sure to take 
some time.

That is actually where the mq-deadline work comes in. The mq-deadline 
work is missing a test patch to limit tag allocations, and a bunch of 
other little things to truly make it functional. There might be some 
options for folding it all together, with idling, as that would still be 
important on rotating storage going forward.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/