Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754265AbbDHRiY (ORCPT ); Wed, 8 Apr 2015 13:38:24 -0400 Received: from mail-pa0-f48.google.com ([209.85.220.48]:33030 "EHLO mail-pa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753093AbbDHRiW (ORCPT ); Wed, 8 Apr 2015 13:38:22 -0400 Message-ID: <55256786.8000608@kernel.dk> Date: Wed, 08 Apr 2015 11:38:14 -0600 From: Jens Axboe User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Jeff Moyer , linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists References: <1428347694-17704-1-git-send-email-jmoyer@redhat.com> <1428347694-17704-2-git-send-email-jmoyer@redhat.com> In-Reply-To: <1428347694-17704-2-git-send-email-jmoyer@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3954 Lines: 81 On 04/06/2015 01:14 PM, Jeff Moyer wrote: > The way the on-stack plugging currently works, each nesting level > flushes its own list of I/Os. This can be less than optimal (read > awful) for certain workloads. For example, consider an application > that issues asynchronous O_DIRECT I/Os. It can send down a bunch of > I/Os together in a single io_submit call, only to have each of them > dispatched individually down in the bowells of the dirct I/O code. > The reason is that there are blk_plug's instantiated both at the upper > call site in do_io_submit and down in do_direct_IO. The latter will > submit as little as 1 I/O at a time (if you have a small enough I/O > size) instead of performing the batching that the plugging > infrastructure is supposed to provide. > > Now, for the case where there is an elevator involved, this doesn't > really matter too much. The elevator will keep the I/O around long > enough for it to be merged. However, in cases where there is no > elevator (like blk-mq), I/Os are simply dispatched immediately. > > Try this, for example (note I'm using a virtio-blk device, so it's > using the blk-mq single queue path, though I've also reproduced this > with the micron p320h): > > fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based > > If you run that on a current kernel, you will get zero merges. Zero! > After this patch, you will get many merges (the actual number depends > on how fast your storage is, obviously), and much better throughput. > Here are results from my test rig: > > Unpatched kernel: > Read B/W: 283,638 KB/s > Read Merges: 0 > > Patched kernel: > Read B/W: 873,224 KB/s > Read Merges: 2,046K > > I considered several approaches to solving the problem: > 1) get rid of the inner-most plugs > 2) handle nesting by using only one on-stack plug > 2a) #2, except use a per-cpu blk_plug struct, which may clean up the > code a bit at the expense of memory footprint > > Option 1 will be tricky or impossible to do, since inner most plug > lists are sometimes the only plug lists, depending on the call path. > Option 2 is what this patch implements. Option 2a is perhaps a better > idea, but since I already implemented option 2, I figured I'd post it > for comments and opinions before rewriting it. > > Much of the patch involves modifying call sites to blk_start_plug, > since its signature is changed. The meat of the patch is actually > pretty simple and constrained to block/blk-core.c and > include/linux/blkdev.h. The only tricky bits were places where plugs > were finished and then restarted to flush out I/O. There, I went > ahead and exported blk_flush_plug_list and called that directly. > > Comments would be greatly appreciated. It's hard to argue with the increased merging for your case. The task plugs did originally work like you changed them to, not flushing until the outermost plug was flushed. Unfortunately I don't quite remember why I changed them, will have to do a bit of digging to refresh my memory. For cases where we don't do any merging (like nvme), we always want to flush. Well almost, if we start do utilize the batched submission, then the plug would still potentially help (just for other reasons than merging). And agree with Ming, this can be cleaned up substantially. I'd also like to see some test results from the other end of the spectrum. Your posted cased is clearly based case (we missed tons of merging, now we don't), I'd like to see a normal case and a worst case result as well so we have an idea of what this would do to latencies. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/