Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754564AbbDHSuR (ORCPT ); Wed, 8 Apr 2015 14:50:17 -0400 Received: from mail-pd0-f171.google.com ([209.85.192.171]:33211 "EHLO mail-pd0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753064AbbDHSuO (ORCPT ); Wed, 8 Apr 2015 14:50:14 -0400 Date: Tue, 7 Apr 2015 09:09:58 -0700 From: Shaohua Li To: Jens Axboe Cc: Jeff Moyer , linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] blk-plug: don't flush nested plug lists Message-ID: <20150407160958.GA105179@kernel.org> References: <1428347694-17704-1-git-send-email-jmoyer@redhat.com> <1428347694-17704-2-git-send-email-jmoyer@redhat.com> <55256786.8000608@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55256786.8000608@kernel.dk> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3885 Lines: 88 On Wed, Apr 08, 2015 at 11:38:14AM -0600, Jens Axboe wrote: > On 04/06/2015 01:14 PM, Jeff Moyer wrote: > >The way the on-stack plugging currently works, each nesting level > >flushes its own list of I/Os. This can be less than optimal (read > >awful) for certain workloads. For example, consider an application > >that issues asynchronous O_DIRECT I/Os. It can send down a bunch of > >I/Os together in a single io_submit call, only to have each of them > >dispatched individually down in the bowells of the dirct I/O code. > >The reason is that there are blk_plug's instantiated both at the upper > >call site in do_io_submit and down in do_direct_IO. The latter will > >submit as little as 1 I/O at a time (if you have a small enough I/O > >size) instead of performing the batching that the plugging > >infrastructure is supposed to provide. > > > >Now, for the case where there is an elevator involved, this doesn't > >really matter too much. The elevator will keep the I/O around long > >enough for it to be merged. However, in cases where there is no > >elevator (like blk-mq), I/Os are simply dispatched immediately. > > > >Try this, for example (note I'm using a virtio-blk device, so it's > >using the blk-mq single queue path, though I've also reproduced this > >with the micron p320h): > > > >fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based > > > >If you run that on a current kernel, you will get zero merges. Zero! > >After this patch, you will get many merges (the actual number depends > >on how fast your storage is, obviously), and much better throughput. > >Here are results from my test rig: > > > >Unpatched kernel: > >Read B/W: 283,638 KB/s > >Read Merges: 0 > > > >Patched kernel: > >Read B/W: 873,224 KB/s > >Read Merges: 2,046K > > > >I considered several approaches to solving the problem: > >1) get rid of the inner-most plugs > >2) handle nesting by using only one on-stack plug > >2a) #2, except use a per-cpu blk_plug struct, which may clean up the > > code a bit at the expense of memory footprint > > > >Option 1 will be tricky or impossible to do, since inner most plug > >lists are sometimes the only plug lists, depending on the call path. > >Option 2 is what this patch implements. Option 2a is perhaps a better > >idea, but since I already implemented option 2, I figured I'd post it > >for comments and opinions before rewriting it. > > > >Much of the patch involves modifying call sites to blk_start_plug, > >since its signature is changed. The meat of the patch is actually > >pretty simple and constrained to block/blk-core.c and > >include/linux/blkdev.h. The only tricky bits were places where plugs > >were finished and then restarted to flush out I/O. There, I went > >ahead and exported blk_flush_plug_list and called that directly. > > > >Comments would be greatly appreciated. > > It's hard to argue with the increased merging for your case. The task plugs > did originally work like you changed them to, not flushing until the > outermost plug was flushed. Unfortunately I don't quite remember why I > changed them, will have to do a bit of digging to refresh my memory. The behavior never changed. Current code only flush outermost plug. blk_start_plug() doesn't assign plug to current task for inner plug, so requests are all added to outmost plug. maybe the code can be cleaned up as: start_plug { if (current->plug) returnl current->plug = plug } end_plug() { if (plug != current->plug) return; flush_plug current->plug = NULL; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/