Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964850AbbD0Nsq (ORCPT ); Mon, 27 Apr 2015 09:48:46 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39488 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751336AbbD0Nsl (ORCPT ); Mon, 27 Apr 2015 09:48:41 -0400 Message-ID: <1430142520.26534.16.camel@fedoraproject.org> Subject: Re: loop block-mq conversion scalability issues From: "Justin M. Forbes" To: Ming Lei Cc: linux-kernel , tom.leiming@gmail.com Date: Mon, 27 Apr 2015 08:48:40 -0500 In-Reply-To: <20150426232732.32e3d6cc@tom-T450> References: <1429823050.26534.9.camel@redhat.com> <20150424105908.47de489c@tom-T450> <1429911962.26534.13.camel@fedoraproject.org> <20150426232732.32e3d6cc@tom-T450> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8199 Lines: 213 On Sun, 2015-04-26 at 23:27 +0800, Ming Lei wrote: > Hi Justin, > > On Fri, 24 Apr 2015 16:46:02 -0500 > "Justin M. Forbes" wrote: > > > On Fri, 2015-04-24 at 10:59 +0800, Ming Lei wrote: > > > Hi Justin, > > > > > > Thanks for the report. > > > > > > On Thu, 23 Apr 2015 16:04:10 -0500 > > > "Justin M. Forbes" wrote: > > > > > > > The block-mq conversion for loop in 4.0 kernels is showing us an > > > > interesting scalability problem with live CDs (ro, squashfs). It was > > > > noticed when testing the Fedora beta that the more CPUs a liveCD image > > > > was given, the slower it would boot. A 4 core qemu instance or bare > > > > metal instance took more than twice as long to boot compared to a single > > > > CPU instance. After investigating, this came directly to the block-mq > > > > conversion, reverting these 4 patches will return performance. More > > > > details are available at > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1210857 > > > > I don't think that reverting the patches is the ideal solution so I am > > > > looking for other options. Since you know this code a bit better than I > > > > do I thought I would run it by you while I am looking as well. > > > > > > I can understand the issue because the default @max_active for > > > alloc_workqueue() is quite big(512), which may cause too much > > > context switchs, then loop I/O performance gets decreased. > > > > > > Actually I have written the kernel dio/aio based patch for decreasing > > > both CPU and memory utilization without sacrificing I/O performance, > > > and I will try to improve and push the patch during this cycle and hope > > > it can be merged(kernel/aio.c change is dropped, and only fs change is > > > needed on fs/direct-io.c). > > > > > > But the following change should help for your case, could you test it? > > > > > > --- > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > > > index c6b3726..b1cb41d 100644 > > > --- a/drivers/block/loop.c > > > +++ b/drivers/block/loop.c > > > @@ -1831,7 +1831,7 @@ static int __init loop_init(void) > > > } > > > > > > loop_wq = alloc_workqueue("kloopd", > > > - WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0); > > > + WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 32); > > > if (!loop_wq) { > > > err = -ENOMEM; > > > goto misc_out; > > > > > Patch tested, it made things work (I gave up after 5 minutes and boot > > still seemed hung). I also tried values of 1, 16, 64, and 128). > > Everything below 128 was much worse than the current situation. Setting > > it at 128 seemed about the same as booting without the patch. I can do > > some more testing over the weekend, but I don't think this is the > > correct solution. > > For describing the problem easily, follows the fedora live CD file structure first: > > Fedora-Live-Workstation-x86_64-22_Beta-TC8.iso > =>LiveOS/ > squashfs.img > =>LiveOS/ > ext3fs.img > > Looks at least two reasons are related with the problem: > > - not like other filesyststems(such as ext4), squashfs is a bit special, and > I observed that increasing I/O jobs to access file in squashfs can't improve > I/O performance at all, but it can for ext4 > > - nested loop: both squashfs.img and ext3fs.img are mounted as loop block > > One key idea in the commit b5dd2f60(block: loop: improve performance via blk-mq) > is to submit I/O concurrently from more than one context(worker), like posix > AIO style. Unfortunately this way can't improve I/O performance for squashfs, > and with extra cost of kworker threads, and nested loop makes it worse. Meantime, > during booting, there are lots of concurrent tasks requiring CPU, so the high > priority kworker threads for loop can affect other boot tasks, then booting time > is increased. > > I think it may improve the problem by removing the nest loop, such as > extract files in ext3fs.img to squashfs.img. > > > I would be interested in testing your dio/aio patches as well though. > > squashfs doesn't support dio, so the dio/aio patch can't help much, but > the motivation for introducing dio/aio is really for avoiding double cache > and decreasing CPU utilization[1]. > > [1], http://marc.info/?l=linux-kernel&m=142116397525668&w=2 > > The following patch may help the situation, but for this case, I am wondering > it can compete with previous loop. > --- > >From 0af95571a2a066b4f3bacaac2c75b39e3c701c6e Mon Sep 17 00:00:00 2001 > From: Ming Lei > Date: Sun, 26 Apr 2015 17:53:56 +0800 > Subject: [PATCH] block: loop: avoiding too many pending per work I/O > > If there are too many pending per work I/O, too many > high priority work thread can be generated so that > system performance can be effected. > > This patch limits the max pending per work I/O as 32, > and will degrage to single queue mode when the max number > is reached. > > This patch fixes Fedora 22 live booting performance > regression when it is booted from squashfs over dm > based on loop. > > Signed-off-by: Ming Lei > --- > drivers/block/loop.c | 21 ++++++++++++++++++--- > drivers/block/loop.h | 2 ++ > 2 files changed, 20 insertions(+), 3 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index c6b3726..55bd04f 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1448,13 +1448,24 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, > const struct blk_mq_queue_data *bd) > { > struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq); > + struct loop_device *lo = cmd->rq->q->queuedata; > + bool single_queue = !!(cmd->rq->cmd_flags & REQ_WRITE); > + > + /* > + * Degrade to single queue mode if the pending per work > + * I/O number reaches 16, otherwise too many high priority > + * worker threads may effect system performance as reported > + * in fedora live booting from squashfs over loop. > + */ > + if (atomic_read(&lo->pending_per_work_io) >= 16) > + single_queue = true; > > blk_mq_start_request(bd->rq); > > - if (cmd->rq->cmd_flags & REQ_WRITE) { > - struct loop_device *lo = cmd->rq->q->queuedata; > + if (single_queue) { > bool need_sched = true; > > + cmd->per_work_io = false; > spin_lock_irq(&lo->lo_lock); > if (lo->write_started) > need_sched = false; > @@ -1466,6 +1477,8 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, > if (need_sched) > queue_work(loop_wq, &lo->write_work); > } else { > + cmd->per_work_io = true; > + atomic_inc(&lo->pending_per_work_io); > queue_work(loop_wq, &cmd->read_work); > } > > @@ -1490,6 +1503,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd) > if (ret) > cmd->rq->errors = -EIO; > blk_mq_complete_request(cmd->rq); > + if (cmd->per_work_io) > + atomic_dec(&lo->pending_per_work_io); > } > > static void loop_queue_write_work(struct work_struct *work) > @@ -1831,7 +1846,7 @@ static int __init loop_init(void) > } > > loop_wq = alloc_workqueue("kloopd", > - WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0); > + WQ_MEM_RECLAIM | WQ_HIGHPRI, 0); > if (!loop_wq) { > err = -ENOMEM; > goto misc_out; > diff --git a/drivers/block/loop.h b/drivers/block/loop.h > index ffb6dd6..06d8f1a 100644 > --- a/drivers/block/loop.h > +++ b/drivers/block/loop.h > @@ -57,6 +57,7 @@ struct loop_device { > struct list_head write_cmd_head; > struct work_struct write_work; > bool write_started; > + atomic_t pending_per_work_io; > int lo_state; > struct mutex lo_ctl_mutex; > > @@ -68,6 +69,7 @@ struct loop_device { > struct loop_cmd { > struct work_struct read_work; > struct request *rq; > + bool per_work_io; > struct list_head list; > }; > This patch tests well. It gives comparable boot times to the 3.19 loop implementation and does not seem to have problems scaling with SMP. I would love to see this make 4.1 and 4.0.x stable as well. Thanks, Justin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/