Message-ID: <1430142520.26534.16.camel@fedoraproject.org>
Subject: Re: loop block-mq conversion scalability issues
From: "Justin M. Forbes" <jforbes@fedoraproject.org>
To: Ming Lei <ming.lei@canonical.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>, tom.leiming@gmail.com
Date: Mon, 27 Apr 2015 08:48:40 -0500
In-Reply-To: <20150426232732.32e3d6cc@tom-T450>
References: <1429823050.26534.9.camel@redhat.com>
	 <20150424105908.47de489c@tom-T450>
	 <1429911962.26534.13.camel@fedoraproject.org>
	 <20150426232732.32e3d6cc@tom-T450>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8199
Lines: 213

On Sun, 2015-04-26 at 23:27 +0800, Ming Lei wrote:
> Hi Justin,
> 
> On Fri, 24 Apr 2015 16:46:02 -0500
> "Justin M. Forbes" <jforbes@fedoraproject.org> wrote:
> 
> > On Fri, 2015-04-24 at 10:59 +0800, Ming Lei wrote:
> > > Hi Justin,
> > > 
> > > Thanks for the report.
> > > 
> > > On Thu, 23 Apr 2015 16:04:10 -0500
> > > "Justin M. Forbes" <jforbes@redhat.com> wrote:
> > > 
> > > > The block-mq conversion for loop in 4.0 kernels is showing us an
> > > > interesting scalability problem with live CDs (ro, squashfs).  It was
> > > > noticed when testing the Fedora beta that the more CPUs a liveCD image
> > > > was given, the slower it would boot. A 4 core qemu instance or bare
> > > > metal instance took more than twice as long to boot compared to a single
> > > > CPU instance.  After investigating, this came directly to the block-mq
> > > > conversion, reverting these 4 patches will return performance. More
> > > > details are available at
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1210857
> > > > I don't think that reverting the patches is the ideal solution so I am
> > > > looking for other options.  Since you know this code a bit better than I
> > > > do I thought I would run it by you while I am looking as well.
> > > 
> > > I can understand the issue because the default @max_active for
> > > alloc_workqueue() is quite big(512), which may cause too much
> > > context switchs, then loop I/O performance gets decreased.
> > > 
> > > Actually I have written the kernel dio/aio based patch for decreasing
> > > both CPU and memory utilization without sacrificing I/O performance,
> > > and I will try to improve and push the patch during this cycle and hope
> > > it can be merged(kernel/aio.c change is dropped, and only fs change is
> > > needed on fs/direct-io.c).
> > > 
> > > But the following change should help for your case, could you test it?
> > > 
> > > ---
> > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > index c6b3726..b1cb41d 100644
> > > --- a/drivers/block/loop.c
> > > +++ b/drivers/block/loop.c
> > > @@ -1831,7 +1831,7 @@ static int __init loop_init(void)
> > >  	}
> > >  
> > >  	loop_wq = alloc_workqueue("kloopd",
> > > -			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> > > +			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 32);
> > >  	if (!loop_wq) {
> > >  		err = -ENOMEM;
> > >  		goto misc_out;
> > > 
> > Patch tested, it made things work (I gave up after 5 minutes and boot
> > still seemed hung). I also tried values of 1, 16, 64, and 128).
> > Everything below 128 was much worse than the current situation. Setting
> > it at 128 seemed about the same as booting without the patch. I can do
> > some more testing over the weekend, but I don't think this is the
> > correct solution.
> 
> For describing the problem easily, follows the fedora live CD file structure first:
> 
> Fedora-Live-Workstation-x86_64-22_Beta-TC8.iso
> 	=>LiveOS/
> 		squashfs.img
> 			=>LiveOS/
> 				ext3fs.img        
>  
> Looks at least two reasons are related with the problem:
> 
> - not like other filesyststems(such as ext4), squashfs is a bit special, and
> I observed that increasing I/O jobs to access file in squashfs can't improve
> I/O performance at all, but it can for ext4
> 
> - nested loop: both squashfs.img and ext3fs.img are mounted as loop block
> 
> One key idea in the commit b5dd2f60(block: loop: improve performance via blk-mq)
> is to submit I/O concurrently from more than one context(worker), like posix
> AIO style. Unfortunately this way can't improve I/O performance for squashfs,
> and with extra cost of kworker threads, and nested loop makes it worse. Meantime,
> during booting, there are lots of concurrent tasks requiring CPU, so the high
> priority kworker threads for loop can affect other boot tasks, then booting time
> is increased.
> 
> I think it may improve the problem by removing the nest loop, such as
> extract files in ext3fs.img to squashfs.img.
> 
> > I would be interested in testing your dio/aio patches as well though.
> 
> squashfs doesn't support dio, so the dio/aio patch can't help much, but
> the motivation for introducing dio/aio is really for avoiding double cache
> and decreasing CPU utilization[1].
> 
> [1], http://marc.info/?l=linux-kernel&m=142116397525668&w=2
> 
> The following patch may help the situation, but for this case, I am wondering
> it can compete with previous loop.
> ---
> >From 0af95571a2a066b4f3bacaac2c75b39e3c701c6e Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@canonical.com>
> Date: Sun, 26 Apr 2015 17:53:56 +0800
> Subject: [PATCH] block: loop: avoiding too many pending per work I/O
> 
> If there are too many pending per work I/O, too many
> high priority work thread can be generated so that
> system performance can be effected.
> 
> This patch limits the max pending per work I/O as 32,
> and will degrage to single queue mode when the max number
> is reached.
> 
> This patch fixes Fedora 22 live booting performance
> regression when it is booted from squashfs over dm
> based on loop.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  drivers/block/loop.c | 21 ++++++++++++++++++---
>  drivers/block/loop.h |  2 ++
>  2 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c6b3726..55bd04f 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1448,13 +1448,24 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		const struct blk_mq_queue_data *bd)
>  {
>  	struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
> +	struct loop_device *lo = cmd->rq->q->queuedata;
> +	bool single_queue = !!(cmd->rq->cmd_flags & REQ_WRITE);
> +
> +	/*
> +	 * Degrade to single queue mode if the pending per work
> +	 * I/O number reaches 16, otherwise too many high priority
> +	 * worker threads may effect system performance as reported
> +	 * in fedora live booting from squashfs over loop.
> +	 */
> +	if (atomic_read(&lo->pending_per_work_io) >= 16)
> +		single_queue = true;
>  
>  	blk_mq_start_request(bd->rq);
>  
> -	if (cmd->rq->cmd_flags & REQ_WRITE) {
> -		struct loop_device *lo = cmd->rq->q->queuedata;
> +	if (single_queue) {
>  		bool need_sched = true;
>  
> +		cmd->per_work_io = false;
>  		spin_lock_irq(&lo->lo_lock);
>  		if (lo->write_started)
>  			need_sched = false;
> @@ -1466,6 +1477,8 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		if (need_sched)
>  			queue_work(loop_wq, &lo->write_work);
>  	} else {
> +		cmd->per_work_io = true;
> +		atomic_inc(&lo->pending_per_work_io);
>  		queue_work(loop_wq, &cmd->read_work);
>  	}
>  
> @@ -1490,6 +1503,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
>  	if (ret)
>  		cmd->rq->errors = -EIO;
>  	blk_mq_complete_request(cmd->rq);
> +	if (cmd->per_work_io)
> +		atomic_dec(&lo->pending_per_work_io);
>  }
>  
>  static void loop_queue_write_work(struct work_struct *work)
> @@ -1831,7 +1846,7 @@ static int __init loop_init(void)
>  	}
>  
>  	loop_wq = alloc_workqueue("kloopd",
> -			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> +			WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
>  	if (!loop_wq) {
>  		err = -ENOMEM;
>  		goto misc_out;
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index ffb6dd6..06d8f1a 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -57,6 +57,7 @@ struct loop_device {
>  	struct list_head	write_cmd_head;
>  	struct work_struct	write_work;
>  	bool			write_started;
> +	atomic_t		pending_per_work_io;
>  	int			lo_state;
>  	struct mutex		lo_ctl_mutex;
>  
> @@ -68,6 +69,7 @@ struct loop_device {
>  struct loop_cmd {
>  	struct work_struct read_work;
>  	struct request *rq;
> +	bool per_work_io;
>  	struct list_head list;
>  };
>  

This patch tests well. It gives comparable boot times to the 3.19 loop
implementation and does not seem to have problems scaling with SMP.  I
would love to see this make 4.1 and 4.0.x stable as well.

Thanks,
Justin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/