Date: Fri, 26 Jun 2009 12:44:06 +0200
From: Jens Axboe <jens.axboe@oracle.com>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>, Ralf Gross <rg@STZ-Softwaretechnik.com>,
       "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
       linux-fsdevel@vger.kernel.org
Subject: Re: io-scheduler tuning for better read/write ratio
Message-ID: <20090626104406.GK23611@kernel.dk>
References: <20090616154342.GA7043@p15145560.pureserver.info> <4A37CB2A.6010209@davidnewall.com> <20090616184027.GB7043@p15145560.pureserver.info> <4A37E7DB.7030100@redhat.com> <20090616185600.GC7043@p15145560.pureserver.info> <x49d494c4u0.fsf@segfault.boston.devel.redhat.com> <x49fxdsl46y.fsf@segfault.boston.devel.redhat.com> <20090622163113.GD12483@p15145560.pureserver.info> <x49hby8jbrd.fsf@segfault.boston.devel.redhat.com> <20090626021905.GA23981@localhost>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20090626021905.GA23981@localhost>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5017
Lines: 122

On Fri, Jun 26 2009, Wu Fengguang wrote:
> On Tue, Jun 23, 2009 at 03:42:46AM +0800, Jeff Moyer wrote:
> > Ralf Gross <rg@STZ-Softwaretechnik.com> writes:
> > 
> > > Jeff Moyer schrieb:
> > >> Jeff Moyer <jmoyer@redhat.com> writes:
> > >> 
> > >> > Ralf Gross <rg@stz-softwaretechnik.com> writes:
> > >> >
> > >> >> Casey Dahlin schrieb:
> > >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote:
> > >> >>> > David Newall schrieb:
> > >> >>> >> Ralf Gross wrote:
> > >> >>> >>> write throughput is much higher than the read throughput (40 MB/s
> > >> >>> >>> read, 90 MB/s write).
> > >> >>> > 
> > >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write
> > >> >>> > to the device at the same time.
> > >> >>> > 
> > >> >>> > Ralf
> > >> >>> 
> > >> >>> How specifically are you testing? It could depend a lot on the
> > >> >>> particular access patterns you're using to test.
> > >> >>
> > >> >> I did the basic tests with tiobench. The real test is a test backup
> > >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device.
> > >> >> The jobs partially write to the device in parallel. Depending which
> > >> >> spool file reaches the 30 GB first, one starts reading from that file
> > >> >> and writing to tape, while to other is still spooling.
> > >> >
> > >> > We are missing a lot of details, here.  I guess the first thing I'd try
> > >> > would be bumping up the max_readahead_kb parameter, since I'm guessing
> > >> > that your backup application isn't driving very deep queue depths.  If
> > >> > that doesn't work, then please provide exact invocations of tiobench
> > >> > that reprduce the problem or some blktrace output for your real test.
> > >> 
> > >> Any news, Ralf?
> > >
> > > sorry for the delay. atm there are large backups running and using the
> > > raid device for spooling. So I can't do any tests.
> > >
> > > Re. read ahead: I tested different settings from 8Kb to 65Kb, this
> > > didn't help.
> > >
> > > I'll do some more tests when the backups are done (3-4 more days).
> > 
> > The default is 128KB, I believe, so it's strange that you would test
> > smaller values.  ;)  I would try something along the lines of 1 or 2 MB.
> > 
> > I'm CCing Fengguang in case he has any suggestions.
> 
> Jeff, thank you for the forwarding (and sorry for the long delay)!
> 
> The read:write (or rather sync:async) ratio control is an IO scheduler
> feature. CFQ has parameters slice_sync and slice_async for that.
> What's more, CFQ will let async IO wait if there are any in flight
> sync IO. This is good, but not quite enough. Normally sync IOs come
> one by one, with some small idle time window in between. If we only
> start dispatching async IOs after the last sync IO has completed for
> eg. 1ms, then we may stop the async background write IOs when there
> are active sync foreground read IO stream.
> 
> This simple patch aims to address the writes-push-aside-reads problem.
> Ralf, you can try applying this patch and run your workload with this
> (huge) CFQ parameter:
> 
>         echo 1000 > /sys/block/sda/queue/iosched/slice_sync 
> 
> The patch is based on 2.6.30, but can be trivially backported if you
> want to use some old kernel.
> 
> It may impact overall (sync+async) IO throughput when there are one or
> more ongoing sync IO streams, so requires considerable benchmarks and
> adjustments.
> 
> Thanks,
> Fengguang
> ---
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index a55a9bd..14011b7 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1064,7 +1064,6 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>  	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
>  		return;
>  
> -	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
>  	WARN_ON(cfq_cfqq_slice_new(cfqq));
>  
>  	/*
> @@ -2175,8 +2174,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>  	 * or if we want to idle in case it has no pending requests.
>  	 */
>  	if (cfqd->active_queue == cfqq) {
> -		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
> -
>  		if (cfq_cfqq_slice_new(cfqq)) {
>  			cfq_set_prio_slice(cfqd, cfqq);
>  			cfq_clear_cfqq_slice_new(cfqq);
> @@ -2190,8 +2187,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>  		 */
>  		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
>  			cfq_slice_expired(cfqd, 1);
> -		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
> -			 sync && !rq_noidle(rq))
> +		else if (sync && !rq_noidle(rq) &&
> +			 !cfq_close_cooperator(cfqd, cfqq, 1))
>  			cfq_arm_slice_timer(cfqd);
>  	}

What's the purpose of this patch? If you have requests pending you don't
want to arm the idle timer and wait, you want to dispatch those.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/