Date: Wed, 9 Sep 2009 22:37:53 +0800
From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>,
       Chris Mason <chris.mason@oracle.com>,
       Artem Bityutskiy <dedekind1@gmail.com>,
       Jens Axboe <jens.axboe@oracle.com>,
       "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
       "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
       "david@fromorbit.com" <david@fromorbit.com>,
       "hch@infradead.org" <hch@infradead.org>,
       "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
       "Theodore Ts'o" <tytso@mit.edu>
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
Message-ID: <20090909143753.GA2071@localhost>
References: <1252401791-22463-9-git-send-email-jens.axboe@oracle.com> <4AA633FD.3080006@gmail.com> <1252425983.7746.120.camel@twins> <20090908162936.GA2975@think> <1252428983.7746.140.camel@twins> <20090908172842.GC2975@think> <1252431974.7746.151.camel@twins> <1252432501.7746.156.camel@twins> <1252434746.7035.7.camel@laptop> <20090909142315.GA7949@duck.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20090909142315.GA7949@duck.suse.cz>
User-Agent: Mutt/1.5.18 (2008-05-17)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3160
Lines: 73

On Wed, Sep 09, 2009 at 10:23:15PM +0800, Jan Kara wrote:
> On Tue 08-09-09 20:32:26, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 19:55 +0200, Peter Zijlstra wrote:
> > > 
> > > I think I'm somewhat confused here though..
> > > 
> > > There's kernel threads doing writeout, and there's apps getting stuck in
> > > balance_dirty_pages().
> > > 
> > > If we want all writeout to be done by kernel threads (bdi/pd-flush like
> > > things) then we still need to manage the actual apps and delay them.
> > > 
> > > As things stand now, we kick pdflush into action when dirty levels are
> > > above the background level, and start writing out from the app task when
> > > we hit the full dirty level.
> > > 
> > > Moving all writeout to a kernel thread sounds good from writing linear
> > > stuff pov, but what do we make apps wait on then?
> > 
> > OK, so like said in the previous email, we could have these app tasks
> > simply sleep on a waitqueue which gets periodic wakeups from
> > __bdi_writeback_inc() every time the dirty threshold drops.
> > 
> > The woken tasks would then check their bdi dirty limit (its task
> > dependent) against the current values and either go back to sleep or
> > back to work.
>   Well, what I imagined we could do is:
> Have a per-bdi variable 'pages_written' - that would reflect the amount of
> pages written to the bdi since boot (OK, we'd have to handle overflows but
> that's doable).
> 
> There will be a per-bdi variable 'pages_waited'. When a thread should sleep
> in balance_dirty_pages() because we are over limits, it kicks writeback thread
> and does:
>   to_wait =  max(pages_waited, pages_written) + sync_dirty_pages() (or
> whatever number we decide)
>   pages_waited = to_wait
>   sleep until pages_written reaches to_wait or we drop below dirty limits.
> 
> That will make sure each thread will sleep until writeback threads have done
> their duty for the writing thread.
> 
> If we make sure sleeping threads are properly ordered on the wait queue,
> we could always wakeup just the first one and thus avoid the herding
> effect. When we drop below dirty limits, we would just wakeup the whole
> waitqueue.
> 
> Does this sound reasonable?

Yup! I have a similar idea: for each chunk the kernel writeback thread
synced, it 'honours' so many pages of quota to some waiting/sleeping
dirtier task to consume (so that it can continue dirty so many pages).

This makes it possible to control the relative/absolute writeback
bandwidth for each dirtier tasks. Something like IO controller.

Thanks,
Fengguang

> > The only problem would be the mass wakeups when lots of tasks are
> > blocked on dirty, but I'm guessing there's no way around that anyway,
> > and its better to have a limited number of writers than have everybody
> > write something, which would result in massive write fragmentation.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/