Date: Sat, 26 Sep 2009 11:02:30 +0800
From: Wu Fengguang <fengguang.wu@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>,
       Andrew Morton <akpm@linux-foundation.org>,
       Peter Zijlstra <a.p.zijlstra@chello.nl>,
       "Li, Shaohua" <shaohua.li@intel.com>,
       "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
       "richard@rsk.demon.co.uk" <richard@rsk.demon.co.uk>,
       "jens.axboe@oracle.com" <jens.axboe@oracle.com>,
       Jan Kara <jack@suse.cz>, linux-fsdevel@vger.kernel.org
Subject: Re: regression in page writeback
Message-ID: <20090926030230.GA7770@localhost>
References: <20090922182832.28e7f73a.akpm@linux-foundation.org> <20090923014500.GA11076@localhost> <20090922185941.1118e011.akpm@linux-foundation.org> <20090923022622.GB11918@localhost> <20090922193622.42c00012.akpm@linux-foundation.org> <20090923140058.GA2794@think> <20090924031508.GD6456@localhost> <20090925001117.GA9464@discord.disaster> <20090925031920.GB10487@localhost> <20090926014715.GD9464@discord.disaster>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20090926014715.GD9464@discord.disaster>
User-Agent: Mutt/1.5.18 (2008-05-17)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 10968
Lines: 234

On Sat, Sep 26, 2009 at 09:47:15AM +0800, Dave Chinner wrote:
> On Fri, Sep 25, 2009 at 11:19:20AM +0800, Wu Fengguang wrote:
> > On Fri, Sep 25, 2009 at 08:11:17AM +0800, Dave Chinner wrote:
> > > On Thu, Sep 24, 2009 at 11:15:08AM +0800, Wu Fengguang wrote:
> > > > On Wed, Sep 23, 2009 at 10:00:58PM +0800, Chris Mason wrote:
> > > > > The only place that actually honors the congestion flag is pdflush.
> > > > > It's trivial to get pdflush backed up and make it sit down without
> > > > > making any progress because once the queue congests, pdflush goes away.
> > > > 
> > > > Right. I guess that's more or less intentional - to give lowest priority
> > > > to periodic/background writeback.
> > > 
> > > IMO, this is the wrong design. Background writeback should
> > > have higher CPU/scheduler priority than normal tasks. If there is
> > > sufficient dirty pages in the system for background writeback to
> > > be active, it should be running *now* to start as much IO as it can
> > > without being held up by other, lower priority tasks.
> > > 
> > > Cleaning pages is important to keeping the system running smoothly.
> > > Given that IO takes time to clean pages, it is therefore important
> > > to issue as much as possible as quickly as possible without delays
> > > before going back to sleep. Delaying issue of the IO or doing
> > > sub-optimal issue simply reduces performance of the system because
> > > it takes longer to clean the same number of dirty pages.
> > > 
> > > > > Nothing stops other procs from keeping the queue congested forever.
> > > > > This can only be fixed by making everyone wait for congestion, at which
> > > > > point we might as well wait for requests.
> > > > 
> > > > Yes. That gives everyone somehow equal opportunity, this is a policy change
> > > > that may lead to interesting effects, as well as present a challenge to
> > > > get_request_wait(). That said, I'm not against the change to a wait queue
> > > > in general.
> > > 
> > > If you block all threads doing _writebehind caching_ (synchronous IO
> > > is self-throttling) to the same BDI on the same queue as the bdi
> > > flusher then when congestion clears the higher priority background
> > > flusher thread should run first and issue more IO.  This should
> > > happen as a natural side-effect of our scheduling algorithms and it
> > > gives preference to efficient background writeback over in-efficient
> > > foreground writeback. Indeed, with this approach we can even avoid
> > > foreground writeback altogether...
> > 
> > I don't see how balance_dirty_pages() writeout is less efficient than
> > pdflush writeout.
> > 
> > They all called the same routines to do the job.
> > balance_dirty_pages() sets nr_to_write=1536 at least for ext4 and xfs
> > (unless memory is tight; btrfs is 1540), which is in fact 50% bigger
> > than the 1024 pages used by pdflush.
> 
> Sure, but the prёblem now is that you are above the
> bdi->dirty_exceeded threshold, foreground writeback tries to issue
> 1536 pages of IO every 8 pages that are dirtied. That means you'll

I'd suggest to increase that "8 pages". It may be good when
ratelimit_pages is statically set to 32. Now that ratelimit_pages is
dynamically set to a much larger value at boot time, we'd better use
ratelimit_pages/4 for dirty_exceeded case.

> block just about every writing process in writeback at the same time
> and they will all be blocked in congestion trying to write different
> inodes....

Ah got it, balance_dirty_pages could lead to too much concurrency and
thus seeks!

> > And it won't back off on congestion.
> 
> And that is, IMO, a major problem.

Not necessarily, the above concurrency problem occurs even if it backs
off on congestion. But anyway we are switching to get_request_wait now.

> > The s_io/b_io queues are shared, so a balance_dirty_pages() will just
> > continue from where the last sync thread exited. So it does not make
> > much difference who initiates the IO. Did I missed something?
> 
> The current implementation uses the request queue to do that
> blocking at IO submission time. This is based on the premise that if
> we write a certain number of pages, we're guaranteed to have waited
> long enough for that many pages to come clean.

Right.

> However, every other
> thread doing writes and being throttled does the same thing.  This
> leads to N IO submitters from at least N different inodes at the
> same time. Which inode gets written when congestion clears is
> anyone's guess - it's a thundering herd IIUC the congestion
> implementation correctly.

Good insight, thanks for pointing this out!

> The result is that we end up with N different sets of IO being
> issued with potentially zero locality to each other, resulting in
> much lower elevator sort/merge efficiency and hence we seek the disk
> all over the place to service the different sets of IO.

Right. But note that its negative effects are not that common given
the current parameters MAX_WRITEBACK_PAGES=1024, max_sectors_kb=512
and nr_requests=128. As we are going on to the next inode anyway
when 4MB of it have been enqueued. So a request queue could hold up to
128/(4096/512) = 16 inodes.

So the problem will turn up when there are >= 16 throttled processes.

When we increase MAX_WRITEBACK_PAGES to 128MB, even _one_ foreground
writeout will hurt.

> OTOH, if there is only one submission thread, it doesn't jump
> between inodes in the same way when congestion clears - if keeps
> writing to the same inode, resulting in large related chunks of
> sequential IOs being issued to the disk. This is more efficient than
> the above foreground writeback because the elevator works better and
> the disk seeks less.
> 
> As you can probably guess, I think foreground writeout is the wrong
> architecture because of the behaviour it induces under heavy
> multithreaded IO patterns. I agree that it works OK if continue
> tweaking it to fix problems.

Agreed. 

> However, my concern is that if it isn't constantly observed, tweaked
> and maintained, performance goes backwards as other code changes.
> i.e. there is a significant maintenance burden and looking at the
> problems once very couple of years (last big maintenance rounds were
> 2.6.15/16, 2.6.23/24, now 2.6.31/32) isn't good enough to prevent
> performance form sliding backwards from release to release.

One problem is that the queues are very tightly coupled. Every change
of behavior leads to reevaluation of other factors.

> ----
> 
> The rest of this is an idea I've been kicking around for a while
> which is derived from IO throttling work I've done during a
> past life.  I haven't had time to research and prototype it to see
> if it performs any better under really heavy load, but I'm going to
> throw it out anyway so that everyone can see a little bit more about
> what I'm thinking.
> 
> My fundamental premise is that throttling does not require IO to be
> issued from the thread to be throttled. The essence of write
> throttling is waiting for more pages to be cleaned in a period of
> time than has been dirtied. i.e.  What we are really waiting on is
> pages to be cleaned.

Yes, Peter's __bdi_writeout_inc() is a good watch point.

> Based on this observation, if we have a flusher thread working in
> the background, we don't need to submit more IO when we need to
> throttle as all we need to do is wait for a certain number of pages
> to transition to the clean state.

It's great You and Mason (and others) are advocating the same idea :)
Jan and me proposed some possible solutions recently:

        http://lkml.org/lkml/2009/9/14/126

> If we take a leaf from XFS's book by doing work at IO completion
> rather than submission we can keep a count of the number of pages
> cleaned on the bdi. This can be used to implement a FIFO-like
> throttle. If we add a simple ticket system to the bdi, when a
> process needs to be throttled it can submit a ticket with the number
> of pages it has dirtied to the bdi, and the bdi can then decide what
> the page cleaned count needs to reach before the process can be
> woken.

Exactly.

> i.e. Take the following ascii art showing the bdi fllusher
> thread running and issuing IO in the background:
> 
> bdi write thread:   +---X------Y---+-A-----ZB----+------C--------+
> 1st proc:		o............o
> 2nd proc:		       o............o
> 3rd proc:		                   o............o
> 
> When the 1st process comes in to be throttled, it samples the page
> clean count and gets X. It submits a ticket to be woken at A (X +
> some number of pages). If the flusher thread is not running, it gets
> kicked.  Process 2 and 3 do the same at Y and Z to be woken at B and
> C. At IO completion, the number of pages cleaned is counted and the
> tickets that are now under the clean count are pulled from the queue
> and the processes that own them are woken.
> 
> This avoids the thundering herd problem and applies throttling in
> a deterministic, predictable fashion. And by relying on background
> writeback, we only have one writeback path to optimise, not two
> different paths that interchange unpredictably.
> 
> In essence, this mechanism replaces the complex path of IO
> submission and congestion with a simple, deterministic counter and
> queue system that probably doesn't even require any memory
> allocation to implement. I think the simpler a thorttle mechanism
> is the more likely it is to work effectively....
> 
> I know that words without code aren't going to convince anyone, but I
> hope I've given you some food for thought about alternatives to what
> we currently do. ;)

Not at all, it's clear enough - your idea is very similar to Jan's
proposal. And I'd like to present an ascii art for another scheme:


                                          \                /
                  one bdi sync thread      \............../
               =========================>   \............/
                 working on dirty pages      \........../
                                              \......../
                                               \....../
                                                \..../
                                                 \../

  throttled    |  |    |  |    |  |     |  |     |  |     wakeup when enough
============>  |  |    |  |    |  |     |  |     |::| ==> pages are put to io
  task list    +--+    +--+    +--+     +--+     +--+     on behalf of this task
              task 5  task 4  task 3   task 2   task 1

               [.] dirty page  [:] writeback page

One benefit of this scheme is, when necessary, the task list could
be converted to some tree to do priorities and thus IO controller for
buffered writes :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/