Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753164AbZIZDCm (ORCPT ); Fri, 25 Sep 2009 23:02:42 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752959AbZIZDCl (ORCPT ); Fri, 25 Sep 2009 23:02:41 -0400 Received: from mga14.intel.com ([143.182.124.37]:57926 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752840AbZIZDCk (ORCPT ); Fri, 25 Sep 2009 23:02:40 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.44,454,1249282800"; d="scan'208";a="191881478" Date: Sat, 26 Sep 2009 11:02:30 +0800 From: Wu Fengguang To: Dave Chinner Cc: Chris Mason , Andrew Morton , Peter Zijlstra , "Li, Shaohua" , "linux-kernel@vger.kernel.org" , "richard@rsk.demon.co.uk" , "jens.axboe@oracle.com" , Jan Kara , linux-fsdevel@vger.kernel.org Subject: Re: regression in page writeback Message-ID: <20090926030230.GA7770@localhost> References: <20090922182832.28e7f73a.akpm@linux-foundation.org> <20090923014500.GA11076@localhost> <20090922185941.1118e011.akpm@linux-foundation.org> <20090923022622.GB11918@localhost> <20090922193622.42c00012.akpm@linux-foundation.org> <20090923140058.GA2794@think> <20090924031508.GD6456@localhost> <20090925001117.GA9464@discord.disaster> <20090925031920.GB10487@localhost> <20090926014715.GD9464@discord.disaster> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20090926014715.GD9464@discord.disaster> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10968 Lines: 234 On Sat, Sep 26, 2009 at 09:47:15AM +0800, Dave Chinner wrote: > On Fri, Sep 25, 2009 at 11:19:20AM +0800, Wu Fengguang wrote: > > On Fri, Sep 25, 2009 at 08:11:17AM +0800, Dave Chinner wrote: > > > On Thu, Sep 24, 2009 at 11:15:08AM +0800, Wu Fengguang wrote: > > > > On Wed, Sep 23, 2009 at 10:00:58PM +0800, Chris Mason wrote: > > > > > The only place that actually honors the congestion flag is pdflush. > > > > > It's trivial to get pdflush backed up and make it sit down without > > > > > making any progress because once the queue congests, pdflush goes away. > > > > > > > > Right. I guess that's more or less intentional - to give lowest priority > > > > to periodic/background writeback. > > > > > > IMO, this is the wrong design. Background writeback should > > > have higher CPU/scheduler priority than normal tasks. If there is > > > sufficient dirty pages in the system for background writeback to > > > be active, it should be running *now* to start as much IO as it can > > > without being held up by other, lower priority tasks. > > > > > > Cleaning pages is important to keeping the system running smoothly. > > > Given that IO takes time to clean pages, it is therefore important > > > to issue as much as possible as quickly as possible without delays > > > before going back to sleep. Delaying issue of the IO or doing > > > sub-optimal issue simply reduces performance of the system because > > > it takes longer to clean the same number of dirty pages. > > > > > > > > Nothing stops other procs from keeping the queue congested forever. > > > > > This can only be fixed by making everyone wait for congestion, at which > > > > > point we might as well wait for requests. > > > > > > > > Yes. That gives everyone somehow equal opportunity, this is a policy change > > > > that may lead to interesting effects, as well as present a challenge to > > > > get_request_wait(). That said, I'm not against the change to a wait queue > > > > in general. > > > > > > If you block all threads doing _writebehind caching_ (synchronous IO > > > is self-throttling) to the same BDI on the same queue as the bdi > > > flusher then when congestion clears the higher priority background > > > flusher thread should run first and issue more IO. This should > > > happen as a natural side-effect of our scheduling algorithms and it > > > gives preference to efficient background writeback over in-efficient > > > foreground writeback. Indeed, with this approach we can even avoid > > > foreground writeback altogether... > > > > I don't see how balance_dirty_pages() writeout is less efficient than > > pdflush writeout. > > > > They all called the same routines to do the job. > > balance_dirty_pages() sets nr_to_write=1536 at least for ext4 and xfs > > (unless memory is tight; btrfs is 1540), which is in fact 50% bigger > > than the 1024 pages used by pdflush. > > Sure, but the prёblem now is that you are above the > bdi->dirty_exceeded threshold, foreground writeback tries to issue > 1536 pages of IO every 8 pages that are dirtied. That means you'll I'd suggest to increase that "8 pages". It may be good when ratelimit_pages is statically set to 32. Now that ratelimit_pages is dynamically set to a much larger value at boot time, we'd better use ratelimit_pages/4 for dirty_exceeded case. > block just about every writing process in writeback at the same time > and they will all be blocked in congestion trying to write different > inodes.... Ah got it, balance_dirty_pages could lead to too much concurrency and thus seeks! > > And it won't back off on congestion. > > And that is, IMO, a major problem. Not necessarily, the above concurrency problem occurs even if it backs off on congestion. But anyway we are switching to get_request_wait now. > > The s_io/b_io queues are shared, so a balance_dirty_pages() will just > > continue from where the last sync thread exited. So it does not make > > much difference who initiates the IO. Did I missed something? > > The current implementation uses the request queue to do that > blocking at IO submission time. This is based on the premise that if > we write a certain number of pages, we're guaranteed to have waited > long enough for that many pages to come clean. Right. > However, every other > thread doing writes and being throttled does the same thing. This > leads to N IO submitters from at least N different inodes at the > same time. Which inode gets written when congestion clears is > anyone's guess - it's a thundering herd IIUC the congestion > implementation correctly. Good insight, thanks for pointing this out! > The result is that we end up with N different sets of IO being > issued with potentially zero locality to each other, resulting in > much lower elevator sort/merge efficiency and hence we seek the disk > all over the place to service the different sets of IO. Right. But note that its negative effects are not that common given the current parameters MAX_WRITEBACK_PAGES=1024, max_sectors_kb=512 and nr_requests=128. As we are going on to the next inode anyway when 4MB of it have been enqueued. So a request queue could hold up to 128/(4096/512) = 16 inodes. So the problem will turn up when there are >= 16 throttled processes. When we increase MAX_WRITEBACK_PAGES to 128MB, even _one_ foreground writeout will hurt. > OTOH, if there is only one submission thread, it doesn't jump > between inodes in the same way when congestion clears - if keeps > writing to the same inode, resulting in large related chunks of > sequential IOs being issued to the disk. This is more efficient than > the above foreground writeback because the elevator works better and > the disk seeks less. > > As you can probably guess, I think foreground writeout is the wrong > architecture because of the behaviour it induces under heavy > multithreaded IO patterns. I agree that it works OK if continue > tweaking it to fix problems. Agreed. > However, my concern is that if it isn't constantly observed, tweaked > and maintained, performance goes backwards as other code changes. > i.e. there is a significant maintenance burden and looking at the > problems once very couple of years (last big maintenance rounds were > 2.6.15/16, 2.6.23/24, now 2.6.31/32) isn't good enough to prevent > performance form sliding backwards from release to release. One problem is that the queues are very tightly coupled. Every change of behavior leads to reevaluation of other factors. > ---- > > The rest of this is an idea I've been kicking around for a while > which is derived from IO throttling work I've done during a > past life. I haven't had time to research and prototype it to see > if it performs any better under really heavy load, but I'm going to > throw it out anyway so that everyone can see a little bit more about > what I'm thinking. > > My fundamental premise is that throttling does not require IO to be > issued from the thread to be throttled. The essence of write > throttling is waiting for more pages to be cleaned in a period of > time than has been dirtied. i.e. What we are really waiting on is > pages to be cleaned. Yes, Peter's __bdi_writeout_inc() is a good watch point. > Based on this observation, if we have a flusher thread working in > the background, we don't need to submit more IO when we need to > throttle as all we need to do is wait for a certain number of pages > to transition to the clean state. It's great You and Mason (and others) are advocating the same idea :) Jan and me proposed some possible solutions recently: http://lkml.org/lkml/2009/9/14/126 > If we take a leaf from XFS's book by doing work at IO completion > rather than submission we can keep a count of the number of pages > cleaned on the bdi. This can be used to implement a FIFO-like > throttle. If we add a simple ticket system to the bdi, when a > process needs to be throttled it can submit a ticket with the number > of pages it has dirtied to the bdi, and the bdi can then decide what > the page cleaned count needs to reach before the process can be > woken. Exactly. > i.e. Take the following ascii art showing the bdi fllusher > thread running and issuing IO in the background: > > bdi write thread: +---X------Y---+-A-----ZB----+------C--------+ > 1st proc: o............o > 2nd proc: o............o > 3rd proc: o............o > > When the 1st process comes in to be throttled, it samples the page > clean count and gets X. It submits a ticket to be woken at A (X + > some number of pages). If the flusher thread is not running, it gets > kicked. Process 2 and 3 do the same at Y and Z to be woken at B and > C. At IO completion, the number of pages cleaned is counted and the > tickets that are now under the clean count are pulled from the queue > and the processes that own them are woken. > > This avoids the thundering herd problem and applies throttling in > a deterministic, predictable fashion. And by relying on background > writeback, we only have one writeback path to optimise, not two > different paths that interchange unpredictably. > > In essence, this mechanism replaces the complex path of IO > submission and congestion with a simple, deterministic counter and > queue system that probably doesn't even require any memory > allocation to implement. I think the simpler a thorttle mechanism > is the more likely it is to work effectively.... > > I know that words without code aren't going to convince anyone, but I > hope I've given you some food for thought about alternatives to what > we currently do. ;) Not at all, it's clear enough - your idea is very similar to Jan's proposal. And I'd like to present an ascii art for another scheme: \ / one bdi sync thread \............../ =========================> \............/ working on dirty pages \........../ \......../ \....../ \..../ \../ throttled | | | | | | | | | | wakeup when enough ============> | | | | | | | | |::| ==> pages are put to io task list +--+ +--+ +--+ +--+ +--+ on behalf of this task task 5 task 4 task 3 task 2 task 1 [.] dirty page [:] writeback page One benefit of this scheme is, when necessary, the task list could be converted to some tree to do priorities and thus IO controller for buffered writes :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/