Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756568AbZIUPMx (ORCPT ); Mon, 21 Sep 2009 11:12:53 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756542AbZIUPMx (ORCPT ); Mon, 21 Sep 2009 11:12:53 -0400 Received: from mga03.intel.com ([143.182.124.21]:53995 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756523AbZIUPMw (ORCPT ); Mon, 21 Sep 2009 11:12:52 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.44,424,1249282800"; d="scan'208";a="189910705" Date: Mon, 21 Sep 2009 23:12:42 +0800 From: Wu Fengguang To: Jan Kara Cc: Theodore Tso , Jens Axboe , Christoph Hellwig , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "chris.mason@oracle.com" , "akpm@linux-foundation.org" Subject: Re: [PATCH 0/7] Per-bdi writeback flusher threads v20 Message-ID: <20090921151242.GA6333@localhost> References: <20090911142926.GI14984@kernel.dk> <20090911143929.GA25499@localhost> <20090918175252.GF26991@mit.edu> <20090919035835.GA9921@localhost> <20090919040051.GA10245@localhost> <20090919042607.GA19752@localhost> <20090919150351.GA19880@localhost> <20090920190006.GD16919@duck.suse.cz> <20090921030402.GC6331@localhost> <20090921124251.GD1099@duck.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090921124251.GD1099@duck.suse.cz> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9303 Lines: 200 On Mon, Sep 21, 2009 at 08:42:51PM +0800, Jan Kara wrote: > On Mon 21-09-09 11:04:02, Wu Fengguang wrote: > > On Mon, Sep 21, 2009 at 03:00:06AM +0800, Jan Kara wrote: > > > On Sat 19-09-09 23:03:51, Wu Fengguang wrote: > ... > > > Fenguang, could we maybe write down how the logic should look like > > > and then look at the code and modify it as needed to fit the logic? > > > Because I couldn't find a compact description of the logic anywhere > > > in the code. > > > > Good idea. It makes sense to write something down in Documentation/ > > or embedded as code comments. > Yes, that would be useful. I'd probably vote for comments in the code. OK. > > > Here is how I'd imaging the writeout logic should work: > > > We would have just two lists - b_dirty and b_more_io. Both would be > > > ordered by dirtied_when. > > > > Andrew has a very good description for the dirty/io/more_io queues: > > > > http://lkml.org/lkml/2006/2/7/5 > > > > | So the protocol would be: > > | > > | s_io: contains expired and non-expired dirty inodes, with expired ones at > > | the head. Unexpired ones (at least) are in time order. > > | > > | s_more_io: contains dirty expired inodes which haven't been fully written. > > | Ordering doesn't matter (unless someone goes and changes > > | dirty_expire_centisecs - but as long as we don't do anything really bad in > > | response to this we'll be OK). > > | > > | s_dirty: contains expired and non-expired dirty inodes. The non-expired > > | ones are in time-of-dirtying order. > > > > Since then s_io was changed to hold only _expired_ dirty inodes at the > > beginning of a full scan. It serves as a bounded set of dirty inodes. > > So that when finished a full scan of it, the writeback can go on to > > the next superblock, and old dirty files' writeback won't be delayed > > infinitely by poring in newly dirty files. > > > > It seems that the boundary could also be provided by some > > older_than_this timestamp. So removal of b_io is possible > > at least on this purpose. > > > > > A thread doing WB_SYNC_ALL writeback will just walk the list and cleanup > > > everything (we should be resistant against livelocks because we stop at > > > inode which has been dirtied after the sync has started). > > > > Yes, that would mean > > > > - older_than_this=now for WB_SYNC_ALL > > - older_than_this=now-30s for WB_SYNC_NONE > Exactly. > > > > A thread doing WB_SYNC_NONE writeback will start walking the list. If the > > > inode has I_SYNC set, it puts it on b_more_io. Otherwise it takes I_SYNC > > > and writes as much as it finds necessary from the first inode. If it > > > stopped before it wrote everything, it puts the inode at the end of > > > b_more_io. > > > > Agreed. The current code is doing that, and it is reasonably easy to > > reuse the code path for WB_SYNC_NONE/WB_SYNC_ALL? > I'm not sure we do exactly that. The I_SYNC part is fine. But looking at > the code in writeback_single_inode(), we put inode at b_more_io only if > wbc->for_kupdate is true and wbc->nr_to_write is <= 0. Otherwise we put the > inode at the tail of dirty list. Ah yes. I actually have posted a patch to unify the !for_kupdate and for_kupdate cases: http://patchwork.kernel.org/patch/46399/ For the (wbc->nr_to_write <= 0) case, we have to delay the inode for some time because it somehow cannot be written for now, hence moving back it to b_dirty. Otherwise could busy loop. > > > If it wrote everything (writeback_index cycled or scanned the > > > whole range) but inode is dirty, it puts the inode at the end of b_dirty > > > and resets dirtied_when to the current time. Then it continues with the > > > next inode. > > > > Agreed. I think it makes sense to reset dirtied_when (thus delay 30s) > > if an inode still has dirty pages when we have finished a full scan of > > it, in order to > > - prevent pointless writeback IO of overwritten pages > > - somehow throttle IO for busy inodes > OK, but currently the logic is subtly different. It does: > If the inode wasn't redirtied during writeback and still has dirty pages, > queue somewhere (requeue_io or redirty_tail depending on other things). > If the inode was redirtied, do redirty_tail. Yup. > Probably, the current logic is safer in the sence that kupdate-style > writeback cannot take forever when inode is permanently redirtied. In my > proposed logic, kupdate writeback would run forever (which makes some > sence as well but probably isn't really convenient). Yes current code is safer. Run kupdate forever for an inodes being busy overwritten is obviously undesirable behavior. > Also if we skip some pages (call redirty_page_for_writepage()) the inode > will get redirtied as well and hence we'll put the inode at the back of > dirty list and thus delaying further writeback by 30s. Again, this makes > some sence (prevents busyloop waiting for a page to get prepared for a > proper writeback) although I'm not sure it's always desirable. For now > we should probably just document this somewhere. Agreed. Again, current code is safe, but may be delaying too much. I have a patch that adds another queue b_more_io_wait, which delays the inode for a shorter 5s (or whatever). Could try that if 30s is reported to be unacceptable in some real workloads. > > > kupdate style writeback stops scanning dirty list when dirtied_when is > > > new enough. Then if b_more_io is nonempty, it splices it into the beginning > > > of the dirty list and restarts. > > > > Right. > But currently, we don't do the splicing. We just set more_io and return > from writeback_inodes_wb(). Should that be changed? Yes, in fact I changed that in the b_io removal patch, to do the splice and retry. It was correct and required behavior to return to give other superblocks a chance. Now with per-bdi writeback, we don't have to worry about that, so it's safe to just splice and restart. > > > Other types of writeback splice b_more_io to b_dirty when b_dirty gets > > > empty. pdflush style writeback writes until we drop below background dirty > > > limit. Other kinds of writeback (throttled threads, writeback submitted by > > > filesystem itself) write while nr_to_write > 0. > > > > I'd propose to always check older_than_this. For non-kupdate sync, it > > still makes sense to give some priority to expired inodes (generally > > it's suboptimal to sync those dirtied-just-now inodes). That is, to > > sync expired inodes first if there are any. > Well, the expired inodes are handled with priority because they are at > the beginning of the list. So we write them first and only if writing them > was not enough, we proceed with inodes that were dirtied later. You are The list order is not enough for large files :) One newly dirtied file; one 100MB expired dirty file. Current code will sync only 4MB of the expired file and go on to sync the newly dirty file, and _never_ return to serve the 100MB file as long as there are new inodes dirtied, which is not optimal. > right that we can get to later dirtied inodes even if there are still dirty > data in the old ones because we just refuse to write too much from a single > inode. So maybe it would be good to splice b_more_io to b_dirty already > when we get to unexpired inode in b_dirty list. The good thing is it won't > livelock on a few expired inodes even in the case new data are written to > one of them while we work on the others - the other inodes on s_dirty list > will eventually expire and from that moment on, we include them in a fair > pdflush writeback. Right. I modified wb_writeback() to first use wbc.older_than_this = jiffies - msecs_to_jiffies(dirty_expire_interval * 10); unconditionally, and then if no more writeback is possible, relax it for !kupdate: wbc.older_than_this = jiffies; > > > If we didn't write anything during the b_dirty scan, we wait until I_SYNC > > > of the first inode on b_more_io gets cleared before starting the next scan. > > > Does this look reasonably complete and cover all the cases? > > > > What about the congested case? > With per-bdi threads, we just have to make sure we don't busyloop when > the device is congested. Just blocking is perfectly fine since the thread > has nothing to do anyway. Right. > The question is how normal processes that are forced to do writeback > or page allocation doing writeback should behave. There probably it > makes sence to bail out from the writeback and let the caller > decide. That seems to be implemented by the current code just fine > but you are right I forgot about it. No current code is not fine for pageout and migrate path, which sets nonblocking=1, could return on congestion and then busy loop. (which is being discussed in another thread with Mason.) > Probably, we should just splice b_more_io to b_dirty list before > bailing out because of congestion... I'd vote for putting back the inode to tail of b_dirty, so that it will be served once congestion stops: it's not the inode's fault :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/