Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754208Ab0G0PWZ (ORCPT ); Tue, 27 Jul 2010 11:22:25 -0400 Received: from mga03.intel.com ([143.182.124.21]:36428 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752661Ab0G0PWY (ORCPT ); Tue, 27 Jul 2010 11:22:24 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.55,269,1278313200"; d="scan'208";a="304921565" Date: Tue, 27 Jul 2010 23:21:47 +0800 From: Wu Fengguang To: Mel Gorman Cc: "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , Dave Chinner , Chris Mason , Nick Piggin , Rik van Riel , Johannes Weiner , Christoph Hellwig , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Andrew Morton , Andrea Arcangeli Subject: Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Message-ID: <20100727152147.GB5184@localhost> References: <1279545090-19169-1-git-send-email-mel@csn.ul.ie> <1279545090-19169-9-git-send-email-mel@csn.ul.ie> <20100726072832.GB13076@localhost> <20100726092616.GG5300@csn.ul.ie> <20100726112709.GB6284@localhost> <20100726125717.GS5300@csn.ul.ie> <20100726131008.GE11947@localhost> <20100727133513.GZ5300@csn.ul.ie> <20100727142412.GA4771@localhost> <20100727143805.GB5300@csn.ul.ie> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100727143805.GB5300@csn.ul.ie> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12874 Lines: 327 On Tue, Jul 27, 2010 at 10:38:05PM +0800, Mel Gorman wrote: > On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote: > > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote: > > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote: > > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote: > > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote: > > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty: > > > > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > > > > > > > } > > > > > > > > > > > > > > > > > > + /* > > > > > > > > > + * If reclaim is encountering dirty pages, it may be because > > > > > > > > > + * dirty pages are reaching the end of the LRU even though > > > > > > > > > + * the dirty_ratio may be satisified. In this case, wake > > > > > > > > > + * flusher threads to pro-actively clean some pages > > > > > > > > > + */ > > > > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > > > > > > > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > > > > > > > > number of dirty pages down to 0 whether or not pageout() is called. > > > > > > > > > > > > > > > > > > > > > > True, this has been fixed to only wakeup flusher threads when this is > > > > > > > the file LRU, dirty pages have been encountered and the caller has > > > > > > > sc->may_writepage. > > > > > > > > > > > > OK. > > > > > > > > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > > > > > > > > for efficiency. > > > > > > > > And it seems good to let the flusher write much more > > > > > > > > than nr_dirty pages to safeguard a reasonable large > > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > > > > > > > > update the comments. > > > > > > > > > > > > > > > > > > > > > > Ok, the reasoning had been to flush a number of pages that was related > > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll > > > > > > > use MAX_WRITEBACK_PAGES. > > > > > > > > > > > > It would be better to pass something like (nr_dirty * N). > > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is > > > > > > obviously too large as a parameter. When the batch size is increased > > > > > > to 128MB, the writeback code may be improved somehow to not exceed the > > > > > > nr_pages limit too much. > > > > > > > > > > > > > > > > What might be a useful value for N? 1.5 appears to work reasonably well > > > > > to create a window of writeback ahead of the scanner but it's a bit > > > > > arbitrary. > > > > > > > > I'd recommend N to be a large value. It's no longer relevant now since > > > > we'll call the flusher to sync some range containing the target page. > > > > The flusher will then choose an N large enough (eg. 4MB) for efficient > > > > IO. It needs to be a large value, otherwise the vmscan code will > > > > quickly run into dirty pages again.. > > > > > > > > > > Ok, I took the 4MB at face value to be a "reasonable amount that should > > > not cause congestion". > > > > Under memory pressure, the disk should be busy/congested anyway. > > Not necessarily. It could be streaming reads where pages are being added > to the LRU quickly but not necessarily dominated by dirty pages. Due to the > scanning rate, a dirty page may be encountered but it could be rare. Right. > > The big 4MB adds much work, however many of the pages may need to be > > synced in the near future anyway. It also requires more time to do > > the bigger IO, hence adding some latency, however the latency should > > be a small factor comparing to the IO queue time (which will be long > > for a busy disk). > > > > Overall expectation is, the more efficient IO, the more progress :) > > > > Ok. > > > > The end result is > > > > > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT) > > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX) > > > static inline long nr_writeback_pages(unsigned long nr_dirty) > > > { > > > return laptop_mode ? 0 : > > > min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR)); > > > } > > > > > > nr_writeback_pages(nr_dirty) is what gets passed to > > > wakeup_flusher_threads(). Does that seem sensible? > > > > If you plan to keep wakeup_flusher_threads(), a simpler form may be > > sufficient, eg. > > > > laptop_mode ? 0 : (nr_dirty * 16) > > > > I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because > while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy > reclaim. I wanted to firmly bound how much writeback was being requested - > hence the mild complexity. OK. > > On top of this, we may write another patch to convert the > > wakeup_flusher_threads(bdi, nr_pages) call to some > > bdi_start_inode_writeback(inode, offset) call, to start more oriented > > writeback. > > > > I did a first pass at optimising based on prioritising inodes related to > dirty pages. It's incredibly primitive and I have to sit down and see > how the entire of writeback is put together to improve on it. Maybe > you'll spot something simple or see if it's the totally wrong direction. > Patch is below. The simplest style may be struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, .nr_to_write = MAX_WRITEBACK_PAGES, }; mapping->writeback_index = offset; return do_writepages(mapping, &wbc); But sure there will be many details to handle. > > When talking the 4MB optimization, I was referring to the internal > > implementation of bdi_start_inode_writeback(). Sorry for the missing > > context in the previous email. > > > > No worries, I was assuming it was something in mainline I didn't know > yet :) > > > It may need a big patch to implement bdi_start_inode_writeback(). > > Would you like to try it, or leave the task to me? > > > > If you send me a patch, I can try it out but it's not my highest > priority right now. I'm still looking to get writeback-from-reclaim down > to a reasonable level without causing a large amount of churn. OK. That's already a great work. > Here is the first pass anyway at kicking wakeup_flusher_threads() for > inodes belonging to a list of pages. You'll note that I do nothing with > page offset because I didn't spot a simple way of taking that > information into account. It's also horrible from a locking perspective. > So far, it's testing has been "it didn't crash". It seems a neat way to prioritize the inodes with a new flag I_DIRTY_RECLAIM. However it may require vastly different implementation when considering the offset. I'll try to work up a prototype tomorrow. Thanks, Fengguang > ==== CUT HERE ==== > writeback: Prioritise dirty inodes encountered by reclaim for background flushing > > It is preferable that as few dirty pages as possible are dispatched for > cleaning from the page reclaim path. When dirty pages are encountered by > page reclaim, this patch marks the inodes that they should be dispatched > immediately. When the background flusher runs, it moves such inodes immediately > to the dispatch queue regardless of inode age. > > This is an early prototype. It could be optimised to not regularly take > the inode lock repeatedly and ideally the page offset would also be > taken into account. > > Signed-off-by: Mel Gorman > --- > fs/fs-writeback.c | 52 ++++++++++++++++++++++++++++++++++++++++++++- > include/linux/fs.h | 5 ++- > include/linux/writeback.h | 1 + > mm/vmscan.c | 6 +++- > 4 files changed, 59 insertions(+), 5 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 5a3c764..27a8b75 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue, > LIST_HEAD(tmp); > struct list_head *pos, *node; > struct super_block *sb = NULL; > - struct inode *inode; > + struct inode *inode, *tinode; > int do_sb_sort = 0; > > if (wbc->for_kupdate || wbc->for_background) { > @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue, > older_than_this = jiffies - expire_interval; > } > > + /* Move inodes reclaim found at end of LRU to dispatch queue */ > + list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) { > + if (inode->i_state & I_DIRTY_RECLAIM) { > + inode->i_state &= ~I_DIRTY_RECLAIM; > + list_move(&inode->i_list, &tmp); > + } > + } > + > while (!list_empty(delaying_queue)) { > inode = list_entry(delaying_queue->prev, struct inode, i_list); > if (expire_interval && > @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages) > rcu_read_unlock(); > } > > +/* > + * Similar to wakeup_flusher_threads except prioritise inodes contained > + * in the page_list regardless of age > + */ > +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list) > +{ > + struct page *page; > + struct address_space *mapping; > + struct inode *inode; > + > + list_for_each_entry(page, page_list, lru) { > + if (!PageDirty(page)) > + continue; > + > + lock_page(page); > + mapping = page_mapping(page); > + if (!mapping || mapping == &swapper_space) > + goto unlock; > + > + /* > + * Test outside the lock to see as if it is already set, taking > + * the inode lock is a waste and the inode should be pinned by > + * the lock_page > + */ > + inode = page->mapping->host; > + if (inode->i_state & I_DIRTY_RECLAIM) > + goto unlock; > + > + /* > + * XXX: Yuck, has to be a way of batching this by not requiring > + * the page lock to pin the inode > + */ > + spin_lock(&inode_lock); > + inode->i_state |= I_DIRTY_RECLAIM; > + spin_unlock(&inode_lock); > +unlock: > + unlock_page(page); > + } > + > + wakeup_flusher_threads(nr_pages); > +} > + > static noinline void block_dump___mark_inode_dirty(struct inode *inode) > { > if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) { > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e29f0ed..8836698 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1585,8 +1585,8 @@ struct super_operations { > /* > * Inode state bits. Protected by inode_lock. > * > - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC, > - * I_DIRTY_DATASYNC and I_DIRTY_PAGES. > + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC, > + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM. > * > * Four bits define the lifetime of an inode. Initially, inodes are I_NEW, > * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at > @@ -1633,6 +1633,7 @@ struct super_operations { > #define I_DIRTY_SYNC 1 > #define I_DIRTY_DATASYNC 2 > #define I_DIRTY_PAGES 4 > +#define I_DIRTY_RECLAIM 256 > #define __I_NEW 3 > #define I_NEW (1 << __I_NEW) > #define I_WILL_FREE 16 > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index 494edd6..73a4df2 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb, > struct writeback_control *wbc); > long wb_do_writeback(struct bdi_writeback *wb, int force_wait); > void wakeup_flusher_threads(long nr_pages); > +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list); > > /* writeback.h requires fs.h; it, too, is not included from here. */ > static inline void wait_on_inode(struct inode *inode) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index b66d1f5..bad1abf 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -901,7 +901,8 @@ keep: > * laptop mode avoiding disk spin-ups > */ > if (file && nr_dirty_seen && sc->may_writepage) > - wakeup_flusher_threads(nr_writeback_pages(nr_dirty)); > + wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty), > + page_list); > > *nr_still_dirty = nr_dirty; > count_vm_events(PGACTIVATE, pgactivate); > @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > list_add(&page->lru, &putback_list); > } > > - wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > + wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty, > + &page_list); > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/