Date: Mon, 6 Feb 2006 16:48:15 +1100
From: David Chinner <dgc@sgi.com>
To: Andrew Morton <akpm@osdl.org>
Cc: David Chinner <dgc@sgi.com>, linux-kernel@vger.kernel.org,
       linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] Prevent large file writeback starvation
Message-ID: <20060206054815.GJ43335175@melbourne.sgi.com>
References: <20060206040027.GI43335175@melbourne.sgi.com> <20060205202733.48a02dbe.akpm@osdl.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20060205202733.48a02dbe.akpm@osdl.org>
User-Agent: Mutt/1.4.2.1i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4628
Lines: 113

On Sun, Feb 05, 2006 at 08:27:33PM -0800, Andrew Morton wrote:
> David Chinner <dgc@sgi.com> wrote:
> > The problem is that when you are creating thousands of files per second
> > with some data in them, the superblock I/O list blows out to approximately
> > (create rate * expiry age) inodes, and any one inode in this list will
> > get a maximum of 1024 pages written back per iteration on the list.
> 
> That code does so many different things it ain't funny.  This is why when
> one thing gets changed, something else gets broken.
> 
> The intention here is that once an inode has "expired" (dirtied_when is
> more than dirty_expire_centisecs ago), the inode will get fully synced.

Sure. And it works just fine when you're not creating lots of small
files at the same time because we iterate across s_io and s_dirty
very quickly.

However, you can't fully sync anything from pdflush with the
writeback parameters it uses without multiple passes through
this code.

> >From a quick peek, this code:
> 
> 			if (wbc->for_kupdate) {
> 				/*
> 				 * For the kupdate function we leave the inode
> 				 * at the head of sb_dirty so it will get more
> 				 * writeout as soon as the queue becomes
> 				 * uncongested.
> 				 */
> 				inode->i_state |= I_DIRTY_PAGES;
> 				list_move_tail(&inode->i_list, &sb->s_dirty);
> 
> 
> isn't working right any more.

If the intent is to continue writing it back until fully
sync'd, then shouldn't we be moving that to the tail of I/O list so
we don't have to iterate over the dirty list again before we try to
write another chunk out?

FWIW, we've never seen this problem before with XFS because prior to
2.6.15 XFS ignored wbc and block device congestion and just wrote as
much as it could cluster into a single extent in a single
do_writepages() call.  hence it would have written the 8GB in one
hit, hence we never saw this problem.

We made XFS behave nicely because it solved several problems
including preventing pdflush from sleeping on full block device
queues during writeback...

> > Looking at this comment in __sync_single_inode():
> > 
> >     196             if (wbc->for_kupdate) {
> >     197                 /*              
> >     198                  * For the kupdate function we leave the inode
> >     199                  * at the head of sb_dirty so it will get more
> >     200                  * writeout as soon as the queue becomes
> >     201                  * uncongested.
> >     202                  */
> >     203                 inode->i_state |= I_DIRTY_PAGES;
> >     204                 list_move_tail(&inode->i_list, &sb->s_dirty);
> >     205             } else {    
> > 
> > It appears that it is intended to handle congested devices. The thing
> > is, 1024 pages on writeback is not enough to congest a single disk,
> > let alone a RAID box 10 or 100 times faster than a single disk.
> > Hence we're stopping writeback long before we congest the device.
> 
> I think the comment is misleading.  The writeout pass can terminate because
> wbc->nr_to_write was satisfied, as well as for queue congestion.

Exactly my point and what the patch addresses - it allows writeback on
that inode to continue from where it left off if the device was not
congested.

> I suspect what's happened here is that someone other than pdflush has tried
> to do some writeback and didn't set for_kupdate, so we ended up resetting
> dirtied_when.

If it's not wb_kupdate that is trying to write it back, and we have little
memory pressure, and we completed writing the file long ago, then what behaves
exactly like wb_kupdate for hours on end apart from wb_kupdate?

> > Therefore, lets only move the inode back onto the dirty list if the device
> > really is congested. Patch against 2.6.15-rc2 below.
> 
> This'll break something else, I bet :(

Wonderful. What needs testing to indicate something else hasn't broken?
Does anyone have any regression tests for this code?

> I'll take a look.   Another approach would be to look at nr_to_write. ie:
> 
> 	if (wbc->for_kupdate || wmb->nr_to_write <= 0)

I just tested this and it doesn't change the default behaviour.
After writing the 4GB file ~5 minutes ago, I've seen ~10k pages go to
disk, and I still have another 140k to go. IOWs, exactly the same 
behaviour as the current code.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/