2007-08-27 11:33:00

by Wu Fengguang

[permalink] [raw]
Subject: [PATCH 0/3] [RFC][PATCH] clustered writeback

Chris,

This is one possible implementation of the clustered writeback idea.
It runs OK on ext3 (compiling, syncing, etc.).

The patch is based on 2.6.23-rc3-mm1 and the writeback patches here:
http://lkml.org/lkml/2007/8/19/10

By default, with many dirty inodes, it works as follows:
- store dirty inodes in a radix tree, indexed by their inode numbers
- sweep the whole inode number space in 25s and do it in 5 times
- each time we walk only 1/5 of the inode number space
- pull all inodes with dirty-age larger than 5s to the io dispatching queue

Because it does the work in small batches of 10 inodes, when the system has
<=10 dirty inodes, its behavior will reduce to:
- do a full sweep *at once* on every 25s
Which means the disk will flicker once every 25s, not bad :)


The implications for the majority users could be:
- medium-to-heavy writes becomes less seeky
- dirty inodes are getting synced earlier(before: 30s; now: 5-30s)
- less panic for the 'atime' mount option (a future work)

Fengguang
--


2007-08-27 12:04:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 0/3] [RFC][PATCH] clustered writeback

On Mon, 27 Aug 2007 19:21:52 +0800
>
> Because it does the work in small batches of 10 inodes, when the
> system has <=10 dirty inodes, its behavior will reduce to:
> - do a full sweep *at once* on every 25s
> Which means the disk will flicker once every 25s, not bad :)

25 seconds is quite not good already though.... it takes a disk a
second or two of no activity to go into low power mode, every 25
seconds means you now have at least a 10% constant power cost....

I don't know the right answer (well other than "make sure inodes aren't
dirty", which involves fixing apps to not do as much file operations,
as well as relatime) but just "every 25s is no big deal" isn't really
the case ;-(

2007-08-27 12:27:50

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 0/3] [RFC][PATCH] clustered writeback

On Mon, Aug 27, 2007 at 05:03:36AM -0700, Arjan van de Ven wrote:
> On Mon, 27 Aug 2007 19:21:52 +0800
> >
> > Because it does the work in small batches of 10 inodes, when the
> > system has <=10 dirty inodes, its behavior will reduce to:
> > - do a full sweep *at once* on every 25s
> > Which means the disk will flicker once every 25s, not bad :)
>
> 25 seconds is quite not good already though.... it takes a disk a
> second or two of no activity to go into low power mode, every 25
> seconds means you now have at least a 10% constant power cost....
>
> I don't know the right answer (well other than "make sure inodes aren't
> dirty", which involves fixing apps to not do as much file operations,
> as well as relatime) but just "every 25s is no big deal" isn't really
> the case ;-(

Yeah, 25s may be too frequent... What I meant is that the old behavior
could be "write 1-3 inodes on every 5s" if the inodes are dirtied at
random times. Now it becomes "write 10 inodes on every 25s". So it is
actually better ;-)

It's interesting that we want writeback to be smooth on heavy loads
and to be 'bursty' on light loads. Increasing dirty_expire_centisecs
and decreasing dirty_writeback_centisecs could help it somehow.

2007-08-27 12:54:18

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/3] [RFC][PATCH] clustered writeback

On Mon, 27 Aug 2007 05:03:36 -0700
Arjan van de Ven <[email protected]> wrote:

> On Mon, 27 Aug 2007 19:21:52 +0800
> >
> > Because it does the work in small batches of 10 inodes, when the
> > system has <=10 dirty inodes, its behavior will reduce to:
> > - do a full sweep *at once* on every 25s
> > Which means the disk will flicker once every 25s, not bad :)
>
> 25 seconds is quite not good already though.... it takes a disk a
> second or two of no activity to go into low power mode, every 25
> seconds means you now have at least a 10% constant power cost....
>
> I don't know the right answer (well other than "make sure inodes
> aren't dirty", which involves fixing apps to not do as much file
> operations, as well as relatime) but just "every 25s is no big deal"
> isn't really the case ;-(

But fixing this isn't the job of this patch....It needs something like
the laptop mode logic where it says ohhhh, the disk is awake, lets send
stuff down.

kupdate hitting on the disk isn't really a new problem, I'd rather
address it with a different patch series.

-chris