2007-08-12 09:22:42

by Wu Fengguang

[permalink] [raw]
Subject: [PATCH 0/6] writeback time order/delay fixes take 3

Andrew and Ken,

Here are some more experiments on the writeback stuff.
Comments are highly welcome~

writeback fixes:
[PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8
[PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes()
[PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_pa

debug codes:
[PATCH 4/6] check dirty inode list
[PATCH 5/6] prevent time-ordering warnings
[PATCH 6/6] track redirty_tail() calls

Thank you,
Fengguang
--


2007-08-22 00:24:21

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Sun, 12 Aug 2007 17:11:20 +0800
Fengguang Wu <[email protected]> wrote:

> Andrew and Ken,
>
> Here are some more experiments on the writeback stuff.
> Comments are highly welcome~

I've been doing benchmarks lately to try and trigger fragmentation, and
one of them is a simulation of make -j N. It takes a list of all
the .o files in the kernel tree, randomly sorts them and then
creates bogus files with the same names and sizes in clean kernel trees.

This is basically creating a whole bunch of files in random order in a
whole bunch of subdirectories.

The results aren't pretty:

http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png

The top graph shows one dot for each write over time. It shows that
ext3 is basically writing all over the place the whole time. But, ext3
actually wins the read phase, so the layout isn't horrible. My guess
is that if we introduce some write clustering by sending a group of
inodes down at the same time, it'll go much much better.

Andrew has mentioned bringing a few radix trees into the writeback paths
before, it seems like file servers and other general uses will benefit
from better clustering here.

I'm hoping to talk you into trying it out ;)

-chris

2007-08-22 01:18:54

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> On Sun, 12 Aug 2007 17:11:20 +0800
> Fengguang Wu <[email protected]> wrote:
>
> > Andrew and Ken,
> >
> > Here are some more experiments on the writeback stuff.
> > Comments are highly welcome~
>
> I've been doing benchmarks lately to try and trigger fragmentation, and
> one of them is a simulation of make -j N. It takes a list of all
> the .o files in the kernel tree, randomly sorts them and then
> creates bogus files with the same names and sizes in clean kernel trees.
>
> This is basically creating a whole bunch of files in random order in a
> whole bunch of subdirectories.
>
> The results aren't pretty:
>
> http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
>
> The top graph shows one dot for each write over time. It shows that
> ext3 is basically writing all over the place the whole time. But, ext3
> actually wins the read phase, so the layout isn't horrible. My guess
> is that if we introduce some write clustering by sending a group of
> inodes down at the same time, it'll go much much better.
>
> Andrew has mentioned bringing a few radix trees into the writeback paths
> before, it seems like file servers and other general uses will benefit
> from better clustering here.
>
> I'm hoping to talk you into trying it out ;)

Thank you for the description of problem. So far I have a similar one
in mind: if we are to delay writeback of atime-dirty-only inodes to
above 1 hour, some grouping/piggy-backing scenario would be
beneficial. (Which I guess does not deserve the complexity now that
we have Ingo's make-reltime-default patch.)

My vague idea is to
- keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching queue.
- convert s_dirty to some radix-tree/rbtree based data structure.
It would have dual functions: delayed-writeback and clustered-writeback.

clustered-writeback:
- Use inode number as clue of locality, hence the key for the sorted
tree.
- Drain some more s_dirty inodes into s_io on every kupdate wakeup,
but do it in the ascending order of inode number instead of
->dirtied_when.

delayed-writeback:
- Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
dirty_expire_interval.

Notes:
(1) I'm not sure inode number is correlated to disk location in
filesystems other than ext2/3/4. Or parent dir?
(2) It duplicates some function of elevators. Why is it necessary?
Maybe we have no clue on the exact data location at this time?

Fengguang

2007-08-22 12:43:42

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Wed, 22 Aug 2007 09:18:41 +0800
Fengguang Wu <[email protected]> wrote:

> On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > On Sun, 12 Aug 2007 17:11:20 +0800
> > Fengguang Wu <[email protected]> wrote:
> >
> > > Andrew and Ken,
> > >
> > > Here are some more experiments on the writeback stuff.
> > > Comments are highly welcome~
> >
> > I've been doing benchmarks lately to try and trigger fragmentation,
> > and one of them is a simulation of make -j N. It takes a list of
> > all the .o files in the kernel tree, randomly sorts them and then
> > creates bogus files with the same names and sizes in clean kernel
> > trees.
> >
> > This is basically creating a whole bunch of files in random order
> > in a whole bunch of subdirectories.
> >
> > The results aren't pretty:
> >
> > http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
> >
> > The top graph shows one dot for each write over time. It shows that
> > ext3 is basically writing all over the place the whole time. But,
> > ext3 actually wins the read phase, so the layout isn't horrible.
> > My guess is that if we introduce some write clustering by sending a
> > group of inodes down at the same time, it'll go much much better.
> >
> > Andrew has mentioned bringing a few radix trees into the writeback
> > paths before, it seems like file servers and other general uses
> > will benefit from better clustering here.
> >
> > I'm hoping to talk you into trying it out ;)
>
> Thank you for the description of problem. So far I have a similar one
> in mind: if we are to delay writeback of atime-dirty-only inodes to
> above 1 hour, some grouping/piggy-backing scenario would be
> beneficial. (Which I guess does not deserve the complexity now that
> we have Ingo's make-reltime-default patch.)

Good clustering would definitely help some delayed atime writeback
scheme.

>
> My vague idea is to
> - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
> queue.
> - convert s_dirty to some radix-tree/rbtree based data structure.
> It would have dual functions: delayed-writeback and
> clustered-writeback.
> clustered-writeback:
> - Use inode number as clue of locality, hence the key for the sorted
> tree.
> - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
> but do it in the ascending order of inode number instead of
> ->dirtied_when.
>
> delayed-writeback:
> - Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
> dirty_expire_interval.

I think we should assume a full scan of s_dirty is impossible in the
presence of concurrent writers. We want to be able to pick a start
time (right now) and find all the inodes older than that start time.
New things will come in while we're scanning. But perhaps that's what
you're saying...

At any rate, we've got two types of lists now. One keeps track of age
and the other two keep track of what is currently being written. I
would try two things:

1) s_dirty stays a list for FIFO. s_io becomes a radix tree that
indexes by inode number (or some arbitrary field the FS can set in the
inode). Radix tree tags are used to indicate which things in s_io are
already in progress or are pending (hand waving because I'm not sure
exactly).

inodes are pulled off s_dirty and the corresponding slot in s_io is
tagged to indicate IO has started. Any nearby inodes in s_io are also
sent down.

2) s_dirty and s_io both become radix trees. s_dirty is indexed by a
sequence number that corresponds to age. It is treated as a big
circular indexed list that can wrap around over time. Radix tree tags
are used both on s_dirty and s_io to flag which inodes are in progress.

>
> Notes:
> (1) I'm not sure inode number is correlated to disk location in
> filesystems other than ext2/3/4. Or parent dir?

In general, it is a better assumption than sorting by time. It may
make sense to one day let the FS provide a clustering hint
(corresponding to the first block in the file?), but for starters it
makes sense to just go with the inode number.

> (2) It duplicates some function of elevators. Why is it necessary?
> Maybe we have no clue on the exact data location at this time?

The elevator can only sort the pending IO, and we send down a
relatively small window of all the dirty pages at a time. If we sent
down all the dirty pages and let the elevator sort it out, we wouldn't
need this clustering at all.

But, that has other issues ;)

-chris


2007-08-23 02:33:38

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> Notes:
> (1) I'm not sure inode number is correlated to disk location in
> filesystems other than ext2/3/4. Or parent dir?

The correspond to the exact location on disk on XFS. But, XFS has it's
own inode clustering (see xfs_iflush) and it can't be moved up
into the generic layers because of locking and integration into
the transaction subsystem.

> (2) It duplicates some function of elevators. Why is it necessary?

The elevators have no clue as to how the filesystem might treat adjacent
inodes. In XFS, inode clustering is a fundamental feature of the inode
reading and writing and that is something no elevator can hope to
acheive....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-08-23 02:48:05

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> I think we should assume a full scan of s_dirty is impossible in the
> presence of concurrent writers. We want to be able to pick a start
> time (right now) and find all the inodes older than that start time.
> New things will come in while we're scanning. But perhaps that's what
> you're saying...
>
> At any rate, we've got two types of lists now. One keeps track of age
> and the other two keep track of what is currently being written. I
> would try two things:
>
> 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that
> indexes by inode number (or some arbitrary field the FS can set in the
> inode). Radix tree tags are used to indicate which things in s_io are
> already in progress or are pending (hand waving because I'm not sure
> exactly).
>
> inodes are pulled off s_dirty and the corresponding slot in s_io is
> tagged to indicate IO has started. Any nearby inodes in s_io are also
> sent down.

the problem with this approach is that it only looks at inode locality.
Data locality is ignored completely here and the data for all the
inodes that are close together could be splattered all over the drive.
In that case, clustering by inode location is exactly the wrong
thing to do.

For example, XFs changes allocation strategy at 1TB for 32bit inode
filesystems which makes the data get placed way away from the inodes.
i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
by inode number for data writeback is mostly useless in the >1TB
case.

The inode32 for <1Tb and inode64 allocators both try to keep data
close to the inode (i.e. in the same AG) so clustering by inode number
might work better here.

Also, it might be worthwhile allowing the filesystem to supply a
hint or mask for "closeness" for inode clustering. This would help
the gernic code only try to cluster inode writes to inodes that
fall into the same cluster as the first inode....

> > Notes:
> > (1) I'm not sure inode number is correlated to disk location in
> > filesystems other than ext2/3/4. Or parent dir?
>
> In general, it is a better assumption than sorting by time. It may
> make sense to one day let the FS provide a clustering hint
> (corresponding to the first block in the file?), but for starters it
> makes sense to just go with the inode number.

Perhaps multiple hints are needed - one for data locality and one
for inode cluster locality.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-08-23 12:16:39

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Thu, 23 Aug 2007 12:47:23 +1000
David Chinner <[email protected]> wrote:

> On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > I think we should assume a full scan of s_dirty is impossible in the
> > presence of concurrent writers. We want to be able to pick a start
> > time (right now) and find all the inodes older than that start time.
> > New things will come in while we're scanning. But perhaps that's
> > what you're saying...
> >
> > At any rate, we've got two types of lists now. One keeps track of
> > age and the other two keep track of what is currently being
> > written. I would try two things:
> >
> > 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that
> > indexes by inode number (or some arbitrary field the FS can set in
> > the inode). Radix tree tags are used to indicate which things in
> > s_io are already in progress or are pending (hand waving because
> > I'm not sure exactly).
> >
> > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > tagged to indicate IO has started. Any nearby inodes in s_io are
> > also sent down.
>
> the problem with this approach is that it only looks at inode
> locality. Data locality is ignored completely here and the data for
> all the inodes that are close together could be splattered all over
> the drive. In that case, clustering by inode location is exactly the
> wrong thing to do.

Usually it won't be less wrong than clustering by time.

>
> For example, XFs changes allocation strategy at 1TB for 32bit inode
> filesystems which makes the data get placed way away from the inodes.
> i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> by inode number for data writeback is mostly useless in the >1TB
> case.

I agree we'll want a way to let the FS provide the clustering key. But
for the first cut on the patch, I would suggest keeping it simple.

>
> The inode32 for <1Tb and inode64 allocators both try to keep data
> close to the inode (i.e. in the same AG) so clustering by inode number
> might work better here.
>
> Also, it might be worthwhile allowing the filesystem to supply a
> hint or mask for "closeness" for inode clustering. This would help
> the gernic code only try to cluster inode writes to inodes that
> fall into the same cluster as the first inode....

Yes, also a good idea after things are working.

>
> > > Notes:
> > > (1) I'm not sure inode number is correlated to disk location in
> > > filesystems other than ext2/3/4. Or parent dir?
> >
> > In general, it is a better assumption than sorting by time. It may
> > make sense to one day let the FS provide a clustering hint
> > (corresponding to the first block in the file?), but for starters it
> > makes sense to just go with the inode number.
>
> Perhaps multiple hints are needed - one for data locality and one
> for inode cluster locality.

So, my feature creep idea would have been more data clustering. I'm
mainly trying to solve this graph:

http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png

Where background writing of the block device inode is making ext3 do
seeky writes while directory trees. My simple idea was to kick
off a 'I've just written block X' call back to the FS, where it may
decide to send down dirty chunks of the block device inode that also
happen to be dirty.

But, maintaining the kupdate max dirty time and congestion limits in
the face of all this clustering gets tricky. So, I wasn't going to
suggest it until the basic machinery was working.

Fengguang, this isn't a small project ;) But, lots of people will be
interested in the results.

-chris

2007-08-24 12:57:29

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote:
> On Thu, 23 Aug 2007 12:47:23 +1000
> David Chinner <[email protected]> wrote:
>
> > On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > > I think we should assume a full scan of s_dirty is impossible in the
> > > presence of concurrent writers. We want to be able to pick a start
> > > time (right now) and find all the inodes older than that start time.
> > > New things will come in while we're scanning. But perhaps that's
> > > what you're saying...
> > >
> > > At any rate, we've got two types of lists now. One keeps track of
> > > age and the other two keep track of what is currently being
> > > written. I would try two things:
> > >
> > > 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that
> > > indexes by inode number (or some arbitrary field the FS can set in
> > > the inode). Radix tree tags are used to indicate which things in
> > > s_io are already in progress or are pending (hand waving because
> > > I'm not sure exactly).
> > >
> > > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > > tagged to indicate IO has started. Any nearby inodes in s_io are
> > > also sent down.
> >
> > the problem with this approach is that it only looks at inode
> > locality. Data locality is ignored completely here and the data for
> > all the inodes that are close together could be splattered all over
> > the drive. In that case, clustering by inode location is exactly the
> > wrong thing to do.
>
> Usually it won't be less wrong than clustering by time.
>
> >
> > For example, XFs changes allocation strategy at 1TB for 32bit inode
> > filesystems which makes the data get placed way away from the inodes.
> > i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> > by inode number for data writeback is mostly useless in the >1TB
> > case.
>
> I agree we'll want a way to let the FS provide the clustering key. But
> for the first cut on the patch, I would suggest keeping it simple.
>
> >
> > The inode32 for <1Tb and inode64 allocators both try to keep data
> > close to the inode (i.e. in the same AG) so clustering by inode number
> > might work better here.
> >
> > Also, it might be worthwhile allowing the filesystem to supply a
> > hint or mask for "closeness" for inode clustering. This would help
> > the gernic code only try to cluster inode writes to inodes that
> > fall into the same cluster as the first inode....
>
> Yes, also a good idea after things are working.
>
> >
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > > filesystems other than ext2/3/4. Or parent dir?
> > >
> > > In general, it is a better assumption than sorting by time. It may
> > > make sense to one day let the FS provide a clustering hint
> > > (corresponding to the first block in the file?), but for starters it
> > > makes sense to just go with the inode number.
> >
> > Perhaps multiple hints are needed - one for data locality and one
> > for inode cluster locality.
>
> So, my feature creep idea would have been more data clustering. I'm
> mainly trying to solve this graph:
>
> http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png
>
> Where background writing of the block device inode is making ext3 do
> seeky writes while directory trees. My simple idea was to kick
> off a 'I've just written block X' call back to the FS, where it may
> decide to send down dirty chunks of the block device inode that also
> happen to be dirty.
>
> But, maintaining the kupdate max dirty time and congestion limits in
> the face of all this clustering gets tricky. So, I wasn't going to
> suggest it until the basic machinery was working.
>
> Fengguang, this isn't a small project ;) But, lots of people will be
> interested in the results.

Exactly, the current writeback logics are unsatisfactory in many ways.
As for writeback clustering, inode/data localities can be different.
But I'll follow your suggestion to start simple first and give the
idea a spin on ext3.

-fengguang

2007-08-24 13:25:20

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > My vague idea is to
> > - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
> > queue.
> > - convert s_dirty to some radix-tree/rbtree based data structure.
> > It would have dual functions: delayed-writeback and
> > clustered-writeback.
> > clustered-writeback:
> > - Use inode number as clue of locality, hence the key for the sorted
> > tree.
> > - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
> > but do it in the ascending order of inode number instead of
> > ->dirtied_when.
> >
> > delayed-writeback:
> > - Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
> > dirty_expire_interval.
>
> I think we should assume a full scan of s_dirty is impossible in the
> presence of concurrent writers. We want to be able to pick a start
> time (right now) and find all the inodes older than that start time.
> New things will come in while we're scanning. But perhaps that's what
> you're saying...

Yeah, I was thinking about elevators :)
Or call it sweeping based on address-hint(inode number).

> At any rate, we've got two types of lists now. One keeps track of age
> and the other two keep track of what is currently being written. I
> would try two things:
>
> 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that
> indexes by inode number (or some arbitrary field the FS can set in the
> inode). Radix tree tags are used to indicate which things in s_io are
> already in progress or are pending (hand waving because I'm not sure
> exactly).
>
> inodes are pulled off s_dirty and the corresponding slot in s_io is
> tagged to indicate IO has started. Any nearby inodes in s_io are also
> sent down.
>
> 2) s_dirty and s_io both become radix trees. s_dirty is indexed by a
> sequence number that corresponds to age. It is treated as a big
> circular indexed list that can wrap around over time. Radix tree tags
> are used both on s_dirty and s_io to flag which inodes are in progress.

It's meaningless to convert s_io to radix tree. Because inodes on s_io
will normally be sent to block layer elevators at the same time.

Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
The more inodes, the more chances of good clustering. That's the
general rule.

s_dirty is the right place to do address-clustering.
As for the dirty_expire_interval parameter on dirty age,
we can apply a simple rule: do one full scan/sweep over the
fs-address-space in every 30s, syncing all inodes encountered,
and sparing those newly dirtied in less than 5s. With that rule,
any inode will get synced after being dirtied for 5-35 seconds.

-fengguang

2007-08-24 13:55:25

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > Notes:
> > (1) I'm not sure inode number is correlated to disk location in
> > filesystems other than ext2/3/4. Or parent dir?
>
> The correspond to the exact location on disk on XFS. But, XFS has it's
> own inode clustering (see xfs_iflush) and it can't be moved up
> into the generic layers because of locking and integration into
> the transaction subsystem.
>
> > (2) It duplicates some function of elevators. Why is it necessary?
>
> The elevators have no clue as to how the filesystem might treat adjacent
> inodes. In XFS, inode clustering is a fundamental feature of the inode
> reading and writing and that is something no elevator can hope to
> acheive....

Thank you. That explains the linear write curve(perfect!) in Chris' graph.

I wonder if XFS can benefit any more from the general writeback clustering.
How large would be a typical XFS cluster?

-fengguang

2007-08-24 14:38:36

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Fri, 24 Aug 2007 21:24:58 +0800
Fengguang Wu <[email protected]> wrote:

> > 2) s_dirty and s_io both become radix trees. s_dirty is indexed by
> > a sequence number that corresponds to age. It is treated as a big
> > circular indexed list that can wrap around over time. Radix tree
> > tags are used both on s_dirty and s_io to flag which inodes are in
> > progress.
>
> It's meaningless to convert s_io to radix tree. Because inodes on s_io
> will normally be sent to block layer elevators at the same time.

Not entirely, using a radix tree instead lets you tag things instead of
doing the current backflips across three lists.

>
> Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
> The more inodes, the more chances of good clustering. That's the
> general rule.
>
> s_dirty is the right place to do address-clustering.
> As for the dirty_expire_interval parameter on dirty age,
> we can apply a simple rule: do one full scan/sweep over the
> fs-address-space in every 30s, syncing all inodes encountered,
> and sparing those newly dirtied in less than 5s. With that rule,
> any inode will get synced after being dirtied for 5-35 seconds.

This gives you an O(inodes dirty) behavior instead of the current O(old
inodes). It might not matter, but walking the radix tree is more
expensive than walking a list.

But, I look forward to your patches, we can tune from there.

-chris

2007-08-28 14:56:07

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > Notes:
> > > (1) I'm not sure inode number is correlated to disk location in
> > > filesystems other than ext2/3/4. Or parent dir?
> >
> > The correspond to the exact location on disk on XFS. But, XFS has it's
> > own inode clustering (see xfs_iflush) and it can't be moved up
> > into the generic layers because of locking and integration into
> > the transaction subsystem.
> >
> > > (2) It duplicates some function of elevators. Why is it necessary?
> >
> > The elevators have no clue as to how the filesystem might treat adjacent
> > inodes. In XFS, inode clustering is a fundamental feature of the inode
> > reading and writing and that is something no elevator can hope to
> > acheive....
>
> Thank you. That explains the linear write curve(perfect!) in Chris' graph.
>
> I wonder if XFS can benefit any more from the general writeback clustering.
> How large would be a typical XFS cluster?

Depends on inode size. typically they are 8k in size, so anything from 4-32
inodes. The inode writeback clustering is pretty tightly integrated into the
transaction subsystem and has some intricate locking, so it's not likely
to be easy (or perhaps even possible) to make it more generic.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-08-28 15:10:54

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Wed, 29 Aug 2007 00:55:30 +1000
David Chinner <[email protected]> wrote:

> On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > > filesystems other than ext2/3/4. Or parent dir?
> > >
> > > The correspond to the exact location on disk on XFS. But, XFS has
> > > it's own inode clustering (see xfs_iflush) and it can't be moved
> > > up into the generic layers because of locking and integration into
> > > the transaction subsystem.
> > >
> > > > (2) It duplicates some function of elevators. Why is it
> > > > necessary?
> > >
> > > The elevators have no clue as to how the filesystem might treat
> > > adjacent inodes. In XFS, inode clustering is a fundamental
> > > feature of the inode reading and writing and that is something no
> > > elevator can hope to acheive....
> >
> > Thank you. That explains the linear write curve(perfect!) in Chris'
> > graph.
> >
> > I wonder if XFS can benefit any more from the general writeback
> > clustering. How large would be a typical XFS cluster?
>
> Depends on inode size. typically they are 8k in size, so anything
> from 4-32 inodes. The inode writeback clustering is pretty tightly
> integrated into the transaction subsystem and has some intricate
> locking, so it's not likely to be easy (or perhaps even possible) to
> make it more generic.

When I talked to hch about this, he said the order file data pages got
written in XFS was still dictated by the order the higher layers sent
things down. Shouldn't the clustering still help to have delalloc done
in inode order instead of in whatever random order pdflush sends things
down now?

-chris

2007-08-28 16:33:36

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:
> On Wed, 29 Aug 2007 00:55:30 +1000
> David Chinner <[email protected]> wrote:
> > On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> > > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > > > Notes:
> > > > > (1) I'm not sure inode number is correlated to disk location in
> > > > > filesystems other than ext2/3/4. Or parent dir?
> > > >
> > > > The correspond to the exact location on disk on XFS. But, XFS has
> > > > it's own inode clustering (see xfs_iflush) and it can't be moved
> > > > up into the generic layers because of locking and integration into
> > > > the transaction subsystem.
> > > >
> > > > > (2) It duplicates some function of elevators. Why is it
> > > > > necessary?
> > > >
> > > > The elevators have no clue as to how the filesystem might treat
> > > > adjacent inodes. In XFS, inode clustering is a fundamental
> > > > feature of the inode reading and writing and that is something no
> > > > elevator can hope to acheive....
> > >
> > > Thank you. That explains the linear write curve(perfect!) in Chris'
> > > graph.
> > >
> > > I wonder if XFS can benefit any more from the general writeback
> > > clustering. How large would be a typical XFS cluster?
> >
> > Depends on inode size. typically they are 8k in size, so anything
> > from 4-32 inodes. The inode writeback clustering is pretty tightly
> > integrated into the transaction subsystem and has some intricate
> > locking, so it's not likely to be easy (or perhaps even possible) to
> > make it more generic.
>
> When I talked to hch about this, he said the order file data pages got
> written in XFS was still dictated by the order the higher layers sent
> things down.

Sure, that's file data. I was talking about the inode writeback, not the
data writeback.

> Shouldn't the clustering still help to have delalloc done
> in inode order instead of in whatever random order pdflush sends things
> down now?

Depends on how things are being allocated. if you've got inode32 allocation
and >1TB filesytsem, then data is nowhere near the inodes. If you've got large
allocation groups, then data is typically nowhere near the inodes, either. If
you've got full AGs, data will be nowehere near the inodes. If you've got
large files and lots of data to write, then clustering multiple files together
for writing is not needed. So in many cases, clustering delalloc writes by
inode number doesn't provide any better I/o patterns than not clustering...

The only difference we may see is that if we flush all the data on inodes
in a single cluster, we can get away with a single inode cluster write
for all of the inodes....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-08-28 16:58:51

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Wed, 29 Aug 2007 02:33:08 +1000
David Chinner <[email protected]> wrote:

> On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:
> > > >
> > > > I wonder if XFS can benefit any more from the general writeback
> > > > clustering. How large would be a typical XFS cluster?
> > >
> > > Depends on inode size. typically they are 8k in size, so anything
> > > from 4-32 inodes. The inode writeback clustering is pretty tightly
> > > integrated into the transaction subsystem and has some intricate
> > > locking, so it's not likely to be easy (or perhaps even possible)
> > > to make it more generic.
> >
> > When I talked to hch about this, he said the order file data pages
> > got written in XFS was still dictated by the order the higher
> > layers sent things down.
>
> Sure, that's file data. I was talking about the inode writeback, not
> the data writeback.

I think we're trying to gain different things from inode based
clustering...I'm not worried that the inode be next to the data. I'm
going under the assumption that most of the time, the FS will try to
allocate inodes in groups in a directory, and so most of the time the
data blocks for inode N will be close to inode N+1.

So what I'm really trying for here is data block clustering when
writing multiple inodes at once. This matters most when files are
relatively small and written in groups, which is a common workload.

It may make the most sense to change the patch to supply some key for
the data block clustering instead of the inode number, but its an easy
first pass.

-chris

2007-08-29 07:53:46

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Wed, Aug 29, 2007 at 02:33:08AM +1000, David Chinner wrote:
> On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote:
> > On Wed, 29 Aug 2007 00:55:30 +1000
> > David Chinner <[email protected]> wrote:
> > > On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote:
> > > > On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
> > > > > On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
> > > > > > On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> > > > > > Notes:
> > > > > > (1) I'm not sure inode number is correlated to disk location in
> > > > > > filesystems other than ext2/3/4. Or parent dir?
> > > > >
> > > > > The correspond to the exact location on disk on XFS. But, XFS has
> > > > > it's own inode clustering (see xfs_iflush) and it can't be moved
> > > > > up into the generic layers because of locking and integration into
> > > > > the transaction subsystem.
> > > > >
> > > > > > (2) It duplicates some function of elevators. Why is it
> > > > > > necessary?
> > > > >
> > > > > The elevators have no clue as to how the filesystem might treat
> > > > > adjacent inodes. In XFS, inode clustering is a fundamental
> > > > > feature of the inode reading and writing and that is something no
> > > > > elevator can hope to acheive....
> > > >
> > > > Thank you. That explains the linear write curve(perfect!) in Chris'
> > > > graph.
> > > >
> > > > I wonder if XFS can benefit any more from the general writeback
> > > > clustering. How large would be a typical XFS cluster?
> > >
> > > Depends on inode size. typically they are 8k in size, so anything
> > > from 4-32 inodes. The inode writeback clustering is pretty tightly
> > > integrated into the transaction subsystem and has some intricate
> > > locking, so it's not likely to be easy (or perhaps even possible) to
> > > make it more generic.
> >
> > When I talked to hch about this, he said the order file data pages got
> > written in XFS was still dictated by the order the higher layers sent
> > things down.
>
> Sure, that's file data. I was talking about the inode writeback, not the
> data writeback.
>
> > Shouldn't the clustering still help to have delalloc done
> > in inode order instead of in whatever random order pdflush sends things
> > down now?
>
> Depends on how things are being allocated. if you've got inode32 allocation
> and >1TB filesytsem, then data is nowhere near the inodes. If you've got large
> allocation groups, then data is typically nowhere near the inodes, either. If
> you've got full AGs, data will be nowehere near the inodes. If you've got
> large files and lots of data to write, then clustering multiple files together
> for writing is not needed. So in many cases, clustering delalloc writes by
> inode number doesn't provide any better I/o patterns than not clustering...
>
> The only difference we may see is that if we flush all the data on inodes
> in a single cluster, we can get away with a single inode cluster write
> for all of the inodes....

So we end up with two major cases:

- small files: inode and its data are expected to be close enough,
hence it can help I_DIRTY_SYNC and/or I_DIRTY_PAGES

- big files: inode and its data may or may not be separated
- I_DIRTY_SYNC: could be improved
- I_DIRTY_PAGES: no better, no worse(it's big I/O, the seek
cost is not relevant in any case)

Conclusion: _inode_ writeback clustering is enough.

Isn't it simple? ;-)

Fengguang