2006-10-23 14:16:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
> Hello,
>
> I've written a simple patch implementing ext3 ioctl for file
> relocation. Basically you call ioctl on a file, give it list of blocks
> and it relocates the file into given blocks (provided they are still
> free). The idea is to use it as a kernel part of ext3 online
> defragmenter (or generally disk access optimizer). Now I don't have the
> user space part that finds larger runs of free blocks and so on so that
> it can really be used as a defragmenter. I just send this as a kind of
> proof-of-concept to hear some comments. Attached is also a simple
> program that demonstrates the use of the ioctl.

As a suggestion, I would pass the inode number and inode generation
number into the ext3_file_mode_data array:

struct ext3_file_move_data {
int extents;
struct ext3_reloc_extent __user *ext_array;
};

This will be much more efficient for the userspace relocator, since it
won't need to translate from an inode number to a pathname, and then
try to open the file before relocating it.

I'd also use an explicit 64-bit block numbers type so that we don't
have to worry about the ABI changing when we support 64-bit block
numbers.


The other problem I see with this patch is that there will be cache
coherency problems between the buffer cache and the page cache. I
think you will want to pull the data blocks of the file into the page
cache, and then write them out from the page cache, and only *then*
update the indirect blocks and commit the transaction.

So what needs to happen is the following:

1) Validate that inode and generation number. Make sure the new
(destination) blocks passed in are valid and not in use. Allocate
them to prevent anyone else from using those blocks.

2) Pull the blocks into the page cache (if they are not already
there), and the write them out to the new location on disk. If any of
the I/O's fail, abort.

3) Update the indirect blocks or extent tree to point at the newly
allocated and copied data blocks.

In the current patch, it looks like you add the inode being relocated
to the orphan list, and then update the direct/indirect blocks first
--- and if you fail the inode gets truncated. That's bad since we
don't want to lose any data if we crash in the middle of the defrag
operation....

Great to see that you're working on this problem! I'd love to see
this functionality into ext4.

Regards,

- Ted

P.S. There is also the question of whether we'll be able to get this
interface past the ioctl() police, but the atomicity requirements of
such an interface are a poster child for why we really, REALLY, can't
do this via a sysfs interface. We might be forced to create a new
filesystem, or create a pseudo inode which we open via a magic
pathname, though. That in my opinion is uglier than an ioctl, but the
ioctl police really don't like the problem of needing to maintain
32/64 bit translation functions, and this interface would surely cause
problems for the x86_64 and PPC platforms, since they have to support
32-bit and 64-bit system ABI's.




2006-10-23 14:31:40

by Alex Tomas

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

>>>>> Theodore Tso (TT) writes:

TT> On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
>> Hello,
>>
>> I've written a simple patch implementing ext3 ioctl for file
>> relocation. Basically you call ioctl on a file, give it list of blocks
>> and it relocates the file into given blocks (provided they are still
>> free). The idea is to use it as a kernel part of ext3 online
>> defragmenter (or generally disk access optimizer).

isn't that a kernel responsbility to find/allocate target blocks?
wouldn't it better to specify desirable target group and minimal
acceptable chunk of free blocks?

thanks, Alex

2006-10-23 14:44:26

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

Hello,

> > I've written a simple patch implementing ext3 ioctl for file
> > relocation. Basically you call ioctl on a file, give it list of blocks
> > and it relocates the file into given blocks (provided they are still
> > free). The idea is to use it as a kernel part of ext3 online
> > defragmenter (or generally disk access optimizer). Now I don't have the
> > user space part that finds larger runs of free blocks and so on so that
> > it can really be used as a defragmenter. I just send this as a kind of
> > proof-of-concept to hear some comments. Attached is also a simple
> > program that demonstrates the use of the ioctl.
>
> As a suggestion, I would pass the inode number and inode generation
> number into the ext3_file_mode_data array:
>
> struct ext3_file_move_data {
> int extents;
> struct ext3_reloc_extent __user *ext_array;
> };
>
> This will be much more efficient for the userspace relocator, since it
> won't need to translate from an inode number to a pathname, and then
> try to open the file before relocating it.
Hmm, I was also thinking about it. Probably you're right. It just
seemed elegant to call ioctl on a file and *plop* it's relocated ;).

> I'd also use an explicit 64-bit block numbers type so that we don't
> have to worry about the ABI changing when we support 64-bit block
> numbers.
Right, will fix.

> The other problem I see with this patch is that there will be cache
> coherency problems between the buffer cache and the page cache. I
> think you will want to pull the data blocks of the file into the page
> cache, and then write them out from the page cache, and only *then*
> update the indirect blocks and commit the transaction.
Hmm, I thought I got this right. We build a new tree, copy all data to
it (no writes happen so trees remain consistent), we switch block
pointers from inode. So from now on, any get_block() will correctly
return new block number and block will be read from disk (hmm, probably
I'm missing sync after writing out all the data). Now we call
invalidate_inode_pages2() so all buffers mapped to old blocks are freed
from memory. So there should not be problems with this... OTOH doing the
data copy via page-cache (of the temporarily set-up inode) should not be
a big problem either and we can avoid one sync which should be a win.

> So what needs to happen is the following:
>
> 1) Validate that inode and generation number. Make sure the new
> (destination) blocks passed in are valid and not in use. Allocate
> them to prevent anyone else from using those blocks.
>
> 2) Pull the blocks into the page cache (if they are not already
> there), and the write them out to the new location on disk. If any of
> the I/O's fail, abort.
>
> 3) Update the indirect blocks or extent tree to point at the newly
> allocated and copied data blocks.
>
> In the current patch, it looks like you add the inode being relocated
> to the orphan list, and then update the direct/indirect blocks first
No, I create temporary inode that holds allocated blocks and that is
added to the orphan list. Hence if we crash in the middle of relocation,
all blocks are correctly freed.

> --- and if you fail the inode gets truncated. That's bad since we
> don't want to lose any data if we crash in the middle of the defrag
> operation....
>
> Great to see that you're working on this problem! I'd love to see
> this functionality into ext4.
Thanks for comments.

> P.S. There is also the question of whether we'll be able to get this
> interface past the ioctl() police, but the atomicity requirements of
> such an interface are a poster child for why we really, REALLY, can't
> do this via a sysfs interface. We might be forced to create a new
> filesystem, or create a pseudo inode which we open via a magic
> pathname, though. That in my opinion is uglier than an ioctl, but the
> ioctl police really don't like the problem of needing to maintain
> 32/64 bit translation functions, and this interface would surely cause
> problems for the x86_64 and PPC platforms, since they have to support
> 32-bit and 64-bit system ABI's.
Umm, yes. I'm open to suggestions with respect to which interface to
choose. ioctl() was just the easiest to code ;).

Bye
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-23 14:48:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Oct 23, 2006 18:31 +0400, Alex Tomas wrote:
> isn't that a kernel responsbility to find/allocate target blocks?
> wouldn't it better to specify desirable target group and minimal
> acceptable chunk of free blocks?

In some cases this is useful (e.g. if file has small fragments after
being written in small pieces or in a fragmented free space). In other
cases the user tool HAS to be able to specify the new mapping in order
to make progress.

Consider if there are two very large fragmented files and user-space
defrag tool wants to make contiguous free space. If kernel is left to do
allocation it will always consume the largest chunk of free space first,
even if it is not yet optimal (e.g. large 1MB aligned extent).

I would make this interface optionally allow the target extent to be
specified, but if target block == 0 then the kernel is free to do its
own allocation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-23 14:50:36

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

> >>>>> Theodore Tso (TT) writes:
>
> TT> On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
> >> Hello,
> >>
> >> I've written a simple patch implementing ext3 ioctl for file
> >> relocation. Basically you call ioctl on a file, give it list of blocks
> >> and it relocates the file into given blocks (provided they are still
> >> free). The idea is to use it as a kernel part of ext3 online
> >> defragmenter (or generally disk access optimizer).
>
> isn't that a kernel responsbility to find/allocate target blocks?
> wouldn't it better to specify desirable target group and minimal
> acceptable chunk of free blocks?
Kernel definitely allocates those blocks (because it's the only
reasonably race-free way). The problem of finding those blocks is a bit
harder - it may be quite complicated decision where to put the file
(also given, that sometimes you may need to "shift away" some file to
make space for some other one). Also what I'm aiming for is, that
userspace defragmenter could be fed some "access patterns" and it
optimizes layout of several files to speedup startup (i.e. blocks of
those several files would be interleaved so that their sequence is close
to the one seen during start-up).

Honza

2006-10-23 14:54:44

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

> On Oct 23, 2006 18:31 +0400, Alex Tomas wrote:
> I would make this interface optionally allow the target extent to be
> specified, but if target block == 0 then the kernel is free to do its
> own allocation.
That's a good idea! I'll change the handling so that if block==0 we
just allocate blocks of given extent as we wish...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-23 15:01:49

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

Alex Tomas wrote:
>>>>>> Theodore Tso (TT) writes:
>
> TT> On Mon, Oct 23, 2006 at 02:27:10PM +0200, Jan Kara wrote:
> >> Hello,
> >>
> >> I've written a simple patch implementing ext3 ioctl for file
> >> relocation. Basically you call ioctl on a file, give it list of blocks
> >> and it relocates the file into given blocks (provided they are still
> >> free). The idea is to use it as a kernel part of ext3 online
> >> defragmenter (or generally disk access optimizer).
>
> isn't that a kernel responsbility to find/allocate target blocks?
> wouldn't it better to specify desirable target group and minimal
> acceptable chunk of free blocks?

XFS does this by allocating new blocks for a temporary file (initiated from
userspace, implemented in kernelspace of course), then just checks to see if the
result is better than what we had before; if so, then swap the storage space &
throw away the temporary file (which now has the original, more-fragmented file
blocks).

see xfs_swapext() in xfs_dfrag.c for the extent swapping part of this.

You probably want to avoid the page cache in all of this too, doing O_DIRECT IO
if possible, I don't think there's any reason to churn the page cache while the
defragmenter runs over a filesystem?

-Eric

2006-10-23 15:14:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Oct 23, 2006 10:16 -0400, Theodore Tso wrote:
> As a suggestion, I would pass the inode number and inode generation
> number into the ext3_file_mode_data array:
>
> struct ext3_file_move_data {
> int extents;
> struct ext3_reloc_extent __user *ext_array;
> };
>
> This will be much more efficient for the userspace relocator, since it
> won't need to translate from an inode number to a pathname, and then
> try to open the file before relocating it.
>
> I'd also use an explicit 64-bit block numbers type so that we don't
> have to worry about the ABI changing when we support 64-bit block
> numbers.

I would in fact go so far as to allow only a single extent to be specified
per call. This is to avoid the passing of any pointers as part of the
interface (hello ioctl police :-), and also makes the kernel code simpler.
I don't think the syscall/ioctl overhead is significant compared to the
journal and IO overhead.

Also, I would specify both the source extent and the target extent in
the inode. This first allows defragmenting only part of the file
instead of (it appears) requiring the whole file to be relocated. That
would be a killer if the file being defragmented is larger than free
space. It secondly provides a level of insurance that what the kernel
is relocating matches what userspace thinks it is doing. It would
protect against problems if the kernel ever does block relocation
itself (e.g. merge fragments into a single extent on (re)write, or for
snapshot/COW).

> The other problem I see with this patch is that there will be cache
> coherency problems between the buffer cache and the page cache. I
> think you will want to pull the data blocks of the file into the page
> cache, and then write them out from the page cache, and only *then*
> update the indirect blocks and commit the transaction.

Alternately (maybe even better) is to treat it as O_DIRECT and ensure
the page cache is flushed. This also avoids polluting the whole page
cache while running a defragmenter on the filesystem.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-10-23 16:02:21

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

> On Oct 23, 2006 10:16 -0400, Theodore Tso wrote:
> > As a suggestion, I would pass the inode number and inode generation
> > number into the ext3_file_mode_data array:
> >
> > struct ext3_file_move_data {
> > int extents;
> > struct ext3_reloc_extent __user *ext_array;
> > };
> >
> > This will be much more efficient for the userspace relocator, since it
> > won't need to translate from an inode number to a pathname, and then
> > try to open the file before relocating it.
> >
> > I'd also use an explicit 64-bit block numbers type so that we don't
> > have to worry about the ABI changing when we support 64-bit block
> > numbers.
>
> I would in fact go so far as to allow only a single extent to be specified
> per call. This is to avoid the passing of any pointers as part of the
> interface (hello ioctl police :-), and also makes the kernel code simpler.
> I don't think the syscall/ioctl overhead is significant compared to the
> journal and IO overhead.
I'm not sure it makes the kernel code simplier - if we have to replace
just a part of the file, we have to rewrite references to blocks at
several places inside indiretc tree. If we relocate whole file, we just
replace block pointers from inode. Furthermore it makes it kind of
harder to tell where indirect blocks would go - and it would be
impossible for the defragmenter to force some unusual placement of
indirect blocks... Currently blocks (including indirect ones) are just
being allocated in the DFS order from the given list.

> Also, I would specify both the source extent and the target extent in
> the inode. This first allows defragmenting only part of the file
> instead of (it appears) requiring the whole file to be relocated. That
> would be a killer if the file being defragmented is larger than free
> space. It secondly provides a level of insurance that what the kernel
> is relocating matches what userspace thinks it is doing. It would
> protect against problems if the kernel ever does block relocation
> itself (e.g. merge fragments into a single extent on (re)write, or for
> snapshot/COW).
I agree that this is the positive side of your approach :).

> > The other problem I see with this patch is that there will be cache
> > coherency problems between the buffer cache and the page cache. I
> > think you will want to pull the data blocks of the file into the page
> > cache, and then write them out from the page cache, and only *then*
> > update the indirect blocks and commit the transaction.
>
> Alternately (maybe even better) is to treat it as O_DIRECT and ensure
> the page cache is flushed. This also avoids polluting the whole page
> cache while running a defragmenter on the filesystem.
That's what I'm trying to do (but maybe my code is buggy).

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-23 17:29:11

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Oct 23, 2006 18:03 +0200, Jan Kara wrote:
> Andreas Dilger wrote:
> > I would in fact go so far as to allow only a single extent to be specified
> > per call. This is to avoid the passing of any pointers as part of the
> > interface (hello ioctl police :-), and also makes the kernel code simpler.
> > I don't think the syscall/ioctl overhead is significant compared to the
> > journal and IO overhead.
>
> ...it makes it kind of
> harder to tell where indirect blocks would go - and it would be
> impossible for the defragmenter to force some unusual placement of
> indirect blocks...

It would be possible to specify indirect block relocation in same manner
as regular block relocation I think. Allocate a new block, copy contents,
flush block from cache, fix up reference (inode, dindirect), commit.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-10-24 04:14:37

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
> isn't that a kernel responsbility to find/allocate target blocks?
> wouldn't it better to specify desirable target group and minimal
> acceptable chunk of free blocks?

The kernel doesn't have enough knowledge to know whether or not the
defragger prefers one blkdev location over another.

When you are trying to consolidate blocks, you must specify the
destination as well as source blocks.

Certainly, to prevent corruption and other nastiness, you must fail if
the destination isn't available...

(ext2meta did all this...)

Jeff

2006-10-24 14:02:08

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote:
> On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
> > isn't that a kernel responsbility to find/allocate target blocks?
> > wouldn't it better to specify desirable target group and minimal
> > acceptable chunk of free blocks?
>
> The kernel doesn't have enough knowledge to know whether or not the
> defragger prefers one blkdev location over another.
>
> When you are trying to consolidate blocks, you must specify the
> destination as well as source blocks.
>
> Certainly, to prevent corruption and other nastiness, you must fail if
> the destination isn't available...

That's the wrong way to look at it. if you want the userspace
process to specify a location, then you should preallocate it first
before doing anything else. There is no need to clutter a simple
data mover interface with all sorts of unnecessary error handling.

Once you've separated the destination allocation from the data
mover, the mover is basically a splice copy from source to
destination, an fsync and then an atomic swap blocks/extents operation.
Most of this code is generic, and a per-fs swap-extents vector
could be easily provided for the one bit that is not....

The allocation interface, OTOH, is anything but simple and is really
a filesystem specific interface. Seems logical to me to separate
the two.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-24 14:51:56

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
> On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote:
> > On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
> > > isn't that a kernel responsbility to find/allocate target blocks?
> > > wouldn't it better to specify desirable target group and minimal
> > > acceptable chunk of free blocks?
> >
> > The kernel doesn't have enough knowledge to know whether or not the
> > defragger prefers one blkdev location over another.
> >
> > When you are trying to consolidate blocks, you must specify the
> > destination as well as source blocks.
> >
> > Certainly, to prevent corruption and other nastiness, you must fail if
> > the destination isn't available...
>
> That's the wrong way to look at it. if you want the userspace
> process to specify a location, then you should preallocate it first
> before doing anything else. There is no need to clutter a simple
> data mover interface with all sorts of unnecessary error handling.

You are implying the the 2-step interface, creating a new inode then
swapping the contents, is the only way to implement this.
>
> Once you've separated the destination allocation from the data
> mover, the mover is basically a splice copy from source to
> destination, an fsync and then an atomic swap blocks/extents operation.
> Most of this code is generic, and a per-fs swap-extents vector
> could be easily provided for the one bit that is not....

The benefit of having such a simple data mover is negated by moving the
complexity into the allocator.

A single interface that would move a part of a file at a time has the
advantage that a large file which is only fragmented in a few areas does
not need to be completely moved.

> The allocation interface, OTOH, is anything but simple and is really
> a filesystem specific interface. Seems logical to me to separate
> the two.

So what then is the benefit of having a simple generic data mover if
every file system needs to implement it's own interface to allocate a
copy of the data?

Shaggy
--
David Kleikamp
IBM Linux Technology Center

2006-10-24 14:52:20

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

David Chinner wrote:
> The allocation interface, OTOH, is anything but simple and is really
> a filesystem specific interface. Seems logical to me to separate
> the two.

And ext[234] preallocation would be a very nice feature in its own right.

-Eric

2006-10-24 16:02:49

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
> On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
> > On Tue, Oct 24, 2006 at 12:14:33AM -0400, Jeff Garzik wrote:
> > > On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote:
> > > > isn't that a kernel responsbility to find/allocate target blocks?
> > > > wouldn't it better to specify desirable target group and minimal
> > > > acceptable chunk of free blocks?
> > >
> > > The kernel doesn't have enough knowledge to know whether or not the
> > > defragger prefers one blkdev location over another.
> > >
> > > When you are trying to consolidate blocks, you must specify the
> > > destination as well as source blocks.
> > >
> > > Certainly, to prevent corruption and other nastiness, you must fail if
> > > the destination isn't available...
> >
> > That's the wrong way to look at it. if you want the userspace
> > process to specify a location, then you should preallocate it first
> > before doing anything else. There is no need to clutter a simple
> > data mover interface with all sorts of unnecessary error handling.
>
> You are implying the the 2-step interface, creating a new inode then
> swapping the contents, is the only way to implement this.

No, it's not the only way to implement it, but it seems the cleanest
way to me when you have to consider crash recovery. With a temporary
inode, you can create it, hold a reference and then unlink it so
that any crash at that point will free the inode and any extents
it has on it.

The only way I can see anything different working is having the
filesystem hold extents somewhere internally that provides us the
same recovery guarantees while we copy the data and insert the new
extents. This is obviously a filesystem specific solution and is
more complex to implement than a swap extent transaction. it
probably also needs on disk format changes to support properly....

> > Once you've separated the destination allocation from the data
> > mover, the mover is basically a splice copy from source to
> > destination, an fsync and then an atomic swap blocks/extents operation.
> > Most of this code is generic, and a per-fs swap-extents vector
> > could be easily provided for the one bit that is not....
>
> The benefit of having such a simple data mover is negated by moving the
> complexity into the allocator.

What complexity does it introduce that the allocator doesn't already
have or needs to provide for the single call interface to work?

> A single interface that would move a part of a file at a time has the
> advantage that a large file which is only fragmented in a few areas does
> not need to be completely moved.

And the two-step process can do exactly this as well - splice can
work on any offset within the file...

> > The allocation interface, OTOH, is anything but simple and is really
> > a filesystem specific interface. Seems logical to me to separate
> > the two.
>
> So what then is the benefit of having a simple generic data mover if
> every file system needs to implement it's own interface to allocate a
> copy of the data?

I assume you meant "....allocate the space to store the copy of the data."

The allocation interface needs to be be able to be extended
independently of the data mover interface. XFS already exposes
allocation ioctls to userspace for preallocation and we've got plans
to extnd this further to allow userspace controlled allocation for
smart defrag tools for XFS. Tying allocation to the data mover
just makes the interface less flexible and harder to do anything
smart with....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-24 16:27:00

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote:
> On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
> > On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
> > > That's the wrong way to look at it. if you want the userspace
> > > process to specify a location, then you should preallocate it first
> > > before doing anything else. There is no need to clutter a simple
> > > data mover interface with all sorts of unnecessary error handling.
> >
> > You are implying the the 2-step interface, creating a new inode then
> > swapping the contents, is the only way to implement this.
>
> No, it's not the only way to implement it, but it seems the cleanest
> way to me when you have to consider crash recovery. With a temporary
> inode, you can create it, hold a reference and then unlink it so
> that any crash at that point will free the inode and any extents
> it has on it.
>
> The only way I can see anything different working is having the
> filesystem hold extents somewhere internally that provides us the
> same recovery guarantees while we copy the data and insert the new
> extents. This is obviously a filesystem specific solution and is
> more complex to implement than a swap extent transaction. it
> probably also needs on disk format changes to support properly....

This is definitely filesystem-dependent. I would think allocating an
extent would be like any other allocation done by the filesystem, and
there are already recovery mechanisms for that.

> > > Once you've separated the destination allocation from the data
> > > mover, the mover is basically a splice copy from source to
> > > destination, an fsync and then an atomic swap blocks/extents operation.
> > > Most of this code is generic, and a per-fs swap-extents vector
> > > could be easily provided for the one bit that is not....
> >
> > The benefit of having such a simple data mover is negated by moving the
> > complexity into the allocator.
>
> What complexity does it introduce that the allocator doesn't already
> have or needs to provide for the single call interface to work?

I don't see it as any more or less complex than a single interface.

> > A single interface that would move a part of a file at a time has the
> > advantage that a large file which is only fragmented in a few areas does
> > not need to be completely moved.
>
> And the two-step process can do exactly this as well - splice can
> work on any offset within the file...

I wasn't aware of that. That makes your proposal sound a lot better.

> > > The allocation interface, OTOH, is anything but simple and is really
> > > a filesystem specific interface. Seems logical to me to separate
> > > the two.
> >
> > So what then is the benefit of having a simple generic data mover if
> > every file system needs to implement it's own interface to allocate a
> > copy of the data?
>
> I assume you meant "....allocate the space to store the copy of the data."

Yeah.

> The allocation interface needs to be be able to be extended
> independently of the data mover interface. XFS already exposes
> allocation ioctls to userspace for preallocation and we've got plans
> to extnd this further to allow userspace controlled allocation for
> smart defrag tools for XFS. Tying allocation to the data mover
> just makes the interface less flexible and harder to do anything
> smart with....

Okay. It would be nice to standardize the interface so we don't have
every filesystem introducing new ioctls.

> Cheers,
>
> Dave.
--
David Kleikamp
IBM Linux Technology Center

2006-10-24 19:44:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote:
> That's the wrong way to look at it. if you want the userspace
> process to specify a location, then you should preallocate it first
> before doing anything else. There is no need to clutter a simple
> data mover interface with all sorts of unnecessary error handling.

This is doable, but it adds a huge amount of complexity before we
could implement on-line defragmentation.

First of all, we would need a way of allowing userpsace to specify
which blocks should be used in the preallocation.

Secondly, we would need a way of marking blocks as "preallocated but
not pre-zeroed"; otherwise we would have to zero out all of the blocks
in order to assure security (don't want userspace programs seeing the
previous contents of the data blocks), only to do the copy and the
extents vector swap.

That's a huge amount of work, and while the above two features can be
useful for other things, it's not clear it's worth it to require this
as the only way to implement on-line defragging. You're right that
it's a way of making things be more generic, but it means that each
filesystem needs to have a huge amount of additional complexity and
potential filesystem format changes before they could take advantage
of this general framework.

(For example, you'd never be able to do this with the FAT filesystem,
or the ext2 or ext3 filesystems; it would work for ext4 only *after*
we implement the above mentioned new features and the associated
filesystem format changes.)

Regards,

- Ted

2006-10-24 20:32:48

by Russell Cattelan

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Tue, 2006-10-24 at 15:44 -0400, Theodore Tso wrote:
> On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote:
> > That's the wrong way to look at it. if you want the userspace
> > process to specify a location, then you should preallocate it first
> > before doing anything else. There is no need to clutter a simple
> > data mover interface with all sorts of unnecessary error handling.
>
> This is doable, but it adds a huge amount of complexity before we
> could implement on-line defragmentation.
>
> First of all, we would need a way of allowing userpsace to specify
> which blocks should be used in the preallocation.
>
> Secondly, we would need a way of marking blocks as "preallocated but
> not pre-zeroed"; otherwise we would have to zero out all of the blocks
> in order to assure security (don't want userspace programs seeing the
> previous contents of the data blocks), only to do the copy and the
> extents vector swap

Chris Mason page place holder work for DIRECT IO should be applicable to
any pre-allocations?


--
Russell Cattelan <[email protected]>


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2006-10-24 23:00:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Oct 24, 2006 15:44 -0400, Theodore Tso wrote:
> First of all, we would need a way of allowing userpsace to specify
> which blocks should be used in the preallocation.

Presumably it could do this in the same way it will be specifying
which blocks to relocate in the defragger - by passing an extent.
You would be required to pass the file offset for which to preallocate,
and optionally an extent for the on-disk allocation itself (if none is
supplied the kernel will allocate the best extent it can).

> Secondly, we would need a way of marking blocks as "preallocated but
> not pre-zeroed"; otherwise we would have to zero out all of the blocks
> in order to assure security (don't want userspace programs seeing the
> previous contents of the data blocks), only to do the copy and the
> extents vector swap.

This could be mitigated by having the preallocation be done (in the
defragment case) against a temporary inode in the orphan list (as
the initial patch did) so if there is a crash it will be released.
The temporary inode will not be linked into the namespace so it cannot
be read - only used to hold preallocation. If this was a write-only
file handle then we should be OK?

For defragger purposes this would need:

- "allocate new temporary inode" (VFS + fs, returns write-only fh if
fs can't properly handle uninitalized extents, or doesn't request
full-extent zeroing)

for each extent to defragment {
- "preallocate extents on temp inode" (fs specific internals)
- "copy data from orig to temp at offset X" (VFS, splice or
e.g. sys_copyfile(src, dst, offset, count) which Linus agreed
to at KS '05 for network filesystems)
- "migrate copied extent to original inode" (fs specific internals)
}

- "free temporary inode" (just close of temp fh, frees unmigrated extents).

I don't think this is much more work than implementing all of this
functionality as part of a monolithic online defrag function, assuming
we don't require full-file copies in order to do defrag.

> (For example, you'd never be able to do this with the FAT filesystem,
> or the ext2 or ext3 filesystems; it would work for ext4 only *after*
> we implement the above mentioned new features and the associated
> filesystem format changes.)

Well, ext4 already has stub support for "allocated but uninitialized"
extents. But regardless, I think if we structure the operations as
above we don't need to do very much crazy stuff. It just boils down
to exposing some fs internals (create open-unlink inode, block allocation
with sanity check if on-disk extents are given) via new userspace methods,
and one new bit of code (extent migration with sanity check).

Virtually all of the VFS bits are generally useful and it doesn't require
any funky ability on the part of the filesystem in order to work. We
don't need this to be super performant, so it can do as much locking &
page flushing as it needs to get things correct.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-10-25 01:20:27

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Tue, Oct 24, 2006 at 11:26:26AM -0500, Dave Kleikamp wrote:
> On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote:
> > On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
> > > On Tue, 2006-10-24 at 23:59 +1000, David Chinner wrote:
> > > > That's the wrong way to look at it. if you want the userspace
> > > > process to specify a location, then you should preallocate it first
> > > > before doing anything else. There is no need to clutter a simple
> > > > data mover interface with all sorts of unnecessary error handling.
> > >
> > > You are implying the the 2-step interface, creating a new inode then
> > > swapping the contents, is the only way to implement this.
> >
> > No, it's not the only way to implement it, but it seems the cleanest
> > way to me when you have to consider crash recovery. With a temporary
> > inode, you can create it, hold a reference and then unlink it so
> > that any crash at that point will free the inode and any extents
> > it has on it.
> >
> > The only way I can see anything different working is having the
> > filesystem hold extents somewhere internally that provides us the
> > same recovery guarantees while we copy the data and insert the new
> > extents. This is obviously a filesystem specific solution and is
> > more complex to implement than a swap extent transaction. it
> > probably also needs on disk format changes to support properly....
>
> This is definitely filesystem-dependent. I would think allocating an
> extent would be like any other allocation done by the filesystem, and
> there are already recovery mechanisms for that.

Yes, the allocation would be the same, but that isn't the problem
I was talking about.

The problem is holding a reference to the extent once it has been
allocated while it is having the data copied into it (i.e. before it
is swapped with the original extents) and then holding the original
extents until they are freed. These references need to be
persistent so they can be freed correctly during crash recovery
i.e. rollback the allocation if the extent swap has not been
logged, or free the original blocks is the extent swap has been
logged.

The obvious way to do this is to use an unlinked (orphan) inode....

> > > > Once you've separated the destination allocation from the data
> > > > mover, the mover is basically a splice copy from source to
> > > > destination, an fsync and then an atomic swap blocks/extents operation.
> > > > Most of this code is generic, and a per-fs swap-extents vector
> > > > could be easily provided for the one bit that is not....
> > >
> > > The benefit of having such a simple data mover is negated by moving the
> > > complexity into the allocator.
> >
> > What complexity does it introduce that the allocator doesn't already
> > have or needs to provide for the single call interface to work?
>
> I don't see it as any more or less complex than a single interface.

Ok, I thought I was missing something there.

> > The allocation interface needs to be be able to be extended
> > independently of the data mover interface. XFS already exposes
> > allocation ioctls to userspace for preallocation and we've got plans
> > to extnd this further to allow userspace controlled allocation for
> > smart defrag tools for XFS. Tying allocation to the data mover
> > just makes the interface less flexible and harder to do anything
> > smart with....
>
> Okay. It would be nice to standardize the interface so we don't have
> every filesystem introducing new ioctls.

Well, that will be an interesting challenge. I'm sure that there
is a common subset that all filesystems can implement e.g. per
file preallocation (something like XFS's allocate/reserve/free space
ioctls) to provide kernel support for posix_fallocate(), etc.

However, we may end up exposing enough of XFS's current allocation
semantics to do things like telling the filesystem to allocate in
allocation group 6, near block number 0x32482 within the AG, falling
back to searching for the nearest match to the size requirement,
failing that look for something larger than the minimum size
specified, and then fail if you can't find a match in that AG.

That makes little sense to any filesystem but XFS, which is really
why I think that the smarter allocation interfaces are going to
remain filesystem specific....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-25 02:09:59

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Tue, Oct 24, 2006 at 03:44:16PM -0400, Theodore Tso wrote:
> On Tue, Oct 24, 2006 at 11:59:28PM +1000, David Chinner wrote:
> > That's the wrong way to look at it. if you want the userspace
> > process to specify a location, then you should preallocate it first
> > before doing anything else. There is no need to clutter a simple
> > data mover interface with all sorts of unnecessary error handling.
>
> This is doable, but it adds a huge amount of complexity before we
> could implement on-line defragmentation.
>
> First of all, we would need a way of allowing userpsace to specify
> which blocks should be used in the preallocation.

Not initially. Create a file, and call posix_fallocate() on it.
Later, the filesystem can provide something that the defrag tool can
use for fine-grained control of where the preallocated blocks are on
disk.

> Secondly, we would need a way of marking blocks as "preallocated but
> not pre-zeroed"; otherwise we would have to zero out all of the blocks
> in order to assure security (don't want userspace programs seeing the
> previous contents of the data blocks), only to do the copy and the
> extents vector swap.

The unlinked inode method avoids this problem because no user space
process can see the inode to open it. Also, posix_fallocate() zeroes
the disk blocks so even this protects against data exposure.

So, now all that remains for an initial implementation is the swap
extents transaction and the data mover syscall.

For a smart, fast implementation, I agree that you need unwritten
extents (which XFS already has), then a fast filesystem
implementation of posix_fallocate() that utilises unwritten extents
(which XFS already has), and finally another interface that allows
you to allocate unwritten extents in an arbitrary location within
the filesystem (which no filesystem currently has).

> That's a huge amount of work, and while the above two features can be
> useful for other things, it's not clear it's worth it to require this
> as the only way to implement on-line defragging. You're right that
> it's a way of making things be more generic, but it means that each
> filesystem needs to have a huge amount of additional complexity and
> potential filesystem format changes before they could take advantage
> of this general framework.

I disagree - it's not a huge amount of work to get some thing
working and to solidify the generic interfaces and only format
change is a new transaction. Any filesystem that supports the swap
extent/blocks method would then work better than XFs's current
online defrag tool which currently does not use preallocation,
nor does it use splice.....

> (For example, you'd never be able to do this with the FAT filesystem,
> or the ext2 or ext3 filesystems; it would work for ext4 only *after*
> we implement the above mentioned new features and the associated
> filesystem format changes.)

Sure, but they can use the slow, unoptimised posix_fallocate() method
for allocating disk space....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-25 02:26:34

by Barry Naujok

[permalink] [raw]
Subject: RE: [RFC] Ext3 online defrag



On Wed, 25 Oct 2006 11:19 AM, David Chinner wrote:
> On Tue, Oct 24, 2006 at 11:26:26AM -0500, Dave Kleikamp wrote:
> > On Wed, 2006-10-25 at 02:01 +1000, David Chinner wrote:
> > > On Tue, Oct 24, 2006 at 09:51:41AM -0500, Dave Kleikamp wrote:
> > > The allocation interface needs to be be able to be extended
> > > independently of the data mover interface. XFS already exposes
> > > allocation ioctls to userspace for preallocation and
> we've got plans
> > > to extnd this further to allow userspace controlled allocation for
> > > smart defrag tools for XFS. Tying allocation to the data mover
> > > just makes the interface less flexible and harder to do anything
> > > smart with....
> >
> > Okay. It would be nice to standardize the interface so we
> don't have
> > every filesystem introducing new ioctls.
>
> Well, that will be an interesting challenge. I'm sure that there
> is a common subset that all filesystems can implement e.g. per
> file preallocation (something like XFS's allocate/reserve/free space
> ioctls) to provide kernel support for posix_fallocate(), etc.
>
> However, we may end up exposing enough of XFS's current allocation
> semantics to do things like telling the filesystem to allocate in
> allocation group 6, near block number 0x32482 within the AG, falling
> back to searching for the nearest match to the size requirement,
> failing that look for something larger than the minimum size
> specified, and then fail if you can't find a match in that AG.
>
> That makes little sense to any filesystem but XFS, which is really
> why I think that the smarter allocation interfaces are going to
> remain filesystem specific....

Could we have a more abstract method for asking the filesystem where the
free blocks are and then using the same block addressing to tell the
fs where to allocate/move the file's data to?

2006-10-25 02:42:57

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 12:30:02PM +1000, Barry Naujok wrote:
> Could we have a more abstract method for asking the filesystem where the
> free blocks are and then using the same block addressing to tell the
> fs where to allocate/move the file's data to?

That's fundamentally racy, so you might as well just read the
filesystem metadata from userspace. No need to go through the kernel
for that.

Jeff




2006-10-25 04:27:53

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Tue, Oct 24, 2006 at 10:42:57PM -0400, Jeff Garzik wrote:
> On Wed, Oct 25, 2006 at 12:30:02PM +1000, Barry Naujok wrote:
> > Could we have a more abstract method for asking the filesystem where the
> > free blocks are and then using the same block addressing to tell the
> > fs where to allocate/move the file's data to?
>
> That's fundamentally racy, so you might as well just read the
> filesystem metadata from userspace. No need to go through the kernel
> for that.

But it a race that is _easily_ handled, and applications only need to
implement one interface, not a different method for every
filesystem that requires deeep filesystem knowledge.

Besides, you still have to handle the case where the block you want
has already been allocated because reading the metadata from
userspace doesn't prevent the kernel from allocating the block you
want before you ask for it...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-25 04:48:44

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote:
> But it a race that is _easily_ handled, and applications only need to
> implement one interface, not a different method for every
> filesystem that requires deeep filesystem knowledge.
>
> Besides, you still have to handle the case where the block you want
> has already been allocated because reading the metadata from
> userspace doesn't prevent the kernel from allocating the block you
> want before you ask for it...

The race is easily handled either way, by having the block move fail
when you tell the kernel the destination blocks.

The difference is that you don't unnecessarily bloat the kernel.

Every major filesystem has a libfoofs library that makes it trivial to
read the metadata, so all you need to do is use an existing lib.

Jeff




2006-10-25 05:39:50

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
> On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote:
> > But it a race that is _easily_ handled, and applications only need to
> > implement one interface, not a different method for every
> > filesystem that requires deeep filesystem knowledge.
> >
> > Besides, you still have to handle the case where the block you want
> > has already been allocated because reading the metadata from
> > userspace doesn't prevent the kernel from allocating the block you
> > want before you ask for it...
>
> The race is easily handled either way, by having the block move fail
> when you tell the kernel the destination blocks.

So why are you arguing that an interface is no good because it
is fundamentally racy? ;)

> The difference is that you don't unnecessarily bloat the kernel.

By that argument, we should rip out the bmap interface (FIBMAP)
because you can get all that information by reading the metadata
from userspace.....

> Every major filesystem has a libfoofs library that makes it trivial to
> read the metadata, so all you need to do is use an existing lib.

IOWs, you are advocating that any application that wants to use this
special allocation technique needs to link against every different
filesystem library and it then needs to implement filesystem
specific searches through their metadata? Nobody in their right
mind would ever want to use an interface like this.

Also, this simply doesn't work for XFS because the cached metadata
is in a different address space to the block device. Hence it can be
tens of seconds between the kernel modifying a metadata buffer and
userspace being able to see that modification. You need to freeze
the filesystem for the XFS userspace tools to guarantee a
consistent view of an online filesystem from the block device.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-25 06:01:44

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
> On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
> > On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote:
> > > But it a race that is _easily_ handled, and applications only need to
> > > implement one interface, not a different method for every
> > > filesystem that requires deeep filesystem knowledge.
> > >
> > > Besides, you still have to handle the case where the block you want
> > > has already been allocated because reading the metadata from
> > > userspace doesn't prevent the kernel from allocating the block you
> > > want before you ask for it...
> >
> > The race is easily handled either way, by having the block move fail
> > when you tell the kernel the destination blocks.
>
> So why are you arguing that an interface is no good because it
> is fundamentally racy? ;)

My point was that it is silly to introduce obviously racy code into the
kernel, when -- inside the kernel -- it could be handled race-free.

If you accept a racy solution, you might as well do it outside the
kernel, where you get the same results, but without adding silliness and
bloat to the kernel.


> > Every major filesystem has a libfoofs library that makes it trivial to
> > read the metadata, so all you need to do is use an existing lib.
>
> IOWs, you are advocating that any application that wants to use this
> special allocation technique needs to link against every different
> filesystem library and it then needs to implement filesystem
> specific searches through their metadata? Nobody in their right
> mind would ever want to use an interface like this.

Online defrag is OBVIOUSLY highly filesystem specific. You have to link
against filesystem specific code somewhere, whether its inside the
kernel or outside the kernel.

Further, in the case being discussed in this thread, ext2meta has
already been proven a workable solution.

Jeff

2006-10-25 08:16:03

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
> On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
> > On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
> > So why are you arguing that an interface is no good because it
> > is fundamentally racy? ;)
>
> My point was that it is silly to introduce obviously racy code into the
> kernel, when -- inside the kernel -- it could be handled race-free.

So how do you then get the generic interface to allocate blocks
specified by userspace race free?

> > > Every major filesystem has a libfoofs library that makes it trivial to
> > > read the metadata, so all you need to do is use an existing lib.
> >
> > IOWs, you are advocating that any application that wants to use this
> > special allocation technique needs to link against every different
> > filesystem library and it then needs to implement filesystem
> > specific searches through their metadata? Nobody in their right
> > mind would ever want to use an interface like this.
>
> Online defrag is OBVIOUSLY highly filesystem specific.

Parts of it are, but data movement and allocation hints need to be
provided by every filesystem that wants to implement this
efficiently. These features are also useful outside of defrag as
well - I can think of several applications that would benefit from
being able to direct where in the filesystem they want data to
reside.

If userspace directed allocation requires deep knowledge of the
filesystem metadata (this is what you are saying they need to do,
right?), then these applications will never, ever make use of this
interface and we'll continue to have problems with them.

I guess my point is that we are going to implement features like
this in XFS and if other filesystems are going to be doing the same
thing then we should try to come up with generic solutions rather
than reinvent the wheel over an over again.

> Further, in the case being discussed in this thread, ext2meta has
> already been proven a workable solution.

Sure, but that's not a generic solution to a problem common to
all filesystems....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-25 14:54:51

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

> On Oct 24, 2006 15:44 -0400, Theodore Tso wrote:
> > First of all, we would need a way of allowing userpsace to specify
> > which blocks should be used in the preallocation.
>
> Presumably it could do this in the same way it will be specifying
> which blocks to relocate in the defragger - by passing an extent.
> You would be required to pass the file offset for which to preallocate,
> and optionally an extent for the on-disk allocation itself (if none is
> supplied the kernel will allocate the best extent it can).
>
> > Secondly, we would need a way of marking blocks as "preallocated but
> > not pre-zeroed"; otherwise we would have to zero out all of the blocks
> > in order to assure security (don't want userspace programs seeing the
> > previous contents of the data blocks), only to do the copy and the
> > extents vector swap.
>
> This could be mitigated by having the preallocation be done (in the
> defragment case) against a temporary inode in the orphan list (as
> the initial patch did) so if there is a crash it will be released.
> The temporary inode will not be linked into the namespace so it cannot
> be read - only used to hold preallocation. If this was a write-only
> file handle then we should be OK?
>
> For defragger purposes this would need:
>
> - "allocate new temporary inode" (VFS + fs, returns write-only fh if
> fs can't properly handle uninitalized extents, or doesn't request
> full-extent zeroing)
>
> for each extent to defragment {
> - "preallocate extents on temp inode" (fs specific internals)
> - "copy data from orig to temp at offset X" (VFS, splice or
> e.g. sys_copyfile(src, dst, offset, count) which Linus agreed
> to at KS '05 for network filesystems)
> - "migrate copied extent to original inode" (fs specific internals)
> }
>
> - "free temporary inode" (just close of temp fh, frees unmigrated extents).
Yes, this sounds feasible. We could split the defrag ioctl into two
pieces (addition of given extent to a file and swapping of extents), which
can have generic interface...

> I don't think this is much more work than implementing all of this
> functionality as part of a monolithic online defrag function, assuming
> we don't require full-file copies in order to do defrag.
Yes, it's not more work than supporting swapping of extents in the
middle of the file. I've just not yet decided how to handle indirect
blocks in case of relocation in the middle of the file. Should they be
relocated or shouldn't they? Probably they should be relocated at least
in case they are fully contained in relocated interval or maybe better
said when all the blocks they reference to are also in the interval
(this handles also the case of EOF). But still if you would like to
relocate the file by parts this is not quite what you want (you won't be
able to relocate indirect blocks in the boundary of intervals) :(.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-25 17:00:58

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
> On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
> > On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
> > > On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
> > > So why are you arguing that an interface is no good because it
> > > is fundamentally racy? ;)
> >
> > My point was that it is silly to introduce obviously racy code into the
> > kernel, when -- inside the kernel -- it could be handled race-free.
>
> So how do you then get the generic interface to allocate blocks
> specified by userspace race free?

As has been repeatedly stated, there is no "generic". There MUST be
filesystem-specific knowledge during these operations.


> If userspace directed allocation requires deep knowledge of the
> filesystem metadata (this is what you are saying they need to do,
> right?), then these applications will never, ever make use of this
> interface and we'll continue to have problems with them.

Completely false assumptions. There is no difference in handling of
knowledge, be it kernel space or userspace.


> > Further, in the case being discussed in this thread, ext2meta has
> > already been proven a workable solution.
>
> Sure, but that's not a generic solution to a problem common to
> all filesystems....

You clearly don't know what I'm talking about. ext2meta is an example
of a filesystem-specific metadata access method, applicable to tasks
such as online optimization.

Implement that tiny kernel module for each filesystem, and you have
everything you need, without races. This was discussed years ago;
review the mailing lists. Google for 'Alexander Viro' and 'ext2meta'.

Jeff

2006-10-25 17:02:24

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote:
> Yes, this sounds feasible. We could split the defrag ioctl into two
> pieces (addition of given extent to a file and swapping of extents), which
> can have generic interface...

An ioctl is UGLY.

This was discussed years ago. Google for 'Alexander Viro' and
'ext2meta'. That's a clean, flexible, extensible way to access metadata
online. No need for ioctl binary translation across 32bit<->64bit, or
any other ioctl issue.

Jeff




2006-10-25 17:58:51

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

> On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote:
> > Yes, this sounds feasible. We could split the defrag ioctl into two
> > pieces (addition of given extent to a file and swapping of extents), which
> > can have generic interface...
>
> An ioctl is UGLY.
Agreed.

> This was discussed years ago. Google for 'Alexander Viro' and
> 'ext2meta'. That's a clean, flexible, extensible way to access metadata
> online. No need for ioctl binary translation across 32bit<->64bit, or
> any other ioctl issue.
I've briefly looked at this and this kind of interface has some
appeal. On the other hand it's not obvious to me, how to implement in
this interface *atomic* operation "copy data from file F to given set of
blocks and rewrite pointers to original blocks with pointers to new
blocks". Something like this is needed for what we want to do...
Also if we'd like to implement operation like "add this block to file F
at position P" we have to make sure that all the necessary updates
(bitmap updates, inode updates, indirect block updates) go into one
transaction. Which basically mean that either ext3meta has to have a way
how to do this in a single operation, or we have to give userspace a way
to start/stop transaction and that starts to be really a mess because of
various deadlocks and so on.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-25 18:08:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 07:58:51PM +0200, Jan Kara wrote:
> I've briefly looked at this and this kind of interface has some
> appeal. On the other hand it's not obvious to me, how to implement in
> this interface *atomic* operation "copy data from file F to given set of
> blocks and rewrite pointers to original blocks with pointers to new
> blocks". Something like this is needed for what we want to do...
> Also if we'd like to implement operation like "add this block to file F
> at position P" we have to make sure that all the necessary updates
> (bitmap updates, inode updates, indirect block updates) go into one
> transaction. Which basically mean that either ext3meta has to have a way
> how to do this in a single operation, or we have to give userspace a way
> to start/stop transaction and that starts to be really a mess because of
> various deadlocks and so on.

Agreed, this issues exist. But these issues exist independent of
whether an ioctl or ext3meta is used. It's all the responsibility
of the implementor to define the interface.

My contention is that ext3meta interface method would be much more
robust than ioctl. It's a namespace inside which you can define any
inodes/dirents you wish, for the operations you desire.

Heck, according to my sf.net/projects/gkernel CVS log, you offered
some helpful review comments to me when I was implementing ext2meta ;-)

Jeff




2006-10-25 18:25:30

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

> On Wed, Oct 25, 2006 at 07:58:51PM +0200, Jan Kara wrote:
> > I've briefly looked at this and this kind of interface has some
> > appeal. On the other hand it's not obvious to me, how to implement in
> > this interface *atomic* operation "copy data from file F to given set of
> > blocks and rewrite pointers to original blocks with pointers to new
> > blocks". Something like this is needed for what we want to do...
> > Also if we'd like to implement operation like "add this block to file F
> > at position P" we have to make sure that all the necessary updates
> > (bitmap updates, inode updates, indirect block updates) go into one
> > transaction. Which basically mean that either ext3meta has to have a way
> > how to do this in a single operation, or we have to give userspace a way
> > to start/stop transaction and that starts to be really a mess because of
> > various deadlocks and so on.
>
> Agreed, this issues exist. But these issues exist independent of
> whether an ioctl or ext3meta is used. It's all the responsibility
> of the implementor to define the interface.
>
> My contention is that ext3meta interface method would be much more
> robust than ioctl. It's a namespace inside which you can define any
> inodes/dirents you wish, for the operations you desire.
I see. So you mean that in our ext3meta filesystem we'd have a file
named "add_this_extent_to_inode" and a file "reloc_inode_interval" and
they'd be fed essentially the same info as the current ioctl interface and
do the same thing as we currently do. Hmm, I don't find it that nice any
more but yes, this would work.

> Heck, according to my sf.net/projects/gkernel CVS log, you offered
> some helpful review comments to me when I was implementing ext2meta ;-)
Looking at those mails it was already quite some time ago so I
forgot about it ;)
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-25 18:33:06

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 08:25:30PM +0200, Jan Kara wrote:
> I see. So you mean that in our ext3meta filesystem we'd have a file
> named "add_this_extent_to_inode" and a file "reloc_inode_interval" and
> they'd be fed essentially the same info as the current ioctl interface and
> do the same thing as we currently do. Hmm, I don't find it that nice any
> more but yes, this would work.

It depends on the operation. ext2meta[1] works fine for online
defrag, just exporting metadata objects and providing read(1)
and write(2) operations on them. Adding 'trigger' files (like your
add_this_extent_to_inode) may make sense for some operations, indeed,
but we need to see the whole picture before really understanding
whether that interface is optimal.

Jeff


[1] http://linux.yyz.us/misc/ext2meta.c

2006-10-25 18:36:57

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

> On Oct 23, 2006 18:03 +0200, Jan Kara wrote:
> > Andreas Dilger wrote:
> > > I would in fact go so far as to allow only a single extent to be specified
> > > per call. This is to avoid the passing of any pointers as part of the
> > > interface (hello ioctl police :-), and also makes the kernel code simpler.
> > > I don't think the syscall/ioctl overhead is significant compared to the
> > > journal and IO overhead.
> >
> > ...it makes it kind of
> > harder to tell where indirect blocks would go - and it would be
> > impossible for the defragmenter to force some unusual placement of
> > indirect blocks...
>
> It would be possible to specify indirect block relocation in same manner
> as regular block relocation I think. Allocate a new block, copy contents,
> flush block from cache, fix up reference (inode, dindirect), commit.
Yes, but there's a question of the interface to this operation. How to
specify which indirect block I mean? Obviously we could introduce
separate call for remapping indirect blocks but I find this solution
kind of clumsy...

Bye
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-25 18:41:18

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote:
> Yes, but there's a question of the interface to this operation. How to
> specify which indirect block I mean? Obviously we could introduce
> separate call for remapping indirect blocks but I find this solution
> kind of clumsy...

Agreed... that gets nasty real quick.

Jeff




2006-10-26 01:41:39

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
> On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
> > On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
> > > On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote:
> > > > On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote:
> > > > So why are you arguing that an interface is no good because it
> > > > is fundamentally racy? ;)
> > >
> > > My point was that it is silly to introduce obviously racy code into the
> > > kernel, when -- inside the kernel -- it could be handled race-free.
> >
> > So how do you then get the generic interface to allocate blocks
> > specified by userspace race free?
>
> As has been repeatedly stated, there is no "generic". There MUST be
> filesystem-specific knowledge during these operations.

What information? All we need to know is where the free disk space
is, and have a method to attempt to allocate from it. That's _easy_
to abstract into a common interface via the VFS....

> > > Further, in the case being discussed in this thread, ext2meta has
> > > already been proven a workable solution.
> >
> > Sure, but that's not a generic solution to a problem common to
> > all filesystems....
>
> You clearly don't know what I'm talking about. ext2meta is an example
> of a filesystem-specific metadata access method, applicable to tasks
> such as online optimization.

I know exactly what ext2meta is. I said it's not a generic solution
and you say its a filesystem specific solution. I think we're
agreeing here. ;)

We don't need to expose anything filesystem specific to userspace to
implement this. Online data movement (i.e. the defrag mechanism)
becomes something like:

do {
get_free_list(dst_fd, location, len, list)
/* select extent to use */
alloc_from_list(dst_fd, list[X], off, len)
} while (ENOALLOC)
move_data(src_fd, dst_fd, off, len);

And this would work on any filesystem type that implemented these
interfaces. Hence tools like a startup file optimiser would
only need to be written once, rather than needing a different
tool for every different filesystem type.....

Remember, I'm not just talking about defrag - I'm talking about
an interface that is actually useful to apps that might care
about how data is laid out on disk but the applications writers
don't know anyhting about how filesystem X or Y or Z is
implemented. Putting the burden of learning about fileystem
internals on application developers is not the correct solution.

I see substantial benefit moving forward from having filesystem
independent interfaces. Many features that filesystems implement
are common, and as time goes on the common feature set of the
different filesystems gets larger. So why shouldn't we be
trying to make common operations generic so that every filesystem
can benefit from the latest and greatest tool?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-26 03:33:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote:
> We don't need to expose anything filesystem specific to userspace to
> implement this. Online data movement (i.e. the defrag mechanism)
> becomes something like:
>
> do {
> get_free_list(dst_fd, location, len, list)
> /* select extent to use */
> alloc_from_list(dst_fd, list[X], off, len)
> } while (ENOALLOC)
> move_data(src_fd, dst_fd, off, len);
>
> And this would work on any filesystem type that implemented these
> interfaces. Hence tools like a startup file optimiser would
> only need to be written once, rather than needing a different
> tool for every different filesystem type.....

Yeah, but that's simply not enough. A good defragger needs to know
about a filesystem's allocation policies, and move files so they are
optimally located, given the filesystem layout. For example, in
ext2/3/4 we will want to move blocks so they in the same block group
as the inode. That's filesystem specific information; other
filesystems will require different policies.

> Remember, I'm not just talking about defrag - I'm talking about
> an interface that is actually useful to apps that might care
> about how data is laid out on disk but the applications writers
> don't know anyhting about how filesystem X or Y or Z is
> implemented. Putting the burden of learning about fileystem
> internals on application developers is not the correct solution.

Unfortunately, if you want to do a good job, a defragger *has* to know
about some very low-level filesystem specific information, if it wants
to do a good job.

- Ted

2006-10-26 06:37:56

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, Oct 25, 2006 at 11:33:16PM -0400, Theodore Tso wrote:
> On Thu, Oct 26, 2006 at 11:40:20AM +1000, David Chinner wrote:
> > We don't need to expose anything filesystem specific to userspace to
> > implement this. Online data movement (i.e. the defrag mechanism)
> > becomes something like:
> >
> > do {
> > get_free_list(dst_fd, location, len, list)
> > /* select extent to use */
> > alloc_from_list(dst_fd, list[X], off, len)
> > } while (ENOALLOC)
> > move_data(src_fd, dst_fd, off, len);
> >
> > And this would work on any filesystem type that implemented these
> > interfaces. Hence tools like a startup file optimiser would
> > only need to be written once, rather than needing a different
> > tool for every different filesystem type.....
>
> Yeah, but that's simply not enough.

Not enough for what?

> A good defragger needs to know

Oh, we're back to defrag again. :/

> about a filesystem's allocation policies, and move files so they are
> optimally located, given the filesystem layout. For example, in
> ext2/3/4 we will want to move blocks so they in the same block group
> as the inode. That's filesystem specific information; other
> filesystems will require different policies.

Of which a good chunk of policies will be common. the above policy
has been around for many, many years and is implemented in many, many
filesystems (even XFS).

> > get_free_list(dst_fd, location, len, list)

location == allocation policy. e.g: give me a list of free blocks:

- anywhere (default filesystem policy applies)
- near block number X
- at block X
- in block/allocation group Y
- of the largest contiguous regions in (one of the above)
- at least N blocks in length
- near inode src_fd
- in storage tier 3

then you select one of the regions that was returned at attempt
to allocate that.

You can put whatever filesystems specific stuff you need around this
to arrive at the decision of where to put the file, but you've got
to allocate the new blocks, move the data to them, and swap them
over. Every defragger needs to do this, regardless of the filesystem
type. So why not provide a framework for it, especially as the
framework is useful for far more than just as the data movement part
of a defrag application.

> > Remember, I'm not just talking about defrag - I'm talking about
> > an interface that is actually useful to apps that might care
> > about how data is laid out on disk but the applications writers
> > don't know anyhting about how filesystem X or Y or Z is
> > implemented. Putting the burden of learning about fileystem
> > internals on application developers is not the correct solution.
>
> Unfortunately, if you want to do a good job, a defragger *has* to know
> about some very low-level filesystem specific information, if it wants
> to do a good job.

Back to defrag. Again. Bigger picture, guys, bigger picture.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-26 09:30:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Oct 25, 2006 16:54 +0200, Jan Kara wrote:
> I've just not yet decided how to handle indirect
> blocks in case of relocation in the middle of the file. Should they be
> relocated or shouldn't they? Probably they should be relocated at least
> in case they are fully contained in relocated interval or maybe better
> said when all the blocks they reference to are also in the interval
> (this handles also the case of EOF). But still if you would like to
> relocate the file by parts this is not quite what you want (you won't be
> able to relocate indirect blocks in the boundary of intervals) :(.

I suspect that the natural choice for metadata blocks is to keep the
block which has the most metadata unchanged. For example, if you are
doing a full-file relocation then you would naturally keep all of the
new {dt}indirect blocks. If you are relocating a small chunk of the
file you would keep the old {dt}indirect blocks and just copy a few
block pointers over.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-26 11:37:24

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

> On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
> > On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote:
> > > On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote:
> > > So how do you then get the generic interface to allocate blocks
> > > specified by userspace race free?
> >
> > As has been repeatedly stated, there is no "generic". There MUST be
> > filesystem-specific knowledge during these operations.
>
> What information? All we need to know is where the free disk space
> is, and have a method to attempt to allocate from it. That's _easy_
> to abstract into a common interface via the VFS....
>
> > > > Further, in the case being discussed in this thread, ext2meta has
> > > > already been proven a workable solution.
> > >
> > > Sure, but that's not a generic solution to a problem common to
> > > all filesystems....
> >
> > You clearly don't know what I'm talking about. ext2meta is an example
> > of a filesystem-specific metadata access method, applicable to tasks
> > such as online optimization.
>
> I know exactly what ext2meta is. I said it's not a generic solution
> and you say its a filesystem specific solution. I think we're
> agreeing here. ;)
>
> We don't need to expose anything filesystem specific to userspace to
> implement this. Online data movement (i.e. the defrag mechanism)
> becomes something like:
>
> do {
> get_free_list(dst_fd, location, len, list)
> /* select extent to use */
Upto this point I can imagine we can be perfectly generic.

> alloc_from_list(dst_fd, list[X], off, len)
> } while (ENOALLOC)
> move_data(src_fd, dst_fd, off, len);
With these two it's not clear how well can we do with just a generic
interface. Every filesystem needs to have some additional metadata to
keep list of data blocks. In case of ext2/ext3/reiserfs this is not
a negligible amount of space and placement of these metadata is important
for performance. So either we focus only on data blocks and let
implementation of alloc_from_list() allocate metadata wherever it wants
(but then we get suboptimal performace because there need not be space
for indirect blocks close before our provided extent) or we allocate
metadata from the provided list, but then we need some knowledge of fs
to know how much should we expect to spend on metadata and where these
metadata should be placed. For example if you know that indirect block
for your interval is at block B, then you'd like to allocate somewhere
close after this point or to relocate that indirect block (and all the
data it references to). But for that you need to know you have something
like indirect blocks => filesystem knowledge.
So I think that to get this working, we also need some way to tell
the program that if it wants to allocate some data, it also needs to
count with this amount of metadata and some of it is already allocated
in given blocks...

> I see substantial benefit moving forward from having filesystem
> independent interfaces. Many features that filesystems implement
> are common, and as time goes on the common feature set of the
> different filesystems gets larger. So why shouldn't we be
> trying to make common operations generic so that every filesystem
> can benefit from the latest and greatest tool?
So you prefer to handle only "data blocks" part of the problem and let
filesystem sort out metadata?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-26 13:39:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote:
> > > Remember, I'm not just talking about defrag - I'm talking about
> > > an interface that is actually useful to apps that might care
> > > about how data is laid out on disk but the applications writers
> > > don't know anyhting about how filesystem X or Y or Z is
> > > implemented. Putting the burden of learning about fileystem
> > > internals on application developers is not the correct solution.

If all you want is something for applicaiton developers, about all you
can do is to tell the filesystem, "create the file so that it will be
quickly accessed after accessing this file or this directory". I
really don't see the point of having the application specify block
numbers if you're also claiming the applicaiton isn't going to know
anything about the filesystem layout --- or even the RAID layout of
the filesystem. I don't think it's at **all** useful to be
half-pregnant on this score.

- Ted

2006-10-26 14:40:55

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Thu, 2006-10-26 at 09:37 -0400, Theodore Tso wrote:
> On Thu, Oct 26, 2006 at 04:36:48PM +1000, David Chinner wrote:
> > > > Remember, I'm not just talking about defrag - I'm talking about
> > > > an interface that is actually useful to apps that might care
> > > > about how data is laid out on disk but the applications writers
> > > > don't know anyhting about how filesystem X or Y or Z is
> > > > implemented. Putting the burden of learning about fileystem
> > > > internals on application developers is not the correct solution.
>
> If all you want is something for applicaiton developers, about all you
> can do is to tell the filesystem, "create the file so that it will be
> quickly accessed after accessing this file or this directory". I
> really don't see the point of having the application specify block
> numbers if you're also claiming the applicaiton isn't going to know
> anything about the filesystem layout --- or even the RAID layout of
> the filesystem. I don't think it's at **all** useful to be
> half-pregnant on this score.

I think a utility such as a defragmenter should know about about the
filesystem layout. I also think that it would be a good thing to have a
consistent interface so that every filesystem isn't implementing a
completely different one.
--
David Kleikamp
IBM Linux Technology Center

2006-10-26 15:25:22

by Jörn Engel

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Wed, 25 October 2006 14:41:18 -0400, Jeff Garzik wrote:
> On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote:
> > Yes, but there's a question of the interface to this operation. How to
> > specify which indirect block I mean? Obviously we could introduce
> > separate call for remapping indirect blocks but I find this solution
> > kind of clumsy...
>
> Agreed... that gets nasty real quick.

Logfs has a similar problem and I introduced a "level". Without going
into all the gory details, data blocks reside on level 0, indirect
blocks on level 1, doubly indirect blocks on level 2, etc. With this,
the tupel of (ino, pos, level) can specify any block on the
filesystem, provided it is used for some inode.

Logfs needs this for Garbage Collection, which is a fairly similar
problem.

J?rn

--
Joern's library part 3:
http://inst.eecs.berkeley.edu/~cs152/fa05/handouts/clark-test.pdf

2006-10-27 01:32:52

by David Chinner

[permalink] [raw]
Subject: Re: [RFC] Ext3 online defrag

On Thu, Oct 26, 2006 at 01:37:22PM +0200, Jan Kara wrote:
> > On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote:
> > We don't need to expose anything filesystem specific to userspace to
> > implement this. Online data movement (i.e. the defrag mechanism)
> > becomes something like:
> >
> > do {
> > get_free_list(dst_fd, location, len, list)
> > /* select extent to use */
> Upto this point I can imagine we can be perfectly generic.
>
> > alloc_from_list(dst_fd, list[X], off, len)
> > } while (ENOALLOC)
> > move_data(src_fd, dst_fd, off, len);
> With these two it's not clear how well can we do with just a generic
> interface. Every filesystem needs to have some additional metadata to
> keep list of data blocks. In case of ext2/ext3/reiserfs this is not
> a negligible amount of space and placement of these metadata is important
> for performance.

Yes, the same can be said for XFS. However, XFS's extent btree implementation
uses readahead to hide a lot of the latency involved with reading extent
map, and it only needs to read it once per inode lifecycle

> So either we focus only on data blocks and let
> implementation of alloc_from_list() allocate metadata wherever it wants
> (but then we get suboptimal performace because there need not be space
> for indirect blocks close before our provided extent)

I think the first step would be to focus on data blocks using something
like the above. There are many steps to full filesystem defragmentation,
but data fragmetnation is typically the most common symptom of
fragmentation that we see.

> or we allocate
> metadata from the provided list, but then we need some knowledge of fs
> to know how much should we expect to spend on metadata and where these
> metadata should be placed.

That's the second step, I think. For example, we could count the metadata blocks
used in metadata structure (say an block list), allocate a new chunk
like above, and then execute a "move_metadata()" type of operation,
which the filesystem does internally in a transactionally safe
manner. Once again, generic interface, filesystem specific implementations.

> For example if you know that indirect block
> for your interval is at block B, then you'd like to allocate somewhere
> close after this point or to relocate that indirect block (and all the
> data it references to). But for that you need to know you have something
> like indirect blocks => filesystem knowledge.

*nod*

This is far less of a problem with extent based filesystems -
coalescing all the fragments into a single extent removes the need
for indirect blocks and you get the extent list for free when you
read the inode. When we do have a fragmented file, XFS uses
readahead to speed btree searching and reading, so it hides a lot of
the latency overhead that fragmented metadata can cause.

Either way, these lists can still be optimised by allocating a
set of contiguous blocks and copying the metadata into them and
updating the pointers to the new blocks. It can be done separately
to the data moving and really should be done after the data has
been defragmented....

> So I think that to get this working, we also need some way to tell
> the program that if it wants to allocate some data, it also needs to
> count with this amount of metadata and some of it is already allocated
> in given blocks...

If you want to do it all in one step.

However, it's not quite that simple for something like XFS. An
allocation may require a btree split (or three, actually) and the
number of blocks required is dependent on the height of the btrees.
So we don't know how many blocks we'll need ahead of time, and we'd
have to reach deep into the allocator and abuse it badly to do
anything like this. It's not something I want to even contemplate
doing. :/

Also, we don't want to be mingling global metadata with inode
specific metadata so we don't want to put most of the new metadata
blocks near the extent we are putting the data into.

That means I'd prefer to be able to optimise metadata objects
separately. e.g. rewrite a btree into a single contiguous extent
with the btree blocks laid out so the readahead patterns result
in sequential I/O. The kernel would need to do this in XFS because
we'd have to lock the entire btree a block at a time, copy it
and then issue a "swap btree" transaction. most other journalling
filesystems will have similar requirements, I think, for doing
this online....

That's a very similar concept to the move_data() interface...

> > I see substantial benefit moving forward from having filesystem
> > independent interfaces. Many features that filesystems implement
> > are common, and as time goes on the common feature set of the
> > different filesystems gets larger. So why shouldn't we be
> > trying to make common operations generic so that every filesystem
> > can benefit from the latest and greatest tool?
> So you prefer to handle only "data blocks" part of the problem and let
> filesystem sort out metadata?

The filesystem should already be attempting to do the right thing
with the metadata blocks. If it can't then I don't think we should
complicate the interface to do this as a spearate metadata
optmisation pass can do it for us. Also, it would make the
interface harder to use for general applications that really only
need to guarantee that their datafiles are unfragmented (e.g
bittorrent).

FWIW, the only user of metadata optmisation that I can see is defrag
applications. Hence I think it is best to keep it separate and let
the filesystem do best effort metadata placement on all allocations
whether they are directed by userspace or not. the block/extent
lists can be further optimised as a separate phase of a defrag
application if the fs isn't able to do it right the first time....

No, wait, I just thought of another - online shrinking of a
filesystem requires you to move lots of metadata in a similarly
safe manner.... :)

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group