2008-09-27 07:26:34

by Akira Fujita

[permalink] [raw]
Subject: [RFC][PATCH 0/12]ext4: online defrag (ver 0.95)

Hi,

I've updated the ext4 online defragmentation patches.
In this version, mainly there are following two changes and some fixes:
- Support 1KB and 2KB block size.
- Implement EXT4_IOC_FIEMAP_INO instead of EXT4_IOC_EXTENTS_INFO
to get extetns information.

Changelog:
- 0.95(Sep. 26, 2008)
- Support 1KB and 2KB block size.
- Implement EXT4_IOC_FIEMAP_INO which calls ext4_fiemap() for specified inode number
instead of EXT4_IOC_EXTENTS_INFO to get extents information.
- Merge "JC6-defrag-alloc-contiguous-blks-credit" and
"ext4-request-for-blocks-with-ar.excepted_group-1.patch" in the ext4 patch queue
into this version.
- Handle s_dirtyblocks_counter correctly in defrag when delalloc enabled.
- Remove unneeded copy_from_user() in EXT4_IOC_GROUP_INFO ioctl. Ted pointed this out.
- 0.9 (May 30, 2008)
- Create some new functions (ext4_defrag_fill_ar(),
ext4_defrag_check_phase() ...) to separate block allocation function,
since the phase mode plug into the allocation function isn't good.
- Add the description of ext4_defrag() which is main function.
- Add the capability check.
- Some cleanups.
- 0.8 (Apr. 8, 2008)
Fix sparse warnings and change the construction of patches.

Outline for ext4 online defragmentation:
Ext4 online defrag has the following three functions.

no option Solving a single file fragmentation.
Single file fragmentation is solved by moving file
data to contiguous free blocks.

-r Solving a relevant file fragmentation.
Relevant file fragmentation could be solved by moving
the files under the specified directory close together with
the block containing the directory data.

-f Solving a free space fragmentation.
If there is no contiguous free blocks in the filesystem,
the other files are moved to make sufficient space to allocate
contiguous blocks for the target file.

Notes:
- Ext4 online defarg needs "mballoc" and "extents" mount options.
- "ext4-fiemap.patch" in the ext4 patch queue is necessary
for EXT4_IOC_FIEMAP_INO ioctl. We have to apply fiemap patches previously
then apply defrag patches, so it is necessary to change
the order of series in the ext4 patch queue.

Next steps:
1. Address the following items that Ted pointed out.
- Move all of the defrag ioctls from ext4_defrag_ioctl() to ext4_ioctl()
so that defrag does not have a double layer ioctl's dispatch.
- Remove the block reservation in the force defrag (-f).
- Remove the EXT4_IOC_FIBMAP ioctl and use the EXT4_IOC_FIEMAP instead.
- Get super block information with opening block device in user space
so that we can remove the EXT4_IOC_GROUP_INFO ioctl.

Summary of patches:
* The followings are new ext4 online defrag patches and they consist
of ioctl unit except 1-4. Because the EXT4_IOC_DEFRAG is too big to review,
I divided it into 4 patches.

[PATCH 1/12] EXT4_IOC_DEFRAG ioctl and main functions of defrag
- Create the temporary inode and do defrag per
defrag_size (defalut 64MB).

[PATCH 2/12] Allocate new contiguous blocks with mballoc
- Search contiguous free blocks with mutil-block allocation
and allocate them for the temporary inode.

[PATCH 3/12] Read and write file data with memory page
- Read the file data from the old blocks to the page cache and
write the file data on the page into the new blocks.

[PATCH 4/12] Exchange the blocks between two inodes
- Exchange the data blocks between the temporary inode and
the original inode.

[PATCH 5/12] EXT4_IOC_FIBMAP ioctl
- The EXT4_IOC_FIBMAP ioctl gets the physical block offset of target inode
with ext4_bmap(). This ioctl is used only in the relevant defrag (-r).

[PATCH 6/12] EXT4_IOC_GROUP_INFO ioctl
- The EXT4_IOC_GROUP_INFO ioctl gets the block group information
of target inode is located in. This ioctl is used only in the force defrag (-f).

[PATCH 7/12] EXT4_IOC_FREE_BLOCKS_INFO ioctl
- The EXT4_IOC_FREE_BLOCKS_INFO ioctl gets free extents
information of the target block group.
This ioctl is used only in the force defrag (-f).

[PATCH 8/12] EXT4_IOC_FIEMAP_INO ioctl
- The EXT4_IOC_FIEMAP_INO is used to get extents information of
inode which set to ioctl.
The defragger uses this ioctl to check the fragment condition
and to get extents information in the specified block group.

[PATCH 9/12] EXT4_IOC_RESERVE_BLOCK ioctl
- The EXT4_IOC_RESERVE_BLOCK ioctl reserves the specified
contiguous free space with ext4 block reservation function.
This ioctl is used only in the force defrag (-f).

[PATCH 10/12] EXT4_IOC_MOVE_VICTIM ioctl
- The EXT4_IOC_MOVE_VICTIM moves the victim extents into other block group.
Therefore the contiguous free space is made in the target block group.
This ioctl is used only in the force defrag (-f).

[PATCH 11/12] EXT4_IOC_BLOCK_RELEASE ioctl
- The EXT4_IOC_BLOCK_RELEASE releases the reserved blocks
which target inode holds.
This ioctl is used if defrag failed and it was after the block reservation.

[PATCH 12/12] Online defrag command
- The defrag command. Usage is as follows:
- Defrag for a single file.
# e4defrag file-name
- Defrag for all files on ext4.
# e4defrag device-name
- Put the multiple files closer together.
# e4defrag -r directory-name
- Defrag for free space fragmentation.
# e4defrag -f file-name

Any comments and tests are welcome.

Best regards,
Akira Fujita


2008-09-27 14:49:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/12]ext4: online defrag (ver 0.95)

On Sat, Sep 27, 2008 at 04:26:29PM +0900, Akira Fujita wrote:
> Summary of patches:
> * The followings are new ext4 online defrag patches and they consist
> of ioctl unit except 1-4. Because the EXT4_IOC_DEFRAG is too big to review,
> I divided it into 4 patches.

It would make it *much* easier to review the first four patches if you
did the following:

*) Refector the patches so that low-level functions are first. Right
now it's a little hard to review the patches because in a number of
cases the documentation is in a different patch than the one where
"return 0;" is replaced with the actual code.

*) In many cases, there is extraneous code in the patch which is not
described in the commit comments. For example, in [PATCH 3/12]
there is a large amount of extent manipulation code that has
nothing to do with "read and write file data with memory page". In
fact, from inspection it looked like there was more extent
manipulation code than code that was responsible for reading and
writing the data blocks.

*) It would be much better if extent-related functions are moved to
the extent.c file, instead of taking low-level extent.c functions
and making them to be non-static. (Similarly, if there are
functions that are more about block allocation, it's better for
them to go into mballoc.c, instead of putting everything in
defrag.c). The idea is to make the code more readable by having
the correct abstractions, ideally that could be useful for more
than just defragging --- for example, the code for making sure we
merge adjacent extents could also be useful when we write into a
middle of a sparse file. If it turns out we have code in extent.c
that does that, let's only have one function, appropriately
abstracted, that does that, instead of one version for normal use,
and one version which is tweaked just for defrag.c's specific
requirements.

*) Could we have at least a few lines description of what the function
does, instead of just a one-liner and "this function returns 0 on
success, or an error on failure". That would be very helpful.
Many thanks!!

- Ted

2008-09-28 23:46:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/12]ext4: online defrag (ver 0.95)

I've added online defrag version 0.95 into the ext4 patch queue.
Because of the recent patch I've added to remove the old legacy block
allocator patches 7 through 11 would no longer apply, and so they are
commented out in the patch queue. In some cases, the patches are ones
that would go away soon anyway. In the case of EXT4_IOC_RESERVE_BLOCK
ioctl, it was non-functional in the common case where mballoc was in
use, since mballoc never used any of the legacy block reservation
code; so any user space that depended on EXT4_IOC_RESERVE_BLOCK ioctl
couldn't possibly have worked.

If there are needed functions that were removed from mballoc.c that
really are necessary, we can certainly look at adding them back ---
but I would want to check to see if there is similar code in mballoc
that we can use instead. One of my concerns is that there is been
duplicated code in balloc.c, mballoc.c, extents.c, and defrag.c and
one of the reasons why I wanted to pull out the legacy block
allocation code was to eliminate some of the duplicated code, and to
hopefully expose more of this duplicated code.

In the end it should make all the ext4 code base and the defrag code
more easy to understand and much more robust.

Regards,

- Ted

2008-10-01 00:40:54

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/12]ext4: online defrag (ver 0.95)

On Sep 27, 2008 16:26 +0900, Akira Fujita wrote:
> I've updated the ext4 online defragmentation patches.
> In this version, mainly there are following two changes and some fixes:
> - Support 1KB and 2KB block size.
> - Implement EXT4_IOC_FIEMAP_INO instead of EXT4_IOC_EXTENTS_INFO
> to get extetns information.
>
> Changelog:
> - 0.95(Sep. 26, 2008)
> - Support 1KB and 2KB block size.
> - Implement EXT4_IOC_FIEMAP_INO which calls ext4_fiemap() for specified inode number

Instead of implementing an EXT4_IOC_FIEMAP_INO ioctl, what we had implemented
is an EXT4_IOC_WRAPPER, which takes as arguments the inode number and the
ioctl command + original ioctl data. This allows inode ioctls to be called
against the filesystem root for arbitrary inodes, and doesn't require new
implementation for each ioctl:

struct ll_ioctl_wrapper {
__u32 ioctl_cmd;
__u32 padding;
__u64 ioctl_ino;
char ioctl_data[0];
};

int ext4_ioctl(struct inode *inode, struct file *filp, unsigned int cmd,
unsigned long arg)
:
:
case EXT4_IOC_WRAPPER: {
struct inode *active_inode;
struct ll_ioctl_wrapper buf;
struct file_operations *fop;

if (!capable(CAP_SYS_ADMIN))
return -EACCES;

if (copy_from_user(&buf, (struct ll_ioctl_wrapper __user *)arg,
sizeof(buf)))
return -EFAULT;

if (buf.ioctl_cmd == EXT4_IOC_WRAPPER)
return -EDEADLK;

active_inode = iget(lfs_sb, buf.ioctl_ino);
if (!active_inode)
return -EACCES;

err = -ENOTTY;
fop = inode->i_sb->s_root->d_inode->i_fop;
if (fop && fop->unlocked_ioctl)
err = fop->unlocked_ioctl(NULL, buf.ioctl_cmd,
arg + sizeof(buf));
if ((!err || err != -ENOTTY))
goto out_dput;

if (fop && fop->ioctl)
err = fop->ioctl(NULL, buf.ioctl_cmd, arg +sizeof(buf));
out_dput:
iput(active_dentry);
return err;
}

> instead of EXT4_IOC_EXTENTS_INFO to get extents information.
> - Merge "JC6-defrag-alloc-contiguous-blks-credit" and
> "ext4-request-for-blocks-with-ar.excepted_group-1.patch" in the ext4 patch queue
> into this version.
> - Handle s_dirtyblocks_counter correctly in defrag when delalloc enabled.
> - Remove unneeded copy_from_user() in EXT4_IOC_GROUP_INFO ioctl. Ted pointed this out.
> - 0.9 (May 30, 2008)
> - Create some new functions (ext4_defrag_fill_ar(),
> ext4_defrag_check_phase() ...) to separate block allocation function,
> since the phase mode plug into the allocation function isn't good.
> - Add the description of ext4_defrag() which is main function.
> - Add the capability check.
> - Some cleanups.
> - 0.8 (Apr. 8, 2008)
> Fix sparse warnings and change the construction of patches.
>
> Outline for ext4 online defragmentation:
> Ext4 online defrag has the following three functions.
>
> no option Solving a single file fragmentation.
> Single file fragmentation is solved by moving file
> data to contiguous free blocks.
>
> -r Solving a relevant file fragmentation.
> Relevant file fragmentation could be solved by moving
> the files under the specified directory close together with
> the block containing the directory data.
>
> -f Solving a free space fragmentation.
> If there is no contiguous free blocks in the filesystem,
> the other files are moved to make sufficient space to allocate
> contiguous blocks for the target file.
>
> Notes:
> - Ext4 online defarg needs "mballoc" and "extents" mount options.
> - "ext4-fiemap.patch" in the ext4 patch queue is necessary
> for EXT4_IOC_FIEMAP_INO ioctl. We have to apply fiemap patches previously
> then apply defrag patches, so it is necessary to change
> the order of series in the ext4 patch queue.
>
> Next steps:
> 1. Address the following items that Ted pointed out.
> - Move all of the defrag ioctls from ext4_defrag_ioctl() to ext4_ioctl()
> so that defrag does not have a double layer ioctl's dispatch.
> - Remove the block reservation in the force defrag (-f).
> - Remove the EXT4_IOC_FIBMAP ioctl and use the EXT4_IOC_FIEMAP instead.
> - Get super block information with opening block device in user space
> so that we can remove the EXT4_IOC_GROUP_INFO ioctl.
>
> Summary of patches:
> * The followings are new ext4 online defrag patches and they consist
> of ioctl unit except 1-4. Because the EXT4_IOC_DEFRAG is too big to review,
> I divided it into 4 patches.
>
> [PATCH 1/12] EXT4_IOC_DEFRAG ioctl and main functions of defrag
> - Create the temporary inode and do defrag per
> defrag_size (defalut 64MB).
>
> [PATCH 2/12] Allocate new contiguous blocks with mballoc
> - Search contiguous free blocks with mutil-block allocation
> and allocate them for the temporary inode.
>
> [PATCH 3/12] Read and write file data with memory page
> - Read the file data from the old blocks to the page cache and
> write the file data on the page into the new blocks.
>
> [PATCH 4/12] Exchange the blocks between two inodes
> - Exchange the data blocks between the temporary inode and
> the original inode.
>
> [PATCH 5/12] EXT4_IOC_FIBMAP ioctl
> - The EXT4_IOC_FIBMAP ioctl gets the physical block offset of target inode
> with ext4_bmap(). This ioctl is used only in the relevant defrag (-r).
>
> [PATCH 6/12] EXT4_IOC_GROUP_INFO ioctl
> - The EXT4_IOC_GROUP_INFO ioctl gets the block group information
> of target inode is located in. This ioctl is used only in the force defrag (-f).
>
> [PATCH 7/12] EXT4_IOC_FREE_BLOCKS_INFO ioctl
> - The EXT4_IOC_FREE_BLOCKS_INFO ioctl gets free extents
> information of the target block group.
> This ioctl is used only in the force defrag (-f).
>
> [PATCH 8/12] EXT4_IOC_FIEMAP_INO ioctl
> - The EXT4_IOC_FIEMAP_INO is used to get extents information of
> inode which set to ioctl.
> The defragger uses this ioctl to check the fragment condition
> and to get extents information in the specified block group.
>
> [PATCH 9/12] EXT4_IOC_RESERVE_BLOCK ioctl
> - The EXT4_IOC_RESERVE_BLOCK ioctl reserves the specified
> contiguous free space with ext4 block reservation function.
> This ioctl is used only in the force defrag (-f).
>
> [PATCH 10/12] EXT4_IOC_MOVE_VICTIM ioctl
> - The EXT4_IOC_MOVE_VICTIM moves the victim extents into other block group.
> Therefore the contiguous free space is made in the target block group.
> This ioctl is used only in the force defrag (-f).
>
> [PATCH 11/12] EXT4_IOC_BLOCK_RELEASE ioctl
> - The EXT4_IOC_BLOCK_RELEASE releases the reserved blocks
> which target inode holds.
> This ioctl is used if defrag failed and it was after the block reservation.
>
> [PATCH 12/12] Online defrag command
> - The defrag command. Usage is as follows:
> - Defrag for a single file.
> # e4defrag file-name
> - Defrag for all files on ext4.
> # e4defrag device-name
> - Put the multiple files closer together.
> # e4defrag -r directory-name
> - Defrag for free space fragmentation.
> # e4defrag -f file-name
>
> Any comments and tests are welcome.
>
> Best regards,
> Akira Fujita
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-10-01 18:45:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/12]ext4: online defrag (ver 0.95)

On Tue, Sep 30, 2008 at 05:40:54PM -0700, Andreas Dilger wrote:
> Instead of implementing an EXT4_IOC_FIEMAP_INO ioctl, what we had implemented
> is an EXT4_IOC_WRAPPER, which takes as arguments the inode number and the
> ioctl command + original ioctl data. This allows inode ioctls to be called
> against the filesystem root for arbitrary inodes, and doesn't require new
> implementation for each ioctl:

Or just provide more generic open by handle functionality. Shouldn't be
too much of a problem to do it in the VFS by reusing the exportfs code.


2008-10-01 21:20:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/12]ext4: online defrag (ver 0.95)

On Wed, Oct 01, 2008 at 02:45:45PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 30, 2008 at 05:40:54PM -0700, Andreas Dilger wrote:
> > Instead of implementing an EXT4_IOC_FIEMAP_INO ioctl, what we had implemented
> > is an EXT4_IOC_WRAPPER, which takes as arguments the inode number and the
> > ioctl command + original ioctl data. This allows inode ioctls to be called
> > against the filesystem root for arbitrary inodes, and doesn't require new
> > implementation for each ioctl:
>
> Or just provide more generic open by handle functionality. Shouldn't be
> too much of a problem to do it in the VFS by reusing the exportfs code.
>

A while back I had implemented an "open by inode" patch for a friend
who needed it for their startup. I never posted it because (a) even
though it was only optionally enabled via a mount option, if you allow
non-root users to access it, it blows a whole through traditional unix
permissions semantics (i.e., a mode 700 directory no longer protects
files underneath that directory), and (b) I was sure that Al Viro
would consider the hacks that I needed to make it work to be far too
ugly to live. :-)

- Ted

2008-10-02 08:10:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/12]ext4: online defrag (ver 0.95)

On Oct 01, 2008 17:20 -0400, Theodore Ts'o wrote:
> On Wed, Oct 01, 2008 at 02:45:45PM -0400, Christoph Hellwig wrote:
> > On Tue, Sep 30, 2008 at 05:40:54PM -0700, Andreas Dilger wrote:
> > > Instead of implementing an EXT4_IOC_FIEMAP_INO ioctl, what we had implemented
> > > is an EXT4_IOC_WRAPPER, which takes as arguments the inode number and the
> > > ioctl command + original ioctl data. This allows inode ioctls to be called
> > > against the filesystem root for arbitrary inodes, and doesn't require new
> > > implementation for each ioctl:
> >
> > Or just provide more generic open by handle functionality. Shouldn't be
> > too much of a problem to do it in the VFS by reusing the exportfs code.
> >
>
> A while back I had implemented an "open by inode" patch for a friend
> who needed it for their startup. I never posted it because (a) even
> though it was only optionally enabled via a mount option, if you allow
> non-root users to access it, it blows a whole through traditional unix
> permissions semantics (i.e., a mode 700 directory no longer protects
> files underneath that directory), and (b) I was sure that Al Viro
> would consider the hacks that I needed to make it work to be far too
> ugly to live. :-)

We've been using that patch for a long time now in Lustre, but will soon
be replacing it with code that calls fh_to_dentry() and just craft fake
fh to have the filesystem open the inode. That avoids all kinds of hacks
in place internally.

Definitely the __iopen__ directory should only be allowed for root...

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-10-02 08:26:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/12]ext4: online defrag (ver 0.95)

On Thu, Oct 02, 2008 at 01:10:59AM -0700, Andreas Dilger wrote:
> We've been using that patch for a long time now in Lustre, but will soon
> be replacing it with code that calls fh_to_dentry() and just craft fake
> fh to have the filesystem open the inode. That avoids all kinds of hacks
> in place internally.
>
> Definitely the __iopen__ directory should only be allowed for root...

Well, the right way to do it is to actually pass in the FH from
userspace. Open by inode # alone is not enough, you alwasy want the
generation, too. And for some filesystems (e.g. reiserfs) the FH is
more complicated. The interface for it would be two ioctls (or better
syscalls) to generate the FH and convert it back to an FD, and possibly
a third one for readlink by handle.

XFS had these since day 1, but I would neither recommend copying the API
or implementation as they are just messed up badly.