2008-09-13 22:16:48

by Theodore Ts'o

[permalink] [raw]
Subject: Review of ext4-online-defrag-check-for-freespace-fragmentation.patch

Hi,

First, some general comments about your patch series --- I would prefer
if we didn't have a double layer of ioctl's dispatch. That is, don't
have have ext4_ioctl() do one level of case statements, only to then
call ext4_defrag_ioctl(), which dispatches on the ioctl's a second time.
It's much better to do all of the dispatching out of the top-level
ext4_ioctl() function.

Secondly, some of the ioctl numbers chosen by the defrag patches overlap
other, already existing patches. This is something we will need to fix,
long term. For now, folks should know that we can't count on the ioctl
numbers being stable, since we will probably need to move them.

Finally, the patch comments don't always describe all of the changes in
the patch. For example, in the patch
"ext4-online-defrag-check-for-freespace-fragmentation.patch", the patch
contains *far* more than just code to return the free space
fragmentation. It defines three new ioctl:

EXT4_IOC_GROUP_INFO
EXT4_IOC_FREE_BLOCKS_INFO
EXT4_IOC_EXTENTS_INFO

These ioctl's are someone odd. First of all, EXT4_IOC_GROUP_INFO only
returns two bits of information:

struct ext4_group_data_info {
int s_blocks_per_group; /* blocks per group */
int s_inodes_per_group; /* inodes per group */
};

Why just those two bits? Userspace can easily get this information by
opening the block device directly, but if you don't want to do that, why
not just return the full 1024 byte ext4 superblock? That would be far
more generally useful, instead of just returning these two fields.

Also, the code implementing EXT4_IOC_GROUP_INFO needlessly does a
copy_from_user() before filling in the data structure (and overwriting
all of the user data copied using copy_from_user); that's not necessary,
and should be elimiated.


The second ioctl, EXT4_IOC_FREE_BLOCKS_INFO, is very strange in that it
returns the free space available in a particular block group, but
instead of passing in a block group number, instead an inode number is
passed in. This seems very strange, and unnecessary.

(Looking at the defrag command, it appears that the defragger only tries
to defragment an inode if can find free space in the block group
associated with the inode? That seems highly limited. Note
particularly that with ext4's flex_bg feature, we now freely try to
allocate blocks within a flex_bg which is composed of multiple block
groups. Also, some of the files that most badly need defragmentation
will be the larger ones that can't possibly fit in a single block
group.)


The last ioctl, EXT4_IOC_EXTENTS_INFO, duplicates the generic FIEMAP
ioctl, and most of the implementation duplicates what is found in the
ext4-fiemap patch. If this patch had been separated into three separate
patches, one for each ioctl, it would be easier to eliminate the
EXT4_IOC_EXTENTS_INFO changes as duplicated by other code.


More generally, I'm beginning to think the best way to make progress
with the defragmentation patches is to break out the patches into
separate ioctl's, and we should merge the read-only ioctls that return
information about the filesystem first. These ioctl's should be made as
generally useful as possible, and if they overlap ioctl's that already
exist or are planned to be merged, such as the FIBMAP or FIEMAP
interfaces, we should figure out why we can't use the pre-existing
ioctl's first.

For example, if we change EXT4_IOC_GROUP_INFO into something which
returns the entire ext4 superblock, then in the future if the defrag
command gains the ability to take the flex_bg feature or the RAID layout
into account, we won't need to create new ioctl's to return those
filesystem parameters.


Best regards,

- Ted



2008-09-14 02:10:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Review of ext4-online-defrag-check-for-freespace-fragmentation.patch

On Sat, Sep 13, 2008 at 06:16:42PM -0400, Theodore Ts'o wrote:
>
> Secondly, some of the ioctl numbers chosen by the defrag patches overlap
> other, already existing patches. This is something we will need to fix,
> long term. For now, folks should know that we can't count on the ioctl
> numbers being stable, since we will probably need to move them.

Note: I have done this in the version of the defrag patches in the
ext4 patch queue, where all of the ioctl's introduced by the defrag
patches have been renumbered starting at 15, to avoid conflicts with
other ioctls that either exist already or could be introduced in the
future.

If you don't mind, we can also merge up some of the patches that have
been inserted in the patch queue, and then you can start working from
the modified version in the patch queue. The other way to move
forward is that we can start creating new ioctl's that are more
general and have carefully designed interfaces, and then as the
replacement ioctl's are created, the defrag patches can be shortened
and the on-line defrag command modified to use the new ioctl's.

- Ted

2008-09-17 06:44:07

by Akira Fujita

[permalink] [raw]
Subject: Re: Review of ext4-online-defrag-check-for-freespace-fragmentation.patch

Hi Ted,
Thank you for your review comments.

Theodore Ts'o wrote:
> Hi,
>
> First, some general comments about your patch series --- I would prefer
> if we didn't have a double layer of ioctl's dispatch. That is, don't
> have have ext4_ioctl() do one level of case statements, only to then
> call ext4_defrag_ioctl(), which dispatches on the ioctl's a second time.
> It's much better to do all of the dispatching out of the top-level
> ext4_ioctl() function.

I will fix defrag to call each of functions directly from ext4_ioctl().

> Secondly, some of the ioctl numbers chosen by the defrag patches overlap
> other, already existing patches. This is something we will need to fix,
> long term. For now, folks should know that we can't count on the ioctl
> numbers being stable, since we will probably need to move them.

I will use renumbered defrag patches in the patch queue.

> Finally, the patch comments don't always describe all of the changes in
> the patch. For example, in the patch
> "ext4-online-defrag-check-for-freespace-fragmentation.patch", the patch
> contains *far* more than just code to return the free space
> fragmentation. It defines three new ioctl:
>
> EXT4_IOC_GROUP_INFO
> EXT4_IOC_FREE_BLOCKS_INFO
> EXT4_IOC_EXTENTS_INFO

OK, I will split defrag patches into each ioctls
and add more descriptions about them.


> These ioctl's are someone odd. First of all, EXT4_IOC_GROUP_INFO only
> returns two bits of information:
>
> struct ext4_group_data_info {
> int s_blocks_per_group; /* blocks per group */
> int s_inodes_per_group; /* inodes per group */
> };
>
> Why just those two bits? Userspace can easily get this information by
> opening the block device directly, but if you don't want to do that, why
> not just return the full 1024 byte ext4 superblock? That would be far
> more generally useful, instead of just returning these two fields.

Either will be fine for me.
But if defrag opens the block device directly to get superblock related information,
it will make sense since we can reduce the number of ioctl.
Or as you mentioned later, if defrag implements new ioctl
which gets superblock information, it will be useful.
So I am puzzled a little about which is better.

> Also, the code implementing EXT4_IOC_GROUP_INFO needlessly does a
> copy_from_user() before filling in the data structure (and overwriting
> all of the user data copied using copy_from_user); that's not necessary,
> and should be elimiated.

You are right. I will remove it.

>
> The second ioctl, EXT4_IOC_FREE_BLOCKS_INFO, is very strange in that it
> returns the free space available in a particular block group, but
> instead of passing in a block group number, instead an inode number is
> passed in. This seems very strange, and unnecessary.

EXT4_IOC_FREE_BLOCKS_INFO is only used in the force defrag mode (-f)
and it is necessary to get free block information.
It reads block bitmap of target block group then sets
free block distribution to ext4_extents_info structure
as extents array and returns it to userspace.
After that in userspace, defragger calculates the combination
of extents which should be moved to other block group as victim extents
with the results of EXT4_IOC_FREE_BLOCKS_INFO and EXT4_IOC_EXTENTS_INFO.


> (Looking at the defrag command, it appears that the defragger only tries
> to defragment an inode if can find free space in the block group
> associated with the inode? That seems highly limited. Note
> particularly that with ext4's flex_bg feature, we now freely try to
> allocate blocks within a flex_bg which is composed of multiple block
> groups. Also, some of the files that most badly need defragmentation
> will be the larger ones that can't possibly fit in a single block
> group.)

In the force defrag mode (-f), the maximum file defrag size is limited to 128MB
(It is the maximum size we can store blocks into a single
block group when blocksize is 4KB).
Because if defragger scans all of block groups and calculates the combination
of free-blocks and used-blocks for victim extents, it will cost a lot.

On the other hand, in usual defrag (no option) and the relevant defrag (-r),
defragger scans all of block groups in ext4 for block allocation with mballoc
(It means no need to calculate the combination of used/unused blocks).
So there is no file size limitation in this case.

>
> The last ioctl, EXT4_IOC_EXTENTS_INFO, duplicates the generic FIEMAP
> ioctl, and most of the implementation duplicates what is found in the
> ext4-fiemap patch. If this patch had been separated into three separate
> patches, one for each ioctl, it would be easier to eliminate the
> EXT4_IOC_EXTENTS_INFO changes as duplicated by other code.

There was no ioctl to get extents information and thus
defrag implemented EXT4_IOC_EXTENTS_INFO originally at that time.
But FIEMAP is already in the patch queue now, so defrag should use it
instead of EXT4_IOC_EXTENTS_INFO. I am working on this now.

>
> More generally, I'm beginning to think the best way to make progress
> with the defragmentation patches is to break out the patches into
> separate ioctl's, and we should merge the read-only ioctls that return
> information about the filesystem first. These ioctl's should be made as

I agree with you.

> generally useful as possible, and if they overlap ioctl's that already
> exist or are planned to be merged, such as the FIBMAP or FIEMAP
> interfaces, we should figure out why we can't use the pre-existing
> ioctl's first.
>
> For example, if we change EXT4_IOC_GROUP_INFO into something which
> returns the entire ext4 superblock, then in the future if the defrag
> command gains the ability to take the flex_bg feature or the RAID layout
> into account, we won't need to create new ioctl's to return those
> filesystem parameters.


Best regards,
Akira Fujita

2008-09-17 06:46:34

by Akira Fujita

[permalink] [raw]
Subject: Re: Review of ext4-online-defrag-check-for-freespace-fragmentation.patch

Hi Ted,

Theodore Tso wrote:
> On Sat, Sep 13, 2008 at 06:16:42PM -0400, Theodore Ts'o wrote:
>> Secondly, some of the ioctl numbers chosen by the defrag patches overlap
>> other, already existing patches. This is something we will need to fix,
>> long term. For now, folks should know that we can't count on the ioctl
>> numbers being stable, since we will probably need to move them.
>
> Note: I have done this in the version of the defrag patches in the
> ext4 patch queue, where all of the ioctl's introduced by the defrag
> patches have been renumbered starting at 15, to avoid conflicts with
> other ioctls that either exist already or could be introduced in the
> future.
>
> If you don't mind, we can also merge up some of the patches that have
> been inserted in the patch queue, and then you can start working from
> the modified version in the patch queue. The other way to move
> forward is that we can start creating new ioctl's that are more
> general and have carefully designed interfaces, and then as the
> replacement ioctl's are created, the defrag patches can be shortened
> and the on-line defrag command modified to use the new ioctl's.

I see, I will use the renumbered defrag patches in the patch queue.
(Currently, I am working on the defrag to use FS_IOC_FIEMAP
instead of EXT4_IOC_EXTENTS_INFO and remove block size limitation
to support 1KB and 2KB block size. I will release them soon.)

Thanks,
Akira Fujita