2009-01-30 22:33:53

by Greg Freemyer

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] ext4: online defrag (ver 1.0)

On Fri, Jan 30, 2009 at 1:11 AM, Akira Fujita <[email protected]> wrote:
> Hi,
>
> I have rewritten ext4 online defrag patches based on the comments from Ted.
> In the new defrag, create donor inode in the user space instead of kernel space,
> and then allocate contiguous blocks to it with fallocate().
> In kernel space, exchange the blocks between target inode and donor inode,
> and then copy the file data of target inode to donor inode every 64MB.
> The EXT4_IOC_DEFRAG ioctl becomes simpler than the old one,
> so it may be useful for other purposes.
>
> #define EXT4_IOC_DEFRAG _IOW('f', 15, struct move_extent)
>

Do we want the ioctl name to be specific to defrag? I thought Ted's
goal was to make it more generic? I can also envision this same ioctl
being implemented by other file systems so EXT4 seems an inappropriate
prefix.

Thoughts?

> struct move_extent {
> int org_fd; /* original file descriptor */
> int dest_fd; /* destination file descriptor */
> ext4_lblk_t start; /* logical offset of org_fd and dest_fd */
> ext4_lblk_t len; /* exchange block length */
> };

I would also like to see .dest_fd changed to .donor_fd.

I would like to see the ABI be more flexible and have .start be broken
into 2 fields:

.start_orig
.start_donor

And I don't think they should be of type ext4_lblk_t. Something more
generic seems appropriate.

Thoughts?

Greg
--
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com


2009-02-04 08:07:48

by Akira Fujita

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] ext4: online defrag (ver 1.0)

Hi Greg,

Greg Freemyer wrote:
> On Fri, Jan 30, 2009 at 1:11 AM, Akira Fujita <[email protected]> wrote:
>> Hi,
>>
>> I have rewritten ext4 online defrag patches based on the comments from Ted.
>> In the new defrag, create donor inode in the user space instead of kernel space,
>> and then allocate contiguous blocks to it with fallocate().
>> In kernel space, exchange the blocks between target inode and donor inode,
>> and then copy the file data of target inode to donor inode every 64MB.
>> The EXT4_IOC_DEFRAG ioctl becomes simpler than the old one,
>> so it may be useful for other purposes.
>>
>> #define EXT4_IOC_DEFRAG _IOW('f', 15, struct move_extent)
>>
>

I see. Does EXT4_IOC_MOVE_EXT sound better for you?

#define EXT4_IOC_MOVE_EXT _IOW('f', 15, struct move_extent)

> Do we want the ioctl name to be specific to defrag? I thought Ted's
> goal was to make it more generic? I can also envision this same ioctl
> being implemented by other file systems so EXT4 seems an inappropriate
> prefix.

Other filesystems (e.g. xfs, btrfs) have their own defrag ioctl,
and ext2/3 can not use this ioctl because they do not handle
extent file, though.
What kind of advantage do you think by moving this ioctl
to vfs layer?


> Thoughts?
>
>> struct move_extent {
>> int org_fd; /* original file descriptor */
>> int dest_fd; /* destination file descriptor */
>> ext4_lblk_t start; /* logical offset of org_fd and dest_fd */
>> ext4_lblk_t len; /* exchange block length */
>> };
>
> I would also like to see .dest_fd changed to .donor_fd.
>
> I would like to see the ABI be more flexible and have .start be broken
> into 2 fields:
>
> .start_orig
> .start_donor
>
> And I don't think they should be of type ext4_lblk_t. Something more
> generic seems appropriate.
>
OK, I broke .start into .orig_start and .donor_start
and changed the entry type from ext4_lblk_t to __u64.
The new move_extent structure is as follows:

struct move_extent {
int orig_fd; /* original file descriptor */
int donor_fd; /* donor file descriptor */
__u64 orig_start; /* logical start offset in block for orig */
__u64 donor_start; /* logical start offset in block for donor */
__u64 len; /* exchange block length */
};

Any comments?

Regards,
Akira Fujita

2009-02-04 12:25:19

by Greg Freemyer

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] ext4: online defrag (ver 1.0)

Hi Akira,

On Wed, Feb 4, 2009 at 3:07 AM, Akira Fujita <[email protected]> wrote:
> Hi Greg,
>
> Greg Freemyer wrote:
>>
>> On Fri, Jan 30, 2009 at 1:11 AM, Akira Fujita <[email protected]>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have rewritten ext4 online defrag patches based on the comments from
>>> Ted.
>>> In the new defrag, create donor inode in the user space instead of kernel
>>> space,
>>> and then allocate contiguous blocks to it with fallocate().
>>> In kernel space, exchange the blocks between target inode and donor
>>> inode,
>>> and then copy the file data of target inode to donor inode every 64MB.
>>> The EXT4_IOC_DEFRAG ioctl becomes simpler than the old one,
>>> so it may be useful for other purposes.
>>>
>>> #define EXT4_IOC_DEFRAG _IOW('f', 15, struct move_extent)
>>>
>>
>
> I see. Does EXT4_IOC_MOVE_EXT sound better for you?
>
> #define EXT4_IOC_MOVE_EXT _IOW('f', 15, struct move_extent)

I like it better, but a core developer should weigh in.

>> Do we want the ioctl name to be specific to defrag? I thought Ted's
>> goal was to make it more generic? I can also envision this same ioctl
>> being implemented by other file systems so EXT4 seems an inappropriate
>> prefix.
>
> Other filesystems (e.g. xfs, btrfs) have their own defrag ioctl,
> and ext2/3 can not use this ioctl because they do not handle
> extent file, though.

I don't want ext2/3 to share any kernel code. I do hope that
userspace code could eventually be written to exercise
EXT4_IOC_MOVE_EXT type functionality for all 3 filesystems.

Do we really need a new ioctl for each one?

> What kind of advantage do you think by moving this ioctl
> to vfs layer?

I only got interested in this code because I started monitoring the
OHSM project (http://code.google.com/p/fscops/).

They don't need defrag, but they do need the functionality of
EXT4_IOC_MOVE_EXT. They are currently writing their code around ext2
and have a proof of concept implementation almost ready. Each time
they add a filesystem (ext3, ext4, etc.) they will need to have a way
to trigger the block re-org from userspace. Having a single ioctl
that can be expanded to handle more and more underlying filesystems
would benefit them.

Equally important if other users of EXT4_IOC_MOVE_EXT come along, they
may want it to be more filesystem generic.as well.

>> Thoughts?
>>
>>> struct move_extent {
>>> int org_fd; /* original file descriptor */
>>> int dest_fd; /* destination file descriptor */
>>> ext4_lblk_t start; /* logical offset of org_fd and dest_fd */
>>> ext4_lblk_t len; /* exchange block length */
>>> };
>>
>> I would also like to see .dest_fd changed to .donor_fd.
>>
>> I would like to see the ABI be more flexible and have .start be broken
>> into 2 fields:
>>
>> .start_orig
>> .start_donor
>>
>> And I don't think they should be of type ext4_lblk_t. Something more
>> generic seems appropriate.
>>
> OK, I broke .start into .orig_start and .donor_start
> and changed the entry type from ext4_lblk_t to __u64.
> The new move_extent structure is as follows:
>
> struct move_extent {
> int orig_fd; /* original file descriptor */
> int donor_fd; /* donor file descriptor */
> __u64 orig_start; /* logical start offset in block for orig */
> __u64 donor_start; /* logical start offset in block for donor
> */
> __u64 len; /* exchange block length */
> };
>
> Any comments?

I like that much better. With OHSM as an example, this gives them the
flexibility to re-org a large file even if there is not enough
freespace to alloc a full redundant copy.

> Regards,
> Akira Fujita
>

Greg
--
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2009-02-04 14:09:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] ext4: online defrag (ver 1.0)

On Wed, Feb 04, 2009 at 05:07:48PM +0900, Akira Fujita wrote:
>> Do we want the ioctl name to be specific to defrag? I thought Ted's
>> goal was to make it more generic? I can also envision this same ioctl
>> being implemented by other file systems so EXT4 seems an inappropriate
>> prefix.

When I said generic I meant in terms of decomposing the functionality
into multiple ioctls which each could be useful for multiple purposes.
Not necessarily in terms of being used by other filesystem, because
they will almost certainly have their own requirements.

So for example, primitives like "allocate blocks for this inode from
this region of the disk", or "don't allocate blocks for any inode in
this region of disk", can be used for multiple things (such as on-line
shrink), and not just defragmentation.

I don't want to move this to the VFS layer, since it will involve huge
amounts of time while people argue over generic issues regarding the
interface. Look at how long it took to settle on the FIEMAP
interface; that's not an experience I care to repeat.

>>> struct move_extent {
>>> int org_fd; /* original file descriptor */
>>> int dest_fd; /* destination file descriptor */
>>> ext4_lblk_t start; /* logical offset of org_fd and dest_fd */
>>> ext4_lblk_t len; /* exchange block length */
>>> };
>>
>> I would also like to see .dest_fd changed to .donor_fd.

Agreed --- dest_fd is very confusing, because while the data is moving
to the blocks contributed by the donor_fd, the actual inode which
remains pointed to by all of the directory entries is the org_fd. But
people who think of the operation as the blocks moving to the
"destination fd", will get completely confused. Donor makes more
sense, since it has the sense of "organ transplant", which makes a lot
more sense.

- Ted

2009-02-04 14:51:07

by Greg Freemyer

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] ext4: online defrag (ver 1.0)

On Wed, Feb 4, 2009 at 9:09 AM, Theodore Tso <[email protected]> wrote:
> On Wed, Feb 04, 2009 at 05:07:48PM +0900, Akira Fujita wrote:
>>> Do we want the ioctl name to be specific to defrag? I thought Ted's
>>> goal was to make it more generic? I can also envision this same ioctl
>>> being implemented by other file systems so EXT4 seems an inappropriate
>>> prefix.
>
> When I said generic I meant in terms of decomposing the functionality
> into multiple ioctls which each could be useful for multiple purposes.
> Not necessarily in terms of being used by other filesystem, because
> they will almost certainly have their own requirements.
>
> So for example, primitives like "allocate blocks for this inode from
> this region of the disk", or "don't allocate blocks for any inode in
> this region of disk", can be used for multiple things (such as on-line
> shrink), and not just defragmentation.
>
> I don't want to move this to the VFS layer, since it will involve huge
> amounts of time while people argue over generic issues regarding the
> interface. Look at how long it took to settle on the FIEMAP
> interface; that's not an experience I care to repeat.

Convinced and request withdrawn.

Talking about this ioctl, can anyone say:

If the OHSM team implements a similar ioctl for ext2 and ext3 and
submits them for mainline at some point, do they have a chance of
being accepted or are ext2 and ext3 feature frozen?

Thanks
Greg
--
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

2009-02-04 15:32:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] ext4: online defrag (ver 1.0)

On Wed, Feb 04, 2009 at 09:51:07AM -0500, Greg Freemyer wrote:
>
> If the OHSM team implements a similar ioctl for ext2 and ext3 and
> submits them for mainline at some point, do they have a chance of
> being accepted or are ext2 and ext3 feature frozen?

It seems unlikely it would be accepted. If the patch could be done in
a way that seriously minimized the chances of destablizing the code,
maybe --- but consider also that the OHSM design is a pretty terrible
hack. I'm not at all conviced they will be able to stablize it for
production use, and a scheme that involves using dmapi across multiple
block devices.

Note that they apparently need to make other changes to the core
filesystem code besides just the ioctl --- to the block allocation
code, at the very least.

The right answer is really to use a stackable filesystem, and to use
separate filesystems for each different tier, and then build on top of
unionfs to give it its policy support. I suspect that OHSM will be a
cute student project, but it won't become anything serious given its
architecture/design, unfortunately.

- Ted

2009-03-25 11:53:07

by SandeepKsinha

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] ext4: online defrag (ver 1.0)



Theodore Ts'o-2 wrote:
>
> On Wed, Feb 04, 2009 at 09:51:07AM -0500, Greg Freemyer wrote:
>
> If the OHSM team implements a similar ioctl for ext2 and ext3 and
> submits them for mainline at some point, do they have a chance of
> being accepted or are ext2 and ext3 feature frozen?
>
> It seems unlikely it would be accepted. If the patch could be done in
> a way that seriously minimized the chances of destablizing the code,
> maybe --- but consider also that the OHSM design is a pretty terrible
> hack. I'm not at all conviced they will be able to stablize it for
>

I could not understand what makes you feel like that. The idea of OHSM is
simply to exploit the underlying device topology on which the file system
resides and
have a better block allocation policy based on that. Just a kind of
handshake
between the Filesystem and the logical volume.

Can you be a bit more specific on what makes you feel that its not the right
way to achieve such goals?


> production use, and a scheme that involves using dmapi across multiple
> block devices.
>

Well, we already have stable ioctls through dmapi which provides the
underlying topology of the logical device.
Which I believe is pretty stable at this point in time.



> Note that they apparently need to make other changes to the core
> filesystem code besides just the ioctl --- to the block allocation
> code, at the very least.
>

Agreed. There would be considerable changes revolving around the block
allocation.
But, the major change would be revolving around the block allocation ONLY.
And the motivation
for OHSM to think in the direction of ext4 was the mail from Akira which
mentions that we are in
plan for a similar ioctl based interface for ext4.
Just to quote:
"(2) An (ioctl-based) interface which associates with an inode
preferred range(s) of blocks which the block allocator will try using
first; if those blocks are not available, or the block range(s) is
exhausted, the block allocator use its normal algorithms to pick the
best available block. The set of preferred blocks is only guaranteed
to persist while the inode is in memory.".

http://patchwork.ozlabs.org/patch/12877/

Other than these most of the other implementation resides in a separate
module.

Don't you think that atleast we are clean at block allocation and dmapi.
The major concerns that you had.



> The right answer is really to use a stackable filesystem, and to use
> separate filesystems for each different tier, and then build on top of
> unionfs to give it its policy support. I suspect that OHSM will be a
> cute student project, but it won't become anything serious given its
> architecture/design, unfortunately.
>

This can be one of the possible ways of achieving it and may be a better
one.

Regards,
Sandeep.



> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>



--
View this message in context: http://www.nabble.com/-RFC--PATCH-0-3--ext4%3A-online-defrag-%28ver-1.0%29-tp21742025p22700089.html
Sent from the linux-ext4 mailing list archive at Nabble.com.