2008-10-24 10:10:26

by Akira Fujita

[permalink] [raw]
Subject: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

ext4: online defrag -- Add the EXT4_IOC_FIEMAP_INO ioctl.

From: Akira Fujita <[email protected]>

The EXT4_IOC_FIEMAP_INO is used to get extents information of
inode which set to ioctl.
The defragger uses this ioctl to check the fragment condition
and to get extents information in the specified block group.

Signed-off-by: Akira Fujita <[email protected]>
Signed-off-by: Takashi Sato <[email protected]>
---
fs/ext4/defrag.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 7 +++
fs/ext4/ioctl.c | 7 +++
3 files changed, 127 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/defrag.c b/fs/ext4/defrag.c
index 891e599..67030bc 100644
--- a/fs/ext4/defrag.c
+++ b/fs/ext4/defrag.c
@@ -20,6 +20,8 @@
#include "ext4_extents.h"
#include "group.h"

+#define FIEMAP_MAX_EXTENTS (UINT_MAX / sizeof(struct fiemap_extent))
+
/**
* ext4_defrag_next_extent - Search for the next extent and set it to "extent"
*
@@ -91,6 +93,117 @@ err:
}

/**
+ * ext4_defrag_fiemap_ino - Get extents information by inode number
+ *
+ * @filp: pointer to file
+ * @arg: pointer to fiemap_ino
+ * @fiemap_ino->ino: an inode number which is used to get
+ * extent information
+ * @fiemap_ino->fiemap: request for fiemap ioctl
+ *
+ * This function returns 0 if succeed, otherwise returns error value.
+ */
+int
+ext4_defrag_fiemap_ino(struct file *filp, unsigned long arg)
+{
+ struct fiemap_ino fiemap_ino;
+ struct fiemap_extent_info fieinfo = { 0, };
+ struct inode *inode;
+ struct super_block *sb = filp->f_dentry->d_inode->i_sb;
+ u64 len = 0;
+ int err = 0;
+
+ if (copy_from_user(&fiemap_ino, (struct fiemap_ino __user *)arg,
+ sizeof(struct fiemap_ino)))
+ return -EFAULT;
+
+ /* Special inodes shouldn't be choiced */
+ if (fiemap_ino.ino < EXT4_GOOD_OLD_FIRST_INO)
+ return -ENOENT;
+
+ inode = ext4_iget(sb, fiemap_ino.ino);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+
+ /*
+ * If we do not have the permission to access the inode,
+ * just skip it.
+ */
+ if (!capable(CAP_DAC_OVERRIDE)) {
+ if ((inode->i_mode & S_IRUSR) != S_IRUSR)
+ return -EACCES;
+ if (current->fsuid != inode->i_uid)
+ return -EACCES;
+ }
+
+ /* Return -ENOENT if a file does not exist */
+ if (!inode->i_nlink || !S_ISREG(inode->i_mode)) {
+ err = -ENOENT;
+ goto out;
+ }
+
+ if (!inode->i_op->fiemap) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
+ if (fiemap_ino.fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (fiemap_ino.fiemap.fm_length == 0) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (fiemap_ino.fiemap.fm_start > sb->s_maxbytes) {
+ err = -EFBIG;
+ goto out;
+ }
+
+ /*
+ * Check offset and length
+ * If the specified range exceeds the max file size,
+ * adjust the length.
+ */
+ if ((fiemap_ino.fiemap.fm_length > sb->s_maxbytes) ||
+ (sb->s_maxbytes - fiemap_ino.fiemap.fm_length)
+ < fiemap_ino.fiemap.fm_start)
+ len = sb->s_maxbytes - fiemap_ino.fiemap.fm_start;
+ else
+ len = fiemap_ino.fiemap.fm_length;
+
+ fieinfo.fi_flags = fiemap_ino.fiemap.fm_flags;
+ fieinfo.fi_extents_max = fiemap_ino.fiemap.fm_extent_count;
+ fieinfo.fi_extents_start =
+ (struct fiemap_extent *)(arg + sizeof(fiemap_ino));
+
+ if (fiemap_ino.fiemap.fm_extent_count != 0 &&
+ !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start,
+ fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC)
+ filemap_write_and_wait(inode->i_mapping);
+
+ err = inode->i_op->fiemap(inode, &fieinfo,
+ fiemap_ino.fiemap.fm_start, len);
+ if (!err) {
+ fiemap_ino.fiemap.fm_flags = fieinfo.fi_flags;
+ fiemap_ino.fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped;
+ if (copy_to_user((char *)arg, &fiemap_ino, sizeof(fiemap_ino)))
+ err = -EFAULT;
+ }
+
+out:
+ iput(inode);
+ return err;
+}
+
+/**
* ext4_defrag_merge_across_blocks - Merge extents across leaf block
*
* @handle: journal handle
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f7b092d..c72703f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -305,6 +305,7 @@ struct ext4_new_group_data {
#define EXT4_IOC_DEFRAG _IOW('f', 15, struct ext4_ext_defrag_data)
#define EXT4_IOC_SUPER_INFO _IOR('f', 16, struct ext4_super_block)
#define EXT4_IOC_FREE_BLOCKS_INFO _IOWR('f', 17, struct ext4_extents_info)
+#define EXT4_IOC_FIEMAP_INO _IOWR('f', 18, struct fiemap_ino)

/*
* ioctl commands in 32 bit emulation
@@ -352,6 +353,11 @@ struct ext4_extents_info {
struct ext4_extent_data ext[DEFRAG_MAX_ENT];
};

+struct fiemap_ino {
+ __u64 ino;
+ struct fiemap fiemap;
+};
+
#define EXT4_TRANS_META_BLOCKS 4 /* bitmap + group desc + sb + inode */

/*
@@ -1185,6 +1191,7 @@ extern int ext4_ext_journal_restart(handle_t *handle, int needed);
/* defrag.c */
extern int ext4_defrag(struct file *filp, ext4_lblk_t block_start,
ext4_lblk_t defrag_size, ext4_fsblk_t goal);
+extern int ext4_defrag_fiemap_ino(struct file *filp, unsigned long arg);

static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es)
{
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 5709574..b69b54a 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -256,6 +256,13 @@ setversion_out:
return 0;
}

+ case EXT4_IOC_FIEMAP_INO: {
+ int err;
+
+ err = ext4_defrag_fiemap_ino(filp, arg);
+ return err;
+ }
+
case EXT4_IOC_FREE_BLOCKS_INFO: {
struct ext4_extents_info ext_info;
int err;


2008-10-26 08:41:03

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

On Oct 24, 2008 19:09 +0900, Akira Fujita wrote:
> The EXT4_IOC_FIEMAP_INO is used to get extents information of
> inode which set to ioctl.
> The defragger uses this ioctl to check the fragment condition
> and to get extents information in the specified block group.

Instead of having a separate IOC number for each such ioctl, instead
we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that
passes in an inode number and a second IOC number so that arbitrary file
ioctls can be run on any inode by root.

This was mentioned last time these patches were posted, but there was
no reply from you. Christoph suggested a more generic VFS open-by-inum,
which isn't impossible to do but would cause a lot of controversy I
think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
but is more generically useful than EXT4_IOC_FIEMAP_INO.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-10-26 08:48:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote:
> This was mentioned last time these patches were posted, but there was
> no reply from you. Christoph suggested a more generic VFS open-by-inum,
> which isn't impossible to do but would cause a lot of controversy I
> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
> but is more generically useful than EXT4_IOC_FIEMAP_INO.

I'll hack up a generic open_by_handle and then we can gather the
reaction - it shouldn't be more than about one or two hundred lines of
code. Note that you really want an open by handle and not just inum for
a defragmentation tool - without the generation you can easily run into
races.

Btw, any reason the XFS approach of passing in filehandles for both
the inode to be defragmented and the "donor" inode doesn't work for you?


2008-10-26 08:49:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote:
> Btw, any reason the XFS approach of passing in filehandles for both
> the inode to be defragmented and the "donor" inode doesn't work for you?

Sorry should be file descriptor and not filehandle above.


2008-10-27 10:21:57

by Akira Fujita

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

Hi Andreas,

Andreas Dilger wrote:
> On Oct 24, 2008 19:09 +0900, Akira Fujita wrote:
>> The EXT4_IOC_FIEMAP_INO is used to get extents information of
>> inode which set to ioctl.
>> The defragger uses this ioctl to check the fragment condition
>> and to get extents information in the specified block group.
>
> Instead of having a separate IOC number for each such ioctl, instead
> we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that
> passes in an inode number and a second IOC number so that arbitrary file
> ioctls can be run on any inode by root.

The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations.
But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also
non-root user to call fiemap,
so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag.

> This was mentioned last time these patches were posted, but there was
> no reply from you. Christoph suggested a more generic VFS open-by-inum,
> which isn't impossible to do but would cause a lot of controversy I
> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
> but is more generically useful than EXT4_IOC_FIEMAP_INO.
>

Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue?


Regards,
Akira Fujita

2008-10-27 19:55:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

On Oct 27, 2008 19:21 +0900, Akira Fujita wrote:
> Andreas Dilger wrote:
>> On Oct 24, 2008 19:09 +0900, Akira Fujita wrote:
>>> The EXT4_IOC_FIEMAP_INO is used to get extents information of
>>> inode which set to ioctl.
>>> The defragger uses this ioctl to check the fragment condition
>>> and to get extents information in the specified block group.
>>
>> Instead of having a separate IOC number for each such ioctl, instead
>> we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that
>> passes in an inode number and a second IOC number so that arbitrary file
>> ioctls can be run on any inode by root.
>
> The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations.
> But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also
> non-root user to call fiemap,
> so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag.

Why does a regular user need to do the ioctl on a file that it may not
have read permission to access? I can see this is useful for root
doing a defrag of the whole filesystem instead of opening and closing
all of the files, but for regular users we need to validate via the
full path to ensure they can even access the file before defragmenting it.

>> This was mentioned last time these patches were posted, but there was
>> no reply from you. Christoph suggested a more generic VFS open-by-inum,
>> which isn't impossible to do but would cause a lot of controversy I
>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
>> but is more generically useful than EXT4_IOC_FIEMAP_INO.
>
> Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue?

If there is interest, yes.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-10-31 09:46:46

by Akira Fujita

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl


Andreas Dilger Wrote:
> On Oct 27, 2008 19:21 +0900, Akira Fujita wrote:
>> Andreas Dilger wrote:
>>> On Oct 24, 2008 19:09 +0900, Akira Fujita wrote:
>>>> The EXT4_IOC_FIEMAP_INO is used to get extents information of
>>>> inode which set to ioctl.
>>>> The defragger uses this ioctl to check the fragment condition
>>>> and to get extents information in the specified block group.
>>> Instead of having a separate IOC number for each such ioctl, instead
>>> we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that
>>> passes in an inode number and a second IOC number so that arbitrary file
>>> ioctls can be run on any inode by root.
>> The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations.
>> But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also
>> non-root user to call fiemap,
>> so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag.
>
> Why does a regular user need to do the ioctl on a file that it may not
> have read permission to access? I can see this is useful for root
> doing a defrag of the whole filesystem instead of opening and closing
> all of the files, but for regular users we need to validate via the
> full path to ensure they can even access the file before defragmenting it.

The FIEMAP_INO ioctl just passes a inode number belongs to
the target block group from user space to kernel space
and then the owner check is done in the kernel space.
If the regular user (defrag -f excecutant) is owner of a file,
defrag handles this file as the candidate of victim file which would be moved
to the other block group to make free space.

So I think the full path check is unneeded because the owner check
is done in the kernel space (I'm not sure it's good enough).
If it's not good in the security point of view,
I will make defrag -f mode be done only by root user.

>>> This was mentioned last time these patches were posted, but there was
>>> no reply from you. Christoph suggested a more generic VFS open-by-inum,
>>> which isn't impossible to do but would cause a lot of controversy I
>>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
>>> but is more generically useful than EXT4_IOC_FIEMAP_INO.
>> Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue?
>
> If there is interest, yes.

How do the other ext4 developers think about
implementing EXT4_IOC_WRAPPER?
Will it be used only for defrag so far?

Regards,
Akira Fujita

2008-10-31 10:05:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

Akira, can you please comment on these issues before going on?
I think the generation issue is a particularly important one if you
want to allow defrag by normal users.

On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote:
> On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote:
> > This was mentioned last time these patches were posted, but there was
> > no reply from you. Christoph suggested a more generic VFS open-by-inum,
> > which isn't impossible to do but would cause a lot of controversy I
> > think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
> > but is more generically useful than EXT4_IOC_FIEMAP_INO.
>
> I'll hack up a generic open_by_handle and then we can gather the
> reaction - it shouldn't be more than about one or two hundred lines of
> code. Note that you really want an open by handle and not just inum for
> a defragmentation tool - without the generation you can easily run into
> races.
>
> Btw, any reason the XFS approach of passing in *file descriptors* for both
> the inode to be defragmented and the "donor" inode doesn't work for you?

2008-11-04 21:43:17

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

On Oct 31, 2008 18:46 +0900, Akira Fujita wrote:
>> Why does a regular user need to do the ioctl on a file that it may not
>> have read permission to access? I can see this is useful for root
>> doing a defrag of the whole filesystem instead of opening and closing
>> all of the files, but for regular users we need to validate via the
>> full path to ensure they can even access the file before defragmenting it.
>
> The FIEMAP_INO ioctl just passes a inode number belongs to
> the target block group from user space to kernel space
> and then the owner check is done in the kernel space.
>
> If the regular user (defrag -f excecutant) is owner of a file,
> defrag handles this file as the candidate of victim file which would
> be moved to the other block group to make free space.
>
> So I think the full path check is unneeded because the owner check
> is done in the kernel space (I'm not sure it's good enough).
> If it's not good in the security point of view,
> I will make defrag -f mode be done only by root user.

If the defrag operation is limited to the owner of the file (or root
via CAP_DAC_OVERRIDE) then this is probably OK also. The data never
gets to userspace so there is relatively little risk to this operation.

>>>> This was mentioned last time these patches were posted, but there was
>>>> no reply from you. Christoph suggested a more generic VFS open-by-inum,
>>>> which isn't impossible to do but would cause a lot of controversy I
>>>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
>>>> but is more generically useful than EXT4_IOC_FIEMAP_INO.
>
> How do the other ext4 developers think about
> implementing EXT4_IOC_WRAPPER?
> Will it be used only for defrag so far?

I expect the initial users of this ioctl will be FIEMAP and DEFRAG, but
it might also be useful for other ioctls in the future.

I haven't really asked other ext4 developers about it yet, and nobody
else has commented the last time I posted the patch.

I don't have an objection to Christoph's open-by-FH API, if there is
acceptance of this from other kernel developers (Al Viro in particular),
but that exposes a lot more security issues than just the ioctl wrapper.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-11-06 07:39:41

by Akira Fujita

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

Hi Christoph,
Christoph Hellwig wrote:
> Akira, can you please comment on these issues before going on?
> I think the generation issue is a particularly important one if you
> want to allow defrag by normal users.

For a regular user defrag (-f), I'll add the check
to solve the generation issue in the following procedures (1-4).
Please check whether my approach is right or not.

1. Acquire the extents information of inode in kernel space
and then return it to user space with EXT4_IOC_FIEMAP_INO.

2. Calculate the victim extents from the combination of
extents (the result of 1) and free space extents.

3. Pass the victim extents (move to the other block group to make free space)
from user space to kernel space with EXT4_IOC_MOVE_VICTIM.

4. In kernel space, make sure the permission and the extents construction
of the passed inode (the passed inode still covers
the victim extent area or not).
If there is no problem, defrag continues its process.

If check (4) failed, it means victim extents was changed
and that area might be already used by other users, so defrag will fail.

If my approach doesn't seem to be a problem, I will work on it.

Regards,
Akira Fujita

> On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote:
>> On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote:
>>> This was mentioned last time these patches were posted, but there was
>>> no reply from you. Christoph suggested a more generic VFS open-by-inum,
>>> which isn't impossible to do but would cause a lot of controversy I
>>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4,
>>> but is more generically useful than EXT4_IOC_FIEMAP_INO.
>> I'll hack up a generic open_by_handle and then we can gather the
>> reaction - it shouldn't be more than about one or two hundred lines of
>> code. Note that you really want an open by handle and not just inum for
>> a defragmentation tool - without the generation you can easily run into
>> races.
>>
>> Btw, any reason the XFS approach of passing in *file descriptors* for both
>> the inode to be defragmented and the "donor" inode doesn't work for you?


2008-11-06 16:16:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

On Fri, Oct 31, 2008 at 06:05:47AM -0400, Christoph Hellwig wrote:
> > I'll hack up a generic open_by_handle and then we can gather the
> > reaction - it shouldn't be more than about one or two hundred lines of
> > code. Note that you really want an open by handle and not just inum for
> > a defragmentation tool - without the generation you can easily run into
> > races.

I think having a generic open_by_handle is a Good Thing, but it
actually isn't quite enough for defrag, because that brings up the
question of how defrag can create the handle in the first place.

In the case of Aryan's desire to get the list of files that were read
during boot, it's pretty obvious how we can define an interface which
would make available a set of file handles corresponding to the files
that were opened during the boot process, and then that list of file
handles can be saved to a file and used on the subsequent boot to do
the readahead. Fairly straight forward.

In the case of the defrag situation, though, we need to step back and
figure out what we are trying to do. What the userspace program is
apparently trying to do is to get the block extent maps used by all of
the inodes in the block group. The problem is we aren't opening the
inodes by pathname, so we couldn't create a handle in the first place
(since in order to create a handle, we need the containing directory).

The bigger question is whether the defrag code is asking the right
question in the first place. The issue is that is that it currently
assumes that in order to find the owner of a particular block (or more
generally, extent), you should search the inodes in the block's
blockgroup. The problem is that for a very fragmented filesystem,
most of the blocks' owners might not be in their block group. In
fact, over time, if you use defrag -f frequently, it will move blocks
belonging to inodes in one block group to other block groups, which
will make defrag -f's job even harder, and eventually impossible, for
inodes belonging to other block groups.

> Akira, can you please comment on these issues before going on?
> I think the generation issue is a particularly important one if you
> want to allow defrag by normal users.

I'm not at all sure that it makes sense to allow "defrag -f" to be
used by normal users. The problem here is we're fighting multiple
constraints. First of all, we want to keep policy in the kernel to an
absolute minimum, since debugging kernel code is a mess, and I don't
think we want the complexity of a full-fledge defragger in the kernel.
Secondly, though, if we are going to do this in userspace, we need to
push a huge amount of information to the userspace defrag program,
that immediately raises some very serious security issues, because we
don't want to leak information unduly to random userspace programs.

> > Btw, any reason the XFS approach of passing in *file descriptors* for both
> > the inode to be defragmented and the "donor" inode doesn't work for you?

I agree this is probably the better approach; it would certainly
reduce the amount of new code that needs to be added to the kernel.
Right now the "donor"/temporary inode is created and allocated in the
kernel, and then the kernel decides whether or not the temporary inode
is an improvement. If we make the userspace code responsible for
creating the temporary inode and then using fallocate() to allocate
the new blocks, then userspace can call FIEMAP and decide whether it
is an improvement.

- Ted

P.S. I've been looking at ext4_defrag_partial(), and the page locking
looks wrong. The page is only locked if it it hasn't been read into
memory yet? And at the end of the function, we do this?

if (PageLocked(page))
unlock_page(page);

2008-11-13 11:35:06

by Akira Fujita

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/9]ext4: Add the EXT4_IOC_FIEMAP_INO ioctl

Hi Ted,

Theodore Tso wrote:
> On Fri, Oct 31, 2008 at 06:05:47AM -0400, Christoph Hellwig wrote:
>>> I'll hack up a generic open_by_handle and then we can gather the
>>> reaction - it shouldn't be more than about one or two hundred lines of
>>> code. Note that you really want an open by handle and not just inum for
>>> a defragmentation tool - without the generation you can easily run into
>>> races.
>
> I think having a generic open_by_handle is a Good Thing, but it
> actually isn't quite enough for defrag, because that brings up the
> question of how defrag can create the handle in the first place.
>
> In the case of Aryan's desire to get the list of files that were read
> during boot, it's pretty obvious how we can define an interface which
> would make available a set of file handles corresponding to the files
> that were opened during the boot process, and then that list of file
> handles can be saved to a file and used on the subsequent boot to do
> the readahead. Fairly straight forward.
>
> In the case of the defrag situation, though, we need to step back and
> figure out what we are trying to do. What the userspace program is
> apparently trying to do is to get the block extent maps used by all of
> the inodes in the block group. The problem is we aren't opening the
> inodes by pathname, so we couldn't create a handle in the first place
> (since in order to create a handle, we need the containing directory).
>
> The bigger question is whether the defrag code is asking the right
> question in the first place. The issue is that is that it currently
> assumes that in order to find the owner of a particular block (or more
> generally, extent), you should search the inodes in the block's
> blockgroup. The problem is that for a very fragmented filesystem,
> most of the blocks' owners might not be in their block group. In
> fact, over time, if you use defrag -f frequently, it will move blocks
> belonging to inodes in one block group to other block groups, which
> will make defrag -f's job even harder, and eventually impossible, for
> inodes belonging to other block groups.
>
>> Akira, can you please comment on these issues before going on?
>> I think the generation issue is a particularly important one if you
>> want to allow defrag by normal users.
>
> I'm not at all sure that it makes sense to allow "defrag -f" to be
> used by normal users. The problem here is we're fighting multiple
> constraints. First of all, we want to keep policy in the kernel to an
> absolute minimum, since debugging kernel code is a mess, and I don't
> think we want the complexity of a full-fledge defragger in the kernel.
> Secondly, though, if we are going to do this in userspace, we need to
> push a huge amount of information to the userspace defrag program,
> that immediately raises some very serious security issues, because we
> don't want to leak information unduly to random userspace programs.

All right.
I will change "defrag -f" to admit only root user.

>>> Btw, any reason the XFS approach of passing in *file descriptors* for both
>>> the inode to be defragmented and the "donor" inode doesn't work for you?
>
> I agree this is probably the better approach; it would certainly
> reduce the amount of new code that needs to be added to the kernel.
> Right now the "donor"/temporary inode is created and allocated in the
> kernel, and then the kernel decides whether or not the temporary inode
> is an improvement. If we make the userspace code responsible for
> creating the temporary inode and then using fallocate() to allocate
> the new blocks, then userspace can call FIEMAP and decide whether it
> is an improvement.

There is no problem if it is only for normal defrag,
but I think fallocate is not enough for the other defrag mode (-r and -f)
because we can't specify physical block offset.
For example, in defrag -r, physical block offset of parent directory is
set to the goal for mballoc.
Also in defrag -f, physical block offset is used to allocate specified
range as well.

Do you mean that defrag -r and -f are unnecessary?
I think these options are advantages
if we compare ext4 defrag with other FS's defrag feature.

>
> P.S. I've been looking at ext4_defrag_partial(), and the page locking
> looks wrong. The page is only locked if it it hasn't been read into
> memory yet? And at the end of the function, we do this?
>
> if (PageLocked(page))
> unlock_page(page);

In case defrag failed between write_begin() and write_end(),
we have to call unlock_page() because we've already have the page lock
with write_begin().
So unlock_page() is called in the end of ext4_defrag_partial() as error case.
Is there something wrong?

Regards,
Akira Fujita