2008-09-13 18:47:14

by Theodore Ts'o

[permalink] [raw]
Subject: YET ANOTHER resend of the fiemap patches


So the last round of the on-going round-and-round of the FIEMAP patches
went to the linux-ext4 mailing list, because of a query there about why
these patches had been stalled for so long. This resulted in a private
set of e-mails attacking an ext4 developer and putting forward a
conspiracy theory that these patches were trying to be snuck in, despite
the fact that they have been sent on linux-fsdevel at least 4 times
already.

Mark has at this point I believe answered all of Christoph Hellwig's
demands, and put forward a request to Andrew Morton to include them in
-mm (minus the ext4 patch, since the version he had no longer compiled).

As of this writing, Andrew has not put them into -mm, I suspect because
he perceives there is still controversies around these patches, despite
the fact that most of the e-mails and time wasted has been over
individual bitflags and other really tiny nit-picky details. So I am
re-sending these patches out for review on linux-fsdevel. The first
three patches are taken from Mark's fiemap branch on the ocfs2 branch.
The last patch is an updated ext4 patch from the ext4 patch queue, which
is provided so that folks can see that the fiemap patches work just fine
on ext4.

Given that filesystem designers seem to love nit-picking tiny details,
and I am personally starting to lose patience, if the fiemap patches
stalls any further, my plan is to take a page from the XFS playbook and
simply take the ext4-fiemap patch and implement an ext4-specific ioctl.
If and when the linux-fsdevel community manages come to consensus on the
fiemap patches, whether it happens in 2.6.28 or Linux 2.6.87, it will be
easy enough to wire the ext4 support to the generic fiemap ioctl.

"It's *just* an ioctl, folks" --- Andreas Dilger

Yours disgustedly,

- Ted



2008-09-13 18:49:27

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 4/4] Hook ext4 to the vfs fiemap interface.

From: Eric Sandeen <[email protected]>

ext4_ext_walk_space() was reinstated to be used for iterating over file
extents with a callback; it is used by the ext4 fiemap implementation.

Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/ext4_extents.h | 15 +++
fs/ext4/extents.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/file.c | 4 +
4 files changed, 269 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9fccf13..8ad9980 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1127,6 +1127,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *,
ext4_lblk_t, int, int *);
struct buffer_head *ext4_bread(handle_t *, struct inode *,
ext4_lblk_t, int, int *);
+int ext4_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create);
int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock, unsigned long maxblocks,
struct buffer_head *bh_result,
diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index 0f2c744..fbe34b4 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -124,6 +124,19 @@ struct ext4_ext_path {
#define EXT4_EXT_CACHE_GAP 1
#define EXT4_EXT_CACHE_EXTENT 2

+/*
+ * to be called by ext4_ext_walk_space()
+ * negative retcode - error
+ * positive retcode - signal for ext4_ext_walk_space(), see below
+ * callback must return valid extent (passed or newly created)
+ */
+typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_path *,
+ struct ext4_ext_cache *,
+ struct ext4_extent *, void *);
+
+#define EXT_CONTINUE 0
+#define EXT_BREAK 1
+#define EXT_REPEAT 2

#define EXT_MAX_BLOCK 0xffffffff

@@ -224,6 +237,8 @@ extern int ext4_ext_try_to_merge(struct inode *inode,
struct ext4_extent *);
extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
+extern int ext4_ext_walk_space(struct inode *, ext4_lblk_t, ext4_lblk_t,
+ ext_prepare_callback, void *);
extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t,
struct ext4_ext_path *);
extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6ff30ff..32c1aa9 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -40,6 +40,7 @@
#include <linux/slab.h>
#include <linux/falloc.h>
#include <asm/uaccess.h>
+#include <linux/fiemap.h>
#include "ext4_jbd2.h"
#include "ext4_extents.h"

@@ -1657,6 +1658,113 @@ cleanup:
return err;
}

+int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block,
+ ext4_lblk_t num, ext_prepare_callback func,
+ void *cbdata)
+{
+ struct ext4_ext_path *path = NULL;
+ struct ext4_ext_cache cbex;
+ struct ext4_extent *ex;
+ ext4_lblk_t next, start = 0, end = 0;
+ ext4_lblk_t last = block + num;
+ int depth, exists, err = 0;
+
+ BUG_ON(func == NULL);
+ BUG_ON(inode == NULL);
+
+ while (block < last && block != EXT_MAX_BLOCK) {
+ num = last - block;
+ /* find extent for this block */
+ path = ext4_ext_find_extent(inode, block, path);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ path = NULL;
+ break;
+ }
+
+ depth = ext_depth(inode);
+ BUG_ON(path[depth].p_hdr == NULL);
+ ex = path[depth].p_ext;
+ next = ext4_ext_next_allocated_block(path);
+
+ exists = 0;
+ if (!ex) {
+ /* there is no extent yet, so try to allocate
+ * all requested space */
+ start = block;
+ end = block + num;
+ } else if (le32_to_cpu(ex->ee_block) > block) {
+ /* need to allocate space before found extent */
+ start = block;
+ end = le32_to_cpu(ex->ee_block);
+ if (block + num < end)
+ end = block + num;
+ } else if (block >= le32_to_cpu(ex->ee_block)
+ + ext4_ext_get_actual_len(ex)) {
+ /* need to allocate space after found extent */
+ start = block;
+ end = block + num;
+ if (end >= next)
+ end = next;
+ } else if (block >= le32_to_cpu(ex->ee_block)) {
+ /*
+ * some part of requested space is covered
+ * by found extent
+ */
+ start = block;
+ end = le32_to_cpu(ex->ee_block)
+ + ext4_ext_get_actual_len(ex);
+ if (block + num < end)
+ end = block + num;
+ exists = 1;
+ } else {
+ BUG();
+ }
+ BUG_ON(end <= start);
+
+ if (!exists) {
+ cbex.ec_block = start;
+ cbex.ec_len = end - start;
+ cbex.ec_start = 0;
+ cbex.ec_type = EXT4_EXT_CACHE_GAP;
+ } else {
+ cbex.ec_block = le32_to_cpu(ex->ee_block);
+ cbex.ec_len = ext4_ext_get_actual_len(ex);
+ cbex.ec_start = ext_pblock(ex);
+ cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
+ }
+
+ BUG_ON(cbex.ec_len == 0);
+ err = func(inode, path, &cbex, ex, cbdata);
+ ext4_ext_drop_refs(path);
+
+ if (err < 0)
+ break;
+
+ if (err == EXT_REPEAT)
+ continue;
+ else if (err == EXT_BREAK) {
+ err = 0;
+ break;
+ }
+
+ if (ext_depth(inode) != depth) {
+ /* depth was changed. we have to realloc path */
+ kfree(path);
+ path = NULL;
+ }
+
+ block = cbex.ec_block + cbex.ec_len;
+ }
+
+ if (path) {
+ ext4_ext_drop_refs(path);
+ kfree(path);
+ }
+
+ return err;
+}
+
static void
ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
__u32 len, ext4_fsblk_t start, int type)
@@ -3006,3 +3114,143 @@ retry:
mutex_unlock(&inode->i_mutex);
return ret > 0 ? ret2 : ret;
}
+
+/*
+ * Callback function called for each extent to gather FIEMAP information.
+ */
+int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_path *path,
+ struct ext4_ext_cache *newex, struct ext4_extent *ex,
+ void *data)
+{
+ struct fiemap_extent_info *fieinfo = data;
+ unsigned long blksize_bits = inode->i_sb->s_blocksize_bits;
+ __u64 logical;
+ __u64 physical;
+ __u64 length;
+ __u32 flags = 0;
+ int error;
+
+ logical = (__u64)newex->ec_block << blksize_bits;
+
+ if (newex->ec_type == EXT4_EXT_CACHE_GAP) {
+ pgoff_t offset;
+ struct page *page;
+ struct buffer_head *bh = NULL;
+
+ offset = logical >> PAGE_SHIFT;
+ page = find_get_page(inode->i_mapping, offset);
+ if (!page || !page_has_buffers(page))
+ return EXT_CONTINUE;
+
+ bh = page_buffers(page);
+
+ if (!bh)
+ return EXT_CONTINUE;
+
+ if (buffer_delay(bh)) {
+ flags |= FIEMAP_EXTENT_DELALLOC;
+ page_cache_release(page);
+ } else {
+ page_cache_release(page);
+ return EXT_CONTINUE;
+ }
+ }
+
+ physical = (__u64)newex->ec_start << blksize_bits;
+ length = (__u64)newex->ec_len << blksize_bits;
+
+ if (ex && ext4_ext_is_uninitialized(ex))
+ flags |= FIEMAP_EXTENT_UNWRITTEN;
+
+ /*
+ * If this extent reaches EXT_MAX_BLOCK, it must be last.
+ *
+ * Or if ext4_ext_next_allocated_block is EXT_MAX_BLOCK,
+ * this also indicates no more allocated blocks.
+ *
+ * XXX this might miss a single-block extent at EXT_MAX_BLOCK
+ */
+ if (logical + length - 1 == EXT_MAX_BLOCK ||
+ ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK)
+ flags |= FIEMAP_EXTENT_LAST;
+
+ error = fiemap_fill_next_extent(fieinfo, logical, physical,
+ length, flags);
+ if (error < 0)
+ return error;
+ if (error == 1)
+ return EXT_BREAK;
+
+ return EXT_CONTINUE;
+}
+
+/* fiemap flags we can handle specified here */
+#define EXT4_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR)
+
+int ext4_xattr_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo)
+{
+ __u64 physical = 0;
+ __u64 length;
+ __u32 flags = FIEMAP_EXTENT_LAST;
+ int blockbits = inode->i_sb->s_blocksize_bits;
+ int error = 0;
+
+ /* in-inode? */
+ if (EXT4_I(inode)->i_state & EXT4_STATE_XATTR) {
+ struct ext4_iloc iloc;
+ int offset; /* offset of xattr in inode */
+
+ error = ext4_get_inode_loc(inode, &iloc);
+ if (error)
+ return error;
+ physical = iloc.bh->b_blocknr << blockbits;
+ offset = EXT4_GOOD_OLD_INODE_SIZE +
+ EXT4_I(inode)->i_extra_isize;
+ physical += offset;
+ length = EXT4_SB(inode->i_sb)->s_inode_size - offset;
+ flags |= FIEMAP_EXTENT_DATA_INLINE;
+ } else { /* external block */
+ physical = EXT4_I(inode)->i_file_acl << blockbits;
+ length = inode->i_sb->s_blocksize;
+ }
+
+ if (physical)
+ error = fiemap_fill_next_extent(fieinfo, 0, physical,
+ length, flags);
+ return (error < 0 ? error : 0);
+}
+
+int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ __u64 start, __u64 len)
+{
+ ext4_lblk_t start_blk;
+ ext4_lblk_t len_blks;
+ int error = 0;
+
+ /* fallback to generic here if not in extents fmt */
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ return generic_block_fiemap(inode, fieinfo, start, len,
+ ext4_get_block);
+
+ if (fiemap_check_flags(fieinfo, EXT4_FIEMAP_FLAGS))
+ return -EBADR;
+
+ if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) {
+ error = ext4_xattr_fiemap(inode, fieinfo);
+ } else {
+ start_blk = start >> inode->i_sb->s_blocksize_bits;
+ len_blks = len >> inode->i_sb->s_blocksize_bits;
+
+ /*
+ * Walk the extent tree gathering extent information.
+ * ext4_ext_fiemap_cb will push extents back to user.
+ */
+ down_write(&EXT4_I(inode)->i_data_sem);
+ error = ext4_ext_walk_space(inode, start_blk, len_blks,
+ ext4_ext_fiemap_cb, fieinfo);
+ up_write(&EXT4_I(inode)->i_data_sem);
+ }
+
+ return error;
+}
+
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 11b289f..8749d8e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -140,6 +140,9 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}

+extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ __u64 start, __u64 len);
+
const struct file_operations ext4_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -170,5 +173,6 @@ const struct inode_operations ext4_file_inode_operations = {
#endif
.permission = ext4_permission,
.fallocate = ext4_fallocate,
+ .fiemap = ext4_fiemap,
};

--
1.5.6.1.205.ge2c7.dirty


2008-09-13 18:49:26

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 3/4] generic block based fiemap implementation

From: Josef Bacik <[email protected]>

Any block based fs (this patch includes ext3) just has to declare its own
fiemap() function and then call this generic function with its own
get_block_t. This works well for block based filesystems that will map
multiple contiguous blocks at one time, but will work for filesystems that
only map one block at a time, you will just end up with an "extent" for each
block. One gotcha is this will not play nicely where there is hole+data
after the EOF. This function will assume its hit the end of the data as soon
as it hits a hole after the EOF, so if there is any data past that it will
not pick that up. AFAIK no block based fs does this anyway, but its in the
comments of the function anyway just in case.

Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: Mark Fasheh <[email protected]>
---
fs/ext2/ext2.h | 2 +
fs/ext2/file.c | 1 +
fs/ext2/inode.c | 8 +++
fs/ext3/file.c | 1 +
fs/ext3/inode.c | 8 +++
fs/ioctl.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ext3_fs.h | 2 +
include/linux/fs.h | 3 +
8 files changed, 143 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 47d88da..bae998c 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -133,6 +133,8 @@ extern void ext2_truncate (struct inode *);
extern int ext2_setattr (struct dentry *, struct iattr *);
extern void ext2_set_inode_flags(struct inode *inode);
extern void ext2_get_inode_flags(struct ext2_inode_info *);
+extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 len);
int __ext2_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 5f2fa9c..45ed071 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -86,4 +86,5 @@ const struct inode_operations ext2_file_inode_operations = {
#endif
.setattr = ext2_setattr,
.permission = ext2_permission,
+ .fiemap = ext2_fiemap,
};
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 991d6df..7658b33 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -31,6 +31,7 @@
#include <linux/writeback.h>
#include <linux/buffer_head.h>
#include <linux/mpage.h>
+#include <linux/fiemap.h>
#include "ext2.h"
#include "acl.h"
#include "xip.h"
@@ -704,6 +705,13 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_

}

+int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 len)
+{
+ return generic_block_fiemap(inode, fieinfo, start, len,
+ ext2_get_block);
+}
+
static int ext2_writepage(struct page *page, struct writeback_control *wbc)
{
return block_write_full_page(page, ext2_get_block, wbc);
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index acc4913..3be1e06 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -134,5 +134,6 @@ const struct inode_operations ext3_file_inode_operations = {
.removexattr = generic_removexattr,
#endif
.permission = ext3_permission,
+ .fiemap = ext3_fiemap,
};

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 507d868..ebfec4d 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -36,6 +36,7 @@
#include <linux/mpage.h>
#include <linux/uio.h>
#include <linux/bio.h>
+#include <linux/fiemap.h>
#include "xattr.h"
#include "acl.h"

@@ -981,6 +982,13 @@ out:
return ret;
}

+int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 len)
+{
+ return generic_block_fiemap(inode, fieinfo, start, len,
+ ext3_get_block);
+}
+
/*
* `handle' can be NULL if create is zero
*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 274625a..682c252 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -13,6 +13,8 @@
#include <linux/security.h>
#include <linux/module.h>
#include <linux/uaccess.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>

#include <asm/ioctls.h>

@@ -223,6 +225,122 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg)
return error;
}

+#define blk_to_logical(inode, blk) (blk << (inode)->i_blkbits)
+#define logical_to_blk(inode, offset) (offset >> (inode)->i_blkbits);
+
+/*
+ * @inode - the inode to map
+ * @arg - the pointer to userspace where we copy everything to
+ * @get_block - the fs's get_block function
+ *
+ * This does FIEMAP for block based inodes. Basically it will just loop
+ * through get_block until we hit the number of extents we want to map, or we
+ * go past the end of the file and hit a hole.
+ *
+ * If it is possible to have data blocks beyond a hole past @inode->i_size, then
+ * please do not use this function, it will stop at the first unmapped block
+ * beyond i_size
+ */
+int generic_block_fiemap(struct inode *inode,
+ struct fiemap_extent_info *fieinfo, u64 start,
+ u64 len, get_block_t *get_block)
+{
+ struct buffer_head tmp;
+ unsigned int start_blk;
+ long long length = 0, map_len = 0;
+ u64 logical = 0, phys = 0, size = 0;
+ u32 flags = FIEMAP_EXTENT_MERGED;
+ int ret = 0;
+
+ if ((ret = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC)))
+ return ret;
+
+ start_blk = logical_to_blk(inode, start);
+
+ /* guard against change */
+ mutex_lock(&inode->i_mutex);
+
+ length = (long long)min_t(u64, len, i_size_read(inode));
+ map_len = length;
+
+ do {
+ /*
+ * we set b_size to the total size we want so it will map as
+ * many contiguous blocks as possible at once
+ */
+ memset(&tmp, 0, sizeof(struct buffer_head));
+ tmp.b_size = map_len;
+
+ ret = get_block(inode, start_blk, &tmp, 0);
+ if (ret)
+ break;
+
+ /* HOLE */
+ if (!buffer_mapped(&tmp)) {
+ /*
+ * first hole after going past the EOF, this is our
+ * last extent
+ */
+ if (length <= 0) {
+ flags = FIEMAP_EXTENT_MERGED|FIEMAP_EXTENT_LAST;
+ ret = fiemap_fill_next_extent(fieinfo, logical,
+ phys, size,
+ flags);
+ break;
+ }
+
+ length -= blk_to_logical(inode, 1);
+
+ /* if we have holes up to/past EOF then we're done */
+ if (length <= 0)
+ break;
+
+ start_blk++;
+ } else {
+ if (length <= 0 && size) {
+ ret = fiemap_fill_next_extent(fieinfo, logical,
+ phys, size,
+ flags);
+ if (ret)
+ break;
+ }
+
+ logical = blk_to_logical(inode, start_blk);
+ phys = blk_to_logical(inode, tmp.b_blocknr);
+ size = tmp.b_size;
+ flags = FIEMAP_EXTENT_MERGED;
+
+ length -= tmp.b_size;
+ start_blk += logical_to_blk(inode, size);
+
+ /*
+ * if we are past the EOF we need to loop again to see
+ * if there is a hole so we can mark this extent as the
+ * last one, and if not keep mapping things until we
+ * find a hole, or we run out of slots in the extent
+ * array
+ */
+ if (length <= 0)
+ continue;
+
+ ret = fiemap_fill_next_extent(fieinfo, logical, phys,
+ size, flags);
+ if (ret)
+ break;
+ }
+ cond_resched();
+ } while (1);
+
+ mutex_unlock(&inode->i_mutex);
+
+ /* if ret is 1 then we just hit the end of the extent array */
+ if (ret == 1)
+ ret = 0;
+
+ return ret;
+}
+EXPORT_SYMBOL(generic_block_fiemap);
+
static int file_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
{
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 80171ee..8120fa1 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -837,6 +837,8 @@ extern void ext3_truncate (struct inode *);
extern void ext3_set_inode_flags(struct inode *);
extern void ext3_get_inode_flags(struct ext3_inode_info *);
extern void ext3_set_aops(struct inode *inode);
+extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 len);

/* ioctl.c */
extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 194fb23..385c9a1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1998,6 +1998,9 @@ extern int vfs_fstat(unsigned int, struct kstat *);

extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
unsigned long arg);
+extern int generic_block_fiemap(struct inode *inode,
+ struct fiemap_extent_info *fieinfo, u64 start,
+ u64 len, get_block_t *get_block);

extern void get_filesystem(struct file_system_type *fs);
extern void put_filesystem(struct file_system_type *fs);
--
1.5.6.1.205.ge2c7.dirty


2008-09-13 18:49:25

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 2/4] ocfs2: fiemap support

From: Mark Fasheh <[email protected]>

Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be
refactored so that the extent cache can be skipped in favor of going
directly to the on-disk records. This makes it easier for us to determine
which extent is the last one in the btree. Also, I'm not sure we want to be
caching fiemap lookups anyway as they're not directly related to data
read/write.

Signed-off-by: Mark Fasheh <[email protected]>
---
fs/ocfs2/alloc.c | 9 --
fs/ocfs2/alloc.h | 9 ++
fs/ocfs2/extent_map.c | 346 +++++++++++++++++++++++++++++++++++++++++--------
fs/ocfs2/extent_map.h | 3 +
fs/ocfs2/file.c | 1 +
5 files changed, 306 insertions(+), 62 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 10bfb46..29ff57e 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -990,15 +990,6 @@ out:
}

/*
- * This is only valid for leaf nodes, which are the only ones that can
- * have empty extents anyway.
- */
-static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec)
-{
- return !rec->e_leaf_clusters;
-}
-
-/*
* This function will discard the rightmost extent record.
*/
static void ocfs2_shift_records_right(struct ocfs2_extent_list *el)
diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h
index 42ff94b..60cd3d5 100644
--- a/fs/ocfs2/alloc.h
+++ b/fs/ocfs2/alloc.h
@@ -146,4 +146,13 @@ static inline unsigned int ocfs2_rec_clusters(struct ocfs2_extent_list *el,
return le16_to_cpu(rec->e_leaf_clusters);
}

+/*
+ * This is only valid for leaf nodes, which are the only ones that can
+ * have empty extents anyway.
+ */
+static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec)
+{
+ return !rec->e_leaf_clusters;
+}
+
#endif /* OCFS2_ALLOC_H */
diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
index c58668a..aed268e 100644
--- a/fs/ocfs2/extent_map.c
+++ b/fs/ocfs2/extent_map.c
@@ -25,6 +25,7 @@
#include <linux/fs.h>
#include <linux/init.h>
#include <linux/types.h>
+#include <linux/fiemap.h>

#define MLOG_MASK_PREFIX ML_EXTENT_MAP
#include <cluster/masklog.h>
@@ -32,6 +33,7 @@
#include "ocfs2.h"

#include "alloc.h"
+#include "dlmglue.h"
#include "extent_map.h"
#include "inode.h"
#include "super.h"
@@ -282,6 +284,51 @@ out:
kfree(new_emi);
}

+static int ocfs2_last_eb_is_empty(struct inode *inode,
+ struct ocfs2_dinode *di)
+{
+ int ret, next_free;
+ u64 last_eb_blk = le64_to_cpu(di->i_last_eb_blk);
+ struct buffer_head *eb_bh = NULL;
+ struct ocfs2_extent_block *eb;
+ struct ocfs2_extent_list *el;
+
+ ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), last_eb_blk,
+ &eb_bh, OCFS2_BH_CACHED, inode);
+ if (ret) {
+ mlog_errno(ret);
+ goto out;
+ }
+
+ eb = (struct ocfs2_extent_block *) eb_bh->b_data;
+ el = &eb->h_list;
+
+ if (!OCFS2_IS_VALID_EXTENT_BLOCK(eb)) {
+ ret = -EROFS;
+ OCFS2_RO_ON_INVALID_EXTENT_BLOCK(inode->i_sb, eb);
+ goto out;
+ }
+
+ if (el->l_tree_depth) {
+ ocfs2_error(inode->i_sb,
+ "Inode %lu has non zero tree depth in "
+ "leaf block %llu\n", inode->i_ino,
+ (unsigned long long)eb_bh->b_blocknr);
+ ret = -EROFS;
+ goto out;
+ }
+
+ next_free = le16_to_cpu(el->l_next_free_rec);
+
+ if (next_free == 0 ||
+ (next_free == 1 && ocfs2_is_empty_extent(&el->l_recs[0])))
+ ret = 1;
+
+out:
+ brelse(eb_bh);
+ return ret;
+}
+
/*
* Return the 1st index within el which contains an extent start
* larger than v_cluster.
@@ -373,42 +420,28 @@ out:
return ret;
}

-int ocfs2_get_clusters(struct inode *inode, u32 v_cluster,
- u32 *p_cluster, u32 *num_clusters,
- unsigned int *extent_flags)
+static int ocfs2_get_clusters_nocache(struct inode *inode,
+ struct buffer_head *di_bh,
+ u32 v_cluster, unsigned int *hole_len,
+ struct ocfs2_extent_rec *ret_rec,
+ unsigned int *is_last)
{
- int ret, i;
- unsigned int flags = 0;
- struct buffer_head *di_bh = NULL;
- struct buffer_head *eb_bh = NULL;
+ int i, ret, tree_height, len;
struct ocfs2_dinode *di;
- struct ocfs2_extent_block *eb;
+ struct ocfs2_extent_block *uninitialized_var(eb);
struct ocfs2_extent_list *el;
struct ocfs2_extent_rec *rec;
- u32 coff;
-
- if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
- ret = -ERANGE;
- mlog_errno(ret);
- goto out;
- }
-
- ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster,
- num_clusters, extent_flags);
- if (ret == 0)
- goto out;
+ struct buffer_head *eb_bh = NULL;

- ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno,
- &di_bh, OCFS2_BH_CACHED, inode);
- if (ret) {
- mlog_errno(ret);
- goto out;
- }
+ memset(ret_rec, 0, sizeof(*ret_rec));
+ if (is_last)
+ *is_last = 0;

di = (struct ocfs2_dinode *) di_bh->b_data;
el = &di->id2.i_list;
+ tree_height = le16_to_cpu(el->l_tree_depth);

- if (el->l_tree_depth) {
+ if (tree_height > 0) {
ret = ocfs2_find_leaf(inode, el, v_cluster, &eb_bh);
if (ret) {
mlog_errno(ret);
@@ -431,46 +464,143 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster,
i = ocfs2_search_extent_list(el, v_cluster);
if (i == -1) {
/*
- * A hole was found. Return some canned values that
- * callers can key on. If asked for, num_clusters will
- * be populated with the size of the hole.
+ * Holes can be larger than the maximum size of an
+ * extent, so we return their lengths in a seperate
+ * field.
*/
- *p_cluster = 0;
- if (num_clusters) {
+ if (hole_len) {
ret = ocfs2_figure_hole_clusters(inode, el, eb_bh,
- v_cluster,
- num_clusters);
+ v_cluster, &len);
if (ret) {
mlog_errno(ret);
goto out;
}
+
+ *hole_len = len;
}
- } else {
- rec = &el->l_recs[i];
+ goto out_hole;
+ }

- BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos));
+ rec = &el->l_recs[i];

- if (!rec->e_blkno) {
- ocfs2_error(inode->i_sb, "Inode %lu has bad extent "
- "record (%u, %u, 0)", inode->i_ino,
- le32_to_cpu(rec->e_cpos),
- ocfs2_rec_clusters(el, rec));
- ret = -EROFS;
- goto out;
+ BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos));
+
+ if (!rec->e_blkno) {
+ ocfs2_error(inode->i_sb, "Inode %lu has bad extent "
+ "record (%u, %u, 0)", inode->i_ino,
+ le32_to_cpu(rec->e_cpos),
+ ocfs2_rec_clusters(el, rec));
+ ret = -EROFS;
+ goto out;
+ }
+
+ *ret_rec = *rec;
+
+ /*
+ * Checking for last extent is potentially expensive - we
+ * might have to look at the next leaf over to see if it's
+ * empty.
+ *
+ * The first two checks are to see whether the caller even
+ * cares for this information, and if the extent is at least
+ * the last in it's list.
+ *
+ * If those hold true, then the extent is last if any of the
+ * additional conditions hold true:
+ * - Extent list is in-inode
+ * - Extent list is right-most
+ * - Extent list is 2nd to rightmost, with empty right-most
+ */
+ if (is_last) {
+ if (i == (le16_to_cpu(el->l_next_free_rec) - 1)) {
+ if (tree_height == 0)
+ *is_last = 1;
+ else if (eb->h_blkno == di->i_last_eb_blk)
+ *is_last = 1;
+ else if (eb->h_next_leaf_blk == di->i_last_eb_blk) {
+ ret = ocfs2_last_eb_is_empty(inode, di);
+ if (ret < 0) {
+ mlog_errno(ret);
+ goto out;
+ }
+ if (ret == 1)
+ *is_last = 1;
+ }
}
+ }
+
+out_hole:
+ ret = 0;
+out:
+ brelse(eb_bh);
+ return ret;
+}
+
+static void ocfs2_relative_extent_offsets(struct super_block *sb,
+ u32 v_cluster,
+ struct ocfs2_extent_rec *rec,
+ u32 *p_cluster, u32 *num_clusters)
+
+{
+ u32 coff = v_cluster - le32_to_cpu(rec->e_cpos);
+
+ *p_cluster = ocfs2_blocks_to_clusters(sb, le64_to_cpu(rec->e_blkno));
+ *p_cluster = *p_cluster + coff;
+
+ if (num_clusters)
+ *num_clusters = le16_to_cpu(rec->e_leaf_clusters) - coff;
+}
+
+int ocfs2_get_clusters(struct inode *inode, u32 v_cluster,
+ u32 *p_cluster, u32 *num_clusters,
+ unsigned int *extent_flags)
+{
+ int ret;
+ unsigned int uninitialized_var(hole_len), flags = 0;
+ struct buffer_head *di_bh = NULL;
+ struct ocfs2_extent_rec rec;

- coff = v_cluster - le32_to_cpu(rec->e_cpos);
+ if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+ ret = -ERANGE;
+ mlog_errno(ret);
+ goto out;
+ }

- *p_cluster = ocfs2_blocks_to_clusters(inode->i_sb,
- le64_to_cpu(rec->e_blkno));
- *p_cluster = *p_cluster + coff;
+ ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster,
+ num_clusters, extent_flags);
+ if (ret == 0)
+ goto out;

- if (num_clusters)
- *num_clusters = ocfs2_rec_clusters(el, rec) - coff;
+ ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno,
+ &di_bh, OCFS2_BH_CACHED, inode);
+ if (ret) {
+ mlog_errno(ret);
+ goto out;
+ }

- flags = rec->e_flags;
+ ret = ocfs2_get_clusters_nocache(inode, di_bh, v_cluster, &hole_len,
+ &rec, NULL);
+ if (ret) {
+ mlog_errno(ret);
+ goto out;
+ }

- ocfs2_extent_map_insert_rec(inode, rec);
+ if (rec.e_blkno == 0ULL) {
+ /*
+ * A hole was found. Return some canned values that
+ * callers can key on. If asked for, num_clusters will
+ * be populated with the size of the hole.
+ */
+ *p_cluster = 0;
+ if (num_clusters) {
+ *num_clusters = hole_len;
+ }
+ } else {
+ ocfs2_relative_extent_offsets(inode->i_sb, v_cluster, &rec,
+ p_cluster, num_clusters);
+ flags = rec.e_flags;
+
+ ocfs2_extent_map_insert_rec(inode, &rec);
}

if (extent_flags)
@@ -478,7 +608,6 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster,

out:
brelse(di_bh);
- brelse(eb_bh);
return ret;
}

@@ -521,3 +650,114 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno,
out:
return ret;
}
+
+static int ocfs2_fiemap_inline(struct inode *inode, struct buffer_head *di_bh,
+ struct fiemap_extent_info *fieinfo,
+ u64 map_start)
+{
+ int ret;
+ unsigned int id_count;
+ struct ocfs2_dinode *di;
+ u64 phys;
+ u32 flags = FIEMAP_EXTENT_DATA_INLINE|FIEMAP_EXTENT_LAST;
+ struct ocfs2_inode_info *oi = OCFS2_I(inode);
+
+ di = (struct ocfs2_dinode *)di_bh->b_data;
+ id_count = le16_to_cpu(di->id2.i_data.id_count);
+
+ if (map_start < id_count) {
+ phys = oi->ip_blkno << inode->i_sb->s_blocksize_bits;
+ phys += offsetof(struct ocfs2_dinode, id2.i_data.id_data);
+
+ ret = fiemap_fill_next_extent(fieinfo, 0, phys, id_count,
+ flags);
+ if (ret < 0)
+ return ret;
+ }
+
+ return 0;
+}
+
+#define OCFS2_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC)
+
+int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 map_start, u64 map_len)
+{
+ int ret, is_last;
+ u32 mapping_end, cpos;
+ unsigned int hole_size;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ u64 len_bytes, phys_bytes, virt_bytes;
+ struct buffer_head *di_bh = NULL;
+ struct ocfs2_extent_rec rec;
+
+ ret = fiemap_check_flags(fieinfo, OCFS2_FIEMAP_FLAGS);
+ if (ret)
+ return ret;
+
+ ret = ocfs2_inode_lock(inode, &di_bh, 0);
+ if (ret) {
+ mlog_errno(ret);
+ goto out;
+ }
+
+ down_read(&OCFS2_I(inode)->ip_alloc_sem);
+
+ /*
+ * Handle inline-data separately.
+ */
+ if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+ ret = ocfs2_fiemap_inline(inode, di_bh, fieinfo, map_start);
+ goto out_unlock;
+ }
+
+ cpos = map_start >> osb->s_clustersize_bits;
+ mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
+ map_start + map_len);
+ mapping_end -= cpos;
+ is_last = 0;
+ while (cpos < mapping_end && !is_last) {
+ u32 fe_flags;
+
+ ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
+ &hole_size, &rec, &is_last);
+ if (ret) {
+ mlog_errno(ret);
+ goto out;
+ }
+
+ if (rec.e_blkno == 0ULL) {
+ cpos += hole_size;
+ continue;
+ }
+
+ fe_flags = 0;
+ if (rec.e_flags & OCFS2_EXT_UNWRITTEN)
+ fe_flags |= FIEMAP_EXTENT_UNWRITTEN;
+ if (is_last)
+ fe_flags |= FIEMAP_EXTENT_LAST;
+ len_bytes = (u64)le16_to_cpu(rec.e_leaf_clusters) << osb->s_clustersize_bits;
+ phys_bytes = le64_to_cpu(rec.e_blkno) << osb->sb->s_blocksize_bits;
+ virt_bytes = (u64)le32_to_cpu(rec.e_cpos) << osb->s_clustersize_bits;
+
+ ret = fiemap_fill_next_extent(fieinfo, virt_bytes, phys_bytes,
+ len_bytes, fe_flags);
+ if (ret)
+ break;
+
+ cpos = le32_to_cpu(rec.e_cpos)+ le16_to_cpu(rec.e_leaf_clusters);
+ }
+
+ if (ret > 0)
+ ret = 0;
+
+out_unlock:
+ brelse(di_bh);
+
+ up_read(&OCFS2_I(inode)->ip_alloc_sem);
+
+ ocfs2_inode_unlock(inode, 0);
+out:
+
+ return ret;
+}
diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h
index de91e3e..1b97490 100644
--- a/fs/ocfs2/extent_map.h
+++ b/fs/ocfs2/extent_map.h
@@ -50,4 +50,7 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, u32 *p_cluster,
int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno,
u64 *ret_count, unsigned int *extent_flags);

+int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 map_start, u64 map_len);
+
#endif /* _EXTENT_MAP_H */
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index ec2ed15..ed38796 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2228,6 +2228,7 @@ const struct inode_operations ocfs2_file_iops = {
.getattr = ocfs2_getattr,
.permission = ocfs2_permission,
.fallocate = ocfs2_fallocate,
+ .fiemap = ocfs2_fiemap,
};

const struct inode_operations ocfs2_special_file_iops = {
--
1.5.6.1.205.ge2c7.dirty


2008-09-13 18:49:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 1/4] vfs: vfs-level fiemap interface

From: Mark Fasheh <[email protected]>

Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
inode operation.

Userspace can get extent information on a file via fiemap ioctl. As input,
the fiemap ioctl takes a struct fiemap which includes an array of struct
fiemap_extent (fm_extents). Size of the extent array is passed as
fm_extent_count and number of extents returned will be written into
fm_mapped_extents. Offset and length fields on the fiemap structure
(fm_start, fm_length) describe a logical range which will be searched for
extents. All extents returned will at least partially contain this range.
The actual extent offsets and ranges returned will be unmodified from their
offset and range on-disk.

The fiemap ioctl returns '0' on success. On error, -1 is returned and errno
is set. If errno is equal to EBADR, then fm_flags will contain those flags
which were passed in which the kernel did not understand. On all other
errors, the contents of fm_extents is undefined.

As fiemap evolved, there have been many authors of the vfs patch. As far as
I can tell, the list includes:
Kalpak Shah <[email protected]>
Andreas Dilger <[email protected]>
Eric Sandeen <[email protected]>
Mark Fasheh <[email protected]>

Signed-off-by: Mark Fasheh <[email protected]>
---
Documentation/filesystems/fiemap.txt | 216 ++++++++++++++++++++++++++++++++++
fs/ioctl.c | 154 ++++++++++++++++++++++++
include/linux/fiemap.h | 62 ++++++++++
include/linux/fs.h | 18 +++
4 files changed, 450 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/fiemap.txt
create mode 100644 include/linux/fiemap.h

diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt
new file mode 100644
index 0000000..5e95e8f
--- /dev/null
+++ b/Documentation/filesystems/fiemap.txt
@@ -0,0 +1,216 @@
+============
+Fiemap Ioctl
+============
+
+The fiemap ioctl is an efficient method for userspace to get file
+extent mappings. Instead of block-by-block mapping (such as bmap), fiemap
+returns a list of extents.
+
+
+Request Basics
+--------------
+
+A fiemap request is encoded within struct fiemap:
+
+struct fiemap {
+ __u64 fm_start; /* logical offset (inclusive) at
+ * which to start mapping (in) */
+ __u64 fm_length; /* logical length of mapping which
+ * userspace cares about (in) */
+ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
+ __u32 fm_mapped_extents; /* number of extents that were
+ * mapped (out) */
+ __u32 fm_extent_count; /* size of fm_extents array (in) */
+ __u32 fm_reserved;
+ struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
+};
+
+
+fm_start, and fm_length specify the logical range within the file
+which the process would like mappings for. Extents returned mirror
+those on disk - that is, the logical offset of the 1st returned extent
+may start before fm_start, and the range covered by the last returned
+extent may end after fm_length. All offsets and lengths are in bytes.
+
+Certain flags to modify the way in which mappings are looked up can be
+set in fm_flags. If the kernel doesn't understand some particular
+flags, it will return EBADR and the contents of fm_flags will contain
+the set of flags which caused the error. If the kernel is compatible
+with all flags passed, the contents of fm_flags will be unmodified.
+It is up to userspace to determine whether rejection of a particular
+flag is fatal to it's operation. This scheme is intended to allow the
+fiemap interface to grow in the future but without losing
+compatibility with old software.
+
+fm_extent_count specifies the number of elements in the fm_extents[] array
+that can be used to return extents. If fm_extent_count is zero, then the
+fm_extents[] array is ignored (no extents will be returned), and the
+fm_mapped_extents count will hold the number of extents needed in
+fm_extents[] to hold the file's current mapping. Note that there is
+nothing to prevent the file from changing between calls to FIEMAP.
+
+Currently, there are three flags which can be set in fm_flags:
+
+* FIEMAP_FLAG_SYNC
+If this flag is set, the kernel will sync the file before mapping extents.
+
+* FIEMAP_FLAG_XATTR
+If this flag is set, the extents returned will describe the inodes
+extended attribute lookup tree, instead of it's data tree.
+
+
+Extent Mapping
+--------------
+
+Extent information is returned within the embedded fm_extents array
+which userspace must allocate along with the fiemap structure. The
+number of elements in the fiemap_extents[] array should be passed via
+fm_extent_count. The number of extents mapped by kernel will be
+returned via fm_mapped_extents. If the number of fiemap_extents
+allocated is less than would be required to map the requested range,
+the maximum number of extents that can be mapped in the fm_extent[]
+array will be returned and fm_mapped_extents will be equal to
+fm_extent_count. In that case, the last extent in the array will not
+complete the requested range and will not have the FIEMAP_EXTENT_LAST
+flag set (see the next section on extent flags).
+
+Each extent is described by a single fiemap_extent structure as
+returned in fm_extents.
+
+struct fiemap_extent {
+ __u64 fe_logical; /* logical offset in bytes for the start of
+ * the extent */
+ __u64 fe_physical; /* physical offset in bytes for the start
+ * of the extent */
+ __u64 fe_length; /* length in bytes for the extent */
+ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
+ __u32 fe_reserved;
+};
+
+All offsets and lengths are in bytes and mirror those on disk. It is valid
+for an extents logical offset to start before the request or it's logical
+length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is
+returned, fe_logical, fe_physical, and fe_length will be aligned to the
+block size of the file system. With the exception of extents flagged as
+FIEMAP_EXTENT_MERGED, adjacent extents will not be merged.
+
+The fe_flags field contains flags which describe the extent returned.
+A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in
+the file so that the process making fiemap calls can determine when no
+more extents are available, without having to call the ioctl again.
+
+Some flags are intentionally vague and will always be set in the
+presence of other more specific flags. This way a program looking for
+a general property does not have to know all existing and future flags
+which imply that property.
+
+For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL
+are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking
+for inline or tail-packed data can key on the specific flag. Software
+which simply cares not to try operating on non-aligned extents
+however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to
+worry about all present and future flags which might imply unaligned
+data. Note that the opposite is not true - it would be valid for
+FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
+
+* FIEMAP_EXTENT_LAST
+This is the last extent in the file. A mapping attempt past this
+extent will return nothing.
+
+* FIEMAP_EXTENT_UNKNOWN
+The location of this extent is currently unknown. This may indicate
+the data is stored on an inaccessible volume or that no storage has
+been allocated for the file yet.
+
+* FIEMAP_EXTENT_DELALLOC
+ - This will also set FIEMAP_EXTENT_UNKNOWN.
+Delayed allocation - while there is data for this extent, it's
+physical location has not been allocated yet.
+
+* FIEMAP_EXTENT_NO_BYPASS
+Direct access to the data in this extent is illegal or will have
+undefined results.
+
+* FIEMAP_EXTENT_DATA_ENCRYPTED
+ - This will also set FIEMAP_EXTENT_NO_BYPASS
+The data in this extent has been encrypted by the file system.
+
+* FIEMAP_EXTENT_NOT_ALIGNED
+Extent offsets and length are not guaranteed to be block aligned.
+
+* FIEMAP_EXTENT_DATA_INLINE
+ This will also set FIEMAP_EXTENT_NOT_ALIGNED
+Data is located within a meta data block.
+
+* FIEMAP_EXTENT_DATA_TAIL
+ This will also set FIEMAP_EXTENT_NOT_ALIGNED
+Data is packed into a block with data from other files.
+
+* FIEMAP_EXTENT_UNWRITTEN
+Unwritten extent - the extent is allocated but it's data has not been
+initialized. This indicates the extent's data will be all zero if read
+through the filesystem but the contents are undefined if read directly from
+the device.
+
+* FIEMAP_EXTENT_MERGED
+This will be set when a file does not support extents, i.e., it uses a block
+based addressing scheme. Since returning an extent for each block back to
+userspace would be highly inefficient, the kernel will try to merge most
+adjacent blocks into 'extents'.
+
+
+VFS -> File System Implementation
+---------------------------------
+
+File systems wishing to support fiemap must implement a ->fiemap callback on
+their inode_operations structure. The fs ->fiemap call is responsible for
+defining it's set of supported fiemap flags, and calling a helper function on
+each discovered extent:
+
+struct inode_operations {
+ ...
+
+ int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
+ u64 len);
+
+->fiemap is passed struct fiemap_extent_info which describes the
+fiemap request:
+
+struct fiemap_extent_info {
+ unsigned int fi_flags; /* Flags as passed from user */
+ unsigned int fi_extents_mapped; /* Number of mapped extents */
+ unsigned int fi_extents_max; /* Size of fiemap_extent array */
+ struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
+};
+
+It is intended that the file system should not need to access any of this
+structure directly.
+
+
+Flag checking should be done at the beginning of the ->fiemap callback via the
+fiemap_check_flags() helper:
+
+int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
+
+The struct fieinfo should be passed in as recieved from ioctl_fiemap(). The
+set of fiemap flags which the fs understands should be passed via fs_flags. If
+fiemap_check_flags finds invalid user flags, it will place the bad values in
+fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from
+fiemap_check_flags(), it should immediately exit, returning that error back to
+ioctl_fiemap().
+
+
+For each extent in the request range, the file system should call
+the helper function, fiemap_fill_next_extent():
+
+int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
+ u64 phys, u64 len, u32 flags, u32 dev);
+
+fiemap_fill_next_extent() will use the passed values to populate the
+next free extent in the fm_extents array. 'General' extent flags will
+automatically be set from specific flags on behalf of the calling file
+system so that the userspace API is not broken.
+
+fiemap_fill_next_extent() returns 0 on success, and 1 when the
+user-supplied fm_extents array is full. If an error is encountered
+while copying the extent to user memory, -EFAULT will be returned.
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 7db32b3..274625a 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -16,6 +16,9 @@

#include <asm/ioctls.h>

+/* So that the fiemap access checks can't overflow on 32 bit machines. */
+#define FIEMAP_MAX_EXTENTS (UINT_MAX / sizeof(struct fiemap_extent))
+
/**
* vfs_ioctl - call filesystem specific ioctl methods
* @filp: open file to invoke ioctl method on
@@ -71,6 +74,155 @@ static int ioctl_fibmap(struct file *filp, int __user *p)
return put_user(res, p);
}

+/**
+ * fiemap_fill_next_extent - Fiemap helper function
+ * @fieinfo: Fiemap context passed into ->fiemap
+ * @logical: Extent logical start offset, in bytes
+ * @phys: Extent physical start offset, in bytes
+ * @len: Extent length, in bytes
+ * @flags: FIEMAP_EXTENT flags that describe this extent
+ *
+ * Called from file system ->fiemap callback. Will populate extent
+ * info as passed in via arguments and copy to user memory. On
+ * success, extent count on fieinfo is incremented.
+ *
+ * Returns 0 on success, -errno on error, 1 if this was the last
+ * extent that will fit in user array.
+ */
+#define SET_UNKNOWN_FLAGS (FIEMAP_EXTENT_DELALLOC)
+#define SET_NO_BYPASS_FLAGS (FIEMAP_EXTENT_DATA_ENCRYPTED)
+#define SET_NOT_ALIGNED_FLAGS (FIEMAP_EXTENT_DATA_TAIL|FIEMAP_EXTENT_DATA_INLINE)
+int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo, u64 logical,
+ u64 phys, u64 len, u32 flags)
+{
+ struct fiemap_extent extent;
+ struct fiemap_extent *dest = fieinfo->fi_extents_start;
+
+ /* only count the extents */
+ if (fieinfo->fi_extents_max == 0) {
+ fieinfo->fi_extents_mapped++;
+ return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+ }
+
+ if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
+ return 1;
+
+ if (flags & SET_UNKNOWN_FLAGS)
+ flags |= FIEMAP_EXTENT_UNKNOWN;
+ if (flags & SET_NO_BYPASS_FLAGS)
+ flags |= FIEMAP_EXTENT_NO_BYPASS;
+ if (flags & SET_NOT_ALIGNED_FLAGS)
+ flags |= FIEMAP_EXTENT_NOT_ALIGNED;
+
+ extent.fe_logical = logical;
+ extent.fe_physical = phys;
+ extent.fe_length = len;
+ extent.fe_flags = flags;
+
+ dest += fieinfo->fi_extents_mapped;
+ if (copy_to_user(dest, &extent, sizeof(extent)))
+ return -EFAULT;
+
+ fieinfo->fi_extents_mapped++;
+ if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
+ return 1;
+ return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+}
+EXPORT_SYMBOL(fiemap_fill_next_extent);
+
+/**
+ * fiemap_check_flags - check validity of requested flags for fiemap
+ * @fieinfo: Fiemap context passed into ->fiemap
+ * @fs_flags: Set of fiemap flags that the file system understands
+ *
+ * Called from file system ->fiemap callback. This will compute the
+ * intersection of valid fiemap flags and those that the fs supports. That
+ * value is then compared against the user supplied flags. In case of bad user
+ * flags, the invalid values will be written into the fieinfo structure, and
+ * -EBADR is returned, which tells ioctl_fiemap() to return those values to
+ * userspace. For this reason, a return code of -EBADR should be preserved.
+ *
+ * Returns 0 on success, -EBADR on bad flags.
+ */
+int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags)
+{
+ u32 incompat_flags;
+
+ incompat_flags = fieinfo->fi_flags & ~(FIEMAP_FLAGS_COMPAT & fs_flags);
+ if (incompat_flags) {
+ fieinfo->fi_flags = incompat_flags;
+ return -EBADR;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(fiemap_check_flags);
+
+static int fiemap_check_ranges(struct super_block *sb,
+ u64 start, u64 len, u64 *new_len)
+{
+ *new_len = len;
+
+ if (len == 0)
+ return -EINVAL;
+
+ if (start > sb->s_maxbytes)
+ return -EFBIG;
+
+ /*
+ * Shrink request scope to what the fs can actually handle.
+ */
+ if ((len > sb->s_maxbytes) ||
+ (sb->s_maxbytes - len) < start)
+ *new_len = sb->s_maxbytes - start;
+
+ return 0;
+}
+
+static int ioctl_fiemap(struct file *filp, unsigned long arg)
+{
+ struct fiemap fiemap;
+ struct fiemap_extent_info fieinfo = { 0, };
+ struct inode *inode = filp->f_path.dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
+ u64 len;
+ int error;
+
+ if (!inode->i_op->fiemap)
+ return -EOPNOTSUPP;
+
+ if (copy_from_user(&fiemap, (struct fiemap __user *)arg,
+ sizeof(struct fiemap)))
+ return -EFAULT;
+
+ if (fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS)
+ return -EINVAL;
+
+ error = fiemap_check_ranges(sb, fiemap.fm_start, fiemap.fm_length,
+ &len);
+ if (error)
+ return error;
+
+ fieinfo.fi_flags = fiemap.fm_flags;
+ fieinfo.fi_extents_max = fiemap.fm_extent_count;
+ fieinfo.fi_extents_start = (struct fiemap_extent *)(arg + sizeof(fiemap));
+
+ if (fiemap.fm_extent_count != 0 &&
+ !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start,
+ fieinfo.fi_extents_max * sizeof(struct fiemap_extent)))
+ return -EFAULT;
+
+ if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC)
+ filemap_write_and_wait(inode->i_mapping);
+
+ error = inode->i_op->fiemap(inode, &fieinfo, fiemap.fm_start, len);
+ fiemap.fm_flags = fieinfo.fi_flags;
+ fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped;
+ if (copy_to_user((char *)arg, &fiemap, sizeof(fiemap)))
+ error = -EFAULT;
+
+ return error;
+}
+
static int file_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
{
@@ -80,6 +232,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
switch (cmd) {
case FIBMAP:
return ioctl_fibmap(filp, p);
+ case FS_IOC_FIEMAP:
+ return ioctl_fiemap(filp, arg);
case FIGETBSZ:
return put_user(inode->i_sb->s_blocksize, p);
case FIONREAD:
diff --git a/include/linux/fiemap.h b/include/linux/fiemap.h
new file mode 100644
index 0000000..29e5862
--- /dev/null
+++ b/include/linux/fiemap.h
@@ -0,0 +1,62 @@
+/*
+ * FS_IOC_FIEMAP ioctl infrastructure.
+ *
+ * Some portions copyright (C) 2007 Cluster File Systems, Inc
+ *
+ * Authors: Mark Fasheh <[email protected]>
+ * Kalpak Shah <[email protected]>
+ * Andreas Dilger <[email protected]>
+ */
+
+#ifndef _LINUX_FIEMAP_H
+#define _LINUX_FIEMAP_H
+
+struct fiemap_extent {
+ __u64 fe_logical; /* logical offset in bytes for the start of
+ * the extent from the beginning of the file */
+ __u64 fe_physical; /* physical offset in bytes for the start
+ * of the extent from the beginning of the disk */
+ __u64 fe_length; /* length in bytes for this extent */
+ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
+ __u32 fe_reserved;
+};
+
+struct fiemap {
+ __u64 fm_start; /* logical offset (inclusive) at
+ * which to start mapping (in) */
+ __u64 fm_length; /* logical length of mapping which
+ * userspace wants (in) */
+ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
+ __u32 fm_mapped_extents;/* number of extents that were mapped (out) */
+ __u32 fm_extent_count; /* size of fm_extents array (in) */
+ __u32 fm_reserved;
+ struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
+};
+
+#define FIEMAP_MAX_OFFSET (~0ULL)
+
+#define FIEMAP_FLAG_SYNC 0x00000001 /* sync file data before map */
+#define FIEMAP_FLAG_XATTR 0x00000002 /* map extended attribute tree */
+
+#define FIEMAP_FLAGS_COMPAT (FIEMAP_FLAG_SYNC | FIEMAP_FLAG_XATTR)
+
+#define FIEMAP_EXTENT_LAST 0x00000001 /* Last extent in file. */
+#define FIEMAP_EXTENT_UNKNOWN 0x00000002 /* Data location unknown. */
+#define FIEMAP_EXTENT_DELALLOC 0x00000004 /* Location still pending.
+ * Sets EXTENT_UNKNOWN. */
+#define FIEMAP_EXTENT_NO_BYPASS 0x00000008 /* Data mapping undefined */
+#define FIEMAP_EXTENT_DATA_ENCRYPTED 0x00000080 /* Data is encrypted by fs.
+ * Sets EXTENT_NO_BYPASS. */
+#define FIEMAP_EXTENT_NOT_ALIGNED 0x00000100 /* Extent offsets may not be
+ * block aligned. */
+#define FIEMAP_EXTENT_DATA_INLINE 0x00000200 /* Data mixed with metadata.
+ * Sets EXTENT_NOT_ALIGNED.*/
+#define FIEMAP_EXTENT_DATA_TAIL 0x00000400 /* Multiple files in block.
+ * Sets EXTENT_NOT_ALIGNED.*/
+#define FIEMAP_EXTENT_UNWRITTEN 0x00000800 /* Space allocated, but
+ * no data (i.e. zero). */
+#define FIEMAP_EXTENT_MERGED 0x00001000 /* File does not natively
+ * support extents. Result
+ * merged for efficiency. */
+
+#endif /* _LINUX_FIEMAP_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 580b513..194fb23 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -231,6 +231,7 @@ extern int dir_notify_enable;
#define FS_IOC_SETFLAGS _IOW('f', 2, long)
#define FS_IOC_GETVERSION _IOR('v', 1, long)
#define FS_IOC_SETVERSION _IOW('v', 2, long)
+#define FS_IOC_FIEMAP _IOWR('f', 11, struct fiemap)
#define FS_IOC32_GETFLAGS _IOR('f', 1, int)
#define FS_IOC32_SETFLAGS _IOW('f', 2, int)
#define FS_IOC32_GETVERSION _IOR('v', 1, int)
@@ -291,6 +292,7 @@ extern int dir_notify_enable;
#include <linux/mutex.h>
#include <linux/capability.h>
#include <linux/semaphore.h>
+#include <linux/fiemap.h>

#include <asm/atomic.h>
#include <asm/byteorder.h>
@@ -1179,6 +1181,20 @@ extern void dentry_unhash(struct dentry *dentry);
extern int file_permission(struct file *, int);

/*
+ * VFS FS_IOC_FIEMAP helper definitions.
+ */
+struct fiemap_extent_info {
+ unsigned int fi_flags; /* Flags as passed from user */
+ unsigned int fi_extents_mapped; /* Number of mapped extents */
+ unsigned int fi_extents_max; /* Size of fiemap_extent array */
+ struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent
+ * array */
+};
+int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
+ u64 phys, u64 len, u32 flags);
+int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
+
+/*
* File types
*
* NOTE! These match bits 12..15 of stat.st_mode
@@ -1287,6 +1303,8 @@ struct inode_operations {
void (*truncate_range)(struct inode *, loff_t, loff_t);
long (*fallocate)(struct inode *inode, int mode, loff_t offset,
loff_t len);
+ int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
+ u64 len);
};

struct seq_file;
--
1.5.6.1.205.ge2c7.dirty


2008-09-13 21:29:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sat, Sep 13, 2008 at 10:17:07PM +0100, Anton Altaparmakov wrote:
> Hi,
>
> Just a quick question (not a criticism of the patches in any way):
>
> Is there a specific reason that the flag FIEMAP_EXTENT_DATA_ENCRYPTED
> exists? I am only asking because there isn't a
> FIEMAP_EXTENT_DATA_COMPRESSED flag yet the effect on the data is the
> same. I suppose one could argue that compression is a form of
> encryption (as one cannot read the data on disk without decompressing
> it) and thus for compressed files I can just set
> FIEMAP_EXTENT_DATA_ENCRYPTED, too?

You could, but actually, the interface isn't really set up well to
deal with compressed files. In order to deal with compressed data you
really would need two fields in struct fiemap_extent:
fe_length_physical, fe_length_logical, like this:

struct fiemap_extent {
__u64 fe_logical; /* logical offset in bytes for the start of
* the extent from the beginning of the file */
__u64 fe_physical; /* physical offset in bytes for the start
* of the extent from the beginning of the disk */
__u64 fe_length; /* length in bytes for this extent */
__u64 fe_length_physical; /* length in bytes for this extent as
* encoded/compressed on disk */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_reserved;
};

Part of the problem/frustration as we go around and round fs-devel is
that some people would request new features to support new
filesystems, and other people have been rejecting new features because
it makes the interface more complicated.

In fact, the previous version of the patch had a flag for compression,
but Cristoph Hellwig objected because he claimed there were no users
for it.

I don't care how this interface ends up, at least as far as this point
is concerned, but it's really annoying when some people want to add
features, and other people are objecting and trying to remove
features, and months and months and months go by while the patch
languishes while people play tug of war on the d*mned thing.

Probably the best thing to do is to add a much larger number of
reserved fields, of size __u32 and __u64, appropriately aligned, and
then when someone adds a compressed filesystem, we can add the
structure definition without changing the size of the structure.
Otherwise, the thou-shalt-not-anticipate-the-needs-of-any-future-
users-of-the-interface nazi's will strike again, and more months will
go by with absolutely no progress....

- Ted

2008-09-13 21:17:07

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

Hi,

Just a quick question (not a criticism of the patches in any way):

Is there a specific reason that the flag FIEMAP_EXTENT_DATA_ENCRYPTED
exists? I am only asking because there isn't a
FIEMAP_EXTENT_DATA_COMPRESSED flag yet the effect on the data is the
same. I suppose one could argue that compression is a form of
encryption (as one cannot read the data on disk without decompressing
it) and thus for compressed files I can just set
FIEMAP_EXTENT_DATA_ENCRYPTED, too?

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


2008-09-13 21:45:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sat, Sep 13, 2008 at 05:29:03PM -0400, Theodore Tso wrote:
> Otherwise, the thou-shalt-not-anticipate-the-needs-of-any-future-
> users-of-the-interface nazi's will strike again, and more months will
> go by with absolutely no progress....

Ted, I appreciate you're feeling frustration, but there's no call for
this kind of language. Please moderate your tone.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-09-13 21:59:43

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

Hi,

On 13 Sep 2008, at 22:29, Theodore Tso wrote:
> On Sat, Sep 13, 2008 at 10:17:07PM +0100, Anton Altaparmakov wrote:
>> Just a quick question (not a criticism of the patches in any way):
>>
>> Is there a specific reason that the flag FIEMAP_EXTENT_DATA_ENCRYPTED
>> exists? I am only asking because there isn't a
>> FIEMAP_EXTENT_DATA_COMPRESSED flag yet the effect on the data is the
>> same. I suppose one could argue that compression is a form of
>> encryption (as one cannot read the data on disk without decompressing
>> it) and thus for compressed files I can just set
>> FIEMAP_EXTENT_DATA_ENCRYPTED, too?
>
> You could, but actually, the interface isn't really set up well to
> deal with compressed files. In order to deal with compressed data you
> really would need two fields in struct fiemap_extent:
> fe_length_physical, fe_length_logical, like this:
>
> struct fiemap_extent {
> __u64 fe_logical; /* logical offset in bytes for the start of
> * the extent from the beginning of the file */
> __u64 fe_physical; /* physical offset in bytes for the start
> * of the extent from the beginning of the disk */
> __u64 fe_length; /* length in bytes for this extent */
> __u64 fe_length_physical; /* length in bytes for this extent as
> * encoded/compressed on disk */
> __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
> __u32 fe_reserved;
> };

Yes, that could make sense for compressed files although for NTFS at
least it does not matter because the logical regions which do not
exist on disk (due to compression having made the corresponding
physical regions smaller by at least one block) are present in the
extent list as sparse extents, so the extents for a compressed file
could look like this for example (lengths are in blocks):

logical ofs physical ofs length
0 X 14
14 SPARSE 2
16 Y 4
20 SPARSE 12
32 Z 9
41 SPARSE 7
... ... ...

As you may have noticed from the above example in NTFS each block of
16 blocks is compressed independently, hence the sparse extents
interspersed with the compressed extents to bring the logical offset
in line with the physical offset.

> Part of the problem/frustration as we go around and round fs-devel is
> that some people would request new features to support new
> filesystems, and other people have been rejecting new features because
> it makes the interface more complicated.
>
> In fact, the previous version of the patch had a flag for compression,
> but Cristoph Hellwig objected because he claimed there were no users
> for it.

He is plain wrong then! NTFS supports compressed files and always has
done! We only support them read-only at present but that does not
matter for the purposes of FIEMAP. And in any case one day we will
have write support for compressed files, too...

> I don't care how this interface ends up, at least as far as this point
> is concerned, but it's really annoying when some people want to add
> features, and other people are objecting and trying to remove
> features, and months and months and months go by while the patch
> languishes while people play tug of war on the d*mned thing.
>
> Probably the best thing to do is to add a much larger number of
> reserved fields, of size __u32 and __u64, appropriately aligned, and
> then when someone adds a compressed filesystem, we can add the
> structure definition without changing the size of the structure.
> Otherwise, the thou-shalt-not-anticipate-the-needs-of-any-future-
> users-of-the-interface nazi's will strike again, and more months will
> go by with absolutely no progress....


We already have two file systems in the mainline kernel tree that
support transparent compression/decompression: ntfs and zisofs so
there is no "future" anywhere in sight... (-;

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


2008-09-13 22:42:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sat, Sep 13, 2008 at 03:45:33PM -0600, Matthew Wilcox wrote:
> On Sat, Sep 13, 2008 at 05:29:03PM -0400, Theodore Tso wrote:
> > Otherwise, the thou-shalt-not-anticipate-the-needs-of-any-future-
> > users-of-the-interface nazi's will strike again, and more months will
> > go by with absolutely no progress....
>
> Ted, I appreciate you're feeling frustration, but there's no call for
> this kind of language. Please moderate your tone.

Well, being polite and us trying to be constructive hasn't gotten us
anywhere, except getting (completely groundless) accusations that one
of the ext4 developers was trying to "sneak this in" for the benefit
of his company --- and months and months and months of time wasted.

I really am beginning to wish that I had ignored Cristoph's whinging
and just pushed an ext4-specific ioctl to Linus, instead of trying to
develop a filesystem-generic interface. It's just an ioctl. Let's
get this done, and move on. If we need to allocate another ioctl, we
can do so. Heck, we have five or six different stat interfaces...

I have no problem adding support so we can support filesystems with
compressed extents. What I don't want to have happen is one patch
version where we add a feature, and then get another complaint from
someone else demanding that we remove that self-same feature from the
interface, repeating, ad infinitum. We shouldn't tolerate this kind
of sh*t.

- Ted

2008-09-14 13:47:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sat, Sep 13, 2008 at 02:49:24PM -0400, Theodore Ts'o wrote:
> +* FIEMAP_EXTENT_NO_BYPASS
> +Direct access to the data in this extent is illegal or will have
> +undefined results.

This one is sitll misnamed and for sure utterly misdocumented. Direct
access to the data is always illegal and has undefined results, and
programs that do it anyway (e.g. grub) are in really deep trouble.

I'm also not sure what happened to my request a few rounds ago to Cc
Michael Kerrisk on the thread. He's been writing manpages for all
important kernel interfaces, and his review resulting from that has been
extremly useful.

2008-09-14 13:48:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sat, Sep 13, 2008 at 05:29:03PM -0400, Theodore Tso wrote:
> Part of the problem/frustration as we go around and round fs-devel is
> that some people would request new features to support new
> filesystems, and other people have been rejecting new features because
> it makes the interface more complicated.
>
> In fact, the previous version of the patch had a flag for compression,
> but Cristoph Hellwig objected because he claimed there were no users
> for it.

Let's make it clear, I've said to not add it unless we have users. And
What Anton brought up is exactly the reason for that - to support
encrypted extents we actually need more information in the structure.
That's why we need to have this broad and sometimes a little slow
discussion on fsdevel instead of just rushing in some flag for future
use that won't make any sense in the end.

(and btw, it's Christoph)


2008-09-14 13:51:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sat, Sep 13, 2008 at 10:59:40PM +0100, Anton Altaparmakov wrote:
>> In fact, the previous version of the patch had a flag for compression,
>> but Cristoph Hellwig objected because he claimed there were no users
>> for it.
>
> He is plain wrong then! NTFS supports compressed files and always has
> done! We only support them read-only at present but that does not
> matter for the purposes of FIEMAP. And in any case one day we will have
> write support for compressed files, too...

Note that I didn't object it because there are not user, this is just
Ted making up BS because he's frustrated. I said don't add it until we
actually have users so we know their requirements. The nice part about
this interface is that there is no problem adding more flags once we
need them, if it's just flags, and if we need mor information than just
the flags we're better of discussing these bits beforehand.

> We already have two file systems in the mainline kernel tree that
> support transparent compression/decompression: ntfs and zisofs so there
> is no "future" anywhere in sight... (-;

or cramfs. But no one involved with this patch srries bothered actually
looking into the requirements.


2008-09-14 18:02:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sun, Sep 14, 2008 at 09:47:11AM -0400, Christoph Hellwig wrote:
> On Sat, Sep 13, 2008 at 02:49:24PM -0400, Theodore Ts'o wrote:
> > +* FIEMAP_EXTENT_NO_BYPASS
> > +Direct access to the data in this extent is illegal or will have
> > +undefined results.
>
> This one is sitll misnamed and for sure utterly misdocumented. Direct
> access to the data is always illegal and has undefined results, and
> programs that do it anyway (e.g. grub) are in really deep trouble.

Programs like Grub *have* to. And in the case of a boot-loader, Lilo
and Grub have been able to do so safely for over 15 years. This is
basically a generic flag that indicates that they can't, but it
reflects the reality that for many filesystems, they *can*. The main
legal use is LILO and/or Grub, in fact. One could argue that programs
that try accessing data blocks directly while the filesystem is
mounted are doomed, but boot loaders do so while the filesystem are
unmounted.

The name isn't particularly important, but what it indicates is very
clearly useful.

- Ted



2008-09-14 18:02:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sun, Sep 14, 2008 at 09:48:59AM -0400, Christoph Hellwig wrote:
> Let's make it clear, I've said to not add it unless we have users. And
> What Anton brought up is exactly the reason for that - to support
> encrypted extents we actually need more information in the structure.
> That's why we need to have this broad and sometimes a little slow
> discussion on fsdevel instead of just rushing in some flag for future
> use that won't make any sense in the end.

You're incorrect here. Encrypted extents does not require any
additional information in the structure. Compressed extents are a bit
more useful if we allow the the filesystem to return the amount of
space used on the storage device, but as Anton has pointed out, it's
not strictly required for ntfs (although it would be more useful for
cramfs). But that being said, the fundamental question here is
whether we should try to plan for future users of the data structure,
and reserve space now for the, or not. Your approach of saying Nein!
Nein! Nein! for every single feature where we don't have
implementation pretty much guarantees that we will need to expand the
structure later to make room for these extra fields, and then we'll
need to define a new ioctl and have similar complexity to the stat
system call to support multiple userspace interfaces. If we try to
anticipate new users in advance, then there is at least a *chance*
we'll get it right up front.

The big problem here is that if we try for a generic interface, we
*will* end up stalling for every as some people ask for new features,
and other people (like you) push back on it. And in the mean time, we
make no progress.

I'd like to break through the logjam, *somehow*. I can see three
possible paths:

* We reserve space for likely features that could likely be
used by real life filesystems that exist today.

* We don't, and accept the fact that later on we will need to
revise the interface, with the resulting hair that this
will mean to the kernel.

* I push an ext4-specific ioctl to Linus.

I don't particularly care *which* one we choose, but I'd like for us
to make a choice, and then move forward, instead of spinning endlessly.

And no, I haven't made anything up. I'm leaving names out because the
accusation was really out of line, and really unfair, so I want to
spare the embarassment of the accuser and the accused, but ask Andrew
Morton for confirmation if you want. This has gotten beyond
ridiculous, and I'd really like us to make a decision, and move *on*.

- Ted

2008-09-14 18:08:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sun, Sep 14, 2008 at 02:01:32PM -0400, Theodore Tso wrote:
> On Sun, Sep 14, 2008 at 09:47:11AM -0400, Christoph Hellwig wrote:
> > On Sat, Sep 13, 2008 at 02:49:24PM -0400, Theodore Ts'o wrote:
> > > +* FIEMAP_EXTENT_NO_BYPASS
> > > +Direct access to the data in this extent is illegal or will have
> > > +undefined results.
> >
> > This one is sitll misnamed and for sure utterly misdocumented. Direct
> > access to the data is always illegal and has undefined results, and
> > programs that do it anyway (e.g. grub) are in really deep trouble.
>
> Programs like Grub *have* to.

Depends on which grub you mean. Grub that boot loader has a filesystem
driver and accesses it, that's fine. /usr/sbin/grub, the installation
program doesn't have to. In fact the only reason why it does is to
"verify" that blocks are in place. This breaks down depending on the
phase of the moon, but especially badly on filesystems using a different
address space than the block device one.

> And in the case of a boot-loader, Lilo
> and Grub have been able to do so safely for over 15 years.

I think your arguing 100% profs my point that the flag name and
description are utterly confusing. What lilo does is load a kernel
from an offset into the partition. This is obviously fine, and a
perfect use case for an extent map / block map interface. I think
(but not one could clarify to me) that this is the primary use case
of the no bypass flag, which is supposed to tell lilo that this extent
is not actually a real block but some sort of meshup and lilo should
return an error that it gets. But this flag is totally unrelated
to beeing able to bypass the filesystem driver when running the kernel
which is always bogus, and has caused endless amounts of trouble with
the only widespread tool that does it (/usr/sbin/grub).

> The name isn't particularly important, but what it indicates is very
> clearly useful.

Sometimes I really wish people would read email before replying to them.
I did not say anywhere that this flag (at least for the use case I think
it is which no one even bothered to document) is good, but the name
and description are utterly wrong, and misleading.

2008-09-14 19:58:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sun, Sep 14, 2008 at 02:08:44PM -0400, Christoph Hellwig wrote:
> Sometimes I really wish people would read email before replying to them.
> I did not say anywhere that this flag (at least for the use case I think
> it is which no one even bothered to document) is good, but the name
> and description are utterly wrong, and misleading.

Want to pick a name, then?

- Ted

2008-09-15 14:47:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sun, Sep 14, 2008 at 03:58:11PM -0400, Theodore Tso wrote:
> On Sun, Sep 14, 2008 at 02:08:44PM -0400, Christoph Hellwig wrote:
> > Sometimes I really wish people would read email before replying to them.
> > I did not say anywhere that this flag (at least for the use case I think
> > it is which no one even bothered to document) is good, but the name
> > and description are utterly wrong, and misleading.
>
> Want to pick a name, then?

Again, I'd love it when people would read my mails :( Can we please
first agree on what the flag is actually supposed to mean? I made one
guess in this mail, but I'm not sure if that's even what Andreas
intended. And of course the description for it is even more important
than the name.


2008-09-15 14:49:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sun, Sep 14, 2008 at 01:58:10PM -0400, Theodore Tso wrote:
> On Sun, Sep 14, 2008 at 09:48:59AM -0400, Christoph Hellwig wrote:
> > Let's make it clear, I've said to not add it unless we have users. And
> > What Anton brought up is exactly the reason for that - to support
> > encrypted extents we actually need more information in the structure.
> > That's why we need to have this broad and sometimes a little slow
> > discussion on fsdevel instead of just rushing in some flag for future
> > use that won't make any sense in the end.
>
> You're incorrect here. Encrypted extents does not require any
> additional information in the structure. Compressed extents are a bit
> more useful if we allow the the filesystem to return the amount of
> space used on the storage device,

Sorry, should have written compressed and not encrypted above.

> cramfs). But that being said, the fundamental question here is
> whether we should try to plan for future users of the data structure,
> and reserve space now for the, or not. Your approach of saying Nein!
> Nein! Nein! for every single feature where we don't have
> implementation pretty much guarantees that we will need to expand the
> structure later to make room for these extra fields, and then we'll
> need to define a new ioctl and have similar complexity to the stat
> system call to support multiple userspace interfaces. If we try to
> anticipate new users in advance, then there is at least a *chance*
> we'll get it right up front.

I agree to you (or someone elses - don't remember anymore) suggestion
to put in more padding so we can add fields later. I strongly disagree
putting in features now that we neither have a user, nor a usecase or
testcase for.


2008-09-15 17:53:09

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Mon, Sep 15, 2008 at 10:49:48AM -0400, Christoph Hellwig wrote:
> I agree to you (or someone elses - don't remember anymore) suggestion
> to put in more padding so we can add fields later. I strongly disagree
> putting in features now that we neither have a user, nor a usecase or
> testcase for.

So, how about another 64 bits of padding in struct fiemap_extent? That could
help cover future uses like compression, which might require another 64 bit
offset field - we only have 32 bits of reserved space there right now.
--Mark

--
Mark Fasheh

2008-09-16 06:49:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sep 15, 2008 10:47 -0400, Christoph Hellwig wrote:
> Again, I'd love it when people would read my mails :( Can we please
> first agree on what the flag is actually supposed to mean? I made one
> guess in this mail, but I'm not sure if that's even what Andreas
> intended. And of course the description for it is even more important
> than the name.

The intent of this flag was a "catch-all" to indicate it isn't safe
to try and read this block from disk, either because it is encrypted,
compressed, on a remote system (HSM or over a network), or maybe not
even written to disk yet (delalloc).

In some cases (e.g. dump on a snapshot, or boot with LILO) it IS ok to
read directly from a block device underneath the filesystem, but that
would completely fail for the above cases.

Note that the NO_BYPASS (formerly NO_DIRECT) flag is meant to be used
in conjunction with other flags that specify more clearly the reason
that this block is not directly accessible. Having a "generic" flag
cover these different flag allows simple applications to know whether
the block is readable or not, without having to understand each flag
itself, and allows the flags to be expanded in the future (e.g. HSM or
whatever that doesn't get included right away).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-09-16 06:51:40

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sep 15, 2008 10:53 -0700, Mark Fasheh wrote:
> On Mon, Sep 15, 2008 at 10:49:48AM -0400, Christoph Hellwig wrote:
> > I agree to you (or someone elses - don't remember anymore) suggestion
> > to put in more padding so we can add fields later. I strongly disagree
> > putting in features now that we neither have a user, nor a usecase or
> > testcase for.
>
> So, how about another 64 bits of padding in struct fiemap_extent? That could
> help cover future uses like compression, which might require another 64 bit
> offset field - we only have 32 bits of reserved space there right now.

I'd be happy to see it.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-09-16 21:31:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Mon, Sep 15, 2008 at 10:53:05AM -0700, Mark Fasheh wrote:
> On Mon, Sep 15, 2008 at 10:49:48AM -0400, Christoph Hellwig wrote:
> > I agree to you (or someone elses - don't remember anymore) suggestion
> > to put in more padding so we can add fields later. I strongly disagree
> > putting in features now that we neither have a user, nor a usecase or
> > testcase for.
>
> So, how about another 64 bits of padding in struct fiemap_extent? That could
> help cover future uses like compression, which might require another 64 bit
> offset field - we only have 32 bits of reserved space there right now.

What I'd recommend is a 56 byte structure:

struct fiemap_extent {
__u64 fe_logical; /* logical offset in bytes for the start of
* the extent from the beginning of the file */
__u64 fe_physical; /* physical offset in bytes for the start
* of the extent from the beginning of the disk */
__u64 fe_length; /* length in bytes for this extent */
__u64 fe_reserved64[2];
* encoded/compressed on disk */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_reserved[3];
};

Yeah, it's a little extra memory per extent, but filesystems seem to
always invent new things, and it seem spretty clear that we have at
least two extensions on deck (compression, multiple storage devices)
both of which have at least one implementation that are either in the
kernel or will likely enter the kernel. So it's likely that there is
something that we may have missed, and leaving a little extra space
doesn't actually cost us that much.

- Ted

2008-09-16 22:03:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Mon, Sep 15, 2008 at 11:49:43PM -0700, Andreas Dilger wrote:
> The intent of this flag was a "catch-all" to indicate it isn't safe
> to try and read this block from disk, either because it is encrypted,
> compressed, on a remote system (HSM or over a network), or maybe not
> even written to disk yet (delalloc).
>
> In some cases (e.g. dump on a snapshot, or boot with LILO) it IS ok to
> read directly from a block device underneath the filesystem, but that
> would completely fail for the above cases.

Indeed, I thought it was pretty clear and obvious, but let me give an
quick but formal definition, and a potential name: DATA_ENCODED

If this flag is not set, then applications that who wish to access the
file data may do so by accessing the block device at the indicated
offset when the filesystem is unmounted. If the filesystem is
mounted, it is undefined whether accessing via the block device will
return valid data. If the flag DATA_ENCODED flag is set, it is almost
certain that an application will never be able to access the file data
via the block device.

Would this make people happy?

- Ted

2008-09-17 14:18:42

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Tue, 16 September 2008 18:03:46 -0400, Theodore Tso wrote:
> On Mon, Sep 15, 2008 at 11:49:43PM -0700, Andreas Dilger wrote:
> > The intent of this flag was a "catch-all" to indicate it isn't safe
> > to try and read this block from disk, either because it is encrypted,
> > compressed, on a remote system (HSM or over a network), or maybe not
> > even written to disk yet (delalloc).
> >
> > In some cases (e.g. dump on a snapshot, or boot with LILO) it IS ok to
> > read directly from a block device underneath the filesystem, but that
> > would completely fail for the above cases.
>
> Indeed, I thought it was pretty clear and obvious, but let me give an
> quick but formal definition, and a potential name: DATA_ENCODED
>
> If this flag is not set, then applications that who wish to access the
^^^^^^^^
> file data may do so by accessing the block device at the indicated
> offset when the filesystem is unmounted. If the filesystem is
> mounted, it is undefined whether accessing via the block device will
> return valid data. If the flag DATA_ENCODED flag is set, it is almost
> certain that an application will never be able to access the file data
> via the block device.
>
> Would this make people happy?

Apart from the typo above, here is a more discouraging version:

In general, accessing the block device directly is strongly discouraged.
Exceptions exist mainly in the form of boot loaders like lilo and grub,
at a time when the filesystem is not (cannot be) mounted.

If the flag DATA_ENCODED is set, however, even this exception is no
longer valid. The content is encoded in some form. Details are
unknown, it could be compressed, encrypted or something else.

Jörn

--
Man darf nicht das, was uns unwahrscheinlich und unnatürlich erscheint,
mit dem verwechseln, was absolut unmöglich ist.
-- Carl Friedrich Gauß

2008-09-17 15:02:24

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

J?rn Engel wrote:
> Apart from the typo above, here is a more discouraging version:
>
> In general, accessing the block device directly is strongly discouraged.
> Exceptions exist mainly in the form of boot loaders like lilo and grub,
> at a time when the filesystem is not (cannot be) mounted.
>
> If the flag DATA_ENCODED is set, however, even this exception is no
> longer valid. The content is encoded in some form. Details are
> unknown, it could be compressed, encrypted or something else.

I'm not clear about something from the above description.

If I were writing a journalling / tree-like filesystem, and I did
store data in blocks without encoding, but fsync() only waits for them
to be committed to journal, not their final destination, and also they
might be moved around - should I set DATA_ENCODED or not? (And should
I return the temporary location in the long-running journal since
that's the only place the data is committed at the time of the call?)

Assume that even reading after unmounting is not 100% safe, because
the data blocks could be relocated after calling FIEMAP (when the
filesystem must be mounted), and before the unmount.

-- Jamie

2008-09-17 15:25:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote:
>
> I'm not clear about something from the above description.
>
> If I were writing a journalling / tree-like filesystem, and I did
> store data in blocks without encoding, but fsync() only waits for them
> to be committed to journal, not their final destination, and also they
> might be moved around - should I set DATA_ENCODED or not? (And should
> I return the temporary location in the long-running journal since
> that's the only place the data is committed at the time of the call?)

I can imagine two ways of requiring this. One would be that you
should only set DATA_ENCODED if you knew that the data had reached its
final destination, possibly with some way of inducing the filesystem
to wait until this had happened. The other approach would be to
require in the specification that the filesystem would have to be
unmounted at least once. I prefer the latter as it is much simpler,
and the number of users who actually care about DATA_ENCODED is quite
small.

The other observation I would make is that for filesystems where
"final location" has no meaning (i.e., a log structured filesystem
where its log cleaner is constantly moving data blocks around for
compaction and/or flash wear leveling purposes), it will probably need
to set some bit that means LOCATION_UNSTABLE, as well as setting
DATA_ENCODED. This is would be a little confusing since what
DATA_ENCODED really means is, "application that might want to read
this file contents on an unmounted filesystem --- think again". So
maybe a better name would be NO_UNMOUNTED_IO.

- Ted


P.S. If such filesystems want to be used by boot loaders that need
fixed bootstrap blocks, they would probably need to have a flag that
pinned the data blocks and caused them not be moved by the log
cleaner. Reiser3 had to have something like this to work for LILO,
but iirc this was just to disable tail packing. Same concept, though;
it was a specialized per-filesystem flag that was needed if that
filesystem wanted to be used by a particular boot loader with direct
(unmounted) I/O requirements to access bootstrap code/data.


2008-09-19 14:05:03

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote:
> J?rn Engel wrote:
> > Apart from the typo above, here is a more discouraging version:
> >
> > In general, accessing the block device directly is strongly discouraged.
> > Exceptions exist mainly in the form of boot loaders like lilo and grub,
> > at a time when the filesystem is not (cannot be) mounted.
> >
> > If the flag DATA_ENCODED is set, however, even this exception is no
> > longer valid. The content is encoded in some form. Details are
> > unknown, it could be compressed, encrypted or something else.
>
> I'm not clear about something from the above description.
>
> If I were writing a journalling / tree-like filesystem, and I did
> store data in blocks without encoding, but fsync() only waits for them
> to be committed to journal, not their final destination, and also they
> might be moved around - should I set DATA_ENCODED or not? (And should
> I return the temporary location in the long-running journal since
> that's the only place the data is committed at the time of the call?)
>
> Assume that even reading after unmounting is not 100% safe, because
> the data blocks could be relocated after calling FIEMAP (when the
> filesystem must be mounted), and before the unmount.

For the journal case at least, grub can walk through the log of the FS
looking for up to date copies of things. It does this already for
reiserfs because the btree can't be trusted at all without a log replay.

-chris

2008-09-19 17:38:02

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

Chris Mason wrote:
> On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote:
> > J?rn Engel wrote:
> > > Apart from the typo above, here is a more discouraging version:
> > >
> > > In general, accessing the block device directly is strongly discouraged.
> > > Exceptions exist mainly in the form of boot loaders like lilo and grub,
> > > at a time when the filesystem is not (cannot be) mounted.
> > >
> > > If the flag DATA_ENCODED is set, however, even this exception is no
> > > longer valid. The content is encoded in some form. Details are
> > > unknown, it could be compressed, encrypted or something else.
> >
> > I'm not clear about something from the above description.
> >
> > If I were writing a journalling / tree-like filesystem, and I did
> > store data in blocks without encoding, but fsync() only waits for them
> > to be committed to journal, not their final destination, and also they
> > might be moved around - should I set DATA_ENCODED or not? (And should
> > I return the temporary location in the long-running journal since
> > that's the only place the data is committed at the time of the call?)
> >
> > Assume that even reading after unmounting is not 100% safe, because
> > the data blocks could be relocated after calling FIEMAP (when the
> > filesystem must be mounted), and before the unmount.
>
> For the journal case at least, grub can walk through the log of the FS
> looking for up to date copies of things. It does this already for
> reiserfs because the btree can't be trusted at all without a log replay.

Ok, that's good - grub doesn't need FIEMAP, it reads the filesystem properly.

So if I were writing a filesystem, what am I expected to return in
FIEMAP for these cases? I'm thinking I should set DATA_ENCODED, even
though the examples in J?rn's description don't cover this.

I'm thinking there are three main uses for FIEMAP:

1. LILO and similar. LILO itself is fine with FIBMAP though.

2. Fragmentation measurement and possibly defragmentation tools.

3. Something wants to have an idea of which areas of disk will be
accessed, so it can optimise I/O at a higher level - i.e. a database.
This isn't foolproof, especially for writes on recent filesystems
which don't overwrite in place.

1 means DATA_ENCODED should be set whenever there's any likelihood
that the result isn't reliable, so that would include when data is
stored in a journal or other temporary place and not a permanent place
on disk.

2 and 3 don't care about DATA_ENCODED at all.

-- Jamie

2008-09-20 07:43:17

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Fri, Sep 19, 2008 at 06:38:02PM +0100, Jamie Lokier wrote:
> Chris Mason wrote:
> > On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote:
> > > Assume that even reading after unmounting is not 100% safe, because
> > > the data blocks could be relocated after calling FIEMAP (when the
> > > filesystem must be mounted), and before the unmount.
> >
> > For the journal case at least, grub can walk through the log of the FS
> > looking for up to date copies of things. It does this already for
> > reiserfs because the btree can't be trusted at all without a log replay.

OMFG.

> Ok, that's good - grub doesn't need FIEMAP, it reads the filesystem properly.

That's garbage. The whole point of using an API like FIEMAP is to
avoid the problems with having to parse the raw disk structures to
do stuff.

Grub has extremely hacky implementation of just the stuff it needs
to read stuff directly off disk. The grub filesystem code is not
reviewed by the various filesystem teams, it doesn't even use the
same code as the filesystems, and last time I looked at it I came
away with bleeding eyes and didn't want to look any further.

Why was I even looking at the grub code? Well, grub doesn't
implement everyhting that is needed to parse XFS metadata
structures. In particular, the problem was out-of-line symlinks
in XFS, or traversing symlinks that point to symlinks was simply
not implemented in the grub XFS code. Of course, this was considered
to be an XFS problem, not a grub problem....

Not to mention that if we change the metadata disk format which is
conditional on a superblock feature bit set at mkfs time, grub knows
*nothing* about that new on disk format, how to parse it and will
break horribly on such a filesystem.

All the APIs are there to prevent such problems that grub has. I
just wish that they would be used rather than using grub's broken
design as justification for not needing or using such APIs.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-09-20 13:52:08

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sat, Sep 20, 2008 at 05:43:12PM +1000, Dave Chinner wrote:
> On Fri, Sep 19, 2008 at 06:38:02PM +0100, Jamie Lokier wrote:
> > Chris Mason wrote:
> > > On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote:
> > > > Assume that even reading after unmounting is not 100% safe, because
> > > > the data blocks could be relocated after calling FIEMAP (when the
> > > > filesystem must be mounted), and before the unmount.
> > >
> > > For the journal case at least, grub can walk through the log of the FS
> > > looking for up to date copies of things. It does this already for
> > > reiserfs because the btree can't be trusted at all without a log replay.
>
> OMFG.

Shrug, either we force the user to do something to get the blocks on
disk in a stable location or we have a bootloader that can parse the
disk format. It's either pain on the developers (patching grub) or pain
on every user, and the support people who have to remind them to rerun
grub-find-the-disk-blocks-just-like-lilo

-chris

2008-09-20 15:37:11

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

Chris Mason wrote:
> > > > For the journal case at least, grub can walk through the log of the FS
> > > > looking for up to date copies of things. It does this already for
> > > > reiserfs because the btree can't be trusted at all without a log replay.
> > OMFG.
>
> Shrug, either we force the user to do something to get the blocks on
> disk in a stable location or we have a bootloader that can parse the
> disk format. It's either pain on the developers (patching grub) or pain
> on every user, and the support people who have to remind them to rerun
> grub-find-the-disk-blocks-just-like-lilo

I think this thread may be confusing /usr/bin/grub with grub the bootloader.

One of the useful features of grub the bootloader is that it can boot
from a real filesystem without needing "run lilo before it's safe to
boot - or you'll regret shutting down".

That feature needs to use some kind of bleeding-eyes filesystem code,
or it will be as large as the filesystem code in Linux.

On the other hand, any dodgy direct access done by /usr/bin/grub has
no justification and is fair game for deletion, imho.

-- Jamie

2008-09-20 15:45:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sat, Sep 20, 2008 at 04:36:34PM +0100, Jamie Lokier wrote:
> > Shrug, either we force the user to do something to get the blocks on
> > disk in a stable location or we have a bootloader that can parse the
> > disk format. It's either pain on the developers (patching grub) or pain
> > on every user, and the support people who have to remind them to rerun
> > grub-find-the-disk-blocks-just-like-lilo
>
> I think this thread may be confusing /usr/bin/grub with grub the bootloader.
>
> One of the useful features of grub the bootloader is that it can boot
> from a real filesystem without needing "run lilo before it's safe to
> boot - or you'll regret shutting down".
>
> That feature needs to use some kind of bleeding-eyes filesystem code,
> or it will be as large as the filesystem code in Linux.

If the lilo or grub approach is better is at least debatable, but that's
not really the point here.

> On the other hand, any dodgy direct access done by /usr/bin/grub has
> no justification and is fair game for deletion, imho.

That _is_ the real problem. But we're getting quite offtopic as non
of that is related to FIEMAP anymore ;-)

2008-09-20 16:47:59

by Josef 'Jeff' Sipek

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Tue, Sep 16, 2008 at 05:31:07PM -0400, Theodore Tso wrote:
> On Mon, Sep 15, 2008 at 10:53:05AM -0700, Mark Fasheh wrote:
> > On Mon, Sep 15, 2008 at 10:49:48AM -0400, Christoph Hellwig wrote:
> > > I agree to you (or someone elses - don't remember anymore) suggestion
> > > to put in more padding so we can add fields later. I strongly disagree
> > > putting in features now that we neither have a user, nor a usecase or
> > > testcase for.
> >
> > So, how about another 64 bits of padding in struct fiemap_extent? That could
> > help cover future uses like compression, which might require another 64 bit
> > offset field - we only have 32 bits of reserved space there right now.
>
> What I'd recommend is a 56 byte structure:

Why not just make it 64 bytes? Sure, that's 8 extra bytes, but I find the
power-of-2 size (and the extra space) comforting. (AFAIK, slab allocators
will give you 64 bytes anyway; and I expect something similar on the
user-space side of things.)

...
> Yeah, it's a little extra memory per extent, but filesystems seem to
> always invent new things, and it seem spretty clear that we have at
> least two extensions on deck (compression, multiple storage devices)
> both of which have at least one implementation that are either in the
> kernel or will likely enter the kernel. So it's likely that there is
> something that we may have missed, and leaving a little extra space
> doesn't actually cost us that much.

Right.

Josef 'Jeff' Sipek.

--
Once you have their hardware. Never give it back.
(The First Rule of Hardware Acquisition)

2008-09-29 01:07:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

Mark,

Do you have time to update the fiemap patchset so we can hopefully
come to consensus and get this pushed to Linus in time for the
upcoming merge window?

If not, I can take over for you and take over the care and feeding of
this patchset so it can get pushed to Linus ASAP; I think we are
pretty close to consensus on this interface, and I'd rather that we
not lose momentum.

- Ted

2008-09-29 21:13:43

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Sun, Sep 28, 2008 at 09:07:20PM -0400, Theodore Tso wrote:
> Do you have time to update the fiemap patchset so we can hopefully
> come to consensus and get this pushed to Linus in time for the
> upcoming merge window?

I can probably do this early next week. If that's not soon enough, I won't
be offended if you or someone else does the update. I agree, it seems like
we're pretty close to consensus.


> If not, I can take over for you and take over the care and feeding of
> this patchset so it can get pushed to Linus ASAP; I think we are
> pretty close to consensus on this interface, and I'd rather that we
> not lose momentum.

Why not let this sit in -mm for a cycle? I know how that sounds, considering
how long it's taken everyone to decide what color to paint the bicycle shed,
but at the least it'd give other file systems a chance to write their own
support patches. There's also the benefit of a little extra time for
ext4/ocfs2 to debug their own support patches. Do we lose any momentum if
the patches are 'complete' and in -mm with a target for 2.6.29?
--Mark

--
Mark Fasheh

2008-09-29 22:10:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] vfs: vfs-level fiemap interface

On Mon, Sep 29, 2008 at 02:13:40PM -0700, Mark Fasheh wrote:
> Why not let this sit in -mm for a cycle? I know how that sounds, considering
> how long it's taken everyone to decide what color to paint the bicycle shed,
> but at the least it'd give other file systems a chance to write their own
> support patches. There's also the benefit of a little extra time for
> ext4/ocfs2 to debug their own support patches. Do we lose any momentum if
> the patches are 'complete' and in -mm with a target for 2.6.29?

Quite frankly, the reason why I'd rather not is because I'd rather not
have someone else showing up asking what color to paint the bicycle
shed. We don't have to have all filesystems with support patches
before we merge the core patches; in fact I don't think there's any
real benefit in holding back until other filesystems are ready. At
least the ext4 patches has been ready for quite some time --- and
getting it merged would help us with our online defrag patches.

- Ted