2011-05-23 21:43:09

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 1/3] fs: add SEEK_HOLE and SEEK_DATA flags V4

This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck. We need to match solaris
here, and the definitions are

*o* If /whence/ is SEEK_HOLE, the offset of the start of the
next hole greater than or equal to the supplied offset
is returned. The definition of a hole is provided near
the end of the DESCRIPTION.

*o* If /whence/ is SEEK_DATA, the file pointer is set to the
start of the next non-hole file region greater than or
equal to the supplied offset.

So in the generic case the entire file is data and there is a virtual hole at
the end. That means we will just return i_size for SEEK_HOLE and will return
the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
the same way.

Thanks,

Signed-off-by: Josef Bacik <[email protected]>
---
V3->V4:
-Fix the SEEK_HOLE/SEEK_DATA values to match solaris
-Fix the generic case to work the same way solaris works as best as possible

fs/read_write.c | 17 +++++++++++++++++
include/linux/fs.h | 4 +++-
2 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 5520f8a..9c3b453 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -64,6 +64,23 @@ generic_file_llseek_unlocked(struct file *file, loff_t offset, int origin)
return file->f_pos;
offset += file->f_pos;
break;
+ case SEEK_DATA:
+ /*
+ * In the generic case the entire file is data, so as long as
+ * offset isn't at the end of the file then the offset is data.
+ */
+ if (offset >= inode->i_size)
+ return -ENXIO;
+ break;
+ case SEEK_HOLE:
+ /*
+ * There is a virtual hole at the end of the file, so as long as
+ * offset isn't i_size or larger, return i_size.
+ */
+ if (offset >= inode->i_size)
+ return -ENXIO;
+ offset = inode->i_size;
+ break;
}

if (offset < 0 && !unsigned_offsets(file))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cdf9495..fe1e250 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -31,7 +31,9 @@
#define SEEK_SET 0 /* seek relative to beginning of file */
#define SEEK_CUR 1 /* seek relative to current file position */
#define SEEK_END 2 /* seek relative to end of file */
-#define SEEK_MAX SEEK_END
+#define SEEK_DATA 3 /* seek to the next data */
+#define SEEK_HOLE 4 /* seek to the next hole */
+#define SEEK_MAX SEEK_HOLE

struct fstrim_range {
__u64 start;
--
1.7.2.3


2011-05-23 21:43:34

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 2/3] Btrfs: implement our own ->llseek V4

In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,

Signed-off-by: Josef Bacik <[email protected]>
---
V3->V4:
-Fix everything to pass Sunil's seek test that passes on Solaris
fs/btrfs/ctree.h | 3 +
fs/btrfs/file.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 150 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 522a39b..75e3ba6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2475,6 +2475,9 @@ int btrfs_csum_truncate(struct btrfs_trans_handle *trans,
int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start,
u64 end, struct list_head *list);
/* inode.c */
+struct extent_map *btrfs_get_extent_fiemap(struct inode *inode, struct page *page,
+ size_t pg_offset, u64 start, u64 len,
+ int create);

/* RHEL and EL kernels have a patch that renames PG_checked to FsMisc */
#if defined(ClearPageFsMisc) && !defined(ClearPageChecked)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cd5e82e..15e4719 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1406,8 +1406,154 @@ out:
return ret;
}

+static int find_desired_extent(struct inode *inode, loff_t *offset, int origin)
+{
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct extent_map *em;
+ struct extent_state *cached_state = NULL;
+ u64 lockstart = *offset;
+ u64 lockend = i_size_read(inode);
+ u64 start = *offset;
+ u64 orig_start = *offset;
+ u64 len = i_size_read(inode);
+ u64 last_end = 0;
+ int ret = 0;
+
+ lockend = max_t(u64, root->sectorsize, lockend);
+ if (lockend <= lockstart)
+ lockend = lockstart + root->sectorsize;
+
+ len = lockend - lockstart + 1;
+
+ len = max_t(u64, len, root->sectorsize);
+ if (inode->i_size == 0)
+ return -ENXIO;
+
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0,
+ &cached_state, GFP_NOFS);
+
+ /*
+ * Delalloc is such a pain. If we have a hole and we have pending
+ * delalloc for a portion of the hole we will get back a hole that
+ * exists for the entire range since it hasn't been actually written
+ * yet. So to take care of this case we need to look for an extent just
+ * before the position we want in case there is outstanding delalloc
+ * going on here.
+ */
+ if (origin == SEEK_HOLE && start != 0) {
+ if (start <= root->sectorsize)
+ em = btrfs_get_extent_fiemap(inode, NULL, 0, 0,
+ root->sectorsize, 0);
+ else
+ em = btrfs_get_extent_fiemap(inode, NULL, 0,
+ start - root->sectorsize,
+ root->sectorsize, 0);
+ if (IS_ERR(em)) {
+ ret = -ENXIO;
+ goto out;
+ }
+ last_end = em->start + em->len;
+ if (em->block_start == EXTENT_MAP_DELALLOC)
+ last_end = min_t(u64, last_end, inode->i_size);
+ free_extent_map(em);
+ }
+
+ while (1) {
+ em = btrfs_get_extent_fiemap(inode, NULL, 0, start, len, 0);
+ if (IS_ERR(em)) {
+ ret = -ENXIO;
+ break;
+ }
+
+ if (em->block_start == EXTENT_MAP_HOLE) {
+ if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) {
+ if (last_end <= orig_start) {
+ free_extent_map(em);
+ ret = -ENXIO;
+ break;
+ }
+ }
+
+ if (origin == SEEK_HOLE) {
+ *offset = start;
+ free_extent_map(em);
+ break;
+ }
+ } else {
+ if (origin == SEEK_DATA) {
+ if (em->block_start == EXTENT_MAP_DELALLOC) {
+ if (start >= inode->i_size) {
+ free_extent_map(em);
+ ret = -ENXIO;
+ break;
+ }
+ }
+
+ *offset = start;
+ free_extent_map(em);
+ break;
+ }
+ }
+
+ start = em->start + em->len;
+ last_end = em->start + em->len;
+
+ if (em->block_start == EXTENT_MAP_DELALLOC)
+ last_end = min_t(u64, last_end, inode->i_size);
+
+ if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) {
+ free_extent_map(em);
+ ret = -ENXIO;
+ break;
+ }
+ free_extent_map(em);
+ cond_resched();
+ }
+ if (!ret)
+ *offset = min(*offset, inode->i_size);
+out:
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
+ &cached_state, GFP_NOFS);
+ return ret;
+}
+
+static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct inode *inode = file->f_mapping->host;
+ int ret;
+
+ mutex_lock(&inode->i_mutex);
+ switch (origin) {
+ case SEEK_END:
+ case SEEK_CUR:
+ offset = generic_file_llseek_unlocked(file, offset, origin);
+ goto out;
+ case SEEK_DATA:
+ case SEEK_HOLE:
+ ret = find_desired_extent(inode, &offset, origin);
+ if (ret) {
+ mutex_unlock(&inode->i_mutex);
+ return ret;
+ }
+ }
+
+ if (offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET))
+ return -EINVAL;
+ if (offset > inode->i_sb->s_maxbytes)
+ return -EINVAL;
+
+ /* Special lock needed here? */
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+out:
+ mutex_unlock(&inode->i_mutex);
+ return offset;
+}
+
const struct file_operations btrfs_file_operations = {
- .llseek = generic_file_llseek,
+ .llseek = btrfs_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
.aio_read = generic_file_aio_read,
--
1.7.2.3


2011-05-23 21:43:11

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 3/3] Ext4: handle SEEK_HOLE/SEEK_DATA generically V4

Since Ext4 has its own lseek we need to make sure it handles
SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic
case, somebody else can come along and make it do fancy things later. Thanks,

Signed-off-by: Josef Bacik <[email protected]>
---
V3->V4:
-update to work the same as the generic way
fs/ext4/file.c | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 7b80d54..148270e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -236,6 +236,27 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int origin)
}
offset += file->f_pos;
break;
+ case SEEK_DATA:
+ /*
+ * In the generic case the entire file is data, so as long as
+ * offset isn't at the end of the file then the offset is data.
+ */
+ if (offset >= inode->i_size) {
+ mutex_unlock(&inode->i_mutex);
+ return -ENXIO;
+ }
+ break;
+ case SEEK_HOLE:
+ /*
+ * There is a virtual hole at the end of the file, so as long as
+ * offset isn't i_size or larger, return i_size.
+ */
+ if (offset >= inode->i_size) {
+ mutex_unlock(&inode->i_mutex);
+ return -ENXIO;
+ }
+ offset = inode->i_size;
+ break;
}

if (offset < 0 || offset > maxbytes) {
--
1.7.2.3

2011-05-25 19:45:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/3] fs: add SEEK_HOLE and SEEK_DATA flags V4

On May 23, 2011, at 15:43, Josef Bacik wrote:
> This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
> using fiemap in things like cp cause more problems than it solves, so lets try
> and give userspace an interface that doesn't suck. We need to match solaris
> here, and the definitions are
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 5520f8a..9c3b453 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -64,6 +64,23 @@ generic_file_llseek_unlocked(struct file *file, loff_t offset, int origin)
> return file->f_pos;
> offset += file->f_pos;
> break;
> + case SEEK_DATA:
> + /*
> + * In the generic case the entire file is data, so as long as
> + * offset isn't at the end of the file then the offset is data.
> + */
> + if (offset >= inode->i_size)
> + return -ENXIO;
> + break;
> + case SEEK_HOLE:
> + /*
> + * There is a virtual hole at the end of the file, so as long as
> + * offset isn't i_size or larger, return i_size.
> + */
> + if (offset >= inode->i_size)
> + return -ENXIO;
> + offset = inode->i_size;
> + break;
> }

What about all of the existing filesystems that currently just ignore
values of "origin" that they don't understand? Looking through those
it appears that most of them will return "offset" for unknown values
of "origin", which I guess is OK for SEEK_DATA, but is confusing for
SEEK_HOLE. Some filesystems will return -EINVAL for values of origin
that are unknown.

Most of the filesystem-specific ->llseek() methods don't do any error
checking on "origin" because this is handled at the sys_llseek() level,
and hasn't changed in many years.

I assume this patch is also dependent upon the "remove default_llseek()"
patch, so that the implementation of SEEK_DATA and SEEK_HOLE can be done
in only generic_file_llseek()?

Finally, while looking through the various ->llseek() methods I notice
that many filesystems return "i_size" for SEEK_END, which clearly does
not make sense for filesystems like ext3/ext4 htree, btrfs, etc that
use hash keys instead of byte offsets for doing directory traversal.
The comment at generic_file_llseek() is that it is intended for use by
regular files.

Should the ext4_llseek() code be changed to return 0x7ffffffff for the
SEEK_END value? That makes more sense compared to values returned for
SEEK_CUR so that an application can compare the current "offset" with
the final value for a progress bar.

Another interesting use is for N threads to process a large directory
in parallel by using max_off = llseek(dirfd, 0, SEEK_END) and then
each thread calls llseek(dirfd, thread_nr * max_off / N, SEEK_SET)
to process 1/N of the directory.


Cheers, Andreas






Cheers, Andreas

2011-05-25 20:46:56

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 1/3] fs: add SEEK_HOLE and SEEK_DATA flags V4

On 05/25/2011 03:45 PM, Andreas Dilger wrote:
> On May 23, 2011, at 15:43, Josef Bacik wrote:
>> This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
>> using fiemap in things like cp cause more problems than it solves, so lets try
>> and give userspace an interface that doesn't suck. We need to match solaris
>> here, and the definitions are
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 5520f8a..9c3b453 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -64,6 +64,23 @@ generic_file_llseek_unlocked(struct file *file, loff_t offset, int origin)
>> return file->f_pos;
>> offset += file->f_pos;
>> break;
>> + case SEEK_DATA:
>> + /*
>> + * In the generic case the entire file is data, so as long as
>> + * offset isn't at the end of the file then the offset is data.
>> + */
>> + if (offset >= inode->i_size)
>> + return -ENXIO;
>> + break;
>> + case SEEK_HOLE:
>> + /*
>> + * There is a virtual hole at the end of the file, so as long as
>> + * offset isn't i_size or larger, return i_size.
>> + */
>> + if (offset >= inode->i_size)
>> + return -ENXIO;
>> + offset = inode->i_size;
>> + break;
>> }
>
> What about all of the existing filesystems that currently just ignore
> values of "origin" that they don't understand? Looking through those
> it appears that most of them will return "offset" for unknown values
> of "origin", which I guess is OK for SEEK_DATA, but is confusing for
> SEEK_HOLE. Some filesystems will return -EINVAL for values of origin
> that are unknown.
>

Yeah I just didn't want to do all that work until I was sure the base of
what I had was acceptable. If people think this set is good to go then
I will go through and fix everybody who does their own lseek.

> Most of the filesystem-specific ->llseek() methods don't do any error
> checking on "origin" because this is handled at the sys_llseek() level,
> and hasn't changed in many years.
>
> I assume this patch is also dependent upon the "remove default_llseek()"
> patch, so that the implementation of SEEK_DATA and SEEK_HOLE can be done
> in only generic_file_llseek()?
>
> Finally, while looking through the various ->llseek() methods I notice
> that many filesystems return "i_size" for SEEK_END, which clearly does
> not make sense for filesystems like ext3/ext4 htree, btrfs, etc that
> use hash keys instead of byte offsets for doing directory traversal.
> The comment at generic_file_llseek() is that it is intended for use by
> regular files.
>
> Should the ext4_llseek() code be changed to return 0x7ffffffff for the
> SEEK_END value? That makes more sense compared to values returned for
> SEEK_CUR so that an application can compare the current "offset" with
> the final value for a progress bar.

So maybe we make SEEK_DATA/SEEK_HOLE only work on regular files and not
directories? Sunil what does solaris do? Thanks,

Josef

2011-05-25 22:07:46

by Sunil Mushran

[permalink] [raw]
Subject: Re: [PATCH 1/3] fs: add SEEK_HOLE and SEEK_DATA flags V4

On 05/25/2011 01:46 PM, Josef Bacik wrote:
> On 05/25/2011 03:45 PM, Andreas Dilger wrote:
>> Most of the filesystem-specific ->llseek() methods don't do any error
>> checking on "origin" because this is handled at the sys_llseek() level,
>> and hasn't changed in many years.
>>
>> I assume this patch is also dependent upon the "remove default_llseek()"
>> patch, so that the implementation of SEEK_DATA and SEEK_HOLE can be done
>> in only generic_file_llseek()?
>>
>> Finally, while looking through the various ->llseek() methods I notice
>> that many filesystems return "i_size" for SEEK_END, which clearly does
>> not make sense for filesystems like ext3/ext4 htree, btrfs, etc that
>> use hash keys instead of byte offsets for doing directory traversal.
>> The comment at generic_file_llseek() is that it is intended for use by
>> regular files.
>>
>> Should the ext4_llseek() code be changed to return 0x7ffffffff for the
>> SEEK_END value? That makes more sense compared to values returned for
>> SEEK_CUR so that an application can compare the current "offset" with
>> the final value for a progress bar.
> So maybe we make SEEK_DATA/SEEK_HOLE only work on regular files and not
> directories? Sunil what does solaris do? Thanks,

In Solaris the size of the directory appears to be equal to the number
of entries and the offset is the file#, so to speak. SEEK_DATA returns
the current offset and SEEK_HOLE the last one.

Just to be clear, I am not a Solaris expert. I just happen to have access
to it. ;)