This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck. So we have
-SEEK_HOLE: this moves the file pos to the nearest hole in the file from the
given position. If the given position is a hole then pos won't move. A "hole"
is defined by whatever the fs feels like defining it to be. In simple things
like ext2/3 it will simplly mean an unallocated space in the file. For more
complex things where you have preallocated space then that is left up to the
filesystem. Since preallocated space is supposed to return all 0's it is
perfectly legitimate to have SEEK_HOLE dump you out at the start of a
preallocated extent, but then again if this is not something you can do and be
sure the extent isn't in the middle of being converted to a real extent then it
is also perfectly legitimate to skip preallocated extents and only park f_pos at
a truly unallocated section.
-SEEK_DATA: this is obviously a little more self-explanatory. Again the only
ambiguity comes in with preallocated extents. If you have an fs that can't
reliably tell that the preallocated extent is in the process of turning into a
real extent, it is correct for SEEK_DATA to park you at a preallocated extent.
In the generic case we will just assume the entire file is data and there is a
virtual hole at i_size, so SEEK_DATA will return -ENXIO unless you provide an
offset of 0 and the file size is larger than 0, and SEEK_HOLE will put you at
i_size unless pos is i_size or larger, and i_size is larger than 0.
Thanks,
Signed-off-by: Josef Bacik <[email protected]>
---
v1->v2: Make the generic case assume that the entire file is data and there is a
virtual hole at the end of the file.
fs/read_write.c | 22 ++++++++++++++++++++++
include/linux/fs.h | 4 +++-
2 files changed, 25 insertions(+), 1 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 5520f8a..6ee63a4 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -64,6 +64,28 @@ generic_file_llseek_unlocked(struct file *file, loff_t offset, int origin)
return file->f_pos;
offset += file->f_pos;
break;
+ case SEEK_DATA:
+ /*
+ * In the generic case the entire file is data, so data only
+ * starts at position 0 provided the file has an i_size,
+ * otherwise it's an empty file and will always be ENXIO.
+ */
+ if (offset != 0 || i_size_read(inode)) {
+ mutex_unlock(&inode->i_mutex);
+ return -ENXIO;
+ }
+ break;
+ case SEEK_HOLE:
+ /*
+ * There is a virtual hole at the end of the file, so as long as
+ * offset isn't i_size or larger, return i_size.
+ */
+ if (offset >= i_size_read(inode)) {
+ mutex_unlock(&inode->i_mutex);
+ return -ENXIO;
+ }
+ offset = i_size_read(inode);
+ break;
}
if (offset < 0 && !unsigned_offsets(file))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dbd860a..1b72e0c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -31,7 +31,9 @@
#define SEEK_SET 0 /* seek relative to beginning of file */
#define SEEK_CUR 1 /* seek relative to current file position */
#define SEEK_END 2 /* seek relative to end of file */
-#define SEEK_MAX SEEK_END
+#define SEEK_HOLE 3 /* seek to the closest hole */
+#define SEEK_DATA 4 /* seek to the closest data */
+#define SEEK_MAX SEEK_DATA
struct fstrim_range {
__u64 start;
--
1.7.2.3
In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,
Signed-off-by: Josef Bacik <[email protected]>
---
v1->v2: fixed some weirdness with finding the next hole when we had delalloc
ranges.
fs/btrfs/ctree.h | 3 +
fs/btrfs/file.c | 128 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 130 insertions(+), 1 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d5f043e..d2991c8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2474,6 +2474,9 @@ int btrfs_csum_truncate(struct btrfs_trans_handle *trans,
int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start,
u64 end, struct list_head *list);
/* inode.c */
+struct extent_map *btrfs_get_extent_fiemap(struct inode *inode, struct page *page,
+ size_t pg_offset, u64 start, u64 len,
+ int create);
/* RHEL and EL kernels have a patch that renames PG_checked to FsMisc */
#if defined(ClearPageFsMisc) && !defined(ClearPageChecked)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cd5e82e..ad5c3c8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1406,8 +1406,134 @@ out:
return ret;
}
+static int find_desired_extent(struct inode *inode, loff_t *offset, int origin)
+{
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct extent_map *em;
+ struct extent_state *cached_state = NULL;
+ u64 lockstart = *offset;
+ u64 lockend = i_size_read(inode);
+ u64 start = *offset;
+ u64 len = i_size_read(inode);
+ u64 last_end = 0;
+ int ret = 0;
+
+ lockend = max_t(u64, BTRFS_I(inode)->root->sectorsize, lockend);
+ len = lockend;
+
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0,
+ &cached_state, GFP_NOFS);
+
+ /*
+ * Delalloc is such a pain. If we have a hole and we have pending
+ * delalloc for a portion of the hole we will get back a hole that
+ * exists for the entire range since it hasn't been actually written
+ * yet. So to take care of this case we need to look for an extent just
+ * before the position we want in case there is outstanding delalloc
+ * going on here.
+ */
+ if (origin == SEEK_HOLE && start != 0) {
+ if (start < root->sectorsize)
+ em = btrfs_get_extent_fiemap(inode, NULL, 0, 0,
+ start, 0);
+ else
+ em = btrfs_get_extent_fiemap(inode, NULL, 0,
+ start - root->sectorsize,
+ root->sectorsize, 0);
+ if (IS_ERR(em)) {
+ ret = -ENXIO;
+ goto out;
+ }
+ last_end = em->start + em->len;
+ free_extent_map(em);
+ }
+
+ while (1) {
+ em = btrfs_get_extent_fiemap(inode, NULL, 0, start, len, 0);
+ if (IS_ERR(em)) {
+ ret = -ENXIO;
+ break;
+ }
+
+ if (em->block_start == EXTENT_MAP_HOLE) {
+ if (origin == SEEK_HOLE) {
+ u64 new_offset = max(em->start, start);
+ /*
+ * If we are in the middle of a hole then move
+ * to the next one, unless the previous range we
+ * found ended here, which means we have a
+ * delalloc range that is going to convert part
+ * of this hole into data.
+ */
+ if (new_offset == em->start ||
+ (new_offset > em->start &&
+ new_offset == last_end)) {
+ *offset = new_offset;
+ free_extent_map(em);
+ break;
+ }
+ }
+ } else {
+ if (origin == SEEK_DATA) {
+ if (em->start == start) {
+ *offset = em->start;
+ free_extent_map(em);
+ break;
+ }
+ }
+ }
+ start = em->start + em->len;
+ last_end = em->start + em->len;
+ if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) {
+ free_extent_map(em);
+ ret = -ENXIO;
+ break;
+ }
+ free_extent_map(em);
+ }
+out:
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
+ &cached_state, GFP_NOFS);
+ return ret;
+}
+
+static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct inode *inode = file->f_mapping->host;
+ int ret;
+
+ mutex_lock(&inode->i_mutex);
+ switch (origin) {
+ case SEEK_END:
+ case SEEK_CUR:
+ offset = generic_file_llseek_unlocked(file, offset, origin);
+ goto out;
+ case SEEK_DATA:
+ case SEEK_HOLE:
+ ret = find_desired_extent(inode, &offset, origin);
+ if (ret) {
+ mutex_unlock(&inode->i_mutex);
+ return ret;
+ }
+ }
+
+ if (offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET))
+ return -EINVAL;
+ if (offset > inode->i_sb->s_maxbytes)
+ return -EINVAL;
+
+ /* Special lock needed here? */
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+out:
+ mutex_unlock(&inode->i_mutex);
+ return offset;
+}
+
const struct file_operations btrfs_file_operations = {
- .llseek = generic_file_llseek,
+ .llseek = btrfs_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
.aio_read = generic_file_aio_read,
--
1.7.2.3
On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:
> -SEEK_HOLE: this moves the file pos to the nearest hole in the file from the
> given position.
Nearest, or next? Solaris defines it as "next", for a good reason - otherwise
you can get stuck in a case where the "nearest" hole is back towards the
start of the file - and "seek data" will bounce back to the next byte at
the other end of the hole.
Consider a file with this layout:
< 40K of data> A < 32K hole> B < 32K data> C < 8K hole> D <32K data> E ....
If you're in the range between "8K-1 before C" and "8K-1 after D", there's no
application of seeks to "nearest" data/hole that doesn't leave you oscillating
between C and D, and unable to reach B or E. If youre at C, "nearest hole" is
where you are, and "nearest data" is at D, not B. Similarly for D - nearest
data is C, not E.
However, this is easily dealt with if you define it as "next", as then it is
simple to discover exactly where A/B/C/D/E are.
On 05/04/2011 03:04 PM, [email protected] wrote:
> On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:
>
>> -SEEK_HOLE: this moves the file pos to the nearest hole in the file from the
>> given position.
>
> Nearest, or next? Solaris defines it as "next", for a good reason - otherwise
> you can get stuck in a case where the "nearest" hole is back towards the
> start of the file - and "seek data" will bounce back to the next byte at
> the other end of the hole.
>
Yeah sorry the log says "nearest" but the code says "next", if you look
at it thats how it works. Thanks,
Josef
On Wed, 04 May 2011 15:10:20 EDT, Josef Bacik said:
> Yeah sorry the log says "nearest" but the code says "next", if you look
> at it thats how it works. Thanks,
Oh good - the changelog is usually easier to fix than the code is. :)
Probably want to fix the changelog before it gets committed, as there's a fair
chance that text will end up being used as the basis for a manpage or other
documentation.
On 05/04/2011 03:20 PM, [email protected] wrote:
> On Wed, 04 May 2011 15:10:20 EDT, Josef Bacik said:
>
>> Yeah sorry the log says "nearest" but the code says "next", if you look
>> at it thats how it works. Thanks,
>
> Oh good - the changelog is usually easier to fix than the code is. :)
>
> Probably want to fix the changelog before it gets committed, as there's a fair
> chance that text will end up being used as the basis for a manpage or other
> documentation.
>
Yeah agreed I meant to change it this time around but forgot, I will
make the log all nice and pretty next time around, as I doubt this will
be the last iteration of these patches ;). Thanks,
Josef
On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:
> +#define SEEK_HOLE 3 /* seek to the closest hole */
> +#define SEEK_DATA 4 /* seek to the closest data */
Comments here need nearest/next fixing as well - otherwise the ext[34] crew may
actually implement the commented semantics. ;)
Other than that, patch 1/2 looks OK to me (not that there's much code to
review), and 2/2 *seems* sane and implement the "next" semantics, though I only
examined the while/if structure and am assuming the btrfs innards are done
correctly. In particular, that 'while (1)' looks like it can be painful for a
sufficiently large and fragmented file (think a gigabyte file in 4K chunks,
producing a million extents), but I'll let a btrfs expert analyse that
performance issue ;)
On 05/04/2011 03:31 PM, [email protected] wrote:
> On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:
>
>> +#define SEEK_HOLE 3 /* seek to the closest hole */
>> +#define SEEK_DATA 4 /* seek to the closest data */
>
> Comments here need nearest/next fixing as well - otherwise the ext[34] crew may
> actually implement the commented semantics. ;)
>
Balls, thanks I'll fix that.
> Other than that, patch 1/2 looks OK to me (not that there's much code to
> review), and 2/2 *seems* sane and implement the "next" semantics, though I only
> examined the while/if structure and am assuming the btrfs innards are done
> correctly. In particular, that 'while (1)' looks like it can be painful for a
> sufficiently large and fragmented file (think a gigabyte file in 4K chunks,
> producing a million extents), but I'll let a btrfs expert analyse that
> performance issue ;)
>
Heh well we do while (1) in btrfs _everywhere_, so this isn't anything
new, tho I should probably throw a cond_resched() in there. Thanks,
Josef
On 05/04/2011 02:10 PM, Josef Bacik wrote:
> On 05/04/2011 03:04 PM, [email protected] wrote:
>> On Wed, 04 May 2011 13:58:39 EDT, Josef Bacik said:
>>
>>> -SEEK_HOLE: this moves the file pos to the nearest hole in the file
>>> from the
>>> given position.
>>
>> Nearest, or next? Solaris defines it as "next", for a good reason -
>> otherwise
>> you can get stuck in a case where the "nearest" hole is back towards the
>> start of the file - and "seek data" will bounce back to the next byte at
>> the other end of the hole.
>>
>
> Yeah sorry the log says "nearest" but the code says "next", if you look
> at it thats how it works. Thanks,
The comments in fs.h say "closest". You may want to change them to
"next" as well.
Thanks,
Shaggy
On 05/04/2011 04:54 PM, Dave Kleikamp wrote:
> The comments in fs.h say "closest". You may want to change them to
> "next" as well.
Sorry. Missed some of the replies before I responded. Already addressed.
Shaggy
Il 04/05/2011 19:58, Josef Bacik ha scritto:
> + if (offset>= i_size_read(inode)) {
> + mutex_unlock(&inode->i_mutex);
> + return -ENXIO;
> + }
> + offset = i_size_read(inode);
> + break;
Here maybe it's possible to use offset bigger than i_size, because
i_size_read is "atomic" but something can happen between two calls,
isn't it?
Marco
Il 05/05/2011 21:01, Josef Bacik ha scritto:
> On 05/05/2011 02:54 PM, Marco Stornelli wrote:
>> Il 04/05/2011 19:58, Josef Bacik ha scritto:
>>> + if (offset>= i_size_read(inode)) {
>>> + mutex_unlock(&inode->i_mutex);
>>> + return -ENXIO;
>>> + }
>>> + offset = i_size_read(inode);
>>> + break;
>>
>> Here maybe it's possible to use offset bigger than i_size, because
>> i_size_read is "atomic" but something can happen between two calls,
>> isn't it?
>>
>
> We're holding the i_mutex so we are safe, i_size_read is used just for
> consistency sake. Thanks,
>
> Josef
>
Oh, I'm sorry, I misread the patch, ok. Maybe we can use i_size at this
point without i_size_read.
Marco
On 05/05/2011 02:54 PM, Marco Stornelli wrote:
> Il 04/05/2011 19:58, Josef Bacik ha scritto:
>> + if (offset>= i_size_read(inode)) {
>> + mutex_unlock(&inode->i_mutex);
>> + return -ENXIO;
>> + }
>> + offset = i_size_read(inode);
>> + break;
>
> Here maybe it's possible to use offset bigger than i_size, because
> i_size_read is "atomic" but something can happen between two calls,
> isn't it?
>
We're holding the i_mutex so we are safe, i_size_read is used just for
consistency sake. Thanks,
Josef
Il 04/05/2011 19:58, Josef Bacik ha scritto:
> + if (offset>= i_size_read(inode)) {
> + mutex_unlock(&inode->i_mutex);
> + return -ENXIO;
> + }
> + offset = i_size_read(inode);
> + break;
I can add that generic_file_llseek_unlocked means *unlocked* so you
shouldn't unlock any mutex but only return a value. The current version,
in case of SEEK_END uses directly i_size indeed, so maybe I'm missing
something.
Marco
On 05/05/2011 03:19 PM, Marco Stornelli wrote:
> Il 04/05/2011 19:58, Josef Bacik ha scritto:
>> + if (offset>= i_size_read(inode)) {
>> + mutex_unlock(&inode->i_mutex);
>> + return -ENXIO;
>> + }
>> + offset = i_size_read(inode);
>> + break;
>
> I can add that generic_file_llseek_unlocked means *unlocked* so you
> shouldn't unlock any mutex but only return a value. The current version,
> in case of SEEK_END uses directly i_size indeed, so maybe I'm missing
> something.
Yeah this was a copy+paste mistake, ext4 has it's own llseek that I
modified to run my tests against and then I just copied and pasted it
over to the generic things. I've fixed this earlier, I'll be sending a
refreshed set out soon. Thanks,
Josef