2011-05-19 02:44:42

by Sunil Mushran

[permalink] [raw]
Subject: SEEK_DATA/HOLE on ocfs2 - v2


Two patches follow this message. One fixes the default implementation
of SEEK_HOLE/DATA. This patch applies atop Josef's last posted patch.
The second patch implements the same on ocfs2.

The test tool for the same is available here.
http://oss.oracle.com/~smushran/seek_data/seek_test.c

It is improved since the last post. It runs cleanly on zfs, ocfs2 and ext3
(default behavior). Users testing on zfs will need to flip the values of
SEEK_HOLE/DATA.

The results on ext4 and btrfs are also available here.
http://oss.oracle.com/~smushran/seek_data/

Thanks
Sunil


2011-05-19 02:44:43

by Sunil Mushran

[permalink] [raw]
Subject: [PATCH] fs: Fix default SEEK_DATA and SEEK_HOLE implementation

In the default SEEK_DATA implementation, as the entire file is treated as data,
all supplied offsets less than the size of the file are valid.

For empty files, SEEK_DATA and SEEK_HOLE returns -ENXIO irrespective of the
input offset.

Signed-off-by: Sunil Mushran <[email protected]>
---
fs/read_write.c | 9 ++++-----
1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index af9cc51..4b991e9 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -66,11 +66,10 @@ generic_file_llseek_unlocked(struct file *file, loff_t offset, int origin)
break;
case SEEK_DATA:
/*
- * In the generic case the entire file is data, so data only
- * starts at position 0 provided the file has an i_size,
- * otherwise it's an empty file and will always be ENXIO.
+ * In the generic case the entire file is data. So all offsets
+ * less than the file size are valid.
*/
- if (offset != 0 || inode->i_size == 0)
+ if (inode->i_size == 0 || offset >= inode->i_size)
return -ENXIO;
break;
case SEEK_HOLE:
@@ -78,7 +77,7 @@ generic_file_llseek_unlocked(struct file *file, loff_t offset, int origin)
* There is a virtual hole at the end of the file, so as long as
* offset isn't i_size or larger, return i_size.
*/
- if (offset >= inode->i_size)
+ if (inode->i_size == 0 || offset >= inode->i_size)
return -ENXIO;
offset = inode->i_size;
break;
--
1.7.4.1

2011-05-19 02:44:44

by Sunil Mushran

[permalink] [raw]
Subject: [PATCH] ocfs2: Implement llseek()

ocfs2 implements its own llseek() to provide the SEEK_HOLE/SEEK_DATA
functionality.

SEEK_HOLE sets the file pointer to the start of either a hole or an unwritten
(preallocated) extent, that is greater than or equal to the supplied offset.

SEEK_DATA sets the file pointer to the start of an allocated extent (not
unwritten) that is greater than or equal to the supplied offset.

If the supplied offset is on a desired region, then the file pointer is set
to it. Offsets greater than or equal to the file size return -ENXIO.

Unwritten (preallocated) extents are considered holes because the file system
treats reads to such regions in the same way as it does to holes.

Signed-off-by: Sunil Mushran <[email protected]>
---
fs/ocfs2/extent_map.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/ocfs2/extent_map.h | 2 +
fs/ocfs2/file.c | 53 ++++++++++++++++++++++++++-
3 files changed, 150 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
index 23457b4..6942c21 100644
--- a/fs/ocfs2/extent_map.c
+++ b/fs/ocfs2/extent_map.c
@@ -832,6 +832,103 @@ out:
return ret;
}

+int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int origin)
+{
+ struct inode *inode = file->f_mapping->host;
+ int ret;
+ unsigned int is_last = 0, is_data = 0;
+ u16 cs_bits = OCFS2_SB(inode->i_sb)->s_clustersize_bits;
+ u32 cpos, cend, clen, hole_size;
+ u64 extoff, extlen;
+ struct buffer_head *di_bh = NULL;
+ struct ocfs2_extent_rec rec;
+
+ BUG_ON(origin != SEEK_DATA && origin != SEEK_HOLE);
+
+ ret = ocfs2_inode_lock(inode, &di_bh, 0);
+ if (ret) {
+ mlog_errno(ret);
+ goto out;
+ }
+
+ down_read(&OCFS2_I(inode)->ip_alloc_sem);
+
+ if (inode->i_size == 0 || *offset >= inode->i_size) {
+ ret = -ENXIO;
+ goto out_unlock;
+ }
+
+ if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+ if (origin == SEEK_HOLE)
+ *offset = inode->i_size;
+ goto out_unlock;
+ }
+
+ clen = 0;
+ cpos = *offset >> cs_bits;
+ cend = ocfs2_clusters_for_bytes(inode->i_sb, inode->i_size);
+
+ while (cpos < cend && !is_last) {
+ ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos, &hole_size,
+ &rec, &is_last);
+ if (ret) {
+ mlog_errno(ret);
+ goto out_unlock;
+ }
+
+ extoff = cpos;
+ extoff <<= cs_bits;
+
+ if (rec.e_blkno == 0ULL) {
+ clen = hole_size;
+ is_data = 0;
+ } else {
+ BUG_ON(cpos < le32_to_cpu(rec.e_cpos));
+ clen = le16_to_cpu(rec.e_leaf_clusters) -
+ (cpos - le32_to_cpu(rec.e_cpos));
+ is_data = (rec.e_flags & OCFS2_EXT_UNWRITTEN) ? 0 : 1;
+ }
+
+ if ((!is_data && origin == SEEK_HOLE) ||
+ (is_data && origin == SEEK_DATA)) {
+ if (extoff > *offset)
+ *offset = extoff;
+ goto out_unlock;
+ }
+
+ if (!is_last)
+ cpos += clen;
+ }
+
+ if (origin == SEEK_HOLE) {
+ extoff = cpos;
+ extoff <<= cs_bits;
+ extlen = clen;
+ extlen <<= cs_bits;
+
+ if ((extoff + extlen) > inode->i_size)
+ extlen = inode->i_size - extoff;
+ extoff += extlen;
+ if (extoff > *offset)
+ *offset = extoff;
+ goto out_unlock;
+ }
+
+ ret = -ENXIO;
+
+out_unlock:
+
+ brelse(di_bh);
+
+ up_read(&OCFS2_I(inode)->ip_alloc_sem);
+
+ ocfs2_inode_unlock(inode, 0);
+out:
+ if (ret && ret != -ENXIO)
+ ret = -ENXIO;
+ return ret;
+}
+
int ocfs2_read_virt_blocks(struct inode *inode, u64 v_block, int nr,
struct buffer_head *bhs[], int flags,
int (*validate)(struct super_block *sb,
diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h
index e79d41c..67ea57d 100644
--- a/fs/ocfs2/extent_map.h
+++ b/fs/ocfs2/extent_map.h
@@ -53,6 +53,8 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno,
int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 map_start, u64 map_len);

+int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int origin);
+
int ocfs2_xattr_get_clusters(struct inode *inode, u32 v_cluster,
u32 *p_cluster, u32 *num_clusters,
struct ocfs2_extent_list *el,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index cce8c2b..5349a4b 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2572,6 +2572,55 @@ bail:
return ret;
}

+/* Refer generic_file_llseek_unlocked() */
+static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct inode *inode = file->f_mapping->host;
+ int ret = 0;
+
+ mutex_lock(&inode->i_mutex);
+
+ switch (origin) {
+ case SEEK_END:
+ offset += inode->i_size;
+ break;
+ case SEEK_CUR:
+ if (offset == 0) {
+ offset = file->f_pos;
+ goto out;
+ }
+ offset += file->f_pos;
+ break;
+ case SEEK_DATA:
+ case SEEK_HOLE:
+ ret = ocfs2_seek_data_hole_offset(file, &offset, origin);
+ if (ret)
+ goto out;
+ break;
+ default:
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET))
+ ret = -EINVAL;
+ if (!ret && offset > inode->i_sb->s_maxbytes)
+ ret = -EINVAL;
+ if (ret)
+ goto out;
+
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+
+out:
+ mutex_unlock(&inode->i_mutex);
+ if (ret)
+ return ret;
+ return offset;
+}
+
const struct inode_operations ocfs2_file_iops = {
.setattr = ocfs2_setattr,
.getattr = ocfs2_getattr,
@@ -2594,7 +2643,7 @@ const struct inode_operations ocfs2_special_file_iops = {
* ocfs2_fops_no_plocks and ocfs2_dops_no_plocks!
*/
const struct file_operations ocfs2_fops = {
- .llseek = generic_file_llseek,
+ .llseek = ocfs2_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
.mmap = ocfs2_mmap,
@@ -2642,7 +2691,7 @@ const struct file_operations ocfs2_dops = {
* the cluster.
*/
const struct file_operations ocfs2_fops_no_plocks = {
- .llseek = generic_file_llseek,
+ .llseek = ocfs2_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
.mmap = ocfs2_mmap,
--
1.7.4.1

2011-05-19 09:13:48

by Tristan Ye

[permalink] [raw]
Subject: Re: [PATCH] ocfs2: Implement llseek()

Sunil Mushran wrote:
> ocfs2 implements its own llseek() to provide the SEEK_HOLE/SEEK_DATA
> functionality.
>
> SEEK_HOLE sets the file pointer to the start of either a hole or an unwritten
> (preallocated) extent, that is greater than or equal to the supplied offset.
>
> SEEK_DATA sets the file pointer to the start of an allocated extent (not
> unwritten) that is greater than or equal to the supplied offset.
>
> If the supplied offset is on a desired region, then the file pointer is set
> to it. Offsets greater than or equal to the file size return -ENXIO.
>
> Unwritten (preallocated) extents are considered holes because the file system
> treats reads to such regions in the same way as it does to holes.
>
> Signed-off-by: Sunil Mushran <[email protected]>
> ---
> fs/ocfs2/extent_map.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ocfs2/extent_map.h | 2 +
> fs/ocfs2/file.c | 53 ++++++++++++++++++++++++++-
> 3 files changed, 150 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
> index 23457b4..6942c21 100644
> --- a/fs/ocfs2/extent_map.c
> +++ b/fs/ocfs2/extent_map.c
> @@ -832,6 +832,103 @@ out:
> return ret;
> }
>
> +int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int origin)
> +{
> + struct inode *inode = file->f_mapping->host;
> + int ret;
> + unsigned int is_last = 0, is_data = 0;
> + u16 cs_bits = OCFS2_SB(inode->i_sb)->s_clustersize_bits;
> + u32 cpos, cend, clen, hole_size;
> + u64 extoff, extlen;
> + struct buffer_head *di_bh = NULL;
> + struct ocfs2_extent_rec rec;
> +
> + BUG_ON(origin != SEEK_DATA && origin != SEEK_HOLE);
> +
> + ret = ocfs2_inode_lock(inode, &di_bh, 0);
> + if (ret) {
> + mlog_errno(ret);
> + goto out;
> + }
> +
> + down_read(&OCFS2_I(inode)->ip_alloc_sem);
> +
> + if (inode->i_size == 0 || *offset >= inode->i_size) {
> + ret = -ENXIO;
> + goto out_unlock;
> + }

Why not using if (*offset >= inode->i_size) directly?

> +
> + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
> + if (origin == SEEK_HOLE)
> + *offset = inode->i_size;
> + goto out_unlock;
> + }
> +
> + clen = 0;
> + cpos = *offset >> cs_bits;
> + cend = ocfs2_clusters_for_bytes(inode->i_sb, inode->i_size);
> +
> + while (cpos < cend && !is_last) {
> + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos, &hole_size,
> + &rec, &is_last);
> + if (ret) {
> + mlog_errno(ret);
> + goto out_unlock;
> + }
> +
> + extoff = cpos;
> + extoff <<= cs_bits;
> +
> + if (rec.e_blkno == 0ULL) {
> + clen = hole_size;
> + is_data = 0;
> + } else {
> + BUG_ON(cpos < le32_to_cpu(rec.e_cpos));


A same assert has already been performed inside ocfs2_get_clusters_nocache(),
does it make sense to do it again here?


> + clen = le16_to_cpu(rec.e_leaf_clusters) -
> + (cpos - le32_to_cpu(rec.e_cpos));
> + is_data = (rec.e_flags & OCFS2_EXT_UNWRITTEN) ? 0 : 1;
> + }
> +
> + if ((!is_data && origin == SEEK_HOLE) ||
> + (is_data && origin == SEEK_DATA)) {
> + if (extoff > *offset)
> + *offset = extoff;
> + goto out_unlock;

Seems above logic is going to stop at the first time we find a hole.

How about the offset was within the range of a hole already when we doing
SEEK_HOLE, shouldn't we proceed detecting until the next hole gets found, whose
start_offset was greater than supplied offset, according to semantics described
by the the header of this patch, should it be like following?

if (extoff > *offset) {
*offset = extoff;
goto out_unlock;
}

> + }
> +
> + if (!is_last)
> + cpos += clen;
> + }
> +
> + if (origin == SEEK_HOLE) {
> + extoff = cpos;
> + extoff <<= cs_bits;

extoff already has been assigned properly above in while loop?

> + extlen = clen;
> + extlen <<= cs_bits;
> +
> + if ((extoff + extlen) > inode->i_size)
> + extlen = inode->i_size - extoff;
> + extoff += extlen;
> + if (extoff > *offset)
> + *offset = extoff;
> + goto out_unlock;
> + }
> +
> + ret = -ENXIO;
> +
> +out_unlock:
> +
> + brelse(di_bh);
> +
> + up_read(&OCFS2_I(inode)->ip_alloc_sem);
> +
> + ocfs2_inode_unlock(inode, 0);
> +out:
> + if (ret && ret != -ENXIO)
> + ret = -ENXIO;
> + return ret;
> +}
> +
> int ocfs2_read_virt_blocks(struct inode *inode, u64 v_block, int nr,
> struct buffer_head *bhs[], int flags,
> int (*validate)(struct super_block *sb,
> diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h
> index e79d41c..67ea57d 100644
> --- a/fs/ocfs2/extent_map.h
> +++ b/fs/ocfs2/extent_map.h
> @@ -53,6 +53,8 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno,
> int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> u64 map_start, u64 map_len);
>
> +int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int origin);
> +
> int ocfs2_xattr_get_clusters(struct inode *inode, u32 v_cluster,
> u32 *p_cluster, u32 *num_clusters,
> struct ocfs2_extent_list *el,
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index cce8c2b..5349a4b 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2572,6 +2572,55 @@ bail:
> return ret;
> }
>
> +/* Refer generic_file_llseek_unlocked() */
> +static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int origin)
> +{
> + struct inode *inode = file->f_mapping->host;
> + int ret = 0;
> +
> + mutex_lock(&inode->i_mutex);
> +
> + switch (origin) {
> + case SEEK_END:
> + offset += inode->i_size;
> + break;
> + case SEEK_CUR:
> + if (offset == 0) {
> + offset = file->f_pos;
> + goto out;
> + }
> + offset += file->f_pos;
> + break;
> + case SEEK_DATA:
> + case SEEK_HOLE:
> + ret = ocfs2_seek_data_hole_offset(file, &offset, origin);
> + if (ret)
> + goto out;
> + break;
> + default:
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + if (offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET))
> + ret = -EINVAL;
> + if (!ret && offset > inode->i_sb->s_maxbytes)
> + ret = -EINVAL;
> + if (ret)
> + goto out;
> +
> + if (offset != file->f_pos) {
> + file->f_pos = offset;
> + file->f_version = 0;
> + }
> +
> +out:
> + mutex_unlock(&inode->i_mutex);
> + if (ret)
> + return ret;
> + return offset;
> +}
> +
> const struct inode_operations ocfs2_file_iops = {
> .setattr = ocfs2_setattr,
> .getattr = ocfs2_getattr,
> @@ -2594,7 +2643,7 @@ const struct inode_operations ocfs2_special_file_iops = {
> * ocfs2_fops_no_plocks and ocfs2_dops_no_plocks!
> */
> const struct file_operations ocfs2_fops = {
> - .llseek = generic_file_llseek,
> + .llseek = ocfs2_file_llseek,
> .read = do_sync_read,
> .write = do_sync_write,
> .mmap = ocfs2_mmap,
> @@ -2642,7 +2691,7 @@ const struct file_operations ocfs2_dops = {
> * the cluster.
> */
> const struct file_operations ocfs2_fops_no_plocks = {
> - .llseek = generic_file_llseek,
> + .llseek = ocfs2_file_llseek,
> .read = do_sync_read,
> .write = do_sync_write,
> .mmap = ocfs2_mmap,

2011-05-19 11:04:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Ocfs2-devel] SEEK_DATA/HOLE on ocfs2 - v2

On Wed, May 18, 2011 at 07:44:42PM -0700, Sunil Mushran wrote:
> It is improved since the last post. It runs cleanly on zfs, ocfs2 and ext3
> (default behavior). Users testing on zfs will need to flip the values of
> SEEK_HOLE/DATA.

sounds like we should switch the around, just to cause the least
amount of confusion.


2011-05-19 11:05:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: Implement llseek()

On Wed, May 18, 2011 at 07:44:44PM -0700, Sunil Mushran wrote:
> Unwritten (preallocated) extents are considered holes because the file system
> treats reads to such regions in the same way as it does to holes.

How does this work for the case of an unwrittent extent that has been
written to in the pagecache but not converted yet? Y'know the big data
corruption and flamewar that started all this?


2011-05-19 17:29:08

by Sunil Mushran

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: Implement llseek()

On 05/19/2011 04:05 AM, Christoph Hellwig wrote:
> On Wed, May 18, 2011 at 07:44:44PM -0700, Sunil Mushran wrote:
>> Unwritten (preallocated) extents are considered holes because the file system
>> treats reads to such regions in the same way as it does to holes.
> How does this work for the case of an unwrittent extent that has been
> written to in the pagecache but not converted yet? Y'know the big data
> corruption and flamewar that started all this?

We don't delay splitting the extent. It is split in ->write_begin(). Delaying
it will be a challenge as we have to provide cache coherency across the
cluster.


2011-05-19 17:45:52

by Sunil Mushran

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: Implement llseek()

On 05/19/2011 02:13 AM, Tristan Ye wrote:
>> + if (inode->i_size == 0 || *offset>= inode->i_size) {
>> + ret = -ENXIO;
>> + goto out_unlock;
>> + }
> Why not using if (*offset>= inode->i_size) directly?

duh!

> + BUG_ON(cpos< le32_to_cpu(rec.e_cpos));
> A same assert has already been performed inside ocfs2_get_clusters_nocache(),
> does it make sense to do it again here?

good catch

>> +
>> + if ((!is_data&& origin == SEEK_HOLE) ||
>> + (is_data&& origin == SEEK_DATA)) {
>> + if (extoff> *offset)
>> + *offset = extoff;
>> + goto out_unlock;
> Seems above logic is going to stop at the first time we find a hole.
>
> How about the offset was within the range of a hole already when we doing
> SEEK_HOLE, shouldn't we proceed detecting until the next hole gets found, whose
> start_offset was greater than supplied offset, according to semantics described
> by the the header of this patch, should it be like following?
>
> if (extoff> *offset) {
> *offset = extoff;
> goto out_unlock;
> }

So if the offset is in a hole, then we set the file pointer to it. Same for
data. The file pointer is set to the region asked at an offset that is equal
to or greater than the supplied offset.

>> + if (origin == SEEK_HOLE) {
>> + extoff = cpos;
>> + extoff<<= cs_bits;
> extoff already has been assigned properly above in while loop?

To handle the case when supplied cpos == cend.

As always, excellent review.

Thanks
Sunil