2014-02-25 19:14:55

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

It can be used to convert a range of file to zeros preferably without
issuing data IO. Blocks should be preallocated for the regions that span
holes in the file, and the entire range is preferable converted to
unwritten extents - even though file system may choose to zero out the
extent or do whatever which will result in reading zeros from the range
while the range remains allocated for the file.

This can be also used to preallocate blocks past EOF in the same way as
with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
size to remain the same.

You can test this feature yourself using xfstests, of fallocate(1) however
you'll need patches for util_linux, xfsprogs and xfstests which are going to
follow soon.

I tested this mostly with a subset of xfstests using fsx and fsstress and
even with new generic/290 which is just a copy of xfs/290 using fzero
command for xfs_io instead of zero (which uses ioctl). I was testing on
x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.

./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083
generic/117 generic/068 generic/231 generic/127 generic/091 generic/075
generic/112 generic/263 generic/091 generic/075 generic/256 generic/255
generic/316 generic/300 generic/290 ext4/242;

Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
touches the same area as this pach set does, so we should figure out
which one should go first and modify the other on top of it.

This has been based on top of xfs-collapse-range so it does not contain ext4
collapse range changes.

Thanks!
-Lukas

Changelog:

v2:
- rebased on top of collapse_range changes
- make varisou ext4 functions static
- flag testing changes in do_fallocate

--
[PATCH 1/6] ext4: Update inode i_size after the preallocation
[PATCH 2/6] ext4: refactor ext4_fallocate code
[PATCH 3/6] ext4: translate fallocate mode bits to strings
[PATCH 4/6] fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
[PATCH 5/6] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
[PATCH 6/6] xfs: Add support for FALLOC_FL_ZERO_RANGE


2014-02-25 19:14:34

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 1/6 v2] ext4: Update inode i_size after the preallocation

Currently in ext4_fallocate we would update inode size, c_time and sync
the file with every partial allocation which is entirely unnecessary. It
is true that if the crash happens in the middle of truncate we might end
up with unchanged i size, or c_time which I do not think is really a
problem - it does not mean file system corruption in any way. Note that
xfs is doing things the same way e.g. update all of the mentioned after
the allocation is done.

This commit moves all the updates after the allocation is done. In
addition we also need to change m_time as not only inode has been change
bot also data regions might have changed (unwritten extents). However
m_time will be only updated when i_size changed.

Also we do not need to be paranoid about changing the c_time only if the
actual allocation have happened, we can change it even if we try to
allocate only to find out that there are already block allocated. It's
not really a big deal and it will save us some additional complexity.

Also use ext4_debug, instead of ext4_warning in #ifdef EXT4FS_DEBUG
section.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/extents.c | 89 ++++++++++++++++++++++---------------------------------
1 file changed, 35 insertions(+), 54 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 10cff47..67c7917 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4513,36 +4513,6 @@ retry:
ext4_std_error(inode->i_sb, err);
}

-static void ext4_falloc_update_inode(struct inode *inode,
- int mode, loff_t new_size, int update_ctime)
-{
- struct timespec now;
-
- if (update_ctime) {
- now = current_fs_time(inode->i_sb);
- if (!timespec_equal(&inode->i_ctime, &now))
- inode->i_ctime = now;
- }
- /*
- * Update only when preallocation was requested beyond
- * the file size.
- */
- if (!(mode & FALLOC_FL_KEEP_SIZE)) {
- if (new_size > i_size_read(inode))
- i_size_write(inode, new_size);
- if (new_size > EXT4_I(inode)->i_disksize)
- ext4_update_i_disksize(inode, new_size);
- } else {
- /*
- * Mark that we allocate beyond EOF so the subsequent truncate
- * can proceed even if the new size is the same as i_size.
- */
- if (new_size > i_size_read(inode))
- ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
- }
-
-}
-
/*
* preallocate space for a file. This implements ext4's fallocate file
* operation, which gets called from sys_fallocate system call.
@@ -4554,13 +4524,14 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
{
struct inode *inode = file_inode(file);
handle_t *handle;
- loff_t new_size;
+ loff_t new_size = 0;
unsigned int max_blocks;
int ret = 0;
int ret2 = 0;
int retries = 0;
int flags;
struct ext4_map_blocks map;
+ struct timespec tv;
unsigned int credits, blkbits = inode->i_blkbits;

/* Return error if mode is not supported */
@@ -4594,12 +4565,15 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
*/
credits = ext4_chunk_trans_blocks(inode, max_blocks);
mutex_lock(&inode->i_mutex);
- ret = inode_newsize_ok(inode, (len + offset));
- if (ret) {
- mutex_unlock(&inode->i_mutex);
- trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
- return ret;
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+ offset + len > i_size_read(inode)) {
+ new_size = offset + len;
+ ret = inode_newsize_ok(inode, new_size);
+ if (ret)
+ goto out;
}
+
flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
@@ -4623,28 +4597,14 @@ retry:
}
ret = ext4_map_blocks(handle, inode, &map, flags);
if (ret <= 0) {
-#ifdef EXT4FS_DEBUG
- ext4_warning(inode->i_sb,
- "inode #%lu: block %u: len %u: "
- "ext4_ext_map_blocks returned %d",
- inode->i_ino, map.m_lblk,
- map.m_len, ret);
-#endif
+ ext4_debug("inode #%lu: block %u: len %u: "
+ "ext4_ext_map_blocks returned %d",
+ inode->i_ino, map.m_lblk,
+ map.m_len, ret);
ext4_mark_inode_dirty(handle, inode);
ret2 = ext4_journal_stop(handle);
break;
}
- if ((map.m_lblk + ret) >= (EXT4_BLOCK_ALIGN(offset + len,
- blkbits) >> blkbits))
- new_size = offset + len;
- else
- new_size = ((loff_t) map.m_lblk + ret) << blkbits;
-
- ext4_falloc_update_inode(inode, mode, new_size,
- (map.m_flags & EXT4_MAP_NEW));
- ext4_mark_inode_dirty(handle, inode);
- if ((file->f_flags & O_SYNC) && ret >= max_blocks)
- ext4_handle_sync(handle);
ret2 = ext4_journal_stop(handle);
if (ret2)
break;
@@ -4654,6 +4614,27 @@ retry:
ret = 0;
goto retry;
}
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle))
+ goto out;
+
+ tv = inode->i_ctime = ext4_current_time(inode);
+
+ if (ret > 0 && new_size) {
+ if (new_size > i_size_read(inode)) {
+ i_size_write(inode, new_size);
+ inode->i_mtime = tv;
+ }
+ if (new_size > EXT4_I(inode)->i_disksize)
+ ext4_update_i_disksize(inode, new_size);
+ }
+ ext4_mark_inode_dirty(handle, inode);
+ if (file->f_flags & O_SYNC)
+ ext4_handle_sync(handle);
+
+ ext4_journal_stop(handle);
+out:
mutex_unlock(&inode->i_mutex);
trace_ext4_fallocate_exit(inode, offset, max_blocks,
ret > 0 ? ret2 : ret);
--
1.8.3.1

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-02-25 19:14:35

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 2/6 v2] ext4: refactor ext4_fallocate code

Move block allocation out of the ext4_fallocate into separate function
called ext4_alloc_file_blocks(). This will allow us to use the same
allocation code for other allocation operations such as zero range which
is commit in the next patch.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/extents.c | 127 +++++++++++++++++++++++++++++++-----------------------
1 file changed, 73 insertions(+), 54 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 67c7917..0e675bc 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4513,6 +4513,64 @@ retry:
ext4_std_error(inode->i_sb, err);
}

+static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
+ ext4_lblk_t len, int flags, int mode)
+{
+ struct inode *inode = file_inode(file);
+ handle_t *handle;
+ int ret = 0;
+ int ret2 = 0;
+ int retries = 0;
+ struct ext4_map_blocks map;
+ unsigned int credits;
+
+ map.m_lblk = offset;
+ /*
+ * Don't normalize the request if it can fit in one extent so
+ * that it doesn't get unnecessarily split into multiple
+ * extents.
+ */
+ if (len <= EXT_UNINIT_MAX_LEN)
+ flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
+
+ /*
+ * credits to insert 1 extent into extent tree
+ */
+ credits = ext4_chunk_trans_blocks(inode, len);
+
+retry:
+ while (ret >= 0 && ret < len) {
+ map.m_lblk = map.m_lblk + ret;
+ map.m_len = len = len - ret;
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
+ credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ break;
+ }
+ ret = ext4_map_blocks(handle, inode, &map, flags);
+ if (ret <= 0) {
+ ext4_debug("inode #%lu: block %u: len %u: "
+ "ext4_ext_map_blocks returned %d",
+ inode->i_ino, map.m_lblk,
+ map.m_len, ret);
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ break;
+ }
+ ret2 = ext4_journal_stop(handle);
+ if (ret2)
+ break;
+ }
+ if (ret == -ENOSPC &&
+ ext4_should_retry_alloc(inode->i_sb, &retries)) {
+ ret = 0;
+ goto retry;
+ }
+
+ return ret > 0 ? ret2 : ret;
+}
+
/*
* preallocate space for a file. This implements ext4's fallocate file
* operation, which gets called from sys_fallocate system call.
@@ -4527,12 +4585,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
loff_t new_size = 0;
unsigned int max_blocks;
int ret = 0;
- int ret2 = 0;
- int retries = 0;
int flags;
- struct ext4_map_blocks map;
+ ext4_lblk_t lblk;
struct timespec tv;
- unsigned int credits, blkbits = inode->i_blkbits;
+ unsigned int blkbits = inode->i_blkbits;

/* Return error if mode is not supported */
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -4553,17 +4609,18 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EOPNOTSUPP;

trace_ext4_fallocate_enter(inode, offset, len, mode);
- map.m_lblk = offset >> blkbits;
+ lblk = offset >> blkbits;
/*
* We can't just convert len to max_blocks because
* If blocksize = 4096 offset = 3072 and len = 2048
*/
max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
- - map.m_lblk;
- /*
- * credits to insert 1 extent into extent tree
- */
- credits = ext4_chunk_trans_blocks(inode, max_blocks);
+ - lblk;
+
+ flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
+ if (mode & FALLOC_FL_KEEP_SIZE)
+ flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
+
mutex_lock(&inode->i_mutex);

if (!(mode & FALLOC_FL_KEEP_SIZE) &&
@@ -4574,46 +4631,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
goto out;
}

- flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
- if (mode & FALLOC_FL_KEEP_SIZE)
- flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
- /*
- * Don't normalize the request if it can fit in one extent so
- * that it doesn't get unnecessarily split into multiple
- * extents.
- */
- if (len <= EXT_UNINIT_MAX_LEN << blkbits)
- flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
-
-retry:
- while (ret >= 0 && ret < max_blocks) {
- map.m_lblk = map.m_lblk + ret;
- map.m_len = max_blocks = max_blocks - ret;
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
- credits);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- break;
- }
- ret = ext4_map_blocks(handle, inode, &map, flags);
- if (ret <= 0) {
- ext4_debug("inode #%lu: block %u: len %u: "
- "ext4_ext_map_blocks returned %d",
- inode->i_ino, map.m_lblk,
- map.m_len, ret);
- ext4_mark_inode_dirty(handle, inode);
- ret2 = ext4_journal_stop(handle);
- break;
- }
- ret2 = ext4_journal_stop(handle);
- if (ret2)
- break;
- }
- if (ret == -ENOSPC &&
- ext4_should_retry_alloc(inode->i_sb, &retries)) {
- ret = 0;
- goto retry;
- }
+ ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags, mode);
+ if (ret)
+ goto out;

handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
if (IS_ERR(handle))
@@ -4621,7 +4641,7 @@ retry:

tv = inode->i_ctime = ext4_current_time(inode);

- if (ret > 0 && new_size) {
+ if (!ret && new_size) {
if (new_size > i_size_read(inode)) {
i_size_write(inode, new_size);
inode->i_mtime = tv;
@@ -4636,9 +4656,8 @@ retry:
ext4_journal_stop(handle);
out:
mutex_unlock(&inode->i_mutex);
- trace_ext4_fallocate_exit(inode, offset, max_blocks,
- ret > 0 ? ret2 : ret);
- return ret > 0 ? ret2 : ret;
+ trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
+ return ret;
}

/*
--
1.8.3.1


2014-02-25 19:14:36

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 3/6 v2] ext4: translate fallocate mode bits to strings

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 1 -
include/trace/events/ext4.h | 9 +++++++--
3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ece5556..3b9601c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -31,6 +31,7 @@
#include <linux/percpu_counter.h>
#include <linux/ratelimit.h>
#include <crypto/hash.h>
+#include <linux/falloc.h>
#ifdef __KERNEL__
#include <linux/compat.h>
#endif
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0e675bc..e5485eb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -37,7 +37,6 @@
#include <linux/quotaops.h>
#include <linux/string.h>
#include <linux/slab.h>
-#include <linux/falloc.h>
#include <asm/uaccess.h>
#include <linux/fiemap.h>
#include "ext4_jbd2.h"
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 197d312..451e020 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -68,6 +68,11 @@ struct extent_status;
{ EXTENT_STATUS_DELAYED, "D" }, \
{ EXTENT_STATUS_HOLE, "H" })

+#define show_falloc_mode(mode) __print_flags(mode, "|", \
+ { FALLOC_FL_KEEP_SIZE, "KEEP_SIZE"}, \
+ { FALLOC_FL_PUNCH_HOLE, "PUNCH_HOLE"}, \
+ { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"})
+

TRACE_EVENT(ext4_free_inode,
TP_PROTO(struct inode *inode),
@@ -1349,10 +1354,10 @@ TRACE_EVENT(ext4_fallocate_enter,
__entry->mode = mode;
),

- TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %d",
+ TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %s",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino, __entry->pos,
- __entry->len, __entry->mode)
+ __entry->len, show_falloc_mode(__entry->mode))
);

TRACE_EVENT(ext4_fallocate_exit,
--
1.8.3.1


2014-02-25 19:15:01

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 4/6 v2] fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

It can be used to convert a range of file to zeros preferably without
issuing data IO. Blocks should be preallocated for the regions that span
holes in the file, and the entire range is preferable converted to
unwritten extents - even though file system may choose to zero out the
extent or do whatever which will result in reading zeros from the range
while the range remains allocated for the file.

This can be also used to preallocate blocks past EOF in the same way as
with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
size to remain the same.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/open.c | 7 ++++++-
include/uapi/linux/falloc.h | 14 ++++++++++++++
2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index 4a923a5..c4465b2 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -232,7 +232,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)

/* Return error if mode is not supported */
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
- FALLOC_FL_COLLAPSE_RANGE))
+ FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
+ return -EOPNOTSUPP;
+
+ /* Punch hole and zero range are mutually exclusive */
+ if ((mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) ==
+ (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;

/* Punch hole must have keep size set */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 5ff562d..d1197ae 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -27,4 +27,18 @@
*/
#define FALLOC_FL_COLLAPSE_RANGE 0x08

+/*
+ * FALLOC_FL_ZERO_RANGE is used to convert a range of file to zeros preferably
+ * without issuing data IO. Blocks should be preallocated for the regions that
+ * span holes in the file, and the entire range is preferable converted to
+ * unwritten extents - even though file system may choose to zero out the
+ * extent or do whatever which will result in reading zeros from the range
+ * while the range remains allocated for the file.
+ *
+ * This can be also used to preallocate blocks past EOF in the same way as
+ * with fallocate. Flag FALLOC_FL_KEEP_SIZE should cause the inode
+ * size to remain the same.
+ */
+#define FALLOC_FL_ZERO_RANGE 0x10
+
#endif /* _UAPI_FALLOC_H_ */
--
1.8.3.1


2014-02-25 19:15:06

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 6/6 v2] xfs: Add support for FALLOC_FL_ZERO_RANGE

Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

We can also preallocate blocks past EOF in the same was as with
fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain
the same even if we preallocate blocks past EOF.

It uses the same code to zero range as it is used by the
XFS_IOC_ZERO_RANGE ioctl.

Signed-off-by: Lukas Czerner <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_file.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 52f96e1..8fb97a6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -824,7 +824,7 @@ xfs_file_fallocate(
if (!S_ISREG(inode->i_mode))
return -EINVAL;
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
- FALLOC_FL_COLLAPSE_RANGE))
+ FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;

xfs_ilock(ip, XFS_IOLOCK_EXCL);
@@ -855,8 +855,11 @@ xfs_file_fallocate(
goto out_unlock;
}

- error = xfs_alloc_file_space(ip, offset, len,
- XFS_BMAPI_PREALLOC);
+ if (mode & FALLOC_FL_ZERO_RANGE)
+ error = xfs_zero_file_space(ip, offset, len);
+ else
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_BMAPI_PREALLOC);
if (error)
goto out_unlock;
}
--
1.8.3.1


2014-02-25 19:14:38

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH 5/6 v2] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

It can be used to convert a range of file to zeros preferably without
issuing data IO. Blocks should be preallocated for the regions that span
holes in the file, and the entire range is preferable converted to
unwritten extents

This can be also used to preallocate blocks past EOF in the same way as
with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
size to remain the same.

Also add appropriate tracepoints.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/extents.c | 270 +++++++++++++++++++++++++++++++++++++++++---
fs/ext4/inode.c | 17 ++-
include/trace/events/ext4.h | 64 +++++------
4 files changed, 300 insertions(+), 53 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3b9601c..a649abe 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -568,6 +568,8 @@ enum {
#define EXT4_GET_BLOCKS_NO_LOCK 0x0100
/* Do not put hole in extent cache */
#define EXT4_GET_BLOCKS_NO_PUT_HOLE 0x0200
+ /* Convert written extents to unwritten */
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0400

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e5485eb..017b4fb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3568,6 +3568,8 @@ out:
* b> Splits in two extents: Write is happening at either end of the extent
* c> Splits in three extents: Somone is writing in middle of the extent
*
+ * This works the same way in the case of initialized -> unwritten conversion.
+ *
* One of more index blocks maybe needed if the extent tree grow after
* the uninitialized extent split. To prevent ENOSPC occur at the IO
* complete, we need to split the uninitialized extent before DIO submit
@@ -3578,7 +3580,7 @@ out:
*
* Returns the size of uninitialized extent to be written on success.
*/
-static int ext4_split_unwritten_extents(handle_t *handle,
+static int ext4_split_convert_extents(handle_t *handle,
struct inode *inode,
struct ext4_map_blocks *map,
struct ext4_ext_path *path,
@@ -3590,9 +3592,9 @@ static int ext4_split_unwritten_extents(handle_t *handle,
unsigned int ee_len;
int split_flag = 0, depth;

- ext_debug("ext4_split_unwritten_extents: inode %lu, logical"
- "block %llu, max_blocks %u\n", inode->i_ino,
- (unsigned long long)map->m_lblk, map->m_len);
+ ext_debug("%s: inode %lu, logical block %llu, max_blocks %u\n",
+ __func__, inode->i_ino,
+ (unsigned long long)map->m_lblk, map->m_len);

eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
inode->i_sb->s_blocksize_bits;
@@ -3607,14 +3609,73 @@ static int ext4_split_unwritten_extents(handle_t *handle,
ee_block = le32_to_cpu(ex->ee_block);
ee_len = ext4_ext_get_actual_len(ex);

- split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
- split_flag |= EXT4_EXT_MARK_UNINIT2;
- if (flags & EXT4_GET_BLOCKS_CONVERT)
- split_flag |= EXT4_EXT_DATA_VALID2;
+ /* Convert to unwritten */
+ if (flags | EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
+ split_flag |= EXT4_EXT_DATA_VALID1;
+ /* Convert to initialized */
+ } else if (flags | EXT4_GET_BLOCKS_CONVERT) {
+ split_flag |= ee_block + ee_len <= eof_block ?
+ EXT4_EXT_MAY_ZEROOUT : 0;
+ split_flag |= (EXT4_EXT_MARK_UNINIT2 & EXT4_EXT_DATA_VALID2);
+ }
flags |= EXT4_GET_BLOCKS_PRE_IO;
return ext4_split_extent(handle, inode, path, map, split_flag, flags);
}

+static int ext4_convert_initialized_extents(handle_t *handle,
+ struct inode *inode,
+ struct ext4_map_blocks *map,
+ struct ext4_ext_path *path)
+{
+ struct ext4_extent *ex;
+ ext4_lblk_t ee_block;
+ unsigned int ee_len;
+ int depth;
+ int err = 0;
+
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+ ee_block = le32_to_cpu(ex->ee_block);
+ ee_len = ext4_ext_get_actual_len(ex);
+
+ ext_debug("%s: inode %lu, logical"
+ "block %llu, max_blocks %u\n", __func__, inode->i_ino,
+ (unsigned long long)ee_block, ee_len);
+
+ if (ee_block != map->m_lblk || ee_len > map->m_len) {
+ err = ext4_split_convert_extents(handle, inode, map, path,
+ EXT4_GET_BLOCKS_CONVERT_UNWRITTEN);
+ if (err < 0)
+ goto out;
+ ext4_ext_drop_refs(path);
+ path = ext4_ext_find_extent(inode, map->m_lblk, path, 0);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ goto out;
+ }
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+ }
+
+ err = ext4_ext_get_access(handle, inode, path + depth);
+ if (err)
+ goto out;
+ /* first mark the extent as uninitialized */
+ ext4_ext_mark_uninitialized(ex);
+
+ /* note: ext4_ext_correct_indexes() isn't needed here because
+ * borders are not changed
+ */
+ ext4_ext_try_to_merge(handle, inode, path, ex);
+
+ /* Mark modified extent as dirty */
+ err = ext4_ext_dirty(handle, inode, path + path->p_depth);
+out:
+ ext4_ext_show_leaf(inode, path);
+ return err;
+}
+
+
static int ext4_convert_unwritten_extents_endio(handle_t *handle,
struct inode *inode,
struct ext4_map_blocks *map,
@@ -3648,8 +3709,8 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
inode->i_ino, (unsigned long long)ee_block, ee_len,
(unsigned long long)map->m_lblk, map->m_len);
#endif
- err = ext4_split_unwritten_extents(handle, inode, map, path,
- EXT4_GET_BLOCKS_CONVERT);
+ err = ext4_split_convert_extents(handle, inode, map, path,
+ EXT4_GET_BLOCKS_CONVERT);
if (err < 0)
goto out;
ext4_ext_drop_refs(path);
@@ -3850,6 +3911,42 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
}

static int
+ext4_ext_convert_initialized_extent(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map,
+ struct ext4_ext_path *path, int flags,
+ unsigned int allocated, ext4_fsblk_t newblock)
+{
+ int ret = 0;
+ int err = 0;
+
+ /*
+ * Make sure that the extent is no bigger than we support with
+ * uninitialized extent
+ */
+ if (map->m_len > EXT_UNINIT_MAX_LEN)
+ map->m_len = EXT_UNINIT_MAX_LEN / 2;
+
+ ret = ext4_convert_initialized_extents(handle, inode, map,
+ path);
+ if (ret >= 0) {
+ ext4_update_inode_fsync_trans(handle, inode, 1);
+ err = check_eofblocks_fl(handle, inode, map->m_lblk,
+ path, map->m_len);
+ } else
+ err = ret;
+ map->m_flags |= EXT4_MAP_UNWRITTEN;
+ if (allocated > map->m_len)
+ allocated = map->m_len;
+ map->m_len = allocated;
+
+ if (path) {
+ ext4_ext_drop_refs(path);
+ kfree(path);
+ }
+ return err ? err : allocated;
+}
+
+static int
ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map,
struct ext4_ext_path *path, int flags,
@@ -3876,8 +3973,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,

/* get_block() before submit the IO, split the extent */
if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
- ret = ext4_split_unwritten_extents(handle, inode, map,
- path, flags);
+ ret = ext4_split_convert_extents(handle, inode, map,
+ path, flags | EXT4_GET_BLOCKS_CONVERT);
if (ret <= 0)
goto out;
/*
@@ -4168,6 +4265,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
unsigned short ee_len;

+
/*
* Uninitialized extents are treated as holes, except that
* we split out initialized portions during a write.
@@ -4184,7 +4282,17 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
ee_block, ee_len, newblock);

- if (!ext4_ext_is_uninitialized(ex))
+ /*
+ * If the extent is initialized check whether the
+ * caller wants to convert it to unwritten.
+ */
+ if ((!ext4_ext_is_uninitialized(ex)) &&
+ (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) {
+ allocated = ext4_ext_convert_initialized_extent(
+ handle, inode, map, path, flags,
+ allocated, newblock);
+ goto out3;
+ } else if (!ext4_ext_is_uninitialized(ex))
goto out;

allocated = ext4_ext_handle_uninitialized_extents(
@@ -4570,6 +4678,136 @@ retry:
return ret > 0 ? ret2 : ret;
}

+static long ext4_zero_range(struct file *file, loff_t offset,
+ loff_t len, int mode)
+{
+ struct inode *inode = file_inode(file);
+ handle_t *handle = NULL;
+ unsigned int max_blocks;
+ loff_t new_size = 0;
+ int ret = 0;
+ int flags;
+ int partial;
+ loff_t start, end;
+ ext4_lblk_t lblk;
+ struct address_space *mapping = inode->i_mapping;
+ unsigned int blkbits = inode->i_blkbits;
+
+ trace_ext4_zero_range(inode, offset, len, mode);
+
+ /*
+ * Write out all dirty pages to avoid race conditions
+ * Then release them.
+ */
+ if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+ ret = filemap_write_and_wait_range(mapping, offset,
+ offset + len - 1);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Round up offset. This is not fallocate, we neet to zero out
+ * blocks, so convert interior block aligned part of the range to
+ * unwritten and possibly manually zero out unaligned parts of the
+ * range.
+ */
+ start = round_up(offset, 1 << blkbits);
+ end = round_down((offset + len), 1 << blkbits);
+
+ if (start < offset || end > offset + len)
+ return -EINVAL;
+ partial = (offset + len) & ((1 << blkbits) - 1);
+
+ lblk = start >> blkbits;
+ max_blocks = (end >> blkbits);
+ if (max_blocks < lblk)
+ max_blocks = 0;
+ else
+ max_blocks -= lblk;
+
+ flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT |
+ EXT4_GET_BLOCKS_CONVERT_UNWRITTEN;
+ if (mode & FALLOC_FL_KEEP_SIZE)
+ flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
+
+ mutex_lock(&inode->i_mutex);
+
+ /*
+ * Indirect files do not support unwritten extnets
+ */
+ if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
+ ret = -EOPNOTSUPP;
+ goto out_mutex;
+ }
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+ offset + len > i_size_read(inode)) {
+ new_size = offset + len;
+ ret = inode_newsize_ok(inode, new_size);
+ if (ret)
+ goto out_mutex;
+ /*
+ * If we have a partial block after EOF we have to allocate
+ * the entire block.
+ */
+ if (partial)
+ max_blocks += 1;
+ }
+
+ if (max_blocks > 0) {
+
+ /* Now release the pages and zero block aligned part of pages*/
+ truncate_pagecache_range(inode, start, end - 1);
+
+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
+ /*
+ * Remove entire range from the extent status tree.
+ */
+ ret = ext4_es_remove_extent(inode, lblk, max_blocks);
+ if (ret)
+ goto out_dio;
+
+ ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags,
+ mode);
+ if (ret)
+ goto out_dio;
+ }
+
+ handle = ext4_journal_start(inode, EXT4_HT_MISC, 4);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ ext4_std_error(inode->i_sb, ret);
+ goto out_dio;
+ }
+
+ inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+
+ if (!ret && new_size) {
+ if (new_size > i_size_read(inode))
+ i_size_write(inode, new_size);
+ if (new_size > EXT4_I(inode)->i_disksize)
+ ext4_update_i_disksize(inode, new_size);
+ }
+ ext4_mark_inode_dirty(handle, inode);
+
+ /* Zero out partial block at the edges of the range */
+ ret = ext4_zero_partial_blocks(handle, inode, offset, len);
+
+ if (file->f_flags & O_SYNC)
+ ext4_handle_sync(handle);
+
+ ext4_journal_stop(handle);
+out_dio:
+ ext4_inode_resume_unlocked_dio(inode);
+out_mutex:
+ mutex_unlock(&inode->i_mutex);
+ return ret;
+}
+
/*
* preallocate space for a file. This implements ext4's fallocate file
* operation, which gets called from sys_fallocate system call.
@@ -4590,7 +4828,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
unsigned int blkbits = inode->i_blkbits;

/* Return error if mode is not supported */
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+ FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;

if (mode & FALLOC_FL_PUNCH_HOLE)
@@ -4607,6 +4846,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
return -EOPNOTSUPP;

+ if (mode & FALLOC_FL_ZERO_RANGE)
+ return ext4_zero_range(file, offset, len, mode);
+
trace_ext4_fallocate_enter(inode, offset, len, mode);
lblk = offset >> blkbits;
/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895..e64807f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -503,6 +503,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
{
struct extent_status es;
int retval;
+ int ret = 0;
#ifdef ES_AGGRESSIVE_TEST
struct ext4_map_blocks orig_map;

@@ -552,7 +553,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
EXT4_GET_BLOCKS_KEEP_SIZE);
}
if (retval > 0) {
- int ret;
unsigned int status;

if (unlikely(retval != map->m_len)) {
@@ -579,7 +579,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,

found:
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret = check_block_validity(inode, map);
+ ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
}
@@ -596,7 +596,13 @@ found:
* with buffer head unmapped.
*/
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED)
- return retval;
+ /*
+ * If we need to convert extent to unwritten
+ * we continue and do the actual work in
+ * ext4_ext_map_blocks()
+ */
+ if (!(flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN))
+ return retval;

/*
* Here we clear m_flags because after allocating an new extent,
@@ -652,7 +658,6 @@ found:
ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);

if (retval > 0) {
- int ret;
unsigned int status;

if (unlikely(retval != map->m_len)) {
@@ -687,7 +692,7 @@ found:
has_zeroout:
up_write((&EXT4_I(inode)->i_data_sem));
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret = check_block_validity(inode, map);
+ ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
}
@@ -3501,7 +3506,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
if (!S_ISREG(inode->i_mode))
return -EOPNOTSUPP;

- trace_ext4_punch_hole(inode, offset, length);
+ trace_ext4_punch_hole(inode, offset, length, 0);

/*
* Write out all dirty pages to avoid race conditions
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 451e020..7bb26aa 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -71,7 +71,8 @@ struct extent_status;
#define show_falloc_mode(mode) __print_flags(mode, "|", \
{ FALLOC_FL_KEEP_SIZE, "KEEP_SIZE"}, \
{ FALLOC_FL_PUNCH_HOLE, "PUNCH_HOLE"}, \
- { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"})
+ { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"}, \
+ { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"})


TRACE_EVENT(ext4_free_inode,
@@ -1333,7 +1334,7 @@ TRACE_EVENT(ext4_direct_IO_exit,
__entry->rw, __entry->ret)
);

-TRACE_EVENT(ext4_fallocate_enter,
+DECLARE_EVENT_CLASS(ext4__fallocate_mode,
TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),

TP_ARGS(inode, offset, len, mode),
@@ -1341,23 +1342,45 @@ TRACE_EVENT(ext4_fallocate_enter,
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( loff_t, pos )
- __field( loff_t, len )
+ __field( loff_t, offset )
+ __field( loff_t, len )
__field( int, mode )
),

TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->pos = offset;
+ __entry->offset = offset;
__entry->len = len;
__entry->mode = mode;
),

- TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %s",
+ TP_printk("dev %d,%d ino %lu offset %lld len %lld mode %s",
MAJOR(__entry->dev), MINOR(__entry->dev),
- (unsigned long) __entry->ino, __entry->pos,
- __entry->len, show_falloc_mode(__entry->mode))
+ (unsigned long) __entry->ino,
+ __entry->offset, __entry->len,
+ show_falloc_mode(__entry->mode))
+);
+
+DEFINE_EVENT(ext4__fallocate_mode, ext4_fallocate_enter,
+
+ TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
+
+ TP_ARGS(inode, offset, len, mode)
+);
+
+DEFINE_EVENT(ext4__fallocate_mode, ext4_punch_hole,
+
+ TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
+
+ TP_ARGS(inode, offset, len, mode)
+);
+
+DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range,
+
+ TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
+
+ TP_ARGS(inode, offset, len, mode)
);

TRACE_EVENT(ext4_fallocate_exit,
@@ -1389,31 +1412,6 @@ TRACE_EVENT(ext4_fallocate_exit,
__entry->ret)
);

-TRACE_EVENT(ext4_punch_hole,
- TP_PROTO(struct inode *inode, loff_t offset, loff_t len),
-
- TP_ARGS(inode, offset, len),
-
- TP_STRUCT__entry(
- __field( dev_t, dev )
- __field( ino_t, ino )
- __field( loff_t, offset )
- __field( loff_t, len )
- ),
-
- TP_fast_assign(
- __entry->dev = inode->i_sb->s_dev;
- __entry->ino = inode->i_ino;
- __entry->offset = offset;
- __entry->len = len;
- ),
-
- TP_printk("dev %d,%d ino %lu offset %lld len %lld",
- MAJOR(__entry->dev), MINOR(__entry->dev),
- (unsigned long) __entry->ino,
- __entry->offset, __entry->len)
-);
-
TRACE_EVENT(ext4_unlink_enter,
TP_PROTO(struct inode *parent, struct dentry *dentry),

--
1.8.3.1


2014-02-26 06:00:49

by jon ernst

[permalink] [raw]
Subject: Re: [PATCH 5/6 v2] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Tue, Feb 25, 2014 at 2:14 PM, Lukas Czerner <[email protected]> wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> Also add appropriate tracepoints.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> fs/ext4/ext4.h | 2 +
> fs/ext4/extents.c | 270 +++++++++++++++++++++++++++++++++++++++++---
> fs/ext4/inode.c | 17 ++-
> include/trace/events/ext4.h | 64 +++++------
> 4 files changed, 300 insertions(+), 53 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3b9601c..a649abe 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -568,6 +568,8 @@ enum {
> #define EXT4_GET_BLOCKS_NO_LOCK 0x0100
> /* Do not put hole in extent cache */
> #define EXT4_GET_BLOCKS_NO_PUT_HOLE 0x0200
> + /* Convert written extents to unwritten */
> +#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0400
>
> /*
> * The bit position of these flags must not overlap with any of the
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index e5485eb..017b4fb 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3568,6 +3568,8 @@ out:
> * b> Splits in two extents: Write is happening at either end of the extent
> * c> Splits in three extents: Somone is writing in middle of the extent
> *
> + * This works the same way in the case of initialized -> unwritten conversion.
> + *
> * One of more index blocks maybe needed if the extent tree grow after
> * the uninitialized extent split. To prevent ENOSPC occur at the IO
> * complete, we need to split the uninitialized extent before DIO submit
> @@ -3578,7 +3580,7 @@ out:
> *
> * Returns the size of uninitialized extent to be written on success.
> */
> -static int ext4_split_unwritten_extents(handle_t *handle,
> +static int ext4_split_convert_extents(handle_t *handle,
> struct inode *inode,
> struct ext4_map_blocks *map,
> struct ext4_ext_path *path,
> @@ -3590,9 +3592,9 @@ static int ext4_split_unwritten_extents(handle_t *handle,
> unsigned int ee_len;
> int split_flag = 0, depth;
>
> - ext_debug("ext4_split_unwritten_extents: inode %lu, logical"
> - "block %llu, max_blocks %u\n", inode->i_ino,
> - (unsigned long long)map->m_lblk, map->m_len);
> + ext_debug("%s: inode %lu, logical block %llu, max_blocks %u\n",
> + __func__, inode->i_ino,
> + (unsigned long long)map->m_lblk, map->m_len);
>
> eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
> inode->i_sb->s_blocksize_bits;
> @@ -3607,14 +3609,73 @@ static int ext4_split_unwritten_extents(handle_t *handle,
> ee_block = le32_to_cpu(ex->ee_block);
> ee_len = ext4_ext_get_actual_len(ex);
>
> - split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
> - split_flag |= EXT4_EXT_MARK_UNINIT2;
> - if (flags & EXT4_GET_BLOCKS_CONVERT)
> - split_flag |= EXT4_EXT_DATA_VALID2;
> + /* Convert to unwritten */
> + if (flags | EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
> + split_flag |= EXT4_EXT_DATA_VALID1;
> + /* Convert to initialized */
> + } else if (flags | EXT4_GET_BLOCKS_CONVERT) {
> + split_flag |= ee_block + ee_len <= eof_block ?
> + EXT4_EXT_MAY_ZEROOUT : 0;
> + split_flag |= (EXT4_EXT_MARK_UNINIT2 & EXT4_EXT_DATA_VALID2);
> + }
> flags |= EXT4_GET_BLOCKS_PRE_IO;
> return ext4_split_extent(handle, inode, path, map, split_flag, flags);
> }
>
> +static int ext4_convert_initialized_extents(handle_t *handle,
> + struct inode *inode,
> + struct ext4_map_blocks *map,
> + struct ext4_ext_path *path)
> +{
> + struct ext4_extent *ex;
> + ext4_lblk_t ee_block;
> + unsigned int ee_len;
> + int depth;
> + int err = 0;
> +
> + depth = ext_depth(inode);
> + ex = path[depth].p_ext;
> + ee_block = le32_to_cpu(ex->ee_block);
> + ee_len = ext4_ext_get_actual_len(ex);
> +
> + ext_debug("%s: inode %lu, logical"
> + "block %llu, max_blocks %u\n", __func__, inode->i_ino,
> + (unsigned long long)ee_block, ee_len);
> +
> + if (ee_block != map->m_lblk || ee_len > map->m_len) {
> + err = ext4_split_convert_extents(handle, inode, map, path,
> + EXT4_GET_BLOCKS_CONVERT_UNWRITTEN);
> + if (err < 0)
> + goto out;
> + ext4_ext_drop_refs(path);
> + path = ext4_ext_find_extent(inode, map->m_lblk, path, 0);
> + if (IS_ERR(path)) {
> + err = PTR_ERR(path);
> + goto out;
> + }
> + depth = ext_depth(inode);
> + ex = path[depth].p_ext;
> + }
> +
> + err = ext4_ext_get_access(handle, inode, path + depth);
> + if (err)
> + goto out;
> + /* first mark the extent as uninitialized */
> + ext4_ext_mark_uninitialized(ex);
> +
> + /* note: ext4_ext_correct_indexes() isn't needed here because
> + * borders are not changed
> + */
> + ext4_ext_try_to_merge(handle, inode, path, ex);
> +
> + /* Mark modified extent as dirty */
> + err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> +out:
> + ext4_ext_show_leaf(inode, path);
> + return err;
> +}
> +
> +
> static int ext4_convert_unwritten_extents_endio(handle_t *handle,
> struct inode *inode,
> struct ext4_map_blocks *map,
> @@ -3648,8 +3709,8 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
> inode->i_ino, (unsigned long long)ee_block, ee_len,
> (unsigned long long)map->m_lblk, map->m_len);
> #endif
> - err = ext4_split_unwritten_extents(handle, inode, map, path,
> - EXT4_GET_BLOCKS_CONVERT);
> + err = ext4_split_convert_extents(handle, inode, map, path,
> + EXT4_GET_BLOCKS_CONVERT);
> if (err < 0)
> goto out;
> ext4_ext_drop_refs(path);
> @@ -3850,6 +3911,42 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
> }
>
> static int
> +ext4_ext_convert_initialized_extent(handle_t *handle, struct inode *inode,
> + struct ext4_map_blocks *map,
> + struct ext4_ext_path *path, int flags,
> + unsigned int allocated, ext4_fsblk_t newblock)
> +{
> + int ret = 0;
> + int err = 0;
> +
> + /*
> + * Make sure that the extent is no bigger than we support with
> + * uninitialized extent
> + */
> + if (map->m_len > EXT_UNINIT_MAX_LEN)
> + map->m_len = EXT_UNINIT_MAX_LEN / 2;

Pardon my possible dumb question. Why do you use "EXT_UNINIT_MAX_LEN
/ 2;" here instead of "EXT_UNINIT_MAX_LEN"
I don't see the reason why we can't use EXT_UNINIT_MAX_LEN here.




Thanks!
Jon


> +
> + ret = ext4_convert_initialized_extents(handle, inode, map,
> + path);
> + if (ret >= 0) {
> + ext4_update_inode_fsync_trans(handle, inode, 1);
> + err = check_eofblocks_fl(handle, inode, map->m_lblk,
> + path, map->m_len);
> + } else
> + err = ret;
> + map->m_flags |= EXT4_MAP_UNWRITTEN;
> + if (allocated > map->m_len)
> + allocated = map->m_len;
> + map->m_len = allocated;
> +
> + if (path) {
> + ext4_ext_drop_refs(path);
> + kfree(path);
> + }
> + return err ? err : allocated;
> +}
> +
> +static int
> ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
> struct ext4_map_blocks *map,
> struct ext4_ext_path *path, int flags,
> @@ -3876,8 +3973,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
>
> /* get_block() before submit the IO, split the extent */
> if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
> - ret = ext4_split_unwritten_extents(handle, inode, map,
> - path, flags);
> + ret = ext4_split_convert_extents(handle, inode, map,
> + path, flags | EXT4_GET_BLOCKS_CONVERT);
> if (ret <= 0)
> goto out;
> /*
> @@ -4168,6 +4265,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
> unsigned short ee_len;
>
> +
> /*
> * Uninitialized extents are treated as holes, except that
> * we split out initialized portions during a write.
> @@ -4184,7 +4282,17 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
> ee_block, ee_len, newblock);
>
> - if (!ext4_ext_is_uninitialized(ex))
> + /*
> + * If the extent is initialized check whether the
> + * caller wants to convert it to unwritten.
> + */
> + if ((!ext4_ext_is_uninitialized(ex)) &&
> + (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) {
> + allocated = ext4_ext_convert_initialized_extent(
> + handle, inode, map, path, flags,
> + allocated, newblock);
> + goto out3;
> + } else if (!ext4_ext_is_uninitialized(ex))
> goto out;
>
> allocated = ext4_ext_handle_uninitialized_extents(
> @@ -4570,6 +4678,136 @@ retry:
> return ret > 0 ? ret2 : ret;
> }
>
> +static long ext4_zero_range(struct file *file, loff_t offset,
> + loff_t len, int mode)
> +{
> + struct inode *inode = file_inode(file);
> + handle_t *handle = NULL;
> + unsigned int max_blocks;
> + loff_t new_size = 0;
> + int ret = 0;
> + int flags;
> + int partial;
> + loff_t start, end;
> + ext4_lblk_t lblk;
> + struct address_space *mapping = inode->i_mapping;
> + unsigned int blkbits = inode->i_blkbits;
> +
> + trace_ext4_zero_range(inode, offset, len, mode);
> +
> + /*
> + * Write out all dirty pages to avoid race conditions
> + * Then release them.
> + */
> + if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> + ret = filemap_write_and_wait_range(mapping, offset,
> + offset + len - 1);
> + if (ret)
> + return ret;
> + }
> +
> + /*
> + * Round up offset. This is not fallocate, we neet to zero out
> + * blocks, so convert interior block aligned part of the range to
> + * unwritten and possibly manually zero out unaligned parts of the
> + * range.
> + */
> + start = round_up(offset, 1 << blkbits);
> + end = round_down((offset + len), 1 << blkbits);
> +
> + if (start < offset || end > offset + len)
> + return -EINVAL;
> + partial = (offset + len) & ((1 << blkbits) - 1);
> +
> + lblk = start >> blkbits;
> + max_blocks = (end >> blkbits);
> + if (max_blocks < lblk)
> + max_blocks = 0;
> + else
> + max_blocks -= lblk;
> +
> + flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT |
> + EXT4_GET_BLOCKS_CONVERT_UNWRITTEN;
> + if (mode & FALLOC_FL_KEEP_SIZE)
> + flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
> +
> + mutex_lock(&inode->i_mutex);
> +
> + /*
> + * Indirect files do not support unwritten extnets
> + */
> + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
> + ret = -EOPNOTSUPP;
> + goto out_mutex;
> + }
> +
> + if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> + offset + len > i_size_read(inode)) {
> + new_size = offset + len;
> + ret = inode_newsize_ok(inode, new_size);
> + if (ret)
> + goto out_mutex;
> + /*
> + * If we have a partial block after EOF we have to allocate
> + * the entire block.
> + */
> + if (partial)
> + max_blocks += 1;
> + }
> +
> + if (max_blocks > 0) {
> +
> + /* Now release the pages and zero block aligned part of pages*/
> + truncate_pagecache_range(inode, start, end - 1);
> +
> + /* Wait all existing dio workers, newcomers will block on i_mutex */
> + ext4_inode_block_unlocked_dio(inode);
> + inode_dio_wait(inode);
> +
> + /*
> + * Remove entire range from the extent status tree.
> + */
> + ret = ext4_es_remove_extent(inode, lblk, max_blocks);
> + if (ret)
> + goto out_dio;
> +
> + ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags,
> + mode);
> + if (ret)
> + goto out_dio;
> + }
> +
> + handle = ext4_journal_start(inode, EXT4_HT_MISC, 4);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + ext4_std_error(inode->i_sb, ret);
> + goto out_dio;
> + }
> +
> + inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
> +
> + if (!ret && new_size) {
> + if (new_size > i_size_read(inode))
> + i_size_write(inode, new_size);
> + if (new_size > EXT4_I(inode)->i_disksize)
> + ext4_update_i_disksize(inode, new_size);
> + }
> + ext4_mark_inode_dirty(handle, inode);
> +
> + /* Zero out partial block at the edges of the range */
> + ret = ext4_zero_partial_blocks(handle, inode, offset, len);
> +
> + if (file->f_flags & O_SYNC)
> + ext4_handle_sync(handle);
> +
> + ext4_journal_stop(handle);
> +out_dio:
> + ext4_inode_resume_unlocked_dio(inode);
> +out_mutex:
> + mutex_unlock(&inode->i_mutex);
> + return ret;
> +}
> +
> /*
> * preallocate space for a file. This implements ext4's fallocate file
> * operation, which gets called from sys_fallocate system call.
> @@ -4590,7 +4828,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> unsigned int blkbits = inode->i_blkbits;
>
> /* Return error if mode is not supported */
> - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> + FALLOC_FL_ZERO_RANGE))
> return -EOPNOTSUPP;
>
> if (mode & FALLOC_FL_PUNCH_HOLE)
> @@ -4607,6 +4846,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> return -EOPNOTSUPP;
>
> + if (mode & FALLOC_FL_ZERO_RANGE)
> + return ext4_zero_range(file, offset, len, mode);
> +
> trace_ext4_fallocate_enter(inode, offset, len, mode);
> lblk = offset >> blkbits;
> /*
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6e39895..e64807f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -503,6 +503,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> {
> struct extent_status es;
> int retval;
> + int ret = 0;
> #ifdef ES_AGGRESSIVE_TEST
> struct ext4_map_blocks orig_map;
>
> @@ -552,7 +553,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> EXT4_GET_BLOCKS_KEEP_SIZE);
> }
> if (retval > 0) {
> - int ret;
> unsigned int status;
>
> if (unlikely(retval != map->m_len)) {
> @@ -579,7 +579,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>
> found:
> if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> - int ret = check_block_validity(inode, map);
> + ret = check_block_validity(inode, map);
> if (ret != 0)
> return ret;
> }
> @@ -596,7 +596,13 @@ found:
> * with buffer head unmapped.
> */
> if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED)
> - return retval;
> + /*
> + * If we need to convert extent to unwritten
> + * we continue and do the actual work in
> + * ext4_ext_map_blocks()
> + */
> + if (!(flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN))
> + return retval;
>
> /*
> * Here we clear m_flags because after allocating an new extent,
> @@ -652,7 +658,6 @@ found:
> ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
>
> if (retval > 0) {
> - int ret;
> unsigned int status;
>
> if (unlikely(retval != map->m_len)) {
> @@ -687,7 +692,7 @@ found:
> has_zeroout:
> up_write((&EXT4_I(inode)->i_data_sem));
> if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> - int ret = check_block_validity(inode, map);
> + ret = check_block_validity(inode, map);
> if (ret != 0)
> return ret;
> }
> @@ -3501,7 +3506,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> if (!S_ISREG(inode->i_mode))
> return -EOPNOTSUPP;
>
> - trace_ext4_punch_hole(inode, offset, length);
> + trace_ext4_punch_hole(inode, offset, length, 0);
>
> /*
> * Write out all dirty pages to avoid race conditions
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 451e020..7bb26aa 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -71,7 +71,8 @@ struct extent_status;
> #define show_falloc_mode(mode) __print_flags(mode, "|", \
> { FALLOC_FL_KEEP_SIZE, "KEEP_SIZE"}, \
> { FALLOC_FL_PUNCH_HOLE, "PUNCH_HOLE"}, \
> - { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"})
> + { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"}, \
> + { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"})
>
>
> TRACE_EVENT(ext4_free_inode,
> @@ -1333,7 +1334,7 @@ TRACE_EVENT(ext4_direct_IO_exit,
> __entry->rw, __entry->ret)
> );
>
> -TRACE_EVENT(ext4_fallocate_enter,
> +DECLARE_EVENT_CLASS(ext4__fallocate_mode,
> TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
>
> TP_ARGS(inode, offset, len, mode),
> @@ -1341,23 +1342,45 @@ TRACE_EVENT(ext4_fallocate_enter,
> TP_STRUCT__entry(
> __field( dev_t, dev )
> __field( ino_t, ino )
> - __field( loff_t, pos )
> - __field( loff_t, len )
> + __field( loff_t, offset )
> + __field( loff_t, len )
> __field( int, mode )
> ),
>
> TP_fast_assign(
> __entry->dev = inode->i_sb->s_dev;
> __entry->ino = inode->i_ino;
> - __entry->pos = offset;
> + __entry->offset = offset;
> __entry->len = len;
> __entry->mode = mode;
> ),
>
> - TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %s",
> + TP_printk("dev %d,%d ino %lu offset %lld len %lld mode %s",
> MAJOR(__entry->dev), MINOR(__entry->dev),
> - (unsigned long) __entry->ino, __entry->pos,
> - __entry->len, show_falloc_mode(__entry->mode))
> + (unsigned long) __entry->ino,
> + __entry->offset, __entry->len,
> + show_falloc_mode(__entry->mode))
> +);
> +
> +DEFINE_EVENT(ext4__fallocate_mode, ext4_fallocate_enter,
> +
> + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
> +
> + TP_ARGS(inode, offset, len, mode)
> +);
> +
> +DEFINE_EVENT(ext4__fallocate_mode, ext4_punch_hole,
> +
> + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
> +
> + TP_ARGS(inode, offset, len, mode)
> +);
> +
> +DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range,
> +
> + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
> +
> + TP_ARGS(inode, offset, len, mode)
> );
>
> TRACE_EVENT(ext4_fallocate_exit,
> @@ -1389,31 +1412,6 @@ TRACE_EVENT(ext4_fallocate_exit,
> __entry->ret)
> );
>
> -TRACE_EVENT(ext4_punch_hole,
> - TP_PROTO(struct inode *inode, loff_t offset, loff_t len),
> -
> - TP_ARGS(inode, offset, len),
> -
> - TP_STRUCT__entry(
> - __field( dev_t, dev )
> - __field( ino_t, ino )
> - __field( loff_t, offset )
> - __field( loff_t, len )
> - ),
> -
> - TP_fast_assign(
> - __entry->dev = inode->i_sb->s_dev;
> - __entry->ino = inode->i_ino;
> - __entry->offset = offset;
> - __entry->len = len;
> - ),
> -
> - TP_printk("dev %d,%d ino %lu offset %lld len %lld",
> - MAJOR(__entry->dev), MINOR(__entry->dev),
> - (unsigned long) __entry->ino,
> - __entry->offset, __entry->len)
> -);
> -
> TRACE_EVENT(ext4_unlink_enter,
> TP_PROTO(struct inode *parent, struct dentry *dentry),
>
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-02-27 04:41:15

by jon ernst

[permalink] [raw]
Subject: Re: [PATCH 5/6 v2] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Wed, Feb 26, 2014 at 1:00 AM, jon ernst <[email protected]> wrote:
> On Tue, Feb 25, 2014 at 2:14 PM, Lukas Czerner <[email protected]> wrote:
>> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
>> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>>
>> It can be used to convert a range of file to zeros preferably without
>> issuing data IO. Blocks should be preallocated for the regions that span
>> holes in the file, and the entire range is preferable converted to
>> unwritten extents
>>
>> This can be also used to preallocate blocks past EOF in the same way as
>> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
>> size to remain the same.
>>
>> Also add appropriate tracepoints.
>>
>> Signed-off-by: Lukas Czerner <[email protected]>
>> ---
>> fs/ext4/ext4.h | 2 +
>> fs/ext4/extents.c | 270 +++++++++++++++++++++++++++++++++++++++++---
>> fs/ext4/inode.c | 17 ++-
>> include/trace/events/ext4.h | 64 +++++------

>> static int
>> +ext4_ext_convert_initialized_extent(handle_t *handle, struct inode *inode,
>> + struct ext4_map_blocks *map,
>> + struct ext4_ext_path *path, int flags,
>> + unsigned int allocated, ext4_fsblk_t newblock)
>> +{
>> + int ret = 0;
>> + int err = 0;
>> +
>> + /*
>> + * Make sure that the extent is no bigger than we support with
>> + * uninitialized extent
>> + */
>> + if (map->m_len > EXT_UNINIT_MAX_LEN)
>> + map->m_len = EXT_UNINIT_MAX_LEN / 2;
>
Pardon my possible dumb question. Why do you use
"EXT_UNINIT_MAX_LEN/ 2;" here instead of "EXT_UNINIT_MAX_LEN"
I don't see the reason why we can't use EXT_UNINIT_MAX_LEN here.


(resend, Ping on this question, thank you!)

Thanks!
Jon

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-02-27 11:56:54

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 5/6 v2] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Wed, 26 Feb 2014, jon ernst wrote:

> Date: Wed, 26 Feb 2014 23:41:15 -0500
> From: jon ernst <[email protected]>
> To: Lukas Czerner <[email protected]>
> Cc: "[email protected] List" <[email protected]>,
> Theodore Ts'o <[email protected]>, [email protected],
> [email protected]
> Subject: Re: [PATCH 5/6 v2] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for
> fallocate
>
> On Wed, Feb 26, 2014 at 1:00 AM, jon ernst <[email protected]> wrote:
> > On Tue, Feb 25, 2014 at 2:14 PM, Lukas Czerner <[email protected]> wrote:
> >> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> >> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> >>
> >> It can be used to convert a range of file to zeros preferably without
> >> issuing data IO. Blocks should be preallocated for the regions that span
> >> holes in the file, and the entire range is preferable converted to
> >> unwritten extents
> >>
> >> This can be also used to preallocate blocks past EOF in the same way as
> >> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> >> size to remain the same.
> >>
> >> Also add appropriate tracepoints.
> >>
> >> Signed-off-by: Lukas Czerner <[email protected]>
> >> ---
> >> fs/ext4/ext4.h | 2 +
> >> fs/ext4/extents.c | 270 +++++++++++++++++++++++++++++++++++++++++---
> >> fs/ext4/inode.c | 17 ++-
> >> include/trace/events/ext4.h | 64 +++++------
>
> >> static int
> >> +ext4_ext_convert_initialized_extent(handle_t *handle, struct inode *inode,
> >> + struct ext4_map_blocks *map,
> >> + struct ext4_ext_path *path, int flags,
> >> + unsigned int allocated, ext4_fsblk_t newblock)
> >> +{
> >> + int ret = 0;
> >> + int err = 0;
> >> +
> >> + /*
> >> + * Make sure that the extent is no bigger than we support with
> >> + * uninitialized extent
> >> + */
> >> + if (map->m_len > EXT_UNINIT_MAX_LEN)
> >> + map->m_len = EXT_UNINIT_MAX_LEN / 2;
> >
> Pardon my possible dumb question. Why do you use
> "EXT_UNINIT_MAX_LEN/ 2;" here instead of "EXT_UNINIT_MAX_LEN"
> I don't see the reason why we can't use EXT_UNINIT_MAX_LEN here.
>
>
> (resend, Ping on this question, thank you!)

Wow, that's an early ping :) I am sorry to disappoint you, my answer
is not going to be that exciting :)


Yes, we can just use EXT_UNINIT_MAX_LEN here. But
EXT_UNINIT_MAX_LEN/2 would make it much more evenly spread out.

I do not think there is any real world advantage to this and the
behaviour should be the same in both cases.

Thanks!
-Lukas


>
> Thanks!
> Jon
>

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-03-13 08:49:39

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Tue, Feb 25, 2014 at 08:14:33PM +0100, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents - even though file system may choose to zero out the
> extent or do whatever which will result in reading zeros from the range
> while the range remains allocated for the file.
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> You can test this feature yourself using xfstests, of fallocate(1) however
> you'll need patches for util_linux, xfsprogs and xfstests which are going to
> follow soon.
>
> I tested this mostly with a subset of xfstests using fsx and fsstress and
> even with new generic/290 which is just a copy of xfs/290 using fzero
> command for xfs_io instead of zero (which uses ioctl). I was testing on
> x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
>
> ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083
> generic/117 generic/068 generic/231 generic/127 generic/091 generic/075
> generic/112 generic/263 generic/091 generic/075 generic/256 generic/255
> generic/316 generic/300 generic/290 ext4/242;
>
> Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
> touches the same area as this pach set does, so we should figure out
> which one should go first and modify the other on top of it.
>
> This has been based on top of xfs-collapse-range so it does not contain ext4
> collapse range changes.

Lukas, I have merged patches 4 and 6 into the xfs-collapse-branch.
That branch should now remain stable, so you can rebase all the ext4
collapse range and zero range bits on top of tha branch and Ted can
pull it into the ext4 tree if he wants to pull it in.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-03-13 10:14:59

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Thu, 13 Mar 2014, Dave Chinner wrote:

> Date: Thu, 13 Mar 2014 19:49:36 +1100
> From: Dave Chinner <[email protected]>
> To: Lukas Czerner <[email protected]>
> Cc: [email protected], [email protected], [email protected],
> [email protected]
> Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
>
> On Tue, Feb 25, 2014 at 08:14:33PM +0100, Lukas Czerner wrote:
> > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> >
> > It can be used to convert a range of file to zeros preferably without
> > issuing data IO. Blocks should be preallocated for the regions that span
> > holes in the file, and the entire range is preferable converted to
> > unwritten extents - even though file system may choose to zero out the
> > extent or do whatever which will result in reading zeros from the range
> > while the range remains allocated for the file.
> >
> > This can be also used to preallocate blocks past EOF in the same way as
> > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > size to remain the same.
> >
> > You can test this feature yourself using xfstests, of fallocate(1) however
> > you'll need patches for util_linux, xfsprogs and xfstests which are going to
> > follow soon.
> >
> > I tested this mostly with a subset of xfstests using fsx and fsstress and
> > even with new generic/290 which is just a copy of xfs/290 using fzero
> > command for xfs_io instead of zero (which uses ioctl). I was testing on
> > x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
> >
> > ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083
> > generic/117 generic/068 generic/231 generic/127 generic/091 generic/075
> > generic/112 generic/263 generic/091 generic/075 generic/256 generic/255
> > generic/316 generic/300 generic/290 ext4/242;
> >
> > Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
> > touches the same area as this pach set does, so we should figure out
> > which one should go first and modify the other on top of it.
> >
> > This has been based on top of xfs-collapse-range so it does not contain ext4
> > collapse range changes.
>
> Lukas, I have merged patches 4 and 6 into the xfs-collapse-branch.
> That branch should now remain stable, so you can rebase all the ext4
> collapse range and zero range bits on top of tha branch and Ted can
> pull it into the ext4 tree if he wants to pull it in.
>
> Cheers,
>
> Dave.

Hi Dave,

that's great thanks! Btw you probably meant branch
xfs-collapse-range and those patches are already based on that
branch so it should apply cleanly. Now we have to figure out the
order in which we're going to take zero-range and collapse-range
into ext4.

Thanks!
-Lukas

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-03-16 03:27:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/6 v2] ext4: Update inode i_size after the preallocation

On Tue, Feb 25, 2014 at 08:14:34PM +0100, Lukas Czerner wrote:
> Currently in ext4_fallocate we would update inode size, c_time and sync
> the file with every partial allocation which is entirely unnecessary. It
> is true that if the crash happens in the middle of truncate we might end
> up with unchanged i size, or c_time which I do not think is really a
> problem - it does not mean file system corruption in any way. Note that
> xfs is doing things the same way e.g. update all of the mentioned after
> the allocation is done.
>
> This commit moves all the updates after the allocation is done. In
> addition we also need to change m_time as not only inode has been change
> bot also data regions might have changed (unwritten extents). However
> m_time will be only updated when i_size changed.
>
> Also we do not need to be paranoid about changing the c_time only if the
> actual allocation have happened, we can change it even if we try to
> allocate only to find out that there are already block allocated. It's
> not really a big deal and it will save us some additional complexity.
>
> Also use ext4_debug, instead of ext4_warning in #ifdef EXT4FS_DEBUG
> section.
>
> Signed-off-by: Lukas Czerner <[email protected]>

Thanks, applied.

- Ted

2014-03-16 03:28:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/6 v2] ext4: refactor ext4_fallocate code

On Tue, Feb 25, 2014 at 08:14:35PM +0100, Lukas Czerner wrote:
> Move block allocation out of the ext4_fallocate into separate function
> called ext4_alloc_file_blocks(). This will allow us to use the same
> allocation code for other allocation operations such as zero range which
> is commit in the next patch.
>
> Signed-off-by: Lukas Czerner <[email protected]>

Thanks, applied.

- Ted

2014-03-16 04:13:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 5/6 v2] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Tue, Feb 25, 2014 at 08:14:38PM +0100, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> Also add appropriate tracepoints.
>
> Signed-off-by: Lukas Czerner <[email protected]>

Thanks, applied.

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-03-16 19:08:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Tue, Feb 25, 2014 at 08:14:33PM +0100, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

Hi Lukas,

I've been trying to merge these patches into the ext4 tree, and I'm
running into a large number of test failures. Could you take a quick
eyeball and see if there's anything obvious wrong?

The "dev2" branch in the ext4 git tree has all the patches in this
series (1, 2, and 5) --- 3 was included earlier applied on top of the
"dev" branch. The "test" branch is the dev2 branch with the
xfs-collapse-range branch pulled in, which actually enables the
ZERO_RANGE flags (as well as the collapse range patches).

I have tested the "dev" branch both standalone and with the
xfs-collapse-range branch pulled in, and things seem to be pretty
stable. I'm doing more comprehensive testing now to confirm things.
(I'm using xfstests commit id 3948694eb12 which has the latest tests
to exercise the zero_range codepath.)

When I tried testing with the "test" branch, things failed pretty
quickly. I've attached two of these in this patch set. I'm guessing
it's some kind of memory corruption problem. These failures are
pretty repeatable, and it fails fast.

If I try running the "dev2" branch, without the xfstests collapse
range branch pulled in, things are much better (so there's clearly a
bug in the ZERO_RANGE code path), but there was still a few more
errors than the baseline. I'm rerunning those tests so I can be sure
that the results are repeatable.

I suspect the problem is that something in the dev branch isn't
playing well with your patches, or I screwed up while fixing up some
merge conflicts -- but the merge conflicts were pretty minimal, so
that seems a bit unlikely.

Anyway, if you could take a look, I'd really appreciate it. Thanks!!

- Ted


Attachments:
(No filename) (1.85 kB)
test-fail.tar.gz (33.53 kB)
Download all attachments

2014-03-17 02:19:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Sun, Mar 16, 2014 at 03:08:20PM -0400, [email protected] wrote:
> If I try running the "dev2" branch, without the xfstests collapse
> range branch pulled in, things are much better (so there's clearly a
> bug in the ZERO_RANGE code path), but there was still a few more
> errors than the baseline. I'm rerunning those tests so I can be sure
> that the results are repeatable.

Running the tests with the dev2 branch, which includes all of the
ext4-specific ZERO_RANGE patches, we see a regression with shared/243
with 4k and 1k block sizes (as well as 4k in no-journal mode):

shared/243 3s ... [15:09:34] [15:09:35] [failed, exit status 1] - output mismatch (see /results/results-4k/shared/243.out.bad)
--- tests/shared/243.out 2014-03-15 13:45:11.000000000 -0400
+++ /results/results-4k/shared/243.out.bad 2014-03-16 15:09:35.470862837 -0400
@@ -1,13 +1,3 @@
QA output created by 243
wrote 4096/4096 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 4096/4096 bytes at offset 0
-XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 40960/40960 bytes at offset 0
-XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
...
(Run 'diff -u tests/shared/243.out /results/results-4k/shared/243.out.bad' to see the entire diff)

- Ted

2014-03-17 03:02:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/6 v2] ext4: Update inode i_size after the preallocation

On Tue, Feb 25, 2014 at 08:14:34PM +0100, Lukas Czerner wrote:
> Currently in ext4_fallocate we would update inode size, c_time and sync
> the file with every partial allocation which is entirely unnecessary. It
> is true that if the crash happens in the middle of truncate we might end
> up with unchanged i size, or c_time which I do not think is really a
> problem - it does not mean file system corruption in any way. Note that
> xfs is doing things the same way e.g. update all of the mentioned after
> the allocation is done.
>
> This commit moves all the updates after the allocation is done. In
> addition we also need to change m_time as not only inode has been change
> bot also data regions might have changed (unwritten extents). However
> m_time will be only updated when i_size changed.
>
> Also we do not need to be paranoid about changing the c_time only if the
> actual allocation have happened, we can change it even if we try to
> allocate only to find out that there are already block allocated. It's
> not really a big deal and it will save us some additional complexity.
>
> Also use ext4_debug, instead of ext4_warning in #ifdef EXT4FS_DEBUG
> section.
>
> Signed-off-by: Lukas Czerner <[email protected]>

Further testing has shown that this patch (applied on top of the ext4
dev branch) is causing a regression failure of xfstests shared/243.

Could you take a look?

- Ted

2014-03-17 10:48:30

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 1/6 v2] ext4: Update inode i_size after the preallocation

On Sun, 16 Mar 2014, [email protected] wrote:

> Date: Sun, 16 Mar 2014 23:02:01 -0400
> From: [email protected]
> To: Lukas Czerner <[email protected]>
> Cc: [email protected], [email protected], [email protected]
> Subject: Re: [PATCH 1/6 v2] ext4: Update inode i_size after the preallocation
>
> On Tue, Feb 25, 2014 at 08:14:34PM +0100, Lukas Czerner wrote:
> > Currently in ext4_fallocate we would update inode size, c_time and sync
> > the file with every partial allocation which is entirely unnecessary. It
> > is true that if the crash happens in the middle of truncate we might end
> > up with unchanged i size, or c_time which I do not think is really a
> > problem - it does not mean file system corruption in any way. Note that
> > xfs is doing things the same way e.g. update all of the mentioned after
> > the allocation is done.
> >
> > This commit moves all the updates after the allocation is done. In
> > addition we also need to change m_time as not only inode has been change
> > bot also data regions might have changed (unwritten extents). However
> > m_time will be only updated when i_size changed.
> >
> > Also we do not need to be paranoid about changing the c_time only if the
> > actual allocation have happened, we can change it even if we try to
> > allocate only to find out that there are already block allocated. It's
> > not really a big deal and it will save us some additional complexity.
> >
> > Also use ext4_debug, instead of ext4_warning in #ifdef EXT4FS_DEBUG
> > section.
> >
> > Signed-off-by: Lukas Czerner <[email protected]>
>
> Further testing has shown that this patch (applied on top of the ext4
> dev branch) is causing a regression failure of xfstests shared/243.
>
> Could you take a look?
>
> - Ted

Hi Ted,

I am looking into this. Thanks!

-Lukas

2014-03-17 12:50:53

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Sun, 16 Mar 2014, [email protected] wrote:

> Date: Sun, 16 Mar 2014 22:19:09 -0400
> From: [email protected]
> To: Lukas Czerner <[email protected]>
> Cc: [email protected]
> Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
>
> On Sun, Mar 16, 2014 at 03:08:20PM -0400, [email protected] wrote:
> > If I try running the "dev2" branch, without the xfstests collapse
> > range branch pulled in, things are much better (so there's clearly a
> > bug in the ZERO_RANGE code path), but there was still a few more
> > errors than the baseline. I'm rerunning those tests so I can be sure
> > that the results are repeatable.
>
> Running the tests with the dev2 branch, which includes all of the
> ext4-specific ZERO_RANGE patches, we see a regression with shared/243
> with 4k and 1k block sizes (as well as 4k in no-journal mode):

Oh, right. This fails because the test really should be deprecated
since we already removed the check in e2fsck - see e2fsprogs commit
010dc7b90d97b93907cbf57b3b44f1c1cad234f6.

In this patch I removed setting the EXT4_INODE_EOFBLOCKS, however I
forgot the mention that in the description. Sorry about that.

Now I am not sure how we want to handle this. Either having it as a
part of this patch just update description, or having the
EXT4_INODE_EOFBLOCKS removal from the fallocate as a separate patch,
which probably make more sense to me. What would you prefer ?

Thanks!
-Lukas

>
> shared/243 3s ... [15:09:34] [15:09:35] [failed, exit status 1] - output mismatch (see /results/results-4k/shared/243.out.bad)
> --- tests/shared/243.out 2014-03-15 13:45:11.000000000 -0400
> +++ /results/results-4k/shared/243.out.bad 2014-03-16 15:09:35.470862837 -0400
> @@ -1,13 +1,3 @@
> QA output created by 243
> wrote 4096/4096 bytes at offset 0
> XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -wrote 4096/4096 bytes at offset 0
> -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -wrote 40960/40960 bytes at offset 0
> -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> ...
> (Run 'diff -u tests/shared/243.out /results/results-4k/shared/243.out.bad' to see the entire diff)
>
> - Ted
>

2014-03-17 12:59:45

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Sun, 16 Mar 2014, [email protected] wrote:

> Date: Sun, 16 Mar 2014 15:08:20 -0400
> From: [email protected]
> To: Lukas Czerner <[email protected]>
> Cc: [email protected]
> Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
>
> On Tue, Feb 25, 2014 at 08:14:33PM +0100, Lukas Czerner wrote:
> > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> Hi Lukas,
>
> I've been trying to merge these patches into the ext4 tree, and I'm
> running into a large number of test failures. Could you take a quick
> eyeball and see if there's anything obvious wrong?
>
> The "dev2" branch in the ext4 git tree has all the patches in this
> series (1, 2, and 5) --- 3 was included earlier applied on top of the
> "dev" branch. The "test" branch is the dev2 branch with the
> xfs-collapse-range branch pulled in, which actually enables the
> ZERO_RANGE flags (as well as the collapse range patches).
>
> I have tested the "dev" branch both standalone and with the
> xfs-collapse-range branch pulled in, and things seem to be pretty
> stable. I'm doing more comprehensive testing now to confirm things.
> (I'm using xfstests commit id 3948694eb12 which has the latest tests
> to exercise the zero_range codepath.)
>
> When I tried testing with the "test" branch, things failed pretty
> quickly. I've attached two of these in this patch set. I'm guessing
> it's some kind of memory corruption problem. These failures are
> pretty repeatable, and it fails fast.
>
> If I try running the "dev2" branch, without the xfstests collapse
> range branch pulled in, things are much better (so there's clearly a
> bug in the ZERO_RANGE code path), but there was still a few more
> errors than the baseline. I'm rerunning those tests so I can be sure
> that the results are repeatable.
>
> I suspect the problem is that something in the dev branch isn't
> playing well with your patches, or I screwed up while fixing up some
> merge conflicts -- but the merge conflicts were pretty minimal, so
> that seems a bit unlikely.
>
> Anyway, if you could take a look, I'd really appreciate it. Thanks!!
>
> - Ted

That's definitely very weird and I have not seen that before. i am
looking into this right not.

Thanks!
-Lukas

>
>

2014-03-17 13:29:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Mon, Mar 17, 2014 at 01:50:46PM +0100, Lukáš Czerner wrote:
> > Running the tests with the dev2 branch, which includes all of the
> > ext4-specific ZERO_RANGE patches, we see a regression with shared/243
> > with 4k and 1k block sizes (as well as 4k in no-journal mode):
>
> Oh, right. This fails because the test really should be deprecated
> since we already removed the check in e2fsck - see e2fsprogs commit
> 010dc7b90d97b93907cbf57b3b44f1c1cad234f6.
>
> In this patch I removed setting the EXT4_INODE_EOFBLOCKS, however I
> forgot the mention that in the description. Sorry about that.

I'm not sure how removing setting the EXT4_INODE_EOFBLOCKS flag would
result in shared/243 failing in this particular way. The EOFBLOCKS
flag never influenced how the userspace-visible behavior of the
kernel; it only set a flag which told e2fsck that it was OK to have
blocks mapped beyond i_size.

So removing EOFBLOCKS could potentially cause false positives by
e2fsck for e2fsprogs previous to 1.42.2 (or which do not have the
above mentioned commit pulled on). That's the main reason to keep
support for setting EOFBLOCKS in the kernel --- to avoid causing user
help desk reports if they try using a newer kernel w/o updating the
version of e2fsprogs on their enterprise kernel distro (not that users
_ever_ upgrade the kernel on their own ;-).

But I'm not sure how this would cause the xfstests failure detailed
below? And how would just updating the commit description deal with
the fact that shared/243 is failing?

- Ted

2014-03-17 15:11:08

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Mon, 17 Mar 2014, [email protected] wrote:

> Date: Mon, 17 Mar 2014 09:29:01 -0400
> From: [email protected]
> To: =?utf-8?B?THVrw6HFoSBDemVybmVyIDxsY3plcm5lckByZWRoYXQuY29tPg==?=@thunk.org
> Cc: [email protected]
> Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
>
> On Mon, Mar 17, 2014 at 01:50:46PM +0100, Lukáš Czerner wrote:
> > > Running the tests with the dev2 branch, which includes all of the
> > > ext4-specific ZERO_RANGE patches, we see a regression with shared/243
> > > with 4k and 1k block sizes (as well as 4k in no-journal mode):
> >
> > Oh, right. This fails because the test really should be deprecated
> > since we already removed the check in e2fsck - see e2fsprogs commit
> > 010dc7b90d97b93907cbf57b3b44f1c1cad234f6.
> >
> > In this patch I removed setting the EXT4_INODE_EOFBLOCKS, however I
> > forgot the mention that in the description. Sorry about that.
>
> I'm not sure how removing setting the EXT4_INODE_EOFBLOCKS flag would
> result in shared/243 failing in this particular way. The EOFBLOCKS
> flag never influenced how the userspace-visible behavior of the
> kernel; it only set a flag which told e2fsck that it was OK to have
> blocks mapped beyond i_size.

Yes, shared/243 is actually looking at the file system using debugfs
trying to figure out whether this flag is set or not. And that is
what is failing.

See the 243.full output:

Test 1: Fallocate 40960 bytes and write 4096 bytes (buffered io).
EOFBLOCK_FL bit is not set.
Error: Current bit state incorrect.

However I am not really sure why is this in shared since this is
only useful for ext4.

>
> So removing EOFBLOCKS could potentially cause false positives by
> e2fsck for e2fsprogs previous to 1.42.2 (or which do not have the
> above mentioned commit pulled on). That's the main reason to keep
> support for setting EOFBLOCKS in the kernel --- to avoid causing user
> help desk reports if they try using a newer kernel w/o updating the
> version of e2fsprogs on their enterprise kernel distro (not that users
> _ever_ upgrade the kernel on their own ;-).

I believe that we're not very consistent with that anyway. That was
part of the reason why we got rid of it in e2fsck. However I agree
that this might cause additional problems. So it might be better to
just keep this in kernel for now...

>
> But I'm not sure how this would cause the xfstests failure detailed
> below? And how would just updating the commit description deal with
> the fact that shared/243 is failing?

See above. If we decide to remote the EXT4_INODE_EOFBLOCKS from
kernel then this tests has to go as well.

Btw. I am not able to call in for the ext4 call today, sorry.

Thanks!
-Lukas

>
> - Ted
>

2014-03-17 20:57:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Mon, Mar 17, 2014 at 04:10:52PM +0100, Lukáš Czerner wrote:
> Yes, shared/243 is actually looking at the file system using debugfs
> trying to figure out whether this flag is set or not. And that is
> what is failing.
>
> See the 243.full output:
>
> Test 1: Fallocate 40960 bytes and write 4096 bytes (buffered io).
> EOFBLOCK_FL bit is not set.
> Error: Current bit state incorrect.
>
> However I am not really sure why is this in shared since this is
> only useful for ext4.

Ah, I agree, it's confusing/wrong for this test to be in shared/243;
we should probably rename it to ext4/243 for now.

> I believe that we're not very consistent with that anyway. That was
> part of the reason why we got rid of it in e2fsck. However I agree
> that this might cause additional problems. So it might be better to
> just keep this in kernel for now...

Yeah, there are some really otherwise hard to solve races were we
might not set EOFBLOCK_FL. We kept tripping over them when running
xfstests, but it was otherwise pretty hard to hit.

Perhaps we should set a date, say in another two years, when we make
this go away.

- Ted

2014-03-17 21:00:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Mon, Mar 17, 2014 at 01:59:35PM +0100, Lukáš Czerner wrote:
> > The "dev2" branch in the ext4 git tree has all the patches in this
> > series (1, 2, and 5) --- 3 was included earlier applied on top of the
> > "dev" branch. The "test" branch is the dev2 branch with the
> > xfs-collapse-range branch pulled in, which actually enables the
> > ZERO_RANGE flags (as well as the collapse range patches).
> >
> > When I tried testing with the "test" branch, things failed pretty
> > quickly. I've attached two of these in this patch set. I'm guessing
> > it's some kind of memory corruption problem. These failures are
> > pretty repeatable, and it fails fast.

Can you let me know when you've had a chance to take a look at this
failure? Once we put back the EOFBLOCK_FL code, I suspect we should
be fine with the first two patches in this patch series, but patch #5
plus the xfs-collapse-range branch appears to still have issues.

Thanks!!

- Ted

2014-03-18 09:38:07

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Mon, 17 Mar 2014, [email protected] wrote:

> Date: Mon, 17 Mar 2014 17:00:30 -0400
> From: [email protected]
> To: =?utf-8?B?THVrw6HFoSBDemVybmVyIDxsY3plcm5lckByZWRoYXQuY29tPg==?=@thunk.org
> Cc: [email protected]
> Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
>
> On Mon, Mar 17, 2014 at 01:59:35PM +0100, Lukáš Czerner wrote:
> > > The "dev2" branch in the ext4 git tree has all the patches in this
> > > series (1, 2, and 5) --- 3 was included earlier applied on top of the
> > > "dev" branch. The "test" branch is the dev2 branch with the
> > > xfs-collapse-range branch pulled in, which actually enables the
> > > ZERO_RANGE flags (as well as the collapse range patches).
> > >
> > > When I tried testing with the "test" branch, things failed pretty
> > > quickly. I've attached two of these in this patch set. I'm guessing
> > > it's some kind of memory corruption problem. These failures are
> > > pretty repeatable, and it fails fast.
>
> Can you let me know when you've had a chance to take a look at this
> failure? Once we put back the EOFBLOCK_FL code, I suspect we should
> be fine with the first two patches in this patch series, but patch #5
> plus the xfs-collapse-range branch appears to still have issues.
>
> Thanks!!
>
> - Ted

Yes, I am looking into this since yesterday, but unfortunately I
have not found a problem yet. I am still trying to figure it out.

-Lukas

2014-03-18 11:37:55

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Mon, 17 Mar 2014, [email protected] wrote:

> Date: Mon, 17 Mar 2014 17:00:30 -0400
> From: [email protected]
> To: =?utf-8?B?THVrw6HFoSBDemVybmVyIDxsY3plcm5lckByZWRoYXQuY29tPg==?=@thunk.org
> Cc: [email protected]
> Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
>
> On Mon, Mar 17, 2014 at 01:59:35PM +0100, Lukáš Czerner wrote:
> > > The "dev2" branch in the ext4 git tree has all the patches in this
> > > series (1, 2, and 5) --- 3 was included earlier applied on top of the
> > > "dev" branch. The "test" branch is the dev2 branch with the
> > > xfs-collapse-range branch pulled in, which actually enables the
> > > ZERO_RANGE flags (as well as the collapse range patches).
> > >
> > > When I tried testing with the "test" branch, things failed pretty
> > > quickly. I've attached two of these in this patch set. I'm guessing
> > > it's some kind of memory corruption problem. These failures are
> > > pretty repeatable, and it fails fast.
>
> Can you let me know when you've had a chance to take a look at this
> failure? Once we put back the EOFBLOCK_FL code, I suspect we should
> be fine with the first two patches in this patch series, but patch #5
> plus the xfs-collapse-range branch appears to still have issues.
>
> Thanks!!
>
> - Ted
>

Ok, finally I got it. The problem is that we now have commit

97d39798f77aef626130db8590cc79195300227b ext4: delete path dealloc
code in ext4_ext_handle_uninitialized_extents

which I was not aware of before. And when merging you have used the
same out2 label out of the function. However when creating my new
function ext4_ext_convert_initialized_exten() so I've done the same
thing as with ext4_ext_handle_uninitialized_extents() and freed the
path. And since we do not set path to NULL in ext4_ext_map_blocks
after calling ext4_ext_convert_initialized_extent() when we hit the
condition at the out2:

if (path) {
ext4_ext_drop_refs(path);
kfree(path);
}

we will double-free possibly destroying data from someone else. That
is why we've seen what looked like a random memory corruption.

I'll resend the patches including the EOFBLOCK_FL fixes.

Thanks!
-Lukas

2014-03-18 12:39:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Tue, Mar 18, 2014 at 12:37:47PM +0100, Lukáš Czerner wrote:
> Ok, finally I got it. The problem is that we now have commit
>
> 97d39798f77aef626130db8590cc79195300227b ext4: delete path dealloc
> code in ext4_ext_handle_uninitialized_extents
>
> which I was not aware of before. And when merging you have used the
> same out2 label out of the function. However when creating my new
> function ext4_ext_convert_initialized_exten() so I've done the same
> thing as with ext4_ext_handle_uninitialized_extents() and freed the
> path. And since we do not set path to NULL in ext4_ext_map_blocks
> after calling ext4_ext_convert_initialized_extent() when we hit the
> condition at the out2:
>
> if (path) {
> ext4_ext_drop_refs(path);
> kfree(path);
> }
>
> we will double-free possibly destroying data from someone else. That
> is why we've seen what looked like a random memory corruption.

My bad! I remember noticing that particular semantic conflict, and I
*thought* I had fixed it up. The fixup must have gotten lost when I
was doing some patch wrangling (I was moving aronud some patch hunks
around to be the most logical with respect to the COLLAPSE RANGE, and
I must have dropped the fixup somewhere along the way).

Thanks for finding it!

- Ted

2014-03-18 12:53:30

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate

On Tue, 18 Mar 2014, [email protected] wrote:

> Date: Tue, 18 Mar 2014 08:39:19 -0400
> From: [email protected]
> To: =?utf-8?B?THVrw6HFoSBDemVybmVyIDxsY3plcm5lckByZWRoYXQuY29tPg==?=@thunk.org
> Cc: [email protected]
> Subject: Re: [PATCH 0/6 v2] Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
>
> On Tue, Mar 18, 2014 at 12:37:47PM +0100, Lukáš Czerner wrote:
> > Ok, finally I got it. The problem is that we now have commit
> >
> > 97d39798f77aef626130db8590cc79195300227b ext4: delete path dealloc
> > code in ext4_ext_handle_uninitialized_extents
> >
> > which I was not aware of before. And when merging you have used the
> > same out2 label out of the function. However when creating my new
> > function ext4_ext_convert_initialized_exten() so I've done the same
> > thing as with ext4_ext_handle_uninitialized_extents() and freed the
> > path. And since we do not set path to NULL in ext4_ext_map_blocks
> > after calling ext4_ext_convert_initialized_extent() when we hit the
> > condition at the out2:
> >
> > if (path) {
> > ext4_ext_drop_refs(path);
> > kfree(path);
> > }
> >
> > we will double-free possibly destroying data from someone else. That
> > is why we've seen what looked like a random memory corruption.
>
> My bad! I remember noticing that particular semantic conflict, and I
> *thought* I had fixed it up. The fixup must have gotten lost when I
> was doing some patch wrangling (I was moving aronud some patch hunks
> around to be the most logical with respect to the COLLAPSE RANGE, and
> I must have dropped the fixup somewhere along the way).
>
> Thanks for finding it!
>
> - Ted
>

No problem. I am running some tests right now to make sure that
everything is in order and will send out patches 1, 2, and 5 once
again in a separate patch set based on ext4/dev branch.

Thanks!
-Lukas