Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
It can be used to convert a range of file to zeros preferably without
issuing data IO. Blocks should be preallocated for the regions that span
holes in the file, and the entire range is preferable converted to
unwritten extents - even though file system may choose to zero out the
extent or do whatever which will result in reading zeros from the range
while the range remains allocated for the file.
This can be also used to preallocate blocks past EOF in the same way as
with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
size to remain the same.
You can test this feature yourself using xfstests, of fallocate(1) however
you'll need patches for util_linux, xfsprogs and xfstests which you
can find here:
http://people.redhat.com/lczerner/zero_range/
I'll post the patches after we agree and merge the kernel functionality.
I tested this mostly with a subset of xfstests using fsx and fsstress and
even with new generic/290 which is just a copy of xfs/290 usinz fzero
command for xfs_io instead of zero (which uses ioctl). I was testing on
x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
touches the same area as this pach set does, so we should figure out
which one should go first and modify the other on top of it.
Thanks!
-Lukas
--
[PATCH 1/6] ext4: Update inode i_size after the preallocation
[PATCH 2/6] ext4: refactor ext4_fallocate code
[PATCH 3/6] ext4: translate fallocate mode bits to strings
[PATCH 4/6] fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
[PATCH 5/6] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
[PATCH 6/6] xfs: Add support for FALLOC_FL_ZERO_RANGE
fs/ext4/ext4.h | 3 +
fs/ext4/extents.c | 430 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
fs/ext4/inode.c | 17 ++-
fs/open.c | 7 +-
fs/xfs/xfs_file.c | 10 +-
include/trace/events/ext4.h | 67 ++++++-----
include/uapi/linux/falloc.h | 1 +
7 files changed, 393 insertions(+), 142 deletions(-)
Currently in ext4_fallocate we would update inode size, c_time and sync
the file with every partial allocation which is entirely unnecessary. It
is true that if the crash happens in the middle of truncate we might end
up with unchanged i size, or c_time which I do not think is really a
problem - it does not mean file system corruption in any way. Note that
xfs is doing things the same way e.g. update all of the mentioned after
the allocation is done.
This commit moves all the updates after the allocation is done. In
addition we also need to change m_time as not only inode has been change
bot also data regions might have changed (unwritten extents). Also we do
not need to be paranoid about changing the c_time and m_time only if the
actual allocation have happened, we can change it even if we try to
allocate only to find out that there are already block allocated. It's
not really a big deal and it will save us some additional complexity.
Also use ext4_debug, instead of ext4_warning in #ifdef EXT4FS_DEBUG
section.
Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/extents.c | 86 +++++++++++++++++++++----------------------------------
1 file changed, 32 insertions(+), 54 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 10cff47..6a52851 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4513,36 +4513,6 @@ retry:
ext4_std_error(inode->i_sb, err);
}
-static void ext4_falloc_update_inode(struct inode *inode,
- int mode, loff_t new_size, int update_ctime)
-{
- struct timespec now;
-
- if (update_ctime) {
- now = current_fs_time(inode->i_sb);
- if (!timespec_equal(&inode->i_ctime, &now))
- inode->i_ctime = now;
- }
- /*
- * Update only when preallocation was requested beyond
- * the file size.
- */
- if (!(mode & FALLOC_FL_KEEP_SIZE)) {
- if (new_size > i_size_read(inode))
- i_size_write(inode, new_size);
- if (new_size > EXT4_I(inode)->i_disksize)
- ext4_update_i_disksize(inode, new_size);
- } else {
- /*
- * Mark that we allocate beyond EOF so the subsequent truncate
- * can proceed even if the new size is the same as i_size.
- */
- if (new_size > i_size_read(inode))
- ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
- }
-
-}
-
/*
* preallocate space for a file. This implements ext4's fallocate file
* operation, which gets called from sys_fallocate system call.
@@ -4554,7 +4524,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
{
struct inode *inode = file_inode(file);
handle_t *handle;
- loff_t new_size;
+ loff_t new_size = 0;
unsigned int max_blocks;
int ret = 0;
int ret2 = 0;
@@ -4594,12 +4564,15 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
*/
credits = ext4_chunk_trans_blocks(inode, max_blocks);
mutex_lock(&inode->i_mutex);
- ret = inode_newsize_ok(inode, (len + offset));
- if (ret) {
- mutex_unlock(&inode->i_mutex);
- trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
- return ret;
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+ offset + len > i_size_read(inode)) {
+ new_size = offset + len;
+ ret = inode_newsize_ok(inode, new_size);
+ if (ret)
+ goto out;
}
+
flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
@@ -4623,28 +4596,14 @@ retry:
}
ret = ext4_map_blocks(handle, inode, &map, flags);
if (ret <= 0) {
-#ifdef EXT4FS_DEBUG
- ext4_warning(inode->i_sb,
- "inode #%lu: block %u: len %u: "
- "ext4_ext_map_blocks returned %d",
- inode->i_ino, map.m_lblk,
- map.m_len, ret);
-#endif
+ ext4_debug("inode #%lu: block %u: len %u: "
+ "ext4_ext_map_blocks returned %d",
+ inode->i_ino, map.m_lblk,
+ map.m_len, ret);
ext4_mark_inode_dirty(handle, inode);
ret2 = ext4_journal_stop(handle);
break;
}
- if ((map.m_lblk + ret) >= (EXT4_BLOCK_ALIGN(offset + len,
- blkbits) >> blkbits))
- new_size = offset + len;
- else
- new_size = ((loff_t) map.m_lblk + ret) << blkbits;
-
- ext4_falloc_update_inode(inode, mode, new_size,
- (map.m_flags & EXT4_MAP_NEW));
- ext4_mark_inode_dirty(handle, inode);
- if ((file->f_flags & O_SYNC) && ret >= max_blocks)
- ext4_handle_sync(handle);
ret2 = ext4_journal_stop(handle);
if (ret2)
break;
@@ -4654,6 +4613,25 @@ retry:
ret = 0;
goto retry;
}
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle))
+ goto out;
+
+ inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+
+ if (ret > 0 && new_size) {
+ if (new_size > i_size_read(inode))
+ i_size_write(inode, new_size);
+ if (new_size > EXT4_I(inode)->i_disksize)
+ ext4_update_i_disksize(inode, new_size);
+ }
+ ext4_mark_inode_dirty(handle, inode);
+ if (file->f_flags & O_SYNC)
+ ext4_handle_sync(handle);
+
+ ext4_journal_stop(handle);
+out:
mutex_unlock(&inode->i_mutex);
trace_ext4_fallocate_exit(inode, offset, max_blocks,
ret > 0 ? ret2 : ret);
--
1.8.3.1
Move block allocation out of the ext4_fallocate into separate function
called ext4_alloc_file_blocks(). This will allow us to use the same
allocation code for other allocation operations such as zero range which
is commit in the next patch.
Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/extents.c | 127 +++++++++++++++++++++++++++++++-----------------------
1 file changed, 73 insertions(+), 54 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6a52851..2d68a46 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4513,6 +4513,64 @@ retry:
ext4_std_error(inode->i_sb, err);
}
+int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
+ ext4_lblk_t len, int flags, int mode)
+{
+ struct inode *inode = file_inode(file);
+ handle_t *handle;
+ int ret = 0;
+ int ret2 = 0;
+ int retries = 0;
+ struct ext4_map_blocks map;
+ unsigned int credits;
+
+ map.m_lblk = offset;
+ /*
+ * Don't normalize the request if it can fit in one extent so
+ * that it doesn't get unnecessarily split into multiple
+ * extents.
+ */
+ if (len <= EXT_UNINIT_MAX_LEN)
+ flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
+
+ /*
+ * credits to insert 1 extent into extent tree
+ */
+ credits = ext4_chunk_trans_blocks(inode, len);
+
+retry:
+ while (ret >= 0 && ret < len) {
+ map.m_lblk = map.m_lblk + ret;
+ map.m_len = len = len - ret;
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
+ credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ break;
+ }
+ ret = ext4_map_blocks(handle, inode, &map, flags);
+ if (ret <= 0) {
+ ext4_debug("inode #%lu: block %u: len %u: "
+ "ext4_ext_map_blocks returned %d",
+ inode->i_ino, map.m_lblk,
+ map.m_len, ret);
+ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ break;
+ }
+ ret2 = ext4_journal_stop(handle);
+ if (ret2)
+ break;
+ }
+ if (ret == -ENOSPC &&
+ ext4_should_retry_alloc(inode->i_sb, &retries)) {
+ ret = 0;
+ goto retry;
+ }
+
+ return ret > 0 ? ret2 : ret;
+}
+
/*
* preallocate space for a file. This implements ext4's fallocate file
* operation, which gets called from sys_fallocate system call.
@@ -4527,11 +4585,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
loff_t new_size = 0;
unsigned int max_blocks;
int ret = 0;
- int ret2 = 0;
- int retries = 0;
int flags;
- struct ext4_map_blocks map;
- unsigned int credits, blkbits = inode->i_blkbits;
+ ext4_lblk_t lblk;
+ unsigned int blkbits = inode->i_blkbits;
/* Return error if mode is not supported */
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -4552,17 +4608,18 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EOPNOTSUPP;
trace_ext4_fallocate_enter(inode, offset, len, mode);
- map.m_lblk = offset >> blkbits;
+ lblk = offset >> blkbits;
/*
* We can't just convert len to max_blocks because
* If blocksize = 4096 offset = 3072 and len = 2048
*/
max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
- - map.m_lblk;
- /*
- * credits to insert 1 extent into extent tree
- */
- credits = ext4_chunk_trans_blocks(inode, max_blocks);
+ - lblk;
+
+ flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
+ if (mode & FALLOC_FL_KEEP_SIZE)
+ flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
+
mutex_lock(&inode->i_mutex);
if (!(mode & FALLOC_FL_KEEP_SIZE) &&
@@ -4573,46 +4630,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
goto out;
}
- flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
- if (mode & FALLOC_FL_KEEP_SIZE)
- flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
- /*
- * Don't normalize the request if it can fit in one extent so
- * that it doesn't get unnecessarily split into multiple
- * extents.
- */
- if (len <= EXT_UNINIT_MAX_LEN << blkbits)
- flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
-
-retry:
- while (ret >= 0 && ret < max_blocks) {
- map.m_lblk = map.m_lblk + ret;
- map.m_len = max_blocks = max_blocks - ret;
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
- credits);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- break;
- }
- ret = ext4_map_blocks(handle, inode, &map, flags);
- if (ret <= 0) {
- ext4_debug("inode #%lu: block %u: len %u: "
- "ext4_ext_map_blocks returned %d",
- inode->i_ino, map.m_lblk,
- map.m_len, ret);
- ext4_mark_inode_dirty(handle, inode);
- ret2 = ext4_journal_stop(handle);
- break;
- }
- ret2 = ext4_journal_stop(handle);
- if (ret2)
- break;
- }
- if (ret == -ENOSPC &&
- ext4_should_retry_alloc(inode->i_sb, &retries)) {
- ret = 0;
- goto retry;
- }
+ ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags, mode);
+ if (ret)
+ goto out;
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
if (IS_ERR(handle))
@@ -4620,7 +4640,7 @@ retry:
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
- if (ret > 0 && new_size) {
+ if (!ret && new_size) {
if (new_size > i_size_read(inode))
i_size_write(inode, new_size);
if (new_size > EXT4_I(inode)->i_disksize)
@@ -4633,9 +4653,8 @@ retry:
ext4_journal_stop(handle);
out:
mutex_unlock(&inode->i_mutex);
- trace_ext4_fallocate_exit(inode, offset, max_blocks,
- ret > 0 ? ret2 : ret);
- return ret > 0 ? ret2 : ret;
+ trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
+ return ret;
}
/*
--
1.8.3.1
Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 1 -
include/trace/events/ext4.h | 9 +++++++--
3 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ece5556..3b9601c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -31,6 +31,7 @@
#include <linux/percpu_counter.h>
#include <linux/ratelimit.h>
#include <crypto/hash.h>
+#include <linux/falloc.h>
#ifdef __KERNEL__
#include <linux/compat.h>
#endif
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2d68a46..4bfa870 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -37,7 +37,6 @@
#include <linux/quotaops.h>
#include <linux/string.h>
#include <linux/slab.h>
-#include <linux/falloc.h>
#include <asm/uaccess.h>
#include <linux/fiemap.h>
#include "ext4_jbd2.h"
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 197d312..451e020 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -68,6 +68,11 @@ struct extent_status;
{ EXTENT_STATUS_DELAYED, "D" }, \
{ EXTENT_STATUS_HOLE, "H" })
+#define show_falloc_mode(mode) __print_flags(mode, "|", \
+ { FALLOC_FL_KEEP_SIZE, "KEEP_SIZE"}, \
+ { FALLOC_FL_PUNCH_HOLE, "PUNCH_HOLE"}, \
+ { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"})
+
TRACE_EVENT(ext4_free_inode,
TP_PROTO(struct inode *inode),
@@ -1349,10 +1354,10 @@ TRACE_EVENT(ext4_fallocate_enter,
__entry->mode = mode;
),
- TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %d",
+ TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %s",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino, __entry->pos,
- __entry->len, __entry->mode)
+ __entry->len, show_falloc_mode(__entry->mode))
);
TRACE_EVENT(ext4_fallocate_exit,
--
1.8.3.1
Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
It can be used to convert a range of file to zeros preferably without
issuing data IO. Blocks should be preallocated for the regions that span
holes in the file, and the entire range is preferable converted to
unwritten extents
This can be also used to preallocate blocks past EOF in the same way as
with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
size to remain the same.
Also add appropriate tracepoints.
Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/extents.c | 262 +++++++++++++++++++++++++++++++++++++++++---
fs/ext4/inode.c | 17 ++-
include/trace/events/ext4.h | 64 ++++++-----
4 files changed, 292 insertions(+), 53 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3b9601c..a649abe 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -568,6 +568,8 @@ enum {
#define EXT4_GET_BLOCKS_NO_LOCK 0x0100
/* Do not put hole in extent cache */
#define EXT4_GET_BLOCKS_NO_PUT_HOLE 0x0200
+ /* Convert written extents to unwritten */
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0400
/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 4bfa870..af0e8af 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3568,6 +3568,8 @@ out:
* b> Splits in two extents: Write is happening at either end of the extent
* c> Splits in three extents: Somone is writing in middle of the extent
*
+ * This works the same way in the case of initialized -> unwritten conversion.
+ *
* One of more index blocks maybe needed if the extent tree grow after
* the uninitialized extent split. To prevent ENOSPC occur at the IO
* complete, we need to split the uninitialized extent before DIO submit
@@ -3578,7 +3580,7 @@ out:
*
* Returns the size of uninitialized extent to be written on success.
*/
-static int ext4_split_unwritten_extents(handle_t *handle,
+static int ext4_split_convert_extents(handle_t *handle,
struct inode *inode,
struct ext4_map_blocks *map,
struct ext4_ext_path *path,
@@ -3590,9 +3592,9 @@ static int ext4_split_unwritten_extents(handle_t *handle,
unsigned int ee_len;
int split_flag = 0, depth;
- ext_debug("ext4_split_unwritten_extents: inode %lu, logical"
- "block %llu, max_blocks %u\n", inode->i_ino,
- (unsigned long long)map->m_lblk, map->m_len);
+ ext_debug("%s: inode %lu, logical block %llu, max_blocks %u\n",
+ __func__, inode->i_ino,
+ (unsigned long long)map->m_lblk, map->m_len);
eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
inode->i_sb->s_blocksize_bits;
@@ -3607,14 +3609,73 @@ static int ext4_split_unwritten_extents(handle_t *handle,
ee_block = le32_to_cpu(ex->ee_block);
ee_len = ext4_ext_get_actual_len(ex);
- split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
- split_flag |= EXT4_EXT_MARK_UNINIT2;
- if (flags & EXT4_GET_BLOCKS_CONVERT)
- split_flag |= EXT4_EXT_DATA_VALID2;
+ /* Convert to unwritten */
+ if (flags | EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
+ split_flag |= EXT4_EXT_DATA_VALID1;
+ /* Convert to initialized */
+ } else if (flags | EXT4_GET_BLOCKS_CONVERT) {
+ split_flag |= ee_block + ee_len <= eof_block ?
+ EXT4_EXT_MAY_ZEROOUT : 0;
+ split_flag |= (EXT4_EXT_MARK_UNINIT2 & EXT4_EXT_DATA_VALID2);
+ }
flags |= EXT4_GET_BLOCKS_PRE_IO;
return ext4_split_extent(handle, inode, path, map, split_flag, flags);
}
+static int ext4_convert_initialized_extents(handle_t *handle,
+ struct inode *inode,
+ struct ext4_map_blocks *map,
+ struct ext4_ext_path *path)
+{
+ struct ext4_extent *ex;
+ ext4_lblk_t ee_block;
+ unsigned int ee_len;
+ int depth;
+ int err = 0;
+
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+ ee_block = le32_to_cpu(ex->ee_block);
+ ee_len = ext4_ext_get_actual_len(ex);
+
+ ext_debug("%s: inode %lu, logical"
+ "block %llu, max_blocks %u\n", __func__, inode->i_ino,
+ (unsigned long long)ee_block, ee_len);
+
+ if (ee_block != map->m_lblk || ee_len > map->m_len) {
+ err = ext4_split_convert_extents(handle, inode, map, path,
+ EXT4_GET_BLOCKS_CONVERT_UNWRITTEN);
+ if (err < 0)
+ goto out;
+ ext4_ext_drop_refs(path);
+ path = ext4_ext_find_extent(inode, map->m_lblk, path, 0);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ goto out;
+ }
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+ }
+
+ err = ext4_ext_get_access(handle, inode, path + depth);
+ if (err)
+ goto out;
+ /* first mark the extent as uninitialized */
+ ext4_ext_mark_uninitialized(ex);
+
+ /* note: ext4_ext_correct_indexes() isn't needed here because
+ * borders are not changed
+ */
+ ext4_ext_try_to_merge(handle, inode, path, ex);
+
+ /* Mark modified extent as dirty */
+ err = ext4_ext_dirty(handle, inode, path + path->p_depth);
+out:
+ ext4_ext_show_leaf(inode, path);
+ return err;
+}
+
+
static int ext4_convert_unwritten_extents_endio(handle_t *handle,
struct inode *inode,
struct ext4_map_blocks *map,
@@ -3648,8 +3709,8 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
inode->i_ino, (unsigned long long)ee_block, ee_len,
(unsigned long long)map->m_lblk, map->m_len);
#endif
- err = ext4_split_unwritten_extents(handle, inode, map, path,
- EXT4_GET_BLOCKS_CONVERT);
+ err = ext4_split_convert_extents(handle, inode, map, path,
+ EXT4_GET_BLOCKS_CONVERT);
if (err < 0)
goto out;
ext4_ext_drop_refs(path);
@@ -3850,6 +3911,35 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
}
static int
+ext4_ext_convert_initialized_extent(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map,
+ struct ext4_ext_path *path, int flags,
+ unsigned int allocated, ext4_fsblk_t newblock)
+{
+ int ret = 0;
+ int err = 0;
+
+ ret = ext4_convert_initialized_extents(handle, inode, map,
+ path);
+ if (ret >= 0) {
+ ext4_update_inode_fsync_trans(handle, inode, 1);
+ err = check_eofblocks_fl(handle, inode, map->m_lblk,
+ path, map->m_len);
+ } else
+ err = ret;
+ map->m_flags |= EXT4_MAP_UNWRITTEN;
+ if (allocated > map->m_len)
+ allocated = map->m_len;
+ map->m_len = allocated;
+
+ if (path) {
+ ext4_ext_drop_refs(path);
+ kfree(path);
+ }
+ return err ? err : allocated;
+}
+
+static int
ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map,
struct ext4_ext_path *path, int flags,
@@ -3876,8 +3966,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
/* get_block() before submit the IO, split the extent */
if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
- ret = ext4_split_unwritten_extents(handle, inode, map,
- path, flags);
+ ret = ext4_split_convert_extents(handle, inode, map,
+ path, flags | EXT4_GET_BLOCKS_CONVERT);
if (ret <= 0)
goto out;
/*
@@ -4168,6 +4258,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
unsigned short ee_len;
+
/*
* Uninitialized extents are treated as holes, except that
* we split out initialized portions during a write.
@@ -4184,7 +4275,17 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
ee_block, ee_len, newblock);
- if (!ext4_ext_is_uninitialized(ex))
+ /*
+ * If the extent is initialized check whether the
+ * caller wants to convert it to unwritten.
+ */
+ if ((!ext4_ext_is_uninitialized(ex)) &&
+ (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) {
+ allocated = ext4_ext_convert_initialized_extent(
+ handle, inode, map, path, flags,
+ allocated, newblock);
+ goto out3;
+ } else if (!ext4_ext_is_uninitialized(ex))
goto out;
allocated = ext4_ext_handle_uninitialized_extents(
@@ -4570,6 +4671,135 @@ retry:
return ret > 0 ? ret2 : ret;
}
+long ext4_zero_range(struct file *file, loff_t offset, loff_t len, int mode)
+{
+ struct inode *inode = file_inode(file);
+ handle_t *handle = NULL;
+ unsigned int max_blocks;
+ loff_t new_size = 0;
+ int ret = 0;
+ int flags;
+ int partial;
+ loff_t start, end;
+ ext4_lblk_t lblk;
+ struct address_space *mapping = inode->i_mapping;
+ unsigned int blkbits = inode->i_blkbits;
+
+ trace_ext4_zero_range(inode, offset, len, mode);
+
+ /*
+ * Write out all dirty pages to avoid race conditions
+ * Then release them.
+ */
+ if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+ ret = filemap_write_and_wait_range(mapping, offset,
+ offset + len - 1);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Round up offset. This is not fallocate, we neet to zero out
+ * blocks, so convert interior block aligned part of the range to
+ * unwritten and possibly manually zero out unaligned parts of the
+ * range.
+ */
+ start = round_up(offset, 1 << blkbits);
+ end = round_down((offset + len), 1 << blkbits);
+
+ if (start < offset || end > offset + len)
+ return -EINVAL;
+ partial = (offset + len) & ((1 << blkbits) - 1);
+
+ lblk = start >> blkbits;
+ max_blocks = (end >> blkbits);
+ if (max_blocks < lblk)
+ max_blocks = 0;
+ else
+ max_blocks -= lblk;
+
+ flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT |
+ EXT4_GET_BLOCKS_CONVERT_UNWRITTEN;
+ if (mode & FALLOC_FL_KEEP_SIZE)
+ flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
+
+ mutex_lock(&inode->i_mutex);
+
+ /*
+ * Indirect files do not support unwritten extnets
+ */
+ if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
+ ret = -EOPNOTSUPP;
+ goto out_mutex;
+ }
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+ offset + len > i_size_read(inode)) {
+ new_size = offset + len;
+ ret = inode_newsize_ok(inode, new_size);
+ if (ret)
+ goto out_mutex;
+ /*
+ * If we have a partial block after EOF we have to allocate
+ * the entire block.
+ */
+ if (partial)
+ max_blocks += 1;
+ }
+
+ if (max_blocks > 0) {
+
+ /* Now release the pages and zero block aligned part of pages*/
+ truncate_pagecache_range(inode, start, end - 1);
+
+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
+ /*
+ * Remove entire range from the extent status tree.
+ */
+ ret = ext4_es_remove_extent(inode, lblk, max_blocks);
+ if (ret)
+ goto out_dio;
+
+ ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags,
+ mode);
+ if (ret)
+ goto out_dio;
+ }
+
+ handle = ext4_journal_start(inode, EXT4_HT_MISC, 4);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ ext4_std_error(inode->i_sb, ret);
+ goto out_dio;
+ }
+
+ inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+
+ if (!ret && new_size) {
+ if (new_size > i_size_read(inode))
+ i_size_write(inode, new_size);
+ if (new_size > EXT4_I(inode)->i_disksize)
+ ext4_update_i_disksize(inode, new_size);
+ }
+ ext4_mark_inode_dirty(handle, inode);
+
+ /* Zero out partial block at the edges of the range */
+ ret = ext4_zero_partial_blocks(handle, inode, offset, len);
+
+ if (file->f_flags & O_SYNC)
+ ext4_handle_sync(handle);
+
+ ext4_journal_stop(handle);
+out_dio:
+ ext4_inode_resume_unlocked_dio(inode);
+out_mutex:
+ mutex_unlock(&inode->i_mutex);
+ return ret;
+}
+
/*
* preallocate space for a file. This implements ext4's fallocate file
* operation, which gets called from sys_fallocate system call.
@@ -4589,7 +4819,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
unsigned int blkbits = inode->i_blkbits;
/* Return error if mode is not supported */
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+ FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;
if (mode & FALLOC_FL_PUNCH_HOLE)
@@ -4606,6 +4837,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
return -EOPNOTSUPP;
+ if (mode & FALLOC_FL_ZERO_RANGE)
+ return ext4_zero_range(file, offset, len, mode);
+
trace_ext4_fallocate_enter(inode, offset, len, mode);
lblk = offset >> blkbits;
/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895..e64807f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -503,6 +503,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
{
struct extent_status es;
int retval;
+ int ret = 0;
#ifdef ES_AGGRESSIVE_TEST
struct ext4_map_blocks orig_map;
@@ -552,7 +553,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
EXT4_GET_BLOCKS_KEEP_SIZE);
}
if (retval > 0) {
- int ret;
unsigned int status;
if (unlikely(retval != map->m_len)) {
@@ -579,7 +579,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
found:
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret = check_block_validity(inode, map);
+ ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
}
@@ -596,7 +596,13 @@ found:
* with buffer head unmapped.
*/
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED)
- return retval;
+ /*
+ * If we need to convert extent to unwritten
+ * we continue and do the actual work in
+ * ext4_ext_map_blocks()
+ */
+ if (!(flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN))
+ return retval;
/*
* Here we clear m_flags because after allocating an new extent,
@@ -652,7 +658,6 @@ found:
ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
if (retval > 0) {
- int ret;
unsigned int status;
if (unlikely(retval != map->m_len)) {
@@ -687,7 +692,7 @@ found:
has_zeroout:
up_write((&EXT4_I(inode)->i_data_sem));
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret = check_block_validity(inode, map);
+ ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
}
@@ -3501,7 +3506,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
if (!S_ISREG(inode->i_mode))
return -EOPNOTSUPP;
- trace_ext4_punch_hole(inode, offset, length);
+ trace_ext4_punch_hole(inode, offset, length, 0);
/*
* Write out all dirty pages to avoid race conditions
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 451e020..7bb26aa 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -71,7 +71,8 @@ struct extent_status;
#define show_falloc_mode(mode) __print_flags(mode, "|", \
{ FALLOC_FL_KEEP_SIZE, "KEEP_SIZE"}, \
{ FALLOC_FL_PUNCH_HOLE, "PUNCH_HOLE"}, \
- { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"})
+ { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"}, \
+ { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"})
TRACE_EVENT(ext4_free_inode,
@@ -1333,7 +1334,7 @@ TRACE_EVENT(ext4_direct_IO_exit,
__entry->rw, __entry->ret)
);
-TRACE_EVENT(ext4_fallocate_enter,
+DECLARE_EVENT_CLASS(ext4__fallocate_mode,
TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
TP_ARGS(inode, offset, len, mode),
@@ -1341,23 +1342,45 @@ TRACE_EVENT(ext4_fallocate_enter,
TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
- __field( loff_t, pos )
- __field( loff_t, len )
+ __field( loff_t, offset )
+ __field( loff_t, len )
__field( int, mode )
),
TP_fast_assign(
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
- __entry->pos = offset;
+ __entry->offset = offset;
__entry->len = len;
__entry->mode = mode;
),
- TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %s",
+ TP_printk("dev %d,%d ino %lu offset %lld len %lld mode %s",
MAJOR(__entry->dev), MINOR(__entry->dev),
- (unsigned long) __entry->ino, __entry->pos,
- __entry->len, show_falloc_mode(__entry->mode))
+ (unsigned long) __entry->ino,
+ __entry->offset, __entry->len,
+ show_falloc_mode(__entry->mode))
+);
+
+DEFINE_EVENT(ext4__fallocate_mode, ext4_fallocate_enter,
+
+ TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
+
+ TP_ARGS(inode, offset, len, mode)
+);
+
+DEFINE_EVENT(ext4__fallocate_mode, ext4_punch_hole,
+
+ TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
+
+ TP_ARGS(inode, offset, len, mode)
+);
+
+DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range,
+
+ TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
+
+ TP_ARGS(inode, offset, len, mode)
);
TRACE_EVENT(ext4_fallocate_exit,
@@ -1389,31 +1412,6 @@ TRACE_EVENT(ext4_fallocate_exit,
__entry->ret)
);
-TRACE_EVENT(ext4_punch_hole,
- TP_PROTO(struct inode *inode, loff_t offset, loff_t len),
-
- TP_ARGS(inode, offset, len),
-
- TP_STRUCT__entry(
- __field( dev_t, dev )
- __field( ino_t, ino )
- __field( loff_t, offset )
- __field( loff_t, len )
- ),
-
- TP_fast_assign(
- __entry->dev = inode->i_sb->s_dev;
- __entry->ino = inode->i_ino;
- __entry->offset = offset;
- __entry->len = len;
- ),
-
- TP_printk("dev %d,%d ino %lu offset %lld len %lld",
- MAJOR(__entry->dev), MINOR(__entry->dev),
- (unsigned long) __entry->ino,
- __entry->offset, __entry->len)
-);
Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
It can be used to convert a range of file to zeros preferably without
issuing data IO. Blocks should be preallocated for the regions that span
holes in the file, and the entire range is preferable converted to
unwritten extents - even though file system may choose to zero out the
extent or do whatever which will result in reading zeros from the range
while the range remains allocated for the file.
This can be also used to preallocate blocks past EOF in the same way as
with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
size to remain the same.
Signed-off-by: Lukas Czerner <[email protected]>
---
fs/open.c | 7 ++++++-
include/uapi/linux/falloc.h | 1 +
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/open.c b/fs/open.c
index 4b3e1ed..6dc46c1 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -231,7 +231,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return -EINVAL;
/* Return error if mode is not supported */
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+ FALLOC_FL_ZERO_RANGE))
+ return -EOPNOTSUPP;
+
+ /* Punch hole and zero range are mutually exclusive */
+ if (mode & FALLOC_FL_PUNCH_HOLE && mode & FALLOC_FL_ZERO_RANGE)
return -EOPNOTSUPP;
/* Punch hole must have keep size set */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 990c4cc..49951ea 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -4,6 +4,7 @@
#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
#define FALLOC_FL_NO_HIDE_STALE 0x04 /* reserved codepoint */
+#define FALLOC_FL_ZERO_RANGE 0x08 /* zero range */
#endif /* _UAPI_FALLOC_H_ */
--
1.8.3.1
Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
We can also preallocate blocks past EOF in the same was as with
fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain
the same even if we preallocate blocks past EOF.
It uses the same code to zero range as it is used by the
XFS_IOC_ZERO_RANGE ioctl.
Signed-off-by: Lukas Czerner <[email protected]>
---
fs/xfs/xfs_file.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48ea..aec5f64 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -823,7 +823,8 @@ xfs_file_fallocate(
if (!S_ISREG(inode->i_mode))
return -EINVAL;
- if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+ FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;
xfs_ilock(ip, XFS_IOLOCK_EXCL);
@@ -840,8 +841,11 @@ xfs_file_fallocate(
goto out_unlock;
}
- error = xfs_alloc_file_space(ip, offset, len,
- XFS_BMAPI_PREALLOC);
+ if (mode & FALLOC_FL_ZERO_RANGE)
+ error = xfs_zero_file_space(ip, offset, len);
+ else
+ error = xfs_alloc_file_space(ip, offset, len,
+ XFS_BMAPI_PREALLOC);
if (error)
goto out_unlock;
}
--
1.8.3.1
On Feb 17, 2014, at 8:08 AM, Lukas Czerner <[email protected]> wrote:
> Currently in ext4_fallocate we would update inode size, c_time and sync
> the file with every partial allocation which is entirely unnecessary. It
> is true that if the crash happens in the middle of truncate we might end
> up with unchanged i size, or c_time which I do not think is really a
> problem - it does not mean file system corruption in any way. Note that
> xfs is doing things the same way e.g. update all of the mentioned after
> the allocation is done.
I'm OK with this part.
> This commit moves all the updates after the allocation is done. In
> addition we also need to change m_time as not only inode has been change
> bot also data regions might have changed (unwritten extents).
I don't necessarily agree about this. Calling fallocate() will not
change the user-visible data at all, so there is no reason to e.g.
do a new backup of the file or reprocess the contents, or any other
reason that an application cares about a changed mtime.
Cheers, Andreas
> Also we do
> not need to be paranoid about changing the c_time and m_time only if the
> actual allocation have happened, we can change it even if we try to
> allocate only to find out that there are already block allocated. It's
> not really a big deal and it will save us some additional complexity.
>
> Also use ext4_debug, instead of ext4_warning in #ifdef EXT4FS_DEBUG
> section.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> fs/ext4/extents.c | 86 +++++++++++++++++++++----------------------------------
> 1 file changed, 32 insertions(+), 54 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 10cff47..6a52851 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4513,36 +4513,6 @@ retry:
> ext4_std_error(inode->i_sb, err);
> }
>
> -static void ext4_falloc_update_inode(struct inode *inode,
> - int mode, loff_t new_size, int update_ctime)
> -{
> - struct timespec now;
> -
> - if (update_ctime) {
> - now = current_fs_time(inode->i_sb);
> - if (!timespec_equal(&inode->i_ctime, &now))
> - inode->i_ctime = now;
> - }
> - /*
> - * Update only when preallocation was requested beyond
> - * the file size.
> - */
> - if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> - if (new_size > i_size_read(inode))
> - i_size_write(inode, new_size);
> - if (new_size > EXT4_I(inode)->i_disksize)
> - ext4_update_i_disksize(inode, new_size);
> - } else {
> - /*
> - * Mark that we allocate beyond EOF so the subsequent truncate
> - * can proceed even if the new size is the same as i_size.
> - */
> - if (new_size > i_size_read(inode))
> - ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
> - }
> -
> -}
> -
> /*
> * preallocate space for a file. This implements ext4's fallocate file
> * operation, which gets called from sys_fallocate system call.
> @@ -4554,7 +4524,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> {
> struct inode *inode = file_inode(file);
> handle_t *handle;
> - loff_t new_size;
> + loff_t new_size = 0;
> unsigned int max_blocks;
> int ret = 0;
> int ret2 = 0;
> @@ -4594,12 +4564,15 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> */
> credits = ext4_chunk_trans_blocks(inode, max_blocks);
> mutex_lock(&inode->i_mutex);
> - ret = inode_newsize_ok(inode, (len + offset));
> - if (ret) {
> - mutex_unlock(&inode->i_mutex);
> - trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
> - return ret;
> +
> + if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> + offset + len > i_size_read(inode)) {
> + new_size = offset + len;
> + ret = inode_newsize_ok(inode, new_size);
> + if (ret)
> + goto out;
> }
> +
> flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
> if (mode & FALLOC_FL_KEEP_SIZE)
> flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
> @@ -4623,28 +4596,14 @@ retry:
> }
> ret = ext4_map_blocks(handle, inode, &map, flags);
> if (ret <= 0) {
> -#ifdef EXT4FS_DEBUG
> - ext4_warning(inode->i_sb,
> - "inode #%lu: block %u: len %u: "
> - "ext4_ext_map_blocks returned %d",
> - inode->i_ino, map.m_lblk,
> - map.m_len, ret);
> -#endif
> + ext4_debug("inode #%lu: block %u: len %u: "
> + "ext4_ext_map_blocks returned %d",
> + inode->i_ino, map.m_lblk,
> + map.m_len, ret);
> ext4_mark_inode_dirty(handle, inode);
> ret2 = ext4_journal_stop(handle);
> break;
> }
> - if ((map.m_lblk + ret) >= (EXT4_BLOCK_ALIGN(offset + len,
> - blkbits) >> blkbits))
> - new_size = offset + len;
> - else
> - new_size = ((loff_t) map.m_lblk + ret) << blkbits;
> -
> - ext4_falloc_update_inode(inode, mode, new_size,
> - (map.m_flags & EXT4_MAP_NEW));
> - ext4_mark_inode_dirty(handle, inode);
> - if ((file->f_flags & O_SYNC) && ret >= max_blocks)
> - ext4_handle_sync(handle);
> ret2 = ext4_journal_stop(handle);
> if (ret2)
> break;
> @@ -4654,6 +4613,25 @@ retry:
> ret = 0;
> goto retry;
> }
> +
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle))
> + goto out;
> +
> + inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
> +
> + if (ret > 0 && new_size) {
> + if (new_size > i_size_read(inode))
> + i_size_write(inode, new_size);
> + if (new_size > EXT4_I(inode)->i_disksize)
> + ext4_update_i_disksize(inode, new_size);
> + }
> + ext4_mark_inode_dirty(handle, inode);
> + if (file->f_flags & O_SYNC)
> + ext4_handle_sync(handle);
> +
> + ext4_journal_stop(handle);
> +out:
> mutex_unlock(&inode->i_mutex);
> trace_ext4_fallocate_exit(inode, offset, max_blocks,
> ret > 0 ? ret2 : ret);
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
On Mon, Feb 17, 2014 at 04:12:14PM -0700, Andreas Dilger wrote:
>
> I don't necessarily agree about this. Calling fallocate() will not
> change the user-visible data at all, so there is no reason to e.g.
> do a new backup of the file or reprocess the contents, or any other
> reason that an application cares about a changed mtime.
Well, if i_size has changed, then the visible results of reading from
the file will change, so in that case I'd argue m_time should change.
If the results of reading file doesn't change then we can keep m_time
unchanged --- but since the inode is changing, c_time *should* always
change any time we've made any changes to the extent tree.
- Ted
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents - even though file system may choose to zero out the
> extent or do whatever which will result in reading zeros from the range
> while the range remains allocated for the file.
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> You can test this feature yourself using xfstests, of fallocate(1) however
> you'll need patches for util_linux, xfsprogs and xfstests which you
> can find here:
>
> http://people.redhat.com/lczerner/zero_range/
>
> I'll post the patches after we agree and merge the kernel functionality.
>
> I tested this mostly with a subset of xfstests using fsx and fsstress and
> even with new generic/290 which is just a copy of xfs/290 usinz fzero
> command for xfs_io instead of zero (which uses ioctl). I was testing on
> x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
You also want to convert xfs/242 to be a generic test - it uses the
_generic_test_punch helper to test all the corner cases across
different extent type transitions.
> ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
>
> Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
> touches the same area as this pach set does, so we should figure out
> which one should go first and modify the other on top of it.
I was going to push the FALLOC_FL_COLLAPSE_RANGE stuff through
the XFS tree once it was done - perhaps you and Namjae can get
together and work out which order the patch series should go.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Feb 17, 2014 at 04:08:21PM +0100, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents - even though file system may choose to zero out the
> extent or do whatever which will result in reading zeros from the range
> while the range remains allocated for the file.
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> fs/open.c | 7 ++++++-
> include/uapi/linux/falloc.h | 1 +
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/open.c b/fs/open.c
> index 4b3e1ed..6dc46c1 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -231,7 +231,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return -EINVAL;
>
> /* Return error if mode is not supported */
> - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> + FALLOC_FL_ZERO_RANGE))
> + return -EOPNOTSUPP;
> +
> + /* Punch hole and zero range are mutually exclusive */
> + if (mode & FALLOC_FL_PUNCH_HOLE && mode & FALLOC_FL_ZERO_RANGE)
I would have expected gcc to throw a warning on this. Even if it
doesn't, it's so easy to mix up & an && and & it needs parenthesis
around it to make it obvious what you actually meant and it doesn't
have a && where an & should be or vice versa. Better, IMO, is this:
/* Punch hole and zero range are mutually exclusive */
if ((mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) ==
(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;
because it's obvious what the intent is and easy to spot typos.
Cheers,
Dave.
--
Dave Chinner
[email protected]
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Mon, Feb 17, 2014 at 04:08:23PM +0100, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> We can also preallocate blocks past EOF in the same was as with
> fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain
> the same even if we preallocate blocks past EOF.
>
> It uses the same code to zero range as it is used by the
> XFS_IOC_ZERO_RANGE ioctl.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> fs/xfs/xfs_file.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 64b48ea..aec5f64 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -823,7 +823,8 @@ xfs_file_fallocate(
>
> if (!S_ISREG(inode->i_mode))
> return -EINVAL;
> - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> + FALLOC_FL_ZERO_RANGE))
> return -EOPNOTSUPP;
>
> xfs_ilock(ip, XFS_IOLOCK_EXCL);
> @@ -840,8 +841,11 @@ xfs_file_fallocate(
> goto out_unlock;
> }
>
> - error = xfs_alloc_file_space(ip, offset, len,
> - XFS_BMAPI_PREALLOC);
> + if (mode & FALLOC_FL_ZERO_RANGE)
> + error = xfs_zero_file_space(ip, offset, len);
> + else
> + error = xfs_alloc_file_space(ip, offset, len,
> + XFS_BMAPI_PREALLOC);
> if (error)
> goto out_unlock;
> }
Looks OK.
Reviewed-by: Dave Chinner <[email protected]>
--
Dave Chinner
[email protected]
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Tue, 18 Feb 2014, Dave Chinner wrote:
> Date: Tue, 18 Feb 2014 13:51:12 +1100
> From: Dave Chinner <[email protected]>
> To: Lukas Czerner <[email protected]>
> Cc: [email protected], [email protected], [email protected],
> [email protected]
> Subject: Re: [PATCH 4/6] fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
>
> On Mon, Feb 17, 2014 at 04:08:21PM +0100, Lukas Czerner wrote:
> > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> >
> > It can be used to convert a range of file to zeros preferably without
> > issuing data IO. Blocks should be preallocated for the regions that span
> > holes in the file, and the entire range is preferable converted to
> > unwritten extents - even though file system may choose to zero out the
> > extent or do whatever which will result in reading zeros from the range
> > while the range remains allocated for the file.
> >
> > This can be also used to preallocate blocks past EOF in the same way as
> > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > size to remain the same.
> >
> > Signed-off-by: Lukas Czerner <[email protected]>
> > ---
> > fs/open.c | 7 ++++++-
> > include/uapi/linux/falloc.h | 1 +
> > 2 files changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/open.c b/fs/open.c
> > index 4b3e1ed..6dc46c1 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -231,7 +231,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> > return -EINVAL;
> >
> > /* Return error if mode is not supported */
> > - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> > + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> > + FALLOC_FL_ZERO_RANGE))
> > + return -EOPNOTSUPP;
> > +
> > + /* Punch hole and zero range are mutually exclusive */
> > + if (mode & FALLOC_FL_PUNCH_HOLE && mode & FALLOC_FL_ZERO_RANGE)
>
> I would have expected gcc to throw a warning on this. Even if it
> doesn't, it's so easy to mix up & an && and & it needs parenthesis
> around it to make it obvious what you actually meant and it doesn't
> have a && where an & should be or vice versa. Better, IMO, is this:
>
> /* Punch hole and zero range are mutually exclusive */
> if ((mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) ==
> (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE))
> return -EOPNOTSUPP;
>
> because it's obvious what the intent is and easy to spot typos.
Fair enough, I'll change it.
Thanks!
-Lukas
>
> Cheers,
>
> Dave.
>
On Mon, 17 Feb 2014, Theodore Ts'o wrote:
> Date: Mon, 17 Feb 2014 18:21:00 -0500
> From: Theodore Ts'o <[email protected]>
> To: Andreas Dilger <[email protected]>
> Cc: Lukas Czerner <[email protected]>,
> Ext4 Developers List <[email protected]>,
> linux-fsdevel <[email protected]>, [email protected]
> Subject: Re: [PATCH 1/6] ext4: Update inode i_size after the preallocation
>
> On Mon, Feb 17, 2014 at 04:12:14PM -0700, Andreas Dilger wrote:
> >
> > I don't necessarily agree about this. Calling fallocate() will not
> > change the user-visible data at all, so there is no reason to e.g.
> > do a new backup of the file or reprocess the contents, or any other
> > reason that an application cares about a changed mtime.
>
> Well, if i_size has changed, then the visible results of reading from
> the file will change, so in that case I'd argue m_time should change.
> If the results of reading file doesn't change then we can keep m_time
> unchanged --- but since the inode is changing, c_time *should* always
> change any time we've made any changes to the extent tree.
>
> - Ted
So I guess the consensus is to update m_time only when the inode size
changes in fallocate case. I'll change that in the code.
Thanks!
-Lukas
seems "int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,"
should be "static int ext4_alloc_file_blocks(struct file *file,
ext4_lblk_t offset,"
Jon
On Mon, Feb 17, 2014 at 3:08 PM, Lukas Czerner <[email protected]> wrote:
> Move block allocation out of the ext4_fallocate into separate function
> called ext4_alloc_file_blocks(). This will allow us to use the same
> allocation code for other allocation operations such as zero range which
> is commit in the next patch.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> fs/ext4/extents.c | 127 +++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 73 insertions(+), 54 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 6a52851..2d68a46 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4513,6 +4513,64 @@ retry:
> ext4_std_error(inode->i_sb, err);
> }
>
> +int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
> + ext4_lblk_t len, int flags, int mode)
> +{
> + struct inode *inode = file_inode(file);
> + handle_t *handle;
> + int ret = 0;
> + int ret2 = 0;
> + int retries = 0;
> + struct ext4_map_blocks map;
> + unsigned int credits;
> +
> + map.m_lblk = offset;
> + /*
> + * Don't normalize the request if it can fit in one extent so
> + * that it doesn't get unnecessarily split into multiple
> + * extents.
> + */
> + if (len <= EXT_UNINIT_MAX_LEN)
> + flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
> +
> + /*
> + * credits to insert 1 extent into extent tree
> + */
> + credits = ext4_chunk_trans_blocks(inode, len);
> +
> +retry:
> + while (ret >= 0 && ret < len) {
> + map.m_lblk = map.m_lblk + ret;
> + map.m_len = len = len - ret;
> + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> + credits);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + break;
> + }
> + ret = ext4_map_blocks(handle, inode, &map, flags);
> + if (ret <= 0) {
> + ext4_debug("inode #%lu: block %u: len %u: "
> + "ext4_ext_map_blocks returned %d",
> + inode->i_ino, map.m_lblk,
> + map.m_len, ret);
> + ext4_mark_inode_dirty(handle, inode);
> + ret2 = ext4_journal_stop(handle);
> + break;
> + }
> + ret2 = ext4_journal_stop(handle);
> + if (ret2)
> + break;
> + }
> + if (ret == -ENOSPC &&
> + ext4_should_retry_alloc(inode->i_sb, &retries)) {
> + ret = 0;
> + goto retry;
> + }
> +
> + return ret > 0 ? ret2 : ret;
> +}
> +
> /*
> * preallocate space for a file. This implements ext4's fallocate file
> * operation, which gets called from sys_fallocate system call.
> @@ -4527,11 +4585,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> loff_t new_size = 0;
> unsigned int max_blocks;
> int ret = 0;
> - int ret2 = 0;
> - int retries = 0;
> int flags;
> - struct ext4_map_blocks map;
> - unsigned int credits, blkbits = inode->i_blkbits;
> + ext4_lblk_t lblk;
> + unsigned int blkbits = inode->i_blkbits;
>
> /* Return error if mode is not supported */
> if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> @@ -4552,17 +4608,18 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return -EOPNOTSUPP;
>
> trace_ext4_fallocate_enter(inode, offset, len, mode);
> - map.m_lblk = offset >> blkbits;
> + lblk = offset >> blkbits;
> /*
> * We can't just convert len to max_blocks because
> * If blocksize = 4096 offset = 3072 and len = 2048
> */
> max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> - - map.m_lblk;
> - /*
> - * credits to insert 1 extent into extent tree
> - */
> - credits = ext4_chunk_trans_blocks(inode, max_blocks);
> + - lblk;
> +
> + flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
> + if (mode & FALLOC_FL_KEEP_SIZE)
> + flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
> +
> mutex_lock(&inode->i_mutex);
>
> if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> @@ -4573,46 +4630,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> goto out;
> }
>
> - flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
> - if (mode & FALLOC_FL_KEEP_SIZE)
> - flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
> - /*
> - * Don't normalize the request if it can fit in one extent so
> - * that it doesn't get unnecessarily split into multiple
> - * extents.
> - */
> - if (len <= EXT_UNINIT_MAX_LEN << blkbits)
> - flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
> -
> -retry:
> - while (ret >= 0 && ret < max_blocks) {
> - map.m_lblk = map.m_lblk + ret;
> - map.m_len = max_blocks = max_blocks - ret;
> - handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> - credits);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - break;
> - }
> - ret = ext4_map_blocks(handle, inode, &map, flags);
> - if (ret <= 0) {
> - ext4_debug("inode #%lu: block %u: len %u: "
> - "ext4_ext_map_blocks returned %d",
> - inode->i_ino, map.m_lblk,
> - map.m_len, ret);
> - ext4_mark_inode_dirty(handle, inode);
> - ret2 = ext4_journal_stop(handle);
> - break;
> - }
> - ret2 = ext4_journal_stop(handle);
> - if (ret2)
> - break;
> - }
> - if (ret == -ENOSPC &&
> - ext4_should_retry_alloc(inode->i_sb, &retries)) {
> - ret = 0;
> - goto retry;
> - }
> + ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags, mode);
> + if (ret)
> + goto out;
>
> handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> if (IS_ERR(handle))
> @@ -4620,7 +4640,7 @@ retry:
>
> inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
>
> - if (ret > 0 && new_size) {
> + if (!ret && new_size) {
> if (new_size > i_size_read(inode))
> i_size_write(inode, new_size);
> if (new_size > EXT4_I(inode)->i_disksize)
> @@ -4633,9 +4653,8 @@ retry:
> ext4_journal_stop(handle);
> out:
> mutex_unlock(&inode->i_mutex);
> - trace_ext4_fallocate_exit(inode, offset, max_blocks,
> - ret > 0 ? ret2 : ret);
> - return ret > 0 ? ret2 : ret;
> + trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
> + return ret;
> }
>
> /*
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
long ext4_zero_range(struct file *file, loff_t offset, loff_t len, int mode)
needs "static" too.
static long ext4_zero_range(struct file *file, loff_t offset, loff_t
len, int mode)
On Mon, Feb 17, 2014 at 3:08 PM, Lukas Czerner <[email protected]> wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> Also add appropriate tracepoints.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> fs/ext4/ext4.h | 2 +
> fs/ext4/extents.c | 262 +++++++++++++++++++++++++++++++++++++++++---
> fs/ext4/inode.c | 17 ++-
> include/trace/events/ext4.h | 64 ++++++-----
> 4 files changed, 292 insertions(+), 53 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3b9601c..a649abe 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -568,6 +568,8 @@ enum {
> #define EXT4_GET_BLOCKS_NO_LOCK 0x0100
> /* Do not put hole in extent cache */
> #define EXT4_GET_BLOCKS_NO_PUT_HOLE 0x0200
> + /* Convert written extents to unwritten */
> +#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0400
>
> /*
> * The bit position of these flags must not overlap with any of the
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 4bfa870..af0e8af 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3568,6 +3568,8 @@ out:
> * b> Splits in two extents: Write is happening at either end of the extent
> * c> Splits in three extents: Somone is writing in middle of the extent
> *
> + * This works the same way in the case of initialized -> unwritten conversion.
> + *
> * One of more index blocks maybe needed if the extent tree grow after
> * the uninitialized extent split. To prevent ENOSPC occur at the IO
> * complete, we need to split the uninitialized extent before DIO submit
> @@ -3578,7 +3580,7 @@ out:
> *
> * Returns the size of uninitialized extent to be written on success.
> */
> -static int ext4_split_unwritten_extents(handle_t *handle,
> +static int ext4_split_convert_extents(handle_t *handle,
> struct inode *inode,
> struct ext4_map_blocks *map,
> struct ext4_ext_path *path,
> @@ -3590,9 +3592,9 @@ static int ext4_split_unwritten_extents(handle_t *handle,
> unsigned int ee_len;
> int split_flag = 0, depth;
>
> - ext_debug("ext4_split_unwritten_extents: inode %lu, logical"
> - "block %llu, max_blocks %u\n", inode->i_ino,
> - (unsigned long long)map->m_lblk, map->m_len);
> + ext_debug("%s: inode %lu, logical block %llu, max_blocks %u\n",
> + __func__, inode->i_ino,
> + (unsigned long long)map->m_lblk, map->m_len);
>
> eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
> inode->i_sb->s_blocksize_bits;
> @@ -3607,14 +3609,73 @@ static int ext4_split_unwritten_extents(handle_t *handle,
> ee_block = le32_to_cpu(ex->ee_block);
> ee_len = ext4_ext_get_actual_len(ex);
>
> - split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
> - split_flag |= EXT4_EXT_MARK_UNINIT2;
> - if (flags & EXT4_GET_BLOCKS_CONVERT)
> - split_flag |= EXT4_EXT_DATA_VALID2;
> + /* Convert to unwritten */
> + if (flags | EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) {
> + split_flag |= EXT4_EXT_DATA_VALID1;
> + /* Convert to initialized */
> + } else if (flags | EXT4_GET_BLOCKS_CONVERT) {
> + split_flag |= ee_block + ee_len <= eof_block ?
> + EXT4_EXT_MAY_ZEROOUT : 0;
> + split_flag |= (EXT4_EXT_MARK_UNINIT2 & EXT4_EXT_DATA_VALID2);
> + }
> flags |= EXT4_GET_BLOCKS_PRE_IO;
> return ext4_split_extent(handle, inode, path, map, split_flag, flags);
> }
>
> +static int ext4_convert_initialized_extents(handle_t *handle,
> + struct inode *inode,
> + struct ext4_map_blocks *map,
> + struct ext4_ext_path *path)
> +{
> + struct ext4_extent *ex;
> + ext4_lblk_t ee_block;
> + unsigned int ee_len;
> + int depth;
> + int err = 0;
> +
> + depth = ext_depth(inode);
> + ex = path[depth].p_ext;
> + ee_block = le32_to_cpu(ex->ee_block);
> + ee_len = ext4_ext_get_actual_len(ex);
> +
> + ext_debug("%s: inode %lu, logical"
> + "block %llu, max_blocks %u\n", __func__, inode->i_ino,
> + (unsigned long long)ee_block, ee_len);
> +
> + if (ee_block != map->m_lblk || ee_len > map->m_len) {
> + err = ext4_split_convert_extents(handle, inode, map, path,
> + EXT4_GET_BLOCKS_CONVERT_UNWRITTEN);
> + if (err < 0)
> + goto out;
> + ext4_ext_drop_refs(path);
> + path = ext4_ext_find_extent(inode, map->m_lblk, path, 0);
> + if (IS_ERR(path)) {
> + err = PTR_ERR(path);
> + goto out;
> + }
> + depth = ext_depth(inode);
> + ex = path[depth].p_ext;
> + }
> +
> + err = ext4_ext_get_access(handle, inode, path + depth);
> + if (err)
> + goto out;
> + /* first mark the extent as uninitialized */
> + ext4_ext_mark_uninitialized(ex);
> +
> + /* note: ext4_ext_correct_indexes() isn't needed here because
> + * borders are not changed
> + */
> + ext4_ext_try_to_merge(handle, inode, path, ex);
> +
> + /* Mark modified extent as dirty */
> + err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> +out:
> + ext4_ext_show_leaf(inode, path);
> + return err;
> +}
> +
> +
> static int ext4_convert_unwritten_extents_endio(handle_t *handle,
> struct inode *inode,
> struct ext4_map_blocks *map,
> @@ -3648,8 +3709,8 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
> inode->i_ino, (unsigned long long)ee_block, ee_len,
> (unsigned long long)map->m_lblk, map->m_len);
> #endif
> - err = ext4_split_unwritten_extents(handle, inode, map, path,
> - EXT4_GET_BLOCKS_CONVERT);
> + err = ext4_split_convert_extents(handle, inode, map, path,
> + EXT4_GET_BLOCKS_CONVERT);
> if (err < 0)
> goto out;
> ext4_ext_drop_refs(path);
> @@ -3850,6 +3911,35 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
> }
>
> static int
> +ext4_ext_convert_initialized_extent(handle_t *handle, struct inode *inode,
> + struct ext4_map_blocks *map,
> + struct ext4_ext_path *path, int flags,
> + unsigned int allocated, ext4_fsblk_t newblock)
> +{
> + int ret = 0;
> + int err = 0;
> +
> + ret = ext4_convert_initialized_extents(handle, inode, map,
> + path);
> + if (ret >= 0) {
> + ext4_update_inode_fsync_trans(handle, inode, 1);
> + err = check_eofblocks_fl(handle, inode, map->m_lblk,
> + path, map->m_len);
> + } else
> + err = ret;
> + map->m_flags |= EXT4_MAP_UNWRITTEN;
> + if (allocated > map->m_len)
> + allocated = map->m_len;
> + map->m_len = allocated;
> +
> + if (path) {
> + ext4_ext_drop_refs(path);
> + kfree(path);
> + }
> + return err ? err : allocated;
> +}
> +
> +static int
> ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
> struct ext4_map_blocks *map,
> struct ext4_ext_path *path, int flags,
> @@ -3876,8 +3966,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
>
> /* get_block() before submit the IO, split the extent */
> if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
> - ret = ext4_split_unwritten_extents(handle, inode, map,
> - path, flags);
> + ret = ext4_split_convert_extents(handle, inode, map,
> + path, flags | EXT4_GET_BLOCKS_CONVERT);
> if (ret <= 0)
> goto out;
> /*
> @@ -4168,6 +4258,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
> unsigned short ee_len;
>
> +
> /*
> * Uninitialized extents are treated as holes, except that
> * we split out initialized portions during a write.
> @@ -4184,7 +4275,17 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk,
> ee_block, ee_len, newblock);
>
> - if (!ext4_ext_is_uninitialized(ex))
> + /*
> + * If the extent is initialized check whether the
> + * caller wants to convert it to unwritten.
> + */
> + if ((!ext4_ext_is_uninitialized(ex)) &&
> + (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) {
> + allocated = ext4_ext_convert_initialized_extent(
> + handle, inode, map, path, flags,
> + allocated, newblock);
> + goto out3;
> + } else if (!ext4_ext_is_uninitialized(ex))
> goto out;
>
> allocated = ext4_ext_handle_uninitialized_extents(
> @@ -4570,6 +4671,135 @@ retry:
> return ret > 0 ? ret2 : ret;
> }
>
> +long ext4_zero_range(struct file *file, loff_t offset, loff_t len, int mode)
> +{
> + struct inode *inode = file_inode(file);
> + handle_t *handle = NULL;
> + unsigned int max_blocks;
> + loff_t new_size = 0;
> + int ret = 0;
> + int flags;
> + int partial;
> + loff_t start, end;
> + ext4_lblk_t lblk;
> + struct address_space *mapping = inode->i_mapping;
> + unsigned int blkbits = inode->i_blkbits;
> +
> + trace_ext4_zero_range(inode, offset, len, mode);
> +
> + /*
> + * Write out all dirty pages to avoid race conditions
> + * Then release them.
> + */
> + if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> + ret = filemap_write_and_wait_range(mapping, offset,
> + offset + len - 1);
> + if (ret)
> + return ret;
> + }
> +
> + /*
> + * Round up offset. This is not fallocate, we neet to zero out
> + * blocks, so convert interior block aligned part of the range to
> + * unwritten and possibly manually zero out unaligned parts of the
> + * range.
> + */
> + start = round_up(offset, 1 << blkbits);
> + end = round_down((offset + len), 1 << blkbits);
> +
> + if (start < offset || end > offset + len)
> + return -EINVAL;
> + partial = (offset + len) & ((1 << blkbits) - 1);
> +
> + lblk = start >> blkbits;
> + max_blocks = (end >> blkbits);
> + if (max_blocks < lblk)
> + max_blocks = 0;
> + else
> + max_blocks -= lblk;
> +
> + flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT |
> + EXT4_GET_BLOCKS_CONVERT_UNWRITTEN;
> + if (mode & FALLOC_FL_KEEP_SIZE)
> + flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
> +
> + mutex_lock(&inode->i_mutex);
> +
> + /*
> + * Indirect files do not support unwritten extnets
> + */
> + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
> + ret = -EOPNOTSUPP;
> + goto out_mutex;
> + }
> +
> + if (!(mode & FALLOC_FL_KEEP_SIZE) &&
> + offset + len > i_size_read(inode)) {
> + new_size = offset + len;
> + ret = inode_newsize_ok(inode, new_size);
> + if (ret)
> + goto out_mutex;
> + /*
> + * If we have a partial block after EOF we have to allocate
> + * the entire block.
> + */
> + if (partial)
> + max_blocks += 1;
> + }
> +
> + if (max_blocks > 0) {
> +
> + /* Now release the pages and zero block aligned part of pages*/
> + truncate_pagecache_range(inode, start, end - 1);
> +
> + /* Wait all existing dio workers, newcomers will block on i_mutex */
> + ext4_inode_block_unlocked_dio(inode);
> + inode_dio_wait(inode);
> +
> + /*
> + * Remove entire range from the extent status tree.
> + */
> + ret = ext4_es_remove_extent(inode, lblk, max_blocks);
> + if (ret)
> + goto out_dio;
> +
> + ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags,
> + mode);
> + if (ret)
> + goto out_dio;
> + }
> +
> + handle = ext4_journal_start(inode, EXT4_HT_MISC, 4);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + ext4_std_error(inode->i_sb, ret);
> + goto out_dio;
> + }
> +
> + inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
> +
> + if (!ret && new_size) {
> + if (new_size > i_size_read(inode))
> + i_size_write(inode, new_size);
> + if (new_size > EXT4_I(inode)->i_disksize)
> + ext4_update_i_disksize(inode, new_size);
> + }
> + ext4_mark_inode_dirty(handle, inode);
> +
> + /* Zero out partial block at the edges of the range */
> + ret = ext4_zero_partial_blocks(handle, inode, offset, len);
> +
> + if (file->f_flags & O_SYNC)
> + ext4_handle_sync(handle);
> +
> + ext4_journal_stop(handle);
> +out_dio:
> + ext4_inode_resume_unlocked_dio(inode);
> +out_mutex:
> + mutex_unlock(&inode->i_mutex);
> + return ret;
> +}
> +
> /*
> * preallocate space for a file. This implements ext4's fallocate file
> * operation, which gets called from sys_fallocate system call.
> @@ -4589,7 +4819,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> unsigned int blkbits = inode->i_blkbits;
>
> /* Return error if mode is not supported */
> - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> + FALLOC_FL_ZERO_RANGE))
> return -EOPNOTSUPP;
>
> if (mode & FALLOC_FL_PUNCH_HOLE)
> @@ -4606,6 +4837,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> return -EOPNOTSUPP;
>
> + if (mode & FALLOC_FL_ZERO_RANGE)
> + return ext4_zero_range(file, offset, len, mode);
> +
> trace_ext4_fallocate_enter(inode, offset, len, mode);
> lblk = offset >> blkbits;
> /*
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6e39895..e64807f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -503,6 +503,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> {
> struct extent_status es;
> int retval;
> + int ret = 0;
> #ifdef ES_AGGRESSIVE_TEST
> struct ext4_map_blocks orig_map;
>
> @@ -552,7 +553,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> EXT4_GET_BLOCKS_KEEP_SIZE);
> }
> if (retval > 0) {
> - int ret;
> unsigned int status;
>
> if (unlikely(retval != map->m_len)) {
> @@ -579,7 +579,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>
> found:
> if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> - int ret = check_block_validity(inode, map);
> + ret = check_block_validity(inode, map);
> if (ret != 0)
> return ret;
> }
> @@ -596,7 +596,13 @@ found:
> * with buffer head unmapped.
> */
> if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED)
> - return retval;
> + /*
> + * If we need to convert extent to unwritten
> + * we continue and do the actual work in
> + * ext4_ext_map_blocks()
> + */
> + if (!(flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN))
> + return retval;
>
> /*
> * Here we clear m_flags because after allocating an new extent,
> @@ -652,7 +658,6 @@ found:
> ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);
>
> if (retval > 0) {
> - int ret;
> unsigned int status;
>
> if (unlikely(retval != map->m_len)) {
> @@ -687,7 +692,7 @@ found:
> has_zeroout:
> up_write((&EXT4_I(inode)->i_data_sem));
> if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
> - int ret = check_block_validity(inode, map);
> + ret = check_block_validity(inode, map);
> if (ret != 0)
> return ret;
> }
> @@ -3501,7 +3506,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> if (!S_ISREG(inode->i_mode))
> return -EOPNOTSUPP;
>
> - trace_ext4_punch_hole(inode, offset, length);
> + trace_ext4_punch_hole(inode, offset, length, 0);
>
> /*
> * Write out all dirty pages to avoid race conditions
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 451e020..7bb26aa 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -71,7 +71,8 @@ struct extent_status;
> #define show_falloc_mode(mode) __print_flags(mode, "|", \
> { FALLOC_FL_KEEP_SIZE, "KEEP_SIZE"}, \
> { FALLOC_FL_PUNCH_HOLE, "PUNCH_HOLE"}, \
> - { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"})
> + { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"}, \
> + { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"})
>
>
> TRACE_EVENT(ext4_free_inode,
> @@ -1333,7 +1334,7 @@ TRACE_EVENT(ext4_direct_IO_exit,
> __entry->rw, __entry->ret)
> );
>
> -TRACE_EVENT(ext4_fallocate_enter,
> +DECLARE_EVENT_CLASS(ext4__fallocate_mode,
> TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
>
> TP_ARGS(inode, offset, len, mode),
> @@ -1341,23 +1342,45 @@ TRACE_EVENT(ext4_fallocate_enter,
> TP_STRUCT__entry(
> __field( dev_t, dev )
> __field( ino_t, ino )
> - __field( loff_t, pos )
> - __field( loff_t, len )
> + __field( loff_t, offset )
> + __field( loff_t, len )
> __field( int, mode )
> ),
>
> TP_fast_assign(
> __entry->dev = inode->i_sb->s_dev;
> __entry->ino = inode->i_ino;
> - __entry->pos = offset;
> + __entry->offset = offset;
> __entry->len = len;
> __entry->mode = mode;
> ),
>
> - TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %s",
> + TP_printk("dev %d,%d ino %lu offset %lld len %lld mode %s",
> MAJOR(__entry->dev), MINOR(__entry->dev),
> - (unsigned long) __entry->ino, __entry->pos,
> - __entry->len, show_falloc_mode(__entry->mode))
> + (unsigned long) __entry->ino,
> + __entry->offset, __entry->len,
> + show_falloc_mode(__entry->mode))
> +);
> +
> +DEFINE_EVENT(ext4__fallocate_mode, ext4_fallocate_enter,
> +
> + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
> +
> + TP_ARGS(inode, offset, len, mode)
> +);
> +
> +DEFINE_EVENT(ext4__fallocate_mode, ext4_punch_hole,
> +
> + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
> +
> + TP_ARGS(inode, offset, len, mode)
> +);
> +
> +DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range,
> +
> + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode),
> +
> + TP_ARGS(inode, offset, len, mode)
> );
>
> TRACE_EVENT(ext4_fallocate_exit,
> @@ -1389,31 +1412,6 @@ TRACE_EVENT(ext4_fallocate_exit,
> __entry->ret)
> );
>
> -TRACE_EVENT(ext4_punch_hole,
> - TP_PROTO(struct inode *inode, loff_t offset, loff_t len),
> -
> - TP_ARGS(inode, offset, len),
> -
> - TP_STRUCT__entry(
> - __field( dev_t, dev )
> - __field( ino_t, ino )
> - __field( loff_t, offset )
> - __field( loff_t, len )
> - ),
> -
> - TP_fast_assign(
> - __entry->dev = inode->i_sb->s_dev;
> - __entry->ino = inode->i_ino;
> - __entry->offset = offset;
> - __entry->len = len;
> - ),
> -
> - TP_printk("dev %d,%d ino %lu offset %lld len %lld",
> - MAJOR(__entry->dev), MINOR(__entry->dev),
> - (unsigned long) __entry->ino,
> - __entry->offset, __entry->len)
> -);
> -
> TRACE_EVENT(ext4_unlink_enter,
> TP_PROTO(struct inode *parent, struct dentry *dentry),
>
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Tue, Feb 18, 2014 at 12:01:38PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> >
> > It can be used to convert a range of file to zeros preferably without
> > issuing data IO. Blocks should be preallocated for the regions that span
> > holes in the file, and the entire range is preferable converted to
> > unwritten extents - even though file system may choose to zero out the
> > extent or do whatever which will result in reading zeros from the range
> > while the range remains allocated for the file.
> >
> > This can be also used to preallocate blocks past EOF in the same way as
> > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > size to remain the same.
> >
> > You can test this feature yourself using xfstests, of fallocate(1) however
> > you'll need patches for util_linux, xfsprogs and xfstests which you
> > can find here:
> >
> > http://people.redhat.com/lczerner/zero_range/
> >
> > I'll post the patches after we agree and merge the kernel functionality.
> >
> > I tested this mostly with a subset of xfstests using fsx and fsstress and
> > even with new generic/290 which is just a copy of xfs/290 usinz fzero
> > command for xfs_io instead of zero (which uses ioctl). I was testing on
> > x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
>
> You also want to convert xfs/242 to be a generic test - it uses the
> _generic_test_punch helper to test all the corner cases across
> different extent type transitions.
>
> > ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
FWIW. if that's a group of tests you consider good for testing
extent tree modifications, then can you create a test group for
these by adding "extent" to each of the tests in the group file?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, 18 Feb 2014, Dave Chinner wrote:
> Date: Tue, 18 Feb 2014 19:33:24 +1100
> From: Dave Chinner <[email protected]>
> To: Lukas Czerner <[email protected]>
> Cc: [email protected], [email protected], [email protected],
> [email protected]
> Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> fallocate
>
> On Tue, Feb 18, 2014 at 12:01:38PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> > > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> > >
> > > It can be used to convert a range of file to zeros preferably without
> > > issuing data IO. Blocks should be preallocated for the regions that span
> > > holes in the file, and the entire range is preferable converted to
> > > unwritten extents - even though file system may choose to zero out the
> > > extent or do whatever which will result in reading zeros from the range
> > > while the range remains allocated for the file.
> > >
> > > This can be also used to preallocate blocks past EOF in the same way as
> > > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > > size to remain the same.
> > >
> > > You can test this feature yourself using xfstests, of fallocate(1) however
> > > you'll need patches for util_linux, xfsprogs and xfstests which you
> > > can find here:
> > >
> > > http://people.redhat.com/lczerner/zero_range/
> > >
> > > I'll post the patches after we agree and merge the kernel functionality.
> > >
> > > I tested this mostly with a subset of xfstests using fsx and fsstress and
> > > even with new generic/290 which is just a copy of xfs/290 usinz fzero
> > > command for xfs_io instead of zero (which uses ioctl). I was testing on
> > > x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
> >
> > You also want to convert xfs/242 to be a generic test - it uses the
> > _generic_test_punch helper to test all the corner cases across
> > different extent type transitions.
That was the plan originally, however it uses xfs bmap which is not
supported for other file systems. But I can take a better look and
possibly port it to generic as well.
> >
> > > ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
>
> FWIW. if that's a group of tests you consider good for testing
> extent tree modifications, then can you create a test group for
> these by adding "extent" to each of the tests in the group file?
I've made patches adding support for FALLOC_FL_ZERO_RANGE into fsx
and fsstress so those tests are mostly tests which are using fsx and
fsstress.
It would require more careful look to identify tests which are
useful for extent tree modification. I'll see what I can do.
>
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Tue, Feb 18, 2014 at 10:09:48AM +0100, Lukáš Czerner wrote:
> On Tue, 18 Feb 2014, Dave Chinner wrote:
>
> > Date: Tue, 18 Feb 2014 19:33:24 +1100
> > From: Dave Chinner <[email protected]>
> > To: Lukas Czerner <[email protected]>
> > Cc: [email protected], [email protected], [email protected],
> > [email protected]
> > Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> > fallocate
> >
> > On Tue, Feb 18, 2014 at 12:01:38PM +1100, Dave Chinner wrote:
> > > On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> > > > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > > > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> > > >
> > > > It can be used to convert a range of file to zeros preferably without
> > > > issuing data IO. Blocks should be preallocated for the regions that span
> > > > holes in the file, and the entire range is preferable converted to
> > > > unwritten extents - even though file system may choose to zero out the
> > > > extent or do whatever which will result in reading zeros from the range
> > > > while the range remains allocated for the file.
> > > >
> > > > This can be also used to preallocate blocks past EOF in the same way as
> > > > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > > > size to remain the same.
> > > >
> > > > You can test this feature yourself using xfstests, of fallocate(1) however
> > > > you'll need patches for util_linux, xfsprogs and xfstests which you
> > > > can find here:
> > > >
> > > > http://people.redhat.com/lczerner/zero_range/
> > > >
> > > > I'll post the patches after we agree and merge the kernel functionality.
> > > >
> > > > I tested this mostly with a subset of xfstests using fsx and fsstress and
> > > > even with new generic/290 which is just a copy of xfs/290 usinz fzero
> > > > command for xfs_io instead of zero (which uses ioctl). I was testing on
> > > > x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
> > >
> > > You also want to convert xfs/242 to be a generic test - it uses the
> > > _generic_test_punch helper to test all the corner cases across
> > > different extent type transitions.
>
> That was the plan originally, however it uses xfs bmap which is not
> supported for other file systems. But I can take a better look and
> possibly port it to generic as well.
Simply pass fiemap rather than "bmap -v" like all the other falloc
tests do. The output of the xfs_io fiemap and bmap commands is
pretty much identical so this shouldn't be an issue.
> > > > ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
> >
> > FWIW. if that's a group of tests you consider good for testing
> > extent tree modifications, then can you create a test group for
> > these by adding "extent" to each of the tests in the group file?
>
> I've made patches adding support for FALLOC_FL_ZERO_RANGE into fsx
> and fsstress so those tests are mostly tests which are using fsx and
> fsstress.
Ok, so it's a "fallocate" test group, then?
Cheers,
Dave.
--
Dave Chinner
[email protected]
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Tue, 18 Feb 2014, Dave Chinner wrote:
> Date: Tue, 18 Feb 2014 20:41:42 +1100
> From: Dave Chinner <[email protected]>
> To: Lukáš Czerner <[email protected]>
> Cc: [email protected], [email protected], [email protected],
> [email protected]
> Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> fallocate
>
> On Tue, Feb 18, 2014 at 10:09:48AM +0100, Lukáš Czerner wrote:
> > On Tue, 18 Feb 2014, Dave Chinner wrote:
> >
> > > Date: Tue, 18 Feb 2014 19:33:24 +1100
> > > From: Dave Chinner <[email protected]>
> > > To: Lukas Czerner <[email protected]>
> > > Cc: [email protected], [email protected], [email protected],
> > > [email protected]
> > > Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> > > fallocate
> > >
> > > On Tue, Feb 18, 2014 at 12:01:38PM +1100, Dave Chinner wrote:
> > > > On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> > > > > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > > > > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> > > > >
> > > > > It can be used to convert a range of file to zeros preferably without
> > > > > issuing data IO. Blocks should be preallocated for the regions that span
> > > > > holes in the file, and the entire range is preferable converted to
> > > > > unwritten extents - even though file system may choose to zero out the
> > > > > extent or do whatever which will result in reading zeros from the range
> > > > > while the range remains allocated for the file.
> > > > >
> > > > > This can be also used to preallocate blocks past EOF in the same way as
> > > > > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > > > > size to remain the same.
> > > > >
> > > > > You can test this feature yourself using xfstests, of fallocate(1) however
> > > > > you'll need patches for util_linux, xfsprogs and xfstests which you
> > > > > can find here:
> > > > >
> > > > > http://people.redhat.com/lczerner/zero_range/
> > > > >
> > > > > I'll post the patches after we agree and merge the kernel functionality.
> > > > >
> > > > > I tested this mostly with a subset of xfstests using fsx and fsstress and
> > > > > even with new generic/290 which is just a copy of xfs/290 usinz fzero
> > > > > command for xfs_io instead of zero (which uses ioctl). I was testing on
> > > > > x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
> > > >
> > > > You also want to convert xfs/242 to be a generic test - it uses the
> > > > _generic_test_punch helper to test all the corner cases across
> > > > different extent type transitions.
> >
> > That was the plan originally, however it uses xfs bmap which is not
> > supported for other file systems. But I can take a better look and
> > possibly port it to generic as well.
>
> Simply pass fiemap rather than "bmap -v" like all the other falloc
> tests do. The output of the xfs_io fiemap and bmap commands is
> pretty much identical so this shouldn't be an issue.
ok, will do.
>
> > > > > ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
> > >
> > > FWIW. if that's a group of tests you consider good for testing
> > > extent tree modifications, then can you create a test group for
> > > these by adding "extent" to each of the tests in the group file?
> >
> > I've made patches adding support for FALLOC_FL_ZERO_RANGE into fsx
> > and fsstress so those tests are mostly tests which are using fsx and
> > fsstress.
>
> Ok, so it's a "fallocate" test group, then?
More like "fsx_fsstress" group, which might sound as a terrible name
for the group but it explains it quite well. So if you do not have
anything against that I'll call the new group "fsx_fsstress"
Thanks!
-Lukas
>
> Cheers,
>
> Dave.
>
On Tue, Feb 18, 2014 at 01:04:24PM +0100, Lukáš Czerner wrote:
>
> > Ok, so it's a "fallocate" test group, then?
>
> More like "fsx_fsstress" group, which might sound as a terrible name
> for the group but it explains it quite well. So if you do not have
> anything against that I'll call the new group "fsx_fsstress"
How about "block_map" group? I like Dave's suggestion about naming
the group after what it is trying to test, as opposed to how it does
that testing. This is also consistent with how the other tests groups
are named in xfstests.
However, extents are an implementation strategy, and you might just as
easily use this test to verify whether or not the punch hole
functionality for indirect block maps worked correctly.
What I think using fsx and fstress together have in common is that
it's a great way of stress testing whatever the file system uses for
creating and maintaining the translation map between (inode, logical
block) to physical block, so that's why perhaps "block_map" might be a
good test group name.
Regards,
- Ted
On Tue, 18 Feb 2014, Theodore Ts'o wrote:
> Date: Tue, 18 Feb 2014 09:23:05 -0500
> From: Theodore Ts'o <[email protected]>
> To: Lukáš Czerner <[email protected]>
> Cc: Dave Chinner <[email protected]>, [email protected],
> [email protected], [email protected]
> Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> fallocate
>
> On Tue, Feb 18, 2014 at 01:04:24PM +0100, Lukáš Czerner wrote:
> >
> > > Ok, so it's a "fallocate" test group, then?
> >
> > More like "fsx_fsstress" group, which might sound as a terrible name
> > for the group but it explains it quite well. So if you do not have
> > anything against that I'll call the new group "fsx_fsstress"
>
> How about "block_map" group? I like Dave's suggestion about naming
> the group after what it is trying to test, as opposed to how it does
> that testing. This is also consistent with how the other tests groups
> are named in xfstests.
>
> However, extents are an implementation strategy, and you might just as
> easily use this test to verify whether or not the punch hole
> functionality for indirect block maps worked correctly.
(it does not :) But I am still having trouble deciphering Al Viro
code ;)
>
> What I think using fsx and fstress together have in common is that
> it's a great way of stress testing whatever the file system uses for
> creating and maintaining the translation map between (inode, logical
> block) to physical block, so that's why perhaps "block_map" might be a
> good test group name.
To be honest "block_map" group name does not mean anything to me.
- "fallocate" is not really the right name as it does much more than
that
- "extents" is not the right name as there is not really anything
extents specific.
- "fsx_fsstress" while this gives information about how it is tested
it's not immediately clear what it is good for.
So I do not know and frankly I do not care very much about the name
of this group so if anyone has a strong opinion about the name feel
free to create such group.
Thanks!
-Lukas
>
> Regards,
>
> - Ted
>
On Tue, Feb 18, 2014 at 03:42:10PM +0100, Lukáš Czerner wrote:
>
> To be honest "block_map" group name does not mean anything to me.
Well, traditionally the "bmap" function is what is used to map from an
inode and a logcal block and a physical block. I thought block_map
was more obvious, but maybe this is a case where "bmap" would be more
understandable? I dunno.
- Ted
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
Hi Lukas,
On 17.02.2014 16:08, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents - even though file system may choose to zero out the
> extent or do whatever which will result in reading zeros from the range
> while the range remains allocated for the file.
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> You can test this feature yourself using xfstests, of fallocate(1) however
> you'll need patches for util_linux, xfsprogs and xfstests which you
> can find here:
>
> http://people.redhat.com/lczerner/zero_range/
Thank you for your great work!
I've tested it both on xfs and on ext4.
(Test environment: Fedora 20, Kernel 3.14-rc3 + your patches,
util-linux v2.24-232-g3c7ed4a + your patches)
It seems to work with xfs without problem.
On ext4, however, immediately after doing "fallocate -z",
kernel crashes with the following error:
------------[ cut here ]------------
kernel BUG at fs/ext4/ext4_extents.h:193!
invalid opcode: 0000 [#1] SMP
Modules linked in: 9pnet_virtio virtio_net 9pnet virtio_blk virtio_pci
virtio_ring virtio
CPU: 2 PID: 2959 Comm: fallocate Not tainted 3.14.0-rc3+ #34
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800da97da10 ti: ffff880119068000 task.ti: ffff880119068000
RIP: 0010:[<ffffffff813694c9>] [<ffffffff813694c9>]
ext4_ext_map_blocks+0x2899/0x2940
RSP: 0018:ffff880119069c50 EFLAGS: 00010202
RAX: 0000000000000003 RBX: ffff880036fa8470 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff82120e98
RBP: ffff880119069d30 R08: ffff88011975d900 R09: 011ad15618080000
R10: fec72ef09c4d8602 R11: 0000000000008000 R12: ffff880119069dd0
R13: 0000000000000403 R14: 0000000000000001 R15: ffff880118c6700c
FS: 00007fa54a0ba740(0000) GS:ffff88011fc40000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003cdbf6f7e0 CR3: 0000000119077000 CR4: 00000000000006e0
Stack:
0000000000000000 0000000000008000 ffff880036fa86c8 0000000000000000
ffff880100000000 0000800081384dee 0000000000000001 ffff880000000000
0000000000008800 0000000000000000 ffff880036f6f000 ffff88011975d900
Call Trace:
[<ffffffff81385baa>] ? ext4_es_insert_extent+0x15a/0x240
[<ffffffff813669ae>] ? ext4_find_delalloc_range+0x1e/0xb0
[<ffffffff81322d3f>] ext4_map_blocks+0x25f/0x830
[<ffffffff81369764>] ? ext4_alloc_file_blocks+0xc4/0x1e0
[<ffffffff813697da>] ext4_alloc_file_blocks+0x13a/0x1e0
[<ffffffff81369e9f>] ext4_zero_range+0x61f/0x870
[<ffffffff8136a5d3>] ext4_fallocate+0x4e3/0x6c0
[<ffffffff81239675>] ? __sb_start_write+0x145/0x1a0
[<ffffffff8120ef00>] ? kmem_cache_free+0x2f0/0x3f0
[<ffffffff81246ca0>] ? final_putname+0x30/0x60
[<ffffffff812326a7>] do_fallocate+0x1e7/0x290
[<ffffffff812327c9>] SyS_fallocate+0x79/0xc0
[<ffffffff81ae7de9>] system_call_fastpath+0x16/0x1b
Code: ba dc 05 00 00 48 c7 c6 b0 91 c7 81 48 89 df 89 04 24 31 c0 e8 99
83 fe ff e9 f5 f8 ff ff 48 83 05 34 b3 f5 00 01 e9 0a db ff ff <0f> 0b
0f 0b 0f 0b 0f 0b 45 89 d1 49 c7 c0 48 22 e5 81 31
RIP [<ffffffff813694c9>] ext4_ext_map_blocks+0x2899/0x2940
RSP <ffff880119069c50>
---[ end trace ba21204a3a98fbdc ]---
Regards,
Dongsu
> I'll post the patches after we agree and merge the kernel functionality.
>
> I tested this mostly with a subset of xfstests using fsx and fsstress and
> even with new generic/290 which is just a copy of xfs/290 usinz fzero
> command for xfs_io instead of zero (which uses ioctl). I was testing on
> x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
>
> ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
>
> Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
> touches the same area as this pach set does, so we should figure out
> which one should go first and modify the other on top of it.
>
> Thanks!
> -Lukas
>
> --
> [PATCH 1/6] ext4: Update inode i_size after the preallocation
> [PATCH 2/6] ext4: refactor ext4_fallocate code
> [PATCH 3/6] ext4: translate fallocate mode bits to strings
> [PATCH 4/6] fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
> [PATCH 5/6] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
> [PATCH 6/6] xfs: Add support for FALLOC_FL_ZERO_RANGE
>
> fs/ext4/ext4.h | 3 +
> fs/ext4/extents.c | 430 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
> fs/ext4/inode.c | 17 ++-
> fs/open.c | 7 +-
> fs/xfs/xfs_file.c | 10 +-
> include/trace/events/ext4.h | 67 ++++++-----
> include/uapi/linux/falloc.h | 1 +
> 7 files changed, 393 insertions(+), 142 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 19 Feb 2014, Dongsu Park wrote:
> Date: Wed, 19 Feb 2014 15:52:39 +0100
> From: Dongsu Park <[email protected]>
> To: Lukas Czerner <[email protected]>
> Cc: [email protected], [email protected], [email protected],
> [email protected]
> Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> fallocate
>
> Hi Lukas,
>
> On 17.02.2014 16:08, Lukas Czerner wrote:
> > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> >
> > It can be used to convert a range of file to zeros preferably without
> > issuing data IO. Blocks should be preallocated for the regions that span
> > holes in the file, and the entire range is preferable converted to
> > unwritten extents - even though file system may choose to zero out the
> > extent or do whatever which will result in reading zeros from the range
> > while the range remains allocated for the file.
> >
> > This can be also used to preallocate blocks past EOF in the same way as
> > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > size to remain the same.
> >
> > You can test this feature yourself using xfstests, of fallocate(1) however
> > you'll need patches for util_linux, xfsprogs and xfstests which you
> > can find here:
> >
> > http://people.redhat.com/lczerner/zero_range/
>
> Thank you for your great work!
> I've tested it both on xfs and on ext4.
> (Test environment: Fedora 20, Kernel 3.14-rc3 + your patches,
> util-linux v2.24-232-g3c7ed4a + your patches)
>
> It seems to work with xfs without problem.
> On ext4, however, immediately after doing "fallocate -z",
> kernel crashes with the following error:
That's weird I have not seen that before even after running tests
for several days and fallocate -z works as expected for me.
Are you able to reproduce it ? Can you tell me the steps to
reproduce this ? The problem is that the extent we're trying to mark
as uninitialized has zero length....
Ah...I can probably see what is going on. For some inexplicable
reason I am forgetting to take i_data_sem which means that we're
probably racing with truncate or something else.
Thanks a lot for letting me know and If you can please send me a
reproducer for your case because as I said I have not seen this
before.
Thanks!
-Lukas
>
> ------------[ cut here ]------------
> kernel BUG at fs/ext4/ext4_extents.h:193!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: 9pnet_virtio virtio_net 9pnet virtio_blk virtio_pci
> virtio_ring virtio
> CPU: 2 PID: 2959 Comm: fallocate Not tainted 3.14.0-rc3+ #34
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da97da10 ti: ffff880119068000 task.ti: ffff880119068000
> RIP: 0010:[<ffffffff813694c9>] [<ffffffff813694c9>]
> ext4_ext_map_blocks+0x2899/0x2940
> RSP: 0018:ffff880119069c50 EFLAGS: 00010202
> RAX: 0000000000000003 RBX: ffff880036fa8470 RCX: 0000000000000002
> RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff82120e98
> RBP: ffff880119069d30 R08: ffff88011975d900 R09: 011ad15618080000
> R10: fec72ef09c4d8602 R11: 0000000000008000 R12: ffff880119069dd0
> R13: 0000000000000403 R14: 0000000000000001 R15: ffff880118c6700c
> FS: 00007fa54a0ba740(0000) GS:ffff88011fc40000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000003cdbf6f7e0 CR3: 0000000119077000 CR4: 00000000000006e0
> Stack:
> 0000000000000000 0000000000008000 ffff880036fa86c8 0000000000000000
> ffff880100000000 0000800081384dee 0000000000000001 ffff880000000000
> 0000000000008800 0000000000000000 ffff880036f6f000 ffff88011975d900
> Call Trace:
> [<ffffffff81385baa>] ? ext4_es_insert_extent+0x15a/0x240
> [<ffffffff813669ae>] ? ext4_find_delalloc_range+0x1e/0xb0
> [<ffffffff81322d3f>] ext4_map_blocks+0x25f/0x830
> [<ffffffff81369764>] ? ext4_alloc_file_blocks+0xc4/0x1e0
> [<ffffffff813697da>] ext4_alloc_file_blocks+0x13a/0x1e0
> [<ffffffff81369e9f>] ext4_zero_range+0x61f/0x870
> [<ffffffff8136a5d3>] ext4_fallocate+0x4e3/0x6c0
> [<ffffffff81239675>] ? __sb_start_write+0x145/0x1a0
> [<ffffffff8120ef00>] ? kmem_cache_free+0x2f0/0x3f0
> [<ffffffff81246ca0>] ? final_putname+0x30/0x60
> [<ffffffff812326a7>] do_fallocate+0x1e7/0x290
> [<ffffffff812327c9>] SyS_fallocate+0x79/0xc0
> [<ffffffff81ae7de9>] system_call_fastpath+0x16/0x1b
> Code: ba dc 05 00 00 48 c7 c6 b0 91 c7 81 48 89 df 89 04 24 31 c0 e8 99
> 83 fe ff e9 f5 f8 ff ff 48 83 05 34 b3 f5 00 01 e9 0a db ff ff <0f> 0b
> 0f 0b 0f 0b 0f 0b 45 89 d1 49 c7 c0 48 22 e5 81 31
> RIP [<ffffffff813694c9>] ext4_ext_map_blocks+0x2899/0x2940
> RSP <ffff880119069c50>
> ---[ end trace ba21204a3a98fbdc ]---
>
> Regards,
> Dongsu
>
> > I'll post the patches after we agree and merge the kernel functionality.
> >
> > I tested this mostly with a subset of xfstests using fsx and fsstress and
> > even with new generic/290 which is just a copy of xfs/290 usinz fzero
> > command for xfs_io instead of zero (which uses ioctl). I was testing on
> > x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
> >
> > ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
> >
> > Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
> > touches the same area as this pach set does, so we should figure out
> > which one should go first and modify the other on top of it.
> >
> > Thanks!
> > -Lukas
> >
> > --
> > [PATCH 1/6] ext4: Update inode i_size after the preallocation
> > [PATCH 2/6] ext4: refactor ext4_fallocate code
> > [PATCH 3/6] ext4: translate fallocate mode bits to strings
> > [PATCH 4/6] fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
> > [PATCH 5/6] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
> > [PATCH 6/6] xfs: Add support for FALLOC_FL_ZERO_RANGE
> >
> > fs/ext4/ext4.h | 3 +
> > fs/ext4/extents.c | 430 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
> > fs/ext4/inode.c | 17 ++-
> > fs/open.c | 7 +-
> > fs/xfs/xfs_file.c | 10 +-
> > include/trace/events/ext4.h | 67 ++++++-----
> > include/uapi/linux/falloc.h | 1 +
> > 7 files changed, 393 insertions(+), 142 deletions(-)
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Wed, Feb 19, 2014 at 4:18 PM, Lukáš Czerner <[email protected]> wrote:
> On Wed, 19 Feb 2014, Dongsu Park wrote:
> Are you able to reproduce it ? Can you tell me the steps to
> reproduce this ? The problem is that the extent we're trying to mark
> as uninitialized has zero length....
>
> Ah...I can probably see what is going on. For some inexplicable
> reason I am forgetting to take i_data_sem which means that we're
> probably racing with truncate or something else.
>
> Thanks a lot for letting me know and If you can please send me a
> reproducer for your case because as I said I have not seen this
> before.
Yes, it's reliably reproducible.
What I'm doing for testing is quite simple, just like that:
(/dev/vdb is a test block device, 16GiB in size)
# mke2fs -t ext4 /dev/vdb
# mkdir -p /mnt/test1
# mount -t ext4 -o discard /dev/vdb /mnt/test1
# dd if=/dev/urandom of=/mnt/test1/file1 bs=2G count=1
# fallocate -z -l 2G /mnt/test1/file1
Then kernel crashes immediately.
Cheers,
Dongsu
> Thanks!
> -Lukas
>
> >
> > ------------[ cut here ]------------
> > kernel BUG at fs/ext4/ext4_extents.h:193!
> > invalid opcode: 0000 [#1] SMP
> > Modules linked in: 9pnet_virtio virtio_net 9pnet virtio_blk virtio_pci
> > virtio_ring virtio
> > CPU: 2 PID: 2959 Comm: fallocate Not tainted 3.14.0-rc3+ #34
> > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > task: ffff8800da97da10 ti: ffff880119068000 task.ti: ffff880119068000
> > RIP: 0010:[<ffffffff813694c9>] [<ffffffff813694c9>]
> > ext4_ext_map_blocks+0x2899/0x2940
> > RSP: 0018:ffff880119069c50 EFLAGS: 00010202
> > RAX: 0000000000000003 RBX: ffff880036fa8470 RCX: 0000000000000002
> > RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff82120e98
> > RBP: ffff880119069d30 R08: ffff88011975d900 R09: 011ad15618080000
> > R10: fec72ef09c4d8602 R11: 0000000000008000 R12: ffff880119069dd0
> > R13: 0000000000000403 R14: 0000000000000001 R15: ffff880118c6700c
> > FS: 00007fa54a0ba740(0000) GS:ffff88011fc40000(0000)
> > knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000003cdbf6f7e0 CR3: 0000000119077000 CR4: 00000000000006e0
> > Stack:
> > 0000000000000000 0000000000008000 ffff880036fa86c8 0000000000000000
> > ffff880100000000 0000800081384dee 0000000000000001 ffff880000000000
> > 0000000000008800 0000000000000000 ffff880036f6f000 ffff88011975d900
> > Call Trace:
> > [<ffffffff81385baa>] ? ext4_es_insert_extent+0x15a/0x240
> > [<ffffffff813669ae>] ? ext4_find_delalloc_range+0x1e/0xb0
> > [<ffffffff81322d3f>] ext4_map_blocks+0x25f/0x830
> > [<ffffffff81369764>] ? ext4_alloc_file_blocks+0xc4/0x1e0
> > [<ffffffff813697da>] ext4_alloc_file_blocks+0x13a/0x1e0
> > [<ffffffff81369e9f>] ext4_zero_range+0x61f/0x870
> > [<ffffffff8136a5d3>] ext4_fallocate+0x4e3/0x6c0
> > [<ffffffff81239675>] ? __sb_start_write+0x145/0x1a0
> > [<ffffffff8120ef00>] ? kmem_cache_free+0x2f0/0x3f0
> > [<ffffffff81246ca0>] ? final_putname+0x30/0x60
> > [<ffffffff812326a7>] do_fallocate+0x1e7/0x290
> > [<ffffffff812327c9>] SyS_fallocate+0x79/0xc0
> > [<ffffffff81ae7de9>] system_call_fastpath+0x16/0x1b
> > Code: ba dc 05 00 00 48 c7 c6 b0 91 c7 81 48 89 df 89 04 24 31 c0 e8 99
> > 83 fe ff e9 f5 f8 ff ff 48 83 05 34 b3 f5 00 01 e9 0a db ff ff <0f> 0b
> > 0f 0b 0f 0b 0f 0b 45 89 d1 49 c7 c0 48 22 e5 81 31
> > RIP [<ffffffff813694c9>] ext4_ext_map_blocks+0x2899/0x2940
> > RSP <ffff880119069c50>
> > ---[ end trace ba21204a3a98fbdc ]---
> >
> > Regards,
> > Dongsu
> >
> > > I'll post the patches after we agree and merge the kernel functionality.
> > >
> > > I tested this mostly with a subset of xfstests using fsx and fsstress and
> > > even with new generic/290 which is just a copy of xfs/290 usinz fzero
> > > command for xfs_io instead of zero (which uses ioctl). I was testing on
> > > x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
> > >
> > > ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
> > >
> > > Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
> > > touches the same area as this pach set does, so we should figure out
> > > which one should go first and modify the other on top of it.
> > >
> > > Thanks!
> > > -Lukas
> > >
> > > --
> > > [PATCH 1/6] ext4: Update inode i_size after the preallocation
> > > [PATCH 2/6] ext4: refactor ext4_fallocate code
> > > [PATCH 3/6] ext4: translate fallocate mode bits to strings
> > > [PATCH 4/6] fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
> > > [PATCH 5/6] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
> > > [PATCH 6/6] xfs: Add support for FALLOC_FL_ZERO_RANGE
> > >
> > > fs/ext4/ext4.h | 3 +
> > > fs/ext4/extents.c | 430 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
> > > fs/ext4/inode.c | 17 ++-
> > > fs/open.c | 7 +-
> > > fs/xfs/xfs_file.c | 10 +-
> > > include/trace/events/ext4.h | 67 ++++++-----
> > > include/uapi/linux/falloc.h | 1 +
> > > 7 files changed, 393 insertions(+), 142 deletions(-)
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Wed, 19 Feb 2014, Dongsu Park wrote:
> Date: Wed, 19 Feb 2014 16:51:23 +0100
> From: Dongsu Park <[email protected]>
> To: Lukáš Czerner <[email protected]>
> Cc: linux-ext4 <[email protected]>, tytso <[email protected]>,
> linux-fsdevel <[email protected]>, xfs <[email protected]>
> Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> fallocate
>
> On Wed, Feb 19, 2014 at 4:18 PM, Lukáš Czerner <[email protected]> wrote:
> > On Wed, 19 Feb 2014, Dongsu Park wrote:
> > Are you able to reproduce it ? Can you tell me the steps to
> > reproduce this ? The problem is that the extent we're trying to mark
> > as uninitialized has zero length....
> >
> > Ah...I can probably see what is going on. For some inexplicable
> > reason I am forgetting to take i_data_sem which means that we're
> > probably racing with truncate or something else.
> >
> > Thanks a lot for letting me know and If you can please send me a
> > reproducer for your case because as I said I have not seen this
> > before.
>
> Yes, it's reliably reproducible.
> What I'm doing for testing is quite simple, just like that:
> (/dev/vdb is a test block device, 16GiB in size)
>
> # mke2fs -t ext4 /dev/vdb
> # mkdir -p /mnt/test1
> # mount -t ext4 -o discard /dev/vdb /mnt/test1
> # dd if=/dev/urandom of=/mnt/test1/file1 bs=2G count=1
> # fallocate -z -l 2G /mnt/test1/file1
>
> Then kernel crashes immediately.
Oh, now I know where the problem really is. It's not about the
locking at all. Initialized and Uninitialized extents have different
maximum size.
So we can not convert initialized extent of a maximum size to a
uninitialized extent right away. We have to split.
Thank you, your testing is very useful!
-Lukas
>
> Cheers,
> Dongsu
>
> > Thanks!
> > -Lukas
> >
> > >
> > > ------------[ cut here ]------------
> > > kernel BUG at fs/ext4/ext4_extents.h:193!
> > > invalid opcode: 0000 [#1] SMP
> > > Modules linked in: 9pnet_virtio virtio_net 9pnet virtio_blk virtio_pci
> > > virtio_ring virtio
> > > CPU: 2 PID: 2959 Comm: fallocate Not tainted 3.14.0-rc3+ #34
> > > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > task: ffff8800da97da10 ti: ffff880119068000 task.ti: ffff880119068000
> > > RIP: 0010:[<ffffffff813694c9>] [<ffffffff813694c9>]
> > > ext4_ext_map_blocks+0x2899/0x2940
> > > RSP: 0018:ffff880119069c50 EFLAGS: 00010202
> > > RAX: 0000000000000003 RBX: ffff880036fa8470 RCX: 0000000000000002
> > > RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff82120e98
> > > RBP: ffff880119069d30 R08: ffff88011975d900 R09: 011ad15618080000
> > > R10: fec72ef09c4d8602 R11: 0000000000008000 R12: ffff880119069dd0
> > > R13: 0000000000000403 R14: 0000000000000001 R15: ffff880118c6700c
> > > FS: 00007fa54a0ba740(0000) GS:ffff88011fc40000(0000)
> > > knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000003cdbf6f7e0 CR3: 0000000119077000 CR4: 00000000000006e0
> > > Stack:
> > > 0000000000000000 0000000000008000 ffff880036fa86c8 0000000000000000
> > > ffff880100000000 0000800081384dee 0000000000000001 ffff880000000000
> > > 0000000000008800 0000000000000000 ffff880036f6f000 ffff88011975d900
> > > Call Trace:
> > > [<ffffffff81385baa>] ? ext4_es_insert_extent+0x15a/0x240
> > > [<ffffffff813669ae>] ? ext4_find_delalloc_range+0x1e/0xb0
> > > [<ffffffff81322d3f>] ext4_map_blocks+0x25f/0x830
> > > [<ffffffff81369764>] ? ext4_alloc_file_blocks+0xc4/0x1e0
> > > [<ffffffff813697da>] ext4_alloc_file_blocks+0x13a/0x1e0
> > > [<ffffffff81369e9f>] ext4_zero_range+0x61f/0x870
> > > [<ffffffff8136a5d3>] ext4_fallocate+0x4e3/0x6c0
> > > [<ffffffff81239675>] ? __sb_start_write+0x145/0x1a0
> > > [<ffffffff8120ef00>] ? kmem_cache_free+0x2f0/0x3f0
> > > [<ffffffff81246ca0>] ? final_putname+0x30/0x60
> > > [<ffffffff812326a7>] do_fallocate+0x1e7/0x290
> > > [<ffffffff812327c9>] SyS_fallocate+0x79/0xc0
> > > [<ffffffff81ae7de9>] system_call_fastpath+0x16/0x1b
> > > Code: ba dc 05 00 00 48 c7 c6 b0 91 c7 81 48 89 df 89 04 24 31 c0 e8 99
> > > 83 fe ff e9 f5 f8 ff ff 48 83 05 34 b3 f5 00 01 e9 0a db ff ff <0f> 0b
> > > 0f 0b 0f 0b 0f 0b 45 89 d1 49 c7 c0 48 22 e5 81 31
> > > RIP [<ffffffff813694c9>] ext4_ext_map_blocks+0x2899/0x2940
> > > RSP <ffff880119069c50>
> > > ---[ end trace ba21204a3a98fbdc ]---
> > >
> > > Regards,
> > > Dongsu
> > >
> > > > I'll post the patches after we agree and merge the kernel functionality.
> > > >
> > > > I tested this mostly with a subset of xfstests using fsx and fsstress and
> > > > even with new generic/290 which is just a copy of xfs/290 usinz fzero
> > > > command for xfs_io instead of zero (which uses ioctl). I was testing on
> > > > x86_64 and ppc64 with block sizes of 1024, 2048 and 4096.
> > > >
> > > > ./check generic/076 generic/232 generic/013 generic/070 generic/269 generic/083 generic/117 generic/068 generic/231 generic/127 generic/091 generic/075 generic/112 generic/263 generic/091 generic/075 generic/256 generic/255 generic/316 generic/300 generic/290;
> > > >
> > > > Note that there is a work in progress on FALLOC_FL_COLLAPSE_RANGE which
> > > > touches the same area as this pach set does, so we should figure out
> > > > which one should go first and modify the other on top of it.
> > > >
> > > > Thanks!
> > > > -Lukas
> > > >
> > > > --
> > > > [PATCH 1/6] ext4: Update inode i_size after the preallocation
> > > > [PATCH 2/6] ext4: refactor ext4_fallocate code
> > > > [PATCH 3/6] ext4: translate fallocate mode bits to strings
> > > > [PATCH 4/6] fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
> > > > [PATCH 5/6] ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
> > > > [PATCH 6/6] xfs: Add support for FALLOC_FL_ZERO_RANGE
> > > >
> > > > fs/ext4/ext4.h | 3 +
> > > > fs/ext4/extents.c | 430 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
> > > > fs/ext4/inode.c | 17 ++-
> > > > fs/open.c | 7 +-
> > > > fs/xfs/xfs_file.c | 10 +-
> > > > include/trace/events/ext4.h | 67 ++++++-----
> > > > include/uapi/linux/falloc.h | 1 +
> > > > 7 files changed, 393 insertions(+), 142 deletions(-)
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
>
On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents - even though file system may choose to zero out the
> extent or do whatever which will result in reading zeros from the range
> while the range remains allocated for the file.
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> You can test this feature yourself using xfstests, of fallocate(1) however
> you'll need patches for util_linux, xfsprogs and xfstests which you
> can find here:
>
> http://people.redhat.com/lczerner/zero_range/
>
> I'll post the patches after we agree and merge the kernel functionality.
Lukas, can you post the xfstests and xfs_io changes so that they can
be reviewed? Once I can verify the behaviour is the same as
XFS_IOC_ZERO_RANGE, I'm ahppy to commit the VFS and XFS kernel
changes along with the xfsprogs and xfstests changes like I've just
done for the FALLOC_FL_COLLAPSE_RANGE changes.
I'd like to get all the changes to the VFS into the XFS tree so that
you can handle the ext4 integration of the two pieces of
functionilty as you and Ted see fit....
Cheers,
Dave.
--
Dave Chinner
[email protected]
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Mon, 24 Feb 2014, Dave Chinner wrote:
> Date: Mon, 24 Feb 2014 12:07:14 +1100
> From: Dave Chinner <[email protected]>
> To: Lukas Czerner <[email protected]>
> Cc: [email protected], [email protected], [email protected],
> [email protected]
> Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> fallocate
>
> On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
> >
> > It can be used to convert a range of file to zeros preferably without
> > issuing data IO. Blocks should be preallocated for the regions that span
> > holes in the file, and the entire range is preferable converted to
> > unwritten extents - even though file system may choose to zero out the
> > extent or do whatever which will result in reading zeros from the range
> > while the range remains allocated for the file.
> >
> > This can be also used to preallocate blocks past EOF in the same way as
> > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > size to remain the same.
> >
> > You can test this feature yourself using xfstests, of fallocate(1) however
> > you'll need patches for util_linux, xfsprogs and xfstests which you
> > can find here:
> >
> > http://people.redhat.com/lczerner/zero_range/
> >
> > I'll post the patches after we agree and merge the kernel functionality.
>
> Lukas, can you post the xfstests and xfs_io changes so that they can
> be reviewed? Once I can verify the behaviour is the same as
> XFS_IOC_ZERO_RANGE, I'm ahppy to commit the VFS and XFS kernel
> changes along with the xfsprogs and xfstests changes like I've just
> done for the FALLOC_FL_COLLAPSE_RANGE changes.
>
> I'd like to get all the changes to the VFS into the XFS tree so that
> you can handle the ext4 integration of the two pieces of
> functionilty as you and Ted see fit....
>
> Cheers,
>
> Dave.
Hi Dave,
ok, I'll rebase and resend the whole series with the xfstests and
xfsprogs patches as well.
Thanks!
-Lukas
On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
Lukas, can you please also send a man page update for
FALLOC_FL_ZERO_RANGE now that is has been merged?
>
> It can be used to convert a range of file to zeros preferably without
> issuing data IO. Blocks should be preallocated for the regions that span
> holes in the file, and the entire range is preferable converted to
> unwritten extents - even though file system may choose to zero out the
> extent or do whatever which will result in reading zeros from the range
> while the range remains allocated for the file.
>
> This can be also used to preallocate blocks past EOF in the same way as
> with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> size to remain the same.
>
> You can test this feature yourself using xfstests, of fallocate(1) however
> you'll need patches for util_linux, xfsprogs and xfstests which you
> can find here:
>
> http://people.redhat.com/lczerner/zero_range/
_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs
On Tue, 15 Apr 2014, Christoph Hellwig wrote:
> Date: Tue, 15 Apr 2014 23:36:18 -0700
> From: Christoph Hellwig <[email protected]>
> To: Lukas Czerner <[email protected]>
> Cc: Michael Kerrisk <[email protected]>, [email protected],
> [email protected], [email protected], [email protected],
> [email protected]
> Subject: Re: [PATCH 0/6][RFC] Introduce FALLOC_FL_ZERO_RANGE flag for
> fallocate
>
> On Mon, Feb 17, 2014 at 04:08:17PM +0100, Lukas Czerner wrote:
> > Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
> > functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
>
> Lukas, can you please also send a man page update for
> FALLOC_FL_ZERO_RANGE now that is has been merged?
Right, I'll do that.
Thanks!
-Lukas
>
> >
> > It can be used to convert a range of file to zeros preferably without
> > issuing data IO. Blocks should be preallocated for the regions that span
> > holes in the file, and the entire range is preferable converted to
> > unwritten extents - even though file system may choose to zero out the
> > extent or do whatever which will result in reading zeros from the range
> > while the range remains allocated for the file.
> >
> > This can be also used to preallocate blocks past EOF in the same way as
> > with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
> > size to remain the same.
> >
> > You can test this feature yourself using xfstests, of fallocate(1) however
> > you'll need patches for util_linux, xfsprogs and xfstests which you
> > can find here:
> >
> > http://people.redhat.com/lczerner/zero_range/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>