2022-01-04 21:24:35

by Jaegeuk Kim

[permalink] [raw]
Subject: [PATCH 1/6] f2fs: rework write preallocations

From: Eric Biggers <[email protected]>

f2fs_write_begin() assumes that all blocks were preallocated by
default unless FI_NO_PREALLOC is explicitly set. This invites data
corruption, as there are cases in which not all blocks are preallocated.
Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
buffered_io") fixed one case, but there are others remaining.

Fix up this logic by replacing this flag with FI_PREALLOCATED_ALL, which
only gets set if all blocks for the current write were preallocated.

Also clean up f2fs_preallocate_blocks(), move it to file.c, and make it
handle some of the logic that was previously in write_iter() directly.

Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/data.c | 55 ++-------------------
fs/f2fs/f2fs.h | 3 +-
fs/f2fs/file.c | 131 +++++++++++++++++++++++++++++++------------------
3 files changed, 88 insertions(+), 101 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index d8190e836a96..3db0f3049b90 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1384,53 +1384,6 @@ static int __allocate_data_block(struct dnode_of_data *dn, int seg_type)
return 0;
}

-int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
-{
- struct inode *inode = file_inode(iocb->ki_filp);
- struct f2fs_map_blocks map;
- int flag;
- int err = 0;
- bool direct_io = iocb->ki_flags & IOCB_DIRECT;
-
- map.m_lblk = F2FS_BLK_ALIGN(iocb->ki_pos);
- map.m_len = F2FS_BYTES_TO_BLK(iocb->ki_pos + iov_iter_count(from));
- if (map.m_len > map.m_lblk)
- map.m_len -= map.m_lblk;
- else
- map.m_len = 0;
-
- map.m_next_pgofs = NULL;
- map.m_next_extent = NULL;
- map.m_seg_type = NO_CHECK_TYPE;
- map.m_may_create = true;
-
- if (direct_io) {
- map.m_seg_type = f2fs_rw_hint_to_seg_type(iocb->ki_hint);
- flag = f2fs_force_buffered_io(inode, iocb, from) ?
- F2FS_GET_BLOCK_PRE_AIO :
- F2FS_GET_BLOCK_PRE_DIO;
- goto map_blocks;
- }
- if (iocb->ki_pos + iov_iter_count(from) > MAX_INLINE_DATA(inode)) {
- err = f2fs_convert_inline_inode(inode);
- if (err)
- return err;
- }
- if (f2fs_has_inline_data(inode))
- return err;
-
- flag = F2FS_GET_BLOCK_PRE_AIO;
-
-map_blocks:
- err = f2fs_map_blocks(inode, &map, 1, flag);
- if (map.m_len > 0 && err == -ENOSPC) {
- if (!direct_io)
- set_inode_flag(inode, FI_NO_PREALLOC);
- err = 0;
- }
- return err;
-}
-
void f2fs_do_map_lock(struct f2fs_sb_info *sbi, int flag, bool lock)
{
if (flag == F2FS_GET_BLOCK_PRE_AIO) {
@@ -3340,12 +3293,10 @@ static int prepare_write_begin(struct f2fs_sb_info *sbi,
int flag;

/*
- * we already allocated all the blocks, so we don't need to get
- * the block addresses when there is no need to fill the page.
+ * If a whole page is being written and we already preallocated all the
+ * blocks, then there is no need to get a block address now.
*/
- if (!f2fs_has_inline_data(inode) && len == PAGE_SIZE &&
- !is_inode_flag_set(inode, FI_NO_PREALLOC) &&
- !f2fs_verity_in_progress(inode))
+ if (len == PAGE_SIZE && is_inode_flag_set(inode, FI_PREALLOCATED_ALL))
return 0;

/* f2fs_lock_op avoids race between write CP and convert_inline_page */
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index ff37cdd7a6b7..6f196621f772 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -715,7 +715,7 @@ enum {
FI_INLINE_DOTS, /* indicate inline dot dentries */
FI_DO_DEFRAG, /* indicate defragment is running */
FI_DIRTY_FILE, /* indicate regular/symlink has dirty pages */
- FI_NO_PREALLOC, /* indicate skipped preallocated blocks */
+ FI_PREALLOCATED_ALL, /* all blocks for write were preallocated */
FI_HOT_DATA, /* indicate file is hot */
FI_EXTRA_ATTR, /* indicate file has extra attribute */
FI_PROJ_INHERIT, /* indicate file inherits projectid */
@@ -3615,7 +3615,6 @@ void f2fs_update_data_blkaddr(struct dnode_of_data *dn, block_t blkaddr);
int f2fs_reserve_new_blocks(struct dnode_of_data *dn, blkcnt_t count);
int f2fs_reserve_new_block(struct dnode_of_data *dn);
int f2fs_get_block(struct dnode_of_data *dn, pgoff_t index);
-int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from);
int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index);
struct page *f2fs_get_read_data_page(struct inode *inode, pgoff_t index,
int op_flags, bool for_write);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 92ec2699bc85..fc87d0f5b82b 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4235,10 +4235,77 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
return ret;
}

+/*
+ * Preallocate blocks for a write request, if it is possible and helpful to do
+ * so. Returns a positive number if blocks may have been preallocated, 0 if no
+ * blocks were preallocated, or a negative errno value if something went
+ * seriously wrong. Also sets FI_PREALLOCATED_ALL on the inode if *all* the
+ * requested blocks (not just some of them) have been allocated.
+ */
+static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
+ const loff_t pos = iocb->ki_pos;
+ const size_t count = iov_iter_count(iter);
+ struct f2fs_map_blocks map = {};
+ bool dio = (iocb->ki_flags & IOCB_DIRECT) &&
+ !f2fs_force_buffered_io(inode, iocb, iter);
+ int flag;
+ int ret;
+
+ /* If it will be an out-of-place direct write, don't bother. */
+ if (dio && f2fs_lfs_mode(sbi))
+ return 0;
+
+ /* No-wait I/O can't allocate blocks. */
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return 0;
+
+ /* If it will be a short write, don't bother. */
+ if (fault_in_iov_iter_readable(iter, count))
+ return 0;
+
+ if (f2fs_has_inline_data(inode)) {
+ /* If the data will fit inline, don't bother. */
+ if (pos + count <= MAX_INLINE_DATA(inode))
+ return 0;
+ ret = f2fs_convert_inline_inode(inode);
+ if (ret)
+ return ret;
+ }
+
+ /* Do not preallocate blocks that will be written partially in 4KB. */
+ map.m_lblk = F2FS_BLK_ALIGN(pos);
+ map.m_len = F2FS_BYTES_TO_BLK(pos + count);
+ if (map.m_len > map.m_lblk)
+ map.m_len -= map.m_lblk;
+ else
+ map.m_len = 0;
+ map.m_may_create = true;
+ if (dio) {
+ map.m_seg_type = f2fs_rw_hint_to_seg_type(inode->i_write_hint);
+ flag = F2FS_GET_BLOCK_PRE_DIO;
+ } else {
+ map.m_seg_type = NO_CHECK_TYPE;
+ flag = F2FS_GET_BLOCK_PRE_AIO;
+ }
+
+ ret = f2fs_map_blocks(inode, &map, 1, flag);
+ /* -ENOSPC is only a fatal error if no blocks could be allocated. */
+ if (ret < 0 && !(ret == -ENOSPC && map.m_len > 0))
+ return ret;
+ if (ret == 0)
+ set_inode_flag(inode, FI_PREALLOCATED_ALL);
+ return map.m_len;
+}
+
static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
+ loff_t target_size;
+ int preallocated;
ssize_t ret;

if (unlikely(f2fs_cp_error(F2FS_I_SB(inode)))) {
@@ -4262,84 +4329,54 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)

if (unlikely(IS_IMMUTABLE(inode))) {
ret = -EPERM;
- goto unlock;
+ goto out_unlock;
}

if (is_inode_flag_set(inode, FI_COMPRESS_RELEASED)) {
ret = -EPERM;
- goto unlock;
+ goto out_unlock;
}

ret = generic_write_checks(iocb, from);
if (ret > 0) {
- bool preallocated = false;
- size_t target_size = 0;
- int err;
-
- if (fault_in_iov_iter_readable(from, iov_iter_count(from)))
- set_inode_flag(inode, FI_NO_PREALLOC);
-
- if ((iocb->ki_flags & IOCB_NOWAIT)) {
+ if (iocb->ki_flags & IOCB_NOWAIT) {
if (!f2fs_overwrite_io(inode, iocb->ki_pos,
iov_iter_count(from)) ||
f2fs_has_inline_data(inode) ||
f2fs_force_buffered_io(inode, iocb, from)) {
- clear_inode_flag(inode, FI_NO_PREALLOC);
- inode_unlock(inode);
ret = -EAGAIN;
- goto out;
+ goto out_unlock;
}
- goto write;
}
-
- if (is_inode_flag_set(inode, FI_NO_PREALLOC))
- goto write;
-
if (iocb->ki_flags & IOCB_DIRECT) {
- /*
- * Convert inline data for Direct I/O before entering
- * f2fs_direct_IO().
- */
- err = f2fs_convert_inline_inode(inode);
- if (err)
- goto out_err;
- /*
- * If force_buffere_io() is true, we have to allocate
- * blocks all the time, since f2fs_direct_IO will fall
- * back to buffered IO.
- */
- if (!f2fs_force_buffered_io(inode, iocb, from) &&
- f2fs_lfs_mode(F2FS_I_SB(inode)))
- goto write;
+ ret = f2fs_convert_inline_inode(inode);
+ if (ret)
+ goto out_unlock;
}
- preallocated = true;
+ /* Possibly preallocate the blocks for the write. */
target_size = iocb->ki_pos + iov_iter_count(from);
-
- err = f2fs_preallocate_blocks(iocb, from);
- if (err) {
-out_err:
- clear_inode_flag(inode, FI_NO_PREALLOC);
- inode_unlock(inode);
- ret = err;
- goto out;
+ preallocated = f2fs_preallocate_blocks(iocb, from);
+ if (preallocated < 0) {
+ ret = preallocated;
+ goto out_unlock;
}
-write:
+
ret = __generic_file_write_iter(iocb, from);
- clear_inode_flag(inode, FI_NO_PREALLOC);

- /* if we couldn't write data, we should deallocate blocks. */
- if (preallocated && i_size_read(inode) < target_size) {
+ /* Don't leave any preallocated blocks around past i_size. */
+ if (preallocated > 0 && i_size_read(inode) < target_size) {
down_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
filemap_invalidate_lock(inode->i_mapping);
f2fs_truncate(inode);
filemap_invalidate_unlock(inode->i_mapping);
up_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
}
+ clear_inode_flag(inode, FI_PREALLOCATED_ALL);

if (ret > 0)
f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
}
-unlock:
+out_unlock:
inode_unlock(inode);
out:
trace_f2fs_file_write_iter(inode, iocb->ki_pos,
--
2.34.1.448.ga2b2bfdf31-goog



2022-01-04 21:24:41

by Jaegeuk Kim

[permalink] [raw]
Subject: [PATCH 2/6] f2fs: reduce indentation in f2fs_file_write_iter()

From: Eric Biggers <[email protected]>

Replace 'if (ret > 0)' with 'if (ret <= 0) goto out_unlock;'.
No change in behavior.

Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/file.c | 64 +++++++++++++++++++++++++++-----------------------
1 file changed, 34 insertions(+), 30 deletions(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index fc87d0f5b82b..808a7c24d993 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4338,44 +4338,48 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
}

ret = generic_write_checks(iocb, from);
- if (ret > 0) {
- if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!f2fs_overwrite_io(inode, iocb->ki_pos,
- iov_iter_count(from)) ||
+ if (ret <= 0)
+ goto out_unlock;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!f2fs_overwrite_io(inode, iocb->ki_pos,
+ iov_iter_count(from)) ||
f2fs_has_inline_data(inode) ||
f2fs_force_buffered_io(inode, iocb, from)) {
- ret = -EAGAIN;
- goto out_unlock;
- }
- }
- if (iocb->ki_flags & IOCB_DIRECT) {
- ret = f2fs_convert_inline_inode(inode);
- if (ret)
- goto out_unlock;
- }
- /* Possibly preallocate the blocks for the write. */
- target_size = iocb->ki_pos + iov_iter_count(from);
- preallocated = f2fs_preallocate_blocks(iocb, from);
- if (preallocated < 0) {
- ret = preallocated;
+ ret = -EAGAIN;
goto out_unlock;
}
+ }

- ret = __generic_file_write_iter(iocb, from);
+ if (iocb->ki_flags & IOCB_DIRECT) {
+ ret = f2fs_convert_inline_inode(inode);
+ if (ret)
+ goto out_unlock;
+ }
+ /* Possibly preallocate the blocks for the write. */
+ target_size = iocb->ki_pos + iov_iter_count(from);
+ preallocated = f2fs_preallocate_blocks(iocb, from);
+ if (preallocated < 0) {
+ ret = preallocated;
+ goto out_unlock;
+ }

- /* Don't leave any preallocated blocks around past i_size. */
- if (preallocated > 0 && i_size_read(inode) < target_size) {
- down_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
- filemap_invalidate_lock(inode->i_mapping);
- f2fs_truncate(inode);
- filemap_invalidate_unlock(inode->i_mapping);
- up_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
- }
- clear_inode_flag(inode, FI_PREALLOCATED_ALL);
+ ret = __generic_file_write_iter(iocb, from);

- if (ret > 0)
- f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
+ /* Don't leave any preallocated blocks around past i_size. */
+ if (preallocated > 0 && i_size_read(inode) < target_size) {
+ down_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
+ filemap_invalidate_lock(inode->i_mapping);
+ f2fs_truncate(inode);
+ filemap_invalidate_unlock(inode->i_mapping);
+ up_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
}
+
+ clear_inode_flag(inode, FI_PREALLOCATED_ALL);
+
+ if (ret > 0)
+ f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
+
out_unlock:
inode_unlock(inode);
out:
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 21:24:41

by Jaegeuk Kim

[permalink] [raw]
Subject: [PATCH 3/6] f2fs: do not expose unwritten blocks to user by DIO

DIO preallocates physical blocks before writing data, but if an error occurrs
or power-cut happens, we can see block contents from the disk. This patch tries
to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
unwritten blocks from error or power-cut.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/data.c | 5 ++++-
fs/f2fs/f2fs.h | 5 +++++
fs/f2fs/file.c | 27 ++++++++++++++++++---------
fs/f2fs/inode.c | 8 ++++++++
4 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 3db0f3049b90..9c867de1ec29 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1543,8 +1543,11 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map,
flag != F2FS_GET_BLOCK_DIO);
err = __allocate_data_block(&dn,
map->m_seg_type);
- if (!err)
+ if (!err) {
+ if (flag == F2FS_GET_BLOCK_PRE_DIO)
+ file_need_truncate(inode);
set_inode_flag(inode, FI_APPEND_WRITE);
+ }
}
if (err)
goto sync_out;
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 6f196621f772..d7435fcb9658 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -654,6 +654,7 @@ enum {
#define FADVISE_KEEP_SIZE_BIT 0x10
#define FADVISE_HOT_BIT 0x20
#define FADVISE_VERITY_BIT 0x40
+#define FADVISE_TRUNC_BIT 0x80

#define FADVISE_MODIFIABLE_BITS (FADVISE_COLD_BIT | FADVISE_HOT_BIT)

@@ -681,6 +682,10 @@ enum {
#define file_is_verity(inode) is_file(inode, FADVISE_VERITY_BIT)
#define file_set_verity(inode) set_file(inode, FADVISE_VERITY_BIT)

+#define file_should_truncate(inode) is_file(inode, FADVISE_TRUNC_BIT)
+#define file_need_truncate(inode) set_file(inode, FADVISE_TRUNC_BIT)
+#define file_dont_truncate(inode) clear_file(inode, FADVISE_TRUNC_BIT)
+
#define DEF_DIR_LEVEL 0

enum {
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 808a7c24d993..e1445cf915ea 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1687,6 +1687,7 @@ static int expand_inode_data(struct inode *inode, loff_t offset,

map.m_seg_type = CURSEG_COLD_DATA_PINNED;
err = f2fs_map_blocks(inode, &map, 1, F2FS_GET_BLOCK_PRE_DIO);
+ file_dont_truncate(inode);

up_write(&sbi->pin_sem);

@@ -4257,6 +4258,13 @@ static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
/* If it will be an out-of-place direct write, don't bother. */
if (dio && f2fs_lfs_mode(sbi))
return 0;
+ /*
+ * Don't preallocate holes aligned to DIO_SKIP_HOLES which turns into
+ * buffered IO, if DIO meets any holes.
+ */
+ if (dio && i_size_read(inode) &&
+ (F2FS_BYTES_TO_BLK(pos) < F2FS_BLK_ALIGN(i_size_read(inode))))
+ return 0;

/* No-wait I/O can't allocate blocks. */
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -4292,8 +4300,8 @@ static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
}

ret = f2fs_map_blocks(inode, &map, 1, flag);
- /* -ENOSPC is only a fatal error if no blocks could be allocated. */
- if (ret < 0 && !(ret == -ENOSPC && map.m_len > 0))
+ /* -ENOSPC|-EDQUOT are fine to report the number of allocated blocks. */
+ if (ret < 0 && !((ret == -ENOSPC || ret == -EDQUOT) && map.m_len > 0))
return ret;
if (ret == 0)
set_inode_flag(inode, FI_PREALLOCATED_ALL);
@@ -4359,20 +4367,21 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
/* Possibly preallocate the blocks for the write. */
target_size = iocb->ki_pos + iov_iter_count(from);
preallocated = f2fs_preallocate_blocks(iocb, from);
- if (preallocated < 0) {
+ if (preallocated < 0)
ret = preallocated;
- goto out_unlock;
- }
-
- ret = __generic_file_write_iter(iocb, from);
+ else
+ ret = __generic_file_write_iter(iocb, from);

/* Don't leave any preallocated blocks around past i_size. */
- if (preallocated > 0 && i_size_read(inode) < target_size) {
+ if (preallocated && i_size_read(inode) < target_size) {
down_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
filemap_invalidate_lock(inode->i_mapping);
- f2fs_truncate(inode);
+ if (!f2fs_truncate(inode))
+ file_dont_truncate(inode);
filemap_invalidate_unlock(inode->i_mapping);
up_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
+ } else {
+ file_dont_truncate(inode);
}

clear_inode_flag(inode, FI_PREALLOCATED_ALL);
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 0f8b2df3e1e0..6998eb1d6bdb 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -544,6 +544,14 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
goto bad_inode;
}
f2fs_set_inode_flags(inode);
+
+ if (file_should_truncate(inode)) {
+ ret = f2fs_truncate(inode);
+ if (ret)
+ goto bad_inode;
+ file_dont_truncate(inode);
+ }
+
unlock_new_inode(inode);
trace_f2fs_iget(inode);
return inode;
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 21:24:47

by Jaegeuk Kim

[permalink] [raw]
Subject: [PATCH 4/6] f2fs: fix the f2fs_file_write_iter tracepoint

From: Eric Biggers <[email protected]>

Pass in the original position and count rather than the position and
count that were updated by the write. Also use the correct types for
all arguments, in particular the file offset which was being truncated
to 32 bits on 32-bit platforms.

Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/file.c | 5 +++--
include/trace/events/f2fs.h | 12 ++++++------
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index e1445cf915ea..048db4852b28 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4312,6 +4312,8 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
+ const loff_t orig_pos = iocb->ki_pos;
+ const size_t orig_count = iov_iter_count(from);
loff_t target_size;
int preallocated;
ssize_t ret;
@@ -4392,8 +4394,7 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
out_unlock:
inode_unlock(inode);
out:
- trace_f2fs_file_write_iter(inode, iocb->ki_pos,
- iov_iter_count(from), ret);
+ trace_f2fs_file_write_iter(inode, orig_pos, orig_count, ret);
if (ret > 0)
ret = generic_write_sync(iocb, ret);
return ret;
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index f8cb916f3595..dcb94d740e12 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -540,17 +540,17 @@ TRACE_EVENT(f2fs_truncate_partial_nodes,

TRACE_EVENT(f2fs_file_write_iter,

- TP_PROTO(struct inode *inode, unsigned long offset,
- unsigned long length, int ret),
+ TP_PROTO(struct inode *inode, loff_t offset, size_t length,
+ ssize_t ret),

TP_ARGS(inode, offset, length, ret),

TP_STRUCT__entry(
__field(dev_t, dev)
__field(ino_t, ino)
- __field(unsigned long, offset)
- __field(unsigned long, length)
- __field(int, ret)
+ __field(loff_t, offset)
+ __field(size_t, length)
+ __field(ssize_t, ret)
),

TP_fast_assign(
@@ -562,7 +562,7 @@ TRACE_EVENT(f2fs_file_write_iter,
),

TP_printk("dev = (%d,%d), ino = %lu, "
- "offset = %lu, length = %lu, written(err) = %d",
+ "offset = %lld, length = %zu, written(err) = %zd",
show_dev_ino(__entry),
__entry->offset,
__entry->length,
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 21:24:47

by Jaegeuk Kim

[permalink] [raw]
Subject: [PATCH 6/6] f2fs: use iomap for direct I/O

From: Eric Biggers <[email protected]>

Make f2fs_file_read_iter() and f2fs_file_write_iter() use the iomap
direct I/O implementation instead of the fs/direct-io.c one.

The iomap implementation is more efficient, and it also avoids the need
to add new features and optimizations to the old implementation.

This new implementation also eliminates the need for f2fs to hook bio
submission and completion and to allocate memory per-bio. This is
because it's possible to correctly update f2fs's in-flight DIO counters
using __iomap_dio_rw() in combination with an implementation of
iomap_dio_ops::end_io() (as suggested by Christoph Hellwig).

When possible, this new implementation preserves existing f2fs behavior
such as the conditions for falling back to buffered I/O.

This patch has been tested with xfstests by running 'gce-xfstests -c
f2fs -g auto -X generic/017' with and without this patch; no regressions
were seen. (Some tests fail both before and after. generic/017 hangs
both before and after, so it had to be excluded.)

Signed-off-by: Eric Biggers <[email protected]>
[Jaegeuk Kim: use spin_lock_bh for f2fs_update_iostat in softirq]
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/data.c | 205 +---------------------------
fs/f2fs/f2fs.h | 8 +-
fs/f2fs/file.c | 342 +++++++++++++++++++++++++++++++++++++++++------
fs/f2fs/iostat.c | 40 +++---
4 files changed, 322 insertions(+), 273 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 57e6a6f0daf9..a9652a8e669b 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1377,11 +1377,6 @@ static int __allocate_data_block(struct dnode_of_data *dn, int seg_type)
f2fs_invalidate_compress_page(sbi, old_blkaddr);
}
f2fs_update_data_blkaddr(dn, dn->data_blkaddr);
-
- /*
- * i_size will be updated by direct_IO. Otherwise, we'll get stale
- * data from unwritten block via dio_read.
- */
return 0;
}

@@ -1743,50 +1738,6 @@ static inline u64 blks_to_bytes(struct inode *inode, u64 blks)
return (blks << inode->i_blkbits);
}

-static int __get_data_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create, int flag,
- pgoff_t *next_pgofs, int seg_type, bool may_write)
-{
- struct f2fs_map_blocks map;
- int err;
-
- map.m_lblk = iblock;
- map.m_len = bytes_to_blks(inode, bh->b_size);
- map.m_next_pgofs = next_pgofs;
- map.m_next_extent = NULL;
- map.m_seg_type = seg_type;
- map.m_may_create = may_write;
-
- err = f2fs_map_blocks(inode, &map, create, flag);
- if (!err) {
- map_bh(bh, inode->i_sb, map.m_pblk);
- bh->b_state = (bh->b_state & ~F2FS_MAP_FLAGS) | map.m_flags;
- bh->b_size = blks_to_bytes(inode, map.m_len);
-
- if (map.m_multidev_dio)
- bh->b_bdev = map.m_bdev;
- }
- return err;
-}
-
-static int get_data_block_dio_write(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
-{
- return __get_data_block(inode, iblock, bh_result, create,
- F2FS_GET_BLOCK_DIO, NULL,
- f2fs_rw_hint_to_seg_type(inode->i_write_hint),
- true);
-}
-
-static int get_data_block_dio(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
-{
- return __get_data_block(inode, iblock, bh_result, create,
- F2FS_GET_BLOCK_DIO, NULL,
- f2fs_rw_hint_to_seg_type(inode->i_write_hint),
- false);
-}
-
static int f2fs_xattr_fiemap(struct inode *inode,
struct fiemap_extent_info *fieinfo)
{
@@ -3263,7 +3214,7 @@ static int f2fs_write_data_pages(struct address_space *mapping,
FS_CP_DATA_IO : FS_DATA_IO);
}

-static void f2fs_write_failed(struct inode *inode, loff_t to)
+void f2fs_write_failed(struct inode *inode, loff_t to)
{
loff_t i_size = i_size_read(inode);

@@ -3551,158 +3502,6 @@ static int f2fs_write_end(struct file *file,
return copied;
}

-static int check_direct_IO(struct inode *inode, struct iov_iter *iter,
- loff_t offset)
-{
- unsigned i_blkbits = READ_ONCE(inode->i_blkbits);
- unsigned blkbits = i_blkbits;
- unsigned blocksize_mask = (1 << blkbits) - 1;
- unsigned long align = offset | iov_iter_alignment(iter);
- struct block_device *bdev = inode->i_sb->s_bdev;
-
- if (iov_iter_rw(iter) == READ && offset >= i_size_read(inode))
- return 1;
-
- if (align & blocksize_mask) {
- if (bdev)
- blkbits = blksize_bits(bdev_logical_block_size(bdev));
- blocksize_mask = (1 << blkbits) - 1;
- if (align & blocksize_mask)
- return -EINVAL;
- return 1;
- }
- return 0;
-}
-
-static void f2fs_dio_end_io(struct bio *bio)
-{
- struct f2fs_private_dio *dio = bio->bi_private;
-
- dec_page_count(F2FS_I_SB(dio->inode),
- dio->write ? F2FS_DIO_WRITE : F2FS_DIO_READ);
-
- bio->bi_private = dio->orig_private;
- bio->bi_end_io = dio->orig_end_io;
-
- kfree(dio);
-
- bio_endio(bio);
-}
-
-static void f2fs_dio_submit_bio(struct bio *bio, struct inode *inode,
- loff_t file_offset)
-{
- struct f2fs_private_dio *dio;
- bool write = (bio_op(bio) == REQ_OP_WRITE);
-
- dio = f2fs_kzalloc(F2FS_I_SB(inode),
- sizeof(struct f2fs_private_dio), GFP_NOFS);
- if (!dio)
- goto out;
-
- dio->inode = inode;
- dio->orig_end_io = bio->bi_end_io;
- dio->orig_private = bio->bi_private;
- dio->write = write;
-
- bio->bi_end_io = f2fs_dio_end_io;
- bio->bi_private = dio;
-
- inc_page_count(F2FS_I_SB(inode),
- write ? F2FS_DIO_WRITE : F2FS_DIO_READ);
-
- submit_bio(bio);
- return;
-out:
- bio->bi_status = BLK_STS_IOERR;
- bio_endio(bio);
-}
-
-static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct address_space *mapping = iocb->ki_filp->f_mapping;
- struct inode *inode = mapping->host;
- struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
- struct f2fs_inode_info *fi = F2FS_I(inode);
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- int rw = iov_iter_rw(iter);
- int err;
- enum rw_hint hint = iocb->ki_hint;
- int whint_mode = F2FS_OPTION(sbi).whint_mode;
- bool do_opu;
-
- err = check_direct_IO(inode, iter, offset);
- if (err)
- return err < 0 ? err : 0;
-
- if (f2fs_force_buffered_io(inode, iocb, iter))
- return 0;
-
- do_opu = rw == WRITE && f2fs_lfs_mode(sbi);
-
- trace_f2fs_direct_IO_enter(inode, offset, count, rw);
-
- if (rw == WRITE && whint_mode == WHINT_MODE_OFF)
- iocb->ki_hint = WRITE_LIFE_NOT_SET;
-
- if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!down_read_trylock(&fi->i_gc_rwsem[rw])) {
- iocb->ki_hint = hint;
- err = -EAGAIN;
- goto out;
- }
- if (do_opu && !down_read_trylock(&fi->i_gc_rwsem[READ])) {
- up_read(&fi->i_gc_rwsem[rw]);
- iocb->ki_hint = hint;
- err = -EAGAIN;
- goto out;
- }
- } else {
- down_read(&fi->i_gc_rwsem[rw]);
- if (do_opu)
- down_read(&fi->i_gc_rwsem[READ]);
- }
-
- err = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
- iter, rw == WRITE ? get_data_block_dio_write :
- get_data_block_dio, NULL, f2fs_dio_submit_bio,
- rw == WRITE ? DIO_LOCKING | DIO_SKIP_HOLES :
- DIO_SKIP_HOLES);
-
- if (do_opu)
- up_read(&fi->i_gc_rwsem[READ]);
-
- up_read(&fi->i_gc_rwsem[rw]);
-
- if (rw == WRITE) {
- if (whint_mode == WHINT_MODE_OFF)
- iocb->ki_hint = hint;
- if (err > 0) {
- f2fs_update_iostat(F2FS_I_SB(inode), APP_DIRECT_IO,
- err);
- if (!do_opu)
- set_inode_flag(inode, FI_UPDATE_WRITE);
- } else if (err == -EIOCBQUEUED) {
- f2fs_update_iostat(F2FS_I_SB(inode), APP_DIRECT_IO,
- count - iov_iter_count(iter));
- } else if (err < 0) {
- f2fs_write_failed(inode, offset + count);
- }
- } else {
- if (err > 0)
- f2fs_update_iostat(sbi, APP_DIRECT_READ_IO, err);
- else if (err == -EIOCBQUEUED)
- f2fs_update_iostat(F2FS_I_SB(inode), APP_DIRECT_READ_IO,
- count - iov_iter_count(iter));
- }
-
-out:
- trace_f2fs_direct_IO_exit(inode, offset, count, rw, err);
-
- return err;
-}
-
void f2fs_invalidate_page(struct page *page, unsigned int offset,
unsigned int length)
{
@@ -4158,7 +3957,7 @@ const struct address_space_operations f2fs_dblock_aops = {
.set_page_dirty = f2fs_set_data_page_dirty,
.invalidatepage = f2fs_invalidate_page,
.releasepage = f2fs_release_page,
- .direct_IO = f2fs_direct_IO,
+ .direct_IO = noop_direct_IO,
.bmap = f2fs_bmap,
.swap_activate = f2fs_swap_activate,
.swap_deactivate = f2fs_swap_deactivate,
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 8242f47304a5..ac6dda6c4c5a 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -1807,13 +1807,6 @@ struct f2fs_sb_info {
#endif
};

-struct f2fs_private_dio {
- struct inode *inode;
- void *orig_private;
- bio_end_io_t *orig_end_io;
- bool write;
-};
-
#ifdef CONFIG_F2FS_FAULT_INJECTION
#define f2fs_show_injection_info(sbi, type) \
printk_ratelimited("%sF2FS-fs (%s) : inject %s in %s of %pS\n", \
@@ -3642,6 +3635,7 @@ int f2fs_write_single_data_page(struct page *page, int *submitted,
struct writeback_control *wbc,
enum iostat_type io_type,
int compr_blocks, bool allow_balance);
+void f2fs_write_failed(struct inode *inode, loff_t to);
void f2fs_invalidate_page(struct page *page, unsigned int offset,
unsigned int length);
int f2fs_release_page(struct page *page, gfp_t wait);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 048db4852b28..7516d97d5016 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -24,6 +24,7 @@
#include <linux/sched/signal.h>
#include <linux/fileattr.h>
#include <linux/fadvise.h>
+#include <linux/iomap.h>

#include "f2fs.h"
#include "node.h"
@@ -4219,23 +4220,145 @@ long f2fs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
return __f2fs_ioctl(filp, cmd, arg);
}

-static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+/*
+ * Return %true if the given read or write request should use direct I/O, or
+ * %false if it should use buffered I/O.
+ */
+static bool f2fs_should_use_dio(struct inode *inode, struct kiocb *iocb,
+ struct iov_iter *iter)
+{
+ unsigned int align;
+
+ if (!(iocb->ki_flags & IOCB_DIRECT))
+ return false;
+
+ if (f2fs_force_buffered_io(inode, iocb, iter))
+ return false;
+
+ /*
+ * Direct I/O not aligned to the disk's logical_block_size will be
+ * attempted, but will fail with -EINVAL.
+ *
+ * f2fs additionally requires that direct I/O be aligned to the
+ * filesystem block size, which is often a stricter requirement.
+ * However, f2fs traditionally falls back to buffered I/O on requests
+ * that are logical_block_size-aligned but not fs-block aligned.
+ *
+ * The below logic implements this behavior.
+ */
+ align = iocb->ki_pos | iov_iter_alignment(iter);
+ if (!IS_ALIGNED(align, i_blocksize(inode)) &&
+ IS_ALIGNED(align, bdev_logical_block_size(inode->i_sb->s_bdev)))
+ return false;
+
+ return true;
+}
+
+static int f2fs_dio_read_end_io(struct kiocb *iocb, ssize_t size, int error,
+ unsigned int flags)
+{
+ struct f2fs_sb_info *sbi = F2FS_I_SB(file_inode(iocb->ki_filp));
+
+ dec_page_count(sbi, F2FS_DIO_READ);
+ if (error)
+ return error;
+ f2fs_update_iostat(sbi, APP_DIRECT_READ_IO, size);
+ return 0;
+}
+
+static const struct iomap_dio_ops f2fs_iomap_dio_read_ops = {
+ .end_io = f2fs_dio_read_end_io,
+};
+
+static ssize_t f2fs_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
- int ret;
+ struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ const loff_t pos = iocb->ki_pos;
+ const size_t count = iov_iter_count(to);
+ struct iomap_dio *dio;
+ ssize_t ret;
+
+ if (count == 0)
+ return 0; /* skip atime update */
+
+ trace_f2fs_direct_IO_enter(inode, pos, count, READ);
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!down_read_trylock(&fi->i_gc_rwsem[READ])) {
+ ret = -EAGAIN;
+ goto out;
+ }
+ } else {
+ down_read(&fi->i_gc_rwsem[READ]);
+ }
+
+ /*
+ * We have to use __iomap_dio_rw() and iomap_dio_complete() instead of
+ * the higher-level function iomap_dio_rw() in order to ensure that the
+ * F2FS_DIO_READ counter will be decremented correctly in all cases.
+ */
+ inc_page_count(sbi, F2FS_DIO_READ);
+ dio = __iomap_dio_rw(iocb, to, &f2fs_iomap_ops,
+ &f2fs_iomap_dio_read_ops, 0, 0);
+ if (IS_ERR_OR_NULL(dio)) {
+ ret = PTR_ERR_OR_ZERO(dio);
+ if (ret != -EIOCBQUEUED)
+ dec_page_count(sbi, F2FS_DIO_READ);
+ } else {
+ ret = iomap_dio_complete(dio);
+ }
+
+ up_read(&fi->i_gc_rwsem[READ]);
+
+ file_accessed(file);
+out:
+ trace_f2fs_direct_IO_exit(inode, pos, count, READ, ret);
+ return ret;
+}
+
+static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;

if (!f2fs_is_compress_backend_ready(inode))
return -EOPNOTSUPP;

- ret = generic_file_read_iter(iocb, iter);
+ if (f2fs_should_use_dio(inode, iocb, to))
+ return f2fs_dio_read_iter(iocb, to);

+ ret = filemap_read(iocb, to, 0);
if (ret > 0)
- f2fs_update_iostat(F2FS_I_SB(inode), APP_READ_IO, ret);
-
+ f2fs_update_iostat(F2FS_I_SB(inode), APP_BUFFERED_READ_IO, ret);
return ret;
}

+static ssize_t f2fs_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ ssize_t count;
+ int err;
+
+ if (IS_IMMUTABLE(inode))
+ return -EPERM;
+
+ if (is_inode_flag_set(inode, FI_COMPRESS_RELEASED))
+ return -EPERM;
+
+ count = generic_write_checks(iocb, from);
+ if (count <= 0)
+ return count;
+
+ err = file_modified(file);
+ if (err)
+ return err;
+ return count;
+}
+
/*
* Preallocate blocks for a write request, if it is possible and helpful to do
* so. Returns a positive number if blocks may have been preallocated, 0 if no
@@ -4243,15 +4366,14 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
* seriously wrong. Also sets FI_PREALLOCATED_ALL on the inode if *all* the
* requested blocks (not just some of them) have been allocated.
*/
-static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
+static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter,
+ bool dio)
{
struct inode *inode = file_inode(iocb->ki_filp);
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
const loff_t pos = iocb->ki_pos;
const size_t count = iov_iter_count(iter);
struct f2fs_map_blocks map = {};
- bool dio = (iocb->ki_flags & IOCB_DIRECT) &&
- !f2fs_force_buffered_io(inode, iocb, iter);
int flag;
int ret;

@@ -4308,13 +4430,174 @@ static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
return map.m_len;
}

-static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t f2fs_buffered_write_iter(struct kiocb *iocb,
+ struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
+ ssize_t ret;
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EOPNOTSUPP;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = generic_perform_write(file, from, iocb->ki_pos);
+ current->backing_dev_info = NULL;
+
+ if (ret > 0) {
+ iocb->ki_pos += ret;
+ f2fs_update_iostat(F2FS_I_SB(inode), APP_BUFFERED_IO, ret);
+ }
+ return ret;
+}
+
+static int f2fs_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
+ unsigned int flags)
+{
+ struct f2fs_sb_info *sbi = F2FS_I_SB(file_inode(iocb->ki_filp));
+
+ dec_page_count(sbi, F2FS_DIO_WRITE);
+ if (error)
+ return error;
+ f2fs_update_iostat(sbi, APP_DIRECT_IO, size);
+ return 0;
+}
+
+static const struct iomap_dio_ops f2fs_iomap_dio_write_ops = {
+ .end_io = f2fs_dio_write_end_io,
+};
+
+static ssize_t f2fs_dio_write_iter(struct kiocb *iocb, struct iov_iter *from,
+ bool *may_need_sync)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ struct f2fs_inode_info *fi = F2FS_I(inode);
+ struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
+ const bool do_opu = f2fs_lfs_mode(sbi);
+ const int whint_mode = F2FS_OPTION(sbi).whint_mode;
+ const loff_t pos = iocb->ki_pos;
+ const ssize_t count = iov_iter_count(from);
+ const enum rw_hint hint = iocb->ki_hint;
+ unsigned int dio_flags;
+ struct iomap_dio *dio;
+ ssize_t ret;
+
+ trace_f2fs_direct_IO_enter(inode, pos, count, WRITE);
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ /* f2fs_convert_inline_inode() and block allocation can block */
+ if (f2fs_has_inline_data(inode) ||
+ !f2fs_overwrite_io(inode, pos, count)) {
+ ret = -EAGAIN;
+ goto out;
+ }
+
+ if (!down_read_trylock(&fi->i_gc_rwsem[WRITE])) {
+ ret = -EAGAIN;
+ goto out;
+ }
+ if (do_opu && !down_read_trylock(&fi->i_gc_rwsem[READ])) {
+ up_read(&fi->i_gc_rwsem[WRITE]);
+ ret = -EAGAIN;
+ goto out;
+ }
+ } else {
+ ret = f2fs_convert_inline_inode(inode);
+ if (ret)
+ goto out;
+
+ down_read(&fi->i_gc_rwsem[WRITE]);
+ if (do_opu)
+ down_read(&fi->i_gc_rwsem[READ]);
+ }
+ if (whint_mode == WHINT_MODE_OFF)
+ iocb->ki_hint = WRITE_LIFE_NOT_SET;
+
+ /*
+ * We have to use __iomap_dio_rw() and iomap_dio_complete() instead of
+ * the higher-level function iomap_dio_rw() in order to ensure that the
+ * F2FS_DIO_WRITE counter will be decremented correctly in all cases.
+ */
+ inc_page_count(sbi, F2FS_DIO_WRITE);
+ dio_flags = 0;
+ if (pos + count > inode->i_size)
+ dio_flags |= IOMAP_DIO_FORCE_WAIT;
+ dio = __iomap_dio_rw(iocb, from, &f2fs_iomap_ops,
+ &f2fs_iomap_dio_write_ops, dio_flags, 0);
+ if (IS_ERR_OR_NULL(dio)) {
+ ret = PTR_ERR_OR_ZERO(dio);
+ if (ret == -ENOTBLK)
+ ret = 0;
+ if (ret != -EIOCBQUEUED)
+ dec_page_count(sbi, F2FS_DIO_WRITE);
+ } else {
+ ret = iomap_dio_complete(dio);
+ }
+
+ if (whint_mode == WHINT_MODE_OFF)
+ iocb->ki_hint = hint;
+ if (do_opu)
+ up_read(&fi->i_gc_rwsem[READ]);
+ up_read(&fi->i_gc_rwsem[WRITE]);
+
+ if (ret < 0)
+ goto out;
+ if (pos + ret > inode->i_size)
+ f2fs_i_size_write(inode, pos + ret);
+ if (!do_opu)
+ set_inode_flag(inode, FI_UPDATE_WRITE);
+
+ if (iov_iter_count(from)) {
+ ssize_t ret2;
+ loff_t bufio_start_pos = iocb->ki_pos;
+
+ /*
+ * The direct write was partial, so we need to fall back to a
+ * buffered write for the remainder.
+ */
+
+ ret2 = f2fs_buffered_write_iter(iocb, from);
+ if (iov_iter_count(from))
+ f2fs_write_failed(inode, iocb->ki_pos);
+ if (ret2 < 0)
+ goto out;
+
+ /*
+ * Ensure that the pagecache pages are written to disk and
+ * invalidated to preserve the expected O_DIRECT semantics.
+ */
+ if (ret2 > 0) {
+ loff_t bufio_end_pos = bufio_start_pos + ret2 - 1;
+
+ ret += ret2;
+
+ ret2 = filemap_write_and_wait_range(file->f_mapping,
+ bufio_start_pos,
+ bufio_end_pos);
+ if (ret2 < 0)
+ goto out;
+ invalidate_mapping_pages(file->f_mapping,
+ bufio_start_pos >> PAGE_SHIFT,
+ bufio_end_pos >> PAGE_SHIFT);
+ }
+ } else {
+ /* iomap_dio_rw() already handled the generic_write_sync(). */
+ *may_need_sync = false;
+ }
+out:
+ trace_f2fs_direct_IO_exit(inode, pos, count, WRITE, ret);
+ return ret;
+}
+
+static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
const loff_t orig_pos = iocb->ki_pos;
const size_t orig_count = iov_iter_count(from);
loff_t target_size;
+ bool dio;
+ bool may_need_sync = true;
int preallocated;
ssize_t ret;

@@ -4337,42 +4620,23 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
inode_lock(inode);
}

- if (unlikely(IS_IMMUTABLE(inode))) {
- ret = -EPERM;
- goto out_unlock;
- }
-
- if (is_inode_flag_set(inode, FI_COMPRESS_RELEASED)) {
- ret = -EPERM;
- goto out_unlock;
- }
-
- ret = generic_write_checks(iocb, from);
+ ret = f2fs_write_checks(iocb, from);
if (ret <= 0)
goto out_unlock;

- if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!f2fs_overwrite_io(inode, iocb->ki_pos,
- iov_iter_count(from)) ||
- f2fs_has_inline_data(inode) ||
- f2fs_force_buffered_io(inode, iocb, from)) {
- ret = -EAGAIN;
- goto out_unlock;
- }
- }
+ /* Determine whether we will do a direct write or a buffered write. */
+ dio = f2fs_should_use_dio(inode, iocb, from);

- if (iocb->ki_flags & IOCB_DIRECT) {
- ret = f2fs_convert_inline_inode(inode);
- if (ret)
- goto out_unlock;
- }
/* Possibly preallocate the blocks for the write. */
target_size = iocb->ki_pos + iov_iter_count(from);
- preallocated = f2fs_preallocate_blocks(iocb, from);
+ preallocated = f2fs_preallocate_blocks(iocb, from, dio);
if (preallocated < 0)
ret = preallocated;
else
- ret = __generic_file_write_iter(iocb, from);
+ /* Do the actual write. */
+ ret = dio ?
+ f2fs_dio_write_iter(iocb, from, &may_need_sync):
+ f2fs_buffered_write_iter(iocb, from);

/* Don't leave any preallocated blocks around past i_size. */
if (preallocated && i_size_read(inode) < target_size) {
@@ -4387,15 +4651,11 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
}

clear_inode_flag(inode, FI_PREALLOCATED_ALL);
-
- if (ret > 0)
- f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
-
out_unlock:
inode_unlock(inode);
out:
trace_f2fs_file_write_iter(inode, orig_pos, orig_count, ret);
- if (ret > 0)
+ if (ret > 0 && may_need_sync)
ret = generic_write_sync(iocb, ret);
return ret;
}
diff --git a/fs/f2fs/iostat.c b/fs/f2fs/iostat.c
index cdcf54ae0db8..be599f31d3c4 100644
--- a/fs/f2fs/iostat.c
+++ b/fs/f2fs/iostat.c
@@ -92,7 +92,7 @@ static inline void __record_iostat_latency(struct f2fs_sb_info *sbi)
struct f2fs_iostat_latency iostat_lat[MAX_IO_TYPE][NR_PAGE_TYPE];
struct iostat_lat_info *io_lat = sbi->iostat_io_lat;

- spin_lock_irq(&sbi->iostat_lat_lock);
+ spin_lock_bh(&sbi->iostat_lat_lock);
for (idx = 0; idx < MAX_IO_TYPE; idx++) {
for (io = 0; io < NR_PAGE_TYPE; io++) {
cnt = io_lat->bio_cnt[idx][io];
@@ -106,7 +106,7 @@ static inline void __record_iostat_latency(struct f2fs_sb_info *sbi)
io_lat->bio_cnt[idx][io] = 0;
}
}
- spin_unlock_irq(&sbi->iostat_lat_lock);
+ spin_unlock_bh(&sbi->iostat_lat_lock);

trace_f2fs_iostat_latency(sbi, iostat_lat);
}
@@ -120,9 +120,9 @@ static inline void f2fs_record_iostat(struct f2fs_sb_info *sbi)
return;

/* Need double check under the lock */
- spin_lock(&sbi->iostat_lock);
+ spin_lock_bh(&sbi->iostat_lock);
if (time_is_after_jiffies(sbi->iostat_next_period)) {
- spin_unlock(&sbi->iostat_lock);
+ spin_unlock_bh(&sbi->iostat_lock);
return;
}
sbi->iostat_next_period = jiffies +
@@ -133,7 +133,7 @@ static inline void f2fs_record_iostat(struct f2fs_sb_info *sbi)
sbi->prev_rw_iostat[i];
sbi->prev_rw_iostat[i] = sbi->rw_iostat[i];
}
- spin_unlock(&sbi->iostat_lock);
+ spin_unlock_bh(&sbi->iostat_lock);

trace_f2fs_iostat(sbi, iostat_diff);

@@ -145,16 +145,16 @@ void f2fs_reset_iostat(struct f2fs_sb_info *sbi)
struct iostat_lat_info *io_lat = sbi->iostat_io_lat;
int i;

- spin_lock(&sbi->iostat_lock);
+ spin_lock_bh(&sbi->iostat_lock);
for (i = 0; i < NR_IO_TYPE; i++) {
sbi->rw_iostat[i] = 0;
sbi->prev_rw_iostat[i] = 0;
}
- spin_unlock(&sbi->iostat_lock);
+ spin_unlock_bh(&sbi->iostat_lock);

- spin_lock_irq(&sbi->iostat_lat_lock);
+ spin_lock_bh(&sbi->iostat_lat_lock);
memset(io_lat, 0, sizeof(struct iostat_lat_info));
- spin_unlock_irq(&sbi->iostat_lat_lock);
+ spin_unlock_bh(&sbi->iostat_lat_lock);
}

void f2fs_update_iostat(struct f2fs_sb_info *sbi,
@@ -163,19 +163,16 @@ void f2fs_update_iostat(struct f2fs_sb_info *sbi,
if (!sbi->iostat_enable)
return;

- spin_lock(&sbi->iostat_lock);
+ spin_lock_bh(&sbi->iostat_lock);
sbi->rw_iostat[type] += io_bytes;

- if (type == APP_WRITE_IO || type == APP_DIRECT_IO)
- sbi->rw_iostat[APP_BUFFERED_IO] =
- sbi->rw_iostat[APP_WRITE_IO] -
- sbi->rw_iostat[APP_DIRECT_IO];
+ if (type == APP_BUFFERED_IO || type == APP_DIRECT_IO)
+ sbi->rw_iostat[APP_WRITE_IO] += io_bytes;

- if (type == APP_READ_IO || type == APP_DIRECT_READ_IO)
- sbi->rw_iostat[APP_BUFFERED_READ_IO] =
- sbi->rw_iostat[APP_READ_IO] -
- sbi->rw_iostat[APP_DIRECT_READ_IO];
- spin_unlock(&sbi->iostat_lock);
+ if (type == APP_BUFFERED_READ_IO || type == APP_DIRECT_READ_IO)
+ sbi->rw_iostat[APP_READ_IO] += io_bytes;
+
+ spin_unlock_bh(&sbi->iostat_lock);

f2fs_record_iostat(sbi);
}
@@ -185,7 +182,6 @@ static inline void __update_iostat_latency(struct bio_iostat_ctx *iostat_ctx,
{
unsigned long ts_diff;
unsigned int iotype = iostat_ctx->type;
- unsigned long flags;
struct f2fs_sb_info *sbi = iostat_ctx->sbi;
struct iostat_lat_info *io_lat = sbi->iostat_io_lat;
int idx;
@@ -206,12 +202,12 @@ static inline void __update_iostat_latency(struct bio_iostat_ctx *iostat_ctx,
idx = WRITE_ASYNC_IO;
}

- spin_lock_irqsave(&sbi->iostat_lat_lock, flags);
+ spin_lock_bh(&sbi->iostat_lat_lock);
io_lat->sum_lat[idx][iotype] += ts_diff;
io_lat->bio_cnt[idx][iotype]++;
if (ts_diff > io_lat->peak_lat[idx][iotype])
io_lat->peak_lat[idx][iotype] = ts_diff;
- spin_unlock_irqrestore(&sbi->iostat_lat_lock, flags);
+ spin_unlock_bh(&sbi->iostat_lat_lock);
}

void iostat_update_and_unbind_ctx(struct bio *bio, int rw)
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-04 21:26:30

by Jaegeuk Kim

[permalink] [raw]
Subject: [PATCH 5/6] f2fs: implement iomap operations

From: Eric Biggers <[email protected]>

Implement 'struct iomap_ops' for f2fs, in preparation for making f2fs
use iomap for direct I/O.

Note that this may be used for other things besides direct I/O in the
future; however, for now I've only tested it for direct I/O.

Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
---
fs/f2fs/Kconfig | 1 +
fs/f2fs/data.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/f2fs/f2fs.h | 1 +
3 files changed, 58 insertions(+)

diff --git a/fs/f2fs/Kconfig b/fs/f2fs/Kconfig
index 7eea3cfd894d..f46a7339d6cf 100644
--- a/fs/f2fs/Kconfig
+++ b/fs/f2fs/Kconfig
@@ -7,6 +7,7 @@ config F2FS_FS
select CRYPTO_CRC32
select F2FS_FS_XATTR if FS_ENCRYPTION
select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
+ select FS_IOMAP
select LZ4_COMPRESS if F2FS_FS_LZ4
select LZ4_DECOMPRESS if F2FS_FS_LZ4
select LZ4HC_COMPRESS if F2FS_FS_LZ4HC
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 9c867de1ec29..57e6a6f0daf9 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -21,6 +21,7 @@
#include <linux/cleancache.h>
#include <linux/sched/signal.h>
#include <linux/fiemap.h>
+#include <linux/iomap.h>

#include "f2fs.h"
#include "node.h"
@@ -4237,3 +4238,58 @@ void f2fs_destroy_bio_entry_cache(void)
{
kmem_cache_destroy(bio_entry_slab);
}
+
+static int f2fs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+ unsigned int flags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct f2fs_map_blocks map = {};
+ pgoff_t next_pgofs = 0;
+ int err;
+
+ map.m_lblk = bytes_to_blks(inode, offset);
+ map.m_len = bytes_to_blks(inode, offset + length - 1) - map.m_lblk + 1;
+ map.m_next_pgofs = &next_pgofs;
+ map.m_seg_type = f2fs_rw_hint_to_seg_type(inode->i_write_hint);
+ if (flags & IOMAP_WRITE)
+ map.m_may_create = true;
+
+ err = f2fs_map_blocks(inode, &map, flags & IOMAP_WRITE,
+ F2FS_GET_BLOCK_DIO);
+ if (err)
+ return err;
+
+ iomap->offset = blks_to_bytes(inode, map.m_lblk);
+
+ if (map.m_flags & (F2FS_MAP_MAPPED | F2FS_MAP_UNWRITTEN)) {
+ iomap->length = blks_to_bytes(inode, map.m_len);
+ if (map.m_flags & F2FS_MAP_MAPPED) {
+ iomap->type = IOMAP_MAPPED;
+ iomap->flags |= IOMAP_F_MERGED;
+ } else {
+ iomap->type = IOMAP_UNWRITTEN;
+ }
+ if (WARN_ON_ONCE(!__is_valid_data_blkaddr(map.m_pblk)))
+ return -EINVAL;
+
+ iomap->bdev = map.m_bdev;
+ iomap->addr = blks_to_bytes(inode, map.m_pblk);
+ } else {
+ iomap->length = blks_to_bytes(inode, next_pgofs) -
+ iomap->offset;
+ iomap->type = IOMAP_HOLE;
+ iomap->addr = IOMAP_NULL_ADDR;
+ }
+
+ if (map.m_flags & F2FS_MAP_NEW)
+ iomap->flags |= IOMAP_F_NEW;
+ if ((inode->i_state & I_DIRTY_DATASYNC) ||
+ offset + length > i_size_read(inode))
+ iomap->flags |= IOMAP_F_DIRTY;
+
+ return 0;
+}
+
+const struct iomap_ops f2fs_iomap_ops = {
+ .iomap_begin = f2fs_iomap_begin,
+};
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index d7435fcb9658..8242f47304a5 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -3655,6 +3655,7 @@ int f2fs_init_post_read_processing(void);
void f2fs_destroy_post_read_processing(void);
int f2fs_init_post_read_wq(struct f2fs_sb_info *sbi);
void f2fs_destroy_post_read_wq(struct f2fs_sb_info *sbi);
+extern const struct iomap_ops f2fs_iomap_ops;

/*
* gc.c
--
2.34.1.448.ga2b2bfdf31-goog


2022-01-05 13:19:24

by Chao Yu

[permalink] [raw]
Subject: Re: [f2fs-dev] [PATCH 3/6] f2fs: do not expose unwritten blocks to user by DIO

On 2022/1/5 5:24, Jaegeuk Kim wrote:
> DIO preallocates physical blocks before writing data, but if an error occurrs
> or power-cut happens, we can see block contents from the disk. This patch tries
> to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
> unwritten blocks from error or power-cut.
>
> Signed-off-by: Jaegeuk Kim <[email protected]>

Reviewed-by: Chao Yu <[email protected]>

Thanks,

2022-01-05 13:20:17

by Chao Yu

[permalink] [raw]
Subject: Re: [f2fs-dev] [PATCH 5/6] f2fs: implement iomap operations

On 2022/1/5 5:24, Jaegeuk Kim wrote:
> From: Eric Biggers <[email protected]>
>
> Implement 'struct iomap_ops' for f2fs, in preparation for making f2fs
> use iomap for direct I/O.
>
> Note that this may be used for other things besides direct I/O in the
> future; however, for now I've only tested it for direct I/O.
>
> Signed-off-by: Eric Biggers <[email protected]>
> Signed-off-by: Jaegeuk Kim <[email protected]>

Reviewed-by: Chao Yu <[email protected]>

Thanks,

2022-01-05 18:25:43

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [f2fs-dev] [PATCH 3/6] f2fs: do not expose unwritten blocks to user by DIO

On 01/05, Chao Yu wrote:
> On 2022/1/5 5:24, Jaegeuk Kim wrote:
> > DIO preallocates physical blocks before writing data, but if an error occurrs
> > or power-cut happens, we can see block contents from the disk. This patch tries
> > to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
> > unwritten blocks from error or power-cut.
> >
> > Signed-off-by: Jaegeuk Kim <[email protected]>
>
> Reviewed-by: Chao Yu <[email protected]>

Thank you for the review tho, I can't ruin the one month git history. Please
chime in if there's any bug in the patch.

>
> Thanks,

2022-01-05 18:26:05

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [f2fs-dev] [PATCH 5/6] f2fs: implement iomap operations

On 01/05, Chao Yu wrote:
> On 2022/1/5 5:24, Jaegeuk Kim wrote:
> > From: Eric Biggers <[email protected]>
> >
> > Implement 'struct iomap_ops' for f2fs, in preparation for making f2fs
> > use iomap for direct I/O.
> >
> > Note that this may be used for other things besides direct I/O in the
> > future; however, for now I've only tested it for direct I/O.
> >
> > Signed-off-by: Eric Biggers <[email protected]>
> > Signed-off-by: Jaegeuk Kim <[email protected]>
>
> Reviewed-by: Chao Yu <[email protected]>

Ditto. Thanks,

>
> Thanks,

2022-01-07 23:13:10

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 3/6] f2fs: do not expose unwritten blocks to user by DIO

Hi Jaegeuk,

On Tue, Jan 04, 2022 at 01:24:16PM -0800, Jaegeuk Kim wrote:
> DIO preallocates physical blocks before writing data, but if an error occurrs
> or power-cut happens, we can see block contents from the disk. This patch tries
> to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
> unwritten blocks from error or power-cut.
>
> Signed-off-by: Jaegeuk Kim <[email protected]>
> ---
> fs/f2fs/data.c | 5 ++++-
> fs/f2fs/f2fs.h | 5 +++++
> fs/f2fs/file.c | 27 ++++++++++++++++++---------
> fs/f2fs/inode.c | 8 ++++++++
> 4 files changed, 35 insertions(+), 10 deletions(-)

Unfortunately, this patch doesn't completely fix the uninitialized data
exposure. The problem is that it only makes DIO writes fall back to buffered
writes for holes, and not for reserved blocks (NEW_ADDR). f2fs's reserved
blocks are *not* the same as the unwritten extents that other filesystems have;
f2fs's reserved blocks have to be turned into regular blocks before DIO can
write to them. That immediately exposes them to concurrent reads (at least
buffered reads, but I think DIO reads too).

This can be reproduced using the aiodio_sparse program from LTP, as follows:

aiodio_sparse -i 4 -a 8k -w 1024k -s 4096k -n 6

This was reported at
https://lore.kernel.org/r/20211226132851.GC34518@xsang-OptiPlex-9020 as a
regression from the commit "f2fs: use iomap for direct I/O", but it actually
failed before too. Note that it's nondeterministic; writing random data to the
block device before creating the filesystem helps with reproduction.

I see only three possible solutions:

* Make DIO writes to reserved blocks fall back to buffered writes, just like
writes to holes. This would mean that a file would have to be written to
before direct writes would work; fallocate() wouldn't be enough. Note that my
patch https://lore.kernel.org/r/[email protected]
would have done this.

* Or, change the f2fs on-disk format to support unwritten extents.

* Or, split up block allocation into two parts, so that blocks could be
preliminarily allocated and not exposed to reads yet. This would be like
Ted's suggestion here: https://lore.kernel.org/r/[email protected]

- Eric

2022-01-08 01:52:54

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH 3/6] f2fs: do not expose unwritten blocks to user by DIO

On 01/07, Eric Biggers wrote:
> Hi Jaegeuk,
>
> On Tue, Jan 04, 2022 at 01:24:16PM -0800, Jaegeuk Kim wrote:
> > DIO preallocates physical blocks before writing data, but if an error occurrs
> > or power-cut happens, we can see block contents from the disk. This patch tries
> > to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
> > unwritten blocks from error or power-cut.
> >
> > Signed-off-by: Jaegeuk Kim <[email protected]>
> > ---
> > fs/f2fs/data.c | 5 ++++-
> > fs/f2fs/f2fs.h | 5 +++++
> > fs/f2fs/file.c | 27 ++++++++++++++++++---------
> > fs/f2fs/inode.c | 8 ++++++++
> > 4 files changed, 35 insertions(+), 10 deletions(-)
>
> Unfortunately, this patch doesn't completely fix the uninitialized data
> exposure. The problem is that it only makes DIO writes fall back to buffered
> writes for holes, and not for reserved blocks (NEW_ADDR). f2fs's reserved
> blocks are *not* the same as the unwritten extents that other filesystems have;
> f2fs's reserved blocks have to be turned into regular blocks before DIO can
> write to them. That immediately exposes them to concurrent reads (at least
> buffered reads, but I think DIO reads too).

Isn't it resolved by i_size which gives the written blocks only?

>
> This can be reproduced using the aiodio_sparse program from LTP, as follows:
>
> aiodio_sparse -i 4 -a 8k -w 1024k -s 4096k -n 6
>
> This was reported at
> https://lore.kernel.org/r/20211226132851.GC34518@xsang-OptiPlex-9020 as a
> regression from the commit "f2fs: use iomap for direct I/O", but it actually
> failed before too. Note that it's nondeterministic; writing random data to the
> block device before creating the filesystem helps with reproduction.
>
> I see only three possible solutions:
>
> * Make DIO writes to reserved blocks fall back to buffered writes, just like
> writes to holes. This would mean that a file would have to be written to
> before direct writes would work; fallocate() wouldn't be enough. Note that my
> patch https://lore.kernel.org/r/[email protected]
> would have done this.
>
> * Or, change the f2fs on-disk format to support unwritten extents.
>
> * Or, split up block allocation into two parts, so that blocks could be
> preliminarily allocated and not exposed to reads yet. This would be like
> Ted's suggestion here: https://lore.kernel.org/r/[email protected]
>
> - Eric

2022-01-08 02:21:34

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 3/6] f2fs: do not expose unwritten blocks to user by DIO

On Fri, Jan 07, 2022 at 05:52:48PM -0800, Jaegeuk Kim wrote:
> On 01/07, Eric Biggers wrote:
> > Hi Jaegeuk,
> >
> > On Tue, Jan 04, 2022 at 01:24:16PM -0800, Jaegeuk Kim wrote:
> > > DIO preallocates physical blocks before writing data, but if an error occurrs
> > > or power-cut happens, we can see block contents from the disk. This patch tries
> > > to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
> > > unwritten blocks from error or power-cut.
> > >
> > > Signed-off-by: Jaegeuk Kim <[email protected]>
> > > ---
> > > fs/f2fs/data.c | 5 ++++-
> > > fs/f2fs/f2fs.h | 5 +++++
> > > fs/f2fs/file.c | 27 ++++++++++++++++++---------
> > > fs/f2fs/inode.c | 8 ++++++++
> > > 4 files changed, 35 insertions(+), 10 deletions(-)
> >
> > Unfortunately, this patch doesn't completely fix the uninitialized data
> > exposure. The problem is that it only makes DIO writes fall back to buffered
> > writes for holes, and not for reserved blocks (NEW_ADDR). f2fs's reserved
> > blocks are *not* the same as the unwritten extents that other filesystems have;
> > f2fs's reserved blocks have to be turned into regular blocks before DIO can
> > write to them. That immediately exposes them to concurrent reads (at least
> > buffered reads, but I think DIO reads too).
>
> Isn't it resolved by i_size which gives the written blocks only?
>

I'm not sure what you mean, but this is for non-extending writes, so i_size
isn't relevant.

- Eric

2022-01-08 03:36:06

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH 3/6] f2fs: do not expose unwritten blocks to user by DIO

On 01/07, Eric Biggers wrote:
> On Fri, Jan 07, 2022 at 05:52:48PM -0800, Jaegeuk Kim wrote:
> > On 01/07, Eric Biggers wrote:
> > > Hi Jaegeuk,
> > >
> > > On Tue, Jan 04, 2022 at 01:24:16PM -0800, Jaegeuk Kim wrote:
> > > > DIO preallocates physical blocks before writing data, but if an error occurrs
> > > > or power-cut happens, we can see block contents from the disk. This patch tries
> > > > to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
> > > > unwritten blocks from error or power-cut.
> > > >
> > > > Signed-off-by: Jaegeuk Kim <[email protected]>
> > > > ---
> > > > fs/f2fs/data.c | 5 ++++-
> > > > fs/f2fs/f2fs.h | 5 +++++
> > > > fs/f2fs/file.c | 27 ++++++++++++++++++---------
> > > > fs/f2fs/inode.c | 8 ++++++++
> > > > 4 files changed, 35 insertions(+), 10 deletions(-)
> > >
> > > Unfortunately, this patch doesn't completely fix the uninitialized data
> > > exposure. The problem is that it only makes DIO writes fall back to buffered
> > > writes for holes, and not for reserved blocks (NEW_ADDR). f2fs's reserved
> > > blocks are *not* the same as the unwritten extents that other filesystems have;
> > > f2fs's reserved blocks have to be turned into regular blocks before DIO can
> > > write to them. That immediately exposes them to concurrent reads (at least
> > > buffered reads, but I think DIO reads too).
> >
> > Isn't it resolved by i_size which gives the written blocks only?
> >
>
> I'm not sure what you mean, but this is for non-extending writes, so i_size
> isn't relevant.

Ah, do you mean the file has NEW_ADDR within i_size? If so, let me continue
to investigate further based on the current -dev, as it's quite hard to remove
the old commits.

>
> - Eric

2022-01-08 04:26:36

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 3/6] f2fs: do not expose unwritten blocks to user by DIO

On Fri, Jan 07, 2022 at 07:35:56PM -0800, Jaegeuk Kim wrote:
> On 01/07, Eric Biggers wrote:
> > On Fri, Jan 07, 2022 at 05:52:48PM -0800, Jaegeuk Kim wrote:
> > > On 01/07, Eric Biggers wrote:
> > > > Hi Jaegeuk,
> > > >
> > > > On Tue, Jan 04, 2022 at 01:24:16PM -0800, Jaegeuk Kim wrote:
> > > > > DIO preallocates physical blocks before writing data, but if an error occurrs
> > > > > or power-cut happens, we can see block contents from the disk. This patch tries
> > > > > to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
> > > > > unwritten blocks from error or power-cut.
> > > > >
> > > > > Signed-off-by: Jaegeuk Kim <[email protected]>
> > > > > ---
> > > > > fs/f2fs/data.c | 5 ++++-
> > > > > fs/f2fs/f2fs.h | 5 +++++
> > > > > fs/f2fs/file.c | 27 ++++++++++++++++++---------
> > > > > fs/f2fs/inode.c | 8 ++++++++
> > > > > 4 files changed, 35 insertions(+), 10 deletions(-)
> > > >
> > > > Unfortunately, this patch doesn't completely fix the uninitialized data
> > > > exposure. The problem is that it only makes DIO writes fall back to buffered
> > > > writes for holes, and not for reserved blocks (NEW_ADDR). f2fs's reserved
> > > > blocks are *not* the same as the unwritten extents that other filesystems have;
> > > > f2fs's reserved blocks have to be turned into regular blocks before DIO can
> > > > write to them. That immediately exposes them to concurrent reads (at least
> > > > buffered reads, but I think DIO reads too).
> > >
> > > Isn't it resolved by i_size which gives the written blocks only?
> > >
> >
> > I'm not sure what you mean, but this is for non-extending writes, so i_size
> > isn't relevant.
>
> Ah, do you mean the file has NEW_ADDR within i_size? If so, let me continue
> to investigate further based on the current -dev, as it's quite hard to remove
> the old commits.
>

Yes, "NEW_ADDR within i_size" is the intended result of fallocate() on f2fs,
right? The problem is that DIO writes convert the NEW_ADDR blocks to real
blocks, which makes uninitialized data immediately visible to reads before the
write actually happens.

- Eric