LinuxLists.cc - [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read

2021-04-12 10:24:29

Subject: [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read

Hello,

This patch series fixes a bug manifesting as occasional generic/418 failure
when direct IO write extending a file races with buffered read. Patch 1
extends iomap DIO code to pass original IO size to ->end_dio handler so that
ext4 can tell whether everything has succeeded and we don't need expensive
DIO cleanup (possible truncate of leftover blocks beyond i_size). Patch 2
fixes the ext4 bug, patch 3 fixes unrelated problem I've found in ext4 DIO
code.

Honza

2021-04-12 10:24:52

by Jan Kara

[permalink] [raw]

Subject: [PATCH 2/3] ext4: Fix occasional generic/418 failure

Eric has noticed that after pagecache read rework, generic/418 is
occasionally failing for ext4 when blocksize < pagesize. In fact, the
pagecache rework just made hard to hit race in ext4 more likely. The
problem is that since ext4 conversion of direct IO writes to iomap
framework (commit 378f32bab371), we update inode size after direct IO
write only after invalidating page cache. Thus if buffered read sneaks
at unfortunate moment like:

CPU1 - write at offset 1k CPU2 - read from offset 0
iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
ext4_readpage();
ext4_handle_inode_extension()

the read will zero out tail of the page as it still sees smaller inode
size and thus page cache becomes inconsistent with on-disk contents with
all the consequences.

Fix the problem by moving inode size update into end_io handler which
gets called before the page cache is invalidated.

Reported-by: Eric Whitney <[email protected]>
Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 185 ++++++++++++++++++++++++++-----------------------
1 file changed, 99 insertions(+), 86 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2505313d96b0..ec59ea078f53 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -280,11 +280,67 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
return ret;
}

+static int ext4_prepare_dio_extend(struct inode *inode)
+{
+ handle_t *handle;
+ int ret;
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ ext4_fc_start_update(inode);
+ ret = ext4_orphan_add(handle, inode);
+ ext4_fc_stop_update(inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ return ret;
+ }
+ ext4_journal_stop(handle);
+ return 0;
+}
+
+static void ext4_cleanup_dio_extend(struct inode *inode, loff_t offset,
+ ssize_t written, size_t count)
+{
+ handle_t *handle;
+ u8 blkbits = inode->i_blkbits;
+ ext4_lblk_t written_blk, end_blk;
+
+ written_blk = ALIGN(offset + written, 1 << blkbits);
+ end_blk = ALIGN(offset + count, 1 << blkbits);
+ if (written_blk < end_blk && ext4_can_truncate(inode)) {
+ ext4_truncate_failed_write(inode);
+ /*
+ * If the truncate operation failed early, then the inode may
+ * still be on the orphan list. In that case, we need to try
+ * remove the inode from the in-memory linked list.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ return;
+ }
+
+ /*
+ * We need to ensure that the inode is removed from the orphan
+ * list if it has been added prematurely, due to writeback of
+ * delalloc blocks.
+ */
+ if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ext4_orphan_del(NULL, inode);
+ return;
+ }
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ }
+}
+
static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
ssize_t written, size_t count)
{
handle_t *handle;
- bool truncate = false;
u8 blkbits = inode->i_blkbits;
ext4_lblk_t written_blk, end_blk;
int ret;
@@ -298,73 +354,31 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
* as much as we intended.
*/
WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
- if (offset + count <= EXT4_I(inode)->i_disksize) {
- /*
- * We need to ensure that the inode is removed from the orphan
- * list if it has been added prematurely, due to writeback of
- * delalloc blocks.
- */
- if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-
- if (IS_ERR(handle)) {
- ext4_orphan_del(NULL, inode);
- return PTR_ERR(handle);
- }
-
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- }
-
+ if (offset + count <= EXT4_I(inode)->i_disksize)
return written;
- }
-
- if (written < 0)
- goto truncate;

handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- written = PTR_ERR(handle);
- goto truncate;
- }
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);

if (ext4_update_inode_size(inode, offset + written)) {
ret = ext4_mark_inode_dirty(handle, inode);
if (unlikely(ret)) {
- written = ret;
ext4_journal_stop(handle);
- goto truncate;
+ return ret;
}
}

/*
- * We may need to truncate allocated but not written blocks beyond EOF.
+ * If there cannot be allocated but unused blocks beyond EOF, remove
+ * the inode from the orphan list as we are done with it.
*/
written_blk = ALIGN(offset + written, 1 << blkbits);
end_blk = ALIGN(offset + count, 1 << blkbits);
- if (written_blk < end_blk && ext4_can_truncate(inode))
- truncate = true;
-
- /*
- * Remove the inode from the orphan list if it has been extended and
- * everything went OK.
- */
- if (!truncate && inode->i_nlink)
+ if (written_blk == end_blk && inode->i_nlink)
ext4_orphan_del(handle, inode);
ext4_journal_stop(handle);

- if (truncate) {
-truncate:
- ext4_truncate_failed_write(inode);
- /*
- * If the truncate operation failed early, then the inode may
- * still be on the orphan list. In that case, we need to try
- * remove the inode from the in-memory linked list.
- */
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
- }
-
return written;
}

@@ -385,10 +399,28 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
return 0;
}

+static int ext4_dio_extending_write_end_io(struct kiocb *iocb, ssize_t size,
+ ssize_t orig_size, int error,
+ unsigned int flags)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ int ret;
+
+ ret = ext4_dio_write_end_io(iocb, size, orig_size, error, flags);
+ if (ret < 0)
+ return ret;
+ return ext4_handle_inode_extension(inode, iocb->ki_pos, size,
+ orig_size);
+}
+
static const struct iomap_dio_ops ext4_dio_write_ops = {
.end_io = ext4_dio_write_end_io,
};

+static const struct iomap_dio_ops ext4_dio_extending_write_ops = {
+ .end_io = ext4_dio_extending_write_end_io,
+};
+
/*
* The intention here is to start with shared lock acquired then see if any
* condition requires an exclusive inode lock. If yes, then we restart the
@@ -455,11 +487,11 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
ssize_t ret;
- handle_t *handle;
struct inode *inode = file_inode(iocb->ki_filp);
loff_t offset = iocb->ki_pos;
size_t count = iov_iter_count(from);
const struct iomap_ops *iomap_ops = &ext4_iomap_ops;
+ const struct iomap_dio_ops *dio_ops = &ext4_dio_write_ops;
bool extend = false, unaligned_io = false;
bool ilock_shared = true;

@@ -530,33 +562,21 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
inode_dio_wait(inode);

if (extend) {
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
+ ret = ext4_prepare_dio_extend(inode);
+ if (ret)
goto out;
- }
-
- ext4_fc_start_update(inode);
- ret = ext4_orphan_add(handle, inode);
- ext4_fc_stop_update(inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
-
- ext4_journal_stop(handle);
}

if (ilock_shared)
iomap_ops = &ext4_iomap_overwrite_ops;
- ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
- (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
- if (ret == -ENOTBLK)
- ret = 0;
-
if (extend)
- ret = ext4_handle_inode_extension(inode, offset, ret, count);
+ dio_ops = &ext4_dio_extending_write_ops;

+ ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
+ (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
+ if (ret == -ENOTBLK)
+ ret = 0;
+ ext4_cleanup_dio_extend(inode, offset, ret, count);
out:
if (ilock_shared)
inode_unlock_shared(inode);
@@ -599,7 +619,6 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
ssize_t ret;
size_t count;
loff_t offset;
- handle_t *handle;
bool extend = false;
struct inode *inode = file_inode(iocb->ki_filp);

@@ -618,26 +637,20 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
count = iov_iter_count(from);

if (offset + count > EXT4_I(inode)->i_disksize) {
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
-
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
+ ret = ext4_prepare_dio_extend(inode);
+ if (ret < 0)
goto out;
- }
-
extend = true;
- ext4_journal_stop(handle);
}

ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);

- if (extend)
- ret = ext4_handle_inode_extension(inode, offset, ret, count);
+ if (extend) {
+ if (ret > 0)
+ ret = ext4_handle_inode_extension(inode, offset, ret,
+ count);
+ ext4_cleanup_dio_extend(inode, offset, ret, count);
+ }
out:
inode_unlock(inode);
if (ret > 0)
--
2.26.2

2021-04-12 10:24:53

by Jan Kara

[permalink] [raw]

Subject: [PATCH 1/3] iomap: Pass original DIO size to completion handler

When extending a file with direct IO write, ext4 needs to know whether IO
has succeeded for the whole range that was prepared for it (the common fast
path). In that case we can piggy back the orphan list update with the
inode size update. In case write was actually shorter than originally
intended, we must leave inode on the orphan list until we cleanup blocks
beyond i_size. So provide the original IO size to the direct IO ->end_io
handler.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 3 ++-
fs/iomap/direct-io.c | 5 ++++-
fs/xfs/xfs_file.c | 1 +
fs/zonefs/super.c | 3 ++-
include/linux/iomap.h | 4 ++--
5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 194f5d00fa32..2505313d96b0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -369,7 +369,8 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
}

static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
- int error, unsigned int flags)
+ ssize_t orig_size, int error,
+ unsigned int flags)
{
loff_t offset = iocb->ki_pos;
struct inode *inode = file_inode(iocb->ki_filp);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bdd0d89bbf0a..a4bbf22f69bc 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -28,6 +28,7 @@ struct iomap_dio {
const struct iomap_dio_ops *dops;
loff_t i_size;
loff_t size;
+ loff_t orig_size;
atomic_t ref;
unsigned flags;
int error;
@@ -85,7 +86,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
ssize_t ret = dio->error;

if (dops && dops->end_io)
- ret = dops->end_io(iocb, dio->size, ret, dio->flags);
+ ret = dops->end_io(iocb, dio->size, dio->orig_size, ret,
+ dio->flags);

if (likely(!ret)) {
ret = dio->size;
@@ -473,6 +475,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
dio->iocb = iocb;
atomic_set(&dio->ref, 1);
dio->size = 0;
+ dio->orig_size = count;
dio->i_size = i_size_read(inode);
dio->dops = dops;
dio->error = 0;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a007ca0711d9..ed23ed56e345 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -442,6 +442,7 @@ static int
xfs_dio_write_end_io(
struct kiocb *iocb,
ssize_t size,
+ ssize_t orig_size,
int error,
unsigned flags)
{
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 049e36c69ed7..e3e0ee4b8c6e 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -652,7 +652,8 @@ static loff_t zonefs_file_llseek(struct file *file, loff_t offset, int whence)
}

static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
- int error, unsigned int flags)
+ ssize_t orig_size, int error,
+ unsigned int flags)
{
struct inode *inode = file_inode(iocb->ki_filp);
struct zonefs_inode_info *zi = ZONEFS_I(inode);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index d202fd2d0f91..b175641e9540 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -252,8 +252,8 @@ int iomap_writepages(struct address_space *mapping,
#define IOMAP_DIO_COW (1 << 1) /* covers COW extent(s) */

struct iomap_dio_ops {
- int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
- unsigned flags);
+ int (*end_io)(struct kiocb *iocb, ssize_t size, ssize_t orig_size,
+ int error, unsigned flags);
blk_qc_t (*submit_io)(struct inode *inode, struct iomap *iomap,
struct bio *bio, loff_t file_offset);
};
--
2.26.2

2021-04-12 14:12:00

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler

Hi Jan,

I love your patch! Yet something to improve:

[auto build test ERROR on ext4/dev]
[also build test ERROR on xfs-linux/for-next linus/master v5.12-rc7 next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: riscv-randconfig-r004-20210412 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install riscv cross compiling tool for clang build
# apt-get install binutils-riscv64-linux-gnu
# https://github.com/0day-ci/linux/commit/0d289243d061378ac42188ff5079287885575bb3
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
git checkout 0d289243d061378ac42188ff5079287885575bb3
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

>> fs/zonefs/super.c:732:49: error: too few arguments to function call, expected 5, have 4
zonefs_file_write_dio_end_io(iocb, size, ret, 0);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
fs/zonefs/super.c:654:12: note: 'zonefs_file_write_dio_end_io' declared here
static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
^
>> fs/zonefs/super.c:961:14: error: incompatible function pointer types initializing 'int (*)(struct kiocb *, ssize_t, ssize_t, int, unsigned int)' (aka 'int (*)(struct kiocb *, long, long, int, unsigned int)') with an expression of type 'int (struct kiocb *, ssize_t, int, unsigned int)' (aka 'int (struct kiocb *, long, int, unsigned int)') [-Werror,-Wincompatible-function-pointer-types]
.end_io = zonefs_file_read_dio_end_io,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.

vim +732 fs/zonefs/super.c

8dcc1a9d90c10f Damien Le Moal 2019-12-25 688
02ef12a663c7ac Johannes Thumshirn 2020-05-12 689 static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
02ef12a663c7ac Johannes Thumshirn 2020-05-12 690 {
02ef12a663c7ac Johannes Thumshirn 2020-05-12 691 struct inode *inode = file_inode(iocb->ki_filp);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 692 struct zonefs_inode_info *zi = ZONEFS_I(inode);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 693 struct block_device *bdev = inode->i_sb->s_bdev;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 694 unsigned int max;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 695 struct bio *bio;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 696 ssize_t size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 697 int nr_pages;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 698 ssize_t ret;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 699
02ef12a663c7ac Johannes Thumshirn 2020-05-12 700 max = queue_max_zone_append_sectors(bdev_get_queue(bdev));
02ef12a663c7ac Johannes Thumshirn 2020-05-12 701 max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 702 iov_iter_truncate(from, max);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 703
a8affc03a9b375 Christoph Hellwig 2021-03-11 704 nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
89ee72376be23a Johannes Thumshirn 2020-07-16 705 if (!nr_pages)
89ee72376be23a Johannes Thumshirn 2020-07-16 706 return 0;
89ee72376be23a Johannes Thumshirn 2020-07-16 707
f91ca2a370bec5 Christoph Hellwig 2021-01-26 708 bio = bio_alloc(GFP_NOFS, nr_pages);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 709 if (!bio)
02ef12a663c7ac Johannes Thumshirn 2020-05-12 710 return -ENOMEM;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 711
02ef12a663c7ac Johannes Thumshirn 2020-05-12 712 bio_set_dev(bio, bdev);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 713 bio->bi_iter.bi_sector = zi->i_zsector;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 714 bio->bi_write_hint = iocb->ki_hint;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 715 bio->bi_ioprio = iocb->ki_ioprio;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 716 bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 717 if (iocb->ki_flags & IOCB_DSYNC)
02ef12a663c7ac Johannes Thumshirn 2020-05-12 718 bio->bi_opf |= REQ_FUA;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 719
02ef12a663c7ac Johannes Thumshirn 2020-05-12 720 ret = bio_iov_iter_get_pages(bio, from);
6bea0225a4bf14 Damien Le Moal 2020-12-09 721 if (unlikely(ret))
6bea0225a4bf14 Damien Le Moal 2020-12-09 722 goto out_release;
6bea0225a4bf14 Damien Le Moal 2020-12-09 723
02ef12a663c7ac Johannes Thumshirn 2020-05-12 724 size = bio->bi_iter.bi_size;
6bea0225a4bf14 Damien Le Moal 2020-12-09 725 task_io_account_write(size);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 726
02ef12a663c7ac Johannes Thumshirn 2020-05-12 727 if (iocb->ki_flags & IOCB_HIPRI)
02ef12a663c7ac Johannes Thumshirn 2020-05-12 728 bio_set_polled(bio, iocb);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 729
02ef12a663c7ac Johannes Thumshirn 2020-05-12 730 ret = submit_bio_wait(bio);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 731
6bea0225a4bf14 Damien Le Moal 2020-12-09 @732 zonefs_file_write_dio_end_io(iocb, size, ret, 0);
62ab1aadcccd03 Johannes Thumshirn 2021-01-27 733 trace_zonefs_file_dio_append(inode, size, ret);
6bea0225a4bf14 Damien Le Moal 2020-12-09 734
6bea0225a4bf14 Damien Le Moal 2020-12-09 735 out_release:
6bea0225a4bf14 Damien Le Moal 2020-12-09 736 bio_release_pages(bio, false);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 737 bio_put(bio);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 738
02ef12a663c7ac Johannes Thumshirn 2020-05-12 739 if (ret >= 0) {
02ef12a663c7ac Johannes Thumshirn 2020-05-12 740 iocb->ki_pos += size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 741 return size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 742 }
02ef12a663c7ac Johannes Thumshirn 2020-05-12 743
02ef12a663c7ac Johannes Thumshirn 2020-05-12 744 return ret;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 745 }
02ef12a663c7ac Johannes Thumshirn 2020-05-12 746

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

Attachments:

(No filename) (7.06 kB)
.config.gz (33.19 kB)
Download all attachments

2021-04-12 14:15:17

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler

Attachments:

(No filename) (2.96 kB)
.config.gz (23.81 kB)
Download all attachments

2021-04-12 16:51:16

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler

Hi Jan,

I love your patch! Yet something to improve:

[auto build test ERROR on ext4/dev]
[also build test ERROR on xfs-linux/for-next linus/master v5.12-rc7 next-20210412]
[cannot apply to tytso-fscrypt/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: openrisc-randconfig-r032-20210412 (attached as .config)
compiler: or1k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/0d289243d061378ac42188ff5079287885575bb3
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
git checkout 0d289243d061378ac42188ff5079287885575bb3
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=openrisc

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

fs/zonefs/super.c: In function 'zonefs_file_dio_append':
>> fs/zonefs/super.c:732:2: error: too few arguments to function 'zonefs_file_write_dio_end_io'
732 | zonefs_file_write_dio_end_io(iocb, size, ret, 0);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/zonefs/super.c:654:12: note: declared here
654 | static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/zonefs/super.c: At top level:
>> fs/zonefs/super.c:961:14: error: initialization of 'int (*)(struct kiocb *, ssize_t, ssize_t, int, unsigned int)' {aka 'int (*)(struct kiocb *, int, int, int, unsigned int)'} from incompatible pointer type 'int (*)(struct kiocb *, ssize_t, int, unsigned int)' {aka 'int (*)(struct kiocb *, int, int, unsigned int)'} [-Werror=incompatible-pointer-types]
961 | .end_io = zonefs_file_read_dio_end_io,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/zonefs/super.c:961:14: note: (near initialization for 'zonefs_read_dio_ops.end_io')
cc1: some warnings being treated as errors

vim +/zonefs_file_write_dio_end_io +732 fs/zonefs/super.c

8dcc1a9d90c10f Damien Le Moal 2019-12-25 688
02ef12a663c7ac Johannes Thumshirn 2020-05-12 689 static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
02ef12a663c7ac Johannes Thumshirn 2020-05-12 690 {
02ef12a663c7ac Johannes Thumshirn 2020-05-12 691 struct inode *inode = file_inode(iocb->ki_filp);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 692 struct zonefs_inode_info *zi = ZONEFS_I(inode);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 693 struct block_device *bdev = inode->i_sb->s_bdev;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 694 unsigned int max;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 695 struct bio *bio;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 696 ssize_t size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 697 int nr_pages;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 698 ssize_t ret;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 699
02ef12a663c7ac Johannes Thumshirn 2020-05-12 700 max = queue_max_zone_append_sectors(bdev_get_queue(bdev));
02ef12a663c7ac Johannes Thumshirn 2020-05-12 701 max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 702 iov_iter_truncate(from, max);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 703
a8affc03a9b375 Christoph Hellwig 2021-03-11 704 nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
89ee72376be23a Johannes Thumshirn 2020-07-16 705 if (!nr_pages)
89ee72376be23a Johannes Thumshirn 2020-07-16 706 return 0;
89ee72376be23a Johannes Thumshirn 2020-07-16 707
f91ca2a370bec5 Christoph Hellwig 2021-01-26 708 bio = bio_alloc(GFP_NOFS, nr_pages);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 709 if (!bio)
02ef12a663c7ac Johannes Thumshirn 2020-05-12 710 return -ENOMEM;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 711
02ef12a663c7ac Johannes Thumshirn 2020-05-12 712 bio_set_dev(bio, bdev);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 713 bio->bi_iter.bi_sector = zi->i_zsector;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 714 bio->bi_write_hint = iocb->ki_hint;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 715 bio->bi_ioprio = iocb->ki_ioprio;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 716 bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 717 if (iocb->ki_flags & IOCB_DSYNC)
02ef12a663c7ac Johannes Thumshirn 2020-05-12 718 bio->bi_opf |= REQ_FUA;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 719
02ef12a663c7ac Johannes Thumshirn 2020-05-12 720 ret = bio_iov_iter_get_pages(bio, from);
6bea0225a4bf14 Damien Le Moal 2020-12-09 721 if (unlikely(ret))
6bea0225a4bf14 Damien Le Moal 2020-12-09 722 goto out_release;
6bea0225a4bf14 Damien Le Moal 2020-12-09 723
02ef12a663c7ac Johannes Thumshirn 2020-05-12 724 size = bio->bi_iter.bi_size;
6bea0225a4bf14 Damien Le Moal 2020-12-09 725 task_io_account_write(size);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 726
02ef12a663c7ac Johannes Thumshirn 2020-05-12 727 if (iocb->ki_flags & IOCB_HIPRI)
02ef12a663c7ac Johannes Thumshirn 2020-05-12 728 bio_set_polled(bio, iocb);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 729
02ef12a663c7ac Johannes Thumshirn 2020-05-12 730 ret = submit_bio_wait(bio);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 731
6bea0225a4bf14 Damien Le Moal 2020-12-09 @732 zonefs_file_write_dio_end_io(iocb, size, ret, 0);
62ab1aadcccd03 Johannes Thumshirn 2021-01-27 733 trace_zonefs_file_dio_append(inode, size, ret);
6bea0225a4bf14 Damien Le Moal 2020-12-09 734
6bea0225a4bf14 Damien Le Moal 2020-12-09 735 out_release:
6bea0225a4bf14 Damien Le Moal 2020-12-09 736 bio_release_pages(bio, false);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 737 bio_put(bio);
02ef12a663c7ac Johannes Thumshirn 2020-05-12 738
02ef12a663c7ac Johannes Thumshirn 2020-05-12 739 if (ret >= 0) {
02ef12a663c7ac Johannes Thumshirn 2020-05-12 740 iocb->ki_pos += size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 741 return size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 742 }
02ef12a663c7ac Johannes Thumshirn 2020-05-12 743
02ef12a663c7ac Johannes Thumshirn 2020-05-12 744 return ret;
02ef12a663c7ac Johannes Thumshirn 2020-05-12 745 }
02ef12a663c7ac Johannes Thumshirn 2020-05-12 746

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

Attachments:

(No filename) (7.11 kB)
.config.gz (21.05 kB)
Download all attachments

2021-04-13 07:18:05

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure

On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> Eric has noticed that after pagecache read rework, generic/418 is
> occasionally failing for ext4 when blocksize < pagesize. In fact, the
> pagecache rework just made hard to hit race in ext4 more likely. The
> problem is that since ext4 conversion of direct IO writes to iomap
> framework (commit 378f32bab371), we update inode size after direct IO
> write only after invalidating page cache. Thus if buffered read sneaks
> at unfortunate moment like:
>
> CPU1 - write at offset 1k CPU2 - read from offset 0
> iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> ext4_readpage();
> ext4_handle_inode_extension()
>
> the read will zero out tail of the page as it still sees smaller inode
> size and thus page cache becomes inconsistent with on-disk contents with
> all the consequences.
>
> Fix the problem by moving inode size update into end_io handler which
> gets called before the page cache is invalidated.

Confused.

This moves all the inode extension stuff into the completion
handler, when all that really needs to be done is extending
inode->i_size to tell the world there is data up to where the
IO completed. Actually removing the inode from the orphan list
does not need to be done in the IO completion callback, because...

> if (ilock_shared)
> iomap_ops = &ext4_iomap_overwrite_ops;
> - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> - if (ret == -ENOTBLK)
> - ret = 0;
> -
> if (extend)
> - ret = ext4_handle_inode_extension(inode, offset, ret, count);
> + dio_ops = &ext4_dio_extending_write_ops;
>
> + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
^^^^^^ ^^^^^^^^^^^^^^^^^^^

.... if we are doing an extending write, we force DIO to complete
before returning. Hence even AIO will block here on an extending
write, and hence we can -always- do the correct post-IO completion
orphan list cleanup here because we know a) the original IO size and
b) the amount of data that was actually written.

Hence all that remains is closing the buffered read vs invalidation
race. All this requires is for the dio write completion to behave
like XFS where it just does the inode->i_size update for extending
writes. THis means the size is updated before the invalidation, and
hence any read that occurs after the invalidation but before the
post-eof blocks have been removed will see the correct size and read
the tail page(s) correctly. This closes the race window, and the
caller can still handle the post-eof block cleanup as it does now.

Hence I don't see any need for changing the iomap infrastructure to
solve this problem. This seems like the obvious solution to me, so
what am I missing?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-04-13 09:45:43

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure

On Tue 13-04-21 07:50:24, Dave Chinner wrote:
> On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> > Eric has noticed that after pagecache read rework, generic/418 is
> > occasionally failing for ext4 when blocksize < pagesize. In fact, the
> > pagecache rework just made hard to hit race in ext4 more likely. The
> > problem is that since ext4 conversion of direct IO writes to iomap
> > framework (commit 378f32bab371), we update inode size after direct IO
> > write only after invalidating page cache. Thus if buffered read sneaks
> > at unfortunate moment like:
> >
> > CPU1 - write at offset 1k CPU2 - read from offset 0
> > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> > ext4_readpage();
> > ext4_handle_inode_extension()
> >
> > the read will zero out tail of the page as it still sees smaller inode
> > size and thus page cache becomes inconsistent with on-disk contents with
> > all the consequences.
> >
> > Fix the problem by moving inode size update into end_io handler which
> > gets called before the page cache is invalidated.
>
> Confused.
>
> This moves all the inode extension stuff into the completion
> handler, when all that really needs to be done is extending
> inode->i_size to tell the world there is data up to where the
> IO completed. Actually removing the inode from the orphan list
> does not need to be done in the IO completion callback, because...
>
> > if (ilock_shared)
> > iomap_ops = &ext4_iomap_overwrite_ops;
> > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > - if (ret == -ENOTBLK)
> > - ret = 0;
> > -
> > if (extend)
> > - ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > + dio_ops = &ext4_dio_extending_write_ops;
> >
> > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
> ^^^^^^ ^^^^^^^^^^^^^^^^^^^
>
> .... if we are doing an extending write, we force DIO to complete
> before returning. Hence even AIO will block here on an extending
> write, and hence we can -always- do the correct post-IO completion
> orphan list cleanup here because we know a) the original IO size and
> b) the amount of data that was actually written.
>
> Hence all that remains is closing the buffered read vs invalidation
> race. All this requires is for the dio write completion to behave
> like XFS where it just does the inode->i_size update for extending
> writes. THis means the size is updated before the invalidation, and
> hence any read that occurs after the invalidation but before the
> post-eof blocks have been removed will see the correct size and read
> the tail page(s) correctly. This closes the race window, and the
> caller can still handle the post-eof block cleanup as it does now.
>
> Hence I don't see any need for changing the iomap infrastructure to
> solve this problem. This seems like the obvious solution to me, so
> what am I missing?

All that you write above is correct. The missing piece is: If everything
succeeded and all the cleanup we need is removing inode from the orphan
list (common case), we want to piggyback that orphan list removal into the
same transaction handle as the update of the inode size. This is just a
performance thing, you are absolutely right we could also do the orphan
cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes
to the iomap framework.

OK, now that I write about this, maybe I was just too hung up on the
performance improvement. Probably a better way forward is that I just fix
the data corruption bug only inside ext4 (that will be also much easier to
backport) and then submit the performance improvement modifying iomap if I
can actually get performance data justifying it. Thanks for poking into
this :)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-04-14 07:24:26

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure

On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote:
> On Tue 13-04-21 07:50:24, Dave Chinner wrote:
> > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> > > Eric has noticed that after pagecache read rework, generic/418 is
> > > occasionally failing for ext4 when blocksize < pagesize. In fact, the
> > > pagecache rework just made hard to hit race in ext4 more likely. The
> > > problem is that since ext4 conversion of direct IO writes to iomap
> > > framework (commit 378f32bab371), we update inode size after direct IO
> > > write only after invalidating page cache. Thus if buffered read sneaks
> > > at unfortunate moment like:
> > >
> > > CPU1 - write at offset 1k CPU2 - read from offset 0
> > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> > > ext4_readpage();
> > > ext4_handle_inode_extension()
> > >
> > > the read will zero out tail of the page as it still sees smaller inode
> > > size and thus page cache becomes inconsistent with on-disk contents with
> > > all the consequences.
> > >
> > > Fix the problem by moving inode size update into end_io handler which
> > > gets called before the page cache is invalidated.
> >
> > Confused.
> >
> > This moves all the inode extension stuff into the completion
> > handler, when all that really needs to be done is extending
> > inode->i_size to tell the world there is data up to where the
> > IO completed. Actually removing the inode from the orphan list
> > does not need to be done in the IO completion callback, because...
> >
> > > if (ilock_shared)
> > > iomap_ops = &ext4_iomap_overwrite_ops;
> > > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > > - if (ret == -ENOTBLK)
> > > - ret = 0;
> > > -
> > > if (extend)
> > > - ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > + dio_ops = &ext4_dio_extending_write_ops;
> > >
> > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> > > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
> > ^^^^^^ ^^^^^^^^^^^^^^^^^^^
> >
> > .... if we are doing an extending write, we force DIO to complete
> > before returning. Hence even AIO will block here on an extending
> > write, and hence we can -always- do the correct post-IO completion
> > orphan list cleanup here because we know a) the original IO size and
> > b) the amount of data that was actually written.
> >
> > Hence all that remains is closing the buffered read vs invalidation
> > race. All this requires is for the dio write completion to behave
> > like XFS where it just does the inode->i_size update for extending
> > writes. THis means the size is updated before the invalidation, and
> > hence any read that occurs after the invalidation but before the
> > post-eof blocks have been removed will see the correct size and read
> > the tail page(s) correctly. This closes the race window, and the
> > caller can still handle the post-eof block cleanup as it does now.
> >
> > Hence I don't see any need for changing the iomap infrastructure to
> > solve this problem. This seems like the obvious solution to me, so
> > what am I missing?
>
> All that you write above is correct. The missing piece is: If everything
> succeeded and all the cleanup we need is removing inode from the orphan
> list (common case), we want to piggyback that orphan list removal into the
> same transaction handle as the update of the inode size. This is just a
> performance thing, you are absolutely right we could also do the orphan
> cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes
> to the iomap framework.

Doesn't ext4, like XFS, keep two copies of the inode size? One for
the on-disk size and one for the in-memory size?

/me looks...

Yeah, there's ei->i_disksize that reflects the on-disk size.

And I note that the first thing that ext4_handle_inode_extension()
is already checking that the write is extending past the current
on-disk inode size before running the extension transaction.

The page cache only cares about the inode->i_size value, not the
ei->i_disksize value, so you can update them independently and still
have things work correctly. That's what XFS does in
xfs_dio_write_end_io - it updates the in-memory inode->i_size, then
runs a transaction to atomically update the inode on-disk inode
size. Updating the VFS inode size first protects against buffered
read races while updating the on-disk size...

So for ext4, the two separate size updates don't need to be done at
the same time - you have all the state you need in the ext4 dio
write path to extend the on-disk file size on successful extending
write, and it is not dependent in any way on the current in-memory
VFS inode size that you'd update in the ->end_io callback....

> OK, now that I write about this, maybe I was just too hung up on the
> performance improvement. Probably a better way forward is that I just fix
> the data corruption bug only inside ext4 (that will be also much easier to
> backport) and then submit the performance improvement modifying iomap if I
> can actually get performance data justifying it. Thanks for poking into
> this :)

Sounds like a good plan :)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-04-14 15:36:12

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure

On Wed 14-04-21 08:45:31, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote:
> > On Tue 13-04-21 07:50:24, Dave Chinner wrote:
> > > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> > > > Eric has noticed that after pagecache read rework, generic/418 is
> > > > occasionally failing for ext4 when blocksize < pagesize. In fact, the
> > > > pagecache rework just made hard to hit race in ext4 more likely. The
> > > > problem is that since ext4 conversion of direct IO writes to iomap
> > > > framework (commit 378f32bab371), we update inode size after direct IO
> > > > write only after invalidating page cache. Thus if buffered read sneaks
> > > > at unfortunate moment like:
> > > >
> > > > CPU1 - write at offset 1k CPU2 - read from offset 0
> > > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> > > > ext4_readpage();
> > > > ext4_handle_inode_extension()
> > > >
> > > > the read will zero out tail of the page as it still sees smaller inode
> > > > size and thus page cache becomes inconsistent with on-disk contents with
> > > > all the consequences.
> > > >
> > > > Fix the problem by moving inode size update into end_io handler which
> > > > gets called before the page cache is invalidated.
> > >
> > > Confused.
> > >
> > > This moves all the inode extension stuff into the completion
> > > handler, when all that really needs to be done is extending
> > > inode->i_size to tell the world there is data up to where the
> > > IO completed. Actually removing the inode from the orphan list
> > > does not need to be done in the IO completion callback, because...
> > >
> > > > if (ilock_shared)
> > > > iomap_ops = &ext4_iomap_overwrite_ops;
> > > > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > > > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > > > - if (ret == -ENOTBLK)
> > > > - ret = 0;
> > > > -
> > > > if (extend)
> > > > - ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > > + dio_ops = &ext4_dio_extending_write_ops;
> > > >
> > > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> > > > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
> > > ^^^^^^ ^^^^^^^^^^^^^^^^^^^
> > >
> > > .... if we are doing an extending write, we force DIO to complete
> > > before returning. Hence even AIO will block here on an extending
> > > write, and hence we can -always- do the correct post-IO completion
> > > orphan list cleanup here because we know a) the original IO size and
> > > b) the amount of data that was actually written.
> > >
> > > Hence all that remains is closing the buffered read vs invalidation
> > > race. All this requires is for the dio write completion to behave
> > > like XFS where it just does the inode->i_size update for extending
> > > writes. THis means the size is updated before the invalidation, and
> > > hence any read that occurs after the invalidation but before the
> > > post-eof blocks have been removed will see the correct size and read
> > > the tail page(s) correctly. This closes the race window, and the
> > > caller can still handle the post-eof block cleanup as it does now.
> > >
> > > Hence I don't see any need for changing the iomap infrastructure to
> > > solve this problem. This seems like the obvious solution to me, so
> > > what am I missing?
> >
> > All that you write above is correct. The missing piece is: If everything
> > succeeded and all the cleanup we need is removing inode from the orphan
> > list (common case), we want to piggyback that orphan list removal into the
> > same transaction handle as the update of the inode size. This is just a
> > performance thing, you are absolutely right we could also do the orphan
> > cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes
> > to the iomap framework.
>
> Doesn't ext4, like XFS, keep two copies of the inode size? One for
> the on-disk size and one for the in-memory size?
>
> /me looks...
>
> Yeah, there's ei->i_disksize that reflects the on-disk size.
>
> And I note that the first thing that ext4_handle_inode_extension()
> is already checking that the write is extending past the current
> on-disk inode size before running the extension transaction.

Yes.

> The page cache only cares about the inode->i_size value, not the
> ei->i_disksize value, so you can update them independently and still
> have things work correctly. That's what XFS does in
> xfs_dio_write_end_io - it updates the in-memory inode->i_size, then
> runs a transaction to atomically update the inode on-disk inode
> size. Updating the VFS inode size first protects against buffered
> read races while updating the on-disk size...
>
> So for ext4, the two separate size updates don't need to be done at
> the same time - you have all the state you need in the ext4 dio
> write path to extend the on-disk file size on successful extending
> write, and it is not dependent in any way on the current in-memory
> VFS inode size that you'd update in the ->end_io callback....

Right, that's a nice trick that will allow us to keep the original
performance (I've verified that indeed splitting inode size and orphan
updates into separate transactions costs us ~10% of performance when
appending 512-byte chunks) without touching generic dio code. Thanks for
the idea!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR