2013-08-05 13:52:28

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/4 v2] ext4: Fix races between writeback and truncate

Hello,

this patch set fixes two races between truncate and writeback code.
The first one was introduced by my rewrite of ext4 writeback path and
Dave and Zheng have spotted them in their testing. It could result in
some leaked fs blocks and delalloc accounting mismatch. The first two
patches fix that bug.

The second race has been there for ages and could result in i_size being
wrong on disk. The last two patches fix that race.

Honza


2013-08-05 13:52:29

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/4] ext4: Fix ext4_writepages() in presence of truncate

Inode size can arbitrarily change while writeback is in progress. When
ext4_writepages() has prepared a long extent for mapping and truncate
then reduces i_size, mpage_map_and_submit_buffers() will always map just
one buffer in a page instead of all of them due to lblk < blocks check.
So we end up not using all blocks we've allocated (thus leaking them)
and also delalloc accounting goes wrong manifesting as a warning like:

ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
with only 0 reserved data blocks

Note that the problem can happen only when blocksize < pagesize because
otherwise we have only a single buffer in the page.

Fix the problem by removing the size check from the mapping loop. We
have an extent allocated so we have to use it all before checking for
i_size. We also rename add_page_bufs_to_extent() to
mpage_process_page_bufs() and make that function submit the page for IO
if all buffers (upto EOF) in it are mapped.

Reported-by: Dave Jones <[email protected]>
Reported-by: Zheng Liu <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 107 ++++++++++++++++++++++++++++++++++----------------------
1 file changed, 66 insertions(+), 41 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index afaf26b..4263878 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1893,6 +1893,26 @@ static int ext4_writepage(struct page *page,
return ret;
}

+static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
+{
+ int len;
+ loff_t size = i_size_read(mpd->inode);
+ int err;
+
+ BUG_ON(page->index != mpd->first_page);
+ if (page->index == size >> PAGE_CACHE_SHIFT)
+ len = size & ~PAGE_CACHE_MASK;
+ else
+ len = PAGE_CACHE_SIZE;
+ clear_page_dirty_for_io(page);
+ err = ext4_bio_write_page(&mpd->io_submit, page, len, mpd->wbc);
+ if (!err)
+ mpd->wbc->nr_to_write--;
+ mpd->first_page++;
+
+ return err;
+}
+
#define BH_FLAGS ((1 << BH_Unwritten) | (1 << BH_Delay))

/*
@@ -1951,12 +1971,29 @@ static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
return false;
}

-static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
- struct buffer_head *head,
- struct buffer_head *bh,
- ext4_lblk_t lblk)
+/*
+ * mpage_process_page_bufs - submit page buffers for IO or add them to extent
+ *
+ * @mpd - extent of blocks for mapping
+ * @head - the first buffer in the page
+ * @bh - buffer we should start processing from
+ * @lblk - logical number of the block in the file corresponding to @bh
+ *
+ * Walk through page buffers from @bh upto @head (exclusive) and either submit
+ * the page for IO if all buffers in this page were mapped and there's no
+ * accumulated extent of buffers to map or add buffers in the page to the
+ * extent of buffers to map. The function returns 1 if the caller can continue
+ * by processing the next page, 0 if it should stop adding buffers to the
+ * extent to map because we cannot extend it anymore. It can also return value
+ * < 0 in case of error during IO submission.
+ */
+static int mpage_process_page_bufs(struct mpage_da_data *mpd,
+ struct buffer_head *head,
+ struct buffer_head *bh,
+ ext4_lblk_t lblk)
{
struct inode *inode = mpd->inode;
+ int err;
ext4_lblk_t blocks = (i_size_read(inode) + (1 << inode->i_blkbits) - 1)
>> inode->i_blkbits;

@@ -1966,32 +2003,18 @@ static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
if (lblk >= blocks || !mpage_add_bh_to_extent(mpd, lblk, bh)) {
/* Found extent to map? */
if (mpd->map.m_len)
- return false;
+ return 0;
/* Everything mapped so far and we hit EOF */
- return true;
+ break;
}
} while (lblk++, (bh = bh->b_this_page) != head);
- return true;
-}
-
-static int mpage_submit_page(struct mpage_da_data *mpd, struct page *page)
-{
- int len;
- loff_t size = i_size_read(mpd->inode);
- int err;
-
- BUG_ON(page->index != mpd->first_page);
- if (page->index == size >> PAGE_CACHE_SHIFT)
- len = size & ~PAGE_CACHE_MASK;
- else
- len = PAGE_CACHE_SIZE;
- clear_page_dirty_for_io(page);
- err = ext4_bio_write_page(&mpd->io_submit, page, len, mpd->wbc);
- if (!err)
- mpd->wbc->nr_to_write--;
- mpd->first_page++;
-
- return err;
+ /* So far everything mapped? Submit the page for IO. */
+ if (mpd->map.m_len == 0) {
+ err = mpage_submit_page(mpd, head->b_page);
+ if (err < 0)
+ return err;
+ }
+ return lblk < blocks;
}

/*
@@ -2015,8 +2038,6 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
struct inode *inode = mpd->inode;
struct buffer_head *head, *bh;
int bpp_bits = PAGE_CACHE_SHIFT - inode->i_blkbits;
- ext4_lblk_t blocks = (i_size_read(inode) + (1 << inode->i_blkbits) - 1)
- >> inode->i_blkbits;
pgoff_t start, end;
ext4_lblk_t lblk;
sector_t pblock;
@@ -2051,18 +2072,26 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
*/
mpd->map.m_len = 0;
mpd->map.m_flags = 0;
- add_page_bufs_to_extent(mpd, head, bh,
- lblk);
+ /*
+ * FIXME: If dioread_nolock supports
+ * blocksize < pagesize, we need to make
+ * sure we add size mapped so far to
+ * io_end->size as the following call
+ * can submit the page for IO.
+ */
+ err = mpage_process_page_bufs(mpd, head,
+ bh, lblk);
pagevec_release(&pvec);
- return 0;
+ if (err > 0)
+ err = 0;
+ return err;
}
if (buffer_delay(bh)) {
clear_buffer_delay(bh);
bh->b_blocknr = pblock++;
}
clear_buffer_unwritten(bh);
- } while (++lblk < blocks &&
- (bh = bh->b_this_page) != head);
+ } while (lblk++, (bh = bh->b_this_page) != head);

/*
* FIXME: This is going to break if dioread_nolock
@@ -2331,14 +2360,10 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
lblk = ((ext4_lblk_t)page->index) <<
(PAGE_CACHE_SHIFT - blkbits);
head = page_buffers(page);
- if (!add_page_bufs_to_extent(mpd, head, head, lblk))
+ err = mpage_process_page_bufs(mpd, head, head, lblk);
+ if (err <= 0)
goto out;
- /* So far everything mapped? Submit the page for IO. */
- if (mpd->map.m_len == 0) {
- err = mpage_submit_page(mpd, page);
- if (err < 0)
- goto out;
- }
+ err = 0;

/*
* Accumulated enough dirty pages? This doesn't apply
--
1.8.1.4


2013-08-05 13:52:28

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback

The following race can lead to a loss of i_disksize update from truncate
thus resulting in a wrong inode size if the inode size isn't updated
again before inode is reclaimed:

ext4_setattr() mpage_map_and_submit_extent()
EXT4_I(inode)->i_disksize = attr->ia_size;
... ...
disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
/* False because i_size isn't
* updated yet */
if (disksize > i_size_read(inode))
/* True, because i_disksize is
* already truncated */
if (disksize > EXT4_I(inode)->i_disksize)
/* Overwrite i_disksize
* update from truncate */
ext4_update_i_disksize()
i_size_write(inode, attr->ia_size);

For other places updating i_disksize such race cannot happen because
i_mutex prevents these races. Writeback is the only place where we do
not hold i_mutex and we cannot grab it there because of lock ordering.

We fix the race by doing both i_disksize and i_size update in truncate
atomically under i_data_sem and in mpage_map_and_submit_extent() we move
the check against i_size under i_data_sem as well.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 24 ++++++++++++++++++++----
fs/ext4/inode.c | 17 ++++++++++++-----
2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..648c5e6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2416,16 +2416,32 @@ do { \
#define EXT4_FREECLUSTERS_WATERMARK 0
#endif

+/* Update i_disksize. Requires i_mutex to avoid races with truncate */
static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
{
- /*
- * XXX: replace with spinlock if seen contended -bzzz
- */
+ WARN_ON_ONCE(S_ISREG(inode->i_mode) &&
+ !mutex_is_locked(&inode->i_mutex));
+ down_write(&EXT4_I(inode)->i_data_sem);
+ if (newsize > EXT4_I(inode)->i_disksize)
+ EXT4_I(inode)->i_disksize = newsize;
+ up_write(&EXT4_I(inode)->i_data_sem);
+}
+
+/*
+ * Update i_disksize after writeback has been started. Races with truncate
+ * are avoided by checking i_size under i_data_sem.
+ */
+static inline void ext4_wb_update_i_disksize(struct inode *inode, loff_t newsize)
+{
+ loff_t i_size;
+
down_write(&EXT4_I(inode)->i_data_sem);
+ i_size = i_size_read(inode);
+ if (newsize > i_size)
+ newsize = i_size;
if (newsize > EXT4_I(inode)->i_disksize)
EXT4_I(inode)->i_disksize = newsize;
up_write(&EXT4_I(inode)->i_data_sem);
- return ;
}

struct ext4_group_info {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e7d98d2..5d3706e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2240,12 +2240,10 @@ static int mpage_map_and_submit_extent(handle_t *handle,

/* Update on-disk size after IO is submitted */
disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT;
- if (disksize > i_size_read(inode))
- disksize = i_size_read(inode);
if (disksize > EXT4_I(inode)->i_disksize) {
int err2;

- ext4_update_i_disksize(inode, disksize);
+ ext4_wb_update_i_disksize(inode, disksize);
err2 = ext4_mark_inode_dirty(handle, inode);
if (err2)
ext4_error(inode->i_sb,
@@ -4587,18 +4585,27 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
error = ext4_orphan_add(handle, inode);
orphan = 1;
}
+ down_write(&EXT4_I(inode)->i_data_sem);
EXT4_I(inode)->i_disksize = attr->ia_size;
rc = ext4_mark_inode_dirty(handle, inode);
if (!error)
error = rc;
+ /*
+ * We have to update i_size under i_data_sem together
+ * with i_disksize to avoid races with writeback code
+ * running ext4_wb_update_i_disksize().
+ */
+ if (!error)
+ i_size_write(inode, attr->ia_size);
+ up_write(&EXT4_I(inode)->i_data_sem);
ext4_journal_stop(handle);
if (error) {
ext4_orphan_del(NULL, inode);
goto err_out;
}
- }
+ } else
+ i_size_write(inode, attr->ia_size);

- i_size_write(inode, attr->ia_size);
/*
* Blocks are going to be removed from the inode. Wait
* for dio in flight. Temporarily disable
--
1.8.1.4


2013-08-05 13:52:28

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place

Currently the logic whether the current buffer can be added to an extent
of buffers to map is split between mpage_add_bh_to_extent() and
add_page_bufs_to_extent(). Move the whole logic to
mpage_add_bh_to_extent() which makes things a bit more straightforward
and make following i_size fixes easier.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 49 +++++++++++++++++++++++++++++--------------------
1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba33c67..afaf26b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1907,34 +1907,48 @@ static int ext4_writepage(struct page *page,
*
* @mpd - extent of blocks
* @lblk - logical number of the block in the file
- * @b_state - b_state of the buffer head added
+ * @bh - buffer head we want to add to the extent
*
- * the function is used to collect contig. blocks in same state
+ * The function is used to collect contig. blocks in the same state. If the
+ * buffer doesn't require mapping for writeback and we haven't started the
+ * extent of buffers to map yet, the function returns 'true' immediately - the
+ * caller can write the buffer right away. Otherwise the function returns true
+ * if the block has been added to the extent, false if the block couldn't be
+ * added.
*/
-static int mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
- unsigned long b_state)
+static bool mpage_add_bh_to_extent(struct mpage_da_data *mpd, ext4_lblk_t lblk,
+ struct buffer_head *bh)
{
struct ext4_map_blocks *map = &mpd->map;

- /* Don't go larger than mballoc is willing to allocate */
- if (map->m_len >= MAX_WRITEPAGES_EXTENT_LEN)
- return 0;
+ /* Buffer that doesn't need mapping for writeback? */
+ if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
+ (!buffer_delay(bh) && !buffer_unwritten(bh))) {
+ /* So far no extent to map => we write the buffer right away */
+ if (map->m_len == 0)
+ return true;
+ return false;
+ }

/* First block in the extent? */
if (map->m_len == 0) {
map->m_lblk = lblk;
map->m_len = 1;
- map->m_flags = b_state & BH_FLAGS;
- return 1;
+ map->m_flags = bh->b_state & BH_FLAGS;
+ return true;
}

+ /* Don't go larger than mballoc is willing to allocate */
+ if (map->m_len >= MAX_WRITEPAGES_EXTENT_LEN)
+ return false;
+
/* Can we merge the block to our big extent? */
if (lblk == map->m_lblk + map->m_len &&
- (b_state & BH_FLAGS) == map->m_flags) {
+ (bh->b_state & BH_FLAGS) == map->m_flags) {
map->m_len++;
- return 1;
+ return true;
}
- return 0;
+ return false;
}

static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
@@ -1949,18 +1963,13 @@ static bool add_page_bufs_to_extent(struct mpage_da_data *mpd,
do {
BUG_ON(buffer_locked(bh));

- if (!buffer_dirty(bh) || !buffer_mapped(bh) ||
- (!buffer_delay(bh) && !buffer_unwritten(bh)) ||
- lblk >= blocks) {
+ if (lblk >= blocks || !mpage_add_bh_to_extent(mpd, lblk, bh)) {
/* Found extent to map? */
if (mpd->map.m_len)
return false;
- if (lblk >= blocks)
- return true;
- continue;
+ /* Everything mapped so far and we hit EOF */
+ return true;
}
- if (!mpage_add_bh_to_extent(mpd, lblk, bh->b_state))
- return false;
} while (lblk++, (bh = bh->b_this_page) != head);
return true;
}
--
1.8.1.4


2013-08-05 13:52:28

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/4] ext4: Simplify truncation code in ext4_setattr()

Merge conditions in ext4_setattr() handling inode size changes, also
move ext4_begin_ordered_truncate() call somewhat earlier because it
simplifies error recovery in case of failure. Also add error handling in
case i_disksize update fails.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 109 +++++++++++++++++++++++++-------------------------------
1 file changed, 49 insertions(+), 60 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4263878..e7d98d2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4560,7 +4560,9 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
ext4_journal_stop(handle);
}

- if (attr->ia_valid & ATTR_SIZE) {
+ if (attr->ia_valid & ATTR_SIZE && attr->ia_size != inode->i_size) {
+ handle_t *handle;
+ loff_t oldsize = inode->i_size;

if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -4568,73 +4570,60 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
if (attr->ia_size > sbi->s_bitmap_maxbytes)
return -EFBIG;
}
- }
-
- if (S_ISREG(inode->i_mode) &&
- attr->ia_valid & ATTR_SIZE &&
- (attr->ia_size < inode->i_size)) {
- handle_t *handle;
-
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
- if (IS_ERR(handle)) {
- error = PTR_ERR(handle);
- goto err_out;
- }
- if (ext4_handle_valid(handle)) {
- error = ext4_orphan_add(handle, inode);
- orphan = 1;
- }
- EXT4_I(inode)->i_disksize = attr->ia_size;
- rc = ext4_mark_inode_dirty(handle, inode);
- if (!error)
- error = rc;
- ext4_journal_stop(handle);
-
- if (ext4_should_order_data(inode)) {
- error = ext4_begin_ordered_truncate(inode,
+ if (S_ISREG(inode->i_mode) &&
+ (attr->ia_size < inode->i_size)) {
+ if (ext4_should_order_data(inode)) {
+ error = ext4_begin_ordered_truncate(inode,
attr->ia_size);
- if (error) {
- /* Do as much error cleanup as possible */
- handle = ext4_journal_start(inode,
- EXT4_HT_INODE, 3);
- if (IS_ERR(handle)) {
- ext4_orphan_del(NULL, inode);
+ if (error)
goto err_out;
- }
- ext4_orphan_del(handle, inode);
- orphan = 0;
- ext4_journal_stop(handle);
+ }
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
+ if (IS_ERR(handle)) {
+ error = PTR_ERR(handle);
+ goto err_out;
+ }
+ if (ext4_handle_valid(handle)) {
+ error = ext4_orphan_add(handle, inode);
+ orphan = 1;
+ }
+ EXT4_I(inode)->i_disksize = attr->ia_size;
+ rc = ext4_mark_inode_dirty(handle, inode);
+ if (!error)
+ error = rc;
+ ext4_journal_stop(handle);
+ if (error) {
+ ext4_orphan_del(NULL, inode);
goto err_out;
}
}
- }
-
- if (attr->ia_valid & ATTR_SIZE) {
- if (attr->ia_size != inode->i_size) {
- loff_t oldsize = inode->i_size;

- i_size_write(inode, attr->ia_size);
- /*
- * Blocks are going to be removed from the inode. Wait
- * for dio in flight. Temporarily disable
- * dioread_nolock to prevent livelock.
- */
- if (orphan) {
- if (!ext4_should_journal_data(inode)) {
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);
- ext4_inode_resume_unlocked_dio(inode);
- } else
- ext4_wait_for_tail_page_commit(inode);
- }
- /*
- * Truncate pagecache after we've waited for commit
- * in data=journal mode to make pages freeable.
- */
- truncate_pagecache(inode, oldsize, inode->i_size);
+ i_size_write(inode, attr->ia_size);
+ /*
+ * Blocks are going to be removed from the inode. Wait
+ * for dio in flight. Temporarily disable
+ * dioread_nolock to prevent livelock.
+ */
+ if (orphan) {
+ if (!ext4_should_journal_data(inode)) {
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+ ext4_inode_resume_unlocked_dio(inode);
+ } else
+ ext4_wait_for_tail_page_commit(inode);
}
- ext4_truncate(inode);
+ /*
+ * Truncate pagecache after we've waited for commit
+ * in data=journal mode to make pages freeable.
+ */
+ truncate_pagecache(inode, oldsize, inode->i_size);
}
+ /*
+ * We want to call ext4_truncate() even if attr->ia_size ==
+ * inode->i_size for cases like truncation of fallocated space
+ */
+ if (attr->ia_valid & ATTR_SIZE)
+ ext4_truncate(inode);

if (!rc) {
setattr_copy(inode, attr);
--
1.8.1.4


2013-08-17 13:59:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/4] ext4: Move test whether extent to map can be extended to one place

On Mon, Aug 05, 2013 at 03:52:21PM +0200, Jan Kara wrote:
> Currently the logic whether the current buffer can be added to an extent
> of buffers to map is split between mpage_add_bh_to_extent() and
> add_page_bufs_to_extent(). Move the whole logic to
> mpage_add_bh_to_extent() which makes things a bit more straightforward
> and make following i_size fixes easier.
>
> Signed-off-by: Jan Kara <[email protected]>

Thanks, applied.

- Ted

2013-08-17 14:08:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/4] ext4: Fix ext4_writepages() in presence of truncate

On Mon, Aug 05, 2013 at 03:52:22PM +0200, Jan Kara wrote:
> Inode size can arbitrarily change while writeback is in progress. When
> ext4_writepages() has prepared a long extent for mapping and truncate
> then reduces i_size, mpage_map_and_submit_buffers() will always map just
> one buffer in a page instead of all of them due to lblk < blocks check.
> So we end up not using all blocks we've allocated (thus leaking them)
> and also delalloc accounting goes wrong manifesting as a warning like:
>
> ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
> with only 0 reserved data blocks
>
> Note that the problem can happen only when blocksize < pagesize because
> otherwise we have only a single buffer in the page.
>
> Fix the problem by removing the size check from the mapping loop. We
> have an extent allocated so we have to use it all before checking for
> i_size. We also rename add_page_bufs_to_extent() to
> mpage_process_page_bufs() and make that function submit the page for IO
> if all buffers (upto EOF) in it are mapped.
>
> Reported-by: Dave Jones <[email protected]>
> Reported-by: Zheng Liu <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>

Thanks, applied.

- Ted

2013-08-17 14:08:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/4] ext4: Simplify truncation code in ext4_setattr()

On Mon, Aug 05, 2013 at 03:52:23PM +0200, Jan Kara wrote:
> Merge conditions in ext4_setattr() handling inode size changes, also
> move ext4_begin_ordered_truncate() call somewhat earlier because it
> simplifies error recovery in case of failure. Also add error handling in
> case i_disksize update fails.
>
> Signed-off-by: Jan Kara <[email protected]>

Thanks, applied with one minor whitespace fixup.

- Ted

2013-08-17 14:12:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback

On Mon, Aug 05, 2013 at 03:52:24PM +0200, Jan Kara wrote:
> The following race can lead to a loss of i_disksize update from truncate
> thus resulting in a wrong inode size if the inode size isn't updated
> again before inode is reclaimed:
>
> ext4_setattr() mpage_map_and_submit_extent()
> EXT4_I(inode)->i_disksize = attr->ia_size;
> ... ...
> disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
> /* False because i_size isn't
> * updated yet */
> if (disksize > i_size_read(inode))
> /* True, because i_disksize is
> * already truncated */
> if (disksize > EXT4_I(inode)->i_disksize)
> /* Overwrite i_disksize
> * update from truncate */
> ext4_update_i_disksize()
> i_size_write(inode, attr->ia_size);
>
> For other places updating i_disksize such race cannot happen because
> i_mutex prevents these races. Writeback is the only place where we do
> not hold i_mutex and we cannot grab it there because of lock ordering.
>
> We fix the race by doing both i_disksize and i_size update in truncate
> atomically under i_data_sem and in mpage_map_and_submit_extent() we move
> the check against i_size under i_data_sem as well.
>
> Signed-off-by: Jan Kara <[email protected]>

Applied, thanks.

- Ted

2013-08-26 19:02:04

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback

On Sat, Aug 17, 2013 at 10:12:27AM -0400, Theodore Ts'o wrote:
> On Mon, Aug 05, 2013 at 03:52:24PM +0200, Jan Kara wrote:
> > The following race can lead to a loss of i_disksize update from truncate
> > thus resulting in a wrong inode size if the inode size isn't updated
> > again before inode is reclaimed:
> >
> > ext4_setattr() mpage_map_and_submit_extent()
> > EXT4_I(inode)->i_disksize = attr->ia_size;
> > ... ...
> > disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
> > /* False because i_size isn't
> > * updated yet */
> > if (disksize > i_size_read(inode))
> > /* True, because i_disksize is
> > * already truncated */
> > if (disksize > EXT4_I(inode)->i_disksize)
> > /* Overwrite i_disksize
> > * update from truncate */
> > ext4_update_i_disksize()
> > i_size_write(inode, attr->ia_size);
> >
> > For other places updating i_disksize such race cannot happen because
> > i_mutex prevents these races. Writeback is the only place where we do
> > not hold i_mutex and we cannot grab it there because of lock ordering.
> >
> > We fix the race by doing both i_disksize and i_size update in truncate
> > atomically under i_data_sem and in mpage_map_and_submit_extent() we move
> > the check against i_size under i_data_sem as well.
> >
> > Signed-off-by: Jan Kara <[email protected]>
>
> Applied, thanks.

Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.

Dave


2013-08-28 22:55:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback

On Mon, Aug 26, 2013 at 03:01:48PM -0400, Dave Jones wrote:
>
> Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.

These patches fixed races that have been around for a while; it's not
a regression. Given that they are fairly involved, I was nervous
sending them to Linus for 3.11, given the late date.

They are queued for the next merge window, and I'll mark them cc:
[email protected].

- Ted

2013-08-28 23:02:03

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 4/4] ext4: Fix lost truncate due to race with writeback

On Wed, Aug 28, 2013 at 06:55:04PM -0400, Theodore Ts'o wrote:
> On Mon, Aug 26, 2013 at 03:01:48PM -0400, Dave Jones wrote:
> >
> > Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.
>
> These patches fixed races that have been around for a while; it's not
> a regression. Given that they are fairly involved, I was nervous
> sending them to Linus for 3.11, given the late date.

That's odd, because I can't reproduce the problem I'm seeing on 3.10

> They are queued for the next merge window, and I'll mark them cc:
> [email protected].

Fair enough.

Dave