Hi, Ted and Andreas
These two patches are the second version of the PATCH "ext4: fix unsafe
extent initialization". Instead of split extent in the first iteration,
it update the i_disksize if the zeroout region exceeds current
i_disksize, please review.
Thanks,
Yi.
-------
Changes since v1:
- Add a patch to correct inode->i_size reading when converting
initialized extent.
- Instead of split extent, update i_disksize if necessary.
zhangyi (F) (2):
ext4: use i_size_read() when converting initialized extent
ext4: fix unsafe extent initialization
fs/ext4/extents.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
--
2.7.4
Current ext4 will call ext4_ext_convert_to_initialized() to split and
initialize an unwritten extent if someone write something to it. It
may also zeroout the nearby blocks and expand the split range if the
allocated extent is smaller than the max zeroout threshold and fully
inside i_size or new_size. But it may lead to inode inconsistency
when system crash or the power fails for the below case.
- Create an empty file and buffer write from block A to D (with delay
allocate).
- Buffer write from X to Z, now the i_size of this inode is updated
to Z.
- Zero range from part of block B to D, it will allocate an unwritten
extent from B to D. Note that it also will skip disksize update in
ext4_zero_range() -> ext4_update_disksize_before_punch() because
the i_size is large than the end of this zero range.
- The write-back kworker write block B and initialize the whole
unwritten extent from B to D, and then update the i_disksize to the
end of B.
- ext4_journal_stop()
- kjournald2 process weakup and call jbd2_journal_commit_transaction()
to commit journal and send FUA.
- System crash.
- System reboot and fsck complain about the extent size exceeds the
inode size.
So it's not safe enough only checking i_size and new_size, we should
also take care of the i_disksize. This patch add checking i_disksize
and updating i_disksize when zeroout the tail of the allocated extent
to avoid this problem.
---------------------
This problem can reproduce by xfstests generic/482 with fsstress seed
-s 1544025012.
Fsck output:
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Inode 15, end of extent exceeds allowed value
(logical block 294, physical block 34028, len 3)
Clear? no
Inode 15, i_blocks is 3784, should be 3760. Fix? no
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(34028--34030)
Fix? no
And the size of inode 15 is 0x127000, the extent tree is:
Level Entries Logical Physical Length Flags
1/ 1 1/ 8 14 - 38 36110 - 36134 25
1/ 1 2/ 8 128 - 137 34688 - 34697 10
1/ 1 3/ 8 219 - 231 36305 - 36317 13
1/ 1 4/ 8 284 - 293 36370 - 36379 10
1/ 1 5/ 8 294 - 296 34028 - 34030 3
1/ 1 6/ 8 297 - 511 35182 - 35396 215 Uninit
1/ 1 7/ 8 512 - 523 34096 - 34107 12 Uninit
1/ 1 8/ 8 630 - 813 35746 - 35929 184 Uninit
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/extents.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0307fc6..a054f51 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3472,6 +3472,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
struct ext4_map_blocks split_map;
struct ext4_extent zero_ex1, zero_ex2;
struct ext4_extent *ex, *abut_ex;
+ loff_t i_size, new_size;
ext4_lblk_t ee_block, eof_block;
unsigned int ee_len, depth, map_len = map->m_len;
int allocated = 0, max_zeroout = 0;
@@ -3483,7 +3484,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
(unsigned long long)map->m_lblk, map_len);
sbi = EXT4_SB(inode->i_sb);
- eof_block = (i_size_read(inode) + inode->i_sb->s_blocksize - 1) >>
+ i_size = i_size_read(inode);
+ eof_block = (i_size + inode->i_sb->s_blocksize - 1) >>
inode->i_sb->s_blocksize_bits;
if (eof_block < map->m_lblk + map_len)
eof_block = map->m_lblk + map_len;
@@ -3663,6 +3665,24 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
if (err)
goto out;
split_map.m_len = allocated;
+
+ /*
+ * Update the i_disksize if zeroout the tail of
+ * the second extent. Otherwise i_disksize update
+ * can be lost as the region may have been marked
+ * unwritten before writing back.
+ */
+ new_size = ((loff_t)(split_map.m_lblk +
+ split_map.m_len)) <<
+ PAGE_SHIFT;
+ if (new_size > i_size)
+ new_size = i_size;
+ if (new_size > EXT4_I(inode)->i_disksize) {
+ EXT4_I(inode)->i_disksize = new_size;
+ err = ext4_mark_inode_dirty(handle, inode);
+ if (err)
+ goto out;
+ }
}
if (split_map.m_lblk - ee_block + split_map.m_len <
max_zeroout) {
--
2.7.4
Use i_size_read() in ext4_ext_convert_to_initialized() and
ext4_split_convert_extents() because we haven't taken inode->i_rwsem
in these two functions.
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/extents.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 240b6de..0307fc6 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3483,7 +3483,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
(unsigned long long)map->m_lblk, map_len);
sbi = EXT4_SB(inode->i_sb);
- eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
+ eof_block = (i_size_read(inode) + inode->i_sb->s_blocksize - 1) >>
inode->i_sb->s_blocksize_bits;
if (eof_block < map->m_lblk + map_len)
eof_block = map->m_lblk + map_len;
@@ -3739,7 +3739,7 @@ static int ext4_split_convert_extents(handle_t *handle,
__func__, inode->i_ino,
(unsigned long long)map->m_lblk, map->m_len);
- eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
+ eof_block = (i_size_read(inode) + inode->i_sb->s_blocksize - 1) >>
inode->i_sb->s_blocksize_bits;
if (eof_block < map->m_lblk + map->m_len)
eof_block = map->m_lblk + map->m_len;
--
2.7.4
On Tue 18-12-18 22:03:41, zhangyi (F) wrote:
> Use i_size_read() in ext4_ext_convert_to_initialized() and
> ext4_split_convert_extents() because we haven't taken inode->i_rwsem
> in these two functions.
>
> Signed-off-by: zhangyi (F) <[email protected]>
The patch looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/extents.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 240b6de..0307fc6 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3483,7 +3483,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
> (unsigned long long)map->m_lblk, map_len);
>
> sbi = EXT4_SB(inode->i_sb);
> - eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
> + eof_block = (i_size_read(inode) + inode->i_sb->s_blocksize - 1) >>
> inode->i_sb->s_blocksize_bits;
> if (eof_block < map->m_lblk + map_len)
> eof_block = map->m_lblk + map_len;
> @@ -3739,7 +3739,7 @@ static int ext4_split_convert_extents(handle_t *handle,
> __func__, inode->i_ino,
> (unsigned long long)map->m_lblk, map->m_len);
>
> - eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
> + eof_block = (i_size_read(inode) + inode->i_sb->s_blocksize - 1) >>
> inode->i_sb->s_blocksize_bits;
> if (eof_block < map->m_lblk + map->m_len)
> eof_block = map->m_lblk + map->m_len;
> --
> 2.7.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 18-12-18 22:03:42, zhangyi (F) wrote:
> Current ext4 will call ext4_ext_convert_to_initialized() to split and
> initialize an unwritten extent if someone write something to it. It
> may also zeroout the nearby blocks and expand the split range if the
> allocated extent is smaller than the max zeroout threshold and fully
> inside i_size or new_size. But it may lead to inode inconsistency
> when system crash or the power fails for the below case.
>
> - Create an empty file and buffer write from block A to D (with delay
> allocate).
> - Buffer write from X to Z, now the i_size of this inode is updated
> to Z.
> - Zero range from part of block B to D, it will allocate an unwritten
> extent from B to D. Note that it also will skip disksize update in
> ext4_zero_range() -> ext4_update_disksize_before_punch() because
> the i_size is large than the end of this zero range.
> - The write-back kworker write block B and initialize the whole
> unwritten extent from B to D, and then update the i_disksize to the
> end of B.
> - ext4_journal_stop()
> - kjournald2 process weakup and call jbd2_journal_commit_transaction()
> to commit journal and send FUA.
> - System crash.
> - System reboot and fsck complain about the extent size exceeds the
> inode size.
>
> So it's not safe enough only checking i_size and new_size, we should
> also take care of the i_disksize. This patch add checking i_disksize
> and updating i_disksize when zeroout the tail of the allocated extent
> to avoid this problem.
>
> ---------------------
>
> This problem can reproduce by xfstests generic/482 with fsstress seed
> -s 1544025012.
>
> Fsck output:
>
> fsck from util-linux 2.23.2
> e2fsck 1.42.9 (28-Dec-2013)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 15, end of extent exceeds allowed value
> (logical block 294, physical block 34028, len 3)
> Clear? no
>
> Inode 15, i_blocks is 3784, should be 3760. Fix? no
>
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences: -(34028--34030)
> Fix? no
>
> And the size of inode 15 is 0x127000, the extent tree is:
>
> Level Entries Logical Physical Length Flags
> 1/ 1 1/ 8 14 - 38 36110 - 36134 25
> 1/ 1 2/ 8 128 - 137 34688 - 34697 10
> 1/ 1 3/ 8 219 - 231 36305 - 36317 13
> 1/ 1 4/ 8 284 - 293 36370 - 36379 10
> 1/ 1 5/ 8 294 - 296 34028 - 34030 3
> 1/ 1 6/ 8 297 - 511 35182 - 35396 215 Uninit
> 1/ 1 7/ 8 512 - 523 34096 - 34107 12 Uninit
> 1/ 1 8/ 8 630 - 813 35746 - 35929 184 Uninit
>
> Signed-off-by: zhangyi (F) <[email protected]>
Thanks for debugging this! The patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
I just have to say that this i_disksize updating is really fragile and we
really need to consider how to clean that up because otherwise we are
likely do create similar bugs in the future again. But that's a separate
cleanup.
Honza
> ---
> fs/ext4/extents.c | 22 +++++++++++++++++++++-
> 1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 0307fc6..a054f51 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3472,6 +3472,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
> struct ext4_map_blocks split_map;
> struct ext4_extent zero_ex1, zero_ex2;
> struct ext4_extent *ex, *abut_ex;
> + loff_t i_size, new_size;
> ext4_lblk_t ee_block, eof_block;
> unsigned int ee_len, depth, map_len = map->m_len;
> int allocated = 0, max_zeroout = 0;
> @@ -3483,7 +3484,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
> (unsigned long long)map->m_lblk, map_len);
>
> sbi = EXT4_SB(inode->i_sb);
> - eof_block = (i_size_read(inode) + inode->i_sb->s_blocksize - 1) >>
> + i_size = i_size_read(inode);
> + eof_block = (i_size + inode->i_sb->s_blocksize - 1) >>
> inode->i_sb->s_blocksize_bits;
> if (eof_block < map->m_lblk + map_len)
> eof_block = map->m_lblk + map_len;
> @@ -3663,6 +3665,24 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
> if (err)
> goto out;
> split_map.m_len = allocated;
> +
> + /*
> + * Update the i_disksize if zeroout the tail of
> + * the second extent. Otherwise i_disksize update
> + * can be lost as the region may have been marked
> + * unwritten before writing back.
> + */
> + new_size = ((loff_t)(split_map.m_lblk +
> + split_map.m_len)) <<
> + PAGE_SHIFT;
> + if (new_size > i_size)
> + new_size = i_size;
> + if (new_size > EXT4_I(inode)->i_disksize) {
> + EXT4_I(inode)->i_disksize = new_size;
> + err = ext4_mark_inode_dirty(handle, inode);
> + if (err)
> + goto out;
> + }
> }
> if (split_map.m_lblk - ee_block + split_map.m_len <
> max_zeroout) {
> --
> 2.7.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR