Current ext4 will call ext4_ext_convert_to_initialized() to split and
initialize an unwritten extent if someone write something to it. It may
also zeroout the nearby blocks and expand the split extent if the
allocated extent is fully inside i_size or new_size. But it may lead to
inode inconsistency when system crash or the power fails.
Consider the following case:
- Create an empty file and buffer write from block A to D (with delay
allocate). It will update the i_size to D.
- Zero range from part of block B to D. It will allocate an unwritten
extent from B to D.
- The write back worker write block B and initialize the unwritten
extent from B to D, and then update the i_disksize to B.
- System crash.
- Remount and fsck complain about the extent size exceeds the inode
size.
This patch add checking i_disksize and chose the small one between
i_size to make sure it's safe to convert extent to initialized.
---------------------
This problem can reproduce by xfstests generic/482 with fsstress seed
1544025012.
Fsck output:
fsck from util-linux 2.23.2
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Inode 15, end of extent exceeds allowed value
(logical block 294, physical block 34028, len 3)
Clear? no
Inode 15, i_blocks is 3784, should be 3760. Fix? no
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(34028--34030)
Fix? no
The sizeof inode 15 is 0x127000, and the extent tree is:
Level Entries Logical Physical Length Flags
1/ 1 1/ 8 14 - 38 36110 - 36134 25
1/ 1 2/ 8 128 - 137 34688 - 34697 10
1/ 1 3/ 8 219 - 231 36305 - 36317 13
1/ 1 4/ 8 284 - 293 36370 - 36379 10
1/ 1 5/ 8 294 - 296 34028 - 34030 3
1/ 1 6/ 8 297 - 511 35182 - 35396 215 Uninit
1/ 1 7/ 8 512 - 523 34096 - 34107 12 Uninit
1/ 1 8/ 8 630 - 813 35746 - 35929 184 Uninit
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/extents.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 240b6de..7c9abab 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3473,6 +3473,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
struct ext4_extent zero_ex1, zero_ex2;
struct ext4_extent *ex, *abut_ex;
ext4_lblk_t ee_block, eof_block;
+ loff_t eof_size;
unsigned int ee_len, depth, map_len = map->m_len;
int allocated = 0, max_zeroout = 0;
int err = 0;
@@ -3483,7 +3484,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
(unsigned long long)map->m_lblk, map_len);
sbi = EXT4_SB(inode->i_sb);
- eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
+ eof_size = min(inode->i_size, EXT4_I(inode)->i_disksize);
+ eof_block = (eof_size + inode->i_sb->s_blocksize - 1) >>
inode->i_sb->s_blocksize_bits;
if (eof_block < map->m_lblk + map_len)
eof_block = map->m_lblk + map_len;
@@ -3623,7 +3625,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
WARN_ON(map->m_lblk < ee_block);
/*
* It is safe to convert extent to initialized via explicit
- * zeroout only if extent is fully inside i_size or new_size.
+ * zeroout only if extent is fully inside min(i_size, i_disksize)
+ * or new_size.
*/
split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
--
2.5.0
On Dec 10, 2018, at 2:26 AM, zhangyi (F) <[email protected]> wrote:
>
> On 2018/12/10 13:10, Theodore Y. Ts'o Wrote:
>> On Sat, Dec 08, 2018 at 10:25:58PM +0800, zhangyi (F) wrote:
>>> Current ext4 will call ext4_ext_convert_to_initialized() to split and
>>> initialize an unwritten extent if someone write something to it. It may
>>> also zeroout the nearby blocks and expand the split extent if the
>>> allocated extent is fully inside i_size or new_size. But it may lead to
>>> inode inconsistency when system crash or the power fails.
>>>
>>> Consider the following case:
>>> - Create an empty file and buffer write from block A to D (with delay
>>> allocate). It will update the i_size to D.
>>> - Zero range from part of block B to D. It will allocate an unwritten
>>> extent from B to D.
>>> - The write back worker write block B and initialize the unwritten
>>> extent from B to D, and then update the i_disksize to B.
>>> - System crash.
>>> - Remount and fsck complain about the extent size exceeds the inode
>>> size.
>>>
>>> This patch add checking i_disksize and chose the small one between
>>> i_size to make sure it's safe to convert extent to initialized.
>>>
>>> ---------------------
>>>
>>> This problem can reproduce by xfstests generic/482 with fsstress seed
>>> 1544025012.
>>
>> Hmm, your explanation is great and the patch makes sense. I haven't
>> been able to reproduce the problem by adding -s 1544025012 to the
>> fsstress arguments. This may be because fsstress being run with two
>> processes (-p 2) and the failure may be timing dependent?
>>
> Yes, it is timing dependent and not quite easy to make a simple fast reproducer.
> The default parameter of fsstress (tested by generic/482) on my machine is:
>
> ./ltp/fsstress -w -d /mnt/scratch -n 512 -p 8 -s 1544025012 -v
>
>> How easily can you replicate the problem?
>
> It is about 2~5% probability to replicate this problem under generic/482 on my
> machine, and it can also appear in generic/019 and generic/455.
>
> After reproducing and checking this problem again, I think the above explanation
> is not accurate. I simplify the running process in generic/482 and the real case
> could be:
>
> - Create an empty file and buffer write from block A to D (with delay allocate).
> - Buffer write from X to Z, now the i_size of this inode is updated to Z.
> - Zero range from part of block B to D, it will allocate an unwritten extent
> from B to D. Note that it also will skip disksize update in
> ext4_zero_range() -> ext4_update_disksize_before_punch() because the i_size
> is large than the end of this zero range.
> - The write back kworker write block B and initialize the whole unwritten
> extent from B to D, and then update the i_disksize to the end of B.
On a related note, should this just refuse to allocate uninitialized extents
for regions that are smaller than the threshold that they would immediately
be expanded into initialized extents on when they are modified? This would
avoid churn in the extent tree at the expense of (potentially) a few extra
zero blocks written to disk.
Cheers, Andreas
On 2018/12/11 4:14, Andreas Dilger Wrote:
> On Dec 10, 2018, at 2:26 AM, zhangyi (F) <[email protected]> wrote:
>>
>> On 2018/12/10 13:10, Theodore Y. Ts'o Wrote:
>>> On Sat, Dec 08, 2018 at 10:25:58PM +0800, zhangyi (F) wrote:
>>>> Current ext4 will call ext4_ext_convert_to_initialized() to split and
>>>> initialize an unwritten extent if someone write something to it. It may
>>>> also zeroout the nearby blocks and expand the split extent if the
>>>> allocated extent is fully inside i_size or new_size. But it may lead to
>>>> inode inconsistency when system crash or the power fails.
>>>>
>>>> Consider the following case:
>>>> - Create an empty file and buffer write from block A to D (with delay
>>>> allocate). It will update the i_size to D.
>>>> - Zero range from part of block B to D. It will allocate an unwritten
>>>> extent from B to D.
>>>> - The write back worker write block B and initialize the unwritten
>>>> extent from B to D, and then update the i_disksize to B.
>>>> - System crash.
>>>> - Remount and fsck complain about the extent size exceeds the inode
>>>> size.
>>>>
>>>> This patch add checking i_disksize and chose the small one between
>>>> i_size to make sure it's safe to convert extent to initialized.
>>>>
>>>> ---------------------
>>>>
>>>> This problem can reproduce by xfstests generic/482 with fsstress seed
>>>> 1544025012.
>>>
>>> Hmm, your explanation is great and the patch makes sense. I haven't
>>> been able to reproduce the problem by adding -s 1544025012 to the
>>> fsstress arguments. This may be because fsstress being run with two
>>> processes (-p 2) and the failure may be timing dependent?
>>>
>> Yes, it is timing dependent and not quite easy to make a simple fast reproducer.
>> The default parameter of fsstress (tested by generic/482) on my machine is:
>>
>> ./ltp/fsstress -w -d /mnt/scratch -n 512 -p 8 -s 1544025012 -v
>>
>>> How easily can you replicate the problem?
>>
>> It is about 2~5% probability to replicate this problem under generic/482 on my
>> machine, and it can also appear in generic/019 and generic/455.
>>
>> After reproducing and checking this problem again, I think the above explanation
>> is not accurate. I simplify the running process in generic/482 and the real case
>> could be:
>>
>> - Create an empty file and buffer write from block A to D (with delay allocate).
>> - Buffer write from X to Z, now the i_size of this inode is updated to Z.
>> - Zero range from part of block B to D, it will allocate an unwritten extent
>> from B to D. Note that it also will skip disksize update in
>> ext4_zero_range() -> ext4_update_disksize_before_punch() because the i_size
>> is large than the end of this zero range.
>> - The write back kworker write block B and initialize the whole unwritten
>> extent from B to D, and then update the i_disksize to the end of B.
>
> On a related note, should this just refuse to allocate uninitialized extents
> for regions that are smaller than the threshold that they would immediately
> be expanded into initialized extents on when they are modified? This would
> avoid churn in the extent tree at the expense of (potentially) a few extra
> zero blocks written to disk.
>
IIUC, allocating uninitialized extents are required for the zero_range
operaion (ext4_zero_range()) and also the normal fallocate operaion, we
should allocate uninitialized extents no matter the regions are smaller
than the threshold or not. So I think we cannot just refuse to allocate
uninitialized extents, am I missing something?
If we want to keep the extent tree, maybe we can update i_disksize instead
of split the extent. something like that.
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 240b6de..4b63b84 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3650,6 +3650,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
if (max_zeroout && (allocated > split_map.m_len)) {
if (allocated <= max_zeroout) {
+ loff_t disksize;
+
/* case 3 or 5 */
zero_ex1.ee_block =
cpu_to_le32(split_map.m_lblk +
@@ -3663,6 +3665,22 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
if (err)
goto out;
split_map.m_len = allocated;
+
+ /*
+ * Update the i_disksize if zeroout the tail of
+ * the second extent. Otherwise i_disksize update
+ * can be lost as the region may have been marked
+ * unwritten before writeback.
+ */
+ disksize = ((loff_t)(split_map.m_lblk +
+ split_map.m_len)) <<
+ PAGE_SHIFT;
+ if (disksize > EXT4_I(inode)->i_disksize) {
+ EXT4_I(inode)->i_disksize = disksize;
+ err = ext4_mark_inode_dirty(handle, inode);
+ if (err)
+ goto out;
+ }
}
if (split_map.m_lblk - ee_block + split_map.m_len <
max_zeroout) {
Thoughts?
Thanks,
Yi.
On 2018/12/10 13:10, Theodore Y. Ts'o Wrote:
> On Sat, Dec 08, 2018 at 10:25:58PM +0800, zhangyi (F) wrote:
>> Current ext4 will call ext4_ext_convert_to_initialized() to split and
>> initialize an unwritten extent if someone write something to it. It may
>> also zeroout the nearby blocks and expand the split extent if the
>> allocated extent is fully inside i_size or new_size. But it may lead to
>> inode inconsistency when system crash or the power fails.
>>
>> Consider the following case:
>> - Create an empty file and buffer write from block A to D (with delay
>> allocate). It will update the i_size to D.
>> - Zero range from part of block B to D. It will allocate an unwritten
>> extent from B to D.
>> - The write back worker write block B and initialize the unwritten
>> extent from B to D, and then update the i_disksize to B.
>> - System crash.
>> - Remount and fsck complain about the extent size exceeds the inode
>> size.
>>
>> This patch add checking i_disksize and chose the small one between
>> i_size to make sure it's safe to convert extent to initialized.
>>
>> ---------------------
>>
>> This problem can reproduce by xfstests generic/482 with fsstress seed
>> 1544025012.
>
> Hmm, your explanation is great and the patch makes sense. I haven't
> been able to reproduce the problem by adding -s 1544025012 to the
> fsstress arguments. This may be because fsstress being run with two
> processes (-p 2) and the failure may be timing dependent?
>
Yes, it is timing dependent and not quite easy to make a simple fast reproducer.
The default parameter of fsstress (tested by generic/482) on my machine is:
./ltp/fsstress -w -d /mnt/scratch -n 512 -p 8 -s 1544025012 -v
> How easily can you replicate the problem?
It is about 2~5% probability to replicate this problem under generic/482 on my
machine, and it can also appear in generic/019 and generic/455.
After reproducing and checking this problem again, I think the above explanation
is not accurate. I simplify the running process in generic/482 and the real case
could be:
- Create an empty file and buffer write from block A to D (with delay allocate).
- Buffer write from X to Z, now the i_size of this inode is updated to Z.
- Zero range from part of block B to D, it will allocate an unwritten extent
from B to D. Note that it also will skip disksize update in
ext4_zero_range() -> ext4_update_disksize_before_punch() because the i_size
is large than the end of this zero range.
- The write back kworker write block B and initialize the whole unwritten
extent from B to D, and then update the i_disksize to the end of B.
- ext4_journal_stop()
- kjournald2 process weakup and call jbd2_journal_commit_transaction() to commit
journal and send FUA.
- System crash.
- System reboot and fsck complain about the extent size exceeds the inode size.
Thanks,
Yi.
On Sat, Dec 08, 2018 at 10:25:58PM +0800, zhangyi (F) wrote:
> Current ext4 will call ext4_ext_convert_to_initialized() to split and
> initialize an unwritten extent if someone write something to it. It may
> also zeroout the nearby blocks and expand the split extent if the
> allocated extent is fully inside i_size or new_size. But it may lead to
> inode inconsistency when system crash or the power fails.
>
> Consider the following case:
> - Create an empty file and buffer write from block A to D (with delay
> allocate). It will update the i_size to D.
> - Zero range from part of block B to D. It will allocate an unwritten
> extent from B to D.
> - The write back worker write block B and initialize the unwritten
> extent from B to D, and then update the i_disksize to B.
> - System crash.
> - Remount and fsck complain about the extent size exceeds the inode
> size.
>
> This patch add checking i_disksize and chose the small one between
> i_size to make sure it's safe to convert extent to initialized.
>
> ---------------------
>
> This problem can reproduce by xfstests generic/482 with fsstress seed
> 1544025012.
Hmm, your explanation is great and the patch makes sense. I haven't
been able to reproduce the problem by adding -s 1544025012 to the
fsstress arguments. This may be because fsstress being run with two
processes (-p 2) and the failure may be timing dependent?
How easily can you replicate the problem?
- Ted