2022-03-16 12:25:12

by zhanchengbin

[permalink] [raw]
Subject: e2fsck: do not skip deeper checkers when s_last_orphan list has truncated inodes

If the system crashes when a file is being truncated, we will get a
problematic inode,
and it will be added into fs->super->s_last_orphan.
When we run `e2fsck -a img`, the s_last_orphan list will be traversed
and deleted.
During this period, orphan inodes in the s_last_orphan list with
i_links_count==0 can
be deleted, and orphan inodes with i_links_count !=0 (ex. the truncated
inode)
cannot be deleted. However, when there are some orphan inodes with
i_links_count !=0,
the EXT2_VALID_FS is still assigned to fs->super->s_state, the deeper
checkers are skipped
with some inconsistency problems.
Here, we will clean EXT2_VALID_FS flag when there is orphan inodes with
i_links_count !=0
for deeper checkers.

Problems with truncated files.
[root@localhost ~]# e2fsck -a img
img: recovering journal
img: Truncating orphaned inode 188 (uid=0, gid=0, mode=0100666, size=0)
img: Truncating orphaned inode 174 (uid=0, gid=0, mode=0100666, size=0)
img: clean, 484/128016 files, 118274/512000 blocks
[root@localhost ~]# e2fsck -fn img
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Inode 174, i_blocks is 2, should be 0. Fix? no

Inode 188, i_blocks is 2, should be 0. Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

img: ********** WARNING: Filesystem still has errors **********

img: 484/128016 files (24.6% non-contiguous), 118274/512000 blocks
[root@localhost ~]# e2fsck -a img
img: clean, 484/128016 files, 118274/512000 blocks

But, if run `e2fsck -f img`, EXT2_VALID_FS flag will be clean, so do
`e2fsck -a img` again,
can fix this problem.

[root@localhost ~]# e2fsck -f img
e2fsck 1.46.5 (30-Dec-2021)
Pass 1: Checking inodes, blocks, and sizes
Inode 174, i_blocks is 2, should be 0. Fix<y>? no
Inode 188, i_blocks is 2, should be 0. Fix<y>? no
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

img: ********** WARNING: Filesystem still has errors **********

img: 484/128016 files (24.6% non-contiguous), 118274/512000 blocks
[root@localhost ~]# e2fsck -a img
img was not cleanly unmounted, check forced.
img: Inode 174, i_blocks is 2, should be 0. FIXED.
img: Inode 188, i_blocks is 2, should be 0. FIXED.
img: 484/128016 files (24.6% non-contiguous), 118274/512000 blocks

Signed-off-by: zhanchengbin <[email protected]>
---
e2fsck/super.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/e2fsck/super.c b/e2fsck/super.c
index 9495e029..f4a414b7 100644
--- a/e2fsck/super.c
+++ b/e2fsck/super.c
@@ -351,6 +351,7 @@ static int release_orphan_inode(e2fsck_t ctx,
ext2_ino_t *ino, char *block_buf)
inode.i_dtime = ctx->now;
} else {
inode.i_dtime = 0;
+ fs->super->s_state &= ~EXT2_VALID_FS;
}
e2fsck_write_inode_full(ctx, *ino, EXT2_INODE(&inode),
sizeof(inode), "delete_file");




2022-03-16 14:17:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsck: do not skip deeper checkers when s_last_orphan list has truncated inodes

On Tue, Mar 15, 2022 at 04:01:45PM +0800, zhanchengbin wrote:
> If the system crashes when a file is being truncated, we will get a
> problematic inode,
> and it will be added into fs->super->s_last_orphan.
> When we run `e2fsck -a img`, the s_last_orphan list will be traversed and
> deleted.
> During this period, orphan inodes in the s_last_orphan list with
> i_links_count==0 can
> be deleted, and orphan inodes with i_links_count !=0 (ex. the truncated
> inode)
> cannot be deleted. However, when there are some orphan inodes with
> i_links_count !=0,
> the EXT2_VALID_FS is still assigned to fs->super->s_state, the deeper
> checkers are skipped
> with some inconsistency problems.

That's not supposed to happen. We regularly put inodes on the orphan
list when they are being truncated so that if we crash, the truncation
operation can be completed as part of the journal recovery and remount
operation. This is true regardles sof whether the recovery is done by
e2fsck or by the kernel.

If a crash during a truncate leads to an inconsistent file system
after the file system is mounted, or after e2fsck does the journal
replay and orphan inode list processing, that's a kernel bug, and we
should fix the bug in the kernel.

Do you have a reliable reproducer for this situation?

Thanks,

- Ted

2022-03-18 10:48:38

by zhanchengbin

[permalink] [raw]
Subject: Re: e2fsck: do not skip deeper checkers when s_last_orphan list has truncated inodes



在 2022/3/16 1:54, Theodore Ts'o 写道:
> On Tue, Mar 15, 2022 at 04:01:45PM +0800, zhanchengbin wrote:
>> If the system crashes when a file is being truncated, we will get a
>> problematic inode,
>> and it will be added into fs->super->s_last_orphan.
>> When we run `e2fsck -a img`, the s_last_orphan list will be traversed and
>> deleted.
>> During this period, orphan inodes in the s_last_orphan list with
>> i_links_count==0 can
>> be deleted, and orphan inodes with i_links_count !=0 (ex. the truncated
>> inode)
>> cannot be deleted. However, when there are some orphan inodes with
>> i_links_count !=0,
>> the EXT2_VALID_FS is still assigned to fs->super->s_state, the deeper
>> checkers are skipped
>> with some inconsistency problems.
>
> That's not supposed to happen. We regularly put inodes on the orphan
> list when they are being truncated so that if we crash, the truncation
> operation can be completed as part of the journal recovery and remount
> operation. This is true regardles sof whether the recovery is done by
> e2fsck or by the kernel.

Yes, you are right.
Truncated has been completed,and file ACL has been set to zero in
release_inode_blocks(), but the i_blocks was not subtracted acl blocks.
So i_blocks is inconsistent。
Li Jinlin sent a patch yesterday to fix it.

>
> If a crash during a truncate leads to an inconsistent file system
> after the file system is mounted, or after e2fsck does the journal
> replay and orphan inode list processing, that's a kernel bug, and we
> should fix the bug in the kernel.
>
> Do you have a reliable reproducer for this situation?

I have a reproducer but it is not necessarily:
#!/bin/bash
disk_list=$(multipath -ll | grep filedisk | awk '{print $1}')

for disk in ${disk_list}
do
mkfs.ext4 -F /dev/mapper/$disk
mkdir ${disk}
done

function err_inject()
{
iscsiadm -m node -p 127.0.0.1 -u &> /dev/null
iscsiadm -m node -p 127.0.0.1 -l &> /dev/null
sleep 1
iscsiadm -m node -p 9.82.236.206 -u &> /dev/null
iscsiadm -m node -p 9.82.236.206 -l &> /dev/null
sleep 1

iscsiadm -m node -p 127.0.0.1 -u &> /dev/null
iscsiadm -m node -p 127.0.0.1 -l &> /dev/null
iscsiadm -m node -p 9.82.236.206 -u &> /dev/null
iscsiadm -m node -p 9.82.236.206 -l &> /dev/null
sleep 1
}



count=0
while true
do
((count=count+1))
for disk in ${disk_list}
do
while true
do
mount -o data_err=abort,errors=remount-ro /dev/mapper/$disk
$disk && break
sleep 0.1
done
nohup fsstress -d $(pwd)/$disk -l 10 -n 1000 -p 10 &>/dev/null &
done

sleep 5

for disk in ${disk_list}
do
dm=$(multipath -ll | grep -w $disk | awk '{print $2}')
aqu_sz=$(iostat -x 1 -d 2 | grep -w $dm | tail -1 | awk '{print
$(NF-1)}')
util=$(iostat -x 1 -d 2 | grep -w $dm | tail -1 | awk '{print
$NF}')
#if [ "${aqu_sz}" == "0.00" -o "$util" == "0.00" ];then
# iostat -x 1 -d 2
# exit 1
#fi
mount | grep $disk | grep '(ro' && exit 1
done

err_inject

while [ -n "`pidof fsstress`" ]
do
sleep 1
done

for disk in ${disk_list}
do
umount $disk
dm=$(multipath -ll | grep -w $disk | awk '{print $2}')
aqu_sz=$(iostat -x 1 -d 2 | grep -w $dm | tail -1 | awk '{print
$(NF-1)}')
util=$(iostat -x 1 -d 2 | grep -w $dm | tail -1 | awk '{print
$NF}')
if [ "${aqu_sz}" != "0.00" -o "$util" != "0.00" ];then
iostat -x 1 -d 2
exit 1
fi

dd bs=1M if=/dev/mapper/$disk of=/root/dockerback

fsck.ext4 -a /dev/mapper/$disk
ret=$?
if [ $ret -ne 0 -a $ret -ne 1 ]; then
exit 1
fi

fsck.ext4 -fn /dev/mapper/$disk
ret=$?
if [ $ret -ne 0 ]; then
exit 1
fi
done

if [ $count -gt 5 ];then
echo 3 > /proc/sys/vm/drop_caches
sleep 1
cat /proc/meminfo >> mem.txt
echo "" >> mem.txt
slabtop -o >> slab.txt
echo "" >> slab.txt
count=0
fi
done

>
> Thanks,
>
> - Ted
> .
>