Hi,
We found a potential concurrency bug in linux kernel 5.3.11. We were able to reproduce this bug in x86 under specific thread interleavings. This bug causes a “checksum invalid” EXT4-fs error.
------------------------------------------
Kernel console output
EXT4-fs error (device sda1): swap_inode_boot_loader:124: inode #5: comm ski-executor:iget: checksum invalid
------------------------------------------
Test input
This bug occurs when a kernel test program is executed twice in different threads and ran concurrently. Our analysis has located that it happens when syscall ioctl with the EXT4_IOC_SWAP_BOOT flag is called twice and interleaves with itself.
The test program is generated by Syzkaller as follows:
r0 = creat(&(0x7f0000000080)='./file0\x00', 0x0)
ioctl$FS_IOC_SETFLAGS(r0, 0x40046602, &(0x7f0000000040))
r1 = creat(&(0x7f0000000000)='./file0\x00', 0x0)
pwrite64(r1, &(0x7f00000000c0)='\x00', 0x1, 0x1010000)
r2 = creat(&(0x7f0000000000)='./file0\x00', 0x0)
ioctl$EXT4_IOC_SWAP_BOOT(r2, 0x6611)
------------------------------------------
Interleaving
Our analysis revealed that the following interleaving triggers the bug.
CPU0 CPU1
swap_inode_boot_loader()
…
bytes = inode_bl->i_bytes;
inode_bl->i_blocks = inode->i_blocks;
inode_bl->i_bytes = inode->i_bytes;
---> err = ext4_mark_inode_dirty(handle, inode_bl);
ext4_mark_iloc_dirty() (fs/ext4/ioctl.c: 223)
ext4_do_update_inode()
ext4_inode_csum_set()
ext4_has_metadata_sum()
ext4_inode_csum()
ext4_chksum()
crypto_shash_update()
chksum_update()
[context switch]
swap_inode_boot_loader()
ext4_iget()
ext4_inode_csum_verify(fs/ext4/inode.c:4927)
[EXT4-fs error]
Thanks,
Sishuai
On Sep 4, 2020, at 9:57 AM, Gong, Sishuai <[email protected]> wrote:
> We found a potential concurrency bug in linux kernel 5.3.11. We were able to reproduce this bug in x86 under specific thread interleavings. This bug causes a “checksum invalid” EXT4-fs error.
I think there are two separate problems here, one that your test exposes,
and a larger an architectural conflict between swap_inode_boot_loader()
and metadata_csum.
The easier problem that your test code exposes is that the inode swap is
not "atomic" w.r.t. changing the data and updating the *inode* checksum, as
no sane user is going to be doing concurrent swaps on their bootloader inode
(race conditions would make it hard to know what the end result is). This
also has an easy fix - add a mutex to ext4_sb_info and held at the start of swap_inode_boot_loader() to provide exclusion for the whole thing to avoid
access to the inode until the process is complete. This is not performance
critical code.
The more complex issue (which you haven't hit, but initially I thought was
the culprit) is that there are checksums in the extent/index blocks that
use the inode number as part of the checksum seed, to allow detection of
incorrectly accessing extent metadata from the wrong inode, even if the
checksum of the contents would otherwise appear correct.
This means that the checksums for extents on the swapped inodes may need
to be updated as part of the swap to ensure that they are again valid, but
I don't see that being done anywhere. The "swap" is merely copying the
contents of the in-inode block pointers between the two inodes, so it may
have the wrong checksums, if that feature is enabled.
If the bootloader inode is very large (over 512MB), or the allocation is
very fragmented and needs many extents to describe, it may also overflow
the maximum 4 extents that can fit directly into the inode. That means
a traversal of the extent tree could be needed to update the checksums
in the index/extent blocks outside the inode itself. However, this is
unlikely to be an issue for most systems, but could be hit in real usage.
One important question is whether the boot loader inode is actually used
by (m)any distros or not? This would indicate how important this bug is,
or if this is currently only an academic exercise. The easy workaround
would be to disable metadata_csum on the boot filesystem, but that isn't
very appealing. Since concurrent swaps are already only a 0.000001%
issue, the mutex will fix those cases, but is unlikely to affect anyone.
The "fragmented and large" issue is more work to fix, and potentially an
issue that could be hit in real life. I checked one of my systems and
see that /boot has kernels and initrd/initramfs in the tens of MB, and
filefrag shows them as having 1-4 extents. Granted that my /boot fs is
small (190MB) it is harder to allocate contiguous extents. When using
a boot inode inside a larger filesystem, that may be less of a concern,
but could still be hit, so will eventually also need to be fixed.
Cheers, Andreas
> ------------------------------------------
> Kernel console output
>
> EXT4-fs error (device sda1): swap_inode_boot_loader:124: inode #5: comm ski-executor:iget: checksum invalid
>
> ------------------------------------------
> Test input
>
> This bug occurs when a kernel test program is executed twice in different threads and ran concurrently. Our analysis has located that it happens when syscall ioctl with the EXT4_IOC_SWAP_BOOT flag is called twice and interleaves with itself.
> The test program is generated by Syzkaller as follows:
> r0 = creat(&(0x7f0000000080)='./file0\x00', 0x0)
> ioctl$FS_IOC_SETFLAGS(r0, 0x40046602, &(0x7f0000000040))
> r1 = creat(&(0x7f0000000000)='./file0\x00', 0x0)
> pwrite64(r1, &(0x7f00000000c0)='\x00', 0x1, 0x1010000)
> r2 = creat(&(0x7f0000000000)='./file0\x00', 0x0)
> ioctl$EXT4_IOC_SWAP_BOOT(r2, 0x6611)
>
> ------------------------------------------
> Interleaving
>
> Our analysis revealed that the following interleaving triggers the bug.
> CPU0 CPU1
> swap_inode_boot_loader()
> …
> bytes = inode_bl->i_bytes;
> inode_bl->i_blocks = inode->i_blocks;
> inode_bl->i_bytes = inode->i_bytes;
> ---> err = ext4_mark_inode_dirty(handle, inode_bl);
>
> ext4_mark_iloc_dirty() (fs/ext4/ioctl.c: 223)
> ext4_do_update_inode()
> ext4_inode_csum_set()
> ext4_has_metadata_sum()
> ext4_inode_csum()
> ext4_chksum()
> crypto_shash_update()
> chksum_update()
> [context switch]
> swap_inode_boot_loader()
> ext4_iget()
> ext4_inode_csum_verify(fs/ext4/inode.c:4927)
> [EXT4-fs error]
>
>
>
>
> Thanks,
> Sishuai
>
Cheers, Andreas