2022-11-17 15:32:30

by Eric Whitney

[permalink] [raw]
Subject: [PATCH] ext4: fix delayed allocation bug in ext4_clu_mapped for bigalloc + inline

When converting files with inline data to extents, delayed allocations
made on a file system created with both the bigalloc and inline options
can result in invalid extent status cache content, incorrect reserved
cluster counts, kernel memory leaks, and potential kernel panics.

With bigalloc, the code that determines whether a block must be
delayed allocated searches the extent tree to see if that block maps
to a previously allocated cluster. If not, the block is delayed
allocated, and otherwise, it isn't. However, if the inline option is
also used, and if the file containing the block is marked as able to
store data inline, there isn't a valid extent tree associated with
the file. The current code in ext4_clu_mapped() calls
ext4_find_extent() to search the non-existent tree for a previously
allocated cluster anyway, which typically finds nothing, as desired.
However, a side effect of the search can be to cache invalid content
from the non-existent tree (garbage) in the extent status tree,
including bogus entries in the pending reservation tree.

To fix this, avoid searching the extent tree when allocating blocks
for bigalloc + inline files that are being converted from inline to
extent mapped.

Signed-off-by: Eric Whitney <[email protected]>
---
fs/ext4/extents.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f1956288307f..a8928d6d47ac 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5791,6 +5791,14 @@ int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu)
struct ext4_extent *extent;
ext4_lblk_t first_lblk, first_lclu, last_lclu;

+ /*
+ * if data can be stored inline, the logical cluster isn't
+ * mapped - no physical clusters have been allocated, and the
+ * file has no extents
+ */
+ if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
+ return 0;
+
/* search for the extent closest to the first block in the cluster */
path = ext4_find_extent(inode, EXT4_C2B(sbi, lclu), NULL, 0);
if (IS_ERR(path)) {
--
2.30.2



2022-11-22 00:49:42

by Eric Whitney

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix delayed allocation bug in ext4_clu_mapped for bigalloc + inline

* Eric Whitney <[email protected]>:
> When converting files with inline data to extents, delayed allocations
> made on a file system created with both the bigalloc and inline options
> can result in invalid extent status cache content, incorrect reserved
> cluster counts, kernel memory leaks, and potential kernel panics.
>
> With bigalloc, the code that determines whether a block must be
> delayed allocated searches the extent tree to see if that block maps
> to a previously allocated cluster. If not, the block is delayed
> allocated, and otherwise, it isn't. However, if the inline option is
> also used, and if the file containing the block is marked as able to
> store data inline, there isn't a valid extent tree associated with
> the file. The current code in ext4_clu_mapped() calls
> ext4_find_extent() to search the non-existent tree for a previously
> allocated cluster anyway, which typically finds nothing, as desired.
> However, a side effect of the search can be to cache invalid content
> from the non-existent tree (garbage) in the extent status tree,
> including bogus entries in the pending reservation tree.
>
> To fix this, avoid searching the extent tree when allocating blocks
> for bigalloc + inline files that are being converted from inline to
> extent mapped.
>
> Signed-off-by: Eric Whitney <[email protected]>
> ---
> fs/ext4/extents.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index f1956288307f..a8928d6d47ac 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -5791,6 +5791,14 @@ int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu)
> struct ext4_extent *extent;
> ext4_lblk_t first_lblk, first_lclu, last_lclu;
>
> + /*
> + * if data can be stored inline, the logical cluster isn't
> + * mapped - no physical clusters have been allocated, and the
> + * file has no extents
> + */
> + if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
> + return 0;
> +
> /* search for the extent closest to the first block in the cluster */
> path = ext4_find_extent(inode, EXT4_C2B(sbi, lclu), NULL, 0);
> if (IS_ERR(path)) {
> --
> 2.30.2
>

The bug this patch addresses was a topic in last week's concall, and on
request here's some more detailed information describing the circumstances
of the test failure I saw.

generic/478 produced a kernel panic during a -g auto run of kvm-xfstests'
bigalloc_inline test configuration. This was caused by a BUG_ON in
ext4_es_cache_extent, where the value of lblk passed in from
ext4_find_extent -> ext4_cache_extents was 0xffffffff and len was a much
smaller non-zero value. Running generic/478 alone isn't sufficient to
reproduce the problem - generic/192 must be run first. That sequence of
tests can be reduced to the following reproducer (in fstest terms):

echo test >$TEST_DIR/testfile

_test_cycle_mount

$XFS_IO_PROG -f -c "pwrite -S 0xFF 0 4096" \
$TEST_DIR/testfile >> $seqres.full 2>&1

What happens as a result of the xfs_io pwrite is that the inline data area is
read from the test file and treated as extents which ext4_find_extent attempts
to cache before proceeding to search for an allocated cluster, as noted above,
at the point in time where the file is being converted from inline to extents.
(All this descends from ext4_da_convert_inline_data_to_extent). The BUG_ON
is triggered because there happens to be an 0xffffffff value in the inline
data area that is interpreted as an lblk. Before that happens, a few
bogus extents are inserted in the extent status cache. It seems clear that if
the BUG_ON hadn't been triggered with an appropriate value, execution
would have continued with the possibility of the consequences noted above.

A look at the extent header passed to ext4_cache_extents by
ext4_find_extent reveals a bad magic number which reflects a portion of
the string "test" rather than EXT4_EXT_MAGIC (0xf30a).

Removing the _test_cycle_mount leads to a different result. The
reproducer executes without triggering a panic. The difference in this
case is that there is an empty extent tree in place of the inline data;
the magic number is valid. ext4_find_extent proceeds without caching
anything, and fails to find an allocated cluster for that empty tree.
All at least okay.

So, the difference appears to be that the new extent tree hasn't been set
up before the delayed allocation activity resulting from conversion in
the umount/mount case, and it has when that doesn't take place in the
other case. I've been looking for the reason.

In either case, there's no point in going through the motions of
searching for a potential cluster mapping at the time of conversion when
there can be no extent tree yet - hence the patch above. There's also
the possibility of shortcutting this higher in the call chain in
ext4_da_map_blocks (where there's an odd twist - extent status cache
gets searched but extent tree does not before delayed allocation). For
the moment, doing this in ext4_clu_mapped seems lowest risk.

An aside: wouldn't it make sense to harden ext4_find_extent a little more by
checking eh_magic?

Eric