Hi all,
Recently I've been digging into a corruption issue which I think is just about
pinned, but I'd appreciate some more expert EXT4 eyes to confirm we're on the
right path.
What we have boils down to a system with
- An ext4 filesystem with the journal disabled
- A workload[0] which in a loop
- Creates a lot of small files
- Occasionally deletes these files and collects them into a single larger "compound" file
- Checks the header of all of these files periodically to ensure they're correct
After a while this check fails, and when inspecting the "bad" file, the contents of that file are actually an EXT4 extent structure, for example:
[ec2-user@ip-172-31-0-206 ~]$ hexdump -C _2w.si
00000000 0a f3 05 00 54 01 00 00 00 00 00 00 00 00 00 00 |....T...........|
00000010 01 00 00 00 63 84 08 05 01 00 00 00 ff 01 00 00 |....c...........|
00000020 75 8a 1c 02 00 02 00 00 00 02 00 00 00 9c 1c 02 |u...............|
00000030 00 04 00 00 dc 00 00 00 00 ac 1c 02 dc 04 00 00 |................|
00000040 08 81 00 00 dc ac 1c 02 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000170 00 00 00 |...|
00000173
This has EXT4_EXT_MAGIC (cpu_to_le16(0xf30a)), and when parsed as extent header
plus array has 5 extent entries at 0 depth.
By the time the file is checked, the file that these extents presumably pointed
to appears to have been deleted, but reading the physical blocks looks like the
data of one of the larger files this test creates.
Based on that what I think is happening is
- A file with separate (i.e. non-inline) extents is synced / written to disk
(in this case, one of the large "compound" files)
- ext4_end_io_end() kicks off writeback of extent metadata
- AIUI this marks the related buffers dirty but does not wait on them in the
no-journal case
- The file is deleted, causing the extents to be "removed" and the blocks where
they were stored are marked unused
- A new file is created (any file, separate extents not required)
- The new file is allocated the block that was just freed (the physical block
where the old extents were located)
Some time between this point and when the file is next read, the dirty extent
buffer hits the disk instead of the intended data for the new file.
A big-hammer hack in __ext4_handle_dirty_metadata() to always sync metadata
blocks appears to avoid the issue but isn't ideal - most likely a better
solution would be to ensure any dirty metadata buffers are synced before the
inode is dropped.
Overall does this summary sound valid, or have I wandered into the weeds somewhere?
Cheers,
Sam Mendoza-Jonas
[0] This is an Elastisearch/Lucene workload, running the esrally tests to hit the issue.
On Mon, Nov 08, 2021 at 09:35:20AM -0800, Samuel Mendoza-Jonas wrote:
> Based on that what I think is happening is
> - A file with separate (i.e. non-inline) extents is synced / written to disk
> (in this case, one of the large "compound" files)
> - ext4_end_io_end() kicks off writeback of extent metadata
> - AIUI this marks the related buffers dirty but does not wait on them in the
> no-journal case
> - The file is deleted, causing the extents to be "removed" and the blocks where
> they were stored are marked unused
> - A new file is created (any file, separate extents not required)
> - The new file is allocated the block that was just freed (the physical block
> where the old extents were located)
>
> Some time between this point and when the file is next read, the dirty extent
> buffer hits the disk instead of the intended data for the new file.
> A big-hammer hack in __ext4_handle_dirty_metadata() to always sync metadata
> blocks appears to avoid the issue but isn't ideal - most likely a better
> solution would be to ensure any dirty metadata buffers are synced before the
> inode is dropped.
>
> Overall does this summary sound valid, or have I wandered into the
> weeds somewhere?
Hmm... well, I can tell you what's *supposed* to happen. When the
extent block is freed, ext4_free_blocks() gets called with the
EXT4_FREE_BLOCKS_FORGET flag set. ext4_free_blocks() calls
ext4_forget() in two places; one when bh passed to ext4_free_blocks()
is NULL, and one where it is non-NULL. And then ext4_free_blocks()
calls bforget(), which should cause the dirty extent block to get
thrown away.
This *should* have prevented your failure scenario from taking place,
since after the call to bforget() the dirty extent buffer *shouldn't*
have hit the disk. If your theory is correct, the somehow either (a)
the bforget() wasn't called, or (b) the bforget() didn't work, and
then the page writeback for the new page happened first, and then
buffer cache writeback happened second, overwriting the intended data
for the new file.
Have you tried enabling the blktrace tracer in combination with some
of the ext4 tracepoints, to see if you can catch the double write
happening? Another thing to try would be enabling some tracepoints,
such as ext4_forget and ext4_free_blocks. Unfortunately we don't have
any tracepoints in fs/ext4/page-io.c to get a tracepoint which
includes the physical block ranges coming from the writeback path.
And the tracepoints in fs/fs-writeback.c won't have the physical block
number (just the inode and logical block numbers).
- Ted
On Mon, Nov 08, 2021 at 10:14:33PM -0500, Theodore Ts'o wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Mon, Nov 08, 2021 at 09:35:20AM -0800, Samuel Mendoza-Jonas wrote:
> > Based on that what I think is happening is
> > - A file with separate (i.e. non-inline) extents is synced / written to disk
> > (in this case, one of the large "compound" files)
> > - ext4_end_io_end() kicks off writeback of extent metadata
> > - AIUI this marks the related buffers dirty but does not wait on them in the
> > no-journal case
> > - The file is deleted, causing the extents to be "removed" and the blocks where
> > they were stored are marked unused
> > - A new file is created (any file, separate extents not required)
> > - The new file is allocated the block that was just freed (the physical block
> > where the old extents were located)
> >
> > Some time between this point and when the file is next read, the dirty extent
> > buffer hits the disk instead of the intended data for the new file.
> > A big-hammer hack in __ext4_handle_dirty_metadata() to always sync metadata
> > blocks appears to avoid the issue but isn't ideal - most likely a better
> > solution would be to ensure any dirty metadata buffers are synced before the
> > inode is dropped.
> >
> > Overall does this summary sound valid, or have I wandered into the
> > weeds somewhere?
>
> Hmm... well, I can tell you what's *supposed* to happen. When the
> extent block is freed, ext4_free_blocks() gets called with the
> EXT4_FREE_BLOCKS_FORGET flag set. ext4_free_blocks() calls
> ext4_forget() in two places; one when bh passed to ext4_free_blocks()
> is NULL, and one where it is non-NULL. And then ext4_free_blocks()
> calls bforget(), which should cause the dirty extent block to get
> thrown away.
>
> This *should* have prevented your failure scenario from taking place,
> since after the call to bforget() the dirty extent buffer *shouldn't*
> have hit the disk. If your theory is correct, the somehow either (a)
> the bforget() wasn't called, or (b) the bforget() didn't work, and
> then the page writeback for the new page happened first, and then
> buffer cache writeback happened second, overwriting the intended data
> for the new file.
>
> Have you tried enabling the blktrace tracer in combination with some
> of the ext4 tracepoints, to see if you can catch the double write
> happening? Another thing to try would be enabling some tracepoints,
> such as ext4_forget and ext4_free_blocks. Unfortunately we don't have
> any tracepoints in fs/ext4/page-io.c to get a tracepoint which
> includes the physical block ranges coming from the writeback path.
> And the tracepoints in fs/fs-writeback.c won't have the physical block
> number (just the inode and logical block numbers).
>
> - Ted
I got unlucky and my test took about a week to hit the issue - in any
case, I did manage to pull out some blktrace info. Pulling out some
interesting bits it looks like there's a point where some amount of
actual data is being written to that block:
253,0 0 35872893 186704.840109354 15324 Q W 746864240 + 912 [mandb]
253,0 0 35872934 186704.861662488 15324 C W 746864240 + 912 [0]
253,0 1 35959519 187579.674191021 16452 Q RA 746864240 + 256 [elasticsearch[i]
253,0 1 35959520 187579.675556807 12335 C RA 746864240 + 256 [0]
253,0 3 10900958 188058.929475561 15532 Q W 746864240 + 784 [kworker/u8:5]
253,0 1 36047096 188059.058090710 4331 C W 746864240 + 784 [0]
253,0 1 36062826 188207.602295365 15865 Q W 746864240 + 960 [kworker/u8:6]
253,0 1 36062874 188207.657179910 16261 C W 746864240 + 960 [0]
253,0 0 38087456 198349.655260245 4346 Q WS 746864240 + 8 [elasticsearch[i]
253,0 0 38087457 198349.655735664 4331 C WS 746864240 + 8 [0]
253,0 0 38847537 202142.902362813 22525 Q RA 746864240 + 88 [elasticsearch[i]
253,0 0 38847538 202142.902870365 21686 C RA 746864240 + 88 [0]
253,0 1 38837970 202413.537374242 21099 Q W 746864240 + 8 [kworker/u8:6]
253,0 0 38917625 202413.560449306 21686 C W 746864240 + 8 [0]
253,0 3 11752576 202500.553085492 21858 Q W 746864240 + 8 [esrally]
253,0 0 38926880 202500.579232976 20195 C W 746864240 + 8 [0]
253,0 0 40475924 210529.224170625 19908 Q W 746864240 + 8 [kworker/u8:2]
253,0 1 40376745 210529.261411675 4329 C W 746864240 + 8 [0]
253,0 1 40410932 210831.620562489 26075 Q W 746864240 + 8 [elasticsearch[i]
253,0 0 40507813 210831.652311756 26102 C W 746864240 + 8 [0]
253,0 0 40517885 210930.628927045 23308 Q W 746864240 + 8 [elasticsearch[i]
253,0 1 40422327 210930.649373285 26075 C W 746864240 + 8 [0]
253,0 0 40549442 211195.656050786 26075 Q W 746864240 + 8 [elasticsearch[i]
253,0 1 40452319 211195.694039437 4330 C W 746864240 + 8 [0]
253,0 2 12367808 211371.325361922 4331 Q WSM 746864240 + 8 [elasticsearch[i]
253,0 0 40573575 211371.325815019 23887 C WSM 746864240 + 8 [0]
253,0 0 40573606 211371.595792860 4329 Q WSM 746864240 + 8 [elasticsearch[i]
253,0 0 40573607 211371.596251886 23887 C WSM 746864240 + 8 [0]
Those sync operations are repeated quite a bit which would seem to line
up with the fsync+check that elasticsearch does, and then some of the
last operations before detecting the issue are:
253,0 0 40575124 211388.516004566 4330 Q WSM 746864240 + 8 [elasticsearch[i]
253,0 1 40477703 211388.516615080 25975 C WSM 746864240 + 8 [0]
253,0 3 12268358 211388.842309745 4329 Q WSM 746864240 + 8 [elasticsearch[i]
253,0 1 40478089 211388.842725344 4319 C WSM 746864240 + 8 [0]
253,0 0 40601789 211508.676476007 19908 Q WM 746864240 + 8 [kworker/u8:2]
253,0 0 40601985 211508.763710372 19908 Q W 746864240 + 8 [kworker/u8:2]
253,0 1 40508182 211508.945657647 25975 C W 746864240 + 8 [0]
253,0 1 40508191 211508.946310947 574 C WM 746864240 + 8 [0]
253,0 1 40557027 211648.172734351 4345 Q RA 746864240 + 8 [elasticsearch[i]
253,0 1 40557028 211648.173102632 26203 C RA 746864240 + 8 [0]
Which could be a case of the "extent" being written, and then being read
by the checking program.
My script that was watching the tracepoints failed somewhere along the
way unfortunately, I'll set up the repro again and see if I can line
them up with the operations seen in the trace.
Cheers,
Sam