2022-03-11 22:20:42

by Detlev Casanova

[permalink] [raw]
Subject: PROBLEM: EXT4 checksumming issues

Hello !

We are experiencing issues with the ext4 file system in automatic tests.
Here are some required information:

[2.] Full description of the problem/report:

Sometimes, accessing a file on the EXT4 file system fails with an error message
in the kernel log. So far, we observed 3 different kind of messages:

- EXT4-fs error (device mmcblk1p2): ext4_lookup:1785: inode #10287: comm
ostree: iget: checksum invalid
- EXT4-fs error (device mmcblk0p3): __ext4_find_entry:1623: inode #258562:
comm gst-launch-1.0: checksumming directory block 0
- EXT4-fs error (device mmcblk0p3): ext4_validate_block_bitmap:390: comm
fstrim: bg 16: bad block bitmap checksum

The first issue was apparently fixed by patching our kernel with this patchset:
https://lore.kernel.org/all/[email protected]/

The second issue seems to be happening for all kind of programs. In this
instance, it was gstreamer opening a file. It can also happen when mkdir
creates a directory.

The third issue seems to only happen with fstrim.

This seems to be a random issue and cannot be reproduced easily nor is there a
procedure to reproduce it.

Each time a test suite is run, the image is freshly written on the device. The
same tested multiple times will sometimes fail, sometimes not.

[3.] Keywords (i.e., modules, networking, kernel):
ext4, checksum

[4.] Kernel information
We use a modified version of the debian kernel, the source code is here:
https://gitlab.apertis.org/pkg/linux

No patches are modifying the ext4 filesystem code.

[4.1.] Kernel version (from /proc/version):
It is hard to determine when the issue started appearing. One educated guess
would be when we upgraded from 5.15.1 to 5.15.22.
One version where this failed is:

Linux version 5.15.0-trunk-amd64 ([email protected]) (gcc-10
(Apertis 10.2.1-6+apertis6bv2023dev1b2) 10.2.1 20210110, GNU ld (GNU Binutils
for Apertis) 2.35.2) #1 SMP Debian 5.15.22-0~apertis2 (2022-02-16)

[4.2.] Kernel .config file:
See the attached config.txt file

[5.] Most recent kernel version which did not have the bug:
My guess is 5.15.1, but I cannot be sure of this.

[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/admin-guide/bug-hunting.rst)
N/A

[7.] A small shell script or example program which triggers the
problem (if possible)
N/A Although you can check the full output here for example: https://
lava.collabora.co.uk/scheduler/job/5756873#L12901 (pointed on the line of the
error)

[8.] Environment
We use two deployment types for our images: APT (classic debian's apt) and
OSTree. The issue seems to only happen with OSTree images.

Also, the issue has happened on multiple different boards, with multiple
architectures (amd64, armhf and arm64). So failing hardware is unlikely at
fault here.

[8.1.] Software (add the output of the ver_linux script here)
[8.2.] Processor information (from /proc/cpuinfo):
Not related

[8.3.] Module information (from /proc/modules):
See attached modules.txt file

[8.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
Happens on different HW

[8.5.] PCI information ('lspci -vvv' as root)
Not related

[8.6.] SCSI information (from /proc/scsi/scsi)
Not related

[8.7.] Other information that might be relevant to the problem
(please look in /proc and include all information that you
think to be relevant):
See the output of mount (amd64) in mount.txt
The issues can happen on the rootfs or the home partition.

[X.] Other notes, patches, fixes, workarounds:
Because I am not familiar with the internals of the ext4 file system and the
issue is random and hard to reproduce, I am mainly asking for pointers or for
patches in review to try. I can get more information as needed.

Regards,

Detlev.


Attachments:
config.txt (249.62 kB)
modules.txt (5.89 kB)
mount.txt (2.26 kB)
Download all attachments