2022-09-12 13:21:35

by JunChao Sun

[permalink] [raw]
Subject: How does newbie find bugs in ext4?

Hi Ted.
I am a new guy in ext4, may I ask several questions about ext4?
I am very interested in ext4 file system and have been reading and
debugging ext4 source code for 2-3 months(just about basic
open(create)/close read/write, do not involve advanced features). I
want to contribute to ext4 but I find it seems hard only by reading
and debugging by myself, can't even find bugs. I only sent two patches
up to now. The reason that I could not find bugs in ext4 may be that I
could not understand code deeply only by reading and debugging them...
And I find many contributors fix bugs which are found at work, but my
company will not give me opportunity to trace the bug in the kernel,
they just tell users "this is a bug in linux", and I could not repro
that by myself...
Could you please provide some suggestions for people who want to
contribute to ext4 like me? Any suggestions about how to start
contributing to ext4 step by step? I mean, really bugs fix other than
document correction(This is also very important and one of the patches
I have sent is about document correction, but I want to learn ext4
more deeply). I know that there is xfs-tests project which is used for
testing ext4/xfs, but I think ext4 developers will pass all test cases
before releasing a new version, so is it necessary to retest ext4
using xfs-tests?
Best regards.


2022-09-12 16:39:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: How does newbie find bugs in ext4?

Hi,

So first of all, I would recommend that you learn how to use
kvm-xfstests. The reason for this is that kvm-xfstests is very useful
for testing any changes that you make. The same test appliance can be
used for testing file systems for Android and using Google Compute
Engine VM's (which is one of the best ways to use it). Please take a
look at these references:

https://thunk.org/gce-xfstests
https://github.com/tytso/xfstests-bld/blob/master/Documentation/what-is-xfstests.md
https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md

In addition to using this as a way of a quick "playground" where you
can test patches, this can also be a good way to (for example) test
syzbot reports.

Another thing which you could potentially do is to manual backporting
of ext4 patches which didn't automatically get applied because the
patch required some adjustments (or required backporting some
additional commits, etc.) to fix a particular problem. So for
example, you could try running xfstests using the latest 5.10.y or
5.15.y stable kernels, since as we fix bugs, we often add tests to
check for regressions. For example, if you look at the header of the
test ext4/058, you'll find:

# Set 256 blocks in a block group, then inject I/O pressure,
# it will trigger off kernel BUG in ext4_mb_mark_diskspace_used
#
# Regression test for commit
# a08f789d2ab5 ext4: fix bug_on ext4_mb_use_inode_pa

So if you find out that a particular test fails on an LTS kernel
(e.g., 5.15.y or 5.10.y), but it passes on upstream, it could be that
a missing commit needs to be backported. We don't currently have
anyone doing this on a regular basis for the LTS kernels (I maybe will
do this once every few months, when I have time), so this could be a
good way for you to contribute and also learn more about ext4 as you
go.

Finally, I'll note that although I do run xfstests regularly, and will
reject patches that cause regressions, but there are still some tests
that fail. For example, here is my latest test report:

TESTRUNID: ltm-20220912073217
KERNEL: kernel 6.0.0-rc4-xfstests #760 SMP PREEMPT_DYNAMIC Mon Sep 12 07:23:13 EDT 2022 x86_64
CMDLINE: full --kernel gs://gce-xfstests/kernel.deb
CPUS: 4
MEM: 7680

ext4/4k: 515 tests, 27 skipped, 4093 seconds
ext4/1k: 511 tests, 2 failures, 40 skipped, 5095 seconds
Flaky: generic/475: 40% (2/5) generic/476: 40% (2/5)
ext4/ext3: 507 tests, 115 skipped, 3514 seconds
ext4/encrypt: 493 tests, 3 failures, 129 skipped, 2583 seconds
Failures: generic/681 generic/682 generic/691
ext4/nojournal: 510 tests, 4 failures, 94 skipped, 3610 seconds
Failures: ext4/301 ext4/304 generic/455
Flaky: generic/077: 40% (2/5)
ext4/ext3conv: 512 tests, 27 skipped, 3650 seconds
ext4/adv: 512 tests, 3 failures, 34 skipped, 3860 seconds
Failures: generic/475 generic/477
Flaky: generic/455: 80% (4/5)
ext4/dioread_nolock: 513 tests, 27 skipped, 4235 seconds
ext4/data_journal: 511 tests, 2 failures, 87 skipped, 3647 seconds
Failures: generic/231 generic/455
ext4/bigalloc: 489 tests, 2 failures, 34 skipped, 3904 seconds
Failures: generic/455 shared/298
ext4/bigalloc_1k: 488 tests, 2 failures, 51 skipped, 3826 seconds
Failures: generic/455 shared/298
ext4/dax: 502 tests, 127 skipped, 2520 seconds
Totals: 6135 tests, 792 skipped, 80 failures, 0 errors, 44288s

(This was done by using gce-xfstests, which is a cloud VM variant of
kvm-xfstests. The equivalant would take roughly 12 to 24 hours using
kvm-xfstests, whichj gets run on multiple VM times, so the wall clock
time needed is perhaps two to two and a half hours.)

In general, I try very hard to make sure that ext4/4k (ext4 with the
default 4k block size) to be free of failures hen running the xfstests
"auto" group. However, you'll see that there are other configs where
there are failures, some of which have been around for a while.
However, the challenge is that these are bugs that often, more senior
ext4 developers have tried looking at for, say, an hour or two, and
then said, "I have higher priority fires to fight". But these might
not be the best tests failures to ask a ext4 newbie to debug. That
being said, if you don't mind a bit (or a lot) of frustration, it
could be that you might be able root cause soe of these failed tests.

(But starting with testing the LTS kernels might be a better place to
start.)

Cheers,

- Ted

2022-09-13 13:08:54

by JunChao Sun

[permalink] [raw]
Subject: Re: How does newbie find bugs in ext4?

Thanks a lot for your suggestions and patience . It is a great
guidance for a newbie of ext4!



On Tue, Sep 13, 2022 at 12:33 AM Theodore Ts'o <[email protected]> wrote:
>
> Hi,
>
> So first of all, I would recommend that you learn how to use
> kvm-xfstests. The reason for this is that kvm-xfstests is very useful
> for testing any changes that you make. The same test appliance can be
> used for testing file systems for Android and using Google Compute
> Engine VM's (which is one of the best ways to use it). Please take a
> look at these references:
>
> https://thunk.org/gce-xfstests
> https://github.com/tytso/xfstests-bld/blob/master/Documentation/what-is-xfstests.md
> https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
> https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
>
> In addition to using this as a way of a quick "playground" where you
> can test patches, this can also be a good way to (for example) test
> syzbot reports.
>
> Another thing which you could potentially do is to manual backporting
> of ext4 patches which didn't automatically get applied because the
> patch required some adjustments (or required backporting some
> additional commits, etc.) to fix a particular problem. So for
> example, you could try running xfstests using the latest 5.10.y or
> 5.15.y stable kernels, since as we fix bugs, we often add tests to
> check for regressions. For example, if you look at the header of the
> test ext4/058, you'll find:
>
> # Set 256 blocks in a block group, then inject I/O pressure,
> # it will trigger off kernel BUG in ext4_mb_mark_diskspace_used
> #
> # Regression test for commit
> # a08f789d2ab5 ext4: fix bug_on ext4_mb_use_inode_pa
>
> So if you find out that a particular test fails on an LTS kernel
> (e.g., 5.15.y or 5.10.y), but it passes on upstream, it could be that
> a missing commit needs to be backported. We don't currently have
> anyone doing this on a regular basis for the LTS kernels (I maybe will
> do this once every few months, when I have time), so this could be a
> good way for you to contribute and also learn more about ext4 as you
> go.
>
> Finally, I'll note that although I do run xfstests regularly, and will
> reject patches that cause regressions, but there are still some tests
> that fail. For example, here is my latest test report:
>
> TESTRUNID: ltm-20220912073217
> KERNEL: kernel 6.0.0-rc4-xfstests #760 SMP PREEMPT_DYNAMIC Mon Sep 12 07:23:13 EDT 2022 x86_64
> CMDLINE: full --kernel gs://gce-xfstests/kernel.deb
> CPUS: 4
> MEM: 7680
>
> ext4/4k: 515 tests, 27 skipped, 4093 seconds
> ext4/1k: 511 tests, 2 failures, 40 skipped, 5095 seconds
> Flaky: generic/475: 40% (2/5) generic/476: 40% (2/5)
> ext4/ext3: 507 tests, 115 skipped, 3514 seconds
> ext4/encrypt: 493 tests, 3 failures, 129 skipped, 2583 seconds
> Failures: generic/681 generic/682 generic/691
> ext4/nojournal: 510 tests, 4 failures, 94 skipped, 3610 seconds
> Failures: ext4/301 ext4/304 generic/455
> Flaky: generic/077: 40% (2/5)
> ext4/ext3conv: 512 tests, 27 skipped, 3650 seconds
> ext4/adv: 512 tests, 3 failures, 34 skipped, 3860 seconds
> Failures: generic/475 generic/477
> Flaky: generic/455: 80% (4/5)
> ext4/dioread_nolock: 513 tests, 27 skipped, 4235 seconds
> ext4/data_journal: 511 tests, 2 failures, 87 skipped, 3647 seconds
> Failures: generic/231 generic/455
> ext4/bigalloc: 489 tests, 2 failures, 34 skipped, 3904 seconds
> Failures: generic/455 shared/298
> ext4/bigalloc_1k: 488 tests, 2 failures, 51 skipped, 3826 seconds
> Failures: generic/455 shared/298
> ext4/dax: 502 tests, 127 skipped, 2520 seconds
> Totals: 6135 tests, 792 skipped, 80 failures, 0 errors, 44288s
>
> (This was done by using gce-xfstests, which is a cloud VM variant of
> kvm-xfstests. The equivalant would take roughly 12 to 24 hours using
> kvm-xfstests, whichj gets run on multiple VM times, so the wall clock
> time needed is perhaps two to two and a half hours.)
>
> In general, I try very hard to make sure that ext4/4k (ext4 with the
> default 4k block size) to be free of failures hen running the xfstests
> "auto" group. However, you'll see that there are other configs where
> there are failures, some of which have been around for a while.
> However, the challenge is that these are bugs that often, more senior
> ext4 developers have tried looking at for, say, an hour or two, and
> then said, "I have higher priority fires to fight". But these might
> not be the best tests failures to ask a ext4 newbie to debug. That
> being said, if you don't mind a bit (or a lot) of frustration, it
> could be that you might be able root cause soe of these failed tests.
>
> (But starting with testing the LTS kernels might be a better place to
> start.)
>
> Cheers,
>
> - Ted