Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Subject: Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head
 removing while doing checkpoint
To:     Theodore Ts'o <tytso@mit.edu>
Cc:     Zhihao Cheng <chengzhihao1@huawei.com>, linux-ext4@vger.kernel.org,
        adilger.kernel@dilger.ca, jack@suse.cz, yi.zhang@huawei.com,
        yukuai3@huawei.com
References: <20230606135928.434610-1-yi.zhang@huaweicloud.com>
 <20230606135928.434610-5-yi.zhang@huaweicloud.com>
 <20230613043120.GB1584772@mit.edu>
 <20002902-39c5-914b-75b0-5a21b5cee25c@huawei.com>
 <20230613172749.GA18303@mit.edu> <20230614054222.GD51259@mit.edu>
 <1033cd3b-e41f-e4e0-c2ee-c4b23979208a@huaweicloud.com>
 <20230614203740.GE51259@mit.edu>
From:   Zhang Yi <yi.zhang@huaweicloud.com>
Message-ID: <22da163b-e6ab-d0b6-af75-ea1c3f4909ec@huaweicloud.com>
Date:   Mon, 26 Jun 2023 15:36:39 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.12.0
MIME-Version: 1.0
In-Reply-To: <20230614203740.GE51259@mit.edu>
Content-Type: text/plain; charset=windows-1252
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 2023/6/15 4:37, Theodore Ts'o wrote:
> On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
>>
>> Sorry about the regression, I found that this issue is not introduced
>> by the first patch in this patch series ("jbd2: recheck chechpointing
>> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
>> initialization from journal_get_superblock()") [1] on your dev branch.
>>
>> The problem is the journal super block had been failed to write out
>> due to IO fault, it's uptodate bit was cleared by
>> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
>> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
>> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
>> unfortunately, the read IO is also fail, so the error handling in
>> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
>> to above NULL pointer dereference issue.
> 
> Thanks for looking into this.  What I believe you are saying is that
> the root cause is that earlier patch, but it is still something about
> the patch "jbd2: recheck chechpointing non-dirty buffer" which is
> changing the timing enough that we're hitting this buffer (because
> without the "recheck checkpointing" patch, I'm not seeing the NULL
> pointer dereference.
> 
> As far as the e2fsck bug that was causing it to hang in the ext4/adv
> test scenario, the patch was a simple one, and I have also checked in
> a test case which was a reliable reproducer of the problem.  (See
> attached for the patches and more detail.)
> 
> It is really interesting that "recheck checkpointing" patch is making
> enough of a difference that it is unmasking these bugs.  If you could
> take a look at these changes and perhaps think about how this patch
> series could be changing the nature of the corruption (specifically,
> how symlink inodes referenced from inline directories could be
> corupted with "rechecking checkpointing", thus unmasking the
> e2fsprogs, I'd really appreciate it.
> 

Hello Ted.

I have found the root cause of which trigger the e2fsck bug, it's a
real kernel bug which was introduced in 5b849b5f96b4 ("jbd2: fast
commit recovery path") when merging fast commit feature.

The problem is that when fast commit is enabled, kernel doesn't replay
the journal completely at mount time (there a bug in do_one_pass(),
see below for details) if the unrecovered transactions loop back to
the head of the normal journal area. If the missing transactions
contain a symlink block revoke record and a reuse record of the same
block, and the reuse record have been write back to the disk before
it's last umount, it could trigger the "Symlink xxx is invalid" after
the incomplete journal replay. Meanwhile it the symlink belongs to a
inline directory, it will trigger the e2fsck bug.

For example, we have a not cleaned image below.


 | normal journal area                             | fast commit area |
 +-------------------------------------------------+------------------+
 | tX+1(rere half)|tX+2|...| tX | tX+1(front half) |       ....       |
 +-------------------------------------------------+------------------+
                       /        /                  /                  /
             (real head)  s_start    journal->j_last journal->j_fc_last

Transaction X(checkpointed):
 - Create symlink 'foo' (use block A, contain 500 length of link name)
   in inline directory 'dir';
Transaction X+1(uncheckpoint):
 - Remove symlink 'foo' (block A is also freed);
Transaction X+2(uncheckpoint):
 - Create symlink 'bar' (reuse block A, contain 400 length of link
   name).

The above transactions are recorded to the journal, block A is also
successfully written back by the background write-back process.

If fast_commit feature is enabled, the journal->j_last point to the
first unused block behind the normal journal area instead of the whole
log area, and the journal->j_fc_last point to the first unused block
behind the fast_commit journal area. While doing journal recovery,
do_one_pass(PASS_SCAN) should first scan tX+1 in the normal journal
area and turn around to the first block once it meet journal->j_last,
but the wrap() macro misuse the journal->j_fc_last, so it could not
read tX+1's commit block, the recovery quit early mistakenly and
missing tX+1 and tX+2. So, after we mount the filesystem, we could
leftover an invalid symlink 'foo' in the inline directory and trigger
issue.

I can reproduce this issue either with or without this patch series
("jbd2: recheck chechpointing non-dirty buffer"), sometimes it takes
longer, sometimes it takes less. It's easy to reproduce the "Symlink
xxx is invalid" issue, but it's a little hard to trigger the e2fsck
bug (which means the invalid symlink should in a inline dir).
Especially I could 100% reproduce the fast_commit recovery bug when
running generic/475. So I couldn't find any relations between this
issue and this patch series.

I've send a patch to fix the fast commit bug separately[1]. I cannot
reproduce this issue again with this patch, please take a look at
that.

[1] https://lore.kernel.org/linux-ext4/20230626073322.3956567-1-yi.zhang@huaweicloud.com/T/#u

Thanks,
Yi.