2022-11-09 14:49:19

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: [PATCH] nilfs2: fix NULL pointer dereference in nilfs_segctor_prepare_write()

On Wed, Nov 9, 2022 at 12:26 PM Liu Shixin wrote:
> On 2022/11/9 1:50, Ryusuke Konishi wrote:
> > Hi Liu Shixin,
> >
> > I'm sorry for my repeated emails.
> >
> > On Wed, Nov 9, 2022 at 12:13 AM Ryusuke Konishi wrote:
> >>> I think the most likely cause is metadata corruption. If so, we
> >>> should fix it by adding a proper sanity check, yes.
> >>> However, there is still the possibility that the error retry logic
> >>> after detecting errors has a flaw. (I believe at least this is not a
> >>> problem with normal paths.)
> >>> The sufile state inconsistency above is hypothetical for now.
> >>> Either way, I'd like to make sure what's actually happening (and where
> >>> the anomaly is coming from) so we can decide how to fix it.
> >> I noticed that this syzbot report has a disk image "mount_0.gz", so I
> >> tried to mount it read-only.
> >> The result was as follows:
> >>
> >> $ sudo mount -t nilfs2 -r ./mount_0 /mnt/test
> >> $ lssu
> >> SEGNUM DATE TIME STAT NBLOCKS
> >> 0 26760730-10-29 19:40:00 -de 527958016
> >> 1 26760730-11-01 20:29:04 -de 527958016
> >> 2 1176433641-11-01 02:01:52 -de 2983038235
> >> 3 -1158249729-11-01 04:57:19 a-- 25375
> >> 4 1970-01-02 21:24:32 a-- 0
> >> 5 -1415215929-01-02 00:19:15 --e 1631451365
> >> 6 841067190-01-02 13:02:59 -d- 3101461260
> >> 7 1495233207-01-02 01:31:11 -de 1697748441
> >> 8 1875753393-01-02 21:54:14 -de 337757600
> >> 9 648878284-01-02 21:06:08 --- 2133388152
> >> 10 2122778986-01-02 17:49:43 --e 874605800
> >> 11 55846197-01-02 09:50:53 -de 4262365368
> >> 12 1872396026-01-02 06:45:18 --- 1051768258
> >> 13 820920473-01-02 19:28:35 --e 2932336675
> >> 14 2128306755-01-02 10:17:45 --- 3568501216
> >> 15 1457063063-01-02 01:24:17 --e 2330511560
> >> 16 586864775-01-02 16:08:15 --- 355283425
> >> 17 -824870041-01-02 22:47:26 -d- 4177999057
> >> 18 1562176264-01-02 08:06:45 --- 1312597355
> >> 19 -392925420-01-02 17:08:27 -d- 3811075948
> >> 20 1927575458-01-02 21:26:51 -de 1384562525
> >> 21 2139923505-01-02 08:16:04 -d- 41861305
> >>
> >> Here, we can see that neither segment #3 (= ns_segnum) nor #4 (=
> >> ns_nextnum) has the dirty flag
> >> ("d" in STAT). So, as expected, this seems to be the root cause of
> >> the duplicate assignments and oops.
> >> The proper correction would be, therefore, to check and reject (or
> >> fix) this anomaly while outputting an error message
> >> (or warning if fix it at the same time).
> >> It should be inserted in mount time logic because the segments that
> >> nilfs2 itself allocates are always marked dirty with
> >> nilfs_sufile_alloc().
> > I will change my opinion.
> >
> > Considering the possibility of sufile corruption at runtime, it seems
> > that the sanity check should be done on every nilfs_sufile_alloc()
> > call.
> >
> > I now think nilfs_sufile_alloc() should call nilfs_error() and return
> > -EIO if the number of a newly found vacant segment matches
> > nilfs->ns_segnum or nilfs->ns_nextnum.

> Before we add the first segbuf into sci->sc_segbufs in nilfs_segctor_begin_construction(),
> there is no possibility existing duplicate segnum. And the subsequent process should
> not be affected by sufile corruption. So I think it's enough to only check for case alloc==0.
> I don't find any other possible wrong scenarios.

I think the problem is not yet fixed.
With your patch, I still think there are scenarios that cause
duplicate segment assignments:
(1) If the segment at nilfs->ns_segnum is in a clean state due to
sufile data corruption, and
if sci->sc_write_logs is empty and nilfs->ns_segnum == nilfs->nextnum
to start writing a segment from the beginning, then
nilfs_sufile_alloc() is called in nilfs_segctor_begin_construction()
and nextnum can be
equal to nilfs->ns_segnum.
This nextnum is stored in segbuf->sb_nextnum, and if segbuf is spliced
with nilfs_segctor_extend_segments(), the segbuf->sb_nextnum will be
used for sb_segnum of the added segbuf, and the list corruption of
sb_segsum_buffers can happen between these two segbufs.

(2) If the segment at nilfs->ns_segnum is in a clean state due to
sufile data corruption, and
if sci->sc_write_logs is not empty and segbuf->sb_rest_blocks <
NILFS_PSEG_MIN_BLOCKS,
then nilfs_sufile_alloc() is called in
nilfs_segctor_begin_construction(), and the retrieved nextnum is set
to
sb_nextnum of the appended segbuf there, and this sb_nextnum can be
equal to nilfs->ns_segnum.
Then, if the segbufs are spliced with nilfs_segctor_extend_segments(),
the last segbuf->sb_nextnum will be used for sb_segnum of the appended
segbuf. Therefore, list corruption of sb_segsum_buffers can happen.

(3) If the segment at nilfs->ns_segnum is in a clean state due to
sufile data corruption, and then the segbuf is spliced with
nilfs_segctor_extend_segments(), nilfs_sufile_alloc() is called, and
the retrieved "nextnextnum" is set to sb_nextnum of the appended
segbuf, but this can be equal to nilfs->ns_segnum.
If one more segbuf is appended to the segbufs in
nilfs_segctor_extend_segments(), list corruption of
sb_segsum_buffers can happen due to the segnum duplication.

And please call nilfs_error() after detecting the metadata anomaly to
output this critical situation and degrade the filesystem
appropriately (read-only for example). This metadata corruption is a
critical situation and writes will continually fail. File system
operations should not continue as if nothing happened.

I would like to ask you to reconsider how to fix it, but will also
comment on the current patch below.

>
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 7be632c15f91..7b91c790b798 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1326,7 +1326,13 @@ static int nilfs_segctor_begin_construction(struct nilfs_sc_info *sci,
> err = nilfs_sufile_alloc(nilfs->ns_sufile, &nextnum);
> if (err)
> goto failed;
> + } else {
> + /* Check the next segment has alreadly been allocated */

Here is a typo: "alreadly" -> "already"

> + err = nilfs_sufile_test_allocated(nilfs->ns_sufile, nextnum);
> + if (err)
> + goto failed;
> }
> +
> nilfs_segbuf_set_next_segnum(segbuf, nextnum, nilfs);
>
> BUG_ON(!list_empty(&sci->sc_segbufs));
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 853a8212114f..8dff12c56891 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -399,6 +399,36 @@ int nilfs_sufile_alloc(struct inode *sufile, __u64 *segnump)
> return ret;
> }
>
> +int nilfs_sufile_test_allocated(struct inode *sufile, __u64 *segnump)

The second argument should be "__u64 segnum". The type is different
from the above caller.

> +{
> + struct buffer_head *su_bh;
> + struct nilfs_segment_usage *su;
> + void *kaddr;
> + int ret;
> +
> + down_write(&NILFS_MDT(sufile)->mi_sem);

Please use down_read()/up_read() since the operation of this function
is not mutative.
See nilfs_sufile_get_stat() for an example.

> +
> + ret = nilfs_sufile_get_segment_usage_block(sufile, segnump, 1,

"segnump" should be "segnum". If you use a pointer, this must be "*segnump".
Either way this is wrong.

> + &su_bh);
> + if (ret < 0)
> + goto out_sem;
> +
> + kaddr = kmap_atomic(su_bh->b_page);
> + su = nilfs_sufile_block_get_segment_usage(
> + sufile, segnump, su_bh, kaddr);

Ditto. "segnump" here should be "segnum".

> +
> + if (nilfs_segment_usage_clean(su))
> + ret = -EIO;
> +
> + kunmap_atomic(kaddr);
> +
> + brelse(su_bh);
> +
> +out_sem:
> + up_write(&NILFS_MDT(sufile)->mi_sem);
> + return ret;
> +}
> +
> void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum,
> struct buffer_head *header_bh,
> struct buffer_head *su_bh)
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index 8e8a1a5a0402..02b61ca6f318 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -24,6 +24,7 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile);
>
> int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
> int nilfs_sufile_alloc(struct inode *, __u64 *);

> +int nilfs_sufile_test_allocated(struct inode *, __u64 *);

This should be:

int nilfs_sufile_test_allocated(struct inode *sufile, __u64 segnum);

Please include the variable name in the argument for newly added or
modified declarations.

Regards,
Ryusuke Konishi


> int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
> int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
> unsigned long nblocks, time64_t modtime);
>