LinuxLists.cc - [syzbot] Monthly xfs report

2023-03-30 10:00:20

Subject: [syzbot] Monthly xfs report

Hello xfs maintainers/developers,

This is a 30-day syzbot report for the xfs subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/xfs

During the period, 5 new issues were detected and 0 were fixed.
In total, 23 issues are still open and 15 have been fixed so far.

Some of the still happening issues:

Crashes Repro Title
327 Yes INFO: task hung in xlog_grant_head_check
https://syzkaller.appspot.com/bug?extid=568245b88fbaedcb1959
85 Yes KASAN: stack-out-of-bounds Read in xfs_buf_lock
https://syzkaller.appspot.com/bug?extid=0bc698a422b5e4ac988c
81 Yes WARNING in xfs_qm_dqget_cache_insert
https://syzkaller.appspot.com/bug?extid=6ae213503fb12e87934f
47 Yes WARNING in xfs_bmapi_convert_delalloc
https://syzkaller.appspot.com/bug?extid=53b443b5c64221ee8bad
44 Yes INFO: task hung in xfs_buf_item_unpin
https://syzkaller.appspot.com/bug?extid=3f083e9e08b726fcfba2
13 Yes general protection fault in __xfs_free_extent
https://syzkaller.appspot.com/bug?extid=bfbc1eecdfb9b10e5792
5 Yes KASAN: use-after-free Read in xfs_btree_lookup_get_block
https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

2023-04-11 01:48:52

by Dave Chinner

[permalink] [raw]

Subject: Re: [syzbot] Monthly xfs report

On Thu, Mar 30, 2023 at 02:58:43AM -0700, syzbot wrote:
> Hello xfs maintainers/developers,
>
> This is a 30-day syzbot report for the xfs subsystem.
> All related reports/information can be found at:
> https://syzkaller.appspot.com/upstream/s/xfs
>
> During the period, 5 new issues were detected and 0 were fixed.
> In total, 23 issues are still open and 15 have been fixed so far.
>
> Some of the still happening issues:
>
> Crashes Repro Title
> 327 Yes INFO: task hung in xlog_grant_head_check
> https://syzkaller.appspot.com/bug?extid=568245b88fbaedcb1959

[ 501.289306][ T5098] XFS (loop0): Mounting V4 Filesystem 5e6273b8-2167-42bb-911b-418aa14a1261
[ 501.299015][ T5098] XFS (loop0): Log size 128 blocks too small, minimum size is 2880 blocks
[ 501.307608][ T5098] XFS (loop0): Log size out of supported range.
[ 501.313866][ T5098] XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.

Syzbot doing something stupid - syzbot needs to stop testing the
deprecated and soon to be unsupported v4 filesystem format.

Invalid.

> 85 Yes KASAN: stack-out-of-bounds Read in xfs_buf_lock
> https://syzkaller.appspot.com/bug?extid=0bc698a422b5e4ac988c

Bisection result is garbage.

Looks like a race between dquot shrinker grabbing a dquot buffer to
write back a dquot and the dquot buffer being reclaimed before it is
submitted from the delwri list. Something is dropping a buffer
reference on the floor...

More investigation needed.

> 81 Yes WARNING in xfs_qm_dqget_cache_insert
> https://syzkaller.appspot.com/bug?extid=6ae213503fb12e87934f

That'll be an ENOMEM warning on radix tree insert.

No big deal, the code cleans up and retries the lookup/insert
process cleanly. Could just remove the warning.

Low priority, low severity.

> 47 Yes WARNING in xfs_bmapi_convert_delalloc
> https://syzkaller.appspot.com/bug?extid=53b443b5c64221ee8bad

Unexpected ENOSPC because syzbot has created a inconsistency between
superblock counters and the free space btrees. Warning is expected
as it indicates user data loss is going to occur, doesn't happen in
typical production operation, generally requires malicious
corruption of the filesystem to trigger.

Not a bug, won't fix.

> 44 Yes INFO: task hung in xfs_buf_item_unpin
> https://syzkaller.appspot.com/bug?extid=3f083e9e08b726fcfba2

Yup, that's a deadlock on the superblock buffer.

xfs_sync_sb_buf() is called from an ioctl of some kind, gets stuck
in the log force waiting for iclogs to complete. xfs_sync_sb_buf()
holds the buffer across the transaction commit, so the sb buffer is
locked while waiting for the log force.

At just the wrong time, the filesystem gets shut down:

[ 484.946965][ T5959] syz-executor360: attempt to access beyond end of device
[ 484.946965][ T5959] loop0: rw=432129, sector=65536, nr_sectors = 64 limit=65536
[ 484.950756][ T52] XFS (loop0): log I/O error -5
[ 484.952017][ T52] XFS (loop0): Filesystem has been shut down due to log error (0x2).
[ 484.953902][ T52] XFS (loop0): Please unmount the filesystem and rectify the problem(s).
[ 714.735393][ T28] INFO: task kworker/1:1H:52 blocked for more than 143 seconds.

And the iclog IO completion tries to unpin and abort all the log
items in the current checkpoint. One of those is the superblock
buffer, and because this is an abort:

[ 714.754433][ T28] xfs_buf_lock+0x264/0xa68
[ 714.755623][ T28] xfs_buf_item_unpin+0x2c4/0xc18
[ 714.756875][ T28] xfs_trans_committed_bulk+0x2d8/0x73c
[ 714.758236][ T28] xlog_cil_committed+0x210/0xef8

The unpin code tries to lock the buffer to pass it through to IO
completion to mark it as failed.

Real deadlock, I think it might be able to occur on any synchronous
transaction commit that holds a buffer locked across it. No
immediate fix comes to mind right now. Can only occur on a journal
IO triggered shutdown, so not somethign that happens typically in
production systems.

Low priority, medium severity.

> 13 Yes general protection fault in __xfs_free_extent
> https://syzkaller.appspot.com/bug?extid=bfbc1eecdfb9b10e5792

Growfs issue. Looks like a NULL pag, which means the fsbno passed
to __xfs_free_extent() is invalid. Without looking further, this
looks like it's a corrupt AGF length or superblock size and this has
resulted in the calculated fsbno starting beyond the end of the last
AG that we are about to grow. That means the agno is beyond EOFS,
xfs_perag_get(agno) ends up NULL, and __xfs_free_extent() goes
splat. Likely requires corruption to trigger.

Low priority, low severity.

> 5 Yes KASAN: use-after-free Read in xfs_btree_lookup_get_block
> https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e

Recovery of reflink COW extents, we have a corrupted journal

[ 52.495566][ T5067] XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
[ 52.599681][ T5067] XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
[ 52.636680][ T5067] XFS (loop0): Starting recovery (logdev: internal)

And then it looks to have a UAF on the refcountbt cursor that is
first initialised in xfs_refcount_recover_cow_leftovers(). Likely
tripping over a corrupted refcount btree of some kind. Probably one
for Darrick to look into.

Low priority, low severity.

-Dave.
--
Dave Chinner
[email protected]

2023-04-11 04:42:35

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [syzbot] Monthly xfs report

On Tue, Apr 11, 2023 at 11:35:12AM +1000, Dave Chinner wrote:
> On Thu, Mar 30, 2023 at 02:58:43AM -0700, syzbot wrote:
> > Hello xfs maintainers/developers,
> >
> > This is a 30-day syzbot report for the xfs subsystem.
> > All related reports/information can be found at:
> > https://syzkaller.appspot.com/upstream/s/xfs
> >
> > During the period, 5 new issues were detected and 0 were fixed.
> > In total, 23 issues are still open and 15 have been fixed so far.
> >
> > Some of the still happening issues:
> >
> > Crashes Repro Title
> > 327 Yes INFO: task hung in xlog_grant_head_check
> > https://syzkaller.appspot.com/bug?extid=568245b88fbaedcb1959
>
> [ 501.289306][ T5098] XFS (loop0): Mounting V4 Filesystem 5e6273b8-2167-42bb-911b-418aa14a1261
> [ 501.299015][ T5098] XFS (loop0): Log size 128 blocks too small, minimum size is 2880 blocks
> [ 501.307608][ T5098] XFS (loop0): Log size out of supported range.
> [ 501.313866][ T5098] XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.
>
> Syzbot doing something stupid - syzbot needs to stop testing the
> deprecated and soon to be unsupported v4 filesystem format.
>
> Invalid.
>
> > 85 Yes KASAN: stack-out-of-bounds Read in xfs_buf_lock
> > https://syzkaller.appspot.com/bug?extid=0bc698a422b5e4ac988c
>
> Bisection result is garbage.
>
> Looks like a race between dquot shrinker grabbing a dquot buffer to
> write back a dquot and the dquot buffer being reclaimed before it is
> submitted from the delwri list. Something is dropping a buffer
> reference on the floor...
>
> More investigation needed.
>
> > 81 Yes WARNING in xfs_qm_dqget_cache_insert
> > https://syzkaller.appspot.com/bug?extid=6ae213503fb12e87934f
>
> That'll be an ENOMEM warning on radix tree insert.
>
> No big deal, the code cleans up and retries the lookup/insert
> process cleanly. Could just remove the warning.
>
> Low priority, low severity.
>
> > 47 Yes WARNING in xfs_bmapi_convert_delalloc
> > https://syzkaller.appspot.com/bug?extid=53b443b5c64221ee8bad
>
> Unexpected ENOSPC because syzbot has created a inconsistency between
> superblock counters and the free space btrees. Warning is expected
> as it indicates user data loss is going to occur, doesn't happen in
> typical production operation, generally requires malicious
> corruption of the filesystem to trigger.
>
> Not a bug, won't fix.
>
> > 44 Yes INFO: task hung in xfs_buf_item_unpin
> > https://syzkaller.appspot.com/bug?extid=3f083e9e08b726fcfba2
>
> Yup, that's a deadlock on the superblock buffer.
>
> xfs_sync_sb_buf() is called from an ioctl of some kind, gets stuck
> in the log force waiting for iclogs to complete. xfs_sync_sb_buf()
> holds the buffer across the transaction commit, so the sb buffer is
> locked while waiting for the log force.
>
> At just the wrong time, the filesystem gets shut down:
>
> [ 484.946965][ T5959] syz-executor360: attempt to access beyond end of device
> [ 484.946965][ T5959] loop0: rw=432129, sector=65536, nr_sectors = 64 limit=65536
> [ 484.950756][ T52] XFS (loop0): log I/O error -5
> [ 484.952017][ T52] XFS (loop0): Filesystem has been shut down due to log error (0x2).
> [ 484.953902][ T52] XFS (loop0): Please unmount the filesystem and rectify the problem(s).
> [ 714.735393][ T28] INFO: task kworker/1:1H:52 blocked for more than 143 seconds.
>
> And the iclog IO completion tries to unpin and abort all the log
> items in the current checkpoint. One of those is the superblock
> buffer, and because this is an abort:
>
> [ 714.754433][ T28] xfs_buf_lock+0x264/0xa68
> [ 714.755623][ T28] xfs_buf_item_unpin+0x2c4/0xc18
> [ 714.756875][ T28] xfs_trans_committed_bulk+0x2d8/0x73c
> [ 714.758236][ T28] xlog_cil_committed+0x210/0xef8
>
> The unpin code tries to lock the buffer to pass it through to IO
> completion to mark it as failed.
>
> Real deadlock, I think it might be able to occur on any synchronous
> transaction commit that holds a buffer locked across it. No
> immediate fix comes to mind right now. Can only occur on a journal
> IO triggered shutdown, so not somethign that happens typically in
> production systems.

Force log, then xfs_ail_push_all_sync()?

It's SETLABEL, who cares how slow it is?

> Low priority, medium severity.
>
>
> > 13 Yes general protection fault in __xfs_free_extent
> > https://syzkaller.appspot.com/bug?extid=bfbc1eecdfb9b10e5792
>
> Growfs issue. Looks like a NULL pag, which means the fsbno passed
> to __xfs_free_extent() is invalid. Without looking further, this
> looks like it's a corrupt AGF length or superblock size and this has
> resulted in the calculated fsbno starting beyond the end of the last
> AG that we are about to grow. That means the agno is beyond EOFS,
> xfs_perag_get(agno) ends up NULL, and __xfs_free_extent() goes
> splat. Likely requires corruption to trigger.
>
> Low priority, low severity.

I've been wondering for quite a while if the code that creates those
defer items ought to be shutting down the fs if they can't get a perag
to stuff in the intent. xfs_perag_intent_get seems like a reasonable
place to shut down the fs with a corruption warning if someone feeds in
a totally garbage fsblock range.

> > 5 Yes KASAN: use-after-free Read in xfs_btree_lookup_get_block
> > https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e
>
> Recovery of reflink COW extents, we have a corrupted journal
>
> [ 52.495566][ T5067] XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
> [ 52.599681][ T5067] XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
> [ 52.636680][ T5067] XFS (loop0): Starting recovery (logdev: internal)
>
> And then it looks to have a UAF on the refcountbt cursor that is
> first initialised in xfs_refcount_recover_cow_leftovers(). Likely
> tripping over a corrupted refcount btree of some kind. Probably one
> for Darrick to look into.

Somehow the bogus refcount level field in the AGF is getting past the
verifiers. I'll look into this later.

--D

> Low priority, low severity.
>
> -Dave.
> --
> Dave Chinner
> [email protected]

2023-04-12 22:05:37

by Dave Chinner

[permalink] [raw]

Subject: Re: [syzbot] Monthly xfs report

On Mon, Apr 10, 2023 at 09:35:17PM -0700, Darrick J. Wong wrote:
> On Tue, Apr 11, 2023 at 11:35:12AM +1000, Dave Chinner wrote:
> > On Thu, Mar 30, 2023 at 02:58:43AM -0700, syzbot wrote:
> >
> > > 13 Yes general protection fault in __xfs_free_extent
> > > https://syzkaller.appspot.com/bug?extid=bfbc1eecdfb9b10e5792
> >
> > Growfs issue. Looks like a NULL pag, which means the fsbno passed
> > to __xfs_free_extent() is invalid. Without looking further, this
> > looks like it's a corrupt AGF length or superblock size and this has
> > resulted in the calculated fsbno starting beyond the end of the last
> > AG that we are about to grow. That means the agno is beyond EOFS,
> > xfs_perag_get(agno) ends up NULL, and __xfs_free_extent() goes
> > splat. Likely requires corruption to trigger.
> >
> > Low priority, low severity.
>
> I've been wondering for quite a while if the code that creates those
> defer items ought to be shutting down the fs if they can't get a perag
> to stuff in the intent. xfs_perag_intent_get seems like a reasonable
> place to shut down the fs with a corruption warning if someone feeds in
> a totally garbage fsblock range.

You know, I think this might be the same as thex case below where
a bogus AGF field is getting past the verifiers in recovery...

>
> > > 5 Yes KASAN: use-after-free Read in xfs_btree_lookup_get_block
> > > https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e
> >
> > Recovery of reflink COW extents, we have a corrupted journal
> >
> > [ 52.495566][ T5067] XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
> > [ 52.599681][ T5067] XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
> > [ 52.636680][ T5067] XFS (loop0): Starting recovery (logdev: internal)
> >
> > And then it looks to have a UAF on the refcountbt cursor that is
> > first initialised in xfs_refcount_recover_cow_leftovers(). Likely
> > tripping over a corrupted refcount btree of some kind. Probably one
> > for Darrick to look into.
>
> Somehow the bogus refcount level field in the AGF is getting past the
> verifiers. I'll look into this later.

... because like this one, it seems to require corruption getting
deep into the modification operation without being detected.

As for shutdown when a perag cannot be obtained by defer items, I'm
hoping that the perag get operations slowly disappear from those as
we slowly move the perag references higher up the heirarchy. The
perag should not go away in the middle of a defer chain, so I don't
think we should ever get a NULL from a lookup except in the case of
buggy code....

Cheers,

Dave.
--
Dave Chinner
[email protected]