2020-08-27 02:48:57

by syzbot

[permalink] [raw]
Subject: kernel BUG at fs/ext4/inode.c:LINE!

Hello,

syzbot found the following issue on:

HEAD commit: c3d8f220 Merge tag 'kbuild-fixes-v5.9' of git://git.kernel..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15f83cb6900000
kernel config: https://syzkaller.appspot.com/x/.config?x=bb68b9e8a8cc842f
dashboard link: https://syzkaller.appspot.com/bug?extid=3622cea378100f45d59f
compiler: clang version 10.0.0 (https://github.com/llvm/llvm-project/ c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1287ac96900000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11c7ac46900000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]

------------[ cut here ]------------
kernel BUG at fs/ext4/inode.c:2598!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 27612 Comm: syz-executor879 Not tainted 5.9.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:mpage_prepare_extent_to_map+0xd34/0xd40 fs/ext4/inode.c:2598
Code: 89 e8 a0 74 9f ff 0f 0b e8 59 81 70 ff 4c 89 e7 48 c7 c6 da a3 07 89 e8 8a 74 9f ff 0f 0b e8 43 81 70 ff 0f 0b e8 3c 81 70 ff <0f> 0b e8 c5 b3 25 06 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53
RSP: 0018:ffffc9000a0974d8 EFLAGS: 00010293
RAX: ffffffff820476e4 RBX: 00fffe000000a01f RCX: ffff88808ea5c340
RDX: 0000000000000000 RSI: 000000000000a01f RDI: 000000000000ffff
RBP: ffffea00020d4f80 R08: ffffffff820471e4 R09: fffff9400041a9f1
R10: fffff9400041a9f1 R11: 0000000000000000 R12: ffffea00020d4f80
R13: ffffc9000a0977b0 R14: 1ffffd400041a9f1 R15: dffffc0000000000
FS: 00007f5055f25700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004ae4b0 CR3: 00000000a2d7f000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
ext4_writepages+0xa98/0x3750 fs/ext4/inode.c:2735
do_writepages+0xda/0x1f0 mm/page-writeback.c:2352
__filemap_fdatawrite_range+0x2a5/0x350 mm/filemap.c:422
filemap_write_and_wait_range+0xca/0x160 mm/filemap.c:655
iomap_dio_rw+0x5a7/0xeb0 fs/iomap/direct-io.c:478
ext4_dio_read_iter fs/ext4/file.c:77 [inline]
ext4_file_read_iter+0x544/0x6d0 fs/ext4/file.c:129
call_read_iter include/linux/fs.h:1876 [inline]
generic_file_splice_read+0x3c5/0x640 fs/splice.c:312
do_splice_to fs/splice.c:870 [inline]
splice_direct_to_actor+0x3bd/0xb60 fs/splice.c:950
do_splice_direct+0x201/0x340 fs/splice.c:1059
do_sendfile+0x86d/0x1210 fs/read_write.c:1540
__do_sys_sendfile64 fs/read_write.c:1601 [inline]
__se_sys_sendfile64 fs/read_write.c:1587 [inline]
__x64_sys_sendfile64+0x164/0x1a0 fs/read_write.c:1587
do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x448bc9
Code: e8 1c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b 06 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f5055f24ce8 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00000000006e4a28 RCX: 0000000000448bc9
RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000004
RBP: 00000000006e4a20 R08: 0000000000000000 R09: 0000000000000000
R10: 00008400fffffffb R11: 0000000000000246 R12: 00000000006e4a2c
R13: 00007ffdc73a76df R14: 00007f5055f259c0 R15: 20c49ba5e353f7cf
Modules linked in:
---[ end trace aba5fca59eda2183 ]---
RIP: 0010:mpage_prepare_extent_to_map+0xd34/0xd40 fs/ext4/inode.c:2598
Code: 89 e8 a0 74 9f ff 0f 0b e8 59 81 70 ff 4c 89 e7 48 c7 c6 da a3 07 89 e8 8a 74 9f ff 0f 0b e8 43 81 70 ff 0f 0b e8 3c 81 70 ff <0f> 0b e8 c5 b3 25 06 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53
RSP: 0018:ffffc9000a0974d8 EFLAGS: 00010293
RAX: ffffffff820476e4 RBX: 00fffe000000a01f RCX: ffff88808ea5c340
RDX: 0000000000000000 RSI: 000000000000a01f RDI: 000000000000ffff
RBP: ffffea00020d4f80 R08: ffffffff820471e4 R09: fffff9400041a9f1
R10: fffff9400041a9f1 R11: 0000000000000000 R12: ffffea00020d4f80
R13: ffffc9000a0977b0 R14: 1ffffd400041a9f1 R15: dffffc0000000000
FS: 00007f5055f25700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004ae4b0 CR3: 00000000a2d7f000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches


2020-08-28 10:09:00

by Jan Kara

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Wed 26-08-20 19:48:16, syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: c3d8f220 Merge tag 'kbuild-fixes-v5.9' of git://git.kernel..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15f83cb6900000
> kernel config: https://syzkaller.appspot.com/x/.config?x=bb68b9e8a8cc842f
> dashboard link: https://syzkaller.appspot.com/bug?extid=3622cea378100f45d59f
> compiler: clang version 10.0.0 (https://github.com/llvm/llvm-project/ c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1287ac96900000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11c7ac46900000
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: [email protected]
>
> ------------[ cut here ]------------
> kernel BUG at fs/ext4/inode.c:2598!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 27612 Comm: syz-executor879 Not tainted 5.9.0-rc1-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:mpage_prepare_extent_to_map+0xd34/0xd40 fs/ext4/inode.c:2598

Doh, so this is:

wait_on_page_writeback(page);
>>> BUG_ON(PageWriteback(page));

in mpage_prepare_extent_to_map(). So we have PageWriteback() page after we
have called wait_on_page_writeback() on a locked page. Not sure how this
could ever happen even less how ext4 could cause this...

Anyway, syzbot has found a reproducer so I'll try if it reproduces in my
KVM...

Honza


> Code: 89 e8 a0 74 9f ff 0f 0b e8 59 81 70 ff 4c 89 e7 48 c7 c6 da a3 07 89 e8 8a 74 9f ff 0f 0b e8 43 81 70 ff 0f 0b e8 3c 81 70 ff <0f> 0b e8 c5 b3 25 06 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53
> RSP: 0018:ffffc9000a0974d8 EFLAGS: 00010293
> RAX: ffffffff820476e4 RBX: 00fffe000000a01f RCX: ffff88808ea5c340
> RDX: 0000000000000000 RSI: 000000000000a01f RDI: 000000000000ffff
> RBP: ffffea00020d4f80 R08: ffffffff820471e4 R09: fffff9400041a9f1
> R10: fffff9400041a9f1 R11: 0000000000000000 R12: ffffea00020d4f80
> R13: ffffc9000a0977b0 R14: 1ffffd400041a9f1 R15: dffffc0000000000
> FS: 00007f5055f25700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000004ae4b0 CR3: 00000000a2d7f000 CR4: 00000000001506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> ext4_writepages+0xa98/0x3750 fs/ext4/inode.c:2735
> do_writepages+0xda/0x1f0 mm/page-writeback.c:2352
> __filemap_fdatawrite_range+0x2a5/0x350 mm/filemap.c:422
> filemap_write_and_wait_range+0xca/0x160 mm/filemap.c:655
> iomap_dio_rw+0x5a7/0xeb0 fs/iomap/direct-io.c:478
> ext4_dio_read_iter fs/ext4/file.c:77 [inline]
> ext4_file_read_iter+0x544/0x6d0 fs/ext4/file.c:129
> call_read_iter include/linux/fs.h:1876 [inline]
> generic_file_splice_read+0x3c5/0x640 fs/splice.c:312
> do_splice_to fs/splice.c:870 [inline]
> splice_direct_to_actor+0x3bd/0xb60 fs/splice.c:950
> do_splice_direct+0x201/0x340 fs/splice.c:1059
> do_sendfile+0x86d/0x1210 fs/read_write.c:1540
> __do_sys_sendfile64 fs/read_write.c:1601 [inline]
> __se_sys_sendfile64 fs/read_write.c:1587 [inline]
> __x64_sys_sendfile64+0x164/0x1a0 fs/read_write.c:1587
> do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x448bc9
> Code: e8 1c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b 06 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007f5055f24ce8 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
> RAX: ffffffffffffffda RBX: 00000000006e4a28 RCX: 0000000000448bc9
> RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000004
> RBP: 00000000006e4a20 R08: 0000000000000000 R09: 0000000000000000
> R10: 00008400fffffffb R11: 0000000000000246 R12: 00000000006e4a2c
> R13: 00007ffdc73a76df R14: 00007f5055f259c0 R15: 20c49ba5e353f7cf
> Modules linked in:
> ---[ end trace aba5fca59eda2183 ]---
> RIP: 0010:mpage_prepare_extent_to_map+0xd34/0xd40 fs/ext4/inode.c:2598
> Code: 89 e8 a0 74 9f ff 0f 0b e8 59 81 70 ff 4c 89 e7 48 c7 c6 da a3 07 89 e8 8a 74 9f ff 0f 0b e8 43 81 70 ff 0f 0b e8 3c 81 70 ff <0f> 0b e8 c5 b3 25 06 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53
> RSP: 0018:ffffc9000a0974d8 EFLAGS: 00010293
> RAX: ffffffff820476e4 RBX: 00fffe000000a01f RCX: ffff88808ea5c340
> RDX: 0000000000000000 RSI: 000000000000a01f RDI: 000000000000ffff
> RBP: ffffea00020d4f80 R08: ffffffff820471e4 R09: fffff9400041a9f1
> R10: fffff9400041a9f1 R11: 0000000000000000 R12: ffffea00020d4f80
> R13: ffffc9000a0977b0 R14: 1ffffd400041a9f1 R15: dffffc0000000000
> FS: 00007f5055f25700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000004ae4b0 CR3: 00000000a2d7f000 CR4: 00000000001506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at [email protected].
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-08-31 10:04:28

by Jan Kara

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Fri 28-08-20 12:07:55, Jan Kara wrote:
> On Wed 26-08-20 19:48:16, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit: c3d8f220 Merge tag 'kbuild-fixes-v5.9' of git://git.kernel..
> > git tree: upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=15f83cb6900000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=bb68b9e8a8cc842f
> > dashboard link: https://syzkaller.appspot.com/bug?extid=3622cea378100f45d59f
> > compiler: clang version 10.0.0 (https://github.com/llvm/llvm-project/ c2443155a0fb245c8f17f2c1c72b6ea391e86e81)
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1287ac96900000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11c7ac46900000
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: [email protected]
> >
> > ------------[ cut here ]------------
> > kernel BUG at fs/ext4/inode.c:2598!
> > invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> > CPU: 1 PID: 27612 Comm: syz-executor879 Not tainted 5.9.0-rc1-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > RIP: 0010:mpage_prepare_extent_to_map+0xd34/0xd40 fs/ext4/inode.c:2598
>
> Doh, so this is:
>
> wait_on_page_writeback(page);
> >>> BUG_ON(PageWriteback(page));
>
> in mpage_prepare_extent_to_map(). So we have PageWriteback() page after we
> have called wait_on_page_writeback() on a locked page. Not sure how this
> could ever happen even less how ext4 could cause this...

I was poking a bit into this and there were actually recent changes into
page bit waiting logic by Linus. Linus, any idea?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-08-31 18:24:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Mon, Aug 31, 2020 at 3:03 AM Jan Kara <[email protected]> wrote:
>
> On Fri 28-08-20 12:07:55, Jan Kara wrote:
> >
> > Doh, so this is:
> >
> > wait_on_page_writeback(page);
> > >>> BUG_ON(PageWriteback(page));
> >
> > in mpage_prepare_extent_to_map(). So we have PageWriteback() page after we
> > have called wait_on_page_writeback() on a locked page. Not sure how this
> > could ever happen even less how ext4 could cause this...
>
> I was poking a bit into this and there were actually recent changes into
> page bit waiting logic by Linus. Linus, any idea?

So the main change is that now if somebody does a wake_up_page(), the
page waiter will be released - even if somebody else then set the bit
again (or possible if the waker never cleared it!).

It used to be that the waiter went back to sleep.

Which really shouldn't matter, but if we had any code that did something like

end_page_writeback();
.. something does set_page_writeback() on the page again ..

then the old BUG_ON() would likely never have triggered (because the
waiter would have seen the writeback bit being set again and gone back
to sleep), but now it will.

So I would suspect a pre-existing issue that was just hidden by the
old behavior and was basically impossible to trigger unless you hit
*just* the right timing.

And now it's easy to trigger, because the first time somebody clears
PG_writeback, the wait_on_page_writeback() will just return *without*
re-testing and *without* going back to sleep.

Could there be somebody who does set_page_writeback() without holding
the page lock?

Maybe adding a

WARN_ON_ONCE(!PageLocked(page));

at the top of __test_set_page_writeback() might find something?

Note that it looks like this problem has been reported on Android
before according to that syzbot thing. Ie, this thing:

https://groups.google.com/g/syzkaller-android-bugs/c/2CfEdQd4EE0/m/xk_GRJEHBQAJ

looks very similar, and predates the wake_up_page() changes.

So it was probably just much _harder_ to hit before, and got easier to hit.

Hmm. In fact, googling for

mpage_prepare_extent_to_map "kernel BUG"

seems to find stuff going back years. Here's a patchwork discussion
where you had a debug patch to try to figure it out back in 2016:

https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/

although that one seems to be a different BUG_ON() in the same area.

Maybe entirely unrelated, but the fact that this function shows up a
fair amount is perhaps a sign of some long-running issue..

Linus

2020-11-24 04:10:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Mon, 30 Aug 2020, Linus Torvalds wrote:
> On Mon, Aug 31, 2020 at 3:03 AM Jan Kara <[email protected]> wrote:
> >
> > On Fri 28-08-20 12:07:55, Jan Kara wrote:
> > >
> > > Doh, so this is:
> > >
> > > wait_on_page_writeback(page);
> > > >>> BUG_ON(PageWriteback(page));
> > >
> > > in mpage_prepare_extent_to_map(). So we have PageWriteback() page after we
> > > have called wait_on_page_writeback() on a locked page. Not sure how this
> > > could ever happen even less how ext4 could cause this...
> >
> > I was poking a bit into this and there were actually recent changes into
> > page bit waiting logic by Linus. Linus, any idea?
>
> So the main change is that now if somebody does a wake_up_page(), the
> page waiter will be released - even if somebody else then set the bit
> again (or possible if the waker never cleared it!).
>
> It used to be that the waiter went back to sleep.
>
> Which really shouldn't matter, but if we had any code that did something like
>
> end_page_writeback();
> .. something does set_page_writeback() on the page again ..
>
> then the old BUG_ON() would likely never have triggered (because the
> waiter would have seen the writeback bit being set again and gone back
> to sleep), but now it will.
>
> So I would suspect a pre-existing issue that was just hidden by the
> old behavior and was basically impossible to trigger unless you hit
> *just* the right timing.
>
> And now it's easy to trigger, because the first time somebody clears
> PG_writeback, the wait_on_page_writeback() will just return *without*
> re-testing and *without* going back to sleep.
>
> Could there be somebody who does set_page_writeback() without holding
> the page lock?
>
> Maybe adding a
>
> WARN_ON_ONCE(!PageLocked(page));
>
> at the top of __test_set_page_writeback() might find something?
>
> Note that it looks like this problem has been reported on Android
> before according to that syzbot thing. Ie, this thing:
>
> https://groups.google.com/g/syzkaller-android-bugs/c/2CfEdQd4EE0/m/xk_GRJEHBQAJ
>
> looks very similar, and predates the wake_up_page() changes.
>
> So it was probably just much _harder_ to hit before, and got easier to hit.
>
> Hmm. In fact, googling for
>
> mpage_prepare_extent_to_map "kernel BUG"
>
> seems to find stuff going back years. Here's a patchwork discussion
> where you had a debug patch to try to figure it out back in 2016:
>
> https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/
>
> although that one seems to be a different BUG_ON() in the same area.
>
> Maybe entirely unrelated, but the fact that this function shows up a
> fair amount is perhaps a sign of some long-running issue..

No recent updates here, nor in
https://lore.kernel.org/linux-mm/[email protected]/
but I believe I've found the answer (or an answer) to this issue.

You may not care for this patch, but I haven't thought of a better,
so let me explain in its commit message. And its "Fixes:" tag is unfair
to your patch, sorry: I agree the issue has probably lurked there longer.

[PATCH] mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)

Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/[email protected]/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead.
It would be safe to wake_up_page() after TestClearPageWriteback()
while still holding i_pages lock (since that lock is needed to remove
the page from cache), but the history of long page lock hash chains
cautions against; so the patch below does get_page() when PageWaiters
there, and put_page() after wake_up_page_bit() at the end. And in
any case, we do need the i_pages lock, even though it was skipped
before when !mapping_use_writeback_tags() i.e. swap. Can mapping be
NULL? I don't see how, but allow for that with a WARN_ON_ONCE(): this
patch is no worse than before, but does not fix the issue if !mapping.

The bulk of the patch below is cleanup: it was not helpful to separate
test_clear_page_writeback() from end_page_writeback(), especially with
the latter declaring BUG() on a condition which the former was working
around: combine them into end_page_writeback() in mm/page-writeback.c.

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page: this bug comes
not from real waiters, but from when PageWaiters is a false positive.

Reported-by: [email protected]
Reported-by: Qian Cai <[email protected]>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] # v5.8+
---

include/linux/page-flags.h | 1
include/linux/pagemap.h | 1
kernel/sched/wait_bit.c | 5 -
mm/filemap.c | 35 ------------
mm/page-writeback.c | 96 ++++++++++++++++++++++-------------
5 files changed, 67 insertions(+), 71 deletions(-)

--- 5.10-rc5/include/linux/page-flags.h 2020-10-25 16:45:47.061817039 -0700
+++ linux/include/linux/page-flags.h 2020-11-22 18:31:21.303046924 -0800
@@ -550,7 +550,6 @@ static __always_inline void SetPageUptod

CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL)

-int test_clear_page_writeback(struct page *page);
int __test_set_page_writeback(struct page *page, bool keep_write);

#define test_set_page_writeback(page) \
--- 5.10-rc5/include/linux/pagemap.h 2020-11-22 17:43:01.585279333 -0800
+++ linux/include/linux/pagemap.h 2020-11-22 18:31:21.303046924 -0800
@@ -660,6 +660,7 @@ static inline int lock_page_or_retry(str
*/
extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern void wake_up_page_bit(struct page *page, int bit_nr);

/*
* Wait for a page to be unlocked.
--- 5.10-rc5/kernel/sched/wait_bit.c 2020-03-29 15:25:41.000000000 -0700
+++ linux/kernel/sched/wait_bit.c 2020-11-22 18:31:21.303046924 -0800
@@ -90,9 +90,8 @@ __wait_on_bit_lock(struct wait_queue_hea
ret = action(&wbq_entry->key, mode);
/*
* See the comment in prepare_to_wait_event().
- * finish_wait() does not necessarily takes wwq_head->lock,
- * but test_and_set_bit() implies mb() which pairs with
- * smp_mb__after_atomic() before wake_up_page().
+ * finish_wait() does not necessarily take
+ * wwq_head->lock, but test_and_set_bit() implies mb().
*/
if (ret)
finish_wait(wq_head, &wbq_entry->wq_entry);
--- 5.10-rc5/mm/filemap.c 2020-11-22 17:43:01.637279974 -0800
+++ linux/mm/filemap.c 2020-11-22 18:31:21.303046924 -0800
@@ -1093,7 +1093,7 @@ static int wake_page_function(wait_queue
return (flags & WQ_FLAG_EXCLUSIVE) != 0;
}

-static void wake_up_page_bit(struct page *page, int bit_nr)
+void wake_up_page_bit(struct page *page, int bit_nr)
{
wait_queue_head_t *q = page_waitqueue(page);
struct wait_page_key key;
@@ -1147,13 +1147,6 @@ static void wake_up_page_bit(struct page
spin_unlock_irqrestore(&q->lock, flags);
}

-static void wake_up_page(struct page *page, int bit)
-{
- if (!PageWaiters(page))
- return;
- wake_up_page_bit(page, bit);
-}
-
/*
* A choice of three behaviors for wait_on_page_bit_common():
*/
@@ -1466,32 +1459,6 @@ void unlock_page(struct page *page)
}
EXPORT_SYMBOL(unlock_page);

-/**
- * end_page_writeback - end writeback against a page
- * @page: the page
- */
-void end_page_writeback(struct page *page)
-{
- /*
- * TestClearPageReclaim could be used here but it is an atomic
- * operation and overkill in this particular case. Failing to
- * shuffle a page marked for immediate reclaim is too mild to
- * justify taking an atomic operation penalty at the end of
- * ever page writeback.
- */
- if (PageReclaim(page)) {
- ClearPageReclaim(page);
- rotate_reclaimable_page(page);
- }
-
- if (!test_clear_page_writeback(page))
- BUG();
-
- smp_mb__after_atomic();
- wake_up_page(page, PG_writeback);
-}
-EXPORT_SYMBOL(end_page_writeback);
-
/*
* After completing I/O on a page, call this routine to update the page
* flags appropriately
--- 5.10-rc5/mm/page-writeback.c 2020-10-25 16:45:47.977843485 -0700
+++ linux/mm/page-writeback.c 2020-11-22 18:31:21.303046924 -0800
@@ -589,7 +589,7 @@ static void wb_domain_writeout_inc(struc

/*
* Increment @wb's writeout completion count and the global writeout
- * completion count. Called from test_clear_page_writeback().
+ * completion count. Called from end_page_writeback().
*/
static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
@@ -2719,55 +2719,85 @@ int clear_page_dirty_for_io(struct page
}
EXPORT_SYMBOL(clear_page_dirty_for_io);

-int test_clear_page_writeback(struct page *page)
+/**
+ * end_page_writeback - end writeback against a page
+ * @page: the page
+ */
+void end_page_writeback(struct page *page)
{
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping;
struct mem_cgroup *memcg;
struct lruvec *lruvec;
- int ret;
+ unsigned long flags;
+ int writeback;
+ int waiters;
+
+ /*
+ * TestClearPageReclaim could be used here but it is an atomic
+ * operation and overkill in this particular case. Failing to
+ * shuffle a page marked for immediate reclaim is too mild to
+ * justify taking an atomic operation penalty at the end of
+ * every page writeback.
+ */
+ if (PageReclaim(page)) {
+ ClearPageReclaim(page);
+ rotate_reclaimable_page(page);
+ }

+ mapping = page_mapping(page);
+ WARN_ON_ONCE(!mapping);
memcg = lock_page_memcg(page);
lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+ dec_lruvec_state(lruvec, NR_WRITEBACK);
+ dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
+ inc_node_page_state(page, NR_WRITTEN);
+
+ if (mapping)
+ xa_lock_irqsave(&mapping->i_pages, flags);
+
+ writeback = TestClearPageWriteback(page);
+ /* No need for smp_mb__after_atomic() after TestClear */
+ waiters = PageWaiters(page);
+ if (waiters) {
+ /*
+ * Writeback doesn't hold a page reference on its own, relying
+ * on truncation to wait for the clearing of PG_writeback.
+ * We could safely wake_up_page_bit(page, PG_writeback) here,
+ * while holding i_pages lock: but that would be a poor choice
+ * if the page is on a long hash chain; so instead choose to
+ * get_page+put_page - though atomics will add some overhead.
+ */
+ get_page(page);
+ }
+
if (mapping && mapping_use_writeback_tags(mapping)) {
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
- unsigned long flags;

- xa_lock_irqsave(&mapping->i_pages, flags);
- ret = TestClearPageWriteback(page);
- if (ret) {
- __xa_clear_mark(&mapping->i_pages, page_index(page),
+ __xa_clear_mark(&mapping->i_pages, page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
- struct bdi_writeback *wb = inode_to_wb(inode);
+ if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
+ struct bdi_writeback *wb = inode_to_wb(inode);

- dec_wb_stat(wb, WB_WRITEBACK);
- __wb_writeout_inc(wb);
- }
+ dec_wb_stat(wb, WB_WRITEBACK);
+ __wb_writeout_inc(wb);
}
+ if (inode && !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+ sb_clear_inode_writeback(inode);
+ }

- if (mapping->host && !mapping_tagged(mapping,
- PAGECACHE_TAG_WRITEBACK))
- sb_clear_inode_writeback(mapping->host);
-
+ if (mapping)
xa_unlock_irqrestore(&mapping->i_pages, flags);
- } else {
- ret = TestClearPageWriteback(page);
- }
- /*
- * NOTE: Page might be free now! Writeback doesn't hold a page
- * reference on its own, it relies on truncation to wait for
- * the clearing of PG_writeback. The below can only access
- * page state that is static across allocation cycles.
- */
- if (ret) {
- dec_lruvec_state(lruvec, NR_WRITEBACK);
- dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
- inc_node_page_state(page, NR_WRITTEN);
- }
__unlock_page_memcg(memcg);
- return ret;
+
+ if (waiters) {
+ wake_up_page_bit(page, PG_writeback);
+ put_page(page);
+ }
+ BUG_ON(!writeback);
}
+EXPORT_SYMBOL(end_page_writeback);

int __test_set_page_writeback(struct page *page, bool keep_write)
{

2020-11-24 04:49:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Mon, Nov 23, 2020 at 8:07 PM Hugh Dickins <[email protected]> wrote:
>
> The problem is that PageWriteback is not accompanied by a page reference
> (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
> soon as TestClearPageWriteback has been done, that page could be removed
> from page cache, freed, and reused for something else by the time that
> wake_up_page() is reached.

Ugh.

Would it be possible to instead just make PageWriteback take the ref?

I don't hate your patch per se, but looking at that long explanation,
and looking at the gyrations end_page_writeback() does, I go "why
don't we do that?"

IOW, why couldn't we just make the __test_set_page_writeback()
increment the page count if the writeback flag wasn't already set, and
then make the end_page_writeback() do a put_page() after it all?

Linus

2020-11-24 05:06:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Mon, Nov 23, 2020 at 8:07 PM Hugh Dickins <[email protected]> wrote:
>
> Then on crashing a second time, realized there's a stronger reason against
> that approach. If my testing just occasionally crashes on that check,
> when the page is reused for part of a compound page, wouldn't it be much
> more common for the page to get reused as an order-0 page before reaching
> wake_up_page()? And on rare occasions, might that reused page already be
> marked PageWriteback by its new user, and already be waited upon? What
> would that look like?
>
> It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> in write_cache_pages() (though I have never seen that crash myself).

So looking more at the patch, I started looking at this part:

> + writeback = TestClearPageWriteback(page);
> + /* No need for smp_mb__after_atomic() after TestClear */
> + waiters = PageWaiters(page);
> + if (waiters) {
> + /*
> + * Writeback doesn't hold a page reference on its own, relying
> + * on truncation to wait for the clearing of PG_writeback.
> + * We could safely wake_up_page_bit(page, PG_writeback) here,
> + * while holding i_pages lock: but that would be a poor choice
> + * if the page is on a long hash chain; so instead choose to
> + * get_page+put_page - though atomics will add some overhead.
> + */
> + get_page(page);
> + }

and thinking more about this, my first reaction was "but that has the
same race, just a smaller window".

And then reading the comment more, I realize you relied on the i_pages
lock, and that this odd ordering was to avoid the possible latency.

But what about the non-mapping case? I'm not sure how that happens,
but this does seem very fragile.

I'm wondering why you didn't want to just do the get_page()
unconditionally and early. Is avoiding the refcount really such a big
optimization?

Linus

2020-11-24 06:37:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Mon, 23 Nov 2020, Linus Torvalds wrote:
> On Mon, Nov 23, 2020 at 8:07 PM Hugh Dickins <[email protected]> wrote:
> >
> > The problem is that PageWriteback is not accompanied by a page reference
> > (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
> > soon as TestClearPageWriteback has been done, that page could be removed
> > from page cache, freed, and reused for something else by the time that
> > wake_up_page() is reached.
>
> Ugh.
>
> Would it be possible to instead just make PageWriteback take the ref?
>
> I don't hate your patch per se, but looking at that long explanation,
> and looking at the gyrations end_page_writeback() does, I go "why
> don't we do that?"
>
> IOW, why couldn't we just make the __test_set_page_writeback()
> increment the page count if the writeback flag wasn't already set, and
> then make the end_page_writeback() do a put_page() after it all?

Right, that should be a lot simpler, and will not require any of the
cleanup (much as I liked that). If you're reasonably confident that
adding the extra get_page+put_page to every writeback (instead of
just to the waited case, which I presume significantly less common)
will get lost in the noise - I was not confident of that, nor
confident of devising realistic tests to decide it.

What I did look into before sending, was whether in the filesystems
there was a pattern of doing a put_page() after *set_page_writeback(),
when it would just be a matter of deleting that put_page() and doing
it instead at the end of end_page_writeback(). But no: there were a
few cases like that, but in general no such pattern.

Though, what I think I'll try is not quite what you suggest there,
but instead do both get_page() and put_page() in end_page_writeback().
The reason being, there are a number of places (in mm at least) where
we judge what to do by the expected refcount: places that know to add
1 on when PagePrivate is set (for buffers), but do not expect to add
1 on when PageWriteback is set. Now, all of those places probably
have to have their own wait_on_page_writeback() too, but I'd rather
narrow the window when the refcount is raised, than work through
what if any change would be needed in those places.

> >
> > Then on crashing a second time, realized there's a stronger reason against
> > that approach. If my testing just occasionally crashes on that check,
> > when the page is reused for part of a compound page, wouldn't it be much
> > more common for the page to get reused as an order-0 page before reaching
> > wake_up_page()? And on rare occasions, might that reused page already be
> > marked PageWriteback by its new user, and already be waited upon? What
> > would that look like?
> >
> > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> > in write_cache_pages() (though I have never seen that crash myself).
>
> So looking more at the patch, I started looking at this part:
>
> > + writeback = TestClearPageWriteback(page);
> > + /* No need for smp_mb__after_atomic() after TestClear */
> > + waiters = PageWaiters(page);
> > + if (waiters) {
> > + /*
> > + * Writeback doesn't hold a page reference on its own, relying
> > + * on truncation to wait for the clearing of PG_writeback.
> > + * We could safely wake_up_page_bit(page, PG_writeback) here,
> > + * while holding i_pages lock: but that would be a poor choice
> > + * if the page is on a long hash chain; so instead choose to
> > + * get_page+put_page - though atomics will add some overhead.
> > + */
> > + get_page(page);
> > + }
>
> and thinking more about this, my first reaction was "but that has the
> same race, just a smaller window".
>
> And then reading the comment more, I realize you relied on the i_pages
> lock, and that this odd ordering was to avoid the possible latency.

Yes. I decided to send the get_page+put_page variant, rather than the
wake_up_page_bit while holding i_pages variant (also tested), in part
because it's easier to edit the get_page+put_page one to the other.

>
> But what about the non-mapping case? I'm not sure how that happens,
> but this does seem very fragile.

I don't see how the non-mapping case would ever occur: I think it
probably comes from a general pattern of caution about NULL mapping
when akpm (I think) originally wrote these functions.

>
> I'm wondering why you didn't want to just do the get_page()
> unconditionally and early. Is avoiding the refcount really such a big
> optimization?

I don't know: I trust your judgement more than mine.

Hugh

2020-11-24 12:22:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Mon, Nov 23, 2020 at 08:07:24PM -0800, Hugh Dickins wrote:
> Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
> on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
> end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
> no longer an ext4 page at all.
>
> The problem is that PageWriteback is not accompanied by a page reference
> (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
> soon as TestClearPageWriteback has been done, that page could be removed
> from page cache, freed, and reused for something else by the time that
> wake_up_page() is reached.
>
> https://lore.kernel.org/linux-mm/[email protected]/
> Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
> check; but I'm paranoid about even looking at an unreferenced struct page,
> lest its memory might itself have already been reused or hotremoved (and
> wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
>
> Then on crashing a second time, realized there's a stronger reason against
> that approach. If my testing just occasionally crashes on that check,
> when the page is reused for part of a compound page, wouldn't it be much
> more common for the page to get reused as an order-0 page before reaching
> wake_up_page()? And on rare occasions, might that reused page already be
> marked PageWriteback by its new user, and already be waited upon? What
> would that look like?
>
> It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> in write_cache_pages() (though I have never seen that crash myself).

I don't think this is it. write_cache_pages() holds a reference to the
page -- indeed, it holds the page lock! So this particular race cannot
cause the page to get recycled. I still have no good ideas what this
is :-(

2020-11-24 16:30:56

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue, 24 Nov 2020, Matthew Wilcox wrote:
> On Mon, Nov 23, 2020 at 08:07:24PM -0800, Hugh Dickins wrote:
> >
> > Then on crashing a second time, realized there's a stronger reason against
> > that approach. If my testing just occasionally crashes on that check,
> > when the page is reused for part of a compound page, wouldn't it be much
> > more common for the page to get reused as an order-0 page before reaching
> > wake_up_page()? And on rare occasions, might that reused page already be
> > marked PageWriteback by its new user, and already be waited upon? What
> > would that look like?
> >
> > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> > in write_cache_pages() (though I have never seen that crash myself).
>
> I don't think this is it. write_cache_pages() holds a reference to the
> page -- indeed, it holds the page lock! So this particular race cannot
> cause the page to get recycled. I still have no good ideas what this
> is :-(

It is confusing. I tried to explain that in the final paragraph:

> > Was there a chance of missed wakeups before, since a page freed before
> > reaching wake_up_page() would have PageWaiters cleared? I think not,
> > because each waiter does hold a reference on the page: this bug comes
> > not from real waiters, but from when PageWaiters is a false positive.

but got lost in between the original end_page_writeback() and the patched
version when writing that last part - false positive PageWaiters are not
relevant. I'll try rewording that in the simpler version, following.

The BUG_ON(PageWriteback) would occur when the old use of the page, the
one we do TestClearPageWriteback on, had *no* waiters, so no additional
page reference beyond the page cache (and whoever racily frees it). The
reuse of the page definitely has a waiter holding a reference, as you
point out, and PageWriteback still set; but our belated wake_up_page()
has woken it to hit the BUG_ON.

Hugh

2020-11-24 16:48:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Mon, 23 Nov 2020, Hugh Dickins wrote:
> On Mon, 23 Nov 2020, Linus Torvalds wrote:
> >
> > IOW, why couldn't we just make the __test_set_page_writeback()
> > increment the page count if the writeback flag wasn't already set, and
> > then make the end_page_writeback() do a put_page() after it all?
>
> Right, that should be a lot simpler, and will not require any of the
> cleanup (much as I liked that). If you're reasonably confident that
> adding the extra get_page+put_page to every writeback (instead of
> just to the waited case, which I presume significantly less common)
> will get lost in the noise - I was not confident of that, nor
> confident of devising realistic tests to decide it.
>
> What I did look into before sending, was whether in the filesystems
> there was a pattern of doing a put_page() after *set_page_writeback(),
> when it would just be a matter of deleting that put_page() and doing
> it instead at the end of end_page_writeback(). But no: there were a
> few cases like that, but in general no such pattern.
>
> Though, what I think I'll try is not quite what you suggest there,
> but instead do both get_page() and put_page() in end_page_writeback().
> The reason being, there are a number of places (in mm at least) where
> we judge what to do by the expected refcount: places that know to add
> 1 on when PagePrivate is set (for buffers), but do not expect to add
> 1 on when PageWriteback is set. Now, all of those places probably
> have to have their own wait_on_page_writeback() too, but I'd rather
> narrow the window when the refcount is raised, than work through
> what if any change would be needed in those places.

This ran fine overnight on several machines - just to check I hadn't
screwed it up. Vanishingly unlikely to have hit either condition,
nor would I have noticed any difference in performance.

[PATCH] mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)

Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/[email protected]/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

Reported-by: [email protected]
Reported-by: Qian Cai <[email protected]>
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] # v5.8+
---

mm/filemap.c | 8 ++++++++
mm/page-writeback.c | 6 ------
2 files changed, 8 insertions(+), 6 deletions(-)

--- 5.10-rc5/mm/filemap.c 2020-11-22 17:43:01.637279974 -0800
+++ linux/mm/filemap.c 2020-11-23 23:08:20.141851113 -0800
@@ -1484,11 +1484,19 @@ void end_page_writeback(struct page *pag
rotate_reclaimable_page(page);
}

+ /*
+ * Writeback does not hold a page reference of its own, relying
+ * on truncation to wait for the clearing of PG_writeback.
+ * But here we must make sure that the page is not freed and
+ * reused before the wake_up_page().
+ */
+ get_page(page);
if (!test_clear_page_writeback(page))
BUG();

smp_mb__after_atomic();
wake_up_page(page, PG_writeback);
+ put_page(page);
}
EXPORT_SYMBOL(end_page_writeback);

--- 5.10-rc5/mm/page-writeback.c 2020-10-25 16:45:47.977843485 -0700
+++ linux/mm/page-writeback.c 2020-11-23 23:08:20.141851113 -0800
@@ -2754,12 +2754,6 @@ int test_clear_page_writeback(struct pag
} else {
ret = TestClearPageWriteback(page);
}
- /*
- * NOTE: Page might be free now! Writeback doesn't hold a page
- * reference on its own, it relies on truncation to wait for
- * the clearing of PG_writeback. The below can only access
- * page state that is static across allocation cycles.
- */
if (ret) {
dec_lruvec_state(lruvec, NR_WRITEBACK);
dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);

2020-11-24 19:09:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue, Nov 24, 2020 at 10:33 AM Matthew Wilcox <[email protected]> wrote:
>
> We could fix this by turning that 'if' into a 'while' in
> write_cache_pages().

That might be the simplest patch indeed.

At the same time, I do worry about other cases like this: while
spurious wakeup events are normal and happen in other places, this is
a bit different.

This is literally a wakeup that leaks from a previous use of a page,
and makes us think that something could have happened to the new use.

The unlock_page() case presumably never hits that, because even if we
have some unlock without a page ref (which I don't think can happen,
but whatever..), the exclusive nature of "lock_page()" means that no
locker can care - once you get the lock, you own the page./

The writeback code is special in that the writeback bit isn't some
kind of exclusive bit, but this code kind of expected it to be that.

So I'd _like_ to have something like

WARN_ON_ONCE(!page_count(page));

in the wake_up_page_bit() function, to catch things that wake up a
page that has already been released and might be reused..

And that would require the "get_page()" to be done when we set the
writeback bit and queue the page up for IO (so that then
end_page_writeback() would clear the bit, do the wakeup, and then drop
the ref).

Hugh's second patch isn't pretty - I think the "get_page()" is
conceptually in the wrong place - but it "works" in that it keeps that
"implicit page reference" being kept by the PG_writeback bit, and then
it takes an explicit page reference before it clears the bit.

So while I don't love the whole "PG_writeback is an implicit reference
to the page" model, Hugh's patch at least makes that model much more
straightforward: we really either have that PG_writeback, _or_ we have
a real ref to the page, and we never have that odd "we could actually
lose the page" situation.

So I think I prefer Hugh's two-liner over your one-liner suggestion.

But your one-liner is technically not just smaller, it obviously also
avoids the whole mucking with the atomic page ref.

I don't _think_ that the extra get/put overhead could possibly really
matter: doing the writeback is going to be a lot more expensive
anyway. And an atomic access to a 'struct page' sounds expensive, but
that cacheline is already likely dirty in the L1 cache because we've
touch page->flags and done other things to it).

So I'd personally be inclined to go with Hugh's patch. Comments?

Linus

2020-11-24 23:56:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue, Nov 24, 2020 at 12:16 PM Matthew Wilcox <[email protected]> wrote:
>
> So my s/if/while/ suggestion is wrong and we need to do something to
> prevent spurious wakeups. Unless we bury the spurious wakeup logic
> inside wait_on_page_writeback() ...

We can certainly make the "if()" in that loop be a "while()'.

That's basically what the old code did - simply by virtue of the
wakeup not happening if the writeback bit was set in
wake_page_function():

if (test_bit(key->bit_nr, &key->page->flags))
return -1;

of course, the race was still there - because the writeback bit might
be clear at that point, but another CPU would reallocate and dirty it,
and then autoremove_wake_function() would happen anyway.

But back in the bad old days, the wait_on_page_bit_common() code would
then double-check in a loop, so it would catch that case, re-insert
itself on the wait queue, and try again. Except for the DROP case,
which isn't used by writeback.

Anyway, making that "if()" be a "while()" in wait_on_page_writeback()
would basically re-introduce that old behavior. I don't really care,
because it was the lock bit that really mattered, the writeback bit is
not really all that interesting (except from a "let's fix this bug"
angle)

I'm not 100% sure I like the fragility of this writeback thing.

Anyway, I'm certainly happy with either model, whether it be an added
while() in wait_on_page_writeback(), or it be the page reference count
in end_page_writeback().

Strong opinions?

Linus

2020-11-24 23:56:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue, Nov 24, 2020 at 11:00:42AM -0800, Linus Torvalds wrote:
> On Tue, Nov 24, 2020 at 10:33 AM Matthew Wilcox <[email protected]> wrote:
> >
> > We could fix this by turning that 'if' into a 'while' in
> > write_cache_pages().
>
> That might be the simplest patch indeed.
>
> At the same time, I do worry about other cases like this: while
> spurious wakeup events are normal and happen in other places, this is
> a bit different.
>
> This is literally a wakeup that leaks from a previous use of a page,
> and makes us think that something could have happened to the new use.
>
> The unlock_page() case presumably never hits that, because even if we
> have some unlock without a page ref (which I don't think can happen,
> but whatever..), the exclusive nature of "lock_page()" means that no
> locker can care - once you get the lock, you own the page./
>
> The writeback code is special in that the writeback bit isn't some
> kind of exclusive bit, but this code kind of expected it to be that.
>
> So I'd _like_ to have something like
>
> WARN_ON_ONCE(!page_count(page));
>
> in the wake_up_page_bit() function, to catch things that wake up a
> page that has already been released and might be reused..
>
> And that would require the "get_page()" to be done when we set the
> writeback bit and queue the page up for IO (so that then
> end_page_writeback() would clear the bit, do the wakeup, and then drop
> the ref).
>
> Hugh's second patch isn't pretty - I think the "get_page()" is
> conceptually in the wrong place - but it "works" in that it keeps that
> "implicit page reference" being kept by the PG_writeback bit, and then
> it takes an explicit page reference before it clears the bit.
>
> So while I don't love the whole "PG_writeback is an implicit reference
> to the page" model, Hugh's patch at least makes that model much more
> straightforward: we really either have that PG_writeback, _or_ we have
> a real ref to the page, and we never have that odd "we could actually
> lose the page" situation.
>
> So I think I prefer Hugh's two-liner over your one-liner suggestion.
>
> But your one-liner is technically not just smaller, it obviously also
> avoids the whole mucking with the atomic page ref.
>
> I don't _think_ that the extra get/put overhead could possibly really
> matter: doing the writeback is going to be a lot more expensive
> anyway. And an atomic access to a 'struct page' sounds expensive, but
> that cacheline is already likely dirty in the L1 cache because we've
> touch page->flags and done other things to it).
>
> So I'd personally be inclined to go with Hugh's patch. Comments?

My only objection to Hugh's patch is that it may cause us to fail
to split pages when we can currently split them. That is, we do:

wait_on_page_writeback()
if (page_has_private(page))
do_invalidatepage(page, offset, length);
split_huge_page()

(at least we do in my THP patchset; not sure if there's any of that
in the kernel today), and the extra reference held for a few nanoseconds
after calling wake_up_page() will cause us to fail to split the page.
It probably doesn't matter; there has to be a fallback path anyway.

Now I'm looking at that codepath, and the race that Hugh uncovered now
looks like a real bug. Consider this sequence:

page allocated, added to page cache, dirtied, writeback started

--- thread A ---
end_page_writeback()
test_clear_page_writeback
--- ctx switch to thread B ---
alloc page, add to page cache, dirty page, start page writeback,
truncate_inode_pages_range()
wait_on_page_writeback()
--- ctx switch to thread A ---
wake_up_page()
--- ctx switch to thread B ---
free page
alloc page
write new data to page

... now the DMA actually starts to do page writeback, and it's writing
the new data.

So my s/if/while/ suggestion is wrong and we need to do something to
prevent spurious wakeups. Unless we bury the spurious wakeup logic
inside wait_on_page_writeback() ...

2020-11-25 00:04:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue, Nov 24, 2020 at 1:47 PM Hugh Dickins <[email protected]> wrote:
>
> I think the unreferenced struct page asks for trouble.

I do agree.

I've applied your second patch (the smaller one that just takes a ref
around the critical section). If somebody comes up with some great
alternative, we can always revisit this.

Linus

2020-11-25 09:22:53

by Jan Kara

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue 24-11-20 12:19:12, Matthew Wilcox wrote:
> On Mon, Nov 23, 2020 at 08:07:24PM -0800, Hugh Dickins wrote:
> > Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
> > on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
> > end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
> > no longer an ext4 page at all.
> >
> > The problem is that PageWriteback is not accompanied by a page reference
> > (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
> > soon as TestClearPageWriteback has been done, that page could be removed
> > from page cache, freed, and reused for something else by the time that
> > wake_up_page() is reached.
> >
> > https://lore.kernel.org/linux-mm/[email protected]/
> > Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
> > check; but I'm paranoid about even looking at an unreferenced struct page,
> > lest its memory might itself have already been reused or hotremoved (and
> > wake_up_page_bit() may modify that memory with its ClearPageWaiters()).
> >
> > Then on crashing a second time, realized there's a stronger reason against
> > that approach. If my testing just occasionally crashes on that check,
> > when the page is reused for part of a compound page, wouldn't it be much
> > more common for the page to get reused as an order-0 page before reaching
> > wake_up_page()? And on rare occasions, might that reused page already be
> > marked PageWriteback by its new user, and already be waited upon? What
> > would that look like?
> >
> > It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
> > in write_cache_pages() (though I have never seen that crash myself).
>
> I don't think this is it. write_cache_pages() holds a reference to the
> page -- indeed, it holds the page lock! So this particular race cannot
> cause the page to get recycled. I still have no good ideas what this
> is :-(

But does it really matter what write_cache_pages() does? I mean we start
page writeback. I mean struct bio holds no reference to the page it writes.
The only thing that prevents the page from being freed under bio's hands is
PageWriteback bit. So when the bio is completing we do (e.g. in
ext4_end_bio()), we usually walk all pages in a bio
bio_for_each_segment_all() and for each page call end_page_writeback(), now
once end_page_writeback() calls test_clear_page_writeback() which clears
PageWriteback(), the page can get freed. And that can happen before the
wake_up_page() call in end_page_writeback(). So a race will be like:

CPU1 CPU2
ext4_end_bio()
...
end_page_writeback(page)
test_clear_page_writeback(page)
free page
reallocate page for something else
we can even dirty & start to
writeback 'page'
wake_up_page(page)

and we have a "spurious" wake up on 'page'.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-11-25 23:25:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: kernel BUG at fs/ext4/inode.c:LINE!

On Tue, Nov 24, 2020 at 3:24 PM Linus Torvalds
<[email protected]> wrote:
>
> I've applied your second patch (the smaller one that just takes a ref
> around the critical section). If somebody comes up with some great
> alternative, we can always revisit this.

Hmm.

I'm not sure about "great alternative", but it strikes me that we
*could* move the clearing of the PG_writeback bit _into_
wake_up_page_bit(), under the page waitqueue lock.

IOW, we could make the rule be that the bit isn't actually cleared
before calling wake_up_page() at all, and we'd clear it with something
like

unsigned long flags = READ_ONCE(page->flags);

// We can clear PG_writeback directly if PG_waiters isn't set
while (!(flags & (1ul << PG_waiters))) {
unsigned long new = flags & ~(1ul << PG_writeback);
// PG_writeback was already clear??!!?
if (WARN_ON_ONCE(new == flags))
return;
new = cmpxchg(&page->flags, flags, new);
if (likely(flags == new))
return;
flags = new;
}

// Otherwise, clear the bit at the end - but under the
// page waitqueue lock - inside wake_up_page_bit()
return wake_up_page_bit(..);

instead.

That would basically make the bit clearing atomic wrt the PG_waiters
flags - either using that atomic cmpxchg, or by doing it under the
page queue lock so that it's atomic wrt any new waiters.

This seems conceptually like the right thing to do - and if would also
make the (fair) exclusive lock hand-off case atomic too, because the
bit we're waking up on would never be cleared if it gets handed off
directly.

The above is entirely untested crap written in my MUA, and obviously
requires that all callers of wake_up_page() be moved to that new world
order, but I think we only have two cases: unlock_page() and
end_page_writeback().

And unlock_page() already has that
"clear_bit_unlock_is_negative_byte()" special case that is an ugly
special case of PG_waiters atomicity. So we'd get rid of that, because
the cmpxchg loop would be the better model.

I'm not sure I'm willing to write and test the real patch, but it
doesn't look _too_ nasty from just looking at the code. The bookmark
thing makes it important to only actually clear the bit at the end (as
does the handoff case anyway), but the way wake_up_page_bit() is
written, that's actually very straightforward - just after the
while-loop. That's when we've woken up everybody.

So I'm sending this idea out to see if somebody can shoot it down, or
even wants to possibly even try to do it..

Linus