LinuxLists.cc - kernel BUG in page_add_anon

2023-01-25 23:59:34

Subject: kernel BUG in page_add_anon_rmap

Good day, dear maintainers,

We found a bug using a modified kernel configuration file used by syzbot.

We enhanced the coverage of the configuration file using our tool, klocalizer.

Kernel Branch:?6.2.0-rc5-next-20230124
Kernel config:?https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
Reproducer:?https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing

Thank you!

Best regards,
Sanan Hasanov

head: 0000000000020000 0000000000000000 00000004ffffffff ffff8881002b8000
page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL))) << (0)))))
------------[ cut here ]------------
kernel BUG at mm/rmap.c:1248!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 7 PID: 14932 Comm: syz-executor.1 Not tainted 6.2.0-rc5-next-20230124 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73 c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
FS: 00007f717898a700(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
Call Trace:
<TASK>
remove_migration_pte+0xaa6/0x1390 mm/migrate.c:261
rmap_walk_anon+0x23a/0x5b0 mm/rmap.c:2433
rmap_walk+0x96/0xd0 mm/rmap.c:2509
remove_migration_ptes mm/migrate.c:294 [inline]
migrate_folio_move mm/migrate.c:1307 [inline]
migrate_pages_batch+0x15f8/0x3910 mm/migrate.c:1809
migrate_pages+0x1b56/0x22e0 mm/migrate.c:1961
do_mbind mm/mempolicy.c:1329 [inline]
kernel_mbind+0x4c0/0x790 mm/mempolicy.c:1476
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0x80 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f717788edcd
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f7178989bf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ed
RAX: ffffffffffffffda RBX: 00007f71779bc050 RCX: 00007f717788edcd
RDX: 0000000000000004 RSI: 0000000000001000 RDI: 0000000020002000
RBP: 00007f71778fc59c R08: 00000000000007ff R09: 0000000000000002
R10: 00000000200000c0 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd2c012f6f R14: 00007ffd2c013110 R15: 00007f7178989d80
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73 c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
FS: 00007f717898a700(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0

2023-01-26 00:13:36

by Andrew Morton

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

On Wed, 25 Jan 2023 23:59:16 +0000 Sanan Hasanov <[email protected]> wrote:

> Good day, dear maintainers,
>
> We found a bug using a modified kernel configuration file used by syzbot.
>
> We enhanced the coverage of the configuration file using our tool, klocalizer.
>
> Kernel Branch:?6.2.0-rc5-next-20230124
> Kernel config:?https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
> Reproducer:?https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing

Helpful.

>
> Thank you!
>
> Best regards,
> Sanan Hasanov
>
> head: 0000000000020000 0000000000000000 00000004ffffffff ffff8881002b8000
> page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL))) << (0)))))
> ------------[ cut here ]------------
> kernel BUG at mm/rmap.c:1248!

I assume this is recent?

Matthew and Liam were in here recently.

> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 7 PID: 14932 Comm: syz-executor.1 Not tainted 6.2.0-rc5-next-20230124 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73 c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
> Call Trace:
> <TASK>
> remove_migration_pte+0xaa6/0x1390 mm/migrate.c:261
> rmap_walk_anon+0x23a/0x5b0 mm/rmap.c:2433
> rmap_walk+0x96/0xd0 mm/rmap.c:2509
> remove_migration_ptes mm/migrate.c:294 [inline]
> migrate_folio_move mm/migrate.c:1307 [inline]
> migrate_pages_batch+0x15f8/0x3910 mm/migrate.c:1809
> migrate_pages+0x1b56/0x22e0 mm/migrate.c:1961
> do_mbind mm/mempolicy.c:1329 [inline]
> kernel_mbind+0x4c0/0x790 mm/mempolicy.c:1476
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x39/0x80 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f717788edcd
> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f7178989bf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ed
> RAX: ffffffffffffffda RBX: 00007f71779bc050 RCX: 00007f717788edcd
> RDX: 0000000000000004 RSI: 0000000000001000 RDI: 0000000020002000
> RBP: 00007f71778fc59c R08: 00000000000007ff R09: 0000000000000002
> R10: 00000000200000c0 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007ffd2c012f6f R14: 00007ffd2c013110 R15: 00007f7178989d80
> </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73 c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0

2023-01-26 18:57:30

by Matthew Wilcox

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

On Wed, Jan 25, 2023 at 11:59:16PM +0000, Sanan Hasanov wrote:
> Good day, dear maintainers,
>
> We found a bug using a modified kernel configuration file used by syzbot.
>
> We enhanced the coverage of the configuration file using our tool, klocalizer.
>
> Kernel Branch:?6.2.0-rc5-next-20230124
> Kernel config:?https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
> Reproducer:?https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing
>
> Thank you!
>
> Best regards,
> Sanan Hasanov
>
> head: 0000000000020000 0000000000000000 00000004ffffffff ffff8881002b8000
> page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL))) << (0)))))
> ------------[ cut here ]------------

I know it says "cut here" and you did that, but including just a few
lines above that would be so much more helpful. I can infer that this
is a multi-page folio, but more than that is hard to tell.

> kernel BUG at mm/rmap.c:1248!

That tracks with VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);

> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 7 PID: 14932 Comm: syz-executor.1 Not tainted 6.2.0-rc5-next-20230124 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73 c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
> Call Trace:
> <TASK>
> remove_migration_pte+0xaa6/0x1390 mm/migrate.c:261

if (folio_test_anon(folio))
page_add_anon_rmap(new, vma, pvmw.address,
rmap_flags);

Earlier in that function, we had:
if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
rmap_flags |= RMAP_EXCLUSIVE;

so that also makes sense. We can also infer that RMAP_COMPOUND wasn't
set, so we're trying to do just one page from the folio.

All right, back to rmap.c:

first = atomic_inc_and_test(&page->_mapcount);

So first is clearly false (ie _mapcount was not -1), implying somebody
else already mapped this page. Not really sure what's going on at
this point. Seems unlikely that the folio changes in
remove_migration_pte() are responsible since they're from last January.
Huang has some more changes to migrate.c that I don't feel qualified
to judge.

Nothing's jumping out at me as obviously wrong. Is it possible to
do a bisect?

> rmap_walk_anon+0x23a/0x5b0 mm/rmap.c:2433
> rmap_walk+0x96/0xd0 mm/rmap.c:2509
> remove_migration_ptes mm/migrate.c:294 [inline]
> migrate_folio_move mm/migrate.c:1307 [inline]
> migrate_pages_batch+0x15f8/0x3910 mm/migrate.c:1809
> migrate_pages+0x1b56/0x22e0 mm/migrate.c:1961
> do_mbind mm/mempolicy.c:1329 [inline]
> kernel_mbind+0x4c0/0x790 mm/mempolicy.c:1476
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x39/0x80 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f717788edcd
> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f7178989bf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ed
> RAX: ffffffffffffffda RBX: 00007f71779bc050 RCX: 00007f717788edcd
> RDX: 0000000000000004 RSI: 0000000000001000 RDI: 0000000020002000
> RBP: 00007f71778fc59c R08: 00000000000007ff R09: 0000000000000002
> R10: 00000000200000c0 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007ffd2c012f6f R14: 00007ffd2c013110 R15: 00007f7178989d80
> </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73 c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
>

2023-01-27 12:09:52

by David Hildenbrand

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

On 26.01.23 19:57, Matthew Wilcox wrote:
> On Wed, Jan 25, 2023 at 11:59:16PM +0000, Sanan Hasanov wrote:
>> Good day, dear maintainers,
>>
>> We found a bug using a modified kernel configuration file used by syzbot.
>>
>> We enhanced the coverage of the configuration file using our tool, klocalizer.
>>
>> Kernel Branch: 6.2.0-rc5-next-20230124
>> Kernel config: https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
>> Reproducer: https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing
>>
>> Thank you!
>>
>> Best regards,
>> Sanan Hasanov
>>
>> head: 0000000000020000 0000000000000000 00000004ffffffff ffff8881002b8000
>> page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL))) << (0)))))
>> ------------[ cut here ]------------
>
> I know it says "cut here" and you did that, but including just a few
> lines above that would be so much more helpful. I can infer that this
> is a multi-page folio, but more than that is hard to tell.
>
>> kernel BUG at mm/rmap.c:1248!
>
> That tracks with VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>
>> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
>> CPU: 7 PID: 14932 Comm: syz-executor.1 Not tainted 6.2.0-rc5-next-20230124 #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
>> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73 c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
>> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
>> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
>> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
>> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
>> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
>> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
>> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
>> Call Trace:
>> <TASK>
>> remove_migration_pte+0xaa6/0x1390 mm/migrate.c:261
>
> if (folio_test_anon(folio))
> page_add_anon_rmap(new, vma, pvmw.address,
> rmap_flags);
>
> Earlier in that function, we had:
> if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
> rmap_flags |= RMAP_EXCLUSIVE;
>
> so that also makes sense. We can also infer that RMAP_COMPOUND wasn't
> set, so we're trying to do just one page from the folio.
>
> All right, back to rmap.c:
>
> first = atomic_inc_and_test(&page->_mapcount);
>
> So first is clearly false (ie _mapcount was not -1), implying somebody
> else already mapped this page. Not really sure what's going on at
> this point. Seems unlikely that the folio changes in
> remove_migration_pte() are responsible since they're from last January.
> Huang has some more changes to migrate.c that I don't feel qualified
> to judge.
>
> Nothing's jumping out at me as obviously wrong. Is it possible to
> do a bisect?

I reproduced on next-20230127 (did not try upstream yet).

I think two key things are that a) THP are set to "always" and b) we have a NUMA setup [I assume].

The relevant bits:

[ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2 mapping:0000000000000000 index:0x20003 pfn:0x14ee03
[ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0 nr_pages_mapped:511 pincount:0
[ 439.899611] memcg:ffff986dc4689000
[ 439.902207] anon flags: 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
[ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8 dead000000000400
[ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
[ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010 ffff98714338a001
[ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff ffff986dc4689000
[ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL))) << (0)))))

Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only mapped into a single
page table (no fork() or similar).

I created this reduced reproducer that triggers 100%:

#include <stdint.h>
#include <unistd.h>
#include <sys/mman.h>
#include <numaif.h>

int main(void)
{
mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);

*(uint32_t*)0x20000080 = 0x80000;
mlock((void*)0x20001000ul, 0x2000ul);
mlock((void*)0x20000000ul, 0x3000ul);
mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful, MPOL_MF_MOVE);
return 0;
}

We map a large-enough are for a THP and then populate a fresh anon THP (PMD mapped)
to write to it.

The first mlock() will trigger PTE-mapping the THP and mlocking that subpage.
The second mlock() seems to cause the issue. The final mbind() triggers page migration.

Removing one of the mlock() makes it work. Note that we do a double
mlock() of the same page -- the one we are then trying to migrate.

Somehow, the double mlock() of the same page seems to affect our mapcount.

CCing Hugh.

--
Thanks,

David / dhildenb

2023-01-27 17:02:20

by Hugh Dickins

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

On Fri, 27 Jan 2023, David Hildenbrand wrote:
> On 26.01.23 19:57, Matthew Wilcox wrote:
> > On Wed, Jan 25, 2023 at 11:59:16PM +0000, Sanan Hasanov wrote:
> >> Good day, dear maintainers,
> >>
> >> We found a bug using a modified kernel configuration file used by syzbot.
> >>
> >> We enhanced the coverage of the configuration file using our tool,
> >> klocalizer.
> >>
> >> Kernel Branch: 6.2.0-rc5-next-20230124
> >> Kernel
> >> config: https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
> >> Reproducer: https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing
> >>
> >> Thank you!
> >>
> >> Best regards,
> >> Sanan Hasanov
> >>
> >> head: 0000000000020000 0000000000000000 00000004ffffffff ffff8881002b8000
> >> page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL)))
> >> << (0)))))
> >> ------------[ cut here ]------------
> >
> > I know it says "cut here" and you did that, but including just a few
> > lines above that would be so much more helpful. I can infer that this
> > is a multi-page folio, but more than that is hard to tell.
> >
> >> kernel BUG at mm/rmap.c:1248!
> >
> > That tracks with VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> >
> >> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> >> CPU: 7 PID: 14932 Comm: syz-executor.1 Not tainted 6.2.0-rc5-next-20230124
> >> #1
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> >> 04/01/2014
> >> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
> >> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8
> >> 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73
> >> c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
> >> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
> >> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
> >> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
> >> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
> >> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
> >> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
> >> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000)
> >> knlGS:0000000000000000
> >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
> >> Call Trace:
> >> <TASK>
> >> remove_migration_pte+0xaa6/0x1390 mm/migrate.c:261
> >
> > if (folio_test_anon(folio))
> > page_add_anon_rmap(new, vma, pvmw.address,
> > rmap_flags);
> >
> > Earlier in that function, we had:
> > if (folio_test_anon(folio) &&
> > !is_readable_migration_entry(entry))
> > rmap_flags |= RMAP_EXCLUSIVE;
> >
> > so that also makes sense. We can also infer that RMAP_COMPOUND wasn't
> > set, so we're trying to do just one page from the folio.
> >
> > All right, back to rmap.c:
> >
> > first = atomic_inc_and_test(&page->_mapcount);
> >
> > So first is clearly false (ie _mapcount was not -1), implying somebody
> > else already mapped this page. Not really sure what's going on at
> > this point. Seems unlikely that the folio changes in
> > remove_migration_pte() are responsible since they're from last January.
> > Huang has some more changes to migrate.c that I don't feel qualified
> > to judge.
> >
> > Nothing's jumping out at me as obviously wrong. Is it possible to
> > do a bisect?
>
> I reproduced on next-20230127 (did not try upstream yet).
>
> I think two key things are that a) THP are set to "always" and b) we have a
> NUMA setup [I assume].
>
> The relevant bits:
>
> [ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
> mapping:0000000000000000 index:0x20003 pfn:0x14ee03
> [ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
> nr_pages_mapped:511 pincount:0
> [ 439.899611] memcg:ffff986dc4689000
> [ 439.902207] anon flags:
> 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
> [ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
> dead000000000400
> [ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
> 0000000000000000
> [ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
> ffff98714338a001
> [ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
> ffff986dc4689000
> [ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
> rmap_t)((((1UL))) << (0)))))
>
>
> Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
> mapped into a single
> page table (no fork() or similar).
>
> I created this reduced reproducer that triggers 100%:
>
>
> #include <stdint.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <numaif.h>
>
> int main(void)
> {
> mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
> MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
> madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
>
> *(uint32_t*)0x20000080 = 0x80000;
> mlock((void*)0x20001000ul, 0x2000ul);
> mlock((void*)0x20000000ul, 0x3000ul);
> mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
> MPOL_MF_MOVE);
> return 0;
> }
>
> We map a large-enough are for a THP and then populate a fresh anon THP (PMD
> mapped)
> to write to it.
>
> The first mlock() will trigger PTE-mapping the THP and mlocking that subpage.
> The second mlock() seems to cause the issue. The final mbind() triggers page
> migration.
>
> Removing one of the mlock() makes it work. Note that we do a double
> mlock() of the same page -- the one we are then trying to migrate.
>
> Somehow, the double mlock() of the same page seems to affect our mapcount.
>
> CCing Hugh.

Thanks David - most especially for the reproducer, not tried here yet.
I'll assume this is my bug, and get into it later in the day.

Hugh

2023-01-29 06:49:47

by Hugh Dickins

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

On Fri, 27 Jan 2023, Hugh Dickins wrote:
> On Fri, 27 Jan 2023, David Hildenbrand wrote:
> > On 26.01.23 19:57, Matthew Wilcox wrote:
> > > On Wed, Jan 25, 2023 at 11:59:16PM +0000, Sanan Hasanov wrote:
> > >> Good day, dear maintainers,
> > >>
> > >> We found a bug using a modified kernel configuration file used by syzbot.
> > >>
> > >> We enhanced the coverage of the configuration file using our tool,
> > >> klocalizer.
> > >>
> > >> Kernel Branch: 6.2.0-rc5-next-20230124
> > >> Kernel
> > >> config: https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
> > >> Reproducer: https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing
> > >>
> > >> Thank you!
> > >>
> > >> Best regards,
> > >> Sanan Hasanov

This is a very interesting find: the thanks go to you.

> > >>
> > >> head: 0000000000020000 0000000000000000 00000004ffffffff ffff8881002b8000
> > >> page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL)))
> > >> << (0)))))
> > >> ------------[ cut here ]------------
> > >
> > > I know it says "cut here" and you did that, but including just a few
> > > lines above that would be so much more helpful. I can infer that this
> > > is a multi-page folio, but more than that is hard to tell.
> > >
> > >> kernel BUG at mm/rmap.c:1248!
> > >
> > > That tracks with VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> > >
> > >> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> > >> CPU: 7 PID: 14932 Comm: syz-executor.1 Not tainted 6.2.0-rc5-next-20230124
> > >> #1
> > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> > >> 04/01/2014
> > >> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
> > >> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8
> > >> 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73
> > >> c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
> > >> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
> > >> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
> > >> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
> > >> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
> > >> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
> > >> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
> > >> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000)
> > >> knlGS:0000000000000000
> > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
> > >> Call Trace:
> > >> <TASK>
> > >> remove_migration_pte+0xaa6/0x1390 mm/migrate.c:261
> > >
> > > if (folio_test_anon(folio))
> > > page_add_anon_rmap(new, vma, pvmw.address,
> > > rmap_flags);
> > >
> > > Earlier in that function, we had:
> > > if (folio_test_anon(folio) &&
> > > !is_readable_migration_entry(entry))
> > > rmap_flags |= RMAP_EXCLUSIVE;
> > >
> > > so that also makes sense. We can also infer that RMAP_COMPOUND wasn't
> > > set, so we're trying to do just one page from the folio.
> > >
> > > All right, back to rmap.c:
> > >
> > > first = atomic_inc_and_test(&page->_mapcount);
> > >
> > > So first is clearly false (ie _mapcount was not -1), implying somebody
> > > else already mapped this page. Not really sure what's going on at
> > > this point. Seems unlikely that the folio changes in
> > > remove_migration_pte() are responsible since they're from last January.
> > > Huang has some more changes to migrate.c that I don't feel qualified
> > > to judge.
> > >
> > > Nothing's jumping out at me as obviously wrong. Is it possible to
> > > do a bisect?
> >
> > I reproduced on next-20230127 (did not try upstream yet).

Upstream's fine; on next-20230127 (with David's repro) it bisects to
5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better
hand on to Liam, rather than delay you by puzzling over it further myself.

> >
> > I think two key things are that a) THP are set to "always" and b) we have a
> > NUMA setup [I assume].
> >
> > The relevant bits:
> >
> > [ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
> > mapping:0000000000000000 index:0x20003 pfn:0x14ee03
> > [ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
> > nr_pages_mapped:511 pincount:0
> > [ 439.899611] memcg:ffff986dc4689000
> > [ 439.902207] anon flags:
> > 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
> > [ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
> > dead000000000400
> > [ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
> > 0000000000000000
> > [ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
> > ffff98714338a001
> > [ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
> > ffff986dc4689000
> > [ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
> > rmap_t)((((1UL))) << (0)))))
> >
> >
> > Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
> > mapped into a single
> > page table (no fork() or similar).

Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
index:0x20003?

My guess is that the remove-__vma_adjust() commit is not properly updating
vm_pgoff into non_vma in some case: so that when remove_migration_pte()
looks for where to insert the new pte, it's off by one page.

> >
> > I created this reduced reproducer that triggers 100%:

Very helpful, thank you.

> >
> >
> > #include <stdint.h>
> > #include <unistd.h>
> > #include <sys/mman.h>
> > #include <numaif.h>
> >
> > int main(void)
> > {
> > mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
> > MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
> > madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
> >
> > *(uint32_t*)0x20000080 = 0x80000;
> > mlock((void*)0x20001000ul, 0x2000ul);
> > mlock((void*)0x20000000ul, 0x3000ul);

It's not an mlock() issue in particular: quickly established by
substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
Looks like a vma splitting issue now.

> > mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
> > MPOL_MF_MOVE);

I guess it will turn out not to be relevant to this particular syzbug,
but what do we expect an mbind() of just 0x1000 of a THP to do?

It's a subject I've wrestled with unsuccessfully in the past: I found
myself arriving at one conclusion (split THP) in one place, and a contrary
conclusion (widen range) in another place, and never had time to work out
one unified answer.

So I do wonder what pte replaces the migration entry when the bug here
is fixed: is it a pte pointing into the THP as before, in which case
what was the point of "migration"? is it a Copy-On-Bind page?
or has the whole THP been migrated?

I ought to read through those "estimated mapcount" threads more
carefully: might be relevant, but I've not paid enough attention.

Hugh

> > return 0;
> > }
> >
> > We map a large-enough are for a THP and then populate a fresh anon THP (PMD
> > mapped)
> > to write to it.
> >
> > The first mlock() will trigger PTE-mapping the THP and mlocking that subpage.
> > The second mlock() seems to cause the issue. The final mbind() triggers page
> > migration.
> >
> > Removing one of the mlock() makes it work. Note that we do a double
> > mlock() of the same page -- the one we are then trying to migrate.
> >
> > Somehow, the double mlock() of the same page seems to affect our mapcount.
> >
> > CCing Hugh.
>
> Thanks David - most especially for the reproducer, not tried here yet.
> I'll assume this is my bug, and get into it later in the day.
>
> Hugh

2023-01-30 09:04:33

by David Hildenbrand

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

>>>
>>> I reproduced on next-20230127 (did not try upstream yet).
>
> Upstream's fine; on next-20230127 (with David's repro) it bisects to
> 5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better
> hand on to Liam, rather than delay you by puzzling over it further myself.
>

Thanks for identifying the problematic commit! ...

>>>
>>> I think two key things are that a) THP are set to "always" and b) we have a
>>> NUMA setup [I assume].
>>>
>>> The relevant bits:
>>>
>>> [ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
>>> mapping:0000000000000000 index:0x20003 pfn:0x14ee03
>>> [ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
>>> nr_pages_mapped:511 pincount:0
>>> [ 439.899611] memcg:ffff986dc4689000
>>> [ 439.902207] anon flags:
>>> 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
>>> [ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
>>> dead000000000400
>>> [ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
>>> 0000000000000000
>>> [ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
>>> ffff98714338a001
>>> [ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
>>> ffff986dc4689000
>>> [ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
>>> rmap_t)((((1UL))) << (0)))))
>>>
>>>
>>> Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
>>> mapped into a single
>>> page table (no fork() or similar).
>
> Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
> what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
> index:0x20003?

I was assuming the whole folio would get migrated. As you raise below,
it's all a bit unclear once THP get involved and dealing with mbind()
and page migration.

>>>
>>> I created this reduced reproducer that triggers 100%:
>
> Very helpful, thank you.
>
>>>
>>>
>>> #include <stdint.h>
>>> #include <unistd.h>
>>> #include <sys/mman.h>
>>> #include <numaif.h>
>>>
>>> int main(void)
>>> {
>>> mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
>>> MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
>>> madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
>>>
>>> *(uint32_t*)0x20000080 = 0x80000;
>>> mlock((void*)0x20001000ul, 0x2000ul);
>>> mlock((void*)0x20000000ul, 0x3000ul);
>
> It's not an mlock() issue in particular: quickly established by
> substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
> Looks like a vma splitting issue now.

Gah, should have tried something like that first before suspecting it's
mlock related. :)

>
>>> mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
>>> MPOL_MF_MOVE);
>
> I guess it will turn out not to be relevant to this particular syzbug,
> but what do we expect an mbind() of just 0x1000 of a THP to do?
>
> It's a subject I've wrestled with unsuccessfully in the past: I found
> myself arriving at one conclusion (split THP) in one place, and a contrary
> conclusion (widen range) in another place, and never had time to work out
> one unified answer.

I'm aware of a similar issue with long-term page pinning: we might want
to pin a 4k portion of a THP, but will end up blocking the whole THP
from getting migrated/swapped/split/freed/ ... until we unpin (ever?). I
wrote a reproducer [1] a while ago to show how you can effectively steal
most THP in the system using comparatively small memlock limit using
io_uring ...

In theory, we could split the THP before long-term pinning only a
subregion ... but what if we cannot split the THP because it's already
pinned (previous pinning request that covered the whole THP)? Copying
instead of splitting would also not be possible, if the page is already
pinned ... so we'd never want to allow long-term pinning a THP ... but
that means that we would have to fail pinning if splitting the THP fails
and that there would be performance-consequences for THP users :/

Non-trivial ... just like mlocking only a part of a THP or mbinding
different parts of a THP to different nodes ...

[1]
https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c

--
Thanks,

David / dhildenb

2023-01-30 09:30:44

by David Hildenbrand

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

On 30.01.23 10:03, David Hildenbrand wrote:
>>>>
>>>> I reproduced on next-20230127 (did not try upstream yet).
>>
>> Upstream's fine; on next-20230127 (with David's repro) it bisects to
>> 5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better
>> hand on to Liam, rather than delay you by puzzling over it further myself.
>>
>
> Thanks for identifying the problematic commit! ...
>
>>>>
>>>> I think two key things are that a) THP are set to "always" and b) we have a
>>>> NUMA setup [I assume].
>>>>
>>>> The relevant bits:
>>>>
>>>> [ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
>>>> mapping:0000000000000000 index:0x20003 pfn:0x14ee03
>>>> [ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
>>>> nr_pages_mapped:511 pincount:0
>>>> [ 439.899611] memcg:ffff986dc4689000
>>>> [ 439.902207] anon flags:
>>>> 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
>>>> [ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
>>>> dead000000000400
>>>> [ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
>>>> 0000000000000000
>>>> [ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
>>>> ffff98714338a001
>>>> [ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
>>>> ffff986dc4689000
>>>> [ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
>>>> rmap_t)((((1UL))) << (0)))))
>>>>
>>>>
>>>> Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
>>>> mapped into a single
>>>> page table (no fork() or similar).
>>
>> Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
>> what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
>> index:0x20003?
>
> I was assuming the whole folio would get migrated. As you raise below,
> it's all a bit unclear once THP get involved and dealing with mbind()
> and page migration.
>
>>>>
>>>> I created this reduced reproducer that triggers 100%:
>>
>> Very helpful, thank you.
>>
>>>>
>>>>
>>>> #include <stdint.h>
>>>> #include <unistd.h>
>>>> #include <sys/mman.h>
>>>> #include <numaif.h>
>>>>
>>>> int main(void)
>>>> {
>>>> mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
>>>> MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
>>>> madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
>>>>
>>>> *(uint32_t*)0x20000080 = 0x80000;
>>>> mlock((void*)0x20001000ul, 0x2000ul);
>>>> mlock((void*)0x20000000ul, 0x3000ul);
>>
>> It's not an mlock() issue in particular: quickly established by
>> substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
>> Looks like a vma splitting issue now.
>
> Gah, should have tried something like that first before suspecting it's
> mlock related. :)
>
>>
>>>> mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
>>>> MPOL_MF_MOVE);
>>
>> I guess it will turn out not to be relevant to this particular syzbug,
>> but what do we expect an mbind() of just 0x1000 of a THP to do?
>>
>> It's a subject I've wrestled with unsuccessfully in the past: I found
>> myself arriving at one conclusion (split THP) in one place, and a contrary
>> conclusion (widen range) in another place, and never had time to work out
>> one unified answer.
>
> I'm aware of a similar issue with long-term page pinning: we might want
> to pin a 4k portion of a THP, but will end up blocking the whole THP
> from getting migrated/swapped/split/freed/ ... until we unpin (ever?). I
> wrote a reproducer [1] a while ago to show how you can effectively steal
> most THP in the system using comparatively small memlock limit using
> io_uring ...
>

Correction, my reproducer already triggers a compund page split to
really only pin a 4k page, to then free the remaining 4k pages of the
previous THP. As a single 4k page is allocated and pinned, we cannot get
a THP at these physical memory locations until the page is unpinned.

--
Thanks,

David / dhildenb

2023-01-30 16:12:25

by Matthew Wilcox

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

On Sat, Jan 28, 2023 at 10:49:31PM -0800, Hugh Dickins wrote:
> I guess it will turn out not to be relevant to this particular syzbug,
> but what do we expect an mbind() of just 0x1000 of a THP to do?
>
> It's a subject I've wrestled with unsuccessfully in the past: I found
> myself arriving at one conclusion (split THP) in one place, and a contrary
> conclusion (widen range) in another place, and never had time to work out
> one unified answer.
>
> So I do wonder what pte replaces the migration entry when the bug here
> is fixed: is it a pte pointing into the THP as before, in which case
> what was the point of "migration"? is it a Copy-On-Bind page?
> or has the whole THP been migrated?

I have an Opinion!

The important thing about THP (IMO) is the Transparency part.
Applications don't need to do anything special to get memory managed
in larger chunks, the only difference is in performance. That is, they
get better performance if the kernel can do it, and thinks it worthwhile.

The tradeoff with THP is that we treat all memory in this 2MB chunk the
same way; we track its dirtiness and age as a single thing (position
on LRU, etc). That assumes we're doing no harm, or less harm than we
would be tracking each page independently.

If userspace gives us a hint like "I want this range of memory on that
node", that's a strong signal that *this* range of memory is considered
by userspace to be a single unit. So my opinion is that userspace is
letting us know that we previously made a bad decision and we should
rectify it by splitting now.

Zi Yan has a patch to allow pages to be split to arbitrary orders instead
of 0. We should probably give that a review so that we're not making
the opposite mistake of tracking at too fine a granularity.

> I ought to read through those "estimated mapcount" threads more
> carefully: might be relevant, but I've not paid enough attention.

I'm not sure they're relevant to this, although obviously I'd love
your thoughts on how we could handle mapcount more efficiently.

2023-01-30 19:20:40

by Yang Shi

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

On Sat, Jan 28, 2023 at 10:49 PM Hugh Dickins <[email protected]> wrote:
>
> On Fri, 27 Jan 2023, Hugh Dickins wrote:
> > On Fri, 27 Jan 2023, David Hildenbrand wrote:
> > > On 26.01.23 19:57, Matthew Wilcox wrote:
> > > > On Wed, Jan 25, 2023 at 11:59:16PM +0000, Sanan Hasanov wrote:
> > > >> Good day, dear maintainers,
> > > >>
> > > >> We found a bug using a modified kernel configuration file used by syzbot.
> > > >>
> > > >> We enhanced the coverage of the configuration file using our tool,
> > > >> klocalizer.
> > > >>
> > > >> Kernel Branch: 6.2.0-rc5-next-20230124
> > > >> Kernel
> > > >> config: https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
> > > >> Reproducer: https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing
> > > >>
> > > >> Thank you!
> > > >>
> > > >> Best regards,
> > > >> Sanan Hasanov
>
> This is a very interesting find: the thanks go to you.
>
> > > >>
> > > >> head: 0000000000020000 0000000000000000 00000004ffffffff ffff8881002b8000
> > > >> page dumped because: VM_BUG_ON_PAGE(!first && (flags & (( rmap_t)((((1UL)))
> > > >> << (0)))))
> > > >> ------------[ cut here ]------------
> > > >
> > > > I know it says "cut here" and you did that, but including just a few
> > > > lines above that would be so much more helpful. I can infer that this
> > > > is a multi-page folio, but more than that is hard to tell.
> > > >
> > > >> kernel BUG at mm/rmap.c:1248!
> > > >
> > > > That tracks with VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> > > >
> > > >> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> > > >> CPU: 7 PID: 14932 Comm: syz-executor.1 Not tainted 6.2.0-rc5-next-20230124
> > > >> #1
> > > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> > > >> 04/01/2014
> > > >> RIP: 0010:page_add_anon_rmap+0xddd/0x11c0 mm/rmap.c:1248
> > > >> Code: c0 ff 48 8b 34 24 48 89 df e8 1f ff 07 00 49 89 c6 e9 85 f6 ff ff e8
> > > >> 52 73 c0 ff 48 c7 c6 c0 3c d8 89 48 89 ef e8 b3 23 f8 ff <0f> 0b e8 3c 73
> > > >> c0 ff 48 c7 c6 00 3b d8 89 48 89 ef e8 9d 23 f8 ff
> > > >> RSP: 0018:ffffc9000c56f7b0 EFLAGS: 00010293
> > > >> RAX: 0000000000000000 RBX: ffff88807efc6f30 RCX: 0000000000000000
> > > >> RDX: ffff8880464fd7c0 RSI: ffffffff81be733d RDI: fffff520018adedb
> > > >> RBP: ffffea0000c68080 R08: 0000000000000056 R09: 0000000000000000
> > > >> R10: 0000000000000001 R11: 0000000000000001 R12: ffffea0000c68000
> > > >> R13: 0000000000000001 R14: ffffea0000c68088 R15: 0000000000000000
> > > >> FS: 00007f717898a700(0000) GS:ffff888119f80000(0000)
> > > >> knlGS:0000000000000000
> > > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> CR2: 00007f7178947d78 CR3: 000000004a9e6000 CR4: 0000000000350ee0
> > > >> Call Trace:
> > > >> <TASK>
> > > >> remove_migration_pte+0xaa6/0x1390 mm/migrate.c:261
> > > >
> > > > if (folio_test_anon(folio))
> > > > page_add_anon_rmap(new, vma, pvmw.address,
> > > > rmap_flags);
> > > >
> > > > Earlier in that function, we had:
> > > > if (folio_test_anon(folio) &&
> > > > !is_readable_migration_entry(entry))
> > > > rmap_flags |= RMAP_EXCLUSIVE;
> > > >
> > > > so that also makes sense. We can also infer that RMAP_COMPOUND wasn't
> > > > set, so we're trying to do just one page from the folio.
> > > >
> > > > All right, back to rmap.c:
> > > >
> > > > first = atomic_inc_and_test(&page->_mapcount);
> > > >
> > > > So first is clearly false (ie _mapcount was not -1), implying somebody
> > > > else already mapped this page. Not really sure what's going on at
> > > > this point. Seems unlikely that the folio changes in
> > > > remove_migration_pte() are responsible since they're from last January.
> > > > Huang has some more changes to migrate.c that I don't feel qualified
> > > > to judge.
> > > >
> > > > Nothing's jumping out at me as obviously wrong. Is it possible to
> > > > do a bisect?
> > >
> > > I reproduced on next-20230127 (did not try upstream yet).
>
> Upstream's fine; on next-20230127 (with David's repro) it bisects to
> 5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better
> hand on to Liam, rather than delay you by puzzling over it further myself.
>
> > >
> > > I think two key things are that a) THP are set to "always" and b) we have a
> > > NUMA setup [I assume].
> > >
> > > The relevant bits:
> > >
> > > [ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
> > > mapping:0000000000000000 index:0x20003 pfn:0x14ee03
> > > [ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
> > > nr_pages_mapped:511 pincount:0
> > > [ 439.899611] memcg:ffff986dc4689000
> > > [ 439.902207] anon flags:
> > > 0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
> > > [ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
> > > dead000000000400
> > > [ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
> > > 0000000000000000
> > > [ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
> > > ffff98714338a001
> > > [ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
> > > ffff986dc4689000
> > > [ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
> > > rmap_t)((((1UL))) << (0)))))
> > >
> > >
> > > Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
> > > mapped into a single
> > > page table (no fork() or similar).
>
> Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
> what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
> index:0x20003?
>
> My guess is that the remove-__vma_adjust() commit is not properly updating
> vm_pgoff into non_vma in some case: so that when remove_migration_pte()
> looks for where to insert the new pte, it's off by one page.
>
> > >
> > > I created this reduced reproducer that triggers 100%:
>
> Very helpful, thank you.
>
> > >
> > >
> > > #include <stdint.h>
> > > #include <unistd.h>
> > > #include <sys/mman.h>
> > > #include <numaif.h>
> > >
> > > int main(void)
> > > {
> > > mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
> > > MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
> > > madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
> > >
> > > *(uint32_t*)0x20000080 = 0x80000;
> > > mlock((void*)0x20001000ul, 0x2000ul);
> > > mlock((void*)0x20000000ul, 0x3000ul);
>
> It's not an mlock() issue in particular: quickly established by
> substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
> Looks like a vma splitting issue now.
>
> > > mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
> > > MPOL_MF_MOVE);
>
> I guess it will turn out not to be relevant to this particular syzbug,
> but what do we expect an mbind() of just 0x1000 of a THP to do?
>
> It's a subject I've wrestled with unsuccessfully in the past: I found
> myself arriving at one conclusion (split THP) in one place, and a contrary
> conclusion (widen range) in another place, and never had time to work out
> one unified answer.
>
> So I do wonder what pte replaces the migration entry when the bug here
> is fixed: is it a pte pointing into the THP as before, in which case
> what was the point of "migration"? is it a Copy-On-Bind page?
> or has the whole THP been migrated?

IIRC, mbind() for partial THP would migrate the whole THP as long as
there is enough memory on the target node. mbind() doesn't split THP,
but only split PMD, and PMD split is after the THP is queued for
migration. So the migration actually migrates PTE-mapped THP.

>
> I ought to read through those "estimated mapcount" threads more
> carefully: might be relevant, but I've not paid enough attention.
>
> Hugh
>
> > > return 0;
> > > }
> > >
> > > We map a large-enough are for a THP and then populate a fresh anon THP (PMD
> > > mapped)
> > > to write to it.
> > >
> > > The first mlock() will trigger PTE-mapping the THP and mlocking that subpage.
> > > The second mlock() seems to cause the issue. The final mbind() triggers page
> > > migration.
> > >
> > > Removing one of the mlock() makes it work. Note that we do a double
> > > mlock() of the same page -- the one we are then trying to migrate.
> > >
> > > Somehow, the double mlock() of the same page seems to affect our mapcount.
> > >
> > > CCing Hugh.
> >
> > Thanks David - most especially for the reproducer, not tried here yet.
> > I'll assume this is my bug, and get into it later in the day.
> >
> > Hugh

2023-01-30 19:27:01

by Liam R. Howlett

[permalink] [raw]

Subject: Re: kernel BUG in page_add_anon_rmap

* Hugh Dickins <[email protected]> [230129 01:49]:
> On Fri, 27 Jan 2023, Hugh Dickins wrote:
> > On Fri, 27 Jan 2023, David Hildenbrand wrote:
> > > On 26.01.23 19:57, Matthew Wilcox wrote:
> > > > On Wed, Jan 25, 2023 at 11:59:16PM +0000, Sanan Hasanov wrote:
> > > >> Good day, dear maintainers,
> > > >>
> > > >> We found a bug using a modified kernel configuration file used by syzbot.
> > > >>
> > > >> We enhanced the coverage of the configuration file using our tool,
> > > >> klocalizer.
> > > >>
> > > >> Kernel Branch:?6.2.0-rc5-next-20230124
> > > >> Kernel
> > > >> config:?https://drive.google.com/file/d/1MZSgIF4R9QfikEuF5siUIZVPce-GiJQK/view?usp=sharing
> > > >> Reproducer:?https://drive.google.com/file/d/1H5KWkT9VVMWTUVVgIaZi6J-fmukRx-BM/view?usp=sharing
> > > >>
> > > >> Thank you!
> > > >>
> > > >> Best regards,
> > > >> Sanan Hasanov
>
> This is a very interesting find: the thanks go to you.
>

...

> Upstream's fine; on next-20230127 (with David's repro) it bisects to
> 5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better
> hand on to Liam, rather than delay you by puzzling over it further myself.

Thanks Hugh!

...

> > > Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
> > > mapped into a single
> > > page table (no fork() or similar).
>
> Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
> what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
> index:0x20003?
>
> My guess is that the remove-__vma_adjust() commit is not properly updating
> vm_pgoff into non_vma in some case: so that when remove_migration_pte()
> looks for where to insert the new pte, it's off by one page.

That looks to be exactly correct. I am setting the vm_pgoff to the
wrong value in case 8 (for lack of a better name).

>
> > >
> > > I created this reduced reproducer that triggers 100%:
>
> Very helpful, thank you.

Yes, thank you very much for find this bug and the reproducer.

...

Thanks,
Liam