2022-05-03 01:19:46

by syzbot

[permalink] [raw]
Subject: [syzbot] BUG: Bad page map (5)

Hello,

syzbot found the following issue on:

HEAD commit: 0966d385830d riscv: Fix auipc+jalr relocation range checks
git tree: git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git fixes
console output: https://syzkaller.appspot.com/x/log.txt?x=10e1526cf00000
kernel config: https://syzkaller.appspot.com/x/.config?x=6295d67591064921
dashboard link: https://syzkaller.appspot.com/bug?extid=915f3e317adb0e85835f
compiler: riscv64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
userspace arch: riscv64

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]

netdevsim netdevsim0 netdevsim1: set [1, 0] type 2 family 0 port 6081 - 0
netdevsim netdevsim0 netdevsim2: set [1, 0] type 2 family 0 port 6081 - 0
netdevsim netdevsim0 netdevsim3: set [1, 0] type 2 family 0 port 6081 - 0
BUG: Bad page map in process syz-executor.0 pte:ffffaf80215a00f0 pmd:285e7c01
addr:00007fffbd3e6000 vm_flags:100400fb anon_vma:0000000000000000 mapping:ffffaf800ab1e058 index:3c
file:kcov fault:0x0 mmap:kcov_mmap readpage:0x0
CPU: 1 PID: 2051 Comm: syz-executor.0 Not tainted 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff803cdcdc>] print_bad_pte+0x3d4/0x4a0 mm/memory.c:563
[<ffffffff803d1622>] vm_normal_page+0x20c/0x22a mm/memory.c:626
[<ffffffff803dbb4e>] copy_present_pte mm/memory.c:949 [inline]
[<ffffffff803dbb4e>] copy_pte_range mm/memory.c:1074 [inline]
[<ffffffff803dbb4e>] copy_pmd_range mm/memory.c:1160 [inline]
[<ffffffff803dbb4e>] copy_pud_range mm/memory.c:1197 [inline]
[<ffffffff803dbb4e>] copy_p4d_range mm/memory.c:1221 [inline]
[<ffffffff803dbb4e>] copy_page_range+0x828/0x236c mm/memory.c:1294
[<ffffffff80049bcc>] dup_mmap kernel/fork.c:612 [inline]
[<ffffffff80049bcc>] dup_mm+0xb5c/0xe10 kernel/fork.c:1451
[<ffffffff8004c7c6>] copy_mm kernel/fork.c:1503 [inline]
[<ffffffff8004c7c6>] copy_process+0x25da/0x3c34 kernel/fork.c:2164
[<ffffffff8004e106>] kernel_clone+0xee/0x920 kernel/fork.c:2555
[<ffffffff8004ea2a>] __do_sys_clone+0xf2/0x12e kernel/fork.c:2672
[<ffffffff8004ee4e>] sys_clone+0x32/0x44 kernel/fork.c:2640
[<ffffffff80005716>] ret_from_syscall+0x0/0x2
BUG: Bad page map in process syz-executor.0 pte:ffffffff801110e4 pmd:285e7c01
addr:00007fffbd3e7000 vm_flags:100400fb anon_vma:0000000000000000 mapping:ffffaf800ab1e058 index:3d
file:kcov fault:0x0 mmap:kcov_mmap readpage:0x0
CPU: 1 PID: 2051 Comm: syz-executor.0 Tainted: G B 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff803cdcdc>] print_bad_pte+0x3d4/0x4a0 mm/memory.c:563
[<ffffffff803d1622>] vm_normal_page+0x20c/0x22a mm/memory.c:626
[<ffffffff803dbb4e>] copy_present_pte mm/memory.c:949 [inline]
[<ffffffff803dbb4e>] copy_pte_range mm/memory.c:1074 [inline]
[<ffffffff803dbb4e>] copy_pmd_range mm/memory.c:1160 [inline]
[<ffffffff803dbb4e>] copy_pud_range mm/memory.c:1197 [inline]
[<ffffffff803dbb4e>] copy_p4d_range mm/memory.c:1221 [inline]
[<ffffffff803dbb4e>] copy_page_range+0x828/0x236c mm/memory.c:1294
[<ffffffff80049bcc>] dup_mmap kernel/fork.c:612 [inline]
[<ffffffff80049bcc>] dup_mm+0xb5c/0xe10 kernel/fork.c:1451
[<ffffffff8004c7c6>] copy_mm kernel/fork.c:1503 [inline]
[<ffffffff8004c7c6>] copy_process+0x25da/0x3c34 kernel/fork.c:2164
[<ffffffff8004e106>] kernel_clone+0xee/0x920 kernel/fork.c:2555
[<ffffffff8004ea2a>] __do_sys_clone+0xf2/0x12e kernel/fork.c:2672
[<ffffffff8004ee4e>] sys_clone+0x32/0x44 kernel/fork.c:2640
[<ffffffff80005716>] ret_from_syscall+0x0/0x2
BUG: Bad page map in process syz-executor.0 pte:ffffffff801110e4 pmd:285e7c01
addr:00007fffbd3ef000 vm_flags:100400fb anon_vma:0000000000000000 mapping:ffffaf800ab1e058 index:45
file:kcov fault:0x0 mmap:kcov_mmap readpage:0x0
CPU: 1 PID: 2051 Comm: syz-executor.0 Tainted: G B 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff803cdcdc>] print_bad_pte+0x3d4/0x4a0 mm/memory.c:563
[<ffffffff803d1622>] vm_normal_page+0x20c/0x22a mm/memory.c:626
[<ffffffff803dbb4e>] copy_present_pte mm/memory.c:949 [inline]
[<ffffffff803dbb4e>] copy_pte_range mm/memory.c:1074 [inline]
[<ffffffff803dbb4e>] copy_pmd_range mm/memory.c:1160 [inline]
[<ffffffff803dbb4e>] copy_pud_range mm/memory.c:1197 [inline]
[<ffffffff803dbb4e>] copy_p4d_range mm/memory.c:1221 [inline]
[<ffffffff803dbb4e>] copy_page_range+0x828/0x236c mm/memory.c:1294
[<ffffffff80049bcc>] dup_mmap kernel/fork.c:612 [inline]
[<ffffffff80049bcc>] dup_mm+0xb5c/0xe10 kernel/fork.c:1451
[<ffffffff8004c7c6>] copy_mm kernel/fork.c:1503 [inline]
[<ffffffff8004c7c6>] copy_process+0x25da/0x3c34 kernel/fork.c:2164
[<ffffffff8004e106>] kernel_clone+0xee/0x920 kernel/fork.c:2555
[<ffffffff8004ea2a>] __do_sys_clone+0xf2/0x12e kernel/fork.c:2672
[<ffffffff8004ee4e>] sys_clone+0x32/0x44 kernel/fork.c:2640
[<ffffffff80005716>] ret_from_syscall+0x0/0x2
BUG: Bad page map in process syz-executor.0 pte:41b58ab3 pmd:285e7c01
addr:00007fffbd3f4000 vm_flags:100400fb anon_vma:0000000000000000 mapping:ffffaf800ab1e058 index:4a
file:kcov fault:0x0 mmap:kcov_mmap readpage:0x0
CPU: 1 PID: 2051 Comm: syz-executor.0 Tainted: G B 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff803cdcdc>] print_bad_pte+0x3d4/0x4a0 mm/memory.c:563
[<ffffffff803d1622>] vm_normal_page+0x20c/0x22a mm/memory.c:626
[<ffffffff803dbb4e>] copy_present_pte mm/memory.c:949 [inline]
[<ffffffff803dbb4e>] copy_pte_range mm/memory.c:1074 [inline]
[<ffffffff803dbb4e>] copy_pmd_range mm/memory.c:1160 [inline]
[<ffffffff803dbb4e>] copy_pud_range mm/memory.c:1197 [inline]
[<ffffffff803dbb4e>] copy_p4d_range mm/memory.c:1221 [inline]
[<ffffffff803dbb4e>] copy_page_range+0x828/0x236c mm/memory.c:1294
[<ffffffff80049bcc>] dup_mmap kernel/fork.c:612 [inline]
[<ffffffff80049bcc>] dup_mm+0xb5c/0xe10 kernel/fork.c:1451
[<ffffffff8004c7c6>] copy_mm kernel/fork.c:1503 [inline]
[<ffffffff8004c7c6>] copy_process+0x25da/0x3c34 kernel/fork.c:2164
[<ffffffff8004e106>] kernel_clone+0xee/0x920 kernel/fork.c:2555
[<ffffffff8004ea2a>] __do_sys_clone+0xf2/0x12e kernel/fork.c:2672
[<ffffffff8004ee4e>] sys_clone+0x32/0x44 kernel/fork.c:2640
[<ffffffff80005716>] ret_from_syscall+0x0/0x2
BUG: Bad page map in process syz-executor.0 pte:ffffffff8451f630 pmd:285e7c01
addr:00007fffbd3f5000 vm_flags:100400fb anon_vma:0000000000000000 mapping:ffffaf800ab1e058 index:4b
file:kcov fault:0x0 mmap:kcov_mmap readpage:0x0
CPU: 1 PID: 2051 Comm: syz-executor.0 Tainted: G B 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff8000a228>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:113
[<ffffffff831668cc>] show_stack+0x34/0x40 arch/riscv/kernel/stacktrace.c:119
[<ffffffff831756ba>] __dump_stack lib/dump_stack.c:88 [inline]
[<ffffffff831756ba>] dump_stack_lvl+0xe4/0x150 lib/dump_stack.c:106
[<ffffffff83175742>] dump_stack+0x1c/0x24 lib/dump_stack.c:113
[<ffffffff803cdcdc>] print_bad_pte+0x3d4/0x4a0 mm/memory.c:563
[<ffffffff803d1622>] vm_normal_page+0x20c/0x22a mm/memory.c:626
[<ffffffff803dbb4e>] copy_present_pte mm/memory.c:949 [inline]
[<ffffffff803dbb4e>] copy_pte_range mm/memory.c:1074 [inline]
[<ffffffff803dbb4e>] copy_pmd_range mm/memory.c:1160 [inline]
[<ffffffff803dbb4e>] copy_pud_range mm/memory.c:1197 [inline]
[<ffffffff803dbb4e>] copy_p4d_range mm/memory.c:1221 [inline]
[<ffffffff803dbb4e>] copy_page_range+0x828/0x236c mm/memory.c:1294
[<ffffffff80049bcc>] dup_mmap kernel/fork.c:612 [inline]
[<ffffffff80049bcc>] dup_mm+0xb5c/0xe10 kernel/fork.c:1451
[<ffffffff8004c7c6>] copy_mm kernel/fork.c:1503 [inline]
[<ffffffff8004c7c6>] copy_process+0x25da/0x3c34 kernel/fork.c:2164
[<ffffffff8004e106>] kernel_clone+0xee/0x920 kernel/fork.c:2555
[<ffffffff8004ea2a>] __do_sys_clone+0xf2/0x12e kernel/fork.c:2672
[<ffffffff8004ee4e>] sys_clone+0x32/0x44 kernel/fork.c:2640
[<ffffffff80005716>] ret_from_syscall+0x0/0x2
Unable to handle kernel paging request at virtual address ffffaf847c9ffff8
Oops [#1]
Modules linked in:
CPU: 1 PID: 2051 Comm: syz-executor.0 Tainted: G B 5.17.0-rc1-syzkaller-00002-g0966d385830d #0
Hardware name: riscv-virtio,qemu (DT)
epc : __nr_to_section include/linux/mmzone.h:1396 [inline]
epc : __pfn_to_section include/linux/mmzone.h:1480 [inline]
epc : pfn_swap_entry_to_page include/linux/swapops.h:252 [inline]
epc : copy_nonpresent_pte mm/memory.c:798 [inline]
epc : copy_pte_range mm/memory.c:1053 [inline]
epc : copy_pmd_range mm/memory.c:1160 [inline]
epc : copy_pud_range mm/memory.c:1197 [inline]
epc : copy_p4d_range mm/memory.c:1221 [inline]
epc : copy_page_range+0x1ade/0x236c mm/memory.c:1294
ra : __nr_to_section include/linux/mmzone.h:1396 [inline]
ra : __pfn_to_section include/linux/mmzone.h:1480 [inline]
ra : pfn_swap_entry_to_page include/linux/swapops.h:252 [inline]
ra : copy_nonpresent_pte mm/memory.c:798 [inline]
ra : copy_pte_range mm/memory.c:1053 [inline]
ra : copy_pmd_range mm/memory.c:1160 [inline]
ra : copy_pud_range mm/memory.c:1197 [inline]
ra : copy_p4d_range mm/memory.c:1221 [inline]
ra : copy_page_range+0x1ade/0x236c mm/memory.c:1294
epc : ffffffff803dce04 ra : ffffffff803dce04 sp : ffffaf80215a3680
gp : ffffffff85863ac0 tp : ffffaf8007409840 t0 : ffffaf80215a3830
t1 : fffff5ef042b4705 t2 : 00007fff83b1f010 s0 : ffffaf80215a38e0
s1 : ffffffff80110fdc a0 : ffffaf847c9ffff8 a1 : 0000000000000007
a2 : 1ffff5f08f93ffff a3 : ffffffff803dce04 a4 : 0000000000000000
a5 : ffffaf847c9ffff8 a6 : 0000000000f00000 a7 : ffffaf80215a382f
s2 : ffffaf802159ffb0 s3 : ffffaf800f182fb0 s4 : 0000000000000000
s5 : 7c1ffffffff00221 s6 : 001ffffffff00221 s7 : ffffaf847c9ffff8
s8 : 000000000000001f s9 : 00007fffbd400000 s10: ffffaf800e521840
s11: 00007fffbd3f6000 t3 : 000000000001fffe t4 : fffff5ef042b4704
t5 : fffff5ef042b4706 t6 : 000000000002463c
status: 0000000000000120 badaddr: ffffaf847c9ffff8 cause: 000000000000000d
[<ffffffff80049bcc>] dup_mmap kernel/fork.c:612 [inline]
[<ffffffff80049bcc>] dup_mm+0xb5c/0xe10 kernel/fork.c:1451
[<ffffffff8004c7c6>] copy_mm kernel/fork.c:1503 [inline]
[<ffffffff8004c7c6>] copy_process+0x25da/0x3c34 kernel/fork.c:2164
[<ffffffff8004e106>] kernel_clone+0xee/0x920 kernel/fork.c:2555
[<ffffffff8004ea2a>] __do_sys_clone+0xf2/0x12e kernel/fork.c:2672
[<ffffffff8004ee4e>] sys_clone+0x32/0x44 kernel/fork.c:2640
[<ffffffff80005716>] ret_from_syscall+0x0/0x2
---[ end trace 0000000000000000 ]---


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.


2022-09-12 04:34:08

by syzbot

[permalink] [raw]
Subject: Re: [syzbot] BUG: Bad page map (5)

syzbot has found a reproducer for the following issue on:

HEAD commit: e47eb90a0a9a Add linux-next specific files for 20220901
git tree: linux-next
console+strace: https://syzkaller.appspot.com/x/log.txt?x=17330430880000
kernel config: https://syzkaller.appspot.com/x/.config?x=7933882276523081
dashboard link: https://syzkaller.appspot.com/bug?extid=915f3e317adb0e85835f
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=13397b77080000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1793564f080000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]

BUG: Bad page map in process syz-executor198 pte:8000000071c00227 pmd:74b30067
addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
file:(null) fault:0x0 mmap:0x0 read_folio:0x0
CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
vm_normal_page+0x10c/0x2a0 mm/memory.c:636
hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
do_madvise mm/madvise.c:1428 [inline]
__do_sys_madvise mm/madvise.c:1428 [inline]
__se_sys_madvise mm/madvise.c:1426 [inline]
__x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f770ba87929
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
</TASK>

2022-09-12 22:07:56

by Yang Shi

[permalink] [raw]
Subject: Re: [syzbot] BUG: Bad page map (5)

On Sun, Sep 11, 2022 at 9:27 PM syzbot
<[email protected]> wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit: e47eb90a0a9a Add linux-next specific files for 20220901
> git tree: linux-next
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=17330430880000
> kernel config: https://syzkaller.appspot.com/x/.config?x=7933882276523081
> dashboard link: https://syzkaller.appspot.com/bug?extid=915f3e317adb0e85835f
> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=13397b77080000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1793564f080000
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: [email protected]
>
> BUG: Bad page map in process syz-executor198 pte:8000000071c00227 pmd:74b30067
> addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
> file:(null) fault:0x0 mmap:0x0 read_folio:0x0
> CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
> vm_normal_page+0x10c/0x2a0 mm/memory.c:636
> hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
> madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
> madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
> madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
> do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
> do_madvise mm/madvise.c:1428 [inline]
> __do_sys_madvise mm/madvise.c:1428 [inline]
> __se_sys_madvise mm/madvise.c:1426 [inline]
> __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f770ba87929
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
> RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
> R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
> R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
> </TASK>

I think I figured out the problem. The reproducer actually triggered
the below race in madvise_collapse():

CPU A
CPU B
mmap 0x20000000 - 0x21000000 as anon

madvise_collapse is called on this area

Retrieve start and end address from the vma (NEVER updated
later!)

Collapsed the first 2M area and dropped mmap_lock
Acquire mmap_lock
mmap io_uring file at 0x20563000
Release mmap_lock

Reacquire mmap_lock

revalidate vma pass since 0x20200000 + 0x200000 >
0x20563000

scan the next 2M (0x20200000 - 0x20400000), but due to
whatever reason it didn't release mmap_lock

scan the 3rd 2M area (start from 0x20400000)

actually scan the new vma created by io_uring since the
end was never updated

The below patch should be able to fix the problem (untested):

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5f7c60b8b269..e708c5d62325 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2441,8 +2441,10 @@ int madvise_collapse(struct vm_area_struct
*vma, struct vm_area_struct **prev,
memset(cc->node_load, 0, sizeof(cc->node_load));
result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
cc);
- if (!mmap_locked)
+ if (!mmap_locked) {
*prev = NULL; /* Tell caller we dropped mmap_lock */
+ hend = vma->end & HPAGE_PMD_MASK;
+ }

switch (result) {
case SCAN_SUCCEED:


>
>

2022-09-13 17:55:10

by Yang Shi

[permalink] [raw]
Subject: Re: [syzbot] BUG: Bad page map (5)

On Mon, Sep 12, 2022 at 2:47 PM Yang Shi <[email protected]> wrote:
>
> On Sun, Sep 11, 2022 at 9:27 PM syzbot
> <[email protected]> wrote:
> >
> > syzbot has found a reproducer for the following issue on:
> >
> > HEAD commit: e47eb90a0a9a Add linux-next specific files for 20220901
> > git tree: linux-next
> > console+strace: https://syzkaller.appspot.com/x/log.txt?x=17330430880000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=7933882276523081
> > dashboard link: https://syzkaller.appspot.com/bug?extid=915f3e317adb0e85835f
> > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=13397b77080000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1793564f080000
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: [email protected]
> >
> > BUG: Bad page map in process syz-executor198 pte:8000000071c00227 pmd:74b30067
> > addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
> > file:(null) fault:0x0 mmap:0x0 read_folio:0x0
> > CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
> > Call Trace:
> > <TASK>
> > __dump_stack lib/dump_stack.c:88 [inline]
> > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> > print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
> > vm_normal_page+0x10c/0x2a0 mm/memory.c:636
> > hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
> > madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
> > madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
> > madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
> > do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
> > do_madvise mm/madvise.c:1428 [inline]
> > __do_sys_madvise mm/madvise.c:1428 [inline]
> > __se_sys_madvise mm/madvise.c:1426 [inline]
> > __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
> > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > RIP: 0033:0x7f770ba87929
> > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> > RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
> > RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> > RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
> > R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
> > R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
> > </TASK>
>
> I think I figured out the problem. The reproducer actually triggered
> the below race in madvise_collapse():
>
> CPU A
> CPU B
> mmap 0x20000000 - 0x21000000 as anon
>
> madvise_collapse is called on this area
>
> Retrieve start and end address from the vma (NEVER updated
> later!)
>
> Collapsed the first 2M area and dropped mmap_lock
> Acquire mmap_lock
> mmap io_uring file at 0x20563000
> Release mmap_lock
>
> Reacquire mmap_lock
>
> revalidate vma pass since 0x20200000 + 0x200000 >
> 0x20563000
>
> scan the next 2M (0x20200000 - 0x20400000), but due to
> whatever reason it didn't release mmap_lock
>
> scan the 3rd 2M area (start from 0x20400000)
>
> actually scan the new vma created by io_uring since the
> end was never updated
>
> The below patch should be able to fix the problem (untested):
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5f7c60b8b269..e708c5d62325 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2441,8 +2441,10 @@ int madvise_collapse(struct vm_area_struct
> *vma, struct vm_area_struct **prev,
> memset(cc->node_load, 0, sizeof(cc->node_load));
> result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> cc);
> - if (!mmap_locked)
> + if (!mmap_locked) {
> *prev = NULL; /* Tell caller we dropped mmap_lock */
> + hend = vma->end & HPAGE_PMD_MASK;
> + }

This is wrong. We should refetch the vma end after
hugepage_vma_revalidate() otherwise the vma is still the old one.

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a3acd3e5e0f3..1860be232a26 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2592,6 +2592,8 @@ int madvise_collapse(struct vm_area_struct *vma,
struct vm_area_struct **prev,
last_fail = result;
goto out_nolock;
}
+
+ hend = vma->vm_end & HPAGE_PMD_MASK;
}
mmap_assert_locked(mm);
memset(cc->node_load, 0, sizeof(cc->node_load));


>
> switch (result) {
> case SCAN_SUCCEED:
>
>
> >
> >

2022-09-13 19:28:55

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [syzbot] BUG: Bad page map (5)

On Sep 13 09:14, Yang Shi wrote:
> On Mon, Sep 12, 2022 at 2:47 PM Yang Shi <[email protected]> wrote:
> >
> > On Sun, Sep 11, 2022 at 9:27 PM syzbot
> > <[email protected]> wrote:
> > >
> > > syzbot has found a reproducer for the following issue on:
> > >
> > > HEAD commit: e47eb90a0a9a Add linux-next specific files for 20220901
> > > git tree: linux-next
> > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=17330430880000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=7933882276523081
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=915f3e317adb0e85835f
> > > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=13397b77080000
> > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1793564f080000
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: [email protected]
> > >
> > > BUG: Bad page map in process syz-executor198 pte:8000000071c00227 pmd:74b30067
> > > addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
> > > file:(null) fault:0x0 mmap:0x0 read_folio:0x0
> > > CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
> > > Call Trace:
> > > <TASK>
> > > __dump_stack lib/dump_stack.c:88 [inline]
> > > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> > > print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
> > > vm_normal_page+0x10c/0x2a0 mm/memory.c:636
> > > hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
> > > madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
> > > madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
> > > madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
> > > do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
> > > do_madvise mm/madvise.c:1428 [inline]
> > > __do_sys_madvise mm/madvise.c:1428 [inline]
> > > __se_sys_madvise mm/madvise.c:1426 [inline]
> > > __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
> > > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > RIP: 0033:0x7f770ba87929
> > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > > RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> > > RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
> > > RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> > > RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
> > > R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
> > > R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
> > > </TASK>
> >
> > I think I figured out the problem. The reproducer actually triggered
> > the below race in madvise_collapse():
> >
> > CPU A
> > CPU B
> > mmap 0x20000000 - 0x21000000 as anon
> >
> > madvise_collapse is called on this area
> >
> > Retrieve start and end address from the vma (NEVER updated
> > later!)
> >
> > Collapsed the first 2M area and dropped mmap_lock
> > Acquire mmap_lock
> > mmap io_uring file at 0x20563000
> > Release mmap_lock
> >
> > Reacquire mmap_lock
> >
> > revalidate vma pass since 0x20200000 + 0x200000 >
> > 0x20563000
> >
> > scan the next 2M (0x20200000 - 0x20400000), but due to
> > whatever reason it didn't release mmap_lock
> >
> > scan the 3rd 2M area (start from 0x20400000)
> >
> > actually scan the new vma created by io_uring since the
> > end was never updated
> >
> > The below patch should be able to fix the problem (untested):
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5f7c60b8b269..e708c5d62325 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2441,8 +2441,10 @@ int madvise_collapse(struct vm_area_struct
> > *vma, struct vm_area_struct **prev,
> > memset(cc->node_load, 0, sizeof(cc->node_load));
> > result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> > cc);
> > - if (!mmap_locked)
> > + if (!mmap_locked) {
> > *prev = NULL; /* Tell caller we dropped mmap_lock */
> > + hend = vma->end & HPAGE_PMD_MASK;
> > + }
>
> This is wrong. We should refetch the vma end after
> hugepage_vma_revalidate() otherwise the vma is still the old one.
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a3acd3e5e0f3..1860be232a26 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2592,6 +2592,8 @@ int madvise_collapse(struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> last_fail = result;
> goto out_nolock;
> }
> +
> + hend = vma->vm_end & HPAGE_PMD_MASK;
> }
> mmap_assert_locked(mm);
> memset(cc->node_load, 0, sizeof(cc->node_load));
>
>
> >
> > switch (result) {
> > case SCAN_SUCCEED:
> >
> >

Hey Yang,

Thanks for triaging this, and apologies for intro'ing this bug.

Also thank you for the repro explanation - I believe you are correct here.

Generalizing the issue of:

1) hugepage_vma_revalidate() pmd X
2) collapse of pmd X doesn't drop mmap_lock
3) don't revalidate pmd X+1
4) attempt collapse of pmd X+1

I think the only problem is that

hugepage_vma_revalidate()
transhuge_vma_suitable()

only checks if a single hugepage-sized/aligned region properly fits / is aligned
in the VMA (i.e. the issue you found here). All other checks should be
intrinsic to the VMA itself and should be safe to skip if mmap_lock isn't
dropped since last hugepage_vma_revalidate().

As for the fix, I think your fix will work. If a VMA's size changes inside the
main for-loop of madvise_collapse, then at some point we will lock mmap_lock and
call hugepage_vma_revalidate(), which might fail itself if the next
hugepage-aligned/sized region is now not contained in the VMA. By updating
"hend" as you propose (i.e. using vma->m_end of the just-found VMA), we also
ensure that for "addr" < "hend", the hugepage-aligned/sized region at "addr"
will fit into the VMA. Note that we don't need to worry about the VMA being
shrank from the other direction, so updating "hend" should be enough.

I think the fix is fine as-is. I briefly thought a comment would be nice, but I
think the code is self evident. The alternative is introing another
transhuge_vma_suitable() call in the "if (!mmap_locked) { .. } else { .. }"
failure path, but I think your approach is easier to read.

Thanks again for taking the time to debug this, and hopefully I can be more
careful in the future.

Best,
Zach

Reviewed-by: Zach O'Keefe <[email protected]>

2022-09-14 16:47:42

by Yang Shi

[permalink] [raw]
Subject: Re: [syzbot] BUG: Bad page map (5)

On Tue, Sep 13, 2022 at 11:39 AM Zach O'Keefe <[email protected]> wrote:
>
> On Sep 13 09:14, Yang Shi wrote:
> > On Mon, Sep 12, 2022 at 2:47 PM Yang Shi <[email protected]> wrote:
> > >
> > > On Sun, Sep 11, 2022 at 9:27 PM syzbot
> > > <[email protected]> wrote:
> > > >
> > > > syzbot has found a reproducer for the following issue on:
> > > >
> > > > HEAD commit: e47eb90a0a9a Add linux-next specific files for 20220901
> > > > git tree: linux-next
> > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=17330430880000
> > > > kernel config: https://syzkaller.appspot.com/x/.config?x=7933882276523081
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=915f3e317adb0e85835f
> > > > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=13397b77080000
> > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1793564f080000
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: [email protected]
> > > >
> > > > BUG: Bad page map in process syz-executor198 pte:8000000071c00227 pmd:74b30067
> > > > addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
> > > > file:(null) fault:0x0 mmap:0x0 read_folio:0x0
> > > > CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
> > > > Call Trace:
> > > > <TASK>
> > > > __dump_stack lib/dump_stack.c:88 [inline]
> > > > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> > > > print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
> > > > vm_normal_page+0x10c/0x2a0 mm/memory.c:636
> > > > hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
> > > > madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
> > > > madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
> > > > madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
> > > > do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
> > > > do_madvise mm/madvise.c:1428 [inline]
> > > > __do_sys_madvise mm/madvise.c:1428 [inline]
> > > > __se_sys_madvise mm/madvise.c:1426 [inline]
> > > > __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
> > > > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > > > entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > > RIP: 0033:0x7f770ba87929
> > > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > > > RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
> > > > RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
> > > > RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
> > > > RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
> > > > R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
> > > > R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
> > > > </TASK>
> > >
> > > I think I figured out the problem. The reproducer actually triggered
> > > the below race in madvise_collapse():
> > >
> > > CPU A
> > > CPU B
> > > mmap 0x20000000 - 0x21000000 as anon
> > >
> > > madvise_collapse is called on this area
> > >
> > > Retrieve start and end address from the vma (NEVER updated
> > > later!)
> > >
> > > Collapsed the first 2M area and dropped mmap_lock
> > > Acquire mmap_lock
> > > mmap io_uring file at 0x20563000
> > > Release mmap_lock
> > >
> > > Reacquire mmap_lock
> > >
> > > revalidate vma pass since 0x20200000 + 0x200000 >
> > > 0x20563000
> > >
> > > scan the next 2M (0x20200000 - 0x20400000), but due to
> > > whatever reason it didn't release mmap_lock
> > >
> > > scan the 3rd 2M area (start from 0x20400000)
> > >
> > > actually scan the new vma created by io_uring since the
> > > end was never updated
> > >
> > > The below patch should be able to fix the problem (untested):
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 5f7c60b8b269..e708c5d62325 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -2441,8 +2441,10 @@ int madvise_collapse(struct vm_area_struct
> > > *vma, struct vm_area_struct **prev,
> > > memset(cc->node_load, 0, sizeof(cc->node_load));
> > > result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked,
> > > cc);
> > > - if (!mmap_locked)
> > > + if (!mmap_locked) {
> > > *prev = NULL; /* Tell caller we dropped mmap_lock */
> > > + hend = vma->end & HPAGE_PMD_MASK;
> > > + }
> >
> > This is wrong. We should refetch the vma end after
> > hugepage_vma_revalidate() otherwise the vma is still the old one.
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index a3acd3e5e0f3..1860be232a26 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2592,6 +2592,8 @@ int madvise_collapse(struct vm_area_struct *vma,
> > struct vm_area_struct **prev,
> > last_fail = result;
> > goto out_nolock;
> > }
> > +
> > + hend = vma->vm_end & HPAGE_PMD_MASK;
> > }
> > mmap_assert_locked(mm);
> > memset(cc->node_load, 0, sizeof(cc->node_load));
> >
> >
> > >
> > > switch (result) {
> > > case SCAN_SUCCEED:
> > >
> > >
>
> Hey Yang,
>
> Thanks for triaging this, and apologies for intro'ing this bug.
>
> Also thank you for the repro explanation - I believe you are correct here.
>
> Generalizing the issue of:
>
> 1) hugepage_vma_revalidate() pmd X
> 2) collapse of pmd X doesn't drop mmap_lock
> 3) don't revalidate pmd X+1
> 4) attempt collapse of pmd X+1
>
> I think the only problem is that
>
> hugepage_vma_revalidate()
> transhuge_vma_suitable()
>
> only checks if a single hugepage-sized/aligned region properly fits / is aligned

I think it is what transhuge_vma_suitable() is designed for. As long
as one hugepage fits, it is suitable.

> in the VMA (i.e. the issue you found here). All other checks should be
> intrinsic to the VMA itself and should be safe to skip if mmap_lock isn't
> dropped since last hugepage_vma_revalidate().
>
> As for the fix, I think your fix will work. If a VMA's size changes inside the
> main for-loop of madvise_collapse, then at some point we will lock mmap_lock and
> call hugepage_vma_revalidate(), which might fail itself if the next
> hugepage-aligned/sized region is now not contained in the VMA. By updating
> "hend" as you propose (i.e. using vma->m_end of the just-found VMA), we also
> ensure that for "addr" < "hend", the hugepage-aligned/sized region at "addr"
> will fit into the VMA. Note that we don't need to worry about the VMA being
> shrank from the other direction, so updating "hend" should be enough.

Yeah, we don't have to worry about the other direction. The
hugepage_vma_revalidate() could handle it correctly. Either no valid
vma is found or the vma doesn't fit anymore.

>
> I think the fix is fine as-is. I briefly thought a comment would be nice, but I
> think the code is self evident. The alternative is introing another
> transhuge_vma_suitable() call in the "if (!mmap_locked) { .. } else { .. }"
> failure path, but I think your approach is easier to read.
>
> Thanks again for taking the time to debug this, and hopefully I can be more
> careful in the future.

It is fine.

>
> Best,
> Zach
>
> Reviewed-by: Zach O'Keefe <[email protected]>

Thanks.

>