2024-05-31 08:24:44

by kernel test robot

[permalink] [raw]
Subject: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h



Hello,

kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:

commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
[test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]

in testcase: trinity
version: trinity-x86_64-6a17c218-1_20240527
with following parameters:

runtime: 300s
group: group-00
nr_groups: 5



compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


we noticed the issue does not always happen. 34 times out of 50 runs as below.
the parent is clean.

1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
:50 68% 34:50 dmesg.RIP:try_get_folio
:50 68% 34:50 dmesg.invalid_opcode:#[##]
:50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


[ 275.267158][ T4335] ------------[ cut here ]------------
[ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
[ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
[ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
[ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
[ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
All code
========
0: c3 ret
1: cc int3
2: cc int3
3: cc int3
4: cc int3
5: 44 89 e6 mov %r12d,%esi
8: 48 89 df mov %rbx,%rdi
b: e8 e4 54 11 00 call 0x1154f4
10: eb ae jmp 0xffffffffffffffc0
12: 90 nop
13: 0f 0b ud2
15: 90 nop
16: 31 db xor %ebx,%ebx
18: eb d5 jmp 0xffffffffffffffef
1a: 9c pushf
1b: 58 pop %rax
1c: 0f 1f 40 00 nopl 0x0(%rax)
20: f6 c4 02 test $0x2,%ah
23: 0f 84 46 ff ff ff je 0xffffffffffffff6f
29: 90 nop
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 c7 c6 a0 54 d2 87 mov $0xffffffff87d254a0,%rsi
33: 48 89 df mov %rbx,%rdi
36: e8 a9 e9 ff ff call 0xffffffffffffe9e4
3b: 90 nop
3c: 0f 0b ud2
3e: be .byte 0xbe
3f: 04 .byte 0x4

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 c7 c6 a0 54 d2 87 mov $0xffffffff87d254a0,%rsi
9: 48 89 df mov %rbx,%rdi
c: e8 a9 e9 ff ff call 0xffffffffffffe9ba
11: 90 nop
12: 0f 0b ud2
14: be .byte 0xbe
15: 04 .byte 0x4
[ 275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202
[ 275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000
[ 275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034
[ 275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006
[ 275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136
[ 275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008
[ 275.276790][ T4335] FS: 00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000
[ 275.277570][ T4335] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0
[ 275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 275.280201][ T4335] Call Trace:
[ 275.280499][ T4335] <TASK>
[ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
[ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
[ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
[ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
[ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174)
[ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
[ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212)
[ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
[ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264)
[ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
[ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
[ 275.285278][ T4335] try_grab_folio (mm/gup.c:148)
[ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1))
[ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188)
[ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825)
[ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1))
[ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209)
[ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204)
[ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722)
[ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106)
[ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350)
[ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350)
[ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1))
[ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573)
[ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360
[ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10
[ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573)
[ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0
[ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228)
[ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10
[ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284)
[ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259)
[ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
[ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291)
[ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/common.c:94 kernel/entry/common.c:112)
[ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1))
[ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
[ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
[ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
[ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
[ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
[ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
[ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
[ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
[ 275.300622][ T4335] RIP: 0033:0x7fa2f9c65719
[ 275.301011][ T4335] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
All code
========
0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx)
6: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
d: 00 00 00
10: 90 nop
11: 48 89 f8 mov %rdi,%rax
14: 48 89 f7 mov %rsi,%rdi
17: 48 89 d6 mov %rdx,%rsi
1a: 48 89 ca mov %rcx,%rdx
1d: 4d 89 c2 mov %r8,%r10
20: 4d 89 c8 mov %r9,%r8
23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 ret
33: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06f1
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W

Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 ret
9: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06c7
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240531/[email protected]



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



2024-05-31 16:51:37

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Fri, May 31, 2024 at 1:24 AM kernel test robot <[email protected]> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
>
> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
>
> in testcase: trinity
> version: trinity-x86_64-6a17c218-1_20240527
> with following parameters:
>
> runtime: 300s
> group: group-00
> nr_groups: 5
>
>
>
> compiler: gcc-13
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> the parent is clean.
>
> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> :50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
> :50 68% 34:50 dmesg.RIP:try_get_folio
> :50 68% 34:50 dmesg.invalid_opcode:#[##]
> :50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
>
>
> [ 275.267158][ T4335] ------------[ cut here ]------------
> [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04

If I read this BUG correctly, it is:

VM_BUG_ON(!in_atomic() && !irqs_disabled());

try_grab_folio() actually assumes it is in an atomic context (irq
disabled or preempt disabled) for this call path. This is achieved by
disabling irq in gup fast or calling it in rcu critical section in
page cache lookup path.

And try_grab_folio() is used when the folio is a large folio. The
bisected commit made the fuzzy test get PMD aligned address and large
folio more likely than before, and process_vm_readv/writev actually
doesn't take care of the large folio case at all. A properly aligned
address, for example, allocated by posix_memalign, should be able to
trigger this BUG even though the bisected commit doesn't exist.

We can't call pin_user_pages_remote() in rcu critical section since it
may sleep, and I don't think we have GUP fast remote either if I
remember correctly. It also doesn't make sense to disallow large folio
for process_vm_readv/writev either.

Maybe a new GUP flag or just use FOLL_LONGTERM to let GUP call
try_glab_folio() in rcu critical section? Added more GUP folks in this
loop.


> All code
> ========
> 0: c3 ret
> 1: cc int3
> 2: cc int3
> 3: cc int3
> 4: cc int3
> 5: 44 89 e6 mov %r12d,%esi
> 8: 48 89 df mov %rbx,%rdi
> b: e8 e4 54 11 00 call 0x1154f4
> 10: eb ae jmp 0xffffffffffffffc0
> 12: 90 nop
> 13: 0f 0b ud2
> 15: 90 nop
> 16: 31 db xor %ebx,%ebx
> 18: eb d5 jmp 0xffffffffffffffef
> 1a: 9c pushf
> 1b: 58 pop %rax
> 1c: 0f 1f 40 00 nopl 0x0(%rax)
> 20: f6 c4 02 test $0x2,%ah
> 23: 0f 84 46 ff ff ff je 0xffffffffffffff6f
> 29: 90 nop
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: 48 c7 c6 a0 54 d2 87 mov $0xffffffff87d254a0,%rsi
> 33: 48 89 df mov %rbx,%rdi
> 36: e8 a9 e9 ff ff call 0xffffffffffffe9e4
> 3b: 90 nop
> 3c: 0f 0b ud2
> 3e: be .byte 0xbe
> 3f: 04 .byte 0x4
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 48 c7 c6 a0 54 d2 87 mov $0xffffffff87d254a0,%rsi
> 9: 48 89 df mov %rbx,%rdi
> c: e8 a9 e9 ff ff call 0xffffffffffffe9ba
> 11: 90 nop
> 12: 0f 0b ud2
> 14: be .byte 0xbe
> 15: 04 .byte 0x4
> [ 275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202
> [ 275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000
> [ 275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034
> [ 275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006
> [ 275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136
> [ 275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008
> [ 275.276790][ T4335] FS: 00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000
> [ 275.277570][ T4335] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0
> [ 275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 275.280201][ T4335] Call Trace:
> [ 275.280499][ T4335] <TASK>
> [ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> [ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
> [ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174)
> [ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212)
> [ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264)
> [ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentryh:568)
> [ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> [ 275.285278][ T4335] try_grab_folio (mm/gup.c:148)
> [ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1))
> [ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188)
> [ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825)
> [ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1))
> [ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209)
> [ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204)
> [ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722)
> [ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106)
> [ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350)
> [ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350)
> [ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1))
> [ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573)
> [ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360
> [ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10
> [ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573)
> [ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0
> [ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228)
> [ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10
> [ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284)
> [ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259)
> [ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
> [ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291)
> [ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/commonc:94 kernel/entry/common.c:112)
> [ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1))
> [ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
> [ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
> [ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
> [ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
> [ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
> [ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
> [ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
> [ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
> [ 275.300622][ T4335] RIP: 0033:0x7fa2f9c65719
> [ 275.301011][ T4335] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
> All code
> ========
> 0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx)
> 6: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
> d: 00 00 00
> 10: 90 nop
> 11: 48 89 f8 mov %rdi,%rax
> 14: 48 89 f7 mov %rsi,%rdi
> 17: 48 89 d6 mov %rdx,%rsi
> 1a: 48 89 ca mov %rcx,%rdx
> 1d: 4d 89 c2 mov %r8,%r10
> 20: 4d 89 c8 mov %r9,%r8
> 23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
> 28: 0f 05 syscall
> 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
> 30: 73 01 jae 0x33
> 32: c3 ret
> 33: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06f1
> 3a: f7 d8 neg %eax
> 3c: 64 89 01 mov %eax,%fs:(%rcx)
> 3f: 48 rex.W
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
> 6: 73 01 jae 0x9
> 8: c3 ret
> 9: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06c7
> 10: f7 d8 neg %eax
> 12: 64 89 01 mov %eax,%fs:(%rcx)
> 15: 48 rex.W
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240531/[email protected]
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>

2024-05-31 17:46:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On 31.05.24 18:50, Yang Shi wrote:
> On Fri, May 31, 2024 at 1:24 AM kernel test robot <[email protected]> wrote:
>>
>>
>>
>> Hello,
>>
>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
>>
>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> [test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
>>
>> in testcase: trinity
>> version: trinity-x86_64-6a17c218-1_20240527
>> with following parameters:
>>
>> runtime: 300s
>> group: group-00
>> nr_groups: 5
>>
>>
>>
>> compiler: gcc-13
>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>>
>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>
>>
>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
>> the parent is clean.
>>
>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
>> ---------------- ---------------------------
>> fail:runs %reproduction fail:runs
>> | | |
>> :50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
>> :50 68% 34:50 dmesg.RIP:try_get_folio
>> :50 68% 34:50 dmesg.invalid_opcode:#[##]
>> :50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h
>>
>>
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <[email protected]>
>> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
>>
>>
>> [ 275.267158][ T4335] ------------[ cut here ]------------
>> [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
>> [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
>> [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
>> [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
>
> If I read this BUG correctly, it is:
>
> VM_BUG_ON(!in_atomic() && !irqs_disabled());
>

Yes, that seems to be the one.

> try_grab_folio() actually assumes it is in an atomic context (irq
> disabled or preempt disabled) for this call path. This is achieved by
> disabling irq in gup fast or calling it in rcu critical section in
> page cache lookup path

try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()

Is called (mm-unstable) from:

(1) gup_fast function, here IRQs are disable
(2) gup_hugepte(), possibly problematic
(3) memfd_pin_folios(), possibly problematic
(4) __get_user_pages(), likely problematic

(1) should be fine.

(2) is possibly problematic on the !fast path. If so, due to commit
a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.

(3) is possibly wrong. CCing Vivek.

(4) is what we hit here

>
> And try_grab_folio() is used when the folio is a large folio. The


We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()

That code was added in

commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
Author: Peter Xu <[email protected]>
Date: Wed Jun 28 17:53:07 2023 -0400

mm/gup: accelerate thp gup even for "pages != NULL"

The acceleration of THP was done with ctx.page_mask, however it'll be
ignored if **pages is non-NULL.


Likely the try_grab_folio() in __get_user_pages() is wrong?

As documented, we already hold a refcount. Likely we should better do a
folio_ref_add() and sanity check the refcount.


In essence, I think: try_grab_folio() should only be called from GUP-fast where
IRQs are disabled.

(2), (3) and (4) are possible offenders of that.


Or am I getting it all wrong? :)

--
Cheers,

David / dhildenb


2024-05-31 18:08:29

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <[email protected]> wrote:
>
> On 31.05.24 18:50, Yang Shi wrote:
> > On Fri, May 31, 2024 at 1:24 AM kernel test robot <[email protected]> wrote:
> >>
> >>
> >>
> >> Hello,
> >>
> >> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
> >>
> >> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>
> >> [test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> >> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
> >>
> >> in testcase: trinity
> >> version: trinity-x86_64-6a17c218-1_20240527
> >> with following parameters:
> >>
> >> runtime: 300s
> >> group: group-00
> >> nr_groups: 5
> >>
> >>
> >>
> >> compiler: gcc-13
> >> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >>
> >> (please refer to attached dmesg/kmsg for entire log/backtrace)
> >>
> >>
> >> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> >> the parent is clean.
> >>
> >> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> >> ---------------- ---------------------------
> >> fail:runs %reproduction fail:runs
> >> | | |
> >> :50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
> >> :50 68% 34:50 dmesg.RIP:try_get_folio
> >> :50 68% 34:50 dmesg.invalid_opcode:#[##]
> >> :50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h
> >>
> >>
> >>
> >> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >> the same patch/commit), kindly add following tags
> >> | Reported-by: kernel test robot <[email protected]>
> >> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
> >>
> >>
> >> [ 275.267158][ T4335] ------------[ cut here ]------------
> >> [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> >> [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> >> [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 67.0-rc4-00061-gefa7df3e3bb5 #1
> >> [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> >> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> >> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
> >
> > If I read this BUG correctly, it is:
> >
> > VM_BUG_ON(!in_atomic() && !irqs_disabled());
> >
>
> Yes, that seems to be the one.
>
> > try_grab_folio() actually assumes it is in an atomic context (irq
> > disabled or preempt disabled) for this call path. This is achieved by
> > disabling irq in gup fast or calling it in rcu critical section in
> > page cache lookup path
>
> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>
> Is called (mm-unstable) from:
>
> (1) gup_fast function, here IRQs are disable
> (2) gup_hugepte(), possibly problematic
> (3) memfd_pin_folios(), possibly problematic
> (4) __get_user_pages(), likely problematic
>
> (1) should be fine.
>
> (2) is possibly problematic on the !fast path. If so, due to commit
> a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>
> (3) is possibly wrong. CCing Vivek.
>
> (4) is what we hit here
>
> >
> > And try_grab_folio() is used when the folio is a large folio. The
>
>
> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
>
> That code was added in
>
> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
> Author: Peter Xu <[email protected]>
> Date: Wed Jun 28 17:53:07 2023 -0400
>
> mm/gup: accelerate thp gup even for "pages != NULL"
>
> The acceleration of THP was done with ctx.page_mask, however it'll be
> ignored if **pages is non-NULL.
>
>
> Likely the try_grab_folio() in __get_user_pages() is wrong?
>
> As documented, we already hold a refcount. Likely we should better do a
> folio_ref_add() and sanity check the refcount.

Yes, a plain folio_ref_add() seems ok for these cases.

In addition, the comment of folio_try_get_rcu() says, which is just a
wrapper of folio_ref_try_add_rcu():

You can also use this function if you're holding a lock that prevents
pages being frozen & removed; eg the i_pages lock for the page cache
or the mmap_lock or page table lock for page tables. In this case, it
will always succeed, and you could have used a plain folio_get(), but
it's sometimes more convenient to have a common function called from
both locked and RCU-protected contexts.

So IIUC we can use the plain folio_get() at least for
process_vm_readv/writev since mmap_lock is held in this path.

>
>
> In essence, I think: try_grab_folio() should only be called from GUP-fast where
> IRQs are disabled.

Yes, I agree. Just the fast path should need to call try_grab_folio().

>
> (2), (3) and (4) are possible offenders of that.
>
>
> Or am I getting it all wrong? :)
>
> --
> Cheers,
>
> David / dhildenb
>

2024-05-31 18:13:45

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Fri, May 31, 2024 at 11:07 AM Yang Shi <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <[email protected]> wrote:
> >
> > On 31.05.24 18:50, Yang Shi wrote:
> > > On Fri, May 31, 2024 at 1:24 AM kernel test robot <[email protected]> wrote:
> > >>
> > >>
> > >>
> > >> Hello,
> > >>
> > >> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
> > >>
> > >> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> > >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >>
> > >> [test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> > >> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
> > >>
> > >> in testcase: trinity
> > >> version: trinity-x86_64-6a17c218-1_20240527
> > >> with following parameters:
> > >>
> > >> runtime: 300s
> > >> group: group-00
> > >> nr_groups: 5
> > >>
> > >>
> > >>
> > >> compiler: gcc-13
> > >> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > >>
> > >> (please refer to attached dmesg/kmsg for entire log/backtrace)
> > >>
> > >>
> > >> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> > >> the parent is clean.
> > >>
> > >> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> > >> ---------------- ---------------------------
> > >> fail:runs %reproduction fail:runs
> > >> | | |
> > >> :50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
> > >> :50 68% 34:50 dmesg.RIP:try_get_folio
> > >> :50 68% 34:50 dmesg.invalid_opcode:#[##]
> > >> :50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h
> > >>
> > >>
> > >>
> > >> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > >> the same patch/commit), kindly add following tags
> > >> | Reported-by: kernel test robot <[email protected]>
> > >> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > >>
> > >>
> > >> [ 275.267158][ T4335] ------------[ cut here ]------------
> > >> [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> > >> [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> > >> [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> > >> [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > >> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> > >> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
> > >
> > > If I read this BUG correctly, it is:
> > >
> > > VM_BUG_ON(!in_atomic() && !irqs_disabled());
> > >
> >
> > Yes, that seems to be the one.
> >
> > > try_grab_folio() actually assumes it is in an atomic context (irq
> > > disabled or preempt disabled) for this call path. This is achieved by
> > > disabling irq in gup fast or calling it in rcu critical section in
> > > page cache lookup path
> >
> > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> >
> > Is called (mm-unstable) from:
> >
> > (1) gup_fast function, here IRQs are disable
> > (2) gup_hugepte(), possibly problematic
> > (3) memfd_pin_folios(), possibly problematic
> > (4) __get_user_pages(), likely problematic
> >
> > (1) should be fine.
> >
> > (2) is possibly problematic on the !fast path. If so, due to commit
> > a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> >
> > (3) is possibly wrong. CCing Vivek.
> >
> > (4) is what we hit here
> >
> > >
> > > And try_grab_folio() is used when the folio is a large folio. The
> >
> >
> > We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
> >
> > That code was added in
> >
> > commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
> > Author: Peter Xu <[email protected]>
> > Date: Wed Jun 28 17:53:07 2023 -0400
> >
> > mm/gup: accelerate thp gup even for "pages != NULL"
> >
> > The acceleration of THP was done with ctx.page_mask, however it'll be
> > ignored if **pages is non-NULL.
> >
> >
> > Likely the try_grab_folio() in __get_user_pages() is wrong?
> >
> > As documented, we already hold a refcount. Likely we should better do a
> > folio_ref_add() and sanity check the refcount.
>
> Yes, a plain folio_ref_add() seems ok for these cases.
>
> In addition, the comment of folio_try_get_rcu() says, which is just a
> wrapper of folio_ref_try_add_rcu():
>
> You can also use this function if you're holding a lock that prevents
> pages being frozen & removed; eg the i_pages lock for the page cache
> or the mmap_lock or page table lock for page tables. In this case, it
> will always succeed, and you could have used a plain folio_get(), but
> it's sometimes more convenient to have a common function called from
> both locked and RCU-protected contexts.
>
> So IIUC we can use the plain folio_get() at least for
> process_vm_readv/writev since mmap_lock is held in this path.
>
> >
> >
> > In essence, I think: try_grab_folio() should only be called from GUP-fast where
> > IRQs are disabled.
>
> Yes, I agree. Just the fast path should need to call try_grab_folio().

try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
keep calling it and add a flag to try_grab_folio, just like:

if flag is true
folio_ref_add()
else
try_get_folio()

>
> >
> > (2), (3) and (4) are possible offenders of that.
> >
> >
> > Or am I getting it all wrong? :)
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

2024-05-31 18:32:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On 31.05.24 20:13, Yang Shi wrote:
> On Fri, May 31, 2024 at 11:07 AM Yang Shi <[email protected]> wrote:
>>
>> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 31.05.24 18:50, Yang Shi wrote:
>>>> On Fri, May 31, 2024 at 1:24 AM kernel test robot <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
>>>>>
>>>>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>
>>>>> [test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
>>>>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
>>>>>
>>>>> in testcase: trinity
>>>>> version: trinity-x86_64-6a17c218-1_20240527
>>>>> with following parameters:
>>>>>
>>>>> runtime: 300s
>>>>> group: group-00
>>>>> nr_groups: 5
>>>>>
>>>>>
>>>>>
>>>>> compiler: gcc-13
>>>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>>>>>
>>>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>>>>
>>>>>
>>>>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
>>>>> the parent is clean.
>>>>>
>>>>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
>>>>> ---------------- ---------------------------
>>>>> fail:runs %reproduction fail:runs
>>>>> | | |
>>>>> :50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
>>>>> :50 68% 34:50 dmesg.RIP:try_get_folio
>>>>> :50 68% 34:50 dmesg.invalid_opcode:#[##]
>>>>> :50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h
>>>>>
>>>>>
>>>>>
>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>>> the same patch/commit), kindly add following tags
>>>>> | Reported-by: kernel test robot <[email protected]>
>>>>> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
>>>>>
>>>>>
>>>>> [ 275.267158][ T4335] ------------[ cut here ]------------
>>>>> [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
>>>>> [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
>>>>> [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
>>>>> [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>>>>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
>>>>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
>>>>
>>>> If I read this BUG correctly, it is:
>>>>
>>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
>>>>
>>>
>>> Yes, that seems to be the one.
>>>
>>>> try_grab_folio() actually assumes it is in an atomic context (irq
>>>> disabled or preempt disabled) for this call path. This is achieved by
>>>> disabling irq in gup fast or calling it in rcu critical section in
>>>> page cache lookup path
>>>
>>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>>>
>>> Is called (mm-unstable) from:
>>>
>>> (1) gup_fast function, here IRQs are disable
>>> (2) gup_hugepte(), possibly problematic
>>> (3) memfd_pin_folios(), possibly problematic
>>> (4) __get_user_pages(), likely problematic
>>>
>>> (1) should be fine.
>>>
>>> (2) is possibly problematic on the !fast path. If so, due to commit
>>> a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>>>
>>> (3) is possibly wrong. CCing Vivek.
>>>
>>> (4) is what we hit here
>>>
>>>>
>>>> And try_grab_folio() is used when the folio is a large folio. The
>>>
>>>
>>> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
>>>
>>> That code was added in
>>>
>>> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
>>> Author: Peter Xu <[email protected]>
>>> Date: Wed Jun 28 17:53:07 2023 -0400
>>>
>>> mm/gup: accelerate thp gup even for "pages != NULL"
>>>
>>> The acceleration of THP was done with ctx.page_mask, however it'll be
>>> ignored if **pages is non-NULL.
>>>
>>>
>>> Likely the try_grab_folio() in __get_user_pages() is wrong?
>>>
>>> As documented, we already hold a refcount. Likely we should better do a
>>> folio_ref_add() and sanity check the refcount.
>>
>> Yes, a plain folio_ref_add() seems ok for these cases.
>>
>> In addition, the comment of folio_try_get_rcu() says, which is just a
>> wrapper of folio_ref_try_add_rcu():
>>
>> You can also use this function if you're holding a lock that prevents
>> pages being frozen & removed; eg the i_pages lock for the page cache
>> or the mmap_lock or page table lock for page tables. In this case, it
>> will always succeed, and you could have used a plain folio_get(), but
>> it's sometimes more convenient to have a common function called from
>> both locked and RCU-protected contexts.
>>
>> So IIUC we can use the plain folio_get() at least for
>> process_vm_readv/writev since mmap_lock is held in this path.
>>
>>>
>>>
>>> In essence, I think: try_grab_folio() should only be called from GUP-fast where
>>> IRQs are disabled.
>>
>> Yes, I agree. Just the fast path should need to call try_grab_folio().
>
> try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
> keep calling it and add a flag to try_grab_folio, just like:
>
> if flag is true
> folio_ref_add()
> else
> try_get_folio()


try_grab_page() is what we use on the GUP-slow path. We'd likely want a
folio variant of that.

We might want to call that gup_try_grab_folio() and rename the other one
to gup_fast_try_grab_folio().

Or something like that :)

--
Cheers,

David / dhildenb


2024-05-31 18:32:23

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Fri, May 31, 2024 at 11:24 AM David Hildenbrand <[email protected]> wrote:
>
> On 31.05.24 20:13, Yang Shi wrote:
> > On Fri, May 31, 2024 at 11:07 AM Yang Shi <[email protected]> wrote:
> >>
> >> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 31.05.24 18:50, Yang Shi wrote:
> >>>> On Fri, May 31, 2024 at 1:24 AM kernel test robot <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
> >>>>>
> >>>>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> >>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>>>>
> >>>>> [test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> >>>>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
> >>>>>
> >>>>> in testcase: trinity
> >>>>> version: trinity-x86_64-6a17c218-1_20240527
> >>>>> with following parameters:
> >>>>>
> >>>>> runtime: 300s
> >>>>> group: group-00
> >>>>> nr_groups: 5
> >>>>>
> >>>>>
> >>>>>
> >>>>> compiler: gcc-13
> >>>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >>>>>
> >>>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
> >>>>>
> >>>>>
> >>>>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> >>>>> the parent is clean.
> >>>>>
> >>>>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> >>>>> ---------------- ---------------------------
> >>>>> fail:runs %reproduction fail:runs
> >>>>> | | |
> >>>>> :50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
> >>>>> :50 68% 34:50 dmesg.RIP:try_get_folio
> >>>>> :50 68% 34:50 dmesg.invalid_opcode:#[##]
> >>>>> :50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h
> >>>>>
> >>>>>
> >>>>>
> >>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >>>>> the same patch/commit), kindly add following tags
> >>>>> | Reported-by: kernel test robot <[email protected]>
> >>>>> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
> >>>>>
> >>>>>
> >>>>> [ 275.267158][ T4335] ------------[ cut here ]------------
> >>>>> [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> >>>>> [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> >>>>> [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> >>>>> [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> >>>>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> >>>>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
> >>>>
> >>>> If I read this BUG correctly, it is:
> >>>>
> >>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
> >>>>
> >>>
> >>> Yes, that seems to be the one.
> >>>
> >>>> try_grab_folio() actually assumes it is in an atomic context (irq
> >>>> disabled or preempt disabled) for this call path. This is achieved by
> >>>> disabling irq in gup fast or calling it in rcu critical section in
> >>>> page cache lookup path
> >>>
> >>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> >>>
> >>> Is called (mm-unstable) from:
> >>>
> >>> (1) gup_fast function, here IRQs are disable
> >>> (2) gup_hugepte(), possibly problematic
> >>> (3) memfd_pin_folios(), possibly problematic
> >>> (4) __get_user_pages(), likely problematic
> >>>
> >>> (1) should be fine.
> >>>
> >>> (2) is possibly problematic on the !fast path. If so, due to commit
> >>> a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> >>>
> >>> (3) is possibly wrong. CCing Vivek.
> >>>
> >>> (4) is what we hit here
> >>>
> >>>>
> >>>> And try_grab_folio() is used when the folio is a large folio. The
> >>>
> >>>
> >>> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
> >>>
> >>> That code was added in
> >>>
> >>> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
> >>> Author: Peter Xu <[email protected]>
> >>> Date: Wed Jun 28 17:53:07 2023 -0400
> >>>
> >>> mm/gup: accelerate thp gup even for "pages != NULL"
> >>>
> >>> The acceleration of THP was done with ctx.page_mask, however it'll be
> >>> ignored if **pages is non-NULL.
> >>>
> >>>
> >>> Likely the try_grab_folio() in __get_user_pages() is wrong?
> >>>
> >>> As documented, we already hold a refcount. Likely we should better do a
> >>> folio_ref_add() and sanity check the refcount.
> >>
> >> Yes, a plain folio_ref_add() seems ok for these cases.
> >>
> >> In addition, the comment of folio_try_get_rcu() says, which is just a
> >> wrapper of folio_ref_try_add_rcu():
> >>
> >> You can also use this function if you're holding a lock that prevents
> >> pages being frozen & removed; eg the i_pages lock for the page cache
> >> or the mmap_lock or page table lock for page tables. In this case, it
> >> will always succeed, and you could have used a plain folio_get(), but
> >> it's sometimes more convenient to have a common function called from
> >> both locked and RCU-protected contexts.
> >>
> >> So IIUC we can use the plain folio_get() at least for
> >> process_vm_readv/writev since mmap_lock is held in this path.
> >>
> >>>
> >>>
> >>> In essence, I think: try_grab_folio() should only be called from GUP-fast where
> >>> IRQs are disabled.
> >>
> >> Yes, I agree. Just the fast path should need to call try_grab_folio().
> >
> > try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
> > keep calling it and add a flag to try_grab_folio, just like:
> >
> > if flag is true
> > folio_ref_add()
> > else
> > try_get_folio()
>
>
> try_grab_page() is what we use on the GUP-slow path. We'd likely want a
> folio variant of that.
>
> We might want to call that gup_try_grab_folio() and rename the other one
> to gup_fast_try_grab_folio().

Won't we duplicate the most code with two versions try_grab_folio()?

I meant something like:

try_grab_folio(struct page *page, int refs, unsigned int flags, bool fast)
{
if fast
try_get_folio()
else
folio_ref_add()
}

We can keep the duplicated code minimum in this way.

>
> Or something like that :)
>
> --
> Cheers,
>
> David / dhildenb
>

2024-05-31 18:38:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On 31.05.24 20:30, Yang Shi wrote:
> On Fri, May 31, 2024 at 11:24 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 31.05.24 20:13, Yang Shi wrote:
>>> On Fri, May 31, 2024 at 11:07 AM Yang Shi <[email protected]> wrote:
>>>>
>>>> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 31.05.24 18:50, Yang Shi wrote:
>>>>>> On Fri, May 31, 2024 at 1:24 AM kernel test robot <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
>>>>>>>
>>>>>>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>>>
>>>>>>> [test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
>>>>>>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
>>>>>>>
>>>>>>> in testcase: trinity
>>>>>>> version: trinity-x86_64-6a17c218-1_20240527
>>>>>>> with following parameters:
>>>>>>>
>>>>>>> runtime: 300s
>>>>>>> group: group-00
>>>>>>> nr_groups: 5
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> compiler: gcc-13
>>>>>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>>>>>>>
>>>>>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>>>>>>
>>>>>>>
>>>>>>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
>>>>>>> the parent is clean.
>>>>>>>
>>>>>>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
>>>>>>> ---------------- ---------------------------
>>>>>>> fail:runs %reproduction fail:runs
>>>>>>> | | |
>>>>>>> :50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
>>>>>>> :50 68% 34:50 dmesg.RIP:try_get_folio
>>>>>>> :50 68% 34:50 dmesg.invalid_opcode:#[##]
>>>>>>> :50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>>>>> the same patch/commit), kindly add following tags
>>>>>>> | Reported-by: kernel test robot <[email protected]>
>>>>>>> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
>>>>>>>
>>>>>>>
>>>>>>> [ 275.267158][ T4335] ------------[ cut here ]------------
>>>>>>> [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
>>>>>>> [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
>>>>>>> [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
>>>>>>> [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>>>>>>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
>>>>>>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
>>>>>>
>>>>>> If I read this BUG correctly, it is:
>>>>>>
>>>>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
>>>>>>
>>>>>
>>>>> Yes, that seems to be the one.
>>>>>
>>>>>> try_grab_folio() actually assumes it is in an atomic context (irq
>>>>>> disabled or preempt disabled) for this call path. This is achieved by
>>>>>> disabling irq in gup fast or calling it in rcu critical section in
>>>>>> page cache lookup path
>>>>>
>>>>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>>>>>
>>>>> Is called (mm-unstable) from:
>>>>>
>>>>> (1) gup_fast function, here IRQs are disable
>>>>> (2) gup_hugepte(), possibly problematic
>>>>> (3) memfd_pin_folios(), possibly problematic
>>>>> (4) __get_user_pages(), likely problematic
>>>>>
>>>>> (1) should be fine.
>>>>>
>>>>> (2) is possibly problematic on the !fast path. If so, due to commit
>>>>> a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>>>>>
>>>>> (3) is possibly wrong. CCing Vivek.
>>>>>
>>>>> (4) is what we hit here
>>>>>
>>>>>>
>>>>>> And try_grab_folio() is used when the folio is a large folio. The
>>>>>
>>>>>
>>>>> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
>>>>>
>>>>> That code was added in
>>>>>
>>>>> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
>>>>> Author: Peter Xu <[email protected]>
>>>>> Date: Wed Jun 28 17:53:07 2023 -0400
>>>>>
>>>>> mm/gup: accelerate thp gup even for "pages != NULL"
>>>>>
>>>>> The acceleration of THP was done with ctx.page_mask, however it'll be
>>>>> ignored if **pages is non-NULL.
>>>>>
>>>>>
>>>>> Likely the try_grab_folio() in __get_user_pages() is wrong?
>>>>>
>>>>> As documented, we already hold a refcount. Likely we should better do a
>>>>> folio_ref_add() and sanity check the refcount.
>>>>
>>>> Yes, a plain folio_ref_add() seems ok for these cases.
>>>>
>>>> In addition, the comment of folio_try_get_rcu() says, which is just a
>>>> wrapper of folio_ref_try_add_rcu():
>>>>
>>>> You can also use this function if you're holding a lock that prevents
>>>> pages being frozen & removed; eg the i_pages lock for the page cache
>>>> or the mmap_lock or page table lock for page tables. In this case, it
>>>> will always succeed, and you could have used a plain folio_get(), but
>>>> it's sometimes more convenient to have a common function called from
>>>> both locked and RCU-protected contexts.
>>>>
>>>> So IIUC we can use the plain folio_get() at least for
>>>> process_vm_readv/writev since mmap_lock is held in this path.
>>>>
>>>>>
>>>>>
>>>>> In essence, I think: try_grab_folio() should only be called from GUP-fast where
>>>>> IRQs are disabled.
>>>>
>>>> Yes, I agree. Just the fast path should need to call try_grab_folio().
>>>
>>> try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
>>> keep calling it and add a flag to try_grab_folio, just like:
>>>
>>> if flag is true
>>> folio_ref_add()
>>> else
>>> try_get_folio()
>>
>>
>> try_grab_page() is what we use on the GUP-slow path. We'd likely want a
>> folio variant of that.
>>
>> We might want to call that gup_try_grab_folio() and rename the other one
>> to gup_fast_try_grab_folio().
>
> Won't we duplicate the most code with two versions try_grab_folio()?
>
> I meant something like:
>
> try_grab_folio(struct page *page, int refs, unsigned int flags, bool fast)
> {
> if fast
> try_get_folio()
> else
> folio_ref_add()
> }
>

That's insufficient to handle FOLL_PIN. Likely we should do this:

diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390d..fea93a64bf235 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -203,8 +203,8 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
}

/**
- * try_grab_page() - elevate a page's refcount by a flag-dependent amount
- * @page: pointer to page to be grabbed
+ * try_grab_folio() - elevate a folios's refcount by a flag-dependent amount
+ * @folio: pointer to folio to be grabbed
* @flags: gup flags: these are the FOLL_* flag values.
*
* This might not do anything at all, depending on the flags argument.
@@ -216,16 +216,16 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
* time. Cases: please see the try_grab_folio() documentation, with
* "refs=1".
*
+ * Must not be called from GUP-fast: the folio must not get freed concurrently.
+ *
* Return: 0 for success, or if no action was required (if neither FOLL_PIN
* nor FOLL_GET was set, nothing is done). A negative error code for failure:
*
* -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not
* be grabbed.
*/
-int __must_check try_grab_page(struct page *page, unsigned int flags)
+int __must_check try_grab_page(struct folio *folio, unsigned int flags)
{
- struct folio *folio = page_folio(page);
-
if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
return -ENOMEM;

@@ -239,7 +239,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
* Don't take a pin on the zero page - it's not going anywhere
* and it is used in a *lot* of places.
*/
- if (is_zero_page(page))
+ if (is_zero_folio(folio))
return 0;

/*
@@ -260,6 +260,11 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
return 0;
}

+int __must_check try_grab_page(struct page *page, unsigned int flags)
+{
+ return gup_try_grab_folio(page_folio(page), flags);
+}
+
/**
* unpin_user_page() - release a dma-pinned page
* @page: pointer to page to be released


Then, fix the callers and rename the other one to gup_fast_*.


--
Cheers,

David / dhildenb


2024-05-31 19:07:24

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Fri, May 31, 2024 at 11:38 AM David Hildenbrand <[email protected]> wrote:
>
> On 31.05.24 20:30, Yang Shi wrote:
> > On Fri, May 31, 2024 at 11:24 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 31.05.24 20:13, Yang Shi wrote:
> >>> On Fri, May 31, 2024 at 11:07 AM Yang Shi <[email protected]> wrote:
> >>>>
> >>>> On Fri, May 31, 2024 at 10:46 AM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 31.05.24 18:50, Yang Shi wrote:
> >>>>>> On Fri, May 31, 2024 at 1:24 AM kernel test robot <[email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> kernel test robot noticed "kernel_BUG_at_include/linux/page_ref.h" on:
> >>>>>>>
> >>>>>>> commit: efa7df3e3bb5da8e6abbe37727417f32a37fba47 ("mm: align larger anonymous mappings on THP boundaries")
> >>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>>>>>>
> >>>>>>> [test failed on linus/master e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc]
> >>>>>>> [test failed on linux-next/master 6dc544b66971c7f9909ff038b62149105272d26a]
> >>>>>>>
> >>>>>>> in testcase: trinity
> >>>>>>> version: trinity-x86_64-6a17c218-1_20240527
> >>>>>>> with following parameters:
> >>>>>>>
> >>>>>>> runtime: 300s
> >>>>>>> group: group-00
> >>>>>>> nr_groups: 5
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> compiler: gcc-13
> >>>>>>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >>>>>>>
> >>>>>>> (please refer to attached dmesg/kmsg for entire log/backtrace)
> >>>>>>>
> >>>>>>>
> >>>>>>> we noticed the issue does not always happen. 34 times out of 50 runs as below.
> >>>>>>> the parent is clean.
> >>>>>>>
> >>>>>>> 1803d0c5ee1a3bbe efa7df3e3bb5da8e6abbe377274
> >>>>>>> ---------------- ---------------------------
> >>>>>>> fail:runs %reproduction fail:runs
> >>>>>>> | | |
> >>>>>>> :50 68% 34:50 dmesg.Kernel_panic-not_syncing:Fatal_exception
> >>>>>>> :50 68% 34:50 dmesg.RIP:try_get_folio
> >>>>>>> :50 68% 34:50 dmesg.invalid_opcode:#[##]
> >>>>>>> :50 68% 34:50 dmesg.kernel_BUG_at_include/linux/page_ref.h
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >>>>>>> the same patch/commit), kindly add following tags
> >>>>>>> | Reported-by: kernel test robot <[email protected]>
> >>>>>>> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
> >>>>>>>
> >>>>>>>
> >>>>>>> [ 275.267158][ T4335] ------------[ cut here ]------------
> >>>>>>> [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
> >>>>>>> [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
> >>>>>>> [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
> >>>>>>> [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> >>>>>>> [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
> >>>>>>> [ 275.271159][ T4335] Code: c3 cc cc cc cc 44 89 e6 48 89 df e8 e4 54 11 00 eb ae 90 0f 0b 90 31 db eb d5 9c 58 0f 1f 40 00 f6 c4 02 0f 84 46 ff ff ff 90 <0f> 0b 48 c7 c6 a0 54 d2 87 48 89 df e8 a9 e9 ff ff 90 0f 0b be 04
> >>>>>>
> >>>>>> If I read this BUG correctly, it is:
> >>>>>>
> >>>>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
> >>>>>>
> >>>>>
> >>>>> Yes, that seems to be the one.
> >>>>>
> >>>>>> try_grab_folio() actually assumes it is in an atomic context (irq
> >>>>>> disabled or preempt disabled) for this call path. This is achieved by
> >>>>>> disabling irq in gup fast or calling it in rcu critical section in
> >>>>>> page cache lookup path
> >>>>>
> >>>>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> >>>>>
> >>>>> Is called (mm-unstable) from:
> >>>>>
> >>>>> (1) gup_fast function, here IRQs are disable
> >>>>> (2) gup_hugepte(), possibly problematic
> >>>>> (3) memfd_pin_folios(), possibly problematic
> >>>>> (4) __get_user_pages(), likely problematic
> >>>>>
> >>>>> (1) should be fine.
> >>>>>
> >>>>> (2) is possibly problematic on the !fast path. If so, due to commit
> >>>>> a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> >>>>>
> >>>>> (3) is possibly wrong. CCing Vivek.
> >>>>>
> >>>>> (4) is what we hit here
> >>>>>
> >>>>>>
> >>>>>> And try_grab_folio() is used when the folio is a large folio. The
> >>>>>
> >>>>>
> >>>>> We come via process_vm_rw()->pin_user_pages_remote()->__get_user_pages()->try_grab_folio()
> >>>>>
> >>>>> That code was added in
> >>>>>
> >>>>> commit 57edfcfd3419b4799353d8cbd6ce49da075cfdbd
> >>>>> Author: Peter Xu <[email protected]>
> >>>>> Date: Wed Jun 28 17:53:07 2023 -0400
> >>>>>
> >>>>> mm/gup: accelerate thp gup even for "pages != NULL"
> >>>>>
> >>>>> The acceleration of THP was done with ctx.page_mask, however it'll be
> >>>>> ignored if **pages is non-NULL.
> >>>>>
> >>>>>
> >>>>> Likely the try_grab_folio() in __get_user_pages() is wrong?
> >>>>>
> >>>>> As documented, we already hold a refcount. Likely we should better do a
> >>>>> folio_ref_add() and sanity check the refcount.
> >>>>
> >>>> Yes, a plain folio_ref_add() seems ok for these cases.
> >>>>
> >>>> In addition, the comment of folio_try_get_rcu() says, which is just a
> >>>> wrapper of folio_ref_try_add_rcu():
> >>>>
> >>>> You can also use this function if you're holding a lock that prevents
> >>>> pages being frozen & removed; eg the i_pages lock for the page cache
> >>>> or the mmap_lock or page table lock for page tables. In this case, it
> >>>> will always succeed, and you could have used a plain folio_get(), but
> >>>> it's sometimes more convenient to have a common function called from
> >>>> both locked and RCU-protected contexts.
> >>>>
> >>>> So IIUC we can use the plain folio_get() at least for
> >>>> process_vm_readv/writev since mmap_lock is held in this path.
> >>>>
> >>>>>
> >>>>>
> >>>>> In essence, I think: try_grab_folio() should only be called from GUP-fast where
> >>>>> IRQs are disabled.
> >>>>
> >>>> Yes, I agree. Just the fast path should need to call try_grab_folio().
> >>>
> >>> try_grab_folio() also handles FOLL_PIN and FOLL_GET, so we may just
> >>> keep calling it and add a flag to try_grab_folio, just like:
> >>>
> >>> if flag is true
> >>> folio_ref_add()
> >>> else
> >>> try_get_folio()
> >>
> >>
> >> try_grab_page() is what we use on the GUP-slow path. We'd likely want a
> >> folio variant of that.
> >>
> >> We might want to call that gup_try_grab_folio() and rename the other one
> >> to gup_fast_try_grab_folio().
> >
> > Won't we duplicate the most code with two versions try_grab_folio()?
> >
> > I meant something like:
> >
> > try_grab_folio(struct page *page, int refs, unsigned int flags, bool fast)
> > {
> > if fast
> > try_get_folio()
> > else
> > folio_ref_add()
> > }
> >
>
> That's insufficient to handle FOLL_PIN. Likely we should do this:
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 231711efa390d..fea93a64bf235 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -203,8 +203,8 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> }
>
> /**
> - * try_grab_page() - elevate a page's refcount by a flag-dependent amount
> - * @page: pointer to page to be grabbed
> + * try_grab_folio() - elevate a folios's refcount by a flag-dependent amount
> + * @folio: pointer to folio to be grabbed
> * @flags: gup flags: these are the FOLL_* flag values.
> *
> * This might not do anything at all, depending on the flags argument.
> @@ -216,16 +216,16 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> * time. Cases: please see the try_grab_folio() documentation, with
> * "refs=1".
> *
> + * Must not be called from GUP-fast: the folio must not get freed concurrently.
> + *
> * Return: 0 for success, or if no action was required (if neither FOLL_PIN
> * nor FOLL_GET was set, nothing is done). A negative error code for failure:
> *
> * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not
> * be grabbed.
> */
> -int __must_check try_grab_page(struct page *page, unsigned int flags)
> +int __must_check try_grab_page(struct folio *folio, unsigned int flags)
> {
> - struct folio *folio = page_folio(page);
> -
> if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
> return -ENOMEM;
>
> @@ -239,7 +239,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
> * Don't take a pin on the zero page - it's not going anywhere
> * and it is used in a *lot* of places.
> */
> - if (is_zero_page(page))
> + if (is_zero_folio(folio))
> return 0;
>
> /*
> @@ -260,6 +260,11 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
> return 0;
> }
>
> +int __must_check try_grab_page(struct page *page, unsigned int flags)
> +{
> + return gup_try_grab_folio(page_folio(page), flags);
> +}
> +
> /**
> * unpin_user_page() - release a dma-pinned page
> * @page: pointer to page to be released
>
>
> Then, fix the callers and rename the other one to gup_fast_*.

I see your point. Replace try_grab_page() to try_grab_folio() for slow
path, it returns 0 or errno, but it should never fail in slow path
since we already hold at least one reference IIUC. The fast version
should just like old try_grab_folio(), which returns the pointer to
folio or NULL.

>
>
> --
> Cheers,
>
> David / dhildenb
>

2024-05-31 20:57:31

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

Hi Oliver,

I just came up with a quick patch (just build test) per the discussion
and attached, can you please to give it a try? Once it is verified, I
will refine the patch and submit for review.

Thanks,
Yang

>
> >
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >


Attachments:
try_grab_folio_fix.patch (12.70 kB)

2024-05-31 23:25:11

by Peter Xu

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>
> Is called (mm-unstable) from:
>
> (1) gup_fast function, here IRQs are disable
> (2) gup_hugepte(), possibly problematic
> (3) memfd_pin_folios(), possibly problematic
> (4) __get_user_pages(), likely problematic
>
> (1) should be fine.
>
> (2) is possibly problematic on the !fast path. If so, due to commit
> a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>
> (3) is possibly wrong. CCing Vivek.
>
> (4) is what we hit here

I guess it was overlooked because try_grab_folio() didn't have any comment
or implication on RCU or IRQ internal helpers being used, hence a bit
confusing. E.g. it has different context requirement on try_grab_page(),
even though they look like sister functions. It might be helpful to have a
better name, something like try_grab_folio_rcu() in this case.

Btw, none of above cases (2-4) have real bug, but we're looking at some way
to avoid triggering the sanity check, am I right? I hope besides the host
splash I didn't overlook any other side effect this issue would cause, and
the splash IIUC should so far be benign, as either gup slow (2,4) or the
newly added memfd_pin_folios() (3) look like to have the refcount stablized
anyway.

Yang's patch in the other email looks sane to me, just that then we'll add
quite some code just to avoid this sanity check in paths 2-4 which seems
like an slight overkill.

One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
its RCU limitation. It boils down to whether we can use atomic_add_unless()
on TINY_RCU / UP setup too? I mean, we do plenty of similar things
(get_page_unless_zero, etc.) in generic code and I don't understand why
here we need to treat folio_ref_try_add_rcu() specially.

IOW, the assertions here we added:

VM_BUG_ON(!in_atomic() && !irqs_disabled());

Is because we need atomicity of below sequences:

VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
folio_ref_add(folio, count);

But atomic ops avoids it.

This chunk of code was (mostly) originally added in 2008 in commit
e286781d5f2e ("mm: speculative page references").

In short, I'm wondering whether something like below would make sense and
easier:

===8<===
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 1acf5bac7f50..c89a67d239d1 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -258,26 +258,9 @@ static inline bool folio_try_get(struct folio *folio)
return folio_ref_add_unless(folio, 1, 0);
}

-static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
-{
-#ifdef CONFIG_TINY_RCU
- /*
- * The caller guarantees the folio will not be freed from interrupt
- * context, so (on !SMP) we only need preemption to be disabled
- * and TINY_RCU does that for us.
- */
-# ifdef CONFIG_PREEMPT_COUNT
- VM_BUG_ON(!in_atomic() && !irqs_disabled());
-# endif
- VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
- folio_ref_add(folio, count);
-#else
- if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
- /* Either the folio has been freed, or will be freed. */
- return false;
- }
-#endif
- return true;
+static inline bool folio_ref_try_add(struct folio *folio, int count)
+{
+ return folio_ref_add_unless(folio, count, 0);
}

/**
@@ -305,7 +288,7 @@ static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
*/
static inline bool folio_try_get_rcu(struct folio *folio)
{
- return folio_ref_try_add_rcu(folio, 1);
+ return folio_ref_try_add(folio, 1);
}

static inline int page_ref_freeze(struct page *page, int count)
diff --git a/mm/gup.c b/mm/gup.c
index e17466fd62bb..17f89e8d31f1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -78,7 +78,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
folio = page_folio(page);
if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
return NULL;
- if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
+ if (unlikely(!folio_ref_try_add(folio, refs)))
return NULL;

/*
===8<===

So instead of adding new code, we fix it by removing some. There might be
some implication on TINY_RCU / UP config on using the atomic_add_unless()
to replace one atomic_add(), but I'm not sure whether that's a major issue.

The benefit is try_get_folio() then don't need a renaming then, because the
rcu implication just went away.

Thanks,

--
Peter Xu


2024-06-01 00:02:10

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Fri, May 31, 2024 at 4:25 PM Peter Xu <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> >
> > Is called (mm-unstable) from:
> >
> > (1) gup_fast function, here IRQs are disable
> > (2) gup_hugepte(), possibly problematic
> > (3) memfd_pin_folios(), possibly problematic
> > (4) __get_user_pages(), likely problematic
> >
> > (1) should be fine.
> >
> > (2) is possibly problematic on the !fast path. If so, due to commit
> > a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> >
> > (3) is possibly wrong. CCing Vivek.
> >
> > (4) is what we hit here
>
> I guess it was overlooked because try_grab_folio() didn't have any comment
> or implication on RCU or IRQ internal helpers being used, hence a bit
> confusing. E.g. it has different context requirement on try_grab_page(),
> even though they look like sister functions. It might be helpful to have a
> better name, something like try_grab_folio_rcu() in this case.
>
> Btw, none of above cases (2-4) have real bug, but we're looking at some way
> to avoid triggering the sanity check, am I right? I hope besides the host
> splash I didn't overlook any other side effect this issue would cause, and
> the splash IIUC should so far be benign, as either gup slow (2,4) or the
> newly added memfd_pin_folios() (3) look like to have the refcount stablized
> anyway.
>
> Yang's patch in the other email looks sane to me, just that then we'll add
> quite some code just to avoid this sanity check in paths 2-4 which seems
> like an slight overkill.
>
> One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
> its RCU limitation. It boils down to whether we can use atomic_add_unless()
> on TINY_RCU / UP setup too? I mean, we do plenty of similar things
> (get_page_unless_zero, etc.) in generic code and I don't understand why
> here we need to treat folio_ref_try_add_rcu() specially.
>
> IOW, the assertions here we added:
>
> VM_BUG_ON(!in_atomic() && !irqs_disabled());
>
> Is because we need atomicity of below sequences:
>
> VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> folio_ref_add(folio, count);
>
> But atomic ops avoids it.

Yeah, I didn't think of why atomic can't do it either. But is it
written in this way because we want to catch the refcount == 0 case
since it means a severe bug? Did we miss something?

>
> This chunk of code was (mostly) originally added in 2008 in commit
> e286781d5f2e ("mm: speculative page references").
>
> In short, I'm wondering whether something like below would make sense and
> easier:
>
> ===8<===
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 1acf5bac7f50..c89a67d239d1 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -258,26 +258,9 @@ static inline bool folio_try_get(struct folio *folio)
> return folio_ref_add_unless(folio, 1, 0);
> }
>
> -static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> -{
> -#ifdef CONFIG_TINY_RCU
> - /*
> - * The caller guarantees the folio will not be freed from interrupt
> - * context, so (on !SMP) we only need preemption to be disabled
> - * and TINY_RCU does that for us.
> - */
> -# ifdef CONFIG_PREEMPT_COUNT
> - VM_BUG_ON(!in_atomic() && !irqs_disabled());
> -# endif
> - VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> - folio_ref_add(folio, count);
> -#else
> - if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
> - /* Either the folio has been freed, or will be freed. */
> - return false;
> - }
> -#endif
> - return true;
> +static inline bool folio_ref_try_add(struct folio *folio, int count)
> +{
> + return folio_ref_add_unless(folio, count, 0);
> }
>
> /**
> @@ -305,7 +288,7 @@ static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> */
> static inline bool folio_try_get_rcu(struct folio *folio)
> {
> - return folio_ref_try_add_rcu(folio, 1);
> + return folio_ref_try_add(folio, 1);
> }
>
> static inline int page_ref_freeze(struct page *page, int count)
> diff --git a/mm/gup.c b/mm/gup.c
> index e17466fd62bb..17f89e8d31f1 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -78,7 +78,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> folio = page_folio(page);
> if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
> return NULL;
> - if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
> + if (unlikely(!folio_ref_try_add(folio, refs)))
> return NULL;
>
> /*
> ===8<===
>
> So instead of adding new code, we fix it by removing some. There might be
> some implication on TINY_RCU / UP config on using the atomic_add_unless()
> to replace one atomic_add(), but I'm not sure whether that's a major issue.
>
> The benefit is try_get_folio() then don't need a renaming then, because the
> rcu implication just went away.
>
> Thanks,
>
> --
> Peter Xu
>

2024-06-01 00:59:47

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Fri, May 31, 2024 at 5:01 PM Yang Shi <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 4:25 PM Peter Xu <[email protected]> wrote:
> >
> > On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> > > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> > >
> > > Is called (mm-unstable) from:
> > >
> > > (1) gup_fast function, here IRQs are disable
> > > (2) gup_hugepte(), possibly problematic
> > > (3) memfd_pin_folios(), possibly problematic
> > > (4) __get_user_pages(), likely problematic
> > >
> > > (1) should be fine.
> > >
> > > (2) is possibly problematic on the !fast path. If so, due to commit
> > > a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> > >
> > > (3) is possibly wrong. CCing Vivek.
> > >
> > > (4) is what we hit here
> >
> > I guess it was overlooked because try_grab_folio() didn't have any comment
> > or implication on RCU or IRQ internal helpers being used, hence a bit
> > confusing. E.g. it has different context requirement on try_grab_page(),
> > even though they look like sister functions. It might be helpful to have a
> > better name, something like try_grab_folio_rcu() in this case.
> >
> > Btw, none of above cases (2-4) have real bug, but we're looking at some way
> > to avoid triggering the sanity check, am I right? I hope besides the host
> > splash I didn't overlook any other side effect this issue would cause, and
> > the splash IIUC should so far be benign, as either gup slow (2,4) or the
> > newly added memfd_pin_folios() (3) look like to have the refcount stablized
> > anyway.
> >
> > Yang's patch in the other email looks sane to me, just that then we'll add
> > quite some code just to avoid this sanity check in paths 2-4 which seems
> > like an slight overkill.
> >
> > One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
> > its RCU limitation. It boils down to whether we can use atomic_add_unless()
> > on TINY_RCU / UP setup too? I mean, we do plenty of similar things
> > (get_page_unless_zero, etc.) in generic code and I don't understand why
> > here we need to treat folio_ref_try_add_rcu() specially.
> >
> > IOW, the assertions here we added:
> >
> > VM_BUG_ON(!in_atomic() && !irqs_disabled());
> >
> > Is because we need atomicity of below sequences:
> >
> > VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> > folio_ref_add(folio, count);
> >
> > But atomic ops avoids it.
>
> Yeah, I didn't think of why atomic can't do it either. But is it
> written in this way because we want to catch the refcount == 0 case
> since it means a severe bug? Did we miss something?

Thought more about it and disassembled the code. IIUC, this is an
optimization for non-SMP kernel. When in rcu critical section or irq
is disabled, we just need an atomic add instruction.
folio_ref_add_unless() would yield more instructions, including branch
instruction. But I'm wondering how useful it would be nowadays. Is it
really worth the complexity? AFAIK, for example, ARM64 has not
supported non-SMP kernel for years.

My patch actually replaced all folio_ref_add_unless() to
folio_ref_add() for slow paths, so it is supposed to run faster, but
we are already in slow path, it may be not measurable at all. So
having more simple and readable code may outweigh the potential slight
performance gain in this case?

>
> >
> > This chunk of code was (mostly) originally added in 2008 in commit
> > e286781d5f2e ("mm: speculative page references").
> >
> > In short, I'm wondering whether something like below would make sense and
> > easier:
> >
> > ===8<===
> > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > index 1acf5bac7f50..c89a67d239d1 100644
> > --- a/include/linux/page_ref.h
> > +++ b/include/linux/page_ref.h
> > @@ -258,26 +258,9 @@ static inline bool folio_try_get(struct folio *folio)
> > return folio_ref_add_unless(folio, 1, 0);
> > }
> >
> > -static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> > -{
> > -#ifdef CONFIG_TINY_RCU
> > - /*
> > - * The caller guarantees the folio will not be freed from interrupt
> > - * context, so (on !SMP) we only need preemption to be disabled
> > - * and TINY_RCU does that for us.
> > - */
> > -# ifdef CONFIG_PREEMPT_COUNT
> > - VM_BUG_ON(!in_atomic() && !irqs_disabled());
> > -# endif
> > - VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> > - folio_ref_add(folio, count);
> > -#else
> > - if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
> > - /* Either the folio has been freed, or will be freed. */
> > - return false;
> > - }
> > -#endif
> > - return true;
> > +static inline bool folio_ref_try_add(struct folio *folio, int count)
> > +{
> > + return folio_ref_add_unless(folio, count, 0);
> > }
> >
> > /**
> > @@ -305,7 +288,7 @@ static inline bool folio_ref_try_add_rcu(struct folio *folio, int count)
> > */
> > static inline bool folio_try_get_rcu(struct folio *folio)
> > {
> > - return folio_ref_try_add_rcu(folio, 1);
> > + return folio_ref_try_add(folio, 1);
> > }
> >
> > static inline int page_ref_freeze(struct page *page, int count)
> > diff --git a/mm/gup.c b/mm/gup.c
> > index e17466fd62bb..17f89e8d31f1 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -78,7 +78,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > folio = page_folio(page);
> > if (WARN_ON_ONCE(folio_ref_count(folio) < 0))
> > return NULL;
> > - if (unlikely(!folio_ref_try_add_rcu(folio, refs)))
> > + if (unlikely(!folio_ref_try_add(folio, refs)))
> > return NULL;
> >
> > /*
> > ===8<===
> >
> > So instead of adding new code, we fix it by removing some. There might be
> > some implication on TINY_RCU / UP config on using the atomic_add_unless()
> > to replace one atomic_add(), but I'm not sure whether that's a major issue.
> >
> > The benefit is try_get_folio() then don't need a renaming then, because the
> > rcu implication just went away.
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >

2024-06-01 06:25:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On 01.06.24 02:59, Yang Shi wrote:
> On Fri, May 31, 2024 at 5:01 PM Yang Shi <[email protected]> wrote:
>>
>> On Fri, May 31, 2024 at 4:25 PM Peter Xu <[email protected]> wrote:
>>>
>>> On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
>>>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>>>>
>>>> Is called (mm-unstable) from:
>>>>
>>>> (1) gup_fast function, here IRQs are disable
>>>> (2) gup_hugepte(), possibly problematic
>>>> (3) memfd_pin_folios(), possibly problematic
>>>> (4) __get_user_pages(), likely problematic
>>>>
>>>> (1) should be fine.
>>>>
>>>> (2) is possibly problematic on the !fast path. If so, due to commit
>>>> a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>>>>
>>>> (3) is possibly wrong. CCing Vivek.
>>>>
>>>> (4) is what we hit here
>>>
>>> I guess it was overlooked because try_grab_folio() didn't have any comment
>>> or implication on RCU or IRQ internal helpers being used, hence a bit
>>> confusing. E.g. it has different context requirement on try_grab_page(),
>>> even though they look like sister functions. It might be helpful to have a
>>> better name, something like try_grab_folio_rcu() in this case.
>>>
>>> Btw, none of above cases (2-4) have real bug, but we're looking at some way
>>> to avoid triggering the sanity check, am I right? I hope besides the host
>>> splash I didn't overlook any other side effect this issue would cause, and
>>> the splash IIUC should so far be benign, as either gup slow (2,4) or the
>>> newly added memfd_pin_folios() (3) look like to have the refcount stablized
>>> anyway.
>>>
>>> Yang's patch in the other email looks sane to me, just that then we'll add
>>> quite some code just to avoid this sanity check in paths 2-4 which seems
>>> like an slight overkill.
>>>
>>> One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
>>> its RCU limitation. It boils down to whether we can use atomic_add_unless()
>>> on TINY_RCU / UP setup too? I mean, we do plenty of similar things
>>> (get_page_unless_zero, etc.) in generic code and I don't understand why
>>> here we need to treat folio_ref_try_add_rcu() specially.
>>>
>>> IOW, the assertions here we added:
>>>
>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
>>>
>>> Is because we need atomicity of below sequences:
>>>
>>> VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
>>> folio_ref_add(folio, count);
>>>
>>> But atomic ops avoids it.
>>
>> Yeah, I didn't think of why atomic can't do it either. But is it
>> written in this way because we want to catch the refcount == 0 case
>> since it means a severe bug? Did we miss something?
>
> Thought more about it and disassembled the code. IIUC, this is an
> optimization for non-SMP kernel. When in rcu critical section or irq
> is disabled, we just need an atomic add instruction.
> folio_ref_add_unless() would yield more instructions, including branch
> instruction. But I'm wondering how useful it would be nowadays. Is it
> really worth the complexity? AFAIK, for example, ARM64 has not
> supported non-SMP kernel for years.
>
> My patch actually replaced all folio_ref_add_unless() to
> folio_ref_add() for slow paths, so it is supposed to run faster, but
> we are already in slow path, it may be not measurable at all. So
> having more simple and readable code may outweigh the potential slight
> performance gain in this case?

Yes, we don't want to use atomic RMW that return values where we can use
atomic RMW that don't return values. The former is slower and implies a
memory barrier, that can be optimized out on some arcitectures (arm64 IIRC)

We should clean that up here, and make it clearer that the old function
is only for grabbing a folio if it can be freed concurrently -- GUP-fast.

--
Cheers,

David / dhildenb


2024-06-03 14:02:48

by kernel test robot

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

hi, Yang Shi,

On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> Hi Oliver,
>
> I just came up with a quick patch (just build test) per the discussion
> and attached, can you please to give it a try? Once it is verified, I
> will refine the patch and submit for review.

what's the base of this patch? I tried to apply it upon efa7df3e3b or
v6.10-rc2. both failed.

>
> Thanks,
> Yang
>
> >
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >



2024-06-03 15:09:59

by Peter Xu

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Sat, Jun 01, 2024 at 08:22:21AM +0200, David Hildenbrand wrote:
> On 01.06.24 02:59, Yang Shi wrote:
> > On Fri, May 31, 2024 at 5:01 PM Yang Shi <[email protected]> wrote:
> > >
> > > On Fri, May 31, 2024 at 4:25 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> > > > > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> > > > >
> > > > > Is called (mm-unstable) from:
> > > > >
> > > > > (1) gup_fast function, here IRQs are disable
> > > > > (2) gup_hugepte(), possibly problematic
> > > > > (3) memfd_pin_folios(), possibly problematic
> > > > > (4) __get_user_pages(), likely problematic
> > > > >
> > > > > (1) should be fine.
> > > > >
> > > > > (2) is possibly problematic on the !fast path. If so, due to commit
> > > > > a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> > > > >
> > > > > (3) is possibly wrong. CCing Vivek.
> > > > >
> > > > > (4) is what we hit here
> > > >
> > > > I guess it was overlooked because try_grab_folio() didn't have any comment
> > > > or implication on RCU or IRQ internal helpers being used, hence a bit
> > > > confusing. E.g. it has different context requirement on try_grab_page(),
> > > > even though they look like sister functions. It might be helpful to have a
> > > > better name, something like try_grab_folio_rcu() in this case.
> > > >
> > > > Btw, none of above cases (2-4) have real bug, but we're looking at some way
> > > > to avoid triggering the sanity check, am I right? I hope besides the host
> > > > splash I didn't overlook any other side effect this issue would cause, and
> > > > the splash IIUC should so far be benign, as either gup slow (2,4) or the
> > > > newly added memfd_pin_folios() (3) look like to have the refcount stablized
> > > > anyway.
> > > >
> > > > Yang's patch in the other email looks sane to me, just that then we'll add
> > > > quite some code just to avoid this sanity check in paths 2-4 which seems
> > > > like an slight overkill.
> > > >
> > > > One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
> > > > its RCU limitation. It boils down to whether we can use atomic_add_unless()
> > > > on TINY_RCU / UP setup too? I mean, we do plenty of similar things
> > > > (get_page_unless_zero, etc.) in generic code and I don't understand why
> > > > here we need to treat folio_ref_try_add_rcu() specially.
> > > >
> > > > IOW, the assertions here we added:
> > > >
> > > > VM_BUG_ON(!in_atomic() && !irqs_disabled());
> > > >
> > > > Is because we need atomicity of below sequences:
> > > >
> > > > VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> > > > folio_ref_add(folio, count);
> > > >
> > > > But atomic ops avoids it.
> > >
> > > Yeah, I didn't think of why atomic can't do it either. But is it
> > > written in this way because we want to catch the refcount == 0 case
> > > since it means a severe bug? Did we miss something?
> >
> > Thought more about it and disassembled the code. IIUC, this is an
> > optimization for non-SMP kernel. When in rcu critical section or irq
> > is disabled, we just need an atomic add instruction.
> > folio_ref_add_unless() would yield more instructions, including branch
> > instruction. But I'm wondering how useful it would be nowadays. Is it
> > really worth the complexity? AFAIK, for example, ARM64 has not
> > supported non-SMP kernel for years.
> >
> > My patch actually replaced all folio_ref_add_unless() to
> > folio_ref_add() for slow paths, so it is supposed to run faster, but
> > we are already in slow path, it may be not measurable at all. So
> > having more simple and readable code may outweigh the potential slight
> > performance gain in this case?
>
> Yes, we don't want to use atomic RMW that return values where we can use
> atomic RMW that don't return values. The former is slower and implies a
> memory barrier, that can be optimized out on some arcitectures (arm64 IIRC)
>
> We should clean that up here, and make it clearer that the old function is
> only for grabbing a folio if it can be freed concurrently -- GUP-fast.

Note again that this only affects TINY_RCU, which mostly implies
!PREEMPTION and UP. It's a matter of whether we prefer adding these bunch
of code to optimize that.

Also we didn't yet measure that in a real workload and see how that
"unless" plays when buried in other paths, but then we'll need a special
kernel build first, and again I'm not sure whether it'll be worthwhile.

Thanks,

--
Peter Xu


2024-06-03 15:43:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On 03.06.24 17:08, Peter Xu wrote:
> On Sat, Jun 01, 2024 at 08:22:21AM +0200, David Hildenbrand wrote:
>> On 01.06.24 02:59, Yang Shi wrote:
>>> On Fri, May 31, 2024 at 5:01 PM Yang Shi <[email protected]> wrote:
>>>>
>>>> On Fri, May 31, 2024 at 4:25 PM Peter Xu <[email protected]> wrote:
>>>>>
>>>>> On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
>>>>>> try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
>>>>>>
>>>>>> Is called (mm-unstable) from:
>>>>>>
>>>>>> (1) gup_fast function, here IRQs are disable
>>>>>> (2) gup_hugepte(), possibly problematic
>>>>>> (3) memfd_pin_folios(), possibly problematic
>>>>>> (4) __get_user_pages(), likely problematic
>>>>>>
>>>>>> (1) should be fine.
>>>>>>
>>>>>> (2) is possibly problematic on the !fast path. If so, due to commit
>>>>>> a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
>>>>>>
>>>>>> (3) is possibly wrong. CCing Vivek.
>>>>>>
>>>>>> (4) is what we hit here
>>>>>
>>>>> I guess it was overlooked because try_grab_folio() didn't have any comment
>>>>> or implication on RCU or IRQ internal helpers being used, hence a bit
>>>>> confusing. E.g. it has different context requirement on try_grab_page(),
>>>>> even though they look like sister functions. It might be helpful to have a
>>>>> better name, something like try_grab_folio_rcu() in this case.
>>>>>
>>>>> Btw, none of above cases (2-4) have real bug, but we're looking at some way
>>>>> to avoid triggering the sanity check, am I right? I hope besides the host
>>>>> splash I didn't overlook any other side effect this issue would cause, and
>>>>> the splash IIUC should so far be benign, as either gup slow (2,4) or the
>>>>> newly added memfd_pin_folios() (3) look like to have the refcount stablized
>>>>> anyway.
>>>>>
>>>>> Yang's patch in the other email looks sane to me, just that then we'll add
>>>>> quite some code just to avoid this sanity check in paths 2-4 which seems
>>>>> like an slight overkill.
>>>>>
>>>>> One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
>>>>> its RCU limitation. It boils down to whether we can use atomic_add_unless()
>>>>> on TINY_RCU / UP setup too? I mean, we do plenty of similar things
>>>>> (get_page_unless_zero, etc.) in generic code and I don't understand why
>>>>> here we need to treat folio_ref_try_add_rcu() specially.
>>>>>
>>>>> IOW, the assertions here we added:
>>>>>
>>>>> VM_BUG_ON(!in_atomic() && !irqs_disabled());
>>>>>
>>>>> Is because we need atomicity of below sequences:
>>>>>
>>>>> VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
>>>>> folio_ref_add(folio, count);
>>>>>
>>>>> But atomic ops avoids it.
>>>>
>>>> Yeah, I didn't think of why atomic can't do it either. But is it
>>>> written in this way because we want to catch the refcount == 0 case
>>>> since it means a severe bug? Did we miss something?
>>>
>>> Thought more about it and disassembled the code. IIUC, this is an
>>> optimization for non-SMP kernel. When in rcu critical section or irq
>>> is disabled, we just need an atomic add instruction.
>>> folio_ref_add_unless() would yield more instructions, including branch
>>> instruction. But I'm wondering how useful it would be nowadays. Is it
>>> really worth the complexity? AFAIK, for example, ARM64 has not
>>> supported non-SMP kernel for years.
>>>
>>> My patch actually replaced all folio_ref_add_unless() to
>>> folio_ref_add() for slow paths, so it is supposed to run faster, but
>>> we are already in slow path, it may be not measurable at all. So
>>> having more simple and readable code may outweigh the potential slight
>>> performance gain in this case?
>>
>> Yes, we don't want to use atomic RMW that return values where we can use
>> atomic RMW that don't return values. The former is slower and implies a
>> memory barrier, that can be optimized out on some arcitectures (arm64 IIRC)
>>
>> We should clean that up here, and make it clearer that the old function is
>> only for grabbing a folio if it can be freed concurrently -- GUP-fast.
>
> Note again that this only affects TINY_RCU, which mostly implies
> !PREEMPTION and UP. It's a matter of whether we prefer adding these bunch
> of code to optimize that.
>
> Also we didn't yet measure that in a real workload and see how that
> "unless" plays when buried in other paths, but then we'll need a special
> kernel build first, and again I'm not sure whether it'll be worthwhile.

try_get_folio() is all about grabbing a folio that might get freed
concurrently. That's why it calls folio_ref_try_add_rcu() and does
complicated stuff.

On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
essentially a atomic_add_unless(), which in the worst case ends up being
a cmpxchg loop.


Stating that we should be using try_get_folio() in paths where we are
sure the folio refcount is not 0 is the same as using folio_try_get()
where folio_get() would be sufficient.

The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we
are using a function in the wrong context, although in our case, it is
safe to use (there is now BUG). Which is true, because we know we have a
folio reference and can simply use a simple folio_ref_add().

Again, just like we have folio_get() and folio_try_get(), we should
distinguish in GUP whether we are adding more reference to a folio (and
effectively do what folio_get() would), or whether we are actually
grabbing a folio that could be freed concurrently (what folio_try_get()
would do).

--
Cheers,

David / dhildenb


2024-06-03 17:01:14

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <[email protected]> wrote:
>
> hi, Yang Shi,
>
> On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > Hi Oliver,
> >
> > I just came up with a quick patch (just build test) per the discussion
> > and attached, can you please to give it a try? Once it is verified, I
> > will refine the patch and submit for review.
>
> what's the base of this patch? I tried to apply it upon efa7df3e3b or
> v6.10-rc2. both failed.

Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
have mentioned this.

>
> >
> > Thanks,
> > Yang
> >
> > >
> > > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >
>
>

2024-06-03 20:27:25

by Peter Xu

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Mon, Jun 03, 2024 at 05:42:44PM +0200, David Hildenbrand wrote:
> On 03.06.24 17:08, Peter Xu wrote:
> > On Sat, Jun 01, 2024 at 08:22:21AM +0200, David Hildenbrand wrote:
> > > On 01.06.24 02:59, Yang Shi wrote:
> > > > On Fri, May 31, 2024 at 5:01 PM Yang Shi <[email protected]> wrote:
> > > > >
> > > > > On Fri, May 31, 2024 at 4:25 PM Peter Xu <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
> > > > > > > try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()
> > > > > > >
> > > > > > > Is called (mm-unstable) from:
> > > > > > >
> > > > > > > (1) gup_fast function, here IRQs are disable
> > > > > > > (2) gup_hugepte(), possibly problematic
> > > > > > > (3) memfd_pin_folios(), possibly problematic
> > > > > > > (4) __get_user_pages(), likely problematic
> > > > > > >
> > > > > > > (1) should be fine.
> > > > > > >
> > > > > > > (2) is possibly problematic on the !fast path. If so, due to commit
> > > > > > > a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.
> > > > > > >
> > > > > > > (3) is possibly wrong. CCing Vivek.
> > > > > > >
> > > > > > > (4) is what we hit here
> > > > > >
> > > > > > I guess it was overlooked because try_grab_folio() didn't have any comment
> > > > > > or implication on RCU or IRQ internal helpers being used, hence a bit
> > > > > > confusing. E.g. it has different context requirement on try_grab_page(),
> > > > > > even though they look like sister functions. It might be helpful to have a
> > > > > > better name, something like try_grab_folio_rcu() in this case.
> > > > > >
> > > > > > Btw, none of above cases (2-4) have real bug, but we're looking at some way
> > > > > > to avoid triggering the sanity check, am I right? I hope besides the host
> > > > > > splash I didn't overlook any other side effect this issue would cause, and
> > > > > > the splash IIUC should so far be benign, as either gup slow (2,4) or the
> > > > > > newly added memfd_pin_folios() (3) look like to have the refcount stablized
> > > > > > anyway.
> > > > > >
> > > > > > Yang's patch in the other email looks sane to me, just that then we'll add
> > > > > > quite some code just to avoid this sanity check in paths 2-4 which seems
> > > > > > like an slight overkill.
> > > > > >
> > > > > > One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
> > > > > > its RCU limitation. It boils down to whether we can use atomic_add_unless()
> > > > > > on TINY_RCU / UP setup too? I mean, we do plenty of similar things
> > > > > > (get_page_unless_zero, etc.) in generic code and I don't understand why
> > > > > > here we need to treat folio_ref_try_add_rcu() specially.
> > > > > >
> > > > > > IOW, the assertions here we added:
> > > > > >
> > > > > > VM_BUG_ON(!in_atomic() && !irqs_disabled());
> > > > > >
> > > > > > Is because we need atomicity of below sequences:
> > > > > >
> > > > > > VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
> > > > > > folio_ref_add(folio, count);
> > > > > >
> > > > > > But atomic ops avoids it.
> > > > >
> > > > > Yeah, I didn't think of why atomic can't do it either. But is it
> > > > > written in this way because we want to catch the refcount == 0 case
> > > > > since it means a severe bug? Did we miss something?
> > > >
> > > > Thought more about it and disassembled the code. IIUC, this is an
> > > > optimization for non-SMP kernel. When in rcu critical section or irq
> > > > is disabled, we just need an atomic add instruction.
> > > > folio_ref_add_unless() would yield more instructions, including branch
> > > > instruction. But I'm wondering how useful it would be nowadays. Is it
> > > > really worth the complexity? AFAIK, for example, ARM64 has not
> > > > supported non-SMP kernel for years.
> > > >
> > > > My patch actually replaced all folio_ref_add_unless() to
> > > > folio_ref_add() for slow paths, so it is supposed to run faster, but
> > > > we are already in slow path, it may be not measurable at all. So
> > > > having more simple and readable code may outweigh the potential slight
> > > > performance gain in this case?
> > >
> > > Yes, we don't want to use atomic RMW that return values where we can use
> > > atomic RMW that don't return values. The former is slower and implies a
> > > memory barrier, that can be optimized out on some arcitectures (arm64 IIRC)
> > >
> > > We should clean that up here, and make it clearer that the old function is
> > > only for grabbing a folio if it can be freed concurrently -- GUP-fast.
> >
> > Note again that this only affects TINY_RCU, which mostly implies
> > !PREEMPTION and UP. It's a matter of whether we prefer adding these bunch
> > of code to optimize that.
> >
> > Also we didn't yet measure that in a real workload and see how that
> > "unless" plays when buried in other paths, but then we'll need a special
> > kernel build first, and again I'm not sure whether it'll be worthwhile.
>
> try_get_folio() is all about grabbing a folio that might get freed
> concurrently. That's why it calls folio_ref_try_add_rcu() and does
> complicated stuff.

IMHO we can define it.. e.g. try_get_page() wasn't defined as so.

If we want to be crystal clear on that and if we think that's what we want,
again I would suggest we rename it differently from try_get_page() to avoid
future misuses, then add documents. We may want to also even assert the
rcu/irq implications in try_get_folio() at entrance, then that'll be
detected even without TINY_RCU config.

>
> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
> essentially a atomic_add_unless(), which in the worst case ends up being a
> cmpxchg loop.
>
>
> Stating that we should be using try_get_folio() in paths where we are sure
> the folio refcount is not 0 is the same as using folio_try_get() where
> folio_get() would be sufficient.
>
> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
> using a function in the wrong context, although in our case, it is safe to
> use (there is now BUG). Which is true, because we know we have a folio
> reference and can simply use a simple folio_ref_add().
>
> Again, just like we have folio_get() and folio_try_get(), we should
> distinguish in GUP whether we are adding more reference to a folio (and
> effectively do what folio_get() would), or whether we are actually grabbing
> a folio that could be freed concurrently (what folio_try_get() would do).

Yes we can. Again, IMHO it's a matter of whether it will worth it.

Note that even with SMP and even if we keep this code, the
atomic_add_unless only affects gup slow on THP only, and even with that
overhead it is much faster than before when that path was introduced.. and
per my previous experience we don't care too much there, really.

So it's literally only three paths that are relevant here on the "unless"
overhead:

- gup slow on THP (I just added it; used to be even slower..)

- vivik's new path

- hugepd (which may be gone for good in a few months..)

IMHO none of them has perf concerns. The real perf concern paths is
gup-fast when pgtable entry existed, but that must use atomic_add_unless()
anyway. Even gup-slow !thp case won't regress as that uses try_get_page().

So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
if nobody worries on UP perf.

I don't have a strong opinion, if any of us really worry about above three
use cases on "unless" overhead, and think it worthwhile to add the code to
support it, I won't object. But to me it's adding pain with no real benefit
we could ever measure, and adding complexity to backport too since we'll
need a fix for as old as 6.6.

Thanks,

--
Peter Xu


2024-06-03 20:38:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

>> try_get_folio() is all about grabbing a folio that might get freed
>> concurrently. That's why it calls folio_ref_try_add_rcu() and does
>> complicated stuff.
>
> IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
>
> If we want to be crystal clear on that and if we think that's what we want,
> again I would suggest we rename it differently from try_get_page() to avoid
> future misuses, then add documents. We may want to also even assert the

Yes, absolutely.

> rcu/irq implications in try_get_folio() at entrance, then that'll be
> detected even without TINY_RCU config.
>
>>
>> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
>> essentially a atomic_add_unless(), which in the worst case ends up being a
>> cmpxchg loop.
>>
>>
>> Stating that we should be using try_get_folio() in paths where we are sure
>> the folio refcount is not 0 is the same as using folio_try_get() where
>> folio_get() would be sufficient.
>>
>> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
>> using a function in the wrong context, although in our case, it is safe to
>> use (there is now BUG). Which is true, because we know we have a folio
>> reference and can simply use a simple folio_ref_add().
>>
>> Again, just like we have folio_get() and folio_try_get(), we should
>> distinguish in GUP whether we are adding more reference to a folio (and
>> effectively do what folio_get() would), or whether we are actually grabbing
>> a folio that could be freed concurrently (what folio_try_get() would do).
>
> Yes we can. Again, IMHO it's a matter of whether it will worth it.
>
> Note that even with SMP and even if we keep this code, the
> atomic_add_unless only affects gup slow on THP only, and even with that
> overhead it is much faster than before when that path was introduced.. and
> per my previous experience we don't care too much there, really.
>
> So it's literally only three paths that are relevant here on the "unless"
> overhead:
>
> - gup slow on THP (I just added it; used to be even slower..)
>
> - vivik's new path
>
> - hugepd (which may be gone for good in a few months..)
>
> IMHO none of them has perf concerns. The real perf concern paths is
> gup-fast when pgtable entry existed, but that must use atomic_add_unless()
> anyway. Even gup-slow !thp case won't regress as that uses try_get_page().

My point is primarily that we should be clear that the one thing is
GUP-fast, and the other is for GUP-slow.

Sooner or later we'll see more code that uses try_grab_page() to be
converted to folios, and people might naturally use try_grab_folio(),
just like we did with Vivik's code.

And I don't think we'll want to make GUP-slow in general using
try_grab_folio() in the future.

So ...

>
> So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
> if nobody worries on UP perf.
>
> I don't have a strong opinion, if any of us really worry about above three
> use cases on "unless" overhead, and think it worthwhile to add the code to
> support it, I won't object. But to me it's adding pain with no real benefit
> we could ever measure, and adding complexity to backport too since we'll
> need a fix for as old as 6.6.

... for the sake of fixing this WARN, I don't primarily care. Adjusting
the TINY_RCU feels wrong because I suspect somebody had good reasons to
do it like that, and it actually reported something valuable (using the
wrong function for the job).

In any case, if we take the easy route to fix the WARN, I'll come back
and clean the functions here up properly.

--
Cheers,

David / dhildenb


2024-06-03 20:45:17

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Mon, Jun 3, 2024 at 1:38 PM David Hildenbrand <[email protected]> wrote:
>
> >> try_get_folio() is all about grabbing a folio that might get freed
> >> concurrently. That's why it calls folio_ref_try_add_rcu() and does
> >> complicated stuff.
> >
> > IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
> >
> > If we want to be crystal clear on that and if we think that's what we want,
> > again I would suggest we rename it differently from try_get_page() to avoid
> > future misuses, then add documents. We may want to also even assert the
>
> Yes, absolutely.
>
> > rcu/irq implications in try_get_folio() at entrance, then that'll be
> > detected even without TINY_RCU config.
> >
> >>
> >> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
> >> essentially a atomic_add_unless(), which in the worst case ends up being a
> >> cmpxchg loop.
> >>
> >>
> >> Stating that we should be using try_get_folio() in paths where we are sure
> >> the folio refcount is not 0 is the same as using folio_try_get() where
> >> folio_get() would be sufficient.
> >>
> >> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
> >> using a function in the wrong context, although in our case, it is safe to
> >> use (there is now BUG). Which is true, because we know we have a folio
> >> reference and can simply use a simple folio_ref_add().
> >>
> >> Again, just like we have folio_get() and folio_try_get(), we should
> >> distinguish in GUP whether we are adding more reference to a folio (and
> >> effectively do what folio_get() would), or whether we are actually grabbing
> >> a folio that could be freed concurrently (what folio_try_get() would do).
> >
> > Yes we can. Again, IMHO it's a matter of whether it will worth it.
> >
> > Note that even with SMP and even if we keep this code, the
> > atomic_add_unless only affects gup slow on THP only, and even with that
> > overhead it is much faster than before when that path was introduced.. and
> > per my previous experience we don't care too much there, really.
> >
> > So it's literally only three paths that are relevant here on the "unless"
> > overhead:
> >
> > - gup slow on THP (I just added it; used to be even slower..)
> >
> > - vivik's new path
> >
> > - hugepd (which may be gone for good in a few months..)
> >
> > IMHO none of them has perf concerns. The real perf concern paths is
> > gup-fast when pgtable entry existed, but that must use atomic_add_unless()
> > anyway. Even gup-slow !thp case won't regress as that uses try_get_page().
>
> My point is primarily that we should be clear that the one thing is
> GUP-fast, and the other is for GUP-slow.
>
> Sooner or later we'll see more code that uses try_grab_page() to be
> converted to folios, and people might naturally use try_grab_folio(),
> just like we did with Vivik's code.
>
> And I don't think we'll want to make GUP-slow in general using
> try_grab_folio() in the future.
>
> So ...
>
> >
> > So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
> > if nobody worries on UP perf.
> >
> > I don't have a strong opinion, if any of us really worry about above three
> > use cases on "unless" overhead, and think it worthwhile to add the code to
> > support it, I won't object. But to me it's adding pain with no real benefit
> > we could ever measure, and adding complexity to backport too since we'll
> > need a fix for as old as 6.6.
>
> ... for the sake of fixing this WARN, I don't primarily care. Adjusting
> the TINY_RCU feels wrong because I suspect somebody had good reasons to
> do it like that, and it actually reported something valuable (using the
> wrong function for the job).

I think this is the major concern about what fix we should do. If that
tiny rcu optimization still makes sense and is useful, we'd better
keep it. But I can't tell. Leaving it as is may be safer.

>
> In any case, if we take the easy route to fix the WARN, I'll come back
> and clean the functions here up properly.
>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-03 21:02:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On 03.06.24 22:44, Yang Shi wrote:
> On Mon, Jun 3, 2024 at 1:38 PM David Hildenbrand <[email protected]> wrote:
>>
>>>> try_get_folio() is all about grabbing a folio that might get freed
>>>> concurrently. That's why it calls folio_ref_try_add_rcu() and does
>>>> complicated stuff.
>>>
>>> IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
>>>
>>> If we want to be crystal clear on that and if we think that's what we want,
>>> again I would suggest we rename it differently from try_get_page() to avoid
>>> future misuses, then add documents. We may want to also even assert the
>>
>> Yes, absolutely.
>>
>>> rcu/irq implications in try_get_folio() at entrance, then that'll be
>>> detected even without TINY_RCU config.
>>>
>>>>
>>>> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
>>>> essentially a atomic_add_unless(), which in the worst case ends up being a
>>>> cmpxchg loop.
>>>>
>>>>
>>>> Stating that we should be using try_get_folio() in paths where we are sure
>>>> the folio refcount is not 0 is the same as using folio_try_get() where
>>>> folio_get() would be sufficient.
>>>>
>>>> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
>>>> using a function in the wrong context, although in our case, it is safe to
>>>> use (there is now BUG). Which is true, because we know we have a folio
>>>> reference and can simply use a simple folio_ref_add().
>>>>
>>>> Again, just like we have folio_get() and folio_try_get(), we should
>>>> distinguish in GUP whether we are adding more reference to a folio (and
>>>> effectively do what folio_get() would), or whether we are actually grabbing
>>>> a folio that could be freed concurrently (what folio_try_get() would do).
>>>
>>> Yes we can. Again, IMHO it's a matter of whether it will worth it.
>>>
>>> Note that even with SMP and even if we keep this code, the
>>> atomic_add_unless only affects gup slow on THP only, and even with that
>>> overhead it is much faster than before when that path was introduced.. and
>>> per my previous experience we don't care too much there, really.
>>>
>>> So it's literally only three paths that are relevant here on the "unless"
>>> overhead:
>>>
>>> - gup slow on THP (I just added it; used to be even slower..)
>>>
>>> - vivik's new path
>>>
>>> - hugepd (which may be gone for good in a few months..)
>>>
>>> IMHO none of them has perf concerns. The real perf concern paths is
>>> gup-fast when pgtable entry existed, but that must use atomic_add_unless()
>>> anyway. Even gup-slow !thp case won't regress as that uses try_get_page().
>>
>> My point is primarily that we should be clear that the one thing is
>> GUP-fast, and the other is for GUP-slow.
>>
>> Sooner or later we'll see more code that uses try_grab_page() to be
>> converted to folios, and people might naturally use try_grab_folio(),
>> just like we did with Vivik's code.
>>
>> And I don't think we'll want to make GUP-slow in general using
>> try_grab_folio() in the future.
>>
>> So ...
>>
>>>
>>> So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
>>> if nobody worries on UP perf.
>>>
>>> I don't have a strong opinion, if any of us really worry about above three
>>> use cases on "unless" overhead, and think it worthwhile to add the code to
>>> support it, I won't object. But to me it's adding pain with no real benefit
>>> we could ever measure, and adding complexity to backport too since we'll
>>> need a fix for as old as 6.6.
>>
>> ... for the sake of fixing this WARN, I don't primarily care. Adjusting
>> the TINY_RCU feels wrong because I suspect somebody had good reasons to
>> do it like that, and it actually reported something valuable (using the
>> wrong function for the job).
>
> I think this is the major concern about what fix we should do. If that
> tiny rcu optimization still makes sense and is useful, we'd better
> keep it. But I can't tell. Leaving it as is may be safer.

Willy moved that code in 020853b6f5e and I think it dates back to e286781d5f2e.

That contained:

+ /*
+ * Preempt must be disabled here - we rely on rcu_read_lock doing
+ * this for us.
+ *
+ * Pagecache won't be truncated from interrupt context, so if we have
+ * found a page in the radix tree here, we have pinned its refcount by
+ * disabling preempt, and hence no need for the "speculative get" that
+ * SMP requires.
+ */
+ VM_BUG_ON(page_count(page) == 0);
+ atomic_inc(&page->_count);
+


--
Cheers,

David / dhildenb


2024-06-03 21:11:55

by Peter Xu

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Mon, Jun 03, 2024 at 10:37:55PM +0200, David Hildenbrand wrote:
> > > try_get_folio() is all about grabbing a folio that might get freed
> > > concurrently. That's why it calls folio_ref_try_add_rcu() and does
> > > complicated stuff.
> >
> > IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
> >
> > If we want to be crystal clear on that and if we think that's what we want,
> > again I would suggest we rename it differently from try_get_page() to avoid
> > future misuses, then add documents. We may want to also even assert the
>
> Yes, absolutely.
>
> > rcu/irq implications in try_get_folio() at entrance, then that'll be
> > detected even without TINY_RCU config.
> >
> > >
> > > On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
> > > essentially a atomic_add_unless(), which in the worst case ends up being a
> > > cmpxchg loop.
> > >
> > >
> > > Stating that we should be using try_get_folio() in paths where we are sure
> > > the folio refcount is not 0 is the same as using folio_try_get() where
> > > folio_get() would be sufficient.
> > >
> > > The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
> > > using a function in the wrong context, although in our case, it is safe to
> > > use (there is now BUG). Which is true, because we know we have a folio
> > > reference and can simply use a simple folio_ref_add().
> > >
> > > Again, just like we have folio_get() and folio_try_get(), we should
> > > distinguish in GUP whether we are adding more reference to a folio (and
> > > effectively do what folio_get() would), or whether we are actually grabbing
> > > a folio that could be freed concurrently (what folio_try_get() would do).
> >
> > Yes we can. Again, IMHO it's a matter of whether it will worth it.
> >
> > Note that even with SMP and even if we keep this code, the
> > atomic_add_unless only affects gup slow on THP only, and even with that
> > overhead it is much faster than before when that path was introduced.. and
> > per my previous experience we don't care too much there, really.
> >
> > So it's literally only three paths that are relevant here on the "unless"
> > overhead:
> >
> > - gup slow on THP (I just added it; used to be even slower..)
> >
> > - vivik's new path
> >
> > - hugepd (which may be gone for good in a few months..)
> > IMHO none of them has perf concerns. The real perf concern paths is
> > gup-fast when pgtable entry existed, but that must use atomic_add_unless()
> > anyway. Even gup-slow !thp case won't regress as that uses try_get_page().
>
> My point is primarily that we should be clear that the one thing is
> GUP-fast, and the other is for GUP-slow.

Yes, understood.

>
> Sooner or later we'll see more code that uses try_grab_page() to be
> converted to folios, and people might naturally use try_grab_folio(), just
> like we did with Vivik's code.
>
> And I don't think we'll want to make GUP-slow in general using
> try_grab_folio() in the future.
>
> So ...
>
> >
> > So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
> > if nobody worries on UP perf.
> >
> > I don't have a strong opinion, if any of us really worry about above three
> > use cases on "unless" overhead, and think it worthwhile to add the code to
> > support it, I won't object. But to me it's adding pain with no real benefit
> > we could ever measure, and adding complexity to backport too since we'll
> > need a fix for as old as 6.6.
>
> ... for the sake of fixing this WARN, I don't primarily care. Adjusting the
> TINY_RCU feels wrong because I suspect somebody had good reasons to do it
> like that, and it actually reported something valuable (using the wrong
> function for the job).
>
> In any case, if we take the easy route to fix the WARN, I'll come back and
> clean the functions here up properly.

Definitely, then there can also be some measurements which will be even
better. I mean, if the diff is minimal, we can be clearer on the path we
choose; while if it shows improvements we have more solid results than
predictions and discussions.

Yes I do worry about the UP change too, hence I sincerely was trying to
collect some feedback. My current guess is the UP was still important in
2008 when the code first wrote, and maybe it changed over the 16 years. I
just notice Nicolas wrote it; I know he's still active so I've added him
into the loop too.

Just for easier reference, the commit introduced the UP change is:

commit e286781d5f2e9c846e012a39653a166e9d31777d
Author: Nicholas Piggin <[email protected]>
Date: Fri Jul 25 19:45:30 2008 -0700

mm: speculative page references

+static inline int page_cache_get_speculative(struct page *page)
+{
+ VM_BUG_ON(in_interrupt());
+
+#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
+# ifdef CONFIG_PREEMPT
+ VM_BUG_ON(!in_atomic());
+# endif
+ /*
+ * Preempt must be disabled here - we rely on rcu_read_lock doing
+ * this for us.
+ *
+ * Pagecache won't be truncated from interrupt context, so if we have
+ * found a page in the radix tree here, we have pinned its refcount by
+ * disabling preempt, and hence no need for the "speculative get" that
+ * SMP requires.
+ */
+ VM_BUG_ON(page_count(page) == 0);
+ atomic_inc(&page->_count);
+
+#else
+ if (unlikely(!get_page_unless_zero(page))) {
+ /*
+ * Either the page has been freed, or will be freed.
+ * In either case, retry here and the caller should
+ * do the right thing (see comments above).
+ */
+ return 0;
+ }
+#endif
+ VM_BUG_ON(PageTail(page));
+
+ return 1;
+}

Thanks,

--
Peter Xu


2024-06-03 21:13:49

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On 03.06.24 23:03, Peter Xu wrote:
> On Mon, Jun 03, 2024 at 10:37:55PM +0200, David Hildenbrand wrote:
>>>> try_get_folio() is all about grabbing a folio that might get freed
>>>> concurrently. That's why it calls folio_ref_try_add_rcu() and does
>>>> complicated stuff.
>>>
>>> IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
>>>
>>> If we want to be crystal clear on that and if we think that's what we want,
>>> again I would suggest we rename it differently from try_get_page() to avoid
>>> future misuses, then add documents. We may want to also even assert the
>>
>> Yes, absolutely.
>>
>>> rcu/irq implications in try_get_folio() at entrance, then that'll be
>>> detected even without TINY_RCU config.
>>>
>>>>
>>>> On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
>>>> essentially a atomic_add_unless(), which in the worst case ends up being a
>>>> cmpxchg loop.
>>>>
>>>>
>>>> Stating that we should be using try_get_folio() in paths where we are sure
>>>> the folio refcount is not 0 is the same as using folio_try_get() where
>>>> folio_get() would be sufficient.
>>>>
>>>> The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
>>>> using a function in the wrong context, although in our case, it is safe to
>>>> use (there is now BUG). Which is true, because we know we have a folio
>>>> reference and can simply use a simple folio_ref_add().
>>>>
>>>> Again, just like we have folio_get() and folio_try_get(), we should
>>>> distinguish in GUP whether we are adding more reference to a folio (and
>>>> effectively do what folio_get() would), or whether we are actually grabbing
>>>> a folio that could be freed concurrently (what folio_try_get() would do).
>>>
>>> Yes we can. Again, IMHO it's a matter of whether it will worth it.
>>>
>>> Note that even with SMP and even if we keep this code, the
>>> atomic_add_unless only affects gup slow on THP only, and even with that
>>> overhead it is much faster than before when that path was introduced.. and
>>> per my previous experience we don't care too much there, really.
>>>
>>> So it's literally only three paths that are relevant here on the "unless"
>>> overhead:
>>>
>>> - gup slow on THP (I just added it; used to be even slower..)
>>>
>>> - vivik's new path
>>>
>>> - hugepd (which may be gone for good in a few months..)
>>> IMHO none of them has perf concerns. The real perf concern paths is
>>> gup-fast when pgtable entry existed, but that must use atomic_add_unless()
>>> anyway. Even gup-slow !thp case won't regress as that uses try_get_page().
>>
>> My point is primarily that we should be clear that the one thing is
>> GUP-fast, and the other is for GUP-slow.
>
> Yes, understood.
>
>>
>> Sooner or later we'll see more code that uses try_grab_page() to be
>> converted to folios, and people might naturally use try_grab_folio(), just
>> like we did with Vivik's code.
>>
>> And I don't think we'll want to make GUP-slow in general using
>> try_grab_folio() in the future.
>>
>> So ...
>>
>>>
>>> So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
>>> if nobody worries on UP perf.
>>>
>>> I don't have a strong opinion, if any of us really worry about above three
>>> use cases on "unless" overhead, and think it worthwhile to add the code to
>>> support it, I won't object. But to me it's adding pain with no real benefit
>>> we could ever measure, and adding complexity to backport too since we'll
>>> need a fix for as old as 6.6.
>>
>> ... for the sake of fixing this WARN, I don't primarily care. Adjusting the
>> TINY_RCU feels wrong because I suspect somebody had good reasons to do it
>> like that, and it actually reported something valuable (using the wrong
>> function for the job).
>>
>> In any case, if we take the easy route to fix the WARN, I'll come back and
>> clean the functions here up properly.
>
> Definitely, then there can also be some measurements which will be even
> better. I mean, if the diff is minimal, we can be clearer on the path we
> choose; while if it shows improvements we have more solid results than
> predictions and discussions.
>
> Yes I do worry about the UP change too, hence I sincerely was trying to
> collect some feedback. My current guess is the UP was still important in
> 2008 when the code first wrote, and maybe it changed over the 16 years. I
> just notice Nicolas wrote it; I know he's still active so I've added him
> into the loop too.
>
> Just for easier reference, the commit introduced the UP change is:
>
> commit e286781d5f2e9c846e012a39653a166e9d31777d
> Author: Nicholas Piggin <[email protected]>
> Date: Fri Jul 25 19:45:30 2008 -0700
>
> mm: speculative page references
>
> +static inline int page_cache_get_speculative(struct page *page)
> +{
> + VM_BUG_ON(in_interrupt());
> +
> +#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
> +# ifdef CONFIG_PREEMPT
> + VM_BUG_ON(!in_atomic());
> +# endif
> + /*
> + * Preempt must be disabled here - we rely on rcu_read_lock doing
> + * this for us.
> + *
> + * Pagecache won't be truncated from interrupt context, so if we have
> + * found a page in the radix tree here, we have pinned its refcount by
> + * disabling preempt, and hence no need for the "speculative get" that
> + * SMP requires.
> + */
> + VM_BUG_ON(page_count(page) == 0);
> + atomic_inc(&page->_count);
> +
> +#else
> + if (unlikely(!get_page_unless_zero(page))) {
> + /*
> + * Either the page has been freed, or will be freed.
> + * In either case, retry here and the caller should
> + * do the right thing (see comments above).
> + */
> + return 0;
> + }
> +#endif
> + VM_BUG_ON(PageTail(page));
> +
> + return 1;
> +}
>
> Thanks,
>

I chased it further to:

commit 8375ad98cc1defc36adf4a77d9ea1e71db51a371
Author: Paul E. McKenney <[email protected]>
Date: Mon Apr 29 15:06:13 2013 -0700

vm: adjust ifdef for TINY_RCU

There is an ifdef in page_cache_get_speculative() that checks for !SMP
and TREE_RCU, which has been an impossible combination since the advent
of TINY_RCU. The ifdef enables a fastpath that is valid when preemption
is disabled by rcu_read_lock() in UP systems, which is the case when
TINY_RCU is enabled. This commit therefore adjusts the ifdef to
generate the fastpath when TINY_RCU is enabled.


Where Paul explicitly restored that fastpath for TINY_RCU instead of removing that code.

So maybe Paul can comment if that is still worth having. CCing him.

--
Cheers,

David / dhildenb


2024-06-03 22:44:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Mon, Jun 03, 2024 at 11:05:44PM +0200, David Hildenbrand wrote:
> On 03.06.24 23:03, Peter Xu wrote:
> > On Mon, Jun 03, 2024 at 10:37:55PM +0200, David Hildenbrand wrote:
> > > > > try_get_folio() is all about grabbing a folio that might get freed
> > > > > concurrently. That's why it calls folio_ref_try_add_rcu() and does
> > > > > complicated stuff.
> > > >
> > > > IMHO we can define it.. e.g. try_get_page() wasn't defined as so.
> > > >
> > > > If we want to be crystal clear on that and if we think that's what we want,
> > > > again I would suggest we rename it differently from try_get_page() to avoid
> > > > future misuses, then add documents. We may want to also even assert the
> > >
> > > Yes, absolutely.
> > >
> > > > rcu/irq implications in try_get_folio() at entrance, then that'll be
> > > > detected even without TINY_RCU config.
> > > >
> > > > >
> > > > > On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's
> > > > > essentially a atomic_add_unless(), which in the worst case ends up being a
> > > > > cmpxchg loop.
> > > > >
> > > > >
> > > > > Stating that we should be using try_get_folio() in paths where we are sure
> > > > > the folio refcount is not 0 is the same as using folio_try_get() where
> > > > > folio_get() would be sufficient.
> > > > >
> > > > > The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are
> > > > > using a function in the wrong context, although in our case, it is safe to
> > > > > use (there is now BUG). Which is true, because we know we have a folio
> > > > > reference and can simply use a simple folio_ref_add().
> > > > >
> > > > > Again, just like we have folio_get() and folio_try_get(), we should
> > > > > distinguish in GUP whether we are adding more reference to a folio (and
> > > > > effectively do what folio_get() would), or whether we are actually grabbing
> > > > > a folio that could be freed concurrently (what folio_try_get() would do).
> > > >
> > > > Yes we can. Again, IMHO it's a matter of whether it will worth it.
> > > >
> > > > Note that even with SMP and even if we keep this code, the
> > > > atomic_add_unless only affects gup slow on THP only, and even with that
> > > > overhead it is much faster than before when that path was introduced.. and
> > > > per my previous experience we don't care too much there, really.
> > > >
> > > > So it's literally only three paths that are relevant here on the "unless"
> > > > overhead:
> > > >
> > > > - gup slow on THP (I just added it; used to be even slower..)
> > > >
> > > > - vivik's new path
> > > >
> > > > - hugepd (which may be gone for good in a few months..)
> > > > IMHO none of them has perf concerns. The real perf concern paths is
> > > > gup-fast when pgtable entry existed, but that must use atomic_add_unless()
> > > > anyway. Even gup-slow !thp case won't regress as that uses try_get_page().
> > >
> > > My point is primarily that we should be clear that the one thing is
> > > GUP-fast, and the other is for GUP-slow.
> >
> > Yes, understood.
> >
> > >
> > > Sooner or later we'll see more code that uses try_grab_page() to be
> > > converted to folios, and people might naturally use try_grab_folio(), just
> > > like we did with Vivik's code.
> > >
> > > And I don't think we'll want to make GUP-slow in general using
> > > try_grab_folio() in the future.
> > >
> > > So ...
> > >
> > > >
> > > > So again, IMHO the easist way to fix this WARN is we drop the TINY_RCU bit,
> > > > if nobody worries on UP perf.
> > > >
> > > > I don't have a strong opinion, if any of us really worry about above three
> > > > use cases on "unless" overhead, and think it worthwhile to add the code to
> > > > support it, I won't object. But to me it's adding pain with no real benefit
> > > > we could ever measure, and adding complexity to backport too since we'll
> > > > need a fix for as old as 6.6.
> > >
> > > ... for the sake of fixing this WARN, I don't primarily care. Adjusting the
> > > TINY_RCU feels wrong because I suspect somebody had good reasons to do it
> > > like that, and it actually reported something valuable (using the wrong
> > > function for the job).
> > >
> > > In any case, if we take the easy route to fix the WARN, I'll come back and
> > > clean the functions here up properly.
> >
> > Definitely, then there can also be some measurements which will be even
> > better. I mean, if the diff is minimal, we can be clearer on the path we
> > choose; while if it shows improvements we have more solid results than
> > predictions and discussions.
> >
> > Yes I do worry about the UP change too, hence I sincerely was trying to
> > collect some feedback. My current guess is the UP was still important in
> > 2008 when the code first wrote, and maybe it changed over the 16 years. I
> > just notice Nicolas wrote it; I know he's still active so I've added him
> > into the loop too.
> >
> > Just for easier reference, the commit introduced the UP change is:
> >
> > commit e286781d5f2e9c846e012a39653a166e9d31777d
> > Author: Nicholas Piggin <[email protected]>
> > Date: Fri Jul 25 19:45:30 2008 -0700
> >
> > mm: speculative page references
> >
> > +static inline int page_cache_get_speculative(struct page *page)
> > +{
> > + VM_BUG_ON(in_interrupt());
> > +
> > +#if !defined(CONFIG_SMP) && defined(CONFIG_CLASSIC_RCU)
> > +# ifdef CONFIG_PREEMPT
> > + VM_BUG_ON(!in_atomic());
> > +# endif
> > + /*
> > + * Preempt must be disabled here - we rely on rcu_read_lock doing
> > + * this for us.
> > + *
> > + * Pagecache won't be truncated from interrupt context, so if we have
> > + * found a page in the radix tree here, we have pinned its refcount by
> > + * disabling preempt, and hence no need for the "speculative get" that
> > + * SMP requires.
> > + */
> > + VM_BUG_ON(page_count(page) == 0);
> > + atomic_inc(&page->_count);
> > +
> > +#else
> > + if (unlikely(!get_page_unless_zero(page))) {
> > + /*
> > + * Either the page has been freed, or will be freed.
> > + * In either case, retry here and the caller should
> > + * do the right thing (see comments above).
> > + */
> > + return 0;
> > + }
> > +#endif
> > + VM_BUG_ON(PageTail(page));
> > +
> > + return 1;
> > +}
> >
> > Thanks,
> >
>
> I chased it further to:
>
> commit 8375ad98cc1defc36adf4a77d9ea1e71db51a371
> Author: Paul E. McKenney <[email protected]>
> Date: Mon Apr 29 15:06:13 2013 -0700
>
> vm: adjust ifdef for TINY_RCU
> There is an ifdef in page_cache_get_speculative() that checks for !SMP
> and TREE_RCU, which has been an impossible combination since the advent
> of TINY_RCU. The ifdef enables a fastpath that is valid when preemption
> is disabled by rcu_read_lock() in UP systems, which is the case when
> TINY_RCU is enabled. This commit therefore adjusts the ifdef to
> generate the fastpath when TINY_RCU is enabled.
>
>
> Where Paul explicitly restored that fastpath for TINY_RCU instead of removing that code.
>
> So maybe Paul can comment if that is still worth having. CCing him.

It is currently an atomic operation either way, though the folio_ref_add()
avoids full ordering, but that is immaterial on x86. Some say that it is
in the noise on server-class ARMv8 as well, though they have also said
a great many things in the past. But if that is true, the big benefit
of the TINY_RCU check is that folio_ref_try_add_rcu() is guaranted not
to fail in that case (single CPU with preemption disabled). Except that
everyone has to check the return value anyway, right?

So the usual advice, unsatisfying though it might be, is to remove that
#ifdef and see if anyone notices.

After all, both 2013 and 2008 were quite some time ago. ;-)

Thanx, Paul

2024-06-04 17:36:19

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

> >
> > I chased it further to:
> >
> > commit 8375ad98cc1defc36adf4a77d9ea1e71db51a371
> > Author: Paul E. McKenney <[email protected]>
> > Date: Mon Apr 29 15:06:13 2013 -0700
> >
> > vm: adjust ifdef for TINY_RCU
> > There is an ifdef in page_cache_get_speculative() that checks for !SMP
> > and TREE_RCU, which has been an impossible combination since the advent
> > of TINY_RCU. The ifdef enables a fastpath that is valid when preemption
> > is disabled by rcu_read_lock() in UP systems, which is the case when
> > TINY_RCU is enabled. This commit therefore adjusts the ifdef to
> > generate the fastpath when TINY_RCU is enabled.
> >
> >
> > Where Paul explicitly restored that fastpath for TINY_RCU instead of removing that code.
> >
> > So maybe Paul can comment if that is still worth having. CCing him.
>
> It is currently an atomic operation either way, though the folio_ref_add()
> avoids full ordering, but that is immaterial on x86. Some say that it is
> in the noise on server-class ARMv8 as well, though they have also said
> a great many things in the past. But if that is true, the big benefit
> of the TINY_RCU check is that folio_ref_try_add_rcu() is guaranted not
> to fail in that case (single CPU with preemption disabled). Except that
> everyone has to check the return value anyway, right?
>
> So the usual advice, unsatisfying though it might be, is to remove that
> #ifdef and see if anyone notices.
>
> After all, both 2013 and 2008 were quite some time ago. ;-)

Thanks, Paul.

I will submit a patch to remove the #ifdef as the fix for the bug
report. And do the clean up in a separate patch which is preferred by
David.

>
> Thanx, Paul

2024-06-04 23:54:20

by Yang Shi

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <[email protected]> wrote:
>
> On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <[email protected]> wrote:
> >
> > hi, Yang Shi,
> >
> > On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > > Hi Oliver,
> > >
> > > I just came up with a quick patch (just build test) per the discussion
> > > and attached, can you please to give it a try? Once it is verified, I
> > > will refine the patch and submit for review.
> >
> > what's the base of this patch? I tried to apply it upon efa7df3e3b or
> > v6.10-rc2. both failed.
>
> Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
> swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
> have mentioned this.

I just figured out a bug in the patch. Anyway, we are going to take a
different approach to fix the issue per the discussion. I already sent
the series to the mailing list. Please refer to
https://lore.kernel.org/linux-mm/[email protected]/

>
> >
> > >
> > > Thanks,
> > > Yang
> > >
> > > >
> > > > >
> > > > >
> > > > > --
> > > > > Cheers,
> > > > >
> > > > > David / dhildenb
> > > > >
> >
> >

2024-06-06 02:16:33

by kernel test robot

[permalink] [raw]
Subject: Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

hi, Yang Shi,

On Tue, Jun 04, 2024 at 04:53:56PM -0700, Yang Shi wrote:
> On Mon, Jun 3, 2024 at 9:54 AM Yang Shi <[email protected]> wrote:
> >
> > On Mon, Jun 3, 2024 at 7:02 AM Oliver Sang <[email protected]> wrote:
> > >
> > > hi, Yang Shi,
> > >
> > > On Fri, May 31, 2024 at 01:57:06PM -0700, Yang Shi wrote:
> > > > Hi Oliver,
> > > >
> > > > I just came up with a quick patch (just build test) per the discussion
> > > > and attached, can you please to give it a try? Once it is verified, I
> > > > will refine the patch and submit for review.
> > >
> > > what's the base of this patch? I tried to apply it upon efa7df3e3b or
> > > v6.10-rc2. both failed.
> >
> > Its base is mm-unstable. The head commit is 8e06d6b9274d ("mm: add
> > swappiness= arg to memory.reclaim"). Sorry for the confusion, I should
> > have mentioned this.
>
> I just figured out a bug in the patch. Anyway, we are going to take a
> different approach to fix the issue per the discussion. I already sent
> the series to the mailing list. Please refer to
> https://lore.kernel.org/linux-mm/[email protected]/

got it. seems you will submit v2? should we wait v2 to do the tests?

sorry that due to resource constraint, we cannot respond test request very
quickly now.

>
> >
> > >
> > > >
> > > > Thanks,
> > > > Yang
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Cheers,
> > > > > >
> > > > > > David / dhildenb
> > > > > >
> > >
> > >
>