2022-12-31 15:34:42

by Oliver Sang

[permalink] [raw]
Subject: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h


Greeting,

FYI, we noticed kernel_BUG_at_include/linux/mm.h due to commit (built with gcc-11):

commit: 0af8489b0216fa1dd83e264bef8063f2632633d7 ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linux-next/master c76083fac3bae1a87ae3d005b5cb1cbc761e31d5]

in testcase: rcutorture
version:
with following parameters:

runtime: 300s
test: default
torture_type: tasks-tracing

test-description: rcutorture is rcutorture kernel module load/unload test.
test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


[ 25.804432][ T214] ------------[ cut here ]------------
[ 25.804917][ T214] kernel BUG at include/linux/mm.h:825!
[ 25.805402][ T214] invalid opcode: 0000 [#1] SMP
[ 25.805820][ T214] CPU: 0 PID: 214 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #2 1c4d7707ec0ce574ed62a77e82a8580202758048
[ 25.806944][ T214] EIP: __dump_page.cold (include/linux/mm.h:825 mm/debug.c:97)
[ 25.807376][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
All code
========
0: ff (bad)
1: ff 83 05 e8 5d bb incl -0x44a217fb(%rbx)
7: c5 01 ba (bad)
a: 4c c4 rex.WR (bad)
c: 2f (bad)
d: c4 (bad)
e: 89 f8 mov %edi,%eax
10: 83 15 ec 5d bb c5 00 adcl $0x0,-0x3a44a214(%rip) # 0xffffffffc5bb5e03
17: e8 f2 92 ed fd callq 0xfffffffffded930e
1c: 83 05 f8 5d bb c5 01 addl $0x1,-0x3a44a208(%rip) # 0xffffffffc5bb5e1b
23: 83 15 fc 5d bb c5 00 adcl $0x0,-0x3a44a204(%rip) # 0xffffffffc5bb5e26
2a:* 0f 0b ud2 <-- trapping instruction
2c: 83 05 00 5e bb c5 01 addl $0x1,-0x3a44a200(%rip) # 0xffffffffc5bb5e33
33: b8 ac 85 a3 c4 mov $0xc4a385ac,%eax
38: 83 15 04 5e bb c5 00 adcl $0x0,-0x3a44a1fc(%rip) # 0xffffffffc5bb5e43
3f: e8 .byte 0xe8

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 83 05 00 5e bb c5 01 addl $0x1,-0x3a44a200(%rip) # 0xffffffffc5bb5e09
9: b8 ac 85 a3 c4 mov $0xc4a385ac,%eax
e: 83 15 04 5e bb c5 00 adcl $0x0,-0x3a44a1fc(%rip) # 0xffffffffc5bb5e19
15: e8 .byte 0xe8
[ 25.808960][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
[ 25.809578][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
[ 25.810168][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
[ 25.810803][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
[ 25.811407][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 25.811999][ T214] DR6: fffe0ff0 DR7: 00000400
[ 25.812390][ T214] Call Trace:
[ 25.812675][ T214] dump_page (mm/debug.c:131)
[ 25.813025][ T214] ? _raw_spin_lock_irqsave (kernel/locking/spinlock.c:162)
[ 25.813492][ T214] folio_flags+0x23/0x70
[ 25.813945][ T214] get_partial_node (include/linux/page-flags.h:483 mm/slab.h:140 mm/slub.c:2967 mm/slub.c:2225)
[ 25.814357][ T214] __slab_alloc_node+0xbb/0x270
[ 25.814860][ T214] kmem_cache_alloc_lru (mm/slub.c:3404 mm/slub.c:3418 mm/slub.c:3425 mm/slub.c:3441)
[ 25.815289][ T214] ? __lock_release (kernel/locking/lockdep.c:355 kernel/locking/lockdep.c:5350)
[ 25.815697][ T214] ? iget_locked (fs/inode.c:1275)
[ 25.816096][ T214] alloc_inode (include/linux/fs.h:3117 fs/inode.c:261)
[ 25.816469][ T214] iget_locked (fs/inode.c:1286)
[ 25.816829][ T214] ? lock_is_held_type (kernel/locking/lockdep.c:5409 kernel/locking/lockdep.c:5711)
[ 25.817264][ T214] kernfs_get_inode (fs/kernfs/inode.c:255)
[ 25.817670][ T214] kernfs_iop_lookup (fs/kernfs/dir.c:1154)
[ 25.818087][ T214] __lookup_slow (fs/namei.c:1685)
[ 25.818479][ T214] lookup_slow (fs/namei.c:1702)
[ 25.818847][ T214] walk_component (fs/namei.c:1993)
[ 25.819244][ T214] path_lookupat (fs/namei.c:2450 fs/namei.c:2474)
[ 25.819627][ T214] path_openat (fs/namei.c:3684 fs/namei.c:3706)
[ 25.820007][ T214] do_filp_open (fs/namei.c:3740)
[ 25.820409][ T214] do_sys_openat2 (fs/open.c:1311)
[ 25.820807][ T214] do_sys_open (fs/open.c:1326)
[ 25.821211][ T214] __ia32_sys_openat (fs/open.c:1337)
[ 25.821622][ T214] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178)
[ 25.822057][ T214] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 19))
[ 25.822480][ T214] ? __fput (fs/file_table.c:59 fs/file_table.c:333)
[ 25.822842][ T214] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4262 kernel/locking/lockdep.c:4321)
[ 25.823346][ T214] ? syscall_exit_to_user_mode (kernel/entry/common.c:299)
[ 25.823823][ T214] ? __do_fast_syscall_32 (arch/x86/entry/common.c:183)
[ 25.824259][ T214] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4262 kernel/locking/lockdep.c:4321)
[ 25.824767][ T214] ? syscall_exit_to_user_mode (kernel/entry/common.c:299)
[ 25.825254][ T214] ? __do_fast_syscall_32 (arch/x86/entry/common.c:183)
[ 25.825696][ T214] ? __do_fast_syscall_32 (arch/x86/entry/common.c:183)
[ 25.826155][ T214] ? syscall_exit_to_user_mode (kernel/entry/common.c:299)
[ 25.826627][ T214] ? __do_fast_syscall_32 (arch/x86/entry/common.c:183)
[ 25.827056][ T214] ? __do_fast_syscall_32 (arch/x86/entry/common.c:183)
[ 25.827486][ T214] ? __do_fast_syscall_32 (arch/x86/entry/common.c:183)
[ 25.827929][ T214] ? irqentry_exit_to_user_mode (kernel/entry/common.c:312)
[ 25.828423][ T214] ? irqentry_exit (kernel/entry/common.c:445)
[ 25.828812][ T214] do_fast_syscall_32 (arch/x86/entry/common.c:203)
[ 25.829223][ T214] do_SYSENTER_32 (arch/x86/entry/common.c:247)
[ 25.829589][ T214] entry_SYSENTER_32 (arch/x86/entry/entry_32.S:867)
[ 25.830003][ T214] EIP: 0xb7f8c549
[ 25.830330][ T214] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
All code
========
0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi
4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d
a: 10 06 adc %al,(%rsi)
c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi
10: 10 07 adc %al,(%rdi)
12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi
16: 10 08 adc %cl,(%rax)
18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi
1c: 00 00 add %al,(%rax)
1e: 00 00 add %al,(%rax)
20: 00 51 52 add %dl,0x52(%rcx)
23: 55 push %rbp
24: 89 e5 mov %esp,%ebp
26: 0f 34 sysenter
28: cd 80 int $0x80
2a:* 5d pop %rbp <-- trapping instruction
2b: 5a pop %rdx
2c: 59 pop %rcx
2d: c3 retq
2e: 90 nop
2f: 90 nop
30: 90 nop
31: 90 nop
32: 8d 76 00 lea 0x0(%rsi),%esi
35: 58 pop %rax
36: b8 77 00 00 00 mov $0x77,%eax
3b: cd 80 int $0x80
3d: 90 nop
3e: 8d .byte 0x8d
3f: 76 .byte 0x76

Code starting with the faulting instruction
===========================================
0: 5d pop %rbp
1: 5a pop %rdx
2: 59 pop %rcx
3: c3 retq
4: 90 nop
5: 90 nop
6: 90 nop
7: 90 nop
8: 8d 76 00 lea 0x0(%rsi),%esi
b: 58 pop %rax
c: b8 77 00 00 00 mov $0x77,%eax
11: cd 80 int $0x80
13: 90 nop
14: 8d .byte 0x8d
15: 76 .byte 0x76


To reproduce:

# build kernel
cd linux
cp config-6.1.0-rc2-00014-g0af8489b0216 .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests



Attachments:
(No filename) (9.38 kB)
config-6.1.0-rc2-00014-g0af8489b0216 (149.80 kB)
job-script (5.63 kB)
dmesg.xz (55.53 kB)
rcutorture (272.20 kB)
Download all attachments

2023-01-01 06:16:43

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On Sat, Dec 31, 2022 at 11:26:25PM +0800, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed kernel_BUG_at_include/linux/mm.h due to commit (built with gcc-11):
>
> commit: 0af8489b0216fa1dd83e264bef8063f2632633d7 ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [test failed on linux-next/master c76083fac3bae1a87ae3d005b5cb1cbc761e31d5]
>
> in testcase: rcutorture
> version:
> with following parameters:
>
> runtime: 300s
> test: default
> torture_type: tasks-tracing
>
> test-description: rcutorture is rcutorture kernel module load/unload test.
> test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> If you fix the issue, kindly add following tag
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-lkp/[email protected]
>
>

<snip>

> Failed to start Update UTMP about System Boot/Shutdown.
> See 'systemctl status systemd-update-utmp.service' for details.
> page:e660911a refcount:0 mapcount:0 mapping:00000000 index:0xedaeef00 pfn:0x2daee
> page:0946d53a refcount:0 mapcount:0 mapping:00000000 index:0x0 pfn:0x2daec
> flags: 0x0(zone=0)
> raw: 00000000 e764d494 e6f205b4 00000000 00000000 00020000 ffffffff 00000000
> raw: 00000000 00000000
> page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
> page_owner tracks the page as freed

the page is freed state.

> page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 208, tgid 208 (systemd-udevd), ts 25780391126, free_ts 25780421356
> post_alloc_hook+0x1fa/0x280
> get_page_from_freelist+0x226/0x310
> __alloc_pages+0xdd/0x360
> alloc_slab_page+0x12d/0x200
> allocate_slab+0x6a/0x350
> new_slab+0x48/0xc0
> __slab_alloc_node+0xfb/0x270
> kmem_cache_alloc+0x8f/0x4e0
> getname_flags+0x33/0x2f0
> getname+0x1a/0x30
> do_sys_openat2+0xa5/0x1f0
> do_sys_open+0x8e/0xe0
> __ia32_sys_openat+0x2b/0x40
> __do_fast_syscall_32+0x72/0xd0
> do_fast_syscall_32+0x32/0x70
> do_SYSENTER_32+0x15/0x20

allocated by slab

> page last free stack trace:
> free_pcp_prepare+0x34f/0x940
> free_unref_page_prepare+0x29/0x210
> free_unref_page+0x3a/0x3b0
> __free_pages+0x187/0x1f0
> __free_slab+0x1fd/0x350
> free_slab+0x22/0x70
> free_to_partial_list+0x125/0x260
> do_slab_free+0x30/0x70
> kmem_cache_free+0x171/0x1e0
> putname+0x9f/0xf0
> do_sys_openat2+0xe2/0x1f0
> do_sys_open+0x8e/0xe0
> __ia32_sys_openat+0x2b/0x40
> __do_fast_syscall_32+0x72/0xd0
> do_fast_syscall_32+0x32/0x70
> do_SYSENTER_32+0x15/0x20

freed by slab

> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:825!
> invalid opcode: 0000 [#1] SMP
> CPU: 0 PID: 214 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #2 1c4d7707ec0ce574ed62a77e82a8580202758048
> EIP: __dump_page.cold+0x282/0x369
> Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: fffe0ff0 DR7: 00000400
> Call Trace:
> dump_page+0x2a/0xc0
> ? _raw_spin_lock_irqsave+0x16/0x30
> folio_flags+0x23/0x70
> get_partial_node+0x89/0x290

a page freed by slab is in the partial list?
Sounds like use-after-free from SLUB_TINY but not sure yet how
that could happen :/

> __slab_alloc_node+0xbb/0x270
> kmem_cache_alloc_lru+0x8d/0x4e0
> ? __lock_release+0x3ec/0x410
> ? iget_locked+0x78/0x310
> alloc_inode+0x93/0x150
> iget_locked+0xdd/0x310
> ? lock_is_held_type+0x80/0xf0
> kernfs_get_inode+0x24/0xb0
> kernfs_iop_lookup+0xb5/0x1a0
> __lookup_slow+0xd9/0x2a0
> lookup_slow+0x50/0x90
> walk_component+0x19c/0x2c0
> path_lookupat+0xa3/0x270
> path_openat+0x307/0x3e0
> do_filp_open+0x7c/0x130
> do_sys_openat2+0x113/0x1f0
> do_sys_open+0x8e/0xe0
> __ia32_sys_openat+0x2b/0x40
> __do_fast_syscall_32+0x72/0xd0
> ? trace_hardirqs_on+0xa2/0x110
> ? __fput+0x19f/0x390
> ? lockdep_hardirqs_on_prepare+0x242/0x400
> ? syscall_exit_to_user_mode+0x5f/0x90
> ? __do_fast_syscall_32+0x7c/0xd0
> ? lockdep_hardirqs_on_prepare+0x242/0x400
> ? syscall_exit_to_user_mode+0x5f/0x90
> ? __do_fast_syscall_32+0x7c/0xd0
> ? __do_fast_syscall_32+0x7c/0xd0
> ? syscall_exit_to_user_mode+0x5f/0x90
> ? __do_fast_syscall_32+0x7c/0xd0
> ? __do_fast_syscall_32+0x7c/0xd0
> ? __do_fast_syscall_32+0x7c/0xd0
> ? irqentry_exit_to_user_mode+0x23/0x30
> ? irqentry_exit+0x7f/0xc0
> do_fast_syscall_32+0x32/0x70
> do_SYSENTER_32+0x15/0x20
> entry_SYSENTER_32+0xa2/0xfb
> EIP: 0xb7f8c549
> Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> EAX: ffffffda EBX: 00000006 ECX: 006142a1 EDX: 002a8000
> ESI: 00000000 EDI: 00000001 EBP: 00614024 ESP: bff3c4a0
> DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> EIP: __dump_page.cold+0x282/0x369
> Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: fffe0ff0 DR7: 00000400
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: disabled

--
Thanks,
Hyeonggon

2023-01-01 08:26:32

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On Sat, Dec 31, 2022 at 11:26:25PM +0800, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed kernel_BUG_at_include/linux/mm.h due to commit (built with gcc-11):
>
> commit: 0af8489b0216fa1dd83e264bef8063f2632633d7 ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [test failed on linux-next/master c76083fac3bae1a87ae3d005b5cb1cbc761e31d5]
>
> in testcase: rcutorture
> version:
> with following parameters:
>
> runtime: 300s
> test: default
> torture_type: tasks-tracing
>
> test-description: rcutorture is rcutorture kernel module load/unload test.
> test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> If you fix the issue, kindly add following tag
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-lkp/[email protected]
>
>

<snip>

>
>
> To reproduce:
>
> # build kernel
> cd linux
> cp config-6.1.0-rc2-00014-g0af8489b0216 .config
> make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
> make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
> cd <mod-install-dir>
> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
>
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
>
> # if come across any failure that blocks the test,
> # please remove ~/.lkp and /lkp dir to run from a clean state.


I was unable to reproduce in the same way as described above
because some files referenced in job-script couldn't be downloaded from
download.01.org/0day :(

So I just built rcutorture module as builtin
and I got weird spinlock bug on commit: 0af8489b0216
("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")

full dmesg added as attachment

[ 1387.564837][ T57] BUG: unable to handle page fault for address: c108f5f4
[ 1387.566649][ T57] #PF: supervisor write access in kernel mode
[ 1387.567965][ T57] #PF: error_code(0x0003) - permissions violation
[ 1387.569439][ T57] *pde = 010001e1
[ 1387.570276][ T57] Oops: 0003 [#1] SMP
[ 1387.571149][ T57] CPU: 2 PID: 57 Comm: rcu_torture_rea Tainted: G S 6.1.0-rc2-00010-g0af8489b0216 #2130 63d19ac2b985fca570c354d8750f489755de37ed
[ 1387.574673][ T57] EIP: kvm_kick_cpu+0x54/0x90
[ 1387.575802][ T57] Code: 2f c5 01 8b 04 9d e0 d4 4e c4 83 15 14 7b 2f c5 00 83 05 08 6d 2f c5 01 0f b7 0c 30 b8 05 00 00 00 83 15 0c 6d 2f c5 00 31 db <0f> 01 c1 83 05 10 6d 2f c5 01 8b 5d f8 8b 75 fc 83 15 14 6d 2f c5
[ 1387.580456][ T57] EAX: 00000005 EBX: 00000000 ECX: 00000003 EDX: c108f5a0
[ 1387.582071][ T57] ESI: c5153580 EDI: 00000046 EBP: c69cddf8 ESP: c69cddf0
[ 1387.583775][ T57] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010046
[ 1387.585643][ T57] CR0: 80050033 CR2: c108f5f4 CR3: 0776b000 CR4: 00350e90
[ 1387.587492][ T57] Call Trace:
[ 1387.588365][ T57] __pv_queued_spin_unlock_slowpath+0x66/0x110
[ 1387.589898][ T57] __pv_queued_spin_unlock+0x4b/0x60
[ 1387.591040][ T57] __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
[ 1387.592771][ T57] do_raw_spin_unlock+0x49/0xa0
[ 1387.593805][ T57] _raw_spin_unlock_irqrestore+0x53/0xd0
[ 1387.594927][ T57] swake_up_one+0x4f/0x70
[ 1387.595739][ T57] __rcu_report_exp_rnp+0x26b/0x470
[ 1387.596730][ T57] rcu_report_exp_cpu_mult+0x82/0x2f0
[ 1387.597770][ T57] rcu_qs+0xac/0x160
[ 1387.598503][ T57] rcu_note_context_switch+0x31/0x1e0
[ 1387.599460][ T57] __schedule+0xc5/0x770
[ 1387.600195][ T57] __cond_resched+0x7a/0x100
[ 1387.600996][ T57] stutter_wait+0x9e/0x2c0
[ 1387.601956][ T57] rcu_torture_reader+0x162/0x3e0
[ 1387.603048][ T57] ? rcu_torture_reader+0x3e0/0x3e0
[ 1387.604269][ T57] ? __kthread_parkme+0xab/0xf0
[ 1387.605420][ T57] kthread+0x167/0x1d0
[ 1387.606383][ T57] ? rcu_torture_read_exit_child+0xa0/0xa0
[ 1387.607516][ T57] ? kthread_exit+0x50/0x50
[ 1387.608517][ T57] ret_from_fork+0x19/0x24
[ 1387.609548][ T57] Modules linked in:
[ 1387.610187][ T57] CR2: 00000000c108f5f4
[ 1387.610873][ T57] ---[ end trace 0000000000000000 ]---
[ 1387.611829][ T57] EIP: kvm_kick_cpu+0x54/0x90
[ 1387.612653][ T57] Code: 2f c5 01 8b 04 9d e0 d4 4e c4 83 15 14 7b 2f c5 00 83 05 08 6d 2f c5 01 0f b7 0c 30 b8 05 00 00 00 83 15 0c 6d 2f c5 00 31 db <0f> 01 c1 83 05 10 6d 2f c5 01 8b 5d f8 8b 75 fc 83 15 14 6d 2f c5
[ 1387.616715][ T57] EAX: 00000005 EBX: 00000000 ECX: 00000003 EDX: c108f5a0
[ 1387.618242][ T57] ESI: c5153580 EDI: 00000046 EBP: c69cddf8 ESP: c69cddf0
[ 1387.619912][ T57] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010046
[ 1387.621666][ T57] CR0: 80050033 CR2: c108f5f4 CR3: 0776b000 CR4: 00350e90
[ 1387.623128][ T57] Kernel panic - not syncing: Fatal exception
[ 1389.285045][ T57] Shutting down cpus with NMI
[ 1389.297949][ T57] Kernel Offset: disabled
[ 1389.299174][ T57] ---[ end Kernel panic - not syncing: Fatal exception ]---


Attachments:
(No filename) (5.33 kB)
dmesg (95.21 kB)
Download all attachments

2023-01-03 10:53:07

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On 12/31/22 16:26, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed kernel_BUG_at_include/linux/mm.h due to commit (built with gcc-11):
>
> commit: 0af8489b0216fa1dd83e264bef8063f2632633d7 ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [test failed on linux-next/master c76083fac3bae1a87ae3d005b5cb1cbc761e31d5]
>
> in testcase: rcutorture
> version:
> with following parameters:
>
> runtime: 300s
> test: default
> torture_type: tasks-tracing
>
> test-description: rcutorture is rcutorture kernel module load/unload test.
> test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

Looking at the full dump:

> [ 25.780445][ T214] page:e660911a refcount:0 mapcount:0 mapping:00000000 index:0xedaeef00 pfn:0x2daee
> [ 25.781618][ T214] page:0946d53a refcount:0 mapcount:0 mapping:00000000 index:0x0 pfn:0x2daec

OK, so we started dumping one page but then immediately went dumping
another. This means page_dump() tripped on itself.

Why did we dump the first page? The callback is

> [ 25.812675][ T214] dump_page+0x2a/0xc0
> [ 25.813025][ T214] ? _raw_spin_lock_irqsave+0x16/0x30
> [ 25.813492][ T214] folio_flags+0x23/0x70
> [ 25.813945][ T214] get_partial_node+0x89/0x290

We walk the partial list, and for a struct slab we check
pfmemalloc_match() thus slab_test_pfmemalloc() thus folio_test_active(),
which will do folio_flags(). That has two VM_BUG... checks but one requires
n > 0 which wouldn't be true here, so it has to be:
VM_BUG_ON_PGFLAGS(PageTail(page), page);

So page with pfn pfn:0x2daee is supposed to be a folio head of order 0 or
order 1 (probably), not larger due to the pfn. But it's a tail page, oops.
We start dumping it, but also create a folio of that, and then trip on
!folio_test_large(folio), because we try to evaluate folio_entire_mapcount()
in __dump_page(), which has that check.
So we start dumping the folio page, and that has pfn:0x2daec. Which means
the page pfn:0x2daee was tail with compound_head pointing to 0x2daec. So
0x2daec was order-2 at some point, with pfn:0x2daee being a 3rd tail page of 4.

> [ 25.782360][ T214] flags: 0x0(zone=0)
> [ 25.782705][ T214] raw: 00000000 e764d494 e6f205b4 00000000 00000000 00020000 ffffffff 00000000
> [ 25.783445][ T214] raw: 00000000 00000000
> [ 25.783809][ T214] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
> [ 25.784455][ T214] page_owner tracks the page as freed
> [ 25.785164][ T214] page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 208, tgid 208 (systemd-udevd), ts 25780391126, free_ts 25780421356
> [ 25.786859][ T214] post_alloc_hook+0x1fa/0x280
> [ 25.787266][ T214] get_page_from_freelist+0x226/0x310
> [ 25.787723][ T214] __alloc_pages+0xdd/0x360
> [ 25.788114][ T214] alloc_slab_page+0x12d/0x200
> [ 25.788517][ T214] allocate_slab+0x6a/0x350
> [ 25.788900][ T214] new_slab+0x48/0xc0
> [ 25.789258][ T214] __slab_alloc_node+0xfb/0x270
> [ 25.789749][ T214] kmem_cache_alloc+0x8f/0x4e0
> [ 25.790154][ T214] getname_flags+0x33/0x2f0
> [ 25.790538][ T214] getname+0x1a/0x30
> [ 25.790881][ T214] do_sys_openat2+0xa5/0x1f0
> [ 25.791278][ T214] do_sys_open+0x8e/0xe0
> [ 25.791636][ T214] __ia32_sys_openat+0x2b/0x40
> [ 25.792040][ T214] __do_fast_syscall_32+0x72/0xd0
> [ 25.792486][ T214] do_fast_syscall_32+0x32/0x70
> [ 25.792899][ T214] do_SYSENTER_32+0x15/0x20
> [ 25.793307][ T214] page last free stack trace:
> [ 25.793701][ T214] free_pcp_prepare+0x34f/0x940
> [ 25.794123][ T214] free_unref_page_prepare+0x29/0x210
> [ 25.794577][ T214] free_unref_page+0x3a/0x3b0
> [ 25.794988][ T214] __free_pages+0x187/0x1f0
> [ 25.795378][ T214] __free_slab+0x1fd/0x350
> [ 25.795842][ T214] free_slab+0x22/0x70
> [ 25.796233][ T214] free_to_partial_list+0x125/0x260
> [ 25.796761][ T214] do_slab_free+0x30/0x70
> [ 25.801249][ T214] kmem_cache_free+0x171/0x1e0
> [ 25.801686][ T214] putname+0x9f/0xf0
> [ 25.802023][ T214] do_sys_openat2+0xe2/0x1f0
> [ 25.802411][ T214] do_sys_open+0x8e/0xe0
> [ 25.802781][ T214] __ia32_sys_openat+0x2b/0x40
> [ 25.803192][ T214] __do_fast_syscall_32+0x72/0xd0
> [ 25.803613][ T214] do_fast_syscall_32+0x32/0x70
> [ 25.804027][ T214] do_SYSENTER_32+0x15/0x20

Yet the page owner info tells us, 0x2daee was most recently used as order-1,
and now it seems to be (based on the raw struct page dump) sitting on a pcplist.

So the events leading up to this could be something like:

- 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
- despite being on partial list, it's freed ???
- somebody else allocates order-2 page 0x2daec and uses it for whatever,
then frees it
- 0x2daec is reallocated as order-1 slab from names_cache, then freed
- we try to allocate from the slab page 0x2daee and trip on the PageTail

Except, the freeing of order-2 page would have reset the PageTail and
compound_head in 0x2daec, so this is even more complicated or involves some
extra race?

In any case, this is something a debug_pagealloc kernel could have a chance
of catching earlier. Would it be possible to enable CONFIG_DEBUG_PAGEALLOC
and DEBUG_PAGEALLOC_ENABLE_DEFAULT additionally to the rest of the
configuration, and repeat the test?

Separately we should also make the __dump_page() more resilient.

Thanks,
Vlastimil

> [ 25.804432][ T214] ------------[ cut here ]------------
> [ 25.804917][ T214] kernel BUG at include/linux/mm.h:825!
> [ 25.805402][ T214] invalid opcode: 0000 [#1] SMP
> [ 25.805820][ T214] CPU: 0 PID: 214 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #2 1c4d7707ec0ce574ed62a77e82a8580202758048
> [ 25.806944][ T214] EIP: __dump_page.cold+0x282/0x369
> [ 25.807376][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> [ 25.808960][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> [ 25.809578][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> [ 25.810168][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> [ 25.810803][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> [ 25.811407][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [ 25.811999][ T214] DR6: fffe0ff0 DR7: 00000400
> [ 25.812390][ T214] Call Trace:
> [ 25.812675][ T214] dump_page+0x2a/0xc0
> [ 25.813025][ T214] ? _raw_spin_lock_irqsave+0x16/0x30
> [ 25.813492][ T214] folio_flags+0x23/0x70
> [ 25.813945][ T214] get_partial_node+0x89/0x290
> [ 25.814357][ T214] __slab_alloc_node+0xbb/0x270
> [ 25.814860][ T214] kmem_cache_alloc_lru+0x8d/0x4e0
> [ 25.815289][ T214] ? __lock_release+0x3ec/0x410
> [ 25.815697][ T214] ? iget_locked+0x78/0x310
> [ 25.816096][ T214] alloc_inode+0x93/0x150
> [ 25.816469][ T214] iget_locked+0xdd/0x310
> [ 25.816829][ T214] ? lock_is_held_type+0x80/0xf0
> [ 25.817264][ T214] kernfs_get_inode+0x24/0xb0
> [ 25.817670][ T214] kernfs_iop_lookup+0xb5/0x1a0
> [ 25.818087][ T214] __lookup_slow+0xd9/0x2a0
> [ 25.818479][ T214] lookup_slow+0x50/0x90
> [ 25.818847][ T214] walk_component+0x19c/0x2c0
> [ 25.819244][ T214] path_lookupat+0xa3/0x270
> [ 25.819627][ T214] path_openat+0x307/0x3e0
> [ 25.820007][ T214] do_filp_open+0x7c/0x130
> [ 25.820409][ T214] do_sys_openat2+0x113/0x1f0
> [ 25.820807][ T214] do_sys_open+0x8e/0xe0
> [ 25.821211][ T214] __ia32_sys_openat+0x2b/0x40
> [ 25.821622][ T214] __do_fast_syscall_32+0x72/0xd0
> [ 25.822057][ T214] ? trace_hardirqs_on+0xa2/0x110
> [ 25.822480][ T214] ? __fput+0x19f/0x390
> [ 25.822842][ T214] ? lockdep_hardirqs_on_prepare+0x242/0x400
> [ 25.823346][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> [ 25.823823][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> [ 25.824259][ T214] ? lockdep_hardirqs_on_prepare+0x242/0x400
> [ 25.824767][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> [ 25.825254][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> [ 25.825696][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> [ 25.826155][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> [ 25.826627][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> [ 25.827056][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> [ 25.827486][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> [ 25.827929][ T214] ? irqentry_exit_to_user_mode+0x23/0x30
> [ 25.828423][ T214] ? irqentry_exit+0x7f/0xc0
> [ 25.828812][ T214] do_fast_syscall_32+0x32/0x70
> [ 25.829223][ T214] do_SYSENTER_32+0x15/0x20
> [ 25.829589][ T214] entry_SYSENTER_32+0xa2/0xfb
> [ 25.830003][ T214] EIP: 0xb7f8c549
> [ 25.830330][ T214] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> [ 25.831929][ T214] EAX: ffffffda EBX: 00000006 ECX: 006142a1 EDX: 002a8000
> [ 25.832522][ T214] ESI: 00000000 EDI: 00000001 EBP: 00614024 ESP: bff3c4a0
> [ 25.833123][ T214] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
> [ 25.833738][ T214] Modules linked in:
> [ 25.834062][ T214] ---[ end trace 0000000000000000 ]---
> [ 25.834522][ T214] EIP: __dump_page.cold+0x282/0x369
> [ 25.834960][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> [ 25.836574][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> [ 25.837183][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> [ 25.837772][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> [ 25.838414][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> [ 25.839011][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [ 25.839597][ T214] DR6: fffe0ff0 DR7: 00000400
> [ 25.839995][ T214] Kernel panic - not syncing: Fatal exception
> [ 25.840554][ T214] Kernel Offset: disabled


2023-01-03 14:06:36

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

hi,

On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> On 12/31/22 16:26, kernel test robot wrote:
> >
> > Greeting,
> >
> > FYI, we noticed kernel_BUG_at_include/linux/mm.h due to commit (built with gcc-11):
> >
> > commit: 0af8489b0216fa1dd83e264bef8063f2632633d7 ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > [test failed on linux-next/master c76083fac3bae1a87ae3d005b5cb1cbc761e31d5]
> >
> > in testcase: rcutorture
> > version:
> > with following parameters:
> >
> > runtime: 300s
> > test: default
> > torture_type: tasks-tracing
> >
> > test-description: rcutorture is rcutorture kernel module load/unload test.
> > test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
> >
> >
> > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
> Looking at the full dump:
>
> > [ 25.780445][ T214] page:e660911a refcount:0 mapcount:0 mapping:00000000 index:0xedaeef00 pfn:0x2daee
> > [ 25.781618][ T214] page:0946d53a refcount:0 mapcount:0 mapping:00000000 index:0x0 pfn:0x2daec
>
> OK, so we started dumping one page but then immediately went dumping
> another. This means page_dump() tripped on itself.
>
> Why did we dump the first page? The callback is
>
> > [ 25.812675][ T214] dump_page+0x2a/0xc0
> > [ 25.813025][ T214] ? _raw_spin_lock_irqsave+0x16/0x30
> > [ 25.813492][ T214] folio_flags+0x23/0x70
> > [ 25.813945][ T214] get_partial_node+0x89/0x290
>
> We walk the partial list, and for a struct slab we check
> pfmemalloc_match() thus slab_test_pfmemalloc() thus folio_test_active(),
> which will do folio_flags(). That has two VM_BUG... checks but one requires
> n > 0 which wouldn't be true here, so it has to be:
> VM_BUG_ON_PGFLAGS(PageTail(page), page);
>
> So page with pfn pfn:0x2daee is supposed to be a folio head of order 0 or
> order 1 (probably), not larger due to the pfn. But it's a tail page, oops.
> We start dumping it, but also create a folio of that, and then trip on
> !folio_test_large(folio), because we try to evaluate folio_entire_mapcount()
> in __dump_page(), which has that check.
> So we start dumping the folio page, and that has pfn:0x2daec. Which means
> the page pfn:0x2daee was tail with compound_head pointing to 0x2daec. So
> 0x2daec was order-2 at some point, with pfn:0x2daee being a 3rd tail page of 4.
>
> > [ 25.782360][ T214] flags: 0x0(zone=0)
> > [ 25.782705][ T214] raw: 00000000 e764d494 e6f205b4 00000000 00000000 00020000 ffffffff 00000000
> > [ 25.783445][ T214] raw: 00000000 00000000
> > [ 25.783809][ T214] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
> > [ 25.784455][ T214] page_owner tracks the page as freed
> > [ 25.785164][ T214] page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 208, tgid 208 (systemd-udevd), ts 25780391126, free_ts 25780421356
> > [ 25.786859][ T214] post_alloc_hook+0x1fa/0x280
> > [ 25.787266][ T214] get_page_from_freelist+0x226/0x310
> > [ 25.787723][ T214] __alloc_pages+0xdd/0x360
> > [ 25.788114][ T214] alloc_slab_page+0x12d/0x200
> > [ 25.788517][ T214] allocate_slab+0x6a/0x350
> > [ 25.788900][ T214] new_slab+0x48/0xc0
> > [ 25.789258][ T214] __slab_alloc_node+0xfb/0x270
> > [ 25.789749][ T214] kmem_cache_alloc+0x8f/0x4e0
> > [ 25.790154][ T214] getname_flags+0x33/0x2f0
> > [ 25.790538][ T214] getname+0x1a/0x30
> > [ 25.790881][ T214] do_sys_openat2+0xa5/0x1f0
> > [ 25.791278][ T214] do_sys_open+0x8e/0xe0
> > [ 25.791636][ T214] __ia32_sys_openat+0x2b/0x40
> > [ 25.792040][ T214] __do_fast_syscall_32+0x72/0xd0
> > [ 25.792486][ T214] do_fast_syscall_32+0x32/0x70
> > [ 25.792899][ T214] do_SYSENTER_32+0x15/0x20
> > [ 25.793307][ T214] page last free stack trace:
> > [ 25.793701][ T214] free_pcp_prepare+0x34f/0x940
> > [ 25.794123][ T214] free_unref_page_prepare+0x29/0x210
> > [ 25.794577][ T214] free_unref_page+0x3a/0x3b0
> > [ 25.794988][ T214] __free_pages+0x187/0x1f0
> > [ 25.795378][ T214] __free_slab+0x1fd/0x350
> > [ 25.795842][ T214] free_slab+0x22/0x70
> > [ 25.796233][ T214] free_to_partial_list+0x125/0x260
> > [ 25.796761][ T214] do_slab_free+0x30/0x70
> > [ 25.801249][ T214] kmem_cache_free+0x171/0x1e0
> > [ 25.801686][ T214] putname+0x9f/0xf0
> > [ 25.802023][ T214] do_sys_openat2+0xe2/0x1f0
> > [ 25.802411][ T214] do_sys_open+0x8e/0xe0
> > [ 25.802781][ T214] __ia32_sys_openat+0x2b/0x40
> > [ 25.803192][ T214] __do_fast_syscall_32+0x72/0xd0
> > [ 25.803613][ T214] do_fast_syscall_32+0x32/0x70
> > [ 25.804027][ T214] do_SYSENTER_32+0x15/0x20
>
> Yet the page owner info tells us, 0x2daee was most recently used as order-1,
> and now it seems to be (based on the raw struct page dump) sitting on a pcplist.
>
> So the events leading up to this could be something like:
>
> - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
> - despite being on partial list, it's freed ???
> - somebody else allocates order-2 page 0x2daec and uses it for whatever,
> then frees it
> - 0x2daec is reallocated as order-1 slab from names_cache, then freed
> - we try to allocate from the slab page 0x2daee and trip on the PageTail
>
> Except, the freeing of order-2 page would have reset the PageTail and
> compound_head in 0x2daec, so this is even more complicated or involves some
> extra race?

FYI, we ran tests more up to 500 times, then saw different issues but rate is
actually low

56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:500 12% 61:500 dmesg.invalid_opcode:#[##]
:500 3% 14:500 dmesg.kernel_BUG_at_include/linux/mm.h
:500 3% 17:500 dmesg.kernel_BUG_at_include/linux/page-flags.h
:500 5% 26:500 dmesg.kernel_BUG_at_lib/list_debug.c
:500 0% 2:500 dmesg.kernel_BUG_at_mm/page_alloc.c
:500 0% 2:500 dmesg.kernel_BUG_at_mm/usercopy.c

>
> In any case, this is something a debug_pagealloc kernel could have a chance
> of catching earlier. Would it be possible to enable CONFIG_DEBUG_PAGEALLOC
> and DEBUG_PAGEALLOC_ENABLE_DEFAULT additionally to the rest of the
> configuration, and repeat the test?

ok, we are starting to test by these 2 additional configs now.

>
> Separately we should also make the __dump_page() more resilient.
>
> Thanks,
> Vlastimil
>
> > [ 25.804432][ T214] ------------[ cut here ]------------
> > [ 25.804917][ T214] kernel BUG at include/linux/mm.h:825!
> > [ 25.805402][ T214] invalid opcode: 0000 [#1] SMP
> > [ 25.805820][ T214] CPU: 0 PID: 214 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #2 1c4d7707ec0ce574ed62a77e82a8580202758048
> > [ 25.806944][ T214] EIP: __dump_page.cold+0x282/0x369
> > [ 25.807376][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> > [ 25.808960][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> > [ 25.809578][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> > [ 25.810168][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> > [ 25.810803][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> > [ 25.811407][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > [ 25.811999][ T214] DR6: fffe0ff0 DR7: 00000400
> > [ 25.812390][ T214] Call Trace:
> > [ 25.812675][ T214] dump_page+0x2a/0xc0
> > [ 25.813025][ T214] ? _raw_spin_lock_irqsave+0x16/0x30
> > [ 25.813492][ T214] folio_flags+0x23/0x70
> > [ 25.813945][ T214] get_partial_node+0x89/0x290
> > [ 25.814357][ T214] __slab_alloc_node+0xbb/0x270
> > [ 25.814860][ T214] kmem_cache_alloc_lru+0x8d/0x4e0
> > [ 25.815289][ T214] ? __lock_release+0x3ec/0x410
> > [ 25.815697][ T214] ? iget_locked+0x78/0x310
> > [ 25.816096][ T214] alloc_inode+0x93/0x150
> > [ 25.816469][ T214] iget_locked+0xdd/0x310
> > [ 25.816829][ T214] ? lock_is_held_type+0x80/0xf0
> > [ 25.817264][ T214] kernfs_get_inode+0x24/0xb0
> > [ 25.817670][ T214] kernfs_iop_lookup+0xb5/0x1a0
> > [ 25.818087][ T214] __lookup_slow+0xd9/0x2a0
> > [ 25.818479][ T214] lookup_slow+0x50/0x90
> > [ 25.818847][ T214] walk_component+0x19c/0x2c0
> > [ 25.819244][ T214] path_lookupat+0xa3/0x270
> > [ 25.819627][ T214] path_openat+0x307/0x3e0
> > [ 25.820007][ T214] do_filp_open+0x7c/0x130
> > [ 25.820409][ T214] do_sys_openat2+0x113/0x1f0
> > [ 25.820807][ T214] do_sys_open+0x8e/0xe0
> > [ 25.821211][ T214] __ia32_sys_openat+0x2b/0x40
> > [ 25.821622][ T214] __do_fast_syscall_32+0x72/0xd0
> > [ 25.822057][ T214] ? trace_hardirqs_on+0xa2/0x110
> > [ 25.822480][ T214] ? __fput+0x19f/0x390
> > [ 25.822842][ T214] ? lockdep_hardirqs_on_prepare+0x242/0x400
> > [ 25.823346][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > [ 25.823823][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > [ 25.824259][ T214] ? lockdep_hardirqs_on_prepare+0x242/0x400
> > [ 25.824767][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > [ 25.825254][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > [ 25.825696][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > [ 25.826155][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > [ 25.826627][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > [ 25.827056][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > [ 25.827486][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > [ 25.827929][ T214] ? irqentry_exit_to_user_mode+0x23/0x30
> > [ 25.828423][ T214] ? irqentry_exit+0x7f/0xc0
> > [ 25.828812][ T214] do_fast_syscall_32+0x32/0x70
> > [ 25.829223][ T214] do_SYSENTER_32+0x15/0x20
> > [ 25.829589][ T214] entry_SYSENTER_32+0xa2/0xfb
> > [ 25.830003][ T214] EIP: 0xb7f8c549
> > [ 25.830330][ T214] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> > [ 25.831929][ T214] EAX: ffffffda EBX: 00000006 ECX: 006142a1 EDX: 002a8000
> > [ 25.832522][ T214] ESI: 00000000 EDI: 00000001 EBP: 00614024 ESP: bff3c4a0
> > [ 25.833123][ T214] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
> > [ 25.833738][ T214] Modules linked in:
> > [ 25.834062][ T214] ---[ end trace 0000000000000000 ]---
> > [ 25.834522][ T214] EIP: __dump_page.cold+0x282/0x369
> > [ 25.834960][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> > [ 25.836574][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> > [ 25.837183][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> > [ 25.837772][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> > [ 25.838414][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> > [ 25.839011][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > [ 25.839597][ T214] DR6: fffe0ff0 DR7: 00000400
> > [ 25.839995][ T214] Kernel panic - not syncing: Fatal exception
> > [ 25.840554][ T214] Kernel Offset: disabled
>
>

2023-01-03 14:56:25

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On 1/3/23 14:46, Oliver Sang wrote:
> hi,
>
>>
>> Yet the page owner info tells us, 0x2daee was most recently used as order-1,
>> and now it seems to be (based on the raw struct page dump) sitting on a pcplist.
>>
>> So the events leading up to this could be something like:
>>
>> - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
>> - despite being on partial list, it's freed ???
>> - somebody else allocates order-2 page 0x2daec and uses it for whatever,
>> then frees it
>> - 0x2daec is reallocated as order-1 slab from names_cache, then freed
>> - we try to allocate from the slab page 0x2daee and trip on the PageTail
>>
>> Except, the freeing of order-2 page would have reset the PageTail and
>> compound_head in 0x2daec, so this is even more complicated or involves some
>> extra race?
>
> FYI, we ran tests more up to 500 times, then saw different issues but rate is
> actually low
>
> 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> :500 12% 61:500 dmesg.invalid_opcode:#[##]
> :500 3% 14:500 dmesg.kernel_BUG_at_include/linux/mm.h
> :500 3% 17:500 dmesg.kernel_BUG_at_include/linux/page-flags.h
> :500 5% 26:500 dmesg.kernel_BUG_at_lib/list_debug.c
> :500 0% 2:500 dmesg.kernel_BUG_at_mm/page_alloc.c
> :500 0% 2:500 dmesg.kernel_BUG_at_mm/usercopy.c

Hm even if rate is low, the different kinds of reports could be useful to
see, if all of that is caused by the commit.

>>
>> In any case, this is something a debug_pagealloc kernel could have a chance
>> of catching earlier. Would it be possible to enable CONFIG_DEBUG_PAGEALLOC
>> and DEBUG_PAGEALLOC_ENABLE_DEFAULT additionally to the rest of the
>> configuration, and repeat the test?
>
> ok, we are starting to test by these 2 additional configs now.

Great, thanks!


2023-01-03 16:07:27

by Matthew Wilcox

[permalink] [raw]
Subject: A better dump_page()

On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> Separately we should also make the __dump_page() more resilient.

Right. It's not ideal when one of our best debugging tools obfuscates
the problem we're trying to debug. I've seen probems like this before,
and the problem is that somebody calls dump_page() on a page that they
don't own a refcount on. That lets the page mutate under us in some
fairly awkward ways (as you've seen here, it seems to be part of several
different compound allocations at various points during the dump
process).

One possibility I thought about was taking our own refcount on the
page at the start of dump_page(). That would kill off the possibility
of ever passing in a const struct page, and it would confuse people.
Also, what if somebody passes in a pointer to something that's not a
struct page? Then we've (tried to) modify memory that's not a refcount.

I think the best we can do is to snapshot the struct page and the folio
it appears to belong to at the start of dump_page(). It'll take a
little care (for example, folio_pfn() must be passed the original
folio, and not the snapshot), but I think it's doable.

2023-01-03 23:30:05

by David Rientjes

[permalink] [raw]
Subject: Re: A better dump_page()

On Tue, 3 Jan 2023, Matthew Wilcox wrote:

> On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> > Separately we should also make the __dump_page() more resilient.
>
> Right. It's not ideal when one of our best debugging tools obfuscates
> the problem we're trying to debug. I've seen probems like this before,
> and the problem is that somebody calls dump_page() on a page that they
> don't own a refcount on. That lets the page mutate under us in some
> fairly awkward ways (as you've seen here, it seems to be part of several
> different compound allocations at various points during the dump
> process).
>
> One possibility I thought about was taking our own refcount on the
> page at the start of dump_page(). That would kill off the possibility
> of ever passing in a const struct page, and it would confuse people.
> Also, what if somebody passes in a pointer to something that's not a
> struct page? Then we've (tried to) modify memory that's not a refcount.
>
> I think the best we can do is to snapshot the struct page and the folio
> it appears to belong to at the start of dump_page(). It'll take a
> little care (for example, folio_pfn() must be passed the original
> folio, and not the snapshot), but I think it's doable.
>

By snapshot do you mean memcpy() of the metadata to the stack? I assume
this still leaves the opportunity for the underlying mutation of the page
but makes the window more narrow.

2023-01-04 00:10:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: A better dump_page()

On Tue, Jan 03, 2023 at 03:07:12PM -0800, David Rientjes wrote:
> On Tue, 3 Jan 2023, Matthew Wilcox wrote:
>
> > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> > > Separately we should also make the __dump_page() more resilient.
> >
> > Right. It's not ideal when one of our best debugging tools obfuscates
> > the problem we're trying to debug. I've seen probems like this before,
> > and the problem is that somebody calls dump_page() on a page that they
> > don't own a refcount on. That lets the page mutate under us in some
> > fairly awkward ways (as you've seen here, it seems to be part of several
> > different compound allocations at various points during the dump
> > process).
> >
> > One possibility I thought about was taking our own refcount on the
> > page at the start of dump_page(). That would kill off the possibility
> > of ever passing in a const struct page, and it would confuse people.
> > Also, what if somebody passes in a pointer to something that's not a
> > struct page? Then we've (tried to) modify memory that's not a refcount.
> >
> > I think the best we can do is to snapshot the struct page and the folio
> > it appears to belong to at the start of dump_page(). It'll take a
> > little care (for example, folio_pfn() must be passed the original
> > folio, and not the snapshot), but I think it's doable.
> >
>
> By snapshot do you mean memcpy() of the metadata to the stack? I assume
> this still leaves the opportunity for the underlying mutation of the page
> but makes the window more narrow.

Right. We'd need to memcpy() both the page and the folio, so around 192
bytes. It doesn't make it consistent since it could be mutated during
the memcpy(), but it will be consistent throughout the execution of the
function, so we won't get calls like folio_entire_mapcount() aborting
due to the folio having become a tail page halfway through.

2023-01-04 09:36:42

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On Tue, Jan 03, 2023 at 09:46:33PM +0800, Oliver Sang wrote:
> On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> > So the events leading up to this could be something like:
> >
> > - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
> > - despite being on partial list, it's freed ???
> > - somebody else allocates order-2 page 0x2daec and uses it for whatever,
> > then frees it
> > - 0x2daec is reallocated as order-1 slab from names_cache, then freed
> > - we try to allocate from the slab page 0x2daee and trip on the PageTail
> >
> > Except, the freeing of order-2 page would have reset the PageTail and
> > compound_head in 0x2daec, so this is even more complicated or involves some
> > extra race?
>
> FYI, we ran tests more up to 500 times, then saw different issues but rate is
> actually low
>
> 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> :500 12% 61:500 dmesg.invalid_opcode:#[##]
> :500 3% 14:500 dmesg.kernel_BUG_at_include/linux/mm.h
> :500 3% 17:500 dmesg.kernel_BUG_at_include/linux/page-flags.h
> :500 5% 26:500 dmesg.kernel_BUG_at_lib/list_debug.c
> :500 0% 2:500 dmesg.kernel_BUG_at_mm/page_alloc.c
> :500 0% 2:500 dmesg.kernel_BUG_at_mm/usercopy.c
>
> >
> > In any case, this is something a debug_pagealloc kernel could have a chance
> > of catching earlier. Would it be possible to enable CONFIG_DEBUG_PAGEALLOC
> > and DEBUG_PAGEALLOC_ENABLE_DEFAULT additionally to the rest of the
> > configuration, and repeat the test?
>
> ok, we are starting to test by these 2 additional configs now.

BTW it seems to be totally unrelated to rcutorture tests.
Are there similar reports in boot tests with the same config?

> >
> > Separately we should also make the __dump_page() more resilient.
> >
> > Thanks,
> > Vlastimil
> >
> > > [ 25.804432][ T214] ------------[ cut here ]------------
> > > [ 25.804917][ T214] kernel BUG at include/linux/mm.h:825!
> > > [ 25.805402][ T214] invalid opcode: 0000 [#1] SMP
> > > [ 25.805820][ T214] CPU: 0 PID: 214 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #2 1c4d7707ec0ce574ed62a77e82a8580202758048
> > > [ 25.806944][ T214] EIP: __dump_page.cold+0x282/0x369
> > > [ 25.807376][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> > > [ 25.808960][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> > > [ 25.809578][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> > > [ 25.810168][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> > > [ 25.810803][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> > > [ 25.811407][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > > [ 25.811999][ T214] DR6: fffe0ff0 DR7: 00000400
> > > [ 25.812390][ T214] Call Trace:
> > > [ 25.812675][ T214] dump_page+0x2a/0xc0
> > > [ 25.813025][ T214] ? _raw_spin_lock_irqsave+0x16/0x30
> > > [ 25.813492][ T214] folio_flags+0x23/0x70
> > > [ 25.813945][ T214] get_partial_node+0x89/0x290
> > > [ 25.814357][ T214] __slab_alloc_node+0xbb/0x270
> > > [ 25.814860][ T214] kmem_cache_alloc_lru+0x8d/0x4e0
> > > [ 25.815289][ T214] ? __lock_release+0x3ec/0x410
> > > [ 25.815697][ T214] ? iget_locked+0x78/0x310
> > > [ 25.816096][ T214] alloc_inode+0x93/0x150
> > > [ 25.816469][ T214] iget_locked+0xdd/0x310
> > > [ 25.816829][ T214] ? lock_is_held_type+0x80/0xf0
> > > [ 25.817264][ T214] kernfs_get_inode+0x24/0xb0
> > > [ 25.817670][ T214] kernfs_iop_lookup+0xb5/0x1a0
> > > [ 25.818087][ T214] __lookup_slow+0xd9/0x2a0
> > > [ 25.818479][ T214] lookup_slow+0x50/0x90
> > > [ 25.818847][ T214] walk_component+0x19c/0x2c0
> > > [ 25.819244][ T214] path_lookupat+0xa3/0x270
> > > [ 25.819627][ T214] path_openat+0x307/0x3e0
> > > [ 25.820007][ T214] do_filp_open+0x7c/0x130
> > > [ 25.820409][ T214] do_sys_openat2+0x113/0x1f0
> > > [ 25.820807][ T214] do_sys_open+0x8e/0xe0
> > > [ 25.821211][ T214] __ia32_sys_openat+0x2b/0x40
> > > [ 25.821622][ T214] __do_fast_syscall_32+0x72/0xd0
> > > [ 25.822057][ T214] ? trace_hardirqs_on+0xa2/0x110
> > > [ 25.822480][ T214] ? __fput+0x19f/0x390
> > > [ 25.822842][ T214] ? lockdep_hardirqs_on_prepare+0x242/0x400
> > > [ 25.823346][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > > [ 25.823823][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > [ 25.824259][ T214] ? lockdep_hardirqs_on_prepare+0x242/0x400
> > > [ 25.824767][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > > [ 25.825254][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > [ 25.825696][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > [ 25.826155][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > > [ 25.826627][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > [ 25.827056][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > [ 25.827486][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > [ 25.827929][ T214] ? irqentry_exit_to_user_mode+0x23/0x30
> > > [ 25.828423][ T214] ? irqentry_exit+0x7f/0xc0
> > > [ 25.828812][ T214] do_fast_syscall_32+0x32/0x70
> > > [ 25.829223][ T214] do_SYSENTER_32+0x15/0x20
> > > [ 25.829589][ T214] entry_SYSENTER_32+0xa2/0xfb
> > > [ 25.830003][ T214] EIP: 0xb7f8c549
> > > [ 25.830330][ T214] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> > > [ 25.831929][ T214] EAX: ffffffda EBX: 00000006 ECX: 006142a1 EDX: 002a8000
> > > [ 25.832522][ T214] ESI: 00000000 EDI: 00000001 EBP: 00614024 ESP: bff3c4a0
> > > [ 25.833123][ T214] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
> > > [ 25.833738][ T214] Modules linked in:
> > > [ 25.834062][ T214] ---[ end trace 0000000000000000 ]---
> > > [ 25.834522][ T214] EIP: __dump_page.cold+0x282/0x369
> > > [ 25.834960][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> > > [ 25.836574][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> > > [ 25.837183][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> > > [ 25.837772][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> > > [ 25.838414][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> > > [ 25.839011][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > > [ 25.839597][ T214] DR6: fffe0ff0 DR7: 00000400
> > > [ 25.839995][ T214] Kernel panic - not syncing: Fatal exception
> > > [ 25.840554][ T214] Kernel Offset: disabled
> >
> >

--
Thanks,
Hyeonggon

2023-01-05 02:54:52

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

hi, Hyeonggon, hi, Vlastimil,

On Wed, Jan 04, 2023 at 06:04:20PM +0900, Hyeonggon Yoo wrote:
> On Tue, Jan 03, 2023 at 09:46:33PM +0800, Oliver Sang wrote:
> > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> > > So the events leading up to this could be something like:
> > >
> > > - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
> > > - despite being on partial list, it's freed ???
> > > - somebody else allocates order-2 page 0x2daec and uses it for whatever,
> > > then frees it
> > > - 0x2daec is reallocated as order-1 slab from names_cache, then freed
> > > - we try to allocate from the slab page 0x2daee and trip on the PageTail
> > >
> > > Except, the freeing of order-2 page would have reset the PageTail and
> > > compound_head in 0x2daec, so this is even more complicated or involves some
> > > extra race?
> >
> > FYI, we ran tests more up to 500 times, then saw different issues but rate is
> > actually low
> >
> > 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> > ---------------- ---------------------------
> > fail:runs %reproduction fail:runs
> > | | |
> > :500 12% 61:500 dmesg.invalid_opcode:#[##]
> > :500 3% 14:500 dmesg.kernel_BUG_at_include/linux/mm.h
> > :500 3% 17:500 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > :500 5% 26:500 dmesg.kernel_BUG_at_lib/list_debug.c
> > :500 0% 2:500 dmesg.kernel_BUG_at_mm/page_alloc.c
> > :500 0% 2:500 dmesg.kernel_BUG_at_mm/usercopy.c
> >

hi Vlastimil,

as you mentioned
> Hm even if rate is low, the different kinds of reports could be useful to
> see, if all of that is caused by the commit.

we tried to run tests even more times, but with the config which enable
CONFIG_DEBUG_PAGEALLOC
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
(config is attached as
config-6.1.0-rc2-00014-g0af8489b0216+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
the only diff with previous config is
@@ -5601,7 +5601,8 @@ CONFIG_HAVE_KCSAN_COMPILER=y
# Memory Debugging
#
CONFIG_PAGE_EXTENSION=y
-# CONFIG_DEBUG_PAGEALLOC is not set
+CONFIG_DEBUG_PAGEALLOC=y
+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_PAGE_OWNER=y
# CONFIG_PAGE_POISONING is not set
CONFIG_DEBUG_PAGE_REF=y
)

what we found now is some issues are also reproduced on parent now (still by
rcutorture tests here), though seems lower rate on parent.

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing

56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
8:985 19% 199:990 dmesg.invalid_opcode:#[##]
:985 5% 51:990 dmesg.kernel_BUG_at_include/linux/mm.h
3:985 4% 41:990 dmesg.kernel_BUG_at_include/linux/page-flags.h
4:985 10% 102:990 dmesg.kernel_BUG_at_lib/list_debug.c
:985 0% 2:990 dmesg.kernel_BUG_at_mm/page_alloc.c
1:985 0% 3:990 dmesg.kernel_BUG_at_mm/usercopy.c

however, we noticed dmesg.kernel_BUG_at_include/linux/mm.h still have
relatively high rate on this commit but keeps clean on parent.

so I attached dmesg-rcutorture-bug-at-mm-h.xz
since it has
[ 33.586931][ T183] ------------[ cut here ]------------
[ 33.590089][ T183] kernel BUG at include/linux/mm.h:825!
[ 33.591219][ T183] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
not sure if this is helpful.

> > >
> > > In any case, this is something a debug_pagealloc kernel could have a chance
> > > of catching earlier. Would it be possible to enable CONFIG_DEBUG_PAGEALLOC
> > > and DEBUG_PAGEALLOC_ENABLE_DEFAULT additionally to the rest of the
> > > configuration, and repeat the test?
> >
> > ok, we are starting to test by these 2 additional configs now.
>
> BTW it seems to be totally unrelated to rcutorture tests.
> Are there similar reports in boot tests with the same config?

hi Hyeonggon,

per your suggestion (Thanks a lot!) we used same config as above which
enabled CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
to do boot tests. the results have some differences:

=========================================================================================
compiler/kconfig/rootfs/sleep/tbox_group/testcase:
gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/1/vm-snb/boot

56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
11:999 20% 208:999 dmesg.invalid_opcode:#[##]
2:999 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
4:999 4% 40:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
4:999 11% 111:999 dmesg.kernel_BUG_at_lib/list_debug.c
:999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
1:999 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c

here the dmesg.kernel_BUG_at_include/linux/mm.h is also reproduced on parent,
only issue shows on this commit but not on parent is (but rate is very low)
dmesg.kernel_BUG_at_mm/page_alloc.c

I also attached dmesg-boot-bug-at-page_alloc-c.xz
which has
[ 17.251777][ C0] ------------[ cut here ]------------
[ 17.252218][ C0] kernel BUG at mm/page_alloc.c:1406!
[ 17.252647][ C0] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC

If you need more information, please let us know. Thanks

>
> > >
> > > Separately we should also make the __dump_page() more resilient.
> > >
> > > Thanks,
> > > Vlastimil
> > >
> > > > [ 25.804432][ T214] ------------[ cut here ]------------
> > > > [ 25.804917][ T214] kernel BUG at include/linux/mm.h:825!
> > > > [ 25.805402][ T214] invalid opcode: 0000 [#1] SMP
> > > > [ 25.805820][ T214] CPU: 0 PID: 214 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #2 1c4d7707ec0ce574ed62a77e82a8580202758048
> > > > [ 25.806944][ T214] EIP: __dump_page.cold+0x282/0x369
> > > > [ 25.807376][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> > > > [ 25.808960][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> > > > [ 25.809578][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> > > > [ 25.810168][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> > > > [ 25.810803][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> > > > [ 25.811407][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > > > [ 25.811999][ T214] DR6: fffe0ff0 DR7: 00000400
> > > > [ 25.812390][ T214] Call Trace:
> > > > [ 25.812675][ T214] dump_page+0x2a/0xc0
> > > > [ 25.813025][ T214] ? _raw_spin_lock_irqsave+0x16/0x30
> > > > [ 25.813492][ T214] folio_flags+0x23/0x70
> > > > [ 25.813945][ T214] get_partial_node+0x89/0x290
> > > > [ 25.814357][ T214] __slab_alloc_node+0xbb/0x270
> > > > [ 25.814860][ T214] kmem_cache_alloc_lru+0x8d/0x4e0
> > > > [ 25.815289][ T214] ? __lock_release+0x3ec/0x410
> > > > [ 25.815697][ T214] ? iget_locked+0x78/0x310
> > > > [ 25.816096][ T214] alloc_inode+0x93/0x150
> > > > [ 25.816469][ T214] iget_locked+0xdd/0x310
> > > > [ 25.816829][ T214] ? lock_is_held_type+0x80/0xf0
> > > > [ 25.817264][ T214] kernfs_get_inode+0x24/0xb0
> > > > [ 25.817670][ T214] kernfs_iop_lookup+0xb5/0x1a0
> > > > [ 25.818087][ T214] __lookup_slow+0xd9/0x2a0
> > > > [ 25.818479][ T214] lookup_slow+0x50/0x90
> > > > [ 25.818847][ T214] walk_component+0x19c/0x2c0
> > > > [ 25.819244][ T214] path_lookupat+0xa3/0x270
> > > > [ 25.819627][ T214] path_openat+0x307/0x3e0
> > > > [ 25.820007][ T214] do_filp_open+0x7c/0x130
> > > > [ 25.820409][ T214] do_sys_openat2+0x113/0x1f0
> > > > [ 25.820807][ T214] do_sys_open+0x8e/0xe0
> > > > [ 25.821211][ T214] __ia32_sys_openat+0x2b/0x40
> > > > [ 25.821622][ T214] __do_fast_syscall_32+0x72/0xd0
> > > > [ 25.822057][ T214] ? trace_hardirqs_on+0xa2/0x110
> > > > [ 25.822480][ T214] ? __fput+0x19f/0x390
> > > > [ 25.822842][ T214] ? lockdep_hardirqs_on_prepare+0x242/0x400
> > > > [ 25.823346][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > > > [ 25.823823][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > > [ 25.824259][ T214] ? lockdep_hardirqs_on_prepare+0x242/0x400
> > > > [ 25.824767][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > > > [ 25.825254][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > > [ 25.825696][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > > [ 25.826155][ T214] ? syscall_exit_to_user_mode+0x5f/0x90
> > > > [ 25.826627][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > > [ 25.827056][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > > [ 25.827486][ T214] ? __do_fast_syscall_32+0x7c/0xd0
> > > > [ 25.827929][ T214] ? irqentry_exit_to_user_mode+0x23/0x30
> > > > [ 25.828423][ T214] ? irqentry_exit+0x7f/0xc0
> > > > [ 25.828812][ T214] do_fast_syscall_32+0x32/0x70
> > > > [ 25.829223][ T214] do_SYSENTER_32+0x15/0x20
> > > > [ 25.829589][ T214] entry_SYSENTER_32+0xa2/0xfb
> > > > [ 25.830003][ T214] EIP: 0xb7f8c549
> > > > [ 25.830330][ T214] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> > > > [ 25.831929][ T214] EAX: ffffffda EBX: 00000006 ECX: 006142a1 EDX: 002a8000
> > > > [ 25.832522][ T214] ESI: 00000000 EDI: 00000001 EBP: 00614024 ESP: bff3c4a0
> > > > [ 25.833123][ T214] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
> > > > [ 25.833738][ T214] Modules linked in:
> > > > [ 25.834062][ T214] ---[ end trace 0000000000000000 ]---
> > > > [ 25.834522][ T214] EIP: __dump_page.cold+0x282/0x369
> > > > [ 25.834960][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8
> > > > [ 25.836574][ T214] EAX: 00000000 EBX: e764d530 ECX: 00000003 EDX: 4108888f
> > > > [ 25.837183][ T214] ESI: e764d4e0 EDI: e764d4e0 EBP: ed89db3c ESP: ed89db00
> > > > [ 25.837772][ T214] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210046
> > > > [ 25.838414][ T214] CR0: 80050033 CR2: 00616abc CR3: 2d878000 CR4: 000406d0
> > > > [ 25.839011][ T214] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > > > [ 25.839597][ T214] DR6: fffe0ff0 DR7: 00000400
> > > > [ 25.839995][ T214] Kernel panic - not syncing: Fatal exception
> > > > [ 25.840554][ T214] Kernel Offset: disabled
> > >
> > >
>
> --
> Thanks,
> Hyeonggon


Attachments:
(No filename) (11.58 kB)
config-6.1.0-rc2-00014-g0af8489b0216+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT (149.83 kB)
dmesg-rcutorture-bug-at-mm-h.xz (55.51 kB)
dmesg-boot-bug-at-page_alloc-c.xz (55.27 kB)
Download all attachments

2023-01-05 14:48:24

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

From: Hyeonggon Yoo <[email protected]>
To: Oliver Sang <[email protected]>
Cc: Vlastimil Babka <[email protected]>, [email protected], [email protected],
Mike Rapoport <[email protected]>,
Christoph Lameter <[email protected]>, [email protected],
[email protected], Matthew Wilcox <[email protected]>
Bcc:
Subject: Re: [linus:master] [mm, slub] 0af8489b02:
kernel_BUG_at_include/linux/mm.h
Reply-To:
In-Reply-To: <Y7Yr3kEkDEd51xns@xsang-OptiPlex-9020>

On Thu, Jan 05, 2023 at 09:46:06AM +0800, Oliver Sang wrote:
> hi, Hyeonggon, hi, Vlastimil,
>
> On Wed, Jan 04, 2023 at 06:04:20PM +0900, Hyeonggon Yoo wrote:
> > On Tue, Jan 03, 2023 at 09:46:33PM +0800, Oliver Sang wrote:
> > > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> > > > So the events leading up to this could be something like:
> > > >
> > > > - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
> > > > - despite being on partial list, it's freed ???
> > > > - somebody else allocates order-2 page 0x2daec and uses it for whatever,
> > > > then frees it
> > > > - 0x2daec is reallocated as order-1 slab from names_cache, then freed
> > > > - we try to allocate from the slab page 0x2daee and trip on the PageTail
> > > >
> > > > Except, the freeing of order-2 page would have reset the PageTail and
> > > > compound_head in 0x2daec, so this is even more complicated or involves some
> > > > extra race?
> > >
> > > FYI, we ran tests more up to 500 times, then saw different issues but rate is
> > > actually low
> > >
> > > 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> > > ---------------- ---------------------------
> > > fail:runs %reproduction fail:runs
> > > | | |
> > > :500 12% 61:500 dmesg.invalid_opcode:#[##]
> > > :500 3% 14:500 dmesg.kernel_BUG_at_include/linux/mm.h
> > > :500 3% 17:500 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > > :500 5% 26:500 dmesg.kernel_BUG_at_lib/list_debug.c
> > > :500 0% 2:500 dmesg.kernel_BUG_at_mm/page_alloc.c
> > > :500 0% 2:500 dmesg.kernel_BUG_at_mm/usercopy.c
> > >
>
> hi Vlastimil,
>
> as you mentioned
> > Hm even if rate is low, the different kinds of reports could be useful to
> > see, if all of that is caused by the commit.
>
> we tried to run tests even more times, but with the config which enable
> CONFIG_DEBUG_PAGEALLOC
> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> (config is attached as
> config-6.1.0-rc2-00014-g0af8489b0216+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> the only diff with previous config is
> @@ -5601,7 +5601,8 @@ CONFIG_HAVE_KCSAN_COMPILER=y
> # Memory Debugging
> #
> CONFIG_PAGE_EXTENSION=y
> -# CONFIG_DEBUG_PAGEALLOC is not set
> +CONFIG_DEBUG_PAGEALLOC=y
> +CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
> CONFIG_PAGE_OWNER=y
> # CONFIG_PAGE_POISONING is not set
> CONFIG_DEBUG_PAGE_REF=y
> )
>
> what we found now is some issues are also reproduced on parent now (still by
> rcutorture tests here), though seems lower rate on parent.
>
> =========================================================================================
> compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
> gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
>
> 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> 8:985 19% 199:990 dmesg.invalid_opcode:#[##]
> :985 5% 51:990 dmesg.kernel_BUG_at_include/linux/mm.h
> 3:985 4% 41:990 dmesg.kernel_BUG_at_include/linux/page-flags.h
> 4:985 10% 102:990 dmesg.kernel_BUG_at_lib/list_debug.c
> :985 0% 2:990 dmesg.kernel_BUG_at_mm/page_alloc.c
> 1:985 0% 3:990 dmesg.kernel_BUG_at_mm/usercopy.c
>
> however, we noticed dmesg.kernel_BUG_at_include/linux/mm.h still have
> relatively high rate on this commit but keeps clean on parent.
>
> so I attached dmesg-rcutorture-bug-at-mm-h.xz
> since it has
> [ 33.586931][ T183] ------------[ cut here ]------------
> [ 33.590089][ T183] kernel BUG at include/linux/mm.h:825!
> [ 33.591219][ T183] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> not sure if this is helpful.
>
> > > >
> > > > In any case, this is something a debug_pagealloc kernel could have a chance
> > > > of catching earlier. Would it be possible to enable CONFIG_DEBUG_PAGEALLOC
> > > > and DEBUG_PAGEALLOC_ENABLE_DEFAULT additionally to the rest of the
> > > > configuration, and repeat the test?
> > >
> > > ok, we are starting to test by these 2 additional configs now.
> >
> > BTW it seems to be totally unrelated to rcutorture tests.
> > Are there similar reports in boot tests with the same config?
>
> hi Hyeonggon,
>
> per your suggestion (Thanks a lot!) we used same config as above which
> enabled CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> to do boot tests. the results have some differences:
>
> =========================================================================================
> compiler/kconfig/rootfs/sleep/tbox_group/testcase:
> gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/1/vm-snb/boot
>
> 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> 11:999 20% 208:999 dmesg.invalid_opcode:#[##]
> 2:999 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
> 4:999 4% 40:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> 4:999 11% 111:999 dmesg.kernel_BUG_at_lib/list_debug.c
> :999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> 1:999 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
>
> here the dmesg.kernel_BUG_at_include/linux/mm.h is also reproduced on parent,
> only issue shows on this commit but not on parent is (but rate is very low)
> dmesg.kernel_BUG_at_mm/page_alloc.c
>
>
> I also attached dmesg-boot-bug-at-page_alloc-c.xz
> which has
> [ 17.251777][ C0] ------------[ cut here ]------------
> [ 17.252218][ C0] kernel BUG at mm/page_alloc.c:1406!
> [ 17.252647][ C0] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>
> If you need more information, please let us know. Thanks

[dmesg-rcutorture-but-at-mm-h.xz]

[ 33.533541][ T183] page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x8df2
[ 33.535615][ T183] page:(ptrval) refcount:0 mapcount:0 mapping:00000000 index:0x0 pfn:0x8df0
[ 33.537599][ T183] flags: 0x0(zone=0)
[ 33.538516][ T183] raw: 00010200 c0100b70 c0100b70 c019b680 c8df1000 00020001 ffffffff 00000001
[ 33.540538][ T183] raw: 00000000 00000000
[ 33.541514][ T183] page dumped because: VM_BUG_ON_FOLIO(!folio_test_large(folio))
[ 33.547407][ T183] page_owner tracks the page as freed
[ 33.548569][ T183] page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 180, tgid 180 (systemd-journal), ts 33548330257, free_ts 33548381837
[ 33.551910][ T183] post_alloc_hook+0x29a/0x320
[ 33.552602][ T183] get_page_from_freelist+0x226/0x310
[ 33.553469][ T183] __alloc_pages+0xdd/0x360
[ 33.554387][ T183] alloc_slab_page+0x12d/0x200
[ 33.555375][ T183] allocate_slab+0x6a/0x350
[ 33.556334][ T183] new_slab+0x48/0xc0
[ 33.557199][ T183] __slab_alloc_node+0xfb/0x270
[ 33.558510][ T183] kmem_cache_alloc+0x8f/0x4e0
[ 33.559599][ T183] getname_flags+0x33/0x2f0
[ 33.560750][ T183] user_path_at_empty+0x2d/0x90
[ 33.561910][ T183] do_faccessat+0xb5/0x410
[ 33.562993][ T183] __ia32_sys_access+0x2a/0x40
[ 33.564112][ T183] __do_fast_syscall_32+0x72/0xd0
[ 33.565340][ T183] do_fast_syscall_32+0x32/0x70
[ 33.566379][ T183] do_SYSENTER_32+0x15/0x20
[ 33.567307][ T183] entry_SYSENTER_32+0xa2/0xfb
[ 33.568331][ T183] page last free stack trace:
[ 33.569321][ T183] free_pcp_prepare+0x39f/0x970
[ 33.570383][ T183] free_unref_page_prepare+0x29/0x210
[ 33.571559][ T183] free_unref_page+0x3a/0x3b0
[ 33.572586][ T183] __free_pages+0x187/0x1f0
[ 33.573628][ T183] __free_slab+0x1fd/0x350
[ 33.574890][ T183] free_slab+0x22/0x70
[ 33.575888][ T183] free_to_partial_list+0x125/0x260
[ 33.577142][ T183] do_slab_free+0x30/0x70
[ 33.578176][ T183] kmem_cache_free+0x171/0x1e0
[ 33.579266][ T183] putname+0x9f/0xf0
[ 33.580198][ T183] do_sys_openat2+0xe2/0x1f0
[ 33.581327][ T183] do_sys_open+0x8e/0xe0
[ 33.582408][ T183] __ia32_sys_openat+0x2b/0x40
[ 33.583508][ T183] __do_fast_syscall_32+0x72/0xd0
[ 33.584743][ T183] do_fast_syscall_32+0x32/0x70
[ 33.585888][ T183] do_SYSENTER_32+0x15/0x20
[ 33.586931][ T183] ------------[ cut here ]------------
[ 33.590089][ T183] kernel BUG at include/linux/mm.h:825!
[ 33.591219][ T183] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 33.592301][ T183] CPU: 0 PID: 183 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #1 7084bf045416bf9c79a76bd063d103f14dfaa815
[ 33.594693][ T183] EIP: __dump_page.cold+0x282/0x369
[ 33.595580][ T183] Code: ff ff 83 05 c8 9e bf c5 01 ba 94 31 30 c4 89 f8 83 15 cc 9e bf c5 00 e8 12 89 ed fd 83 05 d8 9e bf c5 01 83 15 dc 9e bf c5 00 <0f> 0b 83 05 e0 9e bf c5 01 b8 ac c5 a7 c4 83 15 e4 9e bf c5 00 e8
[ 33.598957][ T183] EAX: 00000000 EBX: e6bf1dd0 ECX: 00000000 EDX: ffffffff
[ 33.600184][ T183] ESI: e6bf1d80 EDI: e6bf1d80 EBP: c8bbdd84 ESP: c8bbdd48
[ 33.601409][ T183] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
[ 33.602791][ T183] CR0: 80050033 CR2: 0063209c CR3: 08bdd000 CR4: 000406d0
[ 33.604116][ T183] Call Trace:
[ 33.604821][ T183] dump_page+0x2a/0xc0
[ 33.605616][ T183] folio_flags+0x23/0x70


This time for newly allocated slab page, PageTail(page) returns true.

[ 33.610604][ T183] alloc_slab_page+0x7f/0x200
[ 33.611402][ T183] allocate_slab+0x6a/0x350
[ 33.612162][ T183] new_slab+0x48/0xc0
[ 33.612850][ T183] __slab_alloc_node+0xfb/0x270
[ 33.613878][ T183] kmem_cache_alloc+0x8f/0x4e0
[ 33.614799][ T183] ? __lock_acquire+0x427/0xad0
[ 33.615757][ T183] getname_flags+0x33/0x2f0
[ 33.616631][ T183] getname+0x1a/0x30
[ 33.617385][ T183] do_sys_openat2+0xa5/0x1f0
[ 33.618230][ T183] do_sys_open+0x8e/0xe0
[ 33.618978][ T183] __ia32_sys_openat+0x2b/0x40
[ 33.619753][ T183] __do_fast_syscall_32+0x72/0xd0
[ 33.620669][ T183] ? lockdep_hardirqs_on_prepare+0x242/0x400
[ 33.621632][ T183] ? syscall_exit_to_user_mode+0x5f/0x90
[ 33.622540][ T183] ? __do_fast_syscall_32+0x7c/0xd0
[ 33.623395][ T183] ? lockdep_hardirqs_on_prepare+0x242/0x400
[ 33.624281][ T183] ? syscall_exit_to_user_mode+0x5f/0x90
[ 33.625202][ T183] ? __do_fast_syscall_32+0x7c/0xd0
[ 33.626069][ T183] ? syscall_exit_to_user_mode+0x5f/0x90
[ 33.627016][ T183] ? __do_fast_syscall_32+0x7c/0xd0
[ 33.627996][ T183] ? __do_fast_syscall_32+0x7c/0xd0
[ 33.628916][ T183] ? irqentry_exit_to_user_mode+0x23/0x30
[ 33.629958][ T183] ? irqentry_exit+0x7f/0xc0
[ 33.630728][ T183] do_fast_syscall_32+0x32/0x70
[ 33.631583][ T183] do_SYSENTER_32+0x15/0x20
[ 33.632361][ T183] entry_SYSENTER_32+0xa2/0xfb
[ 33.633179][ T183] EIP: 0xb7f10549
[ 33.633783][ T183] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
[ 33.636795][ T183] EAX: ffffffda EBX: 00000005 ECX: 00631b01 EDX: 002a8000
[ 33.642042][ T183] ESI: 00000000 EDI: 00000001 EBP: 00631adc ESP: bfdff390
[ 33.643329][ T183] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
[ 33.644726][ T183] Modules linked in:
[ 33.646532][ T183] ---[ end trace 0000000000000000 ]---
[ 33.647477][ T183] EIP: __dump_page.cold+0x282/0x369
[ 33.648422][ T183] Code: ff ff 83 05 c8 9e bf c5 01 ba 94 31 30 c4 89 f8 83 15 cc 9e bf c5 00 e8 12 89 ed fd 83 05 d8 9e bf c5 01 83 15 dc 9e bf c5 00 <0f> 0b 83 05 e0 9e bf c5 01 b8 ac c5 a7 c4 83 15 e4 9e bf c5 00 e8
[ 33.651543][ T183] EAX: 00000000 EBX: e6bf1dd0 ECX: 00000000 EDX: ffffffff
[ 33.652761][ T183] ESI: e6bf1d80 EDI: e6bf1d80 EBP: c8bbdd84 ESP: c8bbdd48
[ 33.653995][ T183] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
[ 33.655452][ T183] CR0: 80050033 CR2: 0063209c CR3: 08bdd000 CR4: 000406d0
[ 33.656826][ T183] Kernel panic - not syncing: Fatal exception
[ 33.658167][ T183] Kernel Offset: disabled



[dmesg-boot-bug-at-page_alloc-c.xz]

[ 17.236089][ C0] page:(ptrval) refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x8c96
[ 17.236862][ C0] head:(ptrval) order:1 compound_mapcount:0 compound_pincount:0

In this case the page (pfn 0x8c96) is head page of order-1 page
because it's pfn is even number, and the page is used for kernel stack.

According to the configuration, the page (whose pfn is 0x8c96)
is allocated for kernel stack from the buddy allocator and is
just about to be freed after RCU grace period.

[ 17.237503][ C0] memcg:c8a98c81
[ 17.237789][ C0] flags: 0x10200(slab|head|zone=0)
[ 17.238217][ C0] raw: 00000000 e6bef721 00000122 00000400 00000000 00000000 ffffffff 00000000
[ 17.238932][ C0] raw: c8b1c1e2 00000000
[ 17.239260][ C0] head: 00010200 00000100 00000122 c019b780 00000000 000a000a ffffffff 00000001
[ 17.239959][ C0] head: c8a98c81 00000000

The page's (again pfn 0x8c96) page->flags and page->_refcount are zero
like freed page when it's not freed yet. These two fields must be
overwritten after its allocation, but who overwrote these fields is
still a question.

also the value of page->compound_head (0xe6bef721) does not make sense at all.
This value cannot be set by buddy, because head page of order-1
page cannot be tail page of another order-1 page.

If this value is set by slab allocator's list operation, the last bit
must not be set because struct page is at least word-aligned.

[ 17.240300][ C0] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
[ 17.244977][ C0] page_owner tracks the page as allocated
[ 17.245471][ C0] page last allocated via order 1, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 1, tgid 1 (systemd), ts 17077181323, free_ts 0
[ 17.246702][ C0] post_alloc_hook+0x29a/0x320
[ 17.247126][ C0] get_page_from_freelist+0x226/0x310
[ 17.247550][ C0] __alloc_pages+0xdd/0x360
[ 17.247907][ C0] alloc_thread_stack_node+0x48/0xf0
[ 17.248342][ C0] dup_task_struct+0x73/0x330
[ 17.248722][ C0] copy_process+0x2d6/0x2f90
[ 17.249088][ C0] kernel_clone+0x95/0x4f0
[ 17.249426][ C0] __ia32_sys_clone+0x8a/0xc0
[ 17.249785][ C0] __do_fast_syscall_32+0x72/0xd0
[ 17.250209][ C0] do_fast_syscall_32+0x32/0x70
[ 17.250589][ C0] do_SYSENTER_32+0x15/0x20
[ 17.250964][ C0] entry_SYSENTER_32+0xa2/0xfb
[ 17.251339][ C0] page_owner free stack trace missing
[ 17.251777][ C0] ------------[ cut here ]------------
[ 17.252218][ C0] kernel BUG at mm/page_alloc.c:1406!
[ 17.252647][ C0] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 17.253148][ C0] CPU: 0 PID: 185 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #1 7084bf045416bf9c79a76bd063d103f14dfaa815
[ 17.254174][ C0] EIP: free_pcp_prepare+0x70/0x970
[ 17.254561][ C0] Code: 74 48 83 05 b0 8e c1 c5 01 ba 50 5e 2f c4 89 d8 83 15 b4 8e c1 c5 00 e8 2e 65 fa ff 83 05 c0 8e c1 c5 01 83 15 c4 8e c1 c5 00 <0f> 0b 83 05 c8 8e c1 c5 01 b8 d8 7a a8 c4 83 15 cc 8e c1 c5 00 e8
[ 17.256029][ C0] EAX: 00000000 EBX: e6bef770 ECX: 00000002 EDX: c36c5dbd
[ 17.256571][ C0] ESI: 00000001 EDI: c019ff68 EBP: c019fed0 ESP: c019fea8
[ 17.257122][ C0] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
[ 17.257705][ C0] CR0: 80050033 CR2: 0062f8bc CR3: 08ba7000 CR4: 000406d0
[ 17.258241][ C0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 17.258768][ C0] DR6: fffe0ff0 DR7: 00000400
[ 17.259115][ C0] Call Trace:
[ 17.259367][ C0] <SOFTIRQ>
[ 17.259615][ C0] free_unref_page_prepare+0x29/0x210
[ 17.260019][ C0] free_unref_page+0x3a/0x3b0
[ 17.260381][ C0] __free_pages+0x187/0x1f0
[ 17.260746][ C0] thread_stack_free_rcu+0x2e/0x40
[ 17.261149][ C0] rcu_do_batch+0x267/0xab0
[ 17.261509][ C0] ? rcu_do_batch+0x1f3/0xab0
[ 17.261877][ C0] rcu_core+0x21b/0x450
[ 17.262204][ C0] rcu_core_si+0x16/0x30
[ 17.262535][ C0] __do_softirq+0x178/0x53b
[ 17.262892][ C0] ? __softirqentry_text_start+0x8/0x8
[ 17.263322][ C0] do_softirq_own_stack+0x32/0x50
[ 17.263711][ C0] </SOFTIRQ>

2023-01-05 15:07:18

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On Thu, Jan 05, 2023 at 09:46:06AM +0800, Oliver Sang wrote:
> If you need more information, please let us know. Thanks

Can we get a vmcore dump using kdump functionality?

--
Thanks,
Hyeonggon

2023-01-05 15:53:04

by Matthew Wilcox

[permalink] [raw]
Subject: Re: A better dump_page()

On Thu, Jan 05, 2023 at 04:19:06PM +0100, Vlastimil Babka wrote:
> On 1/4/23 00:29, Matthew Wilcox wrote:
> > On Tue, Jan 03, 2023 at 03:07:12PM -0800, David Rientjes wrote:
> >> On Tue, 3 Jan 2023, Matthew Wilcox wrote:
> >>
> >> > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> >> > > Separately we should also make the __dump_page() more resilient.
> >> >
> >> > Right. It's not ideal when one of our best debugging tools obfuscates
> >> > the problem we're trying to debug. I've seen probems like this before,
> >> > and the problem is that somebody calls dump_page() on a page that they
> >> > don't own a refcount on. That lets the page mutate under us in some
> >> > fairly awkward ways (as you've seen here, it seems to be part of several
> >> > different compound allocations at various points during the dump
> >> > process).
> >> >
> >> > One possibility I thought about was taking our own refcount on the
> >> > page at the start of dump_page(). That would kill off the possibility
> >> > of ever passing in a const struct page, and it would confuse people.
> >> > Also, what if somebody passes in a pointer to something that's not a
> >> > struct page? Then we've (tried to) modify memory that's not a refcount.
> >> >
> >> > I think the best we can do is to snapshot the struct page and the folio
> >> > it appears to belong to at the start of dump_page(). It'll take a
> >> > little care (for example, folio_pfn() must be passed the original
> >> > folio, and not the snapshot), but I think it's doable.
> >> >
> >>
> >> By snapshot do you mean memcpy() of the metadata to the stack? I assume
> >> this still leaves the opportunity for the underlying mutation of the page
> >> but makes the window more narrow.
> >
> > Right. We'd need to memcpy() both the page and the folio, so around 192
> > bytes. It doesn't make it consistent since it could be mutated during
> > the memcpy(), but it will be consistent throughout the execution of the
> > function, so we won't get calls like folio_entire_mapcount() aborting
> > due to the folio having become a tail page halfway through.
>
> I'm afraid this problem can still happen if the snapshot is inconsistent in
> the first place and you e.g. snapshot the tail page as tail page, and the
> supposed folio head page as not head page.

Oh, there would have to be an explicit check that the head page pointed
to is actually a head page, and further that it really is the head page
of this page. But the advantage is that we can check that we have a
consistent snapshot _once_ rather than in every helper that we use in
this function.

The question is what to do if we have an inconsistent snapshot.
The answer we usually use (eg in filemap, gup, etc) is to try again until
we get a consistent answer. But I think that's the wrong answer here; if
we get an inconsistent snapshot, we should give up. Whatever information
might have been gleaned from the head page is now gone, because we don't
have the original head page any more. So we should just dump the page
as a non-compound page.

> The local copying still makes a lot of sense though, as anything checked on
> the copy that determines its further evaluation can be trusted to remain
> consistent without complicated piecemeal READ_ONCE()'s etc. And as you
> mentioned it will allow us to proceed with constification, where the
> possibility of dump_page() through VM_BUG_ON_PAGE() is IIRC a big blocker
> for constifying the various flags checking helpers etc.

Yes, that's a nice side-benefit ;-)

2023-01-05 16:06:23

by Vlastimil Babka

[permalink] [raw]
Subject: Re: A better dump_page()

On 1/4/23 00:29, Matthew Wilcox wrote:
> On Tue, Jan 03, 2023 at 03:07:12PM -0800, David Rientjes wrote:
>> On Tue, 3 Jan 2023, Matthew Wilcox wrote:
>>
>> > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
>> > > Separately we should also make the __dump_page() more resilient.
>> >
>> > Right. It's not ideal when one of our best debugging tools obfuscates
>> > the problem we're trying to debug. I've seen probems like this before,
>> > and the problem is that somebody calls dump_page() on a page that they
>> > don't own a refcount on. That lets the page mutate under us in some
>> > fairly awkward ways (as you've seen here, it seems to be part of several
>> > different compound allocations at various points during the dump
>> > process).
>> >
>> > One possibility I thought about was taking our own refcount on the
>> > page at the start of dump_page(). That would kill off the possibility
>> > of ever passing in a const struct page, and it would confuse people.
>> > Also, what if somebody passes in a pointer to something that's not a
>> > struct page? Then we've (tried to) modify memory that's not a refcount.
>> >
>> > I think the best we can do is to snapshot the struct page and the folio
>> > it appears to belong to at the start of dump_page(). It'll take a
>> > little care (for example, folio_pfn() must be passed the original
>> > folio, and not the snapshot), but I think it's doable.
>> >
>>
>> By snapshot do you mean memcpy() of the metadata to the stack? I assume
>> this still leaves the opportunity for the underlying mutation of the page
>> but makes the window more narrow.
>
> Right. We'd need to memcpy() both the page and the folio, so around 192
> bytes. It doesn't make it consistent since it could be mutated during
> the memcpy(), but it will be consistent throughout the execution of the
> function, so we won't get calls like folio_entire_mapcount() aborting
> due to the folio having become a tail page halfway through.

I'm afraid this problem can still happen if the snapshot is inconsistent in
the first place and you e.g. snapshot the tail page as tail page, and the
supposed folio head page as not head page. I'm afraid the only way to
prevent this is to make sure any helpers __dump_page() uses simply must not
contain any VM_BUG/WARN_ON sanity checks, and anything treated as a pointer
must be checked before dereferencing.

The local copying still makes a lot of sense though, as anything checked on
the copy that determines its further evaluation can be trusted to remain
consistent without complicated piecemeal READ_ONCE()'s etc. And as you
mentioned it will allow us to proceed with constification, where the
possibility of dump_page() through VM_BUG_ON_PAGE() is IIRC a big blocker
for constifying the various flags checking helpers etc.

2023-01-06 10:39:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On 1/5/23 02:46, Oliver Sang wrote:
> hi, Hyeonggon, hi, Vlastimil,
>
> On Wed, Jan 04, 2023 at 06:04:20PM +0900, Hyeonggon Yoo wrote:
>> On Tue, Jan 03, 2023 at 09:46:33PM +0800, Oliver Sang wrote:
>> > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
>> > > So the events leading up to this could be something like:
>> > >
>> > > - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
>> > > - despite being on partial list, it's freed ???
>> > > - somebody else allocates order-2 page 0x2daec and uses it for whatever,
>> > > then frees it
>> > > - 0x2daec is reallocated as order-1 slab from names_cache, then freed
>> > > - we try to allocate from the slab page 0x2daee and trip on the PageTail
>> > >
>> > > Except, the freeing of order-2 page would have reset the PageTail and
>> > > compound_head in 0x2daec, so this is even more complicated or involves some
>> > > extra race?
>> >
>> > FYI, we ran tests more up to 500 times, then saw different issues but rate is
>> > actually low
>> >
>> > 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
>> > ---------------- ---------------------------
>> > fail:runs %reproduction fail:runs
>> > | | |
>> > :500 12% 61:500 dmesg.invalid_opcode:#[##]
>> > :500 3% 14:500 dmesg.kernel_BUG_at_include/linux/mm.h
>> > :500 3% 17:500 dmesg.kernel_BUG_at_include/linux/page-flags.h
>> > :500 5% 26:500 dmesg.kernel_BUG_at_lib/list_debug.c
>> > :500 0% 2:500 dmesg.kernel_BUG_at_mm/page_alloc.c
>> > :500 0% 2:500 dmesg.kernel_BUG_at_mm/usercopy.c
>> >
>
> hi Vlastimil,
>
> as you mentioned
>> Hm even if rate is low, the different kinds of reports could be useful to
>> see, if all of that is caused by the commit.
>
> we tried to run tests even more times, but with the config which enable
> CONFIG_DEBUG_PAGEALLOC
> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> (config is attached as
> config-6.1.0-rc2-00014-g0af8489b0216+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> the only diff with previous config is
> @@ -5601,7 +5601,8 @@ CONFIG_HAVE_KCSAN_COMPILER=y
> # Memory Debugging
> #
> CONFIG_PAGE_EXTENSION=y
> -# CONFIG_DEBUG_PAGEALLOC is not set
> +CONFIG_DEBUG_PAGEALLOC=y
> +CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
> CONFIG_PAGE_OWNER=y
> # CONFIG_PAGE_POISONING is not set
> CONFIG_DEBUG_PAGE_REF=y
> )
>
> what we found now is some issues are also reproduced on parent now (still by
> rcutorture tests here), though seems lower rate on parent.
>
> =========================================================================================
> compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
> gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
>
> 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> 8:985 19% 199:990 dmesg.invalid_opcode:#[##]
> :985 5% 51:990 dmesg.kernel_BUG_at_include/linux/mm.h
> 3:985 4% 41:990 dmesg.kernel_BUG_at_include/linux/page-flags.h
> 4:985 10% 102:990 dmesg.kernel_BUG_at_lib/list_debug.c
> :985 0% 2:990 dmesg.kernel_BUG_at_mm/page_alloc.c
> 1:985 0% 3:990 dmesg.kernel_BUG_at_mm/usercopy.c
>
> however, we noticed dmesg.kernel_BUG_at_include/linux/mm.h still have
> relatively high rate on this commit but keeps clean on parent.

Well that's interesting. As long as any bugs happen in the parent, it could
mean the commit we suspect is just changing the circumstances and creating
conditions that increase the bug happening - e.g. because it causes slab
pages to be always immediately freed when the last object is freed.

So I would be curiou about how some of the reports from the parent look like
in detail. And if the rate at the parent (has it increased thanks to the
DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!

2023-01-06 17:43:46

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On Sat, Dec 31, 2022 at 11:26:25PM +0800, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed kernel_BUG_at_include/linux/mm.h due to commit (built with gcc-11):
>
> commit: 0af8489b0216fa1dd83e264bef8063f2632633d7 ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [test failed on linux-next/master c76083fac3bae1a87ae3d005b5cb1cbc761e31d5]
>
> in testcase: rcutorture
> version:
> with following parameters:
>
> runtime: 300s
> test: default
> torture_type: tasks-tracing
>
> test-description: rcutorture is rcutorture kernel module load/unload test.
> test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> If you fix the issue, kindly add following tag
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-lkp/[email protected]

Adding list_debug oops on same commit and config,
and to me it seems to be related with the reported issue.

Looks like something is corrupting struct pages that are in pcp list...

[ 9.271595][ T271] list_del corruption. next->prev should be ee3aac04, but was 22000001. (next=ee3aabb1)
[ 9.271943][ T271] ------------[ cut here ]------------
[ 9.272117][ T271] kernel BUG at lib/list_debug.c:62!
[ 9.272296][ T271] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 9.272498][ T271] CPU: 1 PID: 271 Comm: systemd-udevd Tainted: G S 6.1.0-rc2-00010-g0af8489b0216 #2144 30f8917077604b53250250d749a1d46e4be1d31a
[ 9.272961][ T271] EIP: __list_del_entry_valid.cold+0x7a/0x1c4
[ 9.273162][ T271] Code: 0c 89 5c 24 04 89 44 24 08 c7 04 24 2c 33 32 c4 83 15 a4 6a e5 c5 00 e8 0c 02 f9 ff 83 05 a8 6a e5 c5 01 83 15 ac 6a e5 c5 00 <0f> 0b 83 05 b0 6a e5 c5 01 b8 bc 1f b5 c4 83 15 b4 6a e5 c5 00 e8
[ 9.273783][ T271] EAX: 00000055 EBX: ee3aac04 ECX: f6ff05c0 EDX: 00000001
[ 9.274007][ T271] ESI: f6ff4560 EDI: ee3aac00 EBP: c8dd3be0 ESP: c8dd3bcc
[ 9.274234][ T271] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010046
[ 9.274476][ T271] CR0: 80050033 CR2: bfa85bf0 CR3: 08a0e000 CR4: 00040690
[ 9.274707][ T271] Call Trace:
[ 9.274816][ T271] __rmqueue_pcplist+0x5d/0x1c0
[ 9.274972][ T271] rmqueue_pcplist.constprop.0+0xc7/0x240
[ 9.275153][ T271] rmqueue.isra.0+0x57f/0xc60
[ 9.275305][ T271] ? zone_watermark_fast+0x118/0x230
[ 9.275479][ T271] get_page_from_freelist+0xe7/0x310
[ 9.275651][ T271] __alloc_pages+0xdd/0x360
[ 9.275799][ T271] alloc_slab_page+0x12d/0x200
[ 9.275952][ T271] allocate_slab+0x6a/0x350
[ 9.276098][ T271] new_slab+0x48/0xc0
[ 9.276228][ T271] __slab_alloc_node.constprop.0+0xf3/0x260
[ 9.276416][ T271] __kmem_cache_alloc_node+0x75/0x490
[ 9.276591][ T271] ? lock_is_held_type+0x80/0xf0
[ 9.276753][ T271] __kmalloc_node+0x7a/0x170
[ 9.276902][ T271] ? kvmalloc_node+0x42/0x1e0
[ 9.277053][ T271] ? seq_read_iter+0x55/0x770
[ 9.277205][ T271] ? kvmalloc_node+0x42/0x1e0
[ 9.277355][ T271] kvmalloc_node+0x42/0x1e0
[ 9.277503][ T271] seq_read_iter+0x359/0x770
[ 9.277653][ T271] proc_reg_read_iter+0xab/0x140
[ 9.277813][ T271] vfs_read+0x294/0x3c0
[ 9.277947][ T271] ksys_read+0x82/0x1c0
[ 9.278077][ T271] __ia32_sys_read+0x1e/0x30
[ 9.278227][ T271] __do_fast_syscall_32+0x72/0xd0
[ 9.278388][ T271] ? __do_fast_syscall_32+0x7c/0xd0
[ 9.278557][ T271] ? lockdep_hardirqs_on_prepare+0x242/0x400
[ 9.278759][ T271] ? syscall_exit_to_user_mode+0x5f/0x90
[ 9.278940][ T271] ? __do_fast_syscall_32+0x7c/0xd0
[ 9.279106][ T271] ? syscall_exit_to_user_mode+0x5f/0x90
[ 9.279285][ T271] ? __do_fast_syscall_32+0x7c/0xd0
[ 9.279453][ T271] ? __do_fast_syscall_32+0x7c/0xd0
[ 9.279621][ T271] ? irqentry_exit_to_user_mode+0x23/0x30
[ 9.279806][ T271] do_fast_syscall_32+0x32/0x70
[ 9.279960][ T271] do_SYSENTER_32+0x15/0x20
[ 9.280107][ T271] entry_SYSENTER_32+0xa2/0xfb
[ 9.280262][ T271] EIP: 0xb7f31549
[ 9.280379][ T271] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
[ 9.280996][ T271] EAX: ffffffda EBX: 00000007 ECX: 004dd3d0 EDX: 00000400
[ 9.281223][ T271] ESI: 004cfa90 EDI: b7e3a960 EBP: bfa86888 ESP: bfa86838

2023-01-09 14:20:57

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

hi, Vlastimil,

On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
> On 1/5/23 02:46, Oliver Sang wrote:
> > hi, Hyeonggon, hi, Vlastimil,
> >
> > On Wed, Jan 04, 2023 at 06:04:20PM +0900, Hyeonggon Yoo wrote:
> >> On Tue, Jan 03, 2023 at 09:46:33PM +0800, Oliver Sang wrote:
> >> > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> >> > > So the events leading up to this could be something like:
> >> > >
> >> > > - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
> >> > > - despite being on partial list, it's freed ???
> >> > > - somebody else allocates order-2 page 0x2daec and uses it for whatever,
> >> > > then frees it
> >> > > - 0x2daec is reallocated as order-1 slab from names_cache, then freed
> >> > > - we try to allocate from the slab page 0x2daee and trip on the PageTail
> >> > >
> >> > > Except, the freeing of order-2 page would have reset the PageTail and
> >> > > compound_head in 0x2daec, so this is even more complicated or involves some
> >> > > extra race?
> >> >
> >> > FYI, we ran tests more up to 500 times, then saw different issues but rate is
> >> > actually low
> >> >
> >> > 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> >> > ---------------- ---------------------------
> >> > fail:runs %reproduction fail:runs
> >> > | | |
> >> > :500 12% 61:500 dmesg.invalid_opcode:#[##]
> >> > :500 3% 14:500 dmesg.kernel_BUG_at_include/linux/mm.h
> >> > :500 3% 17:500 dmesg.kernel_BUG_at_include/linux/page-flags.h
> >> > :500 5% 26:500 dmesg.kernel_BUG_at_lib/list_debug.c
> >> > :500 0% 2:500 dmesg.kernel_BUG_at_mm/page_alloc.c
> >> > :500 0% 2:500 dmesg.kernel_BUG_at_mm/usercopy.c
> >> >
> >
> > hi Vlastimil,
> >
> > as you mentioned
> >> Hm even if rate is low, the different kinds of reports could be useful to
> >> see, if all of that is caused by the commit.
> >
> > we tried to run tests even more times, but with the config which enable
> > CONFIG_DEBUG_PAGEALLOC
> > CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> > (config is attached as
> > config-6.1.0-rc2-00014-g0af8489b0216+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> > the only diff with previous config is
> > @@ -5601,7 +5601,8 @@ CONFIG_HAVE_KCSAN_COMPILER=y
> > # Memory Debugging
> > #
> > CONFIG_PAGE_EXTENSION=y
> > -# CONFIG_DEBUG_PAGEALLOC is not set
> > +CONFIG_DEBUG_PAGEALLOC=y
> > +CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
> > CONFIG_PAGE_OWNER=y
> > # CONFIG_PAGE_POISONING is not set
> > CONFIG_DEBUG_PAGE_REF=y
> > )
> >
> > what we found now is some issues are also reproduced on parent now (still by
> > rcutorture tests here), though seems lower rate on parent.
> >
> > =========================================================================================
> > compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
> >
> > 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> > ---------------- ---------------------------
> > fail:runs %reproduction fail:runs
> > | | |
> > 8:985 19% 199:990 dmesg.invalid_opcode:#[##]
> > :985 5% 51:990 dmesg.kernel_BUG_at_include/linux/mm.h
> > 3:985 4% 41:990 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > 4:985 10% 102:990 dmesg.kernel_BUG_at_lib/list_debug.c
> > :985 0% 2:990 dmesg.kernel_BUG_at_mm/page_alloc.c
> > 1:985 0% 3:990 dmesg.kernel_BUG_at_mm/usercopy.c
> >
> > however, we noticed dmesg.kernel_BUG_at_include/linux/mm.h still have
> > relatively high rate on this commit but keeps clean on parent.
>
> Well that's interesting. As long as any bugs happen in the parent, it could
> mean the commit we suspect is just changing the circumstances and creating
> conditions that increase the bug happening - e.g. because it causes slab
> pages to be always immediately freed when the last object is freed.
>
> So I would be curiou about how some of the reports from the parent look like
> in detail.

since now we have below 3 also for parent:
3:985 4% 41:990 dmesg.kernel_BUG_at_include/linux/page-flags.h
4:985 10% 102:990 dmesg.kernel_BUG_at_lib/list_debug.c
1:985 0% 3:990 dmesg.kernel_BUG_at_mm/usercopy.c
I pick one dmesg for each case from parent commit (56d5a2b9ba85a390) as
attached:
dmesg-parent-bug-at-page-flags-h.xz
dmesg-parent-bug-at-list_debug-c.xz
dmesg-parent-bug-at-usercopy-c.xz
FYI

> And if the rate at the parent (has it increased thanks to the
> DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!

got it. Thanks for suggestion!

since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
with same config upon v6.1-rc2 to see if it's really clean there.
if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.

will keep you updated. Thanks

>
>

2023-01-09 14:34:09

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

hi, Hyeonggon,

On Thu, Jan 05, 2023 at 11:47:22PM +0900, Hyeonggon Yoo wrote:
> On Thu, Jan 05, 2023 at 09:46:06AM +0800, Oliver Sang wrote:
> > If you need more information, please let us know. Thanks
>
> Can we get a vmcore dump using kdump functionality?

it's kind of hard for us to enable kdump in our test framework, which we
need put some efforts to investigate, so maybe we cannot supply vmcore dump
for now.

but as you complained as:

> > To reproduce:
> >
> > # build kernel
> > cd linux
> > cp config-6.1.0-rc2-00014-g0af8489b0216 .config
> > make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
> > make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
> > cd <mod-install-dir>
> > find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
> >
> >
> > git clone https://github.com/intel/lkp-tests.git
> > cd lkp-tests
> > bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
> >
> > # if come across any failure that blocks the test,
> > # please remove ~/.lkp and /lkp dir to run from a clean state.
>
>
> I was unable to reproduce in the same way as described above
> because some files referenced in job-script couldn't be downloaded from
> download.01.org/0day :(

sorry for this, download.01.org is in migration stage, now we uploaded missed
images manually, above reproducer should be ok now.

could you have a try? and not sure if it's easy for you to enable kdump while
you try local reproduce?

anyway, this is just FYI.

at the same time, as Vlastimil gave us a great suggestion to try to do more
bisection to see if there is a real fbc, we now start to try to find a clean
base. thanks

>
> --
> Thanks,
> Hyeonggon
>

2023-01-09 14:55:02

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

sorry, missed the attachment...

On Mon, Jan 09, 2023 at 10:01:15PM +0800, Oliver Sang wrote:
> hi, Vlastimil,
>
> On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
> > On 1/5/23 02:46, Oliver Sang wrote:
> > > hi, Hyeonggon, hi, Vlastimil,
> > >
> > > On Wed, Jan 04, 2023 at 06:04:20PM +0900, Hyeonggon Yoo wrote:
> > >> On Tue, Jan 03, 2023 at 09:46:33PM +0800, Oliver Sang wrote:
> > >> > On Tue, Jan 03, 2023 at 11:42:11AM +0100, Vlastimil Babka wrote:
> > >> > > So the events leading up to this could be something like:
> > >> > >
> > >> > > - 0x2daee is order-1 slab folio of the inode cache, sitting on the partial list
> > >> > > - despite being on partial list, it's freed ???
> > >> > > - somebody else allocates order-2 page 0x2daec and uses it for whatever,
> > >> > > then frees it
> > >> > > - 0x2daec is reallocated as order-1 slab from names_cache, then freed
> > >> > > - we try to allocate from the slab page 0x2daee and trip on the PageTail
> > >> > >
> > >> > > Except, the freeing of order-2 page would have reset the PageTail and
> > >> > > compound_head in 0x2daec, so this is even more complicated or involves some
> > >> > > extra race?
> > >> >
> > >> > FYI, we ran tests more up to 500 times, then saw different issues but rate is
> > >> > actually low
> > >> >
> > >> > 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> > >> > ---------------- ---------------------------
> > >> > fail:runs %reproduction fail:runs
> > >> > | | |
> > >> > :500 12% 61:500 dmesg.invalid_opcode:#[##]
> > >> > :500 3% 14:500 dmesg.kernel_BUG_at_include/linux/mm.h
> > >> > :500 3% 17:500 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > >> > :500 5% 26:500 dmesg.kernel_BUG_at_lib/list_debug.c
> > >> > :500 0% 2:500 dmesg.kernel_BUG_at_mm/page_alloc.c
> > >> > :500 0% 2:500 dmesg.kernel_BUG_at_mm/usercopy.c
> > >> >
> > >
> > > hi Vlastimil,
> > >
> > > as you mentioned
> > >> Hm even if rate is low, the different kinds of reports could be useful to
> > >> see, if all of that is caused by the commit.
> > >
> > > we tried to run tests even more times, but with the config which enable
> > > CONFIG_DEBUG_PAGEALLOC
> > > CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> > > (config is attached as
> > > config-6.1.0-rc2-00014-g0af8489b0216+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
> > > the only diff with previous config is
> > > @@ -5601,7 +5601,8 @@ CONFIG_HAVE_KCSAN_COMPILER=y
> > > # Memory Debugging
> > > #
> > > CONFIG_PAGE_EXTENSION=y
> > > -# CONFIG_DEBUG_PAGEALLOC is not set
> > > +CONFIG_DEBUG_PAGEALLOC=y
> > > +CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
> > > CONFIG_PAGE_OWNER=y
> > > # CONFIG_PAGE_POISONING is not set
> > > CONFIG_DEBUG_PAGE_REF=y
> > > )
> > >
> > > what we found now is some issues are also reproduced on parent now (still by
> > > rcutorture tests here), though seems lower rate on parent.
> > >
> > > =========================================================================================
> > > compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
> > > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
> > >
> > > 56d5a2b9ba85a390 0af8489b0216fa1dd83e264bef8
> > > ---------------- ---------------------------
> > > fail:runs %reproduction fail:runs
> > > | | |
> > > 8:985 19% 199:990 dmesg.invalid_opcode:#[##]
> > > :985 5% 51:990 dmesg.kernel_BUG_at_include/linux/mm.h
> > > 3:985 4% 41:990 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > > 4:985 10% 102:990 dmesg.kernel_BUG_at_lib/list_debug.c
> > > :985 0% 2:990 dmesg.kernel_BUG_at_mm/page_alloc.c
> > > 1:985 0% 3:990 dmesg.kernel_BUG_at_mm/usercopy.c
> > >
> > > however, we noticed dmesg.kernel_BUG_at_include/linux/mm.h still have
> > > relatively high rate on this commit but keeps clean on parent.
> >
> > Well that's interesting. As long as any bugs happen in the parent, it could
> > mean the commit we suspect is just changing the circumstances and creating
> > conditions that increase the bug happening - e.g. because it causes slab
> > pages to be always immediately freed when the last object is freed.
> >
> > So I would be curiou about how some of the reports from the parent look like
> > in detail.
>
> since now we have below 3 also for parent:
> 3:985 4% 41:990 dmesg.kernel_BUG_at_include/linux/page-flags.h
> 4:985 10% 102:990 dmesg.kernel_BUG_at_lib/list_debug.c
> 1:985 0% 3:990 dmesg.kernel_BUG_at_mm/usercopy.c
> I pick one dmesg for each case from parent commit (56d5a2b9ba85a390) as
> attached:
> dmesg-parent-bug-at-page-flags-h.xz
> dmesg-parent-bug-at-list_debug-c.xz
> dmesg-parent-bug-at-usercopy-c.xz
> FYI
>
> > And if the rate at the parent (has it increased thanks to the
> > DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!
>
> got it. Thanks for suggestion!
>
> since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
> with same config upon v6.1-rc2 to see if it's really clean there.
> if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.
>
> will keep you updated. Thanks
>
> >
> >


Attachments:
(No filename) (5.76 kB)
dmesg-parent-bug-at-page-flags-h.xz (55.72 kB)
dmesg-parent-bug-at-list_debug-c.xz (57.27 kB)
dmesg-parent-bug-at-usercopy-c.xz (54.45 kB)
Download all attachments

2023-01-10 14:06:37

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

hi all,

On Mon, Jan 09, 2023 at 10:01:15PM +0800, Oliver Sang wrote:
>
> On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
>
> > And if the rate at the parent (has it increased thanks to the
> > DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!
>
> got it. Thanks for suggestion!
>
> since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
> with same config upon v6.1-rc2 to see if it's really clean there.
> if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.
>
> will keep you updated. Thanks

by more tests, we cannot make sure the v6.1-rc2 is clean, so we also checked
v6.1-rc1 and v6.0. from results, we have low confidence that we can make a
successful bisection based on them [1][2]. could you suggest?

a further information not sure if it's helpful, [1][2] are both i386 based.
we also tried to run boot tests on x86_64 upon commit 0af8489b02, whatever
with or without CONFIG_DEBUG_PAGEALLOC/CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT,
we never obseve similar issues (also run 999 times).


[1]
boot results:
=========================================================================================
compiler/kconfig/rootfs/sleep/tbox_group/testcase:
gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/1/vm-snb/boot

v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
---------------- --------------------------- --------------------------- --------------------------- ---------------------------
fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
| | | | | | | | |
:999 0% 2:999 0% 1:999 1% 11:999 21% 208:999 dmesg.invalid_opcode:#[##]
:999 0% :999 0% :999 0% 2:999 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
:999 0% 1:999 0% :999 0% 4:999 4% 40:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
:999 0% 1:999 0% 1:999 0% 4:999 11% 111:999 dmesg.kernel_BUG_at_lib/list_debug.c
:999 0% :999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
:999 0% :999 0% :999 0% 1:999 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c

[2]
rcutorture results:
=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing

v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
---------------- --------------------------- --------------------------- --------------------------- ---------------------------
fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
| | | | | | | | |
:999 0% 3:999 0% :999 1% 8:998 20% 200:999 dmesg.invalid_opcode:#[##]
:999 0% :999 0% :999 0% :998 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
:999 0% :999 0% :999 0% 3:998 4% 42:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
:999 0% 3:999 0% :999 0% 4:998 10% 102:999 dmesg.kernel_BUG_at_lib/list_debug.c
:999 0% :999 0% :999 0% :998 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
:999 0% :999 0% :999 0% 1:998 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c

>
> >
> >

2023-01-10 14:13:44

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On 1/10/23 14:53, Oliver Sang wrote:
> hi all,
>
> On Mon, Jan 09, 2023 at 10:01:15PM +0800, Oliver Sang wrote:
>>
>> On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
>>
>> > And if the rate at the parent (has it increased thanks to the
>> > DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!
>>
>> got it. Thanks for suggestion!
>>
>> since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
>> with same config upon v6.1-rc2 to see if it's really clean there.
>> if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.
>>
>> will keep you updated. Thanks
>
> by more tests, we cannot make sure the v6.1-rc2 is clean, so we also checked
> v6.1-rc1 and v6.0. from results, we have low confidence that we can make a
> successful bisection based on them [1][2]. could you suggest?

So am I reading it right, that the probleam appears to be introduced between
v6.0 (0 failures) and v6.1-rc1 (>0 failures)? But agree that with such low
incidence, it's hard to bisect.

> a further information not sure if it's helpful, [1][2] are both i386 based.
> we also tried to run boot tests on x86_64 upon commit 0af8489b02, whatever
> with or without CONFIG_DEBUG_PAGEALLOC/CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT,
> we never obseve similar issues (also run 999 times).

Yeah it looks very much like something that manifests only on i386 (perhaps
only in QEMU as well?) and never x86_64.

What might be interesting then is v6.1-rc1 with further modified config to
enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON. Maybe it will catch the
culprit sooner. Or maybe it will obscure the bug instead, unfortunately.

Thanks for all your effort!

> [1]
> boot results:
> =========================================================================================
> compiler/kconfig/rootfs/sleep/tbox_group/testcase:
> gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/1/vm-snb/boot
>
> v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
> ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
> fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> | | | | | | | | |
> :999 0% 2:999 0% 1:999 1% 11:999 21% 208:999 dmesg.invalid_opcode:#[##]
> :999 0% :999 0% :999 0% 2:999 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
> :999 0% 1:999 0% :999 0% 4:999 4% 40:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> :999 0% 1:999 0% 1:999 0% 4:999 11% 111:999 dmesg.kernel_BUG_at_lib/list_debug.c
> :999 0% :999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> :999 0% :999 0% :999 0% 1:999 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
>
> [2]
> rcutorture results:
> =========================================================================================
> compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
> gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
>
> v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
> ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
> fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> | | | | | | | | |
> :999 0% 3:999 0% :999 1% 8:998 20% 200:999 dmesg.invalid_opcode:#[##]
> :999 0% :999 0% :999 0% :998 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
> :999 0% :999 0% :999 0% 3:998 4% 42:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> :999 0% 3:999 0% :999 0% 4:998 10% 102:999 dmesg.kernel_BUG_at_lib/list_debug.c
> :999 0% :999 0% :999 0% :998 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> :999 0% :999 0% :999 0% 1:998 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
>
>>
>> >
>> >

2023-01-11 03:04:29

by Feng Tang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On Tue, Jan 10, 2023 at 10:09:36PM +0800, Vlastimil Babka wrote:
> On 1/10/23 14:53, Oliver Sang wrote:
> > hi all,
> >
> > On Mon, Jan 09, 2023 at 10:01:15PM +0800, Oliver Sang wrote:
> >>
> >> On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
> >>
> >> > And if the rate at the parent (has it increased thanks to the
> >> > DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!
> >>
> >> got it. Thanks for suggestion!
> >>
> >> since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
> >> with same config upon v6.1-rc2 to see if it's really clean there.
> >> if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.
> >>
> >> will keep you updated. Thanks
> >
> > by more tests, we cannot make sure the v6.1-rc2 is clean, so we also checked
> > v6.1-rc1 and v6.0. from results, we have low confidence that we can make a
> > successful bisection based on them [1][2]. could you suggest?
>
> So am I reading it right, that the probleam appears to be introduced between
> v6.0 (0 failures) and v6.1-rc1 (>0 failures)? But agree that with such low
> incidence, it's hard to bisect.

Yes, it's hard for bisection, as it happens at a higher reproduce rate
on several commits, hope it could be bisected soon.

Before Oliver found the parent commit also trigger the problem, I did
some debug, and I think related factors could be compound_page, folio,
slub and i386(struct page size and layout), which cause some compound
page structures corrupted.

* I tried change the slub_max_order to 0 for SLUB_TINY (much less
compound-page operations), and no issue is found for 350 runs.

* Disable order>=1 in pcp list, the issue's reproduce rate keeps
similar, which means the pcp operation is not related.

* Tried add some debug info in unused members of struct page of tail
pages (not first tail page), or add one more member at the end of
'struct page', can't reproduce it.

From Oliver's reports, there are several kinds of errors: list
operation failure (list_head structure corruption), compound page
check failure (PageTail warning) etc, which seem to indicate the
struct pages are randomly corrupted. In dump_page(), I tired to dump
more pages befor and after 'raw' and 'head' pages

[ 18.265162][ T186] page:43caddc9 refcount:1 mapcount:0 mapping:00000000 index:0xedece900 pfn:0x2dece
[ 18.266205][ T186] head:5e92e8e2 order:1 compound_mapcount:0 compound_pincount:0
[ 18.266928][ T186] flags: 0x10200(slab|head|zone=0)
[ 18.267339][ T186] raw-2: 00010200 c0100b70 c0100b70 c01a0680 edecd000 00020001 ffffffff 00000001
[ 18.268055][ T186] raw-2: 00000000 00000000
[ 18.268419][ T186] raw-1: 00000000 e7652fe1 00000101 ffffffff 00000000 00000000 ffffffff 00000000
[ 18.269124][ T186] raw-1: 00000000 00000000
[ 18.269488][ T186] raw: 00010200 e7652fe1 c0100bf0 00000400 edece900 000a0003 ffffffff 00000001
[ 18.270187][ T186] raw: edce5f81 00000000
[ 18.270534][ T186] raw1: 00000000 e7653031 00000101 ffffffff 00000000 00000000 ffffffff 00000000
[ 18.271257][ T186] raw1: 00000000 00000000
[ 18.271617][ T186] raw2: 00010200 00000100 00000122 c01a0780 00000000 000a000a ffffffff 00000001
[ 18.272330][ T186] raw2: edd5bf01 00000000
[ 18.272690][ T186] raw3: 00000000 e7653081 00000101 ffffffff 00000000 00000000 ffffffff 00000000
[ 18.273424][ T186] raw3: 00000000 00000000

[ 18.273793][ T186] head: 00010200 c0100b70 c0100b70 c01a0680 edecd000 00020001 ffffffff 00000001
[ 18.274534][ T186] head: 00000000 00000000
[ 18.274909][ T186] head1: 00000000 e7652fe1 00000101 ffffffff 00000000 00000000 ffffffff 00000000
[ 18.275633][ T186] head1: 00000000 00000000
[ 18.275996][ T186] head2: 00010200 e7652fe1 c0100bf0 00000400 edece900 000a0003 ffffffff 00000001
[ 18.276725][ T186] head2: edce5f81 00000000
[ 18.277093][ T186] head3: 00000000 e7653031 00000101 ffffffff 00000000 00000000 ffffffff 00000000
[ 18.277820][ T186] head3: 00000000 00000000
[ 18.278182][ T186] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
[ 18.278747][ T186] page_owner tracks the page as allocated
[ 18.279447][ T186] page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 186, tgid 186 (udevadm), ts 18262675212, free_ts 18262044539
...

The page 'head' is actually 'raw' - 2, while page 'raw' seems
corrupted as it has PG_head bit set, while PageTail() is also true.

Another problem is 'raw-1' and 'raw' both points to 'raw-2', and
if it is an order-2 compound page, the 'raw+1' should also points
to 'raw-2' as 3 tail pages point to one head page. Or there is some
switch(corrupted) between order-1 and order-2, as Vlastimil
supsected in earlier reply?

Thanks,
Feng

> > a further information not sure if it's helpful, [1][2] are both i386 based.
> > we also tried to run boot tests on x86_64 upon commit 0af8489b02, whatever
> > with or without CONFIG_DEBUG_PAGEALLOC/CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT,
> > we never obseve similar issues (also run 999 times).
>
> Yeah it looks very much like something that manifests only on i386 (perhaps
> only in QEMU as well?) and never x86_64.
>
> What might be interesting then is v6.1-rc1 with further modified config to
> enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON. Maybe it will catch the
> culprit sooner. Or maybe it will obscure the bug instead, unfortunately.
>
> Thanks for all your effort!
>
> > [1]
> > boot results:
> > =========================================================================================
> > compiler/kconfig/rootfs/sleep/tbox_group/testcase:
> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/1/vm-snb/boot
> >
> > v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
> > ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
> > fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> > | | | | | | | | |
> > :999 0% 2:999 0% 1:999 1% 11:999 21% 208:999 dmesg.invalid_opcode:#[##]
> > :999 0% :999 0% :999 0% 2:999 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
> > :999 0% 1:999 0% :999 0% 4:999 4% 40:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > :999 0% 1:999 0% 1:999 0% 4:999 11% 111:999 dmesg.kernel_BUG_at_lib/list_debug.c
> > :999 0% :999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> > :999 0% :999 0% :999 0% 1:999 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
> >
> > [2]
> > rcutorture results:
> > =========================================================================================
> > compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
> >
> > v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
> > ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
> > fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> > | | | | | | | | |
> > :999 0% 3:999 0% :999 1% 8:998 20% 200:999 dmesg.invalid_opcode:#[##]
> > :999 0% :999 0% :999 0% :998 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
> > :999 0% :999 0% :999 0% 3:998 4% 42:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > :999 0% 3:999 0% :999 0% 4:998 10% 102:999 dmesg.kernel_BUG_at_lib/list_debug.c
> > :999 0% :999 0% :999 0% :998 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> > :999 0% :999 0% :999 0% 1:998 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
> >
> >>
> >> >
> >> >
>
>

2023-01-11 09:59:19

by Hyeonggon Yoo

[permalink] [raw]
Subject: BUG: unable to handle page fault for address: f6ffe000

On Sat, Dec 31, 2022 at 11:26:25PM +0800, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed kernel_BUG_at_include/linux/mm.h due to commit (built with gcc-11):
>
> commit: 0af8489b0216fa1dd83e264bef8063f2632633d7 ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> [test failed on linux-next/master c76083fac3bae1a87ae3d005b5cb1cbc761e31d5]
>
> in testcase: rcutorture
> version:
> with following parameters:
>
> runtime: 300s
> test: default
> torture_type: tasks-tracing
>
> test-description: rcutorture is rcutorture kernel module load/unload test.
> test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> If you fix the issue, kindly add following tag
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-lkp/[email protected]
>
>
> [ 25.804432][ T214] ------------[ cut here ]------------
> [ 25.804917][ T214] kernel BUG at include/linux/mm.h:825!
> [ 25.805402][ T214] invalid opcode: 0000 [#1] SMP
> [ 25.805820][ T214] CPU: 0 PID: 214 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #2 1c4d7707ec0ce574ed62a77e82a8580202758048
> [ 25.806944][ T214] EIP: __dump_page.cold (include/linux/mm.h:825 mm/debug.c:97)
> [ 25.807376][ T214] Code: ff ff 83 05 e8 5d bb c5 01 ba 4c c4 2f c4 89 f8 83 15 ec 5d bb c5 00 e8 f2 92 ed fd 83 05 f8 5d bb c5 01 83 15 fc 5d bb c5 00 <0f> 0b 83 05 00 5e bb c5 01 b8 ac 85 a3 c4 83 15 04 5e bb c5 00 e8


FYI, I found a panic from trying to read non-present page,
on the same commit.

I'm not sure if this BUG is related to what we're looking at.

[ 26.970492][ T180] systemd-journald[180]: Received request to flush runtime journal from PID 1
[ 28.117157][ T187] BUG: unable to handle page fault for address: f6ffe000
[ 28.117686][ T187] #PF: supervisor read access in kernel mode
[ 28.117686][ T187] #PF: error_code(0x0000) - not-present page
[ 28.117686][ T187] *pde = 065b9067 *pte = 00000000
[ 28.117686][ T187] Oops: 0000 [#1] SMP
[ 28.117686][ T187] CPU: 0 PID: 187 Comm: udevadm Tainted: G S 6.1.0-rc2-00014-g0af8489b0216 #13 6ab653106aca2d29fecfea6da29eac9984fd4186
[ 28.117686][ T187] EIP: kernfs_name_hash+0x60/0xe0
[ 28.117686][ T187] Code: c6 01 83 d7 00 31 d2 01 c6 89 c8 89 75 ec 11 d7 8b 35 f0 9e c1 c5 89 7d f0 8b 3d f4 9e c1 c5 29 f0 8d b4 26 00 00 00 00 66 90 <0f> be 0c 30 89 ca c1 e2 04 c1 e9 04 01 d1 01 cb 8b 4d f0 83 c6 01
[ 28.117686][ T187] EAX: f0eadd57 EBX: ec9ced85 ECX: 05c29e82 EDX: 05c29e82
[ 28.117686][ T187] ESI: 061502a9 EDI: 00000000 EBP: f5fe1c88 ESP: f5fe1c70
[ 28.117686][ T187] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010206
[ 28.117686][ T187] CR0: 80050033 CR2: f6ffe000 CR3: 35f29000 CR4: 000406d0
[ 28.117686][ T187] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 28.117686][ T187] DR6: fffe0ff0 DR7: 00000400
[ 28.117686][ T187] Call Trace:
[ 28.117686][ T187] kernfs_find_ns+0xb3/0x200
[ 28.117686][ T187] ? kernfs_iop_lookup+0x3c/0x1a0
[ 28.117686][ T187] kernfs_iop_lookup+0x76/0x1a0
[ 28.117686][ T187] lookup_open+0x308/0x970
[ 28.117686][ T187] open_last_lookups+0xa4/0x680
[ 28.117686][ T187] path_openat+0xa9/0x410
[ 28.117686][ T187] do_filp_open+0x99/0x180
[ 28.117686][ T187] ? lock_release+0xe8/0x350
[ 28.117686][ T187] ? _raw_spin_unlock+0x50/0x90
[ 28.117686][ T187] ? alloc_fd+0x252/0x3e0
[ 28.117686][ T187] do_sys_openat2+0xf3/0x2e0
[ 28.117686][ T187] ? find_held_lock+0x37/0x110
[ 28.117686][ T187] do_sys_open+0x8e/0xe0
[ 28.117686][ T187] __ia32_sys_open+0x2a/0x40
[ 28.117686][ T187] __do_fast_syscall_32+0x72/0xd0
[ 28.117686][ T187] ? lockdep_hardirqs_on_prepare+0x232/0x3f0
[ 28.117686][ T187] ? syscall_exit_to_user_mode+0x35/0x50
[ 28.117686][ T187] ? __do_fast_syscall_32+0x7c/0xd0
[ 28.117686][ T187] ? lockdep_hardirqs_on_prepare+0x232/0x3f0
[ 28.117686][ T187] ? irqentry_exit_to_user_mode+0x23/0x30
[ 28.117686][ T187] ? irqentry_exit+0x7d/0xc0
[ 28.117686][ T187] do_fast_syscall_32+0x32/0x70
[ 28.117686][ T187] do_SYSENTER_32+0x15/0x20
[ 28.117686][ T187] entry_SYSENTER_32+0xa2/0xfb
[ 28.117686][ T187] EIP: 0xb7f26549
[ 28.117686][ T187] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
[ 28.117686][ T187] EAX: ffffffda EBX: bf83a680 ECX: 00088000 EDX: 000001b6
[ 28.117686][ T187] ESI: 00ab30b0 EDI: 00000008 EBP: b7e31000 ESP: bf83a4c8
[ 28.117686][ T187] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[ 28.117686][ T187] Modules linked in:
[ 28.117686][ T187] CR2: 00000000f6ffe000
[ 28.117686][ T187] ---[ end trace 0000000000000000 ]---
[ 28.117686][ T187] EIP: kernfs_name_hash+0x60/0xe0
[ 28.117686][ T187] Code: c6 01 83 d7 00 31 d2 01 c6 89 c8 89 75 ec 11 d7 8b 35 f0 9e c1 c5 89 7d f0 8b 3d f4 9e c1 c5 29 f0 8d b4 26 00 00 00 00 66 90 <0f> be 0c 30 89 ca c1 e2 04 c1 e9 04 01 d1 01 cb 8b 4d f0 83 c6 01

2023-01-11 11:07:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On 1/11/23 03:26, Feng Tang wrote:
> On Tue, Jan 10, 2023 at 10:09:36PM +0800, Vlastimil Babka wrote:
>> On 1/10/23 14:53, Oliver Sang wrote:
>> > hi all,
>> >
>> > On Mon, Jan 09, 2023 at 10:01:15PM +0800, Oliver Sang wrote:
>> >>
>> >> On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
>> >>
>> >> > And if the rate at the parent (has it increased thanks to the
>> >> > DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!
>> >>
>> >> got it. Thanks for suggestion!
>> >>
>> >> since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
>> >> with same config upon v6.1-rc2 to see if it's really clean there.
>> >> if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.
>> >>
>> >> will keep you updated. Thanks
>> >
>> > by more tests, we cannot make sure the v6.1-rc2 is clean, so we also checked
>> > v6.1-rc1 and v6.0. from results, we have low confidence that we can make a
>> > successful bisection based on them [1][2]. could you suggest?
>>
>> So am I reading it right, that the probleam appears to be introduced between
>> v6.0 (0 failures) and v6.1-rc1 (>0 failures)? But agree that with such low
>> incidence, it's hard to bisect.
>
> Yes, it's hard for bisection, as it happens at a higher reproduce rate
> on several commits, hope it could be bisected soon.
>
> Before Oliver found the parent commit also trigger the problem, I did
> some debug, and I think related factors could be compound_page, folio,
> slub and i386(struct page size and layout), which cause some compound
> page structures corrupted.
>
> * I tried change the slub_max_order to 0 for SLUB_TINY (much less
> compound-page operations), and no issue is found for 350 runs.
>
> * Disable order>=1 in pcp list, the issue's reproduce rate keeps
> similar, which means the pcp operation is not related.
>
> * Tried add some debug info in unused members of struct page of tail
> pages (not first tail page), or add one more member at the end of
> 'struct page', can't reproduce it.
>
> From Oliver's reports, there are several kinds of errors: list
> operation failure (list_head structure corruption), compound page
> check failure (PageTail warning) etc, which seem to indicate the
> struct pages are randomly corrupted. In dump_page(), I tired to dump
> more pages befor and after 'raw' and 'head' pages
>
> [ 18.265162][ T186] page:43caddc9 refcount:1 mapcount:0 mapping:00000000 index:0xedece900 pfn:0x2dece
> [ 18.266205][ T186] head:5e92e8e2 order:1 compound_mapcount:0 compound_pincount:0
> [ 18.266928][ T186] flags: 0x10200(slab|head|zone=0)
> [ 18.267339][ T186] raw-2: 00010200 c0100b70 c0100b70 c01a0680 edecd000 00020001 ffffffff 00000001
> [ 18.268055][ T186] raw-2: 00000000 00000000
> [ 18.268419][ T186] raw-1: 00000000 e7652fe1 00000101 ffffffff 00000000 00000000 ffffffff 00000000
> [ 18.269124][ T186] raw-1: 00000000 00000000
> [ 18.269488][ T186] raw: 00010200 e7652fe1 c0100bf0 00000400 edece900 000a0003 ffffffff 00000001
> [ 18.270187][ T186] raw: edce5f81 00000000

Yeah this just doesn't make sense at all, it looks like prep_compound_tail()
was called on two pages instead of one when creating the raw-2 order-1
compound page, corrupting the folowing slab order-1 page's head.
Even if we assumed the compound_head "e7652fe1" was wrongly copied or
something, we have mapping == 00000400 which his very specifically done by
prep_compound_tail() "p->mapping = TAIL_MAPPING;"
In the raw-1 (first tail page) it's then rewritten to ffffffff by
prep_compound_head()'s atomic_set(compound_mapcount_ptr(page), -1)". but in
'raw' it remains 00000400, so this means the operation
prep_compound_tail("raw-2", 2); had to happen, but not
prep_compound_tail("raw-2", 3); as we don't have the same kind of corruption
in raw1 below.
Hey maybe the CPU speculated wrongly on the number of iterations in
prep_compound_page() and accidently written the result of one extra
prep_compound_tail() to memory...

> [ 18.270534][ T186] raw1: 00000000 e7653031 00000101 ffffffff 00000000 00000000 ffffffff 00000000
> [ 18.271257][ T186] raw1: 00000000 00000000
> [ 18.271617][ T186] raw2: 00010200 00000100 00000122 c01a0780 00000000 000a000a ffffffff 00000001
> [ 18.272330][ T186] raw2: edd5bf01 00000000
> [ 18.272690][ T186] raw3: 00000000 e7653081 00000101 ffffffff 00000000 00000000 ffffffff 00000000
> [ 18.273424][ T186] raw3: 00000000 00000000
>
> [ 18.273793][ T186] head: 00010200 c0100b70 c0100b70 c01a0680 edecd000 00020001 ffffffff 00000001
> [ 18.274534][ T186] head: 00000000 00000000
> [ 18.274909][ T186] head1: 00000000 e7652fe1 00000101 ffffffff 00000000 00000000 ffffffff 00000000
> [ 18.275633][ T186] head1: 00000000 00000000
> [ 18.275996][ T186] head2: 00010200 e7652fe1 c0100bf0 00000400 edece900 000a0003 ffffffff 00000001
> [ 18.276725][ T186] head2: edce5f81 00000000
> [ 18.277093][ T186] head3: 00000000 e7653031 00000101 ffffffff 00000000 00000000 ffffffff 00000000
> [ 18.277820][ T186] head3: 00000000 00000000
> [ 18.278182][ T186] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
> [ 18.278747][ T186] page_owner tracks the page as allocated
> [ 18.279447][ T186] page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 186, tgid 186 (udevadm), ts 18262675212, free_ts 18262044539
> ...
>
> The page 'head' is actually 'raw' - 2, while page 'raw' seems
> corrupted as it has PG_head bit set, while PageTail() is also true.
>
> Another problem is 'raw-1' and 'raw' both points to 'raw-2', and
> if it is an order-2 compound page, the 'raw+1' should also points
> to 'raw-2' as 3 tail pages point to one head page. Or there is some
> switch(corrupted) between order-1 and order-2, as Vlastimil
> supsected in earlier reply?
>
> Thanks,
> Feng
>
>> > a further information not sure if it's helpful, [1][2] are both i386 based.
>> > we also tried to run boot tests on x86_64 upon commit 0af8489b02, whatever
>> > with or without CONFIG_DEBUG_PAGEALLOC/CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT,
>> > we never obseve similar issues (also run 999 times).
>>
>> Yeah it looks very much like something that manifests only on i386 (perhaps
>> only in QEMU as well?) and never x86_64.
>>
>> What might be interesting then is v6.1-rc1 with further modified config to
>> enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON. Maybe it will catch the
>> culprit sooner. Or maybe it will obscure the bug instead, unfortunately.
>>
>> Thanks for all your effort!
>>
>> > [1]
>> > boot results:
>> > =========================================================================================
>> > compiler/kconfig/rootfs/sleep/tbox_group/testcase:
>> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/1/vm-snb/boot
>> >
>> > v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
>> > ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
>> > fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
>> > | | | | | | | | |
>> > :999 0% 2:999 0% 1:999 1% 11:999 21% 208:999 dmesg.invalid_opcode:#[##]
>> > :999 0% :999 0% :999 0% 2:999 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
>> > :999 0% 1:999 0% :999 0% 4:999 4% 40:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
>> > :999 0% 1:999 0% 1:999 0% 4:999 11% 111:999 dmesg.kernel_BUG_at_lib/list_debug.c
>> > :999 0% :999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
>> > :999 0% :999 0% :999 0% 1:999 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
>> >
>> > [2]
>> > rcutorture results:
>> > =========================================================================================
>> > compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
>> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
>> >
>> > v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
>> > ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
>> > fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
>> > | | | | | | | | |
>> > :999 0% 3:999 0% :999 1% 8:998 20% 200:999 dmesg.invalid_opcode:#[##]
>> > :999 0% :999 0% :999 0% :998 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
>> > :999 0% :999 0% :999 0% 3:998 4% 42:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
>> > :999 0% 3:999 0% :999 0% 4:998 10% 102:999 dmesg.kernel_BUG_at_lib/list_debug.c
>> > :999 0% :999 0% :999 0% :998 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
>> > :999 0% :999 0% :999 0% 1:998 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
>> >
>> >>
>> >> >
>> >> >
>>
>>

2023-01-12 07:58:11

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

hi, Vlastimil,

On Tue, Jan 10, 2023 at 03:09:36PM +0100, Vlastimil Babka wrote:
> On 1/10/23 14:53, Oliver Sang wrote:
> > hi all,
> >
> > On Mon, Jan 09, 2023 at 10:01:15PM +0800, Oliver Sang wrote:
> >>
> >> On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
> >>
> >> > And if the rate at the parent (has it increased thanks to the
> >> > DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!
> >>
> >> got it. Thanks for suggestion!
> >>
> >> since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
> >> with same config upon v6.1-rc2 to see if it's really clean there.
> >> if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.
> >>
> >> will keep you updated. Thanks
> >
> > by more tests, we cannot make sure the v6.1-rc2 is clean, so we also checked
> > v6.1-rc1 and v6.0. from results, we have low confidence that we can make a
> > successful bisection based on them [1][2]. could you suggest?
>
> So am I reading it right, that the probleam appears to be introduced between
> v6.0 (0 failures) and v6.1-rc1 (>0 failures)? But agree that with such low
> incidence, it's hard to bisect.

yeah, you are reading it right :)

>
> > a further information not sure if it's helpful, [1][2] are both i386 based.
> > we also tried to run boot tests on x86_64 upon commit 0af8489b02, whatever
> > with or without CONFIG_DEBUG_PAGEALLOC/CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT,
> > we never obseve similar issues (also run 999 times).
>
> Yeah it looks very much like something that manifests only on i386 (perhaps
> only in QEMU as well?) and never x86_64.
>
> What might be interesting then is v6.1-rc1 with further modified config to
> enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON. Maybe it will catch the
> culprit sooner. Or maybe it will obscure the bug instead, unfortunately.

oh, seems, unfortunalately, 'obscure' happen :(

we enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON, along with
CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT

boot (we also add the test for v6.2-rc3):
=========================================================================================
compiler/kconfig/rootfs/sleep/tbox_group/testcase:
gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT+CONFIG_SLUB_DEBUG_ON/debian-11.1-i386-20220923.cgz/1/vm-snb/boot

v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8 v6.2-rc3
---------------- --------------------------- --------------------------- --------------------------- --------------------------- ---------------------------
fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
| | | | | | | | | | |
43:999 3% 68:999 4% 84:999 6% 99:999 5% 94:999 4% 86:999 dmesg.invalid_opcode:#[##]
4:999 -0% 2:999 0% 7:999 0% 8:999 0% 4:999 -0% :999 dmesg.kernel_BUG_at_include/linux/mm.h
3:999 0% 4:999 0% 3:999 0% 7:999 0% 5:999 1% 9:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
34:999 3% 61:999 4% 73:999 5% 81:999 5% 85:999 4% 74:999 dmesg.kernel_BUG_at_lib/list_debug.c
:999 0% :999 0% :999 0% 1:999 0% :999 0% :999 dmesg.kernel_BUG_at_mm/internal.h
3:999 -0% 1:999 -0% :999 -0% 2:999 -0% :999 -0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
:999 0% :999 0% 2:999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/usercopy.c


since now even the v6.0 is not clean, attached one dmesg FYI


below is from rcutorture:
=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT+CONFIG_SLUB_DEBUG_ON/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing

v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8 v6.2-rc3
---------------- --------------------------- --------------------------- --------------------------- --------------------------- ---------------------------
fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
| | | | | | | | | | |
47:999 3% 72:999 4% 91:999 4% 88:999 3% 76:999 4% 84:999 dmesg.invalid_opcode:#[##]
4:999 0% 8:999 1% 10:999 0% 5:999 0% 4:999 -0% :999 dmesg.kernel_BUG_at_include/linux/mm.h
3:999 -0% 2:999 0% 5:999 0% 5:999 -0% 2:999 1% 8:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
38:999 2% 61:999 4% 75:999 4% 78:999 3% 68:999 4% 73:999 dmesg.kernel_BUG_at_lib/list_debug.c
1:999 0% 1:999 0% 1:999 -0% :999 0% 2:999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
1:999 -0% :999 -0% :999 -0% :999 -0% :999 0% 1:999 dmesg.kernel_BUG_at_mm/usercopy.c
>
> Thanks for all your effort!
>
> > [1]
> > boot results:
> > =========================================================================================
> > compiler/kconfig/rootfs/sleep/tbox_group/testcase:
> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/1/vm-snb/boot
> >
> > v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
> > ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
> > fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> > | | | | | | | | |
> > :999 0% 2:999 0% 1:999 1% 11:999 21% 208:999 dmesg.invalid_opcode:#[##]
> > :999 0% :999 0% :999 0% 2:999 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
> > :999 0% 1:999 0% :999 0% 4:999 4% 40:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > :999 0% 1:999 0% 1:999 0% 4:999 11% 111:999 dmesg.kernel_BUG_at_lib/list_debug.c
> > :999 0% :999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> > :999 0% :999 0% :999 0% 1:999 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
> >
> > [2]
> > rcutorture results:
> > =========================================================================================
> > compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
> >
> > v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
> > ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
> > fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> > | | | | | | | | |
> > :999 0% 3:999 0% :999 1% 8:998 20% 200:999 dmesg.invalid_opcode:#[##]
> > :999 0% :999 0% :999 0% :998 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
> > :999 0% :999 0% :999 0% 3:998 4% 42:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> > :999 0% 3:999 0% :999 0% 4:998 10% 102:999 dmesg.kernel_BUG_at_lib/list_debug.c
> > :999 0% :999 0% :999 0% :998 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> > :999 0% :999 0% :999 0% 1:998 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
> >
> >>
> >> >
> >> >
>
>


Attachments:
(No filename) (10.40 kB)
dmesg.xz (56.02 kB)
Download all attachments

2023-01-12 08:19:39

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On 1/12/23 08:47, Oliver Sang wrote:
> hi, Vlastimil,
>
> On Tue, Jan 10, 2023 at 03:09:36PM +0100, Vlastimil Babka wrote:
>> On 1/10/23 14:53, Oliver Sang wrote:
>> > hi all,
>> >
>> > On Mon, Jan 09, 2023 at 10:01:15PM +0800, Oliver Sang wrote:
>> >>
>> >> On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
>> >>
>> >> > And if the rate at the parent (has it increased thanks to the
>> >> > DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!
>> >>
>> >> got it. Thanks for suggestion!
>> >>
>> >> since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
>> >> with same config upon v6.1-rc2 to see if it's really clean there.
>> >> if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.
>> >>
>> >> will keep you updated. Thanks
>> >
>> > by more tests, we cannot make sure the v6.1-rc2 is clean, so we also checked
>> > v6.1-rc1 and v6.0. from results, we have low confidence that we can make a
>> > successful bisection based on them [1][2]. could you suggest?
>>
>> So am I reading it right, that the probleam appears to be introduced between
>> v6.0 (0 failures) and v6.1-rc1 (>0 failures)? But agree that with such low
>> incidence, it's hard to bisect.
>
> yeah, you are reading it right :)
>
>>
>> > a further information not sure if it's helpful, [1][2] are both i386 based.
>> > we also tried to run boot tests on x86_64 upon commit 0af8489b02, whatever
>> > with or without CONFIG_DEBUG_PAGEALLOC/CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT,
>> > we never obseve similar issues (also run 999 times).
>>
>> Yeah it looks very much like something that manifests only on i386 (perhaps
>> only in QEMU as well?) and never x86_64.
>>
>> What might be interesting then is v6.1-rc1 with further modified config to
>> enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON. Maybe it will catch the
>> culprit sooner. Or maybe it will obscure the bug instead, unfortunately.
>
> oh, seems, unfortunalately, 'obscure' happen :(

Actually no, by "obscure" means with CONFIG_SLUB_DEBUG it wouldn't happen
anymore. But this is the opposite, it seems to happen a lot. I would have
preferred that slub debugging catches some slab misuse, but this seems
useful too. With such fail rates you can perhaps try ealier kernels than 6.0
and eventually find the truly clean and first bad release and bisect?

> we enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON, along with
> CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
>
> boot (we also add the test for v6.2-rc3):
> =========================================================================================
> compiler/kconfig/rootfs/sleep/tbox_group/testcase:
> gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT+CONFIG_SLUB_DEBUG_ON/debian-11.1-i386-20220923.cgz/1/vm-snb/boot
>
> v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8 v6.2-rc3
> ---------------- --------------------------- --------------------------- --------------------------- --------------------------- ---------------------------
> fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> | | | | | | | | | | |
> 43:999 3% 68:999 4% 84:999 6% 99:999 5% 94:999 4% 86:999 dmesg.invalid_opcode:#[##]
> 4:999 -0% 2:999 0% 7:999 0% 8:999 0% 4:999 -0% :999 dmesg.kernel_BUG_at_include/linux/mm.h
> 3:999 0% 4:999 0% 3:999 0% 7:999 0% 5:999 1% 9:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> 34:999 3% 61:999 4% 73:999 5% 81:999 5% 85:999 4% 74:999 dmesg.kernel_BUG_at_lib/list_debug.c
> :999 0% :999 0% :999 0% 1:999 0% :999 0% :999 dmesg.kernel_BUG_at_mm/internal.h
> 3:999 -0% 1:999 -0% :999 -0% 2:999 -0% :999 -0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> :999 0% :999 0% 2:999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/usercopy.c
>
>
> since now even the v6.0 is not clean, attached one dmesg FYI
>
>
> below is from rcutorture:
> =========================================================================================
> compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
> gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT+CONFIG_SLUB_DEBUG_ON/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
>
> v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8 v6.2-rc3
> ---------------- --------------------------- --------------------------- --------------------------- --------------------------- ---------------------------
> fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> | | | | | | | | | | |
> 47:999 3% 72:999 4% 91:999 4% 88:999 3% 76:999 4% 84:999 dmesg.invalid_opcode:#[##]
> 4:999 0% 8:999 1% 10:999 0% 5:999 0% 4:999 -0% :999 dmesg.kernel_BUG_at_include/linux/mm.h
> 3:999 -0% 2:999 0% 5:999 0% 5:999 -0% 2:999 1% 8:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> 38:999 2% 61:999 4% 75:999 4% 78:999 3% 68:999 4% 73:999 dmesg.kernel_BUG_at_lib/list_debug.c
> 1:999 0% 1:999 0% 1:999 -0% :999 0% 2:999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> 1:999 -0% :999 -0% :999 -0% :999 -0% :999 0% 1:999 dmesg.kernel_BUG_at_mm/usercopy.c
>>
>> Thanks for all your effort!
>>
>> > [1]
>> > boot results:
>> > =========================================================================================
>> > compiler/kconfig/rootfs/sleep/tbox_group/testcase:
>> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/1/vm-snb/boot
>> >
>> > v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
>> > ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
>> > fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
>> > | | | | | | | | |
>> > :999 0% 2:999 0% 1:999 1% 11:999 21% 208:999 dmesg.invalid_opcode:#[##]
>> > :999 0% :999 0% :999 0% 2:999 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
>> > :999 0% 1:999 0% :999 0% 4:999 4% 40:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
>> > :999 0% 1:999 0% 1:999 0% 4:999 11% 111:999 dmesg.kernel_BUG_at_lib/list_debug.c
>> > :999 0% :999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
>> > :999 0% :999 0% :999 0% 1:999 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
>> >
>> > [2]
>> > rcutorture results:
>> > =========================================================================================
>> > compiler/kconfig/rootfs/runtime/tbox_group/test/testcase/torture_type:
>> > gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT/debian-11.1-i386-20220923.cgz/300s/vm-snb/default/rcutorture/tasks-tracing
>> >
>> > v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8
>> > ---------------- --------------------------- --------------------------- --------------------------- ---------------------------
>> > fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
>> > | | | | | | | | |
>> > :999 0% 3:999 0% :999 1% 8:998 20% 200:999 dmesg.invalid_opcode:#[##]
>> > :999 0% :999 0% :999 0% :998 5% 51:999 dmesg.kernel_BUG_at_include/linux/mm.h
>> > :999 0% :999 0% :999 0% 3:998 4% 42:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
>> > :999 0% 3:999 0% :999 0% 4:998 10% 102:999 dmesg.kernel_BUG_at_lib/list_debug.c
>> > :999 0% :999 0% :999 0% :998 0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
>> > :999 0% :999 0% :999 0% 1:998 0% 3:999 dmesg.kernel_BUG_at_mm/usercopy.c
>> >
>> >>
>> >> >
>> >> >
>>
>>

2023-01-12 09:43:57

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

On 1/12/23 08:47, Oliver Sang wrote:
> hi, Vlastimil,
>
> On Tue, Jan 10, 2023 at 03:09:36PM +0100, Vlastimil Babka wrote:
>> On 1/10/23 14:53, Oliver Sang wrote:
>> > hi all,
>> >
>> > On Mon, Jan 09, 2023 at 10:01:15PM +0800, Oliver Sang wrote:
>> >>
>> >> On Fri, Jan 06, 2023 at 11:13:15AM +0100, Vlastimil Babka wrote:
>> >>
>> >> > And if the rate at the parent (has it increased thanks to the
>> >> > DEBUG_PAGEALLOC?) is sufficient to bisect to the truly first bad commit. Thanks!
>> >>
>> >> got it. Thanks for suggestion!
>> >>
>> >> since 0af8489b02 is based on v6.1-rc2, we will test (both rectorture and boot)
>> >> with same config upon v6.1-rc2 to see if it's really clean there.
>> >> if so we will use dmesg.invalid_opcode:#[##] to trigger new bisect.
>> >>
>> >> will keep you updated. Thanks
>> >
>> > by more tests, we cannot make sure the v6.1-rc2 is clean, so we also checked
>> > v6.1-rc1 and v6.0. from results, we have low confidence that we can make a
>> > successful bisection based on them [1][2]. could you suggest?
>>
>> So am I reading it right, that the probleam appears to be introduced between
>> v6.0 (0 failures) and v6.1-rc1 (>0 failures)? But agree that with such low
>> incidence, it's hard to bisect.
>
> yeah, you are reading it right :)
>
>>
>> > a further information not sure if it's helpful, [1][2] are both i386 based.
>> > we also tried to run boot tests on x86_64 upon commit 0af8489b02, whatever
>> > with or without CONFIG_DEBUG_PAGEALLOC/CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT,
>> > we never obseve similar issues (also run 999 times).
>>
>> Yeah it looks very much like something that manifests only on i386 (perhaps
>> only in QEMU as well?) and never x86_64.
>>
>> What might be interesting then is v6.1-rc1 with further modified config to
>> enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON. Maybe it will catch the
>> culprit sooner. Or maybe it will obscure the bug instead, unfortunately.
>
> oh, seems, unfortunalately, 'obscure' happen :(
>
> we enabled CONFIG_SLUB_DEBUG and CONFIG_SLUB_DEBUG_ON, along with
> CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
>
> boot (we also add the test for v6.2-rc3):
> =========================================================================================
> compiler/kconfig/rootfs/sleep/tbox_group/testcase:
> gcc-11/i386-randconfig-a012-20221226+CONFIG_DEBUG_PAGEALLOC+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT+CONFIG_SLUB_DEBUG_ON/debian-11.1-i386-20220923.cgz/1/vm-snb/boot
>
> v6.0 v6.1-rc1 v6.1-rc2 56d5a2b9ba85a390473e86b4fe4 0af8489b0216fa1dd83e264bef8 v6.2-rc3
> ---------------- --------------------------- --------------------------- --------------------------- --------------------------- ---------------------------
> fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs %reproduction fail:runs
> | | | | | | | | | | |
> 43:999 3% 68:999 4% 84:999 6% 99:999 5% 94:999 4% 86:999 dmesg.invalid_opcode:#[##]
> 4:999 -0% 2:999 0% 7:999 0% 8:999 0% 4:999 -0% :999 dmesg.kernel_BUG_at_include/linux/mm.h
> 3:999 0% 4:999 0% 3:999 0% 7:999 0% 5:999 1% 9:999 dmesg.kernel_BUG_at_include/linux/page-flags.h
> 34:999 3% 61:999 4% 73:999 5% 81:999 5% 85:999 4% 74:999 dmesg.kernel_BUG_at_lib/list_debug.c
> :999 0% :999 0% :999 0% 1:999 0% :999 0% :999 dmesg.kernel_BUG_at_mm/internal.h
> 3:999 -0% 1:999 -0% :999 -0% 2:999 -0% :999 -0% 2:999 dmesg.kernel_BUG_at_mm/page_alloc.c
> :999 0% :999 0% 2:999 0% :999 0% :999 0% 2:999 dmesg.kernel_BUG_at_mm/usercopy.c
>
>
> since now even the v6.0 is not clean, attached one dmesg FYI

Interesting one, btw.

> [ 38.392164][ T193] page:e6c3e400 refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x9e80
> [ 38.393441][ T193] head:e6c3e2c0 order:3 compound_mapcount:0 compound_pincount:0

Yay, struct page pointers no longer hashed for printing, thanks to slub_debug.
Head page is order-3 (8 pages), yet e6c3e400 would be 9th page, nonsense.

> [ 38.394491][ T193] flags: 0x10200(slab|head|zone=0)
> [ 38.395181][ T193] raw: 00010000 e6c3e2c1 00000122 00000400 00000000 00000000 ffffffff 00000001

We have head flag, compound_head is e6c3e2c0+PageTail, there's again
00000400 in mapping that's specifically set by prep_compound_tail().

> [ 38.396386][ T193] raw: 00000000 00000000
> [ 38.396956][ T193] head: 00010200 e6ac7904 c01007f0 c0101a40 c9e79a00 00150004 ffffffff 00000001

00150004 are slub's counters suggesting it's been used for a while, we
should not be racing with e6c3e2c0's allocation here!

> [ 38.398161][ T193] head: 00000000 00000000
> [ 38.398775][ T193] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
> [ 38.399724][ T193] page_owner tracks the page as allocated
> [ 38.400525][ T193] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 193, tgid 193 (systemd-udevd), ts 38392157598, free_ts 0

This is a dump of e6c3e400, not the "head".

> [ 38.403183][ T193] post_alloc_hook+0x29a/0x320
> [ 38.403891][ T193] get_page_from_freelist+0x226/0x310
> [ 38.404644][ T193] __alloc_pages+0xdd/0x360
> [ 38.405256][ T193] alloc_slab_page+0x12d/0x200
> [ 38.405936][ T193] allocate_slab+0x6a/0x670
> [ 38.406586][ T193] new_slab+0x48/0xc0
> [ 38.407127][ T193] ___slab_alloc+0x9ba/0x1610
> [ 38.407942][ T193] __slab_alloc+0x41/0x70
> [ 38.408776][ T193] __kmalloc_track_caller+0x57b/0x710
> [ 38.409538][ T193] kmalloc_reserve+0x3f/0x100
> [ 38.410339][ T193] __alloc_skb+0xad/0x310
> [ 38.410925][ T193] netlink_sendmsg+0x384/0x6f0
> [ 38.411602][ T193] sock_sendmsg+0x79/0xe0
> [ 38.420425][ T193] ____sys_sendmsg+0x1ef/0x390
> [ 38.421069][ T193] ___sys_sendmsg+0xb0/0x120
> [ 38.421726][ T193] __sys_sendmsg+0x84/0x100
> [ 38.422372][ T193] page_owner free stack trace missing
> [ 38.423113][ T193] ------------[ cut here ]------------
> [ 38.423899][ T193] kernel BUG at include/linux/page-flags.h:319!
> [ 38.424778][ T193] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 38.425610][ T193] CPU: 1 PID: 193 Comm: systemd-udevd Tainted: G S 6.0.0 #1 b46e496fab9f0be3d12299e0bf71444aabf1c433
> [ 38.427187][ T193] EIP: folio_flags+0x31/0x70
> [ 38.427951][ T193] Code: 48 83 05 e8 89 bd c5 01 ba a4 c0 2b c4 55 89 e5 83 15 ec 89 bd c5 00 e8 8d 3f f6 ff 83 05 f8 89 bd c5 01 83 15 fc 89 bd c5 00 <0f> 0b 83 05 00 8a bd c5 01 b8 38 1e a6 c4 83 15 04 8a bd c5 00 e8
> [ 38.430464][ T193] EAX: 00000000 EBX: e6c3e400 ECX: 00000000 EDX: ee7ead00
> [ 38.431382][ T193] ESI: 00000000 EDI: 000c2cc0 EBP: c995fb30 ESP: c995fb30
> [ 38.432350][ T193] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> [ 38.433348][ T193] CR0: 80050033 CR2: b5c4f018 CR3: 09cab000 CR4: 000006d0
> [ 38.434288][ T193] Call Trace:
> [ 38.434737][ T193] alloc_slab_page+0x7f/0x200

Aha, we have crashed while allocating a fresh new slab page, probably at
__folio_set_slab(folio), the page doesn't have het PageSlab() flag set.

So we have just took the page from a freelist (that modifies lru.next
overlapping compound_head, we can still see the poison in lru.prev),
verified that it's page->mapping is 0 (in page_expected_state()). Few
moments later there's a compound_head set and mapping set in a way that only
prep_compound_tail() does.
But the head that would be connected seems to have done its
prep_compound_tail()'s a while ago (nevermind that it would have to be an
off-by-one). So that just makes no sense.

> [ 38.435361][ T193] allocate_slab+0x6a/0x670
> [ 38.436007][ T193] new_slab+0x48/0xc0
> [ 38.436536][ T193] ___slab_alloc+0x9ba/0x1610
> [ 38.437293][ T193] ? __might_sleep+0x5a/0x100
> [ 38.437912][ T193] ? __alloc_skb+0xad/0x310
> [ 38.438513][ T193] ? lock_release+0xa1/0x190
> [ 38.439127][ T193] __slab_alloc+0x41/0x70
> [ 38.439845][ T193] __kmalloc_track_caller+0x57b/0x710
> [ 38.440557][ T193] ? kmem_cache_alloc+0x2d6/0x690
> [ 38.441217][ T193] ? __alloc_skb+0xad/0x310
> [ 38.441821][ T193] kmalloc_reserve+0x3f/0x100
> [ 38.442569][ T193] ? security_capable+0x40/0xd0
> [ 38.443214][ T193] __alloc_skb+0xad/0x310
> [ 38.443801][ T193] ? ns_capable_common+0x41/0x80
> [ 38.444452][ T193] netlink_sendmsg+0x384/0x6f0
> [ 38.445089][ T193] ? iovec_from_user+0xea/0x1b0
> [ 38.445737][ T193] sock_sendmsg+0x79/0xe0
> [ 38.446313][ T193] ____sys_sendmsg+0x1ef/0x390
> [ 38.446955][ T193] ___sys_sendmsg+0xb0/0x120
> [ 38.447586][ T193] __sys_sendmsg+0x84/0x100
> [ 38.448225][ T193] __do_sys_socketcall+0x20d/0x5e0
> [ 38.448908][ T193] __ia32_sys_socketcall+0x1b/0x30
> [ 38.449593][ T193] __do_fast_syscall_32+0x72/0xd0
> [ 38.450264][ T193] ? syscall_exit_to_user_mode+0x5f/0x90
> [ 38.451001][ T193] ? __do_fast_syscall_32+0x7c/0xd0
> [ 38.451702][ T193] ? __do_fast_syscall_32+0x7c/0xd0
> [ 38.452388][ T193] ? lockdep_hardirqs_on_prepare+0x242/0x400
> [ 38.453184][ T193] ? syscall_exit_to_user_mode+0x5f/0x90
> [ 38.453922][ T193] ? __do_fast_syscall_32+0x7c/0xd0
> [ 38.454609][ T193] ? lockdep_hardirqs_on_prepare+0x242/0x400
> [ 38.455390][ T193] ? syscall_exit_to_user_mode+0x5f/0x90
> [ 38.460224][ T193] ? __do_fast_syscall_32+0x7c/0xd0
> [ 38.460922][ T193] ? irqentry_exit_to_user_mode+0x23/0x30
> [ 38.461675][ T193] do_fast_syscall_32+0x32/0x70
> [ 38.462327][ T193] do_SYSENTER_32+0x15/0x20
> [ 38.462936][ T193] entry_SYSENTER_32+0xa2/0xfb
> [ 38.463567][ T193] EIP: 0xb7ed4549
> [ 38.464077][ T193] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
> [ 38.466578][ T193] EAX: ffffffda EBX: 00000010 ECX: bfc1aa40 EDX: 00000000
> [ 38.467499][ T193] ESI: b7e0f000 EDI: 00000000 EBP: 00534540 ESP: bfc1aa30
> [ 38.468437][ T193] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200282
> [ 38.469442][ T193] Modules linked in:
> [ 38.470183][ T193] ---[ end trace 0000000000000000 ]---
> [ 38.470910][ T193] EIP: folio_flags+0x31/0x70
> [ 38.471704][ T193] Code: 48 83 05 e8 89 bd c5 01 ba a4 c0 2b c4 55 89 e5 83 15 ec 89 bd c5 00 e8 8d 3f f6 ff 83 05 f8 89 bd c5 01 83 15 fc 89 bd c5 00 <0f> 0b 83 05 00 8a bd c5 01 b8 38 1e a6 c4 83 15 04 8a bd c5 00 e8
> [ 38.474274][ T193] EAX: 00000000 EBX: e6c3e400 ECX: 00000000 EDX: ee7ead00
> [ 38.475233][ T193] ESI: 00000000 EDI: 000c2cc0 EBP: c995fb30 ESP: c995fb30
> [ 38.476218][ T193] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> [ 38.477246][ T193] CR0: 80050033 CR2: b5c4f018 CR3: 09cab000 CR4: 000006d0
> [ 38.478206][ T193] Kernel panic - not syncing: Fatal exception
> [ 38.479153][ T193] Kernel Offset: disabled

2023-01-17 07:30:27

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [mm, slub] 0af8489b02: kernel_BUG_at_include/linux/mm.h

Hi, Vlastimil,

On Thu, Jan 12, 2023 at 08:56:59AM +0100, Vlastimil Babka wrote:
>
> Actually no, by "obscure" means with CONFIG_SLUB_DEBUG it wouldn't happen
> anymore. But this is the opposite, it seems to happen a lot. I would have
> preferred that slub debugging catches some slab misuse, but this seems
> useful too. With such fail rates you can perhaps try ealier kernels than 6.0
> and eventually find the truly clean and first bad release and bisect?

Thanks a lot for guidance!

yeah, we reached back to until v5.14-rc1 which still has similar issue,
and v5.13 is clean. new bisection was triggered then we got '7118fc2906'

this was already reported as
"[linus:master] [hugetlb] 7118fc2906: kernel_BUG_at_lib/list_debug.c"
at https://lore.kernel.org/all/[email protected]/
and I add you, Hyeonggon, Feng and Fengwei there.

hope that would be helpful.