2023-12-19 15:47:17

by Oliver Sang

[permalink] [raw]
Subject: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c



Hello,

kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:

commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[test failed on linux-next/master aa4db8324c4d0e67aa4670356df4e9fae14b4d37]

in testcase: vm-scalability
version: vm-scalability-x86_64-1.0-0_20220518
with following parameters:

runtime: 300
thp_enabled: always
thp_defrag: always
nr_task: 32
nr_ssd: 1
priority: 1
test: swap-w-rand
cpufreq_governor: performance

test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/


compiler: gcc-12
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


[ 61.404380][ T5947] ------------[ cut here ]------------
[ 61.409984][ T5947] kernel BUG at mm/memory.c:3990!
[ 61.415085][ T5947] invalid opcode: 0000 [#1] SMP NOPTI
[ 61.420506][ T5947] CPU: 32 PID: 5947 Comm: usemem Tainted: G S 6.7.0-rc4-00252-gbbcbf2a3f05f #1
[ 61.430881][ T5947] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0003.2104260124 04/26/2021
[ 61.442761][ T5947] RIP: 0010:do_swap_page (mm/memory.c:3990 (discriminator 3))
[ 61.448112][ T5947] Code: 6f 28 31 d2 be 01 00 00 00 4c 89 ff e8 9b 43 03 00 49 c7 47 28 00 00 00 00 4c 89 f9 48 c7 44 24 08 00 00 00 00 e9 cf fb ff ff <0f> 0b 49 8b 45 08 f0 48 83 28 01 0f 85 3f fc ff ff 49 8b 45 08 4c
All code
========
0: 6f outsl %ds:(%rsi),(%dx)
1: 28 31 sub %dh,(%rcx)
3: d2 be 01 00 00 00 sarb %cl,0x1(%rsi)
9: 4c 89 ff mov %r15,%rdi
c: e8 9b 43 03 00 call 0x343ac
11: 49 c7 47 28 00 00 00 movq $0x0,0x28(%r15)
18: 00
19: 4c 89 f9 mov %r15,%rcx
1c: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
23: 00 00
25: e9 cf fb ff ff jmp 0xfffffffffffffbf9
2a:* 0f 0b ud2 <-- trapping instruction
2c: 49 8b 45 08 mov 0x8(%r13),%rax
30: f0 48 83 28 01 lock subq $0x1,(%rax)
35: 0f 85 3f fc ff ff jne 0xfffffffffffffc7a
3b: 49 8b 45 08 mov 0x8(%r13),%rax
3f: 4c rex.WR

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 49 8b 45 08 mov 0x8(%r13),%rax
6: f0 48 83 28 01 lock subq $0x1,(%rax)
b: 0f 85 3f fc ff ff jne 0xfffffffffffffc50
11: 49 8b 45 08 mov 0x8(%r13),%rax
15: 4c rex.WR
[ 61.468016][ T5947] RSP: 0000:ffa000000bb5fd98 EFLAGS: 00010206
[ 61.474169][ T5947] RAX: ff11000111a47c99 RBX: ffa000000bb5fe08 RCX: 0000002064ac7000
[ 61.482233][ T5947] RDX: 0057ffffc00a106d RSI: 0000000000000043 RDI: ffd400008192b1e8
[ 61.490296][ T5947] RBP: 000000000100c13b R08: 0000000000000000 R09: ffa000000bb5fe08
[ 61.498366][ T5947] R10: 0000000055555554 R11: ff1100018bebbd0c R12: ffd4000044128000
[ 61.506438][ T5947] R13: ff1100205d33d800 R14: ff11000130cd2da8 R15: ffd4000044128000
[ 61.514508][ T5947] FS: 00007f49c900c740(0000) GS:ff11002001000000(0000) knlGS:0000000000000000
[ 61.523534][ T5947] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 61.530225][ T5947] CR2: 00007f4966b3b6b8 CR3: 00000010af786004 CR4: 0000000000771ef0
[ 61.538307][ T5947] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 61.546387][ T5947] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 61.554471][ T5947] PKRU: 55555554
[ 61.558137][ T5947] Call Trace:
[ 61.561544][ T5947] <TASK>
[ 61.564599][ T5947] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
[ 61.568429][ T5947] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
[ 61.572692][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
[ 61.577475][ T5947] ? do_error_trap (arch/x86/include/asm/traps.h:59 arch/x86/kernel/traps.c:174)
[ 61.582172][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
[ 61.586966][ T5947] ? exc_invalid_op (arch/x86/kernel/traps.c:265)
[ 61.591743][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
[ 61.596515][ T5947] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
[ 61.601638][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
[ 61.606412][ T5947] ? do_swap_page (mm/memory.c:3971)
[ 61.611179][ T5947] __handle_mm_fault (mm/memory.c:5274)
[ 61.616203][ T5947] handle_mm_fault (mm/memory.c:5439)
[ 61.621051][ T5947] do_user_addr_fault (arch/x86/mm/fault.c:1365)
[ 61.626151][ T5947] exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561)
[ 61.630824][ T5947] asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
[ 61.635748][ T5947] RIP: 0033:0x5612d5878ad6
[ 61.640229][ T5947] Code: 01 00 00 00 e8 1b f9 ff ff 89 c7 e8 6c ff ff ff bf 00 00 00 00 e8 0a f9 ff ff 85 d2 74 08 48 8d 04 f7 48 8b 00 c3 48 8d 04 f7 <48> 89 30 b8 00 00 00 00 c3 41 54 55 53 48 85 ff 0f 84 21 01 00 00
All code
========
0: 01 00 add %eax,(%rax)
2: 00 00 add %al,(%rax)
4: e8 1b f9 ff ff call 0xfffffffffffff924
9: 89 c7 mov %eax,%edi
b: e8 6c ff ff ff call 0xffffffffffffff7c
10: bf 00 00 00 00 mov $0x0,%edi
15: e8 0a f9 ff ff call 0xfffffffffffff924
1a: 85 d2 test %edx,%edx
1c: 74 08 je 0x26
1e: 48 8d 04 f7 lea (%rdi,%rsi,8),%rax
22: 48 8b 00 mov (%rax),%rax
25: c3 ret
26: 48 8d 04 f7 lea (%rdi,%rsi,8),%rax
2a:* 48 89 30 mov %rsi,(%rax) <-- trapping instruction
2d: b8 00 00 00 00 mov $0x0,%eax
32: c3 ret
33: 41 54 push %r12
35: 55 push %rbp
36: 53 push %rbx
37: 48 85 ff test %rdi,%rdi
3a: 0f 84 21 01 00 00 je 0x161

Code starting with the faulting instruction
===========================================
0: 48 89 30 mov %rsi,(%rax)
3: b8 00 00 00 00 mov $0x0,%eax
8: c3 ret
9: 41 54 push %r12
b: 55 push %rbp
c: 53 push %rbx
d: 48 85 ff test %rdi,%rdi
10: 0f 84 21 01 00 00 je 0x137
[ 61.660112][ T5947] RSP: 002b:00007ffd09f037d8 EFLAGS: 00010246
[ 61.666250][ T5947] RAX: 00007f4966b3b6b8 RBX: 000000000000358f RCX: 00000005deece66d
[ 61.674295][ T5947] RDX: 0000000000000000 RSI: 000000002fa0f0d7 RDI: 00007f47e9ac3000
[ 61.682347][ T5947] RBP: 000000002fa0f0d7 R08: 00007ffd09f0386c R09: 0000000000000001
[ 61.690401][ T5947] R10: 00007ffd09f037c0 R11: 0000000000000000 R12: 000000000001ac78
[ 61.698449][ T5947] R13: 00007f47e9ac3000 R14: 00007ffd09f0386c R15: 00007ffd09f03970
[ 61.706500][ T5947] </TASK>
[ 61.709607][ T5947] Modules linked in: kmem xfs loop device_dax nd_pmem dax_pmem nd_btt btrfs blake2b_generic xor raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 kvm_intel sg kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ahci ipmi_ssif rapl intel_cstate ast libahci mei_me drm_shmem_helper i2c_i801 ioatdma acpi_ipmi libata drm_kms_helper mei intel_uncore joydev i2c_smbus intel_pch_thermal dax_hmem dca wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_pad drm fuse ip_tables
[ 61.768510][ T5947] ---[ end trace 0000000000000000 ]---
[ 61.786010][ T5947] pstore: backend (erst) writing error (-28)
[ 61.792055][ T5947] RIP: 0010:do_swap_page (mm/memory.c:3990 (discriminator 3))
[ 61.797397][ T5947] Code: 6f 28 31 d2 be 01 00 00 00 4c 89 ff e8 9b 43 03 00 49 c7 47 28 00 00 00 00 4c 89 f9 48 c7 44 24 08 00 00 00 00 e9 cf fb ff ff <0f> 0b 49 8b 45 08 f0 48 83 28 01 0f 85 3f fc ff ff 49 8b 45 08 4c
All code
========
0: 6f outsl %ds:(%rsi),(%dx)
1: 28 31 sub %dh,(%rcx)
3: d2 be 01 00 00 00 sarb %cl,0x1(%rsi)
9: 4c 89 ff mov %r15,%rdi
c: e8 9b 43 03 00 call 0x343ac
11: 49 c7 47 28 00 00 00 movq $0x0,0x28(%r15)
18: 00
19: 4c 89 f9 mov %r15,%rcx
1c: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
23: 00 00
25: e9 cf fb ff ff jmp 0xfffffffffffffbf9
2a:* 0f 0b ud2 <-- trapping instruction
2c: 49 8b 45 08 mov 0x8(%r13),%rax
30: f0 48 83 28 01 lock subq $0x1,(%rax)
35: 0f 85 3f fc ff ff jne 0xfffffffffffffc7a
3b: 49 8b 45 08 mov 0x8(%r13),%rax
3f: 4c rex.WR

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 49 8b 45 08 mov 0x8(%r13),%rax
6: f0 48 83 28 01 lock subq $0x1,(%rax)
b: 0f 85 3f fc ff ff jne 0xfffffffffffffc50
11: 49 8b 45 08 mov 0x8(%r13),%rax
15: 4c rex.WR


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231219/[email protected]



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



2023-12-20 22:11:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On Tue, 19 Dec 2023 23:46:50 +0800 kernel test robot <[email protected]> wrote:

>
>
> Hello,
>
> kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
>
> commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")

I assume this is a bisection result, so it's quite repeatable?

> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> [test failed on linux-next/master aa4db8324c4d0e67aa4670356df4e9fae14b4d37]
>
> in testcase: vm-scalability
> version: vm-scalability-x86_64-1.0-0_20220518
> with following parameters:
>
> runtime: 300
> thp_enabled: always
> thp_defrag: always
> nr_task: 32
> nr_ssd: 1
> priority: 1
> test: swap-w-rand
> cpufreq_governor: performance
>
> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>
>
> compiler: gcc-12
> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
>
>
> [ 61.404380][ T5947] ------------[ cut here ]------------
> [ 61.409984][ T5947] kernel BUG at mm/memory.c:3990!
> [ 61.415085][ T5947] invalid opcode: 0000 [#1] SMP NOPTI

This is

BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));

and I don't believe that the error path fix
(https://lkml.kernel.org/r/[email protected]) will
address this.

Matthew, have you had a chance to consider?

Thanks.

> [ 61.420506][ T5947] CPU: 32 PID: 5947 Comm: usemem Tainted: G S 6.7.0-rc4-00252-gbbcbf2a3f05f #1
> [ 61.430881][ T5947] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0003.2104260124 04/26/2021
> [ 61.442761][ T5947] RIP: 0010:do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.448112][ T5947] Code: 6f 28 31 d2 be 01 00 00 00 4c 89 ff e8 9b 43 03 00 49 c7 47 28 00 00 00 00 4c 89 f9 48 c7 44 24 08 00 00 00 00 e9 cf fb ff ff <0f> 0b 49 8b 45 08 f0 48 83 28 01 0f 85 3f fc ff ff 49 8b 45 08 4c
> All code
> ========
> 0: 6f outsl %ds:(%rsi),(%dx)
> 1: 28 31 sub %dh,(%rcx)
> 3: d2 be 01 00 00 00 sarb %cl,0x1(%rsi)
> 9: 4c 89 ff mov %r15,%rdi
> c: e8 9b 43 03 00 call 0x343ac
> 11: 49 c7 47 28 00 00 00 movq $0x0,0x28(%r15)
> 18: 00
> 19: 4c 89 f9 mov %r15,%rcx
> 1c: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
> 23: 00 00
> 25: e9 cf fb ff ff jmp 0xfffffffffffffbf9
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: 49 8b 45 08 mov 0x8(%r13),%rax
> 30: f0 48 83 28 01 lock subq $0x1,(%rax)
> 35: 0f 85 3f fc ff ff jne 0xfffffffffffffc7a
> 3b: 49 8b 45 08 mov 0x8(%r13),%rax
> 3f: 4c rex.WR
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 49 8b 45 08 mov 0x8(%r13),%rax
> 6: f0 48 83 28 01 lock subq $0x1,(%rax)
> b: 0f 85 3f fc ff ff jne 0xfffffffffffffc50
> 11: 49 8b 45 08 mov 0x8(%r13),%rax
> 15: 4c rex.WR
> [ 61.468016][ T5947] RSP: 0000:ffa000000bb5fd98 EFLAGS: 00010206
> [ 61.474169][ T5947] RAX: ff11000111a47c99 RBX: ffa000000bb5fe08 RCX: 0000002064ac7000
> [ 61.482233][ T5947] RDX: 0057ffffc00a106d RSI: 0000000000000043 RDI: ffd400008192b1e8
> [ 61.490296][ T5947] RBP: 000000000100c13b R08: 0000000000000000 R09: ffa000000bb5fe08
> [ 61.498366][ T5947] R10: 0000000055555554 R11: ff1100018bebbd0c R12: ffd4000044128000
> [ 61.506438][ T5947] R13: ff1100205d33d800 R14: ff11000130cd2da8 R15: ffd4000044128000
> [ 61.514508][ T5947] FS: 00007f49c900c740(0000) GS:ff11002001000000(0000) knlGS:0000000000000000
> [ 61.523534][ T5947] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 61.530225][ T5947] CR2: 00007f4966b3b6b8 CR3: 00000010af786004 CR4: 0000000000771ef0
> [ 61.538307][ T5947] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 61.546387][ T5947] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 61.554471][ T5947] PKRU: 55555554
> [ 61.558137][ T5947] Call Trace:
> [ 61.561544][ T5947] <TASK>
> [ 61.564599][ T5947] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> [ 61.568429][ T5947] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
> [ 61.572692][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.577475][ T5947] ? do_error_trap (arch/x86/include/asm/traps.h:59 arch/x86/kernel/traps.c:174)
> [ 61.582172][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.586966][ T5947] ? exc_invalid_op (arch/x86/kernel/traps.c:265)
> [ 61.591743][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.596515][ T5947] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
> [ 61.601638][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.606412][ T5947] ? do_swap_page (mm/memory.c:3971)
> [ 61.611179][ T5947] __handle_mm_fault (mm/memory.c:5274)
> [ 61.616203][ T5947] handle_mm_fault (mm/memory.c:5439)
> [ 61.621051][ T5947] do_user_addr_fault (arch/x86/mm/fault.c:1365)
> [ 61.626151][ T5947] exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561)
> [ 61.630824][ T5947] asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
> [ 61.635748][ T5947] RIP: 0033:0x5612d5878ad6
> [ 61.640229][ T5947] Code: 01 00 00 00 e8 1b f9 ff ff 89 c7 e8 6c ff ff ff bf 00 00 00 00 e8 0a f9 ff ff 85 d2 74 08 48 8d 04 f7 48 8b 00 c3 48 8d 04 f7 <48> 89 30 b8 00 00 00 00 c3 41 54 55 53 48 85 ff 0f 84 21 01 00 00
> All code
> ========
> 0: 01 00 add %eax,(%rax)
> 2: 00 00 add %al,(%rax)
> 4: e8 1b f9 ff ff call 0xfffffffffffff924
> 9: 89 c7 mov %eax,%edi
> b: e8 6c ff ff ff call 0xffffffffffffff7c
> 10: bf 00 00 00 00 mov $0x0,%edi
> 15: e8 0a f9 ff ff call 0xfffffffffffff924
> 1a: 85 d2 test %edx,%edx
> 1c: 74 08 je 0x26
> 1e: 48 8d 04 f7 lea (%rdi,%rsi,8),%rax
> 22: 48 8b 00 mov (%rax),%rax
> 25: c3 ret
> 26: 48 8d 04 f7 lea (%rdi,%rsi,8),%rax
> 2a:* 48 89 30 mov %rsi,(%rax) <-- trapping instruction
> 2d: b8 00 00 00 00 mov $0x0,%eax
> 32: c3 ret
> 33: 41 54 push %r12
> 35: 55 push %rbp
> 36: 53 push %rbx
> 37: 48 85 ff test %rdi,%rdi
> 3a: 0f 84 21 01 00 00 je 0x161
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 89 30 mov %rsi,(%rax)
> 3: b8 00 00 00 00 mov $0x0,%eax
> 8: c3 ret
> 9: 41 54 push %r12
> b: 55 push %rbp
> c: 53 push %rbx
> d: 48 85 ff test %rdi,%rdi
> 10: 0f 84 21 01 00 00 je 0x137
> [ 61.660112][ T5947] RSP: 002b:00007ffd09f037d8 EFLAGS: 00010246
> [ 61.666250][ T5947] RAX: 00007f4966b3b6b8 RBX: 000000000000358f RCX: 00000005deece66d
> [ 61.674295][ T5947] RDX: 0000000000000000 RSI: 000000002fa0f0d7 RDI: 00007f47e9ac3000
> [ 61.682347][ T5947] RBP: 000000002fa0f0d7 R08: 00007ffd09f0386c R09: 0000000000000001
> [ 61.690401][ T5947] R10: 00007ffd09f037c0 R11: 0000000000000000 R12: 000000000001ac78
> [ 61.698449][ T5947] R13: 00007f47e9ac3000 R14: 00007ffd09f0386c R15: 00007ffd09f03970
> [ 61.706500][ T5947] </TASK>
> [ 61.709607][ T5947] Modules linked in: kmem xfs loop device_dax nd_pmem dax_pmem nd_btt btrfs blake2b_generic xor raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 kvm_intel sg kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ahci ipmi_ssif rapl intel_cstate ast libahci mei_me drm_shmem_helper i2c_i801 ioatdma acpi_ipmi libata drm_kms_helper mei intel_uncore joydev i2c_smbus intel_pch_thermal dax_hmem dca wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_pad drm fuse ip_tables
> [ 61.768510][ T5947] ---[ end trace 0000000000000000 ]---
> [ 61.786010][ T5947] pstore: backend (erst) writing error (-28)
> [ 61.792055][ T5947] RIP: 0010:do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.797397][ T5947] Code: 6f 28 31 d2 be 01 00 00 00 4c 89 ff e8 9b 43 03 00 49 c7 47 28 00 00 00 00 4c 89 f9 48 c7 44 24 08 00 00 00 00 e9 cf fb ff ff <0f> 0b 49 8b 45 08 f0 48 83 28 01 0f 85 3f fc ff ff 49 8b 45 08 4c
> All code
> ========
> 0: 6f outsl %ds:(%rsi),(%dx)
> 1: 28 31 sub %dh,(%rcx)
> 3: d2 be 01 00 00 00 sarb %cl,0x1(%rsi)
> 9: 4c 89 ff mov %r15,%rdi
> c: e8 9b 43 03 00 call 0x343ac
> 11: 49 c7 47 28 00 00 00 movq $0x0,0x28(%r15)
> 18: 00
> 19: 4c 89 f9 mov %r15,%rcx
> 1c: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
> 23: 00 00
> 25: e9 cf fb ff ff jmp 0xfffffffffffffbf9
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: 49 8b 45 08 mov 0x8(%r13),%rax
> 30: f0 48 83 28 01 lock subq $0x1,(%rax)
> 35: 0f 85 3f fc ff ff jne 0xfffffffffffffc7a
> 3b: 49 8b 45 08 mov 0x8(%r13),%rax
> 3f: 4c rex.WR
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 49 8b 45 08 mov 0x8(%r13),%rax
> 6: f0 48 83 28 01 lock subq $0x1,(%rax)
> b: 0f 85 3f fc ff ff jne 0xfffffffffffffc50
> 11: 49 8b 45 08 mov 0x8(%r13),%rax
> 15: 4c rex.WR
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20231219/[email protected]
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>

2023-12-20 22:29:26

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On 20.12.23 23:11, Andrew Morton wrote:
> On Tue, 19 Dec 2023 23:46:50 +0800 kernel test robot <[email protected]> wrote:
>
>>
>>
>> Hello,
>>
>> kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
>>
>> commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
>
> I assume this is a bisection result, so it's quite repeatable?
>
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> [test failed on linux-next/master aa4db8324c4d0e67aa4670356df4e9fae14b4d37]
>>
>> in testcase: vm-scalability
>> version: vm-scalability-x86_64-1.0-0_20220518
>> with following parameters:
>>
>> runtime: 300
>> thp_enabled: always
>> thp_defrag: always
>> nr_task: 32
>> nr_ssd: 1
>> priority: 1
>> test: swap-w-rand
>> cpufreq_governor: performance
>>
>> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
>> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>>
>>
>> compiler: gcc-12
>> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
>>
>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>
>>
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <[email protected]>
>> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
>>
>>
>> [ 61.404380][ T5947] ------------[ cut here ]------------
>> [ 61.409984][ T5947] kernel BUG at mm/memory.c:3990!
>> [ 61.415085][ T5947] invalid opcode: 0000 [#1] SMP NOPTI
>
> This is
>
> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>
> and I don't believe that the error path fix
> (https://lkml.kernel.org/r/[email protected]) will
> address this.
>
> Matthew, have you had a chance to consider?

Isn't the

page = folio_page(folio, 0);

just wrong?

We must not do that if the folio didn't change, otherwise we're
in trouble if we had a large folio in the swapcache.


Maybe something like the following?

diff --git a/mm/memory.c b/mm/memory.c
index d995ead7a3933..3aca5e33c6f81 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3961,7 +3961,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio = swapcache;
goto out_page;
}
- page = folio_page(folio, 0);
+ if (folio != swapcache)
+ page = folio_page(folio, 0);

/*
* If we want to map a page that's in the swapcache writable, we





--
Cheers,

David / dhildenb


2023-12-21 02:57:20

by Aithal, Srikanth

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On 12/19/2023 9:16 PM, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
>
> commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> [test failed on linux-next/master aa4db8324c4d0e67aa4670356df4e9fae14b4d37]
>
> in testcase: vm-scalability
> version: vm-scalability-x86_64-1.0-0_20220518
> with following parameters:
>
> runtime: 300
> thp_enabled: always
> thp_defrag: always
> nr_task: 32
> nr_ssd: 1
> priority: 1
> test: swap-w-rand
> cpufreq_governor: performance
>
> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>
>
> compiler: gcc-12
> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-lkp/[email protected]
>
>
> [ 61.404380][ T5947] ------------[ cut here ]------------
> [ 61.409984][ T5947] kernel BUG at mm/memory.c:3990!
> [ 61.415085][ T5947] invalid opcode: 0000 [#1] SMP NOPTI
> [ 61.420506][ T5947] CPU: 32 PID: 5947 Comm: usemem Tainted: G S 6.7.0-rc4-00252-gbbcbf2a3f05f #1
> [ 61.430881][ T5947] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0003.2104260124 04/26/2021
> [ 61.442761][ T5947] RIP: 0010:do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.448112][ T5947] Code: 6f 28 31 d2 be 01 00 00 00 4c 89 ff e8 9b 43 03 00 49 c7 47 28 00 00 00 00 4c 89 f9 48 c7 44 24 08 00 00 00 00 e9 cf fb ff ff <0f> 0b 49 8b 45 08 f0 48 83 28 01 0f 85 3f fc ff ff 49 8b 45 08 4c
> All code
> ========
> 0: 6f outsl %ds:(%rsi),(%dx)
> 1: 28 31 sub %dh,(%rcx)
> 3: d2 be 01 00 00 00 sarb %cl,0x1(%rsi)
> 9: 4c 89 ff mov %r15,%rdi
> c: e8 9b 43 03 00 call 0x343ac
> 11: 49 c7 47 28 00 00 00 movq $0x0,0x28(%r15)
> 18: 00
> 19: 4c 89 f9 mov %r15,%rcx
> 1c: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
> 23: 00 00
> 25: e9 cf fb ff ff jmp 0xfffffffffffffbf9
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: 49 8b 45 08 mov 0x8(%r13),%rax
> 30: f0 48 83 28 01 lock subq $0x1,(%rax)
> 35: 0f 85 3f fc ff ff jne 0xfffffffffffffc7a
> 3b: 49 8b 45 08 mov 0x8(%r13),%rax
> 3f: 4c rex.WR
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 49 8b 45 08 mov 0x8(%r13),%rax
> 6: f0 48 83 28 01 lock subq $0x1,(%rax)
> b: 0f 85 3f fc ff ff jne 0xfffffffffffffc50
> 11: 49 8b 45 08 mov 0x8(%r13),%rax
> 15: 4c rex.WR
> [ 61.468016][ T5947] RSP: 0000:ffa000000bb5fd98 EFLAGS: 00010206
> [ 61.474169][ T5947] RAX: ff11000111a47c99 RBX: ffa000000bb5fe08 RCX: 0000002064ac7000
> [ 61.482233][ T5947] RDX: 0057ffffc00a106d RSI: 0000000000000043 RDI: ffd400008192b1e8
> [ 61.490296][ T5947] RBP: 000000000100c13b R08: 0000000000000000 R09: ffa000000bb5fe08
> [ 61.498366][ T5947] R10: 0000000055555554 R11: ff1100018bebbd0c R12: ffd4000044128000
> [ 61.506438][ T5947] R13: ff1100205d33d800 R14: ff11000130cd2da8 R15: ffd4000044128000
> [ 61.514508][ T5947] FS: 00007f49c900c740(0000) GS:ff11002001000000(0000) knlGS:0000000000000000
> [ 61.523534][ T5947] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 61.530225][ T5947] CR2: 00007f4966b3b6b8 CR3: 00000010af786004 CR4: 0000000000771ef0
> [ 61.538307][ T5947] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 61.546387][ T5947] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 61.554471][ T5947] PKRU: 55555554
> [ 61.558137][ T5947] Call Trace:
> [ 61.561544][ T5947] <TASK>
> [ 61.564599][ T5947] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> [ 61.568429][ T5947] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
> [ 61.572692][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.577475][ T5947] ? do_error_trap (arch/x86/include/asm/traps.h:59 arch/x86/kernel/traps.c:174)
> [ 61.582172][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.586966][ T5947] ? exc_invalid_op (arch/x86/kernel/traps.c:265)
> [ 61.591743][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.596515][ T5947] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
> [ 61.601638][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.606412][ T5947] ? do_swap_page (mm/memory.c:3971)
> [ 61.611179][ T5947] __handle_mm_fault (mm/memory.c:5274)
> [ 61.616203][ T5947] handle_mm_fault (mm/memory.c:5439)
> [ 61.621051][ T5947] do_user_addr_fault (arch/x86/mm/fault.c:1365)
> [ 61.626151][ T5947] exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561)
> [ 61.630824][ T5947] asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
> [ 61.635748][ T5947] RIP: 0033:0x5612d5878ad6
> [ 61.640229][ T5947] Code: 01 00 00 00 e8 1b f9 ff ff 89 c7 e8 6c ff ff ff bf 00 00 00 00 e8 0a f9 ff ff 85 d2 74 08 48 8d 04 f7 48 8b 00 c3 48 8d 04 f7 <48> 89 30 b8 00 00 00 00 c3 41 54 55 53 48 85 ff 0f 84 21 01 00 00
> All code
> ========
> 0: 01 00 add %eax,(%rax)
> 2: 00 00 add %al,(%rax)
> 4: e8 1b f9 ff ff call 0xfffffffffffff924
> 9: 89 c7 mov %eax,%edi
> b: e8 6c ff ff ff call 0xffffffffffffff7c
> 10: bf 00 00 00 00 mov $0x0,%edi
> 15: e8 0a f9 ff ff call 0xfffffffffffff924
> 1a: 85 d2 test %edx,%edx
> 1c: 74 08 je 0x26
> 1e: 48 8d 04 f7 lea (%rdi,%rsi,8),%rax
> 22: 48 8b 00 mov (%rax),%rax
> 25: c3 ret
> 26: 48 8d 04 f7 lea (%rdi,%rsi,8),%rax
> 2a:* 48 89 30 mov %rsi,(%rax) <-- trapping instruction
> 2d: b8 00 00 00 00 mov $0x0,%eax
> 32: c3 ret
> 33: 41 54 push %r12
> 35: 55 push %rbp
> 36: 53 push %rbx
> 37: 48 85 ff test %rdi,%rdi
> 3a: 0f 84 21 01 00 00 je 0x161
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 89 30 mov %rsi,(%rax)
> 3: b8 00 00 00 00 mov $0x0,%eax
> 8: c3 ret
> 9: 41 54 push %r12
> b: 55 push %rbp
> c: 53 push %rbx
> d: 48 85 ff test %rdi,%rdi
> 10: 0f 84 21 01 00 00 je 0x137
> [ 61.660112][ T5947] RSP: 002b:00007ffd09f037d8 EFLAGS: 00010246
> [ 61.666250][ T5947] RAX: 00007f4966b3b6b8 RBX: 000000000000358f RCX: 00000005deece66d
> [ 61.674295][ T5947] RDX: 0000000000000000 RSI: 000000002fa0f0d7 RDI: 00007f47e9ac3000
> [ 61.682347][ T5947] RBP: 000000002fa0f0d7 R08: 00007ffd09f0386c R09: 0000000000000001
> [ 61.690401][ T5947] R10: 00007ffd09f037c0 R11: 0000000000000000 R12: 000000000001ac78
> [ 61.698449][ T5947] R13: 00007f47e9ac3000 R14: 00007ffd09f0386c R15: 00007ffd09f03970
> [ 61.706500][ T5947] </TASK>
> [ 61.709607][ T5947] Modules linked in: kmem xfs loop device_dax nd_pmem dax_pmem nd_btt btrfs blake2b_generic xor raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 kvm_intel sg kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ahci ipmi_ssif rapl intel_cstate ast libahci mei_me drm_shmem_helper i2c_i801 ioatdma acpi_ipmi libata drm_kms_helper mei intel_uncore joydev i2c_smbus intel_pch_thermal dax_hmem dca wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_pad drm fuse ip_tables
> [ 61.768510][ T5947] ---[ end trace 0000000000000000 ]---
> [ 61.786010][ T5947] pstore: backend (erst) writing error (-28)
> [ 61.792055][ T5947] RIP: 0010:do_swap_page (mm/memory.c:3990 (discriminator 3))
> [ 61.797397][ T5947] Code: 6f 28 31 d2 be 01 00 00 00 4c 89 ff e8 9b 43 03 00 49 c7 47 28 00 00 00 00 4c 89 f9 48 c7 44 24 08 00 00 00 00 e9 cf fb ff ff <0f> 0b 49 8b 45 08 f0 48 83 28 01 0f 85 3f fc ff ff 49 8b 45 08 4c
> All code
> ========
> 0: 6f outsl %ds:(%rsi),(%dx)
> 1: 28 31 sub %dh,(%rcx)
> 3: d2 be 01 00 00 00 sarb %cl,0x1(%rsi)
> 9: 4c 89 ff mov %r15,%rdi
> c: e8 9b 43 03 00 call 0x343ac
> 11: 49 c7 47 28 00 00 00 movq $0x0,0x28(%r15)
> 18: 00
> 19: 4c 89 f9 mov %r15,%rcx
> 1c: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
> 23: 00 00
> 25: e9 cf fb ff ff jmp 0xfffffffffffffbf9
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: 49 8b 45 08 mov 0x8(%r13),%rax
> 30: f0 48 83 28 01 lock subq $0x1,(%rax)
> 35: 0f 85 3f fc ff ff jne 0xfffffffffffffc7a
> 3b: 49 8b 45 08 mov 0x8(%r13),%rax
> 3f: 4c rex.WR
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 49 8b 45 08 mov 0x8(%r13),%rax
> 6: f0 48 83 28 01 lock subq $0x1,(%rax)
> b: 0f 85 3f fc ff ff jne 0xfffffffffffffc50
> 11: 49 8b 45 08 mov 0x8(%r13),%rax
> 15: 4c rex.WR
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20231219/[email protected]
>
>
>
Hello,

Starting from next-20231214 until next-20231220 have noted regression
where kernel hangs randomly while running virtualization tests [multiple
start-shutdown and reboot] against AMD sev guest type, with below call
trace:

[ 6251.931094] BUG: kernel NULL pointer dereference, address:
0000000000000000
[ 6251.931297] #PF: supervisor read access in kernel mode
[ 6251.931456] #PF: error_code(0x0000) - not-present page
[ 6251.931604] PGD 800011016c067 P4D 800011016c067 PUD 0
[ 6251.931757] Oops: 0000 [#2] PREEMPT SMP NOPTI
[ 6251.931910] CPU: 20 PID: 11025 Comm: GC Thread#42 Kdump: loaded
Tainted: G D 6.7.0-rc6-next-20231219-next-20231219- #1
[ 6251.932259] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
2.8.5 08/18/2022
[ 6251.932434] RIP: 0010:swapin_readahead+0x8f/0x4f0
[ 6251.932616] Code: ff 48 8b 8d 68 ff ff ff 8b 75 84 4c 89 e2 48 8b bd
70 ff ff ff e8 91 fb ff ff 48 89 c3 4d 85 e4 74 08 41 f6 44 24 06 01 75
46 <48> 8b 13 48 89 d8 83 e2 40 74 15 8b 53 64 48 83 ea 01 48 23 95 70
[ 6251.933025] RSP: 0018:ffffc9003607fc88 EFLAGS: 00010246
[ 6251.933229] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
000000000000001b
[ 6251.933453] RDX: ffff88812c45b380 RSI: 00000000000d021b RDI:
0000000000000000
[ 6251.933668] RBP: ffffc9003607fd28 R08: ffffc9003607fcbf R09:
00000000000d021b
[ 6251.933885] R10: 0000000000000008 R11: 0000000000000003 R12:
ffffffff9ef16040
[ 6251.934116] R13: ffffc9003607fdf0 R14: 00000000000d021b R15:
ffff88a09ce680b8
[ 6251.934339] FS: 00007fe71cdd9640(0000) GS:ffff88a000300000(0000)
knlGS:0000000000000000
[ 6251.934569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6251.934802] CR2: 0000000000000000 CR3: 000800011b43a003 CR4:
0000000000770ef0
[ 6251.935067] PKRU: 55555554
[ 6251.935306] Call Trace:
[ 6251.935560] <TASK>
[ 6251.935844] ? show_regs+0x6d/0x80
[ 6251.936098] ? __die+0x29/0x70
[ 6251.936339] ? page_fault_oops+0x15f/0x460
[ 6251.936585] ? psi_group_change+0x175/0x3b0
[ 6251.936830] ? srso_alias_return_thunk+0x5/0xfbef5
[ 6251.937080] ? do_user_addr_fault+0x30f/0x690
[ 6251.937331] ? exc_page_fault+0x7c/0x190
[ 6251.937580] ? asm_exc_page_fault+0x2b/0x30
[ 6251.937834] ? swapin_readahead+0x8f/0x4f0
[ 6251.938087] ? swapin_readahead+0x3c7/0x4f0
[ 6251.938341] ? srso_alias_return_thunk+0x5/0xfbef5
[ 6251.938599] do_swap_page+0x3ae/0xca0
[ 6251.938854] ? srso_alias_return_thunk+0x5/0xfbef5
[ 6251.939110] ? srso_alias_return_thunk+0x5/0xfbef5
[ 6251.939366] ? __pte_offset_map+0x20/0x190
[ 6251.939626] __handle_mm_fault+0x879/0xe80
[ 6251.939890] handle_mm_fault+0xc6/0x2f0
[ 6251.940151] do_user_addr_fault+0x220/0x690
[ 6251.940413] exc_page_fault+0x7c/0x190
[ 6251.940677] asm_exc_page_fault+0x2b/0x30
[ 6251.940941] RIP: 0033:0x7fe7a57ddf24
[ 6251.941205] Code: 16 4d 01 7c c6 68 e9 8b fd ff ff 48 8b 43 38 49 89
44 24 38 48 8b 43 30 49 89 44 24 30 48 8b 43 28 49 89 44 24 28 48 8b 43
20 <49> 89 44 24 20 48 8b 43 18 49 89 44 24 18 48 8b 43 10 49 89 44 24
[ 6251.941777] RSP: 002b:00007fe71cdd8ab0 EFLAGS: 00010213
[ 6251.942072] RAX: 0000000056293b73 RBX: 00000000fbad2aa8 RCX:
00000000fc31eadb
[ 6251.942368] RDX: 00007fe7a5fad308 RSI: 0000000000000010 RDI:
00007fe7a003b950
[ 6251.942662] RBP: 00007fe71cdd8b20 R08: 0000000000040000 R09:
0000000000000000
[ 6251.942957] R10: 0000000000000001 R11: 0000000000000001 R12:
00000000fc31ead8
[ 6251.943253] R13: 0000000000000005 R14: 00007fe550000d50 R15:
0000000000000005
[ 6251.943552] </TASK>
[ 6251.943838] Modules linked in: binfmt_misc tls overlay ib_core
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat
x_tables nf_nat_tftp nf_conntrack_tftp bridge stp llc nft_fib_inet
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc
nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac
kvm_amd kvm rapl joydev input_leds wmi_bmof efi_pstore pcspkr acpi_ipmi
i2c_piix4 k10temp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter
evbug mac_hid sch_fq_codel xfs libcrc32c mgag200 drm_kms_helper
i2c_algo_bit drm_shmem_helper hid_generic drm crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel sha512_ssse3 usbmouse usbkbd
sha256_ssse3 mpt3sas sha1_ssse3 tg3 usbhid ccp hid raid_class
scsi_transport_sas sp5100_tco wmi dm_mirror dm_region_hash dm_log msr
autofs4 aesni_intel crypto_simd cryptd
[ 6251.946304] CR2: 0000000000000000
[ 6251.946607] ---[ end trace 0000000000000000 ]---


Thanks,
"Aithal, Srikanth" <[email protected]>

2023-12-21 04:31:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On Thu, Dec 21, 2023 at 08:26:37AM +0530, Aithal, Srikanth wrote:
> Starting from next-20231214 until next-20231220 have noted regression where
> kernel hangs randomly while running virtualization tests [multiple
> start-shutdown and reboot] against AMD sev guest type, with below call
> trace:

This is an entirely different problem, already reported and fix sent
about 24 hours ago.

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 793b5b9e4f96..8a3a8f1ab20a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -894,6 +894,9 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
mpol_cond_put(mpol);
+
+ if (!folio)
+ return NULL;
return folio_file_page(folio, swp_offset(entry));
}


2023-12-21 11:23:46

by Oliver Sang

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

hi, Andrew Morton,

On Wed, Dec 20, 2023 at 02:11:35PM -0800, Andrew Morton wrote:
> On Tue, 19 Dec 2023 23:46:50 +0800 kernel test robot <[email protected]> wrote:
>
> >
> >
> > Hello,
> >
> > kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
> >
> > commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
>
> I assume this is a bisection result, so it's quite repeatable?

yes, we bisect to this commit, it's quite repeatable:

ddd06bb63d9793ce bbcbf2a3f05f74f9d268eab57ab
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:6 100% 6:6 dmesg.Kernel_panic-not_syncing:Fatal_exception
:6 100% 6:6 dmesg.RIP:do_swap_page
:6 100% 6:6 dmesg.invalid_opcode:#[##]
:6 100% 6:6 dmesg.kernel_BUG_at_mm/memory.c


>
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
> > [test failed on linux-next/master aa4db8324c4d0e67aa4670356df4e9fae14b4d37]
> >
> > in testcase: vm-scalability
> > version: vm-scalability-x86_64-1.0-0_20220518
> > with following parameters:
> >
> > runtime: 300
> > thp_enabled: always
> > thp_defrag: always
> > nr_task: 32
> > nr_ssd: 1
> > priority: 1
> > test: swap-w-rand
> > cpufreq_governor: performance
> >
> > test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> > test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
> >
> >
> > compiler: gcc-12
> > test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
> >
> > (please refer to attached dmesg/kmsg for entire log/backtrace)
> >
> >
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <[email protected]>
> > | Closes: https://lore.kernel.org/oe-lkp/[email protected]
> >
> >
> > [ 61.404380][ T5947] ------------[ cut here ]------------
> > [ 61.409984][ T5947] kernel BUG at mm/memory.c:3990!
> > [ 61.415085][ T5947] invalid opcode: 0000 [#1] SMP NOPTI
>
> This is
>
> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>
> and I don't believe that the error path fix
> (https://lkml.kernel.org/r/[email protected]) will
> address this.
>
> Matthew, have you had a chance to consider?
>
> Thanks.
>
> > [ 61.420506][ T5947] CPU: 32 PID: 5947 Comm: usemem Tainted: G S 6.7.0-rc4-00252-gbbcbf2a3f05f #1
> > [ 61.430881][ T5947] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0003.2104260124 04/26/2021
> > [ 61.442761][ T5947] RIP: 0010:do_swap_page (mm/memory.c:3990 (discriminator 3))
> > [ 61.448112][ T5947] Code: 6f 28 31 d2 be 01 00 00 00 4c 89 ff e8 9b 43 03 00 49 c7 47 28 00 00 00 00 4c 89 f9 48 c7 44 24 08 00 00 00 00 e9 cf fb ff ff <0f> 0b 49 8b 45 08 f0 48 83 28 01 0f 85 3f fc ff ff 49 8b 45 08 4c
> > All code
> > ========
> > 0: 6f outsl %ds:(%rsi),(%dx)
> > 1: 28 31 sub %dh,(%rcx)
> > 3: d2 be 01 00 00 00 sarb %cl,0x1(%rsi)
> > 9: 4c 89 ff mov %r15,%rdi
> > c: e8 9b 43 03 00 call 0x343ac
> > 11: 49 c7 47 28 00 00 00 movq $0x0,0x28(%r15)
> > 18: 00
> > 19: 4c 89 f9 mov %r15,%rcx
> > 1c: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
> > 23: 00 00
> > 25: e9 cf fb ff ff jmp 0xfffffffffffffbf9
> > 2a:* 0f 0b ud2 <-- trapping instruction
> > 2c: 49 8b 45 08 mov 0x8(%r13),%rax
> > 30: f0 48 83 28 01 lock subq $0x1,(%rax)
> > 35: 0f 85 3f fc ff ff jne 0xfffffffffffffc7a
> > 3b: 49 8b 45 08 mov 0x8(%r13),%rax
> > 3f: 4c rex.WR
> >
> > Code starting with the faulting instruction
> > ===========================================
> > 0: 0f 0b ud2
> > 2: 49 8b 45 08 mov 0x8(%r13),%rax
> > 6: f0 48 83 28 01 lock subq $0x1,(%rax)
> > b: 0f 85 3f fc ff ff jne 0xfffffffffffffc50
> > 11: 49 8b 45 08 mov 0x8(%r13),%rax
> > 15: 4c rex.WR
> > [ 61.468016][ T5947] RSP: 0000:ffa000000bb5fd98 EFLAGS: 00010206
> > [ 61.474169][ T5947] RAX: ff11000111a47c99 RBX: ffa000000bb5fe08 RCX: 0000002064ac7000
> > [ 61.482233][ T5947] RDX: 0057ffffc00a106d RSI: 0000000000000043 RDI: ffd400008192b1e8
> > [ 61.490296][ T5947] RBP: 000000000100c13b R08: 0000000000000000 R09: ffa000000bb5fe08
> > [ 61.498366][ T5947] R10: 0000000055555554 R11: ff1100018bebbd0c R12: ffd4000044128000
> > [ 61.506438][ T5947] R13: ff1100205d33d800 R14: ff11000130cd2da8 R15: ffd4000044128000
> > [ 61.514508][ T5947] FS: 00007f49c900c740(0000) GS:ff11002001000000(0000) knlGS:0000000000000000
> > [ 61.523534][ T5947] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 61.530225][ T5947] CR2: 00007f4966b3b6b8 CR3: 00000010af786004 CR4: 0000000000771ef0
> > [ 61.538307][ T5947] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 61.546387][ T5947] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 61.554471][ T5947] PKRU: 55555554
> > [ 61.558137][ T5947] Call Trace:
> > [ 61.561544][ T5947] <TASK>
> > [ 61.564599][ T5947] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> > [ 61.568429][ T5947] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
> > [ 61.572692][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> > [ 61.577475][ T5947] ? do_error_trap (arch/x86/include/asm/traps.h:59 arch/x86/kernel/traps.c:174)
> > [ 61.582172][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> > [ 61.586966][ T5947] ? exc_invalid_op (arch/x86/kernel/traps.c:265)
> > [ 61.591743][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> > [ 61.596515][ T5947] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
> > [ 61.601638][ T5947] ? do_swap_page (mm/memory.c:3990 (discriminator 3))
> > [ 61.606412][ T5947] ? do_swap_page (mm/memory.c:3971)
> > [ 61.611179][ T5947] __handle_mm_fault (mm/memory.c:5274)
> > [ 61.616203][ T5947] handle_mm_fault (mm/memory.c:5439)
> > [ 61.621051][ T5947] do_user_addr_fault (arch/x86/mm/fault.c:1365)
> > [ 61.626151][ T5947] exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561)
> > [ 61.630824][ T5947] asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
> > [ 61.635748][ T5947] RIP: 0033:0x5612d5878ad6
> > [ 61.640229][ T5947] Code: 01 00 00 00 e8 1b f9 ff ff 89 c7 e8 6c ff ff ff bf 00 00 00 00 e8 0a f9 ff ff 85 d2 74 08 48 8d 04 f7 48 8b 00 c3 48 8d 04 f7 <48> 89 30 b8 00 00 00 00 c3 41 54 55 53 48 85 ff 0f 84 21 01 00 00
> > All code
> > ========
> > 0: 01 00 add %eax,(%rax)
> > 2: 00 00 add %al,(%rax)
> > 4: e8 1b f9 ff ff call 0xfffffffffffff924
> > 9: 89 c7 mov %eax,%edi
> > b: e8 6c ff ff ff call 0xffffffffffffff7c
> > 10: bf 00 00 00 00 mov $0x0,%edi
> > 15: e8 0a f9 ff ff call 0xfffffffffffff924
> > 1a: 85 d2 test %edx,%edx
> > 1c: 74 08 je 0x26
> > 1e: 48 8d 04 f7 lea (%rdi,%rsi,8),%rax
> > 22: 48 8b 00 mov (%rax),%rax
> > 25: c3 ret
> > 26: 48 8d 04 f7 lea (%rdi,%rsi,8),%rax
> > 2a:* 48 89 30 mov %rsi,(%rax) <-- trapping instruction
> > 2d: b8 00 00 00 00 mov $0x0,%eax
> > 32: c3 ret
> > 33: 41 54 push %r12
> > 35: 55 push %rbp
> > 36: 53 push %rbx
> > 37: 48 85 ff test %rdi,%rdi
> > 3a: 0f 84 21 01 00 00 je 0x161
> >
> > Code starting with the faulting instruction
> > ===========================================
> > 0: 48 89 30 mov %rsi,(%rax)
> > 3: b8 00 00 00 00 mov $0x0,%eax
> > 8: c3 ret
> > 9: 41 54 push %r12
> > b: 55 push %rbp
> > c: 53 push %rbx
> > d: 48 85 ff test %rdi,%rdi
> > 10: 0f 84 21 01 00 00 je 0x137
> > [ 61.660112][ T5947] RSP: 002b:00007ffd09f037d8 EFLAGS: 00010246
> > [ 61.666250][ T5947] RAX: 00007f4966b3b6b8 RBX: 000000000000358f RCX: 00000005deece66d
> > [ 61.674295][ T5947] RDX: 0000000000000000 RSI: 000000002fa0f0d7 RDI: 00007f47e9ac3000
> > [ 61.682347][ T5947] RBP: 000000002fa0f0d7 R08: 00007ffd09f0386c R09: 0000000000000001
> > [ 61.690401][ T5947] R10: 00007ffd09f037c0 R11: 0000000000000000 R12: 000000000001ac78
> > [ 61.698449][ T5947] R13: 00007f47e9ac3000 R14: 00007ffd09f0386c R15: 00007ffd09f03970
> > [ 61.706500][ T5947] </TASK>
> > [ 61.709607][ T5947] Modules linked in: kmem xfs loop device_dax nd_pmem dax_pmem nd_btt btrfs blake2b_generic xor raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 kvm_intel sg kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ahci ipmi_ssif rapl intel_cstate ast libahci mei_me drm_shmem_helper i2c_i801 ioatdma acpi_ipmi libata drm_kms_helper mei intel_uncore joydev i2c_smbus intel_pch_thermal dax_hmem dca wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_pad drm fuse ip_tables
> > [ 61.768510][ T5947] ---[ end trace 0000000000000000 ]---
> > [ 61.786010][ T5947] pstore: backend (erst) writing error (-28)
> > [ 61.792055][ T5947] RIP: 0010:do_swap_page (mm/memory.c:3990 (discriminator 3))
> > [ 61.797397][ T5947] Code: 6f 28 31 d2 be 01 00 00 00 4c 89 ff e8 9b 43 03 00 49 c7 47 28 00 00 00 00 4c 89 f9 48 c7 44 24 08 00 00 00 00 e9 cf fb ff ff <0f> 0b 49 8b 45 08 f0 48 83 28 01 0f 85 3f fc ff ff 49 8b 45 08 4c
> > All code
> > ========
> > 0: 6f outsl %ds:(%rsi),(%dx)
> > 1: 28 31 sub %dh,(%rcx)
> > 3: d2 be 01 00 00 00 sarb %cl,0x1(%rsi)
> > 9: 4c 89 ff mov %r15,%rdi
> > c: e8 9b 43 03 00 call 0x343ac
> > 11: 49 c7 47 28 00 00 00 movq $0x0,0x28(%r15)
> > 18: 00
> > 19: 4c 89 f9 mov %r15,%rcx
> > 1c: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
> > 23: 00 00
> > 25: e9 cf fb ff ff jmp 0xfffffffffffffbf9
> > 2a:* 0f 0b ud2 <-- trapping instruction
> > 2c: 49 8b 45 08 mov 0x8(%r13),%rax
> > 30: f0 48 83 28 01 lock subq $0x1,(%rax)
> > 35: 0f 85 3f fc ff ff jne 0xfffffffffffffc7a
> > 3b: 49 8b 45 08 mov 0x8(%r13),%rax
> > 3f: 4c rex.WR
> >
> > Code starting with the faulting instruction
> > ===========================================
> > 0: 0f 0b ud2
> > 2: 49 8b 45 08 mov 0x8(%r13),%rax
> > 6: f0 48 83 28 01 lock subq $0x1,(%rax)
> > b: 0f 85 3f fc ff ff jne 0xfffffffffffffc50
> > 11: 49 8b 45 08 mov 0x8(%r13),%rax
> > 15: 4c rex.WR
> >
> >
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20231219/[email protected]
> >
> >
> >
> > --
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki
> >

2023-12-21 11:32:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On 21.12.23 12:23, Oliver Sang wrote:
> hi, Andrew Morton,
>
> On Wed, Dec 20, 2023 at 02:11:35PM -0800, Andrew Morton wrote:
>> On Tue, 19 Dec 2023 23:46:50 +0800 kernel test robot <[email protected]> wrote:
>>
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
>>>
>>> commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
>>
>> I assume this is a bisection result, so it's quite repeatable?
>
> yes, we bisect to this commit, it's quite repeatable:
>
> ddd06bb63d9793ce bbcbf2a3f05f74f9d268eab57ab
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> :6 100% 6:6 dmesg.Kernel_panic-not_syncing:Fatal_exception
> :6 100% 6:6 dmesg.RIP:do_swap_page
> :6 100% 6:6 dmesg.invalid_opcode:#[##]
> :6 100% 6:6 dmesg.kernel_BUG_at_mm/memory.c
>
>

Can you try with the snipped I sent? Please let me know if you need a
full patch for testing purposes.

--
Cheers,

David / dhildenb


2023-12-21 21:58:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On Thu, 21 Dec 2023 12:32:04 +0100 David Hildenbrand <[email protected]> wrote:

> On 21.12.23 12:23, Oliver Sang wrote:
> > hi, Andrew Morton,
> >
> > On Wed, Dec 20, 2023 at 02:11:35PM -0800, Andrew Morton wrote:
> >> On Tue, 19 Dec 2023 23:46:50 +0800 kernel test robot <[email protected]> wrote:
> >>
> >>>
> >>>
> >>> Hello,
> >>>
> >>> kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
> >>>
> >>> commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
> >>
> >> I assume this is a bisection result, so it's quite repeatable?
> >
> > yes, we bisect to this commit, it's quite repeatable:
> >
> > ddd06bb63d9793ce bbcbf2a3f05f74f9d268eab57ab
> > ---------------- ---------------------------
> > fail:runs %reproduction fail:runs
> > | | |
> > :6 100% 6:6 dmesg.Kernel_panic-not_syncing:Fatal_exception
> > :6 100% 6:6 dmesg.RIP:do_swap_page
> > :6 100% 6:6 dmesg.invalid_opcode:#[##]
> > :6 100% 6:6 dmesg.kernel_BUG_at_mm/memory.c
> >
> >
>
> Can you try with the snipped I sent? Please let me know if you need a
> full patch for testing purposes.

I think a full patch would be better, please.

2023-12-21 22:07:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On 21.12.23 22:58, Andrew Morton wrote:
> On Thu, 21 Dec 2023 12:32:04 +0100 David Hildenbrand <[email protected]> wrote:
>
>> On 21.12.23 12:23, Oliver Sang wrote:
>>> hi, Andrew Morton,
>>>
>>> On Wed, Dec 20, 2023 at 02:11:35PM -0800, Andrew Morton wrote:
>>>> On Tue, 19 Dec 2023 23:46:50 +0800 kernel test robot <[email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
>>>>>
>>>>> commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
>>>>
>>>> I assume this is a bisection result, so it's quite repeatable?
>>>
>>> yes, we bisect to this commit, it's quite repeatable:
>>>
>>> ddd06bb63d9793ce bbcbf2a3f05f74f9d268eab57ab
>>> ---------------- ---------------------------
>>> fail:runs %reproduction fail:runs
>>> | | |
>>> :6 100% 6:6 dmesg.Kernel_panic-not_syncing:Fatal_exception
>>> :6 100% 6:6 dmesg.RIP:do_swap_page
>>> :6 100% 6:6 dmesg.invalid_opcode:#[##]
>>> :6 100% 6:6 dmesg.kernel_BUG_at_mm/memory.c
>>>
>>>
>>
>> Can you try with the snipped I sent? Please let me know if you need a
>> full patch for testing purposes.
>
> I think a full patch would be better, please.
>

From b82e309096abde6c0f24bba50a281e8d3855c132 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <[email protected]>
Date: Thu, 21 Dec 2023 23:04:30 +0100
Subject: [PATCH] Fixup: mm: convert ksm_might_need_to_copy() to work on folios

We must only adjust the page if the folio changed. Otherwise, if we
had a large folio in the swapcache and the folio didn't change, we'd
suddenly change the page to-be-mapped.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 149f779910fd5..2f9668d357f5c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3952,7 +3952,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio = swapcache;
goto out_page;
}
- page = folio_page(folio, 0);
+ if (folio != swapcache)
+ page = folio_page(folio, 0);

/*
* If we want to map a page that's in the swapcache writable, we
--
2.43.0


--
Cheers,

David / dhildenb


2023-12-21 22:13:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On Thu, Dec 21, 2023 at 11:07:21PM +0100, David Hildenbrand wrote:
> Subject: [PATCH] Fixup: mm: convert ksm_might_need_to_copy() to work on folios
>
> We must only adjust the page if the folio changed. Otherwise, if we
> had a large folio in the swapcache and the folio didn't change, we'd
> suddenly change the page to-be-mapped.

Heh, I was expecting you to be done for the day ;-)

2023-12-21 22:15:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On 21.12.23 23:13, Matthew Wilcox wrote:
> On Thu, Dec 21, 2023 at 11:07:21PM +0100, David Hildenbrand wrote:
>> Subject: [PATCH] Fixup: mm: convert ksm_might_need_to_copy() to work on folios
>>
>> We must only adjust the page if the folio changed. Otherwise, if we
>> had a large folio in the swapcache and the folio didn't change, we'd
>> suddenly change the page to-be-mapped.
>
> Heh, I was expecting you to be done for the day ;-)

I was expecting that myself, but here I am ... :)

--
Cheers,

David / dhildenb


2023-12-22 08:14:18

by Oliver Sang

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

hi, David,

On Thu, Dec 21, 2023 at 11:07:21PM +0100, David Hildenbrand wrote:
> On 21.12.23 22:58, Andrew Morton wrote:
> > On Thu, 21 Dec 2023 12:32:04 +0100 David Hildenbrand <[email protected]> wrote:
> >
> > > On 21.12.23 12:23, Oliver Sang wrote:
> > > > hi, Andrew Morton,
> > > >
> > > > On Wed, Dec 20, 2023 at 02:11:35PM -0800, Andrew Morton wrote:
> > > > > On Tue, 19 Dec 2023 23:46:50 +0800 kernel test robot <[email protected]> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
> > > > > >
> > > > > > commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
> > > > >
> > > > > I assume this is a bisection result, so it's quite repeatable?
> > > >
> > > > yes, we bisect to this commit, it's quite repeatable:
> > > >
> > > > ddd06bb63d9793ce bbcbf2a3f05f74f9d268eab57ab
> > > > ---------------- ---------------------------
> > > > fail:runs %reproduction fail:runs
> > > > | | |
> > > > :6 100% 6:6 dmesg.Kernel_panic-not_syncing:Fatal_exception
> > > > :6 100% 6:6 dmesg.RIP:do_swap_page
> > > > :6 100% 6:6 dmesg.invalid_opcode:#[##]
> > > > :6 100% 6:6 dmesg.kernel_BUG_at_mm/memory.c
> > > >
> > > >
> > >
> > > Can you try with the snipped I sent? Please let me know if you need a
> > > full patch for testing purposes.
> >
> > I think a full patch would be better, please.
> >

we cannot reproduce the issue reported previously after applying below patch.
Thanks

>
> From b82e309096abde6c0f24bba50a281e8d3855c132 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <[email protected]>
> Date: Thu, 21 Dec 2023 23:04:30 +0100
> Subject: [PATCH] Fixup: mm: convert ksm_might_need_to_copy() to work on folios
>
> We must only adjust the page if the folio changed. Otherwise, if we
> had a large folio in the swapcache and the folio didn't change, we'd
> suddenly change the page to-be-mapped.
>
> Reported-by: kernel test robot <[email protected]>
> Closes: https://lore.kernel.org/oe-lkp/[email protected]
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/memory.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 149f779910fd5..2f9668d357f5c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3952,7 +3952,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio = swapcache;
> goto out_page;
> }
> - page = folio_page(folio, 0);
> + if (folio != swapcache)
> + page = folio_page(folio, 0);
> /*
> * If we want to map a page that's in the swapcache writable, we
> --
> 2.43.0
>
>
> --
> Cheers,
>
> David / dhildenb
>

2023-12-22 09:38:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [linux-next:master] [mm] bbcbf2a3f0: kernel_BUG_at_mm/memory.c

On 22.12.23 09:13, Oliver Sang wrote:
> hi, David,
>
> On Thu, Dec 21, 2023 at 11:07:21PM +0100, David Hildenbrand wrote:
>> On 21.12.23 22:58, Andrew Morton wrote:
>>> On Thu, 21 Dec 2023 12:32:04 +0100 David Hildenbrand <[email protected]> wrote:
>>>
>>>> On 21.12.23 12:23, Oliver Sang wrote:
>>>>> hi, Andrew Morton,
>>>>>
>>>>> On Wed, Dec 20, 2023 at 02:11:35PM -0800, Andrew Morton wrote:
>>>>>> On Tue, 19 Dec 2023 23:46:50 +0800 kernel test robot <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> kernel test robot noticed "kernel_BUG_at_mm/memory.c" on:
>>>>>>>
>>>>>>> commit: bbcbf2a3f05f74f9d268eab57abbdce6a65a94ad ("mm: convert ksm_might_need_to_copy() to work on folios")
>>>>>>
>>>>>> I assume this is a bisection result, so it's quite repeatable?
>>>>>
>>>>> yes, we bisect to this commit, it's quite repeatable:
>>>>>
>>>>> ddd06bb63d9793ce bbcbf2a3f05f74f9d268eab57ab
>>>>> ---------------- ---------------------------
>>>>> fail:runs %reproduction fail:runs
>>>>> | | |
>>>>> :6 100% 6:6 dmesg.Kernel_panic-not_syncing:Fatal_exception
>>>>> :6 100% 6:6 dmesg.RIP:do_swap_page
>>>>> :6 100% 6:6 dmesg.invalid_opcode:#[##]
>>>>> :6 100% 6:6 dmesg.kernel_BUG_at_mm/memory.c
>>>>>
>>>>>
>>>>
>>>> Can you try with the snipped I sent? Please let me know if you need a
>>>> full patch for testing purposes.
>>>
>>> I think a full patch would be better, please.
>>>
>
> we cannot reproduce the issue reported previously after applying below patch.

Thanks for verifying and happy holidays!

--
Cheers,

David / dhildenb