2024-03-05 06:09:01

by Oliver Sang

[permalink] [raw]
Subject: [linux-next:master] [mm,page_owner] 4bedfb314b: BUG:KASAN:null-ptr-deref_in_init_page_owner



Hello,

kernel test robot noticed "BUG:KASAN:null-ptr-deref_in_init_page_owner" on:

commit: 4bedfb314bdd85c1662ecc46fa25b33b998f994d ("mm,page_owner: maintain own list of stack_records structs")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[test failed on linux-next/master 67908bf6954b7635d33760ff6dfc189fc26ccc89]

in testcase: boot

compiler: clang-17
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+-----------------------------------------------------------+------------+------------+
| | 8151c7a35d | 4bedfb314b |
+-----------------------------------------------------------+------------+------------+
| BUG:KASAN:null-ptr-deref_in_init_page_owner | 0 | 24 |
| canonical_address#:#[##] | 0 | 24 |
| RIP:init_page_owner | 0 | 24 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 24 |
+-----------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


[ 6.582562][ T0] Node 0, zone DMA32: page owner found early allocated 0 pages
[ 6.612136][ T0] Node 0, zone Normal: page owner found early allocated 73871 pages
[ 6.612762][ T0] ==================================================================
[ 6.613351][ T0] BUG: KASAN: null-ptr-deref in init_page_owner (arch/x86/include/asm/atomic.h:28)
[ 6.613893][ T0] Write of size 4 at addr 000000000000001c by task swapper/0
[ 6.614434][ T0]
[ 6.614600][ T0] CPU: 0 PID: 0 Comm: swapper Tainted: G T 6.8.0-rc5-00256-g4bedfb314bdd #1 29e70169ace75ef72d53825e983f3dcb1d5756d9
[ 6.615605][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 6.616367][ T0] Call Trace:
[ 6.616604][ T0] <TASK>
[ 6.616816][ T0] ? dump_stack_lvl (lib/dump_stack.c:?)
[ 6.617161][ T0] ? print_report (mm/kasan/report.c:?)
[ 6.617499][ T0] ? init_page_owner (arch/x86/include/asm/atomic.h:28)
[ 6.617863][ T0] ? kasan_report (mm/kasan/report.c:603)
[ 6.618206][ T0] ? init_page_owner (arch/x86/include/asm/atomic.h:28)
[ 6.618567][ T0] ? kasan_check_range (mm/kasan/generic.c:?)
[ 6.618940][ T0] ? init_page_owner (arch/x86/include/asm/atomic.h:28)
[ 6.619301][ T0] ? mm_core_init (mm/mm_init.c:2790)
[ 6.619627][ T0] ? start_kernel (init/main.c:934)
[ 6.619969][ T0] ? x86_64_start_reservations (??:?)
[ 6.620380][ T0] ? x86_64_start_kernel (??:?)
[ 6.620751][ T0] ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:461)
[ 6.621204][ T0] </TASK>
[ 6.621420][ T0] ==================================================================
[ 6.622015][ T0] Disabling lock debugging due to kernel taint
[ 6.622474][ T0] general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT KASAN PTI
[ 6.623342][ T0] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
[ 6.623960][ T0] CPU: 0 PID: 0 Comm: swapper Tainted: G B T 6.8.0-rc5-00256-g4bedfb314bdd #1 29e70169ace75ef72d53825e983f3dcb1d5756d9
[ 6.624959][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 6.625725][ T0] RIP: 0010:init_page_owner (arch/x86/include/asm/atomic.h:28)
[ 6.626133][ T0] Code: 9c 8e ee fb 48 89 05 55 8f 2d 01 48 8b 1d 0e 8f 2d 01 48 83 c3 1c 48 89 df be 04 00 00 00 e8 dd 5c 8b fa 48 89 d8 48 c1 e8 03 <8a> 04 28 84 c0 0f 85 8a 00 00 00 c7 03 01 00 00 00 48 8b 1d 1e 8f
All code
========
0: 9c pushf
1: 8e ee mov %esi,%gs
3: fb sti
4: 48 89 05 55 8f 2d 01 mov %rax,0x12d8f55(%rip) # 0x12d8f60
b: 48 8b 1d 0e 8f 2d 01 mov 0x12d8f0e(%rip),%rbx # 0x12d8f20
12: 48 83 c3 1c add $0x1c,%rbx
16: 48 89 df mov %rbx,%rdi
19: be 04 00 00 00 mov $0x4,%esi
1e: e8 dd 5c 8b fa call 0xfffffffffa8b5d00
23: 48 89 d8 mov %rbx,%rax
26: 48 c1 e8 03 shr $0x3,%rax
2a:* 8a 04 28 mov (%rax,%rbp,1),%al <-- trapping instruction
2d: 84 c0 test %al,%al
2f: 0f 85 8a 00 00 00 jne 0xbf
35: c7 03 01 00 00 00 movl $0x1,(%rbx)
3b: 48 rex.W
3c: 8b .byte 0x8b
3d: 1d .byte 0x1d
3e: 1e (bad)
3f: 8f .byte 0x8f

Code starting with the faulting instruction
===========================================
0: 8a 04 28 mov (%rax,%rbp,1),%al
3: 84 c0 test %al,%al
5: 0f 85 8a 00 00 00 jne 0x95
b: c7 03 01 00 00 00 movl $0x1,(%rbx)
11: 48 rex.W
12: 8b .byte 0x8b
13: 1d .byte 0x1d
14: 1e (bad)
15: 8f .byte 0x8f
[ 6.627591][ T0] RSP: 0000:ffffffff85e07eb8 EFLAGS: 00010007
[ 6.628035][ T0] RAX: 0000000000000003 RBX: 000000000000001c RCX: ffffffff811f54d8
[ 6.628619][ T0] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffff85f3f220
[ 6.629202][ T0] RBP: dffffc0000000000 R08: ffffffff85f3f227 R09: 1ffffffff0be7e44
[ 6.629788][ T0] R10: dffffc0000000000 R11: fffffbfff0be7e45 R12: ffffffff86d96298
[ 6.630372][ T0] R13: 1ffffd40021ffff8 R14: ffffffff86d96888 R15: 0000000000440000
[ 6.630956][ T0] FS: 0000000000000000(0000) GS:ffffffff85f0e000(0000) knlGS:0000000000000000
[ 6.631610][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6.632091][ T0] CR2: ffff88843ffff000 CR3: 0000000005ef2000 CR4: 00000000000000b0
[ 6.632677][ T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6.633261][ T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6.633849][ T0] Call Trace:
[ 6.634085][ T0] <TASK>
[ 6.634296][ T0] ? __die_body (arch/x86/kernel/dumpstack.c:421)
[ 6.634614][ T0] ? die_addr (arch/x86/kernel/dumpstack.c:?)
[ 6.634930][ T0] ? exc_general_protection (arch/x86/kernel/traps.c:?)
[ 6.635339][ T0] ? kasan_report (mm/kasan/report.c:?)
[ 6.635682][ T0] ? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
[ 6.636104][ T0] ? add_taint (arch/x86/include/asm/bitops.h:60 include/asm-generic/bitops/instrumented-atomic.h:29 kernel/panic.c:543)
[ 6.636413][ T0] ? init_page_owner (arch/x86/include/asm/atomic.h:28)
[ 6.636775][ T0] ? init_page_owner (arch/x86/include/asm/atomic.h:28)
[ 6.637136][ T0] ? mm_core_init (mm/mm_init.c:2790)
[ 6.637465][ T0] ? start_kernel (init/main.c:934)
[ 6.637810][ T0] ? x86_64_start_reservations (??:?)
[ 6.638222][ T0] ? x86_64_start_kernel (??:?)
[ 6.638594][ T0] ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:461)
[ 6.639046][ T0] </TASK>
[ 6.639263][ T0] Modules linked in:
[ 6.639547][ T0] ---[ end trace 0000000000000000 ]---
[ 6.639942][ T0] RIP: 0010:init_page_owner (arch/x86/include/asm/atomic.h:28)
[ 6.640348][ T0] Code: 9c 8e ee fb 48 89 05 55 8f 2d 01 48 8b 1d 0e 8f 2d 01 48 83 c3 1c 48 89 df be 04 00 00 00 e8 dd 5c 8b fa 48 89 d8 48 c1 e8 03 <8a> 04 28 84 c0 0f 85 8a 00 00 00 c7 03 01 00 00 00 48 8b 1d 1e 8f
All code
========
0: 9c pushf
1: 8e ee mov %esi,%gs
3: fb sti
4: 48 89 05 55 8f 2d 01 mov %rax,0x12d8f55(%rip) # 0x12d8f60
b: 48 8b 1d 0e 8f 2d 01 mov 0x12d8f0e(%rip),%rbx # 0x12d8f20
12: 48 83 c3 1c add $0x1c,%rbx
16: 48 89 df mov %rbx,%rdi
19: be 04 00 00 00 mov $0x4,%esi
1e: e8 dd 5c 8b fa call 0xfffffffffa8b5d00
23: 48 89 d8 mov %rbx,%rax
26: 48 c1 e8 03 shr $0x3,%rax
2a:* 8a 04 28 mov (%rax,%rbp,1),%al <-- trapping instruction
2d: 84 c0 test %al,%al
2f: 0f 85 8a 00 00 00 jne 0xbf
35: c7 03 01 00 00 00 movl $0x1,(%rbx)
3b: 48 rex.W
3c: 8b .byte 0x8b
3d: 1d .byte 0x1d
3e: 1e (bad)
3f: 8f .byte 0x8f

Code starting with the faulting instruction
===========================================
0: 8a 04 28 mov (%rax,%rbp,1),%al
3: 84 c0 test %al,%al
5: 0f 85 8a 00 00 00 jne 0x95
b: c7 03 01 00 00 00 movl $0x1,(%rbx)
11: 48 rex.W
12: 8b .byte 0x8b
13: 1d .byte 0x1d
14: 1e (bad)
15: 8f .byte 0x8f


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240305/[email protected]



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



2024-03-05 09:26:44

by Oscar Salvador

[permalink] [raw]
Subject: Re: [linux-next:master] [mm,page_owner] 4bedfb314b: BUG:KASAN:null-ptr-deref_in_init_page_owner

On Tue, Mar 05, 2024 at 02:08:23PM +0800, kernel test robot wrote:
>
> [ 6.582562][ T0] Node 0, zone DMA32: page owner found early allocated 0 pages
> [ 6.612136][ T0] Node 0, zone Normal: page owner found early allocated 73871 pages
> [ 6.612762][ T0] ==================================================================
> [ 6.613351][ T0] BUG: KASAN: null-ptr-deref in init_page_owner (arch/x86/include/asm/atomic.h:28)
> [ 6.613893][ T0] Write of size 4 at addr 000000000000001c by task swapper/0
> [ 6.614434][ T0]
> [ 6.614600][ T0] CPU: 0 PID: 0 Comm: swapper Tainted: G T 6.8.0-rc5-00256-g4bedfb314bdd #1 29e70169ace75ef72d53825e983f3dcb1d5756d9
> [ 6.615605][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 6.616367][ T0] Call Trace:
> [ 6.616604][ T0] <TASK>
> [ 6.616816][ T0] ? dump_stack_lvl (lib/dump_stack.c:?)
> [ 6.617161][ T0] ? print_report (mm/kasan/report.c:?)
> [ 6.617499][ T0] ? init_page_owner (arch/x86/include/asm/atomic.h:28)

So, we are crashing here:

/* Initialize dummy and failure stacks and link them to stack_list */
dummy_stack.stack_record = __stack_depot_get_stack_record(dummy_handle);
failure_stack.stack_record = __stack_depot_get_stack_record(failure_handle);
refcount_set(&dummy_stack.stack_record->count, 1);
refcount_set(&failure_stack.stack_record->count, 1);

when trying to set the refcount. Allegedly, because dummy_handle is 0.
I thought we fixed that with

commit 3ee34eabac2abb6b1b6fcdebffe18870719ad000
Author: Oscar Salvador <[email protected]>
Date: Thu Feb 15 22:59:01 2024 +0100

lib/stackdepot: fix first entry having a 0-handle


But I guess this is different.
The obvious way out is to only set the refcount and link the stacks
if their handles are not 0.

Marco, could it be that stackdepot was too overloaded, that by the time
page_owner gets initialized, there are no more space for its stacks, and
hence return 0-handles?.


--
Oscar Salvador
SUSE Labs

2024-03-05 13:03:22

by Marco Elver

[permalink] [raw]
Subject: Re: [linux-next:master] [mm,page_owner] 4bedfb314b: BUG:KASAN:null-ptr-deref_in_init_page_owner

On Tue, 5 Mar 2024 at 10:26, Oscar Salvador <[email protected]> wrote:
>
> On Tue, Mar 05, 2024 at 02:08:23PM +0800, kernel test robot wrote:
> >
> > [ 6.582562][ T0] Node 0, zone DMA32: page owner found early allocated 0 pages
> > [ 6.612136][ T0] Node 0, zone Normal: page owner found early allocated 73871 pages
> > [ 6.612762][ T0] ==================================================================
> > [ 6.613351][ T0] BUG: KASAN: null-ptr-deref in init_page_owner (arch/x86/include/asm/atomic.h:28)
> > [ 6.613893][ T0] Write of size 4 at addr 000000000000001c by task swapper/0
> > [ 6.614434][ T0]
> > [ 6.614600][ T0] CPU: 0 PID: 0 Comm: swapper Tainted: G T 6.8.0-rc5-00256-g4bedfb314bdd #1 29e70169ace75ef72d53825e983f3dcb1d5756d9
> > [ 6.615605][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > [ 6.616367][ T0] Call Trace:
> > [ 6.616604][ T0] <TASK>
> > [ 6.616816][ T0] ? dump_stack_lvl (lib/dump_stack.c:?)
> > [ 6.617161][ T0] ? print_report (mm/kasan/report.c:?)
> > [ 6.617499][ T0] ? init_page_owner (arch/x86/include/asm/atomic.h:28)
>
> So, we are crashing here:
>
> /* Initialize dummy and failure stacks and link them to stack_list */
> dummy_stack.stack_record = __stack_depot_get_stack_record(dummy_handle);
> failure_stack.stack_record = __stack_depot_get_stack_record(failure_handle);
> refcount_set(&dummy_stack.stack_record->count, 1);
> refcount_set(&failure_stack.stack_record->count, 1);
>
> when trying to set the refcount. Allegedly, because dummy_handle is 0.
> I thought we fixed that with
>
> commit 3ee34eabac2abb6b1b6fcdebffe18870719ad000
> Author: Oscar Salvador <[email protected]>
> Date: Thu Feb 15 22:59:01 2024 +0100
>
> lib/stackdepot: fix first entry having a 0-handle
>
>
> But I guess this is different.
> The obvious way out is to only set the refcount and link the stacks
> if their handles are not 0.
>
> Marco, could it be that stackdepot was too overloaded, that by the time
> page_owner gets initialized, there are no more space for its stacks, and
> hence return 0-handles?.

That's possible. But it's unclear to me what exactly happens. Are you
able to reproduce the issue? (I haven't been able to because the
config enables CFI which seems to cause other issues for me,
presumably toolchain related. :-/ )

2024-03-05 18:31:36

by Oscar Salvador

[permalink] [raw]
Subject: Re: [linux-next:master] [mm,page_owner] 4bedfb314b: BUG:KASAN:null-ptr-deref_in_init_page_owner

On Tue, Mar 05, 2024 at 02:02:35PM +0100, Marco Elver wrote:
> On Tue, 5 Mar 2024 at 10:26, Oscar Salvador <[email protected]> wrote:
> > Marco, could it be that stackdepot was too overloaded, that by the time
> > page_owner gets initialized, there are no more space for its stacks, and
> > hence return 0-handles?.
>
> That's possible. But it's unclear to me what exactly happens. Are you
> able to reproduce the issue? (I haven't been able to because the
> config enables CFI which seems to cause other issues for me,
> presumably toolchain related. :-/ )

I am out of luck here, I cannot reproduce the issue.
I set up the environment just as [1] says, building the kernel with
their config and launching bin/lkp just as [1] states, but it
boots fine here.

[1] https://download.01.org/0day-ci/archive/20240305/[email protected]/reproduce

--
Oscar Salvador
SUSE Labs

2024-03-05 18:38:28

by Oscar Salvador

[permalink] [raw]
Subject: Re: [linux-next:master] [mm,page_owner] 4bedfb314b: BUG:KASAN:null-ptr-deref_in_init_page_owner

On Tue, Mar 05, 2024 at 07:32:02PM +0100, Oscar Salvador wrote:
> On Tue, Mar 05, 2024 at 02:02:35PM +0100, Marco Elver wrote:
> > On Tue, 5 Mar 2024 at 10:26, Oscar Salvador <[email protected]> wrote:
> > > Marco, could it be that stackdepot was too overloaded, that by the time
> > > page_owner gets initialized, there are no more space for its stacks, and
> > > hence return 0-handles?.
> >
> > That's possible. But it's unclear to me what exactly happens. Are you
> > able to reproduce the issue? (I haven't been able to because the
> > config enables CFI which seems to cause other issues for me,
> > presumably toolchain related. :-/ )
>
> I am out of luck here, I cannot reproduce the issue.
> I set up the environment just as [1] says, building the kernel with
> their config and launching bin/lkp just as [1] states, but it
> boots fine here.

But they point out to

commit 4bedfb314bdd85c1662ecc46fa25b33b998f994d (HEAD, bisection)
Author: Oscar Salvador <[email protected]>
Date: Thu Feb 15 22:59:03 2024 +0100

mm,page_owner: maintain own list of stack_records structs

which the only thing it does is to retrieve the stack_record for
{dummy,failure}.handle and increment their refcount and link them.
I am pretty sure the problem comes from either dummy_handle or
failure_handle being 0 and the stack_record we get is NULL.

I will come up with a patch to guard this scenario, although I did not
think this could happen at this early stage (stack_records returning
NULL).


--
Oscar Salvador
SUSE Labs

2024-03-06 07:18:50

by Oscar Salvador

[permalink] [raw]
Subject: Re: [linux-next:master] [mm,page_owner] 4bedfb314b: BUG:KASAN:null-ptr-deref_in_init_page_owner

On Tue, Mar 05, 2024 at 07:38:42PM +0100, Oscar Salvador wrote:
> But they point out to
>
> commit 4bedfb314bdd85c1662ecc46fa25b33b998f994d (HEAD, bisection)
> Author: Oscar Salvador <[email protected]>
> Date: Thu Feb 15 22:59:03 2024 +0100
>
> mm,page_owner: maintain own list of stack_records structs
>
> which the only thing it does is to retrieve the stack_record for
> {dummy,failure}.handle and increment their refcount and link them.
> I am pretty sure the problem comes from either dummy_handle or
> failure_handle being 0 and the stack_record we get is NULL.

Yes, jfyi: I "artificially" reproduced this by making
dummy_handle explicitly = 0 again.
And I see that KASAN points to the same location.

I am kind of surprised stackdepot ran out of space that early, but I
guess we cannot take anything for granted.

I am alrady working on a fixup to now blow up here.


--
Oscar Salvador
SUSE Labs