2023-02-27 06:23:33

by Sanan Hasanov

[permalink] [raw]
Subject: BUG: unable to handle kernel NULL pointer dereference in rcu_core

Good day, dear maintainers,

We found a bug using a modified kernel configuration file used by syzbot.

We enhanced the coverage of the configuration file using our tool, klocalizer.

Kernel Branch: 6.2.0-next-20230221
Kernel config:?https://drive.google.com/file/d/1QKAQV11zjOwISifUc-skRBoTo3EXhutY/view?usp=share_link
C Reproducer:?Unfortunately, there is no reproducer yet.

Thank you!

Best regards,
Sanan Hasanov

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 53756067 P4D 53756067 PUD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.2.0-next-20230221 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
Call Trace:
<IRQ>
rcu_core+0x85d/0x1960
__do_softirq+0x2e5/0xae2
__irq_exit_rcu+0x11d/0x190
irq_exit_rcu+0x9/0x20
sysvec_apic_timer_interrupt+0x97/0xc0
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1a/0x20
RIP: 0010:default_idle+0xf/0x20
Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
default_idle_call+0x67/0xa0
do_idle+0x361/0x440
cpu_startup_entry+0x18/0x20
start_secondary+0x256/0x300
secondary_startup_64_no_verify+0xce/0xdb
</TASK>
Modules linked in:
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246

RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
----------------
Code disassembly (best guess):
0: 89 07 mov %eax,(%rdi)
2: 49 c7 c0 08 00 00 00 mov $0x8,%r8
9: 4d 29 c8 sub %r9,%r8
c: 4c 01 c7 add %r8,%rdi
f: 4c 29 c2 sub %r8,%rdx
12: e9 76 ff ff ff jmp 0xffffff8d
17: cc int3
18: cc int3
19: cc int3
1a: cc int3
1b: f3 0f 1e fa endbr64
1f: eb 07 jmp 0x28
21: 0f 00 2d e3 8a 34 00 verw 0x348ae3(%rip) # 0x348b0b
28: fb sti
29: f4 hlt
* 2a: fa cli <-- trapping instruction
2b: c3 ret
2c: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
33: 00 00 00 00
37: 0f 1f 40 00 nopl 0x0(%rax)
3b: f3 0f 1e fa endbr64
3f: 65 gs



2023-02-27 08:03:11

by Zhouyi Zhou

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

Hi

On Mon, Feb 27, 2023 at 2:30 PM Sanan Hasanov
<[email protected]> wrote:
>
> Good day, dear maintainers,
>
> We found a bug using a modified kernel configuration file used by syzbot.
>
> We enhanced the coverage of the configuration file using our tool, klocalizer.
>
> Kernel Branch: 6.2.0-next-20230221
> Kernel config: https://drive.google.com/file/d/1QKAQV11zjOwISifUc-skRBoTo3EXhutY/view?usp=share_link
> C Reproducer: Unfortunately, there is no reproducer yet.
I downloaded 6.2.0-next-20230221 (wget
https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/snapshot/linux-next-next-20230221.tar.gz)
and compile the kernel using above kernel config, and started syzkaller:
http://154.220.3.120:56700/

Hope I can reproduce the bug and chase down the cause of the bug.

You are welcome ;-)
Thanks
Zhouyi
>
> Thank you!
>
> Best regards,
> Sanan Hasanov
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page
> PGD 53756067 P4D 53756067 PUD 0
> Oops: 0010 [#1] PREEMPT SMP KASAN
> CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.2.0-next-20230221 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> RIP: 0010:0x0
> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> Call Trace:
> <IRQ>
> rcu_core+0x85d/0x1960
> __do_softirq+0x2e5/0xae2
> __irq_exit_rcu+0x11d/0x190
> irq_exit_rcu+0x9/0x20
> sysvec_apic_timer_interrupt+0x97/0xc0
> </IRQ>
> <TASK>
> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> RIP: 0010:default_idle+0xf/0x20
> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
> default_idle_call+0x67/0xa0
> do_idle+0x361/0x440
> cpu_startup_entry+0x18/0x20
> start_secondary+0x256/0x300
> secondary_startup_64_no_verify+0xce/0xdb
> </TASK>
> Modules linked in:
> CR2: 0000000000000000
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:0x0
> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
>
> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> ----------------
> Code disassembly (best guess):
> 0: 89 07 mov %eax,(%rdi)
> 2: 49 c7 c0 08 00 00 00 mov $0x8,%r8
> 9: 4d 29 c8 sub %r9,%r8
> c: 4c 01 c7 add %r8,%rdi
> f: 4c 29 c2 sub %r8,%rdx
> 12: e9 76 ff ff ff jmp 0xffffff8d
> 17: cc int3
> 18: cc int3
> 19: cc int3
> 1a: cc int3
> 1b: f3 0f 1e fa endbr64
> 1f: eb 07 jmp 0x28
> 21: 0f 00 2d e3 8a 34 00 verw 0x348ae3(%rip) # 0x348b0b
> 28: fb sti
> 29: f4 hlt
> * 2a: fa cli <-- trapping instruction
> 2b: c3 ret
> 2c: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
> 33: 00 00 00 00
> 37: 0f 1f 40 00 nopl 0x0(%rax)
> 3b: f3 0f 1e fa endbr64
> 3f: 65 gs
>

2023-02-27 13:15:56

by Joel Fernandes

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core



> On Feb 27, 2023, at 3:03 AM, Zhouyi Zhou <[email protected]> wrote:
>
> Hi
>
>> On Mon, Feb 27, 2023 at 2:30 PM Sanan Hasanov
>> <[email protected]> wrote:
>>
>> Good day, dear maintainers,
>>
>> We found a bug using a modified kernel configuration file used by syzbot.
>>
>> We enhanced the coverage of the configuration file using our tool, klocalizer.
>>
>> Kernel Branch: 6.2.0-next-20230221
>> Kernel config: https://drive.google.com/file/d/1QKAQV11zjOwISifUc-skRBoTo3EXhutY/view?usp=share_link
>> C Reproducer: Unfortunately, there is no reproducer yet.

Sanan/Zhoui,
Could you also provide the full kernel dmesg? Could you enable CONFIG_DEBUG_INFO_DWARF5 and provide the vmlinux after the crash?

More comments below:

>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> #PF: supervisor instruction fetch in kernel mode
>> #PF: error_code(0x0010) - not-present page
>> PGD 53756067 P4D 53756067 PUD 0
>> Oops: 0010 [#1] PREEMPT SMP KASAN
>> CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.2.0-next-20230221 #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>> RIP: 0010:0x0
>> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
>> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
>> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
>> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
>> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
>> FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
>> Call Trace:
>> <IRQ>
>> rcu_core+0x85d/0x1960
>> __do_softirq+0x2e5/0xae2
>> __irq_exit_rcu+0x11d/0x190
>> irq_exit_rcu+0x9/0x20
>> sysvec_apic_timer_interrupt+0x97/0xc0
>> </IRQ>
>> <TASK>
>> asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> RIP: 0010:default_idle+0xf/0x20
>> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
>> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
>> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
>> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
>> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
>> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
>> default_idle_call+0x67/0xa0
>> do_idle+0x361/0x440
>> cpu_startup_entry+0x18/0x20
>> start_secondary+0x256/0x300
>> secondary_startup_64_no_verify+0xce/0xdb
>> </TASK>
>> Modules linked in:
>> CR2: 0000000000000000
>> ---[ end trace 0000000000000000 ]---
>> RIP: 0010:0x0
>> Code: Unable to access opcode bytes at 0xffffffffffffffd6.

I have seen this exact signature when the processor tries to execute a function that has a NULL address. That causes IP to goto 0 and the exception. Sounds like something corrupted rcu_head (Just a guess).

>> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
>>
>> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
>> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
>> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
>> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
>> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
>> FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
>> ----------------
>> Code disassembly (best guess):
>> 0: 89 07 mov %eax,(%rdi)
>> 2: 49 c7 c0 08 00 00 00 mov $0x8,%r8
>> 9: 4d 29 c8 sub %r9,%r8
>> c: 4c 01 c7 add %r8,%rdi
>> f: 4c 29 c2 sub %r8,%rdx
>> 12: e9 76 ff ff ff jmp 0xffffff8d
>> 17: cc int3
>> 18: cc int3
>> 19: cc int3
>> 1a: cc int3
>> 1b: f3 0f 1e fa endbr64
>> 1f: eb 07 jmp 0x28
>> 21: 0f 00 2d e3 8a 34 00 verw 0x348ae3(%rip) # 0x348b0b
>> 28: fb sti
>> 29: f4 hlt
>> * 2a: fa cli <-- trapping instruction

This probably happened before the crash and it is likely unrelated IMO. cli just means interrupts were enabled, the actual problem happened after softirq fired (likely at the tail end of the interrupt).

Thanks,

- Joel


>> 2b: c3 ret
>> 2c: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
>> 33: 00 00 00 00
>> 37: 0f 1f 40 00 nopl 0x0(%rax)
>> 3b: f3 0f 1e fa endbr64
>> 3f: 65 gs
>>

2023-02-27 14:59:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

On Mon, Feb 27, 2023 at 08:15:26AM -0500, Joel Fernandes wrote:
>
>
> > On Feb 27, 2023, at 3:03 AM, Zhouyi Zhou <[email protected]> wrote:
> >
> > Hi
> >
> >> On Mon, Feb 27, 2023 at 2:30 PM Sanan Hasanov
> >> <[email protected]> wrote:
> >>
> >> Good day, dear maintainers,
> >>
> >> We found a bug using a modified kernel configuration file used by syzbot.
> >>
> >> We enhanced the coverage of the configuration file using our tool, klocalizer.
> >>
> >> Kernel Branch: 6.2.0-next-20230221
> >> Kernel config: https://drive.google.com/file/d/1QKAQV11zjOwISifUc-skRBoTo3EXhutY/view?usp=share_link
> >> C Reproducer: Unfortunately, there is no reproducer yet.
>
> Sanan/Zhoui,
> Could you also provide the full kernel dmesg? Could you enable CONFIG_DEBUG_INFO_DWARF5 and provide the vmlinux after the crash?
>
> More comments below:
>
> >>
> >> BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> #PF: supervisor instruction fetch in kernel mode
> >> #PF: error_code(0x0010) - not-present page
> >> PGD 53756067 P4D 53756067 PUD 0
> >> Oops: 0010 [#1] PREEMPT SMP KASAN
> >> CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.2.0-next-20230221 #1
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >> RIP: 0010:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> >> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
> >> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> >> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> >> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> >> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> >> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> >> FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> >> Call Trace:
> >> <IRQ>
> >> rcu_core+0x85d/0x1960
> >> __do_softirq+0x2e5/0xae2
> >> __irq_exit_rcu+0x11d/0x190
> >> irq_exit_rcu+0x9/0x20
> >> sysvec_apic_timer_interrupt+0x97/0xc0
> >> </IRQ>
> >> <TASK>
> >> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> RIP: 0010:default_idle+0xf/0x20
> >> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
> >> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
> >> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
> >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
> >> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
> >> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
> >> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
> >> default_idle_call+0x67/0xa0
> >> do_idle+0x361/0x440
> >> cpu_startup_entry+0x18/0x20
> >> start_secondary+0x256/0x300
> >> secondary_startup_64_no_verify+0xce/0xdb
> >> </TASK>
> >> Modules linked in:
> >> CR2: 0000000000000000
> >> ---[ end trace 0000000000000000 ]---
> >> RIP: 0010:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>
> I have seen this exact signature when the processor tries to execute a function that has a NULL address. That causes IP to goto 0 and the exception. Sounds like something corrupted rcu_head (Just a guess).

Quite possibly! If so, then building with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
might be helpful.

Once a reproducer is foud, of course...

Thanx, Paul

> >> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
> >>
> >> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> >> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> >> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> >> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> >> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> >> FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> >> ----------------
> >> Code disassembly (best guess):
> >> 0: 89 07 mov %eax,(%rdi)
> >> 2: 49 c7 c0 08 00 00 00 mov $0x8,%r8
> >> 9: 4d 29 c8 sub %r9,%r8
> >> c: 4c 01 c7 add %r8,%rdi
> >> f: 4c 29 c2 sub %r8,%rdx
> >> 12: e9 76 ff ff ff jmp 0xffffff8d
> >> 17: cc int3
> >> 18: cc int3
> >> 19: cc int3
> >> 1a: cc int3
> >> 1b: f3 0f 1e fa endbr64
> >> 1f: eb 07 jmp 0x28
> >> 21: 0f 00 2d e3 8a 34 00 verw 0x348ae3(%rip) # 0x348b0b
> >> 28: fb sti
> >> 29: f4 hlt
> >> * 2a: fa cli <-- trapping instruction
>
> This probably happened before the crash and it is likely unrelated IMO. cli just means interrupts were enabled, the actual problem happened after softirq fired (likely at the tail end of the interrupt).
>
> Thanks,
>
> - Joel
>
>
> >> 2b: c3 ret
> >> 2c: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
> >> 33: 00 00 00 00
> >> 37: 0f 1f 40 00 nopl 0x0(%rax)
> >> 3b: f3 0f 1e fa endbr64
> >> 3f: 65 gs
> >>

2023-02-27 15:14:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

On Mon, Feb 27, 2023 at 8:15 AM Joel Fernandes <[email protected]> wrote:
[..]
> >> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
> >>
> >> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> >> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> >> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> >> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> >> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> >> FS: 0000000000000000(0000) GS:ffff888119f80000(0000) knlGS:0000000000000000
> >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> >> ----------------
> >> Code disassembly (best guess):
> >> 0: 89 07 mov %eax,(%rdi)
> >> 2: 49 c7 c0 08 00 00 00 mov $0x8,%r8
> >> 9: 4d 29 c8 sub %r9,%r8
> >> c: 4c 01 c7 add %r8,%rdi
> >> f: 4c 29 c2 sub %r8,%rdx
> >> 12: e9 76 ff ff ff jmp 0xffffff8d
> >> 17: cc int3
> >> 18: cc int3
> >> 19: cc int3
> >> 1a: cc int3
> >> 1b: f3 0f 1e fa endbr64
> >> 1f: eb 07 jmp 0x28
> >> 21: 0f 00 2d e3 8a 34 00 verw 0x348ae3(%rip) # 0x348b0b
> >> 28: fb sti
> >> 29: f4 hlt
> >> * 2a: fa cli <-- trapping instruction
>
> This probably happened before the crash and it is likely unrelated IMO. cli just means interrupts were enabled, the actual problem happened after softirq fired (likely at the tail end of the interrupt).
>

And just to correct myself for completeness, CLI clears the IF flag,
which ends up *disabling maskable interrupts*, not enabling. Still, I
can't see that as a possible reason for the crash.

- Joel

2023-02-27 15:33:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

On Mon, 27 Feb 2023 08:15:26 -0500
Joel Fernandes <[email protected]> wrote:

> >> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> RIP: 0010:default_idle+0xf/0x20
> >> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
> >> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
> >> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
> >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
> >> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
> >> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
> >> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
> >> default_idle_call+0x67/0xa0
> >> do_idle+0x361/0x440
> >> cpu_startup_entry+0x18/0x20
> >> start_secondary+0x256/0x300
> >> secondary_startup_64_no_verify+0xce/0xdb
> >> </TASK>
> >> Modules linked in:
> >> CR2: 0000000000000000
> >> ---[ end trace 0000000000000000 ]---
> >> RIP: 0010:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>
> I have seen this exact signature when the processor tries to execute a function that has a NULL address. That causes IP to goto 0 and the exception. Sounds like something corrupted rcu_head (Just a guess).

[ Joel, you need to line wrap your emails ;-) ]

This looks like a call_rcu() was called on something that later got freed
or reused. That is, the bug is not with RCU but with something using RCU.

OR it could be a bug with RCU if the synchronize_rcu() ended before the
grace periods have finished.

-- Steve



2023-02-27 15:49:05

by Joel Fernandes

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

Hey Steve,

On Mon, Feb 27, 2023 at 10:33 AM Steven Rostedt <[email protected]> wrote:
>
> On Mon, 27 Feb 2023 08:15:26 -0500
> Joel Fernandes <[email protected]> wrote:
>
> > >> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > >> RIP: 0010:default_idle+0xf/0x20
> > >> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
> > >> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
> > >> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
> > >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
> > >> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
> > >> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
> > >> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
> > >> default_idle_call+0x67/0xa0
> > >> do_idle+0x361/0x440
> > >> cpu_startup_entry+0x18/0x20
> > >> start_secondary+0x256/0x300
> > >> secondary_startup_64_no_verify+0xce/0xdb
> > >> </TASK>
> > >> Modules linked in:
> > >> CR2: 0000000000000000
> > >> ---[ end trace 0000000000000000 ]---
> > >> RIP: 0010:0x0
> > >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> >
> > I have seen this exact signature when the processor tries to execute a function that has a NULL address. That causes IP to goto 0 and the exception. Sounds like something corrupted rcu_head (Just a guess).
>
> [ Joel, you need to line wrap your emails ;-) ]

Ok I will try. The thing is, I have not figured out yet how to
plaintext-reply from my iPhone without having it wrap :-(

> This looks like a call_rcu() was called on something that later got freed
> or reused. That is, the bug is not with RCU but with something using RCU.

Yes certainly, the rcu_head is allocated on the caller side so it
could have been trampled while the callback was still in flight.

> OR it could be a bug with RCU if the synchronize_rcu() ended before the
> grace periods have finished.

Good point..

Thanks,

- Joel

2023-02-27 16:12:07

by Zhouyi Zhou

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

On Mon, Feb 27, 2023 at 11:49 PM Joel Fernandes <[email protected]> wrote:
>
> Hey Steve,
>
> On Mon, Feb 27, 2023 at 10:33 AM Steven Rostedt <[email protected]> wrote:
> >
> > On Mon, 27 Feb 2023 08:15:26 -0500
> > Joel Fernandes <[email protected]> wrote:
> >
> > > >> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > > >> RIP: 0010:default_idle+0xf/0x20
> > > >> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
> > > >> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
> > > >> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
> > > >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
> > > >> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
> > > >> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
> > > >> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
> > > >> default_idle_call+0x67/0xa0
> > > >> do_idle+0x361/0x440
> > > >> cpu_startup_entry+0x18/0x20
> > > >> start_secondary+0x256/0x300
> > > >> secondary_startup_64_no_verify+0xce/0xdb
> > > >> </TASK>
> > > >> Modules linked in:
> > > >> CR2: 0000000000000000
> > > >> ---[ end trace 0000000000000000 ]---
> > > >> RIP: 0010:0x0
> > > >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > >
> > > I have seen this exact signature when the processor tries to execute a function that has a NULL address. That causes IP to goto 0 and the exception. Sounds like something corrupted rcu_head (Just a guess).
> >
> > [ Joel, you need to line wrap your emails ;-) ]
>
> Ok I will try. The thing is, I have not figured out yet how to
> plaintext-reply from my iPhone without having it wrap :-(
>
> > This looks like a call_rcu() was called on something that later got freed
> > or reused. That is, the bug is not with RCU but with something using RCU.
>
> Yes certainly, the rcu_head is allocated on the caller side so it
> could have been trampled while the callback was still in flight.
Thank you all for your guidance, I learned a lot during this process
>
> > OR it could be a bug with RCU if the synchronize_rcu() ended before the
> > grace periods have finished.
Thanks again.

By the way, the syzkaller on my local machine has been running for 8
hours, only three bugs reported[1][2][3], but they don't seem to be
related to Sanan's original report.
Maybe there are some configuration mismatches between us.The test
continues, I will report to you once I have any new discovery.

[1] http://154.220.3.120:56700/
[2] https://kernel.source.codeaurora.cn/pub/scm/linux/kernel/git/next/linux-next.git/snapshot/linux-next-next-20230221.tar.gz
[3] http://154.220.3.120/configs/linux-next-config-20230221.txt
Thanks
Zhouyi
>
> Good point..
>
> Thanks,
>
> - Joel

2023-02-27 16:16:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

On Tue, 28 Feb 2023 00:11:51 +0800
Zhouyi Zhou <[email protected]> wrote:

> > Yes certainly, the rcu_head is allocated on the caller side so it
> > could have been trampled while the callback was still in flight.
> Thank you all for your guidance, I learned a lot during this process
> >
> > > OR it could be a bug with RCU if the synchronize_rcu() ended before the
> > > grace periods have finished.
> Thanks again.
>
> By the way, the syzkaller on my local machine has been running for 8
> hours, only three bugs reported[1][2][3], but they don't seem to be
> related to Sanan's original report.
> Maybe there are some configuration mismatches between us.The test
> continues, I will report to you once I have any new discovery.

Note, the above races (either bug, the one that tramples on something in
RCU flight, or a synchronize_sched() returning early) may be extremely hard
to hit. It could have been the planets were lined up just right to hit the
bug, and won't happen for another 27,000 years.

-- Steve

2023-02-27 16:33:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

On Mon, Feb 27, 2023 at 11:16:26AM -0500, Steven Rostedt wrote:
> On Tue, 28 Feb 2023 00:11:51 +0800
> Zhouyi Zhou <[email protected]> wrote:
>
> > > Yes certainly, the rcu_head is allocated on the caller side so it
> > > could have been trampled while the callback was still in flight.
> > Thank you all for your guidance, I learned a lot during this process
> > >
> > > > OR it could be a bug with RCU if the synchronize_rcu() ended before the
> > > > grace periods have finished.
> > Thanks again.
> >
> > By the way, the syzkaller on my local machine has been running for 8
> > hours, only three bugs reported[1][2][3], but they don't seem to be
> > related to Sanan's original report.
> > Maybe there are some configuration mismatches between us.The test
> > continues, I will report to you once I have any new discovery.
>
> Note, the above races (either bug, the one that tramples on something in
> RCU flight, or a synchronize_sched() returning early) may be extremely hard
> to hit. It could have been the planets were lined up just right to hit the
> bug, and won't happen for another 27,000 years.

Which turns into once per week or two across a million-system fleet. ;-)

Not that I know of any fleets running syzkaller...

Thanx, Paul

2023-02-28 07:58:35

by Zhouyi Zhou

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

On Tue, Feb 28, 2023 at 12:33 AM Paul E. McKenney <[email protected]> wrote:
>
> On Mon, Feb 27, 2023 at 11:16:26AM -0500, Steven Rostedt wrote:
> > On Tue, 28 Feb 2023 00:11:51 +0800
> > Zhouyi Zhou <[email protected]> wrote:
> >
> > > > Yes certainly, the rcu_head is allocated on the caller side so it
> > > > could have been trampled while the callback was still in flight.
> > > Thank you all for your guidance, I learned a lot during this process
> > > >
> > > > > OR it could be a bug with RCU if the synchronize_rcu() ended before the
> > > > > grace periods have finished.
> > > Thanks again.
> > >
> > > By the way, the syzkaller on my local machine has been running for 8
> > > hours, only three bugs reported[1][2][3], but they don't seem to be
> > > related to Sanan's original report.
> > > Maybe there are some configuration mismatches between us.The test
> > > continues, I will report to you once I have any new discovery.
> >
> > Note, the above races (either bug, the one that tramples on something in
> > RCU flight, or a synchronize_sched() returning early) may be extremely hard
> > to hit. It could have been the planets were lined up just right to hit the
> > bug, and won't happen for another 27,000 years.
>
> Which turns into once per week or two across a million-system fleet. ;-)
>
> Not that I know of any fleets running syzkaller...
My syzkaller has been running for 24 hours, the bug can't be reproduced.
Yes, the above races are extremely hard to hit.
I learned a lot during the process ;-)

Please inform me if there are any more clues.

Thank you all for your guidance ;-)
Zhouyi
>
> Thanx, Paul

2023-02-28 08:55:36

by Qiuxu Zhuo

[permalink] [raw]
Subject: RE: BUG: unable to handle kernel NULL pointer dereference in rcu_core

> From: Joel Fernandes <[email protected]>
> Sent: Monday, February 27, 2023 9:15 PM
> To: Zhouyi Zhou <[email protected]>
> Cc: Sanan Hasanov <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: BUG: unable to handle kernel NULL pointer dereference in
> rcu_core
>
> ...
> >> BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> #PF: supervisor instruction fetch in kernel mode
> >> #PF: error_code(0x0010) - not-present page PGD 53756067 P4D 53756067
> >> PUD 0
> >> Oops: 0010 [#1] PREEMPT SMP KASAN
> >> CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.2.0-next-20230221 #1
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> >> 04/01/2014
> >> RIP: 0010:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> >> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
> >> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> >> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> >> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> >> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> >> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> >> FS: 0000000000000000(0000) GS:ffff888119f80000(0000)
> >> knlGS:0000000000000000
> >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> >> Call Trace:
> >> <IRQ>
> >> rcu_core+0x85d/0x1960
> >> __do_softirq+0x2e5/0xae2
> >> __irq_exit_rcu+0x11d/0x190
> >> irq_exit_rcu+0x9/0x20
> >> sysvec_apic_timer_interrupt+0x97/0xc0
> >> </IRQ>
> >> <TASK>
> >> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> RIP: 0010:default_idle+0xf/0x20
> >> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff
> >> ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa>
> >> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
> >> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
> >> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
> >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
> >> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
> >> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
> >> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
> >> default_idle_call+0x67/0xa0
> >> do_idle+0x361/0x440
> >> cpu_startup_entry+0x18/0x20
> >> start_secondary+0x256/0x300
> >> secondary_startup_64_no_verify+0xce/0xdb
> >> </TASK>
> >> Modules linked in:
> >> CR2: 0000000000000000
> >> ---[ end trace 0000000000000000 ]---
> >> RIP: 0010:0x0
> >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>
> I have seen this exact signature when the processor tries to execute a
> function that has a NULL address. That causes IP to goto 0 and the exception.
> Sounds like something corrupted rcu_head (Just a guess).

Did a quick test to directly invoke "call_rcu(head, NULL)", then the kernel got panic
with almost the same call trace as above and with the same RIP:

RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.

If invoke " call_rcu(head, NULL + 1)", then

RIP: 0010:0x1
Code: Unable to access opcode bytes at 0xffffffffffffffd7.

If invoke " call_rcu(head, NULL + 2)", then

RIP: 0010:0x2
Code: Unable to access opcode bytes at 0xffffffffffffffd8.

The log above tends to say your guess (a corrupted rcu_head) is reasonable. ????

-Qiuxu

2023-02-28 23:51:38

by Joel Fernandes

[permalink] [raw]
Subject: Re: BUG: unable to handle kernel NULL pointer dereference in rcu_core

On Tue, Feb 28, 2023 at 3:55 AM Zhuo, Qiuxu <[email protected]> wrote:
>
> > From: Joel Fernandes <[email protected]>
> > Sent: Monday, February 27, 2023 9:15 PM
> > To: Zhouyi Zhou <[email protected]>
> > Cc: Sanan Hasanov <[email protected]>; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]
> > Subject: Re: BUG: unable to handle kernel NULL pointer dereference in
> > rcu_core
> >
> > ...
> > >> BUG: kernel NULL pointer dereference, address: 0000000000000000
> > >> #PF: supervisor instruction fetch in kernel mode
> > >> #PF: error_code(0x0010) - not-present page PGD 53756067 P4D 53756067
> > >> PUD 0
> > >> Oops: 0010 [#1] PREEMPT SMP KASAN
> > >> CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.2.0-next-20230221 #1
> > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> > >> 04/01/2014
> > >> RIP: 0010:0x0
> > >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> > >> RSP: 0018:ffffc900003f8e48 EFLAGS: 00010246
> > >> RAX: 0000000000000000 RBX: ffff888100833900 RCX: 00000000b9582f6c
> > >> RDX: 1ffff11020106853 RSI: ffffffff816b2769 RDI: ffff888043f64708
> > >> RBP: 000000000000000c R08: 0000000000000000 R09: ffffffff900b895f
> > >> R10: fffffbfff201712b R11: 000000000008e001 R12: dffffc0000000000
> > >> R13: ffffc900003f8ec8 R14: ffff888043f64708 R15: 000000000000000b
> > >> FS: 0000000000000000(0000) GS:ffff888119f80000(0000)
> > >> knlGS:0000000000000000
> > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> CR2: ffffffffffffffd6 CR3: 0000000054e64000 CR4: 0000000000350ee0
> > >> Call Trace:
> > >> <IRQ>
> > >> rcu_core+0x85d/0x1960
> > >> __do_softirq+0x2e5/0xae2
> > >> __irq_exit_rcu+0x11d/0x190
> > >> irq_exit_rcu+0x9/0x20
> > >> sysvec_apic_timer_interrupt+0x97/0xc0
> > >> </IRQ>
> > >> <TASK>
> > >> asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > >> RIP: 0010:default_idle+0xf/0x20
> > >> Code: 89 07 49 c7 c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff
> > >> ff ff cc cc cc cc f3 0f 1e fa eb 07 0f 00 2d e3 8a 34 00 fb f4 <fa>
> > >> c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 65
> > >> RSP: 0018:ffffc9000017fe00 EFLAGS: 00000202
> > >> RAX: 0000000000dfbea1 RBX: dffffc0000000000 RCX: ffffffff89b1da9c
> > >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
> > >> RBP: 0000000000000007 R08: 0000000000000001 R09: ffff888119fb6c23
> > >> R10: ffffed10233f6d84 R11: dffffc0000000000 R12: 0000000000000003
> > >> R13: ffff888100833900 R14: ffffffff8e112850 R15: 0000000000000000
> > >> default_idle_call+0x67/0xa0
> > >> do_idle+0x361/0x440
> > >> cpu_startup_entry+0x18/0x20
> > >> start_secondary+0x256/0x300
> > >> secondary_startup_64_no_verify+0xce/0xdb
> > >> </TASK>
> > >> Modules linked in:
> > >> CR2: 0000000000000000
> > >> ---[ end trace 0000000000000000 ]---
> > >> RIP: 0010:0x0
> > >> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> >
> > I have seen this exact signature when the processor tries to execute a
> > function that has a NULL address. That causes IP to goto 0 and the exception.
> > Sounds like something corrupted rcu_head (Just a guess).
>
> Did a quick test to directly invoke "call_rcu(head, NULL)", then the kernel got panic
> with almost the same call trace as above and with the same RIP:
>
> RIP: 0010:0x0
> Code: Unable to access opcode bytes at 0xffffffffffffffd6.
>
> If invoke " call_rcu(head, NULL + 1)", then
>
> RIP: 0010:0x1
> Code: Unable to access opcode bytes at 0xffffffffffffffd7.
>
> If invoke " call_rcu(head, NULL + 2)", then
>
> RIP: 0010:0x2
> Code: Unable to access opcode bytes at 0xffffffffffffffd8.
>
> The log above tends to say your guess (a corrupted rcu_head) is reasonable. ????
>

Good that you double checked and kept me honest about my analysis. ;-)

- Joel