2020-06-05 08:40:27

by Christian Kujau

[permalink] [raw]
Subject: 5.7.0 / BUG: kernel NULL pointer dereference / setup_cpu_watcher

Hi,

I'm running a small Xen PVH domain and upgrading from vanilla 5.6.0 to
5.7.0 caused the splat below, really early during boot. The configuration
has not changed, all new "make oldconfig" prompts have been answered with
"N". Old and new config, dmesg are here:

http://nerdbynature.de/bits/5.7.0/

Searching the interwebs for similar reports didn't return much:

* drm_sched_get_cleanup_job: BUG: kernel NULL pointer dereference
https://bugzilla.redhat.com/show_bug.cgi?id=1822984 -- but this
appears to be really DRM related. - https://lkml.org/lkml/2020/4/10/545

* A recent mm/vmstat patch, mentioning "device_offline" in its output
https://patchwork.kernel.org/patch/11563009/

But other than a few overlapping strings, I guess all of that is totally
unrelated :(

Thanks,
Christian.


Note: that "Xen Platform PCI: unrecognised magic value" on the top appears
in 5.6 kernels as well, but no ill effects so far.

---------------------------------------------------------------
Xen Platform PCI: unrecognised magic value
ACPI: No IOAPIC entries present
BUG: kernel NULL pointer dereference, address: 00000000000002d0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0 #2
RIP: 0010:device_offline+0x8/0xf0
Code: 48 89 e7 e8 3a ee f3 ff 4c 89 e0 48 83 c4 10 5b 41 5c c3 45 31 e4 48 83 c4 10 4c 89 e0 5b 41 5c c3 90 41 54 55 53 48 83 ec 10 <f6> 87 d0 02 00 00 01 0f 85 ca 00 00 00 48 89 fb 48 8b 7f 48 48 85
RSP: 0000:ffffbd9100013e78 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000820001fa
RDX: ffff9c9c3dd00000 RSI: 00000000820001fa RDI: 0000000000000000
RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
R10: ffff9c9c3d5072a8 R11: 0000000000000000 R12: ffff9c9c3d594720
R13: ffffffff8a57e5a8 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff9c9c3dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002d0 CR3: 000000006b00a001 CR4: 00000000001606b0
Call Trace:
setup_cpu_watcher+0x44/0x60
? plt_clk_driver_init+0xe/0xe
setup_vcpu_hotplug_event+0x23/0x26
do_one_initcall+0x47/0x180
kernel_init_freeable+0x13b/0x19d
? rest_init+0x95/0x95
kernel_init+0x5/0xeb
ret_from_fork+0x35/0x40
Modules linked in:
CR2: 00000000000002d0
---[ end trace b0cc587db609787f ]---

--
BOFH excuse #440:

Cache miss - please take better aim next time


2020-06-05 09:21:04

by Juergen Gross

[permalink] [raw]
Subject: Re: 5.7.0 / BUG: kernel NULL pointer dereference / setup_cpu_watcher

On 05.06.20 10:36, Christian Kujau wrote:
> Hi,
>
> I'm running a small Xen PVH domain and upgrading from vanilla 5.6.0 to
> 5.7.0 caused the splat below, really early during boot. The configuration
> has not changed, all new "make oldconfig" prompts have been answered with
> "N". Old and new config, dmesg are here:
>
> http://nerdbynature.de/bits/5.7.0/
>
> Searching the interwebs for similar reports didn't return much:
>
> * drm_sched_get_cleanup_job: BUG: kernel NULL pointer dereference
> https://bugzilla.redhat.com/show_bug.cgi?id=1822984 -- but this
> appears to be really DRM related. - https://lkml.org/lkml/2020/4/10/545
>
> * A recent mm/vmstat patch, mentioning "device_offline" in its output
> https://patchwork.kernel.org/patch/11563009/
>
> But other than a few overlapping strings, I guess all of that is totally
> unrelated :(

Do you happen to start the guest with vcpus < maxvcpus?

If yes there is already a patch pending for 5.8:

https://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git/commit/?h=for-linus-5.8&id=c54b071c192dfe8061336f650ceaf358e6386e0b


Juergen

>
> Thanks,
> Christian.
>
>
> Note: that "Xen Platform PCI: unrecognised magic value" on the top appears
> in 5.6 kernels as well, but no ill effects so far.
>
> ---------------------------------------------------------------
> Xen Platform PCI: unrecognised magic value
> ACPI: No IOAPIC entries present
> BUG: kernel NULL pointer dereference, address: 00000000000002d0
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] SMP PTI
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.7.0 #2
> RIP: 0010:device_offline+0x8/0xf0
> Code: 48 89 e7 e8 3a ee f3 ff 4c 89 e0 48 83 c4 10 5b 41 5c c3 45 31 e4 48 83 c4 10 4c 89 e0 5b 41 5c c3 90 41 54 55 53 48 83 ec 10 <f6> 87 d0 02 00 00 01 0f 85 ca 00 00 00 48 89 fb 48 8b 7f 48 48 85
> RSP: 0000:ffffbd9100013e78 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000820001fa
> RDX: ffff9c9c3dd00000 RSI: 00000000820001fa RDI: 0000000000000000
> RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
> R10: ffff9c9c3d5072a8 R11: 0000000000000000 R12: ffff9c9c3d594720
> R13: ffffffff8a57e5a8 R14: 0000000000000000 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff9c9c3dc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000002d0 CR3: 000000006b00a001 CR4: 00000000001606b0
> Call Trace:
> setup_cpu_watcher+0x44/0x60
> ? plt_clk_driver_init+0xe/0xe
> setup_vcpu_hotplug_event+0x23/0x26
> do_one_initcall+0x47/0x180
> kernel_init_freeable+0x13b/0x19d
> ? rest_init+0x95/0x95
> kernel_init+0x5/0xeb
> ret_from_fork+0x35/0x40
> Modules linked in:
> CR2: 00000000000002d0
> ---[ end trace b0cc587db609787f ]---
>

2020-06-05 18:05:27

by Christian Kujau

[permalink] [raw]
Subject: Re: 5.7.0 / BUG: kernel NULL pointer dereference / setup_cpu_watcher

On Fri, 5 Jun 2020, Jürgen Groß wrote:
> Do you happen to start the guest with vcpus < maxvcpus?

Indeed, I was booting with vcpus=2, maxvcpus=4. Setting both to the same
value made the domU boot.

> If yes there is already a patch pending for 5.8:
> https://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git/commit/?h=for-linus-5.8&id=c54b071c192dfe8061336f650ceaf358e6386e0b

Applied that manually and now the system boots even with vcpus < maxvcpus.
So, if this still matters:

Tested-by: Christian Kujau <[email protected]>

Thank you for your response, and the fix!

Christian.
--
BOFH excuse #146:

Communications satellite used by the military for star wars.

2020-06-05 18:12:55

by Andrew Cooper

[permalink] [raw]
Subject: Re: 5.7.0 / BUG: kernel NULL pointer dereference / setup_cpu_watcher

On 05/06/2020 09:36, Christian Kujau wrote:
> Hi,
>
> I'm running a small Xen PVH domain and upgrading from vanilla 5.6.0 to
> <snip>
>
> Note: that "Xen Platform PCI: unrecognised magic value" on the top appears
> in 5.6 kernels as well, but no ill effects so far.
>
> ---------------------------------------------------------------
> Xen Platform PCI: unrecognised magic value

PVH domains don't have the emulated platform device, so Linux will be
finding ~0 when it goes looking in config space.

The diagnostic should be skipped in that case, to avoid giving the false
impression that something is wrong.

~Andrew

2020-06-05 22:24:10

by Christian Kujau

[permalink] [raw]
Subject: Re: 5.7.0 / BUG: kernel NULL pointer dereference / setup_cpu_watcher

On Fri, 5 Jun 2020, Andrew Cooper wrote:
> PVH domains don't have the emulated platform device, so Linux will be
> finding ~0 when it goes looking in config space.
>
> The diagnostic should be skipped in that case, to avoid giving the false
> impression that something is wrong.

Understood, thanks for explaining that. I won't be able to edit
arch/x86/xen/platform-pci-unplug.c to fix that though :-\

Christian.
--
BOFH excuse #134:

because of network lag due to too many people playing deathmatch