2024-03-01 21:36:32

by Mark Brown

[permalink] [raw]
Subject: x86 boot issues in -next

Hi,

For the past few days -next has been failing to boot an x86_64 defconfig
on the x86 machine Linaro has available in their lab. DMI says it's a
"Dell Inc. PowerEdge R200/0TY019, BIOS 1.4.3 05/15/2009" and the CPU is
described as "Intel(R) Xeon(R) CPU X3220 @ 2.40GHz (family: 0x6, model:
0xf, stepping: 0xb)", it's running happily with mainline and
pending-fixes.

The kernel crashes with:

[ 2.012730] PCI: CLS 64 bytes, default 64
[ 2.016743] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 2.023181] software IO TLB: mapped [mem 0x00000000cbeb2000-0x00000000cfeb2000] (64MB)
[ 2.032236] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 2.032914] #PF: supervisor read access in kernel mode
[ 2.032914] #PF: error_code(0x0000) - not-present page
[ 2.032914] PGD 0 P4D 0
[ 2.032914] Oops: 0000 [#1] PREEMPT SMP PTI
[ 2.032914] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc6-next-20240229 #1
[ 2.032914] Hardware name: Dell Inc. PowerEdge R200/0TY019, BIOS 1.4.3 05/15/2009
[ 2.032914] RIP: 0010:exra_is_visible+0xf/0x20
[ 2.032914] Code: b7 46 08 c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 31 c0 83 3d 1b f0 c9 01 01 7e 04 <0f> b7 46 08 c3 cc cc cc cc 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[ 2.032914] RSP: 0000:ffffac7cc001bdd0 EFLAGS: 00010202
[ 2.032914] RAX: 0000000000000000 RBX: ffffffff90812ea0 RCX: ffffffff8ee0ef00
[ 2.032914] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97a681064c00
[ 2.032914] RBP: 0000000000000001 R08: 0000000000000228 R09: ffff97a680127310
[ 2.032914] R10: ffff97a68049ddd0 R11: 0000000000000000 R12: ffff97a681064c00
[ 2.032914] R13: ffffffff90812dc0 R14: 0000000000000001 R15: 0000000000000000
[ 2.032914] FS: 0000000000000000(0000) GS:ffff97a7a7c00000(0000) knlGS:0000000000000000
[ 2.032914] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.032914] CR2: 0000000000000008 CR3: 00000000bf42e000 CR4: 00000000000006f0
[ 2.032914] Call Trace:
[ 2.032914] <TASK>
[ 2.032914] ? __die+0x1e/0x60
[ 2.032914] ? page_fault_oops+0x17b/0x480
[ 2.032914] ? search_module_extables+0x14/0x50
[ 2.032914] ? exc_page_fault+0x6b/0x150
[ 2.032914] ? asm_exc_page_fault+0x26/0x30
[ 2.032914] ? __pfx_exra_is_visible+0x10/0x10
[ 2.032914] ? exra_is_visible+0xf/0x20
[ 2.032914] internal_create_group+0x9c/0x400
[ 2.032914] internal_create_groups+0x3d/0xa0
[ 2.032914] pmu_dev_alloc+0xbb/0xe0
[ 2.032914] perf_event_sysfs_init+0x51/0xa0

A full boot log for a sample failure can be seen at:

https://validation.linaro.org/scheduler/job/4045256

I bisected this, the bisect seemed to run smoothly and landed on commit
f031242dbf22 ("Merge branch into tip/master: 'x86/apic'") with both
parents being fine (full log below) - an issue in the x86 tree does seem
plausible but I haven't investigated further.

git bisect start
# bad: [1870cdc0e8dee32e3c221704a2977898ba4c10e8] Add linux-next specific files for 20240301
git bisect bad 1870cdc0e8dee32e3c221704a2977898ba4c10e8
# good: [d1e87c1d8f90f27a1ca3c90d9de048602beabc61] Merge branch 'for-linux-next-fixes' of git://anongit.freedesktop.org/drm/drm-misc
git bisect good d1e87c1d8f90f27a1ca3c90d9de048602beabc61
# good: [907d374fa897fbbcdce1e027297d933bbab025e1] Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
git bisect good 907d374fa897fbbcdce1e027297d933bbab025e1
# good: [c32a1272f32bf6189357816f510edf7411ecd0ba] Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
git bisect good c32a1272f32bf6189357816f510edf7411ecd0ba
# bad: [2faf0484495c5288200f710966ec252fee2415a9] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt.git
git bisect bad 2faf0484495c5288200f710966ec252fee2415a9
# bad: [0fa7bf8bf39eff1d304d2d46c92d94131205036f] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
git bisect bad 0fa7bf8bf39eff1d304d2d46c92d94131205036f
# bad: [f031242dbf22fc9c850946253324c72611a8b253] Merge branch into tip/master: 'x86/apic'
git bisect bad f031242dbf22fc9c850946253324c72611a8b253
# good: [f631c66c30faa6900e05aec06e7d0a541040dbde] Merge branch into tip/master: 'irq/msi'
git bisect good f631c66c30faa6900e05aec06e7d0a541040dbde
# good: [7f204eefe00c7a6677fc1dd515a02eb8b9c57495] Merge branch into tip/master: 'timers/core'
git bisect good 7f204eefe00c7a6677fc1dd515a02eb8b9c57495
# good: [c0a66c2847908e41c771ca2355fba935a82a9f62] x86/cpu/topology: Move registration out of APIC code
git bisect good c0a66c2847908e41c771ca2355fba935a82a9f62
# good: [882e0cff9ef340e7a47659a9aab9da64f4b9b847] x86/cpu/topology: Mop up primary thread mask handling
git bisect good 882e0cff9ef340e7a47659a9aab9da64f4b9b847
# good: [6be4ec29685c216ebec61d35f56c3808092498aa] x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
git bisect good 6be4ec29685c216ebec61d35f56c3808092498aa
# good: [9be3b2f057d7a6752e8cf25c1d456198b4d3bd6a] ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant
git bisect good 9be3b2f057d7a6752e8cf25c1d456198b4d3bd6a
# good: [27f6a9c87a97f5ea7459be08d5be231af6b32c20] kvmclock: Unexport kvmclock clocksource
git bisect good 27f6a9c87a97f5ea7459be08d5be231af6b32c20
# good: [9b9c280b9af2aa851d83e7d0b79f36a3d869d745] Merge branch 'x86/urgent' into x86/apic, to resolve conflicts
git bisect good 9b9c280b9af2aa851d83e7d0b79f36a3d869d745
# good: [670c000c11ec5c5131cdf502d062075a803214af] Merge branch into tip/master: 'timers/ptp'
git bisect good 670c000c11ec5c5131cdf502d062075a803214af
# first bad commit: [f031242dbf22fc9c850946253324c72611a8b253] Merge branch into tip/master: 'x86/apic'


Attachments:
(No filename) (5.74 kB)
signature.asc (499.00 B)
Download all attachments

2024-03-01 21:48:52

by Dave Hansen

[permalink] [raw]
Subject: Re: x86 boot issues in -next

On 3/1/24 13:29, Mark Brown wrote:
> For the past few days -next has been failing to boot an x86_64 defconfig
> on the x86 machine Linaro has available in their lab. DMI says it's a
> "Dell Inc. PowerEdge R200/0TY019, BIOS 1.4.3 05/15/2009" and the CPU is
> described as "Intel(R) Xeon(R) CPU X3220 @ 2.40GHz (family: 0x6, model:
> 0xf, stepping: 0xb)", it's running happily with mainline and
> pending-fixes.

This wouldn't explain the bisect results, but there's been a crash fixed
in here:

> https://lore.kernel.org/all/170863445442.1479840.1818801787239831650.stgit@dwillia2-xfh.jf.intel.com/

that looks pretty similar to your signature.

Could you give Dan's patch a shot?



2024-03-01 22:21:50

by Dan Williams

[permalink] [raw]
Subject: Re: x86 boot issues in -next

[ add Greg ]

Dave Hansen wrote:
> On 3/1/24 13:29, Mark Brown wrote:
> > For the past few days -next has been failing to boot an x86_64 defconfig
> > on the x86 machine Linaro has available in their lab. DMI says it's a
> > "Dell Inc. PowerEdge R200/0TY019, BIOS 1.4.3 05/15/2009" and the CPU is
> > described as "Intel(R) Xeon(R) CPU X3220 @ 2.40GHz (family: 0x6, model:
> > 0xf, stepping: 0xb)", it's running happily with mainline and
> > pending-fixes.
>
> This wouldn't explain the bisect results, but there's been a crash fixed
> in here:
>
> > https://lore.kernel.org/all/170863445442.1479840.1818801787239831650.stgit@dwillia2-xfh.jf.intel.com/
>
> that looks pretty similar to your signature.
>
> Could you give Dan's patch a shot?
>

Hey Greg, this indeed looks like something that will be fixed when you
update driver-core-next.

http://lore.kernel.org/r/2024022342-unbroken-september-e58d@gregkh

2024-03-02 15:22:49

by Greg KH

[permalink] [raw]
Subject: Re: x86 boot issues in -next

On Fri, Mar 01, 2024 at 02:17:40PM -0800, Dan Williams wrote:
> [ add Greg ]
>
> Dave Hansen wrote:
> > On 3/1/24 13:29, Mark Brown wrote:
> > > For the past few days -next has been failing to boot an x86_64 defconfig
> > > on the x86 machine Linaro has available in their lab. DMI says it's a
> > > "Dell Inc. PowerEdge R200/0TY019, BIOS 1.4.3 05/15/2009" and the CPU is
> > > described as "Intel(R) Xeon(R) CPU X3220 @ 2.40GHz (family: 0x6, model:
> > > 0xf, stepping: 0xb)", it's running happily with mainline and
> > > pending-fixes.
> >
> > This wouldn't explain the bisect results, but there's been a crash fixed
> > in here:
> >
> > > https://lore.kernel.org/all/170863445442.1479840.1818801787239831650.stgit@dwillia2-xfh.jf.intel.com/
> >
> > that looks pretty similar to your signature.
> >
> > Could you give Dan's patch a shot?
> >
>
> Hey Greg, this indeed looks like something that will be fixed when you
> update driver-core-next.
>
> http://lore.kernel.org/r/2024022342-unbroken-september-e58d@gregkh

Ick, I forgot to push out my local tree, sorry about that!

Now done.

greg k-h

2024-03-04 18:26:51

by Mark Brown

[permalink] [raw]
Subject: Re: x86 boot issues in -next

On Fri, Mar 01, 2024 at 01:48:41PM -0800, Dave Hansen wrote:
> On 3/1/24 13:29, Mark Brown wrote:

> > For the past few days -next has been failing to boot an x86_64 defconfig
> > on the x86 machine Linaro has available in their lab. DMI says it's a
> > "Dell Inc. PowerEdge R200/0TY019, BIOS 1.4.3 05/15/2009" and the CPU is
> > described as "Intel(R) Xeon(R) CPU X3220 @ 2.40GHz (family: 0x6, model:
> > 0xf, stepping: 0xb)", it's running happily with mainline and
> > pending-fixes.

> This wouldn't explain the bisect results, but there's been a crash fixed
> in here:

> > https://lore.kernel.org/all/170863445442.1479840.1818801787239831650.stgit@dwillia2-xfh.jf.intel.com/

> that looks pretty similar to your signature.

> Could you give Dan's patch a shot?

Whatever the issue was it's gone today - Dan's patch is in -next so I'm
guessing it may well have been it. I'm guessing the bisection might've
been due to some combination of the two trees causing an empty group to
get added? Thanks for looking into it.


Attachments:
(No filename) (1.02 kB)
signature.asc (499.00 B)
Download all attachments