2022-08-02 20:52:57

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] x86,mm: print likely CPU at segfault time

In a large enough fleet of computers, it is common to have a few bad CPUs.
Those can often be identified by seeing that some commonly run kernel code,
which runs fine everywhere else, keeps crashing on the same CPU core on one
particular bad system.

However, the failure modes in CPUs that have gone bad over the years are
often oddly specific, and the only bad behavior seen might be segfaults
in programs like bash, python, or various system daemons that run fine
everywhere else.

Add a printk to show_signal_msg() to print the CPU, core, and socket
at segfault time. This is not perfect, since the task might get rescheduled
on another CPU between when the fault hit, and when the message is printed,
but in practice this has been good enough to help us identify several bad
CPU cores.

segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)

Signed-off-by: Rik van Riel <[email protected]>
CC: Dave Jones <[email protected]>
---
arch/x86/mm/fault.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fad8faa29d04..47faf7c0041e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -782,6 +782,12 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,

print_vma_addr(KERN_CONT " in ", regs->ip);

+ printk(KERN_CONT " on CPU %d (core %d, socket %d)",
+ raw_smp_processor_id(),
+ topology_core_id(raw_smp_processor_id()),
+ topology_physical_package_id(raw_smp_processor_id()));
+
+
printk(KERN_CONT "\n");

show_opcodes(regs, loglvl);
--
2.37.1




2022-08-03 15:17:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86,mm: print likely CPU at segfault time

On 8/2/22 13:09, Rik van Riel wrote:
> Add a printk to show_signal_msg() to print the CPU, core, and socket

Nit: ^ printk(), please

> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -782,6 +782,12 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>
> print_vma_addr(KERN_CONT " in ", regs->ip);
>
> + printk(KERN_CONT " on CPU %d (core %d, socket %d)",
> + raw_smp_processor_id(),
> + topology_core_id(raw_smp_processor_id()),
> + topology_physical_package_id(raw_smp_processor_id()));

This seems totally sane to me. I have found myself looking through
customer-provided *oopses* more than once trying to figure out if the
same CPU cores were at fault. This extends that to userspace crashes
too. I've also found myself trying to map back from logical CPU numbers
to core and package.

One nit: Preempt is enabled here, right? I understand that this thing
is fundamentally racy, but if we did:

int cpu = READ_ONCE(raw_smp_processor_id());

it would make it internally *consistent*. Without that, we could
theoretically get three different raw_smp_processor_id()'s. It might
even make the code look a wee bit nicer.

The changelog here is great, but couple of comments would also be nice:

/* This is a racy snapshot, but is better than nothing: */
int cpu = READ_ONCE(raw_smp_processor_id());
...
/*
* Dump the likely CPU where the fatal segfault happened. This
* can help help identify buggy pieces of hardware.
*/
printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
topology_core_id(cpu),
topology_physical_package_id(cpu));

If you want to wait a bit and see if you get any other comments, this
seems like something we can suck in after the merge window.