LinuxLists.cc - MCE bug?

2015-06-17 02:15:08

Subject: MCE bug?

Hi Boris & Tony,

While injecting MCEs using einj, I encountered a panic:

[ 0.305697] mce: CPU supports 22 MCE banks
[ 0.310288] BUG: unable to handle kernel NULL pointer dereference at 00000000 00000100
[ 0.319057] IP: [<ffffffff8107d0f2>] __queue_work+0x32/0x370
[ 0.325398] PGD 0
[ 0.327656] Oops: 0000 [#1] SMP

...

[ 0.484045] Call Trace:
[ 0.486780] [<ffffffff8107d66b>] queue_work_on+0x2b/0x50
[ 0.492821] [<ffffffff8102e019>] mce_schedule_work.part.16+0x29/0x30
[ 0.500020] [<ffffffff8102f0d9>] machine_check_poll+0x249/0x260
[ 0.506733] [<ffffffff8102f123>] __mcheck_cpu_init_generic+0x33/0x100
[ 0.514018] [<ffffffff81030061>] mcheck_cpu_init+0x161/0x4b0
[ 0.520443] [<ffffffff81016095>] identify_cpu+0x365/0x450
[ 0.526576] [<ffffffff81b6144c>] identify_boot_cpu+0x10/0x7e
[ 0.532994] [<ffffffff81b614ee>] check_bugs+0x9/0x2d
[ 0.538643] [<ffffffff81b5b0a7>] start_kernel+0x469/0x495
[ 0.544771] [<ffffffff81b5aa2e>] ? set_init_arg+0x55/0x55
[ 0.550900] [<ffffffff81b5a120>] ? early_idt_handlers+0x120/0x120
[ 0.557805] [<ffffffff81b5a5ca>] x86_64_start_reservations+0x2a/0x2c
[ 0.565001] [<ffffffff81b5a709>] x86_64_start_kernel+0x13d/0x14c

It happened after the machine rebooted (due to an injected fatal error). It tried to find leftover banks and then called mce_schedule_work() in machine_check_poll(), but it seemed too early and system_wq wasn't allocated yet, thus the NULL pointer.

Is it a known problem? I'm based on Linux 4.1.0-rc3-7.

Thanks
Rui

2015-06-18 09:36:40

by Rui Wang

[permalink] [raw]

Subject: Re: MCE bug?

> On Wed, Jun 17, 2015 at 11:41:56AM +0200, Borislav Petkov wrote:
>> And I was waiting in line to get a chance to do some injection on our
>> EINJ box here too. But it seems you have the required setup already so
>> if you want to give those changes a run, I've uploaded them here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git#tip-ras
>>
>> It'll be much appreciated.
>
> and the answer is <drum roll> ....
>
>
>
> no. :-(

I see a different panic with this kernel. Not seen every time.
It was after reboot due to injected errors.

[ 0.234672] mce: CPU supports 22 MCE banks
[ 0.239291] CPU0: Thermal monitoring enabled (TM1)
[ 0.244680] process: using mwait in idle threads
[ 0.249844] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[ 0.256654] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[ 0.264330] Freeing SMP alternatives memory: 20K (ffffffff81d1e000 - ffffffff81d23000)
[ 0.274057] ftrace: allocating 22650 entries in 89 pages
[ 0.289946] x2apic: IRQ remapping doesn't support X2APIC mode
[ 0.296505] Switched APIC routing to physical flat.
[ 0.302838] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.349289] smpboot: CPU0: Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz (fam: 06, model: 3f, stepping: 03)
[ 0.359844] Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Intel PMU driver.
[ 0.371173] ... version: 3
[ 0.375649] ... bit width: 48
[ 0.380222] ... generic registers: 4
[ 0.384698] ... value mask: 0000ffffffffffff
[ 0.390632] ... max period: 0000ffffffffffff
[ 0.396566] ... fixed-purpose events: 3
[ 0.401043] ... event mask: 000000070000000f
[ 0.410260] x86: Booting SMP configuration:
[ 0.414933] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17
[ 0.706763] .... node #1, CPUs: #18
[ 0.822565] mce: [Hardware Error]: Machine check events logged
[ 0.822801] #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33
[ 1.078660] mce: [Hardware Error]: Machine check events logged
[ 1.093416] #34
[ 1.095433] BUG: unable to handle kernel
[ 1.100045] #35
[ 1.102193] NULL pointer dereference at 0000000000000008
[ 1.108126] IP: [<ffffffff8107ed01>] pool_mayday_timeout+0x81/0x150
[ 1.111969]
[ 1.116818] .... node #0, CPUs: #36
[ 1.121101] PGD 0
[ 1.123348] Oops: 0000 [#1] SMP
[ 1.126975] Modules linked in:
[ 1.130402] CPU: 33 PID: 0 Comm: swapper/33 Not tainted 4.1.0-rc3-7-default+ #1
[ 1.138570] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0059.R00.1501081238 01/08/2015
[ 1.150134] task: ffff88046e86e0d0 ti: ffff88046e874000 task.ti: ffff88046e874000
[ 1.158496] RIP: 0010:[<ffffffff8107ed01>] [<ffffffff8107ed01>] pool_mayday_timeout+0x81/0x150
[ 1.168228] RSP: 0000:ffff88087f5e3e08 EFLAGS: 00010046
[ 1.174164] RAX: 0000000fffffffe0 RBX: 0000000000000000 RCX: 0000000000000000
[ 1.182135] RDX: ffff88087f5f4898 RSI: ffffffff8107ec80 RDI: ffffffff81dd332c
[ 1.190108] RBP: ffff88087f5e3e48 R08: 0000000000000000 R09: ffff88087f5ed8c0
[ 1.198080] R10: 0000000000000004 R11: 0000000000000005 R12: ffffffff81d4d880
[ 1.206052] R13: 0000000000000101 R14: ffffffff8107ec80 R15: ffff88087f5f4880
[ 1.214026] FS: 0000000000000000(0000) GS:ffff88087f5e0000(0000) knlGS:0000000000000000
[ 1.223066] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.229486] #37
[ 1.229486] CR2: 0000000000000008 CR3: 0000000001a0e000 CR4: 00000000001406e0
[ 1.239605] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.247578] #38
[ 1.247578] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.257697] #39
[ 1.257697] Stack:
[ 1.262090] ffff88087f5e3e48 ffffffff810bf45b 0000000000000021 ffff88087f5ed8c0
[ 1.270398] ffff88087f5f4910[ 1.273400] #40
0000000000000101 ffffffff8107ec80 ffff88087f5f4880
[ 1.280867] ffff88087f5e3e88 ffffffff810cf559 ffff88087f5e3e88 ffff88087f5ed8c0
[ 1.289177] Call Trace:
[ 1.291910] #41
[ 1.294068] <IRQ>
[ 1.294068] [<ffffffff810bf45b>] ? console_unlock+0x1fb/0x460
[ 1.302927] [<ffffffff8107ec80>] ? wq_unbind_fn+0x130/0x130
[ 1.309242] #42
[ 1.309242] [<ffffffff810cf559>] call_timer_fn+0x39/0x130
[ 1.317509] [<ffffffff8107ec80>] ? wq_unbind_fn+0x130/0x130
[ 1.323833] #43
[ 1.323834] [<ffffffff810d1041>] run_timer_softirq+0x211/0x300
[ 1.332598] [<ffffffff8106a874>] __do_softirq+0xe4/0x290
[ 1.338629] [<ffffffff8106ac8d>] irq_exit+0x9d/0xb0
[ 1.344177] #44
[ 1.344177] [<ffffffff8103daba>] smp_apic_timer_interrupt+0x4a/0x60
[ 1.353424] [<ffffffff815b53fe>] apic_timer_interrupt+0x6e/0x80
[ 1.360135] #45
[ 1.362292] <EOI>
[ 1.362292] [<ffffffff8100d7ad>] ? mwait_idle+0x6d/0x90
[ 1.370568] [<ffffffff8100e0cf>] arch_cpu_idle+0xf/0x20
[ 1.376507] #46
[ 1.376507] [<ffffffff810aafe4>] cpu_startup_entry+0x2f4/0x3c0
[ 1.385274] [<ffffffff8103b7e3>] start_secondary+0x143/0x170
[ 1.391694] #47
[ 1.391694] Code: 49 83 ec 08 31 c9 eb 14 66 90 49 8b 44 24 08 48 39 c2 4c 8d 60 f8 0f 84 8e 00 00 00 49 8b 04 [ 1.404957] #48
24 48 89 c3 30 db a8 04 48 0f 44 d9 <4c> 8b 6b 08 49 83 bd 90 00 00 00 00 74 d1 4c 8d b3 80 00 00 00
[ 1.417801] RIP [<ffffffff8107ed01>] pool_mayday_timeout+0x81/0x150
[ 1.424914] #49
[ 1.424914] RSP <ffff88087f5e3e08>
[ 1.430955] CR2: 0000000000000008
[ 1.434665] ---[ end trace 4b134008a4be60b6 ]---
[ 1.439823] #50
[ 1.439824] Kernel panic - not syncing: Fatal exception in interrupt
[ 1.449088] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

2015-06-18 10:03:10

by Borislav Petkov

[permalink] [raw]

Subject: Re: MCE bug?

On Thu, Jun 18, 2015 at 05:18:45PM +0800, Rui Wang wrote:
> I see a different panic with this kernel. Not seen every time.
> It was after reboot due to injected errors.

Yeah, we did debug a bit last night with Tony - this is all in the
workqueue code which we're apparently calling too early into. I got a
box here and am trying to reproduce.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--