From: Rui Wang <rui.y.wang@intel.com>
To: bp@suse.de
Cc: tony.luck@intel.com, gong.chen@intel.com, linux-kernel@vger.kernel.org,
        Rui Wang <rui.y.wang@intel.com>
Subject: Re: MCE bug?
Date: Thu, 18 Jun 2015 17:18:45 +0800
Message-Id: <1434619125-7142-1-git-send-email-rui.y.wang@intel.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6119
Lines: 116

> On Wed, Jun 17, 2015 at 11:41:56AM +0200, Borislav Petkov wrote:
>> And I was waiting in line to get a chance to do some injection on our 
>> EINJ box here too. But it seems you have the required setup already so 
>> if you want to give those changes a run, I've uploaded them here:
>> 
>> git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git#tip-ras
>> 
>> It'll be much appreciated.
>
> and the answer is <drum roll> ....
> 
> 
> 
> no. :-(

I see a different panic with this kernel. Not seen every time.
It was after reboot due to injected errors.

[    0.234672] mce: CPU supports 22 MCE banks
[    0.239291] CPU0: Thermal monitoring enabled (TM1)
[    0.244680] process: using mwait in idle threads
[    0.249844] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[    0.256654] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[    0.264330] Freeing SMP alternatives memory: 20K (ffffffff81d1e000 - ffffffff81d23000)
[    0.274057] ftrace: allocating 22650 entries in 89 pages
[    0.289946] x2apic: IRQ remapping doesn't support X2APIC mode
[    0.296505] Switched APIC routing to physical flat.
[    0.302838] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.349289] smpboot: CPU0: Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz (fam: 06, model: 3f, stepping: 03)
[    0.359844] Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Intel PMU driver.
[    0.371173] ... version:                3
[    0.375649] ... bit width:              48
[    0.380222] ... generic registers:      4
[    0.384698] ... value mask:             0000ffffffffffff
[    0.390632] ... max period:             0000ffffffffffff
[    0.396566] ... fixed-purpose events:   3
[    0.401043] ... event mask:             000000070000000f
[    0.410260] x86: Booting SMP configuration:
[    0.414933] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6   #7   #8   #9  #10  #11  #12  #13  #14  #15  #16  #17
[    0.706763] .... node  #1, CPUs:    #18
[    0.822565] mce: [Hardware Error]: Machine check events logged
[    0.822801]   #19  #20  #21  #22  #23  #24  #25  #26  #27  #28  #29  #30  #31  #32  #33
[    1.078660] mce: [Hardware Error]: Machine check events logged
[    1.093416]   #34
[    1.095433] BUG: unable to handle kernel
[    1.100045]   #35
[    1.102193] NULL pointer dereference at 0000000000000008
[    1.108126] IP: [<ffffffff8107ed01>] pool_mayday_timeout+0x81/0x150
[    1.111969]
[    1.116818] .... node  #0, CPUs:    #36
[    1.121101] PGD 0
[    1.123348] Oops: 0000 [#1] SMP
[    1.126975] Modules linked in:
[    1.130402] CPU: 33 PID: 0 Comm: swapper/33 Not tainted 4.1.0-rc3-7-default+ #1
[    1.138570] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0059.R00.1501081238 01/08/2015
[    1.150134] task: ffff88046e86e0d0 ti: ffff88046e874000 task.ti: ffff88046e874000
[    1.158496] RIP: 0010:[<ffffffff8107ed01>]  [<ffffffff8107ed01>] pool_mayday_timeout+0x81/0x150
[    1.168228] RSP: 0000:ffff88087f5e3e08  EFLAGS: 00010046
[    1.174164] RAX: 0000000fffffffe0 RBX: 0000000000000000 RCX: 0000000000000000
[    1.182135] RDX: ffff88087f5f4898 RSI: ffffffff8107ec80 RDI: ffffffff81dd332c
[    1.190108] RBP: ffff88087f5e3e48 R08: 0000000000000000 R09: ffff88087f5ed8c0
[    1.198080] R10: 0000000000000004 R11: 0000000000000005 R12: ffffffff81d4d880
[    1.206052] R13: 0000000000000101 R14: ffffffff8107ec80 R15: ffff88087f5f4880
[    1.214026] FS:  0000000000000000(0000) GS:ffff88087f5e0000(0000) knlGS:0000000000000000
[    1.223066] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.229486]   #37
[    1.229486] CR2: 0000000000000008 CR3: 0000000001a0e000 CR4: 00000000001406e0
[    1.239605] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    1.247578]   #38
[    1.247578] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    1.257697]   #39
[    1.257697] Stack:
[    1.262090]  ffff88087f5e3e48 ffffffff810bf45b 0000000000000021 ffff88087f5ed8c0
[    1.270398]  ffff88087f5f4910[    1.273400]   #40
 0000000000000101 ffffffff8107ec80 ffff88087f5f4880
[    1.280867]  ffff88087f5e3e88 ffffffff810cf559 ffff88087f5e3e88 ffff88087f5ed8c0
[    1.289177] Call Trace:
[    1.291910]   #41
[    1.294068]  <IRQ>
[    1.294068]  [<ffffffff810bf45b>] ? console_unlock+0x1fb/0x460
[    1.302927]  [<ffffffff8107ec80>] ? wq_unbind_fn+0x130/0x130
[    1.309242]   #42
[    1.309242]  [<ffffffff810cf559>] call_timer_fn+0x39/0x130
[    1.317509]  [<ffffffff8107ec80>] ? wq_unbind_fn+0x130/0x130
[    1.323833]   #43
[    1.323834]  [<ffffffff810d1041>] run_timer_softirq+0x211/0x300
[    1.332598]  [<ffffffff8106a874>] __do_softirq+0xe4/0x290
[    1.338629]  [<ffffffff8106ac8d>] irq_exit+0x9d/0xb0
[    1.344177]   #44
[    1.344177]  [<ffffffff8103daba>] smp_apic_timer_interrupt+0x4a/0x60
[    1.353424]  [<ffffffff815b53fe>] apic_timer_interrupt+0x6e/0x80
[    1.360135]   #45
[    1.362292]  <EOI>
[    1.362292]  [<ffffffff8100d7ad>] ? mwait_idle+0x6d/0x90
[    1.370568]  [<ffffffff8100e0cf>] arch_cpu_idle+0xf/0x20
[    1.376507]   #46
[    1.376507]  [<ffffffff810aafe4>] cpu_startup_entry+0x2f4/0x3c0
[    1.385274]  [<ffffffff8103b7e3>] start_secondary+0x143/0x170
[    1.391694]   #47
[    1.391694] Code: 49 83 ec 08 31 c9 eb 14 66 90 49 8b 44 24 08 48 39 c2 4c 8d 60 f8 0f 84 8e 00 00 00 49 8b 04 [    1.404957]   #48
24 48 89 c3 30 db a8 04 48 0f 44 d9 <4c> 8b 6b 08 49 83 bd 90 00 00 00 00 74 d1 4c 8d b3 80 00 00 00
[    1.417801] RIP  [<ffffffff8107ed01>] pool_mayday_timeout+0x81/0x150
[    1.424914]   #49
[    1.424914]  RSP <ffff88087f5e3e08>
[    1.430955] CR2: 0000000000000008
[    1.434665] ---[ end trace 4b134008a4be60b6 ]---
[    1.439823]   #50
[    1.439824] Kernel panic - not syncing: Fatal exception in interrupt
[    1.449088] ---[ end Kernel panic - not syncing: Fatal exception in interrupt


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/