Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755321AbbKWR7p (ORCPT ); Mon, 23 Nov 2015 12:59:45 -0500 Received: from mail.skyhub.de ([78.46.96.112]:50057 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751742AbbKWR7m (ORCPT ); Mon, 23 Nov 2015 12:59:42 -0500 Date: Mon, 23 Nov 2015 18:59:32 +0100 From: Borislav Petkov To: "Luck, Tony" Cc: "Chen, Gong" , "linux-edac@vger.kernel.org" , "linux-kernel@vger.kernel.org" Subject: Re: [UNTESTED PATCH] x86, mce: Avoid double entry of deferred errors into the genpool. Message-ID: <20151123175932.GG5134@pd.tnic> References: <20151111193845.GA9055@agluck-desk.sc.intel.com> <3165a4989dcb45fc0306438d40d0cf2ace429c4c.1447280215.git.tony.luck@intel.com> <20151119161521.GF6065@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39E9C367@ORSMSX114.amr.corp.intel.com> <20151119203920.GH6065@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20151119203920.GH6065@pd.tnic> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5167 Lines: 111 On Thu, Nov 19, 2015 at 09:39:20PM +0100, Borislav Petkov wrote: > On Thu, Nov 19, 2015 at 07:33:58PM +0000, Luck, Tony wrote: > > > Applied, thanks. > > > > Did you test it (note the "UNTESTED" in the subject!). My usual system for this is getting upgrades and being > > flaky at the moment. > > Bah, it builds, should be enough. Ship it. :-) > > Lemme get a box... Here some results: # grep . /sys/kernel/debug/apei/einj/* /sys/kernel/debug/apei/einj/available_error_type:0x00000002 Processor Uncorrectable non-fatal /sys/kernel/debug/apei/einj/available_error_type:0x00000008 Memory Correctable /sys/kernel/debug/apei/einj/available_error_type:0x00000010 Memory Uncorrectable non-fatal grep: /sys/kernel/debug/apei/einj/error_inject: Permission denied /sys/kernel/debug/apei/einj/error_type:0x0 Looks like some old EINJ without all the features. Oh well, let's see what'll happen anyway: # echo 0x8 > error_type # echo 1 > error_inject [ 840.461666] mce: [Hardware Error]: Machine check events logged [ 840.476221] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 840.489214] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010090 [ 840.507685] EDAC sbridge MC0: TSC 0 [ 840.515223] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86 [ 840.532477] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299322 SOCKET 0 APIC 0 [ 840.551279] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 840.563872] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8800004100800090 [ 840.581970] EDAC sbridge MC0: TSC 0 [ 840.589513] EDAC sbridge MC0: ADDR 0 EDAC sbridge MC0: MISC 4908400040004200 [ 840.606267] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299322 SOCKET 0 APIC 0 [ 841.499090] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) So yeah, mce_notify_irq() is visible there, i.e. we did mce_log() here which sets mce_need_notify. # echo 0x2 > error_type # echo 1 > error_inject bash: echo: write error: Invalid argument [ 885.272000] [Firmware Warn]: APEI: Invalid action table, unknown instruction type: 5 ACPI_EINJ_FLUSH_CACHELINE?? Yeah, we're missing some functionality. # echo 0x10 > error_type # echo 1 > error_inject That went BOOM: [ 1296.233435] Disabling lock debugging due to kernel taint [ 1296.248010] mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090 [ 1296.269245] mce: [Hardware Error]: RIP !INEXACT! 10: {intel_idle+0xbf/0x130} [ 1296.290735] mce: [Hardware Error]: TSC 37c1fb53beb ADDR bb68f400 MISC 20401a9a86 [ 1296.309772] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1448299778 SOCKET 0 APIC c microcode 710 [ 1296.332058] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 1296.346094] EDAC sbridge MC0: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090 [ 1296.366517] EDAC sbridge MC0: TSC 37c1fb53beb [ 1296.375974] EDAC sbridge MC0: ADDR bb68f400 EDAC sbridge MC0: MISC 20401a9a86 [ 1296.394493] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299778 SOCKET 0 APIC c [ 1296.416153] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68f offset: 0x400 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) ... judging by the CPU numbers, looks like node 0 got that error in the shared bank: .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39 finishing with [ 1299.907994] mce: [Hardware Error]: Machine check: Processor context corrupt [ 1299.926783] Kernel panic - not syncing: Fatal machine check [ 1299.959632] Kernel Offset: disabled [ 1299.984254] Rebooting in 100 seconds.. dont_log_ce: $ for i in $(seq 0 63); do echo 1 > /sys/devices/system/machinecheck/machinecheck$i/dont_log_ce; cat /sys/devices/system/machinecheck/machinecheck$i/dont_log_ce; done | uniq 1 # echo 0x8 > error_type # echo 1 > error_inject [ 318.263797] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [ 318.277029] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010090 [ 318.295631] EDAC sbridge MC0: TSC 0 [ 318.303143] EDAC sbridge MC0: ADDR bb68f000 EDAC sbridge MC0: MISC 2040262686 [ 318.320473] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448300397 SOCKET 0 APIC 0 [ 318.809112] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68f offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0) This looks ok, we're missing the mce_notify_irq() line "mce: [Hardware Error]: Machine check events logged" which is as expected but the EDAC lines are there because we sent the error on the notify chain. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/