Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754235AbbKXS4d (ORCPT ); Tue, 24 Nov 2015 13:56:33 -0500 Received: from mail.skyhub.de ([78.46.96.112]:36931 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752098AbbKXS43 (ORCPT ); Tue, 24 Nov 2015 13:56:29 -0500 Date: Tue, 24 Nov 2015 19:56:26 +0100 From: Borislav Petkov To: "Luck, Tony" Cc: "Chen, Gong" , "linux-edac@vger.kernel.org" , "linux-kernel@vger.kernel.org" Subject: Re: [UNTESTED PATCH] x86, mce: Avoid double entry of deferred errors into the genpool. Message-ID: <20151124185626.GC21613@pd.tnic> References: <20151111193845.GA9055@agluck-desk.sc.intel.com> <3165a4989dcb45fc0306438d40d0cf2ace429c4c.1447280215.git.tony.luck@intel.com> <20151121191556.GB15172@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39E9FEF6@ORSMSX114.amr.corp.intel.com> <20151124073639.GA3785@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39EA08E1@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39EA08E1@ORSMSX114.amr.corp.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2311 Lines: 56 On Tue, Nov 24, 2015 at 03:51:21PM +0000, Luck, Tony wrote: > >> Ok ... applied those two on top of my "UNTESTED" patch and injected an error to force a UCNA log. > > > > Ok, what error type is that in EINJ nomenclature? I had only > > > > /sys/kernel/debug/apei/einj/available_error_type:0x00000002 Processor Uncorrectable non-fatal > > /sys/kernel/debug/apei/einj/available_error_type:0x00000008 Memory Correctable > > /sys/kernel/debug/apei/einj/available_error_type:0x00000010 Memory Uncorrectable non-fatal > > > > and I would've guessed it is the 0x10 type, i.e., the memory > > uncorrectable which is non-fatal - assuming here - but that one got > > promoted to a #MC on my box. > > I juggled with the type of the injection and the instruction sequence to access the target > location. I used 0x10 to inject an uncorrected memory error with "# echo 1 > notrigger" > to make sure the EINJ driver skipped the trigger actions. Then I had a user mode test program > write a byte to the cache line. That pulled the uncorrected data into the cache (which logged > the UCNA error signaled with CMCI). But the processor didn't actually consume the poison > (no registers had corrupted data), so there was no machine check. > > Sneaky, huh? That reminds me of the whitepaper: https://software.intel.com/sites/default/files/managed/b3/d1/MCA_Recovery_Validation_Guide.pdf Btw, should we take those tools here: https://git.kernel.org/cgit/linux/kernel/git/aegl/ras-tools.git and glue them together with a python or a shell script or so which goes and automatically takes care of loading einj.ko and injects the proper error type and thus abstracts away all that detail which makes me everytime look at Documentation/acpi/apei/einj.txt? Something like ./einject.py --ucna which would do all the fun? That would simplify our testing a lot, methinks. Hmmm? Oh, and btw, the box here didn't have the notrigger node, which means, it'll always do the trigger actions. :-\ -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/