Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1163054AbdDUVjw (ORCPT ); Fri, 21 Apr 2017 17:39:52 -0400 Received: from mga04.intel.com ([192.55.52.120]:27039 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1041408AbdDUVjr (ORCPT ); Fri, 21 Apr 2017 17:39:47 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.37,231,1488873600"; d="scan'208";a="92780976" From: "Verma, Vishal L" To: "Luck, Tony" , "bp@suse.de" CC: "tglx@linutronix.de" , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "ross.zwisler@linux.intel.com" , "x86@kernel.org" , "linux-nvdimm@ml01.01.org" Subject: Re: [RFC PATCH] x86, mce: change the mce notifier to 'blocking' from 'atomic' Thread-Topic: [RFC PATCH] x86, mce: change the mce notifier to 'blocking' from 'atomic' Thread-Index: AQHSsxV3s/VS0/ZISkGoKwFEXJsIM6HB6Z0AgABPc4CAAGssAIAAAN0AgAAHewCAAADNAIAABL+AgAACAQCAAAfngIAACA+AgAACs4CAAADegIAA2qWAgA08FIA= Date: Fri, 21 Apr 2017 21:39:45 +0000 Message-ID: <1492810703.2738.27.camel@intel.com> References: <20170412202238.5d327vmwjqvbzzop@pd.tnic> <1492028744.2738.14.camel@intel.com> <20170412205229.GA13659@intel.com> <20170412211931.GA15771@intel.com> <20170412214749.jyt7cmyhovivtb2m@pd.tnic> <20170412221639.5klmqk4mjbvy6btx@pd.tnic> <20170412222619.GA17839@intel.com> <20170412222925.r3izasv3yuyjy62e@pd.tnic> <20170413113159.rc32ebiswn64nzrr@pd.tnic> In-Reply-To: <20170413113159.rc32ebiswn64nzrr@pd.tnic> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.232.112.53] Content-Type: text/plain; charset="utf-8" Content-ID: <827E35C5767ABA4DB933C03C84D49F84@intel.com> MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id v3LLeDpB002926 Content-Length: 3642 Lines: 102 On Thu, 2017-04-13 at 13:31 +0200, Borislav Petkov wrote: > On Thu, Apr 13, 2017 at 12:29:25AM +0200, Borislav Petkov wrote: > > On Wed, Apr 12, 2017 at 03:26:19PM -0700, Luck, Tony wrote: > > > We can futz with that and have them specify which chain (or both) > > > that they want to be added to. > > > > Well, I didn't want the atomic chain to be a notifier because we can > > keep it simple and non-blocking. Only the process context one will > > be. > > > > So the question is, do we even have a use case for outside consumers > > hanging on the atomic chain? Because if not, we're good to go. > > Ok, new day, new patch. > > Below is what we could do: we don't call the notifier at all on the > atomic path but only print the MCEs. We do log them and if the machine > survives, we process them accordingly. This is only a fix for upstream > so that the current issue at hand is addressed. > > For later, we'd need to split the paths in: > > critical_print_mce() > > or somesuch which immediately dumps the MCE to dmesg, and > > mce_log() > > which does the slow path of logging MCEs and calling the blocking > notifier. > > Now, I'd want to have decoding of the MCE on the critical path too so > I have to think about how to do that nicely. Maybe move the decoding > bits which are the same between Intel and AMD in mce.c and have some > vendor-specific, fast calls. We'll see. Btw, this is something Ingo > has > been mentioning for a while. > > Anyway, here's just the urgent fix for now. > > Thanks. > > --- > From: Vishal Verma > Date: Tue, 11 Apr 2017 16:44:57 -0600 > Subject: [PATCH] x86/mce: Make the MCE notifier a blocking one > > The NFIT MCE handler callback (for handling media errors on NVDIMMs) > takes a mutex to add the location of a memory error to a list. But > since > the notifier call chain for machine checks (x86_mce_decoder_chain) is > atomic, we get a lockdep splat like: > >   BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:620 >   in_atomic(): 1, irqs_disabled(): 0, pid: 4, name: kworker/0:0 >   [..] >   Call Trace: >    dump_stack >    ___might_sleep >    __might_sleep >    mutex_lock_nested >    ? __lock_acquire >    nfit_handle_mce >    notifier_call_chain >    atomic_notifier_call_chain >    ? atomic_notifier_call_chain >    mce_gen_pool_process > > Convert the notifier to a blocking one which gets to run only in > process > context. > > Boris: remove the notifier call in atomic context in print_mce(). For > now, let's print the MCE on the atomic path so that we can make sure > it > goes out. We still log it for process context later. > > Reported-by: Ross Zwisler > Signed-off-by: Vishal Verma > Cc: Tony Luck > Cc: Dan Williams > Cc: linux-edac > Cc: x86-ml > Cc: > Link: http://lkml.kernel.org/r/20170411224457.24777-1-vishal.l.verma@i > ntel.com > Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media > error") > Signed-off-by: Borislav Petkov > --- >  arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +- >  arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +- >  arch/x86/kernel/cpu/mcheck/mce.c          | 18 ++++-------------- >  3 files changed, 6 insertions(+), 16 deletions(-) > I noticed this patch was picked up in tip, in ras/urgent, but didn't see a pull request for 4.11 - was this the intention? Or will it just be added for 4.12? -Vishal