Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755701AbbLDQvV (ORCPT ); Fri, 4 Dec 2015 11:51:21 -0500 Received: from mail.skyhub.de ([78.46.96.112]:58740 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753891AbbLDQvU (ORCPT ); Fri, 4 Dec 2015 11:51:20 -0500 Date: Fri, 4 Dec 2015 17:51:12 +0100 From: Borislav Petkov To: "Raj, Ashok" Cc: linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org, Tony Luck Subject: Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. Message-ID: <20151204165112.GI21177@pd.tnic> References: <1449188170-3909-1-git-send-email-ashok.raj@intel.com> <20151204143404.GF21177@pd.tnic> <20151204171419.GA4870@otc-brkl-03.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20151204171419.GA4870@otc-brkl-03.jf.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1866 Lines: 46 On Fri, Dec 04, 2015 at 12:14:20PM -0500, Raj, Ashok wrote: > Yes, thats possible to not do ist_enter() and the exception count. > > I tried to keep most of the part as is and leveraging code already > doing the reading of MCG_STATUS. Architecturally we need to also check RIPV > and if clear we should initiate shutdown. So add that check too. > When we add the logging from offline cpus as next step it would be safe to > use interrupt stack, and the offline Franky, I'm not sure at all and very very wary of adding *any* code which runs on an offlined CPU. Because *no one* does that and it hasn't been tested at all. So who knows what happens. What we should be doing is execute the *minimal* amount of code possible and get out. No counting, no per-cpu variables. No nothing. > I liked the observability part keeping the exception count. if and > when we online the cpu again, it might look as it noticed nothing. Now > we can check /proc/interrupts and see the offline cpu also observed > the MCE. And? Tell us what? That SMM fondled the hardware under our feet. TBH, I'd tend to be much more drastic here and even taint the kernel. I mean, seriously, what kind of MCEs which happen as a result of OS execution are you expecting to get reported on an offlined CPU? I can't think of very any. Because we have been considering offlining a core as one possible RAS action. So what happens is a user or a RAS agent offlines a core and yet, that offlined core still reports MCEs. Something's terribly wrong with that picture, IMO. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/