Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754292AbbLDRgn (ORCPT ); Fri, 4 Dec 2015 12:36:43 -0500 Received: from mail.skyhub.de ([78.46.96.112]:42157 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753572AbbLDRgl (ORCPT ); Fri, 4 Dec 2015 12:36:41 -0500 Date: Fri, 4 Dec 2015 18:36:33 +0100 From: Borislav Petkov To: "Luck, Tony" Cc: "Raj, Ashok" , "linux-kernel@vger.kernel.org" , "linux-edac@vger.kernel.org" Subject: Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. Message-ID: <20151204173633.GK21177@pd.tnic> References: <1449188170-3909-1-git-send-email-ashok.raj@intel.com> <20151204143404.GF21177@pd.tnic> <20151204171419.GA4870@otc-brkl-03.jf.intel.com> <20151204165112.GI21177@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F78AD9@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F78AD9@ORSMSX114.amr.corp.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2084 Lines: 48 On Fri, Dec 04, 2015 at 05:23:18PM +0000, Luck, Tony wrote: > > Franky, I'm not sure at all and very very wary of adding *any* code > > which runs on an offlined CPU. Because *no one* does that and it hasn't > > been tested at all. So who knows what happens. > > > > What we should be doing is execute the *minimal* amount of code possible > > and get out. No counting, no per-cpu variables. No nothing. > > The minimal code requires we use: > > smp_processor_id() [to get our cpu number] > cpu_is_offline() [to find out the cpu is offline] > > The first of those looks more dangerous in that it accesses a per-cpu variable. > > I don't think we need to be totally paranoid here. We know that the offline cpus > were once online and went through normal kernel initialization code (if they didn't, > then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their > response to a machine check would have been to reset the system). I don't mean that - I mean the stuff we do before we call cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count), etc. Then we do a whole another bunch of stuff at the "out:" label like printk and whatnot which shouldn't run on an offlined CPU. I.e., the check whether a CPU is offline should be the first thing we do in do_machine_check and get the hell out if so. > Agreed. It would be more pleasant if we had some way to *really* offline a cpu, > including telling the rest of the system not to send it any more broadcast events > like MCE, SMI. But the h/w guys like to give the s/w guys job security by making > these corner cases that we have to work around in s/w :-) Mind you, this is unintentional from the hw guys. But ha(!), I know *exactly* what you mean. :-) -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/