Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755293AbbLDQOj (ORCPT ); Fri, 4 Dec 2015 11:14:39 -0500 Received: from mga02.intel.com ([134.134.136.20]:26056 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753719AbbLDQOX (ORCPT ); Fri, 4 Dec 2015 11:14:23 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,380,1444719600"; d="scan'208";a="866705120" Date: Fri, 4 Dec 2015 12:14:20 -0500 From: "Raj, Ashok" To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org, Tony Luck , Ashok Raj Subject: Re: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. Message-ID: <20151204171419.GA4870@otc-brkl-03.jf.intel.com> References: <1449188170-3909-1-git-send-email-ashok.raj@intel.com> <20151204143404.GF21177@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151204143404.GF21177@pd.tnic> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1922 Lines: 53 Hi Boris On Fri, Dec 04, 2015 at 03:34:04PM +0100, Borislav Petkov wrote: > > @@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code) > > + if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV)) > > + goto out; > > This CPU - it being offline and all - is not doing the minimal amount of > work possible IMO. > > Why does it have to do ist_enter(), this_cpu_inc(mce_exception_count), > etc? Yes, thats possible to not do ist_enter() and the exception count. I tried to keep most of the part as is and leveraging code already doing the reading of MCG_STATUS. Architecturally we need to also check RIPV and if clear we should initiate shutdown. When we add the logging from offline cpus as next step it would be safe to use interrupt stack, and the offline I liked the observability part keeping the exception count. if and when we online the cpu again, it might look as it noticed nothing. Now we can check /proc/interrupts and see the offline cpu also observed the MCE. > > IMO the only things it should do is this: > > if (cpu_is_offline(smp_processor_id())) { > mce_wrmsrl(MSR_IA32_MCG_STATUS, 0); > return; > } > > and that should be at the very beginning of do_machine_check(). So > that the hardware is happy. Concerning Linux, it is offline so no data > structures on it are valid. > > P.S., please don't put stable@ to CC - add it as a "CC: " line in the > SOB section instead. Let me know what you think, i can resend with the Cc: stable line.. I Did add the stable line in the right section in an earlier version, but deleting some extraneous commit messages accidently got to this one :(. Cheers, Ashok -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/