Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756204AbbLDRX0 (ORCPT ); Fri, 4 Dec 2015 12:23:26 -0500 Received: from mga03.intel.com ([134.134.136.65]:14218 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754323AbbLDRXV (ORCPT ); Fri, 4 Dec 2015 12:23:21 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,380,1444719600"; d="scan'208";a="854040339" From: "Luck, Tony" To: Borislav Petkov , "Raj, Ashok" CC: "linux-kernel@vger.kernel.org" , "linux-edac@vger.kernel.org" Subject: RE: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. Thread-Topic: [Patch V0] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process. Thread-Index: AQHRLiCdo/bgn7szpEmDKgCl8H5CG567a4EAgAAsxwD///mJAP//fuxg Date: Fri, 4 Dec 2015 17:23:18 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F78AD9@ORSMSX114.amr.corp.intel.com> References: <1449188170-3909-1-git-send-email-ashok.raj@intel.com> <20151204143404.GF21177@pd.tnic> <20151204171419.GA4870@otc-brkl-03.jf.intel.com> <20151204165112.GI21177@pd.tnic> In-Reply-To: <20151204165112.GI21177@pd.tnic> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.22.254.138] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tB4HNn9X023640 Content-Length: 1588 Lines: 31 > Franky, I'm not sure at all and very very wary of adding *any* code > which runs on an offlined CPU. Because *no one* does that and it hasn't > been tested at all. So who knows what happens. > > What we should be doing is execute the *minimal* amount of code possible > and get out. No counting, no per-cpu variables. No nothing. The minimal code requires we use: smp_processor_id() [to get our cpu number] cpu_is_offline() [to find out the cpu is offline] The first of those looks more dangerous in that it accesses a per-cpu variable. I don't think we need to be totally paranoid here. We know that the offline cpus were once online and went through normal kernel initialization code (if they didn't, then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their response to a machine check would have been to reset the system). > Because we have been considering offlining a core as one possible RAS > action. So what happens is a user or a RAS agent offlines a core and > yet, that offlined core still reports MCEs. Something's terribly wrong > with that picture, IMO. Agreed. It would be more pleasant if we had some way to *really* offline a cpu, including telling the rest of the system not to send it any more broadcast events like MCE, SMI. But the h/w guys like to give the s/w guys job security by making these corner cases that we have to work around in s/w :-) -Tony ????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?