Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753863AbdHQXcU convert rfc822-to-8bit (ORCPT ); Thu, 17 Aug 2017 19:32:20 -0400 Received: from mga06.intel.com ([134.134.136.31]:43376 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753342AbdHQXcT (ORCPT ); Thu, 17 Aug 2017 19:32:19 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,390,1498546800"; d="scan'208";a="1184075864" From: "Luck, Tony" To: Andrew Morton CC: Borislav Petkov , "Hansen, Dave" , Naoya Horiguchi , "Elliott, Robert (Persistent Memory)" , "x86@kernel.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" Subject: RE: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages Thread-Topic: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages Thread-Index: AQHTFrOk7XhQ3140O0ypvNxbZnavRKKJkv4A//+fLLA= Date: Thu, 17 Aug 2017 23:32:16 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F61342363@ORSMSX114.amr.corp.intel.com> References: <20170816171803.28342-1-tony.luck@intel.com> <20170817150942.017f87537b6cbb48e9cfc082@linux-foundation.org> In-Reply-To: <20170817150942.017f87537b6cbb48e9cfc082@linux-foundation.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 10.0.102.7 dlp-reaction: no-action x-originating-ip: [10.22.254.138] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1314 Lines: 25 > It's unclear (to lil ole me) what the end-user-visible effects of this > are. > > Could we please have a description of that? So a) people can > understand your decision to cc:stable and b) people whose kernels are > misbehaving can use your description to decide whether your patch might > fix the issue their users are reporting. Ingo already applied this to the tip tree, so too late to fix the commit message :-( A very, very, unlucky end user with a system that supports machine check recovery (Xeon E7, or Xeon-SP-platinum) that has recovered from one or more uncorrected memory errors (lucky so far) might find a subsequent uncorrected memory error flagged as fatal because the machine check bank that should log the error is already occupied by a log caused by a speculative access to one of the earlier uncorrected errors (the unlucky part). We haven't seen this happen at the Linux OS level, but it is a theoretical possibility. [Some BIOS that map physical memory 1:1 have seen this when doing eMCA processing for the first error ... as soon as they load the address of the error from the MCi_ADDR register they are vulnerable to some speculative access dereferencing the register with the address and setting the overflow bit in the machine check bank that still holds the original log]. -Tony