From: "Luck, Tony" <tony.luck@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
CC: Borislav Petkov <bp@suse.de>, "Hansen, Dave" <dave.hansen@intel.com>,
        Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
        "Elliott, Robert (Persistent Memory)" <elliott@hpe.com>,
        "x86@kernel.org" <x86@kernel.org>,
        "linux-mm@kvack.org" <linux-mm@kvack.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1
 mappings of poison pages
Thread-Topic: [PATCH-resend] mm/hwpoison: Clear PRESENT bit for kernel 1:1
 mappings of poison pages
Thread-Index: AQHTFrOk7XhQ3140O0ypvNxbZnavRKKJkv4A//+fLLA=
Date: Thu, 17 Aug 2017 23:32:16 +0000
Message-ID: <3908561D78D1C84285E8C5FCA982C28F61342363@ORSMSX114.amr.corp.intel.com>
References: <CAPcyv4gC_6TpwVSjuOzxrz3OdVZCVWD0QVWhBzAuOxUNHJHRMQ@mail.gmail.com>
        <20170816171803.28342-1-tony.luck@intel.com>
 <20170817150942.017f87537b6cbb48e9cfc082@linux-foundation.org>
In-Reply-To: <20170817150942.017f87537b6cbb48e9cfc082@linux-foundation.org>
Accept-Language: en-US
Content-Language: en-US
dlp-product: dlpe-windows
dlp-version: 10.0.102.7
dlp-reaction: no-action
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1314
Lines: 25

> It's unclear (to lil ole me) what the end-user-visible effects of this
> are.
>
> Could we please have a description of that?  So a) people can
> understand your decision to cc:stable and b) people whose kernels are
> misbehaving can use your description to decide whether your patch might
> fix the issue their users are reporting.

Ingo already applied this to the tip tree, so too late to fix the commit message :-(

A very, very, unlucky end user with a system that supports machine check recovery
(Xeon E7, or Xeon-SP-platinum) that has recovered from one or more uncorrected
memory errors (lucky so far) might find a subsequent uncorrected memory error flagged
as fatal because the machine check bank that should log the error is already occupied
by a log caused by a speculative access to one of the earlier uncorrected errors (the
unlucky part).

We haven't seen this happen at the Linux OS level, but it is a theoretical possibility.
[Some BIOS that map physical memory 1:1 have seen this when doing eMCA processing
for the first error ... as soon as they load the address of the error from the MCi_ADDR
register they are vulnerable to some speculative access dereferencing the register with 
the address and setting the overflow bit in the machine check bank that still holds the
original log].

-Tony