2018-12-24 23:12:19

by Ingo Molnar

[permalink] [raw]
Subject: [GIT PULL] x86/mm changes for v4.21

Linus,

Please pull the latest x86-mm-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-for-linus

# HEAD: 6848ac7ca39a226ede5df7af0efcc4ef0611e99c x86/mm/dump_pagetables: Use DEFINE_SHOW_ATTRIBUTE()

The main changes in this cycle were:

- Update and clean up x86 fault handling, by Andy Lutomirski.

- Drop usage of __flush_tlb_all() in kernel_physical_mapping_init() and
related fallout, by Dan Williams.

- CPA cleanups and reorganization by Peter Zijlstra: simplify the flow
and remove a few warts.

- Other misc cleanups.

out-of-topic modifications in x86-mm-for-linus:
-------------------------------------------------
include/asm-generic/5level-fixup.h # 0a9fe8ca844d: x86/mm: Validate kernel_phys
include/asm-generic/pgtable-nop4d-hack.h# 0a9fe8ca844d: x86/mm: Validate kernel_phys
include/asm-generic/pgtable-nop4d.h# 0a9fe8ca844d: x86/mm: Validate kernel_phys
include/asm-generic/pgtable-nopud.h# 0a9fe8ca844d: x86/mm: Validate kernel_phys
include/asm-generic/pgtable.h # 4369deaa2f02: generic/pgtable: Introduce s
# 0cebbb60f759: generic/pgtable: Introduce {
# c683c37cd132: generic/pgtable: Make {pmd,

Thanks,

Ingo

------------------>
Andy Lutomirski (12):
x86/fault: Check user_mode(regs) when avoiding an mmap_sem deadlock
x86/cpufeatures, x86/fault: Mark SMAP as disabled when configured out
x86/fault: Fold smap_violation() into do_user_addr_fault()
x86/fault: Fix SMAP #PF handling buglet for implicit supervisor accesses
x86/fault: Improve the condition for signalling vs OOPSing
x86/fault: Make error_code sanitization more robust
x86/fault: Don't set thread.cr2, etc before OOPSing
x86/fault: Remove sw_error_code
x86/fault: Don't try to recover from an implicit supervisor access
x86/oops: Show the correct CS value in show_regs()
x86/vsyscall/64: Use X86_PF constants in the simulated #PF error code
x86/fault: Decode page fault OOPSes better

Dan Williams (5):
generic/pgtable: Make {pmd, pud}_same() unconditionally available
generic/pgtable: Introduce {p4d,pgd}_same()
generic/pgtable: Introduce set_pte_safe()
x86/mm: Validate kernel_physical_mapping_init() PTE population
x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()

Ingo Molnar (1):
x86/fault: Clean up the page fault oops decoder a bit

Peter Zijlstra (9):
x86/mm/cpa: Add ARRAY and PAGES_ARRAY selftests
x86/mm/cpa: Add __cpa_addr() helper
x86/mm/cpa: Make cpa_data::vaddr invariant
x86/mm/cpa: Simplify the code after making cpa->vaddr invariant
x86/mm/cpa: Optimize cpa_flush_array() TLB invalidation
x86/mm/cpa: Make cpa_data::numpages invariant
x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array() into a single cpa_flush() function
x86/mm/cpa: Better use CLFLUSHOPT
x86/mm/cpa: Rename @addrinarray to @numpages

Waiman Long (1):
x86/mm/fault: Allow stack access below %rsp

Yangtao Li (1):
x86/mm/dump_pagetables: Use DEFINE_SHOW_ATTRIBUTE()


arch/x86/entry/vsyscall/vsyscall_64.c | 2 +-
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/pgalloc.h | 27 +++
arch/x86/kernel/process_64.c | 5 +-
arch/x86/mm/debug_pagetables.c | 58 +------
arch/x86/mm/fault.c | 244 +++++++++++++++++-----------
arch/x86/mm/init_64.c | 30 ++--
arch/x86/mm/mm_internal.h | 2 +
arch/x86/mm/pageattr-test.c | 31 +++-
arch/x86/mm/pageattr.c | 271 +++++++++++++------------------
arch/x86/mm/tlb.c | 4 +-
include/asm-generic/5level-fixup.h | 1 +
include/asm-generic/pgtable-nop4d-hack.h | 1 +
include/asm-generic/pgtable-nop4d.h | 1 +
include/asm-generic/pgtable-nopud.h | 1 +
include/asm-generic/pgtable.h | 56 ++++++-
16 files changed, 396 insertions(+), 346 deletions(-)



2018-12-27 15:17:10

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

The pull request you sent on Tue, 25 Dec 2018 00:11:06 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e57d9f638af9673f38d9f09de66fa0a28303127d

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

2019-02-07 00:18:26

by Luck, Tony

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Tue, Dec 25, 2018 at 12:11:06AM +0100, Ingo Molnar wrote:
> Peter Zijlstra (9):
> x86/mm/cpa: Add ARRAY and PAGES_ARRAY selftests
> x86/mm/cpa: Add __cpa_addr() helper
> x86/mm/cpa: Make cpa_data::vaddr invariant
> x86/mm/cpa: Simplify the code after making cpa->vaddr invariant
> x86/mm/cpa: Optimize cpa_flush_array() TLB invalidation
> x86/mm/cpa: Make cpa_data::numpages invariant
> x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array() into a single cpa_flush() function
> x86/mm/cpa: Better use CLFLUSHOPT
> x86/mm/cpa: Rename @addrinarray to @numpages

Something in this series from Peter is causing problems with
machine check recovery. The kernel dies with a #GP fault

[ 93.363295] Disabling lock debugging due to kernel taint
[ 93.369700] mce: Uncorrected hardware memory error in user-access at 3fbeeab400
[ 93.369709] mce: [Hardware Error]: Machine check events logged
[ 93.384415] mce: [Hardware Error]: Machine check events logged
[ 93.390973] EDAC MC2: 1 UE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x3fbeeab offset:0x400 grain:32 - OVERFLOW recoverable area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0)
[ 93.413569] Memory failure: 0x3fbeeab: Killing einj_mem_uc:4810 due to hardware memory corruption
[ 93.423501] Memory failure: 0x3fbeeab: recovery action for dirty LRU page: Recovered
[ 93.432508] general protection fault: 0000 [#1] SMP PTI
[ 93.438359] CPU: 11 PID: 0 Comm: swapper/11 Tainted: G M 4.20.0-rc5+ #13
[ 93.447294] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRBDXSD1.86B.0338.V01.1603162127 03/16/2016
[ 93.458869] RIP: 0010:native_flush_tlb_one_user+0x8c/0xa0
[ 93.464899] Code: 02 48 8b 44 24 18 65 48 33 04 25 28 00 00 00 75 20 c9 c3 83 c0 01 48 89 7c 24 08 48 89 e1 80 cc 08 0f b7 c0 48 89 04 24 31 c0 <66> 0f 38 82 01 eb d0 e8 78 0e 05 00 0f 1f 84 00 00 00 00 00 0f 1f
[ 93.485859] RSP: 0018:ffff99623f2c3f70 EFLAGS: 00010046
[ 93.491692] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff99623f2c3f70
[ 93.499658] RDX: 2e6b58da00000121 RSI: 0000000000000000 RDI: 7fff9981feeab000
[ 93.507623] RBP: ffff99623f2c3f98 R08: 0000000000000002 R09: 0000000000021640
[ 93.515587] R10: 000ecaed3e716d58 R11: 0000000000000000 R12: ffffffff84fe9920
[ 93.523550] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 93.531518] FS: 0000000000000000(0000) GS:ffff99623f2c0000(0000) knlGS:0000000000000000
[ 93.540551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.546966] CR2: 00005566b2cd5470 CR3: 00000049bee0a006 CR4: 00000000003606e0
[ 93.554927] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 93.562892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 93.570857] Call Trace:
[ 93.573593] <IRQ>
[ 93.575846] ? recalibrate_cpu_khz+0x10/0x10
[ 93.580628] __cpa_flush_tlb+0x2e/0x50
[ 93.584830] flush_smp_call_function_queue+0x35/0xe0
[ 93.590390] smp_call_function_interrupt+0x3a/0xd0
[ 93.595740] call_function_interrupt+0xf/0x20
[ 93.600604] </IRQ>

Build errors during bisection couldn't point to a single commit, but
it did limit it to:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
83b4e39146aa70913580966e0f2b78b7c3492760
935f5839827ef54b53406e80906f7c355eb73c1b
fe0937b24ff5d7b343b9922201e469f9a6009d9d
We cannot bisect more!

so (more descriptively):
83b4e39146aa ("x86/mm/cpa: Make cpa_data::numpages invariant")
935f5839827e ("x86/mm/cpa: Optimize cpa_flush_array() TLB invalidation")
fe0937b24ff5 ("x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array() into a single cpa_flush() function")


If I revert those three (together with the following three from this
merge - because I didn't want to run into more build problems). Then
machine check recovery starts working again.

Potentially the problem might be a non-canonical address passed down
by the machine check recovery code to switch the page with the error
to uncacheable. Perhaps the refactored code is now using that in the

invpcid (%rcx),%rax

instruction that gets the #GP fault?

-Tony

2019-02-07 00:34:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On 2/6/19 4:17 PM, Luck, Tony wrote:
> [ 93.491692] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff99623f2c3f70
> [ 93.499658] RDX: 2e6b58da00000121 RSI: 0000000000000000 RDI: 7fff9981feeab000
...
> Potentially the problem might be a non-canonical address passed down
> by the machine check recovery code to switch the page with the error
> to uncacheable. Perhaps the refactored code is now using that in the
>
> invpcid (%rcx),%rax
>
> instruction that gets the #GP fault?

That looks probable. RDI even has the non-canonical address still in it
(RCX is pointing to the stack, btw).

I think there's an option to dump the trace buffers at panic time. You
might want to enable the TLB flush tracing:

/sys/kernel/debug/tracing/events/tlb/tlb_flush

and see if we get a suspect flush captured just before the #GP.

I wonder if the patches that you bisected to just changed the flushing
from being CR3-based (and not taking an address) to being INVPCID-based,
and taking an address that is sensitive to canonicality.

2019-02-07 09:52:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Wed, Feb 06, 2019 at 04:33:54PM -0800, Dave Hansen wrote:

> I wonder if the patches that you bisected to just changed the flushing
> from being CR3-based (and not taking an address) to being INVPCID-based,
> and taking an address that is sensitive to canonicality.

That is indeed one of the things that patch series does; before this it
would always flush world for ARRAY interfaces.

I'll have a look at the code; I seem to remember there being test for
canonical addresses, maybe they're not in the right place.

2019-02-07 10:19:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Wed, Feb 06, 2019 at 04:17:42PM -0800, Luck, Tony wrote:

> fe0937b24ff5 ("x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array() into a single cpa_flush() function")

Boris pointed me at this gem:

c7486104a5ce ("x86/mce: Fix set_mce_nospec() to avoid #GP fault")

(can I just revel at the pure awesome grossness of that)

But we then see my above commit having:

@@ -1732,11 +1685,6 @@ static int change_page_attr_set_clr(unsigned long *addr, int numpages,
*/
WARN_ON_ONCE(1);
}
- /*
- * Save address for cache flush. *addr is modified in the call
- * to __change_page_attr_set_clr() below.
- */
- baddr = make_addr_canonical_again(*addr);

Where I clearly got distracted by that excellent comment..


So now the question is where to put it back in, I'm thinking this might
want to be in __cpa_addr().

Something like so?

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 4f8972311a77..319b767484fb 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -230,6 +230,28 @@ static bool __cpa_pfn_in_highmap(unsigned long pfn)

#endif

+/*
+ * Machine check recovery code needs to change cache mode of poisoned
+ * pages to UC to avoid speculative access logging another error. But
+ * passing the address of the 1:1 mapping to set_memory_uc() is a fine
+ * way to encourage a speculative access. So we cheat and flip the top
+ * bit of the address. This works fine for the code that updates the
+ * page tables. But at the end of the process we need to flush the cache
+ * and the non-canonical address causes a #GP fault when used by the
+ * CLFLUSH instruction.
+ *
+ * But in the common case we already have a canonical address. This code
+ * will fix the top bit if needed and is a no-op otherwise.
+ */
+static inline unsigned long make_addr_canonical_again(unsigned long addr)
+{
+#ifdef CONFIG_X86_64
+ return (long)(addr << 1) >> 1;
+#else
+ return addr;
+#endif
+}
+
static unsigned long __cpa_addr(struct cpa_data *cpa, unsigned long idx)
{
if (cpa->flags & CPA_PAGES_ARRAY) {
@@ -244,7 +266,7 @@ static unsigned long __cpa_addr(struct cpa_data *cpa, unsigned long idx)
if (cpa->flags & CPA_ARRAY)
return cpa->vaddr[idx];

- return *cpa->vaddr + idx * PAGE_SIZE;
+ return make_addr_canonical_again(*cpa->vaddr + idx * PAGE_SIZE);
}

/*
@@ -1627,29 +1649,6 @@ static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias)
return ret;
}

-/*
- * Machine check recovery code needs to change cache mode of poisoned
- * pages to UC to avoid speculative access logging another error. But
- * passing the address of the 1:1 mapping to set_memory_uc() is a fine
- * way to encourage a speculative access. So we cheat and flip the top
- * bit of the address. This works fine for the code that updates the
- * page tables. But at the end of the process we need to flush the cache
- * and the non-canonical address causes a #GP fault when used by the
- * CLFLUSH instruction.
- *
- * But in the common case we already have a canonical address. This code
- * will fix the top bit if needed and is a no-op otherwise.
- */
-static inline unsigned long make_addr_canonical_again(unsigned long addr)
-{
-#ifdef CONFIG_X86_64
- return (long)(addr << 1) >> 1;
-#else
- return addr;
-#endif
-}
-
-
static int change_page_attr_set_clr(unsigned long *addr, int numpages,
pgprot_t mask_set, pgprot_t mask_clr,
int force_split, int in_flag,

2019-02-07 11:52:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Thu, Feb 7, 2019 at 10:18 AM Peter Zijlstra <[email protected]> wrote:
>
> So now the question is where to put it back in, I'm thinking this might
> want to be in __cpa_addr().

Wasn't one of the goals to *not* keep the real virtual address around
to avoid speculation, since we don't want some random speculation
using that address to then trigger more MCE's?

If you re-generate the canonical address in __cpa_addr(), now we'll
actually have the real virtual address around for a lot of code-paths
(pte lookup etc), which was what people wanted to avoid in the first
place.

NOTE! I may be *entirely* off on this, and just confused about exactly
which speculative accesses people were worried about.

But at the same time, looking at all the other cases, it does look
like we need to do it in __cpa_addr(), because we also have other tsts
for an actual valid virtual address, ie all the

vaddr = __cpa_addr(cpa, cpa->curpage);
if (!(within(vaddr, PAGE_OFFSET,
PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT)))) {

kind of checks.

What's the exact rule and path from "set_mce_nospec()" (which is what
sets that "decoy" address with the high bit clear) to when we can then
use the address for tlb flushing?)

Adding Dan Williams to the participants, because it's really only
set_mce_nospec() that triggers this, and where should try to be
careful that that path really doesn't use the actual real virtual
address. He touched it last.

But I guess the whole decoy_addr thing goes back to Tony Luck and
commit ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for
kernel 1:1 mappings of poison pages").

Do we really need it?

Linus

2019-02-07 14:02:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Thu, Feb 07, 2019 at 11:50:52AM +0000, Linus Torvalds wrote:
> If you re-generate the canonical address in __cpa_addr(), now we'll
> actually have the real virtual address around for a lot of code-paths
> (pte lookup etc), which was what people wanted to avoid in the first
> place.

Note that it's an 'unsigned long' address, not an actual pointer, and
(afaict) non of the code paths use it as a pointer. This _should_ avoid
the CPU from following said pointer and doing a deref on it.

But yes, I didn't go all the way down, maybe I missed some.

2019-02-07 17:38:32

by Luck, Tony

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Thu, Feb 07, 2019 at 03:01:31PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 11:50:52AM +0000, Linus Torvalds wrote:
> > If you re-generate the canonical address in __cpa_addr(), now we'll
> > actually have the real virtual address around for a lot of code-paths
> > (pte lookup etc), which was what people wanted to avoid in the first
> > place.
>
> Note that it's an 'unsigned long' address, not an actual pointer, and
> (afaict) non of the code paths use it as a pointer. This _should_ avoid
> the CPU from following said pointer and doing a deref on it.

The type doesn't matter. You want to avoid having the
true value in the register as long as possible. Ideal
spot would be the instruction before the TLB is flushed.

The speculative issue is that any branch you encounter
while you have the address in a register may be mispredicted.
You might also get a bogus hit in the branch target cache
and speculatively jump into the weeds. While there you
could find an instruction that loads using that register, and
even though it is speculative and the instruction won't
retire, a machine check log will be created in a bank (no
machine check is signalled).

Once the TLB is updated, you are safe. A speculative
access to an uncached address will not load or log anything.

-Tony

2019-02-07 17:58:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Thu, Feb 07, 2019 at 09:36:00AM -0800, Luck, Tony wrote:
> On Thu, Feb 07, 2019 at 03:01:31PM +0100, Peter Zijlstra wrote:
> > On Thu, Feb 07, 2019 at 11:50:52AM +0000, Linus Torvalds wrote:
> > > If you re-generate the canonical address in __cpa_addr(), now we'll
> > > actually have the real virtual address around for a lot of code-paths
> > > (pte lookup etc), which was what people wanted to avoid in the first
> > > place.
> >
> > Note that it's an 'unsigned long' address, not an actual pointer, and
> > (afaict) non of the code paths use it as a pointer. This _should_ avoid
> > the CPU from following said pointer and doing a deref on it.
>
> The type doesn't matter. You want to avoid having the
> true value in the register as long as possible. Ideal
> spot would be the instruction before the TLB is flushed.
>
> The speculative issue is that any branch you encounter
> while you have the address in a register may be mispredicted.
> You might also get a bogus hit in the branch target cache
> and speculatively jump into the weeds. While there you
> could find an instruction that loads using that register, and
> even though it is speculative and the instruction won't
> retire, a machine check log will be created in a bank (no
> machine check is signalled).
>
> Once the TLB is updated, you are safe. A speculative
> access to an uncached address will not load or log anything.

Something like so then? AFAICT CLFLUSH will also #GP if feed it crap.


diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 4f8972311a77..d3ae92ad72a6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -230,6 +230,28 @@ static bool __cpa_pfn_in_highmap(unsigned long pfn)

#endif

+/*
+ * Machine check recovery code needs to change cache mode of poisoned
+ * pages to UC to avoid speculative access logging another error. But
+ * passing the address of the 1:1 mapping to set_memory_uc() is a fine
+ * way to encourage a speculative access. So we cheat and flip the top
+ * bit of the address. This works fine for the code that updates the
+ * page tables. But at the end of the process we need to flush the cache
+ * and the non-canonical address causes a #GP fault when used by the
+ * CLFLUSH instruction.
+ *
+ * But in the common case we already have a canonical address. This code
+ * will fix the top bit if needed and is a no-op otherwise.
+ */
+static inline unsigned long fix_addr(unsigned long addr)
+{
+#ifdef CONFIG_X86_64
+ return (long)(addr << 1) >> 1;
+#else
+ return addr;
+#endif
+}
+
static unsigned long __cpa_addr(struct cpa_data *cpa, unsigned long idx)
{
if (cpa->flags & CPA_PAGES_ARRAY) {
@@ -313,7 +335,7 @@ void __cpa_flush_tlb(void *data)
unsigned int i;

for (i = 0; i < cpa->numpages; i++)
- __flush_tlb_one_kernel(__cpa_addr(cpa, i));
+ __flush_tlb_one_kernel(fix_addr(__cpa_addr(cpa, i)));
}

static void cpa_flush(struct cpa_data *data, int cache)
@@ -347,7 +369,7 @@ static void cpa_flush(struct cpa_data *data, int cache)
* Only flush present addresses:
*/
if (pte && (pte_val(*pte) & _PAGE_PRESENT))
- clflush_cache_range_opt((void *)addr, PAGE_SIZE);
+ clflush_cache_range_opt((void *)fix_addr(addr), PAGE_SIZE);
}
mb();
}
@@ -1627,29 +1649,6 @@ static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias)
return ret;
}

-/*
- * Machine check recovery code needs to change cache mode of poisoned
- * pages to UC to avoid speculative access logging another error. But
- * passing the address of the 1:1 mapping to set_memory_uc() is a fine
- * way to encourage a speculative access. So we cheat and flip the top
- * bit of the address. This works fine for the code that updates the
- * page tables. But at the end of the process we need to flush the cache
- * and the non-canonical address causes a #GP fault when used by the
- * CLFLUSH instruction.
- *
- * But in the common case we already have a canonical address. This code
- * will fix the top bit if needed and is a no-op otherwise.
- */
-static inline unsigned long make_addr_canonical_again(unsigned long addr)
-{
-#ifdef CONFIG_X86_64
- return (long)(addr << 1) >> 1;
-#else
- return addr;
-#endif
-}
-
-
static int change_page_attr_set_clr(unsigned long *addr, int numpages,
pgprot_t mask_set, pgprot_t mask_clr,
int force_split, int in_flag,

2019-02-07 18:08:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21



> On Feb 7, 2019, at 9:57 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Thu, Feb 07, 2019 at 09:36:00AM -0800, Luck, Tony wrote:
>>> On Thu, Feb 07, 2019 at 03:01:31PM +0100, Peter Zijlstra wrote:
>>>> On Thu, Feb 07, 2019 at 11:50:52AM +0000, Linus Torvalds wrote:
>>>> If you re-generate the canonical address in __cpa_addr(), now we'll
>>>> actually have the real virtual address around for a lot of code-paths
>>>> (pte lookup etc), which was what people wanted to avoid in the first
>>>> place.
>>>
>>> Note that it's an 'unsigned long' address, not an actual pointer, and
>>> (afaict) non of the code paths use it as a pointer. This _should_ avoid
>>> the CPU from following said pointer and doing a deref on it.
>>
>> The type doesn't matter. You want to avoid having the
>> true value in the register as long as possible. Ideal
>> spot would be the instruction before the TLB is flushed.
>>
>> The speculative issue is that any branch you encounter
>> while you have the address in a register may be mispredicted.
>> You might also get a bogus hit in the branch target cache
>> and speculatively jump into the weeds. While there you
>> could find an instruction that loads using that register, and
>> even though it is speculative and the instruction won't
>> retire, a machine check log will be created in a bank (no
>> machine check is signalled).
>>
>> Once the TLB is updated, you are safe. A speculative
>> access to an uncached address will not load or log anything.
>
> Something like so then? AFAICT CLFLUSH will also #GP if feed it crap.
>
>
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index 4f8972311a77..d3ae92ad72a6 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -230,6 +230,28 @@ static bool __cpa_pfn_in_highmap(unsigned long pfn)
>
> #endif
>
> +/*
> + * Machine check recovery code needs to change cache mode of poisoned
> + * pages to UC to avoid speculative access logging another error. But
> + * passing the address of the 1:1 mapping to set_memory_uc() is a fine
> + * way to encourage a speculative access. So we cheat and flip the top
> + * bit of the address. This works fine for the code that updates the
> + * page tables. But at the end of the process we need to flush the cache
> + * and the non-canonical address causes a #GP fault when used by the
> + * CLFLUSH instruction.
> + *
> + * But in the common case we already have a canonical address. This code
> + * will fix the top bit if needed and is a no-op otherwise.

Joining this thread late...

This is all IMO rather crazy. How about we fiddle with CR0 to turn off the cache, then fiddle with page tables, then turn caching on? Or, heck, see if there’s some chicken bit we can set to improve the situation while we’re in the MCE handler.

Also, since I don’t really want to dig into the code to answer this, how exactly do we do a broadcast TLB flush from MCE context? We’re super-duper-atomic, and locks might be held on various CPUs. Shouldn’t we be telling the cpa code to skip the flush and then just have the MCE code do a full flush manually? The MCE code has already taken over all CPUs on non-LMCE systems.

Or, better yet, get Intel to fix the hardware. A failed speculative access while already in MCE context should just be ignored.

2019-02-07 18:42:25

by Luck, Tony

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Thu, Feb 07, 2019 at 06:57:20PM +0100, Peter Zijlstra wrote:
> Something like so then? AFAICT CLFLUSH will also #GP if feed it crap.

Correct. CFLUSH will also #GP on a non-canonical address.

> - __flush_tlb_one_kernel(__cpa_addr(cpa, i));
> + __flush_tlb_one_kernel(fix_addr(__cpa_addr(cpa, i)));


> - clflush_cache_range_opt((void *)addr, PAGE_SIZE);
> + clflush_cache_range_opt((void *)fix_addr(addr), PAGE_SIZE);

Looks like as close as you can get without being crazy. Many/most
of the apparent function calls here are going to be inlined. So
likely safe.

Tried it out and it works!

So:

Tested-by: Tony Luck <[email protected]>

Thanks

-Tony

2019-02-07 18:46:41

by Luck, Tony

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Thu, Feb 07, 2019 at 10:07:28AM -0800, Andy Lutomirski wrote:
> Joining this thread late...
>
> This is all IMO rather crazy. How about we fiddle with CR0 to turn off
> the cache, then fiddle with page tables, then turn caching on? Or, heck,
> see if there’s some chicken bit we can set to improve the situation
> while we’re in the MCE handler.

> Also, since I don’t really want
> to dig into the code to answer this, how exactly do we do a broadcast TLB
> flush from MCE context? We’re super-duper-atomic, and locks might be
> held on various CPUs. Shouldn’t we be telling the cpa code to skip
> the flush and then just have the MCE code do a full flush manually?
> The MCE code has already taken over all CPUs on non-LMCE systems.

MCE code doesn't do this while still in MCE context. You helped
restructure this code so the recovery bits happen after we call

ist_begin_non_atomic(regs);

on just the CPU that hit the error (in the broadcast case we've
let the other out of MCE jail by this point).

So there is a small window where we know the broken page is still
mapped WB in the kernel 1:1 map. But we just live dangerously for
a few more microseconds until we can fix the map.

> Or, better yet, get Intel to fix the hardware. A failed speculative
> access while already in MCE context should just be ignored.

Good idea.

-Tony

2019-02-07 20:25:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21



> On Feb 7, 2019, at 10:46 AM, Luck, Tony <[email protected]> wrote:
>
>> On Thu, Feb 07, 2019 at 10:07:28AM -0800, Andy Lutomirski wrote:
>> Joining this thread late...
>>
>> This is all IMO rather crazy. How about we fiddle with CR0 to turn off
>> the cache, then fiddle with page tables, then turn caching on? Or, heck,
>> see if there’s some chicken bit we can set to improve the situation
>> while we’re in the MCE handler.
>
>> Also, since I don’t really want
>> to dig into the code to answer this, how exactly do we do a broadcast TLB
>> flush from MCE context? We’re super-duper-atomic, and locks might be
>> held on various CPUs. Shouldn’t we be telling the cpa code to skip
>> the flush and then just have the MCE code do a full flush manually?
>> The MCE code has already taken over all CPUs on non-LMCE systems.
>
> MCE code doesn't do this while still in MCE context. You helped
> restructure this code so the recovery bits happen after we call
>
> ist_begin_non_atomic(regs);
>
> on just the CPU that hit the error (in the broadcast case we've
> let the other out of MCE jail by this point).
>
> So there is a small window where we know the broken page is still
> mapped WB in the kernel 1:1 map. But we just live dangerously for
> a few more microseconds until we can fix the map.

Hmm.

How bad would it be to set CR0.CD while fiddling with the page tables rather than masking the address?

>
>> Or, better yet, get Intel to fix the hardware. A failed speculative
>> access while already in MCE context should just be ignored.
>
> Good idea.
>
> -Tony

2019-02-07 22:54:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21

On Thu, Feb 7, 2019 at 1:24 PM Andy Lutomirski <[email protected]> wrote:
>
> How bad would it be to set CR0.CD while fiddling with the page tables rather than masking the address?

I would suggest against it. When you disable caching, things don't
just go "slightly slower". Everything comes to a screeching halt, with
instruction fetch etc becoming a serious problem.

So disabling caching for a few instructions in a very simple asm
snippet might be reasonable (particularly since you could pre-fetch
the cache and just rely on CR0.CD not fetching *new* lines). But doing
it in C code that might have things like tracing enabled etc? I'd be
very very leery.

Linus

2019-02-07 23:08:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [GIT PULL] x86/mm changes for v4.21



> On Feb 7, 2019, at 2:53 PM, Linus Torvalds <[email protected]> wrote:
>
>> On Thu, Feb 7, 2019 at 1:24 PM Andy Lutomirski <[email protected]> wrote:
>>
>> How bad would it be to set CR0.CD while fiddling with the page tables rather than masking the address?
>
> I would suggest against it. When you disable caching, things don't
> just go "slightly slower". Everything comes to a screeching halt, with
> instruction fetch etc becoming a serious problem.
>
> So disabling caching for a few instructions in a very simple asm
> snippet might be reasonable (particularly since you could pre-fetch
> the cache and just rely on CR0.CD not fetching *new* lines). But doing
> it in C code that might have things like tracing enabled etc? I'd be
> very very leery.
>
>

In principle, it’s just the code that walks the page tables and changes the mode. But we could get a perf NMI and take who knows how long to process it if caching is off.

2019-02-08 12:09:32

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] x86/mm/cpa: Fix set_mce_nospec()

On Thu, Feb 07, 2019 at 10:40:21AM -0800, Luck, Tony wrote:

> Tried it out and it works!

Excellent, find below a proper patch!

---
Subject: x86/mm/cpa: Fix set_mce_nospec()

In commit:

fe0937b24ff5 ("x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array()
into a single cpa_flush() function")

I accidentally made the call to make_addr_canonical_again() go away,
this breaks set_mce_nospec().

Re-instate the call to convert the address back into canonical form
right before we do either CLFLUSH or INVLPG. Rename the function while
at it to be shorter (and less MAGA).

Reported-by: Tony Luck <[email protected]>
Tested-by: Tony Luck <[email protected]>
Fixes: fe0937b24ff5 ("x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array() into a single cpa_flush() function")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/mm/pageattr.c | 50 +++++++++++++++++++++++++-------------------------
1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 4f8972311a77..14e6119838a6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -230,6 +230,29 @@ static bool __cpa_pfn_in_highmap(unsigned long pfn)

#endif

+/*
+ * See set_mce_nospec().
+ *
+ * Machine check recovery code needs to change cache mode of poisoned pages to
+ * UC to avoid speculative access logging another error. But passing the
+ * address of the 1:1 mapping to set_memory_uc() is a fine way to encourage a
+ * speculative access. So we cheat and flip the top bit of the address. This
+ * works fine for the code that updates the page tables. But at the end of the
+ * process we need to flush the TLB and cache and the non-canonical address
+ * causes a #GP fault when used by the INVLPG and CLFLUSH instructions.
+ *
+ * But in the common case we already have a canonical address. This code
+ * will fix the top bit if needed and is a no-op otherwise.
+ */
+static inline unsigned long fix_addr(unsigned long addr)
+{
+#ifdef CONFIG_X86_64
+ return (long)(addr << 1) >> 1;
+#else
+ return addr;
+#endif
+}
+
static unsigned long __cpa_addr(struct cpa_data *cpa, unsigned long idx)
{
if (cpa->flags & CPA_PAGES_ARRAY) {
@@ -313,7 +336,7 @@ void __cpa_flush_tlb(void *data)
unsigned int i;

for (i = 0; i < cpa->numpages; i++)
- __flush_tlb_one_kernel(__cpa_addr(cpa, i));
+ __flush_tlb_one_kernel(fix_addr(__cpa_addr(cpa, i)));
}

static void cpa_flush(struct cpa_data *data, int cache)
@@ -347,7 +370,7 @@ static void cpa_flush(struct cpa_data *data, int cache)
* Only flush present addresses:
*/
if (pte && (pte_val(*pte) & _PAGE_PRESENT))
- clflush_cache_range_opt((void *)addr, PAGE_SIZE);
+ clflush_cache_range_opt((void *)fix_addr(addr), PAGE_SIZE);
}
mb();
}
@@ -1627,29 +1650,6 @@ static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias)
return ret;
}

-/*
- * Machine check recovery code needs to change cache mode of poisoned
- * pages to UC to avoid speculative access logging another error. But
- * passing the address of the 1:1 mapping to set_memory_uc() is a fine
- * way to encourage a speculative access. So we cheat and flip the top
- * bit of the address. This works fine for the code that updates the
- * page tables. But at the end of the process we need to flush the cache
- * and the non-canonical address causes a #GP fault when used by the
- * CLFLUSH instruction.
- *
- * But in the common case we already have a canonical address. This code
- * will fix the top bit if needed and is a no-op otherwise.
- */
-static inline unsigned long make_addr_canonical_again(unsigned long addr)
-{
-#ifdef CONFIG_X86_64
- return (long)(addr << 1) >> 1;
-#else
- return addr;
-#endif
-}
-
-
static int change_page_attr_set_clr(unsigned long *addr, int numpages,
pgprot_t mask_set, pgprot_t mask_clr,
int force_split, int in_flag,

Subject: [tip:x86/urgent] x86/mm/cpa: Fix set_mce_nospec()

Commit-ID: 0521e8be211cd20d547bff9da2534b7ed6f2c1b9
Gitweb: https://git.kernel.org/tip/0521e8be211cd20d547bff9da2534b7ed6f2c1b9
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 8 Feb 2019 13:08:59 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 8 Feb 2019 14:31:56 +0100

x86/mm/cpa: Fix set_mce_nospec()

The recent commit fe0937b24ff5 ("x86/mm/cpa: Fold cpa_flush_range() and
cpa_flush_array() into a single cpa_flush() function") accidentally made
the call to make_addr_canonical_again() go away, which breaks
set_mce_nospec().

Re-instate the call to convert the address back into canonical form right
before invoking either CLFLUSH or INVLPG. Rename the function while at it
to be shorter (and less MAGA).

Fixes: fe0937b24ff5 ("x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array() into a single cpa_flush() function")
Reported-by: Tony Luck <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Tony Luck <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Rik van Riel <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/mm/pageattr.c | 50 +++++++++++++++++++++++++-------------------------
1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 4f8972311a77..14e6119838a6 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -230,6 +230,29 @@ static bool __cpa_pfn_in_highmap(unsigned long pfn)

#endif

+/*
+ * See set_mce_nospec().
+ *
+ * Machine check recovery code needs to change cache mode of poisoned pages to
+ * UC to avoid speculative access logging another error. But passing the
+ * address of the 1:1 mapping to set_memory_uc() is a fine way to encourage a
+ * speculative access. So we cheat and flip the top bit of the address. This
+ * works fine for the code that updates the page tables. But at the end of the
+ * process we need to flush the TLB and cache and the non-canonical address
+ * causes a #GP fault when used by the INVLPG and CLFLUSH instructions.
+ *
+ * But in the common case we already have a canonical address. This code
+ * will fix the top bit if needed and is a no-op otherwise.
+ */
+static inline unsigned long fix_addr(unsigned long addr)
+{
+#ifdef CONFIG_X86_64
+ return (long)(addr << 1) >> 1;
+#else
+ return addr;
+#endif
+}
+
static unsigned long __cpa_addr(struct cpa_data *cpa, unsigned long idx)
{
if (cpa->flags & CPA_PAGES_ARRAY) {
@@ -313,7 +336,7 @@ void __cpa_flush_tlb(void *data)
unsigned int i;

for (i = 0; i < cpa->numpages; i++)
- __flush_tlb_one_kernel(__cpa_addr(cpa, i));
+ __flush_tlb_one_kernel(fix_addr(__cpa_addr(cpa, i)));
}

static void cpa_flush(struct cpa_data *data, int cache)
@@ -347,7 +370,7 @@ static void cpa_flush(struct cpa_data *data, int cache)
* Only flush present addresses:
*/
if (pte && (pte_val(*pte) & _PAGE_PRESENT))
- clflush_cache_range_opt((void *)addr, PAGE_SIZE);
+ clflush_cache_range_opt((void *)fix_addr(addr), PAGE_SIZE);
}
mb();
}
@@ -1627,29 +1650,6 @@ out:
return ret;
}

-/*
- * Machine check recovery code needs to change cache mode of poisoned
- * pages to UC to avoid speculative access logging another error. But
- * passing the address of the 1:1 mapping to set_memory_uc() is a fine
- * way to encourage a speculative access. So we cheat and flip the top
- * bit of the address. This works fine for the code that updates the
- * page tables. But at the end of the process we need to flush the cache
- * and the non-canonical address causes a #GP fault when used by the
- * CLFLUSH instruction.
- *
- * But in the common case we already have a canonical address. This code
- * will fix the top bit if needed and is a no-op otherwise.
- */
-static inline unsigned long make_addr_canonical_again(unsigned long addr)
-{
-#ifdef CONFIG_X86_64
- return (long)(addr << 1) >> 1;
-#else
- return addr;
-#endif
-}
-
-
static int change_page_attr_set_clr(unsigned long *addr, int numpages,
pgprot_t mask_set, pgprot_t mask_clr,
int force_split, int in_flag,