2023-07-14 18:49:30

by Dave Hansen

[permalink] [raw]
Subject: [PATCH] x86/mm: Remove "INVPCID single" feature tracking


From: Dave Hansen <[email protected]>

tl;dr: Replace a synthetic X86_FEATURE with a hardware X86_FEATURE
and check of existing per-cpu state.

== Background ==

There are three features in play here:
1. Good old Page Table Isolation (PTI)
2. Process Context IDentifiers (PCIDs) which allow entries from
multiple address spaces to be in the TLB at once.
3. Support for the "Invalidate PCID" (INVPCID) instruction,
specifically the "individual address" mode (aka. mode 0).

When all *three* of these are in place, INVPCID can and should be used
to flush out individual addresses in the PTI user address space.

But there's a wrinkle or two: First, this INVPCID mode is dependent on
CR4.PCIDE. Even if X86_FEATURE_INVPCID==1, the instruction may #GP
without setting up CR4. Second, TLB flushing is done very early, even
before CR4 is fully set up. That means even if PTI, PCID and INVPCID
are supported, there is *still* a window where INVPCID can #GP.

== Problem ==

The current code seems to work, but mostly by chance and there are a
bunch of ways it can go wrong. It's also somewhat hard to follow
since X86_FEATURE_INVPCID_SINGLE is set far away from its lone user.

== Solution ==

Make "INVPCID single" more robust and easier to follow by placing all
the logic in one place. Remove X86_FEATURE_INVPCID_SINGLE.

Make two explicit checks before using INVPCID:
1. Check that the system supports INVPCID itself (boot_cpu_has())
2. Then check the CR4.PCIDE shadow to ensures that the CPU
can safely use INVPCID for individual address invalidation.

The CR4 check *always* works and is not affected by any X86_FEATURE_*
twiddling or inconsistencies between the boot and secondary CPUs.

This has been tested on non-Meltdown hardware by using pti=on and
then flipping PCID and INVPCID support with qemu.

== Aside ==

How does this code even work today? By chance, I think. First, PTI
is initialized around the same time that the boot CPU sets
CR4.PCIDE=1. There are currently no TLB invalidations when PTI=1 but
CR4.PCIDE=0. That means that the X86_FEATURE_INVPCID_SINGLE check is
never even reached.

this_cpu_has() is also very nasty to use in this context because the
boot CPU reaches here before cpu_data(0) has been initialized. It
happens to work for X86_FEATURE_INVPCID_SINGLE since it's a
software-defined feature but it would fall over for a hardware-
derived X86_FEATURE.

Signed-off-by: Dave Hansen <[email protected]>
Reported-by: Jann Horn <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Cooper <[email protected]>
---

b/arch/x86/include/asm/cpufeatures.h | 1 -
b/arch/x86/mm/init.c | 9 ---------
b/arch/x86/mm/tlb.c | 14 +++++++++-----
3 files changed, 9 insertions(+), 15 deletions(-)

diff -puN arch/x86/include/asm/cpufeatures.h~remove-invpcid-single arch/x86/include/asm/cpufeatures.h
--- a/arch/x86/include/asm/cpufeatures.h~remove-invpcid-single 2023-07-14 08:29:08.661225942 -0700
+++ b/arch/x86/include/asm/cpufeatures.h 2023-07-14 08:29:08.673225955 -0700
@@ -198,7 +198,6 @@
#define X86_FEATURE_CAT_L3 ( 7*32+ 4) /* Cache Allocation Technology L3 */
#define X86_FEATURE_CAT_L2 ( 7*32+ 5) /* Cache Allocation Technology L2 */
#define X86_FEATURE_CDP_L3 ( 7*32+ 6) /* Code and Data Prioritization L3 */
-#define X86_FEATURE_INVPCID_SINGLE ( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
#define X86_FEATURE_HW_PSTATE ( 7*32+ 8) /* AMD HW-PState */
#define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
#define X86_FEATURE_XCOMPACTED ( 7*32+10) /* "" Use compacted XSTATE (XSAVES or XSAVEC) */
diff -puN arch/x86/mm/init.c~remove-invpcid-single arch/x86/mm/init.c
--- a/arch/x86/mm/init.c~remove-invpcid-single 2023-07-14 08:29:08.665225945 -0700
+++ b/arch/x86/mm/init.c 2023-07-14 08:29:08.673225955 -0700
@@ -307,15 +307,6 @@ static void setup_pcid(void)
* start_secondary().
*/
cr4_set_bits(X86_CR4_PCIDE);
-
- /*
- * INVPCID's single-context modes (2/3) only work if we set
- * X86_CR4_PCIDE, *and* we INVPCID support. It's unusable
- * on systems that have X86_CR4_PCIDE clear, or that have
- * no INVPCID support at all.
- */
- if (boot_cpu_has(X86_FEATURE_INVPCID))
- setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE);
} else {
/*
* flush_tlb_all(), as currently implemented, won't work if
diff -puN arch/x86/mm/tlb.c~remove-invpcid-single arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~remove-invpcid-single 2023-07-14 08:29:08.665225945 -0700
+++ b/arch/x86/mm/tlb.c 2023-07-14 08:29:08.673225955 -0700
@@ -1141,20 +1141,24 @@ void flush_tlb_one_kernel(unsigned long
STATIC_NOPV void native_flush_tlb_one_user(unsigned long addr)
{
u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ bool cpu_pcide = this_cpu_read(cpu_tlbstate.cr4) & X86_CR4_PCIDE;

+ /* Flush 'addr' from the kernel PCID: */
asm volatile("invlpg (%0)" ::"r" (addr) : "memory");

+ /* If PTI is off there is no user PCID and nothing to flush. */
if (!static_cpu_has(X86_FEATURE_PTI))
return;

/*
- * Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
- * Just use invalidate_user_asid() in case we are called early.
+ * invpcid_flush_one(pcid>0) will #GP if CR4.PCIDE==0. Check
+ * 'cpu_pcide' to ensure that *this* CPU will not trigger those
+ * #GP's even if called before CR4.PCIDE has been initialized.
*/
- if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE))
- invalidate_user_asid(loaded_mm_asid);
- else
+ if (boot_cpu_has(X86_FEATURE_INVPCID) && cpu_pcide)
invpcid_flush_one(user_pcid(loaded_mm_asid), addr);
+ else
+ invalidate_user_asid(loaded_mm_asid);
}

void flush_tlb_one_user(unsigned long addr)
_


2023-07-15 00:15:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/mm: Remove "INVPCID single" feature tracking

On 7/14/23 13:27, [email protected] wrote:
>> + /* If PTI is off there is no user PCID and nothing to flush. */
>> if (!static_cpu_has(X86_FEATURE_PTI))
>> return;
>
> As a minor observation, the common case is for the function to exit
> here, but you've got both this_cpu_read()'s ahead of a full compiler
> memory barrier.
>
> If you move them here, you'll drop the reads from the common case.
> But...

That's a good point. I was depending on the generosity of the compiler
but the invlpg throws that out the window. I'll move them around.

>> /*
>> - * Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
>> - * Just use invalidate_user_asid() in case we are called early.
>> + * invpcid_flush_one(pcid>0) will #GP if CR4.PCIDE==0. Check
>> + * 'cpu_pcide' to ensure that *this* CPU will not trigger those
>> + * #GP's even if called before CR4.PCIDE has been initialized.
>> */
>> - if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE))
>> - invalidate_user_asid(loaded_mm_asid);
>> - else
>> + if (boot_cpu_has(X86_FEATURE_INVPCID) && cpu_pcide)
> ... why can't this just be && loaded_mm_asid ?
>
> There's no plausible way the asid can be nonzero here without CR4.PCIDE
> being set, and that avoids looking at cr4 directly.

Except that 0 is a valid, normal 'loaded_mm_asid' value. It would be
quite possible to have loaded_mm_asid==0 during normal runtime which
would drop down into the invalidate_user_asid() case. It would work,
but it would be unnecessarily destructive to the TLB.

I guess we _could_ adjust the asids to go from 1=>TLB_NR_DYN_ASIDS
instead of 0=>TLB_NR_DYN_ASIDS-1. *But*, PTI is slow path code these
days. I'd rather read one more (presumably) cache hot variable that's
logically clear than go messing with the ASID allocation code making
ASID 0 even more special.