*** PATCHES 1 and 2 MAY BE 4.9 MATERIAL ***
Alan Cox pointed out that the 486 isn't the only supported CPU that
doesn't have CPUID. Let's clean up the mess and make everything
faster while we're at it.
Patch 1 is intended to be an easy fix: it makes sync_core() work
without CPUID on all 32-bit kernels. It should be quite safe. This
will have a negligible performance cost during boot on kernels built
for newer CPUs. With this in place, patch 2 reverts the buggy 486
check I added.
Patches 3-4 are meant to improve the situation. Patch 3 cleans up
the Intel microcode loader and the patch 4 (which depends on patch 3
to work correctly) stops using CPUID in sync_core() altogether.
Changes from v2:
- Switch to IRET-to-self and get rid of all the paravirt code.
- Further immprove the sync_core() comment.
Changes from v1:
- Fix Xen
- Add timing info to the changelog (hint: 2x speedup)
- Document patch 1 a bit better.
Andy Lutomirski (4):
x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit
kernels
Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"
x86/microcode/intel: Replace sync_core() with native_cpuid()
x86/asm: Rewrite sync_core() to use IRET-to-self
arch/x86/boot/cpu.c | 6 ---
arch/x86/include/asm/processor.h | 77 +++++++++++++++++++++++++----------
arch/x86/kernel/cpu/microcode/intel.c | 26 ++++++++++--
3 files changed, 78 insertions(+), 31 deletions(-)
--
2.9.3
Aside from being excessively slow, CPUID is problematic: Linux runs
on a handful of CPUs that don't have CPUID. Use IRET-to-self
instead. IRET-to-self works everywhere, so it makes testing easy.
For reference, On my laptop, IRET-to-self is ~110ns,
CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
and MOV-to-CR2 is ~42ns.
While we're at it: sync_core() serves a very specific purpose.
Document it.
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
1 file changed, 55 insertions(+), 22 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 64fbc937d586..201a956e345f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
#define cpu_relax_lowlatency() cpu_relax()
-/* Stop speculative execution and prefetching of modified code. */
+/*
+ * This function forces the icache and prefetched instruction stream to
+ * catch up with reality in two very specific cases:
+ *
+ * a) Text was modified using one virtual address and is about to be executed
+ * from the same physical page at a different virtual address.
+ *
+ * b) Text was modified on a different CPU, may subsequently be
+ * executed on this CPU, and you want to make sure the new version
+ * gets executed. This generally means you're calling this in a IPI.
+ *
+ * If you're calling this for a different reason, you're probably doing
+ * it wrong.
+ */
static inline void sync_core(void)
{
- int tmp;
-
-#ifdef CONFIG_X86_32
/*
- * Do a CPUID if available, otherwise do a jump. The jump
- * can conveniently enough be the jump around CPUID.
+ * There are quite a few ways to do this. IRET-to-self is nice
+ * because it works on every CPU, at any CPL (so it's compatible
+ * with paravirtualization), and it never exits to a hypervisor.
+ * The only down sides are that it's a bit slow (it seems to be
+ * a bit more than 2x slower than the fastest options) and that
+ * it unmasks NMIs. The "push %cs" is needed because, in
+ * paravirtual environments, __KERNEL_CS may not be a valid CS
+ * value when we do IRET directly.
+ *
+ * In case NMI unmasking or performance every becomes a problem,
+ * the next best option appears to be MOV-to-CR2 and an
+ * unconditional jump. That sequence also works on all CPUs,
+ * but it will fault at CPL3.
+ *
+ * CPUID is the conventional way, but it's nasty: it doesn't
+ * exist on some 486-like CPUs, and it usually exits to a
+ * hypervisor.
*/
- asm volatile("cmpl %2,%1\n\t"
- "jl 1f\n\t"
- "cpuid\n"
- "1:"
- : "=a" (tmp)
- : "rm" (boot_cpu_data.cpuid_level), "ri" (0), "0" (1)
- : "ebx", "ecx", "edx", "memory");
+ register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_32
+ asm volatile (
+ "pushfl\n\t"
+ "pushl %%cs\n\t"
+ "pushl $1f\n\t"
+ "iret\n\t"
+ "1:"
+ : "+r" (__sp) : : "cc", "memory");
#else
- /*
- * CPUID is a barrier to speculative execution.
- * Prefetched instructions are automatically
- * invalidated when modified.
- */
- asm volatile("cpuid"
- : "=a" (tmp)
- : "0" (1)
- : "ebx", "ecx", "edx", "memory");
+ unsigned long tmp;
+
+ asm volatile (
+ "movq %%ss, %0\n\t"
+ "pushq %0\n\t"
+ "pushq %%rsp\n\t"
+ "addq $8, (%%rsp)\n\t"
+ "pushfq\n\t"
+ "movq %%cs, %0\n\t"
+ "pushq %0\n\t"
+ "pushq $1f\n\t"
+ "iretq\n\t"
+ "1:"
+ : "=r" (tmp), "+r" (__sp) : : "cc", "memory");
#endif
}
--
2.9.3
We support various non-Intel CPUs that don't have the CPUID
instruction, so the M486 test was wrong. For now, fix it with a big
hammer: handle missing CPUID on all 32-bit CPUs.
Reported-by: One Thousand Gnomes <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/processor.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 984a7bf17f6a..64fbc937d586 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -595,7 +595,7 @@ static inline void sync_core(void)
{
int tmp;
-#ifdef CONFIG_M486
+#ifdef CONFIG_X86_32
/*
* Do a CPUID if available, otherwise do a jump. The jump
* can conveniently enough be the jump around CPUID.
--
2.9.3
The Intel microcode driver is using sync_core() to mean "do CPUID
with EAX=1". I want to rework sync_core(), but first the Intel
microcode driver needs to stop depending on its current behavior.
Reported-by: Henrique de Moraes Holschuh <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/cpu/microcode/intel.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index cdc0deab00c9..e0981bb2a351 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -356,6 +356,26 @@ get_matching_model_microcode(unsigned long start, void *data, size_t size,
return state;
}
+static void cpuid_1(void)
+{
+ /*
+ * According to the Intel SDM, Volume 3, 9.11.7:
+ *
+ * CPUID returns a value in a model specific register in
+ * addition to its usual register return values. The
+ * semantics of CPUID cause it to deposit an update ID value
+ * in the 64-bit model-specific register at address 08BH
+ * (IA32_BIOS_SIGN_ID). If no update is present in the
+ * processor, the value in the MSR remains unmodified.
+ *
+ * Use native_cpuid -- this code runs very early and we don't
+ * want to mess with paravirt.
+ */
+ unsigned int eax = 1, ebx, ecx = 0, edx;
+
+ native_cpuid(&eax, &ebx, &ecx, &edx);
+}
+
static int collect_cpu_info_early(struct ucode_cpu_info *uci)
{
unsigned int val[2];
@@ -385,7 +405,7 @@ static int collect_cpu_info_early(struct ucode_cpu_info *uci)
native_wrmsrl(MSR_IA32_UCODE_REV, 0);
/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_1();
/* get the current revision from MSR 0x8B */
native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
@@ -627,7 +647,7 @@ static int apply_microcode_early(struct ucode_cpu_info *uci, bool early)
native_wrmsrl(MSR_IA32_UCODE_REV, 0);
/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_1();
/* get the current revision from MSR 0x8B */
native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
@@ -927,7 +947,7 @@ static int apply_microcode_intel(int cpu)
wrmsrl(MSR_IA32_UCODE_REV, 0);
/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_1();
/* get the current revision from MSR 0x8B */
rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
--
2.9.3
This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.
The patch wasn't quite correct -- there are non-Intel (and hence
non-486) CPUs that we support that don't have CPUID. Since we no
longer require CPUID for sync_core(), just revert the patch.
I think the relevant CPUs are Geode and Elan, but I'm not sure.
In principle, we should try to do better at identifying CPUID-less
CPUs in early boot, but that's more complicated.
Reported-by: One Thousand Gnomes <[email protected]>
Cc: Matthew Whitehead <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/boot/cpu.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/arch/x86/boot/cpu.c b/arch/x86/boot/cpu.c
index 4224ede43b4e..26240dde081e 100644
--- a/arch/x86/boot/cpu.c
+++ b/arch/x86/boot/cpu.c
@@ -87,12 +87,6 @@ int validate_cpu(void)
return -1;
}
- if (CONFIG_X86_MINIMUM_CPU_FAMILY <= 4 && !IS_ENABLED(CONFIG_M486) &&
- !has_eflag(X86_EFLAGS_ID)) {
- printf("This kernel requires a CPU with the CPUID instruction. Build with CONFIG_M486=y to run on this CPU.\n");
- return -1;
- }
-
if (err_flags) {
puts("This kernel requires the following features "
"not present on the CPU:\n");
--
2.9.3
On Mon, Dec 05, 2016 at 01:32:43PM -0800, Andy Lutomirski wrote:
> Aside from being excessively slow, CPUID is problematic: Linux runs
> on a handful of CPUs that don't have CPUID. Use IRET-to-self
> instead. IRET-to-self works everywhere, so it makes testing easy.
>
> For reference, On my laptop, IRET-to-self is ~110ns,
> CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
> and MOV-to-CR2 is ~42ns.
>
> While we're at it: sync_core() serves a very specific purpose.
> Document it.
>
> Cc: "H. Peter Anvin" <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
> 1 file changed, 55 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 64fbc937d586..201a956e345f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
>
> #define cpu_relax_lowlatency() cpu_relax()
>
> -/* Stop speculative execution and prefetching of modified code. */
> +/*
> + * This function forces the icache and prefetched instruction stream to
> + * catch up with reality in two very specific cases:
> + *
> + * a) Text was modified using one virtual address and is about to be executed
> + * from the same physical page at a different virtual address.
> + *
> + * b) Text was modified on a different CPU, may subsequently be
> + * executed on this CPU, and you want to make sure the new version
> + * gets executed. This generally means you're calling this in a IPI.
> + *
> + * If you're calling this for a different reason, you're probably doing
> + * it wrong.
"... and think hard before you call this - it is slow."
I'd add that now that it is even slower than CPUID.
> + */
> static inline void sync_core(void)
> {
> - int tmp;
> -
> -#ifdef CONFIG_X86_32
> /*
> - * Do a CPUID if available, otherwise do a jump. The jump
> - * can conveniently enough be the jump around CPUID.
> + * There are quite a few ways to do this. IRET-to-self is nice
> + * because it works on every CPU, at any CPL (so it's compatible
> + * with paravirtualization), and it never exits to a hypervisor.
> + * The only down sides are that it's a bit slow (it seems to be
> + * a bit more than 2x slower than the fastest options) and that
> + * it unmasks NMIs.
Ewww, I hadn't thought of that angle. Aren't we going to get in all
kinds of hard to debug issues due to that couple of cycles window of
unmasked NMIs?
We sync_core in some NMI handler and then right in the middle of it we
get another NMI. Yeah, we have the nested NMI stuff still but I'd like
to avoid complications if possible.
> The "push %cs" is needed because, in
> + * paravirtual environments, __KERNEL_CS may not be a valid CS
> + * value when we do IRET directly.
> + *
> + * In case NMI unmasking or performance every becomes a problem,
> + * the next best option appears to be MOV-to-CR2 and an
> + * unconditional jump. That sequence also works on all CPUs,
> + * but it will fault at CPL3.
Does it really have to be non-priviledged?
If not, there are a couple more serializing insns:
"• Privileged serializing instructions — INVD, INVEPT, INVLPG,
INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the
exception of MOV CR83), MOV (to debug register), WBINVD, and WRMSR"
What about INVD? It is expensive too :-)
Can't we use, MOV %dr or so? With previously saving/restoring the reg?
WBINVD could be another candidate, albeit a big hammer.
WRMSR maybe too. If it faults, it's fine too because then you have the
IRET automagically. Hell, we could even make it fault on purpose by
writing into an invalid MSR but then we're back to the unmasking NMIs...
:-\
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
>>> On 05.12.16 at 22:32, <[email protected]> wrote:
> static inline void sync_core(void)
> {
> - int tmp;
> -
> -#ifdef CONFIG_X86_32
> /*
> - * Do a CPUID if available, otherwise do a jump. The jump
> - * can conveniently enough be the jump around CPUID.
> + * There are quite a few ways to do this. IRET-to-self is nice
> + * because it works on every CPU, at any CPL (so it's compatible
> + * with paravirtualization), and it never exits to a hypervisor.
> + * The only down sides are that it's a bit slow (it seems to be
> + * a bit more than 2x slower than the fastest options) and that
> + * it unmasks NMIs. The "push %cs" is needed because, in
> + * paravirtual environments, __KERNEL_CS may not be a valid CS
> + * value when we do IRET directly.
> + *
> + * In case NMI unmasking or performance every becomes a problem,
> + * the next best option appears to be MOV-to-CR2 and an
> + * unconditional jump. That sequence also works on all CPUs,
> + * but it will fault at CPL3.
CPL > 0 I think.
> + * CPUID is the conventional way, but it's nasty: it doesn't
> + * exist on some 486-like CPUs, and it usually exits to a
> + * hypervisor.
> */
> - asm volatile("cmpl %2,%1\n\t"
> - "jl 1f\n\t"
> - "cpuid\n"
> - "1:"
> - : "=a" (tmp)
> - : "rm" (boot_cpu_data.cpuid_level), "ri" (0), "0" (1)
> - : "ebx", "ecx", "edx", "memory");
> + register void *__sp asm(_ASM_SP);
> +
> +#ifdef CONFIG_X86_32
> + asm volatile (
> + "pushfl\n\t"
> + "pushl %%cs\n\t"
> + "pushl $1f\n\t"
> + "iret\n\t"
> + "1:"
> + : "+r" (__sp) : : "cc", "memory");
I don't thing EFLAGS (i.e. "cc") gets modified anywhere here. And
the memory clobber would perhaps better be pulled out into an
explicit barrier() invocation (making it more obvious what it's needed
for)?
> #else
> - /*
> - * CPUID is a barrier to speculative execution.
> - * Prefetched instructions are automatically
> - * invalidated when modified.
> - */
> - asm volatile("cpuid"
> - : "=a" (tmp)
> - : "0" (1)
> - : "ebx", "ecx", "edx", "memory");
> + unsigned long tmp;
> +
> + asm volatile (
> + "movq %%ss, %0\n\t"
> + "pushq %0\n\t"
> + "pushq %%rsp\n\t"
> + "addq $8, (%%rsp)\n\t"
> + "pushfq\n\t"
> + "movq %%cs, %0\n\t"
> + "pushq %0\n\t"
> + "pushq $1f\n\t"
> + "iretq\n\t"
> + "1:"
> + : "=r" (tmp), "+r" (__sp) : : "cc", "memory");
The first output needs to be "=&r". And is movq really a good
idea for selector reads? Why don't you make tmp unsigned int,
use plain mov, and use %q0 as pushq's operands?
Jan
On Tue, Dec 06, 2016 at 01:46:37AM -0700, Jan Beulich wrote:
> > + asm volatile (
> > + "pushfl\n\t"
> > + "pushl %%cs\n\t"
> > + "pushl $1f\n\t"
> > + "iret\n\t"
> > + "1:"
> > + : "+r" (__sp) : : "cc", "memory");
>
> I don't thing EFLAGS (i.e. "cc") gets modified anywhere here. And
> the memory clobber would perhaps better be pulled out into an
> explicit barrier() invocation (making it more obvious what it's needed
> for)?
EVerything that implies a memory barrier (and I think serializing
instructions do that) also imply a compiler barrier.
Not doing the memory clobber gets you inconsistency wrt everything else.
On Mon, Dec 5, 2016 at 11:52 PM, Borislav Petkov <[email protected]> wrote:
> On Mon, Dec 05, 2016 at 01:32:43PM -0800, Andy Lutomirski wrote:
>> Aside from being excessively slow, CPUID is problematic: Linux runs
>> on a handful of CPUs that don't have CPUID. Use IRET-to-self
>> instead. IRET-to-self works everywhere, so it makes testing easy.
>>
>> For reference, On my laptop, IRET-to-self is ~110ns,
>> CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
>> and MOV-to-CR2 is ~42ns.
>>
>> While we're at it: sync_core() serves a very specific purpose.
>> Document it.
>>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
>> 1 file changed, 55 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> index 64fbc937d586..201a956e345f 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
>>
>> #define cpu_relax_lowlatency() cpu_relax()
>>
>> -/* Stop speculative execution and prefetching of modified code. */
>> +/*
>> + * This function forces the icache and prefetched instruction stream to
>> + * catch up with reality in two very specific cases:
>> + *
>> + * a) Text was modified using one virtual address and is about to be executed
>> + * from the same physical page at a different virtual address.
>> + *
>> + * b) Text was modified on a different CPU, may subsequently be
>> + * executed on this CPU, and you want to make sure the new version
>> + * gets executed. This generally means you're calling this in a IPI.
>> + *
>> + * If you're calling this for a different reason, you're probably doing
>> + * it wrong.
>
> "... and think hard before you call this - it is slow."
>
> I'd add that now that it is even slower than CPUID.
But only barely. And it's hugely faster than CPUID under KVM or
similar. And it works on all CPUs.
>
>> + */
>> static inline void sync_core(void)
>> {
>> - int tmp;
>> -
>> -#ifdef CONFIG_X86_32
>> /*
>> - * Do a CPUID if available, otherwise do a jump. The jump
>> - * can conveniently enough be the jump around CPUID.
>> + * There are quite a few ways to do this. IRET-to-self is nice
>> + * because it works on every CPU, at any CPL (so it's compatible
>> + * with paravirtualization), and it never exits to a hypervisor.
>> + * The only down sides are that it's a bit slow (it seems to be
>> + * a bit more than 2x slower than the fastest options) and that
>> + * it unmasks NMIs.
>
> Ewww, I hadn't thought of that angle. Aren't we going to get in all
> kinds of hard to debug issues due to that couple of cycles window of
> unmasked NMIs?
>
> We sync_core in some NMI handler and then right in the middle of it we
> get another NMI. Yeah, we have the nested NMI stuff still but I'd like
> to avoid complications if possible.
I'll counter with:
1. Why on earth would an NMI call sync_core()?
2. We have very careful and code to handle this issue, and NMIs really
do cause page faults which have exactly the same problem.
So I'm not too worried.
>
>> The "push %cs" is needed because, in
>> + * paravirtual environments, __KERNEL_CS may not be a valid CS
>> + * value when we do IRET directly.
>> + *
>> + * In case NMI unmasking or performance every becomes a problem,
>> + * the next best option appears to be MOV-to-CR2 and an
>> + * unconditional jump. That sequence also works on all CPUs,
>> + * but it will fault at CPL3.
>
> Does it really have to be non-priviledged?
Unless we want to declare lguest unsupported, delete it from the tree,
or, sigh, actually maintain it, then yes :( native_write_cr2() would
work on Xen, but it's slow.
>
> If not, there are a couple more serializing insns:
>
> "• Privileged serializing instructions — INVD, INVEPT, INVLPG,
> INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the
> exception of MOV CR83), MOV (to debug register), WBINVD, and WRMSR"
>
> What about INVD? It is expensive too :-)
Only if you write the patch and label it:
Snickered-at-by: Andy Lutomirski <[email protected]>
>
> Can't we use, MOV %dr or so? With previously saving/restoring the reg?
>
> WBINVD could be another candidate, albeit a big hammer.
>
> WRMSR maybe too. If it faults, it's fine too because then you have the
> IRET automagically. Hell, we could even make it fault on purpose by
> writing into an invalid MSR but then we're back to the unmasking NMIs...
> :-\
I still like MOV-to-CR2 better than all of these.
--Andy
On 12/06/16 00:46, Jan Beulich wrote:
>> +
>> +#ifdef CONFIG_X86_32
>> + asm volatile (
>> + "pushfl\n\t"
>> + "pushl %%cs\n\t"
>> + "pushl $1f\n\t"
>> + "iret\n\t"
>> + "1:"
>> + : "+r" (__sp) : : "cc", "memory");
>
> I don't thing EFLAGS (i.e. "cc") gets modified anywhere here. And
> the memory clobber would perhaps better be pulled out into an
> explicit barrier() invocation (making it more obvious what it's needed
> for)?
>
Not to mention "cc" doesn't do anything on x86 at all.
-hpa