2016-11-30 20:35:07

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 0/4] CPUID-less CPU fixes and improvements

*** PATCHES 1 and 2 MAY BE 4.9 MATERIAL ***

Alan Cox pointed out that the 486 isn't the only supported CPU that
doesn't have CPUID. Let's clean up the mess.

Patch 1 is intended to be an easy fix: it makes sync_core() work
without CPUID on all 32-bit kernels. It should be quite safe. This
will have a negligible performance cost during boot on kernels built
for newer CPUs. With this in place, patch 2 reverts the buggy 486
check I added.

Patches 3 and 4 are meant to improve the situation. Patch 3 cleans
up the Intel microcode loader and patch 4 (which depends on patch 3)
stops using CPUID in sync_core() altogether.

Andy Lutomirski (4):
x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit
kernels
Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"
x86/microcode/intel: Replace sync_core() with cpuid_eax(1)
x86/asm: Change sync_core() to use MOV to CR2 to serialize

arch/x86/boot/cpu.c | 6 ------
arch/x86/include/asm/processor.h | 31 ++++++++-----------------------
arch/x86/kernel/cpu/microcode/intel.c | 6 +++---
3 files changed, 11 insertions(+), 32 deletions(-)

--
2.9.3


2016-11-30 20:35:13

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 1/4] x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels

We support various non-Intel CPUs that don't have the CPUID
instruction, so the M486 test was wrong. For now, fix it with a big
hammer: handle missing CPUID on all 32-bit CPUs.

Reported-by: One Thousand Gnomes <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/processor.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 984a7bf17f6a..64fbc937d586 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -595,7 +595,7 @@ static inline void sync_core(void)
{
int tmp;

-#ifdef CONFIG_M486
+#ifdef CONFIG_X86_32
/*
* Do a CPUID if available, otherwise do a jump. The jump
* can conveniently enough be the jump around CPUID.
--
2.9.3

2016-11-30 20:35:22

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"

This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.

The patch wasn't quite correct -- there are non-Intel (and hence
non-486) CPUs that we support that don't have CPUID. Since we no
longer require CPUID for sync_core(), just revert the patch.

In principle, we should try to do better at identifying CPUID-less
CPUs in early boot, but that's more complicated.

Reported-by: One Thousand Gnomes <[email protected]>
Cc: Matthew Whitehead <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/boot/cpu.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/arch/x86/boot/cpu.c b/arch/x86/boot/cpu.c
index 4224ede43b4e..26240dde081e 100644
--- a/arch/x86/boot/cpu.c
+++ b/arch/x86/boot/cpu.c
@@ -87,12 +87,6 @@ int validate_cpu(void)
return -1;
}

- if (CONFIG_X86_MINIMUM_CPU_FAMILY <= 4 && !IS_ENABLED(CONFIG_M486) &&
- !has_eflag(X86_EFLAGS_ID)) {
- printf("This kernel requires a CPU with the CPUID instruction. Build with CONFIG_M486=y to run on this CPU.\n");
- return -1;
- }
-
if (err_flags) {
puts("This kernel requires the following features "
"not present on the CPU:\n");
--
2.9.3

2016-11-30 20:35:33

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 3/4] x86/microcode/intel: Replace sync_core() with cpuid_eax(1)

The Intel microcode driver is using sync_core() to mean "do CPUID
with EAX=1". I want to rework sync_core(), but first the Intel
microcode driver needs to stop depending on its current behavior.

Reported-by: Henrique de Moraes Holschuh <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/cpu/microcode/intel.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index cdc0deab00c9..2542d036a30e 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -385,7 +385,7 @@ static int collect_cpu_info_early(struct ucode_cpu_info *uci)
native_wrmsrl(MSR_IA32_UCODE_REV, 0);

/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_eax(1);

/* get the current revision from MSR 0x8B */
native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
@@ -627,7 +627,7 @@ static int apply_microcode_early(struct ucode_cpu_info *uci, bool early)
native_wrmsrl(MSR_IA32_UCODE_REV, 0);

/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_eax(1);

/* get the current revision from MSR 0x8B */
native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
@@ -927,7 +927,7 @@ static int apply_microcode_intel(int cpu)
wrmsrl(MSR_IA32_UCODE_REV, 0);

/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_eax(1);

/* get the current revision from MSR 0x8B */
rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
--
2.9.3

2016-11-30 20:35:53

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH 4/4] x86/asm: Change sync_core() to use MOV to CR2 to serialize

Aside from being excessively slow, CPUID is problematic: Linux runs
on a handful of CPUs that don't have CPUID. MOV to CR2 is always
available, so use it instead.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/processor.h | 31 ++++++++-----------------------
1 file changed, 8 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 64fbc937d586..0388f3d85700 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -593,31 +593,16 @@ static __always_inline void cpu_relax(void)
/* Stop speculative execution and prefetching of modified code. */
static inline void sync_core(void)
{
- int tmp;
-
-#ifdef CONFIG_X86_32
- /*
- * Do a CPUID if available, otherwise do a jump. The jump
- * can conveniently enough be the jump around CPUID.
- */
- asm volatile("cmpl %2,%1\n\t"
- "jl 1f\n\t"
- "cpuid\n"
- "1:"
- : "=a" (tmp)
- : "rm" (boot_cpu_data.cpuid_level), "ri" (0), "0" (1)
- : "ebx", "ecx", "edx", "memory");
-#else
/*
- * CPUID is a barrier to speculative execution.
- * Prefetched instructions are automatically
- * invalidated when modified.
+ * MOV to CR2 is architecturally defined as a serializing
+ * instruction. It's nice because it works on all CPUs, it
+ * doesn't clobber registers, and (unlike CPUID) it won't force
+ * a VM exit.
+ *
+ * 0xbf172b23 is random poison just in case something ends up
+ * caring about this value.
*/
- asm volatile("cpuid"
- : "=a" (tmp)
- : "0" (1)
- : "ebx", "ecx", "edx", "memory");
-#endif
+ write_cr2(0xbf172b23);
}

extern void select_idle_routine(const struct cpuinfo_x86 *c);
--
2.9.3

2016-12-01 05:54:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"

On Wed, Nov 30, 2016 at 12:34:53PM -0800, Andy Lutomirski wrote:
> This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.
>
> The patch wasn't quite correct -- there are non-Intel (and hence
> non-486) CPUs that we support that don't have CPUID. Since we no
> longer require CPUID for sync_core(), just revert the patch.

Might be useful to enumerate which special parts these are.

2016-12-01 09:03:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"

On Thu, Dec 01, 2016 at 06:53:50AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 30, 2016 at 12:34:53PM -0800, Andy Lutomirski wrote:
> > This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.
> >
> > The patch wasn't quite correct -- there are non-Intel (and hence
> > non-486) CPUs that we support that don't have CPUID. Since we no
> > longer require CPUID for sync_core(), just revert the patch.
>
> Might be useful to enumerate which special parts these are.

Right, and since we test for CPUID support at early boot, I think we
should use the X86_FEATURE_CPUID aspect from what I proposed earlier:

https://lkml.kernel.org/r/[email protected]

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-01 09:11:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86/microcode/intel: Replace sync_core() with cpuid_eax(1)

On Wed, Nov 30, 2016 at 12:34:54PM -0800, Andy Lutomirski wrote:
> The Intel microcode driver is using sync_core() to mean "do CPUID
> with EAX=1". I want to rework sync_core(), but first the Intel
> microcode driver needs to stop depending on its current behavior.
>
> Reported-by: Henrique de Moraes Holschuh <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/kernel/cpu/microcode/intel.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)

Acked-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-01 09:22:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/asm: Change sync_core() to use MOV to CR2 to serialize

On Wed, Nov 30, 2016 at 12:34:55PM -0800, Andy Lutomirski wrote:
> Aside from being excessively slow, CPUID is problematic: Linux runs
> on a handful of CPUs that don't have CPUID. MOV to CR2 is always
> available, so use it instead.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/processor.h | 31 ++++++++-----------------------
> 1 file changed, 8 insertions(+), 23 deletions(-)

Looks nice.

I'm wondering if we should leave this one in tip for an additional cycle
to have it tested on more hw. I know, it is architectural and so on but
who knows what every implementation actually does...

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-01 10:11:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"

On Thu, 1 Dec 2016, Borislav Petkov wrote:
> On Thu, Dec 01, 2016 at 06:53:50AM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 30, 2016 at 12:34:53PM -0800, Andy Lutomirski wrote:
> > > This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.
> > >
> > > The patch wasn't quite correct -- there are non-Intel (and hence
> > > non-486) CPUs that we support that don't have CPUID. Since we no
> > > longer require CPUID for sync_core(), just revert the patch.
> >
> > Might be useful to enumerate which special parts these are.
>
> Right, and since we test for CPUID support at early boot, I think we
> should use the X86_FEATURE_CPUID aspect from what I proposed earlier:
>
> https://lkml.kernel.org/r/[email protected]

Yes, that makes a lot of sense.

Thanks,

tglx

2016-12-01 11:15:44

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH] x86/CPU: Add X86_FEATURE_CPUID

On Thu, Dec 01, 2016 at 11:07:21AM +0100, Thomas Gleixner wrote:
> Yes, that makes a lot of sense.

Here it is:

---
From: Borislav Petkov <[email protected]>
Date: Thu, 1 Dec 2016 12:11:37 +0100
Subject: [PATCH] x86/CPU: Add X86_FEATURE_CPUID

Add a synthetic CPUID flag denoting whether the CPU sports the CPUID
instruction or not. This will come useful later when accomodating
CPUID-less CPUs.

Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/kernel/cpu/common.c | 7 ++++---
2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 53e27a12ceb6..cb49dd31c3b2 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -100,7 +100,7 @@
#define X86_FEATURE_XTOPOLOGY ( 3*32+22) /* cpu topology enum extensions */
#define X86_FEATURE_TSC_RELIABLE ( 3*32+23) /* TSC is known to be reliable */
#define X86_FEATURE_NONSTOP_TSC ( 3*32+24) /* TSC does not stop in C states */
-/* free, was #define X86_FEATURE_CLFLUSH_MONITOR ( 3*32+25) * "" clflush reqd with monitor */
+#define X86_FEATURE_CPUID ( 3*32+25) /* CPU has CPUID instruction itself */
#define X86_FEATURE_EXTD_APICID ( 3*32+26) /* has extended APICID (8 bits) */
#define X86_FEATURE_AMD_DCM ( 3*32+27) /* multi-node processor */
#define X86_FEATURE_APERFMPERF ( 3*32+28) /* APERFMPERF */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 90c007447507..8dcdcdeec569 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -800,14 +800,12 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
memset(&c->x86_capability, 0, sizeof c->x86_capability);
c->extended_cpuid_level = 0;

- if (!have_cpuid_p())
- identify_cpu_without_cpuid(c);
-
/* cyrix could have cpuid enabled via c_identify()*/
if (have_cpuid_p()) {
cpu_detect(c);
get_cpu_vendor(c);
get_cpu_cap(c);
+ setup_force_cpu_cap(X86_FEATURE_CPUID);

if (this_cpu->c_early_init)
this_cpu->c_early_init(c);
@@ -817,6 +815,9 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)

if (this_cpu->c_bsp_init)
this_cpu->c_bsp_init(c);
+ } else {
+ identify_cpu_without_cpuid(c);
+ setup_clear_cpu_cap(X86_FEATURE_CPUID);
}

setup_force_cpu_cap(X86_FEATURE_ALWAYS);
--
2.10.0

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-01 17:00:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/CPU: Add X86_FEATURE_CPUID

On Thu, Dec 1, 2016 at 3:15 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Dec 01, 2016 at 11:07:21AM +0100, Thomas Gleixner wrote:
>> Yes, that makes a lot of sense.
>
> Here it is:
>
> ---
> From: Borislav Petkov <[email protected]>
> Date: Thu, 1 Dec 2016 12:11:37 +0100
> Subject: [PATCH] x86/CPU: Add X86_FEATURE_CPUID
>
> Add a synthetic CPUID flag denoting whether the CPU sports the CPUID
> instruction or not. This will come useful later when accomodating
> CPUID-less CPUs.

Reviewed-by: Andy Lutomirski <[email protected]>

2016-12-01 17:01:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"

On Wed, Nov 30, 2016 at 9:53 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Nov 30, 2016 at 12:34:53PM -0800, Andy Lutomirski wrote:
>> This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.
>>
>> The patch wasn't quite correct -- there are non-Intel (and hence
>> non-486) CPUs that we support that don't have CPUID. Since we no
>> longer require CPUID for sync_core(), just revert the patch.
>
> Might be useful to enumerate which special parts these are.

Alan thought it was Geode and Elan, right? I can add that to the
commit message if I do a v2.

--Andy

2016-12-01 17:08:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/asm: Change sync_core() to use MOV to CR2 to serialize

On Thu, Dec 1, 2016 at 1:22 AM, Borislav Petkov <[email protected]> wrote:
> On Wed, Nov 30, 2016 at 12:34:55PM -0800, Andy Lutomirski wrote:
>> Aside from being excessively slow, CPUID is problematic: Linux runs
>> on a handful of CPUs that don't have CPUID. MOV to CR2 is always
>> available, so use it instead.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/include/asm/processor.h | 31 ++++++++-----------------------
>> 1 file changed, 8 insertions(+), 23 deletions(-)
>
> Looks nice.
>
> I'm wondering if we should leave this one in tip for an additional cycle
> to have it tested on more hw. I know, it is architectural and so on but
> who knows what every implementation actually does...

I want the Xen opinion as well.

Xen folks, can Linux use write_cr2 to serialize the CPU core on Xen PV
or do we need something a bit heavier weight like native_write_cr2?

--Andy

2016-12-01 17:46:12

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/asm: Change sync_core() to use MOV to CR2 to serialize

On 01/12/16 17:08, Andy Lutomirski wrote:
> On Thu, Dec 1, 2016 at 1:22 AM, Borislav Petkov <[email protected]> wrote:
>> On Wed, Nov 30, 2016 at 12:34:55PM -0800, Andy Lutomirski wrote:
>>> Aside from being excessively slow, CPUID is problematic: Linux runs
>>> on a handful of CPUs that don't have CPUID. MOV to CR2 is always
>>> available, so use it instead.
>>>
>>> Signed-off-by: Andy Lutomirski <[email protected]>
>>> ---
>>> arch/x86/include/asm/processor.h | 31 ++++++++-----------------------
>>> 1 file changed, 8 insertions(+), 23 deletions(-)
>> Looks nice.
>>
>> I'm wondering if we should leave this one in tip for an additional cycle
>> to have it tested on more hw. I know, it is architectural and so on but
>> who knows what every implementation actually does...
> I want the Xen opinion as well.
>
> Xen folks, can Linux use write_cr2 to serialize the CPU core on Xen PV
> or do we need something a bit heavier weight like native_write_cr2?

To sum up our conversation on IRC.

xen_write_cr2() is not serialising; it is just a write into a shared page.

native_write_cr2() would trap and be emulated. This will incur #GP[0]
due to cpl, although not necessarily an iret on the way back out of Xen.

Something like an iret-to-self would be far quicker, and avoid trapping
into the hypervisor.

~Andrew

2016-12-02 00:33:40

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"

On Thu, Dec 1, 2016 at 1:02 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Dec 01, 2016 at 06:53:50AM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 30, 2016 at 12:34:53PM -0800, Andy Lutomirski wrote:
>> > This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.
>> >
>> > The patch wasn't quite correct -- there are non-Intel (and hence
>> > non-486) CPUs that we support that don't have CPUID. Since we no
>> > longer require CPUID for sync_core(), just revert the patch.
>>
>> Might be useful to enumerate which special parts these are.
>
> Right, and since we test for CPUID support at early boot, I think we
> should use the X86_FEATURE_CPUID aspect from what I proposed earlier:
>
> https://lkml.kernel.org/r/[email protected]

Maybe we should just make the kernel work fully without CPUID. There
isn't much to do on top of your patch -- I think we mainly want to
make the cpuid driver refuse to load.

--Andy

2016-12-02 00:35:20

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 1/6] x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels

We support various non-Intel CPUs that don't have the CPUID
instruction, so the M486 test was wrong. For now, fix it with a big
hammer: handle missing CPUID on all 32-bit CPUs.

Reported-by: One Thousand Gnomes <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/processor.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 984a7bf17f6a..64fbc937d586 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -595,7 +595,7 @@ static inline void sync_core(void)
{
int tmp;

-#ifdef CONFIG_M486
+#ifdef CONFIG_X86_32
/*
* Do a CPUID if available, otherwise do a jump. The jump
* can conveniently enough be the jump around CPUID.
--
2.9.3

2016-12-02 00:35:27

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
serialize on Xen PV, but, in practice, any trap it generates will
serialize.)

On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
should end up being a nice speedup.

Cc: Andrew Cooper <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/xen/enlighten.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index bdd855685403..1f765b41eee7 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -311,6 +311,39 @@ static __read_mostly unsigned int cpuid_leaf1_ecx_set_mask;
static __read_mostly unsigned int cpuid_leaf5_ecx_val;
static __read_mostly unsigned int cpuid_leaf5_edx_val;

+static void xen_sync_core(void)
+{
+ register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_32
+ asm volatile (
+ "pushl %%ss\n\t"
+ "pushl %%esp\n\t"
+ "addl $4, (%%esp)\n\t"
+ "pushfl\n\t"
+ "pushl %%cs\n\t"
+ "pushl $1f\n\t"
+ "iret\n\t"
+ "1:"
+ : "+r" (__sp) : : "cc");
+#else
+ unsigned long tmp;
+
+ asm volatile (
+ "movq %%ss, %0\n\t"
+ "pushq %0\n\t"
+ "pushq %%rsp\n\t"
+ "addq $8, (%%rsp)\n\t"
+ "pushfq\n\t"
+ "movq %%cs, %0\n\t"
+ "pushq %0\n\t"
+ "pushq $1f\n\t"
+ "iretq\n\t"
+ "1:"
+ : "=r" (tmp), "+r" (__sp) : : "cc");
+#endif
+}
+
static void xen_cpuid(unsigned int *ax, unsigned int *bx,
unsigned int *cx, unsigned int *dx)
{
@@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {

.start_context_switch = paravirt_start_context_switch,
.end_context_switch = xen_end_context_switch,
+
+ .sync_core = xen_sync_core,
};

static void xen_reboot(int reason)
--
2.9.3

2016-12-02 00:35:25

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 4/6] x86/paravirt: Make sync_core() be a paravirt op

I want to change sync_core() to use MOV to CR2, but that won't work
the way we want on Xen PV, and the easiest fix is to make
sync_core() be a paravirt op. Make it so.

A real paravirt guru could probably microoptimize this. I doubt it
matters much, though, as sync_core() is mostly used during boot.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/paravirt.h | 5 +++++
arch/x86/include/asm/paravirt_types.h | 2 ++
arch/x86/include/asm/processor.h | 3 ++-
arch/x86/kernel/paravirt.c | 2 ++
4 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ce932812f142..7e76b72aa698 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -28,6 +28,11 @@ static inline void __cpuid(unsigned int *eax, unsigned int *ebx,
PVOP_VCALL4(pv_cpu_ops.cpuid, eax, ebx, ecx, edx);
}

+static inline void sync_core(void)
+{
+ PVOP_VCALL0(pv_cpu_ops.sync_core);
+}
+
/*
* These special macros can be used to get or set a debugging register
*/
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 0f400c0e4979..e4d2cb2c0165 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -177,6 +177,8 @@ struct pv_cpu_ops {

void (*start_context_switch)(struct task_struct *prev);
void (*end_context_switch)(struct task_struct *next);
+
+ void (*sync_core)(void);
};

struct pv_irq_ops {
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 64fbc937d586..c4402053c663 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -507,6 +507,7 @@ static inline void load_sp0(struct tss_struct *tss,
}

#define set_iopl_mask native_set_iopl_mask
+#define sync_core native_sync_core
#endif /* CONFIG_PARAVIRT */

/* Free all resources held by a thread. */
@@ -591,7 +592,7 @@ static __always_inline void cpu_relax(void)
#define cpu_relax_lowlatency() cpu_relax()

/* Stop speculative execution and prefetching of modified code. */
-static inline void sync_core(void)
+static inline void native_sync_core(void)
{
int tmp;

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bbf3d5933eaa..4d6a20ecbc78 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -373,6 +373,8 @@ __visible struct pv_cpu_ops pv_cpu_ops = {

.start_context_switch = paravirt_nop,
.end_context_switch = paravirt_nop,
+
+ .sync_core = native_sync_core,
};

/* At this point, native_get/set_debugreg has real function entries */
--
2.9.3

2016-12-02 00:35:45

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 3/6] x86/microcode/intel: Replace sync_core() with cpuid_eax(1)

The Intel microcode driver is using sync_core() to mean "do CPUID
with EAX=1". I want to rework sync_core(), but first the Intel
microcode driver needs to stop depending on its current behavior.

Reported-by: Henrique de Moraes Holschuh <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/cpu/microcode/intel.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index cdc0deab00c9..2542d036a30e 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -385,7 +385,7 @@ static int collect_cpu_info_early(struct ucode_cpu_info *uci)
native_wrmsrl(MSR_IA32_UCODE_REV, 0);

/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_eax(1);

/* get the current revision from MSR 0x8B */
native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
@@ -627,7 +627,7 @@ static int apply_microcode_early(struct ucode_cpu_info *uci, bool early)
native_wrmsrl(MSR_IA32_UCODE_REV, 0);

/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_eax(1);

/* get the current revision from MSR 0x8B */
native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
@@ -927,7 +927,7 @@ static int apply_microcode_intel(int cpu)
wrmsrl(MSR_IA32_UCODE_REV, 0);

/* As documented in the SDM: Do a CPUID 1 here */
- sync_core();
+ cpuid_eax(1);

/* get the current revision from MSR 0x8B */
rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
--
2.9.3

2016-12-02 00:35:59

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 6/6] x86/asm: Change sync_core() to use MOV to CR2 to serialize

Aside from being excessively slow, CPUID is problematic: Linux runs
on a handful of CPUs that don't have CPUID. MOV to CR2 is always
available, so use it instead.

On my laptop, CPUID(eax=1, ecx=0) is ~83ns and MOV-to-CR2 is ~42ns,
so this should be a nice speedup.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/processor.h | 31 ++++++++-----------------------
1 file changed, 8 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c4402053c663..6727ed1c0ca0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -594,31 +594,16 @@ static __always_inline void cpu_relax(void)
/* Stop speculative execution and prefetching of modified code. */
static inline void native_sync_core(void)
{
- int tmp;
-
-#ifdef CONFIG_X86_32
- /*
- * Do a CPUID if available, otherwise do a jump. The jump
- * can conveniently enough be the jump around CPUID.
- */
- asm volatile("cmpl %2,%1\n\t"
- "jl 1f\n\t"
- "cpuid\n"
- "1:"
- : "=a" (tmp)
- : "rm" (boot_cpu_data.cpuid_level), "ri" (0), "0" (1)
- : "ebx", "ecx", "edx", "memory");
-#else
/*
- * CPUID is a barrier to speculative execution.
- * Prefetched instructions are automatically
- * invalidated when modified.
+ * MOV to CR2 is architecturally defined as a serializing
+ * instruction. It's nice because it works on all CPUs, it
+ * doesn't clobber registers, and (unlike CPUID) it won't force
+ * a VM exit.
+ *
+ * 0xbf172b23 is random poison just in case something ends up
+ * caring about this value.
*/
- asm volatile("cpuid"
- : "=a" (tmp)
- : "0" (1)
- : "ebx", "ecx", "edx", "memory");
-#endif
+ native_write_cr2(0xbf172b23);
}

extern void select_idle_routine(const struct cpuinfo_x86 *c);
--
2.9.3

2016-12-02 00:36:47

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 2/6] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"

This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.

The patch wasn't quite correct -- there are non-Intel (and hence
non-486) CPUs that we support that don't have CPUID. Since we no
longer require CPUID for sync_core(), just revert the patch.

I think the relevant CPUs are Geode and Elan, but I'm not sure.

In principle, we should try to do better at identifying CPUID-less
CPUs in early boot, but that's more complicated.

Reported-by: One Thousand Gnomes <[email protected]>
Cc: Matthew Whitehead <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/boot/cpu.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/arch/x86/boot/cpu.c b/arch/x86/boot/cpu.c
index 4224ede43b4e..26240dde081e 100644
--- a/arch/x86/boot/cpu.c
+++ b/arch/x86/boot/cpu.c
@@ -87,12 +87,6 @@ int validate_cpu(void)
return -1;
}

- if (CONFIG_X86_MINIMUM_CPU_FAMILY <= 4 && !IS_ENABLED(CONFIG_M486) &&
- !has_eflag(X86_EFLAGS_ID)) {
- printf("This kernel requires a CPU with the CPUID instruction. Build with CONFIG_M486=y to run on this CPU.\n");
- return -1;
- }
-
if (err_flags) {
puts("This kernel requires the following features "
"not present on the CPU:\n");
--
2.9.3

2016-12-02 00:36:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/4] CPUID-less CPU fixes and improvements

Ugh, please ignore the accidental re-send of v1. I suck at using git.
v2 is what I meant to send, but I accidentally sent both versions. :-(

--Andy

2016-12-02 00:37:21

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v2 0/6] CPUID-less CPU/sync_core fixes and improvements

*** PATCHES 1 and 2 MAY BE 4.9 MATERIAL ***

Alan Cox pointed out that the 486 isn't the only supported CPU that
doesn't have CPUID. Let's clean up the mess and make everything
faster while we're at it.

Patch 1 is intended to be an easy fix: it makes sync_core() work
without CPUID on all 32-bit kernels. It should be quite safe. This
will have a negligible performance cost during boot on kernels built
for newer CPUs. With this in place, patch 2 reverts the buggy 486
check I added.

Patches 3-6 are meant to improve the situation. Patch 3 cleans
up the Intel microcode loader and the rest (which depend on patch 3)
stops using CPUID in sync_core() altogether.

Changes from v1:
- Fix Xen
- Add timing info to the changelog (hint: 2x speedup)
- Document patch 1 a bit better.

Andy Lutomirski (6):
x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit
kernels
Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"
x86/microcode/intel: Replace sync_core() with cpuid_eax(1)
x86/paravirt: Make sync_core() be a paravirt op
x86/xen: Add a Xen-specific sync_core() implementation
x86/asm: Change sync_core() to use MOV to CR2 to serialize

arch/x86/boot/cpu.c | 6 ------
arch/x86/include/asm/paravirt.h | 5 +++++
arch/x86/include/asm/paravirt_types.h | 2 ++
arch/x86/include/asm/processor.h | 34 ++++++++++------------------------
arch/x86/kernel/cpu/microcode/intel.c | 6 +++---
arch/x86/kernel/paravirt.c | 2 ++
arch/x86/xen/enlighten.c | 35 +++++++++++++++++++++++++++++++++++
7 files changed, 57 insertions(+), 33 deletions(-)

--
2.9.3

2016-12-02 07:34:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86/asm: Change sync_core() to use MOV to CR2 to serialize


* Borislav Petkov <[email protected]> wrote:

> On Wed, Nov 30, 2016 at 12:34:55PM -0800, Andy Lutomirski wrote:
> > Aside from being excessively slow, CPUID is problematic: Linux runs
> > on a handful of CPUs that don't have CPUID. MOV to CR2 is always
> > available, so use it instead.
> >
> > Signed-off-by: Andy Lutomirski <[email protected]>
> > ---
> > arch/x86/include/asm/processor.h | 31 ++++++++-----------------------
> > 1 file changed, 8 insertions(+), 23 deletions(-)
>
> Looks nice.
>
> I'm wondering if we should leave this one in tip for an additional cycle
> to have it tested on more hw. I know, it is architectural and so on but
> who knows what every implementation actually does...

I think -tip and "upstream of the day" mostly gets tested on relatively recent x86
hardware - proven by the fact that these regressions are many months old.

The reason v4.9 got extra testing is the announced Long Term Support (LTS) aspect:
more, older, weirder hardware is being tested because it's going to be a very
popular base kernel.

So the best option would be to get these fixes into -tip, make sure it's sane all
around and works on hardware that gets tested on bleeding edge kernels, then push
it upstream sooner rather than later and also have Cc:stable tags on the obvious
fixes, and handle any eventual fallout as it happens.

That's the best we can do I think.

Thanks,

Ingo

2016-12-02 10:17:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/4] CPUID-less CPU fixes and improvements


* Andy Lutomirski <[email protected]> wrote:

> Ugh, please ignore the accidental re-send of v1. I suck at using git.
> v2 is what I meant to send, but I accidentally sent both versions. :-(

Would be nice to get an Ack from the Xen bits - can pick -v2 up after that.

Thanks,

Ingo

2016-12-02 11:44:35

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On 02/12/16 00:35, Andy Lutomirski wrote:
> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
> serialize on Xen PV, but, in practice, any trap it generates will
> serialize.)

Well, Xen will enabled CPUID Faulting wherever it can, which is
realistically all IvyBridge hardware and newer.

All hypercalls are a privilege change to cpl0. I'd hope this condition
is serialising, but I can't actually find any documentation proving or
disproving this.

>
> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
> should end up being a nice speedup.
>
> Cc: Andrew Cooper <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>

CC'ing xen-devel and the Xen maintainers in Linux.

As this is the only email from this series in my inbox, I will say this
here, but it should really be against patch 6.

A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
serialising on the 486, but I don't have a manual to hand to check.

~Andrew

> ---
> arch/x86/xen/enlighten.c | 35 +++++++++++++++++++++++++++++++++++
> 1 file changed, 35 insertions(+)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index bdd855685403..1f765b41eee7 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -311,6 +311,39 @@ static __read_mostly unsigned int cpuid_leaf1_ecx_set_mask;
> static __read_mostly unsigned int cpuid_leaf5_ecx_val;
> static __read_mostly unsigned int cpuid_leaf5_edx_val;
>
> +static void xen_sync_core(void)
> +{
> + register void *__sp asm(_ASM_SP);
> +
> +#ifdef CONFIG_X86_32
> + asm volatile (
> + "pushl %%ss\n\t"
> + "pushl %%esp\n\t"
> + "addl $4, (%%esp)\n\t"
> + "pushfl\n\t"
> + "pushl %%cs\n\t"
> + "pushl $1f\n\t"
> + "iret\n\t"
> + "1:"
> + : "+r" (__sp) : : "cc");
> +#else
> + unsigned long tmp;
> +
> + asm volatile (
> + "movq %%ss, %0\n\t"
> + "pushq %0\n\t"
> + "pushq %%rsp\n\t"
> + "addq $8, (%%rsp)\n\t"
> + "pushfq\n\t"
> + "movq %%cs, %0\n\t"
> + "pushq %0\n\t"
> + "pushq $1f\n\t"
> + "iretq\n\t"
> + "1:"
> + : "=r" (tmp), "+r" (__sp) : : "cc");
> +#endif
> +}
> +
> static void xen_cpuid(unsigned int *ax, unsigned int *bx,
> unsigned int *cx, unsigned int *dx)
> {
> @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {
>
> .start_context_switch = paravirt_start_context_switch,
> .end_context_switch = xen_end_context_switch,
> +
> + .sync_core = xen_sync_core,
> };
>
> static void xen_reboot(int reason)

2016-12-02 17:07:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Dec 2, 2016 3:44 AM, "Andrew Cooper" <[email protected]> wrote:
>
> On 02/12/16 00:35, Andy Lutomirski wrote:
> > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> > guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
> > serialize on Xen PV, but, in practice, any trap it generates will
> > serialize.)
>
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0. I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.

I don't know for sure. IRET is serializing, and if Xen returns using
IRET, we're fine.

>
> >
> > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
> > should end up being a nice speedup.
> >
> > Cc: Andrew Cooper <[email protected]>
> > Signed-off-by: Andy Lutomirski <[email protected]>
>
> CC'ing xen-devel and the Xen maintainers in Linux.
>
> As this is the only email from this series in my inbox, I will say this
> here, but it should really be against patch 6.
>
> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> serialising on the 486, but I don't have a manual to hand to check.

I'll quote the (modern) SDM. For self-modifying code "The use of one
of these options is not required for programs intended to run on the
Pentium or Intel486 processors,
but are recommended to ensure compatibility with the P6 and more
recent processor families.". For cross-modifying code "The use of
this option is not required for programs intended to run on the
Intel486 processor, but is recommended
to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
Pentium processors." So I'm not sure there's a problem.

I can add an unconditional jump just to make sure. It costs basically
nothing on modern CPUs. (Also, CPUID also isn't serializing on 486
according to the table.)

--Andy

2016-12-02 17:17:23

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On 02/12/16 17:07, Andy Lutomirski wrote:
> On Dec 2, 2016 3:44 AM, "Andrew Cooper" <[email protected]> wrote:
>> On 02/12/16 00:35, Andy Lutomirski wrote:
>>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>>> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
>>> serialize on Xen PV, but, in practice, any trap it generates will
>>> serialize.)
>> Well, Xen will enabled CPUID Faulting wherever it can, which is
>> realistically all IvyBridge hardware and newer.
>>
>> All hypercalls are a privilege change to cpl0. I'd hope this condition
>> is serialising, but I can't actually find any documentation proving or
>> disproving this.
> I don't know for sure. IRET is serializing, and if Xen returns using
> IRET, we're fine.

All returns to a 64bit PV guest at defined points (hypercall return,
exception entry, etc) are from SYSRET, not IRET.

Talking of, I still have a patch to remove
PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.

>
>>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>>> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
>>> should end up being a nice speedup.
>>>
>>> Cc: Andrew Cooper <[email protected]>
>>> Signed-off-by: Andy Lutomirski <[email protected]>
>> CC'ing xen-devel and the Xen maintainers in Linux.
>>
>> As this is the only email from this series in my inbox, I will say this
>> here, but it should really be against patch 6.
>>
>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
>> serialising on the 486, but I don't have a manual to hand to check.
> I'll quote the (modern) SDM. For self-modifying code "The use of one
> of these options is not required for programs intended to run on the
> Pentium or Intel486 processors,
> but are recommended to ensure compatibility with the P6 and more
> recent processor families.". For cross-modifying code "The use of
> this option is not required for programs intended to run on the
> Intel486 processor, but is recommended
> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
> Pentium processors." So I'm not sure there's a problem.

Fair enough. (Assuming similar properties hold for the older processors
of other vendors.)

~Andrew

2016-12-02 17:23:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper <[email protected]> wrote:
> On 02/12/16 17:07, Andy Lutomirski wrote:
>> On Dec 2, 2016 3:44 AM, "Andrew Cooper" <[email protected]> wrote:
>>> On 02/12/16 00:35, Andy Lutomirski wrote:
>>>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>>>> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
>>>> serialize on Xen PV, but, in practice, any trap it generates will
>>>> serialize.)
>>> Well, Xen will enabled CPUID Faulting wherever it can, which is
>>> realistically all IvyBridge hardware and newer.
>>>
>>> All hypercalls are a privilege change to cpl0. I'd hope this condition
>>> is serialising, but I can't actually find any documentation proving or
>>> disproving this.
>> I don't know for sure. IRET is serializing, and if Xen returns using
>> IRET, we're fine.
>
> All returns to a 64bit PV guest at defined points (hypercall return,
> exception entry, etc) are from SYSRET, not IRET.

But CPUID faulting isn't like this, right? Unless Xen does
opportunistic SYSRET.

>
> Talking of, I still have a patch to remove
> PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.
>
>>
>>>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>>>> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
>>>> should end up being a nice speedup.
>>>>
>>>> Cc: Andrew Cooper <[email protected]>
>>>> Signed-off-by: Andy Lutomirski <[email protected]>
>>> CC'ing xen-devel and the Xen maintainers in Linux.
>>>
>>> As this is the only email from this series in my inbox, I will say this
>>> here, but it should really be against patch 6.
>>>
>>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
>>> serialising on the 486, but I don't have a manual to hand to check.
>> I'll quote the (modern) SDM. For self-modifying code "The use of one
>> of these options is not required for programs intended to run on the
>> Pentium or Intel486 processors,
>> but are recommended to ensure compatibility with the P6 and more
>> recent processor families.". For cross-modifying code "The use of
>> this option is not required for programs intended to run on the
>> Intel486 processor, but is recommended
>> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
>> Pentium processors." So I'm not sure there's a problem.
>
> Fair enough. (Assuming similar properties hold for the older processors
> of other vendors.)

No, you were right -- a different section of the SDM contradicts it:

For Intel486 processors, a write to an instruction in the cache will
modify it in both the cache and memory, but if
the instruction was prefetched before the write, the old version of
the instruction could be the one executed. To
prevent the old instruction from being executed, flush the instruction
prefetch unit by coding a jump instruction
immediately after any write that modifies an instruction.

--Andy

2016-12-02 17:26:30

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On 02/12/16 17:23, Andy Lutomirski wrote:
> On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper <[email protected]> wrote:
>> On 02/12/16 17:07, Andy Lutomirski wrote:
>>> On Dec 2, 2016 3:44 AM, "Andrew Cooper" <[email protected]> wrote:
>>>> On 02/12/16 00:35, Andy Lutomirski wrote:
>>>>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>>>>> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
>>>>> serialize on Xen PV, but, in practice, any trap it generates will
>>>>> serialize.)
>>>> Well, Xen will enabled CPUID Faulting wherever it can, which is
>>>> realistically all IvyBridge hardware and newer.
>>>>
>>>> All hypercalls are a privilege change to cpl0. I'd hope this condition
>>>> is serialising, but I can't actually find any documentation proving or
>>>> disproving this.
>>> I don't know for sure. IRET is serializing, and if Xen returns using
>>> IRET, we're fine.
>> All returns to a 64bit PV guest at defined points (hypercall return,
>> exception entry, etc) are from SYSRET, not IRET.
> But CPUID faulting isn't like this, right? Unless Xen does
> opportunistic SYSRET.

Correct. Xen doesn't do opportunistic SYSRET.

>
>> Talking of, I still have a patch to remove
>> PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.
>>
>>>>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>>>>> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
>>>>> should end up being a nice speedup.
>>>>>
>>>>> Cc: Andrew Cooper <[email protected]>
>>>>> Signed-off-by: Andy Lutomirski <[email protected]>
>>>> CC'ing xen-devel and the Xen maintainers in Linux.
>>>>
>>>> As this is the only email from this series in my inbox, I will say this
>>>> here, but it should really be against patch 6.
>>>>
>>>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
>>>> serialising on the 486, but I don't have a manual to hand to check.
>>> I'll quote the (modern) SDM. For self-modifying code "The use of one
>>> of these options is not required for programs intended to run on the
>>> Pentium or Intel486 processors,
>>> but are recommended to ensure compatibility with the P6 and more
>>> recent processor families.". For cross-modifying code "The use of
>>> this option is not required for programs intended to run on the
>>> Intel486 processor, but is recommended
>>> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
>>> Pentium processors." So I'm not sure there's a problem.
>> Fair enough. (Assuming similar properties hold for the older processors
>> of other vendors.)
> No, you were right -- a different section of the SDM contradicts it:
>
> For Intel486 processors, a write to an instruction in the cache will
> modify it in both the cache and memory, but if
> the instruction was prefetched before the write, the old version of
> the instruction could be the one executed. To
> prevent the old instruction from being executed, flush the instruction
> prefetch unit by coding a jump instruction
> immediately after any write that modifies an instruction.

:(

Presumably this means patching has been subtly broken forever on the 486?

~Andrew

2016-12-02 17:32:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski <[email protected]> wrote:
>
> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
> should end up being a nice speedup.

So if we care deeply about the performance of this, we should really
ask ourselves how much we need this...

There are *very* few places where we really need to do a full
serializing instruction, and I'd worry that we really don't need it in
many of the places we do this.

The only real case I'm aware of is modifying code that is modified
through a different linear address than it's executed.

Is there anything else where we _really_ need this sync-core thing?
Sure, the microcode loader looks fine, but that doesn't look
particularly performance-critical either.

So I'd like to know which sync_core is actually so
performance-critical that w e care about it, and then I'd like to
understand why it's needed at all, because I suspect a number of them
has been added with the model of "sprinkle random things around and
hope".

Looking at ftrace, for example, which is one of the users, does it
actually _really_ need sync_core() at all? It seems to use the kerrnel
virtual address, and then the instruction stream will be coherent.

Adding Peter Anvin to the participants list, because iirc he was the
one who really talked to hardwre engineers about the synchronization
issues with serializing kernel code.

Linus

2016-12-02 17:39:43

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 9:32 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>
> So if we care deeply about the performance of this, we should really
> ask ourselves how much we need this...
>
> There are *very* few places where we really need to do a full
> serializing instruction, and I'd worry that we really don't need it in
> many of the places we do this.
>
> The only real case I'm aware of is modifying code that is modified
> through a different linear address than it's executed.

TBH, I didn't start down this path for performance. I did it because
I wanted to kill off a CPUID that was breaking on old CPUs that don't
have CPUID. So I propose MOV-to-CR2 followed by an unconditional
jump. My goal here is to make the #*!& thing work reliably and not be
ludicrously slow. Borislav and I mulled over using an alternative to
use CPUID if and only if we have CPUID, but that doesn't work because
we call sync_core() before we're done applying alternatives.

>
> Is there anything else where we _really_ need this sync-core thing?
> Sure, the microcode loader looks fine, but that doesn't look
> particularly performance-critical either.
>
> So I'd like to know which sync_core is actually so
> performance-critical that w e care about it, and then I'd like to
> understand why it's needed at all, because I suspect a number of them
> has been added with the model of "sprinkle random things around and
> hope".

apply_alternatives, unfortunately. It's performance-critical because
it's intensely stupid and does sync_core() for every single patch.
Fixing that would be nice, too.

> Adding Peter Anvin to the participants list, because iirc he was the
> one who really talked to hardwre engineers about the synchronization
> issues with serializing kernel code.
>
> Linus

2016-12-02 18:11:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 9:38 AM, Andy Lutomirski <[email protected]> wrote:
>
> apply_alternatives, unfortunately. It's performance-critical because
> it's intensely stupid and does sync_core() for every single patch.
> Fixing that would be nice, too.

So looking at text_poke_early(), that's very much a case that really
shouldn't need any "sync_core()" at all as far as I can tell.

Only the current CPU is running, and for local CPU I$ coherence all
you need is a jump instruction, and even that is only on really old
CPU's. From the PPro onwards (maybe even Pentium?) the I$ is entirely
serialized as long as you change the data using the same linear
address.

So at most, that function could mark itsel f"noinline" just to
guarantee that it will cause a control flow change before returning.
The sync_core() seems entirely bogus.

Same goes for optimize_nops() too.

Linus

2016-12-02 18:11:57

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote:
> apply_alternatives, unfortunately. It's performance-critical because
> it's intensely stupid and does sync_core() for every single patch.
> Fixing that would be nice, too.

So I did experiment at the time to batch that sync_core() and interrupts
disabling in text_poke_early() but that amounted to almost nothing.
That's why I didn't bother chasing this further. I can try to dig out
that old thread...

/me goes and searches...

ah, here's something:

SNB:
* before:
[ 0.011707] SMP alternatives: Starting apply_alternatives...
[ 0.011784] SMP alternatives: done.
[ 0.011806] Freeing SMP alternatives: 20k freed

* after:
[ 0.011699] SMP alternatives: Starting apply_alternatives...
[ 0.011721] SMP alternatives: done.
[ 0.011743] Freeing SMP alternatives: 20k freed

-> 63 microseconds speedup


* kvm guest:
* before:
[ 0.017005] SMP alternatives: Starting apply_alternatives...
[ 0.019024] SMP alternatives: done.
[ 0.020119] Freeing SMP alternatives: 20k freed

* after:
[ 0.015008] SMP alternatives: Starting apply_alternatives...
[ 0.016029] SMP alternatives: done.
[ 0.017118] Freeing SMP alternatives: 20k freed

-> ~3 milliseconds speedup

---

I tried something simple like this:

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 1850592f4700..f4d5689ea503 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -253,10 +253,13 @@ void __init_or_module apply_alternatives(struct alt_instr *start,
struct alt_instr *end)
{
struct alt_instr *a;
+ unsigned long flags;
u8 *instr, *replacement;
u8 insnbuf[MAX_PATCH_LEN];

DPRINTK("%s: alt table %p -> %p\n", __func__, start, end);
+
+ local_irq_save(flags);
/*
* The scan order should be from start to end. A later scanned
* alternative code can overwrite a previous scanned alternative code.
@@ -284,8 +287,10 @@ void __init_or_module apply_alternatives(struct alt_instr *start,
add_nops(insnbuf + a->replacementlen,
a->instrlen - a->replacementlen);

- text_poke_early(instr, insnbuf, a->instrlen);
+ memcpy(instr, insnbuf, a->instrlen);
}
+ sync_core();
+ local_irq_restore(flags);
}

#ifdef CONFIG_SMP
---

I could try and redo the measurements again but I doubt it'll be any
different. Unless I haven't made a mistake somewhere...

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-02 18:29:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 10:03 AM, Borislav Petkov <[email protected]> wrote:
>
> SNB:
> * before:
> * after:

I suspect it's entirely invisible on raw hardware. But quite possibly
more noticeable in a VM that takes slow faults for every case.

But yes, even there is' probably not *that* noticeable.

I'd prefer to get rid of them just because I don't like voodoo
programming. If there is no actual reason for "sync_core()", we
shouldn't have one there. Even if it were entirely free and the
compiler optimized it away because the compiler was so smart that it
would see that it's pointless, it's misleading and wrong on a source
level.

That is, of course, assuming that there is no really subtle reason
why that stupid sync_core() is there.

Linus

2016-12-02 18:49:02

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> On 02/12/16 00:35, Andy Lutomirski wrote:
>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
>> serialize on Xen PV, but, in practice, any trap it generates will
>> serialize.)
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0. I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.
>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>>
>> Cc: Andrew Cooper <[email protected]>
>> Signed-off-by: Andy Lutomirski <[email protected]>
> CC'ing xen-devel and the Xen maintainers in Linux.
>
> As this is the only email from this series in my inbox, I will say this
> here, but it should really be against patch 6.
>
> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> serialising on the 486, but I don't have a manual to hand to check.
>
> ~Andrew
>
>> ---
>> arch/x86/xen/enlighten.c | 35 +++++++++++++++++++++++++++++++++++
>> 1 file changed, 35 insertions(+)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index bdd855685403..1f765b41eee7 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -311,6 +311,39 @@ static __read_mostly unsigned int cpuid_leaf1_ecx_set_mask;
>> static __read_mostly unsigned int cpuid_leaf5_ecx_val;
>> static __read_mostly unsigned int cpuid_leaf5_edx_val;
>>
>> +static void xen_sync_core(void)
>> +{
>> + register void *__sp asm(_ASM_SP);
>> +
>> +#ifdef CONFIG_X86_32
>> + asm volatile (
>> + "pushl %%ss\n\t"
>> + "pushl %%esp\n\t"
>> + "addl $4, (%%esp)\n\t"
>> + "pushfl\n\t"
>> + "pushl %%cs\n\t"
>> + "pushl $1f\n\t"
>> + "iret\n\t"
>> + "1:"
>> + : "+r" (__sp) : : "cc");

This breaks 32-bit PV guests.

Why are we pushing %ss? We are not changing privilege levels so why not
just flags, cs and eip (which, incidentally, does work)?

-boris

>> +#else
>> + unsigned long tmp;
>> +
>> + asm volatile (
>> + "movq %%ss, %0\n\t"
>> + "pushq %0\n\t"
>> + "pushq %%rsp\n\t"
>> + "addq $8, (%%rsp)\n\t"
>> + "pushfq\n\t"
>> + "movq %%cs, %0\n\t"
>> + "pushq %0\n\t"
>> + "pushq $1f\n\t"
>> + "iretq\n\t"
>> + "1:"
>> + : "=r" (tmp), "+r" (__sp) : : "cc");
>> +#endif
>> +}
>> +
>> static void xen_cpuid(unsigned int *ax, unsigned int *bx,
>> unsigned int *cx, unsigned int *dx)
>> {
>> @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {
>>
>> .start_context_switch = paravirt_start_context_switch,
>> .end_context_switch = xen_end_context_switch,
>> +
>> + .sync_core = xen_sync_core,
>> };
>>
>> static void xen_reboot(int reason)

2016-12-02 18:50:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 02, 2016 at 10:27:29AM -0800, Linus Torvalds wrote:
> That is, of course, assuming that there is no really subtle reason
> why that stupid sync_core() is there.

Right, we can try to do something like invalidate_icache() or so in
there with the JMP so that the BSP refetches modified code and see where
it gets us.

The good thing is, the early patching paths run before SMP is
up but from looking at load_module(), for example, which does
post_relocation()->module_finalize()->apply_alternatives(), this can
happen late.

Now there I'd like to avoid other cores walking into that address being
patched. Or are we "safe" there in the sense that load_module() happens
on one CPU only sequentially? (I haven't looked at that code to see
what's going on there, actually).

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-02 19:03:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov <[email protected]> wrote:
>
> Right, we can try to do something like invalidate_icache() or so in
> there with the JMP so that the BSP refetches modified code and see where
> it gets us.

I'd really rather rjust mark it noinline with a comment. That way the
return from the function acts as the control flow change.

> The good thing is, the early patching paths run before SMP is
> up but from looking at load_module(), for example, which does
> post_relocation()->module_finalize()->apply_alternatives(), this can
> happen late.
>
> Now there I'd like to avoid other cores walking into that address being
> patched. Or are we "safe" there in the sense that load_module() happens
> on one CPU only sequentially? (I haven't looked at that code to see
> what's going on there, actually).

'sync_core()' doesn't help for other CPU's anyway, you need to do the
cross-call IPI. So worrying about other CPU's is *not* a valid reason
to keep a "sync_core()" call.

Seriously, the only reason I can see for "sync_core()" really is:

- some deep non-serialized MSR access or similar (ie things like
firmware loading etc really might want it, and a mchine check might
want it)

- writing to one virtual address, and then executing from another. We
do this for (a) user mode (of course) and (b) text_poke(), but
text_poke() is a whole different dance.

but I may have forgotten some other case.

The issues with modifying code while another CPU may be just about to
access it is a separate issue. And as noted, "sync_core()" is not
sufficient for that, you have to do a whole careful dance with
single-byte debug instruction writes and then a final cross-call.

See the whole "text_poke_bp()" and "text_poke()" for *that* whole
dance. That's a much more complex thing from the normal
apply_alternatives().

Linus

2016-12-02 19:20:59

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 02, 2016 at 11:03:50AM -0800, Linus Torvalds wrote:
> I'd really rather rjust mark it noinline with a comment. That way the
> return from the function acts as the control flow change.

Something like below?

It boots in a guest but that doesn't mean anything.

> 'sync_core()' doesn't help for other CPU's anyway, you need to do the
> cross-call IPI. So worrying about other CPU's is *not* a valid reason
> to keep a "sync_core()" call.

Yeah, no, I'm not crazy about it either - I was just sanity-checking all
call sites of apply_alternatives(). But as you say, we would've gotten
much bigger problems if other CPUs would walk in there on us.

> Seriously, the only reason I can see for "sync_core()" really is:
>
> - some deep non-serialized MSR access or similar (ie things like
> firmware loading etc really might want it, and a mchine check might
> want it)

Yah, we do it in the #MC handler - apparently we need it there - and
in the microcode loader to tickle out the version of the microcode
currently applied into the MSR.

> The issues with modifying code while another CPU may be just about to
> access it is a separate issue. And as noted, "sync_core()" is not
> sufficient for that, you have to do a whole careful dance with
> single-byte debug instruction writes and then a final cross-call.
>
> See the whole "text_poke_bp()" and "text_poke()" for *that* whole
> dance. That's a much more complex thing from the normal
> apply_alternatives().

Yeah.

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cb272a7a5a3..b1d0c35e6dcb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -346,7 +346,6 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)

local_irq_save(flags);
add_nops(instr + (a->instrlen - a->padlen), a->padlen);
- sync_core();
local_irq_restore(flags);

DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
@@ -359,9 +358,12 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)
* This implies that asymmetric systems where APs have less capabilities than
* the boot processor are not handled. Tough. Make sure you disable such
* features by hand.
+ *
+ * Marked "noinline" to cause control flow change and thus insn cache
+ * to refetch changed I$ lines.
*/
-void __init_or_module apply_alternatives(struct alt_instr *start,
- struct alt_instr *end)
+void __init_or_module noinline apply_alternatives(struct alt_instr *start,
+ struct alt_instr *end)
{
struct alt_instr *a;
u8 *instr, *replacement;
@@ -667,7 +669,6 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
unsigned long flags;
local_irq_save(flags);
memcpy(addr, opcode, len);
- sync_core();
local_irq_restore(flags);
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-02 19:23:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 11:03 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov <[email protected]> wrote:
>>
>> Right, we can try to do something like invalidate_icache() or so in
>> there with the JMP so that the BSP refetches modified code and see where
>> it gets us.
>
> I'd really rather rjust mark it noinline with a comment. That way the
> return from the function acts as the control flow change.
>
>> The good thing is, the early patching paths run before SMP is
>> up but from looking at load_module(), for example, which does
>> post_relocation()->module_finalize()->apply_alternatives(), this can
>> happen late.
>>
>> Now there I'd like to avoid other cores walking into that address being
>> patched. Or are we "safe" there in the sense that load_module() happens
>> on one CPU only sequentially? (I haven't looked at that code to see
>> what's going on there, actually).
>
> 'sync_core()' doesn't help for other CPU's anyway, you need to do the
> cross-call IPI. So worrying about other CPU's is *not* a valid reason
> to keep a "sync_core()" call.
>
> Seriously, the only reason I can see for "sync_core()" really is:
>
> - some deep non-serialized MSR access or similar (ie things like
> firmware loading etc really might want it, and a mchine check might
> want it)

Not even firmware loading wants it. Firmware loading needs
specifically cpuid(eax=1). It has nothing to do with serializing
anything -- it's just a CPU bug that was turned into "architecture".
I think it really is just cross-address or cross-core modification,
and I'll add a comment to that effect.

--Andy

2016-12-02 19:24:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov <[email protected]> wrote:
>
> Something like below?

The optimize-nops thing needs it too, I think.

Again, this will never matter in practice (even if somebody has a i486
s till, the prefetch window size is like 16 bytes or something), but
from a documentation standpoint it's good.

Linus

2016-12-02 19:28:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 02, 2016 at 11:24:09AM -0800, Linus Torvalds wrote:
> The optimize-nops thing needs it too, I think.

Ah, it is called only from apply_alternatives() but sure, it is safer
this way. Lemme do that and run it through the boxes to see whether
anything catches fire.

> Again, this will never matter in practice (even if somebody has a i486
> s till, the prefetch window size is like 16 bytes or something), but
> from a documentation standpoint it's good.

Right.

Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-02 19:30:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 02, 2016 at 11:23:09AM -0800, Andy Lutomirski wrote:
> Not even firmware loading wants it. Firmware loading needs

Microcode...

> specifically cpuid(eax=1). It has nothing to do with serializing

... but yes, of course. NOT sync_core() but CPUID(1).

Thanks!

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-02 19:32:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 11:24 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov <[email protected]> wrote:
>>
>> Something like below?
>
> The optimize-nops thing needs it too, I think.
>
> Again, this will never matter in practice (even if somebody has a i486
> s till, the prefetch window size is like 16 bytes or something), but
> from a documentation standpoint it's good.

How's this?

/*
* This function forces the icache and prefetched instruction stream to
* catch up with reality in two very specific cases:
*
* a) Text was modified using one virtual address and is about to be executed
* from the same physical page at a different virtual address.
*
* b) Text was modified on a different CPU, may subsequently be
* executed on this CPU, and you want to make sure the new version
* gets executed. This generally means you're calling this in a IPI.
*
* If you're calling this for a different reason, you're probably doing
* it wrong.
*/
static inline void native_sync_core(void) { ... }

The body will do a MOV-to-CR2 followed by jmp 1f; 1:. This sequence
should be guaranteed to flush the pipeline on any real CPU. On Xen it
will do IRET-to-self.

I suppose it could be an unconditional IRET-to-self, but that's a good
deal slower and not a whole lot simpler. Although if we start doing
it right, performance won't really matter here.

--Andy

2016-12-02 19:34:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Dec 2, 2016 10:48 AM, "Boris Ostrovsky" <[email protected]> wrote:
>
> On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> > On 02/12/16 00:35, Andy Lutomirski wrote:
> >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> >> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
> >> serialize on Xen PV, but, in practice, any trap it generates will
> >> serialize.)
> > Well, Xen will enabled CPUID Faulting wherever it can, which is
> > realistically all IvyBridge hardware and newer.
> >
> > All hypercalls are a privilege change to cpl0. I'd hope this condition
> > is serialising, but I can't actually find any documentation proving or
> > disproving this.
> >
> >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
> >> should end up being a nice speedup.
> >>
> >> Cc: Andrew Cooper <[email protected]>
> >> Signed-off-by: Andy Lutomirski <[email protected]>
> > CC'ing xen-devel and the Xen maintainers in Linux.
> >
> > As this is the only email from this series in my inbox, I will say this
> > here, but it should really be against patch 6.
> >
> > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> > serialising on the 486, but I don't have a manual to hand to check.
> >
> > ~Andrew
> >
> >> ---
> >> arch/x86/xen/enlighten.c | 35 +++++++++++++++++++++++++++++++++++
> >> 1 file changed, 35 insertions(+)
> >>
> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> >> index bdd855685403..1f765b41eee7 100644
> >> --- a/arch/x86/xen/enlighten.c
> >> +++ b/arch/x86/xen/enlighten.c
> >> @@ -311,6 +311,39 @@ static __read_mostly unsigned int cpuid_leaf1_ecx_set_mask;
> >> static __read_mostly unsigned int cpuid_leaf5_ecx_val;
> >> static __read_mostly unsigned int cpuid_leaf5_edx_val;
> >>
> >> +static void xen_sync_core(void)
> >> +{
> >> + register void *__sp asm(_ASM_SP);
> >> +
> >> +#ifdef CONFIG_X86_32
> >> + asm volatile (
> >> + "pushl %%ss\n\t"
> >> + "pushl %%esp\n\t"
> >> + "addl $4, (%%esp)\n\t"
> >> + "pushfl\n\t"
> >> + "pushl %%cs\n\t"
> >> + "pushl $1f\n\t"
> >> + "iret\n\t"
> >> + "1:"
> >> + : "+r" (__sp) : : "cc");
>
> This breaks 32-bit PV guests.
>
> Why are we pushing %ss? We are not changing privilege levels so why not
> just flags, cs and eip (which, incidentally, does work)?
>

Doh! I carefully tested 64-bit on Xen and 32-bit in user mode.

2016-12-02 19:36:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski <[email protected]> wrote:
>
> How's this?

Looks ok. I do think that

> I suppose it could be an unconditional IRET-to-self, but that's a good
> deal slower and not a whole lot simpler. Although if we start doing
> it right, performance won't really matter here.

Considering you already got the iret-to-self wrong in the first
version, I really like the "one single unconditional version" so that
everybody tests that _one_ thing and there isn't anything subtle going
on.

Hmm?

And yes, if it turns out that performance matters, we almost certainly
are doing something really wrong, and we shouldn't be using that
sync_core() thing in that place anyway.

Linus

2016-12-02 20:09:22

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> On 02/12/16 00:35, Andy Lutomirski wrote:
>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to
>> serialize on Xen PV, but, in practice, any trap it generates will
>> serialize.)
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0. I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.
>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>>
>> Cc: Andrew Cooper <[email protected]>
>> Signed-off-by: Andy Lutomirski <[email protected]>

Executing CPUID in an HVM guest is quite expensive since it will cause a
VMEXIT. (And that should be true for any hypervisor, at least on Intel.
On AMD it's configurable)

-boris

2016-12-02 20:41:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 11:35 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski <[email protected]> wrote:
>>
>> How's this?
>
> Looks ok. I do think that
>
>> I suppose it could be an unconditional IRET-to-self, but that's a good
>> deal slower and not a whole lot simpler. Although if we start doing
>> it right, performance won't really matter here.
>
> Considering you already got the iret-to-self wrong in the first
> version, I really like the "one single unconditional version" so that
> everybody tests that _one_ thing and there isn't anything subtle going
> on.
>
> Hmm?

Okay, sold. It makes the patchset much much shorter, too.

>
> And yes, if it turns out that performance matters, we almost certainly
> are doing something really wrong, and we shouldn't be using that
> sync_core() thing in that place anyway.

To play devil's advocate (and definitely out of scope for this
particular patchset), is user code permitted to do:

1. Map a page RX at one address and RW somewhere else (for nice ASLR).
2. Write to the RW mapping.
3. CPUID or IRET-to-self.
4. Jump to the RX mapping.

Because, if so, we should maybe serialize whenever we migrate a
process to a different CPU. (We *definitely* need to flush the store
buffer when migrating, because the x86 architecture makes some memory
ordering promises that get broken if a store from a thread stays in
the store buffer of a different CPU when the thread gets migrated.)
And if we're going to start serializing when migrating a thread, then
we actually care about performance, in which case we should optimize
the crap out of this thing, which probably means using MFENCE on AMD
CPUs (AMD promises that MFENCE flushes the pipeline. Intel seems to
be confused as to exactly what effect MFENCE has, or at least I'm
confused as to what Intel thinks MFENCE does.) And we should make
sure that we only do the extra flush when we don't switch mms.

--Andy

2016-12-02 21:10:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski <[email protected]> wrote:
>
> Because, if so, we should maybe serialize whenever we migrate a
> process to a different CPU.

The intel docs are bad on this issue.

Technically what we do could fall under the "cross-modifying code"
case, where one CPU does the write, and then we run it on another CPU.

And no, we do *not* do a serializing instruction before returning to
user space. Sure, we might do an iret (which is serializing), but we
equally well might be doing a systret (which is not).

Honestly, I think Intel should clean up their documentation.

> (We *definitely* need to flush the store buffer when migrating,

There is no such thing as flushing the store buffer.

But we do end up doing a memory barrier which gives you the required
semantics. That's not a problem. Those operations are fast. The
serializing instructions are not.

Linus

2016-12-02 22:56:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 1:10 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> Because, if so, we should maybe serialize whenever we migrate a
>> process to a different CPU.
>
> The intel docs are bad on this issue.
>
> Technically what we do could fall under the "cross-modifying code"
> case, where one CPU does the write, and then we run it on another CPU.
>
> And no, we do *not* do a serializing instruction before returning to
> user space. Sure, we might do an iret (which is serializing), but we
> equally well might be doing a systret (which is not).
>
> Honestly, I think Intel should clean up their documentation.
>

I'm not sure I follow. If a user program gets migrated, it might end
up doing cross-modification when it expects self-modification. If
that trips the program up, is that a user bug or a kernel bug?

Admittedly, I'd be very surprised if this happened in practice.
Migration is *slow*, caches tend to get blown away, lots of code gets
executed, etc. Presumably any prefetched / trace cached / decoded /
i-cached user code is long gone when we migrate.

2016-12-02 23:10:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 2, 2016 at 2:55 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> Honestly, I think Intel should clean up their documentation.
>
> I'm not sure I follow. If a user program gets migrated, it might end
> up doing cross-modification when it expects self-modification. If
> that trips the program up, is that a user bug or a kernel bug?

Well, the user may not see it as a cross-modification.

Example: user compiles a program, and writes out the new binary. That
write goes to the page cache.

The user then immediately executes that program.

It's technically a "cross modification", because the build that wrote
the page cache ran on one CPU, and then it gets loaded on another.

Not, page faulting the binary does bring in a known serializing
instruction: iret.

But let's theorize that we have your "optimistic sysret" return path
because sometimes it can happen. So the "iret" isn't exactly
fundamental.

But we know we will write %cr2, which is a serializing instruction.

But that's not fundamental either, because we <i>could</i> just have a
program just load the object file into its own address space using the
dynamic linker. And if you don't unmap anything, there won't be any
TLB flushes.

Now, that is safe <i>too</I>, but by then we're not even relying on
simply the fact that the code couldn't even have been in any virtual
caches in the running environment, so it _must_ have come from the
physically indexed data cache. So no actual serializing instruction
even _needed_.

So there is no room for any cache I$ coherency issue at any point, but
note how we got to the point where we're now basically depending on
some fairly fundamental logic that is not in the Intel documentation?

THAT is what I don't like. I don't doubt for a moment that what we're
doing is entirely coherent, and we're fine. But the intel memory
ordering documentation simply doesn't cover this situation at all. The
"real" memory ordering documentation only covers the normal data
cache. And then they handwave the "self-modifying code" situation with
incomplete examples and just bullshit "you need a serializing
instruction", which clearly isn't actually the case, and is also
something that we very much don't even do.

It would be better if here was actual documentation, and we had some
nice clear "yeah, we don't need any stinking serializing instructions,
because we're already doing X".

Linus

2016-12-03 12:44:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote:
> TBH, I didn't start down this path for performance. I did it because
> I wanted to kill off a CPUID that was breaking on old CPUs that don't
> have CPUID. So I propose MOV-to-CR2 followed by an unconditional
> jump. My goal here is to make the #*!& thing work reliably and not be
> ludicrously slow. Borislav and I mulled over using an alternative to
> use CPUID if and only if we have CPUID, but that doesn't work because
> we call sync_core() before we're done applying alternatives.

Btw if the noinline thing which Linus suggested, works out, we can still
do the alternatives thing with CPUID in sync_core() because we won't
need it in alternatives.c itself anymore.

Just putting it on the table, I know you complained about the mess
yesterday on IRC and in case the CR2 move looks painful on xen, we can
still do what we initially considered. I.e., that thing:

+ /* Do a CPUID if available, otherwise do a forward jump. */
+ alternative_io("jmp 1f\n\t1:", "cpuid",
+ X86_FEATURE_CPUID,
+ "=a" (tmp),
+ "0" (1)
+ : "ebx", "ecx", "edx", "memory");

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-03 15:03:04

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH] x86/alternatives: Do not use sync_core() to serialize I$

On Fri, Dec 02, 2016 at 08:28:44PM +0100, Borislav Petkov wrote:
> Ah, it is called only from apply_alternatives() but sure, it is safer
> this way. Lemme do that and run it through the boxes to see whether
> anything catches fire.

Looks good, ran it on a bunch of machines, two of them huge AMD and
Intel, 8-socket monsters.

Here is a version with a commit message:

---
From: Borislav Petkov <[email protected]>
Date: Fri, 2 Dec 2016 23:18:51 +0100
Subject: [PATCH] x86/alternatives: Do not use sync_core() to serialize I$

We use sync_core() in the alternatives code to stop speculative
execution of prefetched instructions because we are potentially changing
them and don't want to execute stale bytes.

What it does on most machines is call CPUID which is a serializing
instruction. And that's expensive.

However, the instruction cache is serialized when we're on the local CPU
and are changing the data through the same virtual address. So then, we
don't need the serializing CPUID but a simple control flow change. Last
being accomplished with a CALL/RET which the noinline causes.

So let's try it.

Signed-off-by: Borislav Petkov <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/alternative.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cb272a7a5a3..c5b8f760473c 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -337,7 +337,11 @@ recompute_jump(struct alt_instr *a, u8 *orig_insn, u8 *repl_insn, u8 *insnbuf)
n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
}

-static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)
+/*
+ * "noinline" to cause control flow change and thus invalidate I$ and
+ * cause refetch after modification.
+ */
+static void __init_or_module noinline optimize_nops(struct alt_instr *a, u8 *instr)
{
unsigned long flags;

@@ -346,7 +350,6 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)

local_irq_save(flags);
add_nops(instr + (a->instrlen - a->padlen), a->padlen);
- sync_core();
local_irq_restore(flags);

DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
@@ -359,9 +362,12 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)
* This implies that asymmetric systems where APs have less capabilities than
* the boot processor are not handled. Tough. Make sure you disable such
* features by hand.
+ *
+ * Marked "noinline" to cause control flow change and thus insn cache
+ * to refetch changed I$ lines.
*/
-void __init_or_module apply_alternatives(struct alt_instr *start,
- struct alt_instr *end)
+void __init_or_module noinline apply_alternatives(struct alt_instr *start,
+ struct alt_instr *end)
{
struct alt_instr *a;
u8 *instr, *replacement;
@@ -667,7 +673,6 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
unsigned long flags;
local_irq_save(flags);
memcpy(addr, opcode, len);
- sync_core();
local_irq_restore(flags);
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
--
2.11.0

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-03 17:13:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/alternatives: Do not use sync_core() to serialize I$

On Sat, Dec 3, 2016 at 7:02 AM, Borislav Petkov <[email protected]> wrote:
> On Fri, Dec 02, 2016 at 08:28:44PM +0100, Borislav Petkov wrote:
>> Ah, it is called only from apply_alternatives() but sure, it is safer
>> this way. Lemme do that and run it through the boxes to see whether
>> anything catches fire.
>
> Looks good, ran it on a bunch of machines, two of them huge AMD and
> Intel, 8-socket monsters.
>
> Here is a version with a commit message:
>
> ---
> From: Borislav Petkov <[email protected]>
> Date: Fri, 2 Dec 2016 23:18:51 +0100
> Subject: [PATCH] x86/alternatives: Do not use sync_core() to serialize I$
>
> We use sync_core() in the alternatives code to stop speculative
> execution of prefetched instructions because we are potentially changing
> them and don't want to execute stale bytes.
>
> What it does on most machines is call CPUID which is a serializing
> instruction. And that's expensive.
>
> However, the instruction cache is serialized when we're on the local CPU
> and are changing the data through the same virtual address. So then, we
> don't need the serializing CPUID but a simple control flow change. Last
> being accomplished with a CALL/RET which the noinline causes.
>
> So let's try it.

I like it.

Reviewed-by: Andy Lutomirski <[email protected]>

Subject: [tip:x86/urgent] x86/alternatives: Do not use sync_core() to serialize I$

Commit-ID: 16f69e06c610a594dc182b9cca19257345598153
Gitweb: http://git.kernel.org/tip/16f69e06c610a594dc182b9cca19257345598153
Author: Borislav Petkov <[email protected]>
AuthorDate: Sat, 3 Dec 2016 16:02:58 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 20 Dec 2016 08:51:46 +0100

x86/alternatives: Do not use sync_core() to serialize I$

We use sync_core() in the alternatives code to stop speculative
execution of prefetched instructions because we are potentially changing
them and don't want to execute stale bytes.

What it does on most machines is call CPUID which is a serializing
instruction. And that's expensive.

However, the instruction cache is serialized when we're on the local CPU
and are changing the data through the same virtual address. So then, we
don't need the serializing CPUID but a simple control flow change. Last
being accomplished with a CALL/RET which the noinline causes.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Henrique de Moraes Holschuh <[email protected]>
Cc: Matthew Whitehead <[email protected]>
Cc: One Thousand Gnomes <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/alternative.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cb272a..c5b8f76 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -337,7 +337,11 @@ done:
n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
}

-static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)
+/*
+ * "noinline" to cause control flow change and thus invalidate I$ and
+ * cause refetch after modification.
+ */
+static void __init_or_module noinline optimize_nops(struct alt_instr *a, u8 *instr)
{
unsigned long flags;

@@ -346,7 +350,6 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)

local_irq_save(flags);
add_nops(instr + (a->instrlen - a->padlen), a->padlen);
- sync_core();
local_irq_restore(flags);

DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
@@ -359,9 +362,12 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)
* This implies that asymmetric systems where APs have less capabilities than
* the boot processor are not handled. Tough. Make sure you disable such
* features by hand.
+ *
+ * Marked "noinline" to cause control flow change and thus insn cache
+ * to refetch changed I$ lines.
*/
-void __init_or_module apply_alternatives(struct alt_instr *start,
- struct alt_instr *end)
+void __init_or_module noinline apply_alternatives(struct alt_instr *start,
+ struct alt_instr *end)
{
struct alt_instr *a;
u8 *instr, *replacement;
@@ -667,7 +673,6 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
unsigned long flags;
local_irq_save(flags);
memcpy(addr, opcode, len);
- sync_core();
local_irq_restore(flags);
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */

Subject: [tip:x86/urgent] x86/alternatives: Do not use sync_core() to serialize I$

Commit-ID: 34bfab0eaf0fb5c6fb14c6b4013b06cdc7984466
Gitweb: http://git.kernel.org/tip/34bfab0eaf0fb5c6fb14c6b4013b06cdc7984466
Author: Borislav Petkov <[email protected]>
AuthorDate: Sat, 3 Dec 2016 16:02:58 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 20 Dec 2016 09:36:42 +0100

x86/alternatives: Do not use sync_core() to serialize I$

We use sync_core() in the alternatives code to stop speculative
execution of prefetched instructions because we are potentially changing
them and don't want to execute stale bytes.

What it does on most machines is call CPUID which is a serializing
instruction. And that's expensive.

However, the instruction cache is serialized when we're on the local CPU
and are changing the data through the same virtual address. So then, we
don't need the serializing CPUID but a simple control flow change. Last
being accomplished with a CALL/RET which the noinline causes.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Henrique de Moraes Holschuh <[email protected]>
Cc: Matthew Whitehead <[email protected]>
Cc: One Thousand Gnomes <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/alternative.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cb272a..c5b8f76 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -337,7 +337,11 @@ done:
n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
}

-static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)
+/*
+ * "noinline" to cause control flow change and thus invalidate I$ and
+ * cause refetch after modification.
+ */
+static void __init_or_module noinline optimize_nops(struct alt_instr *a, u8 *instr)
{
unsigned long flags;

@@ -346,7 +350,6 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)

local_irq_save(flags);
add_nops(instr + (a->instrlen - a->padlen), a->padlen);
- sync_core();
local_irq_restore(flags);

DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
@@ -359,9 +362,12 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)
* This implies that asymmetric systems where APs have less capabilities than
* the boot processor are not handled. Tough. Make sure you disable such
* features by hand.
+ *
+ * Marked "noinline" to cause control flow change and thus insn cache
+ * to refetch changed I$ lines.
*/
-void __init_or_module apply_alternatives(struct alt_instr *start,
- struct alt_instr *end)
+void __init_or_module noinline apply_alternatives(struct alt_instr *start,
+ struct alt_instr *end)
{
struct alt_instr *a;
u8 *instr, *replacement;
@@ -667,7 +673,6 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
unsigned long flags;
local_irq_save(flags);
memcpy(addr, opcode, len);
- sync_core();
local_irq_restore(flags);
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */