2022-07-16 23:45:28

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

load_percpu_segment() is using wrmsr() which is paravirtualized. That's an
issue because the code sequence is:

__loadsegment_simple(gs, 0);
wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));

So anything which uses a per CPU variable between setting GS to 0 and
writing GSBASE is going to end up in a NULL pointer dereference. That's
can be triggered with instrumentation and is guaranteed to be triggered
with callthunks for call depth tracking.

Use native_wrmsrl() instead. XEN_PV will trap and emulate, but that's not a
hot path.

Also make it static and mark it noinstr so neither kprobes, sanitizers or
whatever can touch it.

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/processor.h | 1 -
arch/x86/kernel/cpu/common.c | 12 ++++++++++--
2 files changed, 10 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -673,7 +673,6 @@ extern struct desc_ptr early_gdt_descr;
extern void switch_to_new_gdt(int);
extern void load_direct_gdt(int);
extern void load_fixmap_gdt(int);
-extern void load_percpu_segment(int);
extern void cpu_init(void);
extern void cpu_init_secondary(void);
extern void cpu_init_exception_handling(void);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -701,13 +701,21 @@ static const char *table_lookup_model(st
__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));

-void load_percpu_segment(int cpu)
+static noinstr void load_percpu_segment(int cpu)
{
#ifdef CONFIG_X86_32
loadsegment(fs, __KERNEL_PERCPU);
#else
__loadsegment_simple(gs, 0);
- wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
+ /*
+ * Because of the __loadsegment_simple(gs, 0) above, any GS-prefixed
+ * instruction will explode right about here. As such, we must not have
+ * any CALL-thunks using per-cpu data.
+ *
+ * Therefore, use native_wrmsrl() and have XenPV take the fault and
+ * emulate.
+ */
+ native_wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
#endif
}



2022-07-17 15:53:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Sat, Jul 16, 2022 at 5:22 PM Andrew Cooper <[email protected]> wrote:
>
> It's only 32bit where the percpu pointer is tied to the GDT. On 64bit,
> gsbase is good before this, and remains good after.

That sounds sensible to me, but somebody should check that there's
nothing that accidentally relied on the MSR_GS_BASE setting (or the
segment selector clearing, for that matter).

Not that I can necessarily see how anything could work with it wrong, but..

Linus

2022-07-17 19:15:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Sun, Jul 17 2022 at 00:22, Andrew Cooper wrote:
>> -void load_percpu_segment(int cpu)
>> +static noinstr void load_percpu_segment(int cpu)
>> {
>> #ifdef CONFIG_X86_32
>> loadsegment(fs, __KERNEL_PERCPU);
>> #else
>> __loadsegment_simple(gs, 0);
>> - wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
>> + /*
>> + * Because of the __loadsegment_simple(gs, 0) above, any GS-prefixed
>> + * instruction will explode right about here. As such, we must not have
>> + * any CALL-thunks using per-cpu data.
>> + *
>> + * Therefore, use native_wrmsrl() and have XenPV take the fault and
>> + * emulate.
>> + */
>> + native_wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
>> #endif
>
> Lovely :-/
>
> But I still don't see how that works, because __loadsegment_simple() is
> a memory clobber and cpu_kernelmode_gs_base() has a per-cpu lookup in
> it.

No. It uses an array lookup :)

> That said, this only has a sole caller, and in context, it's bogus for
> 64bit.  Can't we fix all the problems by just doing this:
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 736262a76a12..6f393bc9d89d 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -701,16 +701,6 @@ static const char *table_lookup_model(struct
> cpuinfo_x86 *c)
>  __u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned
> long));
>  __u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
>  
> -void load_percpu_segment(int cpu)
> -{
> -#ifdef CONFIG_X86_32
> -       loadsegment(fs, __KERNEL_PERCPU);
> -#else
> -       __loadsegment_simple(gs, 0);
> -       wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
> -#endif
> -}
> -
>  #ifdef CONFIG_X86_32
>  /* The 32-bit entry code needs to find cpu_entry_area. */
>  DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
> @@ -742,12 +732,15 @@ EXPORT_SYMBOL_GPL(load_fixmap_gdt);
>   * Current gdt points %fs at the "master" per-cpu area: after this,
>   * it's on the real one.
>   */
> -void switch_to_new_gdt(int cpu)
> +void __noinstr switch_to_new_gdt(int cpu)
>  {
>         /* Load the original GDT */
>         load_direct_gdt(cpu);
> +
> +#ifdef CONFIG_X86_32
>         /* Reload the per-cpu base */
> -       load_percpu_segment(cpu);
> +       loadsegment(fs, __KERNEL_PERCPU);
> +#endif
>  }
>  
>  static const struct cpu_dev *cpu_devs[X86_VENDOR_NUM] = {};
>
>
> It's only 32bit where the percpu pointer is tied to the GDT.  On 64bit,
> gsbase is good before this, and remains good after.
>
> With this change,
>
> # Make sure load_percpu_segment has no stackprotector
> CFLAGS_common.o         := -fno-stack-protector
>
> comes up for re-evaluation too.

Good point. Let me stare at it some more.

Thanks,

tglx

2022-07-17 20:10:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Sun, Jul 17 2022 at 21:08, Thomas Gleixner wrote:
> On Sun, Jul 17 2022 at 00:22, Andrew Cooper wrote:
>>  #ifdef CONFIG_X86_32
>>  /* The 32-bit entry code needs to find cpu_entry_area. */
>>  DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
>> @@ -742,12 +732,15 @@ EXPORT_SYMBOL_GPL(load_fixmap_gdt);
>>   * Current gdt points %fs at the "master" per-cpu area: after this,
>>   * it's on the real one.
>>   */
>> -void switch_to_new_gdt(int cpu)
>> +void __noinstr switch_to_new_gdt(int cpu)
>>  {
>>         /* Load the original GDT */
>>         load_direct_gdt(cpu);
>> +
>> +#ifdef CONFIG_X86_32
>>         /* Reload the per-cpu base */
>> -       load_percpu_segment(cpu);
>> +       loadsegment(fs, __KERNEL_PERCPU);
>> +#endif
>>  }
>>  
>>  static const struct cpu_dev *cpu_devs[X86_VENDOR_NUM] = {};
>>
>>
>> It's only 32bit where the percpu pointer is tied to the GDT.  On 64bit,
>> gsbase is good before this, and remains good after.
>>
>> With this change,
>>
>> # Make sure load_percpu_segment has no stackprotector
>> CFLAGS_common.o         := -fno-stack-protector
>>
>> comes up for re-evaluation too.
>
> Good point. Let me stare at it some more.

If it only would be that simple :)

loadsegment_simple() was a red herring. The gs segment is already zero.

So what explodes here is the early boot when switching from early per
CPU to the real per CPU area.

start_kernel()
.....
setup_per_cpu_areas();
smp_prepare_boot_cpu()
switch_to_new_gdt()
load_direct_gdt(cpu);
load_percpu_segment(cpu)
GS: 0
GS_BASE: 0xffffffff829d0000 (early PERCPU)
wrmsrl()
GS_BASE: 0xffff888237c00000 (real PERCPU)

So the explosion happens when accessing a per CPU variable after loading
the GDT and before GS_BASE is fixed up.

That's the only case AFAICT where this matters. In all other invocations
GS_BASE is already correct.

Let me fix this proper.

Thanks,

tglx

2022-07-17 20:27:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Sun, Jul 17 2022 at 22:08, Thomas Gleixner wrote:
> On Sun, Jul 17 2022 at 21:08, Thomas Gleixner wrote:
> loadsegment_simple() was a red herring. The gs segment is already zero.
>
> So what explodes here is the early boot when switching from early per
> CPU to the real per CPU area.
>
> start_kernel()
> .....
> setup_per_cpu_areas();
> smp_prepare_boot_cpu()

Bah. switch_to_new_gdt() is already invoked from setup_per_cpu_areas()
and then again in smp_prepare_boot_cpu() and once more in cpu_init(),

What a mess.

2022-07-17 21:56:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Sun, Jul 17 2022 at 22:13, Thomas Gleixner wrote:
> On Sun, Jul 17 2022 at 22:08, Thomas Gleixner wrote:
>> On Sun, Jul 17 2022 at 21:08, Thomas Gleixner wrote:
>> loadsegment_simple() was a red herring. The gs segment is already zero.
>>
>> So what explodes here is the early boot when switching from early per
>> CPU to the real per CPU area.
>>
>> start_kernel()
>> .....
>> setup_per_cpu_areas();
>> smp_prepare_boot_cpu()
>
> Bah. switch_to_new_gdt() is already invoked from setup_per_cpu_areas()
> and then again in smp_prepare_boot_cpu() and once more in cpu_init(),
>
> What a mess.

So the below builds and boots at least on 64bit. I'll stare at it some
more tomorrow. I have no idea whether native_load_gdt() works with
XEN_PV. It should, but what do I know.

Thanks,

tglx
---
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -205,7 +205,7 @@ static inline void native_set_ldt(const
}
}

-static inline void native_load_gdt(const struct desc_ptr *dtr)
+static __always_inline void native_load_gdt(const struct desc_ptr *dtr)
{
asm volatile("lgdt %0"::"m" (*dtr));
}
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -670,10 +670,9 @@ extern int sysenter_setup(void);
/* Defined in head.S */
extern struct desc_ptr early_gdt_descr;

-extern void switch_to_new_gdt(int);
+extern void switch_to_real_gdt(int);
extern void load_direct_gdt(int);
extern void load_fixmap_gdt(int);
-extern void load_percpu_segment(int);
extern void cpu_init(void);
extern void cpu_init_secondary(void);
extern void cpu_init_exception_handling(void);
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -7,20 +7,24 @@
ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_common.o = -pg
CFLAGS_REMOVE_perf_event.o = -pg
+CFLAGS_REMOVE_switch_gdt.o = -pg
endif

# If these files are instrumented, boot hangs during the first second.
KCOV_INSTRUMENT_common.o := n
KCOV_INSTRUMENT_perf_event.o := n
+KCOV_INSTRUMENT_switch_gdt.o := n

# As above, instrumenting secondary CPU boot code causes boot hangs.
KCSAN_SANITIZE_common.o := n
+KCSAN_SANITIZE_switch_gdt.o := n

-# Make sure load_percpu_segment has no stackprotector
-CFLAGS_common.o := -fno-stack-protector
+# Make sure that switching the GDT and the per CPU segment
+# does not have stack protector enabled.
+CFLAGS_switch_gdt.o := -fno-stack-protector

obj-y := cacheinfo.o scattered.o topology.o
-obj-y += common.o
+obj-y += common.o switch_gdt.o
obj-y += rdrand.o
obj-y += match.o
obj-y += bugs.o
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -701,16 +701,6 @@ static const char *table_lookup_model(st
__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));

-void load_percpu_segment(int cpu)
-{
-#ifdef CONFIG_X86_32
- loadsegment(fs, __KERNEL_PERCPU);
-#else
- __loadsegment_simple(gs, 0);
- wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
-#endif
-}
-
#ifdef CONFIG_X86_32
/* The 32-bit entry code needs to find cpu_entry_area. */
DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
@@ -738,18 +728,6 @@ void load_fixmap_gdt(int cpu)
}
EXPORT_SYMBOL_GPL(load_fixmap_gdt);

-/*
- * Current gdt points %fs at the "master" per-cpu area: after this,
- * it's on the real one.
- */
-void switch_to_new_gdt(int cpu)
-{
- /* Load the original GDT */
- load_direct_gdt(cpu);
- /* Reload the per-cpu base */
- load_percpu_segment(cpu);
-}
-
static const struct cpu_dev *cpu_devs[X86_VENDOR_NUM] = {};

static void get_model_name(struct cpuinfo_x86 *c)
@@ -2228,12 +2206,6 @@ void cpu_init(void)
boot_cpu_has(X86_FEATURE_TSC) || boot_cpu_has(X86_FEATURE_DE))
cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);

- /*
- * Initialize the per-CPU GDT with the boot GDT,
- * and set up the GDT descriptor:
- */
- switch_to_new_gdt(cpu);
-
if (IS_ENABLED(CONFIG_X86_64)) {
loadsegment(fs, 0);
memset(cur->thread.tls_array, 0, GDT_ENTRY_TLS_ENTRIES * 8);
--- /dev/null
+++ b/arch/x86/kernel/cpu/switch_gdt.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <asm/processor.h>
+#include <asm/segment.h>
+#include <asm/desc.h>
+
+/*
+ * Invoked during early boot to switch from early GDT and early per CPU
+ * (%fs on 32bit, GS_BASE on 64bit) to the real GDT and the runtime per CPU
+ * area.
+ *
+ * This has to be done atomic because after switching from early GDT to
+ * the real one any per cpu variable access is going to fault because
+ * %fs resp. GS_BASE is not yet pointing to the real per CPU data.
+ *
+ * As a consequence this uses the native variants of load_gdt() and
+ * wrmsrl(). So XEN_PV has to take the fault and emulate.
+ */
+void __init switch_to_real_gdt(int cpu)
+{
+ struct desc_ptr gdt_descr;
+
+ gdt_descr.address = (long)get_cpu_gdt_rw(cpu);
+ gdt_descr.size = GDT_SIZE - 1;
+ native_load_gdt(&gdt_descr);
+
+#ifdef CONFIG_X86_32
+ loadsegment(fs, __KERNEL_PERCPU);
+#else
+ native_wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
+#endif
+}
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -211,7 +211,7 @@ void __init setup_per_cpu_areas(void)
* area. Reload any changed state for the boot CPU.
*/
if (!cpu)
- switch_to_new_gdt(cpu);
+ switch_to_real_gdt(cpu);
}

/* indicate the early static arrays will soon be gone */
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1457,7 +1457,11 @@ void arch_thaw_secondary_cpus_end(void)
void __init native_smp_prepare_boot_cpu(void)
{
int me = smp_processor_id();
- switch_to_new_gdt(me);
+
+ /* SMP invokes this from setup_per_cpu_areas() */
+ if (!IS_ENABLED(CONFIG_SMP))
+ switch_to_real_gdt(me);
+
/* already set me in cpu_online_mask in boot_cpu_init() */
cpumask_set_cpu(me, cpu_callout_mask);
cpu_set_state_online(me);
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -1164,7 +1164,7 @@ static void __init xen_setup_gdt(int cpu
pv_ops.cpu.write_gdt_entry = xen_write_gdt_entry_boot;
pv_ops.cpu.load_gdt = xen_load_gdt_boot;

- switch_to_new_gdt(cpu);
+ switch_to_real_gdt(cpu);

pv_ops.cpu.write_gdt_entry = xen_write_gdt_entry;
pv_ops.cpu.load_gdt = xen_load_gdt;

2022-07-18 05:37:01

by Jürgen Groß

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On 17.07.22 23:54, Thomas Gleixner wrote:
> On Sun, Jul 17 2022 at 22:13, Thomas Gleixner wrote:
>> On Sun, Jul 17 2022 at 22:08, Thomas Gleixner wrote:
>>> On Sun, Jul 17 2022 at 21:08, Thomas Gleixner wrote:
>>> loadsegment_simple() was a red herring. The gs segment is already zero.
>>>
>>> So what explodes here is the early boot when switching from early per
>>> CPU to the real per CPU area.
>>>
>>> start_kernel()
>>> .....
>>> setup_per_cpu_areas();
>>> smp_prepare_boot_cpu()
>>
>> Bah. switch_to_new_gdt() is already invoked from setup_per_cpu_areas()
>> and then again in smp_prepare_boot_cpu() and once more in cpu_init(),
>>
>> What a mess.
>
> So the below builds and boots at least on 64bit. I'll stare at it some
> more tomorrow. I have no idea whether native_load_gdt() works with
> XEN_PV. It should, but what do I know.

No, shouldn't work. But ...

>
> Thanks,
>
> tglx
> ---
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -205,7 +205,7 @@ static inline void native_set_ldt(const
> }
> }
>
> -static inline void native_load_gdt(const struct desc_ptr *dtr)
> +static __always_inline void native_load_gdt(const struct desc_ptr *dtr)
> {
> asm volatile("lgdt %0"::"m" (*dtr));
> }
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -670,10 +670,9 @@ extern int sysenter_setup(void);
> /* Defined in head.S */
> extern struct desc_ptr early_gdt_descr;
>
> -extern void switch_to_new_gdt(int);
> +extern void switch_to_real_gdt(int);
> extern void load_direct_gdt(int);
> extern void load_fixmap_gdt(int);
> -extern void load_percpu_segment(int);
> extern void cpu_init(void);
> extern void cpu_init_secondary(void);
> extern void cpu_init_exception_handling(void);
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -7,20 +7,24 @@
> ifdef CONFIG_FUNCTION_TRACER
> CFLAGS_REMOVE_common.o = -pg
> CFLAGS_REMOVE_perf_event.o = -pg
> +CFLAGS_REMOVE_switch_gdt.o = -pg
> endif
>
> # If these files are instrumented, boot hangs during the first second.
> KCOV_INSTRUMENT_common.o := n
> KCOV_INSTRUMENT_perf_event.o := n
> +KCOV_INSTRUMENT_switch_gdt.o := n
>
> # As above, instrumenting secondary CPU boot code causes boot hangs.
> KCSAN_SANITIZE_common.o := n
> +KCSAN_SANITIZE_switch_gdt.o := n
>
> -# Make sure load_percpu_segment has no stackprotector
> -CFLAGS_common.o := -fno-stack-protector
> +# Make sure that switching the GDT and the per CPU segment
> +# does not have stack protector enabled.
> +CFLAGS_switch_gdt.o := -fno-stack-protector
>
> obj-y := cacheinfo.o scattered.o topology.o
> -obj-y += common.o
> +obj-y += common.o switch_gdt.o
> obj-y += rdrand.o
> obj-y += match.o
> obj-y += bugs.o
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -701,16 +701,6 @@ static const char *table_lookup_model(st
> __u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> __u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
>
> -void load_percpu_segment(int cpu)
> -{
> -#ifdef CONFIG_X86_32
> - loadsegment(fs, __KERNEL_PERCPU);
> -#else
> - __loadsegment_simple(gs, 0);
> - wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
> -#endif
> -}
> -
> #ifdef CONFIG_X86_32
> /* The 32-bit entry code needs to find cpu_entry_area. */
> DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
> @@ -738,18 +728,6 @@ void load_fixmap_gdt(int cpu)
> }
> EXPORT_SYMBOL_GPL(load_fixmap_gdt);
>
> -/*
> - * Current gdt points %fs at the "master" per-cpu area: after this,
> - * it's on the real one.
> - */
> -void switch_to_new_gdt(int cpu)
> -{
> - /* Load the original GDT */
> - load_direct_gdt(cpu);
> - /* Reload the per-cpu base */
> - load_percpu_segment(cpu);
> -}
> -
> static const struct cpu_dev *cpu_devs[X86_VENDOR_NUM] = {};
>
> static void get_model_name(struct cpuinfo_x86 *c)
> @@ -2228,12 +2206,6 @@ void cpu_init(void)
> boot_cpu_has(X86_FEATURE_TSC) || boot_cpu_has(X86_FEATURE_DE))
> cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
>
> - /*
> - * Initialize the per-CPU GDT with the boot GDT,
> - * and set up the GDT descriptor:
> - */
> - switch_to_new_gdt(cpu);
> -
> if (IS_ENABLED(CONFIG_X86_64)) {
> loadsegment(fs, 0);
> memset(cur->thread.tls_array, 0, GDT_ENTRY_TLS_ENTRIES * 8);
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/switch_gdt.c
> @@ -0,0 +1,31 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <asm/processor.h>
> +#include <asm/segment.h>
> +#include <asm/desc.h>
> +
> +/*
> + * Invoked during early boot to switch from early GDT and early per CPU
> + * (%fs on 32bit, GS_BASE on 64bit) to the real GDT and the runtime per CPU
> + * area.
> + *
> + * This has to be done atomic because after switching from early GDT to
> + * the real one any per cpu variable access is going to fault because
> + * %fs resp. GS_BASE is not yet pointing to the real per CPU data.
> + *
> + * As a consequence this uses the native variants of load_gdt() and
> + * wrmsrl(). So XEN_PV has to take the fault and emulate.
> + */
> +void __init switch_to_real_gdt(int cpu)
> +{
> + struct desc_ptr gdt_descr;
> +
> + gdt_descr.address = (long)get_cpu_gdt_rw(cpu);
> + gdt_descr.size = GDT_SIZE - 1;
> + native_load_gdt(&gdt_descr);
> +
> +#ifdef CONFIG_X86_32
> + loadsegment(fs, __KERNEL_PERCPU);
> +#else
> + native_wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
> +#endif
> +}
> --- a/arch/x86/kernel/setup_percpu.c
> +++ b/arch/x86/kernel/setup_percpu.c
> @@ -211,7 +211,7 @@ void __init setup_per_cpu_areas(void)
> * area. Reload any changed state for the boot CPU.
> */
> if (!cpu)
> - switch_to_new_gdt(cpu);
> + switch_to_real_gdt(cpu);
> }
>
> /* indicate the early static arrays will soon be gone */
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1457,7 +1457,11 @@ void arch_thaw_secondary_cpus_end(void)
> void __init native_smp_prepare_boot_cpu(void)
> {
> int me = smp_processor_id();
> - switch_to_new_gdt(me);
> +
> + /* SMP invokes this from setup_per_cpu_areas() */
> + if (!IS_ENABLED(CONFIG_SMP))
> + switch_to_real_gdt(me);
> +
> /* already set me in cpu_online_mask in boot_cpu_init() */
> cpumask_set_cpu(me, cpu_callout_mask);
> cpu_set_state_online(me);
> --- a/arch/x86/xen/enlighten_pv.c
> +++ b/arch/x86/xen/enlighten_pv.c
> @@ -1164,7 +1164,7 @@ static void __init xen_setup_gdt(int cpu
> pv_ops.cpu.write_gdt_entry = xen_write_gdt_entry_boot;
> pv_ops.cpu.load_gdt = xen_load_gdt_boot;
>
> - switch_to_new_gdt(cpu);
> + switch_to_real_gdt(cpu);

... can't you use the paravirt variant of load_gdt in switch_to_real_gdt() ?

>
> pv_ops.cpu.write_gdt_entry = xen_write_gdt_entry;
> pv_ops.cpu.load_gdt = xen_load_gdt;
>

Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.08 kB)
OpenPGP public key
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2022-07-18 07:16:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Mon, Jul 18 2022 at 07:11, Juergen Gross wrote:
> On 17.07.22 23:54, Thomas Gleixner wrote:
>> void __init native_smp_prepare_boot_cpu(void)
>> {
>> int me = smp_processor_id();
>> - switch_to_new_gdt(me);
>> +
>> + /* SMP invokes this from setup_per_cpu_areas() */
>> + if (!IS_ENABLED(CONFIG_SMP))
>> + switch_to_real_gdt(me);
>> +
>> /* already set me in cpu_online_mask in boot_cpu_init() */
>> cpumask_set_cpu(me, cpu_callout_mask);
>> cpu_set_state_online(me);
>> --- a/arch/x86/xen/enlighten_pv.c
>> +++ b/arch/x86/xen/enlighten_pv.c
>> @@ -1164,7 +1164,7 @@ static void __init xen_setup_gdt(int cpu
>> pv_ops.cpu.write_gdt_entry = xen_write_gdt_entry_boot;
>> pv_ops.cpu.load_gdt = xen_load_gdt_boot;
>>
>> - switch_to_new_gdt(cpu);
>> + switch_to_real_gdt(cpu);
>
> ... can't you use the paravirt variant of load_gdt in switch_to_real_gdt() ?

That does not solve the problem of having a disagreement between GDT and
GS_BASE. Let me dig into this some more.

Thanks,

tglx

2022-07-18 09:33:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Mon, Jul 18 2022 at 08:54, Thomas Gleixner wrote:
> On Mon, Jul 18 2022 at 07:11, Juergen Gross wrote:
>>> - switch_to_new_gdt(cpu);
>>> + switch_to_real_gdt(cpu);
>>
>> ... can't you use the paravirt variant of load_gdt in switch_to_real_gdt() ?
>
> That does not solve the problem of having a disagreement between GDT and
> GS_BASE. Let me dig into this some more.

Bah. The real problem is __loadsegment_simple(gs, 0). After that GS_BASE
is 0. So any per CPU access before setting MSR_GS_BASE back to working
state is going into lala land.

So it's not the GDT. It's the mov 0, %gs which makes stuff go south, but
as %gs is already 0, we can keep the paravirt load_gdt() and use
native_write_msr() and everything should be happy.

Thanks,

tglx

2022-07-18 09:54:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Mon, Jul 18, 2022 at 10:55:29AM +0200, Thomas Gleixner wrote:
> On Mon, Jul 18 2022 at 08:54, Thomas Gleixner wrote:
> > On Mon, Jul 18 2022 at 07:11, Juergen Gross wrote:
> >>> - switch_to_new_gdt(cpu);
> >>> + switch_to_real_gdt(cpu);
> >>
> >> ... can't you use the paravirt variant of load_gdt in switch_to_real_gdt() ?
> >
> > That does not solve the problem of having a disagreement between GDT and
> > GS_BASE. Let me dig into this some more.
>
> Bah. The real problem is __loadsegment_simple(gs, 0). After that GS_BASE
> is 0. So any per CPU access before setting MSR_GS_BASE back to working
> state is going into lala land.
>
> So it's not the GDT. It's the mov 0, %gs which makes stuff go south, but
> as %gs is already 0, we can keep the paravirt load_gdt() and use
> native_write_msr() and everything should be happy.

How is the ret from xen_load_gdt() not going to explode?

2022-07-18 10:41:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Mon, Jul 18 2022 at 11:31, Peter Zijlstra wrote:
> On Mon, Jul 18, 2022 at 10:55:29AM +0200, Thomas Gleixner wrote:
>> On Mon, Jul 18 2022 at 08:54, Thomas Gleixner wrote:
>> > On Mon, Jul 18 2022 at 07:11, Juergen Gross wrote:
>> >>> - switch_to_new_gdt(cpu);
>> >>> + switch_to_real_gdt(cpu);
>> >>
>> >> ... can't you use the paravirt variant of load_gdt in switch_to_real_gdt() ?
>> >
>> > That does not solve the problem of having a disagreement between GDT and
>> > GS_BASE. Let me dig into this some more.
>>
>> Bah. The real problem is __loadsegment_simple(gs, 0). After that GS_BASE
>> is 0. So any per CPU access before setting MSR_GS_BASE back to working
>> state is going into lala land.
>>
>> So it's not the GDT. It's the mov 0, %gs which makes stuff go south, but
>> as %gs is already 0, we can keep the paravirt load_gdt() and use
>> native_write_msr() and everything should be happy.
>
> How is the ret from xen_load_gdt() not going to explode?

This is only for the early boot _before_ all the patching happens. So
that goes through the default retthunk.

Secondary CPUs do not need that as they set up GDT and GS_BASE in the
low level asm code before coming out to C.

I'm still trying to figure out how this works on XENPV and on 32bit.

Sigh...

2022-07-18 11:51:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 02/38] x86/cpu: Use native_wrmsrl() in load_percpu_segment()

On Mon, Jul 18 2022 at 12:33, Thomas Gleixner wrote:
> On Mon, Jul 18 2022 at 11:31, Peter Zijlstra wrote:
>> On Mon, Jul 18, 2022 at 10:55:29AM +0200, Thomas Gleixner wrote:
>>> On Mon, Jul 18 2022 at 08:54, Thomas Gleixner wrote:
>>> > On Mon, Jul 18 2022 at 07:11, Juergen Gross wrote:
>>> >>> - switch_to_new_gdt(cpu);
>>> >>> + switch_to_real_gdt(cpu);
>>> >>
>>> >> ... can't you use the paravirt variant of load_gdt in switch_to_real_gdt() ?
>>> >
>>> > That does not solve the problem of having a disagreement between GDT and
>>> > GS_BASE. Let me dig into this some more.
>>>
>>> Bah. The real problem is __loadsegment_simple(gs, 0). After that GS_BASE
>>> is 0. So any per CPU access before setting MSR_GS_BASE back to working
>>> state is going into lala land.
>>>
>>> So it's not the GDT. It's the mov 0, %gs which makes stuff go south, but
>>> as %gs is already 0, we can keep the paravirt load_gdt() and use
>>> native_write_msr() and everything should be happy.
>>
>> How is the ret from xen_load_gdt() not going to explode?
>
> This is only for the early boot _before_ all the patching happens. So
> that goes through the default retthunk.
>
> Secondary CPUs do not need that as they set up GDT and GS_BASE in the
> low level asm code before coming out to C.
>
> I'm still trying to figure out how this works on XENPV and on 32bit.

On 32bit the CPU comes out with GDT and FS correctly set too.

For XEN_PV it looks like cpu_initialize_context() hands down GDT and
GSBASE info to the hypercall which kicks the CPU so we should be
good there as well. Emphasis on should. Jürgen?

Thanks,

tglx