LinuxLists.cc - [PATCH v2 0/9] Reclaim TIF_IA32 and TIF

2020-10-01 21:02:47

Subject: [PATCH v2 0/9] Reclaim TIF_IA32 and TIF_X32

TI_IA32 and TIF_X32 are not strictly necessary and they are only set at
task creation time, which doesn't fit with processes that transition
between 64/32 bits. In addition, other reasons to drop these flags are
that we are running out of TI flags for x86 and it is generally a good
idea to reduce architecture specific TI flags, before move the generic
ones to common code.

Many of the ideas for this patchset came from Andy Lutomirski (Thank
you!)

The only difference of v2 from v1 is the addition of the final 3 patches
that solve the last 3 cases of TIF_IA32 and TIF_X32 usage, and actually
remove the TI flags.

Finally, the testing for this patchset was done exercising the code
paths of each case where the flags were used with x32, ia32 and x64
applications. Finally, x86 selftests showed no regressions.

Gabriel Krisman Bertazi (9):
x86: events: Avoid TIF_IA32 when checking 64bit mode
x86: Simplify compat syscall userspace allocation
x86: oprofile: Avoid TIF_IA32 when checking 64bit mode
x86: elf: Use e_machine to choose DLINFO in compat
x86: elf: Use e_machine to select start_thread for x32
x86: elf: Use e_machine to select setup_additional_pages for x32
x86: Use current USER_CS to setup correct context on vmx entry
x86: Convert mmu context ia32_compat into a proper flags field
x86: Reclaim TIF_IA32 and TIF_X32

arch/x86/entry/vdso/vma.c | 21 ++++++++++-------
arch/x86/entry/vsyscall/vsyscall_64.c | 2 +-
arch/x86/events/core.c | 2 +-
arch/x86/events/intel/ds.c | 2 +-
arch/x86/events/intel/lbr.c | 2 +-
arch/x86/include/asm/compat.h | 15 ++++++------
arch/x86/include/asm/elf.h | 24 ++++++++++++++-----
arch/x86/include/asm/mmu.h | 6 +++--
arch/x86/include/asm/mmu_context.h | 2 +-
arch/x86/include/asm/thread_info.h | 4 ----
arch/x86/kernel/perf_regs.c | 2 +-
arch/x86/kernel/process_64.c | 34 ++++++++++++++-------------
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/oprofile/backtrace.c | 2 +-
14 files changed, 67 insertions(+), 53 deletions(-)

--
2.28.0

2020-10-01 21:02:57

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH v2 9/9] x86: Reclaim TIF_IA32 and TIF_X32

Now that these flags are no longer used, reclaim those TI bits.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
arch/x86/include/asm/thread_info.h | 4 ----
arch/x86/kernel/process_64.c | 6 ------
2 files changed, 10 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..6888aa39c4d6 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -91,7 +91,6 @@ struct thread_info {
#define TIF_NEED_FPU_LOAD 14 /* load FPU on return to userspace */
#define TIF_NOCPUID 15 /* CPUID is not accessible in userland */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
-#define TIF_IA32 17 /* IA32 compatibility process */
#define TIF_SLD 18 /* Restore split lock detection on context switch */
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
@@ -101,7 +100,6 @@ struct thread_info {
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
-#define TIF_X32 30 /* 32-bit native x86-64 binary */
#define TIF_FSCHECK 31 /* Check FS is USER_DS on return */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
@@ -121,7 +119,6 @@ struct thread_info {
#define _TIF_NEED_FPU_LOAD (1 << TIF_NEED_FPU_LOAD)
#define _TIF_NOCPUID (1 << TIF_NOCPUID)
#define _TIF_NOTSC (1 << TIF_NOTSC)
-#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_SLD (1 << TIF_SLD)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
@@ -130,7 +127,6 @@ struct thread_info {
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
#define _TIF_ADDR32 (1 << TIF_ADDR32)
-#define _TIF_X32 (1 << TIF_X32)
#define _TIF_FSCHECK (1 << TIF_FSCHECK)

/* flags to check in __switch_to() */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3226ceed409c..b557312aa9cb 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -644,9 +644,7 @@ void set_personality_64bit(void)
/* inherit personality from parent */

/* Make sure to be in 64bit mode */
- clear_thread_flag(TIF_IA32);
clear_thread_flag(TIF_ADDR32);
- clear_thread_flag(TIF_X32);
/* Pretend that this comes from a 64bit execve */
task_pt_regs(current)->orig_ax = __NR_execve;
current_thread_info()->status &= ~TS_COMPAT;
@@ -663,8 +661,6 @@ void set_personality_64bit(void)
static void __set_personality_x32(void)
{
#ifdef CONFIG_X86_X32
- clear_thread_flag(TIF_IA32);
- set_thread_flag(TIF_X32);
if (current->mm)
current->mm->context.flags = 0;

@@ -685,8 +681,6 @@ static void __set_personality_x32(void)
static void __set_personality_ia32(void)
{
#ifdef CONFIG_IA32_EMULATION
- set_thread_flag(TIF_IA32);
- clear_thread_flag(TIF_X32);
if (current->mm) {
/*
* uprobes applied to this MM need to know this and
--
2.28.0

2020-10-01 21:03:44

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH v2 6/9] x86: elf: Use e_machine to select setup_additional_pages for x32

Since TIF_X32 is going away, avoid using it to find the ELF type when
choosing which additional pages to set up.

According to SysV AMD64 ABI Draft, an AMD64 ELF object using ILP32 must
have ELFCLASS32 with (E_MACHINE == EM_X86_64), so use that ELF field to
differentiate a x32 object from a IA32 object when executing
start_thread in compat mode.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
arch/x86/entry/vdso/vma.c | 21 ++++++++++++---------
arch/x86/include/asm/elf.h | 11 ++++++++---
2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 9185cb1d13b9..7a3cda8294a3 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -412,22 +412,25 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
}

#ifdef CONFIG_COMPAT
-int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
- int uses_interp)
+int compat_arch_setup_additional_pages_ia32(struct linux_binprm *bprm,
+ int uses_interp)
{
-#ifdef CONFIG_X86_X32_ABI
- if (test_thread_flag(TIF_X32)) {
- if (!vdso64_enabled)
- return 0;
- return map_vdso_randomized(&vdso_image_x32);
- }
-#endif
#ifdef CONFIG_IA32_EMULATION
return load_vdso32();
#else
return 0;
#endif
}
+
+int compat_arch_setup_additional_pages_x32(struct linux_binprm *bprm,
+ int uses_interp)
+{
+#ifdef CONFIG_X86_X32_ABI
+ if (vdso64_enabled)
+ return map_vdso_randomized(&vdso_image_x32);
+#endif
+ return 0;
+}
#endif
#else
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 33c1c9be2e07..4d91f5b1079f 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -388,9 +388,14 @@ struct linux_binprm;
#define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
extern int arch_setup_additional_pages(struct linux_binprm *bprm,
int uses_interp);
-extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
- int uses_interp);
-#define compat_arch_setup_additional_pages compat_arch_setup_additional_pages
+extern int compat_arch_setup_additional_pages_ia32(struct linux_binprm *bprm,
+ int uses_interp);
+extern int compat_arch_setup_additional_pages_x32(struct linux_binprm *bprm,
+ int uses_interp);
+
+#define compat_arch_setup_additional_pages \
+ ((elf_ex->e_machine == EM_X86_64) ? \
+ compat_arch_setup_additional_pages_x32 : compat_arch_setup_additional_pages_ia32)

/* Do not change the values. See get_align_mask() */
enum align_flags {
--
2.28.0

2020-10-01 21:04:30

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH v2 8/9] x86: Convert mmu context ia32_compat into a proper flags field

The ia32_compat attribute is a weird thing. It mirrors TIF_IA32 and
TIF_X32 and is used only in two very unrelated places: (1) to decide if
the vsyscall page is accessible (2) for uprobes to find whether the
patched instruction is 32 or 64 bit. In preparation to remove the TI
flags, we want new values for ia32_compat, but given its odd semantics,
I'd rather make it a real flags field that configures these specific
behaviours. So, set_personality_x64 can ask for the vsyscall page,
which is not available in x32/ia32 and set_personality_ia32 can
configure the uprobe code as needed.

uprobe cannot rely on other methods like user_64bit_mode() to decide how
to patch, so it needs some specific flag like this.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
arch/x86/entry/vsyscall/vsyscall_64.c | 2 +-
arch/x86/include/asm/mmu.h | 6 ++++--
arch/x86/include/asm/mmu_context.h | 2 +-
arch/x86/kernel/process_64.c | 17 +++++++++++------
4 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index 44c33103a955..20abc396dbe0 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -316,7 +316,7 @@ static struct vm_area_struct gate_vma __ro_after_init = {
struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
{
#ifdef CONFIG_COMPAT
- if (!mm || mm->context.ia32_compat)
+ if (!mm || !(mm->context.flags & MM_CONTEXT_GATE_PAGE))
return NULL;
#endif
if (vsyscall_mode == NONE)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 9257667d13c5..76ab742a0e39 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -7,6 +7,9 @@
#include <linux/mutex.h>
#include <linux/atomic.h>

+#define MM_CONTEXT_UPROBE_IA32 1 /* Uprobes on this MM assume 32-bit code */
+#define MM_CONTEXT_GATE_PAGE 2 /* Whether MM has gate page */
+
/*
* x86 has arch-specific MMU state beyond what lives in mm_struct.
*/
@@ -33,8 +36,7 @@ typedef struct {
#endif

#ifdef CONFIG_X86_64
- /* True if mm supports a task running in 32 bit compatibility mode. */
- unsigned short ia32_compat;
+ unsigned short flags;
#endif

struct mutex lock;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index d98016b83755..054a79157323 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -177,7 +177,7 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
static inline bool is_64bit_mm(struct mm_struct *mm)
{
return !IS_ENABLED(CONFIG_IA32_EMULATION) ||
- !(mm->context.ia32_compat == TIF_IA32);
+ !(mm->context.flags & MM_CONTEXT_UPROBE_IA32);
}
#else
static inline bool is_64bit_mm(struct mm_struct *mm)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 56e882c339e6..3226ceed409c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -650,10 +650,8 @@ void set_personality_64bit(void)
/* Pretend that this comes from a 64bit execve */
task_pt_regs(current)->orig_ax = __NR_execve;
current_thread_info()->status &= ~TS_COMPAT;
-
- /* Ensure the corresponding mm is not marked. */
if (current->mm)
- current->mm->context.ia32_compat = 0;
+ current->mm->context.flags = MM_CONTEXT_GATE_PAGE;

/* TBD: overwrites user setup. Should have two bits.
But 64bit processes have always behaved this way,
@@ -668,7 +666,8 @@ static void __set_personality_x32(void)
clear_thread_flag(TIF_IA32);
set_thread_flag(TIF_X32);
if (current->mm)
- current->mm->context.ia32_compat = TIF_X32;
+ current->mm->context.flags = 0;
+
current->personality &= ~READ_IMPLIES_EXEC;
/*
* in_32bit_syscall() uses the presence of the x32 syscall bit
@@ -688,8 +687,14 @@ static void __set_personality_ia32(void)
#ifdef CONFIG_IA32_EMULATION
set_thread_flag(TIF_IA32);
clear_thread_flag(TIF_X32);
- if (current->mm)
- current->mm->context.ia32_compat = TIF_IA32;
+ if (current->mm) {
+ /*
+ * uprobes applied to this MM need to know this and
+ * cannot use user_64bit_mode() at that time.
+ */
+ current->mm->context.flags = MM_CONTEXT_UPROBE_IA32;
+ }
+
current->personality |= force_personality32;
/* Prepare the first "return" to user space */
task_pt_regs(current)->orig_ax = __NR_ia32_execve;
--
2.28.0

2020-10-01 21:04:31

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

vmx_prepare_switch_to_guest shouldn't use is_64bit_mm, which has a
very specific use in uprobes. Use the user_64bit_mode helper instead.
This reduces the usage of is_64bit_mm, which is awkward, since it relies
on the personality at load time, which is fine for uprobes, but doesn't
seem fine here.

I tested this by running VMs with 64 and 32 bits payloads from 64/32
programs.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7b2a068f08c1..b5aafd9e5f5d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1172,7 +1172,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
savesegment(es, host_state->es_sel);

gs_base = cpu_kernelmode_gs_base(cpu);
- if (likely(is_64bit_mm(current->mm))) {
+ if (likely(user_64bit_mode(current_pt_regs()))) {
current_save_fsgs();
fs_sel = current->thread.fsindex;
gs_sel = current->thread.gsindex;
--
2.28.0

2020-10-01 21:54:14

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v2 6/9] x86: elf: Use e_machine to select setup_additional_pages for x32

On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Since TIF_X32 is going away, avoid using it to find the ELF type when
> choosing which additional pages to set up.
>
> According to SysV AMD64 ABI Draft, an AMD64 ELF object using ILP32 must
> have ELFCLASS32 with (E_MACHINE == EM_X86_64), so use that ELF field to
> differentiate a x32 object from a IA32 object when executing
> start_thread in compat mode.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> arch/x86/entry/vdso/vma.c | 21 ++++++++++++---------
> arch/x86/include/asm/elf.h | 11 ++++++++---
> 2 files changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index 9185cb1d13b9..7a3cda8294a3 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -412,22 +412,25 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> }
>
> #ifdef CONFIG_COMPAT
> -int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
> - int uses_interp)
> +int compat_arch_setup_additional_pages_ia32(struct linux_binprm *bprm,
> + int uses_interp)
> {
> -#ifdef CONFIG_X86_X32_ABI
> - if (test_thread_flag(TIF_X32)) {
> - if (!vdso64_enabled)
> - return 0;
> - return map_vdso_randomized(&vdso_image_x32);
> - }
> -#endif
> #ifdef CONFIG_IA32_EMULATION
> return load_vdso32();
> #else
> return 0;
> #endif
> }
> +
> +int compat_arch_setup_additional_pages_x32(struct linux_binprm *bprm,
> + int uses_interp)
> +{
> +#ifdef CONFIG_X86_X32_ABI
> + if (vdso64_enabled)
> + return map_vdso_randomized(&vdso_image_x32);
> +#endif
> + return 0;
> +}
> #endif
> #else
> int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index 33c1c9be2e07..4d91f5b1079f 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -388,9 +388,14 @@ struct linux_binprm;
> #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
> extern int arch_setup_additional_pages(struct linux_binprm *bprm,
> int uses_interp);
> -extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
> - int uses_interp);
> -#define compat_arch_setup_additional_pages compat_arch_setup_additional_pages
> +extern int compat_arch_setup_additional_pages_ia32(struct linux_binprm *bprm,
> + int uses_interp);
> +extern int compat_arch_setup_additional_pages_x32(struct linux_binprm *bprm,
> + int uses_interp);
> +
> +#define compat_arch_setup_additional_pages \
> + ((elf_ex->e_machine == EM_X86_64) ? \
> + compat_arch_setup_additional_pages_x32 : compat_arch_setup_additional_pages_ia32)
>

As in the previous patch, can you wire up the new argument for real, please?

2020-10-01 21:56:23

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> vmx_prepare_switch_to_guest shouldn't use is_64bit_mm, which has a
> very specific use in uprobes. Use the user_64bit_mode helper instead.
> This reduces the usage of is_64bit_mm, which is awkward, since it relies
> on the personality at load time, which is fine for uprobes, but doesn't
> seem fine here.
>
> I tested this by running VMs with 64 and 32 bits payloads from 64/32
> programs.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7b2a068f08c1..b5aafd9e5f5d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1172,7 +1172,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> savesegment(es, host_state->es_sel);
>
> gs_base = cpu_kernelmode_gs_base(cpu);
> - if (likely(is_64bit_mm(current->mm))) {
> + if (likely(user_64bit_mode(current_pt_regs()))) {
> current_save_fsgs();
> fs_sel = current->thread.fsindex;
> gs_sel = current->thread.gsindex;

I disagree with this one. This whole code path is nonsense. Can you
just remove the condition entirely and use the 64-bit path
unconditionally?
]

2020-10-01 22:13:26

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v2 8/9] x86: Convert mmu context ia32_compat into a proper flags field

On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> The ia32_compat attribute is a weird thing. It mirrors TIF_IA32 and
> TIF_X32 and is used only in two very unrelated places: (1) to decide if
> the vsyscall page is accessible (2) for uprobes to find whether the
> patched instruction is 32 or 64 bit. In preparation to remove the TI
> flags, we want new values for ia32_compat, but given its odd semantics,
> I'd rather make it a real flags field that configures these specific
> behaviours. So, set_personality_x64 can ask for the vsyscall page,
> which is not available in x32/ia32 and set_personality_ia32 can
> configure the uprobe code as needed.
>
> uprobe cannot rely on other methods like user_64bit_mode() to decide how
> to patch, so it needs some specific flag like this.

I like this quite a bit, but can you rename MM_CONTEXT_GATE_PAGE to
MM_CONTEXT_HAS_VSYSCALL?

--Andy

2020-10-02 22:42:54

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

On Thu, Oct 01, 2020 at 02:52:32PM -0700, Andy Lutomirski wrote:
> On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
> <[email protected]> wrote:
> >
> > vmx_prepare_switch_to_guest shouldn't use is_64bit_mm, which has a
> > very specific use in uprobes. Use the user_64bit_mode helper instead.
> > This reduces the usage of is_64bit_mm, which is awkward, since it relies
> > on the personality at load time, which is fine for uprobes, but doesn't
> > seem fine here.
> >
> > I tested this by running VMs with 64 and 32 bits payloads from 64/32
> > programs.
> >
> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > ---
> > arch/x86/kvm/vmx/vmx.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 7b2a068f08c1..b5aafd9e5f5d 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -1172,7 +1172,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> > savesegment(es, host_state->es_sel);
> >
> > gs_base = cpu_kernelmode_gs_base(cpu);
> > - if (likely(is_64bit_mm(current->mm))) {
> > + if (likely(user_64bit_mode(current_pt_regs()))) {
> > current_save_fsgs();
> > fs_sel = current->thread.fsindex;
> > gs_sel = current->thread.gsindex;
>
> I disagree with this one. This whole code path is nonsense. Can you
> just remove the condition entirely and use the 64-bit path
> unconditionally?

I finally came back to this one with fresh eyes. I've read through the code
a bajllion times and typed up half a dozen responses. I think, finally, I
understand what's broken.

I believe your original assertion that the bug was misdiagnosed is correct
(can't link because LKML wasn't in the loop). I'm pretty sure your analysis
that KVM's handling of things works mostly by coincidence is also correct.

The coincidence is that "real" VMMs all use arch_prctl(), and
do_arch_prctl_64() ensures thread.{fs,gs}base are accurate. save_base_legacy()
detects sel==0 and intentionally does nothing, knowing the the base is already
accurate.

Userspaces that don't use arch_prctl(), in the bug report case a 32-bit compat
test, may or may not have accurate thread.{fs,gs}base values. This is
especially true if sel!=0 as save_base_legacy() explicitly zeros the base in
this case, as load_seg_legacy() will restore the seg on the backend.

KVM on the other hand assumes thread.{fs,gs}base are always fresh. When that
didn't hold true for userspace that didn't use arch_prctl(), the fix of
detecting a !64-bit mm just so happened to work because all 64-bit VMMs use
arch_prctl().

It's tempting to just open code this and use RD{FS,GS}BASE when possible,
i.e. avoid any guesswork. Maybe with a module param that userspace can set
to tell KVM it doesn't do anything fancy with FS/GS base (will userspace still
use arch_prctl() even if FSGSABSE is available?).

savesegment(fs, fs_sel);
savesegment(gs, gs_sel);
if (use_current_fsgs_base) {
fs_base = current->thread.fsbase;
vmx->msr_host_kernel_gs_base = current->thread.gsbase;
} else if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
fs_base = rdfsbase()
vmx->msr_host_kernel_gs_base = __rdgsbase_inactive();
} else {
fs_base = read_msr(MSR_FS_BASE);
vmx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
}

I'll revisit this on Monday and run a patch by Paolo.

2020-10-02 23:14:20

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: Re: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

Sean Christopherson <[email protected]> writes:

> On Thu, Oct 01, 2020 at 02:52:32PM -0700, Andy Lutomirski wrote:
>> On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
>> <[email protected]> wrote:
>> >
>> > vmx_prepare_switch_to_guest shouldn't use is_64bit_mm, which has a
>> > very specific use in uprobes. Use the user_64bit_mode helper instead.
>> > This reduces the usage of is_64bit_mm, which is awkward, since it relies
>> > on the personality at load time, which is fine for uprobes, but doesn't
>> > seem fine here.
>> >
>> > I tested this by running VMs with 64 and 32 bits payloads from 64/32
>> > programs.
>> >
>> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>> > ---
>> > arch/x86/kvm/vmx/vmx.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> > index 7b2a068f08c1..b5aafd9e5f5d 100644
>> > --- a/arch/x86/kvm/vmx/vmx.c
>> > +++ b/arch/x86/kvm/vmx/vmx.c
>> > @@ -1172,7 +1172,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>> > savesegment(es, host_state->es_sel);
>> >
>> > gs_base = cpu_kernelmode_gs_base(cpu);
>> > - if (likely(is_64bit_mm(current->mm))) {
>> > + if (likely(user_64bit_mode(current_pt_regs()))) {
>> > current_save_fsgs();
>> > fs_sel = current->thread.fsindex;
>> > gs_sel = current->thread.gsindex;
>>
>> I disagree with this one. This whole code path is nonsense. Can you
>> just remove the condition entirely and use the 64-bit path
>> unconditionally?
>
> I finally came back to this one with fresh eyes. I've read through the code
> a bajllion times and typed up half a dozen responses. I think, finally, I
> understand what's broken.
>
> I believe your original assertion that the bug was misdiagnosed is correct
> (can't link because LKML wasn't in the loop). I'm pretty sure your analysis
> that KVM's handling of things works mostly by coincidence is also correct.
>
> The coincidence is that "real" VMMs all use arch_prctl(), and
> do_arch_prctl_64() ensures thread.{fs,gs}base are accurate. save_base_legacy()
> detects sel==0 and intentionally does nothing, knowing the the base is already
> accurate.
>
> Userspaces that don't use arch_prctl(), in the bug report case a 32-bit compat
> test, may or may not have accurate thread.{fs,gs}base values. This is
> especially true if sel!=0 as save_base_legacy() explicitly zeros the base in
> this case, as load_seg_legacy() will restore the seg on the backend.
>
> KVM on the other hand assumes thread.{fs,gs}base are always fresh. When that
> didn't hold true for userspace that didn't use arch_prctl(), the fix of
> detecting a !64-bit mm just so happened to work because all 64-bit VMMs use
> arch_prctl().
>
> It's tempting to just open code this and use RD{FS,GS}BASE when possible,
> i.e. avoid any guesswork. Maybe with a module param that userspace can set
> to tell KVM it doesn't do anything fancy with FS/GS base (will userspace still
> use arch_prctl() even if FSGSABSE is available?).
>
> savesegment(fs, fs_sel);
> savesegment(gs, gs_sel);
> if (use_current_fsgs_base) {
> fs_base = current->thread.fsbase;
> vmx->msr_host_kernel_gs_base = current->thread.gsbase;
> } else if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> fs_base = rdfsbase()
> vmx->msr_host_kernel_gs_base = __rdgsbase_inactive();
> } else {
> fs_base = read_msr(MSR_FS_BASE);
> vmx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> }
>
> I'll revisit this on Monday and run a patch by Paolo.

If this is the case, I will just drop the current patch from my series
and leave it to you. Given that is_64bit_mm() is not going away, this
use case can be fixed separately.

--
Gabriel Krisman Bertazi

2020-10-03 00:18:55

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

On Fri, Oct 2, 2020 at 3:40 PM Sean Christopherson
<[email protected]> wrote:
>
> On Thu, Oct 01, 2020 at 02:52:32PM -0700, Andy Lutomirski wrote:
> > On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
> > <[email protected]> wrote:
> > >
> > > vmx_prepare_switch_to_guest shouldn't use is_64bit_mm, which has a
> > > very specific use in uprobes. Use the user_64bit_mode helper instead.
> > > This reduces the usage of is_64bit_mm, which is awkward, since it relies
> > > on the personality at load time, which is fine for uprobes, but doesn't
> > > seem fine here.
> > >
> > > I tested this by running VMs with 64 and 32 bits payloads from 64/32
> > > programs.
> > >
> > > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/vmx.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 7b2a068f08c1..b5aafd9e5f5d 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -1172,7 +1172,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> > > savesegment(es, host_state->es_sel);
> > >
> > > gs_base = cpu_kernelmode_gs_base(cpu);
> > > - if (likely(is_64bit_mm(current->mm))) {
> > > + if (likely(user_64bit_mode(current_pt_regs()))) {
> > > current_save_fsgs();
> > > fs_sel = current->thread.fsindex;
> > > gs_sel = current->thread.gsindex;
> >
> > I disagree with this one. This whole code path is nonsense. Can you
> > just remove the condition entirely and use the 64-bit path
> > unconditionally?
>
> I finally came back to this one with fresh eyes. I've read through the code
> a bajllion times and typed up half a dozen responses. I think, finally, I
> understand what's broken.
>
> I believe your original assertion that the bug was misdiagnosed is correct
> (can't link because LKML wasn't in the loop). I'm pretty sure your analysis
> that KVM's handling of things works mostly by coincidence is also correct.
>
> The coincidence is that "real" VMMs all use arch_prctl(), and
> do_arch_prctl_64() ensures thread.{fs,gs}base are accurate. save_base_legacy()
> detects sel==0 and intentionally does nothing, knowing the the base is already
> accurate.
>
> Userspaces that don't use arch_prctl(), in the bug report case a 32-bit compat
> test, may or may not have accurate thread.{fs,gs}base values. This is
> especially true if sel!=0 as save_base_legacy() explicitly zeros the base in
> this case, as load_seg_legacy() will restore the seg on the backend.
>
> KVM on the other hand assumes thread.{fs,gs}base are always fresh. When that
> didn't hold true for userspace that didn't use arch_prctl(), the fix of
> detecting a !64-bit mm just so happened to work because all 64-bit VMMs use
> arch_prctl().
>
> It's tempting to just open code this and use RD{FS,GS}BASE when possible,
> i.e. avoid any guesswork. Maybe with a module param that userspace can set
> to tell KVM it doesn't do anything fancy with FS/GS base (will userspace still
> use arch_prctl() even if FSGSABSE is available?).
>
> savesegment(fs, fs_sel);
> savesegment(gs, gs_sel);
> if (use_current_fsgs_base) {
> fs_base = current->thread.fsbase;
> vmx->msr_host_kernel_gs_base = current->thread.gsbase;

I don't like this. The FSGSBASE case is fast, and the !FSGSBASE case
is mostly obsolete. I see no great reason to have a module parameter
asking for incorrect behavior. There have been too many bugs in this
area -- let's not add more please.

> } else if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> fs_base = rdfsbase()
> vmx->msr_host_kernel_gs_base = __rdgsbase_inactive();
> } else {
> fs_base = read_msr(MSR_FS_BASE);
> vmx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> }

I'm okay with this, but I think you should fix the corresponding bugs
on the restore side as well. The current code is:

if (host_state->ldt_sel || (host_state->gs_sel & 7)) {
kvm_load_ldt(host_state->ldt_sel);
#ifdef CONFIG_X86_64
load_gs_index(host_state->gs_sel);
#else
loadsegment(gs, host_state->gs_sel);
#endif
}
if (host_state->fs_sel & 7)
loadsegment(fs, host_state->fs_sel);

which is blatantly wrong in the case where fs_set.TI == 1, gs_set.TI
== 0, and ldt_sel != 0. But it's also more subtly wrong -- this
corrupts all the segment attributes in the case where a segment points
to the GDT and the GDT attributes are non-default. So I would suggest
making the code radically simpler and more correct:

if (host_state->idt_sel)
kvm_load_idt(host_state->idt_sel); // see below
if (host_state->ds_sel)
loadsegment(ds, host_state->ds_sel);
if (host_state->es_sel)
loadsegment(es, host_state->es_sel);
if (host_state->fs_sel) {
loadsegment(fs, host_state->fs_sel);
x86_fsbase_write_cpu(host_state->fs_base);
}
if (host_state->gs_sel) {
load_gs_index(host_state->gs_sel);
x86_gsbase_write_cpu_inactive(host_state->msr_host_kernel_gs_base);
}

In the IMO unlikely event that any performance-critical KVM userspace
runs with these selectors != 0, you could also skip the load if they
are set to __USER_DS.

You can also simplify this crud:

if (unlikely(fs_sel != host->fs_sel)) {
if (!(fs_sel & 7))
vmcs_write16(HOST_FS_SELECTOR, fs_sel);
else
vmcs_write16(HOST_FS_SELECTOR, 0);
host->fs_sel = fs_sel;
}

There is nothing special about segments with TI set according to the
SDM (AFAICT) nor is anything particularly special about them in
Linux's segment tables, so this code makes little sense. It could
just be:

host->fs_sel = fs_sel;

and leave the VMCS field set to 0. Or if you do the __USER_DS
optimization above, you could do:

if (unlikely(fs_sel != host->fs_sel)) {
vmcs_write16(HOST_FS_SELECTOR, fs_sel == __USER_DS ? __USER_DS : 0);
host->fs_sel = fs_sel;
}

I suspect that the only users who care about performance (or for whom
we should care about performance) are normal userspace, and normal
64-bit userspace has DS == ES == FS == GS == 0.

I would also be okay with making the KVM code match the context switch
code, but this may be distinctly nontrivial.

2020-10-03 23:16:56

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

On Fri, Oct 2, 2020 at 5:15 PM Andy Lutomirski <[email protected]> wrote:
>
> On Fri, Oct 2, 2020 at 3:40 PM Sean Christopherson
> <[email protected]> wrote:
> >
> > On Thu, Oct 01, 2020 at 02:52:32PM -0700, Andy Lutomirski wrote:
> > > On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
> > > <[email protected]> wrote:
> > > >
> > > > vmx_prepare_switch_to_guest shouldn't use is_64bit_mm, which has a
> > > > very specific use in uprobes. Use the user_64bit_mode helper instead.
> > > > This reduces the usage of is_64bit_mm, which is awkward, since it relies
> > > > on the personality at load time, which is fine for uprobes, but doesn't
> > > > seem fine here.
> > > >
> > > > I tested this by running VMs with 64 and 32 bits payloads from 64/32
> > > > programs.
> > > >
> > > > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > > > ---
> > > > arch/x86/kvm/vmx/vmx.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 7b2a068f08c1..b5aafd9e5f5d 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -1172,7 +1172,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> > > > savesegment(es, host_state->es_sel);
> > > >
> > > > gs_base = cpu_kernelmode_gs_base(cpu);
> > > > - if (likely(is_64bit_mm(current->mm))) {
> > > > + if (likely(user_64bit_mode(current_pt_regs()))) {
> > > > current_save_fsgs();
> > > > fs_sel = current->thread.fsindex;
> > > > gs_sel = current->thread.gsindex;
> > >
> > > I disagree with this one. This whole code path is nonsense. Can you
> > > just remove the condition entirely and use the 64-bit path
> > > unconditionally?
> >
> > I finally came back to this one with fresh eyes. I've read through the code
> > a bajllion times and typed up half a dozen responses. I think, finally, I
> > understand what's broken.
> >
> > I believe your original assertion that the bug was misdiagnosed is correct
> > (can't link because LKML wasn't in the loop). I'm pretty sure your analysis
> > that KVM's handling of things works mostly by coincidence is also correct.
> >
> > The coincidence is that "real" VMMs all use arch_prctl(), and
> > do_arch_prctl_64() ensures thread.{fs,gs}base are accurate. save_base_legacy()
> > detects sel==0 and intentionally does nothing, knowing the the base is already
> > accurate.
> >
> > Userspaces that don't use arch_prctl(), in the bug report case a 32-bit compat
> > test, may or may not have accurate thread.{fs,gs}base values. This is
> > especially true if sel!=0 as save_base_legacy() explicitly zeros the base in
> > this case, as load_seg_legacy() will restore the seg on the backend.
> >
> > KVM on the other hand assumes thread.{fs,gs}base are always fresh. When that
> > didn't hold true for userspace that didn't use arch_prctl(), the fix of
> > detecting a !64-bit mm just so happened to work because all 64-bit VMMs use
> > arch_prctl().
> >
> > It's tempting to just open code this and use RD{FS,GS}BASE when possible,
> > i.e. avoid any guesswork. Maybe with a module param that userspace can set
> > to tell KVM it doesn't do anything fancy with FS/GS base (will userspace still
> > use arch_prctl() even if FSGSABSE is available?).
> >
> > savesegment(fs, fs_sel);
> > savesegment(gs, gs_sel);
> > if (use_current_fsgs_base) {
> > fs_base = current->thread.fsbase;
> > vmx->msr_host_kernel_gs_base = current->thread.gsbase;
>
> I don't like this. The FSGSBASE case is fast, and the !FSGSBASE case
> is mostly obsolete. I see no great reason to have a module parameter
> asking for incorrect behavior. There have been too many bugs in this
> area -- let's not add more please.
>
> > } else if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> > fs_base = rdfsbase()
> > vmx->msr_host_kernel_gs_base = __rdgsbase_inactive();
> > } else {
> > fs_base = read_msr(MSR_FS_BASE);
> > vmx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> > }
>
> I'm okay with this, but I think you should fix the corresponding bugs
> on the restore side as well. The current code is:
>
> if (host_state->ldt_sel || (host_state->gs_sel & 7)) {
> kvm_load_ldt(host_state->ldt_sel);
> #ifdef CONFIG_X86_64
> load_gs_index(host_state->gs_sel);
> #else
> loadsegment(gs, host_state->gs_sel);
> #endif
> }
> if (host_state->fs_sel & 7)
> loadsegment(fs, host_state->fs_sel);
>
> which is blatantly wrong in the case where fs_set.TI == 1, gs_set.TI
> == 0, and ldt_sel != 0. But it's also more subtly wrong -- this
> corrupts all the segment attributes in the case where a segment points
> to the GDT and the GDT attributes are non-default. So I would suggest
> making the code radically simpler and more correct:
>
> if (host_state->idt_sel)
> kvm_load_idt(host_state->idt_sel); // see below
> if (host_state->ds_sel)
> loadsegment(ds, host_state->ds_sel);
> if (host_state->es_sel)
> loadsegment(es, host_state->es_sel);
> if (host_state->fs_sel) {
> loadsegment(fs, host_state->fs_sel);
> x86_fsbase_write_cpu(host_state->fs_base);
> }
> if (host_state->gs_sel) {
> load_gs_index(host_state->gs_sel);
> x86_gsbase_write_cpu_inactive(host_state->msr_host_kernel_gs_base);
> }
>
> In the IMO unlikely event that any performance-critical KVM userspace
> runs with these selectors != 0, you could also skip the load if they
> are set to __USER_DS.
>
> You can also simplify this crud:
>
> if (unlikely(fs_sel != host->fs_sel)) {
> if (!(fs_sel & 7))
> vmcs_write16(HOST_FS_SELECTOR, fs_sel);
> else
> vmcs_write16(HOST_FS_SELECTOR, 0);
> host->fs_sel = fs_sel;
> }
>
> There is nothing special about segments with TI set according to the
> SDM (AFAICT) nor is anything particularly special about them in
> Linux's segment tables, so this code makes little sense. It could
> just be:
>
> host->fs_sel = fs_sel;
>
> and leave the VMCS field set to 0. Or if you do the __USER_DS
> optimization above, you could do:
>
> if (unlikely(fs_sel != host->fs_sel)) {
> vmcs_write16(HOST_FS_SELECTOR, fs_sel == __USER_DS ? __USER_DS : 0);
> host->fs_sel = fs_sel;
> }
>
> I suspect that the only users who care about performance (or for whom
> we should care about performance) are normal userspace, and normal
> 64-bit userspace has DS == ES == FS == GS == 0.
>
>
> I would also be okay with making the KVM code match the context switch
> code, but this may be distinctly nontrivial.

If you're okay waiting for a couple days, I'll just do this. I have
it 2/3-done already, except I'm running into the utter catastrophe
that is 32-bit stackprotector, so I'm going to fix that first. (Or
delete it if I get toosick of it.)

2020-10-05 20:34:09

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

On Sat, Oct 3, 2020 at 4:04 PM Andy Lutomirski <[email protected]> wrote:
>
> On Fri, Oct 2, 2020 at 5:15 PM Andy Lutomirski <[email protected]> wrote:
> >
> > On Fri, Oct 2, 2020 at 3:40 PM Sean Christopherson
> > <[email protected]> wrote:
> > >
> > > On Thu, Oct 01, 2020 at 02:52:32PM -0700, Andy Lutomirski wrote:
> > > > On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
> > > > <[email protected]> wrote:
> > > > >
> > > > > vmx_prepare_switch_to_guest shouldn't use is_64bit_mm, which has a
> > > > > very specific use in uprobes. Use the user_64bit_mode helper instead.
> > > > > This reduces the usage of is_64bit_mm, which is awkward, since it relies
> > > > > on the personality at load time, which is fine for uprobes, but doesn't
> > > > > seem fine here.
> > > > >
> > > > > I tested this by running VMs with 64 and 32 bits payloads from 64/32
> > > > > programs.
> > > > >
> > > > > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > > > > ---
> > > > > arch/x86/kvm/vmx/vmx.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > > index 7b2a068f08c1..b5aafd9e5f5d 100644
> > > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > > @@ -1172,7 +1172,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> > > > > savesegment(es, host_state->es_sel);
> > > > >
> > > > > gs_base = cpu_kernelmode_gs_base(cpu);
> > > > > - if (likely(is_64bit_mm(current->mm))) {
> > > > > + if (likely(user_64bit_mode(current_pt_regs()))) {
> > > > > current_save_fsgs();
> > > > > fs_sel = current->thread.fsindex;
> > > > > gs_sel = current->thread.gsindex;
> > > >
> > > > I disagree with this one. This whole code path is nonsense. Can you
> > > > just remove the condition entirely and use the 64-bit path
> > > > unconditionally?
> > >
> > > I finally came back to this one with fresh eyes. I've read through the code
> > > a bajllion times and typed up half a dozen responses. I think, finally, I
> > > understand what's broken.
> > >
> > > I believe your original assertion that the bug was misdiagnosed is correct
> > > (can't link because LKML wasn't in the loop). I'm pretty sure your analysis
> > > that KVM's handling of things works mostly by coincidence is also correct.
> > >
> > > The coincidence is that "real" VMMs all use arch_prctl(), and
> > > do_arch_prctl_64() ensures thread.{fs,gs}base are accurate. save_base_legacy()
> > > detects sel==0 and intentionally does nothing, knowing the the base is already
> > > accurate.
> > >
> > > Userspaces that don't use arch_prctl(), in the bug report case a 32-bit compat
> > > test, may or may not have accurate thread.{fs,gs}base values. This is
> > > especially true if sel!=0 as save_base_legacy() explicitly zeros the base in
> > > this case, as load_seg_legacy() will restore the seg on the backend.
> > >
> > > KVM on the other hand assumes thread.{fs,gs}base are always fresh. When that
> > > didn't hold true for userspace that didn't use arch_prctl(), the fix of
> > > detecting a !64-bit mm just so happened to work because all 64-bit VMMs use
> > > arch_prctl().
> > >
> > > It's tempting to just open code this and use RD{FS,GS}BASE when possible,
> > > i.e. avoid any guesswork. Maybe with a module param that userspace can set
> > > to tell KVM it doesn't do anything fancy with FS/GS base (will userspace still
> > > use arch_prctl() even if FSGSABSE is available?).
> > >
> > > savesegment(fs, fs_sel);
> > > savesegment(gs, gs_sel);
> > > if (use_current_fsgs_base) {
> > > fs_base = current->thread.fsbase;
> > > vmx->msr_host_kernel_gs_base = current->thread.gsbase;
> >
> > I don't like this. The FSGSBASE case is fast, and the !FSGSBASE case
> > is mostly obsolete. I see no great reason to have a module parameter
> > asking for incorrect behavior. There have been too many bugs in this
> > area -- let's not add more please.
> >
> > > } else if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> > > fs_base = rdfsbase()
> > > vmx->msr_host_kernel_gs_base = __rdgsbase_inactive();
> > > } else {
> > > fs_base = read_msr(MSR_FS_BASE);
> > > vmx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> > > }
> >
> > I'm okay with this, but I think you should fix the corresponding bugs
> > on the restore side as well. The current code is:
> >
> > if (host_state->ldt_sel || (host_state->gs_sel & 7)) {
> > kvm_load_ldt(host_state->ldt_sel);
> > #ifdef CONFIG_X86_64
> > load_gs_index(host_state->gs_sel);
> > #else
> > loadsegment(gs, host_state->gs_sel);
> > #endif
> > }
> > if (host_state->fs_sel & 7)
> > loadsegment(fs, host_state->fs_sel);
> >
> > which is blatantly wrong in the case where fs_set.TI == 1, gs_set.TI
> > == 0, and ldt_sel != 0. But it's also more subtly wrong -- this
> > corrupts all the segment attributes in the case where a segment points
> > to the GDT and the GDT attributes are non-default. So I would suggest
> > making the code radically simpler and more correct:
> >
> > if (host_state->idt_sel)
> > kvm_load_idt(host_state->idt_sel); // see below
> > if (host_state->ds_sel)
> > loadsegment(ds, host_state->ds_sel);
> > if (host_state->es_sel)
> > loadsegment(es, host_state->es_sel);
> > if (host_state->fs_sel) {
> > loadsegment(fs, host_state->fs_sel);
> > x86_fsbase_write_cpu(host_state->fs_base);
> > }
> > if (host_state->gs_sel) {
> > load_gs_index(host_state->gs_sel);
> > x86_gsbase_write_cpu_inactive(host_state->msr_host_kernel_gs_base);
> > }
> >
> > In the IMO unlikely event that any performance-critical KVM userspace
> > runs with these selectors != 0, you could also skip the load if they
> > are set to __USER_DS.
> >
> > You can also simplify this crud:
> >
> > if (unlikely(fs_sel != host->fs_sel)) {
> > if (!(fs_sel & 7))
> > vmcs_write16(HOST_FS_SELECTOR, fs_sel);
> > else
> > vmcs_write16(HOST_FS_SELECTOR, 0);
> > host->fs_sel = fs_sel;
> > }
> >
> > There is nothing special about segments with TI set according to the
> > SDM (AFAICT) nor is anything particularly special about them in
> > Linux's segment tables, so this code makes little sense. It could
> > just be:
> >
> > host->fs_sel = fs_sel;
> >
> > and leave the VMCS field set to 0. Or if you do the __USER_DS
> > optimization above, you could do:
> >
> > if (unlikely(fs_sel != host->fs_sel)) {
> > vmcs_write16(HOST_FS_SELECTOR, fs_sel == __USER_DS ? __USER_DS : 0);
> > host->fs_sel = fs_sel;
> > }
> >
> > I suspect that the only users who care about performance (or for whom
> > we should care about performance) are normal userspace, and normal
> > 64-bit userspace has DS == ES == FS == GS == 0.
> >
> >
> > I would also be okay with making the KVM code match the context switch
> > code, but this may be distinctly nontrivial.
>
> If you're okay waiting for a couple days, I'll just do this. I have
> it 2/3-done already, except I'm running into the utter catastrophe
> that is 32-bit stackprotector, so I'm going to fix that first. (Or
> delete it if I get toosick of it.)

Rough draft. This needs the rest of that branch applied to have any
chance of being correct.

I was contemplating whether we could get a meaningful speedup if we
declared that entering a VM would wipe segments, but I don't think
there's actually much speedup available.

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/unify_stack_canary&id=f5a2dd07cd222bb674bd947f5d9c698ab50ccb88

2020-10-05 22:27:04

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

On Sat, Oct 03, 2020 at 04:04:22PM -0700, Andy Lutomirski wrote:
> On Fri, Oct 2, 2020 at 5:15 PM Andy Lutomirski <[email protected]> wrote:
> > But it's also more subtly wrong -- this corrupts all the segment attributes
> > in the case where a segment points to the GDT and the GDT attributes are
> > non-default.

Part of me wants to ask if it's even possible to get into such a scenario,
but a much larger part of me doesn't want to think about segmentation any
more :-)

> > I would also be okay with making the KVM code match the context switch
> > code, but this may be distinctly nontrivial.

Ya.

> If you're okay waiting for a couple days, I'll just do this. I have
> it 2/3-done already, except I'm running into the utter catastrophe
> that is 32-bit stackprotector, so I'm going to fix that first. (Or
> delete it if I get toosick of it.)

By all means. I dragged my feet for several months, I can certainly do
nothing for a few more days.