2018-02-16 09:05:20

by Nadav Amit

[permalink] [raw]
Subject: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

Compatibility mode should be safe of Meltdown, since the pointers are only
32-bit long. This can allow us to disable PTI selectively as long as x86-32
processes are running and to enable global pages throughout this time. These
patches may also be a basis for later disabling PTI selectively for "trusted"
processes.

The patches are marked as an RFC since they (specifically the last one) do not
coexist with Dave Hansen's enabling of global pages, and might have conflicts
with Joerg's work on 32-bit (although in this case, it should be easily
resolvable). They are also based on Linux 4.15.

I removed the PTI disabling while SMEP is unsupported, although I must admit I
did not fully understand why it is required. IIUC, Intel's indirect branch
prediction only predicts the low 32-bits of the target, which would still not
allow to manipulate the kernel to jump to userspace code through Spectre v2.

RFC v1 -> RFC v2:
- Handling the use of CS64 in compatibility mode (Andy)
- Holding the PTI disable indication per mm and not task (Andy)
- No PTI disabling if SMEP is unsupported (Dave, Ingo)
- Self-test and cleanup
- Enabling global pages while running in compatibility mode

Nadav Amit (6):
x86: Skip PTI when disable indication is set
x86: Save pti_disable for each mm_context
x86: Switching page-table isolation
x86: Disable PTI on compatibility mode
x86: Use global pages when PTI is disabled
selftest: x86: test using CS64 on compatibility-mode

arch/x86/entry/calling.h | 33 ++++++++
arch/x86/include/asm/mmu.h | 3 +
arch/x86/include/asm/pti.h | 70 +++++++++++++++++
arch/x86/include/asm/tlbflush.h | 35 ++++++++-
arch/x86/kernel/asm-offsets.c | 1 +
arch/x86/kernel/process_64.c | 13 +++-
arch/x86/kernel/traps.c | 23 +++++-
arch/x86/mm/init.c | 14 ++--
arch/x86/mm/pgtable.c | 4 +-
arch/x86/mm/pti.c | 139 +++++++++++++++++++++++++++++++++-
arch/x86/mm/tlb.c | 28 ++++++-
tools/testing/selftests/x86/ldt_gdt.c | 41 ++++++++++
12 files changed, 386 insertions(+), 18 deletions(-)

--
2.14.1



2018-02-15 16:38:31

by Nadav Amit

[permalink] [raw]
Subject: [PATCH RFC v2 5/6] x86: Use global pages when PTI is disabled

As long as PTI is disabled, it is possible to use global pages, as long
as we remove them once PTI is enabled again. To do so, return the global
bit to __supported_pte_mask and disable global pages using CR4.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 6 ++++++
arch/x86/mm/init.c | 14 ++++++--------
arch/x86/mm/tlb.c | 3 ++-
3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index ea65cf951c49..3a44cb0a9f56 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -319,6 +319,12 @@ static inline void set_cpu_pti_disable(unsigned short disable)
WARN_ON_ONCE(preemptible());

pti_update_user_cs64(cpu_pti_disable(), disable);
+ if (__supported_pte_mask & _PAGE_GLOBAL) {
+ if (disable)
+ cr4_set_bits(X86_CR4_PGE);
+ else
+ cr4_clear_bits(X86_CR4_PGE);
+ }
this_cpu_write(cpu_tlbstate.pti_disable, disable);
}

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 82f5252c723a..7f918a59c536 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,12 +161,6 @@ struct map_range {

static int page_size_mask;

-static void enable_global_pages(void)
-{
- if (!static_cpu_has(X86_FEATURE_PTI))
- __supported_pte_mask |= _PAGE_GLOBAL;
-}
-
static void __init probe_page_size_mask(void)
{
/*
@@ -186,8 +180,10 @@ static void __init probe_page_size_mask(void)
/* Enable PGE if available */
__supported_pte_mask &= ~_PAGE_GLOBAL;
if (boot_cpu_has(X86_FEATURE_PGE)) {
- cr4_set_bits_and_update_boot(X86_CR4_PGE);
- enable_global_pages();
+ __supported_pte_mask |= _PAGE_GLOBAL;
+
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ cr4_set_bits_and_update_boot(X86_CR4_PGE);
}

/* Enable 1 GB linear kernel mappings if available: */
@@ -683,6 +679,8 @@ void __init init_mem_mapping(void)
#else
early_ioremap_page_table_range_init();
#endif
+ if (static_cpu_has(X86_FEATURE_PTI))
+ cr4_clear_bits(X86_CR4_PGE);

load_cr3(swapper_pg_dir);
__flush_tlb_all();
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c67ef3fb4f35..979c7ec6baab 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -74,7 +74,8 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
return;
}

- if (this_cpu_read(cpu_tlbstate.invalidate_other))
+ if (this_cpu_read(cpu_tlbstate.invalidate_other) &&
+ !mm_pti_disable(next))
clear_asid_other();

for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
--
2.14.1


2018-02-15 16:38:35

by Nadav Amit

[permalink] [raw]
Subject: [PATCH RFC v2 3/6] x86: Switching page-table isolation

On context switch, switch the page-table isolation according to the
new task. Accordingly, restore or remove CS64.

The different types of disabling are kept as a bitmap in order to
quickly check whether a certain type of disabling was switched, although
it is assumed only a single type is set at a given time. The code
prepares the facility for future disabling of PTI in other means
(prctl). To do so, the logic means that greater "disabling" value means
stronger disabling, and should override lower ones.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/pti.h | 22 ++++++++++++++++++++++
arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
arch/x86/mm/tlb.c | 25 +++++++++++++++++++++++++
3 files changed, 59 insertions(+)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 96a5fbfedf7a..78a333699874 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -3,6 +3,11 @@
#define _ASM_X86_PTI_H
#ifndef __ASSEMBLY__

+#include <asm/desc.h>
+
+#define PTI_DISABLE_OFF (0)
+#define PTI_DISABLE_IA32 (1 << 0)
+
#ifdef CONFIG_PAGE_TABLE_ISOLATION
static inline unsigned short mm_pti_disable(struct mm_struct *mm)
{
@@ -12,10 +17,27 @@ static inline unsigned short mm_pti_disable(struct mm_struct *mm)
return mm->context.pti_disable;
}

+static inline void pti_update_user_cs64(unsigned short prev_pti_disable,
+ unsigned short next_pti_disable)
+{
+ struct desc_struct user_cs, *d;
+
+ if ((prev_pti_disable ^ next_pti_disable) & PTI_DISABLE_IA32)
+ return;
+
+ d = get_cpu_gdt_rw(smp_processor_id());
+ user_cs = d[GDT_ENTRY_DEFAULT_USER_CS];
+ user_cs.p = !(next_pti_disable & PTI_DISABLE_IA32);
+ write_gdt_entry(d, GDT_ENTRY_DEFAULT_USER_CS, &user_cs, DESCTYPE_S);
+}
+
extern void pti_init(void);
extern void pti_check_boottime_disable(void);
#else
static inline unsigned short mm_pti_disable(struct mm_struct *mm) { return 0; }
+static inline unsigned short mm_pti_disable(struct mm_struct *mm);
+static inline void pti_update_user_cs64(unsigned short prev_pti_disable,
+ unsigned short next_pti_disable) { }
static inline void pti_check_boottime_disable(void) { }
#endif

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cf91a484bb41..ea65cf951c49 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -310,6 +310,18 @@ static inline unsigned short cpu_pti_disable(void)
return this_cpu_read(cpu_tlbstate.pti_disable);
}

+static inline void set_cpu_pti_disable(unsigned short disable)
+{
+ /*
+ * Enabling/disabling CS64 and updating the state must be done
+ * atomically
+ */
+ WARN_ON_ONCE(preemptible());
+
+ pti_update_user_cs64(cpu_pti_disable(), disable);
+ this_cpu_write(cpu_tlbstate.pti_disable, disable);
+}
+
/*
* Save some of cr4 feature set we're using (e.g. Pentium 4MB
* enable and PPro Global page enable), so that any CPU's that boot
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5bfe61a5e8e3..c67ef3fb4f35 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -178,6 +178,28 @@ static void sync_current_stack_to_mm(struct mm_struct *mm)
}
}
}
+static void switch_pti_disable(struct mm_struct *mm)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ unsigned short prev_pti_disable, next_pti_disable;
+
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ prev_pti_disable = cpu_pti_disable();
+
+ /*
+ * Avoid concurrent changes to mm_pti_disable()), since we need to
+ * ensure both CS64 and the CPU indication are identical
+ */
+ next_pti_disable = READ_ONCE(mm->context.pti_disable);
+
+ if (prev_pti_disable == next_pti_disable)
+ return;
+
+ set_cpu_pti_disable(next_pti_disable);
+#endif
+}

void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
@@ -292,6 +314,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
}

+ /* Disable/reenable page-table isolation as needed */
+ switch_pti_disable(next);
+
this_cpu_write(cpu_tlbstate.loaded_mm, next);
this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
}
--
2.14.1


2018-02-15 16:39:35

by Nadav Amit

[permalink] [raw]
Subject: [PATCH RFC v2 6/6] selftest: x86: test using CS64 on compatibility-mode

As we mask the 64-bit code segment in compatibility-mode, and since
applications, most notably CRIU, might still use it, add a test to
ensure it does not break.

Signed-off-by: Nadav Amit <[email protected]>
---
tools/testing/selftests/x86/ldt_gdt.c | 41 +++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

diff --git a/tools/testing/selftests/x86/ldt_gdt.c b/tools/testing/selftests/x86/ldt_gdt.c
index 1aef72df20a1..40b442e5c514 100644
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -50,6 +50,11 @@
# define INT80_CLOBBERS "r8", "r9", "r10", "r11"
#else
# define INT80_CLOBBERS
+
+/* copied from asm/segment.h */
+#define GDT_ENTRY_DEFAULT_USER_CS 14
+#define GDT_ENTRY_DEFAULT_USER_DS 15
+
#endif

static int nerrs;
@@ -907,6 +912,40 @@ static void test_gdt_invalidation(void)
#endif
}

+struct far_jmp_addr {
+ unsigned long addr;
+ unsigned short seg;
+} __packed;
+
+static void switch_to_cs64(void)
+{
+#ifdef __i386__
+ struct far_jmp_addr far_jmp_compat, far_jmp_64;
+ unsigned short ds;
+
+ /* Poor's man detection of compatibility mode; */
+ asm volatile ("mov %%ds, %0" : [ds]"=r"(ds));
+ if (ds >> 3 != 5)
+ return;
+
+ far_jmp_64.seg = (6 << 3) | 3; /* __USER_CS */
+ far_jmp_compat.seg = (4 << 3) | 3; /* __USER32_CS */
+
+ asm volatile ("movl $.cs64_target, (%[target64])\n\t"
+ "movl $.cs32_target, (%[target32])\n\t"
+ "ljmp *%[jmp_addr_64]\n\t"
+ ".cs64_target:\n\t"
+ "ljmp *%[jmp_addr_32]\n\t"
+ ".cs32_target:\n\t" : :
+ [jmp_addr_64]"m"(far_jmp_64),
+ [jmp_addr_32]"m"(far_jmp_compat),
+ [target64]"r"(&far_jmp_64.addr),
+ [target32]"r"(&far_jmp_compat.addr) : "memory");
+
+ printf("[OK]\tSwitching to CS64 and back\n");
+#endif
+}
+
int main(int argc, char **argv)
{
if (argc == 1 && !strcmp(argv[0], "ldt_gdt_test_exec"))
@@ -923,5 +962,7 @@ int main(int argc, char **argv)

test_gdt_invalidation();

+ switch_to_cs64();
+
return nerrs ? 1 : 0;
}
--
2.14.1


2018-02-15 17:49:28

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 5/6] x86: Use global pages when PTI is disabled

Dave Hansen <[email protected]> wrote:

>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index c67ef3fb4f35..979c7ec6baab 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -74,7 +74,8 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>> return;
>> }
>>
>> - if (this_cpu_read(cpu_tlbstate.invalidate_other))
>> + if (this_cpu_read(cpu_tlbstate.invalidate_other) &&
>> + !mm_pti_disable(next))
>> clear_asid_other();
>
> This isn't obviously correct. Don't we still need to invalidate other
> user asids?

I forgot to regard this question: When you reenable PTI (after switching back
to 64-bit process), you flush the global pages, so no kernel mappings for the
32-bit process are left.

As for kernel mappings of 64-bit processes, you will flush them later, when
you switch back to 64-bit process (the indication is left set).


2018-02-15 18:10:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC v2 5/6] x86: Use global pages when PTI is disabled

On 02/15/2018 09:47 AM, Nadav Amit wrote:
> Dave Hansen <[email protected]> wrote:
>>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>>> index c67ef3fb4f35..979c7ec6baab 100644
>>> --- a/arch/x86/mm/tlb.c
>>> +++ b/arch/x86/mm/tlb.c
>>> @@ -74,7 +74,8 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>>> return;
>>> }
>>>
>>> - if (this_cpu_read(cpu_tlbstate.invalidate_other))
>>> + if (this_cpu_read(cpu_tlbstate.invalidate_other) &&
>>> + !mm_pti_disable(next))
>>> clear_asid_other();
>>
>> This isn't obviously correct. Don't we still need to invalidate other
>> user asids?
>
> I forgot to regard this question: When you reenable PTI (after switching back
> to 64-bit process), you flush the global pages, so no kernel mappings for the
> 32-bit process are left.

Can you please write up a proper description for this? It's horribly
complicated, intertwined with global pages, and sets up a dependency
that *ALL* TLB entries invalidated via __flush_tlb_one_kernel() must be
_PAGE_GLOBAL.

How about you actually clear cpu_tlbstate.invalidate_other when you do
the CR4.PGE switching? That seems a much more direct way and is much
more self-documenting.

That brings up another point: these patches rather ignore cpu_tlbstate.
That leads to confusing code (this) and the double-flushing on context
switch I brought up earlier. Was this intentional, or is it something
you can reconsider going forward?

2018-02-15 18:12:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/6] x86: Skip PTI when disable indication is set

On 02/15/2018 08:35 AM, Nadav Amit wrote:
> +.Lno_spec_\@:
> + lfence
> + jmp .Lend_\@
> +
> .Lnoflush_\@:
> SET_NOFLUSH_BIT \save_reg
>

How expensive is this?

2018-02-16 08:39:12

by Nadav Amit

[permalink] [raw]
Subject: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

Based on the understanding that there should be no way for userspace to
address the kernel-space from compatibility mode, disable it while
running in compatibility mode as long as the 64-bit code segment of the
user is not used.

Reenabling PTI is performed by restoring NX-bits to the userspace
mappings, flushing the TLBs, and notifying all the CPUs that use the
affected mm to disable PTI. Each core responds by removing the present
bit for the 64-bit code-segment, and marking that PTI is disabled on
that core.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/pti.h | 39 +++++++++++++
arch/x86/kernel/process_64.c | 13 ++++-
arch/x86/kernel/traps.c | 23 +++++++-
arch/x86/mm/pti.c | 130 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 201 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 78a333699874..d04954ebb0d4 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -31,6 +31,42 @@ static inline void pti_update_user_cs64(unsigned short prev_pti_disable,
write_gdt_entry(d, GDT_ENTRY_DEFAULT_USER_CS, &user_cs, DESCTYPE_S);
}

+void __pti_reenable(void);
+
+static inline void pti_reenable(void)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI) || !mm_pti_disable(current->mm))
+ return;
+
+ __pti_reenable();
+}
+
+void __pti_disable(unsigned short type);
+
+static inline void pti_disable(unsigned short type)
+{
+ /*
+ * To allow PTI to be disabled, we must:
+ *
+ * 1. Have PTI enabled.
+ * 2. Have SMEP enabled, since the lack of NX-bit on user mappings
+ * raises general security concerns.
+ * 3. Have NX-bit enabled, since reenabling PTI has a corner case in
+ * which the kernel tables are restored instead of those of those of
+ * the user. Having NX-bit causes this scenario to trigger a spurious
+ * page-fault when control is returned to the user, and allow the
+ * entry code to restore the page-tables to their correct state.
+ */
+ if (!static_cpu_has(X86_FEATURE_PTI) ||
+ !static_cpu_has(X86_FEATURE_SMEP) ||
+ !static_cpu_has(X86_FEATURE_NX))
+ return;
+
+ __pti_disable(type);
+}
+
+bool pti_handle_segment_not_present(long error_code);
+
extern void pti_init(void);
extern void pti_check_boottime_disable(void);
#else
@@ -38,6 +74,9 @@ static inline unsigned short mm_pti_disable(struct mm_struct *mm) { return 0; }
static inline unsigned short mm_pti_disable(struct mm_struct *mm);
static inline void pti_update_user_cs64(unsigned short prev_pti_disable,
unsigned short next_pti_disable) { }
+static inline void pti_disable(unsigned short type) { }
+static inline void pti_reenable(void) { }
+static inline bool pti_handle_segment_not_present(long error_code) { return false; }
static inline void pti_check_boottime_disable(void) { }
#endif

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index c75466232016..24d3429b4191 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -54,6 +54,7 @@
#include <asm/vdso.h>
#include <asm/intel_rdt_sched.h>
#include <asm/unistd.h>
+#include <asm/pti.h>
#ifdef CONFIG_IA32_EMULATION
/* Not included via unistd.h */
#include <asm/unistd_32_ia32.h>
@@ -530,8 +531,10 @@ void set_personality_64bit(void)
task_pt_regs(current)->orig_ax = __NR_execve;

/* Ensure the corresponding mm is not marked. */
- if (current->mm)
+ if (current->mm) {
current->mm->context.ia32_compat = 0;
+ pti_reenable();
+ }

/* TBD: overwrites user setup. Should have two bits.
But 64bit processes have always behaved this way,
@@ -545,8 +548,10 @@ static void __set_personality_x32(void)
#ifdef CONFIG_X86_X32
clear_thread_flag(TIF_IA32);
set_thread_flag(TIF_X32);
- if (current->mm)
+ if (current->mm) {
current->mm->context.ia32_compat = TIF_X32;
+ pti_reenable();
+ }
current->personality &= ~READ_IMPLIES_EXEC;
/*
* in_compat_syscall() uses the presence of the x32 syscall bit
@@ -566,8 +571,10 @@ static void __set_personality_ia32(void)
#ifdef CONFIG_IA32_EMULATION
set_thread_flag(TIF_IA32);
clear_thread_flag(TIF_X32);
- if (current->mm)
+ if (current->mm) {
current->mm->context.ia32_compat = TIF_IA32;
+ pti_disable(PTI_DISABLE_IA32);
+ }
current->personality |= force_personality32;
/* Prepare the first "return" to user space */
task_pt_regs(current)->orig_ax = __NR_ia32_execve;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 446c9ef8cfc3..65d8ccb20175 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -61,6 +61,7 @@
#include <asm/mpx.h>
#include <asm/vm86.h>
#include <asm/umip.h>
+#include <asm/pti.h>

#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -315,7 +316,6 @@ DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow)
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun",coprocessor_segment_overrun)
DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS)
-DO_ERROR(X86_TRAP_NP, SIGBUS, "segment not present", segment_not_present)
DO_ERROR(X86_TRAP_SS, SIGBUS, "stack segment", stack_segment)
DO_ERROR(X86_TRAP_AC, SIGBUS, "alignment check", alignment_check)

@@ -529,6 +529,27 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
}

+dotraplinkage void
+do_segment_not_present(struct pt_regs *regs, long error_code)
+{
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+ cond_local_irq_enable(regs);
+
+ /*
+ * 64-bit mode was disabled to prevent unnecessary page table isolation.
+ * Enable it, and from now on page-tables will be switched on kernel
+ * entry. Due to potential race conditions, we check the error code to
+ * see whether it references the __USER_CS, and ensure we only handle a
+ * single event per thread.
+ */
+ if (pti_handle_segment_not_present(error_code))
+ return;
+
+ do_trap(X86_TRAP_NP, SIGTRAP, "segment not present", regs, error_code,
+ NULL);
+}
+NOKPROBE_SYMBOL(do_segment_not_present);
+
dotraplinkage void
do_general_protection(struct pt_regs *regs, long error_code)
{
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index a973a291a34d..18d936c5aa31 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -148,6 +148,136 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
return pgd;
}

+static void pti_update_user_pgds(struct mm_struct *mm, bool pti_enable)
+{
+ int i;
+
+ if (!(__supported_pte_mask & _PAGE_NX))
+ return;
+
+ for (i = 0; i < PTRS_PER_PGD / 2; i++) {
+ pgd_t pgd, *pgdp = &mm->pgd[i];
+
+ pgd = *pgdp;
+
+ if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) !=
+ (_PAGE_USER|_PAGE_PRESENT))
+ continue;
+
+ if (pti_enable)
+ pgd.pgd |= _PAGE_NX;
+ else
+ pgd.pgd &= ~_PAGE_NX;
+
+ *pgdp = pgd;
+ }
+}
+
+static void pti_cpu_update_func(void *info)
+{
+ struct mm_struct *mm = (struct mm_struct *)info;
+
+ if (mm != this_cpu_read(cpu_tlbstate.loaded_mm))
+ return;
+
+ /*
+ * Keep CS64 and CPU settings in sync despite potential concurrent
+ * updates.
+ */
+ set_cpu_pti_disable(READ_ONCE(mm->context.pti_disable));
+}
+
+/*
+ * Reenable PTI after it was selectively disabled. Since the mm is in use, and
+ * the NX-bit of the PGD may be set while the user still uses the kernel PGD, it
+ * may lead to spurious page-faults. The page-fault handler should be able to
+ * handle them gracefully.
+ */
+void __pti_reenable(void)
+{
+ struct mm_struct *mm = current->mm;
+ int cpu;
+
+ if (!mm_pti_disable(mm))
+ return;
+
+ /*
+ * Prevent spurious page-fault storm while we set the NX-bit and have
+ * yet not updated the per-CPU pti_disable flag.
+ */
+ down_write(&mm->mmap_sem);
+
+ if (!mm_pti_disable(mm))
+ goto out;
+
+ /*
+ * First, mark the PTI is enabled. Although we do anything yet, we are
+ * safe as long as we do not reenable CS64. Since we did not update the
+ * page tables yet, this may lead to spurious page-faults, but we need
+ * the pti_disable in mm to be set for __pti_set_user_pgd() to do the
+ * right thing. Holding mmap_sem would ensure matter we hold the
+ * mmap_sem to prevent them from swamping the system.
+ */
+ mm->context.pti_disable = PTI_DISABLE_OFF;
+
+ /* Second, restore the NX bits. */
+ pti_update_user_pgds(mm, true);
+
+ /*
+ * Third, flush the entire mm. By doing so we also force the processes
+ * to reload the correct page-table on return. This also provides a
+ * barrier before we restore USER_CS, ensuring we see the update
+ * cpumask.
+ */
+ flush_tlb_mm(mm);
+
+ /*
+ * Finally, restore CS64 to its correct state and mark that PTI is
+ * reenabled.
+ */
+ cpu = get_cpu();
+ pti_cpu_update_func(mm);
+ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+ smp_call_function_many(mm_cpumask(mm), pti_cpu_update_func,
+ mm, 1);
+ put_cpu();
+
+out:
+ up_write(&mm->mmap_sem);
+}
+
+void __pti_disable(unsigned short type)
+{
+ struct mm_struct *mm = current->mm;
+
+ /*
+ * Give disabling options with higher value higher priority, as they are
+ * permanent and not transient. This also avoids re-disabling.
+ */
+ if (mm_pti_disable(mm) >= type)
+ return;
+
+ mm->context.pti_disable = type;
+
+ pti_update_user_pgds(mm, false);
+
+ preempt_disable();
+ set_cpu_pti_disable(type);
+ preempt_enable();
+}
+
+bool pti_handle_segment_not_present(long error_code)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return false;
+
+ if ((unsigned short)error_code != GDT_ENTRY_DEFAULT_USER_CS << 3)
+ return false;
+
+ pti_reenable();
+ return true;
+}
+
/*
* Walk the user copy of the page tables (optionally) trying to allocate
* page table pages on the way down.
--
2.14.1


2018-02-16 09:05:06

by Nadav Amit

[permalink] [raw]
Subject: [PATCH RFC v2 1/6] x86: Skip PTI when disable indication is set

If PTI is disabled, we do not want to switch page-tables. On entry to
the kernel, this is done based on CR3 value. On return, do it according
to per core indication.

To be on the safe side, avoid speculative skipping of page-tables
switching when returning the userspace. This can be avoided if the CPU
cannot execute speculatively code without the proper permissions. When
switching to the kernel page-tables, this is anyhow not an issue: if PTI
is enabled and page-tables were not switched, the kernel part of the
user page-tables would not be set.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/entry/calling.h | 33 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/tlbflush.h | 17 +++++++++++++++--
arch/x86/kernel/asm-offsets.c | 1 +
3 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 3f48f695d5e6..5e9895f44d11 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -216,7 +216,14 @@ For 32-bit we have the following conventions - kernel is built with

.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+ /*
+ * Do not switch on compatibility mode.
+ */
mov %cr3, \scratch_reg
+ testq $PTI_USER_PGTABLE_MASK, \scratch_reg
+ jz .Lend_\@
+
ADJUST_KERNEL_CR3 \scratch_reg
mov \scratch_reg, %cr3
.Lend_\@:
@@ -225,8 +232,20 @@ For 32-bit we have the following conventions - kernel is built with
#define THIS_CPU_user_pcid_flush_mask \
PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask

+#define THIS_CPU_pti_disable \
+ PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_pti_disable
+
.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+ /*
+ * Do not switch on compatibility mode. If there is no need for a
+ * flush, run lfence to avoid speculative execution returning to user
+ * with the wrong CR3.
+ */
+ cmpw $(0), THIS_CPU_pti_disable
+ jnz .Lno_spec_\@
+
mov %cr3, \scratch_reg

ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
@@ -244,6 +263,10 @@ For 32-bit we have the following conventions - kernel is built with
movq \scratch_reg2, \scratch_reg
jmp .Lwrcr3_pcid_\@

+.Lno_spec_\@:
+ lfence
+ jmp .Lend_\@
+
.Lnoflush_\@:
movq \scratch_reg2, \scratch_reg
SET_NOFLUSH_BIT \scratch_reg
@@ -288,6 +311,12 @@ For 32-bit we have the following conventions - kernel is built with

ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID

+ /*
+ * Do not restore if PTI is disabled.
+ */
+ cmpw $(0), THIS_CPU_pti_disable
+ jnz .Lno_spec_\@
+
/*
* KERNEL pages can always resume with NOFLUSH as we do
* explicit flushes.
@@ -307,6 +336,10 @@ For 32-bit we have the following conventions - kernel is built with
btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
jmp .Lwrcr3_\@

+.Lno_spec_\@:
+ lfence
+ jmp .Lend_\@
+
.Lnoflush_\@:
SET_NOFLUSH_BIT \save_reg

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index d33e4a26dc7e..cf91a484bb41 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -216,6 +216,12 @@ struct tlb_state {
*/
unsigned long cr4;

+ /*
+ * Cached value of mm.pti_enable to simplify and speed up kernel entry
+ * code.
+ */
+ unsigned short pti_disable;
+
/*
* This is a list of all contexts that might exist in the TLB.
* There is one per ASID that we use, and the ASID (what the
@@ -298,6 +304,12 @@ static inline void invalidate_other_asid(void)
this_cpu_write(cpu_tlbstate.invalidate_other, true);
}

+/* Return whether page-table isolation is disabled on this CPU */
+static inline unsigned short cpu_pti_disable(void)
+{
+ return this_cpu_read(cpu_tlbstate.pti_disable);
+}
+
/*
* Save some of cr4 feature set we're using (e.g. Pentium 4MB
* enable and PPro Global page enable), so that any CPU's that boot
@@ -355,7 +367,8 @@ static inline void __native_flush_tlb(void)
*/
WARN_ON_ONCE(preemptible());

- invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ if (!cpu_pti_disable())
+ invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));

/* If current->mm == NULL then the read_cr3() "borrows" an mm */
native_write_cr3(__native_read_cr3());
@@ -404,7 +417,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)

asm volatile("invlpg (%0)" ::"r" (addr) : "memory");

- if (!static_cpu_has(X86_FEATURE_PTI))
+ if (!static_cpu_has(X86_FEATURE_PTI) || cpu_pti_disable())
return;

/*
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 76417a9aab73..435bb5cdfd66 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -97,6 +97,7 @@ void common(void) {

/* TLB state for the entry code */
OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
+ OFFSET(TLB_STATE_pti_disable, tlb_state, pti_disable);

/* Layout info for cpu_entry_area */
OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
--
2.14.1


2018-02-16 09:05:15

by Nadav Amit

[permalink] [raw]
Subject: [PATCH RFC v2 2/6] x86: Save pti_disable for each mm_context

Save for each mm->context whether page-table isolation should be
disabled. In order not to make intrusive changes, the information is
saved in the PGD page metadata.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/mmu.h | 3 +++
arch/x86/include/asm/pti.h | 9 +++++++++
arch/x86/mm/pgtable.c | 4 +++-
arch/x86/mm/pti.c | 9 +++++++--
4 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5ff3e8af2c20..fb8db4a09d8a 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -36,6 +36,9 @@ typedef struct {
/* True if mm supports a task running in 32 bit compatibility mode. */
unsigned short ia32_compat;
#endif
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ unsigned short pti_disable;
+#endif

struct mutex lock;
void __user *vdso; /* vdso base address */
diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
index 0b5ef05b2d2d..96a5fbfedf7a 100644
--- a/arch/x86/include/asm/pti.h
+++ b/arch/x86/include/asm/pti.h
@@ -4,9 +4,18 @@
#ifndef __ASSEMBLY__

#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static inline unsigned short mm_pti_disable(struct mm_struct *mm)
+{
+ if (mm == NULL)
+ return 0;
+
+ return mm->context.pti_disable;
+}
+
extern void pti_init(void);
extern void pti_check_boottime_disable(void);
#else
+static inline unsigned short mm_pti_disable(struct mm_struct *mm) { return 0; }
static inline void pti_check_boottime_disable(void) { }
#endif

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 004abf9ebf12..ef0394a97adb 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -139,7 +139,9 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
if (!SHARED_KERNEL_PMD) {
pgd_set_mm(pgd, mm);
pgd_list_add(pgd);
- }
+ } else if (IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION) &&
+ static_cpu_has(X86_FEATURE_PTI))
+ pgd_set_mm(pgd, mm);
}

static void pgd_dtor(pgd_t *pgd)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index ce38f165489b..a973a291a34d 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -134,10 +134,15 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
* may execute from it
* - we don't have NX support
* - we're clearing the PGD (i.e. the new pgd is not present).
+ * - PTI is disabled.
*/
if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
- (__supported_pte_mask & _PAGE_NX))
- pgd.pgd |= _PAGE_NX;
+ (__supported_pte_mask & _PAGE_NX)) {
+ struct mm_struct *mm = pgd_page_get_mm(virt_to_page(pgdp));
+
+ if (!mm_pti_disable(mm))
+ pgd.pgd |= _PAGE_NX;
+ }

/* return the copy of the PGD we want the kernel to use: */
return pgd;
--
2.14.1


2018-02-16 10:08:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC v2 5/6] x86: Use global pages when PTI is disabled

On 02/15/2018 08:36 AM, Nadav Amit wrote:
> As long as PTI is disabled, it is possible to use global pages, as long
> as we remove them once PTI is enabled again. To do so, return the global
> bit to __supported_pte_mask and disable global pages using CR4.
>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> arch/x86/include/asm/tlbflush.h | 6 ++++++
> arch/x86/mm/init.c | 14 ++++++--------
> arch/x86/mm/tlb.c | 3 ++-
> 3 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index ea65cf951c49..3a44cb0a9f56 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -319,6 +319,12 @@ static inline void set_cpu_pti_disable(unsigned short disable)
> WARN_ON_ONCE(preemptible());
>
> pti_update_user_cs64(cpu_pti_disable(), disable);
> + if (__supported_pte_mask & _PAGE_GLOBAL) {
> + if (disable)
> + cr4_set_bits(X86_CR4_PGE);
> + else
> + cr4_clear_bits(X86_CR4_PGE);
> + }
> this_cpu_write(cpu_tlbstate.pti_disable, disable);
> }

The TLB invalidations when doing this switch are *CRITICAL*. Otherwise,
we end up globally-mapped kernel entries persisting to other processes
that are then vulnerable to Meltdown.

So, where are the TLB flushes?

They're hidden in the cr4_set/clear_bits() function, of course. This is
dangerous for two reasons because it makes them non-obvious and hard to
find. It also has no interactions with the existing TLB invalidation
infrastructure. That's _safe_ of course because extra flushing is OK,
but it feels really funky because you're going to end up double-flushing
on context switches which is rather unfortunate.

This also needs some heavy commenting about the fact that _PAGE_GLOBAL
is ignored when CR4.PGE=0. That's key to this working and not mentioned
anywhere.

While this looks OK to me, it still makes me rather nervous. The
changelog and commenting definitely need a lot of work. I'm also still
rather unconvinced that the added complexity here is worth it.

> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index c67ef3fb4f35..979c7ec6baab 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -74,7 +74,8 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> return;
> }
>
> - if (this_cpu_read(cpu_tlbstate.invalidate_other))
> + if (this_cpu_read(cpu_tlbstate.invalidate_other) &&
> + !mm_pti_disable(next))
> clear_asid_other();

This isn't obviously correct. Don't we still need to invalidate other
user asids?

2018-02-16 10:48:41

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 5/6] x86: Use global pages when PTI is disabled

Dave Hansen <[email protected]> wrote:

> On 02/15/2018 08:36 AM, Nadav Amit wrote:
>> As long as PTI is disabled, it is possible to use global pages, as long
>> as we remove them once PTI is enabled again. To do so, return the global
>> bit to __supported_pte_mask and disable global pages using CR4.
>>
>> Signed-off-by: Nadav Amit <[email protected]>
>> ---
>> arch/x86/include/asm/tlbflush.h | 6 ++++++
>> arch/x86/mm/init.c | 14 ++++++--------
>> arch/x86/mm/tlb.c | 3 ++-
>> 3 files changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index ea65cf951c49..3a44cb0a9f56 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -319,6 +319,12 @@ static inline void set_cpu_pti_disable(unsigned short disable)
>> WARN_ON_ONCE(preemptible());
>>
>> pti_update_user_cs64(cpu_pti_disable(), disable);
>> + if (__supported_pte_mask & _PAGE_GLOBAL) {
>> + if (disable)
>> + cr4_set_bits(X86_CR4_PGE);
>> + else
>> + cr4_clear_bits(X86_CR4_PGE);
>> + }
>> this_cpu_write(cpu_tlbstate.pti_disable, disable);
>> }
>
> The TLB invalidations when doing this switch are *CRITICAL*. Otherwise,
> we end up globally-mapped kernel entries persisting to other processes
> that are then vulnerable to Meltdown.
>
> So, where are the TLB flushes?
>
> They're hidden in the cr4_set/clear_bits() function, of course. This is
> dangerous for two reasons because it makes them non-obvious and hard to
> find. It also has no interactions with the existing TLB invalidation
> infrastructure. That's _safe_ of course because extra flushing is OK,
> but it feels really funky because you're going to end up double-flushing
> on context switches which is rather unfortunate.
>
> This also needs some heavy commenting about the fact that _PAGE_GLOBAL
> is ignored when CR4.PGE=0. That's key to this working and not mentioned
> anywhere.
>
> While this looks OK to me, it still makes me rather nervous. The
> changelog and commenting definitely need a lot of work. I'm also still
> rather unconvinced that the added complexity here is worth it.

I agree that comments, and perhaps some wrapper functions to clarify when a
flush takes place are necessary. I actually sent the patches before making
another pass since I saw they somewhat conflict with your recent patches.

I think that there are several reasons why these patches are not as bad as
they look:

1) Legacy mode performance (with PTI) is bad, but apparently x86-32
applications are still widely used: https://lkml.org/lkml/2018/2/10/38

2) It is likely that at some point you will want to disable PTI selectively
for processes, similarly to the way Windows “trusts” Microsoft SQL. At this
point you are likely to end up with most of the code that these patches
introduce.


2018-02-16 15:04:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC v2 5/6] x86: Use global pages when PTI is disabled

On 02/15/2018 11:53 AM, Andy Lutomirski wrote:
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -319,6 +319,12 @@ static inline void set_cpu_pti_disable(unsigned short disable)
>> WARN_ON_ONCE(preemptible());
>>
>> pti_update_user_cs64(cpu_pti_disable(), disable);
>> + if (__supported_pte_mask & _PAGE_GLOBAL) {
>> + if (disable)
>> + cr4_set_bits(X86_CR4_PGE);
>> + else
>> + cr4_clear_bits(X86_CR4_PGE);
>> + }
> This will be *extremely* slow, and I don't see the point at all. What
> are you accomplishing here?

It won't be slow if you always run compat processes, I guess.

But mixing these in here will eat a big chunk of the benefit of having
global pages (or PCIDs for that matter) in the first place.

2018-02-16 15:12:41

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 5/6] x86: Use global pages when PTI is disabled

Dave Hansen <[email protected]> wrote:

> On 02/15/2018 11:53 AM, Andy Lutomirski wrote:
>>> --- a/arch/x86/include/asm/tlbflush.h
>>> +++ b/arch/x86/include/asm/tlbflush.h
>>> @@ -319,6 +319,12 @@ static inline void set_cpu_pti_disable(unsigned short disable)
>>> WARN_ON_ONCE(preemptible());
>>>
>>> pti_update_user_cs64(cpu_pti_disable(), disable);
>>> + if (__supported_pte_mask & _PAGE_GLOBAL) {
>>> + if (disable)
>>> + cr4_set_bits(X86_CR4_PGE);
>>> + else
>>> + cr4_clear_bits(X86_CR4_PGE);
>>> + }
>> This will be *extremely* slow, and I don't see the point at all. What
>> are you accomplishing here?
>
> It won't be slow if you always run compat processes, I guess.
>
> But mixing these in here will eat a big chunk of the benefit of having
> global pages (or PCIDs for that matter) in the first place.

These are all good points. The idea was to allow workloads like Apache, that
spawn multiple processes that frequently perform context-switches to incur
TLB misses on kernel pages.

The double-flushing was not intentional - I missed it. Anyhow, based on your
comments, and because I don’t see an easy way to make the global
cpu_entry_area (your recent patches) to work with this patch, I think I will
drop this patch.


2018-02-16 15:15:04

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

Andy Lutomirski <[email protected]> wrote:

> On Thu, Feb 15, 2018 at 4:36 PM, Nadav Amit <[email protected]> wrote:
>> Based on the understanding that there should be no way for userspace to
>> address the kernel-space from compatibility mode, disable it while
>> running in compatibility mode as long as the 64-bit code segment of the
>> user is not used.
>>
>> Reenabling PTI is performed by restoring NX-bits to the userspace
>> mappings, flushing the TLBs, and notifying all the CPUs that use the
>> affected mm to disable PTI. Each core responds by removing the present
>> bit for the 64-bit code-segment, and marking that PTI is disabled on
>> that core.
>
> I dislike this patch because it's conflating two things. The patch
> claims to merely disable PTI for compat tasks, whatever those are.
> But it's also introducing a much stronger concept of what a compat
> task is. The kernel currently mostly doesn't care whether a task is
> "compat" or not, and I think that most remaining code paths that do
> care are buggy and should be removed.
>
> I think the right way to approach this is to add a new arch_prctl()
> that changes allowable bitness, like this:
>
> arch_prctl(ARCH_SET_ALLOWED_GDT_CS, X86_ALLOW_CS32 | X86_ALLOW_CS64);
>
> this would set the current task to work the normal way, where 32-bit
> and 64-bit CS are available. You could set just X86_ALLOW_CS32 to
> deny 64-bit mode and just X86_ALLOW_CS64 to deny 32-bit mode. This
> would make nice attack surface reduction tools for the more paranoid
> sandbox users to use. Doing arch_prctl(ARCH_SET_ALLOWED_GDT_CS, 0)
> would return -EINVAL.
>
> A separate patch would turn PTI off if you set X86_ALLOW_CS32.
>
> This has the downside that old code doesn't get the benefit without
> some code change, but that's not the end of the world.
>
>> +static void pti_cpu_update_func(void *info)
>> +{
>> + struct mm_struct *mm = (struct mm_struct *)info;
>> +
>> + if (mm != this_cpu_read(cpu_tlbstate.loaded_mm))
>> + return;
>> +
>> + /*
>> + * Keep CS64 and CPU settings in sync despite potential concurrent
>> + * updates.
>> + */
>> + set_cpu_pti_disable(READ_ONCE(mm->context.pti_disable));
>> +}
>
> I don't like this at all. IMO a sane implementation should never
> change PTI status on a remote CPU. Just track it per task.

From your comments in the past, I understood that this requirement
(enabling/disabling PTI per mm) came from Linus. I can do it per task, but
that means that the NX-bit will always be cleared on the kernel page-table
for user mappings.

>> +void __pti_reenable(void)
>> +{
>> + struct mm_struct *mm = current->mm;
>> + int cpu;
>> +
>> + if (!mm_pti_disable(mm))
>> + return;
>> +
>> + /*
>> + * Prevent spurious page-fault storm while we set the NX-bit and have
>> + * yet not updated the per-CPU pti_disable flag.
>> + */
>> + down_write(&mm->mmap_sem);
>> +
>> + if (!mm_pti_disable(mm))
>> + goto out;
>> +
>> + /*
>> + * First, mark the PTI is enabled. Although we do anything yet, we are
>> + * safe as long as we do not reenable CS64. Since we did not update the
>> + * page tables yet, this may lead to spurious page-faults, but we need
>> + * the pti_disable in mm to be set for __pti_set_user_pgd() to do the
>> + * right thing. Holding mmap_sem would ensure matter we hold the
>> + * mmap_sem to prevent them from swamping the system.
>> + */
>> + mm->context.pti_disable = PTI_DISABLE_OFF;
>> +
>> + /* Second, restore the NX bits. */
>> + pti_update_user_pgds(mm, true);
>
> You're holding mmap_sem, but there are code paths that touch page
> tables that don't hold mmap_sem, such as the stack extension code.

Do they change user mappings? These are the only ones pti_update_user_pgds()
changes.

>> +
>> +bool pti_handle_segment_not_present(long error_code)
>> +{
>> + if (!static_cpu_has(X86_FEATURE_PTI))
>> + return false;
>> +
>> + if ((unsigned short)error_code != GDT_ENTRY_DEFAULT_USER_CS << 3)
>> + return false;
>> +
>> + pti_reenable();
>> + return true;
>> +}
>
> Please don't. You're trying to emulate the old behavior here, but
> you're emulating it wrong. In particular, you won't trap on LAR.

Yes, I thought I’ll manage to address LAR, but failed. I thought you said
this is not a “show-stopper”. I’ll adapt your approach of using prctl, although
it really limits the benefit of this mechanism.


2018-02-16 15:43:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

On Thu, Feb 15, 2018 at 8:58 PM, Nadav Amit <[email protected]> wrote:
> Andy Lutomirski <[email protected]> wrote:
>
>> On Thu, Feb 15, 2018 at 4:36 PM, Nadav Amit <[email protected]> wrote:
>>> Based on the understanding that there should be no way for userspace to
>>> address the kernel-space from compatibility mode, disable it while
>>> running in compatibility mode as long as the 64-bit code segment of the
>>> user is not used.
>>>
>>> Reenabling PTI is performed by restoring NX-bits to the userspace
>>> mappings, flushing the TLBs, and notifying all the CPUs that use the
>>> affected mm to disable PTI. Each core responds by removing the present
>>> bit for the 64-bit code-segment, and marking that PTI is disabled on
>>> that core.
>>
>> I dislike this patch because it's conflating two things. The patch
>> claims to merely disable PTI for compat tasks, whatever those are.
>> But it's also introducing a much stronger concept of what a compat
>> task is. The kernel currently mostly doesn't care whether a task is
>> "compat" or not, and I think that most remaining code paths that do
>> care are buggy and should be removed.
>>
>> I think the right way to approach this is to add a new arch_prctl()
>> that changes allowable bitness, like this:
>>
>> arch_prctl(ARCH_SET_ALLOWED_GDT_CS, X86_ALLOW_CS32 | X86_ALLOW_CS64);
>>
>> this would set the current task to work the normal way, where 32-bit
>> and 64-bit CS are available. You could set just X86_ALLOW_CS32 to
>> deny 64-bit mode and just X86_ALLOW_CS64 to deny 32-bit mode. This
>> would make nice attack surface reduction tools for the more paranoid
>> sandbox users to use. Doing arch_prctl(ARCH_SET_ALLOWED_GDT_CS, 0)
>> would return -EINVAL.
>>
>> A separate patch would turn PTI off if you set X86_ALLOW_CS32.
>>
>> This has the downside that old code doesn't get the benefit without
>> some code change, but that's not the end of the world.
>>
>>> +static void pti_cpu_update_func(void *info)
>>> +{
>>> + struct mm_struct *mm = (struct mm_struct *)info;
>>> +
>>> + if (mm != this_cpu_read(cpu_tlbstate.loaded_mm))
>>> + return;
>>> +
>>> + /*
>>> + * Keep CS64 and CPU settings in sync despite potential concurrent
>>> + * updates.
>>> + */
>>> + set_cpu_pti_disable(READ_ONCE(mm->context.pti_disable));
>>> +}
>>
>> I don't like this at all. IMO a sane implementation should never
>> change PTI status on a remote CPU. Just track it per task.
>
> From your comments in the past, I understood that this requirement
> (enabling/disabling PTI per mm) came from Linus. I can do it per task, but
> that means that the NX-bit will always be cleared on the kernel page-table
> for user mappings.

I disagree with Linus here, although I could be convinced. But
there's still no need for an IPI -- even if it were per-mm, PTI could
stay on until the next kernel entry.

>
>>> +void __pti_reenable(void)
>>> +{
>>> + struct mm_struct *mm = current->mm;
>>> + int cpu;
>>> +
>>> + if (!mm_pti_disable(mm))
>>> + return;
>>> +
>>> + /*
>>> + * Prevent spurious page-fault storm while we set the NX-bit and have
>>> + * yet not updated the per-CPU pti_disable flag.
>>> + */
>>> + down_write(&mm->mmap_sem);
>>> +
>>> + if (!mm_pti_disable(mm))
>>> + goto out;
>>> +
>>> + /*
>>> + * First, mark the PTI is enabled. Although we do anything yet, we are
>>> + * safe as long as we do not reenable CS64. Since we did not update the
>>> + * page tables yet, this may lead to spurious page-faults, but we need
>>> + * the pti_disable in mm to be set for __pti_set_user_pgd() to do the
>>> + * right thing. Holding mmap_sem would ensure matter we hold the
>>> + * mmap_sem to prevent them from swamping the system.
>>> + */
>>> + mm->context.pti_disable = PTI_DISABLE_OFF;
>>> +
>>> + /* Second, restore the NX bits. */
>>> + pti_update_user_pgds(mm, true);
>>
>> You're holding mmap_sem, but there are code paths that touch page
>> tables that don't hold mmap_sem, such as the stack extension code.
>
> Do they change user mappings? These are the only ones pti_update_user_pgds()
> changes.

I think they do -- it's the user stack. page_table_lock may be more
appropriate, although I'm far from an expert in this.

>
>>> +
>>> +bool pti_handle_segment_not_present(long error_code)
>>> +{
>>> + if (!static_cpu_has(X86_FEATURE_PTI))
>>> + return false;
>>> +
>>> + if ((unsigned short)error_code != GDT_ENTRY_DEFAULT_USER_CS << 3)
>>> + return false;
>>> +
>>> + pti_reenable();
>>> + return true;
>>> +}
>>
>> Please don't. You're trying to emulate the old behavior here, but
>> you're emulating it wrong. In particular, you won't trap on LAR.
>
> Yes, I thought I’ll manage to address LAR, but failed. I thought you said
> this is not a “show-stopper”. I’ll adapt your approach of using prctl, although
> it really limits the benefit of this mechanism.
>

It's possible we could get away with adding the prctl but making the
default be that only the bitness that matches the program being run is
allowed. After all, it's possible that CRIU is literally the only
program that switches bitness using the GDT. (DOSEMU2 definitely does
cross-bitness stuff, but it uses the LDT as far as I know.) And I've
never been entirely sure that CRIU fully counts toward the Linux
"don't break ABI" guarantee.

Linus, how would you feel about, by default, preventing 64-bit
programs from long-jumping to __USER32_CS and vice versa? I think it
has some value as a hardening measure. I've certainly engaged in some
exploit shenanigans myself that took advantage of the ability to long
jump/ret to change bitness at will. This wouldn't affect users of
modify_ldt() -- 64-bit programs could still create and use their own
private 32-bit segments with modify_ldt(), and seccomp can (and
should!) prevent that in sandboxed programs.

In general, I prefer an approach where everything is explicit to an
approach where we almost, but not quite, emulate the weird historical
behavior.

Pavel and Cyrill, how annoying would it be if CRIU had to do an extra
arch_prctl() to enable its cross-bitness shenanigans when
checkpointing and restoring a 32-bit program?

2018-02-16 16:01:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

On 02/15/2018 08:35 AM, Nadav Amit wrote:
> I removed the PTI disabling while SMEP is unsupported, although I
> must admit I did not fully understand why it is required.

Do you mean you don't fully understand how PTI gives SMEP-like behavior
on non-SMEP hardware?

2018-02-16 16:04:05

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

Linus Torvalds <[email protected]> wrote:

> On Thu, Feb 15, 2018 at 3:29 PM, Andy Lutomirski <[email protected]> wrote:
>> It's possible we could get away with adding the prctl but making the
>> default be that only the bitness that matches the program being run is
>> allowed. After all, it's possible that CRIU is literally the only
>> program that switches bitness using the GDT. (DOSEMU2 definitely does
>> cross-bitness stuff, but it uses the LDT as far as I know.) And I've
>> never been entirely sure that CRIU fully counts toward the Linux
>> "don't break ABI" guarantee.
>
> Ugh.
>
> There are just _so_ many reasons to dislike that.
>
> It's not that I don't think we could try to encourage it, but this
> whole "security depends on it being in sync" seems really like a
> fundamentally bad design.
>
>> Linus, how would you feel about, by default, preventing 64-bit
>> programs from long-jumping to __USER32_CS and vice versa?
>
> How? It's a standard GDT entry. Are you going to start switching the
> GDT around every context switch?
>
> I *thought* that user space can just do a far jump on its own. But
> it's so long since I had to care that I may have forgotten all the
> requirements for going between "compatibility mode" and real long
> mode.
>
> I just feel this all is a nightmare. I can see how you would want to
> think that compatibility mode doesn't need PTI, but at the same time
> it feels like a really risky move to do this.
>
> I can see one thread being in compatibiilty mode, and another being in
> long mode, and sharing the address space. But even with just one
> thread, I'm not seeing how you keep user mode from going from
> compatibility mode to L mode with just a far jump.
>
> But maybe you have some clever scheme in mind that guarantees that
> there are no issues, or maybe I've just forgotten all the details of
> long mode vs compat mode.

It is not too pretty, I agree, but it should do the work. There is only one
problematic descriptor that can be used to switch from compatibility-mode to
long-mode in the GDT (LDT descriptors always have the L-bit cleared).
Changing the descriptor's present bit on context switch when needed can do
the work.

I tried to do it transparently, and if long-mode is entered, by any thread,
restore PTI. There is one corner case I did not cover (LAR) and Andy felt
this scheme is too complicated. Unfortunately, I don’t have a better scheme
in mind.


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2018-02-16 16:05:45

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

Dave Hansen <[email protected]> wrote:

> On 02/15/2018 08:35 AM, Nadav Amit wrote:
>> I removed the PTI disabling while SMEP is unsupported, although I
>> must admit I did not fully understand why it is required.
>
> Do you mean you don't fully understand how PTI gives SMEP-like behavior
> on non-SMEP hardware?

No. I understand how it provide SMEP-like behavior, and I understand the value
of SMEP by itself.

However, I do not understand why SMEP-like protection is required to protect
processes that run in compatibility-mode from Meltdown/Spectre attacks. As
far as I understand, the process should not be able to manipulate the kernel
to execute code in the low 4GB.


2018-02-16 16:07:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

On 02/15/2018 04:25 PM, Nadav Amit wrote:
> Dave Hansen <[email protected]> wrote:
>
>> On 02/15/2018 08:35 AM, Nadav Amit wrote:
>>> I removed the PTI disabling while SMEP is unsupported, although I
>>> must admit I did not fully understand why it is required.
>>
>> Do you mean you don't fully understand how PTI gives SMEP-like behavior
>> on non-SMEP hardware?
>
> No. I understand how it provide SMEP-like behavior, and I understand the value
> of SMEP by itself.
>
> However, I do not understand why SMEP-like protection is required to protect
> processes that run in compatibility-mode from Meltdown/Spectre attacks. As
> far as I understand, the process should not be able to manipulate the kernel
> to execute code in the low 4GB.

There are two problems: one is that regardless of Meltdown/Spectre, SMEP
is valuable. It's valuable to everything, compatibility-mode or not.

The second problem is the RSB. It has a full-width virtual address and,
unlike the other indirect branch prediction, can steer you anywhere
including to the low 4GB.


2018-02-16 16:08:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

On Thu, Feb 15, 2018 at 4:22 PM, Nadav Amit <[email protected]> wrote:
>
> It is not too pretty, I agree, but it should do the work. There is only one
> problematic descriptor that can be used to switch from compatibility-mode to
> long-mode in the GDT (LDT descriptors always have the L-bit cleared).
> Changing the descriptor's present bit on context switch when needed can do
> the work.

Sure, I can see it working, but it's some really shady stuff, and now
the scheduler needs to save/restore/check one more subtle bit.

And if you get it wrong, things will happily work, except you've now
defeated PTI. But you'll never notice, because you won't be testing
for it, and the only people who will are the black hats.

This is exactly the "security depends on it being in sync" thing that
makes me go "eww" about the whole model. Get one thing wrong, and
you'll blow all the PTI code out of the water.

So now you tried to optimize one small case that most people won't
use, but the downside is that you may make all our PTI work (and all
the overhead for all the _normal_ cases) pointless.

Linus

2018-02-16 16:47:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/6] x86: Skip PTI when disable indication is set

On Thu, Feb 15, 2018 at 4:35 PM, Nadav Amit <[email protected]> wrote:
> If PTI is disabled, we do not want to switch page-tables. On entry to
> the kernel, this is done based on CR3 value. On return, do it according
> to per core indication.
>
> To be on the safe side, avoid speculative skipping of page-tables
> switching when returning the userspace. This can be avoided if the CPU
> cannot execute speculatively code without the proper permissions. When
> switching to the kernel page-tables, this is anyhow not an issue: if PTI
> is enabled and page-tables were not switched, the kernel part of the
> user page-tables would not be set.
>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> arch/x86/entry/calling.h | 33 +++++++++++++++++++++++++++++++++
> arch/x86/include/asm/tlbflush.h | 17 +++++++++++++++--
> arch/x86/kernel/asm-offsets.c | 1 +
> 3 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 3f48f695d5e6..5e9895f44d11 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -216,7 +216,14 @@ For 32-bit we have the following conventions - kernel is built with
>
> .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> + /*
> + * Do not switch on compatibility mode.
> + */

That comment should just say "if we're already using kernel CR3, don't
switch" or something like that.

> mov %cr3, \scratch_reg
> + testq $PTI_USER_PGTABLE_MASK, \scratch_reg
> + jz .Lend_\@
> +
> ADJUST_KERNEL_CR3 \scratch_reg
> mov \scratch_reg, %cr3
> .Lend_\@:
> @@ -225,8 +232,20 @@ For 32-bit we have the following conventions - kernel is built with
> #define THIS_CPU_user_pcid_flush_mask \
> PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
>
> +#define THIS_CPU_pti_disable \
> + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_pti_disable
> +
> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> +
> + /*
> + * Do not switch on compatibility mode. If there is no need for a
> + * flush, run lfence to avoid speculative execution returning to user
> + * with the wrong CR3.
> + */

Nix the "compatibility mode" stuff please. Also, can someone confirm
whether the affected CPUs actually speculate through SYSRET? Because
your LFENCE might be so expensive that it negates a decent chunk of
the benefit.

> + /*
> + * Cached value of mm.pti_enable to simplify and speed up kernel entry
> + * code.
> + */
> + unsigned short pti_disable;

Why unsigned short?

IIRC a lot of CPUs use a slow path when decoding instructions with
16-bit operands like cmpw, so u8 or u32 could be waaaay faster than
u16.

> +/* Return whether page-table isolation is disabled on this CPU */
> +static inline unsigned short cpu_pti_disable(void)
> +{
> + return this_cpu_read(cpu_tlbstate.pti_disable);
> +}

This should return bool regardless of what type lives in the struct.

> - invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
> + if (!cpu_pti_disable())
> + invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));

This will go badly wrong if pti_disable becomes dynamic. Can you just
leave the code as it was?

>
> /* If current->mm == NULL then the read_cr3() "borrows" an mm */
> native_write_cr3(__native_read_cr3());
> @@ -404,7 +417,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)
>
> asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
>
> - if (!static_cpu_has(X86_FEATURE_PTI))
> + if (!static_cpu_has(X86_FEATURE_PTI) || cpu_pti_disable())
> return;

Ditto.

2018-02-16 16:47:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC v2 5/6] x86: Use global pages when PTI is disabled

On Thu, Feb 15, 2018 at 4:36 PM, Nadav Amit <[email protected]> wrote:
> As long as PTI is disabled, it is possible to use global pages, as long
> as we remove them once PTI is enabled again. To do so, return the global
> bit to __supported_pte_mask and disable global pages using CR4.
>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> arch/x86/include/asm/tlbflush.h | 6 ++++++
> arch/x86/mm/init.c | 14 ++++++--------
> arch/x86/mm/tlb.c | 3 ++-
> 3 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index ea65cf951c49..3a44cb0a9f56 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -319,6 +319,12 @@ static inline void set_cpu_pti_disable(unsigned short disable)
> WARN_ON_ONCE(preemptible());
>
> pti_update_user_cs64(cpu_pti_disable(), disable);
> + if (__supported_pte_mask & _PAGE_GLOBAL) {
> + if (disable)
> + cr4_set_bits(X86_CR4_PGE);
> + else
> + cr4_clear_bits(X86_CR4_PGE);
> + }

This will be *extremely* slow, and I don't see the point at all. What
are you accomplishing here?

At best, you might gain something by adjusting the page table entries
to get globalness back, but I see no reason to fiddle with CR4. Also,
if you do any of this at all and you run on K8, you're at risk of
getting machine checks unless you're very careful to get the
invalidation right.

I would just drop this patch, personally.

2018-02-16 16:48:20

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

On Thu, Feb 15, 2018 at 4:36 PM, Nadav Amit <[email protected]> wrote:
> Based on the understanding that there should be no way for userspace to
> address the kernel-space from compatibility mode, disable it while
> running in compatibility mode as long as the 64-bit code segment of the
> user is not used.
>
> Reenabling PTI is performed by restoring NX-bits to the userspace
> mappings, flushing the TLBs, and notifying all the CPUs that use the
> affected mm to disable PTI. Each core responds by removing the present
> bit for the 64-bit code-segment, and marking that PTI is disabled on
> that core.
>

I dislike this patch because it's conflating two things. The patch
claims to merely disable PTI for compat tasks, whatever those are.
But it's also introducing a much stronger concept of what a compat
task is. The kernel currently mostly doesn't care whether a task is
"compat" or not, and I think that most remaining code paths that do
care are buggy and should be removed.

I think the right way to approach this is to add a new arch_prctl()
that changes allowable bitness, like this:

arch_prctl(ARCH_SET_ALLOWED_GDT_CS, X86_ALLOW_CS32 | X86_ALLOW_CS64);

this would set the current task to work the normal way, where 32-bit
and 64-bit CS are available. You could set just X86_ALLOW_CS32 to
deny 64-bit mode and just X86_ALLOW_CS64 to deny 32-bit mode. This
would make nice attack surface reduction tools for the more paranoid
sandbox users to use. Doing arch_prctl(ARCH_SET_ALLOWED_GDT_CS, 0)
would return -EINVAL.

A separate patch would turn PTI off if you set X86_ALLOW_CS32.

This has the downside that old code doesn't get the benefit without
some code change, but that's not the end of the world.

> +static void pti_cpu_update_func(void *info)
> +{
> + struct mm_struct *mm = (struct mm_struct *)info;
> +
> + if (mm != this_cpu_read(cpu_tlbstate.loaded_mm))
> + return;
> +
> + /*
> + * Keep CS64 and CPU settings in sync despite potential concurrent
> + * updates.
> + */
> + set_cpu_pti_disable(READ_ONCE(mm->context.pti_disable));
> +}

I don't like this at all. IMO a sane implementation should never
change PTI status on a remote CPU. Just track it per task.

> +void __pti_reenable(void)
> +{
> + struct mm_struct *mm = current->mm;
> + int cpu;
> +
> + if (!mm_pti_disable(mm))
> + return;
> +
> + /*
> + * Prevent spurious page-fault storm while we set the NX-bit and have
> + * yet not updated the per-CPU pti_disable flag.
> + */
> + down_write(&mm->mmap_sem);
> +
> + if (!mm_pti_disable(mm))
> + goto out;
> +
> + /*
> + * First, mark the PTI is enabled. Although we do anything yet, we are
> + * safe as long as we do not reenable CS64. Since we did not update the
> + * page tables yet, this may lead to spurious page-faults, but we need
> + * the pti_disable in mm to be set for __pti_set_user_pgd() to do the
> + * right thing. Holding mmap_sem would ensure matter we hold the
> + * mmap_sem to prevent them from swamping the system.
> + */
> + mm->context.pti_disable = PTI_DISABLE_OFF;
> +
> + /* Second, restore the NX bits. */
> + pti_update_user_pgds(mm, true);

You're holding mmap_sem, but there are code paths that touch page
tables that don't hold mmap_sem, such as the stack extension code.

> +
> +bool pti_handle_segment_not_present(long error_code)
> +{
> + if (!static_cpu_has(X86_FEATURE_PTI))
> + return false;
> +
> + if ((unsigned short)error_code != GDT_ENTRY_DEFAULT_USER_CS << 3)
> + return false;
> +
> + pti_reenable();
> + return true;
> +}

Please don't. You're trying to emulate the old behavior here, but
you're emulating it wrong. In particular, you won't trap on LAR.

2018-02-16 17:05:00

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/6] x86: Skip PTI when disable indication is set

Andy Lutomirski <[email protected]> wrote:

> On Thu, Feb 15, 2018 at 4:35 PM, Nadav Amit <[email protected]> wrote:
>> If PTI is disabled, we do not want to switch page-tables. On entry to
>> the kernel, this is done based on CR3 value. On return, do it according
>> to per core indication.
>>
>> To be on the safe side, avoid speculative skipping of page-tables
>> switching when returning the userspace. This can be avoided if the CPU
>> cannot execute speculatively code without the proper permissions. When
>> switching to the kernel page-tables, this is anyhow not an issue: if PTI
>> is enabled and page-tables were not switched, the kernel part of the
>> user page-tables would not be set.
>>
>> Signed-off-by: Nadav Amit <[email protected]>
>> ---
>> arch/x86/entry/calling.h | 33 +++++++++++++++++++++++++++++++++
>> arch/x86/include/asm/tlbflush.h | 17 +++++++++++++++--
>> arch/x86/kernel/asm-offsets.c | 1 +
>> 3 files changed, 49 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>> index 3f48f695d5e6..5e9895f44d11 100644
>> --- a/arch/x86/entry/calling.h
>> +++ b/arch/x86/entry/calling.h
>> @@ -216,7 +216,14 @@ For 32-bit we have the following conventions - kernel is built with
>>
>> .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
>> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>> +
>> + /*
>> + * Do not switch on compatibility mode.
>> + */
>
> That comment should just say "if we're already using kernel CR3, don't
> switch" or something like that.

ok.

>
>> mov %cr3, \scratch_reg
>> + testq $PTI_USER_PGTABLE_MASK, \scratch_reg
>> + jz .Lend_\@
>> +
>> ADJUST_KERNEL_CR3 \scratch_reg
>> mov \scratch_reg, %cr3
>> .Lend_\@:
>> @@ -225,8 +232,20 @@ For 32-bit we have the following conventions - kernel is built with
>> #define THIS_CPU_user_pcid_flush_mask \
>> PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
>>
>> +#define THIS_CPU_pti_disable \
>> + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_pti_disable
>> +
>> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
>> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>> +
>> + /*
>> + * Do not switch on compatibility mode. If there is no need for a
>> + * flush, run lfence to avoid speculative execution returning to user
>> + * with the wrong CR3.
>> + */
>
> Nix the "compatibility mode" stuff please. Also, can someone confirm
> whether the affected CPUs actually speculate through SYSRET? Because
> your LFENCE might be so expensive that it negates a decent chunk of
> the benefit.

I will send performance numbers with in the next iteration. The LFENCE did
not introduce high overheads. Anyhow, it would surely be nice to remove it.

>> + /*
>> + * Cached value of mm.pti_enable to simplify and speed up kernel entry
>> + * code.
>> + */
>> + unsigned short pti_disable;
>
> Why unsigned short?
>
> IIRC a lot of CPUs use a slow path when decoding instructions with
> 16-bit operands like cmpw, so u8 or u32 could be waaaay faster than
> u16.

Will do.

>
>> +/* Return whether page-table isolation is disabled on this CPU */
>> +static inline unsigned short cpu_pti_disable(void)
>> +{
>> + return this_cpu_read(cpu_tlbstate.pti_disable);
>> +}
>
> This should return bool regardless of what type lives in the struct.

Ok. I think that it was so because I tried to support both CS64 and CS32.

>
>> - invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
>> + if (!cpu_pti_disable())
>> + invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
>
> This will go badly wrong if pti_disable becomes dynamic. Can you just
> leave the code as it was?

Will do.

>
>> /* If current->mm == NULL then the read_cr3() "borrows" an mm */
>> native_write_cr3(__native_read_cr3());
>> @@ -404,7 +417,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)
>>
>> asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
>>
>> - if (!static_cpu_has(X86_FEATURE_PTI))
>> + if (!static_cpu_has(X86_FEATURE_PTI) || cpu_pti_disable())
>> return;
>
> Ditto.

As for this last one - I don’t see why. Can you please explain? If you are
only worried about enabling/disabling PTI dynamically, I can address this
specific issue by flushing the TLB when it happens.

Thanks,
Nadav


2018-02-16 18:25:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC v2 1/6] x86: Skip PTI when disable indication is set

On Thu, Feb 15, 2018 at 8:51 PM, Nadav Amit <[email protected]> wrote:
> Andy Lutomirski <[email protected]> wrote:
>
>> On Thu, Feb 15, 2018 at 4:35 PM, Nadav Amit <[email protected]> wrote:
>>> If PTI is disabled, we do not want to switch page-tables. On entry to
>>> the kernel, this is done based on CR3 value. On return, do it according
>>> to per core indication.
>>>
>>> To be on the safe side, avoid speculative skipping of page-tables
>>> switching when returning the userspace. This can be avoided if the CPU
>>> cannot execute speculatively code without the proper permissions. When
>>> switching to the kernel page-tables, this is anyhow not an issue: if PTI
>>> is enabled and page-tables were not switched, the kernel part of the
>>> user page-tables would not be set.
>>>
>>> Signed-off-by: Nadav Amit <[email protected]>
>>> ---
>>> arch/x86/entry/calling.h | 33 +++++++++++++++++++++++++++++++++
>>> arch/x86/include/asm/tlbflush.h | 17 +++++++++++++++--
>>> arch/x86/kernel/asm-offsets.c | 1 +
>>> 3 files changed, 49 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>>> index 3f48f695d5e6..5e9895f44d11 100644
>>> --- a/arch/x86/entry/calling.h
>>> +++ b/arch/x86/entry/calling.h
>>> @@ -216,7 +216,14 @@ For 32-bit we have the following conventions - kernel is built with
>>>
>>> .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
>>> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>>> +
>>> + /*
>>> + * Do not switch on compatibility mode.
>>> + */
>>
>> That comment should just say "if we're already using kernel CR3, don't
>> switch" or something like that.
>
> ok.
>
>>
>>> mov %cr3, \scratch_reg
>>> + testq $PTI_USER_PGTABLE_MASK, \scratch_reg
>>> + jz .Lend_\@
>>> +
>>> ADJUST_KERNEL_CR3 \scratch_reg
>>> mov \scratch_reg, %cr3
>>> .Lend_\@:
>>> @@ -225,8 +232,20 @@ For 32-bit we have the following conventions - kernel is built with
>>> #define THIS_CPU_user_pcid_flush_mask \
>>> PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
>>>
>>> +#define THIS_CPU_pti_disable \
>>> + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_pti_disable
>>> +
>>> .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
>>> ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>>> +
>>> + /*
>>> + * Do not switch on compatibility mode. If there is no need for a
>>> + * flush, run lfence to avoid speculative execution returning to user
>>> + * with the wrong CR3.
>>> + */
>>
>> Nix the "compatibility mode" stuff please. Also, can someone confirm
>> whether the affected CPUs actually speculate through SYSRET? Because
>> your LFENCE might be so expensive that it negates a decent chunk of
>> the benefit.
>
> I will send performance numbers with in the next iteration. The LFENCE did
> not introduce high overheads. Anyhow, it would surely be nice to remove it.
>
>>> + /*
>>> + * Cached value of mm.pti_enable to simplify and speed up kernel entry
>>> + * code.
>>> + */
>>> + unsigned short pti_disable;
>>
>> Why unsigned short?
>>
>> IIRC a lot of CPUs use a slow path when decoding instructions with
>> 16-bit operands like cmpw, so u8 or u32 could be waaaay faster than
>> u16.
>
> Will do.
>
>>
>>> +/* Return whether page-table isolation is disabled on this CPU */
>>> +static inline unsigned short cpu_pti_disable(void)
>>> +{
>>> + return this_cpu_read(cpu_tlbstate.pti_disable);
>>> +}
>>
>> This should return bool regardless of what type lives in the struct.
>
> Ok. I think that it was so because I tried to support both CS64 and CS32.
>
>>
>>> - invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
>>> + if (!cpu_pti_disable())
>>> + invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
>>
>> This will go badly wrong if pti_disable becomes dynamic. Can you just
>> leave the code as it was?
>
> Will do.
>
>>
>>> /* If current->mm == NULL then the read_cr3() "borrows" an mm */
>>> native_write_cr3(__native_read_cr3());
>>> @@ -404,7 +417,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)
>>>
>>> asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
>>>
>>> - if (!static_cpu_has(X86_FEATURE_PTI))
>>> + if (!static_cpu_has(X86_FEATURE_PTI) || cpu_pti_disable())
>>> return;
>>
>> Ditto.
>
> As for this last one - I don’t see why. Can you please explain? If you are
> only worried about enabling/disabling PTI dynamically, I can address this
> specific issue by flushing the TLB when it happens.
>

Simplicity. But if the code is careful to get all the flushing right
when the mode switches, I'm okay with keeping the optimization.

--Andy

2018-02-16 18:30:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

On Thu, Feb 15, 2018 at 3:29 PM, Andy Lutomirski <[email protected]> wrote:
>
> It's possible we could get away with adding the prctl but making the
> default be that only the bitness that matches the program being run is
> allowed. After all, it's possible that CRIU is literally the only
> program that switches bitness using the GDT. (DOSEMU2 definitely does
> cross-bitness stuff, but it uses the LDT as far as I know.) And I've
> never been entirely sure that CRIU fully counts toward the Linux
> "don't break ABI" guarantee.

Ugh.

There are just _so_ many reasons to dislike that.

It's not that I don't think we could try to encourage it, but this
whole "security depends on it being in sync" seems really like a
fundamentally bad design.

> Linus, how would you feel about, by default, preventing 64-bit
> programs from long-jumping to __USER32_CS and vice versa?

How? It's a standard GDT entry. Are you going to start switching the
GDT around every context switch?

I *thought* that user space can just do a far jump on its own. But
it's so long since I had to care that I may have forgotten all the
requirements for going between "compatibility mode" and real long
mode.

I just feel this all is a nightmare. I can see how you would want to
think that compatibility mode doesn't need PTI, but at the same time
it feels like a really risky move to do this.

I can see one thread being in compatibiilty mode, and another being in
long mode, and sharing the address space. But even with just one
thread, I'm not seeing how you keep user mode from going from
compatibility mode to L mode with just a far jump.

But maybe you have some clever scheme in mind that guarantees that
there are no issues, or maybe I've just forgotten all the details of
long mode vs compat mode.

Linus

2018-02-16 18:31:14

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

On 16/02/2018 00:08, Linus Torvalds wrote:
> On Thu, Feb 15, 2018 at 3:29 PM, Andy Lutomirski <[email protected]> wrote:
>> Linus, how would you feel about, by default, preventing 64-bit
>> programs from long-jumping to __USER32_CS and vice versa?
> How? It's a standard GDT entry. Are you going to start switching the
> GDT around every context switch?
>
> I *thought* that user space can just do a far jump on its own. But
> it's so long since I had to care that I may have forgotten all the
> requirements for going between "compatibility mode" and real long
> mode.

Yes - it is just a straight far jump to switch between compat and long mode.

A evil^W cunning programmer can use the 286 world view and disable
segments by clearing the present bit to yield #NP[sel] on use, which is
liable to be rather faster than LGDT on a context switch.

Alternatively, set both the L and D (code segments only), or playing
with DPL/type can all yield #GP[sel] on use, but these probably aren't
as good options.

~Andrew

2018-02-16 18:32:03

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

On 16/02/2018 00:25, Nadav Amit wrote:
> Dave Hansen <[email protected]> wrote:
>
>> On 02/15/2018 08:35 AM, Nadav Amit wrote:
>>> I removed the PTI disabling while SMEP is unsupported, although I
>>> must admit I did not fully understand why it is required.
>> Do you mean you don't fully understand how PTI gives SMEP-like behavior
>> on non-SMEP hardware?
> No. I understand how it provide SMEP-like behavior, and I understand the value
> of SMEP by itself.
>
> However, I do not understand why SMEP-like protection is required to protect
> processes that run in compatibility-mode from Meltdown/Spectre attacks. As
> far as I understand, the process should not be able to manipulate the kernel
> to execute code in the low 4GB.

Being 32bit is itself sufficient protection against Meltdown (as long as
there nothing interesting of the kernels mapped below the 4G boundary).

However, a 32bit compatibility process try to attack with Spectre/SP2 to
redirect speculation back into userspace, at which point (if successful)
the pipeline will be speculating in 64bit mode, and Meltdown is back on
the table.  SMEP will block this attack vector, irrespective of other
SP2 defences the kernel may employ, but a fully SP2-defended kernel
doesn't require SMEP to be safe in this case.

~Andrew

2018-02-16 18:32:37

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

Dave Hansen <[email protected]> wrote:

> On 02/15/2018 04:25 PM, Nadav Amit wrote:
>> Dave Hansen <[email protected]> wrote:
>>
>>> On 02/15/2018 08:35 AM, Nadav Amit wrote:
>>>> I removed the PTI disabling while SMEP is unsupported, although I
>>>> must admit I did not fully understand why it is required.
>>>
>>> Do you mean you don't fully understand how PTI gives SMEP-like behavior
>>> on non-SMEP hardware?
>>
>> No. I understand how it provide SMEP-like behavior, and I understand the value
>> of SMEP by itself.
>>
>> However, I do not understand why SMEP-like protection is required to protect
>> processes that run in compatibility-mode from Meltdown/Spectre attacks. As
>> far as I understand, the process should not be able to manipulate the kernel
>> to execute code in the low 4GB.
>
> There are two problems: one is that regardless of Meltdown/Spectre, SMEP
> is valuable. It's valuable to everything, compatibility-mode or not.
>
> The second problem is the RSB. It has a full-width virtual address and,
> unlike the other indirect branch prediction, can steer you anywhere
> including to the low 4GB.

Thanks for the explanation. Based on Linus response, I guess this series is
nak’d, but still thanks for your patience.

I suspected the RSB might be the reason but it seemed to me that all the ROP
opportunities are still there, so I assumed it is not a reason.

Anyhow, thanks again.



2018-02-16 18:33:02

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

Andrew Cooper <[email protected]> wrote:

> On 16/02/2018 00:25, Nadav Amit wrote:
>> Dave Hansen <[email protected]> wrote:
>>
>>> On 02/15/2018 08:35 AM, Nadav Amit wrote:
>>>> I removed the PTI disabling while SMEP is unsupported, although I
>>>> must admit I did not fully understand why it is required.
>>> Do you mean you don't fully understand how PTI gives SMEP-like behavior
>>> on non-SMEP hardware?
>> No. I understand how it provide SMEP-like behavior, and I understand the value
>> of SMEP by itself.
>>
>> However, I do not understand why SMEP-like protection is required to protect
>> processes that run in compatibility-mode from Meltdown/Spectre attacks. As
>> far as I understand, the process should not be able to manipulate the kernel
>> to execute code in the low 4GB.
>
> Being 32bit is itself sufficient protection against Meltdown (as long as
> there nothing interesting of the kernels mapped below the 4G boundary).
>
> However, a 32bit compatibility process try to attack with Spectre/SP2 to
> redirect speculation back into userspace, at which point (if successful)
> the pipeline will be speculating in 64bit mode, and Meltdown is back on
> the table. SMEP will block this attack vector, irrespective of other
> SP2 defences the kernel may employ, but a fully SP2-defended kernel
> doesn't require SMEP to be safe in this case.

Based on Jann Horn’s description of the branch predictor, it basically only
holds the lowest 31-bits of the target address. There might be a subtle
problem if the prediction wrapsaround, but excluding this case, I do not see
how Spectre v2 can be used to jump into running user code.


2018-02-16 18:34:36

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

On 16/02/2018 00:51, Nadav Amit wrote:
> Andrew Cooper <[email protected]> wrote:
>
>> On 16/02/2018 00:25, Nadav Amit wrote:
>>> Dave Hansen <[email protected]> wrote:
>>>
>>>> On 02/15/2018 08:35 AM, Nadav Amit wrote:
>>>>> I removed the PTI disabling while SMEP is unsupported, although I
>>>>> must admit I did not fully understand why it is required.
>>>> Do you mean you don't fully understand how PTI gives SMEP-like behavior
>>>> on non-SMEP hardware?
>>> No. I understand how it provide SMEP-like behavior, and I understand the value
>>> of SMEP by itself.
>>>
>>> However, I do not understand why SMEP-like protection is required to protect
>>> processes that run in compatibility-mode from Meltdown/Spectre attacks. As
>>> far as I understand, the process should not be able to manipulate the kernel
>>> to execute code in the low 4GB.
>> Being 32bit is itself sufficient protection against Meltdown (as long as
>> there nothing interesting of the kernels mapped below the 4G boundary).
>>
>> However, a 32bit compatibility process try to attack with Spectre/SP2 to
>> redirect speculation back into userspace, at which point (if successful)
>> the pipeline will be speculating in 64bit mode, and Meltdown is back on
>> the table. SMEP will block this attack vector, irrespective of other
>> SP2 defences the kernel may employ, but a fully SP2-defended kernel
>> doesn't require SMEP to be safe in this case.
> Based on Jann Horn’s description of the branch predictor, it basically only
> holds the lowest 31-bits of the target address. There might be a subtle
> problem if the prediction wrapsaround, but excluding this case, I do not see
> how Spectre v2 can be used to jump into running user code.

RSB poisoning is also part of SP2, and does have full width addresses.

~Andrew

P.S. Consider yourself lucky that the 32bit code isn't running in
Ring1.  Xen has a substantially more interesting time in this regard.

2018-02-16 18:42:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

On Fri, Feb 16, 2018 at 12:42 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 15, 2018 at 4:22 PM, Nadav Amit <[email protected]> wrote:
>>
>> It is not too pretty, I agree, but it should do the work. There is only one
>> problematic descriptor that can be used to switch from compatibility-mode to
>> long-mode in the GDT (LDT descriptors always have the L-bit cleared).
>> Changing the descriptor's present bit on context switch when needed can do
>> the work.
>
> Sure, I can see it working, but it's some really shady stuff, and now
> the scheduler needs to save/restore/check one more subtle bit.
>
> And if you get it wrong, things will happily work, except you've now
> defeated PTI. But you'll never notice, because you won't be testing
> for it, and the only people who will are the black hats.
>
> This is exactly the "security depends on it being in sync" thing that
> makes me go "eww" about the whole model. Get one thing wrong, and
> you'll blow all the PTI code out of the water.
>
> So now you tried to optimize one small case that most people won't
> use, but the downside is that you may make all our PTI work (and all
> the overhead for all the _normal_ cases) pointless.
>

There's also the fact that, if this stuff goes in, we'll be
encouraging people to deploy 32-bit binaries. Then they'll buy
Meltdown-fixed CPUs (or AMD CPUs!) and they may well continue running
32-bit binaries. Sigh. I'm not totally a fan of this.

2018-02-16 18:43:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC v2 0/6] x86: Disabling PTI in compatibility mode

On Fri, Feb 16, 2018 at 12:51 AM, Nadav Amit <[email protected]> wrote:
> Andrew Cooper <[email protected]> wrote:
>
>> On 16/02/2018 00:25, Nadav Amit wrote:
>>> Dave Hansen <[email protected]> wrote:
>>>
>>>> On 02/15/2018 08:35 AM, Nadav Amit wrote:
>>>>> I removed the PTI disabling while SMEP is unsupported, although I
>>>>> must admit I did not fully understand why it is required.
>>>> Do you mean you don't fully understand how PTI gives SMEP-like behavior
>>>> on non-SMEP hardware?
>>> No. I understand how it provide SMEP-like behavior, and I understand the value
>>> of SMEP by itself.
>>>
>>> However, I do not understand why SMEP-like protection is required to protect
>>> processes that run in compatibility-mode from Meltdown/Spectre attacks. As
>>> far as I understand, the process should not be able to manipulate the kernel
>>> to execute code in the low 4GB.
>>
>> Being 32bit is itself sufficient protection against Meltdown (as long as
>> there nothing interesting of the kernels mapped below the 4G boundary).
>>
>> However, a 32bit compatibility process try to attack with Spectre/SP2 to
>> redirect speculation back into userspace, at which point (if successful)
>> the pipeline will be speculating in 64bit mode, and Meltdown is back on
>> the table. SMEP will block this attack vector, irrespective of other
>> SP2 defences the kernel may employ, but a fully SP2-defended kernel
>> doesn't require SMEP to be safe in this case.
>
> Based on Jann Horn’s description of the branch predictor, it basically only
> holds the lowest 31-bits of the target address. There might be a subtle
> problem if the prediction wrapsaround, but excluding this case, I do not see
> how Spectre v2 can be used to jump into running user code.
>

If you can make the *kernel* speculate into user code, you can create
whatever gadget you want. A 32-bit task can poison the branch
predictor and use this to attack a non-retpolined kernel.

2018-02-16 18:50:13

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

On Thu, Feb 15, 2018 at 11:29:42PM +0000, Andy Lutomirski wrote:
...
> >>> +bool pti_handle_segment_not_present(long error_code)
> >>> +{
> >>> + if (!static_cpu_has(X86_FEATURE_PTI))
> >>> + return false;
> >>> +
> >>> + if ((unsigned short)error_code != GDT_ENTRY_DEFAULT_USER_CS << 3)
> >>> + return false;
> >>> +
> >>> + pti_reenable();
> >>> + return true;
> >>> +}
> >>
> >> Please don't. You're trying to emulate the old behavior here, but
> >> you're emulating it wrong. In particular, you won't trap on LAR.
> >
> > Yes, I thought I’ll manage to address LAR, but failed. I thought you said
> > this is not a “show-stopper”. I’ll adapt your approach of using prctl, although
> > it really limits the benefit of this mechanism.
> >
>
> It's possible we could get away with adding the prctl but making the
> default be that only the bitness that matches the program being run is
> allowed. After all, it's possible that CRIU is literally the only
> program that switches bitness using the GDT. (DOSEMU2 definitely does
> cross-bitness stuff, but it uses the LDT as far as I know.) And I've
> never been entirely sure that CRIU fully counts toward the Linux
> "don't break ABI" guarantee.
>
> Linus, how would you feel about, by default, preventing 64-bit
> programs from long-jumping to __USER32_CS and vice versa? I think it
> has some value as a hardening measure. I've certainly engaged in some
> exploit shenanigans myself that took advantage of the ability to long
> jump/ret to change bitness at will. This wouldn't affect users of
> modify_ldt() -- 64-bit programs could still create and use their own
> private 32-bit segments with modify_ldt(), and seccomp can (and
> should!) prevent that in sandboxed programs.
>
> In general, I prefer an approach where everything is explicit to an
> approach where we almost, but not quite, emulate the weird historical
> behavior.
>
> Pavel and Cyrill, how annoying would it be if CRIU had to do an extra
> arch_prctl() to enable its cross-bitness shenanigans when
> checkpointing and restoring a 32-bit program?

I think this should not be a problem for criu (CC'ing Dima, who has
been working on compat mode support in criu). As far as I remember
we initiate restoring of 32 bit tasks in native 64 bit mode (well,
ia32e to be precise :) mode and then, once everything is ready,
we changing the mode by doing a return to __USER32_CS descriptor.
So this won't be painful to add additional prctl call here.

2018-02-16 19:14:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode



>> On Feb 15, 2018, at 4:08 PM, Linus Torvalds <[email protected]> wrote:
>>
>> On Thu, Feb 15, 2018 at 3:29 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> It's possible we could get away with adding the prctl but making the
>> default be that only the bitness that matches the program being run is
>> allowed. After all, it's possible that CRIU is literally the only
>> program that switches bitness using the GDT. (DOSEMU2 definitely does
>> cross-bitness stuff, but it uses the LDT as far as I know.) And I've
>> never been entirely sure that CRIU fully counts toward the Linux
>> "don't break ABI" guarantee.
>
> Ugh.
>
> There are just _so_ many reasons to dislike that.
>
> It's not that I don't think we could try to encourage it, but this
> whole "security depends on it being in sync" seems really like a
> fundamentally bad design.

If we're going to do Nadav's thing, I think we have no choice. We could say that Nadav's idea of turning off PTI for 32-bit is just too messy, though.

>
>> Linus, how would you feel about, by default, preventing 64-bit
>> programs from long-jumping to __USER32_CS and vice versa?
>
> How? It's a standard GDT entry. Are you going to start switching the
> GDT around every context switch?

That's the idea. We already switch out three GDT entries for TLS. Switching two more isn't going to kill us.

>
> I *thought* that user space can just do a far jump on its own. But
> it's so long since I had to care that I may have forgotten all the
> requirements for going between "compatibility mode" and real long
> mode.
>
> I just feel this all is a nightmare. I can see how you would want to
> think that compatibility mode doesn't need PTI, but at the same time
> it feels like a really risky move to do this.
>
> I can see one thread being in compatibiilty mode, and another being in
> long mode, and sharing the address space. But even with just one
> thread, I'm not seeing how you keep user mode from going from
> compatibility mode to L mode with just a far jump.
>
> But maybe you have some clever scheme in mind that guarantees that
> there are no issues, or maybe I've just forgotten all the details of
> long mode vs compat mode.

The clever scheme is that we have a new (maybe default) compat-and-i-mean-it mode that removes the DPL=3 L code segment from the GDT and prevents opportunistic SYSRET.

2018-02-16 19:34:42

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

Andy Lutomirski <[email protected]> wrote:

> On Fri, Feb 16, 2018 at 12:42 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Feb 15, 2018 at 4:22 PM, Nadav Amit <[email protected]> wrote:
>>> It is not too pretty, I agree, but it should do the work. There is only one
>>> problematic descriptor that can be used to switch from compatibility-mode to
>>> long-mode in the GDT (LDT descriptors always have the L-bit cleared).
>>> Changing the descriptor's present bit on context switch when needed can do
>>> the work.
>>
>> Sure, I can see it working, but it's some really shady stuff, and now
>> the scheduler needs to save/restore/check one more subtle bit.
>>
>> And if you get it wrong, things will happily work, except you've now
>> defeated PTI. But you'll never notice, because you won't be testing
>> for it, and the only people who will are the black hats.
>>
>> This is exactly the "security depends on it being in sync" thing that
>> makes me go "eww" about the whole model. Get one thing wrong, and
>> you'll blow all the PTI code out of the water.
>>
>> So now you tried to optimize one small case that most people won't
>> use, but the downside is that you may make all our PTI work (and all
>> the overhead for all the _normal_ cases) pointless.
>
> There's also the fact that, if this stuff goes in, we'll be
> encouraging people to deploy 32-bit binaries. Then they'll buy
> Meltdown-fixed CPUs (or AMD CPUs!) and they may well continue running
> 32-bit binaries. Sigh. I'm not totally a fan of this.

Ok, ok. Stop kicking the dead body...

2018-02-16 19:35:34

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

2018-02-15 20:02 GMT+00:00 Andy Lutomirski <[email protected]>:
> On Thu, Feb 15, 2018 at 4:36 PM, Nadav Amit <[email protected]> wrote:
>> Based on the understanding that there should be no way for userspace to
>> address the kernel-space from compatibility mode, disable it while
>> running in compatibility mode as long as the 64-bit code segment of the
>> user is not used.
>>
>> Reenabling PTI is performed by restoring NX-bits to the userspace
>> mappings, flushing the TLBs, and notifying all the CPUs that use the
>> affected mm to disable PTI. Each core responds by removing the present
>> bit for the 64-bit code-segment, and marking that PTI is disabled on
>> that core.
>>
>
> I dislike this patch because it's conflating two things. The patch
> claims to merely disable PTI for compat tasks, whatever those are.
> But it's also introducing a much stronger concept of what a compat
> task is. The kernel currently mostly doesn't care whether a task is
> "compat" or not, and I think that most remaining code paths that do
> care are buggy and should be removed.

Yes, please, don't do a stronger concept..
Speaking from CRIU side, it C/R ia32 tasks with x86_64 binaries.

--
Dmitry

2018-02-16 22:08:54

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

2018-02-16 7:11 GMT+00:00 Cyrill Gorcunov <[email protected]>:
> On Thu, Feb 15, 2018 at 11:29:42PM +0000, Andy Lutomirski wrote:
> ...
>> >>> +bool pti_handle_segment_not_present(long error_code)
>> >>> +{
>> >>> + if (!static_cpu_has(X86_FEATURE_PTI))
>> >>> + return false;
>> >>> +
>> >>> + if ((unsigned short)error_code != GDT_ENTRY_DEFAULT_USER_CS << 3)
>> >>> + return false;
>> >>> +
>> >>> + pti_reenable();
>> >>> + return true;
>> >>> +}
>> >>
>> >> Please don't. You're trying to emulate the old behavior here, but
>> >> you're emulating it wrong. In particular, you won't trap on LAR.
>> >
>> > Yes, I thought I’ll manage to address LAR, but failed. I thought you said
>> > this is not a “show-stopper”. I’ll adapt your approach of using prctl, although
>> > it really limits the benefit of this mechanism.
>> >
>>
>> It's possible we could get away with adding the prctl but making the
>> default be that only the bitness that matches the program being run is
>> allowed. After all, it's possible that CRIU is literally the only
>> program that switches bitness using the GDT. (DOSEMU2 definitely does
>> cross-bitness stuff, but it uses the LDT as far as I know.) And I've
>> never been entirely sure that CRIU fully counts toward the Linux
>> "don't break ABI" guarantee.
>>
>> Linus, how would you feel about, by default, preventing 64-bit
>> programs from long-jumping to __USER32_CS and vice versa? I think it
>> has some value as a hardening measure. I've certainly engaged in some
>> exploit shenanigans myself that took advantage of the ability to long
>> jump/ret to change bitness at will. This wouldn't affect users of
>> modify_ldt() -- 64-bit programs could still create and use their own
>> private 32-bit segments with modify_ldt(), and seccomp can (and
>> should!) prevent that in sandboxed programs.
>>
>> In general, I prefer an approach where everything is explicit to an
>> approach where we almost, but not quite, emulate the weird historical
>> behavior.
>>
>> Pavel and Cyrill, how annoying would it be if CRIU had to do an extra
>> arch_prctl() to enable its cross-bitness shenanigans when
>> checkpointing and restoring a 32-bit program?
>
> I think this should not be a problem for criu (CC'ing Dima, who has
> been working on compat mode support in criu). As far as I remember
> we initiate restoring of 32 bit tasks in native 64 bit mode (well,
> ia32e to be precise :) mode and then, once everything is ready,
> we changing the mode by doing a return to __USER32_CS descriptor.
> So this won't be painful to add additional prctl call here.

Yeah, restoring will still be easy..
But checkpointing will be harder if we can't switch to 64-bit mode.
ATM we have one 64-bit parasite binary, which does all seizing job
for both 64 and 32 bit binaries.
So, if you can't switch back to 64-bit from 32-bit mode, we'll need
to keep two parasites.

--
Dmitry

2018-02-16 22:13:49

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode

Dmitry Safonov <[email protected]> wrote:

> 2018-02-16 7:11 GMT+00:00 Cyrill Gorcunov <[email protected]>:
>> On Thu, Feb 15, 2018 at 11:29:42PM +0000, Andy Lutomirski wrote:
>> ...
>>>>>> +bool pti_handle_segment_not_present(long error_code)
>>>>>> +{
>>>>>> + if (!static_cpu_has(X86_FEATURE_PTI))
>>>>>> + return false;
>>>>>> +
>>>>>> + if ((unsigned short)error_code != GDT_ENTRY_DEFAULT_USER_CS << 3)
>>>>>> + return false;
>>>>>> +
>>>>>> + pti_reenable();
>>>>>> + return true;
>>>>>> +}
>>>>>
>>>>> Please don't. You're trying to emulate the old behavior here, but
>>>>> you're emulating it wrong. In particular, you won't trap on LAR.
>>>>
>>>> Yes, I thought I’ll manage to address LAR, but failed. I thought you said
>>>> this is not a “show-stopper”. I’ll adapt your approach of using prctl, although
>>>> it really limits the benefit of this mechanism.
>>>
>>> It's possible we could get away with adding the prctl but making the
>>> default be that only the bitness that matches the program being run is
>>> allowed. After all, it's possible that CRIU is literally the only
>>> program that switches bitness using the GDT. (DOSEMU2 definitely does
>>> cross-bitness stuff, but it uses the LDT as far as I know.) And I've
>>> never been entirely sure that CRIU fully counts toward the Linux
>>> "don't break ABI" guarantee.
>>>
>>> Linus, how would you feel about, by default, preventing 64-bit
>>> programs from long-jumping to __USER32_CS and vice versa? I think it
>>> has some value as a hardening measure. I've certainly engaged in some
>>> exploit shenanigans myself that took advantage of the ability to long
>>> jump/ret to change bitness at will. This wouldn't affect users of
>>> modify_ldt() -- 64-bit programs could still create and use their own
>>> private 32-bit segments with modify_ldt(), and seccomp can (and
>>> should!) prevent that in sandboxed programs.
>>>
>>> In general, I prefer an approach where everything is explicit to an
>>> approach where we almost, but not quite, emulate the weird historical
>>> behavior.
>>>
>>> Pavel and Cyrill, how annoying would it be if CRIU had to do an extra
>>> arch_prctl() to enable its cross-bitness shenanigans when
>>> checkpointing and restoring a 32-bit program?
>>
>> I think this should not be a problem for criu (CC'ing Dima, who has
>> been working on compat mode support in criu). As far as I remember
>> we initiate restoring of 32 bit tasks in native 64 bit mode (well,
>> ia32e to be precise :) mode and then, once everything is ready,
>> we changing the mode by doing a return to __USER32_CS descriptor.
>> So this won't be painful to add additional prctl call here.
>
> Yeah, restoring will still be easy..
> But checkpointing will be harder if we can't switch to 64-bit mode.
> ATM we have one 64-bit parasite binary, which does all seizing job
> for both 64 and 32 bit binaries.
> So, if you can't switch back to 64-bit from 32-bit mode, we'll need
> to keep two parasites.

I can allow to switch back and forth by dynamically enabling/disabling PTI.
Andy, Dave, do you think it makes it a viable option? Should I respin
another version of the patch-set?