2018-07-10 22:35:49

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 00/27] Control Flow Enforcement (CET)

The first version of CET patches were divided into four series.
They can be found in the following links.

https://lkml.org/lkml/2018/6/7/795
https://lkml.org/lkml/2018/6/7/782
https://lkml.org/lkml/2018/6/7/771
https://lkml.org/lkml/2018/6/7/739

Summary of changes in v2:

Small fixes in the XSAVES patches.
Improve: Documentation/x86/intel_cet.txt.
Remove TLB flushing from ptep/pmdp_set_wrprotect for SHSTK.
Use shadow stack restore token for signals; save SHSTK pointer after FPU.
Rewrite ELF header parsing.
Backward compatibility is now handled from GLIBC tunables.
Add a new patch to can_follow_write_pte/pmd, for SHSTK.
Remove blocking of mremap/madvice/munmap.
Add Makefile checking CET support of assembler/compiler.

H.J. Lu (1):
x86: Insert endbr32/endbr64 to vDSO

Yu-cheng Yu (26):
x86/cpufeatures: Add CPUIDs for Control-flow Enforcement Technology
(CET)
x86/fpu/xstate: Change some names to separate XSAVES system and user
states
x86/fpu/xstate: Enable XSAVES system states
x86/fpu/xstate: Add XSAVES system states for shadow stack
Documentation/x86: Add CET description
x86/cet: Control protection exception handler
x86/cet/shstk: Add Kconfig option for user-mode shadow stack
mm: Introduce VM_SHSTK for shadow stack memory
x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
x86/mm: Introduce _PAGE_DIRTY_SW
x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for
_PAGE_DIRTY_SW
x86/mm: Shadow stack page fault error checking
mm: Handle shadow stack page fault
mm: Handle THP/HugeTLB shadow stack page fault
mm/mprotect: Prevent mprotect from changing shadow stack
mm: Modify can_follow_write_pte/pmd for shadow stack
x86/cet/shstk: User-mode shadow stack support
x86/cet/shstk: Introduce WRUSS instruction
x86/cet/shstk: Signal handling for shadow stack
x86/cet/shstk: ELF header parsing of CET
x86/cet/ibt: Add Kconfig option for user-mode Indirect Branch Tracking
x86/cet/ibt: User-mode indirect branch tracking support
mm/mmap: Add IBT bitmap size to address space limit check
x86/cet: Add PTRACE interface for CET
x86/cet/shstk: Handle thread shadow stack
x86/cet: Add arch_prctl functions for CET

.../admin-guide/kernel-parameters.txt | 6 +
Documentation/x86/intel_cet.txt | 250 ++++++++++++
arch/x86/Kconfig | 40 ++
arch/x86/Makefile | 14 +
arch/x86/entry/entry_64.S | 2 +-
arch/x86/entry/vdso/.gitignore | 4 +
arch/x86/entry/vdso/Makefile | 12 +-
arch/x86/entry/vdso/vdso-layout.lds.S | 1 +
arch/x86/ia32/ia32_signal.c | 13 +
arch/x86/include/asm/cet.h | 50 +++
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 16 +-
arch/x86/include/asm/elf.h | 5 +
arch/x86/include/asm/fpu/internal.h | 6 +-
arch/x86/include/asm/fpu/regset.h | 7 +-
arch/x86/include/asm/fpu/types.h | 22 +
arch/x86/include/asm/fpu/xstate.h | 31 +-
arch/x86/include/asm/mmu_context.h | 3 +
arch/x86/include/asm/msr-index.h | 14 +
arch/x86/include/asm/pgtable.h | 135 ++++++-
arch/x86/include/asm/pgtable_types.h | 31 +-
arch/x86/include/asm/processor.h | 5 +
arch/x86/include/asm/sighandling.h | 5 +
arch/x86/include/asm/special_insns.h | 45 +++
arch/x86/include/asm/traps.h | 5 +
arch/x86/include/uapi/asm/elf_property.h | 16 +
arch/x86/include/uapi/asm/prctl.h | 6 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/include/uapi/asm/resource.h | 5 +
arch/x86/include/uapi/asm/sigcontext.h | 17 +
arch/x86/kernel/Makefile | 4 +
arch/x86/kernel/cet.c | 375 ++++++++++++++++++
arch/x86/kernel/cet_prctl.c | 141 +++++++
arch/x86/kernel/cpu/common.c | 42 ++
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kernel/elf.c | 280 +++++++++++++
arch/x86/kernel/fpu/core.c | 11 +-
arch/x86/kernel/fpu/init.c | 10 -
arch/x86/kernel/fpu/regset.c | 41 ++
arch/x86/kernel/fpu/signal.c | 6 +-
arch/x86/kernel/fpu/xstate.c | 152 ++++---
arch/x86/kernel/idt.c | 4 +
arch/x86/kernel/process.c | 10 +
arch/x86/kernel/process_64.c | 7 +
arch/x86/kernel/ptrace.c | 16 +
arch/x86/kernel/relocate_kernel_64.S | 2 +-
arch/x86/kernel/signal.c | 96 +++++
arch/x86/kernel/traps.c | 58 +++
arch/x86/kvm/vmx.c | 2 +-
arch/x86/lib/x86-opcode-map.txt | 2 +-
arch/x86/mm/fault.c | 24 +-
fs/binfmt_elf.c | 16 +
fs/proc/task_mmu.c | 3 +
include/asm-generic/pgtable.h | 21 +
include/linux/mm.h | 8 +
include/uapi/asm-generic/resource.h | 3 +
include/uapi/linux/elf.h | 2 +
mm/gup.c | 11 +-
mm/huge_memory.c | 18 +-
mm/internal.h | 8 +
mm/memory.c | 38 +-
mm/mmap.c | 12 +-
mm/mprotect.c | 9 +
tools/objtool/arch/x86/lib/x86-opcode-map.txt | 2 +-
64 files changed, 2070 insertions(+), 135 deletions(-)
create mode 100644 Documentation/x86/intel_cet.txt
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/include/uapi/asm/elf_property.h
create mode 100644 arch/x86/kernel/cet.c
create mode 100644 arch/x86/kernel/cet_prctl.c
create mode 100644 arch/x86/kernel/elf.c

--
2.17.1



2018-07-10 22:32:14

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 27/27] x86/cet: Add arch_prctl functions for CET

arch_prctl(ARCH_CET_STATUS, unsigned long *addr)
Return CET feature status.

The parameter 'addr' is a pointer to a user buffer.
On returning to the caller, the kernel fills the following
information:

*addr = SHSTK/IBT status
*(addr + 1) = SHSTK base address
*(addr + 2) = SHSTK size

arch_prctl(ARCH_CET_DISABLE, unsigned long features)
Disable SHSTK and/or IBT specified in 'features'. Return -EPERM
if CET is locked out.

arch_prctl(ARCH_CET_LOCK)
Lock out CET feature.

arch_prctl(ARCH_CET_ALLOC_SHSTK, unsigned long *addr)
Allocate a new SHSTK.

The parameter 'addr' is a pointer to a user buffer and indicates
the desired SHSTK size to allocate. On returning to the caller
the buffer contains the address of the new SHSTK.

arch_prctl(ARCH_CET_LEGACY_BITMAP, unsigned long *addr)
Allocate an IBT legacy code bitmap if the current task does not
have one.

The parameter 'addr' is a pointer to a user buffer.
On returning to the caller, the kernel fills the following
information:

*addr = IBT bitmap base address
*(addr + 1) = IBT bitmap size

Signed-off-by: H.J. Lu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/cet.h | 5 ++
arch/x86/include/uapi/asm/prctl.h | 6 ++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/cet.c | 26 ++++++
arch/x86/kernel/cet_prctl.c | 141 ++++++++++++++++++++++++++++++
arch/x86/kernel/elf.c | 4 +
arch/x86/kernel/process.c | 6 ++
7 files changed, 189 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cet_prctl.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index d5737f3346f2..50b5284c6667 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -16,11 +16,14 @@ struct cet_status {
unsigned long ibt_bitmap_size;
unsigned int shstk_enabled:1;
unsigned int ibt_enabled:1;
+ unsigned int locked:1;
};

#ifdef CONFIG_X86_INTEL_CET
+int prctl_cet(int option, unsigned long arg2);
int cet_setup_shstk(void);
int cet_setup_thread_shstk(struct task_struct *p);
+int cet_alloc_shstk(unsigned long *arg);
void cet_disable_shstk(void);
void cet_disable_free_shstk(struct task_struct *p);
int cet_restore_signal(unsigned long ssp);
@@ -29,8 +32,10 @@ int cet_setup_ibt(void);
int cet_setup_ibt_bitmap(void);
void cet_disable_ibt(void);
#else
+static inline int prctl_cet(int option, unsigned long arg2) { return 0; }
static inline int cet_setup_shstk(void) { return 0; }
static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
+static inline int cet_alloc_shstk(unsigned long *arg) { return -EINVAL; }
static inline void cet_disable_shstk(void) {}
static inline void cet_disable_free_shstk(struct task_struct *p) {}
static inline int cet_restore_signal(unsigned long ssp) { return 0; }
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..52ec04e443c5 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -14,4 +14,10 @@
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003

+#define ARCH_CET_STATUS 0x3001
+#define ARCH_CET_DISABLE 0x3002
+#define ARCH_CET_LOCK 0x3003
+#define ARCH_CET_LEGACY_BITMAP 0x3004
+#define ARCH_CET_ALLOC_SHSTK 0x3005
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 36b14ef410c8..b9e6cdc6b4f7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -139,7 +139,7 @@ obj-$(CONFIG_UNWINDER_ORC) += unwind_orc.o
obj-$(CONFIG_UNWINDER_FRAME_POINTER) += unwind_frame.o
obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o

-obj-$(CONFIG_X86_INTEL_CET) += cet.o
+obj-$(CONFIG_X86_INTEL_CET) += cet.o cet_prctl.o

obj-$(CONFIG_ARCH_HAS_PROGRAM_PROPERTIES) += elf.o

diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 2a366a5ccf20..20ce9ac8a0df 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -132,6 +132,32 @@ static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
return addr;
}

+int cet_alloc_shstk(unsigned long *arg)
+{
+ unsigned long len = *arg;
+ unsigned long addr;
+ unsigned long token;
+ unsigned long ssp;
+
+ addr = shstk_mmap(0, len);
+ if (addr >= TASK_SIZE_MAX)
+ return -ENOMEM;
+
+ /* Restore token is 8 bytes and aligned to 8 bytes */
+ ssp = addr + len;
+ token = ssp;
+
+ if (!in_ia32_syscall())
+ token |= 1;
+ ssp -=8;
+
+ if (write_user_shstk_64(ssp, token))
+ return -EINVAL;
+
+ *arg = addr;
+ return 0;
+}
+
int cet_setup_shstk(void)
{
unsigned long addr, size;
diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
new file mode 100644
index 000000000000..86bb78ae656d
--- /dev/null
+++ b/arch/x86/kernel/cet_prctl.c
@@ -0,0 +1,141 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/errno.h>
+#include <linux/uaccess.h>
+#include <linux/prctl.h>
+#include <linux/compat.h>
+#include <asm/processor.h>
+#include <asm/prctl.h>
+#include <asm/elf.h>
+#include <asm/elf_property.h>
+#include <asm/cet.h>
+
+/* See Documentation/x86/intel_cet.txt. */
+
+static int handle_get_status(unsigned long arg2)
+{
+ unsigned int features = 0;
+ unsigned long shstk_base, shstk_size;
+
+ if (current->thread.cet.shstk_enabled)
+ features |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
+ if (current->thread.cet.ibt_enabled)
+ features |= GNU_PROPERTY_X86_FEATURE_1_IBT;
+
+ shstk_base = current->thread.cet.shstk_base;
+ shstk_size = current->thread.cet.shstk_size;
+
+ if (in_ia32_syscall()) {
+ unsigned int buf[3];
+
+ buf[0] = features;
+ buf[1] = (unsigned int)shstk_base;
+ buf[2] = (unsigned int)shstk_size;
+ return copy_to_user((unsigned int __user *)arg2, buf,
+ sizeof(buf));
+ } else {
+ unsigned long buf[3];
+
+ buf[0] = (unsigned long)features;
+ buf[1] = shstk_base;
+ buf[2] = shstk_size;
+ return copy_to_user((unsigned long __user *)arg2, buf,
+ sizeof(buf));
+ }
+}
+
+static int handle_alloc_shstk(unsigned long arg2)
+{
+ int err = 0;
+ unsigned long shstk_size = 0;
+
+ if (in_ia32_syscall()) {
+ unsigned int size;
+
+ err = get_user(size, (unsigned int __user *)arg2);
+ if (!err)
+ shstk_size = size;
+ } else {
+ err = get_user(shstk_size, (unsigned long __user *)arg2);
+ }
+
+ if (err)
+ return -EFAULT;
+
+ err = cet_alloc_shstk(&shstk_size);
+ if (err)
+ return -err;
+
+ if (in_ia32_syscall()) {
+ if (put_user(shstk_size, (unsigned int __user *)arg2))
+ return -EFAULT;
+ } else {
+ if (put_user(shstk_size, (unsigned long __user *)arg2))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+static int handle_bitmap(unsigned long arg2)
+{
+ unsigned long addr, size;
+
+ if (current->thread.cet.ibt_enabled) {
+ if (!current->thread.cet.ibt_bitmap_addr)
+ cet_setup_ibt_bitmap();
+ addr = current->thread.cet.ibt_bitmap_addr;
+ size = current->thread.cet.ibt_bitmap_size;
+ } else {
+ addr = 0;
+ size = 0;
+ }
+
+ if (in_compat_syscall()) {
+ if (put_user(addr, (unsigned int __user *)arg2) ||
+ put_user(size, (unsigned int __user *)arg2 + 1))
+ return -EFAULT;
+ } else {
+ if (put_user(addr, (unsigned long __user *)arg2) ||
+ put_user(size, (unsigned long __user *)arg2 + 1))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+int prctl_cet(int option, unsigned long arg2)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+ !cpu_feature_enabled(X86_FEATURE_IBT))
+ return -EINVAL;
+
+ switch (option) {
+ case ARCH_CET_STATUS:
+ return handle_get_status(arg2);
+
+ case ARCH_CET_DISABLE:
+ if (current->thread.cet.locked)
+ return -EPERM;
+ if (arg2 & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+ cet_disable_free_shstk(current);
+ if (arg2 & GNU_PROPERTY_X86_FEATURE_1_IBT)
+ cet_disable_ibt();
+
+ return 0;
+
+ case ARCH_CET_LOCK:
+ current->thread.cet.locked = 1;
+ return 0;
+
+ case ARCH_CET_ALLOC_SHSTK:
+ return handle_alloc_shstk(arg2);
+
+ /*
+ * Allocate legacy bitmap and return address & size to user.
+ */
+ case ARCH_CET_LEGACY_BITMAP:
+ return handle_bitmap(arg2);
+
+ default:
+ return -EINVAL;
+ }
+}
diff --git a/arch/x86/kernel/elf.c b/arch/x86/kernel/elf.c
index 42e08d3b573e..3d4934fdac7f 100644
--- a/arch/x86/kernel/elf.c
+++ b/arch/x86/kernel/elf.c
@@ -8,7 +8,10 @@

#include <asm/cet.h>
#include <asm/elf_property.h>
+#include <asm/prctl.h>
+#include <asm/processor.h>
#include <uapi/linux/elf-em.h>
+#include <uapi/linux/prctl.h>
#include <linux/binfmts.h>
#include <linux/elf.h>
#include <linux/slab.h>
@@ -255,6 +258,7 @@ int arch_setup_features(void *ehdr_p, void *phdr_p,
current->thread.cet.ibt_enabled = 0;
current->thread.cet.ibt_bitmap_addr = 0;
current->thread.cet.ibt_bitmap_size = 0;
+ current->thread.cet.locked = 0;
if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
if (shstk) {
err = cet_setup_shstk();
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 43a57d284a22..259b92664981 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -795,6 +795,12 @@ long do_arch_prctl_common(struct task_struct *task, int option,
return get_cpuid_mode();
case ARCH_SET_CPUID:
return set_cpuid_mode(task, cpuid_enabled);
+ case ARCH_CET_STATUS:
+ case ARCH_CET_DISABLE:
+ case ARCH_CET_LOCK:
+ case ARCH_CET_ALLOC_SHSTK:
+ case ARCH_CET_LEGACY_BITMAP:
+ return prctl_cet(option, cpuid_enabled);
}

return -EINVAL;
--
2.17.1


2018-07-10 22:32:23

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 26/27] x86/cet/shstk: Handle thread shadow stack

The shadow stack for clone/fork is handled as the following:

(1) If ((clone_flags & (CLONE_VFORK | CLONE_VM)) == CLONE_VM),
the kernel allocates (and frees on thread exit) a new SHSTK
for the child.

It is possible for the kernel to complete the clone syscall
and set the child's SHSTK pointer to NULL and let the child
thread allocate a SHSTK for itself. There are two issues
in this approach: It is not compatible with existing code
that does inline syscall and it cannot handle signals before
the child can successfully allocate a SHSTK.

(2) For (clone_flags & CLONE_VFORK), the child uses the existing
SHSTK.

(3) For all other cases, the SHSTK is copied/reused whenever the
parent or the child does a call/ret.

This patch handles cases (1) & (2). Case (3) is handled in
the SHSTK page fault patches.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/cet.h | 2 ++
arch/x86/include/asm/mmu_context.h | 3 +++
arch/x86/kernel/cet.c | 33 ++++++++++++++++++++++++++++++
arch/x86/kernel/process.c | 1 +
arch/x86/kernel/process_64.c | 7 +++++++
5 files changed, 46 insertions(+)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 71da2cccba16..d5737f3346f2 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -20,6 +20,7 @@ struct cet_status {

#ifdef CONFIG_X86_INTEL_CET
int cet_setup_shstk(void);
+int cet_setup_thread_shstk(struct task_struct *p);
void cet_disable_shstk(void);
void cet_disable_free_shstk(struct task_struct *p);
int cet_restore_signal(unsigned long ssp);
@@ -29,6 +30,7 @@ int cet_setup_ibt_bitmap(void);
void cet_disable_ibt(void);
#else
static inline int cet_setup_shstk(void) { return 0; }
+static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
static inline void cet_disable_shstk(void) {}
static inline void cet_disable_free_shstk(struct task_struct *p) {}
static inline int cet_restore_signal(unsigned long ssp) { return 0; }
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index bbc796eb0a3b..662755048598 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -13,6 +13,7 @@
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
#include <asm/mpx.h>
+#include <asm/cet.h>

extern atomic64_t last_mm_ctx_id;

@@ -228,6 +229,8 @@ do { \
#else
#define deactivate_mm(tsk, mm) \
do { \
+ if (!tsk->vfork_done) \
+ cet_disable_free_shstk(tsk); \
load_gs_index(0); \
loadsegment(fs, 0); \
} while (0)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 8bbd63e1a2ba..2a366a5ccf20 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -155,6 +155,39 @@ int cet_setup_shstk(void)
return 0;
}

+int cet_setup_thread_shstk(struct task_struct *tsk)
+{
+ unsigned long addr, size;
+ struct cet_user_state *state;
+
+ if (!current->thread.cet.shstk_enabled)
+ return 0;
+
+ state = get_xsave_addr(&tsk->thread.fpu.state.xsave,
+ XFEATURE_MASK_SHSTK_USER);
+
+ if (!state)
+ return -EINVAL;
+
+ size = tsk->thread.cet.shstk_size;
+ if (size == 0)
+ size = in_ia32_syscall() ? SHSTK_SIZE_32:SHSTK_SIZE_64;
+
+ addr = shstk_mmap(0, size);
+
+ if (addr >= TASK_SIZE_MAX) {
+ tsk->thread.cet.shstk_base = 0;
+ tsk->thread.cet.shstk_size = 0;
+ tsk->thread.cet.shstk_enabled = 0;
+ return -ENOMEM;
+ }
+
+ state->user_ssp = (u64)(addr + size - sizeof(u64));
+ tsk->thread.cet.shstk_base = addr;
+ tsk->thread.cet.shstk_size = size;
+ return 0;
+}
+
void cet_disable_shstk(void)
{
u64 r;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 309ebb7f9d8d..43a57d284a22 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -127,6 +127,7 @@ void exit_thread(struct task_struct *tsk)

free_vm86(t);

+ cet_disable_free_shstk(tsk);
fpu__drop(fpu);
}

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 12bb445fb98d..6e493b0bcedd 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -317,6 +317,13 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
if (sp)
childregs->sp = sp;

+ /* Allocate a new shadow stack for pthread */
+ if ((clone_flags & (CLONE_VFORK | CLONE_VM)) == CLONE_VM) {
+ err = cet_setup_thread_shstk(p);
+ if (err)
+ goto out;
+ }
+
err = -ENOMEM;
if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) {
p->thread.io_bitmap_ptr = kmemdup(me->thread.io_bitmap_ptr,
--
2.17.1


2018-07-10 22:32:32

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET

Add PTRACE interface for CET MSRs.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/fpu/regset.h | 7 +++---
arch/x86/kernel/fpu/regset.c | 41 +++++++++++++++++++++++++++++++
arch/x86/kernel/ptrace.c | 16 ++++++++++++
include/uapi/linux/elf.h | 1 +
4 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/regset.h
index d5bdffb9d27f..edad0d889084 100644
--- a/arch/x86/include/asm/fpu/regset.h
+++ b/arch/x86/include/asm/fpu/regset.h
@@ -7,11 +7,12 @@

#include <linux/regset.h>

-extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active;
+extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active,
+ cetregs_active;
extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
- xstateregs_get;
+ xstateregs_get, cetregs_get;
extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
- xstateregs_set;
+ xstateregs_set, cetregs_set;

/*
* xstateregs_active == regset_fpregs_active. Please refer to the comment
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index bc02f5144b95..7008eb084d36 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -160,6 +160,47 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
return ret;
}

+int cetregs_active(struct task_struct *target, const struct user_regset *regset)
+{
+#ifdef CONFIG_X86_INTEL_CET
+ if (target->thread.cet.shstk_enabled || target->thread.cet.ibt_enabled)
+ return regset->n;
+#endif
+ return 0;
+}
+
+int cetregs_get(struct task_struct *target, const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ void *kbuf, void __user *ubuf)
+{
+ struct fpu *fpu = &target->thread.fpu;
+ struct cet_user_state *cetregs;
+
+ if (!boot_cpu_has(X86_FEATURE_SHSTK))
+ return -ENODEV;
+
+ cetregs = get_xsave_addr(&fpu->state.xsave, XFEATURE_MASK_SHSTK_USER);
+
+ fpu__prepare_read(fpu);
+ return user_regset_copyout(&pos, &count, &kbuf, &ubuf, cetregs, 0, -1);
+}
+
+int cetregs_set(struct task_struct *target, const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ const void *kbuf, const void __user *ubuf)
+{
+ struct fpu *fpu = &target->thread.fpu;
+ struct cet_user_state *cetregs;
+
+ if (!boot_cpu_has(X86_FEATURE_SHSTK))
+ return -ENODEV;
+
+ cetregs = get_xsave_addr(&fpu->state.xsave, XFEATURE_MASK_SHSTK_USER);
+
+ fpu__prepare_write(fpu);
+ return user_regset_copyin(&pos, &count, &kbuf, &ubuf, cetregs, 0, -1);
+}
+
#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION

/*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index e2ee403865eb..ac2bc3a18427 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -49,7 +49,9 @@ enum x86_regset {
REGSET_IOPERM64 = REGSET_XFP,
REGSET_XSTATE,
REGSET_TLS,
+ REGSET_CET64 = REGSET_TLS,
REGSET_IOPERM32,
+ REGSET_CET32,
};

struct pt_regs_offset {
@@ -1276,6 +1278,13 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
.size = sizeof(long), .align = sizeof(long),
.active = ioperm_active, .get = ioperm_get
},
+ [REGSET_CET64] = {
+ .core_note_type = NT_X86_CET,
+ .n = sizeof(struct cet_user_state) / sizeof(u64),
+ .size = sizeof(u64), .align = sizeof(u64),
+ .active = cetregs_active, .get = cetregs_get,
+ .set = cetregs_set
+ },
};

static const struct user_regset_view user_x86_64_view = {
@@ -1331,6 +1340,13 @@ static struct user_regset x86_32_regsets[] __ro_after_init = {
.size = sizeof(u32), .align = sizeof(u32),
.active = ioperm_active, .get = ioperm_get
},
+ [REGSET_CET32] = {
+ .core_note_type = NT_X86_CET,
+ .n = sizeof(struct cet_user_state) / sizeof(u64),
+ .size = sizeof(u64), .align = sizeof(u64),
+ .active = cetregs_active, .get = cetregs_get,
+ .set = cetregs_set
+ },
};

static const struct user_regset_view user_x86_32_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index dc93982b9664..0898ba719fd7 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -401,6 +401,7 @@ typedef struct elf64_shdr {
#define NT_386_TLS 0x200 /* i386 TLS slots (struct user_desc) */
#define NT_386_IOPERM 0x201 /* x86 io permission bitmap (1=deny) */
#define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
+#define NT_X86_CET 0x203 /* x86 cet state */
#define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */
#define NT_S390_TIMER 0x301 /* s390 timer register */
#define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */
--
2.17.1


2018-07-10 22:32:51

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 23/27] mm/mmap: Add IBT bitmap size to address space limit check

The indirect branch tracking legacy bitmap takes a large address
space. This causes may_expand_vm() failure on the address limit
check. For a IBT-enabled task, add the bitmap size to the
address limit.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/uapi/asm/resource.h | 5 +++++
include/uapi/asm-generic/resource.h | 3 +++
mm/mmap.c | 12 +++++++++++-
3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/resource.h b/arch/x86/include/uapi/asm/resource.h
index 04bc4db8921b..0741b2a6101a 100644
--- a/arch/x86/include/uapi/asm/resource.h
+++ b/arch/x86/include/uapi/asm/resource.h
@@ -1 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifdef CONFIG_X86_INTEL_CET
+#define rlimit_as_extra() current->thread.cet.ibt_bitmap_size
+#endif
+
#include <asm-generic/resource.h>
diff --git a/include/uapi/asm-generic/resource.h b/include/uapi/asm-generic/resource.h
index f12db7a0da64..8a7608a09700 100644
--- a/include/uapi/asm-generic/resource.h
+++ b/include/uapi/asm-generic/resource.h
@@ -58,5 +58,8 @@
# define RLIM_INFINITY (~0UL)
#endif

+#ifndef rlimit_as_extra
+#define rlimit_as_extra() 0
+#endif

#endif /* _UAPI_ASM_GENERIC_RESOURCE_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87ef4b1a..fad41b291ae1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3242,7 +3242,17 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
*/
bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags, unsigned long npages)
{
- if (mm->total_vm + npages > rlimit(RLIMIT_AS) >> PAGE_SHIFT)
+ unsigned long as_limit = rlimit(RLIMIT_AS);
+ unsigned long as_limit_plus = as_limit + rlimit_as_extra();
+
+ /* as_limit_plus overflowed */
+ if (as_limit_plus < as_limit)
+ as_limit_plus = RLIM_INFINITY;
+
+ if (as_limit_plus > as_limit)
+ as_limit = as_limit_plus;
+
+ if (mm->total_vm + npages > as_limit >> PAGE_SHIFT)
return false;

if (is_data_mapping(flags) &&
--
2.17.1


2018-07-10 22:32:54

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 24/27] x86: Insert endbr32/endbr64 to vDSO

From: "H.J. Lu" <[email protected]>

When Intel indirect branch tracking is enabled, functions in vDSO which
may be called indirectly must have endbr32 or endbr64 as the first
instruction. Compiler must support -fcf-protection=branch so that it
can be used to compile vDSO.

Signed-off-by: H.J. Lu <[email protected]>
---
arch/x86/entry/vdso/.gitignore | 4 ++++
arch/x86/entry/vdso/Makefile | 12 +++++++++++-
arch/x86/entry/vdso/vdso-layout.lds.S | 1 +
3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/.gitignore b/arch/x86/entry/vdso/.gitignore
index aae8ffdd5880..552941fdfae0 100644
--- a/arch/x86/entry/vdso/.gitignore
+++ b/arch/x86/entry/vdso/.gitignore
@@ -5,3 +5,7 @@ vdso32-sysenter-syms.lds
vdso32-int80-syms.lds
vdso-image-*.c
vdso2c
+vclock_gettime.S
+vgetcpu.S
+vclock_gettime.asm
+vgetcpu.asm
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 261802b1cc50..d49548ebec6f 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -108,13 +108,17 @@ vobjx32s := $(foreach F,$(vobjx32s-y),$(obj)/$F)

# Convert 64bit object file to x32 for x32 vDSO.
quiet_cmd_x32 = X32 $@
- cmd_x32 = $(OBJCOPY) -O elf32-x86-64 $< $@
+ cmd_x32 = $(OBJCOPY) -R .note.gnu.property -O elf32-x86-64 $< $@

$(obj)/%-x32.o: $(obj)/%.o FORCE
$(call if_changed,x32)

targets += vdsox32.lds $(vobjx32s-y)

+ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+ $(obj)/vclock_gettime.o $(obj)/vgetcpu.o $(obj)/vdso32/vclock_gettime.o: KBUILD_CFLAGS += -fcf-protection=branch
+endif
+
$(obj)/%.so: OBJCOPYFLAGS := -S
$(obj)/%.so: $(obj)/%.so.dbg
$(call if_changed,objcopy)
@@ -164,6 +168,12 @@ quiet_cmd_vdso = VDSO $@

VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=both) \
$(call cc-ldoption, -Wl$(comma)--build-id) -Wl,-Bsymbolic $(LTO_CFLAGS)
+ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+ VDSO_LDFLAGS += $(call cc-ldoption, -Wl$(comma)-z$(comma)ibt)
+endif
+ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+ VDSO_LDFLAGS += $(call cc-ldoption, -Wl$(comma)-z$(comma)shstk)
+endif
GCOV_PROFILE := n

#
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index acfd5ba7d943..cabaeedfed78 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -74,6 +74,7 @@ SECTIONS
.fake_shstrtab : { *(.fake_shstrtab) } :text


+ .note.gnu.property : { *(.note.gnu.property) } :text :note
.note : { *(.note.*) } :text :note

.eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame_hdr
--
2.17.1


2018-07-10 22:33:05

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

WRUSS is a new kernel-mode instruction but writes directly
to user shadow stack memory. This is used to construct
a return address on the shadow stack for the signal
handler.

This instruction can fault if the user shadow stack is
invalid shadow stack memory. In that case, the kernel does
fixup.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/special_insns.h | 45 +++++++++++++++++++
arch/x86/lib/x86-opcode-map.txt | 2 +-
arch/x86/mm/fault.c | 13 +++++-
tools/objtool/arch/x86/lib/x86-opcode-map.txt | 2 +-
4 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 317fc59b512c..c69d8d6b457f 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -237,6 +237,51 @@ static inline void clwb(volatile void *__p)
: [pax] "a" (p));
}

+#ifdef CONFIG_X86_INTEL_CET
+
+#if defined(CONFIG_IA32_EMULATION) || defined(CONFIG_X86_X32)
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+ int err;
+
+ asm volatile("1: wrussd %[val], (%[addr])\n"
+ "xor %[err], %[err]\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: mov $-1, %[err]; jmp 2b\n"
+ ".previous\n"
+ _ASM_EXTABLE(1b, 3b)
+ : [err] "=a" (err)
+ : [val] "S" (val), [addr] "D" (addr));
+
+ return err;
+}
+#else
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+ BUG();
+ return 0;
+}
+#endif
+
+static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
+{
+ int err = 0;
+
+ asm volatile("1: wrussq %[val], (%[addr])\n"
+ "xor %[err], %[err]\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: mov $-1, %[err]; jmp 2b\n"
+ ".previous\n"
+ _ASM_EXTABLE(1b, 3b)
+ : [err] "=a" (err)
+ : [val] "S" (val), [addr] "D" (addr));
+
+ return err;
+}
+#endif /* CONFIG_X86_INTEL_CET */
+
#define nop() asm volatile ("nop")


diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index e0b85930dd77..72bb7c48a7df 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2)
f2: ANDN Gy,By,Ey (v)
f3: Grp17 (1A)
-f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v)
+f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSS Pq,Qq (66),REX.W
f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
EndTable
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fcd5739151f9..92f178b8b598 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -641,6 +641,17 @@ static int is_f00f_bug(struct pt_regs *regs, unsigned long address)
return 0;
}

+/*
+ * WRUSS is a kernel instrcution and but writes to user
+ * shadow stack memory. When a fault occurs, both
+ * X86_PF_USER and X86_PF_SHSTK are set.
+ */
+static int is_wruss(struct pt_regs *regs, unsigned long error_code)
+{
+ return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
+ (X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
+}
+
static void
show_fault_oops(struct pt_regs *regs, unsigned long error_code,
unsigned long address)
@@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
struct task_struct *tsk = current;

/* User mode accesses just cause a SIGSEGV */
- if (error_code & X86_PF_USER) {
+ if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
/*
* It's possible to have interrupts off here:
*/
diff --git a/tools/objtool/arch/x86/lib/x86-opcode-map.txt b/tools/objtool/arch/x86/lib/x86-opcode-map.txt
index e0b85930dd77..72bb7c48a7df 100644
--- a/tools/objtool/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/objtool/arch/x86/lib/x86-opcode-map.txt
@@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2)
f2: ANDN Gy,By,Ey (v)
f3: Grp17 (1A)
-f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v)
+f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSS Pq,Qq (66),REX.W
f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
EndTable
--
2.17.1


2018-07-10 22:33:12

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 19/27] x86/cet/shstk: Signal handling for shadow stack

When setting up a signal, the kernel creates a shadow stack
restore token at the current SHSTK address and then stores the
token's address in the signal frame, right after the FPU state.
Before restoring a signal, the kernel verifies and then uses the
restore token to set the SHSTK pointer.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/ia32/ia32_signal.c | 13 +++
arch/x86/include/asm/cet.h | 5 ++
arch/x86/include/asm/sighandling.h | 5 ++
arch/x86/include/uapi/asm/sigcontext.h | 17 ++++
arch/x86/kernel/cet.c | 115 +++++++++++++++++++++++++
arch/x86/kernel/signal.c | 96 +++++++++++++++++++++
6 files changed, 251 insertions(+)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 86b1341cba9a..cea28d2a946e 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,6 +34,7 @@
#include <asm/sigframe.h>
#include <asm/sighandling.h>
#include <asm/smap.h>
+#include <asm/cet.h>

/*
* Do a signal return; undo the signal stack.
@@ -108,6 +109,9 @@ static int ia32_restore_sigcontext(struct pt_regs *regs,

err |= fpu__restore_sig(buf, 1);

+ if (!err)
+ err = restore_sigcontext_ext(buf);
+
force_iret();

return err;
@@ -234,6 +238,10 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
if (fpu->initialized) {
unsigned long fx_aligned, math_size;

+ /* sigcontext extension */
+ if (boot_cpu_has(X86_FEATURE_SHSTK))
+ sp -= (sizeof(struct sc_ext) + 8);
+
sp = fpu__alloc_mathframe(sp, 1, &fx_aligned, &math_size);
*fpstate = (struct _fpstate_32 __user *) sp;
if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
@@ -277,6 +285,8 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,

if (ia32_setup_sigcontext(&frame->sc, fpstate, regs, set->sig[0]))
return -EFAULT;
+ if (setup_sigcontext_ext(ksig, fpstate))
+ return -EFAULT;

if (_COMPAT_NSIG_WORDS > 1) {
if (__copy_to_user(frame->extramask, &set->sig[1],
@@ -384,6 +394,9 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig,
regs, set->sig[0]);
err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));

+ if (!err)
+ err = setup_sigcontext_ext(ksig, fpstate);
+
if (err)
return -EFAULT;

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index ad278c520414..d9ae3d86cdd7 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -19,10 +19,15 @@ struct cet_status {
int cet_setup_shstk(void);
void cet_disable_shstk(void);
void cet_disable_free_shstk(struct task_struct *p);
+int cet_restore_signal(unsigned long ssp);
+int cet_setup_signal(bool ia32, unsigned long rstor, unsigned long *new_ssp);
#else
static inline int cet_setup_shstk(void) { return 0; }
static inline void cet_disable_shstk(void) {}
static inline void cet_disable_free_shstk(struct task_struct *p) {}
+static inline int cet_restore_signal(unsigned long ssp) { return 0; }
+static inline int cet_setup_signal(bool ia32, unsigned long rstor,
+ unsigned long *new_ssp) { return 0; }
#endif

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/sighandling.h b/arch/x86/include/asm/sighandling.h
index bd26834724e5..23014b4082de 100644
--- a/arch/x86/include/asm/sighandling.h
+++ b/arch/x86/include/asm/sighandling.h
@@ -17,4 +17,9 @@ void signal_fault(struct pt_regs *regs, void __user *frame, char *where);
int setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate,
struct pt_regs *regs, unsigned long mask);

+#ifdef CONFIG_X86_64
+int setup_sigcontext_ext(struct ksignal *ksig, void __user *fpu);
+int restore_sigcontext_ext(void __user *fpu);
+#endif
+
#endif /* _ASM_X86_SIGHANDLING_H */
diff --git a/arch/x86/include/uapi/asm/sigcontext.h b/arch/x86/include/uapi/asm/sigcontext.h
index 844d60eb1882..74f5ea5dcd24 100644
--- a/arch/x86/include/uapi/asm/sigcontext.h
+++ b/arch/x86/include/uapi/asm/sigcontext.h
@@ -196,6 +196,23 @@ struct _xstate {
/* New processor state extensions go here: */
};

+#ifdef __x86_64__
+/*
+ * Sigcontext extension (struct sc_ext) is located after
+ * sigcontext->fpstate. Because currently only the shadow
+ * stack pointer is saved there and the shadow stack depends
+ * on XSAVES, we can find sc_ext from sigcontext->fpstate.
+ *
+ * The 64-bit fpstate has a size of fpu_user_xstate_size, plus
+ * FP_XSTATE_MAGIC2_SIZE when XSAVE* is used. The struct sc_ext
+ * is located at the end of sigcontext->fpstate, aligned to 8.
+ */
+struct sc_ext {
+ unsigned long total_size;
+ unsigned long ssp;
+};
+#endif
+
/*
* The 32-bit signal frame:
*/
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 96bf69db7da7..4eba7790c4e4 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -18,6 +18,7 @@
#include <asm/fpu/types.h>
#include <asm/compat.h>
#include <asm/cet.h>
+#include <asm/special_insns.h>

#define SHSTK_SIZE_64 (0x8000 * 8)
#define SHSTK_SIZE_32 (0x8000 * 4)
@@ -49,6 +50,69 @@ static unsigned long get_shstk_addr(void)
return ptr;
}

+/*
+ * Verify the restore token at the address of 'ssp' is
+ * valid and then set shadow stack pointer according to the
+ * token.
+ */
+static int verify_rstor_token(bool ia32, unsigned long ssp,
+ unsigned long *new_ssp)
+{
+ unsigned long token;
+
+ *new_ssp = 0;
+
+ if (!IS_ALIGNED(ssp, 8))
+ return -EINVAL;
+
+ if (get_user(token, (unsigned long __user*)ssp))
+ return -EFAULT;
+
+ /* Is 64-bit mode flag correct? */
+ if (ia32 && (token & 3) != 0)
+ return -EINVAL;
+ else if ((token & 3) != 1)
+ return -EINVAL;
+
+ token &= ~(1UL);
+
+ if ((!ia32 && !IS_ALIGNED(token, 8)) || !IS_ALIGNED(token, 4))
+ return -EINVAL;
+
+ if ((ALIGN_DOWN(token, 8) - 8) != ssp)
+ return -EINVAL;
+
+ *new_ssp = token;
+ return 0;
+}
+
+/*
+ * Create a restore token on the shadow stack.
+ * A token is always 8-byte and aligned to 8.
+ */
+static int create_rstor_token(bool ia32, unsigned long ssp,
+ unsigned long *new_ssp)
+{
+ unsigned long addr;
+
+ *new_ssp = 0;
+
+ if ((!ia32 && !IS_ALIGNED(ssp, 8)) || !IS_ALIGNED(ssp, 4))
+ return -EINVAL;
+
+ addr = ALIGN_DOWN(ssp, 8) - 8;
+
+ /* Is the token for 64-bit? */
+ if (!ia32)
+ ssp |= 1;
+
+ if (write_user_shstk_64(addr, ssp))
+ return -EFAULT;
+
+ *new_ssp = addr;
+ return 0;
+}
+
static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
{
struct mm_struct *mm = current->mm;
@@ -126,3 +190,54 @@ void cet_disable_free_shstk(struct task_struct *tsk)

tsk->thread.cet.shstk_enabled = 0;
}
+
+int cet_restore_signal(unsigned long ssp)
+{
+ unsigned long new_ssp;
+ int err;
+
+ if (!current->thread.cet.shstk_enabled)
+ return 0;
+
+ err = verify_rstor_token(in_ia32_syscall(), ssp, &new_ssp);
+
+ if (err)
+ return err;
+
+ return set_shstk_ptr(new_ssp);
+}
+
+/*
+ * Setup the shadow stack for the signal handler: first,
+ * create a restore token to keep track of the current ssp,
+ * and then the return address of the signal handler.
+ */
+int cet_setup_signal(bool ia32, unsigned long rstor_addr,
+ unsigned long *new_ssp)
+{
+ unsigned long ssp;
+ int err;
+
+ if (!current->thread.cet.shstk_enabled)
+ return 0;
+
+ ssp = get_shstk_addr();
+ err = create_rstor_token(ia32, ssp, new_ssp);
+
+ if (err)
+ return err;
+
+ if (ia32) {
+ ssp = *new_ssp - sizeof(u32);
+ err = write_user_shstk_32(ssp, (unsigned int)rstor_addr);
+ } else {
+ ssp = *new_ssp - sizeof(u64);
+ err = write_user_shstk_64(ssp, rstor_addr);
+ }
+
+ if (err)
+ return err;
+
+ set_shstk_ptr(ssp);
+ return 0;
+}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 92a3b312a53c..31f45d8d794a 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -46,6 +46,7 @@

#include <asm/sigframe.h>
#include <asm/signal.h>
+#include <asm/cet.h>

#define COPY(x) do { \
get_user_ex(regs->x, &sc->x); \
@@ -152,6 +153,10 @@ static int restore_sigcontext(struct pt_regs *regs,

err |= fpu__restore_sig(buf, IS_ENABLED(CONFIG_X86_32));

+#ifdef CONFIG_X86_64
+ err |= restore_sigcontext_ext(buf);
+#endif
+
force_iret();

return err;
@@ -266,6 +271,11 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
}

if (fpu->initialized) {
+#ifdef CONFIG_X86_64
+ /* sigcontext extension */
+ if (boot_cpu_has(X86_FEATURE_SHSTK))
+ sp -= sizeof(struct sc_ext) + 8;
+#endif
sp = fpu__alloc_mathframe(sp, IS_ENABLED(CONFIG_X86_32),
&buf_fx, &math_size);
*fpstate = (void __user *)sp;
@@ -493,6 +503,9 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig,
err |= setup_sigcontext(&frame->uc.uc_mcontext, fp, regs, set->sig[0]);
err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));

+ if (!err)
+ err = setup_sigcontext_ext(ksig, fp);
+
if (err)
return -EFAULT;

@@ -576,6 +589,9 @@ static int x32_setup_rt_frame(struct ksignal *ksig,
regs, set->sig[0]);
err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));

+ if (!err)
+ err = setup_sigcontext_ext(ksig, fpstate);
+
if (err)
return -EFAULT;

@@ -707,6 +723,86 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
}
}

+#ifdef CONFIG_X86_64
+static int copy_ext_from_user(struct sc_ext *ext, void __user *fpu)
+{
+ void __user *p;
+
+ if (!fpu)
+ return -EINVAL;
+
+ p = fpu + fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+ p = (void __user *)ALIGN((unsigned long)p, 8);
+
+ if (!access_ok(VERIFY_READ, p, sizeof(*ext)))
+ return -EFAULT;
+
+ if (__copy_from_user(ext, p, sizeof(*ext)))
+ return -EFAULT;
+
+ if (ext->total_size != sizeof(*ext))
+ return -EINVAL;
+ return 0;
+}
+
+static int copy_ext_to_user(void __user *fpu, struct sc_ext *ext)
+{
+ void __user *p;
+
+ if (!fpu)
+ return -EINVAL;
+
+ if (ext->total_size != sizeof(*ext))
+ return -EINVAL;
+
+ p = fpu + fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+ p = (void __user *)ALIGN((unsigned long)p, 8);
+
+ if (!access_ok(VERIFY_WRITE, p, sizeof(*ext)))
+ return -EFAULT;
+
+ if (__copy_to_user(p, ext, sizeof(*ext)))
+ return -EFAULT;
+
+ return 0;
+}
+
+int restore_sigcontext_ext(void __user *fp)
+{
+ int err = 0;
+
+ if (boot_cpu_has(X86_FEATURE_SHSTK) && fp) {
+ struct sc_ext ext = {0, 0};
+
+ err = copy_ext_from_user(&ext, fp);
+
+ if (!err)
+ err = cet_restore_signal(ext.ssp);
+ }
+
+ return err;
+}
+
+int setup_sigcontext_ext(struct ksignal *ksig, void __user *fp)
+{
+ int err = 0;
+
+ if (boot_cpu_has(X86_FEATURE_SHSTK) && fp) {
+ struct sc_ext ext;
+ unsigned long rstor;
+
+ rstor = (unsigned long)ksig->ka.sa.sa_restorer;
+ err = cet_setup_signal(is_ia32_frame(ksig), rstor, &ext.ssp);
+ if (!err) {
+ ext.total_size = sizeof(ext);
+ err = copy_ext_to_user(fp, &ext);
+ }
+ }
+
+ return err;
+}
+#endif
+
static void
handle_signal(struct ksignal *ksig, struct pt_regs *regs)
{
--
2.17.1


2018-07-10 22:33:22

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

Add user-mode indirect branch tracking enabling/disabling
and supporting routines.

Signed-off-by: H.J. Lu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/cet.h | 8 +++
arch/x86/include/asm/disabled-features.h | 8 ++-
arch/x86/kernel/cet.c | 73 ++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 20 ++++++-
arch/x86/kernel/elf.c | 16 +++++-
arch/x86/kernel/process.c | 1 +
6 files changed, 123 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index d9ae3d86cdd7..71da2cccba16 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -12,7 +12,10 @@ struct task_struct;
struct cet_status {
unsigned long shstk_base;
unsigned long shstk_size;
+ unsigned long ibt_bitmap_addr;
+ unsigned long ibt_bitmap_size;
unsigned int shstk_enabled:1;
+ unsigned int ibt_enabled:1;
};

#ifdef CONFIG_X86_INTEL_CET
@@ -21,6 +24,9 @@ void cet_disable_shstk(void);
void cet_disable_free_shstk(struct task_struct *p);
int cet_restore_signal(unsigned long ssp);
int cet_setup_signal(bool ia32, unsigned long rstor, unsigned long *new_ssp);
+int cet_setup_ibt(void);
+int cet_setup_ibt_bitmap(void);
+void cet_disable_ibt(void);
#else
static inline int cet_setup_shstk(void) { return 0; }
static inline void cet_disable_shstk(void) {}
@@ -28,6 +34,8 @@ static inline void cet_disable_free_shstk(struct task_struct *p) {}
static inline int cet_restore_signal(unsigned long ssp) { return 0; }
static inline int cet_setup_signal(bool ia32, unsigned long rstor,
unsigned long *new_ssp) { return 0; }
+static inline int cet_setup_ibt(void) { return 0; }
+static inline void cet_disable_ibt(void) {}
#endif

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 3624a11e5ba6..ce5bdaf0f1ff 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -62,6 +62,12 @@
#define DISABLE_SHSTK (1<<(X86_FEATURE_SHSTK & 31))
#endif

+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+#define DISABLE_IBT 0
+#else
+#define DISABLE_IBT (1<<(X86_FEATURE_IBT & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -72,7 +78,7 @@
#define DISABLED_MASK4 (DISABLE_PCID)
#define DISABLED_MASK5 0
#define DISABLED_MASK6 0
-#define DISABLED_MASK7 (DISABLE_PTI)
+#define DISABLED_MASK7 (DISABLE_PTI|DISABLE_IBT)
#define DISABLED_MASK8 0
#define DISABLED_MASK9 (DISABLE_MPX)
#define DISABLED_MASK10 0
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 4eba7790c4e4..8bbd63e1a2ba 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -12,6 +12,8 @@
#include <linux/slab.h>
#include <linux/uaccess.h>
#include <linux/sched/signal.h>
+#include <linux/vmalloc.h>
+#include <linux/bitops.h>
#include <asm/msr.h>
#include <asm/user.h>
#include <asm/fpu/xstate.h>
@@ -241,3 +243,74 @@ int cet_setup_signal(bool ia32, unsigned long rstor_addr,
set_shstk_ptr(ssp);
return 0;
}
+
+static unsigned long ibt_mmap(unsigned long addr, unsigned long len)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long populate;
+
+ down_write(&mm->mmap_sem);
+ addr = do_mmap(NULL, addr, len, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ VM_DONTDUMP, 0, &populate, NULL);
+ up_write(&mm->mmap_sem);
+
+ if (populate)
+ mm_populate(addr, populate);
+
+ return addr;
+}
+
+int cet_setup_ibt(void)
+{
+ u64 r;
+
+ if (!cpu_feature_enabled(X86_FEATURE_IBT))
+ return -EOPNOTSUPP;
+
+ rdmsrl(MSR_IA32_U_CET, r);
+ r |= (MSR_IA32_CET_ENDBR_EN | MSR_IA32_CET_NO_TRACK_EN);
+ wrmsrl(MSR_IA32_U_CET, r);
+ current->thread.cet.ibt_enabled = 1;
+ return 0;
+}
+
+int cet_setup_ibt_bitmap(void)
+{
+ u64 r;
+ unsigned long bitmap;
+ unsigned long size;
+
+ if (!cpu_feature_enabled(X86_FEATURE_IBT))
+ return -EOPNOTSUPP;
+
+ size = TASK_SIZE_MAX / PAGE_SIZE / BITS_PER_BYTE;
+ bitmap = ibt_mmap(0, size);
+
+ if (bitmap >= TASK_SIZE_MAX)
+ return -ENOMEM;
+
+ bitmap &= PAGE_MASK;
+
+ rdmsrl(MSR_IA32_U_CET, r);
+ r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
+ wrmsrl(MSR_IA32_U_CET, r);
+
+ current->thread.cet.ibt_bitmap_addr = bitmap;
+ current->thread.cet.ibt_bitmap_size = size;
+ return 0;
+}
+
+void cet_disable_ibt(void)
+{
+ u64 r;
+
+ if (!cpu_feature_enabled(X86_FEATURE_IBT))
+ return;
+
+ rdmsrl(MSR_IA32_U_CET, r);
+ r &= ~(MSR_IA32_CET_ENDBR_EN | MSR_IA32_CET_LEG_IW_EN |
+ MSR_IA32_CET_NO_TRACK_EN);
+ wrmsrl(MSR_IA32_U_CET, r);
+ current->thread.cet.ibt_enabled = 0;
+}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 705467839ce8..c609c9ce5691 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -413,7 +413,8 @@ __setup("nopku", setup_disable_pku);

static __always_inline void setup_cet(struct cpuinfo_x86 *c)
{
- if (cpu_feature_enabled(X86_FEATURE_SHSTK))
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+ cpu_feature_enabled(X86_FEATURE_IBT))
cr4_set_bits(X86_CR4_CET);
}

@@ -434,6 +435,23 @@ static __init int setup_disable_shstk(char *s)
__setup("no_cet_shstk", setup_disable_shstk);
#endif

+#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+static __init int setup_disable_ibt(char *s)
+{
+ /* require an exact match without trailing characters */
+ if (strlen(s))
+ return 0;
+
+ if (!boot_cpu_has(X86_FEATURE_IBT))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_IBT);
+ pr_info("x86: 'no_cet_ibt' specified, disabling Branch Tracking\n");
+ return 1;
+}
+__setup("no_cet_ibt", setup_disable_ibt);
+#endif
+
/*
* Some CPU features depend on higher CPUID levels, which may not always
* be available due to CPUID level capping or broken virtualization
diff --git a/arch/x86/kernel/elf.c b/arch/x86/kernel/elf.c
index 233f6dad9c1f..42e08d3b573e 100644
--- a/arch/x86/kernel/elf.c
+++ b/arch/x86/kernel/elf.c
@@ -15,6 +15,7 @@
#include <linux/fs.h>
#include <linux/uaccess.h>
#include <linux/string.h>
+#include <linux/compat.h>

/*
* The .note.gnu.property layout:
@@ -222,7 +223,8 @@ int arch_setup_features(void *ehdr_p, void *phdr_p,

struct elf64_hdr *ehdr64 = ehdr_p;

- if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+ !cpu_feature_enabled(X86_FEATURE_IBT))
return 0;

if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {
@@ -250,6 +252,9 @@ int arch_setup_features(void *ehdr_p, void *phdr_p,
current->thread.cet.shstk_enabled = 0;
current->thread.cet.shstk_base = 0;
current->thread.cet.shstk_size = 0;
+ current->thread.cet.ibt_enabled = 0;
+ current->thread.cet.ibt_bitmap_addr = 0;
+ current->thread.cet.ibt_bitmap_size = 0;
if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
if (shstk) {
err = cet_setup_shstk();
@@ -257,6 +262,15 @@ int arch_setup_features(void *ehdr_p, void *phdr_p,
goto out;
}
}
+
+ if (cpu_feature_enabled(X86_FEATURE_IBT)) {
+ if (ibt) {
+ err = cet_setup_ibt();
+ if (err < 0)
+ goto out;
+ }
+ }
+
out:
return err;
}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b3b0b482983a..309ebb7f9d8d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -138,6 +138,7 @@ void flush_thread(void)
memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));

cet_disable_shstk();
+ cet_disable_ibt();
fpu__clear(&tsk->thread.fpu);
}

--
2.17.1


2018-07-10 22:33:22

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 20/27] x86/cet/shstk: ELF header parsing of CET

Look in .note.gnu.property of an ELF file and check if shadow stack needs
to be enabled for the task.

Signed-off-by: H.J. Lu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/Kconfig | 4 +
arch/x86/include/asm/elf.h | 5 +
arch/x86/include/uapi/asm/elf_property.h | 16 ++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/elf.c | 262 +++++++++++++++++++++++
fs/binfmt_elf.c | 16 ++
include/uapi/linux/elf.h | 1 +
7 files changed, 306 insertions(+)
create mode 100644 arch/x86/include/uapi/asm/elf_property.h
create mode 100644 arch/x86/kernel/elf.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 44af5e1aaa4a..768343768643 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1923,12 +1923,16 @@ config X86_INTEL_CET
config ARCH_HAS_SHSTK
def_bool n

+config ARCH_HAS_PROGRAM_PROPERTIES
+ def_bool n
+
config X86_INTEL_SHADOW_STACK_USER
prompt "Intel Shadow Stack for user-mode"
def_bool n
depends on CPU_SUP_INTEL && X86_64
select X86_INTEL_CET
select ARCH_HAS_SHSTK
+ select ARCH_HAS_PROGRAM_PROPERTIES
---help---
Shadow stack provides hardware protection against program stack
corruption. Only when all the following are true will an application
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 0d157d2a1e2a..5b5f169c5c07 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -382,4 +382,9 @@ struct va_alignment {

extern struct va_alignment va_align;
extern unsigned long align_vdso_addr(unsigned long);
+
+#ifdef CONFIG_ARCH_HAS_PROGRAM_PROPERTIES
+extern int arch_setup_features(void *ehdr, void *phdr, struct file *file,
+ bool interp);
+#endif
#endif /* _ASM_X86_ELF_H */
diff --git a/arch/x86/include/uapi/asm/elf_property.h b/arch/x86/include/uapi/asm/elf_property.h
new file mode 100644
index 000000000000..343a871b8fc1
--- /dev/null
+++ b/arch/x86/include/uapi/asm/elf_property.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_ASM_X86_ELF_PROPERTY_H
+#define _UAPI_ASM_X86_ELF_PROPERTY_H
+
+/*
+ * pr_type
+ */
+#define GNU_PROPERTY_X86_FEATURE_1_AND (0xc0000002)
+
+/*
+ * Bits for GNU_PROPERTY_X86_FEATURE_1_AND
+ */
+#define GNU_PROPERTY_X86_FEATURE_1_SHSTK (0x00000002)
+#define GNU_PROPERTY_X86_FEATURE_1_IBT (0x00000001)
+
+#endif /* _UAPI_ASM_X86_ELF_PROPERTY_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index fbb2d91fb756..36b14ef410c8 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -141,6 +141,8 @@ obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o

obj-$(CONFIG_X86_INTEL_CET) += cet.o

+obj-$(CONFIG_ARCH_HAS_PROGRAM_PROPERTIES) += elf.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/elf.c b/arch/x86/kernel/elf.c
new file mode 100644
index 000000000000..233f6dad9c1f
--- /dev/null
+++ b/arch/x86/kernel/elf.c
@@ -0,0 +1,262 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Look at an ELF file's .note.gnu.property and determine if the file
+ * supports shadow stack and/or indirect branch tracking.
+ * The path from the ELF header to the note section is the following:
+ * elfhdr->elf_phdr->elf_note->property[].
+ */
+
+#include <asm/cet.h>
+#include <asm/elf_property.h>
+#include <uapi/linux/elf-em.h>
+#include <linux/binfmts.h>
+#include <linux/elf.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/string.h>
+
+/*
+ * The .note.gnu.property layout:
+ *
+ * struct elf_note {
+ * u32 n_namesz; --> sizeof(n_name[]); always (4)
+ * u32 n_ndescsz;--> sizeof(property[])
+ * u32 n_type; --> always NT_GNU_PROPERTY_TYPE_0
+ * };
+ *
+ * char n_name[4]; --> always 'GNU\0'
+ *
+ * struct {
+ * u32 pr_type;
+ * u32 pr_datasz;--> sizeof(pr_data[])
+ * u8 pr_data[pr_datasz];
+ * } property[];
+ */
+
+#define ELF_NOTE_DESC_OFFSET(n, align) \
+ round_up(sizeof(*n) + n->n_namesz, (align))
+
+#define ELF_NOTE_NEXT_OFFSET(n, align) \
+ round_up(ELF_NOTE_DESC_OFFSET(n, align) + n->n_descsz, (align))
+
+#define NOTE_PROPERTY_TYPE_0(n) \
+ ((n->n_namesz == 4) && (memcmp(n + 1, "GNU", 4) == 0) && \
+ (n->n_type == NT_GNU_PROPERTY_TYPE_0))
+
+#define NOTE_SIZE_BAD(n, align, max) \
+ ((n->n_descsz < 8) || ((n->n_descsz % align) != 0) || \
+ (((u8 *)(n + 1) + 4 + n->n_descsz) > (max)))
+
+/*
+ * Go through the property array and look for the one
+ * with pr_type of GNU_PROPERTY_X86_FEATURE_1_AND.
+ */
+static u32 find_x86_feature_1(u8 *buf, u32 size, u32 align)
+{
+ u8 *end = buf + size;
+ u8 *ptr = buf;
+
+ while (1) {
+ u32 pr_type, pr_datasz;
+
+ if ((ptr + 4) >= end)
+ break;
+
+ pr_type = *(u32 *)ptr;
+ pr_datasz = *(u32 *)(ptr + 4);
+ ptr += 8;
+
+ if ((ptr + pr_datasz) >= end)
+ break;
+
+ if (pr_type == GNU_PROPERTY_X86_FEATURE_1_AND &&
+ pr_datasz == 4)
+ return *(u32 *)ptr;
+
+ ptr += pr_datasz;
+ }
+ return 0;
+}
+
+static int find_cet(u8 *buf, u32 size, u32 align, int *shstk, int *ibt)
+{
+ struct elf_note *note = (struct elf_note *)buf;
+ *shstk = 0;
+ *ibt = 0;
+
+ /*
+ * Go through the note section and find the note
+ * with n_type of NT_GNU_PROPERTY_TYPE_0.
+ */
+ while ((unsigned long)(note + 1) - (unsigned long)buf < size) {
+ if (NOTE_PROPERTY_TYPE_0(note)) {
+ u32 p;
+
+ if (NOTE_SIZE_BAD(note, align, buf + size))
+ return 0;
+
+ /*
+ * Found the note; look at its property array.
+ */
+ p = find_x86_feature_1((u8 *)(note + 1) + 4,
+ note->n_descsz, align);
+
+ if (p & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+ *shstk = 1;
+ if (p & GNU_PROPERTY_X86_FEATURE_1_IBT)
+ *ibt = 1;
+ return 1;
+ }
+
+ /*
+ * Note sections like .note.ABI-tag and .note.gnu.build-id
+ * are aligned to 4 bytes in 64-bit ELF objects. So always
+ * use phdr->p_align.
+ */
+ note = (void *)note + ELF_NOTE_NEXT_OFFSET(note, align);
+ }
+
+ return 0;
+}
+
+static int check_pt_note_segment(struct file *file,
+ unsigned long note_size, loff_t *pos,
+ u32 align, int *shstk, int *ibt)
+{
+ int retval;
+ char *note_buf;
+
+ /*
+ * PT_NOTE segment is small. Read at most
+ * PAGE_SIZE.
+ */
+ if (note_size > PAGE_SIZE)
+ note_size = PAGE_SIZE;
+
+ /*
+ * Try to read in the whole PT_NOTE segment.
+ */
+ note_buf = kmalloc(note_size, GFP_KERNEL);
+ if (!note_buf)
+ return -ENOMEM;
+ retval = kernel_read(file, note_buf, note_size, pos);
+ if (retval != note_size) {
+ kfree(note_buf);
+ return (retval < 0) ? retval : -EIO;
+ }
+
+ retval = find_cet(note_buf, note_size, align, shstk, ibt);
+ kfree(note_buf);
+ return retval;
+}
+
+#ifdef CONFIG_COMPAT
+static int check_pt_note_32(struct file *file, struct elf32_phdr *phdr,
+ int phnum, int *shstk, int *ibt)
+{
+ int i;
+ int found = 0;
+
+ /*
+ * Go through all PT_NOTE segments and find NT_GNU_PROPERTY_TYPE_0.
+ */
+ for (i = 0; i < phnum; i++, phdr++) {
+ loff_t pos;
+
+ /*
+ * NT_GNU_PROPERTY_TYPE_0 note is aligned to 4 bytes
+ * in 32-bit binaries.
+ */
+ if ((phdr->p_type != PT_NOTE) || (phdr->p_align != 4))
+ continue;
+
+ pos = phdr->p_offset;
+ found = check_pt_note_segment(file, phdr->p_filesz,
+ &pos, phdr->p_align,
+ shstk, ibt);
+ if (found)
+ break;
+ }
+ return found;
+}
+#endif
+
+#ifdef CONFIG_X86_64
+static int check_pt_note_64(struct file *file, struct elf64_phdr *phdr,
+ int phnum, int *shstk, int *ibt)
+{
+ int found = 0;
+
+ /*
+ * Go through all PT_NOTE segments.
+ */
+ for (; phnum > 0; phnum--, phdr++) {
+ loff_t pos;
+
+ /*
+ * NT_GNU_PROPERTY_TYPE_0 note is aligned to 8 bytes
+ * in 64-bit binaries.
+ */
+ if ((phdr->p_type != PT_NOTE) || (phdr->p_align != 8))
+ continue;
+
+ pos = phdr->p_offset;
+ found = check_pt_note_segment(file, phdr->p_filesz,
+ &pos, phdr->p_align,
+ shstk, ibt);
+
+ if (found)
+ break;
+ }
+ return found;
+}
+#endif
+
+int arch_setup_features(void *ehdr_p, void *phdr_p,
+ struct file *file, bool interp)
+{
+ int err = 0;
+ int shstk = 0;
+ int ibt = 0;
+
+ struct elf64_hdr *ehdr64 = ehdr_p;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return 0;
+
+ if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {
+ struct elf64_phdr *phdr64 = phdr_p;
+
+ err = check_pt_note_64(file, phdr64, ehdr64->e_phnum,
+ &shstk, &ibt);
+ if (err < 0)
+ goto out;
+ } else {
+#ifdef CONFIG_COMPAT
+ struct elf32_hdr *ehdr32 = ehdr_p;
+
+ if (ehdr32->e_ident[EI_CLASS] == ELFCLASS32) {
+ struct elf32_phdr *phdr32 = phdr_p;
+
+ err = check_pt_note_32(file, phdr32, ehdr32->e_phnum,
+ &shstk, &ibt);
+ if (err < 0)
+ goto out;
+ }
+#endif
+ }
+
+ current->thread.cet.shstk_enabled = 0;
+ current->thread.cet.shstk_base = 0;
+ current->thread.cet.shstk_size = 0;
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (shstk) {
+ err = cet_setup_shstk();
+ if (err < 0)
+ goto out;
+ }
+ }
+out:
+ return err;
+}
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 0ac456b52bdd..3395f6a631d5 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1081,6 +1081,22 @@ static int load_elf_binary(struct linux_binprm *bprm)
goto out_free_dentry;
}

+#ifdef CONFIG_ARCH_HAS_PROGRAM_PROPERTIES
+
+ if (interpreter) {
+ retval = arch_setup_features(&loc->interp_elf_ex,
+ interp_elf_phdata,
+ interpreter, true);
+ } else {
+ retval = arch_setup_features(&loc->elf_ex,
+ elf_phdata,
+ bprm->file, false);
+ }
+
+ if (retval < 0)
+ goto out_free_dentry;
+#endif
+
if (elf_interpreter) {
unsigned long interp_map_addr = 0;

diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 4e12c423b9fe..dc93982b9664 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -372,6 +372,7 @@ typedef struct elf64_shdr {
#define NT_PRFPREG 2
#define NT_PRPSINFO 3
#define NT_TASKSTRUCT 4
+#define NT_GNU_PROPERTY_TYPE_0 5
#define NT_AUXV 6
/*
* Note to userspace developers: size of NT_SIGINFO note may increase
--
2.17.1


2018-07-10 22:33:52

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

There are three possible shadow stack PTE settings:

Normal SHSTK PTE: (R/O + DIRTY_HW)
SHSTK PTE COW'ed: (R/O + DIRTY_HW)
SHSTK PTE shared as R/O data: (R/O + DIRTY_SW)

Update can_follow_write_pte/pmd for the shadow stack.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
mm/gup.c | 11 ++++++++---
mm/huge_memory.c | 10 +++++++---
2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b70d7ba7cc13..00171ee847af 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -64,10 +64,13 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
* FOLL_FORCE can write to even unwritable pte's, but only
* after we've gone through a COW cycle and they are dirty.
*/
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
+ bool shstk)
{
+ bool pte_cowed = shstk ? is_shstk_pte(pte):pte_dirty(pte);
+
return pte_write(pte) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+ ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_cowed);
}

static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -78,7 +81,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
struct page *page;
spinlock_t *ptl;
pte_t *ptep, pte;
+ bool shstk;

+ shstk = is_shstk_mapping(vma->vm_flags);
retry:
if (unlikely(pmd_bad(*pmd)))
return no_page_table(vma, flags);
@@ -105,7 +110,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
- if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+ if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags, shstk)) {
pte_unmap_unlock(ptep, ptl);
return NULL;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7f3e11d3b64a..db4c689a960a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1389,10 +1389,13 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
* FOLL_FORCE can write to even unwritable pmd's, but only
* after we've gone through a COW cycle and they are dirty.
*/
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags,
+ bool shstk)
{
+ bool pmd_cowed = shstk ? is_shstk_pmd(pmd):pmd_dirty(pmd);
+
return pmd_write(pmd) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+ ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_cowed);
}

struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1402,10 +1405,11 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
+ bool shstk = is_shstk_mapping(vma->vm_flags);

assert_spin_locked(pmd_lockptr(mm, pmd));

- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+ if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags, shstk))
goto out;

/* Avoid dumping huge zero page */
--
2.17.1


2018-07-10 22:34:09

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 21/27] x86/cet/ibt: Add Kconfig option for user-mode Indirect Branch Tracking

The user-mode indirect branch tracking support is done mostly by
GCC to insert ENDBR64/ENDBR32 instructions at branch targets.
The kernel provides CPUID enumeration, feature MSR setup and
the allocation of legacy bitmap.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/Kconfig | 12 ++++++++++++
arch/x86/Makefile | 7 +++++++
2 files changed, 19 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 768343768643..01de9743efe6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1945,6 +1945,18 @@ config X86_INTEL_SHADOW_STACK_USER

If unsure, say y.

+config X86_INTEL_BRANCH_TRACKING_USER
+ prompt "Intel Indirect Branch Tracking for user-mode"
+ def_bool n
+ depends on CPU_SUP_INTEL && X86_64
+ select X86_INTEL_CET
+ select ARCH_HAS_PROGRAM_PROPERTIES
+ ---help---
+ Indirect Branch Tracking provides hardware protection against return-/jmp-
+ oriented programing attacks.
+
+ If unsure, say y
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index ad1314e5ef65..a7913de34866 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -164,6 +164,13 @@ ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
endif
endif

+# Check compiler ibt support
+ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
+ ifeq ($(call cc-option-yn, -fcf-protection=branch), n)
+ $(error CONFIG_X86_INTEL_BRANCH_TRACKING_USER not supported by compiler)
+ endif
+endif
+
#
# If the function graph tracer is used with mcount instead of fentry,
# '-maccumulate-outgoing-args' is needed to prevent a GCC bug
--
2.17.1


2018-07-10 22:34:17

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 13/27] mm: Handle shadow stack page fault

When a task does fork(), its shadow stack must be duplicated for
the child. However, the child may not actually use all pages of
of the copied shadow stack. This patch implements a flow that
is similar to copy-on-write of an anonymous page, but for shadow
stack memory. A shadow stack PTE needs to be RO and dirty. We
use this dirty bit requirement to effect the copying of shadow
stack pages.

In copy_one_pte(), we clear the dirty bit from the shadow stack
PTE. On the next shadow stack access to the PTE, a page fault
occurs. At that time, we then copy/re-use the page and fix the
PTE.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
mm/memory.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..a2695dbc0418 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2453,7 +2453,13 @@ static inline void wp_page_reuse(struct vm_fault *vmf)

flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+
+ if (is_shstk_mapping(vma->vm_flags))
+ entry = pte_mkdirty_shstk(entry);
+ else
+ entry = pte_mkdirty(entry);
+
+ entry = maybe_mkwrite(entry, vma);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2526,7 +2532,11 @@ static int wp_page_copy(struct vm_fault *vmf)
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (is_shstk_mapping(vma->vm_flags))
+ entry = pte_mkdirty_shstk(entry);
+ else
+ entry = pte_mkdirty(entry);
+ entry = maybe_mkwrite(entry, vma);
/*
* Clear the pte entry and flush it first, before updating the
* pte with the new entry. This will avoid a race condition
@@ -3201,6 +3211,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
+ /*
+ * If this is within a shadow stack mapping, mark
+ * the PTE dirty. We don't use pte_mkdirty(),
+ * because the PTE must have _PAGE_DIRTY_HW set.
+ */
+ if (is_shstk_mapping(vma->vm_flags))
+ entry = pte_mkdirty_shstk(entry);
+
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);

/* No need to invalidate - it was non-present before */
@@ -3983,6 +4001,14 @@ static int handle_pte_fault(struct vm_fault *vmf)
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
+
+ /*
+ * Shadow stack PTEs are copy-on-access, so do_wp_page()
+ * handling on them no matter if we have write fault or not.
+ */
+ if (is_shstk_mapping(vmf->vma->vm_flags))
+ return do_wp_page(vmf);
+
if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(vmf);
--
2.17.1


2018-07-10 22:34:25

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 08/27] mm: Introduce VM_SHSTK for shadow stack memory

VM_SHSTK indicates a shadow stack memory area.

A shadow stack PTE must be read-only and dirty. For non shadow
stack, we use a spare bit of the 64-bit PTE for dirty. The PTE
changes are in the next patch.

There is no more spare bit in the 32-bit PTE (except for PAE) and
the shadow stack is not implemented for the 32-bit kernel.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
include/linux/mm.h | 8 ++++++++
mm/internal.h | 8 ++++++++
2 files changed, 16 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9ffe380..d7b338b41593 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -222,11 +222,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -264,6 +266,12 @@ extern unsigned int kobjsize(const void *objp);
# define VM_MPX VM_NONE
#endif

+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+# define VM_SHSTK VM_HIGH_ARCH_5
+#else
+# define VM_SHSTK VM_NONE
+#endif
+
#ifndef VM_GROWSUP
# define VM_GROWSUP VM_NONE
#endif
diff --git a/mm/internal.h b/mm/internal.h
index 9e3654d70289..b09c29762d85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -280,6 +280,14 @@ static inline bool is_data_mapping(vm_flags_t flags)
return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE;
}

+/*
+ * Shadow stack area
+ */
+static inline bool is_shstk_mapping(vm_flags_t flags)
+{
+ return (flags & VM_SHSTK);
+}
+
/* mm/util.c */
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node *rb_parent);
--
2.17.1


2018-07-10 22:34:29

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 15/27] mm/mprotect: Prevent mprotect from changing shadow stack

Signed-off-by: Yu-cheng Yu <[email protected]>
---
mm/mprotect.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 625608bc8962..128dcb880c12 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -446,6 +446,15 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
error = -ENOMEM;
if (!vma)
goto out;
+
+ /*
+ * Do not allow changing shadow stack memory.
+ */
+ if (vma->vm_flags & VM_SHSTK) {
+ error = -EINVAL;
+ goto out;
+ }
+
prev = vma->vm_prev;
if (unlikely(grows & PROT_GROWSDOWN)) {
if (vma->vm_start >= end)
--
2.17.1


2018-07-10 22:34:34

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support

This patch adds basic shadow stack enabling/disabling routines.
A task's shadow stack is allocated from memory with VM_SHSTK
flag set and read-only protection. The shadow stack is
allocated to a fixed size.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/cet.h | 30 ++++++
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/msr-index.h | 14 +++
arch/x86/include/asm/processor.h | 5 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/cet.c | 128 +++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 24 +++++
arch/x86/kernel/process.c | 2 +
fs/proc/task_mmu.c | 3 +
9 files changed, 215 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
new file mode 100644
index 000000000000..ad278c520414
--- /dev/null
+++ b/arch/x86/include/asm/cet.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CET_H
+#define _ASM_X86_CET_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+/*
+ * Per-thread CET status
+ */
+struct cet_status {
+ unsigned long shstk_base;
+ unsigned long shstk_size;
+ unsigned int shstk_enabled:1;
+};
+
+#ifdef CONFIG_X86_INTEL_CET
+int cet_setup_shstk(void);
+void cet_disable_shstk(void);
+void cet_disable_free_shstk(struct task_struct *p);
+#else
+static inline int cet_setup_shstk(void) { return 0; }
+static inline void cet_disable_shstk(void) {}
+static inline void cet_disable_free_shstk(struct task_struct *p) {}
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 33833d1909af..3624a11e5ba6 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -56,6 +56,12 @@
# define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31))
#endif

+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define DISABLE_SHSTK 0
+#else
+#define DISABLE_SHSTK (1<<(X86_FEATURE_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -75,7 +81,7 @@
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
-#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
+#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
#define DISABLED_MASK17 0
#define DISABLED_MASK18 0
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 68b2c3150de1..66849230712e 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -770,4 +770,18 @@
#define MSR_VM_IGNNE 0xc0010115
#define MSR_VM_HSAVE_PA 0xc0010117

+/* Control-flow Enforcement Technology MSRs */
+#define MSR_IA32_U_CET 0x6a0 /* user mode cet setting */
+#define MSR_IA32_S_CET 0x6a2 /* kernel mode cet setting */
+#define MSR_IA32_PL0_SSP 0x6a4 /* kernel shstk pointer */
+#define MSR_IA32_PL3_SSP 0x6a7 /* user shstk pointer */
+#define MSR_IA32_INT_SSP_TAB 0x6a8 /* exception shstk table */
+
+/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
+#define MSR_IA32_CET_SHSTK_EN 0x0000000000000001
+#define MSR_IA32_CET_WRSS_EN 0x0000000000000002
+#define MSR_IA32_CET_ENDBR_EN 0x0000000000000004
+#define MSR_IA32_CET_LEG_IW_EN 0x0000000000000008
+#define MSR_IA32_CET_NO_TRACK_EN 0x0000000000000010
+
#endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index cfd29ee8c3da..edf94393bf7e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -24,6 +24,7 @@ struct vm86;
#include <asm/special_insns.h>
#include <asm/fpu/types.h>
#include <asm/unwind_hints.h>
+#include <asm/cet.h>

#include <linux/personality.h>
#include <linux/cache.h>
@@ -498,6 +499,10 @@ struct thread_struct {
unsigned int sig_on_uaccess_err:1;
unsigned int uaccess_err:1; /* uaccess failed */

+#ifdef CONFIG_X86_INTEL_CET
+ struct cet_status cet;
+#endif
+
/* Floating point and extended processor state */
struct fpu fpu;
/*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8824d01c0c35..fbb2d91fb756 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -139,6 +139,8 @@ obj-$(CONFIG_UNWINDER_ORC) += unwind_orc.o
obj-$(CONFIG_UNWINDER_FRAME_POINTER) += unwind_frame.o
obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o

+obj-$(CONFIG_X86_INTEL_CET) += cet.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..96bf69db7da7
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,128 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * cet.c - Control Flow Enforcement (CET)
+ *
+ * Copyright (c) 2018, Intel Corporation.
+ * Yu-cheng Yu <[email protected]>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <asm/msr.h>
+#include <asm/user.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/compat.h>
+#include <asm/cet.h>
+
+#define SHSTK_SIZE_64 (0x8000 * 8)
+#define SHSTK_SIZE_32 (0x8000 * 4)
+
+static int set_shstk_ptr(unsigned long addr)
+{
+ u64 r;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return -1;
+
+ if ((addr >= TASK_SIZE_MAX) || (!IS_ALIGNED(addr, 4)))
+ return -1;
+
+ rdmsrl(MSR_IA32_U_CET, r);
+ wrmsrl(MSR_IA32_PL3_SSP, addr);
+ wrmsrl(MSR_IA32_U_CET, r | MSR_IA32_CET_SHSTK_EN);
+ return 0;
+}
+
+static unsigned long get_shstk_addr(void)
+{
+ unsigned long ptr;
+
+ if (!current->thread.cet.shstk_enabled)
+ return 0;
+
+ rdmsrl(MSR_IA32_PL3_SSP, ptr);
+ return ptr;
+}
+
+static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long populate;
+
+ down_write(&mm->mmap_sem);
+ addr = do_mmap(NULL, addr, len, PROT_READ,
+ MAP_ANONYMOUS | MAP_PRIVATE, VM_SHSTK,
+ 0, &populate, NULL);
+ up_write(&mm->mmap_sem);
+
+ if (populate)
+ mm_populate(addr, populate);
+
+ return addr;
+}
+
+int cet_setup_shstk(void)
+{
+ unsigned long addr, size;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return -EOPNOTSUPP;
+
+ size = in_ia32_syscall() ? SHSTK_SIZE_32:SHSTK_SIZE_64;
+ addr = shstk_mmap(0, size);
+
+ /*
+ * Return actual error from do_mmap().
+ */
+ if (addr >= TASK_SIZE_MAX)
+ return addr;
+
+ set_shstk_ptr(addr + size - sizeof(u64));
+ current->thread.cet.shstk_base = addr;
+ current->thread.cet.shstk_size = size;
+ current->thread.cet.shstk_enabled = 1;
+ return 0;
+}
+
+void cet_disable_shstk(void)
+{
+ u64 r;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return;
+
+ rdmsrl(MSR_IA32_U_CET, r);
+ r &= ~(MSR_IA32_CET_SHSTK_EN);
+ wrmsrl(MSR_IA32_U_CET, r);
+ wrmsrl(MSR_IA32_PL3_SSP, 0);
+ current->thread.cet.shstk_enabled = 0;
+}
+
+void cet_disable_free_shstk(struct task_struct *tsk)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+ !tsk->thread.cet.shstk_enabled)
+ return;
+
+ if (tsk == current)
+ cet_disable_shstk();
+
+ /*
+ * Free only when tsk is current or shares mm
+ * with current but has its own shstk.
+ */
+ if (tsk->mm && (tsk->mm == current->mm) &&
+ (tsk->thread.cet.shstk_base)) {
+ vm_munmap(tsk->thread.cet.shstk_base,
+ tsk->thread.cet.shstk_size);
+ tsk->thread.cet.shstk_base = 0;
+ tsk->thread.cet.shstk_size = 0;
+ }
+
+ tsk->thread.cet.shstk_enabled = 0;
+}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eb4cb3efd20e..705467839ce8 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -411,6 +411,29 @@ static __init int setup_disable_pku(char *arg)
__setup("nopku", setup_disable_pku);
#endif /* CONFIG_X86_64 */

+static __always_inline void setup_cet(struct cpuinfo_x86 *c)
+{
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK))
+ cr4_set_bits(X86_CR4_CET);
+}
+
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+static __init int setup_disable_shstk(char *s)
+{
+ /* require an exact match without trailing characters */
+ if (strlen(s))
+ return 0;
+
+ if (!boot_cpu_has(X86_FEATURE_SHSTK))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_SHSTK);
+ pr_info("x86: 'no_cet_shstk' specified, disabling Shadow Stack\n");
+ return 1;
+}
+__setup("no_cet_shstk", setup_disable_shstk);
+#endif
+
/*
* Some CPU features depend on higher CPUID levels, which may not always
* be available due to CPUID level capping or broken virtualization
@@ -1358,6 +1381,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
x86_init_rdrand(c);
x86_init_cache_qos(c);
setup_pku(c);
+ setup_cet(c);

/*
* Clear/Set all flags overridden by options, need do it
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 30ca2d1a9231..b3b0b482983a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,7 @@
#include <asm/desc.h>
#include <asm/prctl.h>
#include <asm/spec-ctrl.h>
+#include <asm/cet.h>

/*
* per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -136,6 +137,7 @@ void flush_thread(void)
flush_ptrace_hw_breakpoint(tsk);
memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));

+ cet_disable_shstk();
fpu__clear(&tsk->thread.fpu);
}

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e9679016271f..a76739499e25 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -684,6 +684,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_PKEY_BIT4)] = "",
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+ [ilog2(VM_SHSTK)] = "ss"
+#endif
};
size_t i;

--
2.17.1


2018-07-10 22:34:41

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 11/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW

Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
_PAGE_DIRTY_SW.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ecbd3539a864..456a864aa605 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1170,7 +1170,18 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
+ pte_t pte;
+
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
+ pte = *ptep;
+
+ /*
+ * On platforms before CET, other threads could race to
+ * create a RO and _PAGE_DIRTY_HW PTE again. However,
+ * on CET platforms, this is safe without a TLB flush.
+ */
+ pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
+ set_pte_at(mm, addr, ptep, pte);
}

#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
@@ -1220,7 +1231,18 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp)
{
+ pmd_t pmd;
+
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
+ pmd = *pmdp;
+
+ /*
+ * On platforms before CET, other threads could race to
+ * create a RO and _PAGE_DIRTY_HW PMD again. However,
+ * on CET platforms, this is safe without a TLB flush.
+ */
+ pmd = pmd_move_flags(pmd, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
+ set_pmd_at(mm, addr, pmdp, pmd);
}

#define pud_write pud_write
--
2.17.1


2018-07-10 22:34:49

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 14/27] mm: Handle THP/HugeTLB shadow stack page fault

This patch implements THP shadow stack memory copying in the same
way as the previous patch for regular PTE.

In copy_huge_pmd(), we clear the dirty bit from the PMD. On the
next shadow stack access to the PMD, a page fault occurs. At
that time, the page is copied/re-used and the PMD is fixed.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
mm/huge_memory.c | 8 ++++++++
mm/memory.c | 8 +++++++-
2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a57a14..7f3e11d3b64a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -597,6 +597,8 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,

entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (is_shstk_mapping(vma->vm_flags))
+ entry = pmd_mkdirty_shstk(entry);
page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false, true);
lru_cache_add_active_or_unevictable(page, vma);
@@ -1193,6 +1195,8 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
pte_t entry;
entry = mk_pte(pages[i], vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (is_shstk_mapping(vma->vm_flags))
+ entry = pte_mkdirty_shstk(entry);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
@@ -1277,6 +1281,8 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (is_shstk_mapping(vma->vm_flags))
+ entry = pmd_mkdirty_shstk(entry);
if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
ret |= VM_FAULT_WRITE;
@@ -1347,6 +1353,8 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
pmd_t entry;
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (is_shstk_mapping(vma->vm_flags))
+ entry = pmd_mkdirty_shstk(entry);
pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
diff --git a/mm/memory.c b/mm/memory.c
index a2695dbc0418..f7c46d61eaea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4108,7 +4108,13 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf, orig_pmd);

- if (dirty && !pmd_write(orig_pmd)) {
+ /*
+ * Shadow stack trans huge PMDs are copy-on-access,
+ * so wp_huge_pmd() on them no mater if we have a
+ * write fault or not.
+ */
+ if (is_shstk_mapping(vma->vm_flags) ||
+ (dirty && !pmd_write(orig_pmd))) {
ret = wp_huge_pmd(&vmf, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
--
2.17.1


2018-07-10 22:34:49

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 07/27] x86/cet/shstk: Add Kconfig option for user-mode shadow stack

Introduce Kconfig option X86_INTEL_SHADOW_STACK_USER.

An application has shadow stack protection when all the following are
true:

(1) The kernel has X86_INTEL_SHADOW_STACK_USER enabled,
(2) The running processor supports the shadow stack,
(3) The application is built with shadow stack enabled tools & libs
and, and at runtime, all dependent shared libs can support shadow
stack.

If this kernel config option is enabled, but (2) or (3) above is not
true, the application runs without the shadow stack protection.
Existing legacy applications will continue to work without the shadow
stack protection.

The user-mode shadow stack protection is only implemented for the
64-bit kernel. Thirty-two bit applications are supported under the
compatibility mode.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/Kconfig | 24 ++++++++++++++++++++++++
arch/x86/Makefile | 7 +++++++
2 files changed, 31 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f1dbb4ee19d7..44af5e1aaa4a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1917,6 +1917,30 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS

If unsure, say y.

+config X86_INTEL_CET
+ def_bool n
+
+config ARCH_HAS_SHSTK
+ def_bool n
+
+config X86_INTEL_SHADOW_STACK_USER
+ prompt "Intel Shadow Stack for user-mode"
+ def_bool n
+ depends on CPU_SUP_INTEL && X86_64
+ select X86_INTEL_CET
+ select ARCH_HAS_SHSTK
+ ---help---
+ Shadow stack provides hardware protection against program stack
+ corruption. Only when all the following are true will an application
+ have the shadow stack protection: the kernel supports it (i.e. this
+ feature is enabled), the application is compiled and linked with
+ shadow stack enabled, and the processor supports this feature.
+ When the kernel has this configuration enabled, existing non shadow
+ stack applications will continue to work, but without shadow stack
+ protection.
+
+ If unsure, say y.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index a08e82856563..ad1314e5ef65 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -157,6 +157,13 @@ ifdef CONFIG_X86_X32
endif
export CONFIG_X86_X32_ABI

+# Check assembler shadow stack suppot
+ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+ ifeq ($(call as-instr, saveprevssp, y),)
+ $(error CONFIG_X86_INTEL_SHADOW_STACK_USER not supported by the assembler)
+ endif
+endif
+
#
# If the function graph tracer is used with mcount instead of fentry,
# '-maccumulate-outgoing-args' is needed to prevent a GCC bug
--
2.17.1


2018-07-10 22:35:02

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 06/27] x86/cet: Control protection exception handler

A control protection exception is triggered when a control flow transfer
attempt violated shadow stack or indirect branch tracking constraints.
For example, the return address for a RET instruction differs from the
safe copy on the shadow stack; or a JMP instruction arrives at a non-
ENDBR instruction.

The control protection exception handler works in a similar way as the
general protection fault handler.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/entry/entry_64.S | 2 +-
arch/x86/include/asm/traps.h | 3 ++
arch/x86/kernel/idt.c | 4 +++
arch/x86/kernel/traps.c | 58 ++++++++++++++++++++++++++++++++++++
4 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 73a522d53b53..99398a27fe0b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -999,7 +999,7 @@ idtentry spurious_interrupt_bug do_spurious_interrupt_bug has_error_code=0
idtentry coprocessor_error do_coprocessor_error has_error_code=0
idtentry alignment_check do_alignment_check has_error_code=1
idtentry simd_coprocessor_error do_simd_coprocessor_error has_error_code=0
-
+idtentry control_protection do_control_protection has_error_code=1

/*
* Reload gs selector with exception handling
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 3de69330e6c5..5196050ff3d5 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -26,6 +26,7 @@ asmlinkage void invalid_TSS(void);
asmlinkage void segment_not_present(void);
asmlinkage void stack_segment(void);
asmlinkage void general_protection(void);
+asmlinkage void control_protection(void);
asmlinkage void page_fault(void);
asmlinkage void async_page_fault(void);
asmlinkage void spurious_interrupt_bug(void);
@@ -77,6 +78,7 @@ dotraplinkage void do_stack_segment(struct pt_regs *, long);
dotraplinkage void do_double_fault(struct pt_regs *, long);
#endif
dotraplinkage void do_general_protection(struct pt_regs *, long);
+dotraplinkage void do_control_protection(struct pt_regs *, long);
dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
dotraplinkage void do_spurious_interrupt_bug(struct pt_regs *, long);
dotraplinkage void do_coprocessor_error(struct pt_regs *, long);
@@ -142,6 +144,7 @@ enum {
X86_TRAP_AC, /* 17, Alignment Check */
X86_TRAP_MC, /* 18, Machine Check */
X86_TRAP_XF, /* 19, SIMD Floating-Point Exception */
+ X86_TRAP_CP = 21, /* 21 Control Protection Fault */
X86_TRAP_IRET = 32, /* 32, IRET Exception */
};

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 74383a3780dc..aa0229e1962d 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -103,6 +103,10 @@ static const __initconst struct idt_data def_idts[] = {
#elif defined(CONFIG_X86_32)
SYSG(IA32_SYSCALL_VECTOR, entry_INT80_32),
#endif
+
+#ifdef CONFIG_X86_INTEL_CET
+ INTG(X86_TRAP_CP, control_protection),
+#endif
};

/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e6db475164ed..21a713b96148 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -578,6 +578,64 @@ do_general_protection(struct pt_regs *regs, long error_code)
}
NOKPROBE_SYMBOL(do_general_protection);

+static const char *control_protection_err[] =
+{
+ "unknown",
+ "near-ret",
+ "far-ret/iret",
+ "endbranch",
+ "rstorssp",
+ "setssbsy",
+};
+
+/*
+ * When a control protection exception occurs, send a signal
+ * to the responsible application. Currently, control
+ * protection is only enabled for the user mode. This
+ * exception should not come from the kernel mode.
+ */
+dotraplinkage void
+do_control_protection(struct pt_regs *regs, long error_code)
+{
+ struct task_struct *tsk;
+
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+ if (notify_die(DIE_TRAP, "control protection fault", regs,
+ error_code, X86_TRAP_CP, SIGSEGV) == NOTIFY_STOP)
+ return;
+ cond_local_irq_enable(regs);
+
+ if (!user_mode(regs))
+ die("kernel control protection fault", regs, error_code);
+
+ if (!static_cpu_has(X86_FEATURE_SHSTK) &&
+ !static_cpu_has(X86_FEATURE_IBT))
+ WARN_ONCE(1, "CET is disabled but got control "
+ "protection fault\n");
+
+ tsk = current;
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_CP;
+
+ if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+ printk_ratelimit()) {
+ unsigned int max_err;
+
+ max_err = ARRAY_SIZE(control_protection_err) - 1;
+ if ((error_code < 0) || (error_code > max_err))
+ error_code = 0;
+ pr_info("%s[%d] control protection ip:%lx sp:%lx error:%lx(%s)",
+ tsk->comm, task_pid_nr(tsk),
+ regs->ip, regs->sp, error_code,
+ control_protection_err[error_code]);
+ print_vma_addr(" in ", regs->ip);
+ pr_cont("\n");
+ }
+
+ force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
+}
+NOKPROBE_SYMBOL(do_control_protection);
+
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
#ifdef CONFIG_DYNAMIC_FTRACE
--
2.17.1


2018-07-10 22:35:14

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 12/27] x86/mm: Shadow stack page fault error checking

If a page fault is triggered by a shadow stack access (e.g.
call/ret) or shadow stack management instructions (e.g.
wrussq), then bit[6] of the page fault error code is set.

In access_error(), we check if a shadow stack page fault
is within a shadow stack memory area.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/traps.h | 2 ++
arch/x86/mm/fault.c | 11 +++++++++++
2 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 5196050ff3d5..58ea2f5722e9 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -157,6 +157,7 @@ enum {
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
+ * bit 6 == 1: shadow stack access fault
*/
enum x86_pf_error_code {
X86_PF_PROT = 1 << 0,
@@ -165,5 +166,6 @@ enum x86_pf_error_code {
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
+ X86_PF_SHSTK = 1 << 6,
};
#endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 2aafa6ab6103..fcd5739151f9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1163,6 +1163,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
(error_code & X86_PF_INSTR), foreign))
return 1;

+ /*
+ * Verify X86_PF_SHSTK is within a shadow stack VMA.
+ * It is always an error if there is a shadow stack
+ * fault outside a shadow stack VMA.
+ */
+ if (error_code & X86_PF_SHSTK) {
+ if (!(vma->vm_flags & VM_SHSTK))
+ return 1;
+ return 0;
+ }
+
if (error_code & X86_PF_WRITE) {
/* write, present and write, not present: */
if (unlikely(!(vma->vm_flags & VM_WRITE)))
--
2.17.1


2018-07-10 22:35:26

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 02/27] x86/fpu/xstate: Change some names to separate XSAVES system and user states

To support XSAVES system states, change some names to distinguish
user and system states.

Change:
supervisor to system
copy_init_fpstate_to_fpregs() to copy_init_user_fpstate_to_fpregs()
xfeatures_mask to xfeatures_mask_user
XCNTXT_MASK to SUPPORTED_XFEATURES_MASK (states supported)

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 5 +-
arch/x86/include/asm/fpu/xstate.h | 24 ++++----
arch/x86/kernel/fpu/core.c | 4 +-
arch/x86/kernel/fpu/init.c | 2 +-
arch/x86/kernel/fpu/signal.c | 6 +-
arch/x86/kernel/fpu/xstate.c | 88 +++++++++++++++--------------
6 files changed, 66 insertions(+), 63 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a38bf5a1e37a..f1f9bf91a0ab 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -93,7 +93,8 @@ static inline void fpstate_init_xstate(struct xregs_state *xsave)
* XRSTORS requires these bits set in xcomp_bv, or it will
* trigger #GP:
*/
- xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | xfeatures_mask;
+ xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
+ xfeatures_mask_user;
}

static inline void fpstate_init_fxstate(struct fxregs_state *fx)
@@ -233,7 +234,7 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)

/*
* If XSAVES is enabled, it replaces XSAVEOPT because it supports a compact
- * format and supervisor states in addition to modified optimization in
+ * format and system states in addition to modified optimization in
* XSAVEOPT.
*
* Otherwise, if XSAVEOPT is enabled, XSAVEOPT replaces XSAVE because XSAVEOPT
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 48581988d78c..9b382e5157ed 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -19,19 +19,19 @@
#define XSAVE_YMM_SIZE 256
#define XSAVE_YMM_OFFSET (XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET)

-/* Supervisor features */
-#define XFEATURE_MASK_SUPERVISOR (XFEATURE_MASK_PT)
+/* System features */
+#define XFEATURE_MASK_SYSTEM (XFEATURE_MASK_PT)

/* All currently supported features */
-#define XCNTXT_MASK (XFEATURE_MASK_FP | \
- XFEATURE_MASK_SSE | \
- XFEATURE_MASK_YMM | \
- XFEATURE_MASK_OPMASK | \
- XFEATURE_MASK_ZMM_Hi256 | \
- XFEATURE_MASK_Hi16_ZMM | \
- XFEATURE_MASK_PKRU | \
- XFEATURE_MASK_BNDREGS | \
- XFEATURE_MASK_BNDCSR)
+#define SUPPORTED_XFEATURES_MASK (XFEATURE_MASK_FP | \
+ XFEATURE_MASK_SSE | \
+ XFEATURE_MASK_YMM | \
+ XFEATURE_MASK_OPMASK | \
+ XFEATURE_MASK_ZMM_Hi256 | \
+ XFEATURE_MASK_Hi16_ZMM | \
+ XFEATURE_MASK_PKRU | \
+ XFEATURE_MASK_BNDREGS | \
+ XFEATURE_MASK_BNDCSR)

#ifdef CONFIG_X86_64
#define REX_PREFIX "0x48, "
@@ -39,7 +39,7 @@
#define REX_PREFIX
#endif

-extern u64 xfeatures_mask;
+extern u64 xfeatures_mask_user;
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];

extern void __init update_regset_xstate_info(unsigned int size,
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index f92a6593de1e..2627e18dcbb5 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -362,7 +362,7 @@ void fpu__drop(struct fpu *fpu)
* Clear FPU registers by setting them up from
* the init fpstate:
*/
-static inline void copy_init_fpstate_to_fpregs(void)
+static inline void copy_init_user_fpstate_to_fpregs(void)
{
if (use_xsave())
copy_kernel_to_xregs(&init_fpstate.xsave, -1);
@@ -394,7 +394,7 @@ void fpu__clear(struct fpu *fpu)
preempt_disable();
fpu__initialize(fpu);
user_fpu_begin();
- copy_init_fpstate_to_fpregs();
+ copy_init_user_fpstate_to_fpregs();
preempt_enable();
}
}
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 6abd83572b01..761c3a5a9e07 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -229,7 +229,7 @@ static void __init fpu__init_system_xstate_size_legacy(void)
*/
u64 __init fpu__get_supported_xfeatures_mask(void)
{
- return XCNTXT_MASK;
+ return SUPPORTED_XFEATURES_MASK;
}

/* Legacy code to initialize eager fpu mode. */
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 23f1691670b6..f77aa76ba675 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -249,11 +249,11 @@ static inline int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int fx_
{
if (use_xsave()) {
if ((unsigned long)buf % 64 || fx_only) {
- u64 init_bv = xfeatures_mask & ~XFEATURE_MASK_FPSSE;
+ u64 init_bv = xfeatures_mask_user & ~XFEATURE_MASK_FPSSE;
copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
return copy_user_to_fxregs(buf);
} else {
- u64 init_bv = xfeatures_mask & ~xbv;
+ u64 init_bv = xfeatures_mask_user & ~xbv;
if (unlikely(init_bv))
copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
return copy_user_to_xregs(buf, xbv);
@@ -417,7 +417,7 @@ void fpu__init_prepare_fx_sw_frame(void)

fx_sw_reserved.magic1 = FP_XSTATE_MAGIC1;
fx_sw_reserved.extended_size = size;
- fx_sw_reserved.xfeatures = xfeatures_mask;
+ fx_sw_reserved.xfeatures = xfeatures_mask_user;
fx_sw_reserved.xstate_size = fpu_user_xstate_size;

if (IS_ENABLED(CONFIG_IA32_EMULATION) ||
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7642d3..19f8df54c72a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -53,11 +53,11 @@ static short xsave_cpuid_features[] __initdata = {
/*
* Mask of xstate features supported by the CPU and the kernel:
*/
-u64 xfeatures_mask __read_mostly;
+u64 xfeatures_mask_user __read_mostly;

static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
-static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask)*8];
+static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask_user)*8];

/*
* The XSAVE area of kernel can be in standard or compacted format;
@@ -82,7 +82,7 @@ void fpu__xstate_clear_all_cpu_caps(void)
*/
int cpu_has_xfeatures(u64 xfeatures_needed, const char **feature_name)
{
- u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask;
+ u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask_user;

if (unlikely(feature_name)) {
long xfeature_idx, max_idx;
@@ -113,14 +113,14 @@ int cpu_has_xfeatures(u64 xfeatures_needed, const char **feature_name)
}
EXPORT_SYMBOL_GPL(cpu_has_xfeatures);

-static int xfeature_is_supervisor(int xfeature_nr)
+static int xfeature_is_system(int xfeature_nr)
{
/*
- * We currently do not support supervisor states, but if
+ * We currently do not support system states, but if
* we did, we could find out like this.
*
* SDM says: If state component 'i' is a user state component,
- * ECX[0] return 0; if state component i is a supervisor
+ * ECX[0] return 0; if state component i is a system
* state component, ECX[0] returns 1.
*/
u32 eax, ebx, ecx, edx;
@@ -131,7 +131,7 @@ static int xfeature_is_supervisor(int xfeature_nr)

static int xfeature_is_user(int xfeature_nr)
{
- return !xfeature_is_supervisor(xfeature_nr);
+ return !xfeature_is_system(xfeature_nr);
}

/*
@@ -164,7 +164,7 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
* None of the feature bits are in init state. So nothing else
* to do for us, as the memory layout is up to date.
*/
- if ((xfeatures & xfeatures_mask) == xfeatures_mask)
+ if ((xfeatures & xfeatures_mask_user) == xfeatures_mask_user)
return;

/*
@@ -191,7 +191,7 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
* in a special way already:
*/
feature_bit = 0x2;
- xfeatures = (xfeatures_mask & ~xfeatures) >> 2;
+ xfeatures = (xfeatures_mask_user & ~xfeatures) >> 2;

/*
* Update all the remaining memory layouts according to their
@@ -219,20 +219,20 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
*/
void fpu__init_cpu_xstate(void)
{
- if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask)
+ if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask_user)
return;
/*
- * Make it clear that XSAVES supervisor states are not yet
+ * Make it clear that XSAVES system states are not yet
* implemented should anyone expect it to work by changing
* bits in XFEATURE_MASK_* macros and XCR0.
*/
- WARN_ONCE((xfeatures_mask & XFEATURE_MASK_SUPERVISOR),
- "x86/fpu: XSAVES supervisor states are not yet implemented.\n");
+ WARN_ONCE((xfeatures_mask_user & XFEATURE_MASK_SYSTEM),
+ "x86/fpu: XSAVES system states are not yet implemented.\n");

- xfeatures_mask &= ~XFEATURE_MASK_SUPERVISOR;
+ xfeatures_mask_user &= ~XFEATURE_MASK_SYSTEM;

cr4_set_bits(X86_CR4_OSXSAVE);
- xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask_user);
}

/*
@@ -242,7 +242,7 @@ void fpu__init_cpu_xstate(void)
*/
static int xfeature_enabled(enum xfeature xfeature)
{
- return !!(xfeatures_mask & (1UL << xfeature));
+ return !!(xfeatures_mask_user & BIT_ULL(xfeature));
}

/*
@@ -272,7 +272,7 @@ static void __init setup_xstate_features(void)
cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);

/*
- * If an xfeature is supervisor state, the offset
+ * If an xfeature is system state, the offset
* in EBX is invalid. We leave it to -1.
*/
if (xfeature_is_user(i))
@@ -348,7 +348,7 @@ static int xfeature_is_aligned(int xfeature_nr)
*/
static void __init setup_xstate_comp(void)
{
- unsigned int xstate_comp_sizes[sizeof(xfeatures_mask)*8];
+ unsigned int xstate_comp_sizes[sizeof(xfeatures_mask_user)*8];
int i;

/*
@@ -421,7 +421,8 @@ static void __init setup_init_fpu_buf(void)
print_xstate_features();

if (boot_cpu_has(X86_FEATURE_XSAVES))
- init_fpstate.xsave.header.xcomp_bv = (u64)1 << 63 | xfeatures_mask;
+ init_fpstate.xsave.header.xcomp_bv =
+ BIT_ULL(63) | xfeatures_mask_user;

/*
* Init all the features state with header.xfeatures being 0x0
@@ -440,11 +441,11 @@ static int xfeature_uncompacted_offset(int xfeature_nr)
u32 eax, ebx, ecx, edx;

/*
- * Only XSAVES supports supervisor states and it uses compacted
- * format. Checking a supervisor state's uncompacted offset is
+ * Only XSAVES supports system states and it uses compacted
+ * format. Checking a system state's uncompacted offset is
* an error.
*/
- if (XFEATURE_MASK_SUPERVISOR & (1 << xfeature_nr)) {
+ if (XFEATURE_MASK_SYSTEM & (1 << xfeature_nr)) {
WARN_ONCE(1, "No fixed offset for xstate %d\n", xfeature_nr);
return -1;
}
@@ -465,7 +466,7 @@ static int xfeature_size(int xfeature_nr)

/*
* 'XSAVES' implies two different things:
- * 1. saving of supervisor/system state
+ * 1. saving of system state
* 2. using the compacted format
*
* Use this function when dealing with the compacted format so
@@ -480,8 +481,8 @@ int using_compacted_format(void)
/* Validate an xstate header supplied by userspace (ptrace or sigreturn) */
int validate_xstate_header(const struct xstate_header *hdr)
{
- /* No unknown or supervisor features may be set */
- if (hdr->xfeatures & (~xfeatures_mask | XFEATURE_MASK_SUPERVISOR))
+ /* No unknown or system features may be set */
+ if (hdr->xfeatures & (~xfeatures_mask_user | XFEATURE_MASK_SYSTEM))
return -EINVAL;

/* Userspace must use the uncompacted format */
@@ -588,11 +589,11 @@ static void do_extra_xstate_size_checks(void)

check_xstate_against_struct(i);
/*
- * Supervisor state components can be managed only by
+ * System state components can be managed only by
* XSAVES, which is compacted-format only.
*/
if (!using_compacted_format())
- XSTATE_WARN_ON(xfeature_is_supervisor(i));
+ XSTATE_WARN_ON(xfeature_is_system(i));

/* Align from the end of the previous feature */
if (xfeature_is_aligned(i))
@@ -616,7 +617,7 @@ static void do_extra_xstate_size_checks(void)


/*
- * Get total size of enabled xstates in XCR0/xfeatures_mask.
+ * Get total size of enabled xstates in XCR0/xfeatures_mask_user.
*
* Note the SDM's wording here. "sub-function 0" only enumerates
* the size of the *user* states. If we use it to size a buffer
@@ -706,7 +707,7 @@ static int init_xstate_size(void)
*/
static void fpu__init_disable_system_xstate(void)
{
- xfeatures_mask = 0;
+ xfeatures_mask_user = 0;
cr4_clear_bits(X86_CR4_OSXSAVE);
fpu__xstate_clear_all_cpu_caps();
}
@@ -742,15 +743,15 @@ void __init fpu__init_system_xstate(void)
}

cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
- xfeatures_mask = eax + ((u64)edx << 32);
+ xfeatures_mask_user = eax + ((u64)edx << 32);

- if ((xfeatures_mask & XFEATURE_MASK_FPSSE) != XFEATURE_MASK_FPSSE) {
+ if ((xfeatures_mask_user & XFEATURE_MASK_FPSSE) != XFEATURE_MASK_FPSSE) {
/*
* This indicates that something really unexpected happened
* with the enumeration. Disable XSAVE and try to continue
* booting without it. This is too early to BUG().
*/
- pr_err("x86/fpu: FP/SSE not present amongst the CPU's xstate features: 0x%llx.\n", xfeatures_mask);
+ pr_err("x86/fpu: FP/SSE not present amongst the CPU's xstate features: 0x%llx.\n", xfeatures_mask_user);
goto out_disable;
}

@@ -759,10 +760,10 @@ void __init fpu__init_system_xstate(void)
*/
for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
if (!boot_cpu_has(xsave_cpuid_features[i]))
- xfeatures_mask &= ~BIT(i);
+ xfeatures_mask_user &= ~BIT_ULL(i);
}

- xfeatures_mask &= fpu__get_supported_xfeatures_mask();
+ xfeatures_mask_user &= fpu__get_supported_xfeatures_mask();

/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
@@ -772,9 +773,10 @@ void __init fpu__init_system_xstate(void)

/*
* Update info used for ptrace frames; use standard-format size and no
- * supervisor xstates:
+ * system xstates:
*/
- update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask & ~XFEATURE_MASK_SUPERVISOR);
+ update_regset_xstate_info(fpu_user_xstate_size,
+ xfeatures_mask_user & ~XFEATURE_MASK_SYSTEM);

fpu__init_prepare_fx_sw_frame();
setup_init_fpu_buf();
@@ -782,7 +784,7 @@ void __init fpu__init_system_xstate(void)
print_xstate_offset_size();

pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
- xfeatures_mask,
+ xfeatures_mask_user,
fpu_kernel_xstate_size,
boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
return;
@@ -801,7 +803,7 @@ void fpu__resume_cpu(void)
* Restore XCR0 on xsave capable CPUs:
*/
if (boot_cpu_has(X86_FEATURE_XSAVE))
- xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask_user);
}

/*
@@ -853,7 +855,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
* have not enabled. Remember that pcntxt_mask is
* what we write to the XCR0 register.
*/
- WARN_ONCE(!(xfeatures_mask & xstate_feature),
+ WARN_ONCE(!(xfeatures_mask_user & xstate_feature),
"get of unsupported state");
/*
* This assumes the last 'xsave*' instruction to
@@ -1003,7 +1005,7 @@ int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int of
*/
memset(&header, 0, sizeof(header));
header.xfeatures = xsave->header.xfeatures;
- header.xfeatures &= ~XFEATURE_MASK_SUPERVISOR;
+ header.xfeatures &= ~XFEATURE_MASK_SYSTEM;

/*
* Copy xregs_state->header:
@@ -1087,7 +1089,7 @@ int copy_xstate_to_user(void __user *ubuf, struct xregs_state *xsave, unsigned i
*/
memset(&header, 0, sizeof(header));
header.xfeatures = xsave->header.xfeatures;
- header.xfeatures &= ~XFEATURE_MASK_SUPERVISOR;
+ header.xfeatures &= ~XFEATURE_MASK_SYSTEM;

/*
* Copy xregs_state->header:
@@ -1180,7 +1182,7 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
* The state that came in from userspace was user-state only.
* Mask all the user states out of 'xfeatures':
*/
- xsave->header.xfeatures &= XFEATURE_MASK_SUPERVISOR;
+ xsave->header.xfeatures &= XFEATURE_MASK_SYSTEM;

/*
* Add back in the features that came in from userspace:
@@ -1236,7 +1238,7 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
* The state that came in from userspace was user-state only.
* Mask all the user states out of 'xfeatures':
*/
- xsave->header.xfeatures &= XFEATURE_MASK_SUPERVISOR;
+ xsave->header.xfeatures &= XFEATURE_MASK_SYSTEM;

/*
* Add back in the features that came in from userspace:
--
2.17.1


2018-07-10 22:35:36

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 09/27] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW

We are going to create _PAGE_DIRTY_SW for non-hardware, memory
management purposes. Rename _PAGE_DIRTY to _PAGE_DIRTY_HW and
_PAGE_BIT_DIRTY to _PAGE_BIT_DIRTY_HW to make these PTE dirty
bits more clear. There are no functional changes in this
patch.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 6 +++---
arch/x86/include/asm/pgtable_types.h | 17 +++++++++--------
arch/x86/kernel/relocate_kernel_64.S | 2 +-
arch/x86/kvm/vmx.c | 2 +-
4 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5715647fc4fe..28806f8f36c3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -303,7 +303,7 @@ static inline pte_t pte_mkexec(pte_t pte)

static inline pte_t pte_mkdirty(pte_t pte)
{
- return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

static inline pte_t pte_mkyoung(pte_t pte)
@@ -377,7 +377,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)

static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
- return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -436,7 +436,7 @@ static inline pud_t pud_wrprotect(pud_t pud)

static inline pud_t pud_mkdirty(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

static inline pud_t pud_mkdevmap(pud_t pud)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 99fff853c944..806abf530f50 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -15,7 +15,7 @@
#define _PAGE_BIT_PWT 3 /* page write through */
#define _PAGE_BIT_PCD 4 /* page cache disabled */
#define _PAGE_BIT_ACCESSED 5 /* was accessed (raised by CPU) */
-#define _PAGE_BIT_DIRTY 6 /* was written to (raised by CPU) */
+#define _PAGE_BIT_DIRTY_HW 6 /* was written to (raised by CPU) */
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */
#define _PAGE_BIT_PAT 7 /* on 4KB pages */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
@@ -45,7 +45,7 @@
#define _PAGE_PWT (_AT(pteval_t, 1) << _PAGE_BIT_PWT)
#define _PAGE_PCD (_AT(pteval_t, 1) << _PAGE_BIT_PCD)
#define _PAGE_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
-#define _PAGE_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
+#define _PAGE_DIRTY_HW (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_HW)
#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE)
#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
#define _PAGE_SOFTW1 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1)
@@ -72,7 +72,7 @@
_PAGE_PKEY_BIT3)

#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
-#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
+#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY_HW | _PAGE_ACCESSED)
#else
#define _PAGE_KNL_ERRATUM_MASK 0
#endif
@@ -111,9 +111,9 @@
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

#define _PAGE_TABLE_NOENC (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |\
- _PAGE_ACCESSED | _PAGE_DIRTY)
+ _PAGE_ACCESSED | _PAGE_DIRTY_HW)
#define _KERNPG_TABLE_NOENC (_PAGE_PRESENT | _PAGE_RW | \
- _PAGE_ACCESSED | _PAGE_DIRTY)
+ _PAGE_ACCESSED | _PAGE_DIRTY_HW)

/*
* Set of bits not changed in pte_modify. The pte's
@@ -122,7 +122,7 @@
* pte_modify() does modify it.
*/
#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
- _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
+ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_HW | \
_PAGE_SOFT_DIRTY)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)

@@ -167,7 +167,8 @@ enum page_cache_mode {
_PAGE_ACCESSED)

#define __PAGE_KERNEL_EXEC \
- (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL)
+ (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY_HW | _PAGE_ACCESSED | \
+ _PAGE_GLOBAL)
#define __PAGE_KERNEL (__PAGE_KERNEL_EXEC | _PAGE_NX)

#define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW)
@@ -186,7 +187,7 @@ enum page_cache_mode {
#define _PAGE_ENC (_AT(pteval_t, sme_me_mask))

#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
- _PAGE_DIRTY | _PAGE_ENC)
+ _PAGE_DIRTY_HW | _PAGE_ENC)
#define _PAGE_TABLE (_KERNPG_TABLE | _PAGE_USER)

#define __PAGE_KERNEL_ENC (__PAGE_KERNEL | _PAGE_ENC)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 11eda21eb697..e7665a4767b3 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -17,7 +17,7 @@
*/

#define PTR(x) (x << 3)
-#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY_HW)

/*
* control_page + KEXEC_CONTROL_CODE_MAX_SIZE
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1689f433f3a0..faef36473105 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5467,7 +5467,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
/* Set up identity-mapping pagetable for EPT in real mode */
for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
- _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
+ _PAGE_ACCESSED | _PAGE_DIRTY_HW | _PAGE_PSE);
r = kvm_write_guest_page(kvm, identity_map_pfn,
&tmp, i * sizeof(tmp), sizeof(tmp));
if (r < 0)
--
2.17.1


2018-07-10 22:35:40

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 03/27] x86/fpu/xstate: Enable XSAVES system states

XSAVES saves both system and user states. The Linux kernel
currently does not save/restore any system states. This patch
creates the framework for supporting system states.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 3 +-
arch/x86/include/asm/fpu/xstate.h | 9 ++-
arch/x86/kernel/fpu/core.c | 7 +-
arch/x86/kernel/fpu/init.c | 10 ---
arch/x86/kernel/fpu/xstate.c | 112 +++++++++++++++++-----------
5 files changed, 80 insertions(+), 61 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index f1f9bf91a0ab..1f447865db3a 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -45,7 +45,6 @@ extern void fpu__init_cpu_xstate(void);
extern void fpu__init_system(struct cpuinfo_x86 *c);
extern void fpu__init_check_bugs(void);
extern void fpu__resume_cpu(void);
-extern u64 fpu__get_supported_xfeatures_mask(void);

/*
* Debugging facility:
@@ -94,7 +93,7 @@ static inline void fpstate_init_xstate(struct xregs_state *xsave)
* trigger #GP:
*/
xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
- xfeatures_mask_user;
+ xfeatures_mask_all;
}

static inline void fpstate_init_fxstate(struct fxregs_state *fx)
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 9b382e5157ed..a32dc5f8c963 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -19,10 +19,10 @@
#define XSAVE_YMM_SIZE 256
#define XSAVE_YMM_OFFSET (XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET)

-/* System features */
-#define XFEATURE_MASK_SYSTEM (XFEATURE_MASK_PT)
-
-/* All currently supported features */
+/*
+ * SUPPORTED_XFEATURES_MASK indicates all features
+ * implemented in and supported by the kernel.
+ */
#define SUPPORTED_XFEATURES_MASK (XFEATURE_MASK_FP | \
XFEATURE_MASK_SSE | \
XFEATURE_MASK_YMM | \
@@ -40,6 +40,7 @@
#endif

extern u64 xfeatures_mask_user;
+extern u64 xfeatures_mask_all;
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];

extern void __init update_regset_xstate_info(unsigned int size,
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2627e18dcbb5..6250e857e764 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -364,8 +364,13 @@ void fpu__drop(struct fpu *fpu)
*/
static inline void copy_init_user_fpstate_to_fpregs(void)
{
+ /*
+ * Only XSAVES user states are copied.
+ * System states are preserved.
+ */
if (use_xsave())
- copy_kernel_to_xregs(&init_fpstate.xsave, -1);
+ copy_kernel_to_xregs(&init_fpstate.xsave,
+ xfeatures_mask_user);
else if (static_cpu_has(X86_FEATURE_FXSR))
copy_kernel_to_fxregs(&init_fpstate.fxsave);
else
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 761c3a5a9e07..eaf9d9d479a5 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -222,16 +222,6 @@ static void __init fpu__init_system_xstate_size_legacy(void)
fpu_user_xstate_size = fpu_kernel_xstate_size;
}

-/*
- * Find supported xfeatures based on cpu features and command-line input.
- * This must be called after fpu__init_parse_early_param() is called and
- * xfeatures_mask is enumerated.
- */
-u64 __init fpu__get_supported_xfeatures_mask(void)
-{
- return SUPPORTED_XFEATURES_MASK;
-}
-
/* Legacy code to initialize eager fpu mode. */
static void __init fpu__init_system_ctx_switch(void)
{
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 19f8df54c72a..dd2c561c4544 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -51,13 +51,16 @@ static short xsave_cpuid_features[] __initdata = {
};

/*
- * Mask of xstate features supported by the CPU and the kernel:
+ * Mask of xstate features supported by the CPU and the kernel.
+ * This is the result from CPUID query, SUPPORTED_XFEATURES_MASK,
+ * and boot_cpu_has().
*/
u64 xfeatures_mask_user __read_mostly;
+u64 xfeatures_mask_all __read_mostly;

static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
-static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask_user)*8];
+static unsigned int xstate_comp_offsets[sizeof(xfeatures_mask_all)*8];

/*
* The XSAVE area of kernel can be in standard or compacted format;
@@ -82,7 +85,7 @@ void fpu__xstate_clear_all_cpu_caps(void)
*/
int cpu_has_xfeatures(u64 xfeatures_needed, const char **feature_name)
{
- u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask_user;
+ u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask_all;

if (unlikely(feature_name)) {
long xfeature_idx, max_idx;
@@ -164,7 +167,7 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
* None of the feature bits are in init state. So nothing else
* to do for us, as the memory layout is up to date.
*/
- if ((xfeatures & xfeatures_mask_user) == xfeatures_mask_user)
+ if ((xfeatures & xfeatures_mask_all) == xfeatures_mask_all)
return;

/*
@@ -219,30 +222,31 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
*/
void fpu__init_cpu_xstate(void)
{
- if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask_user)
+ if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask_all)
return;
+
+ cr4_set_bits(X86_CR4_OSXSAVE);
+
/*
- * Make it clear that XSAVES system states are not yet
- * implemented should anyone expect it to work by changing
- * bits in XFEATURE_MASK_* macros and XCR0.
+ * XCR_XFEATURE_ENABLED_MASK sets the features that are managed
+ * by XSAVE{C, OPT} and XRSTOR. Only XSAVE user states can be
+ * set here.
*/
- WARN_ONCE((xfeatures_mask_user & XFEATURE_MASK_SYSTEM),
- "x86/fpu: XSAVES system states are not yet implemented.\n");
+ xsetbv(XCR_XFEATURE_ENABLED_MASK,
+ xfeatures_mask_user);

- xfeatures_mask_user &= ~XFEATURE_MASK_SYSTEM;
-
- cr4_set_bits(X86_CR4_OSXSAVE);
- xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask_user);
+ /*
+ * MSR_IA32_XSS sets which XSAVES system states to be managed by
+ * XSAVES. Only XSAVES system states can be set here.
+ */
+ if (boot_cpu_has(X86_FEATURE_XSAVES))
+ wrmsrl(MSR_IA32_XSS,
+ xfeatures_mask_all & ~xfeatures_mask_user);
}

-/*
- * Note that in the future we will likely need a pair of
- * functions here: one for user xstates and the other for
- * system xstates. For now, they are the same.
- */
static int xfeature_enabled(enum xfeature xfeature)
{
- return !!(xfeatures_mask_user & BIT_ULL(xfeature));
+ return !!(xfeatures_mask_all & BIT_ULL(xfeature));
}

/*
@@ -348,7 +352,7 @@ static int xfeature_is_aligned(int xfeature_nr)
*/
static void __init setup_xstate_comp(void)
{
- unsigned int xstate_comp_sizes[sizeof(xfeatures_mask_user)*8];
+ unsigned int xstate_comp_sizes[sizeof(xfeatures_mask_all)*8];
int i;

/*
@@ -422,7 +426,7 @@ static void __init setup_init_fpu_buf(void)

if (boot_cpu_has(X86_FEATURE_XSAVES))
init_fpstate.xsave.header.xcomp_bv =
- BIT_ULL(63) | xfeatures_mask_user;
+ BIT_ULL(63) | xfeatures_mask_all;

/*
* Init all the features state with header.xfeatures being 0x0
@@ -441,11 +445,10 @@ static int xfeature_uncompacted_offset(int xfeature_nr)
u32 eax, ebx, ecx, edx;

/*
- * Only XSAVES supports system states and it uses compacted
- * format. Checking a system state's uncompacted offset is
- * an error.
+ * Checking a system or unsupported state's uncompacted offset
+ * is an error.
*/
- if (XFEATURE_MASK_SYSTEM & (1 << xfeature_nr)) {
+ if (~xfeatures_mask_user & BIT_ULL(xfeature_nr)) {
WARN_ONCE(1, "No fixed offset for xstate %d\n", xfeature_nr);
return -1;
}
@@ -482,7 +485,7 @@ int using_compacted_format(void)
int validate_xstate_header(const struct xstate_header *hdr)
{
/* No unknown or system features may be set */
- if (hdr->xfeatures & (~xfeatures_mask_user | XFEATURE_MASK_SYSTEM))
+ if (hdr->xfeatures & ~xfeatures_mask_user)
return -EINVAL;

/* Userspace must use the uncompacted format */
@@ -617,15 +620,12 @@ static void do_extra_xstate_size_checks(void)


/*
- * Get total size of enabled xstates in XCR0/xfeatures_mask_user.
+ * Get total size of enabled xstates in XCR0 | IA32_XSS.
*
* Note the SDM's wording here. "sub-function 0" only enumerates
* the size of the *user* states. If we use it to size a buffer
* that we use 'XSAVES' on, we could potentially overflow the
* buffer because 'XSAVES' saves system states too.
- *
- * Note that we do not currently set any bits on IA32_XSS so
- * 'XCR0 | IA32_XSS == XCR0' for now.
*/
static unsigned int __init get_xsaves_size(void)
{
@@ -707,6 +707,7 @@ static int init_xstate_size(void)
*/
static void fpu__init_disable_system_xstate(void)
{
+ xfeatures_mask_all = 0;
xfeatures_mask_user = 0;
cr4_clear_bits(X86_CR4_OSXSAVE);
fpu__xstate_clear_all_cpu_caps();
@@ -722,6 +723,8 @@ void __init fpu__init_system_xstate(void)
static int on_boot_cpu __initdata = 1;
int err;
int i;
+ u64 cpu_user_xfeatures_mask;
+ u64 cpu_system_xfeatures_mask;

WARN_ON_FPU(!on_boot_cpu);
on_boot_cpu = 0;
@@ -742,10 +745,24 @@ void __init fpu__init_system_xstate(void)
return;
}

+ /*
+ * Find user states supported by the processor.
+ * Only these bits can be set in XCR0.
+ */
cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
- xfeatures_mask_user = eax + ((u64)edx << 32);
+ cpu_user_xfeatures_mask = eax + ((u64)edx << 32);
+
+ /*
+ * Find system states supported by the processor.
+ * Only these bits can be set in IA32_XSS MSR.
+ */
+ cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx);
+ cpu_system_xfeatures_mask = ecx + ((u64)edx << 32);

- if ((xfeatures_mask_user & XFEATURE_MASK_FPSSE) != XFEATURE_MASK_FPSSE) {
+ xfeatures_mask_all = cpu_user_xfeatures_mask |
+ cpu_system_xfeatures_mask;
+
+ if ((xfeatures_mask_all & XFEATURE_MASK_FPSSE) != XFEATURE_MASK_FPSSE) {
/*
* This indicates that something really unexpected happened
* with the enumeration. Disable XSAVE and try to continue
@@ -760,10 +777,11 @@ void __init fpu__init_system_xstate(void)
*/
for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
if (!boot_cpu_has(xsave_cpuid_features[i]))
- xfeatures_mask_user &= ~BIT_ULL(i);
+ xfeatures_mask_all &= ~BIT_ULL(i);
}

- xfeatures_mask_user &= fpu__get_supported_xfeatures_mask();
+ xfeatures_mask_all &= SUPPORTED_XFEATURES_MASK;
+ xfeatures_mask_user = xfeatures_mask_all & cpu_user_xfeatures_mask;

/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
@@ -775,8 +793,7 @@ void __init fpu__init_system_xstate(void)
* Update info used for ptrace frames; use standard-format size and no
* system xstates:
*/
- update_regset_xstate_info(fpu_user_xstate_size,
- xfeatures_mask_user & ~XFEATURE_MASK_SYSTEM);
+ update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask_user);

fpu__init_prepare_fx_sw_frame();
setup_init_fpu_buf();
@@ -784,7 +801,7 @@ void __init fpu__init_system_xstate(void)
print_xstate_offset_size();

pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
- xfeatures_mask_user,
+ xfeatures_mask_all,
fpu_kernel_xstate_size,
boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
return;
@@ -804,6 +821,13 @@ void fpu__resume_cpu(void)
*/
if (boot_cpu_has(X86_FEATURE_XSAVE))
xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask_user);
+
+ /*
+ * Restore IA32_XSS
+ */
+ if (boot_cpu_has(X86_FEATURE_XSAVES))
+ wrmsrl(MSR_IA32_XSS,
+ xfeatures_mask_all & ~xfeatures_mask_user);
}

/*
@@ -853,9 +877,9 @@ void *get_xsave_addr(struct xregs_state *xsave, int xstate_feature)
/*
* We should not ever be requesting features that we
* have not enabled. Remember that pcntxt_mask is
- * what we write to the XCR0 register.
+ * what we write to the XCR0 | IA32_XSS registers.
*/
- WARN_ONCE(!(xfeatures_mask_user & xstate_feature),
+ WARN_ONCE(!(xfeatures_mask_all & xstate_feature),
"get of unsupported state");
/*
* This assumes the last 'xsave*' instruction to
@@ -1005,7 +1029,7 @@ int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int of
*/
memset(&header, 0, sizeof(header));
header.xfeatures = xsave->header.xfeatures;
- header.xfeatures &= ~XFEATURE_MASK_SYSTEM;
+ header.xfeatures &= xfeatures_mask_user;

/*
* Copy xregs_state->header:
@@ -1089,7 +1113,7 @@ int copy_xstate_to_user(void __user *ubuf, struct xregs_state *xsave, unsigned i
*/
memset(&header, 0, sizeof(header));
header.xfeatures = xsave->header.xfeatures;
- header.xfeatures &= ~XFEATURE_MASK_SYSTEM;
+ header.xfeatures &= xfeatures_mask_user;

/*
* Copy xregs_state->header:
@@ -1182,7 +1206,7 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
* The state that came in from userspace was user-state only.
* Mask all the user states out of 'xfeatures':
*/
- xsave->header.xfeatures &= XFEATURE_MASK_SYSTEM;
+ xsave->header.xfeatures &= (xfeatures_mask_all & ~xfeatures_mask_user);

/*
* Add back in the features that came in from userspace:
@@ -1238,7 +1262,7 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
* The state that came in from userspace was user-state only.
* Mask all the user states out of 'xfeatures':
*/
- xsave->header.xfeatures &= XFEATURE_MASK_SYSTEM;
+ xsave->header.xfeatures &= (xfeatures_mask_all & ~xfeatures_mask_user);

/*
* Add back in the features that came in from userspace:
--
2.17.1


2018-07-10 22:36:05

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 01/27] x86/cpufeatures: Add CPUIDs for Control-flow Enforcement Technology (CET)

Add CPUIDs for Control-flow Enforcement Technology (CET).

CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect branch tracking

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/kernel/cpu/scattered.c | 1 +
2 files changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 5701f5cecd31..f479345e344b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -219,6 +219,7 @@
#define X86_FEATURE_IBPB ( 7*32+26) /* Indirect Branch Prediction Barrier */
#define X86_FEATURE_STIBP ( 7*32+27) /* Single Thread Indirect Branch Predictors */
#define X86_FEATURE_ZEN ( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */
+#define X86_FEATURE_IBT ( 7*32+29) /* Indirect Branch Tracking */

/* Virtualization flags: Linux defined, word 8 */
#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
@@ -319,6 +320,7 @@
#define X86_FEATURE_PKU (16*32+ 3) /* Protection Keys for Userspace */
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 772c219b6889..63cbb4d9938e 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -21,6 +21,7 @@ struct cpuid_bit {
static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_APERFMPERF, CPUID_ECX, 0, 0x00000006, 0 },
{ X86_FEATURE_EPB, CPUID_ECX, 3, 0x00000006, 0 },
+ { X86_FEATURE_IBT, CPUID_EDX, 20, 0x00000007, 0},
{ X86_FEATURE_CAT_L3, CPUID_EBX, 1, 0x00000010, 0 },
{ X86_FEATURE_CAT_L2, CPUID_EBX, 2, 0x00000010, 0 },
{ X86_FEATURE_CDP_L3, CPUID_ECX, 2, 0x00000010, 1 },
--
2.17.1


2018-07-10 22:36:05

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 05/27] Documentation/x86: Add CET description

Explain how CET works and the no_cet_shstk/no_cet_ibt kernel
parameters.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 6 +
Documentation/x86/intel_cet.txt | 250 ++++++++++++++++++
2 files changed, 256 insertions(+)
create mode 100644 Documentation/x86/intel_cet.txt

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index efc7aa7a0670..dc787facdcde 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2661,6 +2661,12 @@
noexec=on: enable non-executable mappings (default)
noexec=off: disable non-executable mappings

+ no_cet_ibt [X86-64] Disable indirect branch tracking for user-mode
+ applications
+
+ no_cet_shstk [X86-64] Disable shadow stack support for user-mode
+ applications
+
nosmap [X86]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
diff --git a/Documentation/x86/intel_cet.txt b/Documentation/x86/intel_cet.txt
new file mode 100644
index 000000000000..974bb8262146
--- /dev/null
+++ b/Documentation/x86/intel_cet.txt
@@ -0,0 +1,250 @@
+=========================================
+Control Flow Enforcement Technology (CET)
+=========================================
+
+[1] Overview
+============
+
+Control Flow Enforcement Technology (CET) provides protection against
+return/jump-oriented programing (ROP) attacks. It can be implemented
+to protect both the kernel and applications. In the first phase,
+only the user-mode protection is implemented for the 64-bit kernel.
+Thirty-two bit applications are supported under the compatibility
+mode.
+
+CET includes shadow stack (SHSTK) and indirect branch tracking (IBT)
+and they are enabled from two kernel configuration options:
+
+ INTEL_X86_SHADOW_STACK_USER, and
+ INTEL_X86_BRANCH_TRACKING_USER.
+
+To build a CET-enabled kernel, Binutils v2.30 and GCC v8.1 or later
+are required. To build a CET-enabled application, GLIBC v2.29 or
+later is also requried.
+
+There are two command-line options for disabling CET features:
+
+ no_cet_shstk - disables SHSTK, and
+ no_cet_ibt - disables IBT.
+
+At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.
+
+[2] CET assembly instructions
+=============================
+
+RDSSP %r
+ Read the SHSTK pointer into %r.
+
+INCSSP %r
+ Unwind (increment) the SHSTK pointer (0 ~ 255) steps as indicated
+ in the operand register. The GLIBC longjmp uses INCSSP to unwind
+ the SHSTK until that matches the program stack. When it is
+ necessary to unwind beyond 255 steps, longjmp divides and repeats
+ the process.
+
+RSTORSSP (%r)
+ Switch to the SHSTK indicated in the 'restore token' pointed by
+ the operand register and replace the 'restore token' with a new
+ token to be saved (with SAVEPREVSSP) for the outgoing SHSTK.
+
+ Before RSTORSSP
+
+ Incoming SHSTK Current/Outgoing SHSTK
+
+ |----------------------| |----------------------|
+ addr=x | | ssp-> | |
+ |----------------------| |----------------------|
+ (%r)-> | rstor_token=(x|Lg) | addr=y-8 | |
+ |----------------------| |----------------------|
+
+ After RSTORSSP
+
+ |----------------------| |----------------------|
+ ssp-> | | | |
+ |----------------------| |----------------------|
+ | rstor_token=(y|Bz|Lg)| addr=y-8 | |
+ |----------------------| |----------------------|
+
+ note:
+ 1. Only valid addresses and restore tokens can be on the
+ user-mode SHSTK.
+ 2. A token is always of type u64 and must align to u64.
+ 3. The incoming SHSTK pointer in a rstor_token must point to
+ immediately above the token.
+ 4. 'Lg' is bit[0] of a rstor_token indicating a 64-bit SHSTK.
+ 5. 'Bz' is bit[1] of a rstor_token indicating the token is to
+ be used only for the next SAVEPREVSSP and invalid for the
+ RSTORSSP.
+
+SAVEPREVSSP
+ Store the SHSTK 'restore token' pointed by
+ (current_SHSTK_pointer + 8).
+
+ After SAVEPREVSSP
+
+ |----------------------| |----------------------|
+ ssp-> | | | |
+ |----------------------| |----------------------|
+ | rstor_token=(y|Bz|Lg)| addr=y-8 | rstor_token(y|Lg) |
+ |----------------------| |----------------------|
+
+WRUSS %r0, (%r1)
+ Write the value in %r0 to the SHSTK address pointed by (%r1).
+ This is a kernel-mode only instruction.
+
+ENDBR
+ The compiler inserts an ENDBR at all valid branch targets. Any
+ CALL/JMP to a target without an ENDBR triggers a control
+ protection fault.
+
+[3] Application Enabling
+========================
+
+An application's CET capability is marked in its ELF header and can
+be verified from the following command output, in the
+NT_GNU_PROPERTY_TYPE_0 field:
+
+ readelf -n <application>
+
+If an application supports CET and is statically linked, it will run
+with CET protection. If the application needs any shared libraries,
+the loader checks all dependencies and enables CET only when all
+requirements are met.
+
+[4] Legacy Libraries
+====================
+
+GLIBC provides a few tunables for backward compatibility.
+
+GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
+ Turn off SHSTK/IBT for the current shell.
+
+GLIBC_TUNABLES=glibc.tune.x86_shstk=<on, permissive>
+ This controls how dlopen() handles SHSTK legacy libraries:
+ on: continue with SHSTK enabled;
+ permissive: continue with SHSTK off.
+
+[5] CET system calls
+====================
+
+The following arch_prctl() system calls are added for CET:
+
+arch_prctl(ARCH_CET_STATUS, unsigned long *addr)
+ Return CET feature status.
+
+ The parameter 'addr' is a pointer to a user buffer.
+ On returning to the caller, the kernel fills the following
+ information:
+
+ *addr = SHSTK/IBT status
+ *(addr + 1) = SHSTK base address
+ *(addr + 2) = SHSTK size
+
+arch_prctl(ARCH_CET_DISABLE, unsigned long features)
+ Disable SHSTK and/or IBT specified in 'features'. Return -EPERM
+ if CET is locked out.
+
+arch_prctl(ARCH_CET_LOCK)
+ Lock out CET feature.
+
+arch_prctl(ARCH_CET_ALLOC_SHSTK, unsigned long *addr)
+ Allocate a new SHSTK.
+
+ The parameter 'addr' is a pointer to a user buffer and indicates
+ the desired SHSTK size to allocate. On returning to the caller
+ the buffer contains the address of the new SHSTK.
+
+arch_prctl(ARCH_CET_LEGACY_BITMAP, unsigned long *addr)
+ Allocate an IBT legacy code bitmap if the current task does not
+ have one.
+
+ The parameter 'addr' is a pointer to a user buffer.
+ On returning to the caller, the kernel fills the following
+ information:
+
+ *addr = IBT bitmap base address
+ *(addr + 1) = IBT bitmap size
+
+[6] The implementation of the SHSTK
+===================================
+
+SHSTK size
+----------
+
+A task's SHSTK is allocated from memory to a fixed size that can
+support 32 KB nested function calls; that is 256 KB for a 64-bit
+application and 128 KB for a 32-bit application. The system admin
+can change the default size.
+
+Signal
+------
+
+The main program and its signal handlers use the same SHSTK. Because
+the SHSTK stores only return addresses, we can estimate a large
+enough SHSTK to cover the condition that both the program stack and
+the sigaltstack run out.
+
+The kernel creates a restore token at the SHSTK restoring address and
+verifies that token when restoring from the signal handler.
+
+Fork
+----
+
+The SHSTK's vma has VM_SHSTK flag set; its PTEs are required to be
+read-only and dirty. When a SHSTK PTE is not present, RO, and dirty,
+a SHSTK access triggers a page fault with an additional SHSTK bit set
+in the page fault error code.
+
+When a task forks a child, its SHSTK PTEs are copied and both the
+parent's and the child's SHSTK PTEs are cleared of the dirty bit.
+Upon the next SHSTK access, the resulting SHSTK page fault is handled
+by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new SHSTK for
+the new thread.
+
+Setjmp/Longjmp
+--------------
+
+Longjmp unwinds SHSTK until it matches the program stack.
+
+Ucontext
+--------
+
+In GLIBC, getcontext/setcontext is implemented in similar way as
+setjmp/longjmp.
+
+When makecontext creates a new ucontext, a new SHSTK is allocated for
+that context with ARCH_CET_ALLOC_SHSTK the syscall. The kernel
+creates a restore token at the top of the new SHSTK and the user-mode
+code switches to the new SHSTK with the RSTORSSP instruction.
+
+[7] The management of read-only & dirty PTEs for SHSTK
+======================================================
+
+A RO and dirty PTE exists in the following cases:
+
+(a) A page is modified and then shared with a fork()'ed child;
+(b) A R/O page that has been COW'ed;
+(c) A SHSTK page.
+
+The processor only checks the dirty bit for (c). To prevent the use
+of non-SHSTK memory as SHSTK, we use a spare bit of the 64-bit PTE as
+DIRTY_SW for (a) and (b) above. This results to the following PTE
+settings:
+
+Modified PTE: (R/W + DIRTY_HW)
+Modified and shared PTE: (R/O + DIRTY_SW)
+R/O PTE, COW'ed: (R/O + DIRTY_SW)
+SHSTK PTE: (R/O + DIRTY_HW)
+SHSTK PTE, COW'ed: (R/O + DIRTY_HW)
+SHSTK PTE, shared: (R/O + DIRTY_SW)
+
+Note that DIRTY_SW is only used in R/O PTEs but not R/W PTEs.
+
+[8] The implementation of IBT
+=============================
+
+The kernel provides IBT support in mmap() of the legacy code bit map.
+However, the management of the bitmap is done in the GLIBC or the
+application.
--
2.17.1


2018-07-10 22:36:21

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 10/27] x86/mm: Introduce _PAGE_DIRTY_SW

A RO and dirty PTE exists in the following cases:

(a) A page is modified and then shared with a fork()'ed child;
(b) A R/O page that has been COW'ed;
(c) A SHSTK page.

The processor does not read the dirty bit for (a) and (b), but
checks the dirty bit for (c). To prevent the use of non-SHSTK
memory as SHSTK, we introduce a spare bit of the 64-bit PTE as
_PAGE_BIT_DIRTY_SW and use that for (a) and (b). This results
to the following possible PTE settings:

Modified PTE: (R/W + DIRTY_HW)
Modified and shared PTE: (R/O + DIRTY_SW)
R/O PTE COW'ed: (R/O + DIRTY_SW)
SHSTK PTE: (R/O + DIRTY_HW)
SHSTK PTE COW'ed: (R/O + DIRTY_HW)
SHSTK PTE shared: (R/O + DIRTY_SW)

Note that _PAGE_BIT_DRITY_SW is only used in R/O PTEs but
not R/W PTEs.

When this patch is applied, there are six free bits left in
the 64-bit PTE. There is no more free bit in the 32-bit
PTE (except for PAE) and shadow stack is not implemented
for the 32-bit kernel.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 109 ++++++++++++++++++++++++---
arch/x86/include/asm/pgtable_types.h | 14 +++-
include/asm-generic/pgtable.h | 21 ++++++
3 files changed, 132 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 28806f8f36c3..ecbd3539a864 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -116,9 +116,9 @@ extern pmdval_t early_pmd_flags;
* The following only work if pte_present() is true.
* Undefined behaviour if not..
*/
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
{
- return pte_flags(pte) & _PAGE_DIRTY;
+ return pte_flags(pte) & _PAGE_DIRTY_BITS;
}


@@ -140,9 +140,9 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED;
}

-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_DIRTY;
+ return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
}

static inline int pmd_young(pmd_t pmd)
@@ -150,9 +150,9 @@ static inline int pmd_young(pmd_t pmd)
return pmd_flags(pmd) & _PAGE_ACCESSED;
}

-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
{
- return pud_flags(pud) & _PAGE_DIRTY;
+ return pud_flags(pud) & _PAGE_DIRTY_BITS;
}

static inline int pud_young(pud_t pud)
@@ -281,9 +281,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
return native_make_pte(v & ~clear);
}

+#if defined(CONFIG_X86_INTEL_SHADOW_STACK_USER)
+static inline pte_t pte_move_flags(pte_t pte, pteval_t from, pteval_t to)
+{
+ if (pte_flags(pte) & from)
+ pte = pte_set_flags(pte_clear_flags(pte, from), to);
+ return pte;
+}
+#else
+static inline pte_t pte_move_flags(pte_t pte, pteval_t from, pteval_t to)
+{
+ return pte;
+}
+#endif
+
static inline pte_t pte_mkclean(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_DIRTY);
+ return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
}

static inline pte_t pte_mkold(pte_t pte)
@@ -293,6 +307,7 @@ static inline pte_t pte_mkold(pte_t pte)

static inline pte_t pte_wrprotect(pte_t pte)
{
+ pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
return pte_clear_flags(pte, _PAGE_RW);
}

@@ -303,9 +318,27 @@ static inline pte_t pte_mkexec(pte_t pte)

static inline pte_t pte_mkdirty(pte_t pte)
{
+ pteval_t dirty = (!IS_ENABLED(CONFIG_X86_INTEL_SHSTK_USER) ||
+ pte_write(pte)) ? _PAGE_DIRTY_HW:_PAGE_DIRTY_SW;
+ return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+#ifdef CONFIG_ARCH_HAS_SHSTK
+static inline pte_t pte_mkdirty_shstk(pte_t pte)
+{
+ pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

+static inline bool is_shstk_pte(pte_t pte)
+{
+ pteval_t val;
+
+ val = pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY_HW);
+ return (val == _PAGE_DIRTY_HW);
+}
+#endif
+
static inline pte_t pte_mkyoung(pte_t pte)
{
return pte_set_flags(pte, _PAGE_ACCESSED);
@@ -313,6 +346,7 @@ static inline pte_t pte_mkyoung(pte_t pte)

static inline pte_t pte_mkwrite(pte_t pte)
{
+ pte = pte_move_flags(pte, _PAGE_DIRTY_SW, _PAGE_DIRTY_HW);
return pte_set_flags(pte, _PAGE_RW);
}

@@ -360,6 +394,20 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
return native_make_pmd(v & ~clear);
}

+#if defined(CONFIG_X86_INTEL_SHADOW_STACK_USER)
+static inline pmd_t pmd_move_flags(pmd_t pmd, pmdval_t from, pmdval_t to)
+{
+ if (pmd_flags(pmd) & from)
+ pmd = pmd_set_flags(pmd_clear_flags(pmd, from), to);
+ return pmd;
+}
+#else
+static inline pmd_t pmd_move_flags(pmd_t pmd, pmdval_t from, pmdval_t to)
+{
+ return pmd;
+}
+#endif
+
static inline pmd_t pmd_mkold(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_ACCESSED);
@@ -367,19 +415,39 @@ static inline pmd_t pmd_mkold(pmd_t pmd)

static inline pmd_t pmd_mkclean(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_DIRTY);
+ return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
}

static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
+ pmd = pmd_move_flags(pmd, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
return pmd_clear_flags(pmd, _PAGE_RW);
}

static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
+ pmdval_t dirty = (!IS_ENABLED(CONFIG_X86_INTEL_SHSTK_USER) ||
+ (pmd_flags(pmd) & _PAGE_RW)) ?
+ _PAGE_DIRTY_HW:_PAGE_DIRTY_SW;
+ return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+#ifdef CONFIG_ARCH_HAS_SHSTK
+static inline pmd_t pmd_mkdirty_shstk(pmd_t pmd)
+{
+ pmd = pmd_clear_flags(pmd, _PAGE_DIRTY_SW);
return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

+static inline bool is_shstk_pmd(pmd_t pmd)
+{
+ pmdval_t val;
+
+ val = pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY_HW);
+ return (val == _PAGE_DIRTY_HW);
+}
+#endif
+
static inline pmd_t pmd_mkdevmap(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_DEVMAP);
@@ -397,6 +465,7 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)

static inline pmd_t pmd_mkwrite(pmd_t pmd)
{
+ pmd = pmd_move_flags(pmd, _PAGE_DIRTY_SW, _PAGE_DIRTY_HW);
return pmd_set_flags(pmd, _PAGE_RW);
}

@@ -419,6 +488,20 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
return native_make_pud(v & ~clear);
}

+#if defined(CONFIG_X86_INTEL_SHADOW_STACK_USER)
+static inline pud_t pud_move_flags(pud_t pud, pudval_t from, pudval_t to)
+{
+ if (pud_flags(pud) & from)
+ pud = pud_set_flags(pud_clear_flags(pud, from), to);
+ return pud;
+}
+#else
+static inline pud_t pud_move_flags(pud_t pud, pudval_t from, pudval_t to)
+{
+ return pud;
+}
+#endif
+
static inline pud_t pud_mkold(pud_t pud)
{
return pud_clear_flags(pud, _PAGE_ACCESSED);
@@ -426,17 +509,22 @@ static inline pud_t pud_mkold(pud_t pud)

static inline pud_t pud_mkclean(pud_t pud)
{
- return pud_clear_flags(pud, _PAGE_DIRTY);
+ return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
}

static inline pud_t pud_wrprotect(pud_t pud)
{
+ pud = pud_move_flags(pud, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW);
return pud_clear_flags(pud, _PAGE_RW);
}

static inline pud_t pud_mkdirty(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
+ pudval_t dirty = (!IS_ENABLED(CONFIG_X86_INTEL_SHSTK_USER) ||
+ (pud_flags(pud) & _PAGE_RW)) ?
+ _PAGE_DIRTY_HW:_PAGE_DIRTY_SW;
+
+ return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
}

static inline pud_t pud_mkdevmap(pud_t pud)
@@ -456,6 +544,7 @@ static inline pud_t pud_mkyoung(pud_t pud)

static inline pud_t pud_mkwrite(pud_t pud)
{
+ pud = pud_move_flags(pud, _PAGE_DIRTY_SW, _PAGE_DIRTY_HW);
return pud_set_flags(pud, _PAGE_RW);
}

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 806abf530f50..4bad635beaab 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,6 +23,7 @@
#define _PAGE_BIT_SOFTW2 10 /* " */
#define _PAGE_BIT_SOFTW3 11 /* " */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
+#define _PAGE_BIT_SOFTW5 57 /* available for programmer */
#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
#define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
#define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
@@ -34,6 +35,7 @@
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
+#define _PAGE_BIT_DIRTY_SW _PAGE_BIT_SOFTW5 /* was written to */

/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -108,6 +110,14 @@
#define _PAGE_DEVMAP (_AT(pteval_t, 0))
#endif

+#if defined(CONFIG_X86_INTEL_SHADOW_STACK_USER)
+#define _PAGE_DIRTY_SW (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_SW)
+#else
+#define _PAGE_DIRTY_SW (_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY_HW | _PAGE_DIRTY_SW)
+
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

#define _PAGE_TABLE_NOENC (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |\
@@ -121,9 +131,9 @@
* instance, and is *not* included in this mask since
* pte_modify() does modify it.
*/
-#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
+#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_HW | \
- _PAGE_SOFT_DIRTY)
+ _PAGE_DIRTY_SW | _PAGE_SOFT_DIRTY)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)

/*
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f59639afaa39..4ee683c9ac19 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1097,4 +1097,25 @@ static inline void init_espfix_bsp(void) { }
#endif
#endif

+#ifndef CONFIG_ARCH_HAS_SHSTK
+static inline pte_t pte_mkdirty_shstk(pte_t pte)
+{
+ return pte;
+}
+static inline bool is_shstk_pte(pte_t pte)
+{
+ return false;
+}
+
+static inline pmd_t pmd_mkdirty_shstk(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline bool is_shstk_pmd(pmd_t pmd)
+{
+ return false;
+}
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
--
2.17.1


2018-07-10 22:36:40

by Yu-cheng Yu

[permalink] [raw]
Subject: [RFC PATCH v2 04/27] x86/fpu/xstate: Add XSAVES system states for shadow stack

Intel Control-flow Enforcement Technology (CET) introduces the
following MSRs into the XSAVES system states.

IA32_U_CET (user-mode CET settings),
IA32_PL3_SSP (user-mode shadow stack),
IA32_PL0_SSP (kernel-mode shadow stack),
IA32_PL1_SSP (ring-1 shadow stack),
IA32_PL2_SSP (ring-2 shadow stack).

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/fpu/types.h | 22 +++++++++++++++++++++
arch/x86/include/asm/fpu/xstate.h | 4 +++-
arch/x86/include/uapi/asm/processor-flags.h | 2 ++
arch/x86/kernel/fpu/xstate.c | 10 ++++++++++
4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c53918ecf..e55d51d172f1 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -114,6 +114,9 @@ enum xfeature {
XFEATURE_Hi16_ZMM,
XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
XFEATURE_PKRU,
+ XFEATURE_RESERVED,
+ XFEATURE_SHSTK_USER,
+ XFEATURE_SHSTK_KERNEL,

XFEATURE_MAX,
};
@@ -128,6 +131,8 @@ enum xfeature {
#define XFEATURE_MASK_Hi16_ZMM (1 << XFEATURE_Hi16_ZMM)
#define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
+#define XFEATURE_MASK_SHSTK_USER (1 << XFEATURE_SHSTK_USER)
+#define XFEATURE_MASK_SHSTK_KERNEL (1 << XFEATURE_SHSTK_KERNEL)

#define XFEATURE_MASK_FPSSE (XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
#define XFEATURE_MASK_AVX512 (XFEATURE_MASK_OPMASK \
@@ -229,6 +234,23 @@ struct pkru_state {
u32 pad;
} __packed;

+/*
+ * State component 11 is Control flow Enforcement user states
+ */
+struct cet_user_state {
+ u64 u_cet; /* user control flow settings */
+ u64 user_ssp; /* user shadow stack pointer */
+} __packed;
+
+/*
+ * State component 12 is Control flow Enforcement kernel states
+ */
+struct cet_kernel_state {
+ u64 kernel_ssp; /* kernel shadow stack */
+ u64 pl1_ssp; /* ring-1 shadow stack */
+ u64 pl2_ssp; /* ring-2 shadow stack */
+} __packed;
+
struct xstate_header {
u64 xfeatures;
u64 xcomp_bv;
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index a32dc5f8c963..662562cbafe9 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -31,7 +31,9 @@
XFEATURE_MASK_Hi16_ZMM | \
XFEATURE_MASK_PKRU | \
XFEATURE_MASK_BNDREGS | \
- XFEATURE_MASK_BNDCSR)
+ XFEATURE_MASK_BNDCSR | \
+ XFEATURE_MASK_SHSTK_USER | \
+ XFEATURE_MASK_SHSTK_KERNEL)

#ifdef CONFIG_X86_64
#define REX_PREFIX "0x48, "
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..25311ec4b731 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
#define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT)
#define X86_CR4_PKE_BIT 22 /* enable Protection Keys support */
#define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_CET_BIT 23 /* enable Control flow Enforcement */
+#define X86_CR4_CET _BITUL(X86_CR4_CET_BIT)

/*
* x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index dd2c561c4544..91c0f665567b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -35,6 +35,9 @@ static const char *xfeature_names[] =
"Processor Trace (unused)" ,
"Protection Keys User registers",
"unknown xstate feature" ,
+ "Control flow User registers" ,
+ "Control flow Kernel registers" ,
+ "unknown xstate feature" ,
};

static short xsave_cpuid_features[] __initdata = {
@@ -48,6 +51,9 @@ static short xsave_cpuid_features[] __initdata = {
X86_FEATURE_AVX512F,
X86_FEATURE_INTEL_PT,
X86_FEATURE_PKU,
+ 0, /* Unused */
+ X86_FEATURE_SHSTK, /* XFEATURE_SHSTK_USER */
+ X86_FEATURE_SHSTK, /* XFEATURE_SHSTK_KERNEL */
};

/*
@@ -316,6 +322,8 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
+ print_xstate_feature(XFEATURE_MASK_SHSTK_USER);
+ print_xstate_feature(XFEATURE_MASK_SHSTK_KERNEL);
}

/*
@@ -562,6 +570,8 @@ static void check_xstate_against_struct(int nr)
XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state);
XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
+ XCHECK_SZ(sz, nr, XFEATURE_SHSTK_USER, struct cet_user_state);
+ XCHECK_SZ(sz, nr, XFEATURE_SHSTK_KERNEL, struct cet_kernel_state);

/*
* Make *SURE* to add any feature numbers in below if
--
2.17.1


2018-07-10 22:47:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 11/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> + /*
> + * On platforms before CET, other threads could race to
> + * create a RO and _PAGE_DIRTY_HW PMD again. However,
> + * on CET platforms, this is safe without a TLB flush.
> + */

If I didn't work for Intel, I'd wonder what the heck CET is and what the
heck it has to do with _PAGE_DIRTY_HW. I think we need a better comment
than this. How about:

Some processors can _start_ a write, but end up seeing
a read-only PTE by the time they get to getting the
Dirty bit. In this case, they will set the Dirty bit,
leaving a read-only, Dirty PTE which looks like a Shadow
Stack PTE.

However, this behavior has been improved and will *not* occur on
processors supporting Shadow Stacks. Without this guarantee, a
transition to a non-present PTE and flush the TLB would be
needed.


2018-07-10 22:54:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/27] x86/mm: Shadow stack page fault error checking

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> +++ b/arch/x86/include/asm/traps.h
> @@ -157,6 +157,7 @@ enum {
> * bit 3 == 1: use of reserved bit detected
> * bit 4 == 1: fault was an instruction fetch
> * bit 5 == 1: protection keys block access
> + * bit 6 == 1: shadow stack access fault
> */

Could we document this bit better?

Is this a fault where the *processor* thought it should be a shadow
stack fault? Or is it also set on faults to valid shadow stack PTEs
that just happen to fault for other reasons, say protection keys?

2018-07-10 23:08:01

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/27] mm: Handle shadow stack page fault

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> + if (is_shstk_mapping(vma->vm_flags))
> + entry = pte_mkdirty_shstk(entry);
> + else
> + entry = pte_mkdirty(entry);
> +
> + entry = maybe_mkwrite(entry, vma);
> if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
> update_mmu_cache(vma, vmf->address, vmf->pte);
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -2526,7 +2532,11 @@ static int wp_page_copy(struct vm_fault *vmf)
> }
> flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> entry = mk_pte(new_page, vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + if (is_shstk_mapping(vma->vm_flags))
> + entry = pte_mkdirty_shstk(entry);
> + else
> + entry = pte_mkdirty(entry);
> + entry = maybe_mkwrite(entry, vma);

Do we want to lift this hunk of code and put it elsewhere? Maybe:

entry = pte_set_vma_features(entry, vma);

and then:

pte_t pte_set_vma_features(pte_t entry, struct vm_area_struct)
{
/*
* Shadow stack PTEs are always dirty and always
* writable. They have a different encoding for
* this than normal PTEs, though.
*/
if (is_shstk_mapping(vma->vm_flags))
entry = pte_mkdirty_shstk(entry);
else
entry = pte_mkdirty(entry);

entry = maybe_mkwrite(entry, vma);

return entry;
}

> /*
> * Clear the pte entry and flush it first, before updating the
> * pte with the new entry. This will avoid a race condition
> @@ -3201,6 +3211,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
> mem_cgroup_commit_charge(page, memcg, false, false);
> lru_cache_add_active_or_unevictable(page, vma);
> setpte:
> + /*
> + * If this is within a shadow stack mapping, mark
> + * the PTE dirty. We don't use pte_mkdirty(),
> + * because the PTE must have _PAGE_DIRTY_HW set.
> + */
> + if (is_shstk_mapping(vma->vm_flags))
> + entry = pte_mkdirty_shstk(entry);
> +
> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);

I'm not sure this is the right spot to do this.

The other code does pte_mkdirty_shstk() near where we do the
pte_mkwrite(). Why not here? I think you might have picked this
because it's a common path used by both allocated pages and zero pages.
But, we can't have the zero pages in shadow stack areas since they can't
be read-only. I think you need to move this up. Can you even
consolidate it with the other two pte_mkdirt_shstk() call sites?

> /* No need to invalidate - it was non-present before */
> @@ -3983,6 +4001,14 @@ static int handle_pte_fault(struct vm_fault *vmf)
> entry = vmf->orig_pte;
> if (unlikely(!pte_same(*vmf->pte, entry)))
> goto unlock;
> +
> + /*
> + * Shadow stack PTEs are copy-on-access, so do_wp_page()
> + * handling on them no matter if we have write fault or not.
> + */

I'd say this differently:

Shadow stack PTEs can not be read-only and because of that can
not have traditional copy-on-write semantics. This essentially
performs a copy-on-write operation, but on *any* access, not
just actual writes.

2018-07-10 23:09:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/27] mm: Handle THP/HugeTLB shadow stack page fault

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> @@ -1347,6 +1353,8 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
> pmd_t entry;
> entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + if (is_shstk_mapping(vma->vm_flags))
> + entry = pmd_mkdirty_shstk(entry);

This pattern is repeated enough that it makes me wonder if we should
just be doing the shadowstack PTE creation in mk_huge_pmd() itself.

Or, should we just be setting the shadowstack pte bit combination in
vma->vm_page_prot so we don't have to go set it explicitly every time?

2018-07-10 23:12:01

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/27] mm/mprotect: Prevent mprotect from changing shadow stack

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> Signed-off-by: Yu-cheng Yu <[email protected]>

This still needs a changelog, even if you think it's simple.
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -446,6 +446,15 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
> error = -ENOMEM;
> if (!vma)
> goto out;
> +
> + /*
> + * Do not allow changing shadow stack memory.
> + */
> + if (vma->vm_flags & VM_SHSTK) {
> + error = -EINVAL;
> + goto out;
> + }
> +

I think this is a _bit_ draconian. Why shouldn't we be able to use
protection keys with a shadow stack? Or, set it to PROT_NONE?

2018-07-10 23:25:06

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH v2 11/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW

at 6:44 PM, Dave Hansen <[email protected]> wrote:

> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
>> + /*
>> + * On platforms before CET, other threads could race to
>> + * create a RO and _PAGE_DIRTY_HW PMD again. However,
>> + * on CET platforms, this is safe without a TLB flush.
>> + */
>
> If I didn't work for Intel, I'd wonder what the heck CET is and what the
> heck it has to do with _PAGE_DIRTY_HW. I think we need a better comment
> than this. How about:
>
> Some processors can _start_ a write, but end up seeing
> a read-only PTE by the time they get to getting the
> Dirty bit. In this case, they will set the Dirty bit,
> leaving a read-only, Dirty PTE which looks like a Shadow
> Stack PTE.
>
> However, this behavior has been improved and will *not* occur on
> processors supporting Shadow Stacks. Without this guarantee, a
> transition to a non-present PTE and flush the TLB would be
> needed.

Interesting. Does that regard the knights landing bug or something more
general?

Will the write succeed or trigger a page-fault in this case?

[ I know it is not related to the patch, but I would appreciate if you share
your knowledge ]

Regards,
Nadav

2018-07-10 23:26:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/27] x86/mm: Shadow stack page fault error checking

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> + /*
> + * Verify X86_PF_SHSTK is within a shadow stack VMA.
> + * It is always an error if there is a shadow stack
> + * fault outside a shadow stack VMA.
> + */
> + if (error_code & X86_PF_SHSTK) {
> + if (!(vma->vm_flags & VM_SHSTK))
> + return 1;
> + return 0;
> + }

It turns out that a X86_PF_SHSTK just means that the processor faulted
while doing access to something it thinks should be a shadow-stack
virtual address.

But, we *can* have faults on shadow stack accesses for non-shadow-stack
reasons.

I think you need to remove the 'return 0' and let it fall through to the
other access checks that we might be failing. If it's a shadow stack
access, it has to be a shadow stack VMA. But, a shadow-stack access
fault to a shadow stack VMA isn't _necessarily_ OK.

2018-07-10 23:38:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> There are three possible shadow stack PTE settings:
>
> Normal SHSTK PTE: (R/O + DIRTY_HW)
> SHSTK PTE COW'ed: (R/O + DIRTY_HW)
> SHSTK PTE shared as R/O data: (R/O + DIRTY_SW)
>
> Update can_follow_write_pte/pmd for the shadow stack.

First of all, thanks for the excellent patch headers. It's nice to have
that reference every time even though it's repeated.

> -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> +static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
> + bool shstk)
> {
> + bool pte_cowed = shstk ? is_shstk_pte(pte):pte_dirty(pte);
> +
> return pte_write(pte) ||
> - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_cowed);
> }

Can we just pass the VMA in here? This use is OK-ish, but I generally
detest true/false function arguments because you can't tell what they
are when they show up without a named variable.

But... Why does this even matter? Your own example showed that all
shadowstack PTEs have either DIRTY_HW or DIRTY_SW set, and pte_dirty()
checks both.

That makes this check seem a bit superfluous.

2018-07-10 23:41:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> +static __init int setup_disable_shstk(char *s)
> +{
> + /* require an exact match without trailing characters */
> + if (strlen(s))
> + return 0;
> +
> + if (!boot_cpu_has(X86_FEATURE_SHSTK))
> + return 1;
> +
> + setup_clear_cpu_cap(X86_FEATURE_SHSTK);
> + pr_info("x86: 'no_cet_shstk' specified, disabling Shadow Stack\n");
> + return 1;
> +}
> +__setup("no_cet_shstk", setup_disable_shstk);

Why do we need a boot-time disable for this?

2018-07-10 23:50:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

> +/*
> + * WRUSS is a kernel instrcution and but writes to user
> + * shadow stack memory. When a fault occurs, both
> + * X86_PF_USER and X86_PF_SHSTK are set.
> + */
> +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> +{
> + return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> + (X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> +}

I thought X86_PF_USER was set based on the mode in which the fault
occurred. Does this mean that the architecture of this bit is different
now?

That seems like something we need to call out if so. It also means we
need to update the SDM because some of the text is wrong.

> static void
> show_fault_oops(struct pt_regs *regs, unsigned long error_code,
> unsigned long address)
> @@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
> struct task_struct *tsk = current;
>
> /* User mode accesses just cause a SIGSEGV */
> - if (error_code & X86_PF_USER) {
> + if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
> /*
> * It's possible to have interrupts off here:
> */

This needs commenting about why is_wruss() is special.

2018-07-10 23:53:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 11/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW

On 07/10/2018 04:23 PM, Nadav Amit wrote:
> at 6:44 PM, Dave Hansen <[email protected]> wrote:
>
>> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
>>> + /*
>>> + * On platforms before CET, other threads could race to
>>> + * create a RO and _PAGE_DIRTY_HW PMD again. However,
>>> + * on CET platforms, this is safe without a TLB flush.
>>> + */
>>
>> If I didn't work for Intel, I'd wonder what the heck CET is and what the
>> heck it has to do with _PAGE_DIRTY_HW. I think we need a better comment
>> than this. How about:
>>
>> Some processors can _start_ a write, but end up seeing
>> a read-only PTE by the time they get to getting the
>> Dirty bit. In this case, they will set the Dirty bit,
>> leaving a read-only, Dirty PTE which looks like a Shadow
>> Stack PTE.
>>
>> However, this behavior has been improved and will *not* occur on
>> processors supporting Shadow Stacks. Without this guarantee, a
>> transition to a non-present PTE and flush the TLB would be
>> needed.
>
> Interesting. Does that regard the knights landing bug or something more
> general?

It's more general.

> Will the write succeed or trigger a page-fault in this case?

It will trigger a page fault.

2018-07-10 23:58:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 23/27] mm/mmap: Add IBT bitmap size to address space limit check

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> The indirect branch tracking legacy bitmap takes a large address
> space. This causes may_expand_vm() failure on the address limit
> check. For a IBT-enabled task, add the bitmap size to the
> address limit.

This appears to require that we set up
current->thread.cet.ibt_bitmap_size _before_ calling may_expand_vm().
What keeps the ibt_mmap() itself from hitting the address limit?

2018-07-11 00:12:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

Is this feature *integral* to shadow stacks? Or, should it just be in a
different series?

> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index d9ae3d86cdd7..71da2cccba16 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -12,7 +12,10 @@ struct task_struct;
> struct cet_status {
> unsigned long shstk_base;
> unsigned long shstk_size;
> + unsigned long ibt_bitmap_addr;
> + unsigned long ibt_bitmap_size;
> unsigned int shstk_enabled:1;
> + unsigned int ibt_enabled:1;
> };

Is there a reason we're not using pointers here? This seems like the
kind of place that we probably want __user pointers.


> +static unsigned long ibt_mmap(unsigned long addr, unsigned long len)
> +{
> + struct mm_struct *mm = current->mm;
> + unsigned long populate;
> +
> + down_write(&mm->mmap_sem);
> + addr = do_mmap(NULL, addr, len, PROT_READ | PROT_WRITE,
> + MAP_ANONYMOUS | MAP_PRIVATE,
> + VM_DONTDUMP, 0, &populate, NULL);
> + up_write(&mm->mmap_sem);
> +
> + if (populate)
> + mm_populate(addr, populate);
> +
> + return addr;
> +}

We're going to have to start consolidating these at some point. We have
at least three of them now, maybe more.

> +int cet_setup_ibt_bitmap(void)
> +{
> + u64 r;
> + unsigned long bitmap;
> + unsigned long size;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_IBT))
> + return -EOPNOTSUPP;
> +
> + size = TASK_SIZE_MAX / PAGE_SIZE / BITS_PER_BYTE;

Just a note: this table is going to be gigantic on 5-level paging
systems, and userspace won't, by default use any of that extra address
space. I think it ends up being a 512GB allocation in a 128TB address
space.

Is that a problem?

On 5-level paging systems, maybe we should just stick it up in the high
part of the address space.

> + bitmap = ibt_mmap(0, size);
> +
> + if (bitmap >= TASK_SIZE_MAX)
> + return -ENOMEM;
> +
> + bitmap &= PAGE_MASK;

We're page-aligning the result of an mmap()? Why?

> + rdmsrl(MSR_IA32_U_CET, r);
> + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> + wrmsrl(MSR_IA32_U_CET, r);

Comments, please. What is this doing, logically? Also, why are we
OR'ing the results into this MSR? What are we trying to preserve?

> + current->thread.cet.ibt_bitmap_addr = bitmap;
> + current->thread.cet.ibt_bitmap_size = size;
> + return 0;
> +}
> +
> +void cet_disable_ibt(void)
> +{
> + u64 r;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_IBT))
> + return;

Does this need a check for being already disabled?

> + rdmsrl(MSR_IA32_U_CET, r);
> + r &= ~(MSR_IA32_CET_ENDBR_EN | MSR_IA32_CET_LEG_IW_EN |
> + MSR_IA32_CET_NO_TRACK_EN);
> + wrmsrl(MSR_IA32_U_CET, r);
> + current->thread.cet.ibt_enabled = 0;
> +}

What's the locking for current->thread.cet?

> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 705467839ce8..c609c9ce5691 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -413,7 +413,8 @@ __setup("nopku", setup_disable_pku);
>
> static __always_inline void setup_cet(struct cpuinfo_x86 *c)
> {
> - if (cpu_feature_enabled(X86_FEATURE_SHSTK))
> + if (cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> + cpu_feature_enabled(X86_FEATURE_IBT))
> cr4_set_bits(X86_CR4_CET);
> }
>
> @@ -434,6 +435,23 @@ static __init int setup_disable_shstk(char *s)
> __setup("no_cet_shstk", setup_disable_shstk);
> #endif
>
> +#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
> +static __init int setup_disable_ibt(char *s)
> +{
> + /* require an exact match without trailing characters */
> + if (strlen(s))
> + return 0;
> +
> + if (!boot_cpu_has(X86_FEATURE_IBT))
> + return 1;
> +
> + setup_clear_cpu_cap(X86_FEATURE_IBT);
> + pr_info("x86: 'no_cet_ibt' specified, disabling Branch Tracking\n");
> + return 1;
> +}
> +__setup("no_cet_ibt", setup_disable_ibt);
> +#endif
> /*
> * Some CPU features depend on higher CPUID levels, which may not always
> * be available due to CPUID level capping or broken virtualization
> diff --git a/arch/x86/kernel/elf.c b/arch/x86/kernel/elf.c
> index 233f6dad9c1f..42e08d3b573e 100644
> --- a/arch/x86/kernel/elf.c
> +++ b/arch/x86/kernel/elf.c
> @@ -15,6 +15,7 @@
> #include <linux/fs.h>
> #include <linux/uaccess.h>
> #include <linux/string.h>
> +#include <linux/compat.h>
>
> /*
> * The .note.gnu.property layout:
> @@ -222,7 +223,8 @@ int arch_setup_features(void *ehdr_p, void *phdr_p,
>
> struct elf64_hdr *ehdr64 = ehdr_p;
>
> - if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) &&
> + !cpu_feature_enabled(X86_FEATURE_IBT))
> return 0;
>
> if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {
> @@ -250,6 +252,9 @@ int arch_setup_features(void *ehdr_p, void *phdr_p,
> current->thread.cet.shstk_enabled = 0;
> current->thread.cet.shstk_base = 0;
> current->thread.cet.shstk_size = 0;
> + current->thread.cet.ibt_enabled = 0;
> + current->thread.cet.ibt_bitmap_addr = 0;
> + current->thread.cet.ibt_bitmap_size = 0;
> if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> if (shstk) {
> err = cet_setup_shstk();
> @@ -257,6 +262,15 @@ int arch_setup_features(void *ehdr_p, void *phdr_p,
> goto out;
> }
> }
> +
> + if (cpu_feature_enabled(X86_FEATURE_IBT)) {
> + if (ibt) {
> + err = cet_setup_ibt();
> + if (err < 0)
> + goto out;
> + }
> + }

You introduced 'ibt' before it was used. Please wait to introduce it
until you actually use it to make it easier to review.

Also, what's wrong with:

if (cpu_feature_enabled(X86_FEATURE_IBT) && ibt) {
...
}

?


2018-07-11 08:29:29

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/27] Documentation/x86: Add CET description

On Tue 2018-07-10 15:26:17, Yu-cheng Yu wrote:
> Explain how CET works and the no_cet_shstk/no_cet_ibt kernel
> parameters.
>

> --- /dev/null
> +++ b/Documentation/x86/intel_cet.txt
> @@ -0,0 +1,250 @@
> +=========================================
> +Control Flow Enforcement Technology (CET)
> +=========================================

We normally use .rst for this kind of formatted text.


> +[6] The implementation of the SHSTK
> +===================================
> +
> +SHSTK size
> +----------
> +
> +A task's SHSTK is allocated from memory to a fixed size that can
> +support 32 KB nested function calls; that is 256 KB for a 64-bit
> +application and 128 KB for a 32-bit application. The system admin
> +can change the default size.

How does admin change that? We already have ulimit for stack size,
should those be somehow tied together?

$ ulimit -a
...
stack size (kbytes, -s) 8192


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.06 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-11 08:36:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/27] mm: Introduce VM_SHSTK for shadow stack memory

On Tue, Jul 10, 2018 at 03:26:20PM -0700, Yu-cheng Yu wrote:
> VM_SHSTK indicates a shadow stack memory area.
>
> A shadow stack PTE must be read-only and dirty. For non shadow
> stack, we use a spare bit of the 64-bit PTE for dirty. The PTE
> changes are in the next patch.

This doesn't make any sense.. the $subject and the patch seem completely
unrelated to this Changelog.

2018-07-11 08:46:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/27] x86/mm: Introduce _PAGE_DIRTY_SW

On Tue, Jul 10, 2018 at 03:26:22PM -0700, Yu-cheng Yu wrote:
> + pte = pte_clear_flags(pte, _PAGE_DIRTY_SW);
> return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);

Having both _PAGE_DIRTY_SW and _PAGE_SOFT_DIRTY is really confusing.

I'm not sure I have an anwser for this, but urggh.

2018-07-11 08:49:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 11/27] x86/mm: Modify ptep_set_wrprotect and pmdp_set_wrprotect for _PAGE_DIRTY_SW

On Tue, Jul 10, 2018 at 03:44:32PM -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> > + /*
> > + * On platforms before CET, other threads could race to
> > + * create a RO and _PAGE_DIRTY_HW PMD again. However,
> > + * on CET platforms, this is safe without a TLB flush.
> > + */
>
> If I didn't work for Intel, I'd wonder what the heck CET is and what the
> heck it has to do with _PAGE_DIRTY_HW. I think we need a better comment

And Changelog, the provided one is abysmal.

> than this. How about:
>
> Some processors can _start_ a write, but end up seeing
> a read-only PTE by the time they get to getting the
> Dirty bit. In this case, they will set the Dirty bit,
> leaving a read-only, Dirty PTE which looks like a Shadow
> Stack PTE.
>
> However, this behavior has been improved and will *not* occur on
> processors supporting Shadow Stacks. Without this guarantee, a
> transition to a non-present PTE and flush the TLB would be
> needed.

I'm still struggling. I think I get the first paragraph, but then what?

2018-07-11 09:08:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/27] mm: Handle shadow stack page fault

On Tue, Jul 10, 2018 at 04:06:25PM -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> > + if (is_shstk_mapping(vma->vm_flags))
> > + entry = pte_mkdirty_shstk(entry);
> > + else
> > + entry = pte_mkdirty(entry);
> > +
> > + entry = maybe_mkwrite(entry, vma);
> > if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
> > update_mmu_cache(vma, vmf->address, vmf->pte);
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > @@ -2526,7 +2532,11 @@ static int wp_page_copy(struct vm_fault *vmf)
> > }
> > flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> > entry = mk_pte(new_page, vma->vm_page_prot);
> > - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > + if (is_shstk_mapping(vma->vm_flags))
> > + entry = pte_mkdirty_shstk(entry);
> > + else
> > + entry = pte_mkdirty(entry);
> > + entry = maybe_mkwrite(entry, vma);
>
> Do we want to lift this hunk of code and put it elsewhere? Maybe:
>
> entry = pte_set_vma_features(entry, vma);
>
> and then:
>
> pte_t pte_set_vma_features(pte_t entry, struct vm_area_struct)
> {
> /*
> * Shadow stack PTEs are always dirty and always
> * writable. They have a different encoding for
> * this than normal PTEs, though.
> */
> if (is_shstk_mapping(vma->vm_flags))
> entry = pte_mkdirty_shstk(entry);
> else
> entry = pte_mkdirty(entry);
>
> entry = maybe_mkwrite(entry, vma);
>
> return entry;
> }

Yes, that wants a helper like that. Not sold on the name, but whatever.

Is there any way we can hide all the shadow stack magic in arch code?

2018-07-11 09:12:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/27] mm: Handle THP/HugeTLB shadow stack page fault

On Tue, Jul 10, 2018 at 03:26:26PM -0700, Yu-cheng Yu wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index a2695dbc0418..f7c46d61eaea 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4108,7 +4108,13 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&vmf, orig_pmd);
>
> - if (dirty && !pmd_write(orig_pmd)) {
> + /*
> + * Shadow stack trans huge PMDs are copy-on-access,
> + * so wp_huge_pmd() on them no mater if we have a
> + * write fault or not.
> + */
> + if (is_shstk_mapping(vma->vm_flags) ||
> + (dirty && !pmd_write(orig_pmd))) {
> ret = wp_huge_pmd(&vmf, orig_pmd);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;

Can't we do this (and the do_wp_page thing) by setting FAULT_FLAG_WRITE
in the arch fault handler on shadow stack faults?

2018-07-11 09:13:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/27] mm/mprotect: Prevent mprotect from changing shadow stack

On Tue, Jul 10, 2018 at 04:10:08PM -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> > Signed-off-by: Yu-cheng Yu <[email protected]>
>
> This still needs a changelog, even if you think it's simple.
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -446,6 +446,15 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
> > error = -ENOMEM;
> > if (!vma)
> > goto out;
> > +
> > + /*
> > + * Do not allow changing shadow stack memory.
> > + */
> > + if (vma->vm_flags & VM_SHSTK) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > +
>
> I think this is a _bit_ draconian. Why shouldn't we be able to use
> protection keys with a shadow stack? Or, set it to PROT_NONE?

Right, and then there's also madvise() and some of the other accessors.

Why do we need to disallow this? AFAICT the worst that can happen is
that a process wrecks itself, so what?

2018-07-11 09:22:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/27] x86/mm: Introduce _PAGE_DIRTY_SW

On Tue, Jul 10, 2018 at 03:26:22PM -0700, Yu-cheng Yu wrote:
> +static inline bool is_shstk_pte(pte_t pte)
> +{
> + pteval_t val;
> +
> + val = pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY_HW);
> + return (val == _PAGE_DIRTY_HW);
> +}

That's against naming convention here.

static inline bool pte_shstk(pte_t pte)
{
return pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY_HW) == _PAGE_DIRTY_HW;
}

would be more in style with the rest of this code.

2018-07-11 09:37:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support

On Tue, Jul 10, 2018 at 03:26:29PM -0700, Yu-cheng Yu wrote:
> +/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
> +#define MSR_IA32_CET_SHSTK_EN 0x0000000000000001
> +#define MSR_IA32_CET_WRSS_EN 0x0000000000000002
> +#define MSR_IA32_CET_ENDBR_EN 0x0000000000000004
> +#define MSR_IA32_CET_LEG_IW_EN 0x0000000000000008
> +#define MSR_IA32_CET_NO_TRACK_EN 0x0000000000000010

Do those want a ULL literal suffix?

2018-07-11 09:38:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support

On Tue, Jul 10, 2018 at 03:26:29PM -0700, Yu-cheng Yu wrote:
> +struct cet_status {
> + unsigned long shstk_base;
> + unsigned long shstk_size;
> + unsigned int shstk_enabled:1;
> +};

> @@ -498,6 +499,10 @@ struct thread_struct {
> unsigned int sig_on_uaccess_err:1;
> unsigned int uaccess_err:1; /* uaccess failed */
>
> +#ifdef CONFIG_X86_INTEL_CET
> + struct cet_status cet;
> +#endif
> +
> /* Floating point and extended processor state */
> struct fpu fpu;
> /*

Why does that need a structure? That avoids folding the bitfields.

2018-07-11 09:46:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> WRUSS is a new kernel-mode instruction but writes directly
> to user shadow stack memory. This is used to construct
> a return address on the shadow stack for the signal
> handler.
>
> This instruction can fault if the user shadow stack is
> invalid shadow stack memory. In that case, the kernel does
> fixup.
>

> +static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
> +{
> + int err = 0;
> +
> + asm volatile("1: wrussq %[val], (%[addr])\n"
> + "xor %[err], %[err]\n"

this XOR is superfluous, you already cleared @err above.

> + "2:\n"
> + ".section .fixup,\"ax\"\n"
> + "3: mov $-1, %[err]; jmp 2b\n"
> + ".previous\n"
> + _ASM_EXTABLE(1b, 3b)
> + : [err] "=a" (err)
> + : [val] "S" (val), [addr] "D" (addr));
> +
> + return err;
> +}
> +#endif /* CONFIG_X86_INTEL_CET */
> +
> #define nop() asm volatile ("nop")

What happened to:

https://lkml.kernel.org/r/[email protected]

2018-07-11 09:47:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
> index e0b85930dd77..72bb7c48a7df 100644
> --- a/arch/x86/lib/x86-opcode-map.txt
> +++ b/arch/x86/lib/x86-opcode-map.txt
> @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
> f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2)
> f2: ANDN Gy,By,Ey (v)
> f3: Grp17 (1A)
> -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v)
> +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSS Pq,Qq (66),REX.W
> f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
> f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
> EndTable

Where are all the other instructions? ISTR that documentation patch
listing a whole bunch of new instructions, not just wuss.

2018-07-11 09:59:45

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/27] Documentation/x86: Add CET description

On 07/11/2018 12:26 AM, Yu-cheng Yu wrote:

> +To build a CET-enabled kernel, Binutils v2.30 and GCC v8.1 or later
> +are required. To build a CET-enabled application, GLIBC v2.29 or
> +later is also requried.

Have you given up on getting the required changes into glibc 2.28?

Thanks,
Florian

2018-07-11 10:13:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On Tue, Jul 10, 2018 at 03:26:28PM -0700, Yu-cheng Yu wrote:
> There are three possible shadow stack PTE settings:
>
> Normal SHSTK PTE: (R/O + DIRTY_HW)
> SHSTK PTE COW'ed: (R/O + DIRTY_HW)
> SHSTK PTE shared as R/O data: (R/O + DIRTY_SW)

I count _2_ distinct states there.

> Update can_follow_write_pte/pmd for the shadow stack.

So the below disallows can_follow_write when shstk && _PAGE_DIRTY_SW,
but this here Changelog doesn't explain why. Doesn't even get close.

Also, the code is a right mess :/ Can't we try harder to not let this
shadow stack stuff escape arch code.

2018-07-11 10:25:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET


* Yu-cheng Yu <[email protected]> wrote:

> Add PTRACE interface for CET MSRs.

Please *always* describe new ABIs in the changelog, in a precise, well-documented
way.

> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index e2ee403865eb..ac2bc3a18427 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -49,7 +49,9 @@ enum x86_regset {
> REGSET_IOPERM64 = REGSET_XFP,
> REGSET_XSTATE,
> REGSET_TLS,
> + REGSET_CET64 = REGSET_TLS,
> REGSET_IOPERM32,
> + REGSET_CET32,
> };

Why does REGSET_CET64 alias on REGSET_TLS?

> struct pt_regs_offset {
> @@ -1276,6 +1278,13 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
> .size = sizeof(long), .align = sizeof(long),
> .active = ioperm_active, .get = ioperm_get
> },
> + [REGSET_CET64] = {
> + .core_note_type = NT_X86_CET,
> + .n = sizeof(struct cet_user_state) / sizeof(u64),
> + .size = sizeof(u64), .align = sizeof(u64),
> + .active = cetregs_active, .get = cetregs_get,
> + .set = cetregs_set
> + },

Ok, could we first please make this part of the regset code more readable and
start the series with a standalone clean-up patch that changes these initializers
to something more readable:

[REGSET_CET64] = {
.core_note_type = NT_X86_CET,
.n = sizeof(struct cet_user_state) / sizeof(u64),
.size = sizeof(u64),
.align = sizeof(u64),
.active = cetregs_active,
.get = cetregs_get,
.set = cetregs_set
},

? (I'm demonstrating the cleanup based on REGSET_CET64, but this should be done on
every other entry first.)


> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -401,6 +401,7 @@ typedef struct elf64_shdr {
> #define NT_386_TLS 0x200 /* i386 TLS slots (struct user_desc) */
> #define NT_386_IOPERM 0x201 /* x86 io permission bitmap (1=deny) */
> #define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
> +#define NT_X86_CET 0x203 /* x86 cet state */

Acronyms in comments should be in capital letters.

Also, I think I asked this before: why does "Control Flow Enforcement" abbreviate
to "CET" (which is a well-known acronym for "Central European Time"), not to CFE?

Thanks,

Ingo

2018-07-11 11:14:41

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH v2 20/27] x86/cet/shstk: ELF header parsing of CET

On 07/11/2018 12:26 AM, Yu-cheng Yu wrote:
> + /*
> + * PT_NOTE segment is small. Read at most
> + * PAGE_SIZE.
> + */
> + if (note_size > PAGE_SIZE)
> + note_size = PAGE_SIZE;

That's not really true. There are some huge PT_NOTE segments out there.

Why can't you check the notes after the executable has been mapped?

Thanks,
Florian

2018-07-11 15:11:16

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH v2 27/27] x86/cet: Add arch_prctl functions for CET

On 07/11/2018 12:26 AM, Yu-cheng Yu wrote:
> arch_prctl(ARCH_CET_DISABLE, unsigned long features)
> Disable SHSTK and/or IBT specified in 'features'. Return -EPERM
> if CET is locked out.
>
> arch_prctl(ARCH_CET_LOCK)
> Lock out CET feature.

Isn't it a “lock in” rather than a “lock out”?

Thanks,
Florian

2018-07-11 16:35:46

by H.J. Lu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/27] Documentation/x86: Add CET description

On Wed, Jul 11, 2018 at 2:57 AM, Florian Weimer <[email protected]> wrote:
> On 07/11/2018 12:26 AM, Yu-cheng Yu wrote:
>
>> +To build a CET-enabled kernel, Binutils v2.30 and GCC v8.1 or later
>> +are required. To build a CET-enabled application, GLIBC v2.29 or
>> +later is also requried.
>
>
> Have you given up on getting the required changes into glibc 2.28?
>

This is a typo. We are still targeting for 2.28. All pieces are there.

--
H.J.

2018-07-11 17:29:21

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Wed, 2018-07-11 at 11:45 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> >
> > diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-
> > opcode-map.txt
> > index e0b85930dd77..72bb7c48a7df 100644
> > --- a/arch/x86/lib/x86-opcode-map.txt
> > +++ b/arch/x86/lib/x86-opcode-map.txt
> > @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32
> > Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
> >  f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32
> > Gd,Ew (66&F2)
> >  f2: ANDN Gy,By,Ey (v)
> >  f3: Grp17 (1A)
> > -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > (F2),(v)
> > +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > (F2),(v) | WRUSS Pq,Qq (66),REX.W
> >  f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
> >  f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By
> > (F3),(v) | SHRX Gy,Ey,By (F2),(v)
> >  EndTable
> Where are all the other instructions? ISTR that documentation patch
> listing a whole bunch of new instructions, not just wuss.

Currently we only use WRUSS in the kernel code.  Do we want to add all
instructions here?

Yu-cheng

2018-07-11 17:32:53

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Wed, 2018-07-11 at 11:44 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> >
> > WRUSS is a new kernel-mode instruction but writes directly
> > to user shadow stack memory.  This is used to construct
> > a return address on the shadow stack for the signal
> > handler.
> >
> > This instruction can fault if the user shadow stack is
> > invalid shadow stack memory.  In that case, the kernel does
> > fixup.
> >
> >
> > +static inline int write_user_shstk_64(unsigned long addr, unsigned
> > long val)
> > +{
> > + int err = 0;
> > +
> > + asm volatile("1: wrussq %[val], (%[addr])\n"
> > +      "xor %[err], %[err]\n"
> this XOR is superfluous, you already cleared @err above.

I will fix it.

>
> >
> > +      "2:\n"
> > +      ".section .fixup,\"ax\"\n"
> > +      "3: mov $-1, %[err]; jmp 2b\n"
> > +      ".previous\n"
> > +      _ASM_EXTABLE(1b, 3b)
> > +      : [err] "=a" (err)
> > +      : [val] "S" (val), [addr] "D" (addr));
> > +
> > + return err;
> > +}
> > +#endif /* CONFIG_X86_INTEL_CET */
> > +
> >  #define nop() asm volatile ("nop")
> What happened to:
>
>   https://lkml.kernel.org/r/[email protected]

Yes, I put that in once and realized we only need to skip the
instruction and return err.  Do you think we still need a handler for
that?

Yu-cheng

2018-07-11 17:50:09

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/27] Documentation/x86: Add CET description

On Wed, 2018-07-11 at 10:27 +0200, Pavel Machek wrote:
> On Tue 2018-07-10 15:26:17, Yu-cheng Yu wrote:
> >
> > Explain how CET works and the no_cet_shstk/no_cet_ibt kernel
> > parameters.
> >
> >
> > --- /dev/null
> > +++ b/Documentation/x86/intel_cet.txt
> > @@ -0,0 +1,250 @@
> > +=========================================
> > +Control Flow Enforcement Technology (CET)
> > +=========================================
> We normally use .rst for this kind of formatted text.

I will change this to a .rst file.

>
>
> >
> > +[6] The implementation of the SHSTK
> > +===================================
> > +
> > +SHSTK size
> > +----------
> > +
> > +A task's SHSTK is allocated from memory to a fixed size that can
> > +support 32 KB nested function calls; that is 256 KB for a 64-bit
> > +application and 128 KB for a 32-bit application.  The system admin
> > +can change the default size.
> How does admin change that? We already have ulimit for stack size,
> should those be somehow tied together?
>
> $ ulimit -a
> ...
> stack size              (kbytes, -s) 8192
>

We can do that.  This makes sense to me.

Yu-cheng


2018-07-11 18:30:22

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Wed, 2018-07-11 at 17:27 +0200, Peter Zijlstra wrote:
> On Wed, Jul 11, 2018 at 07:58:09AM -0700, Yu-cheng Yu wrote:
> >
> > On Wed, 2018-07-11 at 11:45 +0200, Peter Zijlstra wrote:
> > >
> > > On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> > > >
> > > >
> > > > diff --git a/arch/x86/lib/x86-opcode-map.txt
> > > > b/arch/x86/lib/x86-
> > > > opcode-map.txt
> > > > index e0b85930dd77..72bb7c48a7df 100644
> > > > --- a/arch/x86/lib/x86-opcode-map.txt
> > > > +++ b/arch/x86/lib/x86-opcode-map.txt
> > > > @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32
> > > > Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
> > > >  f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32
> > > > Gd,Ew (66&F2)
> > > >  f2: ANDN Gy,By,Ey (v)
> > > >  f3: Grp17 (1A)
> > > > -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > > > (F2),(v)
> > > > +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > > > (F2),(v) | WRUSS Pq,Qq (66),REX.W
> > > >  f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey
> > > > (F2),(v)
> > > >  f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX
> > > > Gy,Ey,By
> > > > (F3),(v) | SHRX Gy,Ey,By (F2),(v)
> > > >  EndTable
> > > Where are all the other instructions? ISTR that documentation
> > > patch
> > > listing a whole bunch of new instructions, not just wuss.
> > Currently we only use WRUSS in the kernel code.  Do we want to add
> > all
> > instructions here?
> Yes, since we also use the in-kernel decoder to decode random
> userspace
> code.

I will add other instructions.

Yu-cheng


2018-07-11 19:12:47

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/27] mm: Introduce VM_SHSTK for shadow stack memory

On Wed, 2018-07-11 at 10:34 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:20PM -0700, Yu-cheng Yu wrote:
> >
> > VM_SHSTK indicates a shadow stack memory area.
> >
> > A shadow stack PTE must be read-only and dirty.  For non shadow
> > stack, we use a spare bit of the 64-bit PTE for dirty.  The PTE
> > changes are in the next patch.
> This doesn't make any sense.. the $subject and the patch seem
> completely
> unrelated to this Changelog.

I was trying to say why this is only defined for 64-bit.  I will fix
it.

Yu-cheng

2018-07-11 19:31:23

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 23/27] mm/mmap: Add IBT bitmap size to address space limit check

On Tue, 2018-07-10 at 16:57 -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> >
> > The indirect branch tracking legacy bitmap takes a large address
> > space.  This causes may_expand_vm() failure on the address limit
> > check.  For a IBT-enabled task, add the bitmap size to the
> > address limit.
> This appears to require that we set up
> current->thread.cet.ibt_bitmap_size _before_ calling may_expand_vm().
> What keeps the ibt_mmap() itself from hitting the address limit?

Yes, that is overlooked.  I will fix it.

Thanks,
Yu-cheng

2018-07-11 19:38:10

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On Tue, 2018-07-10 at 16:37 -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> >
> > There are three possible shadow stack PTE settings:
> >
> >   Normal SHSTK PTE: (R/O + DIRTY_HW)
> >   SHSTK PTE COW'ed: (R/O + DIRTY_HW)
> >   SHSTK PTE shared as R/O data: (R/O + DIRTY_SW)
> >
> > Update can_follow_write_pte/pmd for the shadow stack.
> First of all, thanks for the excellent patch headers.  It's nice to
> have
> that reference every time even though it's repeated.
>
> >
> > -static inline bool can_follow_write_pte(pte_t pte, unsigned int
> > flags)
> > +static inline bool can_follow_write_pte(pte_t pte, unsigned int
> > flags,
> > + bool shstk)
> >  {
> > + bool pte_cowed = shstk ? is_shstk_pte(pte):pte_dirty(pte);
> > +
> >   return pte_write(pte) ||
> > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
> > pte_dirty(pte));
> > + ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
> > pte_cowed);
> >  }
> Can we just pass the VMA in here?  This use is OK-ish, but I
> generally
> detest true/false function arguments because you can't tell what they
> are when they show up without a named variable.
>
> But...  Why does this even matter?  Your own example showed that all
> shadowstack PTEs have either DIRTY_HW or DIRTY_SW set, and
> pte_dirty()
> checks both.
>
> That makes this check seem a bit superfluous.

My understanding is that we don't want to follow write pte if the page
is shared as read-only.  For a SHSTK page, that is (R/O + DIRTY_SW),
which means the SHSTK page has not been COW'ed.  Is that right?

Thanks,
Yu-cheng

2018-07-11 20:16:45

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH v2 27/27] x86/cet: Add arch_prctl functions for CET

On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]> wrote:
>
> arch_prctl(ARCH_CET_STATUS, unsigned long *addr)
> Return CET feature status.
>
> The parameter 'addr' is a pointer to a user buffer.
> On returning to the caller, the kernel fills the following
> information:
>
> *addr = SHSTK/IBT status
> *(addr + 1) = SHSTK base address
> *(addr + 2) = SHSTK size
>
> arch_prctl(ARCH_CET_DISABLE, unsigned long features)
> Disable SHSTK and/or IBT specified in 'features'. Return -EPERM
> if CET is locked out.
>
> arch_prctl(ARCH_CET_LOCK)
> Lock out CET feature.
>
> arch_prctl(ARCH_CET_ALLOC_SHSTK, unsigned long *addr)
> Allocate a new SHSTK.
>
> The parameter 'addr' is a pointer to a user buffer and indicates
> the desired SHSTK size to allocate. On returning to the caller
> the buffer contains the address of the new SHSTK.
>
> arch_prctl(ARCH_CET_LEGACY_BITMAP, unsigned long *addr)
> Allocate an IBT legacy code bitmap if the current task does not
> have one.
>
> The parameter 'addr' is a pointer to a user buffer.
> On returning to the caller, the kernel fills the following
> information:
>
> *addr = IBT bitmap base address
> *(addr + 1) = IBT bitmap size
>
> Signed-off-by: H.J. Lu <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
[...]
> diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
> new file mode 100644
> index 000000000000..86bb78ae656d
> --- /dev/null
> +++ b/arch/x86/kernel/cet_prctl.c
> @@ -0,0 +1,141 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/errno.h>
> +#include <linux/uaccess.h>
> +#include <linux/prctl.h>
> +#include <linux/compat.h>
> +#include <asm/processor.h>
> +#include <asm/prctl.h>
> +#include <asm/elf.h>
> +#include <asm/elf_property.h>
> +#include <asm/cet.h>
> +
> +/* See Documentation/x86/intel_cet.txt. */
> +
> +static int handle_get_status(unsigned long arg2)
> +{
> + unsigned int features = 0;
> + unsigned long shstk_base, shstk_size;
> +
> + if (current->thread.cet.shstk_enabled)
> + features |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
> + if (current->thread.cet.ibt_enabled)
> + features |= GNU_PROPERTY_X86_FEATURE_1_IBT;
> +
> + shstk_base = current->thread.cet.shstk_base;
> + shstk_size = current->thread.cet.shstk_size;
> +
> + if (in_ia32_syscall()) {
> + unsigned int buf[3];
> +
> + buf[0] = features;
> + buf[1] = (unsigned int)shstk_base;
> + buf[2] = (unsigned int)shstk_size;
> + return copy_to_user((unsigned int __user *)arg2, buf,
> + sizeof(buf));
> + } else {
> + unsigned long buf[3];
> +
> + buf[0] = (unsigned long)features;
> + buf[1] = shstk_base;
> + buf[2] = shstk_size;
> + return copy_to_user((unsigned long __user *)arg2, buf,
> + sizeof(buf));
> + }

Other places in the kernel (e.g. the BPF subsystem) just
unconditionally use u64 instead of unsigned long to avoid having to
switch between different sizes. I wonder whether that would make sense
here?

> +}
> +
> +static int handle_alloc_shstk(unsigned long arg2)
> +{
> + int err = 0;
> + unsigned long shstk_size = 0;
> +
> + if (in_ia32_syscall()) {
> + unsigned int size;
> +
> + err = get_user(size, (unsigned int __user *)arg2);
> + if (!err)
> + shstk_size = size;
> + } else {
> + err = get_user(shstk_size, (unsigned long __user *)arg2);
> + }

As above.

> + if (err)
> + return -EFAULT;
> +
> + err = cet_alloc_shstk(&shstk_size);
> + if (err)
> + return -err;
> +
> + if (in_ia32_syscall()) {
> + if (put_user(shstk_size, (unsigned int __user *)arg2))
> + return -EFAULT;
> + } else {
> + if (put_user(shstk_size, (unsigned long __user *)arg2))
> + return -EFAULT;
> + }
> + return 0;
> +}
> +
> +static int handle_bitmap(unsigned long arg2)
> +{
> + unsigned long addr, size;
> +
> + if (current->thread.cet.ibt_enabled) {
> + if (!current->thread.cet.ibt_bitmap_addr)
> + cet_setup_ibt_bitmap();
> + addr = current->thread.cet.ibt_bitmap_addr;
> + size = current->thread.cet.ibt_bitmap_size;
> + } else {
> + addr = 0;
> + size = 0;
> + }
> +
> + if (in_compat_syscall()) {
> + if (put_user(addr, (unsigned int __user *)arg2) ||
> + put_user(size, (unsigned int __user *)arg2 + 1))
> + return -EFAULT;
> + } else {
> + if (put_user(addr, (unsigned long __user *)arg2) ||
> + put_user(size, (unsigned long __user *)arg2 + 1))
> + return -EFAULT;
> + }
> + return 0;
> +}
> +
> +int prctl_cet(int option, unsigned long arg2)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) &&
> + !cpu_feature_enabled(X86_FEATURE_IBT))
> + return -EINVAL;
> +
> + switch (option) {
> + case ARCH_CET_STATUS:
> + return handle_get_status(arg2);
> +
> + case ARCH_CET_DISABLE:
> + if (current->thread.cet.locked)
> + return -EPERM;
> + if (arg2 & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
> + cet_disable_free_shstk(current);
> + if (arg2 & GNU_PROPERTY_X86_FEATURE_1_IBT)
> + cet_disable_ibt();
> +
> + return 0;
> +
> + case ARCH_CET_LOCK:
> + current->thread.cet.locked = 1;
> + return 0;
> +
> + case ARCH_CET_ALLOC_SHSTK:
> + return handle_alloc_shstk(arg2);
> +
> + /*
> + * Allocate legacy bitmap and return address & size to user.
> + */
> + case ARCH_CET_LEGACY_BITMAP:
> + return handle_bitmap(arg2);
> +
> + default:
> + return -EINVAL;
> + }
> +}
> diff --git a/arch/x86/kernel/elf.c b/arch/x86/kernel/elf.c
> index 42e08d3b573e..3d4934fdac7f 100644
> --- a/arch/x86/kernel/elf.c
> +++ b/arch/x86/kernel/elf.c
> @@ -8,7 +8,10 @@
>
> #include <asm/cet.h>
> #include <asm/elf_property.h>
> +#include <asm/prctl.h>
> +#include <asm/processor.h>
> #include <uapi/linux/elf-em.h>
> +#include <uapi/linux/prctl.h>
> #include <linux/binfmts.h>
> #include <linux/elf.h>
> #include <linux/slab.h>
> @@ -255,6 +258,7 @@ int arch_setup_features(void *ehdr_p, void *phdr_p,
> current->thread.cet.ibt_enabled = 0;
> current->thread.cet.ibt_bitmap_addr = 0;
> current->thread.cet.ibt_bitmap_size = 0;
> + current->thread.cet.locked = 0;
> if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> if (shstk) {
> err = cet_setup_shstk();
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 43a57d284a22..259b92664981 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -795,6 +795,12 @@ long do_arch_prctl_common(struct task_struct *task, int option,
> return get_cpuid_mode();
> case ARCH_SET_CPUID:
> return set_cpuid_mode(task, cpuid_enabled);
> + case ARCH_CET_STATUS:
> + case ARCH_CET_DISABLE:
> + case ARCH_CET_LOCK:
> + case ARCH_CET_ALLOC_SHSTK:
> + case ARCH_CET_LEGACY_BITMAP:
> + return prctl_cet(option, cpuid_enabled);
> }
>
> return -EINVAL;
> --
> 2.17.1
>

2018-07-11 21:00:38

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/27] Documentation/x86: Add CET description

On Wed, 2018-07-11 at 06:47 -0700, H.J. Lu wrote:
> On Wed, Jul 11, 2018 at 2:57 AM, Florian Weimer <[email protected]>
> wrote:
> >
> > On 07/11/2018 12:26 AM, Yu-cheng Yu wrote:
> >
> > >
> > > +To build a CET-enabled kernel, Binutils v2.30 and GCC v8.1 or
> > > later
> > > +are required.  To build a CET-enabled application, GLIBC v2.29
> > > or
> > > +later is also requried.
> >
> > Have you given up on getting the required changes into glibc 2.28?
> >
> This is a typo.  We are still targeting for 2.28.  All pieces are
> there.
>

Ok, I will fix it.

Yu-cheng

2018-07-11 21:04:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Wed, Jul 11, 2018 at 08:06:55AM -0700, Yu-cheng Yu wrote:
> On Wed, 2018-07-11 at 11:44 +0200, Peter Zijlstra wrote:

> > What happened to:
> >
> > ? https://lkml.kernel.org/r/[email protected]
>
> Yes, I put that in once and realized we only need to skip the
> instruction and return err. ?Do you think we still need a handler for
> that?

I find that other form more readable, but then there's Nadav doing asm
macros to shrink inline asm thingies so maybe he has another suggestion.

2018-07-11 21:04:57

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET

On Wed, 2018-07-11 at 12:20 +0200, Ingo Molnar wrote:
> * Yu-cheng Yu <[email protected]> wrote:
>
> >
> > Add PTRACE interface for CET MSRs.
> Please *always* describe new ABIs in the changelog, in a precise,
> well-documented 
> way.

Ok!

> >
> > diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> > index e2ee403865eb..ac2bc3a18427 100644
> > --- a/arch/x86/kernel/ptrace.c
> > +++ b/arch/x86/kernel/ptrace.c
> > @@ -49,7 +49,9 @@ enum x86_regset {
> >   REGSET_IOPERM64 = REGSET_XFP,
> >   REGSET_XSTATE,
> >   REGSET_TLS,
> > + REGSET_CET64 = REGSET_TLS,
> >   REGSET_IOPERM32,
> > + REGSET_CET32,
> >  };
> Why does REGSET_CET64 alias on REGSET_TLS?

In x86_64_regsets[], there is no [REGSET_TLS].  The core dump code
cannot handle holes in the array.

>
> >
> >  struct pt_regs_offset {
> > @@ -1276,6 +1278,13 @@ static struct user_regset x86_64_regsets[]
> > __ro_after_init = {
> >   .size = sizeof(long), .align = sizeof(long),
> >   .active = ioperm_active, .get = ioperm_get
> >   },
> > + [REGSET_CET64] = {
> > + .core_note_type = NT_X86_CET,
> > + .n = sizeof(struct cet_user_state) / sizeof(u64),
> > + .size = sizeof(u64), .align = sizeof(u64),
> > + .active = cetregs_active, .get = cetregs_get,
> > + .set = cetregs_set
> > + },
> Ok, could we first please make this part of the regset code more
> readable and 
> start the series with a standalone clean-up patch that changes these
> initializers 
> to something more readable:
>
> [REGSET_CET64] = {
> .core_note_type = NT_X86_CET,
> .n = sizeof(struct cet_user_state) /
> sizeof(u64),
> .size = sizeof(u64),
> .align = sizeof(u64),
> .active = cetregs_active,
> .get = cetregs_get,
> .set = cetregs_set
> },
>
> ? (I'm demonstrating the cleanup based on REGSET_CET64, but this
> should be done on 
> every other entry first.)
>

I will fix it.

>
> >
> > --- a/include/uapi/linux/elf.h
> > +++ b/include/uapi/linux/elf.h
> > @@ -401,6 +401,7 @@ typedef struct elf64_shdr {
> >  #define NT_386_TLS 0x200 /* i386 TLS slots
> > (struct user_desc) */
> >  #define NT_386_IOPERM 0x201 /* x86 io
> > permission bitmap (1=deny) */
> >  #define NT_X86_XSTATE 0x202 /* x86 extended
> > state using xsave */
> > +#define NT_X86_CET 0x203 /* x86 cet state */
> Acronyms in comments should be in capital letters.
>
> Also, I think I asked this before: why does "Control Flow
> Enforcement" abbreviate 
> to "CET" (which is a well-known acronym for "Central European Time"),
> not to CFE?
>

I don't know if I can change that, will find out.

Thanks,
Yu-cheng


2018-07-11 21:06:00

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support

On Wed, 2018-07-11 at 11:34 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:29PM -0700, Yu-cheng Yu wrote:
> >
> > +/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
> > +#define MSR_IA32_CET_SHSTK_EN 0x0000000000000001
> > +#define MSR_IA32_CET_WRSS_EN 0x0000000000000002
> > +#define MSR_IA32_CET_ENDBR_EN 0x0000000000000004
> > +#define MSR_IA32_CET_LEG_IW_EN 0x0000000000000008
> > +#define MSR_IA32_CET_NO_TRACK_EN 0x0000000000000010
> Do those want a ULL literal suffix?

I will fix it.

2018-07-11 21:37:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Wed, Jul 11, 2018 at 07:58:09AM -0700, Yu-cheng Yu wrote:
> On Wed, 2018-07-11 at 11:45 +0200, Peter Zijlstra wrote:
> > On Tue, Jul 10, 2018 at 03:26:30PM -0700, Yu-cheng Yu wrote:
> > >
> > > diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-
> > > opcode-map.txt
> > > index e0b85930dd77..72bb7c48a7df 100644
> > > --- a/arch/x86/lib/x86-opcode-map.txt
> > > +++ b/arch/x86/lib/x86-opcode-map.txt
> > > @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32
> > > Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
> > > ?f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32
> > > Gd,Ew (66&F2)
> > > ?f2: ANDN Gy,By,Ey (v)
> > > ?f3: Grp17 (1A)
> > > -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > > (F2),(v)
> > > +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey
> > > (F2),(v) | WRUSS Pq,Qq (66),REX.W
> > > ?f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
> > > ?f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By
> > > (F3),(v) | SHRX Gy,Ey,By (F2),(v)
> > > ?EndTable
> > Where are all the other instructions? ISTR that documentation patch
> > listing a whole bunch of new instructions, not just wuss.
>
> Currently we only use WRUSS in the kernel code. ?Do we want to add all
> instructions here?

Yes, since we also use the in-kernel decoder to decode random userspace
code.

2018-07-11 22:11:06

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/27] mm/mprotect: Prevent mprotect from changing shadow stack

On Wed, 2018-07-11 at 11:12 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 04:10:08PM -0700, Dave Hansen wrote:
> >
> > On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> > >
> > > Signed-off-by: Yu-cheng Yu <[email protected]>
> > This still needs a changelog, even if you think it's simple.
> > >
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -446,6 +446,15 @@ static int do_mprotect_pkey(unsigned long
> > > start, size_t len,
> > >   error = -ENOMEM;
> > >   if (!vma)
> > >   goto out;
> > > +
> > > + /*
> > > +  * Do not allow changing shadow stack memory.
> > > +  */
> > > + if (vma->vm_flags & VM_SHSTK) {
> > > + error = -EINVAL;
> > > + goto out;
> > > + }
> > > +
> > I think this is a _bit_ draconian.  Why shouldn't we be able to use
> > protection keys with a shadow stack?  Or, set it to PROT_NONE?
> Right, and then there's also madvise() and some of the other
> accessors.
>
> Why do we need to disallow this? AFAICT the worst that can happen is
> that a process wrecks itself, so what?

Agree.  I will remove the patch.

2018-07-11 22:13:09

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/27] mm: Handle THP/HugeTLB shadow stack page fault

On Wed, 2018-07-11 at 11:10 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:26PM -0700, Yu-cheng Yu wrote:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a2695dbc0418..f7c46d61eaea 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4108,7 +4108,13 @@ static int __handle_mm_fault(struct
> > vm_area_struct *vma, unsigned long address,
> >   if (pmd_protnone(orig_pmd) &&
> > vma_is_accessible(vma))
> >   return do_huge_pmd_numa_page(&vmf,
> > orig_pmd);
> >  
> > - if (dirty && !pmd_write(orig_pmd)) {
> > + /*
> > +  * Shadow stack trans huge PMDs are copy-
> > on-access,
> > +  * so wp_huge_pmd() on them no mater if we
> > have a
> > +  * write fault or not.
> > +  */
> > + if (is_shstk_mapping(vma->vm_flags) ||
> > +     (dirty && !pmd_write(orig_pmd))) {
> >   ret = wp_huge_pmd(&vmf, orig_pmd);
> >   if (!(ret & VM_FAULT_FALLBACK))
> >   return ret;
> Can't we do this (and the do_wp_page thing) by setting
> FAULT_FLAG_WRITE
> in the arch fault handler on shadow stack faults?

This can work.  I don't know if that will create other issues.
Let me think about that.

Yu-cheng

2018-07-11 22:43:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 15/27] mm/mprotect: Prevent mprotect from changing shadow stack

On 07/11/2018 09:07 AM, Yu-cheng Yu wrote:
>> Why do we need to disallow this? AFAICT the worst that can happen is
>> that a process wrecks itself, so what?
> Agree.  I will remove the patch.

No so quick. :)

We still need to find out a way to handle things that ask for an
mprotect() which is incompatible with shadow stacks. PROT_READ without
PROT_WRITE comes to mind. We also need to be careful that we don't
copy-on-write/copy-on-access pages which fault on PROT_NONE. I *think*
it'll get done correctly but we have to be sure.

BTW, where are all the selftests for this code? We're slowly building
up a list of pathological things that need to get tested.

I don't think this can or should get merged before we have selftests.

2018-07-12 00:07:34

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/27] x86/mm: Shadow stack page fault error checking

On Tue, 2018-07-10 at 15:52 -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> >
> > +++ b/arch/x86/include/asm/traps.h
> > @@ -157,6 +157,7 @@ enum {
> >   *   bit 3 == 1: use of reserved
> > bit detected
> >   *   bit 4 == 1: fault was an
> > instruction fetch
> >   *   bit 5 == 1: protection keys
> > block access
> > + *   bit 6 == 1: shadow stack
> > access fault
> >   */
> Could we document this bit better?
>
> Is this a fault where the *processor* thought it should be a shadow
> stack fault?  Or is it also set on faults to valid shadow stack PTEs
> that just happen to fault for other reasons, say protection keys?

Thanks Vedvyas for explaining this to me.
I will add this to comments:

This flag is 1 if (1) CR4.CET = 1; and (2) the access causing the page-
fault exception was a shadow-stack data access.

So this bit does not report the reason for the fault. It reports the
type of access; i.e. it was a shadow-stack-load or a shadow-stack-store
that took the page fault. The fault could have been caused by any
variety of reasons including protection keys.


2018-07-12 01:28:07

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH v2 20/27] x86/cet/shstk: ELF header parsing of CET

On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]> wrote:
>
> Look in .note.gnu.property of an ELF file and check if shadow stack needs
> to be enabled for the task.
>
> Signed-off-by: H.J. Lu <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
[...]
> diff --git a/arch/x86/kernel/elf.c b/arch/x86/kernel/elf.c
> new file mode 100644
> index 000000000000..233f6dad9c1f
> --- /dev/null
> +++ b/arch/x86/kernel/elf.c
[...]
> +#define NOTE_SIZE_BAD(n, align, max) \
> + ((n->n_descsz < 8) || ((n->n_descsz % align) != 0) || \
> + (((u8 *)(n + 1) + 4 + n->n_descsz) > (max)))

Please do not compute out-of-bounds pointers and then compare them
against an expected maximum pointer. Computing an out-of-bounds
pointer is undefined behavior according to the C99 specification,
section "6.5.6 Additive operators", paragraph 8; and in this case,
n->n_descsz is 32 bits wide, which means that even if the compiler
isn't doing anything funny, if you're operating on addresses in the
last 4GiB of virtual memory and the pointer wraps around, this could
break.
In particular, if anyone ever uses this code in a 32-bit kernel, this
is going to blow up.
Please use size comparisons instead of pointer comparisons.

> +
> +/*
> + * Go through the property array and look for the one
> + * with pr_type of GNU_PROPERTY_X86_FEATURE_1_AND.
> + */
> +static u32 find_x86_feature_1(u8 *buf, u32 size, u32 align)
> +{
> + u8 *end = buf + size;
> + u8 *ptr = buf;
> +
> + while (1) {
> + u32 pr_type, pr_datasz;
> +
> + if ((ptr + 4) >= end)
> + break;

Theoretical UB.

> + pr_type = *(u32 *)ptr;
> + pr_datasz = *(u32 *)(ptr + 4);
> + ptr += 8;
> +
> + if ((ptr + pr_datasz) >= end)
> + break;

UB, like in NOTE_SIZE_BAD().

> + if (pr_type == GNU_PROPERTY_X86_FEATURE_1_AND &&
> + pr_datasz == 4)
> + return *(u32 *)ptr;
> +
> + ptr += pr_datasz;
> + }
> + return 0;
> +}
[...]
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 0ac456b52bdd..3395f6a631d5 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1081,6 +1081,22 @@ static int load_elf_binary(struct linux_binprm *bprm)
> goto out_free_dentry;
> }
>
> +#ifdef CONFIG_ARCH_HAS_PROGRAM_PROPERTIES
> +
> + if (interpreter) {
> + retval = arch_setup_features(&loc->interp_elf_ex,
> + interp_elf_phdata,
> + interpreter, true);
> + } else {
> + retval = arch_setup_features(&loc->elf_ex,
> + elf_phdata,
> + bprm->file, false);
> + }

So for non-static binaries, the ELF headers of ld.so determine whether
CET will be on or off for the entire system, right? Is the intent here
that ld.so should start with CET enabled, and then either use the
compatibility bitmap or turn CET off at runtime if the executable or
one of the libraries doesn't actually work with CET?

2018-07-12 02:54:53

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 27/27] x86/cet: Add arch_prctl functions for CET

On Wed, 2018-07-11 at 12:45 -0700, Jann Horn wrote:
> On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]>
> wrote:
> >
> >
> > arch_prctl(ARCH_CET_STATUS, unsigned long *addr)
> >     Return CET feature status.
> >
> >     The parameter 'addr' is a pointer to a user buffer.
> >     On returning to the caller, the kernel fills the following
> >     information:
> >
> >     *addr = SHSTK/IBT status
> >     *(addr + 1) = SHSTK base address
> >     *(addr + 2) = SHSTK size
> >
> > arch_prctl(ARCH_CET_DISABLE, unsigned long features)
> >     Disable SHSTK and/or IBT specified in 'features'.  Return
> > -EPERM
> >     if CET is locked out.
> >
> > arch_prctl(ARCH_CET_LOCK)
> >     Lock out CET feature.
> >
> > arch_prctl(ARCH_CET_ALLOC_SHSTK, unsigned long *addr)
> >     Allocate a new SHSTK.
> >
> >     The parameter 'addr' is a pointer to a user buffer and
> > indicates
> >     the desired SHSTK size to allocate.  On returning to the caller
> >     the buffer contains the address of the new SHSTK.
> >
> > arch_prctl(ARCH_CET_LEGACY_BITMAP, unsigned long *addr)
> >     Allocate an IBT legacy code bitmap if the current task does not
> >     have one.
> >
> >     The parameter 'addr' is a pointer to a user buffer.
> >     On returning to the caller, the kernel fills the following
> >     information:
> >
> >     *addr = IBT bitmap base address
> >     *(addr + 1) = IBT bitmap size
> >
> > Signed-off-by: H.J. Lu <[email protected]>
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> [...]
> >
> > diff --git a/arch/x86/kernel/cet_prctl.c
> > b/arch/x86/kernel/cet_prctl.c
> > new file mode 100644
> > index 000000000000..86bb78ae656d
> > --- /dev/null
> > +++ b/arch/x86/kernel/cet_prctl.c
> > @@ -0,0 +1,141 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#include <linux/errno.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/prctl.h>
> > +#include <linux/compat.h>
> > +#include <asm/processor.h>
> > +#include <asm/prctl.h>
> > +#include <asm/elf.h>
> > +#include <asm/elf_property.h>
> > +#include <asm/cet.h>
> > +
> > +/* See Documentation/x86/intel_cet.txt. */
> > +
> > +static int handle_get_status(unsigned long arg2)
> > +{
> > +       unsigned int features = 0;
> > +       unsigned long shstk_base, shstk_size;
> > +
> > +       if (current->thread.cet.shstk_enabled)
> > +               features |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
> > +       if (current->thread.cet.ibt_enabled)
> > +               features |= GNU_PROPERTY_X86_FEATURE_1_IBT;
> > +
> > +       shstk_base = current->thread.cet.shstk_base;
> > +       shstk_size = current->thread.cet.shstk_size;
> > +
> > +       if (in_ia32_syscall()) {
> > +               unsigned int buf[3];
> > +
> > +               buf[0] = features;
> > +               buf[1] = (unsigned int)shstk_base;
> > +               buf[2] = (unsigned int)shstk_size;
> > +               return copy_to_user((unsigned int __user *)arg2,
> > buf,
> > +                                   sizeof(buf));
> > +       } else {
> > +               unsigned long buf[3];
> > +
> > +               buf[0] = (unsigned long)features;
> > +               buf[1] = shstk_base;
> > +               buf[2] = shstk_size;
> > +               return copy_to_user((unsigned long __user *)arg2,
> > buf,
> > +                                   sizeof(buf));
> > +       }
> Other places in the kernel (e.g. the BPF subsystem) just
> unconditionally use u64 instead of unsigned long to avoid having to
> switch between different sizes. I wonder whether that would make
> sense
> here?

Yes, that simplifies the code.  I will make that change.

Yu-cheng

2018-07-12 02:55:25

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 20/27] x86/cet/shstk: ELF header parsing of CET

On Wed, 2018-07-11 at 12:37 -0700, Jann Horn wrote:
> On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]>
> wrote:
> >
> >
> > Look in .note.gnu.property of an ELF file and check if shadow stack
> > needs
> > to be enabled for the task.
> >
> > Signed-off-by: H.J. Lu <[email protected]>
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> [...]
> >
> > diff --git a/arch/x86/kernel/elf.c b/arch/x86/kernel/elf.c
> > new file mode 100644
> > index 000000000000..233f6dad9c1f
> > --- /dev/null
> > +++ b/arch/x86/kernel/elf.c
> [...]
> >
> > +#define NOTE_SIZE_BAD(n, align, max) \
> > +       ((n->n_descsz < 8) || ((n->n_descsz % align) != 0) || \
> > +        (((u8 *)(n + 1) + 4 + n->n_descsz) > (max)))
> Please do not compute out-of-bounds pointers and then compare them
> against an expected maximum pointer. Computing an out-of-bounds
> pointer is undefined behavior according to the C99 specification,
> section "6.5.6 Additive operators", paragraph 8; and in this case,
> n->n_descsz is 32 bits wide, which means that even if the compiler
> isn't doing anything funny, if you're operating on addresses in the
> last 4GiB of virtual memory and the pointer wraps around, this could
> break.
> In particular, if anyone ever uses this code in a 32-bit kernel, this
> is going to blow up.
> Please use size comparisons instead of pointer comparisons.

I will fix it.

> [...]
> >
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 0ac456b52bdd..3395f6a631d5 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -1081,6 +1081,22 @@ static int load_elf_binary(struct
> > linux_binprm *bprm)
> >                 goto out_free_dentry;
> >         }
> >
> > +#ifdef CONFIG_ARCH_HAS_PROGRAM_PROPERTIES
> > +
> > +       if (interpreter) {
> > +               retval = arch_setup_features(&loc->interp_elf_ex,
> > +                                            interp_elf_phdata,
> > +                                            interpreter, true);
> > +       } else {
> > +               retval = arch_setup_features(&loc->elf_ex,
> > +                                            elf_phdata,
> > +                                            bprm->file, false);
> > +       }
> So for non-static binaries, the ELF headers of ld.so determine
> whether
> CET will be on or off for the entire system, right? Is the intent
> here
> that ld.so should start with CET enabled, and then either use the
> compatibility bitmap or turn CET off at runtime if the executable or
> one of the libraries doesn't actually work with CET?


The kernel command-line options "no_cet_shstk" and "no_cet_ibt" turn
off CET features for the whole system.  The GLIBC tunable
"glibc.tune.hwcap=-SHSTK,-IBT" turns off CET features for the current
shell.  Another GLIBC tunable "glibc.tune.x86_shstk=<on, permissive>"
determines, in the current shell, how dlopen() deals with SHSTK legacy
lib's.

So, if ld.so's ELF header has SHSTK/IBT, and CET is enabled in the
current shell, it will run with CET enabled.  If the application
executable and all its dependent libraries have CET, ld.so runs the
application with CET enabled.  Otherwise ld.so turns off SHSTK (and/or
sets up legacy bitmap for IBT) before passing to the application.

Yu-cheng

2018-07-12 02:57:46

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 27/27] x86/cet: Add arch_prctl functions for CET

On Wed, 2018-07-11 at 14:19 +0200, Florian Weimer wrote:
> On 07/11/2018 12:26 AM, Yu-cheng Yu wrote:
> >
> > arch_prctl(ARCH_CET_DISABLE, unsigned long features)
> >      Disable SHSTK and/or IBT specified in 'features'.  Return
> > -EPERM
> >      if CET is locked out.
> >
> > arch_prctl(ARCH_CET_LOCK)
> >      Lock out CET feature.
> Isn't it a “lock in” rather than a “lock out”?

Yes, that makes more sense.  I will fix it.

2018-07-12 02:58:22

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]> wrote:
>
> Add user-mode indirect branch tracking enabling/disabling
> and supporting routines.
>
> Signed-off-by: H.J. Lu <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
[...]
> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> index 4eba7790c4e4..8bbd63e1a2ba 100644
> --- a/arch/x86/kernel/cet.c
> +++ b/arch/x86/kernel/cet.c
[...]
> +static unsigned long ibt_mmap(unsigned long addr, unsigned long len)
> +{
> + struct mm_struct *mm = current->mm;
> + unsigned long populate;
> +
> + down_write(&mm->mmap_sem);
> + addr = do_mmap(NULL, addr, len, PROT_READ | PROT_WRITE,
> + MAP_ANONYMOUS | MAP_PRIVATE,
> + VM_DONTDUMP, 0, &populate, NULL);
> + up_write(&mm->mmap_sem);
> +
> + if (populate)
> + mm_populate(addr, populate);
> +
> + return addr;
> +}

Is this thing going to stay writable? Will any process with an IBT
bitmap be able to disable protections by messing with the bitmap even
if the lock-out mode is active? If so, would it perhaps make sense to
forbid lock-out mode if an IBT bitmap is active, to make it clear that
effective lock-out is impossible in that state?

2018-07-12 02:58:52

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support

On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]> wrote:
>
> This patch adds basic shadow stack enabling/disabling routines.
> A task's shadow stack is allocated from memory with VM_SHSTK
> flag set and read-only protection. The shadow stack is
> allocated to a fixed size.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
[...]
> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> new file mode 100644
> index 000000000000..96bf69db7da7
> --- /dev/null
> +++ b/arch/x86/kernel/cet.c
[...]
> +static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
> +{
> + struct mm_struct *mm = current->mm;
> + unsigned long populate;
> +
> + down_write(&mm->mmap_sem);
> + addr = do_mmap(NULL, addr, len, PROT_READ,
> + MAP_ANONYMOUS | MAP_PRIVATE, VM_SHSTK,
> + 0, &populate, NULL);
> + up_write(&mm->mmap_sem);
> +
> + if (populate)
> + mm_populate(addr, populate);
> +
> + return addr;
> +}

How does this interact with UFFDIO_REGISTER?

Is there an explicit design decision on whether FOLL_FORCE should be
able to write to shadow stacks? I'm guessing the answer is "yes,
FOLL_FORCE should be able to write to shadow stacks"? It might make
sense to add documentation for this.

Should the kernel enforce that two shadow stacks must have a guard
page between them so that they can not be directly adjacent, so that
if you have too much recursion, you can't end up corrupting an
adjacent shadow stack?

> +int cet_setup_shstk(void)
> +{
> + unsigned long addr, size;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> + return -EOPNOTSUPP;
> +
> + size = in_ia32_syscall() ? SHSTK_SIZE_32:SHSTK_SIZE_64;
> + addr = shstk_mmap(0, size);
> +
> + /*
> + * Return actual error from do_mmap().
> + */
> + if (addr >= TASK_SIZE_MAX)
> + return addr;
> +
> + set_shstk_ptr(addr + size - sizeof(u64));
> + current->thread.cet.shstk_base = addr;
> + current->thread.cet.shstk_size = size;
> + current->thread.cet.shstk_enabled = 1;
> + return 0;
> +}
[...]
> +void cet_disable_free_shstk(struct task_struct *tsk)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> + !tsk->thread.cet.shstk_enabled)
> + return;
> +
> + if (tsk == current)
> + cet_disable_shstk();
> +
> + /*
> + * Free only when tsk is current or shares mm
> + * with current but has its own shstk.
> + */
> + if (tsk->mm && (tsk->mm == current->mm) &&
> + (tsk->thread.cet.shstk_base)) {
> + vm_munmap(tsk->thread.cet.shstk_base,
> + tsk->thread.cet.shstk_size);
> + tsk->thread.cet.shstk_base = 0;
> + tsk->thread.cet.shstk_size = 0;
> + }
> +
> + tsk->thread.cet.shstk_enabled = 0;
> +}

2018-07-12 03:00:04

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support



> On Jul 11, 2018, at 2:10 PM, Jann Horn <[email protected]> wrote:
>
>> On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]> wrote:
>>
>> This patch adds basic shadow stack enabling/disabling routines.
>> A task's shadow stack is allocated from memory with VM_SHSTK
>> flag set and read-only protection. The shadow stack is
>> allocated to a fixed size.
>>
>> Signed-off-by: Yu-cheng Yu <[email protected]>
> [...]
>> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
>> new file mode 100644
>> index 000000000000..96bf69db7da7
>> --- /dev/null
>> +++ b/arch/x86/kernel/cet.c
> [...]
>> +static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
>> +{
>> + struct mm_struct *mm = current->mm;
>> + unsigned long populate;
>> +
>> + down_write(&mm->mmap_sem);
>> + addr = do_mmap(NULL, addr, len, PROT_READ,
>> + MAP_ANONYMOUS | MAP_PRIVATE, VM_SHSTK,
>> + 0, &populate, NULL);
>> + up_write(&mm->mmap_sem);
>> +
>> + if (populate)
>> + mm_populate(addr, populate);
>> +
>> + return addr;
>> +}
>
> How does this interact with UFFDIO_REGISTER?
>
> Is there an explicit design decision on whether FOLL_FORCE should be
> able to write to shadow stacks? I'm guessing the answer is "yes,
> FOLL_FORCE should be able to write to shadow stacks"? It might make
> sense to add documentation for this.

FOLL_FORCE should be able to write them, IMO. Otherwise we’ll need a whole new debugging API.

By the time an attacker can do FOLL_FORCE writes, the attacker can directly modify *text*, and CET is useless. We should probably audit all uses of FOLL_FORCE and remove as many as we can get away with.

>
> Should the kernel enforce that two shadow stacks must have a guard
> page between them so that they can not be directly adjacent, so that
> if you have too much recursion, you can't end up corrupting an
> adjacent shadow stack?

I think the answer is a qualified “no”. I would like to instead enforce a general guard page on all mmaps that don’t use MAP_FORCE. We *might* need to exempt any mmap with an address hint for compatibility.

My commercial software has been manually adding guard pages on every single mmap done by tcmalloc for years, and it has caught a couple bugs and costs essentially nothing.

Hmm. Linux should maybe add something like Windows’ “reserved” virtual memory. It’s basically a way to ask for a VA range that explicitly contains nothing and can be subsequently be turned into something useful with the equivalent of MAP_FORCE.

>
>> +int cet_setup_shstk(void)
>> +{
>> + unsigned long addr, size;
>> +
>> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
>> + return -EOPNOTSUPP;
>> +
>> + size = in_ia32_syscall() ? SHSTK_SIZE_32:SHSTK_SIZE_64;
>> + addr = shstk_mmap(0, size);
>> +
>> + /*
>> + * Return actual error from do_mmap().
>> + */
>> + if (addr >= TASK_SIZE_MAX)
>> + return addr;
>> +
>> + set_shstk_ptr(addr + size - sizeof(u64));
>> + current->thread.cet.shstk_base = addr;
>> + current->thread.cet.shstk_size = size;
>> + current->thread.cet.shstk_enabled = 1;
>> + return 0;
>> +}
> [...]
>> +void cet_disable_free_shstk(struct task_struct *tsk)
>> +{
>> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
>> + !tsk->thread.cet.shstk_enabled)
>> + return;
>> +
>> + if (tsk == current)
>> + cet_disable_shstk();
>> +
>> + /*
>> + * Free only when tsk is current or shares mm
>> + * with current but has its own shstk.
>> + */
>> + if (tsk->mm && (tsk->mm == current->mm) &&
>> + (tsk->thread.cet.shstk_base)) {
>> + vm_munmap(tsk->thread.cet.shstk_base,
>> + tsk->thread.cet.shstk_size);
>> + tsk->thread.cet.shstk_base = 0;
>> + tsk->thread.cet.shstk_size = 0;
>> + }
>> +
>> + tsk->thread.cet.shstk_enabled = 0;
>> +}

2018-07-12 03:01:04

by Jann Horn

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support

On Wed, Jul 11, 2018 at 2:34 PM Andy Lutomirski <[email protected]> wrote:
> > On Jul 11, 2018, at 2:10 PM, Jann Horn <[email protected]> wrote:
> >
> >> On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]> wrote:
> >>
> >> This patch adds basic shadow stack enabling/disabling routines.
> >> A task's shadow stack is allocated from memory with VM_SHSTK
> >> flag set and read-only protection. The shadow stack is
> >> allocated to a fixed size.
> >>
> >> Signed-off-by: Yu-cheng Yu <[email protected]>
> > [...]
> >> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> >> new file mode 100644
> >> index 000000000000..96bf69db7da7
> >> --- /dev/null
> >> +++ b/arch/x86/kernel/cet.c
> > [...]
> >> +static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
> >> +{
> >> + struct mm_struct *mm = current->mm;
> >> + unsigned long populate;
> >> +
> >> + down_write(&mm->mmap_sem);
> >> + addr = do_mmap(NULL, addr, len, PROT_READ,
> >> + MAP_ANONYMOUS | MAP_PRIVATE, VM_SHSTK,
> >> + 0, &populate, NULL);
> >> + up_write(&mm->mmap_sem);
> >> +
> >> + if (populate)
> >> + mm_populate(addr, populate);
> >> +
> >> + return addr;
> >> +}
[...]
> > Should the kernel enforce that two shadow stacks must have a guard
> > page between them so that they can not be directly adjacent, so that
> > if you have too much recursion, you can't end up corrupting an
> > adjacent shadow stack?
>
> I think the answer is a qualified “no”. I would like to instead enforce a general guard page on all mmaps that don’t use MAP_FORCE. We *might* need to exempt any mmap with an address hint for compatibility.

I like this idea a lot.

> My commercial software has been manually adding guard pages on every single mmap done by tcmalloc for years, and it has caught a couple bugs and costs essentially nothing.
>
> Hmm. Linux should maybe add something like Windows’ “reserved” virtual memory. It’s basically a way to ask for a VA range that explicitly contains nothing and can be subsequently be turned into something useful with the equivalent of MAP_FORCE.

What's the benefit over creating an anonymous PROT_NONE region? That
the kernel won't have to scan through the corresponding PTEs when
tearing down the mapping?

2018-07-12 03:02:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support


> On Jul 11, 2018, at 2:51 PM, Jann Horn <[email protected]> wrote:
>
> On Wed, Jul 11, 2018 at 2:34 PM Andy Lutomirski <[email protected]> wrote:
>>> On Jul 11, 2018, at 2:10 PM, Jann Horn <[email protected]> wrote:
>>>
>>>> On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]> wrote:
>>>>
>>>> This patch adds basic shadow stack enabling/disabling routines.
>>>> A task's shadow stack is allocated from memory with VM_SHSTK
>>>> flag set and read-only protection. The shadow stack is
>>>> allocated to a fixed size.
>>>>
>>>> Signed-off-by: Yu-cheng Yu <[email protected]>
>>> [...]
>>>> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
>>>> new file mode 100644
>>>> index 000000000000..96bf69db7da7
>>>> --- /dev/null
>>>> +++ b/arch/x86/kernel/cet.c
>>> [...]
>>>> +static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
>>>> +{
>>>> + struct mm_struct *mm = current->mm;
>>>> + unsigned long populate;
>>>> +
>>>> + down_write(&mm->mmap_sem);
>>>> + addr = do_mmap(NULL, addr, len, PROT_READ,
>>>> + MAP_ANONYMOUS | MAP_PRIVATE, VM_SHSTK,
>>>> + 0, &populate, NULL);
>>>> + up_write(&mm->mmap_sem);
>>>> +
>>>> + if (populate)
>>>> + mm_populate(addr, populate);
>>>> +
>>>> + return addr;
>>>> +}
> [...]
>>> Should the kernel enforce that two shadow stacks must have a guard
>>> page between them so that they can not be directly adjacent, so that
>>> if you have too much recursion, you can't end up corrupting an
>>> adjacent shadow stack?
>>
>> I think the answer is a qualified “no”. I would like to instead enforce a general guard page on all mmaps that don’t use MAP_FORCE. We *might* need to exempt any mmap with an address hint for compatibility.
>
> I like this idea a lot.
>
>> My commercial software has been manually adding guard pages on every single mmap done by tcmalloc for years, and it has caught a couple bugs and costs essentially nothing.
>>
>> Hmm. Linux should maybe add something like Windows’ “reserved” virtual memory. It’s basically a way to ask for a VA range that explicitly contains nothing and can be subsequently be turned into something useful with the equivalent of MAP_FORCE.
>
> What's the benefit over creating an anonymous PROT_NONE region? That
> the kernel won't have to scan through the corresponding PTEs when
> tearing down the mapping?

Make it more obvious what’s happening and avoid accounting issues? What I’ve actually used is MAP_NORESERVE | PROT_NONE, but I think this still counts against the VA rlimit. But maybe that’s actually the desired behavior.



2018-07-12 03:02:38

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

On Tue, 2018-07-10 at 17:11 -0700, Dave Hansen wrote:
> Is this feature *integral* to shadow stacks?  Or, should it just be
> in a
> different series?

The whole CET series is mostly about SHSTK and only a minority for IBT.
IBT changes cannot be applied by itself without first applying SHSTK
changes.  Would the titles help, e.g. x86/cet/ibt, x86/cet/shstk, etc.?

>
> >
> > diff --git a/arch/x86/include/asm/cet.h
> > b/arch/x86/include/asm/cet.h
> > index d9ae3d86cdd7..71da2cccba16 100644
> > --- a/arch/x86/include/asm/cet.h
> > +++ b/arch/x86/include/asm/cet.h
> > @@ -12,7 +12,10 @@ struct task_struct;
> >  struct cet_status {
> >   unsigned long shstk_base;
> >   unsigned long shstk_size;
> > + unsigned long ibt_bitmap_addr;
> > + unsigned long ibt_bitmap_size;
> >   unsigned int shstk_enabled:1;
> > + unsigned int ibt_enabled:1;
> >  };
> Is there a reason we're not using pointers here?  This seems like the
> kind of place that we probably want __user pointers.

Yes, I will change that.

>
>
> >
> > +static unsigned long ibt_mmap(unsigned long addr, unsigned long
> > len)
> > +{
> > + struct mm_struct *mm = current->mm;
> > + unsigned long populate;
> > +
> > + down_write(&mm->mmap_sem);
> > + addr = do_mmap(NULL, addr, len, PROT_READ | PROT_WRITE,
> > +        MAP_ANONYMOUS | MAP_PRIVATE,
> > +        VM_DONTDUMP, 0, &populate, NULL);
> > + up_write(&mm->mmap_sem);
> > +
> > + if (populate)
> > + mm_populate(addr, populate);
> > +
> > + return addr;
> > +}
> We're going to have to start consolidating these at some point.  We
> have
> at least three of them now, maybe more.

Maybe we can do the following in linux/mm.h?

+static inline unsigned long do_mmap_locked(addr, len, prot,
+     flags, vm_flags)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long populate;
+
+ down_write(&mm->mmap_sem);
+ addr = do_mmap(NULL, addr, len, prot, flags, vm_flags,
+        0, &populate, NULL);
+ up_write(&mm->mmap_sem);
+
+ if (populate)
+ mm_populate(addr, populate);
+
+ return addr;
+} 

> >
> > +int cet_setup_ibt_bitmap(void)
> > +{
> > + u64 r;
> > + unsigned long bitmap;
> > + unsigned long size;
> > +
> > + if (!cpu_feature_enabled(X86_FEATURE_IBT))
> > + return -EOPNOTSUPP;
> > +
> > + size = TASK_SIZE_MAX / PAGE_SIZE / BITS_PER_BYTE;
> Just a note: this table is going to be gigantic on 5-level paging
> systems, and userspace won't, by default use any of that extra
> address
> space.  I think it ends up being a 512GB allocation in a 128TB
> address
> space.
>
> Is that a problem?
>
> On 5-level paging systems, maybe we should just stick it up in the
> high
> part of the address space.

We do not know in advance if dlopen() needs to create the bitmap.  Do
we always reserve high address or force legacy libs to low address?

>
> >
> > + bitmap = ibt_mmap(0, size);
> > +
> > + if (bitmap >= TASK_SIZE_MAX)
> > + return -ENOMEM;
> > +
> > + bitmap &= PAGE_MASK;
> We're page-aligning the result of an mmap()?  Why?

This may not be necessary.  The lower bits of MSR_IA32_U_CET are
settings and not part of the bitmap address.  Is this is safer?

>
> >
> > + rdmsrl(MSR_IA32_U_CET, r);
> > + r |= (MSR_IA32_CET_LEG_IW_EN | bitmap);
> > + wrmsrl(MSR_IA32_U_CET, r);
> Comments, please.  What is this doing, logically?  Also, why are we
> OR'ing the results into this MSR?  What are we trying to preserve?

I will add comments.

>
> >
> > + current->thread.cet.ibt_bitmap_addr = bitmap;
> > + current->thread.cet.ibt_bitmap_size = size;
> > + return 0;
> > +}
> > +
> > +void cet_disable_ibt(void)
> > +{
> > + u64 r;
> > +
> > + if (!cpu_feature_enabled(X86_FEATURE_IBT))
> > + return;
> Does this need a check for being already disabled?

We need that.  We cannot write to those MSRs if the CPU does not
support it.

>
> >
> > + rdmsrl(MSR_IA32_U_CET, r);
> > + r &= ~(MSR_IA32_CET_ENDBR_EN | MSR_IA32_CET_LEG_IW_EN |
> > +        MSR_IA32_CET_NO_TRACK_EN);
> > + wrmsrl(MSR_IA32_U_CET, r);
> > + current->thread.cet.ibt_enabled = 0;
> > +}
> What's the locking for current->thread.cet?

Now CET is not locked until the application calls ARCH_CET_LOCK.

>
> >
> > diff --git a/arch/x86/kernel/cpu/common.c
> > b/arch/x86/kernel/cpu/common.c
> > index 705467839ce8..c609c9ce5691 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -413,7 +413,8 @@ __setup("nopku", setup_disable_pku);
> >  
> >  static __always_inline void setup_cet(struct cpuinfo_x86 *c)
> >  {
> > - if (cpu_feature_enabled(X86_FEATURE_SHSTK))
> > + if (cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> > +     cpu_feature_enabled(X86_FEATURE_IBT))
> >   cr4_set_bits(X86_CR4_CET);
> >  }
> >  
> > @@ -434,6 +435,23 @@ static __init int setup_disable_shstk(char *s)
> >  __setup("no_cet_shstk", setup_disable_shstk);
> >  #endif
> >  
> > +#ifdef CONFIG_X86_INTEL_BRANCH_TRACKING_USER
> > +static __init int setup_disable_ibt(char *s)
> > +{
> > + /* require an exact match without trailing characters */
> > + if (strlen(s))
> > + return 0;
> > +
> > + if (!boot_cpu_has(X86_FEATURE_IBT))
> > + return 1;
> > +
> > + setup_clear_cpu_cap(X86_FEATURE_IBT);
> > + pr_info("x86: 'no_cet_ibt' specified, disabling Branch
> > Tracking\n");
> > + return 1;
> > +}
> > +__setup("no_cet_ibt", setup_disable_ibt);
> > +#endif
> >  /*
> >   * Some CPU features depend on higher CPUID levels, which may not
> > always
> >   * be available due to CPUID level capping or broken
> > virtualization
> > diff --git a/arch/x86/kernel/elf.c b/arch/x86/kernel/elf.c
> > index 233f6dad9c1f..42e08d3b573e 100644
> > --- a/arch/x86/kernel/elf.c
> > +++ b/arch/x86/kernel/elf.c
> > @@ -15,6 +15,7 @@
> >  #include <linux/fs.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/string.h>
> > +#include <linux/compat.h>
> >  
> >  /*
> >   * The .note.gnu.property layout:
> > @@ -222,7 +223,8 @@ int arch_setup_features(void *ehdr_p, void
> > *phdr_p,
> >  
> >   struct elf64_hdr *ehdr64 = ehdr_p;
> >  
> > - if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) &&
> > +     !cpu_feature_enabled(X86_FEATURE_IBT))
> >   return 0;
> >  
> >   if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {
> > @@ -250,6 +252,9 @@ int arch_setup_features(void *ehdr_p, void
> > *phdr_p,
> >   current->thread.cet.shstk_enabled = 0;
> >   current->thread.cet.shstk_base = 0;
> >   current->thread.cet.shstk_size = 0;
> > + current->thread.cet.ibt_enabled = 0;
> > + current->thread.cet.ibt_bitmap_addr = 0;
> > + current->thread.cet.ibt_bitmap_size = 0;
> >   if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> >   if (shstk) {
> >   err = cet_setup_shstk();
> > @@ -257,6 +262,15 @@ int arch_setup_features(void *ehdr_p, void
> > *phdr_p,
> >   goto out;
> >   }
> >   }
> > +
> > + if (cpu_feature_enabled(X86_FEATURE_IBT)) {
> > + if (ibt) {
> > + err = cet_setup_ibt();
> > + if (err < 0)
> > + goto out;
> > + }
> > + }
> You introduced 'ibt' before it was used.  Please wait to introduce it
> until you actually use it to make it easier to review.
>
> Also, what's wrong with:
>
> if (cpu_feature_enabled(X86_FEATURE_IBT) && ibt) {
> ...
> }
>
> ?

I will fix it.


2018-07-12 03:03:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

On 07/11/2018 03:10 PM, Yu-cheng Yu wrote:
> On Tue, 2018-07-10 at 17:11 -0700, Dave Hansen wrote:
>> Is this feature *integral* to shadow stacks?  Or, should it just be
>> in a
>> different series?
>
> The whole CET series is mostly about SHSTK and only a minority for IBT.
> IBT changes cannot be applied by itself without first applying SHSTK
> changes.  Would the titles help, e.g. x86/cet/ibt, x86/cet/shstk, etc.?

That doesn't really answer what I asked, though.

Do shadow stacks *require* IBT? Or, should we concentrate on merging
shadow stacks themselves first and then do IBT at a later time, in a
different patch series?

But, yes, better patch titles would help, although I'm not sure that's
quite the format that Ingo and Thomas prefer.

>>> +int cet_setup_ibt_bitmap(void)
>>> +{
>>> + u64 r;
>>> + unsigned long bitmap;
>>> + unsigned long size;
>>> +
>>> + if (!cpu_feature_enabled(X86_FEATURE_IBT))
>>> + return -EOPNOTSUPP;
>>> +
>>> + size = TASK_SIZE_MAX / PAGE_SIZE / BITS_PER_BYTE;
>> Just a note: this table is going to be gigantic on 5-level paging
>> systems, and userspace won't, by default use any of that extra
>> address
>> space.  I think it ends up being a 512GB allocation in a 128TB
>> address
>> space.
>>
>> Is that a problem?
>>
>> On 5-level paging systems, maybe we should just stick it up in the
>> high part of the address space.
>
> We do not know in advance if dlopen() needs to create the bitmap.  Do
> we always reserve high address or force legacy libs to low address?

Does it matter? Does code ever get pointers to this area? Might they
be depending on high address bits for the IBT being clear?


>>> + bitmap = ibt_mmap(0, size);
>>> +
>>> + if (bitmap >= TASK_SIZE_MAX)
>>> + return -ENOMEM;
>>> +
>>> + bitmap &= PAGE_MASK;
>> We're page-aligning the result of an mmap()?  Why?
>
> This may not be necessary.  The lower bits of MSR_IA32_U_CET are
> settings and not part of the bitmap address.  Is this is safer?

No. If we have mmap() returning non-page-aligned addresses, we have
bigger problems. Worst-case, do

WARN_ON_ONCE(bitmap & ~PAGE_MASK);

>>> + current->thread.cet.ibt_bitmap_addr = bitmap;
>>> + current->thread.cet.ibt_bitmap_size = size;
>>> + return 0;
>>> +}
>>> +
>>> +void cet_disable_ibt(void)
>>> +{
>>> + u64 r;
>>> +
>>> + if (!cpu_feature_enabled(X86_FEATURE_IBT))
>>> + return;
>> Does this need a check for being already disabled?
>
> We need that.  We cannot write to those MSRs if the CPU does not
> support it.

No, I mean for code doing cet_disable_ibt() twice in a row.

>>> + rdmsrl(MSR_IA32_U_CET, r);
>>> + r &= ~(MSR_IA32_CET_ENDBR_EN | MSR_IA32_CET_LEG_IW_EN |
>>> +        MSR_IA32_CET_NO_TRACK_EN);
>>> + wrmsrl(MSR_IA32_U_CET, r);
>>> + current->thread.cet.ibt_enabled = 0;
>>> +}
>> What's the locking for current->thread.cet?
>
> Now CET is not locked until the application calls ARCH_CET_LOCK.

No, I mean what is the in-kernel locking for the current->thread.cet
data structure? Is there none because it's only every modified via
current->thread and it's entirely thread-local?



2018-07-12 03:04:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

On 07/11/2018 04:00 PM, Yu-cheng Yu wrote:
> On Wed, 2018-07-11 at 15:40 -0700, Dave Hansen wrote:
>> On 07/11/2018 03:10 PM, Yu-cheng Yu wrote:
>>>
>>> On Tue, 2018-07-10 at 17:11 -0700, Dave Hansen wrote:
>>>>
>>>> Is this feature *integral* to shadow stacks?  Or, should it just
>>>> be
>>>> in a
>>>> different series?
>>> The whole CET series is mostly about SHSTK and only a minority for
>>> IBT.
>>> IBT changes cannot be applied by itself without first applying
>>> SHSTK
>>> changes.  Would the titles help, e.g. x86/cet/ibt, x86/cet/shstk,
>>> etc.?
>> That doesn't really answer what I asked, though.
>>
>> Do shadow stacks *require* IBT?  Or, should we concentrate on merging
>> shadow stacks themselves first and then do IBT at a later time, in a
>> different patch series?
>>
>> But, yes, better patch titles would help, although I'm not sure
>> that's
>> quite the format that Ingo and Thomas prefer.
>
> Shadow stack does not require IBT, but they complement each other.  If
> we can resolve the legacy bitmap, both features can be merged at the
> same time.

As large as this patch set is, I'd really prefer to see you get shadow
stacks merged and then move on to IBT. I say separate them.

> GLIBC does the bitmap setup.  It sets bits in there.
> I thought you wanted a smaller bitmap?  One way is forcing legacy libs
> to low address, or not having the bitmap at all, i.e. turn IBT off.

I'm concerned with two things:
1. the virtual address space consumption, especially the *default* case
which will be apps using 4-level address space amounts, but having
5-level-sized tables.
2. the driving a truck-sized hole in the address space limits

You can force legacy libs to low addresses, but you can't stop anyone
from putting code into a high address *later*, at least with the code we
have today.

>>>>> + rdmsrl(MSR_IA32_U_CET, r);
>>>>> + r &= ~(MSR_IA32_CET_ENDBR_EN | MSR_IA32_CET_LEG_IW_EN
>>>>> |
>>>>> +        MSR_IA32_CET_NO_TRACK_EN);
>>>>> + wrmsrl(MSR_IA32_U_CET, r);
>>>>> + current->thread.cet.ibt_enabled = 0;
>>>>> +}
>>>> What's the locking for current->thread.cet?
>>> Now CET is not locked until the application calls ARCH_CET_LOCK.
>> No, I mean what is the in-kernel locking for the current->thread.cet
>> data structure?  Is there none because it's only every modified via
>> current->thread and it's entirely thread-local?
>
> Yes, that is the case.



2018-07-12 03:04:36

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

On Wed, 2018-07-11 at 15:40 -0700, Dave Hansen wrote:
> On 07/11/2018 03:10 PM, Yu-cheng Yu wrote:
> >
> > On Tue, 2018-07-10 at 17:11 -0700, Dave Hansen wrote:
> > >
> > > Is this feature *integral* to shadow stacks?  Or, should it just
> > > be
> > > in a
> > > different series?
> > The whole CET series is mostly about SHSTK and only a minority for
> > IBT.
> > IBT changes cannot be applied by itself without first applying
> > SHSTK
> > changes.  Would the titles help, e.g. x86/cet/ibt, x86/cet/shstk,
> > etc.?
> That doesn't really answer what I asked, though.
>
> Do shadow stacks *require* IBT?  Or, should we concentrate on merging
> shadow stacks themselves first and then do IBT at a later time, in a
> different patch series?
>
> But, yes, better patch titles would help, although I'm not sure
> that's
> quite the format that Ingo and Thomas prefer.

Shadow stack does not require IBT, but they complement each other.  If
we can resolve the legacy bitmap, both features can be merged at the
same time.

>
> >
> > >
> > > >
> > > > +int cet_setup_ibt_bitmap(void)
> > > > +{
> > > > + u64 r;
> > > > + unsigned long bitmap;
> > > > + unsigned long size;
> > > > +
> > > > + if (!cpu_feature_enabled(X86_FEATURE_IBT))
> > > > + return -EOPNOTSUPP;
> > > > +
> > > > + size = TASK_SIZE_MAX / PAGE_SIZE / BITS_PER_BYTE;
> > > Just a note: this table is going to be gigantic on 5-level paging
> > > systems, and userspace won't, by default use any of that extra
> > > address
> > > space.  I think it ends up being a 512GB allocation in a 128TB
> > > address
> > > space.
> > >
> > > Is that a problem?
> > >
> > > On 5-level paging systems, maybe we should just stick it up in
> > > the 
> > > high part of the address space.
> > We do not know in advance if dlopen() needs to create the bitmap.
> >  Do
> > we always reserve high address or force legacy libs to low address?
> Does it matter?  Does code ever get pointers to this area?  Might
> they
> be depending on high address bits for the IBT being clear?

GLIBC does the bitmap setup.  It sets bits in there.
I thought you wanted a smaller bitmap?  One way is forcing legacy libs
to low address, or not having the bitmap at all, i.e. turn IBT off.

>
>
> >
> > >
> > > >
> > > > + bitmap = ibt_mmap(0, size);
> > > > +
> > > > + if (bitmap >= TASK_SIZE_MAX)
> > > > + return -ENOMEM;
> > > > +
> > > > + bitmap &= PAGE_MASK;
> > > We're page-aligning the result of an mmap()?  Why?
> > This may not be necessary.  The lower bits of MSR_IA32_U_CET are
> > settings and not part of the bitmap address.  Is this is safer?
> No.  If we have mmap() returning non-page-aligned addresses, we have
> bigger problems.  Worst-case, do
>
> WARN_ON_ONCE(bitmap & ~PAGE_MASK);
>

Ok.

> >
> > >
> > > >
> > > > + current->thread.cet.ibt_bitmap_addr = bitmap;
> > > > + current->thread.cet.ibt_bitmap_size = size;
> > > > + return 0;
> > > > +}
> > > > +
> > > > +void cet_disable_ibt(void)
> > > > +{
> > > > + u64 r;
> > > > +
> > > > + if (!cpu_feature_enabled(X86_FEATURE_IBT))
> > > > + return;
> > > Does this need a check for being already disabled?
> > We need that.  We cannot write to those MSRs if the CPU does not
> > support it.
> No, I mean for code doing cet_disable_ibt() twice in a row.

Got it.

>
> >
> > >
> > > >
> > > > + rdmsrl(MSR_IA32_U_CET, r);
> > > > + r &= ~(MSR_IA32_CET_ENDBR_EN | MSR_IA32_CET_LEG_IW_EN
> > > > |
> > > > +        MSR_IA32_CET_NO_TRACK_EN);
> > > > + wrmsrl(MSR_IA32_U_CET, r);
> > > > + current->thread.cet.ibt_enabled = 0;
> > > > +}
> > > What's the locking for current->thread.cet?
> > Now CET is not locked until the application calls ARCH_CET_LOCK.
> No, I mean what is the in-kernel locking for the current->thread.cet
> data structure?  Is there none because it's only every modified via
> current->thread and it's entirely thread-local?

Yes, that is the case.


2018-07-12 14:05:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET


* Yu-cheng Yu <[email protected]> wrote:

> > > diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> > > index e2ee403865eb..ac2bc3a18427 100644
> > > --- a/arch/x86/kernel/ptrace.c
> > > +++ b/arch/x86/kernel/ptrace.c
> > > @@ -49,7 +49,9 @@ enum x86_regset {
> > > ? REGSET_IOPERM64 = REGSET_XFP,
> > > ? REGSET_XSTATE,
> > > ? REGSET_TLS,
> > > + REGSET_CET64 = REGSET_TLS,
> > > ? REGSET_IOPERM32,
> > > + REGSET_CET32,
> > > ?};
> > Why does REGSET_CET64 alias on REGSET_TLS?
>
> In x86_64_regsets[], there is no [REGSET_TLS]. ?The core dump code
> cannot handle holes in the array.

Is there a fundamental (ABI) reason for that?

> > to "CET" (which is a well-known acronym for "Central European Time"),
> > not to CFE?
> >
>
> I don't know if I can change that, will find out.

So what I'd suggest is something pretty simple: to use CFT/cft in kernel internal
names, except for the Intel feature bit and any MSR enumeration which can be CET
if Intel named it that way, and a short comment explaining the acronym difference.

Or something like that.

Thanks,

Ingo

2018-07-12 22:41:48

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET

On Thu, 2018-07-12 at 16:03 +0200, Ingo Molnar wrote:
> * Yu-cheng Yu <[email protected]> wrote:
>
> >
> > >
> > > >
> > > > diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> > > > index e2ee403865eb..ac2bc3a18427 100644
> > > > --- a/arch/x86/kernel/ptrace.c
> > > > +++ b/arch/x86/kernel/ptrace.c
> > > > @@ -49,7 +49,9 @@ enum x86_regset {
> > > >   REGSET_IOPERM64 = REGSET_XFP,
> > > >   REGSET_XSTATE,
> > > >   REGSET_TLS,
> > > > + REGSET_CET64 = REGSET_TLS,
> > > >   REGSET_IOPERM32,
> > > > + REGSET_CET32,
> > > >  };
> > > Why does REGSET_CET64 alias on REGSET_TLS?
> > In x86_64_regsets[], there is no [REGSET_TLS].  The core dump code
> > cannot handle holes in the array.
> Is there a fundamental (ABI) reason for that?

What I did was, ran Linux with 'slub_debug', and forced a core dump
(kill -abrt <pid>), then there was a red zone warning in the dmesg.
My feeling is there could be issues in the core dump code.  These
enum's are only local to arch/x86/kernel/ptrace.c and not exported.
I am not aware this is in the ABI.

>
> >
> > >
> > > to "CET" (which is a well-known acronym for "Central European Time"),
> > > not to CFE?
> > >
> > I don't know if I can change that, will find out.
> So what I'd suggest is something pretty simple: to use CFT/cft in kernel internal 
> names, except for the Intel feature bit and any MSR enumeration which can be CET 
> if Intel named it that way, and a short comment explaining the acronym difference.
>
> Or something like that.

Ok, I will make changes in the next version and probably revise
from that if still not optimal.

Yu-cheng


2018-07-12 23:04:27

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Tue, 2018-07-10 at 16:48 -0700, Dave Hansen wrote:
> >
> > +/*
> > + * WRUSS is a kernel instrcution and but writes to user
> > + * shadow stack memory.  When a fault occurs, both
> > + * X86_PF_USER and X86_PF_SHSTK are set.
> > + */
> > +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> > +{
> > + return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> > + (X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> > +}
> I thought X86_PF_USER was set based on the mode in which the fault
> occurred.  Does this mean that the architecture of this bit is different
> now?

Yes.

> That seems like something we need to call out if so.  It also means we
> need to update the SDM because some of the text is wrong.

It needs to mention the WRUSS case.

>
> >
> >  static void
> >  show_fault_oops(struct pt_regs *regs, unsigned long error_code,
> >   unsigned long address)
> > @@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
> >   struct task_struct *tsk = current;
> >  
> >   /* User mode accesses just cause a SIGSEGV */
> > - if (error_code & X86_PF_USER) {
> > + if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
> >   /*
> >    * It's possible to have interrupts off here:
> >    */
> This needs commenting about why is_wruss() is special.

Ok.

2018-07-12 23:11:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET

On Thu, 12 Jul 2018, Yu-cheng Yu wrote:
> On Thu, 2018-07-12 at 16:03 +0200, Ingo Molnar wrote:
> > * Yu-cheng Yu <[email protected]> wrote:
> > > > > diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> > > > > index e2ee403865eb..ac2bc3a18427 100644
> > > > > --- a/arch/x86/kernel/ptrace.c
> > > > > +++ b/arch/x86/kernel/ptrace.c
> > > > > @@ -49,7 +49,9 @@ enum x86_regset {
> > > > >   REGSET_IOPERM64 = REGSET_XFP,
> > > > >   REGSET_XSTATE,
> > > > >   REGSET_TLS,
> > > > > + REGSET_CET64 = REGSET_TLS,
> > > > >   REGSET_IOPERM32,
> > > > > + REGSET_CET32,
> > > > >  };
> > > > Why does REGSET_CET64 alias on REGSET_TLS?
> > > In x86_64_regsets[], there is no [REGSET_TLS].  The core dump code
> > > cannot handle holes in the array.
> > Is there a fundamental (ABI) reason for that?
>
> What I did was, ran Linux with 'slub_debug', and forced a core dump
> (kill -abrt <pid>), then there was a red zone warning in the dmesg.
> My feeling is there could be issues in the core dump code.  These

Kernel development is not about feelings.

Either you can track down the root cause or you cannot. There is no place
for feelings and no place in between. And if you cannot track down the root
cause and explain it proper then the resulting patch is just papering over
the symptoms and will come back to hunt you (or others) sooner than later.

No if, no could, no feelings. Facts is what matters. Really.

Thanks,

tglx

2018-07-12 23:52:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On 07/12/2018 03:59 PM, Yu-cheng Yu wrote:
> On Tue, 2018-07-10 at 16:48 -0700, Dave Hansen wrote:
>>>
>>> +/*
>>> + * WRUSS is a kernel instrcution and but writes to user
>>> + * shadow stack memory.  When a fault occurs, both
>>> + * X86_PF_USER and X86_PF_SHSTK are set.
>>> + */
>>> +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
>>> +{
>>> + return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
>>> + (X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
>>> +}
>> I thought X86_PF_USER was set based on the mode in which the fault
>> occurred.  Does this mean that the architecture of this bit is different
>> now?
>
> Yes.
>
>> That seems like something we need to call out if so.  It also means we
>> need to update the SDM because some of the text is wrong.
>
> It needs to mention the WRUSS case.

Ugh. The documentation for this is not pretty. But, I guess this is
not fundamentally different from access to U=1 pages when SMAP is in
place and we've set EFLAGS.AC=1.

But, sheesh, we need to call this out really explicitly and make it
crystal clear what is going on.

We need to go through the page fault code very carefully and audit all
the X86_PF_USER spots and make sure there's no impact to those. SMAP
should mean that we already dealt with these, but we still need an audit.

The docs[1] are clear as mud on this though: "Page entry has user
privilege (U=1) for a supervisor-level shadow-stack-load,
shadow-stack-store-intent or shadow-stack-store access except those that
originate from the WRUSS instruction."

Or, in short:

"Page has U=1 ... except those that originate from the WRUSS
instruction."

Which is backwards from what you said. I really wish those docs had
reused the established SDM language instead of reinventing their own way
of saying things.

1.
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

2018-07-13 01:51:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On 07/12/2018 04:49 PM, Dave Hansen wrote:
>>> That seems like something we need to call out if so.  It also means we
>>> need to update the SDM because some of the text is wrong.
>> It needs to mention the WRUSS case.
> Ugh. The documentation for this is not pretty. But, I guess this is
> not fundamentally different from access to U=1 pages when SMAP is in
> place and we've set EFLAGS.AC=1.

I was wrong and misread the docs. We do not get X86_PF_USER set when
EFLAGS.AC=1.

But, we *do* get X86_PF_USER (otherwise defined to be set when in ring3)
when running in ring0 with the WRUSS instruction and some other various
shadow-stack-access-related things. I'm sure folks had a good reason
for this architecture, but it is a pretty fundamentally *new*
architecture that we have to account for.

This new architecture is also not spelled out or accounted for in the
SDM as of yet. It's only called out here as far as I know:
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

Which reminds me: Yu-cheng, do you have a link to the docs anywhere in
your set? If not, you really should.

2018-07-13 02:22:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction



> On Jul 12, 2018, at 6:50 PM, Dave Hansen <[email protected]> wrote:
>
> On 07/12/2018 04:49 PM, Dave Hansen wrote:
>>>> That seems like something we need to call out if so. It also means we
>>>> need to update the SDM because some of the text is wrong.
>>> It needs to mention the WRUSS case.
>> Ugh. The documentation for this is not pretty. But, I guess this is
>> not fundamentally different from access to U=1 pages when SMAP is in
>> place and we've set EFLAGS.AC=1.
>
> I was wrong and misread the docs. We do not get X86_PF_USER set when
> EFLAGS.AC=1.
>
> But, we *do* get X86_PF_USER (otherwise defined to be set when in ring3)
> when running in ring0 with the WRUSS instruction and some other various
> shadow-stack-access-related things. I'm sure folks had a good reason
> for this architecture, but it is a pretty fundamentally *new*
> architecture that we have to account for.

I think it makes (some) sense. The USER bit is set for a page fault that was done with user privilege. So a descriptor table fault at CPL 3 has USER clear (regardless of the cause of the fault) and WRUSS has USER set.

>
> This new architecture is also not spelled out or accounted for in the
> SDM as of yet. It's only called out here as far as I know:
> https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf
>
> Which reminds me: Yu-cheng, do you have a link to the docs anywhere in
> your set? If not, you really should.

I am tempted to suggest that the whole series not be merged until there are actual docs. It’s not a fantastic precedent.

2018-07-13 04:17:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On 07/12/2018 07:21 PM, Andy Lutomirski wrote:
> I am tempted to suggest that the whole series not be merged until
> there are actual docs. It’s not a fantastic precedent.

Do you mean Documentation or manpages, or are you talking about hardware
documentation?
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

2018-07-13 04:19:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On 07/12/2018 09:16 PM, Dave Hansen wrote:
> On 07/12/2018 07:21 PM, Andy Lutomirski wrote:
>> I am tempted to suggest that the whole series not be merged until
>> there are actual docs. It’s not a fantastic precedent.
>
> Do you mean Documentation or manpages, or are you talking about hardware
> documentation?
> https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

Hit send too soon...

We do need manpages a well. If I had to do it for protection keys,
everyone else has to suffer too. :)

Yu-cheng, I really do think selftests are a necessity before this gets
merged.


2018-07-13 05:56:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction



> On Jul 12, 2018, at 9:16 PM, Dave Hansen <[email protected]> wrote:
>
>> On 07/12/2018 07:21 PM, Andy Lutomirski wrote:
>> I am tempted to suggest that the whole series not be merged until
>> there are actual docs. It’s not a fantastic precedent.
>
> Do you mean Documentation or manpages, or are you talking about hardware
> documentation?
> https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

I mean hardware docs. The “preview” is a little bit dubious IMO.

2018-07-13 06:28:58

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET


> > > to "CET" (which is a well-known acronym for "Central European Time"),
> > > not to CFE?
> > >
> >
> > I don't know if I can change that, will find out.
>
> So what I'd suggest is something pretty simple: to use CFT/cft in kernel internal
> names, except for the Intel feature bit and any MSR enumeration which can be CET
> if Intel named it that way, and a short comment explaining the acronym difference.
>
> Or something like that.

Actually, I don't think CFT is much better -- there's limited number
of TLAs (*). "ENFORCE_FLOW"? "FLOWE"? "EFLOW"?

Pavel

(*) Three letter accronyms.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (767.00 B)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-13 12:13:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> +{
> + return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> + (X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> +}
> +
> static void
> show_fault_oops(struct pt_regs *regs, unsigned long error_code,
> unsigned long address)
> @@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
> struct task_struct *tsk = current;
>
> /* User mode accesses just cause a SIGSEGV */
> - if (error_code & X86_PF_USER) {
> + if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
> /*
> * It's possible to have interrupts off here:
> */

Please don't do it this way.

We have two styles of page fault:
1. User page faults: find a VMA, try to handle (allocate memory et al.),
kill process if we can't handle.
2. Kernel page faults: search for a *discrete* set of conditions that
can be handled, including faults in instructions marked in exception
tables.

X86_PF_USER *means*: do user page fault handling. In the places where
the hardware doesn't set it, but we still want user page fault handling,
we manually set it, like this where we "downgrade" an implicit
supervisor access to a user access:

if (user_mode(regs)) {
local_irq_enable();
error_code |= X86_PF_USER;
flags |= FAULT_FLAG_USER;

So, just please *clear* X86_PF_USER if !user_mode(regs) and X86_PF_SS.
We do not want user page fault handling, thus we should not keep the bit
set.

2018-07-13 13:35:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET


* Pavel Machek <[email protected]> wrote:

>
> > > > to "CET" (which is a well-known acronym for "Central European Time"),
> > > > not to CFE?
> > > >
> > >
> > > I don't know if I can change that, will find out.
> >
> > So what I'd suggest is something pretty simple: to use CFT/cft in kernel internal
> > names, except for the Intel feature bit and any MSR enumeration which can be CET
> > if Intel named it that way, and a short comment explaining the acronym difference.
> >
> > Or something like that.
>
> Actually, I don't think CFT is much better -- there's limited number
> of TLAs (*). "ENFORCE_FLOW"? "FLOWE"? "EFLOW"?

Erm, I wanted to say 'CFE', i.e. the abbreviation of 'Control Flow Enforcement'.

But I guess I can live with CET as well ...

Thanks,

Ingo

2018-07-13 16:13:09

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET

On Fri, 2018-07-13 at 01:08 +0200, Thomas Gleixner wrote:
> On Thu, 12 Jul 2018, Yu-cheng Yu wrote:
> >
> > On Thu, 2018-07-12 at 16:03 +0200, Ingo Molnar wrote:
> > >
> > > * Yu-cheng Yu <[email protected]> wrote:
> > > >
> > > > >
> > > > > >
> > > > > > diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> > > > > > index e2ee403865eb..ac2bc3a18427 100644
> > > > > > --- a/arch/x86/kernel/ptrace.c
> > > > > > +++ b/arch/x86/kernel/ptrace.c
> > > > > > @@ -49,7 +49,9 @@ enum x86_regset {
> > > > > >   REGSET_IOPERM64 = REGSET_XFP,
> > > > > >   REGSET_XSTATE,
> > > > > >   REGSET_TLS,
> > > > > > + REGSET_CET64 = REGSET_TLS,
> > > > > >   REGSET_IOPERM32,
> > > > > > + REGSET_CET32,
> > > > > >  };
> > > > > Why does REGSET_CET64 alias on REGSET_TLS?
> > > > In x86_64_regsets[], there is no [REGSET_TLS].  The core dump code
> > > > cannot handle holes in the array.
> > > Is there a fundamental (ABI) reason for that?
> > What I did was, ran Linux with 'slub_debug', and forced a core dump
> > (kill -abrt <pid>), then there was a red zone warning in the dmesg.
> > My feeling is there could be issues in the core dump code.  These
> Kernel development is not about feelings.

I got that :-)

>
> Either you can track down the root cause or you cannot. There is no place
> for feelings and no place in between. And if you cannot track down the root
> cause and explain it proper then the resulting patch is just papering over
> the symptoms and will come back to hunt you (or others) sooner than later.
>
> No if, no could, no feelings. Facts is what matters. Really.

In kernel/ptrace.c,

find_regset(const struct user_regset_view *view, unsigned int type)
{
const struct user_regset *regset;
int n;

for (n = 0; n < view->n; ++n) {
regset = view->regsets + n;
if (regset->core_note_type == type)
return regset;
}

return NULL;
}

If there is a hole in the regset array, the empty slot's
regset->core_note_type is not defined.

We can add some comments near those enum's.

Yu-cheng


2018-07-13 17:41:55

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Fri, 2018-07-13 at 05:12 -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> >
> > +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> > +{
> > + return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> > + (X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> > +}
> > +
> >  static void
> >  show_fault_oops(struct pt_regs *regs, unsigned long error_code,
> >   unsigned long address)
> > @@ -848,7 +859,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
> >   struct task_struct *tsk = current;
> >  
> >   /* User mode accesses just cause a SIGSEGV */
> > - if (error_code & X86_PF_USER) {
> > + if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
> >   /*
> >    * It's possible to have interrupts off here:
> >    */
> Please don't do it this way.
>
> We have two styles of page fault:
> 1. User page faults: find a VMA, try to handle (allocate memory et al.),
>    kill process if we can't handle.
> 2. Kernel page faults: search for a *discrete* set of conditions that
>    can be handled, including faults in instructions marked in exception
>    tables.
>
> X86_PF_USER *means*: do user page fault handling.  In the places where
> the hardware doesn't set it, but we still want user page fault handling,
> we manually set it, like this where we "downgrade" an implicit
> supervisor access to a user access:
>
>         if (user_mode(regs)) {
>                 local_irq_enable();
>                 error_code |= X86_PF_USER;
>                 flags |= FAULT_FLAG_USER;
>
> So, just please *clear* X86_PF_USER if !user_mode(regs) and X86_PF_SS.
> We do not want user page fault handling, thus we should not keep the bit
> set.

Agree.  I will change that.

Yu-cheng


2018-07-13 17:43:36

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 18/27] x86/cet/shstk: Introduce WRUSS instruction

On Thu, 2018-07-12 at 21:18 -0700, Dave Hansen wrote:
> On 07/12/2018 09:16 PM, Dave Hansen wrote:
> >
> > On 07/12/2018 07:21 PM, Andy Lutomirski wrote:
> > >
> > > I am tempted to suggest that the whole series not be merged until
> > > there are actual docs. It’s not a fantastic precedent.
> > Do you mean Documentation or manpages, or are you talking about hardware
> > documentation?
> > https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf
> Hit send too soon...
>
> We do need manpages a well.  If I had to do it for protection keys,
> everyone else has to suffer too. :)
>
> Yu-cheng, I really do think selftests are a necessity before this gets
> merged.
>

We already have some.  I will put those in patches.

Yu-cheng


2018-07-13 18:01:30

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

On Wed, 2018-07-11 at 16:16 -0700, Dave Hansen wrote:
> On 07/11/2018 04:00 PM, Yu-cheng Yu wrote:
> >
> > On Wed, 2018-07-11 at 15:40 -0700, Dave Hansen wrote:
> > >
> > > On 07/11/2018 03:10 PM, Yu-cheng Yu wrote:
> > > >
> > > >
> > > > On Tue, 2018-07-10 at 17:11 -0700, Dave Hansen wrote:
> > > > >
> > > > >
> > > > > Is this feature *integral* to shadow stacks?  Or, should it just
> > > > > be
> > > > > in a
> > > > > different series?
> > > > The whole CET series is mostly about SHSTK and only a minority for
> > > > IBT.
> > > > IBT changes cannot be applied by itself without first applying
> > > > SHSTK
> > > > changes.  Would the titles help, e.g. x86/cet/ibt, x86/cet/shstk,
> > > > etc.?
> > > That doesn't really answer what I asked, though.
> > >
> > > Do shadow stacks *require* IBT?  Or, should we concentrate on merging
> > > shadow stacks themselves first and then do IBT at a later time, in a
> > > different patch series?
> > >
> > > But, yes, better patch titles would help, although I'm not sure
> > > that's
> > > quite the format that Ingo and Thomas prefer.
> > Shadow stack does not require IBT, but they complement each other.  If
> > we can resolve the legacy bitmap, both features can be merged at the
> > same time.
> As large as this patch set is, I'd really prefer to see you get shadow
> stacks merged and then move on to IBT.  I say separate them.

Ok, separate them.

>
> >
> > GLIBC does the bitmap setup.  It sets bits in there.
> > I thought you wanted a smaller bitmap?  One way is forcing legacy libs
> > to low address, or not having the bitmap at all, i.e. turn IBT off.
> I'm concerned with two things:
> 1. the virtual address space consumption, especially the *default* case
>    which will be apps using 4-level address space amounts, but having
>    5-level-sized tables.
> 2. the driving a truck-sized hole in the address space limits
>
> You can force legacy libs to low addresses, but you can't stop anyone
> from putting code into a high address *later*, at least with the code we
> have today.

So we will always reserve a big space for all CET tasks?

Currently if an application does dlopen() a legacy lib, it will have only
partial IBT protection and no SHSTK.  Do we want to consider simply turning
off IBT in that case?

Yu-cheng


2018-07-13 18:06:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 22/27] x86/cet/ibt: User-mode indirect branch tracking support

On 07/13/2018 10:56 AM, Yu-cheng Yu wrote:
>>> GLIBC does the bitmap setup.  It sets bits in there.
>>> I thought you wanted a smaller bitmap?  One way is forcing legacy libs
>>> to low address, or not having the bitmap at all, i.e. turn IBT off.
>> I'm concerned with two things:
>> 1. the virtual address space consumption, especially the *default* case
>>    which will be apps using 4-level address space amounts, but having
>>    5-level-sized tables.
>> 2. the driving a truck-sized hole in the address space limits
>>
>> You can force legacy libs to low addresses, but you can't stop anyone
>> from putting code into a high address *later*, at least with the code we
>> have today.
> So we will always reserve a big space for all CET tasks?

Yes. You either hard-restrict the address space (which we can't do
currently) or you reserve a big space.

> Currently if an application does dlopen() a legacy lib, it will have only
> partial IBT protection and no SHSTK.  Do we want to consider simply turning
> off IBT in that case?

I don't know. I honestly don't understand the threat model enough to
give you a good answer. Is there background on this in the docs?

2018-07-13 18:07:48

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 17/27] x86/cet/shstk: User-mode shadow stack support

On Wed, 2018-07-11 at 15:21 -0700, Andy Lutomirski wrote:
> >
> > On Jul 11, 2018, at 2:51 PM, Jann Horn <[email protected]> wrote:
> >
> > On Wed, Jul 11, 2018 at 2:34 PM Andy Lutomirski <[email protected]> wrote:
> > >
> > > >
> > > > On Jul 11, 2018, at 2:10 PM, Jann Horn <[email protected]> wrote:
> > > >
> > > > >
> > > > > On Tue, Jul 10, 2018 at 3:31 PM Yu-cheng Yu <[email protected]> wrote:
> > > > >
> > > > > This patch adds basic shadow stack enabling/disabling routines.
> > > > > A task's shadow stack is allocated from memory with VM_SHSTK
> > > > > flag set and read-only protection.  The shadow stack is
> > > > > allocated to a fixed size.
> > > > >
> > > > > Signed-off-by: Yu-cheng Yu <[email protected]>
> > > > [...]
> > > > >
> > > > > diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> > > > > new file mode 100644
> > > > > index 000000000000..96bf69db7da7
> > > > > --- /dev/null
> > > > > +++ b/arch/x86/kernel/cet.c
> > > > [...]
> > > > >
> > > > > +static unsigned long shstk_mmap(unsigned long addr, unsigned long len)
> > > > > +{
> > > > > +       struct mm_struct *mm = current->mm;
> > > > > +       unsigned long populate;
> > > > > +
> > > > > +       down_write(&mm->mmap_sem);
> > > > > +       addr = do_mmap(NULL, addr, len, PROT_READ,
> > > > > +                      MAP_ANONYMOUS | MAP_PRIVATE, VM_SHSTK,
> > > > > +                      0, &populate, NULL);
> > > > > +       up_write(&mm->mmap_sem);
> > > > > +
> > > > > +       if (populate)
> > > > > +               mm_populate(addr, populate);
> > > > > +
> > > > > +       return addr;
> > > > > +}
> > [...]
> > >
> > > >
> > > > Should the kernel enforce that two shadow stacks must have a guard
> > > > page between them so that they can not be directly adjacent, so that
> > > > if you have too much recursion, you can't end up corrupting an
> > > > adjacent shadow stack?
> > > I think the answer is a qualified “no”. I would like to instead enforce a general guard page on all mmaps that don’t use MAP_FORCE. We *might* need to exempt any mmap with an address hint for
> > > compatibility.
> > I like this idea a lot.
> >
> > >
> > > My commercial software has been manually adding guard pages on every single mmap done by tcmalloc for years, and it has caught a couple bugs and costs essentially nothing.
> > >
> > > Hmm. Linux should maybe add something like Windows’ “reserved” virtual memory. It’s basically a way to ask for a VA range that explicitly contains nothing and can be subsequently be turned into
> > > something useful with the equivalent of MAP_FORCE.
> > What's the benefit over creating an anonymous PROT_NONE region? That
> > the kernel won't have to scan through the corresponding PTEs when
> > tearing down the mapping?
> Make it more obvious what’s happening and avoid accounting issues?  What I’ve actually used is MAP_NORESERVE | PROT_NONE, but I think this still counts against the VA rlimit. But maybe that’s
> actually the desired behavior.

We can put a NULL at both ends of a SHSTK to guard against corruption.

Yu-cheng 


2018-07-13 18:27:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On 07/11/2018 10:05 AM, Yu-cheng Yu wrote:
> My understanding is that we don't want to follow write pte if the page
> is shared as read-only.  For a SHSTK page, that is (R/O + DIRTY_SW),
> which means the SHSTK page has not been COW'ed.  Is that right?

Let's look at the code again:

> -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> +static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
> + bool shstk)
> {
> + bool pte_cowed = shstk ? is_shstk_pte(pte):pte_dirty(pte);
> +
> return pte_write(pte) ||
> - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_cowed);
> }

This is another case where the naming of pte_*() is biting us vs. the
perversion of the PTE bits. The lack of comments and explanation inthe
patch is compounding the confusion.

We need to find a way to differentiate "someone can write to this PTE"
from "the write bit is set in this PTE".

In this particular hunk, we need to make it clear that pte_write() is
*never* true for shadowstack PTEs. In other words, shadow stack VMAs
will (should?) never even *see* a pte_write() PTE.

I think this is a case where you just need to bite the bullet and
bifurcate can_follow_write_pte(). Just separate the shadowstack and
non-shadowstack parts.

2018-07-14 06:29:10

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/27] x86/cet: Add PTRACE interface for CET

On Fri 2018-07-13 15:33:58, Ingo Molnar wrote:
>
> * Pavel Machek <[email protected]> wrote:
>
> >
> > > > > to "CET" (which is a well-known acronym for "Central European Time"),
> > > > > not to CFE?
> > > > >
> > > >
> > > > I don't know if I can change that, will find out.
> > >
> > > So what I'd suggest is something pretty simple: to use CFT/cft in kernel internal
> > > names, except for the Intel feature bit and any MSR enumeration which can be CET
> > > if Intel named it that way, and a short comment explaining the acronym difference.
> > >
> > > Or something like that.
> >
> > Actually, I don't think CFT is much better -- there's limited number
> > of TLAs (*). "ENFORCE_FLOW"? "FLOWE"? "EFLOW"?
>
> Erm, I wanted to say 'CFE', i.e. the abbreviation of 'Control Flow Enforcement'.
>
> But I guess I can live with CET as well ...

Yeah, and I am trying to say that perhaps we should use something
longer than three letters. It will make code longer but easier to
read.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.14 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-07-17 23:05:13

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On Wed, 2018-07-11 at 11:29 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:28PM -0700, Yu-cheng Yu wrote:
> >
> > There are three possible shadow stack PTE settings:
> >
> >   Normal SHSTK PTE: (R/O + DIRTY_HW)
> >   SHSTK PTE COW'ed: (R/O + DIRTY_HW)
> >   SHSTK PTE shared as R/O data: (R/O + DIRTY_SW)
> I count _2_ distinct states there.
>
> >
> > Update can_follow_write_pte/pmd for the shadow stack.
> So the below disallows can_follow_write when shstk && _PAGE_DIRTY_SW,
> but this here Changelog doesn't explain why. Doesn't even get close.

Can we add the following to the log:

When a SHSTK PTE is shared, it is (R/O + DIRTY_SW); otherwise it is
(R/O + DIRTY_HW).

When we (FOLL_WRITE | FOLL_FORCE) on a SHSTK PTE, the following
must be true:

  - It has been COW'ed at least once (FOLL_COW is set);
  - It still is not shared, i.e. PTE is (R/O + DIRTY_HW);

>
> Also, the code is a right mess :/ Can't we try harder to not let this
> shadow stack stuff escape arch code.

We either check here if the VMA is SHSTK mapping or move the logic
to pte_dirty().  The latter would be less obvious.  Or can we
create a can_follow_write_shstk_pte()?

Yu-cheng

2018-07-17 23:08:49

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On Fri, 2018-07-13 at 11:26 -0700, Dave Hansen wrote:
> On 07/11/2018 10:05 AM, Yu-cheng Yu wrote:
> >
> > My understanding is that we don't want to follow write pte if the page
> > is shared as read-only.  For a SHSTK page, that is (R/O + DIRTY_SW),
> > which means the SHSTK page has not been COW'ed.  Is that right?
> Let's look at the code again:
>
> >
> > -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> > +static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
> > + bool shstk)
> >  {
> > + bool pte_cowed = shstk ? is_shstk_pte(pte):pte_dirty(pte);
> > +
> >   return pte_write(pte) ||
> > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> > + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_cowed);
> >  }
> This is another case where the naming of pte_*() is biting us vs. the
> perversion of the PTE bits.  The lack of comments and explanation inthe
> patch is compounding the confusion.
>
> We need to find a way to differentiate "someone can write to this PTE"
> from "the write bit is set in this PTE".
>
> In this particular hunk, we need to make it clear that pte_write() is
> *never* true for shadowstack PTEs.  In other words, shadow stack VMAs
> will (should?) never even *see* a pte_write() PTE.
>
> I think this is a case where you just need to bite the bullet and
> bifurcate can_follow_write_pte().  Just separate the shadowstack and
> non-shadowstack parts.

In case I don't understand the exact issue.
What about the following.

diff --git a/mm/gup.c b/mm/gup.c
index fc5f98069f4e..45a0837b27f9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -70,6 +70,12 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
  ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
 }
 
+static inline bool can_follow_write_shstk_pte(pte_t pte, unsigned int flags)
+{
+ return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+ is_shstk_pte(pte));
+}
+
 static struct page *follow_page_pte(struct vm_area_struct *vma,
  unsigned long address, pmd_t *pmd, unsigned int flags)
 {
@@ -105,9 +111,16 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
  }
  if ((flags & FOLL_NUMA) && pte_protnone(pte))
  goto no_page;
- if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
- pte_unmap_unlock(ptep, ptl);
- return NULL;
+ if (flags & FOLL_WRITE) {
+ if (is_shstk_mapping(vma->vm_flags)) {
+ if (!can_follow_write_shstk_pte(pte, flags)) {
+ pte_unmap_unlock(ptep, ptl);
+ return NULL;
+ }
+ } else if (!can_follow_write_pte(pte, flags) {
+ pte_unmap_unlock(ptep, ptl);
+ return NULL;
+ }
  }
 
  page = vm_normal_page(vma, address, pte);


2018-07-17 23:12:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On 07/17/2018 04:03 PM, Yu-cheng Yu wrote:
> On Fri, 2018-07-13 at 11:26 -0700, Dave Hansen wrote:
>> On 07/11/2018 10:05 AM, Yu-cheng Yu wrote:
>>>
>>> My understanding is that we don't want to follow write pte if the page
>>> is shared as read-only.  For a SHSTK page, that is (R/O + DIRTY_SW),
>>> which means the SHSTK page has not been COW'ed.  Is that right?
>> Let's look at the code again:
>>
>>>
>>> -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>>> +static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
>>> + bool shstk)
>>>  {
>>> + bool pte_cowed = shstk ? is_shstk_pte(pte):pte_dirty(pte);
>>> +
>>>   return pte_write(pte) ||
>>> - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
>>> + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_cowed);
>>>  }
>> This is another case where the naming of pte_*() is biting us vs. the
>> perversion of the PTE bits.  The lack of comments and explanation inthe
>> patch is compounding the confusion.
>>
>> We need to find a way to differentiate "someone can write to this PTE"
>> from "the write bit is set in this PTE".
>>
>> In this particular hunk, we need to make it clear that pte_write() is
>> *never* true for shadowstack PTEs.  In other words, shadow stack VMAs
>> will (should?) never even *see* a pte_write() PTE.
>>
>> I think this is a case where you just need to bite the bullet and
>> bifurcate can_follow_write_pte().  Just separate the shadowstack and
>> non-shadowstack parts.
>
> In case I don't understand the exact issue.
> What about the following.
>
> diff --git a/mm/gup.c b/mm/gup.c
> index fc5f98069f4e..45a0837b27f9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -70,6 +70,12 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
>  }
>  
> +static inline bool can_follow_write_shstk_pte(pte_t pte, unsigned int flags)
> +{
> + return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
> + is_shstk_pte(pte));
> +}
> +
>  static struct page *follow_page_pte(struct vm_area_struct *vma,
>   unsigned long address, pmd_t *pmd, unsigned int flags)
>  {
> @@ -105,9 +111,16 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>   }
>   if ((flags & FOLL_NUMA) && pte_protnone(pte))
>   goto no_page;
> - if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
> - pte_unmap_unlock(ptep, ptl);
> - return NULL;
> + if (flags & FOLL_WRITE) {
> + if (is_shstk_mapping(vma->vm_flags)) {
> + if (!can_follow_write_shstk_pte(pte, flags)) {
> + pte_unmap_unlock(ptep, ptl);
> + return NULL;
> + }
> + } else if (!can_follow_write_pte(pte, flags) {
> + pte_unmap_unlock(ptep, ptl);
> + return NULL;
> + }

That looks pretty horrible. :(

We need:

bool can_follow_write(vma, pte_t pte, unsigned int flags)
{
if (!is_shstk_mapping(vma->vm_flags)) {
// vanilla case here
} else {
// shadowstack case here
}
}


2018-07-17 23:16:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On 07/17/2018 04:03 PM, Yu-cheng Yu wrote:
> We need to find a way to differentiate "someone can write to this PTE"
> from "the write bit is set in this PTE".

Please think about this:

Should pte_write() tell us whether PTE.W=1, or should it tell us
that *something* can write to the PTE, which would include
PTE.W=0/D=1?

2018-07-18 20:20:33

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On Tue, 2018-07-17 at 16:15 -0700, Dave Hansen wrote:
> On 07/17/2018 04:03 PM, Yu-cheng Yu wrote:
> >
> > We need to find a way to differentiate "someone can write to this PTE"
> > from "the write bit is set in this PTE".
> Please think about this:
>
> Should pte_write() tell us whether PTE.W=1, or should it tell us
> that *something* can write to the PTE, which would include
> PTE.W=0/D=1?


Is it better now?


Subject: [PATCH] mm: Modify can_follow_write_pte/pmd for shadow stack

can_follow_write_pte/pmd look for the (RO & DIRTY) PTE/PMD to
verify a non-sharing RO page still exists after a broken COW.

However, a shadow stack PTE is always RO & DIRTY; it can be:

  RO & DIRTY_HW - is_shstk_pte(pte) is true; or
  RO & DIRTY_SW - the page is being shared.

Update these functions to check a non-sharing shadow stack page
still exists after the COW.

Also rename can_follow_write_pte/pmd() to can_follow_write() to
make their meaning clear; i.e. "Can we write to the page?", not
"Is the PTE writable?"

Signed-off-by: Yu-cheng Yu <[email protected]>
---
 mm/gup.c         | 38 ++++++++++++++++++++++++++++++++++----
 mm/huge_memory.c | 19 ++++++++++++++-----
 2 files changed, 48 insertions(+), 9 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index fc5f98069f4e..316967996232 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -63,11 +63,41 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 /*
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
+ *
+ * Background:
+ *
+ * When we force-write to a read-only page, the page fault
+ * handler copies the page and sets the new page's PTE to
+ * RO & DIRTY.  This routine tells
+ *
+ *     "Can we write to the page?"
+ *
+ * by checking:
+ *
+ *     (1) The page has been copied, i.e. FOLL_COW is set;
+ *     (2) The copy still exists and its PTE is RO & DIRTY.
+ *
+ * However, a shadow stack PTE is always RO & DIRTY; it can
+ * be:
+ *
+ *     RO & DIRTY_HW: when is_shstk_pte(pte) is true; or
+ *     RO & DIRTY_SW: when the page is being shared.
+ *
+ * To test a shadow stack's non-sharing page still exists,
+ * we verify that the new page's PTE is_shstk_pte(pte).
  */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write(pte_t pte, unsigned int flags,
+     struct vm_area_struct *vma)
 {
- return pte_write(pte) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+ if (!is_shstk_mapping(vma->vm_flags)) {
+ if (pte_write(pte))
+ return true;
+ return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+ pte_dirty(pte));
+ } else {
+ return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+ is_shstk_pte(pte));
+ }
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -105,7 +135,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
  }
  if ((flags & FOLL_NUMA) && pte_protnone(pte))
  goto no_page;
- if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+ if ((flags & FOLL_WRITE) && !can_follow_write(pte, flags, vma)) {
  pte_unmap_unlock(ptep, ptl);
  return NULL;
  }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7f3e11d3b64a..822a563678b5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1388,11 +1388,20 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 /*
  * FOLL_FORCE can write to even unwritable pmd's, but only
  * after we've gone through a COW cycle and they are dirty.
+ * See comments in mm/gup.c, can_follow_write().
  */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
-{
- return pmd_write(pmd) ||
-        ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+static inline bool can_follow_write(pmd_t pmd, unsigned int flags,
+     struct vm_area_struct *vma)
+{
+ if (!is_shstk_mapping(vma->vm_flags)) {
+ if (pmd_write(pmd))
+ return true;
+ return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+ pmd_dirty(pmd));
+ } else {
+ return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+ is_shstk_pmd(pmd));
+ }
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1405,7 +1414,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
  assert_spin_locked(pmd_lockptr(mm, pmd));
 
- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+ if (flags & FOLL_WRITE && !can_follow_write(*pmd, flags, vma))
  goto out;
 
  /* Avoid dumping huge zero page */
-- 

2018-07-18 21:47:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On 07/18/2018 01:14 PM, Yu-cheng Yu wrote:
> On Tue, 2018-07-17 at 16:15 -0700, Dave Hansen wrote:
>> On 07/17/2018 04:03 PM, Yu-cheng Yu wrote:
>>>
>>> We need to find a way to differentiate "someone can write to this PTE"
>>> from "the write bit is set in this PTE".
>> Please think about this:
>>
>> Should pte_write() tell us whether PTE.W=1, or should it tell us
>> that *something* can write to the PTE, which would include
>> PTE.W=0/D=1?
>
>
> Is it better now?
>
>
> Subject: [PATCH] mm: Modify can_follow_write_pte/pmd for shadow stack
>
> can_follow_write_pte/pmd look for the (RO & DIRTY) PTE/PMD to
> verify a non-sharing RO page still exists after a broken COW.
>
> However, a shadow stack PTE is always RO & DIRTY; it can be:
>
>   RO & DIRTY_HW - is_shstk_pte(pte) is true; or
>   RO & DIRTY_SW - the page is being shared.
>
> Update these functions to check a non-sharing shadow stack page
> still exists after the COW.
>
> Also rename can_follow_write_pte/pmd() to can_follow_write() to
> make their meaning clear; i.e. "Can we write to the page?", not
> "Is the PTE writable?"
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> ---
>  mm/gup.c         | 38 ++++++++++++++++++++++++++++++++++----
>  mm/huge_memory.c | 19 ++++++++++++++-----
>  2 files changed, 48 insertions(+), 9 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index fc5f98069f4e..316967996232 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -63,11 +63,41 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  /*
>   * FOLL_FORCE can write to even unwritable pte's, but only
>   * after we've gone through a COW cycle and they are dirty.
> + *
> + * Background:
> + *
> + * When we force-write to a read-only page, the page fault
> + * handler copies the page and sets the new page's PTE to
> + * RO & DIRTY.  This routine tells
> + *
> + *     "Can we write to the page?"
> + *
> + * by checking:
> + *
> + *     (1) The page has been copied, i.e. FOLL_COW is set;
> + *     (2) The copy still exists and its PTE is RO & DIRTY.
> + *
> + * However, a shadow stack PTE is always RO & DIRTY; it can
> + * be:
> + *
> + *     RO & DIRTY_HW: when is_shstk_pte(pte) is true; or
> + *     RO & DIRTY_SW: when the page is being shared.
> + *
> + * To test a shadow stack's non-sharing page still exists,
> + * we verify that the new page's PTE is_shstk_pte(pte).

The content is getting there, but we need it next to the code, please.

>   */
> -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> +static inline bool can_follow_write(pte_t pte, unsigned int flags,
> +     struct vm_area_struct *vma)
>  {
> - return pte_write(pte) ||
> - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> + if (!is_shstk_mapping(vma->vm_flags)) {
> + if (pte_write(pte))
> + return true;

Let me see if I can say this another way.

The bigger issue is that these patches change the semantics of
pte_write(). Before these patches, it meant that you *MUST* have this
bit set to write to the page controlled by the PTE. Now, it means: you
can write if this bit is set *OR* the shadowstack bit combination is set.

That's the fundamental problem. We need some code in the kernel that
logically represents the concept of "is this PTE a shadowstack PTE or a
PTE with the write bit set", and we will call that pte_write(), or maybe
pte_writable().

You *have* to somehow rectify this situation. We can absolutely no
leave pte_write() in its current, ambiguous state where it has no real
meaning or where it is used to mean _both_ things depending on context.

> + return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
> + pte_dirty(pte));
> + } else {
> + return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
> + is_shstk_pte(pte));
> + }
>  }

Ok, it's rewrite time I guess.

Yu-cheng, you may not know all the history, but this code is actually
the source of the "Dirty COW" security issue. We need to be very, very
careful with it, and super-explicit about all the logic. This is the
time to blow up the comments and walk folks through exactly what we
expect to happen.

Anybody think I'm being too verbose? Is there a reason not to just go
whole-hog on this sucker?

static inline bool can_follow_write(pte_t pte, unsigned int flags,
struct vm_area_struct *vma)
{
/*
* FOLL_FORCE can "write" to hardware read-only PTEs, but
* has to do a COW operation first. Do not allow the
* hardware protection override unless we see FOLL_FORCE
* *and* the COW has been performed by the fault code.
*/
bool gup_cow_ok = (flags & FOLL_FORCE) &&
(flags & FOLL_COW);

/*
* FOLL_COW flags tell us whether the page fault code did a COW
* operation but not whether the PTE we are dealing with here
* was COW'd. It could have been zapped and refaulted since the
* COW operation.
*/
bool pte_cow_ok;

/* We have two COW pte "formats" */
if (!is_shstk_mapping(vma->vm_flags)) {
if (pte_write(pte)) {
/* Any hardware-writable PTE is writable here */
pte_cow_ok = true;
} else {
/* Is the COW-set dirty bit still there? */
pte_cow_ok = pte_dirty(pte));
}
} else {
/* Shadow stack PTEs are always hardware-writable */

/*
* Shadow stack pages do copy-on-access, so any present
* shadow stack page has had a COW-equivalent performed.
*/
pte_cow_ok = is_shstk_pte(pte));
}

return gup_cow_ok && pte_cow_ok;
}

2018-07-18 23:16:10

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On Wed, 2018-07-18 at 14:45 -0700, Dave Hansen wrote:
> On 07/18/2018 01:14 PM, Yu-cheng Yu wrote:
> >
> > On Tue, 2018-07-17 at 16:15 -0700, Dave Hansen wrote:
> > >
> > > On 07/17/2018 04:03 PM, Yu-cheng Yu wrote:
> > > >
> > > >
> > > > We need to find a way to differentiate "someone can write to this PTE"
> > > > from "the write bit is set in this PTE".
> > > Please think about this:
> > >
> > > Should pte_write() tell us whether PTE.W=1, or should it tell us
> > > that *something* can write to the PTE, which would include
> > > PTE.W=0/D=1?
> >
> > Is it better now?
> >
> >
> > Subject: [PATCH] mm: Modify can_follow_write_pte/pmd for shadow stack
> >
> > can_follow_write_pte/pmd look for the (RO & DIRTY) PTE/PMD to
> > verify a non-sharing RO page still exists after a broken COW.
> >
> > However, a shadow stack PTE is always RO & DIRTY; it can be:
> >
> >   RO & DIRTY_HW - is_shstk_pte(pte) is true; or
> >   RO & DIRTY_SW - the page is being shared.
> >
> > Update these functions to check a non-sharing shadow stack page
> > still exists after the COW.
> >
> > Also rename can_follow_write_pte/pmd() to can_follow_write() to
> > make their meaning clear; i.e. "Can we write to the page?", not
> > "Is the PTE writable?"
> >
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> > ---
> >  mm/gup.c         | 38 ++++++++++++++++++++++++++++++++++----
> >  mm/huge_memory.c | 19 ++++++++++++++-----
> >  2 files changed, 48 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index fc5f98069f4e..316967996232 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -63,11 +63,41 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> >  /*
> >   * FOLL_FORCE can write to even unwritable pte's, but only
> >   * after we've gone through a COW cycle and they are dirty.
> > + *
> > + * Background:
> > + *
> > + * When we force-write to a read-only page, the page fault
> > + * handler copies the page and sets the new page's PTE to
> > + * RO & DIRTY.  This routine tells
> > + *
> > + *     "Can we write to the page?"
> > + *
> > + * by checking:
> > + *
> > + *     (1) The page has been copied, i.e. FOLL_COW is set;
> > + *     (2) The copy still exists and its PTE is RO & DIRTY.
> > + *
> > + * However, a shadow stack PTE is always RO & DIRTY; it can
> > + * be:
> > + *
> > + *     RO & DIRTY_HW: when is_shstk_pte(pte) is true; or
> > + *     RO & DIRTY_SW: when the page is being shared.
> > + *
> > + * To test a shadow stack's non-sharing page still exists,
> > + * we verify that the new page's PTE is_shstk_pte(pte).
> The content is getting there, but we need it next to the code, please.
>
> >
> >   */
> > -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> > +static inline bool can_follow_write(pte_t pte, unsigned int flags,
> > +     struct vm_area_struct *vma)
> >  {
> > - return pte_write(pte) ||
> > - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> > + if (!is_shstk_mapping(vma->vm_flags)) {
> > + if (pte_write(pte))
> > + return true;
> Let me see if I can say this another way.
>
> The bigger issue is that these patches change the semantics of
> pte_write().  Before these patches, it meant that you *MUST* have this
> bit set to write to the page controlled by the PTE.  Now, it means: you
> can write if this bit is set *OR* the shadowstack bit combination is set.

Here, we only figure out (1) if the page is pointed by a writable PTE; or
(2) if the page is pointed by a RO PTE (data or SHSTK) and it has been
copied and it still exists.  We are not trying to
determine if the
SHSTK PTE is writable (we know it is not).

We look for the dirty bit to be sure the COW'ed page is still there.
The difference for the shadow stack case is that we look for the *hardware*
dirty bit.  Perhaps we can create another macro, pte_ro_dirty_hw(), which
is equivalent to is_shstk_pte().

>
> That's the fundamental problem.  We need some code in the kernel that
> logically represents the concept of "is this PTE a shadowstack PTE or a
> PTE with the write bit set", and we will call that pte_write(), or maybe
> pte_writable().
>
> You *have* to somehow rectify this situation.  We can absolutely no
> leave pte_write() in its current, ambiguous state where it has no real
> meaning or where it is used to mean _both_ things depending on context.

True, the processor can always write to a page through a shadow stack
PTE, but it must do that with a CALL instruction.  Can we define a 
write operation as: MOV r1, *(r2).  Then we don't have any doubt on
pte_write() any more.

>
> >
> > + return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
> > + pte_dirty(pte));
> > + } else {
> > + return ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
> > + is_shstk_pte(pte));
> > + }
> >  }
> Ok, it's rewrite time I guess.
>
> Yu-cheng, you may not know all the history, but this code is actually
> the source of the "Dirty COW" security issue.  We need to be very, very
> careful with it, and super-explicit about all the logic.  This is the
> time to blow up the comments and walk folks through exactly what we
> expect to happen.
>
> Anybody think I'm being too verbose?  Is there a reason not to just go
> whole-hog on this sucker?
>
> static inline bool can_follow_write(pte_t pte, unsigned int flags,
>     struct vm_area_struct *vma)
> {
> /*
> * FOLL_FORCE can "write" to hardware read-only PTEs, but
> * has to do a COW operation first.  Do not allow the
> * hardware protection override unless we see FOLL_FORCE
> * *and* the COW has been performed by the fault code.
> */
> bool gup_cow_ok = (flags & FOLL_FORCE) &&
>   (flags & FOLL_COW);
>
> /*
> * FOLL_COW flags tell us whether the page fault code did a COW
> * operation but not whether the PTE we are dealing with here
> * was COW'd.  It could have been zapped and refaulted since the
> * COW operation.
> */
> bool pte_cow_ok;
>
> /* We have two COW pte "formats" */
> if (!is_shstk_mapping(vma->vm_flags)) {
> if (pte_write(pte)) {
> /* Any hardware-writable PTE is writable here */
> pte_cow_ok = true;
> } else {
> /* Is the COW-set dirty bit still there? */
> pte_cow_ok = pte_dirty(pte));
> }
> } else {
> /* Shadow stack PTEs are always hardware-writable */
>
> /*
>  * Shadow stack pages do copy-on-access, so any present
>  * shadow stack page has had a COW-equivalent performed.
>  */
> pte_cow_ok = is_shstk_pte(pte));
> }
>
> return gup_cow_ok && pte_cow_ok;
> }

Ok, I will change it.

Yu-cheng

2018-07-19 00:07:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

>>> -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>>> +static inline bool can_follow_write(pte_t pte, unsigned int flags,
>>> +     struct vm_area_struct *vma)
>>>  {
>>> - return pte_write(pte) ||
>>> - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
>>> + if (!is_shstk_mapping(vma->vm_flags)) {
>>> + if (pte_write(pte))
>>> + return true;
>> Let me see if I can say this another way.
>>
>> The bigger issue is that these patches change the semantics of
>> pte_write().  Before these patches, it meant that you *MUST* have this
>> bit set to write to the page controlled by the PTE.  Now, it means: you
>> can write if this bit is set *OR* the shadowstack bit combination is set.
>
> Here, we only figure out (1) if the page is pointed by a writable PTE; or
> (2) if the page is pointed by a RO PTE (data or SHSTK) and it has been
> copied and it still exists.  We are not trying to
> determine if the
> SHSTK PTE is writable (we know it is not).

Please think about the big picture. I'm not just talking about this
patch, but about every use of pte_write() in the kernel.

>> That's the fundamental problem.  We need some code in the kernel that
>> logically represents the concept of "is this PTE a shadowstack PTE or a
>> PTE with the write bit set", and we will call that pte_write(), or maybe
>> pte_writable().
>>
>> You *have* to somehow rectify this situation.  We can absolutely no
>> leave pte_write() in its current, ambiguous state where it has no real
>> meaning or where it is used to mean _both_ things depending on context.
>
> True, the processor can always write to a page through a shadow stack
> PTE, but it must do that with a CALL instruction.  Can we define a 
> write operation as: MOV r1, *(r2).  Then we don't have any doubt on
> pte_write() any more.

No, we can't just move the target. :)

You can define it this way, but then you also need to go to every spot
in the kernel that calls pte_write() (and _PAGE_RW in fact) and audit it
to ensure it means "mov ..." and not push.

2018-07-19 17:10:47

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On Wed, 2018-07-18 at 17:06 -0700, Dave Hansen wrote:
> >
> > >
> > > >
> > > > -static inline bool can_follow_write_pte(pte_t pte, unsigned
> > > > int flags)
> > > > +static inline bool can_follow_write(pte_t pte, unsigned int
> > > > flags,
> > > > +     struct vm_area_struct
> > > > *vma)
> > > >  {
> > > > - return pte_write(pte) ||
> > > > - ((flags & FOLL_FORCE) && (flags & FOLL_COW)
> > > > && pte_dirty(pte));
> > > > + if (!is_shstk_mapping(vma->vm_flags)) {
> > > > + if (pte_write(pte))
> > > > + return true;
> > > Let me see if I can say this another way.
> > >
> > > The bigger issue is that these patches change the semantics of
> > > pte_write().  Before these patches, it meant that you *MUST*
> > > have this
> > > bit set to write to the page controlled by the PTE.  Now, it
> > > means: you
> > > can write if this bit is set *OR* the shadowstack bit
> > > combination is set.
> > Here, we only figure out (1) if the page is pointed by a writable
> > PTE; or
> > (2) if the page is pointed by a RO PTE (data or SHSTK) and it has
> > been
> > copied and it still exists.  We are not trying to
> > determine if the
> > SHSTK PTE is writable (we know it is not).
> Please think about the big picture.  I'm not just talking about this
> patch, but about every use of pte_write() in the kernel.
>
> >
> > >
> > > That's the fundamental problem.  We need some code in the kernel
> > > that
> > > logically represents the concept of "is this PTE a shadowstack
> > > PTE or a
> > > PTE with the write bit set", and we will call that pte_write(),
> > > or maybe
> > > pte_writable().
> > >
> > > You *have* to somehow rectify this situation.  We can absolutely
> > > no
> > > leave pte_write() in its current, ambiguous state where it has
> > > no real
> > > meaning or where it is used to mean _both_ things depending on
> > > context.
> > True, the processor can always write to a page through a shadow
> > stack
> > PTE, but it must do that with a CALL instruction.  Can we define
> > a 
> > write operation as: MOV r1, *(r2).  Then we don't have any doubt
> > on
> > pte_write() any more.
> No, we can't just move the target. :)
>
> You can define it this way, but then you also need to go to every
> spot
> in the kernel that calls pte_write() (and _PAGE_RW in fact) and
> audit it
> to ensure it means "mov ..." and not push.

Which pte_write() do you think is right?

bool is_shstk_pte(pte) {
return (_PAGE_RW not set) &&
(_PAGE_DIRTY_HW set);
}

int pte_write_1(pte) {
return (_PAGE_RW set) && !is_shstk_pte(pte);
}

int pte_write_2(pte) {
return (_PAGE_RW set) || is_shstk_pte(pte);
}

Yu-cheng


2018-07-19 19:32:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/27] mm: Modify can_follow_write_pte/pmd for shadow stack

On 07/19/2018 10:06 AM, Yu-cheng Yu wrote:
> Which pte_write() do you think is right?

There isn't one that's right.

The problem is that the behavior right now is ambiguous. Some callers
of pte_write() need to know about _PAGE_RW alone and others want to know
if (_PAGE_RW || is_shstk()).

The point is that you need both, plus a big audit of all the pte_write()
users to ensure they use the right one.

For instance, see spurious_fault_check(). We can get a shadowstack
fault that also has X86_PF_WRITE, but pte_write()==0. That might make a
shadowstack write fault falsely appear spurious.

2018-07-20 14:21:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/27] mm: Handle THP/HugeTLB shadow stack page fault

On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> @@ -1193,6 +1195,8 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> pte_t entry;
> entry = mk_pte(pages[i], vma->vm_page_prot);
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + if (is_shstk_mapping(vma->vm_flags))
> + entry = pte_mkdirty_shstk(entry);

Peter Z was pointing out that we should get rid of all this generic code
manipulation. We might not easily be able to do it *all*, but we can do
better than what we've got here.

Basically, if you have code outside of arch/x86 in your patch set that
refers to shadow stacks, you should consider it a bug (for now),
especially if you have to hack .c files.

For instance, in the code above, you could move the is_shstk_mapping() into:

static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
if (likely(vma->vm_flags & VM_WRITE))
pte = pte_mkwrite(pte);

+ pte = arch_pte_mkwrite(pte, vma);
+
return pte;
}

... and add an arch callback that does:

static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
if (!is_shstk_mapping(vma->vm_flags))
return pte;

WARN_ON(... pte bits incompatible with shadow stacks?);

/* Lots of comments of course */
entry = pte_mkdirty_shstk(entry);
}

This is just one example. You are probably going to need a couple of
similar things. Just remember: the bar is very high to make changes to
.c files outside of arch/x86. You can do a _bit_ more in non-x86
headers, but you have the most freedom to patch what you want as long as
it's in arch/x86.

2018-07-20 15:02:54

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/27] mm: Handle THP/HugeTLB shadow stack page fault

On Fri, 2018-07-20 at 07:20 -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> >
> > @@ -1193,6 +1195,8 @@ static int
> > do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >   pte_t entry;
> >   entry = mk_pte(pages[i], vma->vm_page_prot);
> >   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > + if (is_shstk_mapping(vma->vm_flags))
> > + entry = pte_mkdirty_shstk(entry);
> Peter Z was pointing out that we should get rid of all this generic
> code
> manipulation.  We might not easily be able to do it *all*, but we
> can do
> better than what we've got here.
>
> Basically, if you have code outside of arch/x86 in your patch set
> that
> refers to shadow stacks, you should consider it a bug (for now),
> especially if you have to hack .c files.
>
> For instance, in the code above, you could move the
> is_shstk_mapping() into:
>
> static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct
> *vma)
> {
>         if (likely(vma->vm_flags & VM_WRITE))
>                 pte = pte_mkwrite(pte);
>
> + pte = arch_pte_mkwrite(pte, vma);
> +
>         return pte;
> }
>
> ... and add an arch callback that does:
>
> static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct
> *vma)
> {
> if (!is_shstk_mapping(vma->vm_flags))
> return pte;
>
> WARN_ON(... pte bits incompatible with shadow stacks?);
>
> /* Lots of comments of course */
> entry = pte_mkdirty_shstk(entry);
> }
>
> This is just one example.  You are probably going to need a couple
> of
> similar things.  Just remember: the bar is very high to make changes
> to
> .c files outside of arch/x86.  You can do a _bit_ more in non-x86
> headers, but you have the most freedom to patch what you want as
> long as
> it's in arch/x86.

Ok, I will work on that.  Thanks!

Yu-cheng

2018-08-14 22:08:46

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 13/27] mm: Handle shadow stack page fault

On Wed, 2018-07-11 at 11:06 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 04:06:25PM -0700, Dave Hansen wrote:
> >
> > On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> > >
> > > + if (is_shstk_mapping(vma->vm_flags))
> > > + entry = pte_mkdirty_shstk(entry);
> > > + else
> > > + entry = pte_mkdirty(entry);
> > > +
> > > + entry = maybe_mkwrite(entry, vma);
> > >   if (ptep_set_access_flags(vma, vmf->address, vmf->pte,
> > > entry, 1))
> > >   update_mmu_cache(vma, vmf->address, vmf->pte);
> > >   pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > @@ -2526,7 +2532,11 @@ static int wp_page_copy(struct vm_fault
> > > *vmf)
> > >   }
> > >   flush_cache_page(vma, vmf->address,
> > > pte_pfn(vmf->orig_pte));
> > >   entry = mk_pte(new_page, vma->vm_page_prot);
> > > - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > > + if (is_shstk_mapping(vma->vm_flags))
> > > + entry = pte_mkdirty_shstk(entry);
> > > + else
> > > + entry = pte_mkdirty(entry);
> > > + entry = maybe_mkwrite(entry, vma);
> > Do we want to lift this hunk of code and put it elsewhere?  Maybe:
> >
> > entry = pte_set_vma_features(entry, vma);
> >
> > and then:
> >
> > pte_t pte_set_vma_features(pte_t entry, struct vm_area_struct)
> > {
> > /*
> >  * Shadow stack PTEs are always dirty and always
> >  * writable.  They have a different encoding for
> >  * this than normal PTEs, though.
> >  */
> > if (is_shstk_mapping(vma->vm_flags))
> > entry = pte_mkdirty_shstk(entry);
> > else
> > entry = pte_mkdirty(entry);
> >
> > entry = maybe_mkwrite(entry, vma);
> >
> > return entry;
> > }
> Yes, that wants a helper like that. Not sold on the name, but
> whatever.
>
> Is there any way we can hide all the shadow stack magic in arch
> code?

We use is_shstk_mapping() only to determine PAGE_DIRTY_SW or
PAGE_DIRTY_HW should be set in a PTE.  One way to remove this shadow
stack code from generic code is changing pte_mkdirty(pte) to
pte_mkdirty(pte, vma), and in the arch code we handle shadow stack.
Is this acceptable?

Thanks,
Yu-cheng