2020-04-29 22:10:30

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

Control-flow Enforcement (CET) is a new Intel processor feature that blocks
return/jump-oriented programming attacks. Details can be found in "Intel
64 and IA-32 Architectures Software Developer's Manual" [1].

This series depends on the XSAVES supervisor state series that was split
out and submitted earlier [2].

I have gone through previous comments, and hope all concerns have been
resolved now. Please inform me if anything is overlooked.

Changes in v10:

- A shadow stack PTE is (!_PAGE_RW and _PAGE_DIRTY_HW). In handling page
faults, previous versions of this series use helpers such as arch_copy_
pte_mapping() and arch_set_vma_features() to manage the _PAGE_DIRTY_HW
bit for the copy-on-write logic. This has been simplified by treating
shadow stack as logically writable, and shadow stack faults are handled
similarly as for normal writable data pages. Functions pte_write(),
pte_mkwrite(), pte_wrprotect(), maybe_mkwrite() etc. are updated
accordingly.

- Signal return code is updated according to the XSAVES supervisor state
changes.

- Other smaller changes are noted in each patch's log.

[1] Intel 64 and IA-32 Architectures Software Developer's Manual:

https://software.intel.com/en-us/download/intel-64-and-ia-32-
architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4

[2] XSAVES supervisor states patches:
https://lkml.kernel.org/r/[email protected]/

[3] CET Shadow Stack patches v9:

https://lkml.kernel.org/r/[email protected]/

Dave Martin (1):
ELF: Add ELF program property parsing support

Yu-cheng Yu (25):
Documentation/x86: Add CET description
x86/cpufeatures: Add CET CPU feature flags for Control-flow
Enforcement Technology (CET)
x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states
x86/cet: Add control-protection fault handler
x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack
x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
x86/mm: Remove _PAGE_DIRTY_HW from kernel RO pages
x86/mm: Introduce _PAGE_COW
drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
x86/mm: Update pte_modify for _PAGE_COW
x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
transition from _PAGE_DIRTY_HW to _PAGE_COW
mm: Introduce VM_SHSTK for shadow stack memory
x86/mm: Shadow Stack page fault error checking
x86/mm: Update maybe_mkwrite() for shadow stack
mm: Fixup places that call pte_mkwrite() directly
mm: Add guard pages around a shadow stack.
mm/mmap: Add shadow stack pages to memory accounting
mm: Update can_follow_write_pte() for shadow stack
x86/cet/shstk: User-mode shadow stack support
x86/cet/shstk: Handle signals for shadow stack
ELF: UAPI and Kconfig additions for ELF program properties
ELF: Introduce arch_setup_elf_property()
x86/cet/shstk: ELF header parsing for shadow stack
x86/cet/shstk: Handle thread shadow stack
x86/cet/shstk: Add arch_prctl functions for shadow stack

.../admin-guide/kernel-parameters.txt | 6 +
Documentation/x86/index.rst | 1 +
Documentation/x86/intel_cet.rst | 129 +++++++
arch/x86/Kconfig | 36 ++
arch/x86/entry/entry_64.S | 2 +-
arch/x86/ia32/ia32_signal.c | 17 +
arch/x86/include/asm/cet.h | 40 ++
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/elf.h | 13 +
arch/x86/include/asm/fpu/internal.h | 10 +
arch/x86/include/asm/fpu/types.h | 22 ++
arch/x86/include/asm/fpu/xstate.h | 5 +-
arch/x86/include/asm/mmu_context.h | 3 +
arch/x86/include/asm/msr-index.h | 18 +
arch/x86/include/asm/pgtable.h | 209 +++++++++-
arch/x86/include/asm/pgtable_types.h | 58 ++-
arch/x86/include/asm/processor.h | 15 +
arch/x86/include/asm/special_insns.h | 32 ++
arch/x86/include/asm/traps.h | 7 +
arch/x86/include/uapi/asm/prctl.h | 5 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/include/uapi/asm/sigcontext.h | 9 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/cet.c | 356 ++++++++++++++++++
arch/x86/kernel/cet_prctl.c | 87 +++++
arch/x86/kernel/cpu/common.c | 28 ++
arch/x86/kernel/cpu/cpuid-deps.c | 2 +
arch/x86/kernel/fpu/signal.c | 101 +++++
arch/x86/kernel/fpu/xstate.c | 25 +-
arch/x86/kernel/idt.c | 4 +
arch/x86/kernel/process.c | 12 +-
arch/x86/kernel/process_64.c | 29 ++
arch/x86/kernel/relocate_kernel_64.S | 2 +-
arch/x86/kernel/signal.c | 10 +
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 59 +++
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/mm/fault.c | 19 +
arch/x86/mm/mmap.c | 2 +
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 25 ++
drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
fs/Kconfig.binfmt | 3 +
fs/binfmt_elf.c | 131 +++++++
fs/compat_binfmt_elf.c | 4 +
fs/proc/task_mmu.c | 3 +
include/asm-generic/pgtable.h | 35 ++
include/linux/elf.h | 33 ++
include/linux/mm.h | 34 +-
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/elf.h | 12 +
mm/gup.c | 8 +-
mm/huge_memory.c | 10 +-
mm/memory.c | 5 +-
mm/migrate.c | 3 +-
mm/mmap.c | 5 +
mm/mprotect.c | 2 +-
scripts/as-x86_64-has-shadow-stack.sh | 4 +
.../arch/x86/include/asm/disabled-features.h | 8 +-
60 files changed, 1670 insertions(+), 53 deletions(-)
create mode 100644 Documentation/x86/intel_cet.rst
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/kernel/cet.c
create mode 100644 arch/x86/kernel/cet_prctl.c
create mode 100755 scripts/as-x86_64-has-shadow-stack.sh

--
2.21.0


2020-04-29 22:10:33

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 03/26] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states

Control-flow Enforcement Technology (CET) adds five MSRs. Introduce them
and their XSAVES supervisor states:

MSR_IA32_U_CET (user-mode CET settings),
MSR_IA32_PL3_SSP (user-mode Shadow Stack pointer),
MSR_IA32_PL0_SSP (kernel-mode Shadow Stack pointer),
MSR_IA32_PL1_SSP (Privilege Level 1 Shadow Stack pointer),
MSR_IA32_PL2_SSP (Privilege Level 2 Shadow Stack pointer).

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
v6:
- Remove __packed from struct cet_user_state, struct cet_kernel_state.

arch/x86/include/asm/fpu/types.h | 22 ++++++++++++++++++
arch/x86/include/asm/fpu/xstate.h | 5 +++--
arch/x86/include/asm/msr-index.h | 18 +++++++++++++++
arch/x86/include/uapi/asm/processor-flags.h | 2 ++
arch/x86/kernel/fpu/xstate.c | 25 +++++++++++++++++++--
5 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f098f6cab94b..d7ef4d9c7ad5 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -114,6 +114,9 @@ enum xfeature {
XFEATURE_Hi16_ZMM,
XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
XFEATURE_PKRU,
+ XFEATURE_RESERVED,
+ XFEATURE_CET_USER,
+ XFEATURE_CET_KERNEL,

XFEATURE_MAX,
};
@@ -128,6 +131,8 @@ enum xfeature {
#define XFEATURE_MASK_Hi16_ZMM (1 << XFEATURE_Hi16_ZMM)
#define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
+#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)

#define XFEATURE_MASK_FPSSE (XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
#define XFEATURE_MASK_AVX512 (XFEATURE_MASK_OPMASK \
@@ -229,6 +234,23 @@ struct pkru_state {
u32 pad;
} __packed;

+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+ u64 user_cet; /* user control-flow settings */
+ u64 user_ssp; /* user shadow stack pointer */
+};
+
+/*
+ * State component 12 is Control-flow Enforcement kernel states
+ */
+struct cet_kernel_state {
+ u64 kernel_ssp; /* kernel shadow stack */
+ u64 pl1_ssp; /* privilege level 1 shadow stack */
+ u64 pl2_ssp; /* privilege level 2 shadow stack */
+};
+
struct xstate_header {
u64 xfeatures;
u64 xcomp_bv;
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 422d8369012a..db89d796b22e 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -33,13 +33,14 @@
XFEATURE_MASK_BNDCSR)

/* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (0)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_CET_USER)

/*
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+ XFEATURE_MASK_CET_KERNEL)

/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 12c9684d59ba..47f603729543 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -885,4 +885,22 @@
#define MSR_VM_IGNNE 0xc0010115
#define MSR_VM_HSAVE_PA 0xc0010117

+/* Control-flow Enforcement Technology MSRs */
+#define MSR_IA32_U_CET 0x6a0 /* user mode cet setting */
+#define MSR_IA32_S_CET 0x6a2 /* kernel mode cet setting */
+#define MSR_IA32_PL0_SSP 0x6a4 /* kernel shstk pointer */
+#define MSR_IA32_PL1_SSP 0x6a5 /* ring-1 shstk pointer */
+#define MSR_IA32_PL2_SSP 0x6a6 /* ring-2 shstk pointer */
+#define MSR_IA32_PL3_SSP 0x6a7 /* user shstk pointer */
+#define MSR_IA32_INT_SSP_TAB 0x6a8 /* exception shstk table */
+
+/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
+#define MSR_IA32_CET_SHSTK_EN 0x0000000000000001ULL
+#define MSR_IA32_CET_WRSS_EN 0x0000000000000002ULL
+#define MSR_IA32_CET_ENDBR_EN 0x0000000000000004ULL
+#define MSR_IA32_CET_LEG_IW_EN 0x0000000000000008ULL
+#define MSR_IA32_CET_NO_TRACK_EN 0x0000000000000010ULL
+#define MSR_IA32_CET_WAIT_ENDBR 0x00000000000000800UL
+#define MSR_IA32_CET_BITMAP_MASK 0xfffffffffffff000ULL
+
#endif /* _ASM_X86_MSR_INDEX_H */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..a8df907e8017 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
#define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT)
#define X86_CR4_PKE_BIT 22 /* enable Protection Keys support */
#define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_CET_BIT 23 /* enable Control-flow Enforcement */
+#define X86_CR4_CET _BITUL(X86_CR4_CET_BIT)

/*
* x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 587e03f0094d..7c7be482e6f3 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -38,6 +38,9 @@ static const char *xfeature_names[] =
"Processor Trace (unused)" ,
"Protection Keys User registers",
"unknown xstate feature" ,
+ "Control-flow User registers" ,
+ "Control-flow Kernel registers" ,
+ "unknown xstate feature" ,
};

static short xsave_cpuid_features[] __initdata = {
@@ -51,6 +54,9 @@ static short xsave_cpuid_features[] __initdata = {
X86_FEATURE_AVX512F,
X86_FEATURE_INTEL_PT,
X86_FEATURE_PKU,
+ -1, /* Unused */
+ X86_FEATURE_SHSTK, /* XFEATURE_CET_USER */
+ X86_FEATURE_SHSTK, /* XFEATURE_CET_KERNEL */
};

/*
@@ -316,6 +322,8 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
+ print_xstate_feature(XFEATURE_MASK_CET_USER);
+ print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
}

/*
@@ -590,6 +598,8 @@ static void check_xstate_against_struct(int nr)
XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state);
XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
+ XCHECK_SZ(sz, nr, XFEATURE_CET_USER, struct cet_user_state);
+ XCHECK_SZ(sz, nr, XFEATURE_CET_KERNEL, struct cet_kernel_state);

/*
* Make *SURE* to add any feature numbers in below if
@@ -797,8 +807,19 @@ void __init fpu__init_system_xstate(void)
* Clear XSAVE features that are disabled in the normal CPUID.
*/
for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
- if (!boot_cpu_has(xsave_cpuid_features[i]))
- xfeatures_mask_all &= ~BIT_ULL(i);
+ if (xsave_cpuid_features[i] == X86_FEATURE_SHSTK) {
+ /*
+ * X86_FEATURE_SHSTK and X86_FEATURE_IBT share
+ * same states, but can be enabled separately.
+ */
+ if (!boot_cpu_has(X86_FEATURE_SHSTK) &&
+ !boot_cpu_has(X86_FEATURE_IBT))
+ xfeatures_mask_all &= ~BIT_ULL(i);
+ } else {
+ if ((xsave_cpuid_features[i] == -1) ||
+ !boot_cpu_has(xsave_cpuid_features[i]))
+ xfeatures_mask_all &= ~BIT_ULL(i);
+ }
}

xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
--
2.21.0

2020-04-29 22:10:36

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 01/26] Documentation/x86: Add CET description

Explain no_user_shstk/no_user_ibt kernel parameters, and introduce a new
document on Control-flow Enforcement Technology (CET).

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
v10:
- Change no_cet_shstk and no_cet_ibt to no_user_shstk and no_user_ibt.
- Remove the opcode section, as it is already in the Intel SDM.
- Remove sections related to GLIBC implementation.
- Remove shadow stack memory management section, as it is already in the
code comments.
- Remove legacy bitmap related information, as it is not supported now.
- Fix arch_ioctl() related text.
- Change SHSTK, IBT to plain English.

.../admin-guide/kernel-parameters.txt | 6 +
Documentation/x86/index.rst | 1 +
Documentation/x86/intel_cet.rst | 129 ++++++++++++++++++
3 files changed, 136 insertions(+)
create mode 100644 Documentation/x86/intel_cet.rst

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 7bc83f3d9bdf..be715675df6d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3093,6 +3093,12 @@
noexec=on: enable non-executable mappings (default)
noexec=off: disable non-executable mappings

+ no_user_shstk [X86-64] Disable Shadow Stack for user-mode
+ applications
+
+ no_user_ibt [X86-64] Disable Indirect Branch Tracking for user-mode
+ applications
+
nosmap [X86,PPC]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index 265d9e9a093b..2aef972a868d 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -19,6 +19,7 @@ x86-specific Documentation
tlb
mtrr
pat
+ intel_cet
intel-iommu
intel_txt
amd-memory-encryption
diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
new file mode 100644
index 000000000000..746eda8c82f3
--- /dev/null
+++ b/Documentation/x86/intel_cet.rst
@@ -0,0 +1,129 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Control-flow Enforcement Technology (CET)
+=========================================
+
+[1] Overview
+============
+
+Control-flow Enforcement Technology (CET) is an Intel processor feature
+that provides protection against return/jump-oriented programming (ROP)
+attacks. It can be set up to protect both applications and the kernel.
+Only user-mode protection is implemented in the 64-bit kernel, including
+support for running legacy 32-bit applications.
+
+CET introduces Shadow Stack and Indirect Branch Tracking. Shadow stack is
+a secondary stack allocated from memory and cannot be directly modified by
+applications. When executing a CALL, the processor pushes the return
+address to both the normal stack and the shadow stack. Upon function
+return, the processor pops the shadow stack copy and compares it to the
+normal stack copy. If the two differ, the processor raises a control-
+protection fault. Indirect branch tracking verifies indirect CALL/JMP
+targets are intended as marked by the compiler with 'ENDBR' opcodes.
+
+There are two kernel configuration options:
+
+ X86_INTEL_SHADOW_STACK_USER, and
+ X86_INTEL_BRANCH_TRACKING_USER.
+
+These need to be enabled to build a CET-enabled kernel, and Binutils v2.31
+and GCC v8.1 or later are required to build a CET kernel. To build a CET-
+enabled application, GLIBC v2.28 or later is also required.
+
+There are two command-line options for disabling CET features::
+
+ no_user_shstk - disables user shadow stack, and
+ no_user_ibt - disables user indirect branch tracking.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET.
+
+[2] Application Enabling
+========================
+
+An application's CET capability is marked in its ELF header and can be
+verified from the following command output, in the NT_GNU_PROPERTY_TYPE_0
+field:
+
+ readelf -n <application>
+
+If an application supports CET and is statically linked, it will run with
+CET protection. If the application needs any shared libraries, the loader
+checks all dependencies and enables CET when all requirements are met.
+
+[3] CET arch_prctl()'s
+======================
+
+Several arch_prctl()'s have been added for CET:
+
+arch_prctl(ARCH_X86_CET_STATUS, u64 *addr)
+ Return CET feature status.
+
+ The parameter 'addr' is a pointer to a user buffer.
+ On returning to the caller, the kernel fills the following
+ information::
+
+ *addr = shadow stack/indirect branch tracking status
+ *(addr + 1) = shadow stack base address
+ *(addr + 2) = shadow stack size
+
+arch_prctl(ARCH_X86_CET_DISABLE, u64 features)
+ Disable shadow stack and/or indirect branch tracking as specified in
+ 'features'. Return -EPERM if CET is locked.
+
+arch_prctl(ARCH_X86_CET_LOCK)
+ Lock in all CET features. They cannot be turned off afterwards.
+
+arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, u64 *addr)
+ Allocate a new shadow stack and put a restore token at top.
+
+ The parameter 'addr' is a pointer to a user buffer and indicates the
+ shadow stack size to allocate. On returning to the caller, the kernel
+ fills '*addr' with the base address of the new shadow stack.
+
+ User-level threads that need a new stack are expected to allocate a
+ new shadow stack.
+
+Note:
+ There is no CET-enabling arch_prctl function. By design, CET is enabled
+ automatically if the binary and the system can support it.
+
+[4] The implementation of the Shadow Stack
+==========================================
+
+Shadow Stack size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+The main program and its signal handlers use the same shadow stack.
+Because the shadow stack stores only return addresses, a large shadow
+stack covers the condition that both the program stack and the signal
+alternate stack run out.
+
+The kernel creates a restore token for the shadow stack restoring address
+and verifies that token when restoring from the signal handler.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHSTK flag set; its PTEs are required to be
+read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread.
--
2.21.0

2020-04-29 22:10:47

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 06/26] x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW

Before introducing _PAGE_COW for non-hardware memory management purposes in
the next patch, rename _PAGE_DIRTY to _PAGE_DIRTY_HW and _PAGE_BIT_DIRTY to
_PAGE_BIT_DIRTY_HW to make meanings more clear. There are no functional
changes from this patch.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
---
v9:
- At some places _PAGE_DIRTY were not changed to _PAGE_DIRTY_HW, because
they will be changed again in the next patch to _PAGE_DIRTY_BITS.
However, this causes compile issues if the next patch is not yet applied.
Fix it by changing all _PAGE_DIRTY to _PAGE_DRITY_HW.

arch/x86/include/asm/pgtable.h | 18 +++++++++---------
arch/x86/include/asm/pgtable_types.h | 11 +++++------
arch/x86/kernel/relocate_kernel_64.S | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
4 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 4d02e64af1b3..90f9a73881ad 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -124,7 +124,7 @@ extern pmdval_t early_pmd_flags;
*/
static inline int pte_dirty(pte_t pte)
{
- return pte_flags(pte) & _PAGE_DIRTY;
+ return pte_flags(pte) & _PAGE_DIRTY_HW;
}


@@ -163,7 +163,7 @@ static inline int pte_young(pte_t pte)

static inline int pmd_dirty(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_DIRTY;
+ return pmd_flags(pmd) & _PAGE_DIRTY_HW;
}

static inline int pmd_young(pmd_t pmd)
@@ -173,7 +173,7 @@ static inline int pmd_young(pmd_t pmd)

static inline int pud_dirty(pud_t pud)
{
- return pud_flags(pud) & _PAGE_DIRTY;
+ return pud_flags(pud) & _PAGE_DIRTY_HW;
}

static inline int pud_young(pud_t pud)
@@ -333,7 +333,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)

static inline pte_t pte_mkclean(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_DIRTY);
+ return pte_clear_flags(pte, _PAGE_DIRTY_HW);
}

static inline pte_t pte_mkold(pte_t pte)
@@ -353,7 +353,7 @@ static inline pte_t pte_mkexec(pte_t pte)

static inline pte_t pte_mkdirty(pte_t pte)
{
- return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

static inline pte_t pte_mkyoung(pte_t pte)
@@ -434,7 +434,7 @@ static inline pmd_t pmd_mkold(pmd_t pmd)

static inline pmd_t pmd_mkclean(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_DIRTY);
+ return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
}

static inline pmd_t pmd_wrprotect(pmd_t pmd)
@@ -444,7 +444,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)

static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
- return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -488,7 +488,7 @@ static inline pud_t pud_mkold(pud_t pud)

static inline pud_t pud_mkclean(pud_t pud)
{
- return pud_clear_flags(pud, _PAGE_DIRTY);
+ return pud_clear_flags(pud, _PAGE_DIRTY_HW);
}

static inline pud_t pud_wrprotect(pud_t pud)
@@ -498,7 +498,7 @@ static inline pud_t pud_wrprotect(pud_t pud)

static inline pud_t pud_mkdirty(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

static inline pud_t pud_mkdevmap(pud_t pud)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b6606fe6cfdf..b82e0f167879 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -15,7 +15,7 @@
#define _PAGE_BIT_PWT 3 /* page write through */
#define _PAGE_BIT_PCD 4 /* page cache disabled */
#define _PAGE_BIT_ACCESSED 5 /* was accessed (raised by CPU) */
-#define _PAGE_BIT_DIRTY 6 /* was written to (raised by CPU) */
+#define _PAGE_BIT_DIRTY_HW 6 /* was written to (raised by CPU) */
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */
#define _PAGE_BIT_PAT 7 /* on 4KB pages */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
@@ -46,7 +46,7 @@
#define _PAGE_PWT (_AT(pteval_t, 1) << _PAGE_BIT_PWT)
#define _PAGE_PCD (_AT(pteval_t, 1) << _PAGE_BIT_PCD)
#define _PAGE_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
-#define _PAGE_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
+#define _PAGE_DIRTY_HW (_AT(pteval_t, 1) << _PAGE_BIT_DIRTY_HW)
#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE)
#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
#define _PAGE_SOFTW1 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1)
@@ -74,7 +74,7 @@
_PAGE_PKEY_BIT3)

#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
-#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
+#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY_HW | _PAGE_ACCESSED)
#else
#define _PAGE_KNL_ERRATUM_MASK 0
#endif
@@ -126,7 +126,7 @@
* pte_modify() does modify it.
*/
#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
- _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
+ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_HW | \
_PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \
_PAGE_UFFD_WP)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
@@ -163,7 +163,7 @@ enum page_cache_mode {
#define __RW _PAGE_RW
#define _USR _PAGE_USER
#define ___A _PAGE_ACCESSED
-#define ___D _PAGE_DIRTY
+#define ___D _PAGE_DIRTY_HW
#define ___G _PAGE_GLOBAL
#define __NX _PAGE_NX

@@ -205,7 +205,6 @@ enum page_cache_mode {
#define __PAGE_KERNEL_IO __PAGE_KERNEL
#define __PAGE_KERNEL_IO_NOCACHE __PAGE_KERNEL_NOCACHE

-
#ifndef __ASSEMBLY__

#define __PAGE_KERNEL_ENC (__PAGE_KERNEL | _ENC)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index a4d9a261425b..e3bb4ff95523 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -17,7 +17,7 @@
*/

#define PTR(x) (x << 3)
-#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY_HW)

/*
* control_page + KEXEC_CONTROL_CODE_MAX_SIZE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c2c6335a998c..d52d470e36b1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3501,7 +3501,7 @@ static int init_rmode_identity_map(struct kvm *kvm)
/* Set up identity-mapping pagetable for EPT in real mode */
for (i = 0; i < PT32_ENT_PER_PAGE; i++) {
tmp = (i << 22) + (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
- _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
+ _PAGE_ACCESSED | _PAGE_DIRTY_HW | _PAGE_PSE);
r = kvm_write_guest_page(kvm, identity_map_pfn,
&tmp, i * sizeof(tmp), sizeof(tmp));
if (r < 0)
--
2.21.0

2020-04-29 22:10:50

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 12/26] mm: Introduce VM_SHSTK for shadow stack memory

A Shadow Stack PTE must be read-only and have _PAGE_DIRTY set. However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
two cases are handled differently for page faults. Introduce VM_SHSTK to
track shadow stack VMAs.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
v9:
- Add VM_SHSTK case to arch_vma_name().
- Revise the commit log to explain why adding a new VM flag.

arch/x86/mm/mmap.c | 2 ++
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 8 ++++++++
3 files changed, 13 insertions(+)

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index cb91eccc4960..fe77fd6debf1 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -163,6 +163,8 @@ unsigned long get_mmap_base(int is_legacy)

const char *arch_vma_name(struct vm_area_struct *vma)
{
+ if (vma->vm_flags & VM_SHSTK)
+ return "[shadow stack]";
return NULL;
}

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8d382d4ec067..434692759265 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -657,6 +657,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_PKEY_BIT4)] = "",
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+ [ilog2(VM_SHSTK)] = "ss",
+#endif
};
size_t i;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..54bb4cd9fee8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -294,11 +294,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -336,6 +338,12 @@ extern unsigned int kobjsize(const void *objp);
# define VM_MPX VM_NONE
#endif

+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+# define VM_SHSTK VM_HIGH_ARCH_5
+#else
+# define VM_SHSTK VM_NONE
+#endif
+
#ifndef VM_GROWSUP
# define VM_GROWSUP VM_NONE
#endif
--
2.21.0

2020-04-29 22:10:52

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 14/26] x86/mm: Update maybe_mkwrite() for shadow stack

Shadow stack memory is writable, but its VMA has VM_SHSTK instead of
VM_WRITE. Update maybe_mkwrite() to include the shadow stack.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/Kconfig | 4 ++++
arch/x86/mm/pgtable.c | 18 ++++++++++++++++++
include/asm-generic/pgtable.h | 24 ++++++++++++++++++++++++
include/linux/mm.h | 2 ++
mm/huge_memory.c | 2 ++
5 files changed, 50 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c98f82fffe85..ac07e1f6a2bc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1955,6 +1955,9 @@ config AS_HAS_SHADOW_STACK
config X86_INTEL_CET
def_bool n

+config ARCH_MAYBE_MKWRITE
+ def_bool n
+
config ARCH_HAS_SHADOW_STACK
def_bool n

@@ -1965,6 +1968,7 @@ config X86_INTEL_SHADOW_STACK_USER
depends on AS_HAS_SHADOW_STACK
select ARCH_USES_HIGH_VMA_FLAGS
select X86_INTEL_CET
+ select ARCH_MAYBE_MKWRITE
select ARCH_HAS_SHADOW_STACK
help
Shadow Stacks provides protection against program stack
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7bd2c3a52297..aa4d396ff98d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -603,6 +603,24 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
}
#endif

+#ifdef CONFIG_ARCH_MAYBE_MKWRITE
+pte_t arch_maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ if (likely(vma->vm_flags & VM_SHSTK))
+ pte = pte_mkwrite_shstk(pte);
+ return pte;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+ if (likely(vma->vm_flags & VM_SHSTK))
+ pmd = pmd_mkwrite_shstk(pmd);
+ return pmd;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_ARCH_MAYBE_MKWRITE */
+
/**
* reserve_top_address - reserves a hole in the top of kernel address space
* @reserve - size of hole to reserve
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 329b8c8ca703..2c3875724809 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1191,6 +1191,30 @@ static inline bool arch_has_pfn_modify_check(void)
}
#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */

+#ifdef CONFIG_MMU
+#ifdef CONFIG_ARCH_MAYBE_MKWRITE
+pte_t arch_maybe_mkwrite(pte_t pte, struct vm_area_struct *vma);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#else /* !CONFIG_ARCH_MAYBE_MKWRITE */
+static inline pte_t arch_maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ return pte;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+ return pmd;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* CONFIG_ARCH_MAYBE_MKWRITE */
+#endif /* CONFIG_MMU */
+
/*
* Architecture PAGE_KERNEL_* fallbacks
*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 54bb4cd9fee8..f0669e3cdd37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -944,6 +944,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
if (likely(vma->vm_flags & VM_WRITE))
pte = pte_mkwrite(pte);
+ else
+ pte = arch_maybe_mkwrite(pte, vma);
return pte;
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6ecd1045113b..608746bb9d19 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -485,6 +485,8 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
if (likely(vma->vm_flags & VM_WRITE))
pmd = pmd_mkwrite(pmd);
+ else
+ pmd = arch_maybe_pmd_mkwrite(pmd, vma);
return pmd;
}

--
2.21.0

2020-04-29 22:10:55

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 13/26] x86/mm: Shadow Stack page fault error checking

Shadow stack accesses are those that are performed by the CPU where it
expects to encounter a shadow stack mapping. These accesses are performed
implicitly by CALL/RET at the site of the shadow stack pointer. These
accesses are made explicitly by shadow stack management instructions like
WRUSSQ.

Shadow stacks accesses to shadow-stack mapping can see faults in normal,
valid operation just like regular accesses to regular mappings. Shadow
stacks need some of the same features like delayed allocation, swap and
copy-on-write.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping.

In handling a shadow stack page fault, verify it occurs within a shadow
stack mapping. It is always an error otherwise. For valid shadow stack
accesses, set FAULT_FLAG_WRITE to effect copy-on-write. Because clearing
_PAGE_DIRTY_HW (vs. _PAGE_RW) is used to trigger the fault, shadow stack
read fault and shadow stack write fault are not differentiated and both are
handled as a write access.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
v10:
-Revise commit log.

arch/x86/include/asm/traps.h | 2 ++
arch/x86/mm/fault.c | 19 +++++++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 9bf804709ee6..b4f4c725a350 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -168,6 +168,7 @@ enum {
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
+ * bit 6 == 1: shadow stack access fault
*/
enum x86_pf_error_code {
X86_PF_PROT = 1 << 0,
@@ -176,5 +177,6 @@ enum x86_pf_error_code {
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
+ X86_PF_SHSTK = 1 << 6,
};
#endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a51df516b87b..a4a3c8f016f0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1210,6 +1210,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
(error_code & X86_PF_INSTR), foreign))
return 1;

+ /*
+ * Verify a shadow stack access is within a shadow stack VMA.
+ * It is always an error otherwise. Normal data access to a
+ * shadow stack area is checked in the case followed.
+ */
+ if (error_code & X86_PF_SHSTK) {
+ if (!(vma->vm_flags & VM_SHSTK))
+ return 1;
+ return 0;
+ }
+
if (error_code & X86_PF_WRITE) {
/* write, present and write, not present: */
if (unlikely(!(vma->vm_flags & VM_WRITE)))
@@ -1367,6 +1378,14 @@ void do_user_addr_fault(struct pt_regs *regs,

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

+ /*
+ * Clearing _PAGE_DIRTY_HW is used to detect shadow stack access.
+ * This method cannot distinguish shadow stack read vs. write.
+ * For valid shadow stack accesses, set FAULT_FLAG_WRITE to effect
+ * copy-on-write.
+ */
+ if (hw_error_code & X86_PF_SHSTK)
+ flags |= FAULT_FLAG_WRITE;
if (hw_error_code & X86_PF_WRITE)
flags |= FAULT_FLAG_WRITE;
if (hw_error_code & X86_PF_INSTR)
--
2.21.0

2020-04-29 22:11:06

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 25/26] x86/cet/shstk: Handle thread shadow stack

The kernel allocates (and frees on thread exit) a new shadow stack for a
pthread child.

It is possible for the kernel to complete the clone syscall and set the
child's shadow stack pointer to NULL and let the child thread allocate
a shadow stack for itself. There are two issues in this approach: It
is not compatible with existing code that does inline syscall and it
cannot handle signals before the child can successfully allocate a
shadow stack.

A 64-bit shadow stack has a size of min(RLIMIT_STACK, 4 GB). A compat-mode
thread shadow stack has a size of 1/4 min(RLIMIT_STACK, 4 GB). This allows
more threads to run in a 32-bit address space.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Limit shadow stack size to 4 GB.

arch/x86/include/asm/cet.h | 2 ++
arch/x86/include/asm/mmu_context.h | 3 +++
arch/x86/kernel/cet.c | 41 ++++++++++++++++++++++++++++++
arch/x86/kernel/process.c | 7 +++++
4 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 56fe08eebae6..71dc92acd2f2 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -18,11 +18,13 @@ struct cet_status {

#ifdef CONFIG_X86_INTEL_CET
int cet_setup_shstk(void);
+int cet_setup_thread_shstk(struct task_struct *p);
void cet_disable_free_shstk(struct task_struct *p);
int cet_verify_rstor_token(bool ia32, unsigned long ssp, unsigned long *new_ssp);
void cet_restore_signal(struct sc_ext *sc);
int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
#else
+static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
static inline void cet_disable_free_shstk(struct task_struct *p) {}
static inline void cet_restore_signal(struct sc_ext *sc) { return; }
static inline int cet_setup_signal(bool ia32, unsigned long rstor,
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 4e55370e48e8..bb7a4a2d6923 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -12,6 +12,7 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
+#include <asm/cet.h>
#include <asm/debugreg.h>

extern atomic64_t last_mm_ctx_id;
@@ -155,6 +156,8 @@ do { \
#else
#define deactivate_mm(tsk, mm) \
do { \
+ if (!tsk->vfork_done) \
+ cet_disable_free_shstk(tsk); \
load_gs_index(0); \
loadsegment(fs, 0); \
} while (0)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 274fecdd9669..121552047b86 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -169,6 +169,47 @@ int cet_setup_shstk(void)
return 0;
}

+int cet_setup_thread_shstk(struct task_struct *tsk)
+{
+ unsigned long addr, size;
+ struct cet_user_state *state;
+ struct cet_status *cet = &tsk->thread.cet;
+
+ if (!cet->shstk_size)
+ return 0;
+
+ state = get_xsave_addr(&tsk->thread.fpu.state.xsave,
+ XFEATURE_CET_USER);
+
+ if (!state)
+ return -EINVAL;
+
+ /* Cap shadow stack size to 4 GB */
+ size = min(rlimit(RLIMIT_STACK), 1UL << 32);
+
+ /*
+ * Compat-mode pthreads share a limited address space.
+ * If each function call takes an average of four slots
+ * stack space, we need 1/4 of stack size for shadow stack.
+ */
+ if (in_compat_syscall())
+ size /= 4;
+ size = round_up(size, PAGE_SIZE);
+ addr = alloc_shstk(size);
+
+ if (IS_ERR((void *)addr)) {
+ cet->shstk_base = 0;
+ cet->shstk_size = 0;
+ return PTR_ERR((void *)addr);
+ }
+
+ fpu__prepare_write(&tsk->thread.fpu);
+ state->user_ssp = (u64)(addr + size);
+ cet->shstk_base = addr;
+ cet->shstk_size = size;
+ return 0;
+}
+
void cet_disable_free_shstk(struct task_struct *tsk)
{
struct cet_status *cet = &tsk->thread.cet;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9d9cff2c1018..ef1c2b8086a2 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -109,6 +109,7 @@ void exit_thread(struct task_struct *tsk)

free_vm86(t);

+ cet_disable_free_shstk(tsk);
fpu__drop(fpu);
}

@@ -179,6 +180,12 @@ int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
if (clone_flags & CLONE_SETTLS)
ret = set_new_tls(p, tls);

+#ifdef CONFIG_X86_64
+ /* Allocate a new shadow stack for pthread */
+ if (!ret && (clone_flags & (CLONE_VFORK | CLONE_VM)) == CLONE_VM)
+ ret = cet_setup_thread_shstk(p);
+#endif
+
if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
io_bitmap_share(p);

--
2.21.0

2020-04-29 22:11:09

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 23/26] ELF: Introduce arch_setup_elf_property()

An ELF file's .note.gnu.property indicates architecture features of the
file. These features are extracted earlier and stored in the struct
'arch_elf_state'. Introduce arch_setup_elf_property() to setup and enable
these features. The first use-case of this function is shadow stack.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
fs/binfmt_elf.c | 4 ++++
include/linux/elf.h | 6 ++++++
2 files changed, 10 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 327a995ff743..d7a4c0a1245e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1216,6 +1216,10 @@ static int load_elf_binary(struct linux_binprm *bprm)

set_binfmt(&elf_format);

+ retval = arch_setup_elf_property(&arch_state);
+ if (retval < 0)
+ goto out;
+
#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
retval = arch_setup_additional_pages(bprm, !!interpreter);
if (retval < 0)
diff --git a/include/linux/elf.h b/include/linux/elf.h
index 7bdc6da160c7..81f2161fa4a8 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -78,9 +78,15 @@ static inline int arch_parse_elf_property(u32 type, const void *data,
{
return 0;
}
+
+static inline int arch_setup_elf_property(struct arch_elf_state *arch)
+{
+ return 0;
+}
#else
extern int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
bool compat, struct arch_elf_state *arch);
+extern int arch_setup_elf_property(struct arch_elf_state *arch);
#endif

#endif /* _LINUX_ELF_H */
--
2.21.0

2020-04-29 22:11:33

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 22/26] ELF: Add ELF program property parsing support

From: Dave Martin <[email protected]>

ELF program properties will be needed for detecting whether to
enable optional architecture or ABI features for a new ELF process.

For now, there are no generic properties that we care about, so do
nothing unless CONFIG_ARCH_USE_GNU_PROPERTY=y.

Otherwise, the presence of properties using the PT_PROGRAM_PROPERTY
phdrs entry (if any), and notify each property to the arch code.

For now, the added code is not used.

Signed-off-by: Dave Martin <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Mark Brown <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
---
fs/binfmt_elf.c | 127 +++++++++++++++++++++++++++++++++++++++
fs/compat_binfmt_elf.c | 4 ++
include/linux/elf.h | 19 ++++++
include/uapi/linux/elf.h | 4 ++
4 files changed, 154 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 13f25e241ac4..327a995ff743 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -40,12 +40,18 @@
#include <linux/sched/coredump.h>
#include <linux/sched/task_stack.h>
#include <linux/sched/cputime.h>
+#include <linux/sizes.h>
+#include <linux/types.h>
#include <linux/cred.h>
#include <linux/dax.h>
#include <linux/uaccess.h>
#include <asm/param.h>
#include <asm/page.h>

+#ifndef ELF_COMPAT
+#define ELF_COMPAT 0
+#endif
+
#ifndef user_long_t
#define user_long_t long
#endif
@@ -682,6 +688,111 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
* libraries. There is no binary dependent code anywhere else.
*/

+static int parse_elf_property(const char *data, size_t *off, size_t datasz,
+ struct arch_elf_state *arch,
+ bool have_prev_type, u32 *prev_type)
+{
+ size_t o, step;
+ const struct gnu_property *pr;
+ int ret;
+
+ if (*off == datasz)
+ return -ENOENT;
+
+ if (WARN_ON_ONCE(*off > datasz || *off % ELF_GNU_PROPERTY_ALIGN))
+ return -EIO;
+ o = *off;
+ datasz -= *off;
+
+ if (datasz < sizeof(*pr))
+ return -ENOEXEC;
+ pr = (const struct gnu_property *)(data + o);
+ o += sizeof(*pr);
+ datasz -= sizeof(*pr);
+
+ if (pr->pr_datasz > datasz)
+ return -ENOEXEC;
+
+ WARN_ON_ONCE(o % ELF_GNU_PROPERTY_ALIGN);
+ step = round_up(pr->pr_datasz, ELF_GNU_PROPERTY_ALIGN);
+ if (step > datasz)
+ return -ENOEXEC;
+
+ /* Properties are supposed to be unique and sorted on pr_type: */
+ if (have_prev_type && pr->pr_type <= *prev_type)
+ return -ENOEXEC;
+ *prev_type = pr->pr_type;
+
+ ret = arch_parse_elf_property(pr->pr_type, data + o,
+ pr->pr_datasz, ELF_COMPAT, arch);
+ if (ret)
+ return ret;
+
+ *off = o + step;
+ return 0;
+}
+
+#define NOTE_DATA_SZ SZ_1K
+#define GNU_PROPERTY_TYPE_0_NAME "GNU"
+#define NOTE_NAME_SZ (sizeof(GNU_PROPERTY_TYPE_0_NAME))
+
+static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
+ struct arch_elf_state *arch)
+{
+ union {
+ struct elf_note nhdr;
+ char data[NOTE_DATA_SZ];
+ } note;
+ loff_t pos;
+ ssize_t n;
+ size_t off, datasz;
+ int ret;
+ bool have_prev_type;
+ u32 prev_type;
+
+ if (!IS_ENABLED(CONFIG_ARCH_USE_GNU_PROPERTY) || !phdr)
+ return 0;
+
+ /* load_elf_binary() shouldn't call us unless this is true... */
+ if (WARN_ON_ONCE(phdr->p_type != PT_GNU_PROPERTY))
+ return -ENOEXEC;
+
+ /* If the properties are crazy large, that's too bad (for now): */
+ if (phdr->p_filesz > sizeof(note))
+ return -ENOEXEC;
+
+ pos = phdr->p_offset;
+ n = kernel_read(f, &note, phdr->p_filesz, &pos);
+
+ BUILD_BUG_ON(sizeof(note) < sizeof(note.nhdr) + NOTE_NAME_SZ);
+ if (n < 0 || n < sizeof(note.nhdr) + NOTE_NAME_SZ)
+ return -EIO;
+
+ if (note.nhdr.n_type != NT_GNU_PROPERTY_TYPE_0 ||
+ note.nhdr.n_namesz != NOTE_NAME_SZ ||
+ strncmp(note.data + sizeof(note.nhdr),
+ GNU_PROPERTY_TYPE_0_NAME, n - sizeof(note.nhdr)))
+ return -ENOEXEC;
+
+ off = round_up(sizeof(note.nhdr) + NOTE_NAME_SZ,
+ ELF_GNU_PROPERTY_ALIGN);
+ if (off > n)
+ return -ENOEXEC;
+
+ if (note.nhdr.n_descsz > n - off)
+ return -ENOEXEC;
+ datasz = off + note.nhdr.n_descsz;
+
+ have_prev_type = false;
+ do {
+ ret = parse_elf_property(note.data, &off, datasz, arch,
+ have_prev_type, &prev_type);
+ have_prev_type = true;
+ } while (!ret);
+
+ return ret == -ENOENT ? 0 : ret;
+}
+
static int load_elf_binary(struct linux_binprm *bprm)
{
struct file *interpreter = NULL; /* to shut gcc up */
@@ -689,6 +800,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
int load_addr_set = 0;
unsigned long error;
struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
+ struct elf_phdr *elf_property_phdata = NULL;
unsigned long elf_bss, elf_brk;
int bss_prot = 0;
int retval, i;
@@ -726,6 +838,11 @@ static int load_elf_binary(struct linux_binprm *bprm)
for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
char *elf_interpreter;

+ if (elf_ppnt->p_type == PT_GNU_PROPERTY) {
+ elf_property_phdata = elf_ppnt;
+ continue;
+ }
+
if (elf_ppnt->p_type != PT_INTERP)
continue;

@@ -819,9 +936,14 @@ static int load_elf_binary(struct linux_binprm *bprm)
goto out_free_dentry;

/* Pass PT_LOPROC..PT_HIPROC headers to arch code */
+ elf_property_phdata = NULL;
elf_ppnt = interp_elf_phdata;
for (i = 0; i < interp_elf_ex->e_phnum; i++, elf_ppnt++)
switch (elf_ppnt->p_type) {
+ case PT_GNU_PROPERTY:
+ elf_property_phdata = elf_ppnt;
+ break;
+
case PT_LOPROC ... PT_HIPROC:
retval = arch_elf_pt_proc(interp_elf_ex,
elf_ppnt, interpreter,
@@ -832,6 +954,11 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
}

+ retval = parse_elf_properties(interpreter ?: bprm->file,
+ elf_property_phdata, &arch_state);
+ if (retval)
+ goto out_free_dentry;
+
/*
* Allow arch code to reject the ELF at this point, whilst it's
* still possible to return an error to the code that invoked
diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c
index aaad4ca1217e..13a087bc816b 100644
--- a/fs/compat_binfmt_elf.c
+++ b/fs/compat_binfmt_elf.c
@@ -17,6 +17,8 @@
#include <linux/elfcore-compat.h>
#include <linux/time.h>

+#define ELF_COMPAT 1
+
/*
* Rename the basic ELF layout types to refer to the 32-bit class of files.
*/
@@ -28,11 +30,13 @@
#undef elf_shdr
#undef elf_note
#undef elf_addr_t
+#undef ELF_GNU_PROPERTY_ALIGN
#define elfhdr elf32_hdr
#define elf_phdr elf32_phdr
#define elf_shdr elf32_shdr
#define elf_note elf32_note
#define elf_addr_t Elf32_Addr
+#define ELF_GNU_PROPERTY_ALIGN ELF32_GNU_PROPERTY_ALIGN

/*
* Some data types as stored in coredump.
diff --git a/include/linux/elf.h b/include/linux/elf.h
index 459cddcceaac..7bdc6da160c7 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -22,6 +22,9 @@
SET_PERSONALITY(ex)
#endif

+#define ELF32_GNU_PROPERTY_ALIGN 4
+#define ELF64_GNU_PROPERTY_ALIGN 8
+
#if ELF_CLASS == ELFCLASS32

extern Elf32_Dyn _DYNAMIC [];
@@ -32,6 +35,7 @@ extern Elf32_Dyn _DYNAMIC [];
#define elf_addr_t Elf32_Off
#define Elf_Half Elf32_Half
#define Elf_Word Elf32_Word
+#define ELF_GNU_PROPERTY_ALIGN ELF32_GNU_PROPERTY_ALIGN

#else

@@ -43,6 +47,7 @@ extern Elf64_Dyn _DYNAMIC [];
#define elf_addr_t Elf64_Off
#define Elf_Half Elf64_Half
#define Elf_Word Elf64_Word
+#define ELF_GNU_PROPERTY_ALIGN ELF64_GNU_PROPERTY_ALIGN

#endif

@@ -64,4 +69,18 @@ struct gnu_property {
u32 pr_datasz;
};

+struct arch_elf_state;
+
+#ifndef CONFIG_ARCH_USE_GNU_PROPERTY
+static inline int arch_parse_elf_property(u32 type, const void *data,
+ size_t datasz, bool compat,
+ struct arch_elf_state *arch)
+{
+ return 0;
+}
+#else
+extern int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
+ bool compat, struct arch_elf_state *arch);
+#endif
+
#endif /* _LINUX_ELF_H */
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 61251ecabdd7..518651708d8f 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -368,6 +368,7 @@ typedef struct elf64_shdr {
* Notes used in ET_CORE. Architectures export some of the arch register sets
* using the corresponding note types via the PTRACE_GETREGSET and
* PTRACE_SETREGSET requests.
+ * The note name for all these is "LINUX".
*/
#define NT_PRSTATUS 1
#define NT_PRFPREG 2
@@ -430,6 +431,9 @@ typedef struct elf64_shdr {
#define NT_MIPS_FP_MODE 0x801 /* MIPS floating-point mode */
#define NT_MIPS_MSA 0x802 /* MIPS SIMD registers */

+/* Note types with note name "GNU" */
+#define NT_GNU_PROPERTY_TYPE_0 5
+
/* Note header in a PT_NOTE section */
typedef struct elf32_note {
Elf32_Word n_namesz; /* Name size */
--
2.21.0

2020-04-29 22:11:35

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 19/26] x86/cet/shstk: User-mode shadow stack support

This patch adds basic shadow stack enabling/disabling routines. A task's
shadow stack is allocated from memory with VM_SHSTK flag and has a fixed
size of min(RLIMIT_STACK, 4GB).

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Change no_cet_shstk to no_user_shstk.
- Limit shadow stack size to 4 GB, and round_up to PAGE_SIZE.
- Replace checking shstk_enabled with shstk_size being zero.
- WARN_ON_ONCE() when vm_munmap() fails.

v9:
- Change cpu_feature_enabled() to static_cpu_has().
- Merge cet_disable_shstk to cet_disable_free_shstk.
- Remove the empty slot at the top of the shadow stack, as it is not
needed.
- Move do_mmap_locked() to alloc_shstk(), which is a static function.

v6:
- Create a function do_mmap_locked() for shadow stack allocation.

v2:
- Change noshstk to no_cet_shstk.

arch/x86/include/asm/cet.h | 26 ++++
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/processor.h | 5 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/cet.c | 135 ++++++++++++++++++
arch/x86/kernel/cpu/common.c | 28 ++++
arch/x86/kernel/process.c | 1 +
.../arch/x86/include/asm/disabled-features.h | 8 +-
8 files changed, 211 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
new file mode 100644
index 000000000000..caac0687c8e4
--- /dev/null
+++ b/arch/x86/include/asm/cet.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CET_H
+#define _ASM_X86_CET_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+/*
+ * Per-thread CET status
+ */
+struct cet_status {
+ unsigned long shstk_base;
+ unsigned long shstk_size;
+};
+
+#ifdef CONFIG_X86_INTEL_CET
+int cet_setup_shstk(void);
+void cet_disable_free_shstk(struct task_struct *p);
+#else
+static inline void cet_disable_free_shstk(struct task_struct *p) {}
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 4ea8584682f9..a0e1b24cfa02 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -56,6 +56,12 @@
# define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31))
#endif

+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define DISABLE_SHSTK 0
+#else
+#define DISABLE_SHSTK (1<<(X86_FEATURE_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -75,7 +81,7 @@
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
-#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
+#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
#define DISABLED_MASK17 0
#define DISABLED_MASK18 0
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb9536f803f9..0ccf1c7ab173 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -27,6 +27,7 @@ struct vm86;
#include <asm/unwind_hints.h>
#include <asm/vmxfeatures.h>
#include <asm/vdso/processor.h>
+#include <asm/cet.h>

#include <linux/personality.h>
#include <linux/cache.h>
@@ -543,6 +544,10 @@ struct thread_struct {

unsigned int sig_on_uaccess_err:1;

+#ifdef CONFIG_X86_INTEL_CET
+ struct cet_status cet;
+#endif
+
/* Floating point and extended processor state */
struct fpu fpu;
/*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index ba89cabe5fcf..e9cc2551573b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -144,6 +144,8 @@ obj-$(CONFIG_UNWINDER_ORC) += unwind_orc.o
obj-$(CONFIG_UNWINDER_FRAME_POINTER) += unwind_frame.o
obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o

+obj-$(CONFIG_X86_INTEL_CET) += cet.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..d8196c8e792a
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * cet.c - Control-flow Enforcement (CET)
+ *
+ * Copyright (c) 2019, Intel Corporation.
+ * Yu-cheng Yu <[email protected]>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <asm/msr.h>
+#include <asm/user.h>
+#include <asm/fpu/internal.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/cet.h>
+
+static void start_update_msrs(void)
+{
+ fpregs_lock();
+ if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ __fpregs_load_activate();
+}
+
+static void end_update_msrs(void)
+{
+ fpregs_unlock();
+}
+
+static unsigned long cet_get_shstk_addr(void)
+{
+ struct fpu *fpu = &current->thread.fpu;
+ unsigned long ssp = 0;
+
+ fpregs_lock();
+
+ if (fpregs_state_valid(fpu, smp_processor_id())) {
+ rdmsrl(MSR_IA32_PL3_SSP, ssp);
+ } else {
+ struct cet_user_state *p;
+
+ p = get_xsave_addr(&fpu->state.xsave, XFEATURE_CET_USER);
+ if (p)
+ ssp = p->user_ssp;
+ }
+
+ fpregs_unlock();
+ return ssp;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long addr, populate;
+
+ down_write(&mm->mmap_sem);
+ addr = do_mmap(NULL, 0, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE,
+ VM_SHSTK, 0, &populate, NULL);
+ up_write(&mm->mmap_sem);
+
+ if (populate)
+ mm_populate(addr, populate);
+
+ return addr;
+}
+
+int cet_setup_shstk(void)
+{
+ unsigned long addr, size;
+ struct cet_status *cet = &current->thread.cet;
+
+ if (!static_cpu_has(X86_FEATURE_SHSTK))
+ return -EOPNOTSUPP;
+
+ size = round_up(min(rlimit(RLIMIT_STACK), 1UL << 32), PAGE_SIZE);
+ addr = alloc_shstk(size);
+
+ if (IS_ERR((void *)addr))
+ return PTR_ERR((void *)addr);
+
+ cet->shstk_base = addr;
+ cet->shstk_size = size;
+
+ start_update_msrs();
+ wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+ wrmsrl(MSR_IA32_U_CET, MSR_IA32_CET_SHSTK_EN);
+ end_update_msrs();
+ return 0;
+}
+
+void cet_disable_free_shstk(struct task_struct *tsk)
+{
+ struct cet_status *cet = &tsk->thread.cet;
+
+ if (!static_cpu_has(X86_FEATURE_SHSTK) ||
+ !cet->shstk_size || !cet->shstk_base)
+ return;
+
+ if (!tsk->mm || (tsk->mm != current->mm))
+ return;
+
+ if (tsk == current) {
+ u64 msr_val;
+
+ start_update_msrs();
+ rdmsrl(MSR_IA32_U_CET, msr_val);
+ wrmsrl(MSR_IA32_U_CET, msr_val & ~MSR_IA32_CET_SHSTK_EN);
+ wrmsrl(MSR_IA32_PL3_SSP, 0);
+ end_update_msrs();
+ }
+
+ while (1) {
+ int r;
+
+ r = vm_munmap(cet->shstk_base, cet->shstk_size);
+
+ /*
+ * Retry if mmap_sem is not available.
+ */
+ if (r == -EINTR) {
+ cond_resched();
+ continue;
+ }
+
+ WARN_ON_ONCE(r);
+ break;
+ }
+ cet->shstk_base = 0;
+ cet->shstk_size = 0;
+}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index bed0cb83fe24..1563b472e0f9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -55,6 +55,7 @@
#include <asm/microcode_intel.h>
#include <asm/intel-family.h>
#include <asm/cpu_device_id.h>
+#include <asm/cet.h>
#include <asm/uv/uv.h>

#include "cpu.h"
@@ -469,6 +470,32 @@ static __init int setup_disable_pku(char *arg)
__setup("nopku", setup_disable_pku);
#endif /* CONFIG_X86_64 */

+static __always_inline void setup_cet(struct cpuinfo_x86 *c)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+ !cpu_feature_enabled(X86_FEATURE_IBT))
+ return;
+
+ cr4_set_bits(X86_CR4_CET);
+}
+
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+static __init int setup_disable_shstk(char *s)
+{
+ /* require an exact match without trailing characters */
+ if (s[0] != '\0')
+ return 0;
+
+ if (!boot_cpu_has(X86_FEATURE_SHSTK))
+ return 1;
+
+ setup_clear_cpu_cap(X86_FEATURE_SHSTK);
+ pr_info("x86: 'no_user_shstk' specified, disabling user Shadow Stack\n");
+ return 1;
+}
+__setup("no_user_shstk", setup_disable_shstk);
+#endif
+
/*
* Some CPU features depend on higher CPUID levels, which may not always
* be available due to CPUID level capping or broken virtualization
@@ -1505,6 +1532,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
x86_init_rdrand(c);
x86_init_cache_qos(c);
setup_pku(c);
+ setup_cet(c);

/*
* Clear/Set all flags overridden by options, need do it
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index de182b84723a..9d9cff2c1018 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -42,6 +42,7 @@
#include <asm/spec-ctrl.h>
#include <asm/io_bitmap.h>
#include <asm/proto.h>
+#include <asm/cet.h>

#include "process.h"

diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h
index 4ea8584682f9..a0e1b24cfa02 100644
--- a/tools/arch/x86/include/asm/disabled-features.h
+++ b/tools/arch/x86/include/asm/disabled-features.h
@@ -56,6 +56,12 @@
# define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31))
#endif

+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define DISABLE_SHSTK 0
+#else
+#define DISABLE_SHSTK (1<<(X86_FEATURE_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -75,7 +81,7 @@
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
-#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
+#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP|DISABLE_SHSTK)
#define DISABLED_MASK17 0
#define DISABLED_MASK18 0
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
--
2.21.0

2020-04-29 22:11:40

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 26/26] x86/cet/shstk: Add arch_prctl functions for shadow stack

arch_prctl(ARCH_X86_CET_STATUS, u64 *args)
Get CET feature status.

The parameter 'args' is a pointer to a user buffer. The kernel returns
the following information:

*args = shadow stack/IBT status
*(args + 1) = shadow stack base address
*(args + 2) = shadow stack size

arch_prctl(ARCH_X86_CET_DISABLE, u64 features)
Disable CET features specified in 'features'. Return -EPERM if CET is
locked.

arch_prctl(ARCH_X86_CET_LOCK)
Lock in CET features.

arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, u64 *args)
Allocate a new shadow stack.

The parameter 'args' is a pointer to a user buffer containing the
desired size to allocate. The kernel returns the allocated shadow
stack address in *args.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Verify CET is enabled before handling arch_prctl.
- Change input parameters from unsigned long to u64, to make it clear they
are 64-bit.

arch/x86/include/asm/cet.h | 4 ++
arch/x86/include/uapi/asm/prctl.h | 5 ++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/cet.c | 29 +++++++++++
arch/x86/kernel/cet_prctl.c | 87 +++++++++++++++++++++++++++++++
arch/x86/kernel/process.c | 4 +-
6 files changed, 128 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/kernel/cet_prctl.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 71dc92acd2f2..99e6e741d28c 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -14,16 +14,20 @@ struct sc_ext;
struct cet_status {
unsigned long shstk_base;
unsigned long shstk_size;
+ unsigned int locked:1;
};

#ifdef CONFIG_X86_INTEL_CET
+int prctl_cet(int option, u64 arg2);
int cet_setup_shstk(void);
int cet_setup_thread_shstk(struct task_struct *p);
+int cet_alloc_shstk(unsigned long *arg);
void cet_disable_free_shstk(struct task_struct *p);
int cet_verify_rstor_token(bool ia32, unsigned long ssp, unsigned long *new_ssp);
void cet_restore_signal(struct sc_ext *sc);
int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
#else
+static inline int prctl_cet(int option, u64 arg2) { return -EINVAL; }
static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
static inline void cet_disable_free_shstk(struct task_struct *p) {}
static inline void cet_restore_signal(struct sc_ext *sc) { return; }
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..d962f0ec9ccf 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -14,4 +14,9 @@
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003

+#define ARCH_X86_CET_STATUS 0x3001
+#define ARCH_X86_CET_DISABLE 0x3002
+#define ARCH_X86_CET_LOCK 0x3003
+#define ARCH_X86_CET_ALLOC_SHSTK 0x3004
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index e9cc2551573b..0b621e2afbdc 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -144,7 +144,7 @@ obj-$(CONFIG_UNWINDER_ORC) += unwind_orc.o
obj-$(CONFIG_UNWINDER_FRAME_POINTER) += unwind_frame.o
obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o

-obj-$(CONFIG_X86_INTEL_CET) += cet.o
+obj-$(CONFIG_X86_INTEL_CET) += cet.o cet_prctl.o

###
# 64 bit specific files
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 121552047b86..c1b9b540c03e 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -145,6 +145,35 @@ static int create_rstor_token(bool ia32, unsigned long ssp,
return 0;
}

+int cet_alloc_shstk(unsigned long *arg)
+{
+ unsigned long len = *arg;
+ unsigned long addr;
+ unsigned long token;
+ unsigned long ssp;
+
+ addr = alloc_shstk(round_up(len, PAGE_SIZE));
+
+ if (IS_ERR((void *)addr))
+ return PTR_ERR((void *)addr);
+
+ /* Restore token is 8 bytes and aligned to 8 bytes */
+ ssp = addr + len;
+ token = ssp;
+
+ if (!in_ia32_syscall())
+ token |= TOKEN_MODE_64;
+ ssp -= 8;
+
+ if (write_user_shstk_64(ssp, token)) {
+ vm_munmap(addr, len);
+ return -EINVAL;
+ }
+
+ *arg = addr;
+ return 0;
+}
+
int cet_setup_shstk(void)
{
unsigned long addr, size;
diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
new file mode 100644
index 000000000000..0139c48f2215
--- /dev/null
+++ b/arch/x86/kernel/cet_prctl.c
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/errno.h>
+#include <linux/uaccess.h>
+#include <linux/prctl.h>
+#include <linux/compat.h>
+#include <linux/mman.h>
+#include <linux/elfcore.h>
+#include <asm/processor.h>
+#include <asm/prctl.h>
+#include <asm/cet.h>
+
+/* See Documentation/x86/intel_cet.rst. */
+
+static int handle_get_status(u64 arg2)
+{
+ struct cet_status *cet = &current->thread.cet;
+ u64 buf[3] = {0, 0, 0};
+
+ if (cet->shstk_size) {
+ buf[0] |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
+ buf[1] = (u64)cet->shstk_base;
+ buf[2] = (u64)cet->shstk_size;
+ }
+
+ return copy_to_user((u64 __user *)arg2, buf, sizeof(buf));
+}
+
+static int handle_alloc_shstk(u64 arg2)
+{
+ int err = 0;
+ unsigned long arg;
+ unsigned long addr = 0;
+ unsigned long size = 0;
+
+ if (get_user(arg, (unsigned long __user *)arg2))
+ return -EFAULT;
+
+ size = arg;
+ err = cet_alloc_shstk(&arg);
+ if (err)
+ return err;
+
+ addr = arg;
+ if (put_user((u64)addr, (u64 __user *)arg2)) {
+ vm_munmap(addr, size);
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
+int prctl_cet(int option, u64 arg2)
+{
+ struct cet_status *cet;
+
+ if (!IS_ENABLED(CONFIG_X86_INTEL_CET))
+ return -EINVAL;
+
+ if (option == ARCH_X86_CET_STATUS)
+ return handle_get_status(arg2);
+
+ if (!static_cpu_has(X86_FEATURE_SHSTK))
+ return -EINVAL;
+
+ cet = &current->thread.cet;
+
+ switch (option) {
+ case ARCH_X86_CET_DISABLE:
+ if (cet->locked)
+ return -EPERM;
+ if (arg2 & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+ cet_disable_free_shstk(current);
+
+ return 0;
+
+ case ARCH_X86_CET_LOCK:
+ cet->locked = 1;
+ return 0;
+
+ case ARCH_X86_CET_ALLOC_SHSTK:
+ return handle_alloc_shstk(arg2);
+
+ default:
+ return -EINVAL;
+ }
+}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ef1c2b8086a2..de6773dd6a16 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -996,7 +996,7 @@ long do_arch_prctl_common(struct task_struct *task, int option,
return get_cpuid_mode();
case ARCH_SET_CPUID:
return set_cpuid_mode(task, cpuid_enabled);
+ default:
+ return prctl_cet(option, cpuid_enabled);
}
-
- return -EINVAL;
}
--
2.21.0

2020-04-29 22:11:47

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 24/26] x86/cet/shstk: ELF header parsing for shadow stack

Check an ELF file's .note.gnu.property, and setup shadow stack if the
application supports it.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v9:
- Change cpu_feature_enabled() to static_cpu_has().

arch/x86/Kconfig | 2 ++
arch/x86/include/asm/elf.h | 13 +++++++++++++
arch/x86/kernel/process_64.c | 29 +++++++++++++++++++++++++++++
3 files changed, 44 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ac07e1f6a2bc..8b7b97ff5fb4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1970,6 +1970,8 @@ config X86_INTEL_SHADOW_STACK_USER
select X86_INTEL_CET
select ARCH_MAYBE_MKWRITE
select ARCH_HAS_SHADOW_STACK
+ select ARCH_USE_GNU_PROPERTY
+ select ARCH_BINFMT_ELF_STATE
help
Shadow Stacks provides protection against program stack
corruption. It's a hardware feature. This only matters
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 69c0f892e310..fac79b621e0a 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -367,6 +367,19 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,
int uses_interp);
#define compat_arch_setup_additional_pages compat_arch_setup_additional_pages

+#ifdef CONFIG_ARCH_BINFMT_ELF_STATE
+struct arch_elf_state {
+ unsigned int gnu_property;
+};
+
+#define INIT_ARCH_ELF_STATE { \
+ .gnu_property = 0, \
+}
+
+#define arch_elf_pt_proc(ehdr, phdr, elf, interp, state) (0)
+#define arch_check_elf(ehdr, interp, interp_ehdr, state) (0)
+#endif
+
/* Do not change the values. See get_align_mask() */
enum align_flags {
ALIGN_VA_32 = BIT(0),
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 5ef9d8f25b0e..93ba4afd0c19 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -730,3 +730,32 @@ unsigned long KSTK_ESP(struct task_struct *task)
{
return task_pt_regs(task)->sp;
}
+
+#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
+int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
+ bool compat, struct arch_elf_state *state)
+{
+ if (type != GNU_PROPERTY_X86_FEATURE_1_AND)
+ return 0;
+
+ if (datasz != sizeof(unsigned int))
+ return -ENOEXEC;
+
+ state->gnu_property = *(unsigned int *)data;
+ return 0;
+}
+
+int arch_setup_elf_property(struct arch_elf_state *state)
+{
+ int r = 0;
+
+ memset(&current->thread.cet, 0, sizeof(struct cet_status));
+
+ if (static_cpu_has(X86_FEATURE_SHSTK)) {
+ if (state->gnu_property & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+ r = cet_setup_shstk();
+ }
+
+ return r;
+}
+#endif
--
2.21.0

2020-04-29 22:11:59

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 20/26] x86/cet/shstk: Handle signals for shadow stack

To deliver a signal, create a shadow stack restore token and put a restore
token and the signal restorer address on the shadow stack. For sigreturn,
verify the token and restore the shadow stack pointer.

Introduce WRUSS, which is a kernel-mode instruction but writes directly to
user shadow stack. It is used to construct the user signal stack as
described above.

Introduce a signal context extension struct 'sc_ext', which is used to save
shadow stack restore token address and WAIT_ENDBR status. WAIT_ENDBR will
be introduced later in the Indirect Branch Tracking (IBT) series, but add
that into sc_ext now to keep the struct stable in case the IBT series is
applied later.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Combine with WRUSS instruction patch, since it is used only here.
- Revise signal restore code to the latest supervisor states handling.
Move shadow stack restore token checking out of the fast path.

v9:
- Update CET MSR access according to XSAVES supervisor state changes.
- Add 'wait_endbr' to struct 'sc_ext'.
- Update and simplify signal frame allocation, setup, and restoration.
- Update commit log text.

v2:
- Move CET status from sigcontext to a separate struct sc_ext, which is
located above the fpstate on the signal frame.
- Add a restore token for sigreturn address.

arch/x86/ia32/ia32_signal.c | 17 +++
arch/x86/include/asm/cet.h | 8 ++
arch/x86/include/asm/fpu/internal.h | 10 ++
arch/x86/include/asm/special_insns.h | 32 ++++++
arch/x86/include/uapi/asm/sigcontext.h | 9 ++
arch/x86/kernel/cet.c | 151 +++++++++++++++++++++++++
arch/x86/kernel/fpu/signal.c | 101 +++++++++++++++++
arch/x86/kernel/signal.c | 10 ++
8 files changed, 338 insertions(+)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index f9d8804144d0..cb19159817cb 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -35,6 +35,7 @@
#include <asm/sigframe.h>
#include <asm/sighandling.h>
#include <asm/smap.h>
+#include <asm/cet.h>

static inline void reload_segments(struct sigcontext_32 *sc)
{
@@ -205,6 +206,7 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
void __user **fpstate)
{
unsigned long sp, fx_aligned, math_size;
+ void __user *restorer = NULL;

/* Default to using normal stack */
sp = regs->sp;
@@ -218,8 +220,23 @@ static void __user *get_sigframe(struct ksignal *ksig, struct pt_regs *regs,
ksig->ka.sa.sa_restorer)
sp = (unsigned long) ksig->ka.sa.sa_restorer;

+ if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+ restorer = ksig->ka.sa.sa_restorer;
+ } else if (current->mm->context.vdso) {
+ if (ksig->ka.sa.sa_flags & SA_SIGINFO)
+ restorer = current->mm->context.vdso +
+ vdso_image_32.sym___kernel_rt_sigreturn;
+ else
+ restorer = current->mm->context.vdso +
+ vdso_image_32.sym___kernel_sigreturn;
+ }
+
sp = fpu__alloc_mathframe(sp, 1, &fx_aligned, &math_size);
*fpstate = (struct _fpstate_32 __user *) sp;
+
+ if (save_cet_to_sigframe(1, *fpstate, (unsigned long)restorer))
+ return (void __user *) -1L;
+
if (copy_fpstate_to_sigframe(*fpstate, (void __user *)fx_aligned,
math_size) < 0)
return (void __user *) -1L;
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index caac0687c8e4..56fe08eebae6 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -6,6 +6,8 @@
#include <linux/types.h>

struct task_struct;
+struct sc_ext;
+
/*
* Per-thread CET status
*/
@@ -17,8 +19,14 @@ struct cet_status {
#ifdef CONFIG_X86_INTEL_CET
int cet_setup_shstk(void);
void cet_disable_free_shstk(struct task_struct *p);
+int cet_verify_rstor_token(bool ia32, unsigned long ssp, unsigned long *new_ssp);
+void cet_restore_signal(struct sc_ext *sc);
+int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
#else
static inline void cet_disable_free_shstk(struct task_struct *p) {}
+static inline void cet_restore_signal(struct sc_ext *sc) { return; }
+static inline int cet_setup_signal(bool ia32, unsigned long rstor,
+ struct sc_ext *sc) { return -EINVAL; }
#endif

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 42159f45bf9c..b569ac929ccc 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -476,6 +476,16 @@ static inline void copy_kernel_to_fpregs(union fpregs_state *fpstate)
__copy_kernel_to_fpregs(fpstate, -1);
}

+#ifdef CONFIG_X86_INTEL_CET
+extern int save_cet_to_sigframe(int ia32, void __user *fp,
+ unsigned long restorer);
+#else
+static inline int save_cet_to_sigframe(int ia32, void __user *fp,
+ unsigned long restorer)
+{
+ return 0;
+}
+#endif
extern int copy_fpstate_to_sigframe(void __user *buf, void __user *fp, int size);

/*
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 6d37b8fcfc77..1b9b2e79c353 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -222,6 +222,38 @@ static inline void clwb(volatile void *__p)
: [pax] "a" (p));
}

+#ifdef CONFIG_X86_INTEL_CET
+#if defined(CONFIG_IA32_EMULATION) || defined(CONFIG_X86_X32)
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+ asm_volatile_goto("1: wrussd %1, (%0)\n"
+ _ASM_EXTABLE(1b, %l[fail])
+ :: "r" (addr), "r" (val)
+ :: fail);
+ return 0;
+fail:
+ return -EPERM;
+}
+#else
+static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
+{
+ WARN_ONCE(1, "%s used but not supported.\n", __func__);
+ return -EFAULT;
+}
+#endif
+
+static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
+{
+ asm_volatile_goto("1: wrussq %1, (%0)\n"
+ _ASM_EXTABLE(1b, %l[fail])
+ :: "r" (addr), "r" (val)
+ :: fail);
+ return 0;
+fail:
+ return -EPERM;
+}
+#endif /* CONFIG_X86_INTEL_CET */
+
#define nop() asm volatile ("nop")


diff --git a/arch/x86/include/uapi/asm/sigcontext.h b/arch/x86/include/uapi/asm/sigcontext.h
index 844d60eb1882..cf2d55db3be4 100644
--- a/arch/x86/include/uapi/asm/sigcontext.h
+++ b/arch/x86/include/uapi/asm/sigcontext.h
@@ -196,6 +196,15 @@ struct _xstate {
/* New processor state extensions go here: */
};

+/*
+ * Located at the end of sigcontext->fpstate, aligned to 8.
+ */
+struct sc_ext {
+ unsigned long total_size;
+ unsigned long ssp;
+ unsigned long wait_endbr;
+};
+
/*
* The 32-bit signal frame:
*/
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index d8196c8e792a..274fecdd9669 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -19,6 +19,8 @@
#include <asm/fpu/xstate.h>
#include <asm/fpu/types.h>
#include <asm/cet.h>
+#include <asm/special_insns.h>
+#include <uapi/asm/sigcontext.h>

static void start_update_msrs(void)
{
@@ -69,6 +71,80 @@ static unsigned long alloc_shstk(unsigned long size)
return addr;
}

+#define TOKEN_MODE_MASK 3UL
+#define TOKEN_MODE_64 1UL
+#define IS_TOKEN_64(token) ((token & TOKEN_MODE_MASK) == TOKEN_MODE_64)
+#define IS_TOKEN_32(token) ((token & TOKEN_MODE_MASK) == 0)
+
+/*
+ * Verify the restore token at the address of 'ssp' is
+ * valid and then set shadow stack pointer according to the
+ * token.
+ */
+int cet_verify_rstor_token(bool ia32, unsigned long ssp,
+ unsigned long *new_ssp)
+{
+ unsigned long token;
+
+ *new_ssp = 0;
+
+ if (!IS_ALIGNED(ssp, 8))
+ return -EINVAL;
+
+ if (get_user(token, (unsigned long __user *)ssp))
+ return -EFAULT;
+
+ /* Is 64-bit mode flag correct? */
+ if (!ia32 && !IS_TOKEN_64(token))
+ return -EINVAL;
+ else if (ia32 && !IS_TOKEN_32(token))
+ return -EINVAL;
+
+ token &= ~TOKEN_MODE_MASK;
+
+ /*
+ * Restore address properly aligned?
+ */
+ if ((!ia32 && !IS_ALIGNED(token, 8)) || !IS_ALIGNED(token, 4))
+ return -EINVAL;
+
+ /*
+ * Token was placed properly?
+ */
+ if ((ALIGN_DOWN(token, 8) - 8) != ssp)
+ return -EINVAL;
+
+ *new_ssp = token;
+ return 0;
+}
+
+/*
+ * Create a restore token on the shadow stack.
+ * A token is always 8-byte and aligned to 8.
+ */
+static int create_rstor_token(bool ia32, unsigned long ssp,
+ unsigned long *new_ssp)
+{
+ unsigned long addr;
+
+ *new_ssp = 0;
+
+ if ((!ia32 && !IS_ALIGNED(ssp, 8)) || !IS_ALIGNED(ssp, 4))
+ return -EINVAL;
+
+ addr = ALIGN_DOWN(ssp, 8) - 8;
+
+ /* Is the token for 64-bit? */
+ if (!ia32)
+ ssp |= TOKEN_MODE_64;
+
+ if (write_user_shstk_64(addr, ssp))
+ return -EFAULT;
+
+ *new_ssp = addr;
+ return 0;
+}
+
int cet_setup_shstk(void)
{
unsigned long addr, size;
@@ -133,3 +209,78 @@ void cet_disable_free_shstk(struct task_struct *tsk)
cet->shstk_base = 0;
cet->shstk_size = 0;
}
+
+/*
+ * Called from __fpu__restore_sig() and XSAVES buffer is protected by
+ * set_thread_flag(TIF_NEED_FPU_LOAD) in the slow path.
+ */
+void cet_restore_signal(struct sc_ext *sc_ext)
+{
+ struct cet_user_state *cet_user_state;
+ struct cet_status *cet = &current->thread.cet;
+ u64 msr_val = 0;
+
+ cet_user_state = get_xsave_addr(&current->thread.fpu.state.xsave,
+ XFEATURE_CET_USER);
+ if (!cet_user_state)
+ return;
+
+ if (cet->shstk_size) {
+ if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ cet_user_state->user_ssp = sc_ext->ssp;
+ else
+ wrmsrl(MSR_IA32_PL3_SSP, sc_ext->ssp);
+
+ msr_val |= MSR_IA32_CET_SHSTK_EN;
+ }
+
+ if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ cet_user_state->user_cet = msr_val;
+ else
+ wrmsrl(MSR_IA32_U_CET, msr_val);
+
+ return;
+}
+
+/*
+ * Setup the shadow stack for the signal handler: first,
+ * create a restore token to keep track of the current ssp,
+ * and then the return address of the signal handler.
+ */
+int cet_setup_signal(bool ia32, unsigned long rstor_addr, struct sc_ext *sc_ext)
+{
+ struct cet_status *cet = &current->thread.cet;
+ unsigned long ssp = 0, new_ssp = 0;
+ int err;
+
+ if (cet->shstk_size) {
+ if (!rstor_addr)
+ return -EINVAL;
+
+ ssp = cet_get_shstk_addr();
+ err = create_rstor_token(ia32, ssp, &new_ssp);
+ if (err)
+ return err;
+
+ if (ia32) {
+ ssp = new_ssp - sizeof(u32);
+ err = write_user_shstk_32(ssp, (unsigned int)rstor_addr);
+ } else {
+ ssp = new_ssp - sizeof(u64);
+ err = write_user_shstk_64(ssp, rstor_addr);
+ }
+
+ if (err)
+ return err;
+
+ sc_ext->ssp = new_ssp;
+ }
+
+ if (ssp) {
+ start_update_msrs();
+ wrmsrl(MSR_IA32_PL3_SSP, ssp);
+ end_update_msrs();
+ }
+
+ return 0;
+}
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 4dad5afc938d..95ee76d08971 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -52,6 +52,73 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
return 0;
}

+#ifdef CONFIG_X86_INTEL_CET
+int save_cet_to_sigframe(int ia32, void __user *fp, unsigned long restorer)
+{
+ int err = 0;
+
+ if (!current->thread.cet.shstk_size)
+ return 0;
+
+ if (fp) {
+ struct sc_ext ext = {0, 0, 0};
+
+ err = cet_setup_signal(ia32, restorer, &ext);
+ if (!err) {
+ void __user *p = fp;
+
+ ext.total_size = sizeof(ext);
+
+ if (ia32)
+ p += sizeof(struct fregs_state);
+
+ p += fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+ p = (void __user *)ALIGN((unsigned long)p, 8);
+
+ if (copy_to_user(p, &ext, sizeof(ext)))
+ return -EFAULT;
+ }
+ }
+
+ return err;
+}
+
+static int get_cet_from_sigframe(int ia32, void __user *fp, struct sc_ext *ext)
+{
+ int err = 0;
+
+ memset(ext, 0, sizeof(*ext));
+
+ if (!current->thread.cet.shstk_size)
+ return 0;
+
+ if (fp) {
+ void __user *p = fp;
+
+ if (ia32)
+ p += sizeof(struct fregs_state);
+
+ p += fpu_user_xstate_size + FP_XSTATE_MAGIC2_SIZE;
+ p = (void __user *)ALIGN((unsigned long)p, 8);
+
+ if (copy_from_user(ext, p, sizeof(*ext)))
+ return -EFAULT;
+
+ if (ext->total_size != sizeof(*ext))
+ return -EFAULT;
+
+ err = cet_verify_rstor_token(ia32, ext->ssp, &ext->ssp);
+ }
+
+ return err;
+}
+#else
+static int get_cet_from_sigframe(int ia32, void __user *fp, struct sc_ext *ext)
+{
+ return 0;
+}
+#endif
+
/*
* Signal frame handlers.
*/
@@ -294,6 +361,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
struct task_struct *tsk = current;
struct fpu *fpu = &tsk->thread.fpu;
struct user_i387_ia32_struct env;
+ struct sc_ext sc_ext;
u64 user_xfeatures = 0;
int fx_only = 0;
int ret = 0;
@@ -334,6 +402,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
if ((unsigned long)buf_fx % 64)
fx_only = 1;

+ ret = get_cet_from_sigframe(ia32_fxstate, buf, &sc_ext);
+ if (ret)
+ return ret;
+
if (!ia32_fxstate) {
/*
* Attempt to restore the FPU registers directly from user
@@ -346,7 +418,10 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
pagefault_disable();
ret = copy_user_to_fpregs_zeroing(buf_fx, user_xfeatures, fx_only);
pagefault_enable();
+
if (!ret) {
+ cet_restore_signal(&sc_ext);
+
/* Restore supervisor states */
if (test_thread_flag(TIF_NEED_FPU_LOAD) &&
xfeatures_mask_supervisor())
@@ -405,6 +480,9 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
fpregs_lock();
if (unlikely(init_bv))
copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
+
+ cet_restore_signal(&sc_ext);
+
/*
* Restore previously saved supervisor xstates along with
* copied-in user xstates.
@@ -473,12 +551,35 @@ int fpu__restore_sig(void __user *buf, int ia32_frame)
return __fpu__restore_sig(buf, buf_fx, size);
}

+#ifdef CONFIG_X86_INTEL_CET
+static unsigned long fpu__alloc_sigcontext_ext(unsigned long sp)
+{
+ struct cet_status *cet = &current->thread.cet;
+
+ /*
+ * sigcontext_ext is at: fpu + fpu_user_xstate_size +
+ * FP_XSTATE_MAGIC2_SIZE, then aligned to 8.
+ */
+ if (cet->shstk_size)
+ sp -= (sizeof(struct sc_ext) + 8);
+
+ return sp;
+}
+#else
+static unsigned long fpu__alloc_sigcontext_ext(unsigned long sp)
+{
+ return sp;
+}
+#endif
+
unsigned long
fpu__alloc_mathframe(unsigned long sp, int ia32_frame,
unsigned long *buf_fx, unsigned long *size)
{
unsigned long frame_size = xstate_sigframe_size();

+ sp = fpu__alloc_sigcontext_ext(sp);
+
*buf_fx = sp = round_down(sp - frame_size, 64);
if (ia32_frame && use_fxsr()) {
frame_size += sizeof(struct fregs_state);
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 0052bbe5dfd4..5ee1b2e51de3 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -44,6 +44,7 @@
#include <asm/syscall.h>
#include <asm/sigframe.h>
#include <asm/signal.h>
+#include <asm/cet.h>

#ifdef CONFIG_X86_64
/*
@@ -237,6 +238,9 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
unsigned long buf_fx = 0;
int onsigstack = on_sig_stack(sp);
int ret;
+#ifdef CONFIG_X86_64
+ void __user *restorer = NULL;
+#endif

/* redzone */
if (IS_ENABLED(CONFIG_X86_64))
@@ -268,6 +272,12 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
if (onsigstack && !likely(on_sig_stack(sp)))
return (void __user *)-1L;

+#ifdef CONFIG_X86_64
+ if (ka->sa.sa_flags & SA_RESTORER)
+ restorer = ka->sa.sa_restorer;
+ ret = save_cet_to_sigframe(0, *fpstate, (unsigned long)restorer);
+#endif
+
/* save i387 and extended state */
ret = copy_fpstate_to_sigframe(*fpstate, (void __user *)buf_fx, math_size);
if (ret < 0)
--
2.21.0

2020-04-29 22:12:09

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 18/26] mm: Update can_follow_write_pte() for shadow stack

Can_follow_write_pte() ensures a read-only page is COWed by checking the
FOLL_COW flag, and uses pte_dirty() to validate the flag is still valid.

Like a writable data page, a shadow stack page is writable, and becomes
read-only during copy-on-write, but it is always dirty. Thus, in the
can_follow_write_pte() check, it belongs to the writable page case and
should be excluded from the read-only page pte_dirty() check. Apply
the same changes to can_follow_write_pmd().

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Reverse name changes to can_follow_write_*().

mm/gup.c | 8 +++++---
mm/huge_memory.c | 8 +++++---
2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 50681f0286de..c737782403ee 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -385,10 +385,12 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
* FOLL_FORCE can write to even unwritable pte's, but only
* after we've gone through a COW cycle and they are dirty.
*/
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
+ struct vm_area_struct *vma)
{
return pte_write(pte) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+ ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+ !arch_shadow_stack_mapping(vma->vm_flags) && pte_dirty(pte));
}

static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -431,7 +433,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
- if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+ if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags, vma)) {
pte_unmap_unlock(ptep, ptl);
return NULL;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 608746bb9d19..cb1b0cb4b4eb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1520,10 +1520,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
* FOLL_FORCE can write to even unwritable pmd's, but only
* after we've gone through a COW cycle and they are dirty.
*/
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags,
+ struct vm_area_struct *vma)
{
return pmd_write(pmd) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+ ((flags & FOLL_FORCE) && (flags & FOLL_COW) &&
+ !arch_shadow_stack_mapping(vma->vm_flags) && pmd_dirty(pmd));
}

struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1536,7 +1538,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,

assert_spin_locked(pmd_lockptr(mm, pmd));

- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+ if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags, vma))
goto out;

/* Avoid dumping huge zero page */
--
2.21.0

2020-04-29 22:12:18

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 16/26] mm: Add guard pages around a shadow stack.

INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
first and the last elements in the range, effectively touches those memory
areas.

The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
255 * 4 = 1020 bytes by INCSSPD. Both ranges are far from PAGE_SIZE.
Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
CALL, and RET from going beyond.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Define ARCH_SHADOW_STACK_GUARD_GAP.

arch/x86/include/asm/processor.h | 10 ++++++++++
include/linux/mm.h | 24 ++++++++++++++++++++----
2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3bcf27caf6c9..eb9536f803f9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -907,6 +907,16 @@ static inline void spin_lock_prefetch(const void *x)
#define STACK_TOP TASK_SIZE_LOW
#define STACK_TOP_MAX TASK_SIZE_MAX

+/*
+ * Shadow stack pointer is moved by CALL, JMP, and INCSSP(Q/D). INCSSPQ
+ * moves shadow stack pointer up to 255 * 8 = ~2 KB (~1KB for INCSSPD) and
+ * touches the first and the last element in the range, which triggers a
+ * page fault if the range is not in a shadow stack. Because of this,
+ * creating 4-KB guard pages around a shadow stack prevents these
+ * instructions from going beyond.
+ */
+#define ARCH_SHADOW_STACK_GUARD_GAP PAGE_SIZE
+
#define INIT_THREAD { \
.addr_limit = KERNEL_DS, \
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f0669e3cdd37..68eadf2c466d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2630,6 +2630,10 @@ void page_cache_async_readahead(struct address_space *mapping,
pgoff_t offset,
unsigned long size);

+#ifndef ARCH_SHADOW_STACK_GUARD_GAP
+#define ARCH_SHADOW_STACK_GUARD_GAP 0
+#endif
+
extern unsigned long stack_guard_gap;
/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
@@ -2662,9 +2666,15 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
{
unsigned long vm_start = vma->vm_start;
+ unsigned long gap = 0;

- if (vma->vm_flags & VM_GROWSDOWN) {
- vm_start -= stack_guard_gap;
+ if (vma->vm_flags & VM_GROWSDOWN)
+ gap = stack_guard_gap;
+ else if (vma->vm_flags & VM_SHSTK)
+ gap = ARCH_SHADOW_STACK_GUARD_GAP;
+
+ if (gap != 0) {
+ vm_start -= gap;
if (vm_start > vma->vm_start)
vm_start = 0;
}
@@ -2674,9 +2684,15 @@ static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
{
unsigned long vm_end = vma->vm_end;
+ unsigned long gap = 0;
+
+ if (vma->vm_flags & VM_GROWSUP)
+ gap = stack_guard_gap;
+ else if (vma->vm_flags & VM_SHSTK)
+ gap = ARCH_SHADOW_STACK_GUARD_GAP;

- if (vma->vm_flags & VM_GROWSUP) {
- vm_end += stack_guard_gap;
+ if (gap != 0) {
+ vm_end += gap;
if (vm_end < vma->vm_end)
vm_end = -PAGE_SIZE;
}
--
2.21.0

2020-04-29 22:12:26

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 15/26] mm: Fixup places that call pte_mkwrite() directly

A shadow stack page is made writable by pte_mkwrite_shstk(), which sets
_PAGE_DIRTY_HW. There are a few places that call pte_mkwrite() directly
and miss the maybe_mkwrite() fixup in the previous patch. Fix them with
maybe_mkwrite():

- do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE directly
and call pte_mkwrite(), which is the same as maybe_mkwrite(). Change
them to maybe_mkwrite().

- In do_numa_page(), if the numa entry 'was-writable', then pte_mkwrite()
is called directly. Fix it by doing maybe_mkwrite().

- In change_pte_range(), pte_mkwrite() is called directly. Replace it with
maybe_mkwrite().

Signed-off-by: Yu-cheng Yu <[email protected]>
---
mm/memory.c | 5 ++---
mm/migrate.c | 3 +--
mm/mprotect.c | 2 +-
3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f703fe8c8346..b9002f644806 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3373,8 +3373,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
__SetPageUptodate(page);

entry = mk_pte(page, vma->vm_page_prot);
- if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
@@ -4033,7 +4032,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
pte = pte_modify(old_pte, vma->vm_page_prot);
pte = pte_mkyoung(pte);
if (was_writable)
- pte = pte_mkwrite(pte);
+ pte = maybe_mkwrite(pte, vma);
ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);

diff --git a/mm/migrate.c b/mm/migrate.c
index 7160c1556f79..0fa59b1562c6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2805,8 +2805,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
}
} else {
entry = mk_pte(page, vma->vm_page_prot);
- if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}

ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 494192ca954b..02762af1057c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -135,7 +135,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (dirty_accountable && pte_dirty(ptent) &&
(pte_soft_dirty(ptent) ||
!(vma->vm_flags & VM_SOFTDIRTY))) {
- ptent = pte_mkwrite(ptent);
+ ptent = maybe_mkwrite(ptent, vma);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
pages++;
--
2.21.0

2020-04-29 22:12:41

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 09/26] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS

After the introduction of _PAGE_COW, a modified page's PTE can have either
_PAGE_DIRTY_HW or _PAGE_COW. Change _PAGE_DIRTY to _PAGE_DIRTY_BITS.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Zhenyu Wang <[email protected]>
Cc: Zhi Wang <[email protected]>
---
drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
index 2a4b23f8aa74..789dce23424b 100644
--- a/drivers/gpu/drm/i915/gvt/gtt.c
+++ b/drivers/gpu/drm/i915/gvt/gtt.c
@@ -1207,7 +1207,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
}

/* Clear dirty field. */
- se->val64 &= ~_PAGE_DIRTY;
+ se->val64 &= ~_PAGE_DIRTY_BITS;

ops->clear_pse(se);
ops->clear_ips(se);
--
2.21.0

2020-04-29 22:12:50

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 02/26] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)

Add CPU feature flags for Control-flow Enforcement Technology (CET).

CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect Branch Tracking

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/kernel/cpu/cpuid-deps.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index db189945e9b0..2c95a3efc2d9 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -339,6 +339,7 @@
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
@@ -365,6 +366,7 @@
#define X86_FEATURE_MD_CLEAR (18*32+10) /* VERW clears CPU buffers */
#define X86_FEATURE_TSX_FORCE_ABORT (18*32+13) /* "" TSX_FORCE_ABORT */
#define X86_FEATURE_PCONFIG (18*32+18) /* Intel PCONFIG */
+#define X86_FEATURE_IBT (18*32+20) /* Indirect Branch Tracking */
#define X86_FEATURE_SPEC_CTRL (18*32+26) /* "" Speculation Control (IBRS + IBPB) */
#define X86_FEATURE_INTEL_STIBP (18*32+27) /* "" Single Thread Indirect Branch Predictors */
#define X86_FEATURE_FLUSH_L1D (18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 3cbe24ca80ab..fec83cc74b9e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -69,6 +69,8 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_CQM_MBM_TOTAL, X86_FEATURE_CQM_LLC },
{ X86_FEATURE_CQM_MBM_LOCAL, X86_FEATURE_CQM_LLC },
{ X86_FEATURE_AVX512_BF16, X86_FEATURE_AVX512VL },
+ { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
+ { X86_FEATURE_IBT, X86_FEATURE_XSAVES },
{}
};

--
2.21.0

2020-04-29 22:13:20

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 10/26] x86/mm: Update pte_modify for _PAGE_COW

Pte_modify() changes a PTE to 'newprot'. It doesn't use the pte_*()
helpers that a previous patch fixed up, so we need a new site.

Introduce fixup_dirty_pte() to set the dirty bits based on _PAGE_RW, and
apply the same changes to pmd_modify().

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Change static_cpu_has() to cpu_feature_enabled().
- Replace _PAGE_CHG_MASK approach with fixup functions.

arch/x86/include/asm/pgtable.h | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5f89035d1e60..f4870cd040de 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -726,6 +726,21 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)

static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);

+static inline pteval_t fixup_dirty_pte(pteval_t pteval)
+{
+ pte_t pte = __pte(pteval);
+
+ if (pte_dirty(pte)) {
+ pte = pte_mkclean(pte);
+
+ if (pte_flags(pte) & _PAGE_RW)
+ pte = pte_set_flags(pte, _PAGE_DIRTY_HW);
+ else
+ pte = pte_set_flags(pte, _PAGE_COW);
+ }
+ return pte_val(pte);
+}
+
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
pteval_t val = pte_val(pte), oldval = val;
@@ -736,16 +751,34 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
*/
val &= _PAGE_CHG_MASK;
val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+ val = fixup_dirty_pte(val);
val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
return __pte(val);
}

+static inline int pmd_write(pmd_t pmd);
+static inline pmdval_t fixup_dirty_pmd(pmdval_t pmdval)
+{
+ pmd_t pmd = __pmd(pmdval);
+
+ if (pmd_dirty(pmd)) {
+ pmd = pmd_mkclean(pmd);
+
+ if (pmd_flags(pmd) & _PAGE_RW)
+ pmd = pmd_set_flags(pmd, _PAGE_DIRTY_HW);
+ else
+ pmd = pmd_set_flags(pmd, _PAGE_COW);
+ }
+ return pmd_val(pmd);
+}
+
static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
{
pmdval_t val = pmd_val(pmd), oldval = val;

val &= _HPAGE_CHG_MASK;
val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+ val = fixup_dirty_pmd(val);
val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
return __pmd(val);
}
--
2.21.0

2020-04-29 22:13:26

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 07/26] x86/mm: Remove _PAGE_DIRTY_HW from kernel RO pages

Kernel read-only PTEs are setup as _PAGE_DIRTY_HW. Since these become
shadow stack PTEs, remove the dirty bit.

Signed-off-by: Yu-cheng Yu <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 6 +++---
arch/x86/mm/pat/set_memory.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b82e0f167879..522b80b952f4 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -193,10 +193,10 @@ enum page_cache_mode {
#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
#define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0)
#define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC)
-#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G)
-#define __PAGE_KERNEL_RX (__PP| 0| 0|___A| 0|___D| 0|___G)
+#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G)
+#define __PAGE_KERNEL_RX (__PP| 0| 0|___A| 0| 0|___G)
#define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC)
-#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX|___D| 0|___G)
+#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX| 0| 0|___G)
#define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G)
#define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW| 0|___A| 0|___D|_PSE|___G)
#define __PAGE_KERNEL_WP (__PP|__RW| 0|___A|__NX|___D| 0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 59eca6a94ce7..87751b7e2131 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1923,7 +1923,7 @@ int set_memory_nx(unsigned long addr, int numpages)

int set_memory_ro(unsigned long addr, int numpages)
{
- return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+ return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY_HW), 0);
}

int set_memory_rw(unsigned long addr, int numpages)
--
2.21.0

2020-04-29 22:13:56

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 11/26] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY_HW to _PAGE_COW

When shadow stack is introduced, [R/O + _PAGE_DIRTY_HW] PTE is reserved
for shadow stack. Copy-on-write PTEs have [R/O + _PAGE_COW].

When a PTE goes from [R/W + _PAGE_DIRTY_HW] to [R/O + _PAGE_COW], it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a read-only PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE. However, this will not occur on processors supporting
shadow stack, therefore we don't need a TLB flush here.

The second case is that when the software, without atomic, tests & replaces
_PAGE_DIRTY_HW with _PAGE_COW, a transient shadow stack PTE can exist.
This is prevented with cmpxchg.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the cmpxchg solution.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
v10:
- Replace bit shift with pte_wrprotect()/pmd_wrprotect(), which use bit
test & shift.
- Move READ_ONCE of old_pte into try_cmpxchg() loop.
- Change static_cpu_has() to cpu_feature_enabled().

v9:
- Change compile-time conditionals to runtime checks.
- Fix parameters of try_cmpxchg(): change pte_t/pmd_t to
pte_t.pte/pmd_t.pmd.

v4:
- Implement try_cmpxchg().

arch/x86/include/asm/pgtable.h | 52 ++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f4870cd040de..eaa38adb1038 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1316,6 +1316,32 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
+ /*
+ * Some processors can start a write, but end up seeing a read-only
+ * PTE by the time they get to the Dirty bit. In this case, they
+ * will set the Dirty bit, leaving a read-only, Dirty PTE which
+ * looks like a shadow stack PTE.
+ *
+ * However, this behavior has been improved and will not occur on
+ * processors supporting shadow stack. Without this guarantee, a
+ * transition to a non-present PTE and flush the TLB would be
+ * needed.
+ *
+ * When changing a writable PTE to read-only and if the PTE has
+ * _PAGE_DIRTY_HW set, move that bit to _PAGE_COW so that the
+ * PTE is not a shadow stack PTE.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pte_t old_pte, new_pte;
+
+ do {
+ old_pte = READ_ONCE(*ptep);
+ new_pte = pte_wrprotect(old_pte);
+
+ } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+ return;
+ }
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
}

@@ -1372,6 +1398,32 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp)
{
+ /*
+ * Some processors can start a write, but end up seeing a read-only
+ * PMD by the time they get to the Dirty bit. In this case, they
+ * will set the Dirty bit, leaving a read-only, Dirty PMD which
+ * looks like a Shadow Stack PMD.
+ *
+ * However, this behavior has been improved and will not occur on
+ * processors supporting Shadow Stack. Without this guarantee, a
+ * transition to a non-present PMD and flush the TLB would be
+ * needed.
+ *
+ * When changing a writable PMD to read-only and if the PMD has
+ * _PAGE_DIRTY_HW set, we move that bit to _PAGE_COW so that the
+ * PMD is not a shadow stack PMD.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pmd_t old_pmd, new_pmd;
+
+ do {
+ old_pmd = READ_ONCE(*pmdp);
+ new_pmd = pmd_wrprotect(old_pmd);
+
+ } while (!try_cmpxchg((pmdval_t *)pmdp, (pmdval_t *)&old_pmd, pmd_val(new_pmd)));
+
+ return;
+ }
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

--
2.21.0

2020-04-29 22:14:00

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 21/26] ELF: UAPI and Kconfig additions for ELF program properties

Introduce basic ELF definitions relating to the NT_GNU_PROPERTY_TYPE_0
note.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
v10:
- Merge GNU_PROPERTY_X86_FEATURE_1_* from a separate patch.

fs/Kconfig.binfmt | 3 +++
include/linux/elf.h | 8 ++++++++
include/uapi/linux/elf.h | 8 ++++++++
3 files changed, 19 insertions(+)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index 62dc4f577ba1..d2cfe0729a73 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -36,6 +36,9 @@ config COMPAT_BINFMT_ELF
config ARCH_BINFMT_ELF_STATE
bool

+config ARCH_USE_GNU_PROPERTY
+ bool
+
config BINFMT_ELF_FDPIC
bool "Kernel support for FDPIC ELF binaries"
default y if !BINFMT_ELF
diff --git a/include/linux/elf.h b/include/linux/elf.h
index e3649b3e970e..459cddcceaac 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -2,6 +2,7 @@
#ifndef _LINUX_ELF_H
#define _LINUX_ELF_H

+#include <linux/types.h>
#include <asm/elf.h>
#include <uapi/linux/elf.h>

@@ -56,4 +57,11 @@ static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) {
extern int elf_coredump_extra_notes_size(void);
extern int elf_coredump_extra_notes_write(struct coredump_params *cprm);
#endif
+
+/* NT_GNU_PROPERTY_TYPE_0 header */
+struct gnu_property {
+ u32 pr_type;
+ u32 pr_datasz;
+};
+
#endif /* _LINUX_ELF_H */
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 34c02e4290fe..61251ecabdd7 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -36,6 +36,7 @@ typedef __s64 Elf64_Sxword;
#define PT_LOPROC 0x70000000
#define PT_HIPROC 0x7fffffff
#define PT_GNU_EH_FRAME 0x6474e550
+#define PT_GNU_PROPERTY 0x6474e553

#define PT_GNU_STACK (PT_LOOS + 0x474e551)

@@ -443,4 +444,11 @@ typedef struct elf64_note {
Elf64_Word n_type; /* Content type */
} Elf64_Nhdr;

+/* .note.gnu.property types */
+#define GNU_PROPERTY_X86_FEATURE_1_AND 0xc0000002
+
+/* Bits of GNU_PROPERTY_X86_FEATURE_1_AND */
+#define GNU_PROPERTY_X86_FEATURE_1_IBT 0x00000001
+#define GNU_PROPERTY_X86_FEATURE_1_SHSTK 0x00000002
+
#endif /* _UAPI_LINUX_ELF_H */
--
2.21.0

2020-04-29 22:14:14

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 17/26] mm/mmap: Add shadow stack pages to memory accounting

Account shadow stack pages to stack memory.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Use arch_shadow_stack_mapping() to make meaning clear.

v8:
- Change shadow stake pages from data_vm to stack_vm.

arch/x86/mm/pgtable.c | 7 +++++++
include/asm-generic/pgtable.h | 11 +++++++++++
mm/mmap.c | 5 +++++
3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index aa4d396ff98d..f384e0314ba7 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -890,3 +890,10 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)

#endif /* CONFIG_X86_64 */
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
+bool arch_shadow_stack_mapping(vm_flags_t vm_flags)
+{
+ return (vm_flags & VM_SHSTK);
+}
+#endif
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 2c3875724809..dbd415ab7dd8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1215,6 +1215,17 @@ static inline pmd_t arch_maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma
#endif /* CONFIG_ARCH_MAYBE_MKWRITE */
#endif /* CONFIG_MMU */

+#ifdef CONFIG_MMU
+#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
+bool arch_shadow_stack_mapping(vm_flags_t vm_flags);
+#else
+static inline bool arch_shadow_stack_mapping(vm_flags_t vm_flags)
+{
+ return false;
+}
+#endif /* CONFIG_ARCH_HAS_SHADOW_STACK */
+#endif /* CONFIG_MMU */
+
/*
* Architecture PAGE_KERNEL_* fallbacks
*
diff --git a/mm/mmap.c b/mm/mmap.c
index f609e9ec4a25..70d240b3559c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1681,6 +1681,9 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
if (file && is_file_hugepages(file))
return 0;

+ if (arch_shadow_stack_mapping(vm_flags))
+ return 1;
+
return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
}

@@ -3318,6 +3321,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
mm->stack_vm += npages;
else if (is_data_mapping(flags))
mm->data_vm += npages;
+ else if (arch_shadow_stack_mapping(flags))
+ mm->stack_vm += npages;
}

static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
--
2.21.0

2020-04-29 22:14:42

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 08/26] x86/mm: Introduce _PAGE_COW

There is essentially no room left in the x86 hardware PTEs on some OSes
(not Linux). That left the hardware architects looking for a way to
represent a new memory type (shadow stack) within the existing bits.
They chose to repurpose a lightly-used state: Write=0,Dirty=1.

The reason it's lightly used is that Dirty=1 is normally set by hardware
and cannot normally be set by hardware on a Write=0 PTE. Software must
normally be involved to create one of these PTEs, so software can simply
opt to not create them.

But that leaves us with a Linux problem: we need to ensure we never create
Write=0,Dirty=1 PTEs. In places where we do create them, we need to find
an alternative way to represent them _without_ using the same hardware bit
combination. Thus, enter _PAGE_COW. This results in the following:

(a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW)
(b) A R/O page that has been COW'ed: (R/O + _PAGE_COW)
The user page is in a R/O VMA, and get_user_pages() needs a writable
copy. The page fault handler creates a copy of the page and sets
the new copy's PTE as R/O and _PAGE_COW.
(c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW)
(d) A shared shadow stack PTE: (R/O + _PAGE_COW)
When a shadow stack page is being shared among processes (this happens
at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the next shadow
stack access causes a fault, and the page is duplicated and
_PAGE_DIRTY_HW is set again. This is the COW equivalent for shadow
stack pages, even though it's copy-on-access rather than copy-on-write.
(e) A page where the processor observed a Write=1 PTE, started a write, set
Dirty=1, but then observed a Write=0 PTE. That's possible today, but
will not happen on processors that support shadow stack.

Use _PAGE_COW in pte_wrprotect() and _PAGE_DIRTY_HW in pte_mkwrite().
Apply the same changes to pmd and pud.

When this patch is applied, there are six free bits left in the 64-bit PTE.
There are no more free bits in the 32-bit PTE (except for PAE) and shadow
stack is not implemented for the 32-bit kernel.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
v10:
- Change _PAGE_BIT_DIRTY_SW to _PAGE_BIT_COW, as it is used for copy-on-
write PTEs.
- Update pte_write() and treat shadow stack as writable.
- Change *_mkdirty_shstk() to *_mkwrite_shstk() as these make shadow stack
pages writable.
- Use bit test & shift to move _PAGE_BIT_DIRTY_HW to _PAGE_BIT_COW.
- Change static_cpu_has() to cpu_feature_enabled().
- Revise commit log.

v9:
- Remove pte_move_flags() etc. and put the logic directly in
pte_wrprotect()/pte_mkwrite() etc.
- Change compile-time conditionals to run-time checks.
- Split out pte_modify()/pmd_modify() to a new patch.
- Update comments.

arch/x86/include/asm/pgtable.h | 120 ++++++++++++++++++++++++---
arch/x86/include/asm/pgtable_types.h | 41 ++++++++-
2 files changed, 150 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 90f9a73881ad..5f89035d1e60 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -122,9 +122,9 @@ extern pmdval_t early_pmd_flags;
* The following only work if pte_present() is true.
* Undefined behaviour if not..
*/
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
{
- return pte_flags(pte) & _PAGE_DIRTY_HW;
+ return pte_flags(pte) & _PAGE_DIRTY_BITS;
}


@@ -161,9 +161,9 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED;
}

-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_DIRTY_HW;
+ return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
}

static inline int pmd_young(pmd_t pmd)
@@ -171,9 +171,9 @@ static inline int pmd_young(pmd_t pmd)
return pmd_flags(pmd) & _PAGE_ACCESSED;
}

-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
{
- return pud_flags(pud) & _PAGE_DIRTY_HW;
+ return pud_flags(pud) & _PAGE_DIRTY_BITS;
}

static inline int pud_young(pud_t pud)
@@ -183,6 +183,12 @@ static inline int pud_young(pud_t pud)

static inline int pte_write(pte_t pte)
{
+ /*
+ * If _PAGE_DIRTY_HW is set, the PTE must either have
+ * _PAGE_RW or be a shadow stack PTE, which is logically writable.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY_HW);
return pte_flags(pte) & _PAGE_RW;
}

@@ -333,7 +339,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)

static inline pte_t pte_mkclean(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_DIRTY_HW);
+ return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
}

static inline pte_t pte_mkold(pte_t pte)
@@ -343,6 +349,17 @@ static inline pte_t pte_mkold(pte_t pte)

static inline pte_t pte_wrprotect(pte_t pte)
{
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PTE (RW=0,Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >>
+ _PAGE_BIT_DIRTY_HW << _PAGE_BIT_COW;
+ pte = pte_clear_flags(pte, _PAGE_DIRTY_HW);
+ }
+
return pte_clear_flags(pte, _PAGE_RW);
}

@@ -353,6 +370,18 @@ static inline pte_t pte_mkexec(pte_t pte)

static inline pte_t pte_mkdirty(pte_t pte)
{
+ pteval_t dirty = _PAGE_DIRTY_HW;
+
+ /* Avoid creating (HW)Dirty=1,Write=0 PTEs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
+ dirty = _PAGE_COW;
+
+ return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+ pte = pte_clear_flags(pte, _PAGE_COW);
return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

@@ -363,6 +392,13 @@ static inline pte_t pte_mkyoung(pte_t pte)

static inline pte_t pte_mkwrite(pte_t pte)
{
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (pte_flags(pte) & _PAGE_COW) {
+ pte = pte_clear_flags(pte, _PAGE_COW);
+ pte = pte_set_flags(pte, _PAGE_DIRTY_HW);
+ }
+ }
+
return pte_set_flags(pte, _PAGE_RW);
}

@@ -434,16 +470,41 @@ static inline pmd_t pmd_mkold(pmd_t pmd)

static inline pmd_t pmd_mkclean(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
+ return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
}

static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PMD (RW=0,Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pmdval_t v = native_pmd_val(pmd);
+
+ v |= (v & _PAGE_DIRTY_HW) >> _PAGE_BIT_DIRTY_HW <<
+ _PAGE_BIT_COW;
+ pmd = pmd_clear_flags(__pmd(v), _PAGE_DIRTY_HW);
+ }
+
return pmd_clear_flags(pmd, _PAGE_RW);
}

static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
+ pmdval_t dirty = _PAGE_DIRTY_HW;
+
+ /* Avoid creating (HW)Dirty=1,Write=0 PMDs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_RW))
+ dirty = _PAGE_COW;
+
+ return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+ pmd = pmd_clear_flags(pmd, _PAGE_COW);
return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}

@@ -464,6 +525,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)

static inline pmd_t pmd_mkwrite(pmd_t pmd)
{
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (pmd_flags(pmd) & _PAGE_COW) {
+ pmd = pmd_clear_flags(pmd, _PAGE_COW);
+ pmd = pmd_set_flags(pmd, _PAGE_DIRTY_HW);
+ }
+ }
+
return pmd_set_flags(pmd, _PAGE_RW);
}

@@ -488,17 +556,36 @@ static inline pud_t pud_mkold(pud_t pud)

static inline pud_t pud_mkclean(pud_t pud)
{
- return pud_clear_flags(pud, _PAGE_DIRTY_HW);
+ return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
}

static inline pud_t pud_wrprotect(pud_t pud)
{
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PUD (RW=0,Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pudval_t v = native_pud_val(pud);
+
+ v |= (v & _PAGE_DIRTY_HW) >> _PAGE_BIT_DIRTY_HW <<
+ _PAGE_BIT_COW;
+ pud = pud_clear_flags(__pud(v), _PAGE_DIRTY_HW);
+ }
+
return pud_clear_flags(pud, _PAGE_RW);
}

static inline pud_t pud_mkdirty(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
+ pudval_t dirty = _PAGE_DIRTY_HW;
+
+ /* Avoid creating (HW)Dirty=1,Write=0 PUDs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_RW))
+ dirty = _PAGE_COW;
+
+ return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
}

static inline pud_t pud_mkdevmap(pud_t pud)
@@ -518,6 +605,13 @@ static inline pud_t pud_mkyoung(pud_t pud)

static inline pud_t pud_mkwrite(pud_t pud)
{
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (pud_flags(pud) & _PAGE_COW) {
+ pud = pud_clear_flags(pud, _PAGE_COW);
+ pud = pud_set_flags(pud, _PAGE_DIRTY_HW);
+ }
+ }
+
return pud_set_flags(pud, _PAGE_RW);
}

@@ -1218,6 +1312,12 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
#define pmd_write pmd_write
static inline int pmd_write(pmd_t pmd)
{
+ /*
+ * If _PAGE_DIRTY_HW is set, then the PMD must either have
+ * _PAGE_RW or be a shadow stack PMD, which is logically writable.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY_HW);
return pmd_flags(pmd) & _PAGE_RW;
}

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 522b80b952f4..74229db078ce 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,7 +23,8 @@
#define _PAGE_BIT_SOFTW2 10 /* " */
#define _PAGE_BIT_SOFTW3 11 /* " */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
+#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
+#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
#define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
#define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
#define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
@@ -36,6 +37,16 @@
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4

+/*
+ * This bit indicates a copy-on-write page, and is different from
+ * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
+ */
+#ifdef CONFIG_X86_64
+#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW 0
+#endif
+
/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
@@ -117,6 +128,34 @@
#define _PAGE_DEVMAP (_AT(pteval_t, 0))
#endif

+/*
+ * _PAGE_COW is used to separate R/O and copy-on-write PTEs created by
+ * software from the shadow stack PTE setting required by the hardware:
+ * (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW)
+ * (b) A R/O page that has been COW'ed: (R/O +_PAGE_COW)
+ * The user page is in a R/O VMA, and get_user_pages() needs a
+ * writable copy. The page fault handler creates a copy of the page
+ * and sets the new copy's PTE as R/O and _PAGE_COW.
+ * (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW)
+ * (d) A shared (copy-on-access) shadow stack PTE: (R/O + _PAGE_COW)
+ * When a shadow stack page is being shared among processes (this
+ * happens at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the
+ * next shadow stack access causes a fault, and the page is duplicated
+ * and _PAGE_DIRTY_HW is set again. This is the COW equivalent for
+ * shadow stack pages, even though it's copy-on-access rather than
+ * copy-on-write.
+ * (e) A page where the processor observed a Write=1 PTE, started a write,
+ * set Dirty=1, but then observed a Write=0 PTE. That's possible
+ * today, but will not happen on processors that support shadow stack.
+ */
+#ifdef CONFIG_X86_INTEL_SHADOW_STACK_USER
+#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW (_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY_HW | _PAGE_COW)
+
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

/*
--
2.21.0

2020-04-29 22:14:54

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 04/26] x86/cet: Add control-protection fault handler

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the Shadow Stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
v10:
- Change CONFIG_X86_64 to CONFIG_X86_INTEL_CET.

v9:
- Add Shadow Stack pointer to the fault printout.

arch/x86/entry/entry_64.S | 2 +-
arch/x86/include/asm/traps.h | 5 +++
arch/x86/kernel/idt.c | 4 ++
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 59 ++++++++++++++++++++++++++++++
include/uapi/asm-generic/siginfo.h | 3 +-
6 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0e9504fabe52..f42780922387 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1034,7 +1034,7 @@ idtentry spurious_interrupt_bug do_spurious_interrupt_bug has_error_code=0
idtentry coprocessor_error do_coprocessor_error has_error_code=0
idtentry alignment_check do_alignment_check has_error_code=1
idtentry simd_coprocessor_error do_simd_coprocessor_error has_error_code=0
-
+idtentry control_protection do_control_protection has_error_code=1

/*
* Reload gs selector with exception handling
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c26a7e1d8a2c..9bf804709ee6 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -35,6 +35,9 @@ asmlinkage void alignment_check(void);
asmlinkage void machine_check(void);
#endif /* CONFIG_X86_MCE */
asmlinkage void simd_coprocessor_error(void);
+#ifdef CONFIG_X86_INTEL_CET
+asmlinkage void control_protection(void);
+#endif

#if defined(CONFIG_X86_64) && defined(CONFIG_XEN_PV)
asmlinkage void xen_divide_error(void);
@@ -86,6 +89,7 @@ dotraplinkage void do_simd_coprocessor_error(struct pt_regs *regs, long error_co
dotraplinkage void do_iret_error(struct pt_regs *regs, long error_code);
#endif
dotraplinkage void do_mce(struct pt_regs *regs, long error_code);
+dotraplinkage void do_control_protection(struct pt_regs *regs, long error_code);

#ifdef CONFIG_X86_64
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs);
@@ -151,6 +155,7 @@ enum {
X86_TRAP_AC, /* 17, Alignment Check */
X86_TRAP_MC, /* 18, Machine Check */
X86_TRAP_XF, /* 19, SIMD Floating-Point Exception */
+ X86_TRAP_CP = 21, /* 21 Control Protection Fault */
X86_TRAP_IRET = 32, /* 32, IRET Exception */
};

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 87ef69a72c52..19160c8d734f 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -102,6 +102,10 @@ static const __initconst struct idt_data def_idts[] = {
#elif defined(CONFIG_X86_32)
SYSG(IA32_SYSCALL_VECTOR, entry_INT80_32),
#endif
+
+#ifdef CONFIG_X86_INTEL_CET
+ INTG(X86_TRAP_CP, control_protection),
+#endif
};

/*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf0576cd0..c572a3de1037 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
*/
BUILD_BUG_ON(NSIGILL != 11);
BUILD_BUG_ON(NSIGFPE != 15);
- BUILD_BUG_ON(NSIGSEGV != 7);
+ BUILD_BUG_ON(NSIGSEGV != 8);
BUILD_BUG_ON(NSIGBUS != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d54cffdc7cac..d2515dfbc178 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -586,6 +586,65 @@ dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code)
}
NOKPROBE_SYMBOL(do_general_protection);

+static const char * const control_protection_err[] = {
+ "unknown",
+ "near-ret",
+ "far-ret/iret",
+ "endbranch",
+ "rstorssp",
+ "setssbsy",
+};
+
+/*
+ * When a control protection exception occurs, send a signal
+ * to the responsible application. Currently, control
+ * protection is only enabled for the user mode. This
+ * exception should not come from the kernel mode.
+ */
+dotraplinkage void
+do_control_protection(struct pt_regs *regs, long error_code)
+{
+ struct task_struct *tsk;
+
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
+ if (notify_die(DIE_TRAP, "control protection fault", regs,
+ error_code, X86_TRAP_CP, SIGSEGV) == NOTIFY_STOP)
+ return;
+ cond_local_irq_enable(regs);
+
+ if (!user_mode(regs))
+ die("kernel control protection fault", regs, error_code);
+
+ if (!static_cpu_has(X86_FEATURE_SHSTK) &&
+ !static_cpu_has(X86_FEATURE_IBT))
+ WARN_ONCE(1, "CET is disabled but got control protection fault\n");
+
+ tsk = current;
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_CP;
+
+ if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+ printk_ratelimit()) {
+ unsigned int max_err;
+ unsigned long ssp;
+
+ max_err = ARRAY_SIZE(control_protection_err) - 1;
+ if ((error_code < 0) || (error_code > max_err))
+ error_code = 0;
+ rdmsrl(MSR_IA32_PL3_SSP, ssp);
+ pr_info("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+ tsk->comm, task_pid_nr(tsk),
+ regs->ip, regs->sp, ssp, error_code,
+ control_protection_err[error_code]);
+ print_vma_addr(KERN_CONT " in ", regs->ip);
+ pr_cont("\n");
+ }
+
+ force_sig_fault(SIGSEGV, SEGV_CPERR,
+ (void __user *)uprobe_get_trap_addr(regs));
+}
+NOKPROBE_SYMBOL(do_control_protection);
+
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
if (poke_int3_handler(regs))
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c267181..693071dbe641 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -229,7 +229,8 @@ typedef struct siginfo {
#define SEGV_ACCADI 5 /* ADI not enabled for mapped object */
#define SEGV_ADIDERR 6 /* Disrupting MCD error */
#define SEGV_ADIPERR 7 /* Precise MCD exception */
-#define NSIGSEGV 7
+#define SEGV_CPERR 8
+#define NSIGSEGV 8

/*
* SIGBUS si_codes
--
2.21.0

2020-04-29 22:14:55

by Yu-cheng Yu

[permalink] [raw]
Subject: [PATCH v10 05/26] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack

Shadow Stack provides protection against function return address
corruption. It is active when the processor supports it, the kernel has
CONFIG_X86_INTEL_SHADOW_STACK_USER, and the application is built for the
feature. This is only implemented for the 64-bit kernel. When it is
enabled, legacy non-shadow stack applications continue to work, but without
protection.

Signed-off-by: Yu-cheng Yu <[email protected]>
---
v10:
- Change SHSTK to shadow stack in the help text.
- Change build-time check to config-time check.
- Change ARCH_HAS_SHSTK to ARCH_HAS_SHADOW_STACK.

arch/x86/Kconfig | 30 +++++++++++++++++++++++++++
scripts/as-x86_64-has-shadow-stack.sh | 4 ++++
2 files changed, 34 insertions(+)
create mode 100755 scripts/as-x86_64-has-shadow-stack.sh

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1197b5596d5a..c98f82fffe85 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1947,6 +1947,36 @@ config X86_INTEL_TSX_MODE_AUTO
side channel attacks- equals the tsx=auto command line parameter.
endchoice

+config AS_HAS_SHADOW_STACK
+ def_bool $(success,$(srctree)/scripts/as-x86_64-has-shadow-stack.sh $(CC))
+ help
+ Test the assembler for shadow stack instructions.
+
+config X86_INTEL_CET
+ def_bool n
+
+config ARCH_HAS_SHADOW_STACK
+ def_bool n
+
+config X86_INTEL_SHADOW_STACK_USER
+ prompt "Intel Shadow Stacks for user-mode"
+ def_bool n
+ depends on CPU_SUP_INTEL && X86_64
+ depends on AS_HAS_SHADOW_STACK
+ select ARCH_USES_HIGH_VMA_FLAGS
+ select X86_INTEL_CET
+ select ARCH_HAS_SHADOW_STACK
+ help
+ Shadow Stacks provides protection against program stack
+ corruption. It's a hardware feature. This only matters
+ if you have the right hardware. It's a security hardening
+ feature and apps must be enabled to use it. You get no
+ protection "for free" on old userspace. The hardware can
+ support user and kernel, but this option is for user space
+ only.
+
+ If unsure, say y.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/scripts/as-x86_64-has-shadow-stack.sh b/scripts/as-x86_64-has-shadow-stack.sh
new file mode 100755
index 000000000000..fac1d363a1b8
--- /dev/null
+++ b/scripts/as-x86_64-has-shadow-stack.sh
@@ -0,0 +1,4 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+echo "wrussq %rax, (%rbx)" | $* -x assembler -c -
--
2.21.0

2020-04-29 22:54:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 4/29/20 3:07 PM, Yu-cheng Yu wrote:
> +Note:
> + There is no CET-enabling arch_prctl function. By design, CET is enabled
> + automatically if the binary and the system can support it.

I think Andy and I danced around this last time. Let me try to say it
more explicitly.

I want CET kernel enabling to able to be disconnected from the on-disk
binary. I want a binary compiled with CET to be able to disable it, and
I want a binary not compiled with CET to be able to enable it. I want
different threads in a process to be able to each have different CET status.

Which JITs was this tested with? I think as a bare minimum we need to
know that this design can accommodate _a_ modern JIT. It would be
horrible if the browser javascript engines couldn't use this design, for
instance.

2020-04-29 23:06:43

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Wed, 2020-04-29 at 15:53 -0700, Dave Hansen wrote:
> On 4/29/20 3:07 PM, Yu-cheng Yu wrote:
> > +Note:
> > + There is no CET-enabling arch_prctl function. By design, CET is enabled
> > + automatically if the binary and the system can support it.
>
> I think Andy and I danced around this last time. Let me try to say it
> more explicitly.
>
> I want CET kernel enabling to able to be disconnected from the on-disk
> binary. I want a binary compiled with CET to be able to disable it, and
> I want a binary not compiled with CET to be able to enable it. I want
> different threads in a process to be able to each have different CET status.

The kernel patches we have now can be modified to support this model. If after
discussion this is favorable, I will modify code accordingly.

> Which JITs was this tested with? I think as a bare minimum we need to
> know that this design can accommodate _a_ modern JIT. It would be
> horrible if the browser javascript engines couldn't use this design, for
> instance.

JIT work is still in progress. When that is available I will test it.

Yu-cheng

2020-05-07 15:59:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 05/26] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack

On 4/29/20 3:07 PM, Yu-cheng Yu wrote:
> +config X86_INTEL_SHADOW_STACK_USER
> + prompt "Intel Shadow Stacks for user-mode"
> + def_bool n
> + depends on CPU_SUP_INTEL && X86_64
> + depends on AS_HAS_SHADOW_STACK
> + select ARCH_USES_HIGH_VMA_FLAGS
> + select X86_INTEL_CET
> + select ARCH_HAS_SHADOW_STACK

I called protection keys: X86_INTEL_MEMORY_PROTECTION_KEYS

AMD recently posted documentation which shows them implementing it as
well. The "INTEL_" is feeling now like a mistake.

Going forward, we should probably avoid sticking the company name on
them, if for no other reason than avoiding confusion and/or churn in the
future.

Shadow stacks, for instance, seem like something that another vendor
might implement one day. So, let's at least remove the "INTEL_" from
the config option names themselves. Mentioning Intel in the changelog
and the Kconfig help text is fine.

2020-05-07 17:01:43

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 05/26] x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack

On Thu, 2020-05-07 at 08:55 -0700, Dave Hansen wrote:
> On 4/29/20 3:07 PM, Yu-cheng Yu wrote:
> > +config X86_INTEL_SHADOW_STACK_USER
> > + prompt "Intel Shadow Stacks for user-mode"
> > + def_bool n
> > + depends on CPU_SUP_INTEL && X86_64
> > + depends on AS_HAS_SHADOW_STACK
> > + select ARCH_USES_HIGH_VMA_FLAGS
> > + select X86_INTEL_CET
> > + select ARCH_HAS_SHADOW_STACK
>
> I called protection keys: X86_INTEL_MEMORY_PROTECTION_KEYS
>
> AMD recently posted documentation which shows them implementing it as
> well. The "INTEL_" is feeling now like a mistake.
>
> Going forward, we should probably avoid sticking the company name on
> them, if for no other reason than avoiding confusion and/or churn in the
> future.
>
> Shadow stacks, for instance, seem like something that another vendor
> might implement one day. So, let's at least remove the "INTEL_" from
> the config option names themselves. Mentioning Intel in the changelog
> and the Kconfig help text is fine.

Yes, sure.

Yu-cheng

2020-05-12 23:22:30

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Wed, 2020-04-29 at 16:02 -0700, Yu-cheng Yu wrote:
> On Wed, 2020-04-29 at 15:53 -0700, Dave Hansen wrote:
> > On 4/29/20 3:07 PM, Yu-cheng Yu wrote:
> > > +Note:
> > > + There is no CET-enabling arch_prctl function. By design, CET is enabled
> > > + automatically if the binary and the system can support it.
> >
> > I think Andy and I danced around this last time. Let me try to say it
> > more explicitly.
> >
> > I want CET kernel enabling to able to be disconnected from the on-disk
> > binary. I want a binary compiled with CET to be able to disable it, and
> > I want a binary not compiled with CET to be able to enable it. I want
> > different threads in a process to be able to each have different CET status.
>
> The kernel patches we have now can be modified to support this model. If after
> discussion this is favorable, I will modify code accordingly.

To turn on/off and to lock CET are application-level decisions. The kernel does
not prevent any of those. Should there be a need to provide an arch_prctl() to
turn on CET, it can be added without any conflict to this series.

> > Which JITs was this tested with? I think as a bare minimum we need to
> > know that this design can accommodate _a_ modern JIT. It would be
> > horrible if the browser javascript engines couldn't use this design, for
> > instance.
>
> JIT work is still in progress. When that is available I will test it.

I found CET has been enabled in LLVM JIT, Mesa JIT as well as sljit which is
used by jit. So the current model works with JIT.

Yu-cheng

2020-05-15 18:41:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 5/12/20 4:20 PM, Yu-cheng Yu wrote:
> On Wed, 2020-04-29 at 16:02 -0700, Yu-cheng Yu wrote:
>> On Wed, 2020-04-29 at 15:53 -0700, Dave Hansen wrote:
>>> On 4/29/20 3:07 PM, Yu-cheng Yu wrote:
>>>> +Note:
>>>> + There is no CET-enabling arch_prctl function. By design, CET is enabled
>>>> + automatically if the binary and the system can support it.
>>>
>>> I think Andy and I danced around this last time. Let me try to say it
>>> more explicitly.
>>>
>>> I want CET kernel enabling to able to be disconnected from the on-disk
>>> binary. I want a binary compiled with CET to be able to disable it, and
>>> I want a binary not compiled with CET to be able to enable it. I want
>>> different threads in a process to be able to each have different CET status.
>>
>> The kernel patches we have now can be modified to support this model. If after
>> discussion this is favorable, I will modify code accordingly.
>
> To turn on/off and to lock CET are application-level decisions. The kernel does
> not prevent any of those. Should there be a need to provide an arch_prctl() to
> turn on CET, it can be added without any conflict to this series.

I spelled out what I wanted pretty clearly. On your next post, could
you please directly address each of the things I asked for? Please
directly answer the following questions in your next post with respect
to the code you post:

Can a binary compiled with CET run without CET?
Can a binary compiled without CET run CET-enabled code?
Can different threads in a process have different CET enabling state?

>>> Which JITs was this tested with? I think as a bare minimum we need to
>>> know that this design can accommodate _a_ modern JIT. It would be
>>> horrible if the browser javascript engines couldn't use this design, for
>>> instance.
>>
>> JIT work is still in progress. When that is available I will test it.
>
> I found CET has been enabled in LLVM JIT, Mesa JIT as well as sljit which is
> used by jit. So the current model works with JIT.

Great! I'm glad the model works. That's not what I asked, though.

Does this *code* work? Could you please indicate which JITs have been
enabled to use the code in this series? How much of the new ABI is in use?

Where are the selftests/ for this new ABI? Were you planning on
submitting any with this series?

2020-05-15 21:35:31

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Fri, 2020-05-15 at 11:39 -0700, Dave Hansen wrote:
> On 5/12/20 4:20 PM, Yu-cheng Yu wrote:
> > On Wed, 2020-04-29 at 16:02 -0700, Yu-cheng Yu wrote:
> > > On Wed, 2020-04-29 at 15:53 -0700, Dave Hansen wrote:
> > > > On 4/29/20 3:07 PM, Yu-cheng Yu wrote:
> > > > > +Note:
> > > > > + There is no CET-enabling arch_prctl function. By design, CET is enabled
> > > > > + automatically if the binary and the system can support it.
> > > >
> > > > I think Andy and I danced around this last time. Let me try to say it
> > > > more explicitly.
> > > >
> > > > I want CET kernel enabling to able to be disconnected from the on-disk
> > > > binary. I want a binary compiled with CET to be able to disable it, and
> > > > I want a binary not compiled with CET to be able to enable it. I want
> > > > different threads in a process to be able to each have different CET status.
> > >
> > > The kernel patches we have now can be modified to support this model. If after
> > > discussion this is favorable, I will modify code accordingly.
> >
> > To turn on/off and to lock CET are application-level decisions. The kernel does
> > not prevent any of those. Should there be a need to provide an arch_prctl() to
> > turn on CET, it can be added without any conflict to this series.
>
> I spelled out what I wanted pretty clearly. On your next post, could
> you please directly address each of the things I asked for? Please
> directly answer the following questions in your next post with respect
> to the code you post:
>
> Can a binary compiled with CET run without CET?

Yes, but a few details:

- The shadow stack is transparent to the application. A CET application does
not have anything different from a non-CET application. However, if a CET
application uses any CET instructions (e.g. INCSSP), it must first check if CET
is turned on.
- If an application is compiled for IBT, the compiler inserts ENDBRs at branch
targets. These are nops if IBT is not on.

> Can a binary compiled without CET run CET-enabled code?

Partially yes, but in reality somewhat difficult.

- If a non-CET application does exec() of a CET binary, then CET is enabled.
- If a not-CET application does fork(), and the child wants to turn on CET, it
would be difficult to manage the stack frames, unless the child knows what is is
doing. The JIT examples I mentioned previously run with CET enabled from the
beginning. Do you have a reason to do this? In other words, if the JIT code
needs CET, the app could have started with CET in the first place.
- If you are asking about dlopen(), the library will have the same setting as
the main application. Do you have any reason to have a library running with
CET, but the application does not have CET?

> Can different threads in a process have different CET enabling state?

Yes, if the parent starts with CET, children can turn it off. But for the same
reason described above, it is difficult to turn on CET from the middle.

> > > > Which JITs was this tested with? I think as a bare minimum we need to
> > > > know that this design can accommodate _a_ modern JIT. It would be
> > > > horrible if the browser javascript engines couldn't use this design, for
> > > > instance.
> > >
> > > JIT work is still in progress. When that is available I will test it.
> >
> > I found CET has been enabled in LLVM JIT, Mesa JIT as well as sljit which is
> > used by jit. So the current model works with JIT.
>
> Great! I'm glad the model works. That's not what I asked, though.
>
> Does this *code* work? Could you please indicate which JITs have been
> enabled to use the code in this series? How much of the new ABI is in use?

JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
are tested and in the master branch. Sljit fixes are in the release.

> Where are the selftests/ for this new ABI? Were you planning on
> submitting any with this series?

The ABI is more related to the application side, and therefore most suitable for
GLIBC unit tests. The more complicated areas such as pthreads, signals,
ucontext, fork() are all included there. I have been constantly running these
tests without any problems. I can provide more details if testing is the
concern.

Yu-cheng

2020-05-15 22:47:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 5/15/20 2:33 PM, Yu-cheng Yu wrote:
> On Fri, 2020-05-15 at 11:39 -0700, Dave Hansen wrote:
>> On 5/12/20 4:20 PM, Yu-cheng Yu wrote:
>> Can a binary compiled with CET run without CET?
>
> Yes, but a few details:
>
> - The shadow stack is transparent to the application. A CET application does
> not have anything different from a non-CET application. However, if a CET
> application uses any CET instructions (e.g. INCSSP), it must first check if CET
> is turned on.
> - If an application is compiled for IBT, the compiler inserts ENDBRs at branch
> targets. These are nops if IBT is not on.

I appreciate the detailed response, but it wasn't quite what I was
asking. Let's ignore IBT for now and just talk about shadow stacks.

An app compiled with the new ELF flags and running on a CET-enabled
kernel and CPU will start off with shadow stacks allocated and enabled,
right? It can turn its shadow stack off per-thread with the new prctl.
But, otherwise, it's stuck, the only way to turn shadow stacks off at
startup would be editing the binary.

Basically, if there ends up being a bug in an app that violates the
shadow stack rules, the app is broken, period. The only recourse is to
have the kernel disable CET and reboot.

Is that right?

>> Can a binary compiled without CET run CET-enabled code?
>
> Partially yes, but in reality somewhat difficult.
...
> - If a not-CET application does fork(), and the child wants to turn on CET, it
> would be difficult to manage the stack frames, unless the child knows what is is
> doing.

It might be hard to do, but it is possible with the patches you posted?
I think you're saying that the CET-enabled binary would do
arch_setup_elf_property() when it was first exec()'d. Later, it could
use the new prctl(ARCH_X86_CET_DISABLE) to disable its shadow stack,
then fork() and the child would not be using CET. Right?

What is ARCH_X86_CET_DISABLE used for, anyway?

> The JIT examples I mentioned previously run with CET enabled from the
> beginning. Do you have a reason to do this? In other words, if the JIT code
> needs CET, the app could have started with CET in the first place.

Let's say I have a JIT'd sandbox. I want the sandbox to be
CET-protected, but the JIT engine itself not to be.

> - If you are asking about dlopen(), the library will have the same setting as
> the main application. Do you have any reason to have a library running with
> CET, but the application does not have CET?

Sure, using old binaries. That's why IBT has a legacy bitmap and things
like MPX had ways of jumping into old non-enabled binaries.

>> Can different threads in a process have different CET enabling state?
>
> Yes, if the parent starts with CET, children can turn it off.

How would that work, though? clone() by default will copy the parent
xsave state, which means it will be CET-enabled, which means it needs a
shadow stack. So, if I want a CET-free child thread, I need to clone(),
then turn CET off, then free the shadow stack?

>> Does this *code* work? Could you please indicate which JITs have been
>> enabled to use the code in this series? How much of the new ABI is in use?
>
> JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
> frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
> are tested and in the master branch. Sljit fixes are in the release.

Huh, so who is using the new prctl() ABIs?

>> Where are the selftests/ for this new ABI? Were you planning on
>> submitting any with this series?
>
> The ABI is more related to the application side, and therefore most suitable for
> GLIBC unit tests.

I was mostly concerned with the kernel selftests. The things in
tools/testing/selftests/x86 in the kernel tree.

> The more complicated areas such as pthreads, signals, ucontext,
> fork() are all included there. I have been constantly running these
> tests without any problems. I can provide more details if testing is
> the concern.

For something this complicated, with new kernel ABIs, we need an
in-kernel sefltest.

MPX was not that much different from this feature. It required a
boatload of compiler and linker changes to function. Yet, there was a
simple in-kernel test for it that didn't require *any* of that big pile
of toolchain bits.

Is there a reason we don't have one of those for CET?

2020-05-15 23:31:27

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Fri, 2020-05-15 at 15:43 -0700, Dave Hansen wrote:
> On 5/15/20 2:33 PM, Yu-cheng Yu wrote:
> > On Fri, 2020-05-15 at 11:39 -0700, Dave Hansen wrote:
> > > On 5/12/20 4:20 PM, Yu-cheng Yu wrote:
> > > Can a binary compiled with CET run without CET?
> >
> > Yes, but a few details:
> >
> > - The shadow stack is transparent to the application. A CET application does
> > not have anything different from a non-CET application. However, if a CET
> > application uses any CET instructions (e.g. INCSSP), it must first check if CET
> > is turned on.
> > - If an application is compiled for IBT, the compiler inserts ENDBRs at branch
> > targets. These are nops if IBT is not on.
>
> I appreciate the detailed response, but it wasn't quite what I was
> asking. Let's ignore IBT for now and just talk about shadow stacks.
>
> An app compiled with the new ELF flags and running on a CET-enabled
> kernel and CPU will start off with shadow stacks allocated and enabled,
> right? It can turn its shadow stack off per-thread with the new prctl.
> But, otherwise, it's stuck, the only way to turn shadow stacks off at
> startup would be editing the binary.
>
> Basically, if there ends up being a bug in an app that violates the
> shadow stack rules, the app is broken, period. The only recourse is to
> have the kernel disable CET and reboot.
>
> Is that right?

You must be talking about init or any of the system daemons, right?
Assuming we let the app itself start CET with an arch_prctl(), why would that be
different from the current approach?

> > > Can a binary compiled without CET run CET-enabled code?
> >
> > Partially yes, but in reality somewhat difficult.
> ...
> > - If a not-CET application does fork(), and the child wants to turn on CET, it
> > would be difficult to manage the stack frames, unless the child knows what is is
> > doing.
>
> It might be hard to do, but it is possible with the patches you posted?

It is possible to add an arch_prctl() to turn on CET. That is simple from the
kernel's perspective, but difficult for the application. Once the app enables
shadow stack, it has to take care not to return beyond the function call layers
before that point. It can no longer do longjmp or ucontext swaps to anything
before that point. It will also be complicated if the app enables shadow stack
in a signal handler.

> I think you're saying that the CET-enabled binary would do
> arch_setup_elf_property() when it was first exec()'d. Later, it could
> use the new prctl(ARCH_X86_CET_DISABLE) to disable its shadow stack,
> then fork() and the child would not be using CET. Right?
>
> What is ARCH_X86_CET_DISABLE used for, anyway?

Both the parent and the child can do ARCH_X86_CET_DISABLE, if CET is not locked.

> > The JIT examples I mentioned previously run with CET enabled from the
> > beginning. Do you have a reason to do this? In other words, if the JIT code
> > needs CET, the app could have started with CET in the first place.
>
> Let's say I have a JIT'd sandbox. I want the sandbox to be
> CET-protected, but the JIT engine itself not to be.

I do not have any objections to this use case, but it needs some cautions as
stated above. It will be much easier and cleaner if the sandbox is in a
separate exec'ed task with CET on.

> > - If you are asking about dlopen(), the library will have the same setting as
> > the main application. Do you have any reason to have a library running with
> > CET, but the application does not have CET?
>
> Sure, using old binaries. That's why IBT has a legacy bitmap and things
> like MPX had ways of jumping into old non-enabled binaries.

If the app has CET, but libs do not, then bitmap can help.
If the app does not have CET, we don't run the libs with CET, right? This is
the case right now.

> > > Can different threads in a process have different CET enabling state?
> >
> > Yes, if the parent starts with CET, children can turn it off.
>
> How would that work, though? clone() by default will copy the parent
> xsave state, which means it will be CET-enabled, which means it needs a
> shadow stack. So, if I want a CET-free child thread, I need to clone(),
> then turn CET off, then free the shadow stack?

Yes, the child itself turns off CET.

> > > Does this *code* work? Could you please indicate which JITs have been
> > > enabled to use the code in this series? How much of the new ABI is in use?
> >
> > JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
> > frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
> > are tested and in the master branch. Sljit fixes are in the release.
>
> Huh, so who is using the new prctl() ABIs?

Any code can use the ABI, but JIT code CET-enabling part mostly do not use these
new prctl()'s, except, probably to get CET status.

> > > Where are the selftests/ for this new ABI? Were you planning on
> > > submitting any with this series?
> >
> > The ABI is more related to the application side, and therefore most suitable for
> > GLIBC unit tests.
>
> I was mostly concerned with the kernel selftests. The things in
> tools/testing/selftests/x86 in the kernel tree.

I have run them with CET enabled. All of them pass, except for the following:
Sigreturn from 64-bit to 32-bit fails, because shadow stack is at a 64-bit
address. This is understandable.

> > The more complicated areas such as pthreads, signals, ucontext,
> > fork() are all included there. I have been constantly running these
> > tests without any problems. I can provide more details if testing is
> > the concern.
>
> For something this complicated, with new kernel ABIs, we need an
> in-kernel sefltest.
>
> MPX was not that much different from this feature. It required a
> boatload of compiler and linker changes to function. Yet, there was a
> simple in-kernel test for it that didn't require *any* of that big pile
> of toolchain bits.
>
> Is there a reason we don't have one of those for CET?

I have a quick test that checks shadow stack and ibt in both main program and in
signals. Currently it is public on Github. If that is desired, I can submit it
to the mailing list.

Yu-cheng

2020-05-15 23:58:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 5/15/20 4:29 PM, Yu-cheng Yu wrote:
> On Fri, 2020-05-15 at 15:43 -0700, Dave Hansen wrote:
>> Basically, if there ends up being a bug in an app that violates the
>> shadow stack rules, the app is broken, period. The only recourse is to
>> have the kernel disable CET and reboot.
>>
>> Is that right?
>
> You must be talking about init or any of the system daemons, right?
> Assuming we let the app itself start CET with an arch_prctl(), why would that be
> different from the current approach?

You're getting ahead of me a bit here.

I'm actually not asking directly about the prctls() or advocating for a
different approach. The MPX approach of _requiring the app to make a
prctl() was actually pretty nasty because sometimes threads got created
before the prctl() could get called. Apps ended up inadvertently
half-MPX-enabled. Not fun.

Let's say we have an app doing silly things like retpolines. (Lots of
app do lots of silly things). It gets compiled in a distro but never
runs on a system with CET. The app gets run for the first time on a
system with CET. App goes boom. Not init, just some random app, say
/usr/bin/ldapsearch.

What's my recourse as an end user? I want to run my app and turn off
CET for that app. How can I do that?

>>>> Can a binary compiled without CET run CET-enabled code?
>>>
>>> Partially yes, but in reality somewhat difficult.
>> ...
>>> - If a not-CET application does fork(), and the child wants to turn on CET, it
>>> would be difficult to manage the stack frames, unless the child knows what is is
>>> doing.
>>
>> It might be hard to do, but it is possible with the patches you posted?
>
> It is possible to add an arch_prctl() to turn on CET. That is simple from the
> kernel's perspective, but difficult for the application. Once the app enables
> shadow stack, it has to take care not to return beyond the function call layers
> before that point. It can no longer do longjmp or ucontext swaps to anything
> before that point. It will also be complicated if the app enables shadow stack
> in a signal handler.

Yu-cheng, I'm having a very hard time getting direct answers to my
questions. Could you endeavor to give succinct, direct answers? If you
want to give a longer, conditioned answer, that's great. But, I'd
appreciate if you could please focus first on clearly answering the
questions that I'm asking.

Let me try again:

Is it possible with the patches in this series to run a single-
threaded binary which was has GNU_PROPERTY_X86_FEATURE_1_SHSTK
unset to run with shadow stack protection?

I think the answer is an unambiguous: "No". But I'd like to hear it
from you.

>> I think you're saying that the CET-enabled binary would do
>> arch_setup_elf_property() when it was first exec()'d. Later, it could
>> use the new prctl(ARCH_X86_CET_DISABLE) to disable its shadow stack,
>> then fork() and the child would not be using CET. Right?
>>
>> What is ARCH_X86_CET_DISABLE used for, anyway?
>
> Both the parent and the child can do ARCH_X86_CET_DISABLE, if CET is
> not locked.

Could you please describe a real-world example of why
ARCH_X86_CET_DISABLE exists? What kinds of apps will use it, or *are*
using it? Why was it created in the first place?

>>> The JIT examples I mentioned previously run with CET enabled from the
>>> beginning. Do you have a reason to do this? In other words, if the JIT code
>>> needs CET, the app could have started with CET in the first place.
>>
>> Let's say I have a JIT'd sandbox. I want the sandbox to be
>> CET-protected, but the JIT engine itself not to be.
>
> I do not have any objections to this use case, but it needs some cautions as
> stated above. It will be much easier and cleaner if the sandbox is in a
> separate exec'ed task with CET on.

OK, great suggestion! Could you do some research and look at the
various sandboxing techniques? Is imposing this requirement for a
separate exec'd task reasonable? Does it fit nicely with their existing
models? How about the Chrome browser and Firefox sandboxs?

>>>> Does this *code* work? Could you please indicate which JITs have been
>>>> enabled to use the code in this series? How much of the new ABI is in use?
>>>
>>> JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
>>> frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
>>> are tested and in the master branch. Sljit fixes are in the release.
>>
>> Huh, so who is using the new prctl() ABIs?
>
> Any code can use the ABI, but JIT code CET-enabling part mostly do not use these
> new prctl()'s, except, probably to get CET status.

Which applications specifically are going to use the new prctl()s which
this series adds? How are they going to use them?

"Any code can use them" is not a specific enough answer.

>>>> Where are the selftests/ for this new ABI? Were you planning on
>>>> submitting any with this series?
>>>
>>> The ABI is more related to the application side, and therefore most suitable for
>>> GLIBC unit tests.
>>
>> I was mostly concerned with the kernel selftests. The things in
>> tools/testing/selftests/x86 in the kernel tree.
>
> I have run them with CET enabled. All of them pass, except for the following:
> Sigreturn from 64-bit to 32-bit fails, because shadow stack is at a 64-bit
> address. This is understandable.

That is not what I meant. I'm going to be as explicit:

I expect you to create a test case which you will submit with these
patches and the test case will go into the tools/testing/selftests/x86
directory in the kernel tree. This test case will exercise the kernel
functionality added in this series, especially the new prctl()s.

One a separate topic: You ran the selftests and one failed. This is a
*MASSIVE* warning sign. It should minimally be described in your cover
letter, and accompanied by a fix to the test case. It is absolutely
unacceptable to introduce a kernel feature that causes a test to fail.
You must either fix your kernel feature or you fix the test.

This code can not be accepted until this selftests issue is rectified.

>>> The more complicated areas such as pthreads, signals, ucontext,
>>> fork() are all included there. I have been constantly running these
>>> tests without any problems. I can provide more details if testing is
>>> the concern.
>>
>> For something this complicated, with new kernel ABIs, we need an
>> in-kernel sefltest.
>>
>> MPX was not that much different from this feature. It required a
>> boatload of compiler and linker changes to function. Yet, there was a
>> simple in-kernel test for it that didn't require *any* of that big pile
>> of toolchain bits.
>>
>> Is there a reason we don't have one of those for CET?
>
> I have a quick test that checks shadow stack and ibt in both main program and in
> signals. Currently it is public on Github. If that is desired, I can submit it
> to the mailing list.

Yes, that is desired. It must accompany this submission. It must also
exercise all of the new ABIs.

2020-05-16 00:16:02

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 15/05/2020 23:43, Dave Hansen wrote:
> On 5/15/20 2:33 PM, Yu-cheng Yu wrote:
>> On Fri, 2020-05-15 at 11:39 -0700, Dave Hansen wrote:
>>> On 5/12/20 4:20 PM, Yu-cheng Yu wrote:
>>> Can a binary compiled with CET run without CET?
>> Yes, but a few details:
>>
>> - The shadow stack is transparent to the application. A CET application does
>> not have anything different from a non-CET application. However, if a CET
>> application uses any CET instructions (e.g. INCSSP), it must first check if CET
>> is turned on.
>> - If an application is compiled for IBT, the compiler inserts ENDBRs at branch
>> targets. These are nops if IBT is not on.
> I appreciate the detailed response, but it wasn't quite what I was
> asking. Let's ignore IBT for now and just talk about shadow stacks.
>
> An app compiled with the new ELF flags and running on a CET-enabled
> kernel and CPU will start off with shadow stacks allocated and enabled,
> right? It can turn its shadow stack off per-thread with the new prctl.
> But, otherwise, it's stuck, the only way to turn shadow stacks off at
> startup would be editing the binary.
>
> Basically, if there ends up being a bug in an app that violates the
> shadow stack rules, the app is broken, period. The only recourse is to
> have the kernel disable CET and reboot.
>
> Is that right?

If I may interject with the experience of having got supervisor shadow
stacks working for Xen.

Turning shadow stacks off is quite easy - clear MSR_U_CET.SHSTK_EN and
the shadow stack will stay in whatever state it was in, and you can
largely forget about it.  (Of course, in a sandbox scenario, it would be
prudent to prevent the ability to disable shadow stacks.)

Turning shadow stacks on is much more tricky.  You cannot enable it in
any function you intend to return from, as the divergence between the
stack and shadow stack will constitute a control flow violation.


When it comes to binaries,  you can reasonably arrange for clone() to
start a thread on a new stack/shstk, as you can prepare both stacks
suitably before execution starts.

You cannot reasonably implement a system call for "turn shadow stacks on
for me", because you'll crash on the ret out of the VDSO from the system
call.  It would be possible to conceive of an exec()-like system call
which is "discard my current stack, turn on shstk, and start me on this
new stack/shstk".

In principle, with a pair of system calls to atomically manage the ststk
settings and stack switching, it might possible to construct a
`run_with_shstk_enabled(func, stack, shstk)` API which executes in the
current threads context and doesn't explode.

Fork() is a problem when shadow stacks are disabled in the parent.  The
moment shadow stacks are disabled, the regular stack diverges from the
shadow stack.  A CET-enabled app which turns off shstk and then fork()'s
must have the child inherit the shstk-off property.  If the child were
to start with shstk enabled, it would explode almost immediately due to
the parent's stack divergence which it inherited.


Finally seeing as the question was asked but not answered, it is
actually quite easy to figure out whether shadow stacks are enabled in
the current thread.

    mov     $1, %eax
    rdsspd  %eax
    cmp     $1, %eax
    je      no_shstk
            ...
no_shsk:

rdssp is allocated from the hint nop encoding space, and the minimum
alignment of the shadow stack pointer is 4.  On older parts, or with
shstk disabled (either at the system level, or for the thread), the $1
will be preserved in %eax, while if CET is active, it will be clobbered
with something that has the bottom two bits clear.

It turns out this is a lifesaver for codepaths (e.g. the NMI handler)
which need to use other CET instructions which aren't from the hint nop
space, and run before the BSP can set everything up.

~Andrew

2020-05-16 02:41:27

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Fri, May 15, 2020 at 5:13 PM Andrew Cooper <[email protected]> wrote:
>
> On 15/05/2020 23:43, Dave Hansen wrote:
> > On 5/15/20 2:33 PM, Yu-cheng Yu wrote:
> >> On Fri, 2020-05-15 at 11:39 -0700, Dave Hansen wrote:
> >>> On 5/12/20 4:20 PM, Yu-cheng Yu wrote:
> >>> Can a binary compiled with CET run without CET?
> >> Yes, but a few details:
> >>
> >> - The shadow stack is transparent to the application. A CET application does
> >> not have anything different from a non-CET application. However, if a CET
> >> application uses any CET instructions (e.g. INCSSP), it must first check if CET
> >> is turned on.
> >> - If an application is compiled for IBT, the compiler inserts ENDBRs at branch
> >> targets. These are nops if IBT is not on.
> > I appreciate the detailed response, but it wasn't quite what I was
> > asking. Let's ignore IBT for now and just talk about shadow stacks.
> >
> > An app compiled with the new ELF flags and running on a CET-enabled
> > kernel and CPU will start off with shadow stacks allocated and enabled,
> > right? It can turn its shadow stack off per-thread with the new prctl.
> > But, otherwise, it's stuck, the only way to turn shadow stacks off at
> > startup would be editing the binary.
> >
> > Basically, if there ends up being a bug in an app that violates the
> > shadow stack rules, the app is broken, period. The only recourse is to
> > have the kernel disable CET and reboot.
> >
> > Is that right?
>
> If I may interject with the experience of having got supervisor shadow
> stacks working for Xen.
>
> Turning shadow stacks off is quite easy - clear MSR_U_CET.SHSTK_EN and
> the shadow stack will stay in whatever state it was in, and you can
> largely forget about it. (Of course, in a sandbox scenario, it would be
> prudent to prevent the ability to disable shadow stacks.)
>
> Turning shadow stacks on is much more tricky. You cannot enable it in
> any function you intend to return from, as the divergence between the
> stack and shadow stack will constitute a control flow violation.
>
>
> When it comes to binaries, you can reasonably arrange for clone() to
> start a thread on a new stack/shstk, as you can prepare both stacks
> suitably before execution starts.
>
> You cannot reasonably implement a system call for "turn shadow stacks on
> for me", because you'll crash on the ret out of the VDSO from the system
> call. It would be possible to conceive of an exec()-like system call
> which is "discard my current stack, turn on shstk, and start me on this
> new stack/shstk".
>
> In principle, with a pair of system calls to atomically manage the ststk
> settings and stack switching, it might possible to construct a
> `run_with_shstk_enabled(func, stack, shstk)` API which executes in the
> current threads context and doesn't explode.
>
> Fork() is a problem when shadow stacks are disabled in the parent. The
> moment shadow stacks are disabled, the regular stack diverges from the
> shadow stack. A CET-enabled app which turns off shstk and then fork()'s
> must have the child inherit the shstk-off property. If the child were
> to start with shstk enabled, it would explode almost immediately due to
> the parent's stack divergence which it inherited.
>
>
> Finally seeing as the question was asked but not answered, it is
> actually quite easy to figure out whether shadow stacks are enabled in
> the current thread.
>
> mov $1, %eax
> rdsspd %eax

This is for 32-bit mode. I use

/* Check if shadow stack is in use. */
xorl %esi, %esi
rdsspq %rsi
testq %rsi, %rsi
/* Normal return if shadow stack isn't in use. */
je L(no_shstk)

> cmp $1, %eax
> je no_shstk
> ...
> no_shsk:
>
> rdssp is allocated from the hint nop encoding space, and the minimum
> alignment of the shadow stack pointer is 4. On older parts, or with
> shstk disabled (either at the system level, or for the thread), the $1
> will be preserved in %eax, while if CET is active, it will be clobbered
> with something that has the bottom two bits clear.
>
> It turns out this is a lifesaver for codepaths (e.g. the NMI handler)
> which need to use other CET instructions which aren't from the hint nop
> space, and run before the BSP can set everything up.
>
> ~Andrew



--
H.J.

2020-05-16 02:54:12

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Fri, May 15, 2020 at 4:56 PM Dave Hansen <[email protected]> wrote:
>
> On 5/15/20 4:29 PM, Yu-cheng Yu wrote:
> > On Fri, 2020-05-15 at 15:43 -0700, Dave Hansen wrote:
> >> Basically, if there ends up being a bug in an app that violates the
> >> shadow stack rules, the app is broken, period. The only recourse is to
> >> have the kernel disable CET and reboot.
> >>
> >> Is that right?
> >
> > You must be talking about init or any of the system daemons, right?
> > Assuming we let the app itself start CET with an arch_prctl(), why would that be
> > different from the current approach?
>
> You're getting ahead of me a bit here.
>
> I'm actually not asking directly about the prctls() or advocating for a
> different approach. The MPX approach of _requiring the app to make a
> prctl() was actually pretty nasty because sometimes threads got created
> before the prctl() could get called. Apps ended up inadvertently
> half-MPX-enabled. Not fun.
>
> Let's say we have an app doing silly things like retpolines. (Lots of
> app do lots of silly things). It gets compiled in a distro but never
> runs on a system with CET. The app gets run for the first time on a
> system with CET. App goes boom. Not init, just some random app, say
> /usr/bin/ldapsearch.

I designed and implemented CET toolchain and run-time in such a way
for it very difficult to happen. Basically, CET won't be enabled on such
an app.

> What's my recourse as an end user? I want to run my app and turn off
> CET for that app. How can I do that?

The CET OS I designed turns CET off for you and you don't have to do
anything.

> >>>> Can a binary compiled without CET run CET-enabled code?
> >>>
> >>> Partially yes, but in reality somewhat difficult.
> >> ...
> >>> - If a not-CET application does fork(), and the child wants to turn on CET, it
> >>> would be difficult to manage the stack frames, unless the child knows what is is
> >>> doing.
> >>
> >> It might be hard to do, but it is possible with the patches you posted?
> >
> > It is possible to add an arch_prctl() to turn on CET. That is simple from the
> > kernel's perspective, but difficult for the application. Once the app enables
> > shadow stack, it has to take care not to return beyond the function call layers
> > before that point. It can no longer do longjmp or ucontext swaps to anything
> > before that point. It will also be complicated if the app enables shadow stack
> > in a signal handler.
>
> Yu-cheng, I'm having a very hard time getting direct answers to my
> questions. Could you endeavor to give succinct, direct answers? If you
> want to give a longer, conditioned answer, that's great. But, I'd
> appreciate if you could please focus first on clearly answering the
> questions that I'm asking.
>
> Let me try again:
>
> Is it possible with the patches in this series to run a single-
> threaded binary which was has GNU_PROPERTY_X86_FEATURE_1_SHSTK
> unset to run with shadow stack protection?

Yes, you can. I added such capabilities for testing purpose. But you
application
will crash as soon as there is a CET violation. My CET software design is very
flexible. It can accommodate different requirements. We are working
with 2 OSVs
to enable CET in their OSes. So far we haven't run into any
unexpected issues.

> I think the answer is an unambiguous: "No". But I'd like to hear it
> from you.
>
> >> I think you're saying that the CET-enabled binary would do
> >> arch_setup_elf_property() when it was first exec()'d. Later, it could
> >> use the new prctl(ARCH_X86_CET_DISABLE) to disable its shadow stack,
> >> then fork() and the child would not be using CET. Right?
> >>
> >> What is ARCH_X86_CET_DISABLE used for, anyway?
> >
> > Both the parent and the child can do ARCH_X86_CET_DISABLE, if CET is
> > not locked.
>
> Could you please describe a real-world example of why
> ARCH_X86_CET_DISABLE exists? What kinds of apps will use it, or *are*
> using it? Why was it created in the first place?
>
> >>> The JIT examples I mentioned previously run with CET enabled from the
> >>> beginning. Do you have a reason to do this? In other words, if the JIT code
> >>> needs CET, the app could have started with CET in the first place.
> >>
> >> Let's say I have a JIT'd sandbox. I want the sandbox to be
> >> CET-protected, but the JIT engine itself not to be.
> >
> > I do not have any objections to this use case, but it needs some cautions as
> > stated above. It will be much easier and cleaner if the sandbox is in a
> > separate exec'ed task with CET on.
>
> OK, great suggestion! Could you do some research and look at the
> various sandboxing techniques? Is imposing this requirement for a
> separate exec'd task reasonable? Does it fit nicely with their existing
> models? How about the Chrome browser and Firefox sandboxs?
>
> >>>> Does this *code* work? Could you please indicate which JITs have been
> >>>> enabled to use the code in this series? How much of the new ABI is in use?
> >>>
> >>> JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
> >>> frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
> >>> are tested and in the master branch. Sljit fixes are in the release.
> >>
> >> Huh, so who is using the new prctl() ABIs?
> >
> > Any code can use the ABI, but JIT code CET-enabling part mostly do not use these
> > new prctl()'s, except, probably to get CET status.
>
> Which applications specifically are going to use the new prctl()s which
> this series adds? How are they going to use them?
>
> "Any code can use them" is not a specific enough answer.
>
> >>>> Where are the selftests/ for this new ABI? Were you planning on
> >>>> submitting any with this series?
> >>>
> >>> The ABI is more related to the application side, and therefore most suitable for
> >>> GLIBC unit tests.
> >>
> >> I was mostly concerned with the kernel selftests. The things in
> >> tools/testing/selftests/x86 in the kernel tree.
> >
> > I have run them with CET enabled. All of them pass, except for the following:
> > Sigreturn from 64-bit to 32-bit fails, because shadow stack is at a 64-bit
> > address. This is understandable.
>
> That is not what I meant. I'm going to be as explicit:
>
> I expect you to create a test case which you will submit with these
> patches and the test case will go into the tools/testing/selftests/x86
> directory in the kernel tree. This test case will exercise the kernel
> functionality added in this series, especially the new prctl()s.
>
> One a separate topic: You ran the selftests and one failed. This is a
> *MASSIVE* warning sign. It should minimally be described in your cover
> letter, and accompanied by a fix to the test case. It is absolutely
> unacceptable to introduce a kernel feature that causes a test to fail.
> You must either fix your kernel feature or you fix the test.
>
> This code can not be accepted until this selftests issue is rectified.
>
> >>> The more complicated areas such as pthreads, signals, ucontext,
> >>> fork() are all included there. I have been constantly running these
> >>> tests without any problems. I can provide more details if testing is
> >>> the concern.
> >>
> >> For something this complicated, with new kernel ABIs, we need an
> >> in-kernel sefltest.
> >>
> >> MPX was not that much different from this feature. It required a
> >> boatload of compiler and linker changes to function. Yet, there was a
> >> simple in-kernel test for it that didn't require *any* of that big pile
> >> of toolchain bits.
> >>
> >> Is there a reason we don't have one of those for CET?
> >
> > I have a quick test that checks shadow stack and ibt in both main program and in
> > signals. Currently it is public on Github. If that is desired, I can submit it
> > to the mailing list.
>
> Yes, that is desired. It must accompany this submission. It must also
> exercise all of the new ABIs.

Our CET smoke test is for quick validation of CET OS, not just kernel.
It requires
the complete CET implementation. It does nothing if your OS isn't CET enabled.

--
H.J.

2020-05-16 02:57:17

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Fri, 2020-05-15 at 16:56 -0700, Dave Hansen wrote:
> On 5/15/20 4:29 PM, Yu-cheng Yu wrote:
> > On Fri, 2020-05-15 at 15:43 -0700, Dave Hansen wrote:
> > > Basically, if there ends up being a bug in an app that violates the
> > > shadow stack rules, the app is broken, period. The only recourse is to
> > > have the kernel disable CET and reboot.
> > >
> > > Is that right?
> >
> > You must be talking about init or any of the system daemons, right?
> > Assuming we let the app itself start CET with an arch_prctl(), why would that be
> > different from the current approach?
>
> You're getting ahead of me a bit here.
>
> I'm actually not asking directly about the prctls() or advocating for a
> different approach. The MPX approach of _requiring the app to make a
> prctl() was actually pretty nasty because sometimes threads got created
> before the prctl() could get called. Apps ended up inadvertently
> half-MPX-enabled. Not fun.
>
> Let's say we have an app doing silly things like retpolines. (Lots of
> app do lots of silly things). It gets compiled in a distro but never
> runs on a system with CET. The app gets run for the first time on a
> system with CET. App goes boom. Not init, just some random app, say
> /usr/bin/ldapsearch.
>
> What's my recourse as an end user? I want to run my app and turn off
> CET for that app. How can I do that?

GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT

> > > > > Can a binary compiled without CET run CET-enabled code?
> > > >
> > > > Partially yes, but in reality somewhat difficult.
> > > ...
> > > > - If a not-CET application does fork(), and the child wants to turn on CET, it
> > > > would be difficult to manage the stack frames, unless the child knows what is is
> > > > doing.
> > >
> > > It might be hard to do, but it is possible with the patches you posted?
> >
> > It is possible to add an arch_prctl() to turn on CET. That is simple from the
> > kernel's perspective, but difficult for the application. Once the app enables
> > shadow stack, it has to take care not to return beyond the function call layers
> > before that point. It can no longer do longjmp or ucontext swaps to anything
> > before that point. It will also be complicated if the app enables shadow stack
> > in a signal handler.
>
> Yu-cheng, I'm having a very hard time getting direct answers to my
> questions. Could you endeavor to give succinct, direct answers? If you
> want to give a longer, conditioned answer, that's great. But, I'd
> appreciate if you could please focus first on clearly answering the
> questions that I'm asking.
>
> Let me try again:
>
> Is it possible with the patches in this series to run a single-
> threaded binary which was has GNU_PROPERTY_X86_FEATURE_1_SHSTK
> unset to run with shadow stack protection?
>
> I think the answer is an unambiguous: "No". But I'd like to hear it
> from you.

No!

> > > I think you're saying that the CET-enabled binary would do
> > > arch_setup_elf_property() when it was first exec()'d. Later, it could
> > > use the new prctl(ARCH_X86_CET_DISABLE) to disable its shadow stack,
> > > then fork() and the child would not be using CET. Right?
> > >
> > > What is ARCH_X86_CET_DISABLE used for, anyway?
> >
> > Both the parent and the child can do ARCH_X86_CET_DISABLE, if CET is
> > not locked.
>
> Could you please describe a real-world example of why
> ARCH_X86_CET_DISABLE exists? What kinds of apps will use it, or *are*
> using it? Why was it created in the first place?

Currently, ld-linux turns off CET if the binary being loaded does not support
CET.

> > > > The JIT examples I mentioned previously run with CET enabled from the
> > > > beginning. Do you have a reason to do this? In other words, if the JIT code
> > > > needs CET, the app could have started with CET in the first place.
> > >
> > > Let's say I have a JIT'd sandbox. I want the sandbox to be
> > > CET-protected, but the JIT engine itself not to be.
> >
> > I do not have any objections to this use case, but it needs some cautions as
> > stated above. It will be much easier and cleaner if the sandbox is in a
> > separate exec'ed task with CET on.
>
> OK, great suggestion! Could you do some research and look at the
> various sandboxing techniques? Is imposing this requirement for a
> separate exec'd task reasonable? Does it fit nicely with their existing
> models? How about the Chrome browser and Firefox sandboxs?

I will check.

> > > > > Does this *code* work? Could you please indicate which JITs have been
> > > > > enabled to use the code in this series? How much of the new ABI is in use?
> > > >
> > > > JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
> > > > frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
> > > > are tested and in the master branch. Sljit fixes are in the release.
> > >
> > > Huh, so who is using the new prctl() ABIs?
> >
> > Any code can use the ABI, but JIT code CET-enabling part mostly do not use these
> > new prctl()'s, except, probably to get CET status.
>
> Which applications specifically are going to use the new prctl()s which
> this series adds? How are they going to use them?
>
> "Any code can use them" is not a specific enough answer.

We have four arch_ptctl() calls. ARCH_X86_CET_DISABLE and ARCH_X86_CET_LOCK are
used by ld-linux. ARCH_X86_CET_STATUS are used in many places to determine if
CET is on. ARCH_X86_CET_ALLOC_SHSTK is used in ucontext related handling, but
it can be use by any application to switch shadow stacks.

> > > > > Where are the selftests/ for this new ABI? Were you planning on
> > > > > submitting any with this series?
> > > >
> > > > The ABI is more related to the application side, and therefore most suitable for
> > > > GLIBC unit tests.
> > >
> > > I was mostly concerned with the kernel selftests. The things in
> > > tools/testing/selftests/x86 in the kernel tree.
> >
> > I have run them with CET enabled. All of them pass, except for the following:
> > Sigreturn from 64-bit to 32-bit fails, because shadow stack is at a 64-bit
> > address. This is understandable.
>
> That is not what I meant. I'm going to be as explicit:
>
> I expect you to create a test case which you will submit with these
> patches and the test case will go into the tools/testing/selftests/x86
> directory in the kernel tree. This test case will exercise the kernel
> functionality added in this series, especially the new prctl()s.

I will submit the test case as a separate patch in response to this discussion,
and combine with the series when the discussion concludes.

> One a separate topic: You ran the selftests and one failed. This is a
> *MASSIVE* warning sign. It should minimally be described in your cover
> letter, and accompanied by a fix to the test case. It is absolutely
> unacceptable to introduce a kernel feature that causes a test to fail.
> You must either fix your kernel feature or you fix the test.
>
> This code can not be accepted until this selftests issue is rectified.

Sure, I will do that.

>
> > > > The more complicated areas such as pthreads, signals, ucontext,
> > > > fork() are all included there. I have been constantly running these
> > > > tests without any problems. I can provide more details if testing is
> > > > the concern.
> > >
> > > For something this complicated, with new kernel ABIs, we need an
> > > in-kernel sefltest.
> > >
> > > MPX was not that much different from this feature. It required a
> > > boatload of compiler and linker changes to function. Yet, there was a
> > > simple in-kernel test for it that didn't require *any* of that big pile
> > > of toolchain bits.
> > >
> > > Is there a reason we don't have one of those for CET?
> >
> > I have a quick test that checks shadow stack and ibt in both main program and in
> > signals. Currently it is public on Github. If that is desired, I can submit it
> > to the mailing list.
>
> Yes, that is desired. It must accompany this submission. It must also
> exercise all of the new ABIs.

Ok.

Yu-cheng

2020-05-16 14:12:05

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 16/05/2020 03:37, H.J. Lu wrote:
> On Fri, May 15, 2020 at 5:13 PM Andrew Cooper <[email protected]> wrote:
>> Finally seeing as the question was asked but not answered, it is
>> actually quite easy to figure out whether shadow stacks are enabled in
>> the current thread.
>>
>> mov $1, %eax
>> rdsspd %eax
> This is for 32-bit mode.

It actually works for both, if all you need is a shstk yes/no check.

Usually, you also want SSP in the yes case, so substitute rdsspq %rax as
appropriate.

(On a tangent - binutils mandating the D/Q suffixes is very irritating
with mixed 32/64bit code because you have to #ifdef your instructions
despite the register operands being totally unambiguous.  Also, D is the
wrong suffix for AT&T syntax, and should be L.  Frankly - the Intel
manuals are wrong and should not have the operand size suffix included
in the opcode name, as they are consistent with all the other
instructions in this regard.)

> I use
>
> /* Check if shadow stack is in use. */
> xorl %esi, %esi
> rdsspq %rsi
> testq %rsi, %rsi
> /* Normal return if shadow stack isn't in use. */
> je L(no_shstk)

This is probably fine for user code, as I don't think it would be
legitimate for shstk to be enabled, with SSP being 0.

Sadly, the same is not true for kernel shadow stacks.

SSP is 0 after SYSCALL, SYSENTER and CLRSSBSY, and you've got to be
careful to re-establish the shadow stack before a CALL, interrupt or
exception tries pushing a word onto the shadow stack at 0xfffffffffffffff8.

It is a very good (lucky?) thing that frame is unmapped for other
reasons, because this corner case does not protect against multiple
threads/cores using the same shadow stack concurrently.

~Andrew

2020-05-17 23:11:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 5/15/20 7:51 PM, H.J. Lu wrote:
> On Fri, May 15, 2020 at 4:56 PM Dave Hansen <[email protected]> wrote:
>> Let's say we have an app doing silly things like retpolines. (Lots of
>> app do lots of silly things). It gets compiled in a distro but never
>> runs on a system with CET. The app gets run for the first time on a
>> system with CET. App goes boom. Not init, just some random app, say
>> /usr/bin/ldapsearch.
>
> I designed and implemented CET toolchain and run-time in such a way
> for it very difficult to happen. Basically, CET won't be enabled on such
> an app.

Would you care to share any specifics about how this is implemented?
That would be great information to include in the kernel documentation
because it informs us about the reasons why we don't need a kernel-based
"kill switch".

>> What's my recourse as an end user? I want to run my app and turn off
>> CET for that app. How can I do that?
>
> The CET OS I designed turns CET off for you and you don't have to do
> anything.

OK, cool! Could you share some of the specifics about how it does that?

>> Is it possible with the patches in this series to run a single-
>> threaded binary which was has GNU_PROPERTY_X86_FEATURE_1_SHSTK
>> unset to run with shadow stack protection?
>
> Yes, you can. I added such capabilities for testing purpose. But
> you application will crash as soon as there is a CET violation. My
> CET software design is very flexible.

Yu-cheng speficially referred to the:

GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT

option. Is that the option you're talking about?

>>> I have a quick test that checks shadow stack and ibt in both main program and in
>>> signals. Currently it is public on Github. If that is desired, I can submit it
>>> to the mailing list.
>>
>> Yes, that is desired. It must accompany this submission. It must also
>> exercise all of the new ABIs.
>
> Our CET smoke test is for quick validation of CET OS, not just
> kernel. It requires the complete CET implementation. It does
> nothing if your OS isn't CET enabled.
I think requiring the complete CET implementation to be present for this
test to work is a mistake. We don't require anything other than an
enabled kernel and the selftests that ship with that kernel.

MPX required toolchain, library and compiler changes. Yet, we had a
totally standalone kernel test that found real bugs. It sounds like
this smoke test as it stands wouldn't be a great fit. But, that
shouldn't discourage us from finding something that _is_ a good fit for
the kernel-shipped selftests.

2020-05-18 13:43:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 5/15/20 7:53 PM, Yu-cheng Yu wrote:
> On Fri, 2020-05-15 at 16:56 -0700, Dave Hansen wrote:
>> What's my recourse as an end user? I want to run my app and turn off
>> CET for that app. How can I do that?
>
> GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT

Like I mentioned to H.J., this is something that we need to at least
acknowledge the existence of in the changelog and probably even the
Documentation/.

>>>> I think you're saying that the CET-enabled binary would do
>>>> arch_setup_elf_property() when it was first exec()'d. Later, it could
>>>> use the new prctl(ARCH_X86_CET_DISABLE) to disable its shadow stack,
>>>> then fork() and the child would not be using CET. Right?
>>>>
>>>> What is ARCH_X86_CET_DISABLE used for, anyway?
>>>
>>> Both the parent and the child can do ARCH_X86_CET_DISABLE, if CET is
>>> not locked.
>>
>> Could you please describe a real-world example of why
>> ARCH_X86_CET_DISABLE exists? What kinds of apps will use it, or *are*
>> using it? Why was it created in the first place?
>
> Currently, ld-linux turns off CET if the binary being loaded does not support
> CET.

Great! Could this please be immortalized in the documentation for the
prctl()?

>>>>>> Does this *code* work? Could you please indicate which JITs have been
>>>>>> enabled to use the code in this series? How much of the new ABI is in use?
>>>>>
>>>>> JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
>>>>> frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
>>>>> are tested and in the master branch. Sljit fixes are in the release.
>>>>
>>>> Huh, so who is using the new prctl() ABIs?
>>>
>>> Any code can use the ABI, but JIT code CET-enabling part mostly do not use these
>>> new prctl()'s, except, probably to get CET status.
>>
>> Which applications specifically are going to use the new prctl()s which
>> this series adds? How are they going to use them?
>>
>> "Any code can use them" is not a specific enough answer.
>
> We have four arch_ptctl() calls. ARCH_X86_CET_DISABLE and ARCH_X86_CET_LOCK are
> used by ld-linux. ARCH_X86_CET_STATUS are used in many places to determine if
> CET is on. ARCH_X86_CET_ALLOC_SHSTK is used in ucontext related handling, but
> it can be use by any application to switch shadow stacks.

Could some of this information be added to the documentation, please?
It would also be nice to have some more details about how apps end up
using ARCH_X86_CET_STATUS. Why would they care that CET is on?

2020-05-18 14:07:41

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Mon, May 18, 2020 at 6:41 AM Dave Hansen <[email protected]> wrote:
>
> On 5/15/20 7:53 PM, Yu-cheng Yu wrote:
> > On Fri, 2020-05-15 at 16:56 -0700, Dave Hansen wrote:
> >> What's my recourse as an end user? I want to run my app and turn off
> >> CET for that app. How can I do that?
> >
> > GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
>
> Like I mentioned to H.J., this is something that we need to at least
> acknowledge the existence of in the changelog and probably even the
> Documentation/.
>
> >>>> I think you're saying that the CET-enabled binary would do
> >>>> arch_setup_elf_property() when it was first exec()'d. Later, it could
> >>>> use the new prctl(ARCH_X86_CET_DISABLE) to disable its shadow stack,
> >>>> then fork() and the child would not be using CET. Right?
> >>>>
> >>>> What is ARCH_X86_CET_DISABLE used for, anyway?
> >>>
> >>> Both the parent and the child can do ARCH_X86_CET_DISABLE, if CET is
> >>> not locked.
> >>
> >> Could you please describe a real-world example of why
> >> ARCH_X86_CET_DISABLE exists? What kinds of apps will use it, or *are*
> >> using it? Why was it created in the first place?
> >
> > Currently, ld-linux turns off CET if the binary being loaded does not support
> > CET.
>
> Great! Could this please be immortalized in the documentation for the
> prctl()?
>
> >>>>>> Does this *code* work? Could you please indicate which JITs have been
> >>>>>> enabled to use the code in this series? How much of the new ABI is in use?
> >>>>>
> >>>>> JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
> >>>>> frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
> >>>>> are tested and in the master branch. Sljit fixes are in the release.
> >>>>
> >>>> Huh, so who is using the new prctl() ABIs?
> >>>
> >>> Any code can use the ABI, but JIT code CET-enabling part mostly do not use these
> >>> new prctl()'s, except, probably to get CET status.
> >>
> >> Which applications specifically are going to use the new prctl()s which
> >> this series adds? How are they going to use them?
> >>
> >> "Any code can use them" is not a specific enough answer.
> >
> > We have four arch_ptctl() calls. ARCH_X86_CET_DISABLE and ARCH_X86_CET_LOCK are
> > used by ld-linux. ARCH_X86_CET_STATUS are used in many places to determine if
> > CET is on. ARCH_X86_CET_ALLOC_SHSTK is used in ucontext related handling, but
> > it can be use by any application to switch shadow stacks.
>
> Could some of this information be added to the documentation, please?
> It would also be nice to have some more details about how apps end up
> using ARCH_X86_CET_STATUS. Why would they care that CET is on?

CET software spec is at

https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/Intel-CET-extension

My CET presentation at 2018 LPC is at

https://www.linuxplumbersconf.org/event/2/contributions/147/attachments/72/83/CET-LPC-2018.pdf

I am working on an updated CET presentation for 2020 LPC. Let me know
if you want to see the early draft.

--
H.J.

2020-05-18 14:25:28

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Mon, 2020-05-18 at 06:41 -0700, Dave Hansen wrote:
> On 5/15/20 7:53 PM, Yu-cheng Yu wrote:
> > On Fri, 2020-05-15 at 16:56 -0700, Dave Hansen wrote:
> > > What's my recourse as an end user? I want to run my app and turn off
> > > CET for that app. How can I do that?
> >
> > GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK,-IBT
>
> Like I mentioned to H.J., this is something that we need to at least
> acknowledge the existence of in the changelog and probably even the
> Documentation/.

Sure. I will do that.

>
> > > > > I think you're saying that the CET-enabled binary would do
> > > > > arch_setup_elf_property() when it was first exec()'d. Later, it could
> > > > > use the new prctl(ARCH_X86_CET_DISABLE) to disable its shadow stack,
> > > > > then fork() and the child would not be using CET. Right?
> > > > >
> > > > > What is ARCH_X86_CET_DISABLE used for, anyway?
> > > >
> > > > Both the parent and the child can do ARCH_X86_CET_DISABLE, if CET is
> > > > not locked.
> > >
> > > Could you please describe a real-world example of why
> > > ARCH_X86_CET_DISABLE exists? What kinds of apps will use it, or *are*
> > > using it? Why was it created in the first place?
> >
> > Currently, ld-linux turns off CET if the binary being loaded does not support
> > CET.
>
> Great! Could this please be immortalized in the documentation for the
> prctl()?

Yes.

>
> > > > > > > Does this *code* work? Could you please indicate which JITs have been
> > > > > > > enabled to use the code in this series? How much of the new ABI is in use?
> > > > > >
> > > > > > JIT does not necessarily use all of the ABI. The JIT changes mainly fix stack
> > > > > > frames and insert ENDBRs. I do not work on JIT. What I found is LLVM JIT fixes
> > > > > > are tested and in the master branch. Sljit fixes are in the release.
> > > > >
> > > > > Huh, so who is using the new prctl() ABIs?
> > > >
> > > > Any code can use the ABI, but JIT code CET-enabling part mostly do not use these
> > > > new prctl()'s, except, probably to get CET status.
> > >
> > > Which applications specifically are going to use the new prctl()s which
> > > this series adds? How are they going to use them?
> > >
> > > "Any code can use them" is not a specific enough answer.
> >
> > We have four arch_ptctl() calls. ARCH_X86_CET_DISABLE and ARCH_X86_CET_LOCK are
> > used by ld-linux. ARCH_X86_CET_STATUS are used in many places to determine if
> > CET is on. ARCH_X86_CET_ALLOC_SHSTK is used in ucontext related handling, but
> > it can be use by any application to switch shadow stacks.
>
> Could some of this information be added to the documentation, please?
> It would also be nice to have some more details about how apps end up
> using ARCH_X86_CET_STATUS. Why would they care that CET is on?

Yes.

Yu-cheng

2020-05-18 14:28:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 5/18/20 7:01 AM, H.J. Lu wrote:
>> Could some of this information be added to the documentation, please?
>> It would also be nice to have some more details about how apps end up
>> using ARCH_X86_CET_STATUS. Why would they care that CET is on?
> CET software spec is at
>
> https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/Intel-CET-extension
>
> My CET presentation at 2018 LPC is at
>
> https://www.linuxplumbersconf.org/event/2/contributions/147/attachments/72/83/CET-LPC-2018.pdf
>
> I am working on an updated CET presentation for 2020 LPC. Let me know
> if you want to see the early draft.

There's a lot of great information in there!

However, please remember that presentations are no substitute for old
fashioned documentation in the kernel tree and changelogs. The fact
that we need to lean on them to answer basic questions about new
interfaces is not a great sign.

2020-05-18 23:49:44

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Fri, 2020-05-15 at 19:53 -0700, Yu-cheng Yu wrote:
> On Fri, 2020-05-15 at 16:56 -0700, Dave Hansen wrote:
> > On 5/15/20 4:29 PM, Yu-cheng Yu wrote:
> > > [...]
> > > I have run them with CET enabled. All of them pass, except for the following:
> > > Sigreturn from 64-bit to 32-bit fails, because shadow stack is at a 64-bit
> > > address. This is understandable.
> > [...]
> > One a separate topic: You ran the selftests and one failed. This is a
> > *MASSIVE* warning sign. It should minimally be described in your cover
> > letter, and accompanied by a fix to the test case. It is absolutely
> > unacceptable to introduce a kernel feature that causes a test to fail.
> > You must either fix your kernel feature or you fix the test.
> >
> > This code can not be accepted until this selftests issue is rectified.

The x86/sigreturn test constructs 32-bit ldt entries, and does sigreturn from
64-bit to 32-bit context. We do not have a way to construct a static 32-bit
shadow stack. Why do we want that? I think we can simply run the test with CET
disabled.

Yu-cheng


2020-05-19 00:41:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 5/18/20 4:47 PM, Yu-cheng Yu wrote:
> On Fri, 2020-05-15 at 19:53 -0700, Yu-cheng Yu wrote:
>> On Fri, 2020-05-15 at 16:56 -0700, Dave Hansen wrote:
>>> On 5/15/20 4:29 PM, Yu-cheng Yu wrote:
>>>> [...]
>>>> I have run them with CET enabled. All of them pass, except for the following:
>>>> Sigreturn from 64-bit to 32-bit fails, because shadow stack is at a 64-bit
>>>> address. This is understandable.
>>> [...]
>>> One a separate topic: You ran the selftests and one failed. This is a
>>> *MASSIVE* warning sign. It should minimally be described in your cover
>>> letter, and accompanied by a fix to the test case. It is absolutely
>>> unacceptable to introduce a kernel feature that causes a test to fail.
>>> You must either fix your kernel feature or you fix the test.
>>>
>>> This code can not be accepted until this selftests issue is rectified.
> The x86/sigreturn test constructs 32-bit ldt entries, and does sigreturn from
> 64-bit to 32-bit context. We do not have a way to construct a static 32-bit
> shadow stack.

Why? What's the limiting factor? Hardware architecture? Something in
the kernel?

> Why do we want that? I think we can simply run the test with CET
> disabled.

The sadistic parts of selftests/x86 come from real bugs. Either bugs
where the kernel fell over, or where behavior changed that broke apps.
I'd suggest doing some research on where that particular test case came
from. Find the author of the test, look at the changelogs.

If this is something that a real app does, this is a problem. If it's a
sadistic test that Andy L added because it was an attack vector against
the entry code, it's a different story.

I don't personally know the background, but the changelogs can help you
find the person that does.

2020-05-19 01:37:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description



> On May 18, 2020, at 5:38 PM, Dave Hansen <[email protected]> wrote:
>
> On 5/18/20 4:47 PM, Yu-cheng Yu wrote:
>>> On Fri, 2020-05-15 at 19:53 -0700, Yu-cheng Yu wrote:
>>> On Fri, 2020-05-15 at 16:56 -0700, Dave Hansen wrote:
>>>> On 5/15/20 4:29 PM, Yu-cheng Yu wrote:
>>>>> [...]
>>>>> I have run them with CET enabled. All of them pass, except for the following:
>>>>> Sigreturn from 64-bit to 32-bit fails, because shadow stack is at a 64-bit
>>>>> address. This is understandable.
>>>> [...]
>>>> One a separate topic: You ran the selftests and one failed. This is a
>>>> *MASSIVE* warning sign. It should minimally be described in your cover
>>>> letter, and accompanied by a fix to the test case. It is absolutely
>>>> unacceptable to introduce a kernel feature that causes a test to fail.
>>>> You must either fix your kernel feature or you fix the test.
>>>>
>>>> This code can not be accepted until this selftests issue is rectified.
>> The x86/sigreturn test constructs 32-bit ldt entries, and does sigreturn from
>> 64-bit to 32-bit context. We do not have a way to construct a static 32-bit
>> shadow stack.
>
> Why? What's the limiting factor? Hardware architecture? Something in
> the kernel?
>
>> Why do we want that? I think we can simply run the test with CET
>> disabled.
>
> The sadistic parts of selftests/x86 come from real bugs. Either bugs
> where the kernel fell over, or where behavior changed that broke apps.
> I'd suggest doing some research on where that particular test case came
> from. Find the author of the test, look at the changelogs.
>
> If this is something that a real app does, this is a problem. If it's a
> sadistic test that Andy L added because it was an attack vector against
> the entry code, it's a different story.

There are quite a few tests that do these horrible things in there. IN my personal opinion, sigreturn.c is one of the most important tests we have — it does every horrible thing to the entry code that I thought of and that I could come up with a way of doing. We have been saved from regressing many times by these tests. CET, and especially the CPL0 version of CET, is its own set of entry horror, and we need to keep these tests working.

I assume the basic issue is that we call raise(), the context magically changes to 32-bit, but SSP has a 64-bit value, and horrors happen. So I think two things need to happen:

1. Someone needs to document what happens when IRET tries to put a 64-bit value into SSP but CS is compat. Because Intel has plenty of history of doing colossally broken things here. IOW you could easily be hitting a hardware design problem, not a software issue per se.

2. The test needs to work. Assuming the hardware doesn’t do something utterly broken, either the 32-bit code needs to be adjusted to avoid any CALL
or RET, or you need to write a little raise_on_32bit_shstk() func that switches to an SSP that fits in 32 bits, calls raise(), and switches back. From memory, I didn’t think there was a CALl or RET, so I’m guessing that SSP is getting truncated when we round trip through CPL3 compat mode and the result is that the kernel invoked the signal handler with the wrong SSP. Whoops.

>
> I don't personally know the background, but the changelogs can help you
> find the person that does.

2020-05-20 01:08:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Mon, May 18, 2020 at 6:35 PM Andy Lutomirski <[email protected]> wrote:
>
>
>
> > On May 18, 2020, at 5:38 PM, Dave Hansen <[email protected]> wrote:
> >
> > On 5/18/20 4:47 PM, Yu-cheng Yu wrote:
> >>> On Fri, 2020-05-15 at 19:53 -0700, Yu-cheng Yu wrote:
> >>> On Fri, 2020-05-15 at 16:56 -0700, Dave Hansen wrote:
> >>>> On 5/15/20 4:29 PM, Yu-cheng Yu wrote:
> >>>>> [...]
> >>>>> I have run them with CET enabled. All of them pass, except for the following:
> >>>>> Sigreturn from 64-bit to 32-bit fails, because shadow stack is at a 64-bit
> >>>>> address. This is understandable.
> >>>> [...]
> >>>> One a separate topic: You ran the selftests and one failed. This is a
> >>>> *MASSIVE* warning sign. It should minimally be described in your cover
> >>>> letter, and accompanied by a fix to the test case. It is absolutely
> >>>> unacceptable to introduce a kernel feature that causes a test to fail.
> >>>> You must either fix your kernel feature or you fix the test.
> >>>>
> >>>> This code can not be accepted until this selftests issue is rectified.
> >> The x86/sigreturn test constructs 32-bit ldt entries, and does sigreturn from
> >> 64-bit to 32-bit context. We do not have a way to construct a static 32-bit
> >> shadow stack.
> >
> > Why? What's the limiting factor? Hardware architecture? Something in
> > the kernel?
> >
> >> Why do we want that? I think we can simply run the test with CET
> >> disabled.
> >
> > The sadistic parts of selftests/x86 come from real bugs. Either bugs
> > where the kernel fell over, or where behavior changed that broke apps.
> > I'd suggest doing some research on where that particular test case came
> > from. Find the author of the test, look at the changelogs.
> >
> > If this is something that a real app does, this is a problem. If it's a
> > sadistic test that Andy L added because it was an attack vector against
> > the entry code, it's a different story.
>
> There are quite a few tests that do these horrible things in there. IN my personal opinion, sigreturn.c is one of the most important tests we have — it does every horrible thing to the entry code that I thought of and that I could come up with a way of doing. We have been saved from regressing many times by these tests. CET, and especially the CPL0 version of CET, is its own set of entry horror, and we need to keep these tests working.
>
> I assume the basic issue is that we call raise(), the context magically changes to 32-bit, but SSP has a 64-bit value, and horrors happen. So I think two things need to happen:
>
> 1. Someone needs to document what happens when IRET tries to put a 64-bit value into SSP but CS is compat. Because Intel has plenty of history of doing colossally broken things here. IOW you could easily be hitting a hardware design problem, not a software issue per se.
>
> 2. The test needs to work. Assuming the hardware doesn’t do something utterly broken, either the 32-bit code needs to be adjusted to avoid any CALL
> or RET, or you need to write a little raise_on_32bit_shstk() func that switches to an SSP that fits in 32 bits, calls raise(), and switches back. From memory, I didn’t think there was a CALl or RET, so I’m guessing that SSP is getting truncated when we round trip through CPL3 compat mode and the result is that the kernel invoked the signal handler with the wrong SSP. Whoops.
>

Following up here, I think this needs attention from the H/W architects.

From the SDM:

SYSRET and SYSEXIT:

IF ShadowStackEnabled(CPL)
SSP ← IA32_PL3_SSP;
FI;

IRET:

IF ShadowStackEnabled(CPL)
IF CPL = 3
THEN tempSSP ← IA32_PL3_SSP; FI;
IF ((EFER.LMA AND CS.L) = 0 AND tempSSP[63:32] != 0)
THEN #GP(0); FI;
SSP ← tempSSP

The semantics of actually executing in compat mode with SSP >= 2^32
are unclear. If nothing else, VM exit will save the full SSP and a
subsequent VM entry will fail.

I don't know what the actual effect of operand-size-32 SYSRET or
SYSEXIT with too big a PL3_SSP will be, but I think it needs to be
documented. Ideally it will not put the CPU in an invalid state.
Ideally it will also not fault, because SYSRET faults in particular
are fatal unless the vector uses IST, and please please please don't
force more ISTs on anyone.

So I think we may need to put this entire series on hold until we get
some answers, because I suspect we're going to have a nice little root
hole otherwise.

2020-05-21 15:18:29

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Wed, Apr 29, 2020 at 03:07:06PM -0700, Yu-cheng Yu wrote:
> Control-flow Enforcement (CET) is a new Intel processor feature that blocks
> return/jump-oriented programming attacks. Details can be found in "Intel
> 64 and IA-32 Architectures Software Developer's Manual" [1].
>
> This series depends on the XSAVES supervisor state series that was split
> out and submitted earlier [2].
>
> I have gone through previous comments, and hope all concerns have been
> resolved now. Please inform me if anything is overlooked.
>
> Changes in v10:

Hi Yu-cheng,

Do you have a git branch with the latest Shadow Stack and IBT branches
applied? I tried to apply IBT v9 on top of this, but I guess the SS
code has changed since then and it didn't apply cleanly.

--
Josh

2020-05-21 15:59:53

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Thu, 2020-05-21 at 10:15 -0500, Josh Poimboeuf wrote:
> On Wed, Apr 29, 2020 at 03:07:06PM -0700, Yu-cheng Yu wrote:
> > Control-flow Enforcement (CET) is a new Intel processor feature that blocks
> > return/jump-oriented programming attacks. Details can be found in "Intel
> > 64 and IA-32 Architectures Software Developer's Manual" [1].
> >
> > This series depends on the XSAVES supervisor state series that was split
> > out and submitted earlier [2].
> >
> > I have gone through previous comments, and hope all concerns have been
> > resolved now. Please inform me if anything is overlooked.
> >
> > Changes in v10:
>
> Hi Yu-cheng,
>
> Do you have a git branch with the latest Shadow Stack and IBT branches
> applied? I tried to apply IBT v9 on top of this, but I guess the SS
> code has changed since then and it didn't apply cleanly.

It is here:

https://github.com/yyu168/linux_cet/commits/cet

Yu-cheng

2020-05-21 18:53:12

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Thu, May 21, 2020 at 08:57:57AM -0700, Yu-cheng Yu wrote:
> On Thu, 2020-05-21 at 10:15 -0500, Josh Poimboeuf wrote:
> > On Wed, Apr 29, 2020 at 03:07:06PM -0700, Yu-cheng Yu wrote:
> > > Control-flow Enforcement (CET) is a new Intel processor feature that blocks
> > > return/jump-oriented programming attacks. Details can be found in "Intel
> > > 64 and IA-32 Architectures Software Developer's Manual" [1].
> > >
> > > This series depends on the XSAVES supervisor state series that was split
> > > out and submitted earlier [2].
> > >
> > > I have gone through previous comments, and hope all concerns have been
> > > resolved now. Please inform me if anything is overlooked.
> > >
> > > Changes in v10:
> >
> > Hi Yu-cheng,
> >
> > Do you have a git branch with the latest Shadow Stack and IBT branches
> > applied? I tried to apply IBT v9 on top of this, but I guess the SS
> > code has changed since then and it didn't apply cleanly.
>
> It is here:
>
> https://github.com/yyu168/linux_cet/commits/cet

Thanks. FYI, I got the following warning on an AMD system.

[ 18.936979] get of unsupported state
[ 18.936989] WARNING: CPU: 251 PID: 1794 at arch/x86/kernel/fpu/xstate.c:919 get_xsave_addr+0x83/0x90
[ 18.949676] Modules linked in:
[ 18.952731] CPU: 251 PID: 1794 Comm: dracut-rootfs-g Not tainted 5.7.0-rc6+ #162
[ 18.960121] Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RDY1005C 11/22/2019
[ 18.968198] RIP: 0010:get_xsave_addr+0x83/0x90
[ 18.972637] Code: 5b c3 48 83 c4 08 31 c0 5b c3 80 3d f9 c2 7a 01 00 75 bc 48 c7 c7 c4 cb 8f a9 89 74 24 04 c6 05 e5 c2 7a 01 01 e8 3f 49 0a 00 <0f> 0b 8b 74 24 04 eb 9d 31 c0 c3 66 90 0f 1f 44 00 00 48 89 fe 0f
[ 18.991373] RSP: 0018:ffffb8db103cfcd8 EFLAGS: 00010286
[ 18.996591] RAX: 0000000000000000 RBX: ffff947da1189440 RCX: 0000000000000000
[ 19.003715] RDX: 0000000000000000 RSI: ffffffffaa6809d8 RDI: ffffffffaa67e58c
[ 19.010839] RBP: ffff947da1188000 R08: 0000000468bb5e6c R09: 0000000000000018
[ 19.017962] R10: 0000000000000002 R11: 00000000000000f0 R12: ffffb8db103cfd20
[ 19.025087] R13: ffff947da1189400 R14: 0000000000000000 R15: 0000000000000007
[ 19.032211] FS: 00007f0a81b15740(0000) GS:ffff947dcf8c0000(0000) knlGS:0000000000000000
[ 19.040321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 19.046057] CR2: 00007f0a81b156c0 CR3: 0000003fa125a000 CR4: 0000000000340ee0
[ 19.053183] Call Trace:
[ 19.055637] cet_restore_signal+0x26/0xf0
[ 19.059649] __fpu__restore_sig+0x4cc/0x6e0
[ 19.063832] ? remove_wait_queue+0x20/0x60
[ 19.067928] ? reuse_swap_page+0x6e/0x340
[ 19.071939] restore_sigcontext+0x162/0x1b0
[ 19.076128] ? recalc_sigpending+0x17/0x50
[ 19.080223] ? __set_task_blocked+0x34/0xa0
[ 19.084401] __do_sys_rt_sigreturn+0x92/0xde
[ 19.088675] do_syscall_64+0x55/0x1b0
[ 19.092342] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 19.097394] RIP: 0033:0x7f0a811389d1
[ 19.100970] Code: 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 41 ba 08 00 00 00 b8 0e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 07 c3 66 0f 1f 44 00 00 48 8b 15 81 44 38 00
[ 19.119709] RSP: 002b:00007ffd643d5dd8 EFLAGS: 00000246
[ 19.124933] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f0a811389d1
[ 19.132056] RDX: 0000000000000000 RSI: 00007ffd643d5e60 RDI: 0000000000000002
[ 19.139182] RBP: 000055e140190e20 R08: 000055e14017a014 R09: 0000000000000001
[ 19.146307] R10: 0000000000000008 R11: 0000000000000246 R12: 000055e13f47e4e0
[ 19.153436] R13: 00007ffd643d5e60 R14: 0000000000000000 R15: 0000000000000000

2020-05-21 19:12:10

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Thu, 2020-05-21 at 13:50 -0500, Josh Poimboeuf wrote:
[...]
>
> Thanks. FYI, I got the following warning on an AMD system.
>
> [ 18.936979] get of unsupported state
> [ 18.936989] WARNING: CPU: 251 PID: 1794 at arch/x86/kernel/fpu/xstate.c:919 get_xsave_addr+0x83/0x90
> [ 18.949676] Modules linked in:
> [ 18.952731] CPU: 251 PID: 1794 Comm: dracut-rootfs-g Not tainted 5.7.0-rc6+ #162
> [ 18.960121] Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RDY1005C 11/22/2019
> [ 18.968198] RIP: 0010:get_xsave_addr+0x83/0x90
> [ 18.972637] Code: 5b c3 48 83 c4 08 31 c0 5b c3 80 3d f9 c2 7a 01 00 75 bc 48 c7 c7 c4 cb 8f a9 89 74 24 04 c6 05 e5 c2 7a 01 01 e8 3f 49 0a 00 <0f> 0b 8b 74 24 04 eb 9d 31 c0 c3 66 90 0f 1f 44 00 00 48 89 fe 0f
> [ 18.991373] RSP: 0018:ffffb8db103cfcd8 EFLAGS: 00010286
> [ 18.996591] RAX: 0000000000000000 RBX: ffff947da1189440 RCX: 0000000000000000
> [ 19.003715] RDX: 0000000000000000 RSI: ffffffffaa6809d8 RDI: ffffffffaa67e58c
> [ 19.010839] RBP: ffff947da1188000 R08: 0000000468bb5e6c R09: 0000000000000018
> [ 19.017962] R10: 0000000000000002 R11: 00000000000000f0 R12: ffffb8db103cfd20
> [ 19.025087] R13: ffff947da1189400 R14: 0000000000000000 R15: 0000000000000007
> [ 19.032211] FS: 00007f0a81b15740(0000) GS:ffff947dcf8c0000(0000) knlGS:0000000000000000
> [ 19.040321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 19.046057] CR2: 00007f0a81b156c0 CR3: 0000003fa125a000 CR4: 0000000000340ee0
> [ 19.053183] Call Trace:
> [ 19.055637] cet_restore_signal+0x26/0xf0
> [ 19.059649] __fpu__restore_sig+0x4cc/0x6e0
> [ 19.063832] ? remove_wait_queue+0x20/0x60
> [ 19.067928] ? reuse_swap_page+0x6e/0x340
> [ 19.071939] restore_sigcontext+0x162/0x1b0
> [ 19.076128] ? recalc_sigpending+0x17/0x50
> [ 19.080223] ? __set_task_blocked+0x34/0xa0
> [ 19.084401] __do_sys_rt_sigreturn+0x92/0xde
> [ 19.088675] do_syscall_64+0x55/0x1b0
> [ 19.092342] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 19.097394] RIP: 0033:0x7f0a811389d1
> [ 19.100970] Code: 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 41 ba 08 00 00 00 b8 0e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 07 c3 66 0f 1f 44 00 00 48 8b 15 81 44 38 00
> [ 19.119709] RSP: 002b:00007ffd643d5dd8 EFLAGS: 00000246
> [ 19.124933] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f0a811389d1
> [ 19.132056] RDX: 0000000000000000 RSI: 00007ffd643d5e60 RDI: 0000000000000002
> [ 19.139182] RBP: 000055e140190e20 R08: 000055e14017a014 R09: 0000000000000001
> [ 19.146307] R10: 0000000000000008 R11: 0000000000000246 R12: 000055e13f47e4e0
> [ 19.153436] R13: 00007ffd643d5e60 R14: 0000000000000000 R15: 0000000000000000
>

Thanks! I will fix it.

Yu-cheng

2020-05-21 22:43:54

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v10 26/26] x86/cet/shstk: Add arch_prctl functions for shadow stack

On Wed, Apr 29, 2020 at 03:07:32PM -0700, Yu-cheng Yu wrote:
> arch_prctl(ARCH_X86_CET_STATUS, u64 *args)
> Get CET feature status.
>
> The parameter 'args' is a pointer to a user buffer. The kernel returns
> the following information:
>
> *args = shadow stack/IBT status
> *(args + 1) = shadow stack base address
> *(args + 2) = shadow stack size
>
> arch_prctl(ARCH_X86_CET_DISABLE, u64 features)
> Disable CET features specified in 'features'. Return -EPERM if CET is
> locked.
>
> arch_prctl(ARCH_X86_CET_LOCK)
> Lock in CET features.
>
> arch_prctl(ARCH_X86_CET_ALLOC_SHSTK, u64 *args)
> Allocate a new shadow stack.
>
> The parameter 'args' is a pointer to a user buffer containing the
> desired size to allocate. The kernel returns the allocated shadow
> stack address in *args.

Hi! Just a quick note about getting these designs right -- prctl() (and
similar APIs) needs to make sure they're examining all "unknown" flags
as zero, or we run the risk of breaking sloppy userspace callers who
accidentally set flags and then later the kernel gives meaning to those
flags. Notes below...

>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> ---
> v10:
> - Verify CET is enabled before handling arch_prctl.
> - Change input parameters from unsigned long to u64, to make it clear they
> are 64-bit.
>
> arch/x86/include/asm/cet.h | 4 ++
> arch/x86/include/uapi/asm/prctl.h | 5 ++
> arch/x86/kernel/Makefile | 2 +-
> arch/x86/kernel/cet.c | 29 +++++++++++
> arch/x86/kernel/cet_prctl.c | 87 +++++++++++++++++++++++++++++++
> arch/x86/kernel/process.c | 4 +-
> 6 files changed, 128 insertions(+), 3 deletions(-)
> create mode 100644 arch/x86/kernel/cet_prctl.c
>
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index 71dc92acd2f2..99e6e741d28c 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -14,16 +14,20 @@ struct sc_ext;
> struct cet_status {
> unsigned long shstk_base;
> unsigned long shstk_size;
> + unsigned int locked:1;
> };
>
> #ifdef CONFIG_X86_INTEL_CET
> +int prctl_cet(int option, u64 arg2);
> int cet_setup_shstk(void);
> int cet_setup_thread_shstk(struct task_struct *p);
> +int cet_alloc_shstk(unsigned long *arg);
> void cet_disable_free_shstk(struct task_struct *p);
> int cet_verify_rstor_token(bool ia32, unsigned long ssp, unsigned long *new_ssp);
> void cet_restore_signal(struct sc_ext *sc);
> int cet_setup_signal(bool ia32, unsigned long rstor, struct sc_ext *sc);
> #else
> +static inline int prctl_cet(int option, u64 arg2) { return -EINVAL; }
> static inline int cet_setup_thread_shstk(struct task_struct *p) { return 0; }
> static inline void cet_disable_free_shstk(struct task_struct *p) {}
> static inline void cet_restore_signal(struct sc_ext *sc) { return; }
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 5a6aac9fa41f..d962f0ec9ccf 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -14,4 +14,9 @@
> #define ARCH_MAP_VDSO_32 0x2002
> #define ARCH_MAP_VDSO_64 0x2003
>
> +#define ARCH_X86_CET_STATUS 0x3001
> +#define ARCH_X86_CET_DISABLE 0x3002
> +#define ARCH_X86_CET_LOCK 0x3003
> +#define ARCH_X86_CET_ALLOC_SHSTK 0x3004
> +
> #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index e9cc2551573b..0b621e2afbdc 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -144,7 +144,7 @@ obj-$(CONFIG_UNWINDER_ORC) += unwind_orc.o
> obj-$(CONFIG_UNWINDER_FRAME_POINTER) += unwind_frame.o
> obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o
>
> -obj-$(CONFIG_X86_INTEL_CET) += cet.o
> +obj-$(CONFIG_X86_INTEL_CET) += cet.o cet_prctl.o
>
> ###
> # 64 bit specific files
> diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
> index 121552047b86..c1b9b540c03e 100644
> --- a/arch/x86/kernel/cet.c
> +++ b/arch/x86/kernel/cet.c
> @@ -145,6 +145,35 @@ static int create_rstor_token(bool ia32, unsigned long ssp,
> return 0;
> }
>
> +int cet_alloc_shstk(unsigned long *arg)
> +{
> + unsigned long len = *arg;
> + unsigned long addr;
> + unsigned long token;
> + unsigned long ssp;
> +
> + addr = alloc_shstk(round_up(len, PAGE_SIZE));
> +
> + if (IS_ERR((void *)addr))
> + return PTR_ERR((void *)addr);
> +
> + /* Restore token is 8 bytes and aligned to 8 bytes */
> + ssp = addr + len;
> + token = ssp;
> +
> + if (!in_ia32_syscall())
> + token |= TOKEN_MODE_64;
> + ssp -= 8;
> +
> + if (write_user_shstk_64(ssp, token)) {
> + vm_munmap(addr, len);
> + return -EINVAL;
> + }
> +
> + *arg = addr;
> + return 0;
> +}
> +
> int cet_setup_shstk(void)
> {
> unsigned long addr, size;
> diff --git a/arch/x86/kernel/cet_prctl.c b/arch/x86/kernel/cet_prctl.c
> new file mode 100644
> index 000000000000..0139c48f2215
> --- /dev/null
> +++ b/arch/x86/kernel/cet_prctl.c
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/errno.h>
> +#include <linux/uaccess.h>
> +#include <linux/prctl.h>
> +#include <linux/compat.h>
> +#include <linux/mman.h>
> +#include <linux/elfcore.h>
> +#include <asm/processor.h>
> +#include <asm/prctl.h>
> +#include <asm/cet.h>
> +
> +/* See Documentation/x86/intel_cet.rst. */
> +
> +static int handle_get_status(u64 arg2)
> +{
> + struct cet_status *cet = &current->thread.cet;
> + u64 buf[3] = {0, 0, 0};
> +
> + if (cet->shstk_size) {
> + buf[0] |= GNU_PROPERTY_X86_FEATURE_1_SHSTK;
> + buf[1] = (u64)cet->shstk_base;
> + buf[2] = (u64)cet->shstk_size;
> + }
> +
> + return copy_to_user((u64 __user *)arg2, buf, sizeof(buf));
> +}
> +
> +static int handle_alloc_shstk(u64 arg2)
> +{
> + int err = 0;
> + unsigned long arg;
> + unsigned long addr = 0;
> + unsigned long size = 0;
> +
> + if (get_user(arg, (unsigned long __user *)arg2))
> + return -EFAULT;
> +
> + size = arg;
> + err = cet_alloc_shstk(&arg);
> + if (err)
> + return err;
> +
> + addr = arg;
> + if (put_user((u64)addr, (u64 __user *)arg2)) {
> + vm_munmap(addr, size);
> + return -EFAULT;
> + }
> +
> + return 0;
> +}
> +
> +int prctl_cet(int option, u64 arg2)
> +{
> + struct cet_status *cet;
> +
> + if (!IS_ENABLED(CONFIG_X86_INTEL_CET))
> + return -EINVAL;

Using -EINVAL here means userspace can't tell the difference between an
old kernel and a kernel not built with CONFIG_X86_INTEL_CET. Perhaps
-ENOTSUPP?

> +
> + if (option == ARCH_X86_CET_STATUS)
> + return handle_get_status(arg2);
> +
> + if (!static_cpu_has(X86_FEATURE_SHSTK))
> + return -EINVAL;

Similar case: though this is now a kernel that knows how, but a CPU that
doesn't.

> +
> + cet = &current->thread.cet;

You get this both here and in handle_get_status(). Why not just get it
once and pass it into handle_get_status()? (And perhaps rename it to
"copy_status_to_user" or so?

> +
> + switch (option) {
> + case ARCH_X86_CET_DISABLE:

This must check for unknown flags before doing anything else:

if (arg & ~(GNU_PROPERTY_X86_FEATURE_1_SHSTK))
return -EINVAL;

> + if (cet->locked)
> + return -EPERM;
> + if (arg2 & GNU_PROPERTY_X86_FEATURE_1_SHSTK)
> + cet_disable_free_shstk(current);
> +
> + return 0;
> +
> + case ARCH_X86_CET_LOCK:

Same here.

> + cet->locked = 1;
> + return 0;
> +
> + case ARCH_X86_CET_ALLOC_SHSTK:
> + return handle_alloc_shstk(arg2);
> +
> + default:
> + return -EINVAL;
> + }
> +}
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index ef1c2b8086a2..de6773dd6a16 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -996,7 +996,7 @@ long do_arch_prctl_common(struct task_struct *task, int option,
> return get_cpuid_mode();
> case ARCH_SET_CPUID:
> return set_cpuid_mode(task, cpuid_enabled);
> + default:
> + return prctl_cet(option, cpuid_enabled);
> }

This is weird, but yeah, there's only the cpuid and cet handlers... I
think do_arch_prctrl_common() should call the second arg "arg2" and
there should be a series of calls:

ret = prctl_cpuid(task, option, args);
if (ret != -EINVAL)
return ret;
return prctl_cet(option, args);

But that's just a style nit, I guess.

And really, why is x86's arch prtcl return EINVAL for unknown options?
It should use -ENOSYS for unknown options and -EINVAL for bad arguments,
but I guess it's too late for that. :)

--
Kees Cook

2020-05-22 16:55:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Sat, May 16, 2020 at 03:09:22PM +0100, Andrew Cooper wrote:

> Sadly, the same is not true for kernel shadow stacks.
>
> SSP is 0 after SYSCALL, SYSENTER and CLRSSBSY, and you've got to be
> careful to re-establish the shadow stack before a CALL, interrupt or
> exception tries pushing a word onto the shadow stack at 0xfffffffffffffff8.

Oh man, I can only imagine the joy that brings to #NM and friends :-(

2020-05-22 17:22:35

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 26/26] x86/cet/shstk: Add arch_prctl functions for shadow stack

On Thu, 2020-05-21 at 15:42 -0700, Kees Cook wrote:
> On Wed, Apr 29, 2020 at 03:07:32PM -0700, Yu-cheng Yu wrote:
[...]
> > +
> > +int prctl_cet(int option, u64 arg2)
> > +{
> > + struct cet_status *cet;
> > +
> > + if (!IS_ENABLED(CONFIG_X86_INTEL_CET))
> > + return -EINVAL;
>
> Using -EINVAL here means userspace can't tell the difference between an
> old kernel and a kernel not built with CONFIG_X86_INTEL_CET. Perhaps
> -ENOTSUPP?

Looked into this. The kernel and GLIBC are not in sync. So maybe we still use
EINVAL here?

Yu-cheng



In kernel:
----------

#define EOPNOTSUPP 95
#define ENOTSUPP 524

In GLIBC:
---------

printf("ENOTSUP=%d\n", ENOTSUP);
printf("EOPNOTSUPP=%d\n", EOPNOTSUPP);
printf("%s=524\n", strerror(524));

ENOTSUP=95
EOPNOTSUPP=95
Unknown error 524=524

2020-05-22 17:34:07

by Eugene Syromiatnikov

[permalink] [raw]
Subject: Re: [PATCH v10 26/26] x86/cet/shstk: Add arch_prctl functions for shadow stack

On Fri, May 22, 2020 at 10:17:43AM -0700, Yu-cheng Yu wrote:
> On Thu, 2020-05-21 at 15:42 -0700, Kees Cook wrote:
> > On Wed, Apr 29, 2020 at 03:07:32PM -0700, Yu-cheng Yu wrote:
> [...]
> > > +
> > > +int prctl_cet(int option, u64 arg2)
> > > +{
> > > + struct cet_status *cet;
> > > +
> > > + if (!IS_ENABLED(CONFIG_X86_INTEL_CET))
> > > + return -EINVAL;
> >
> > Using -EINVAL here means userspace can't tell the difference between an
> > old kernel and a kernel not built with CONFIG_X86_INTEL_CET. Perhaps
> > -ENOTSUPP?
>
> Looked into this. The kernel and GLIBC are not in sync. So maybe we still use
> EINVAL here?
>
> Yu-cheng
>
>
>
> In kernel:
> ----------
>
> #define EOPNOTSUPP 95
> #define ENOTSUPP 524
>
> In GLIBC:
> ---------
>
> printf("ENOTSUP=%d\n", ENOTSUP);
> printf("EOPNOTSUPP=%d\n", EOPNOTSUPP);
> printf("%s=524\n", strerror(524));
>
> ENOTSUP=95
> EOPNOTSUPP=95
> Unknown error 524=524

EOPNOTSUPP/ENOTSUP/ENOTSUPP is actually a mess, it's summarized recently
by Michael Kerrisk[1]. From the kernel's point of view, I think it
would be reasonable to return EOPNOTSUPP, and expect that the userspace
would use ENOTSUP to match against it.

[1] https://lore.kernel.org/linux-man/[email protected]/

2020-05-22 17:52:29

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On 22/05/2020 17:49, Peter Zijlstra wrote:
> On Sat, May 16, 2020 at 03:09:22PM +0100, Andrew Cooper wrote:
>
>> Sadly, the same is not true for kernel shadow stacks.
>>
>> SSP is 0 after SYSCALL, SYSENTER and CLRSSBSY, and you've got to be
>> careful to re-establish the shadow stack before a CALL, interrupt or
>> exception tries pushing a word onto the shadow stack at 0xfffffffffffffff8.
> Oh man, I can only imagine the joy that brings to #NM and friends :-(

Establishing a supervisor shadow stack for the first time involves a
large leap of faith, even by usual x86 standards.

You need to have prepared MSR_PL0_SSP with correct mappings and
supervisor tokens, such that when you enable CR4.CET and
MSR_S_CET.SHSTK_EN, your SETSSBSY instruction succeeds at its atomic
"check the token and set the busy bit" shadow stack access.  Any failure
here tends to be a triple fault, and I didn't get around to figuring out
why #DF wasn't taken cleanly.

You also need to have prepared MSR_IST_SSP beforehand with the IST
shadow stack pointers matching any IST configuration in the IDT, lest a
NMI ruins your day on the instruction boundary before SETSSBSY.

A less obvious side effect of these "windows with an SSP of 0" is that
you're now forced to use IST for all non-maskable interrupts/exceptions,
even if you choose not to use SYSCALL, and you no longer need IST to
remove the risks of a userspace privilege escalation, and would prefer
not to use IST because of its problematic reentrancy characteristics.

For anyone counting the number of IST-necessary vectors across all
potential configurations in modern hardware, its #DB, NMI, #DF, #MC,
#VE, #HV, #VC and #SX, and an architectural limit of 7.

There are several other amusing aspects, such as iret-to-self needing to
use call-oriented-programming to keep itself shadow-stack-safe, or the
fact that IRET to user mode doesn't fault if it fails to clear the
supervisor busy bit, instead leaving you to double fault at some point
in the future at the next syscall/interrupt/exception because the stack
is still busy.

~Andrew

P.S. For anyone interested,
https://lore.kernel.org/xen-devel/[email protected]/T/#u
for getting supervisor shadow stacks working on Xen, which is far
simpler to manage than Linux.  I do not envy whomever has the fun of
trying to make this work for Linux.

2020-05-22 18:16:33

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 26/26] x86/cet/shstk: Add arch_prctl functions for shadow stack

On Fri, 2020-05-22 at 19:29 +0200, Eugene Syromiatnikov wrote:
> On Fri, May 22, 2020 at 10:17:43AM -0700, Yu-cheng Yu wrote:
> > On Thu, 2020-05-21 at 15:42 -0700, Kees Cook wrote:
> > > On Wed, Apr 29, 2020 at 03:07:32PM -0700, Yu-cheng Yu wrote:
> > [...]
> > > > +
> > > > +int prctl_cet(int option, u64 arg2)
> > > > +{
> > > > + struct cet_status *cet;
> > > > +
> > > > + if (!IS_ENABLED(CONFIG_X86_INTEL_CET))
> > > > + return -EINVAL;
> > >
> > > Using -EINVAL here means userspace can't tell the difference between an
> > > old kernel and a kernel not built with CONFIG_X86_INTEL_CET. Perhaps
> > > -ENOTSUPP?
> >
> > Looked into this. The kernel and GLIBC are not in sync. So maybe we still use
> > EINVAL here?
> >
> > Yu-cheng
> >
> >
> >
> > In kernel:
> > ----------
> >
> > #define EOPNOTSUPP 95
> > #define ENOTSUPP 524
> >
> > In GLIBC:
> > ---------
> >
> > printf("ENOTSUP=%d\n", ENOTSUP);
> > printf("EOPNOTSUPP=%d\n", EOPNOTSUPP);
> > printf("%s=524\n", strerror(524));
> >
> > ENOTSUP=95
> > EOPNOTSUPP=95
> > Unknown error 524=524
>
> EOPNOTSUPP/ENOTSUP/ENOTSUPP is actually a mess, it's summarized recently
> by Michael Kerrisk[1]. From the kernel's point of view, I think it
> would be reasonable to return EOPNOTSUPP, and expect that the userspace
> would use ENOTSUP to match against it.

Ok, use EOPNOTSUPP and add a comment why.

Yu-cheng

2020-05-29 02:12:31

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 01/26] Documentation/x86: Add CET description

On Tue, 2020-05-19 at 18:04 -0700, Andy Lutomirski wrote:
> On Mon, May 18, 2020 at 6:35 PM Andy Lutomirski <[email protected]> wrote:
> > [...]
> > > On May 18, 2020, at 5:38 PM, Dave Hansen <[email protected]> wrote:
> > > [...]
> > > The sadistic parts of selftests/x86 come from real bugs. Either bugs
> > > where the kernel fell over, or where behavior changed that broke apps.
> > > I'd suggest doing some research on where that particular test case came
> > > from. Find the author of the test, look at the changelogs.
> > >
> > > If this is something that a real app does, this is a problem. If it's a
> > > sadistic test that Andy L added because it was an attack vector against
> > > the entry code, it's a different story.
> >
> > There are quite a few tests that do these horrible things in there. IN my personal opinion, sigreturn.c is one of the most important tests we have — it does every horrible thing to the entry code that I thought of and that I could come up with a way of doing. We have been saved from regressing many times by these tests. CET, and especially the CPL0 version of CET, is its own set of entry horror, and we need to keep these tests working.
> >
> > I assume the basic issue is that we call raise(), the context magically changes to 32-bit, but SSP has a 64-bit value, and horrors happen. So I think two things need to happen:
> >
> > 1. Someone needs to document what happens when IRET tries to put a 64-bit value into SSP but CS is compat. Because Intel has plenty of history of doing colossally broken things here. IOW you could easily be hitting a hardware design problem, not a software issue per se.
> >
> > 2. The test needs to work. Assuming the hardware doesn’t do something utterly broken, either the 32-bit code needs to be adjusted to avoid any CALL
> > or RET, or you need to write a little raise_on_32bit_shstk() func that switches to an SSP that fits in 32 bits, calls raise(), and switches back. From memory, I didn’t think there was a CALl or RET, so I’m guessing that SSP is getting truncated when we round trip through CPL3 compat mode and the result is that the kernel invoked the signal handler with the wrong SSP. Whoops.
> >
>
> Following up here, I think this needs attention from the H/W architects.
>
> From the SDM:
>
> SYSRET and SYSEXIT:
>
> IF ShadowStackEnabled(CPL)
> SSP ← IA32_PL3_SSP;
> FI;
>
> IRET:
>
> IF ShadowStackEnabled(CPL)
> IF CPL = 3
> THEN tempSSP ← IA32_PL3_SSP; FI;
> IF ((EFER.LMA AND CS.L) = 0 AND tempSSP[63:32] != 0)
> THEN #GP(0); FI;
> SSP ← tempSSP
>
> The semantics of actually executing in compat mode with SSP >= 2^32
> are unclear. If nothing else, VM exit will save the full SSP and a
> subsequent VM entry will fail.

Here is what I got after talking to the architect.

If the guest is in 32-bit mode, but its VM guest state SSP field is 64-bit, the
CPU only uses the lower 32 bits.

The SDM currently states a consistency check of the guest SSP field, but that
will be removed in the next version. Upon VM entry, the CPU only requires the
guest SSP to be pseudo-canonical like the RIP and RSP.

> I don't know what the actual effect of operand-size-32 SYSRET or
> SYSEXIT with too big a PL3_SSP will be, but I think it needs to be
> documented. Ideally it will not put the CPU in an invalid state.
> Ideally it will also not fault, because SYSRET faults in particular
> are fatal unless the vector uses IST, and please please please don't
> force more ISTs on anyone.

On SYSRET/SYSEXIT to a 32-bit context, the CPU only uses the lower 32 bits of
the user-mode SSP, and will not go into an invalid state and will not fault.
The SDM will be explicit about this.

Yu-cheng

>
> So I think we may need to put this entire series on hold until we get
> some answers, because I suspect we're going to have a nice little root
> hole otherwise.

2020-07-23 16:11:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 03/26] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states

On Wed, Apr 29, 2020 at 03:07:09PM -0700, Yu-cheng Yu wrote:
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 12c9684d59ba..47f603729543 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -885,4 +885,22 @@
> #define MSR_VM_IGNNE 0xc0010115
> #define MSR_VM_HSAVE_PA 0xc0010117
>
> +/* Control-flow Enforcement Technology MSRs */
> +#define MSR_IA32_U_CET 0x6a0 /* user mode cet setting */
> +#define MSR_IA32_S_CET 0x6a2 /* kernel mode cet setting */
> +#define MSR_IA32_PL0_SSP 0x6a4 /* kernel shstk pointer */
> +#define MSR_IA32_PL1_SSP 0x6a5 /* ring-1 shstk pointer */
> +#define MSR_IA32_PL2_SSP 0x6a6 /* ring-2 shstk pointer */
> +#define MSR_IA32_PL3_SSP 0x6a7 /* user shstk pointer */
> +#define MSR_IA32_INT_SSP_TAB 0x6a8 /* exception shstk table */
> +
> +/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
> +#define MSR_IA32_CET_SHSTK_EN 0x0000000000000001ULL

Can we drop the MSR_IA32 prefix for the individual bits? Mostly to yield
shorter line lengths, but also because it's more or less redundant info,
and in some ways unhelpful as it's hard to quickly differentiate between
"this is an MSR index" and "this is a bit/mask for an MSR".

My vote would also be to use BIT() or BIT_ULL(). The SDM defines the flags
by their (decimal) bit number. Manually converting the bits to masks makes
it difficult to check for correctness.

E.g.

#define CET_SHSTK_EN BIT(0)
#define CET_WRSS_EN BIT(1)
#define CET_ENDBR_EN BIT(2)
#define CET_LEG_IW_EN BIT(3)
#define CET_NO_TRACK_EN BIT(4)
#define CET_WAIT_ENDBR BIT(5)

> +#define MSR_IA32_CET_WRSS_EN 0x0000000000000002ULL
> +#define MSR_IA32_CET_ENDBR_EN 0x0000000000000004ULL
> +#define MSR_IA32_CET_LEG_IW_EN 0x0000000000000008ULL
> +#define MSR_IA32_CET_NO_TRACK_EN 0x0000000000000010ULL
> +#define MSR_IA32_CET_WAIT_ENDBR 0x00000000000000800UL
> +#define MSR_IA32_CET_BITMAP_MASK 0xfffffffffffff000ULL

This particular define, the so called BITMAP_MASK, is no longer used in the
IBT series. IMO it'd be better off dropping this mask as it's not clear
from the name that this is really nothing more than a mask for a virtual
address, e.g. at first glance (for someone without CET knowledge) it looks
like bits 63:12 hold a bitmap as opposed to holding a pointer to a bitmap.

2020-07-23 16:24:55

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 03/26] x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states

On Thu, 2020-07-23 at 09:10 -0700, Sean Christopherson wrote:
> On Wed, Apr 29, 2020 at 03:07:09PM -0700, Yu-cheng Yu wrote:
> > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > index 12c9684d59ba..47f603729543 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -885,4 +885,22 @@
> > #define MSR_VM_IGNNE 0xc0010115
> > #define MSR_VM_HSAVE_PA 0xc0010117
> >
> > +/* Control-flow Enforcement Technology MSRs */
> > +#define MSR_IA32_U_CET 0x6a0 /* user mode cet setting */
> > +#define MSR_IA32_S_CET 0x6a2 /* kernel mode cet setting */
> > +#define MSR_IA32_PL0_SSP 0x6a4 /* kernel shstk pointer */
> > +#define MSR_IA32_PL1_SSP 0x6a5 /* ring-1 shstk pointer */
> > +#define MSR_IA32_PL2_SSP 0x6a6 /* ring-2 shstk pointer */
> > +#define MSR_IA32_PL3_SSP 0x6a7 /* user shstk pointer */
> > +#define MSR_IA32_INT_SSP_TAB 0x6a8 /* exception shstk table */
> > +
> > +/* MSR_IA32_U_CET and MSR_IA32_S_CET bits */
> > +#define MSR_IA32_CET_SHSTK_EN 0x0000000000000001ULL
>
> Can we drop the MSR_IA32 prefix for the individual bits? Mostly to yield
> shorter line lengths, but also because it's more or less redundant info,
> and in some ways unhelpful as it's hard to quickly differentiate between
> "this is an MSR index" and "this is a bit/mask for an MSR".

Agree!

>
> My vote would also be to use BIT() or BIT_ULL(). The SDM defines the flags
> by their (decimal) bit number. Manually converting the bits to masks makes
> it difficult to check for correctness.
>
> E.g.
>
> #define CET_SHSTK_EN BIT(0)
> #define CET_WRSS_EN BIT(1)
> #define CET_ENDBR_EN BIT(2)
> #define CET_LEG_IW_EN BIT(3)
> #define CET_NO_TRACK_EN BIT(4)
> #define CET_WAIT_ENDBR BIT(5)

I will change them.

>
> > +#define MSR_IA32_CET_WRSS_EN 0x0000000000000002ULL
> > +#define MSR_IA32_CET_ENDBR_EN 0x0000000000000004ULL
> > +#define MSR_IA32_CET_LEG_IW_EN 0x0000000000000008ULL
> > +#define MSR_IA32_CET_NO_TRACK_EN 0x0000000000000010ULL
> > +#define MSR_IA32_CET_WAIT_ENDBR 0x00000000000000800UL
> > +#define MSR_IA32_CET_BITMAP_MASK 0xfffffffffffff000ULL
>
> This particular define, the so called BITMAP_MASK, is no longer used in the
> IBT series. IMO it'd be better off dropping this mask as it's not clear
> from the name that this is really nothing more than a mask for a virtual
> address, e.g. at first glance (for someone without CET knowledge) it looks
> like bits 63:12 hold a bitmap as opposed to holding a pointer to a bitmap.

I will remove this.

Thanks,
Yu-cheng

2020-07-23 16:26:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Wed, Apr 29, 2020 at 03:07:06PM -0700, Yu-cheng Yu wrote:
> Control-flow Enforcement (CET) is a new Intel processor feature that blocks
> return/jump-oriented programming attacks. Details can be found in "Intel
> 64 and IA-32 Architectures Software Developer's Manual" [1].
>
> This series depends on the XSAVES supervisor state series that was split
> out and submitted earlier [2].

...

> Yu-cheng Yu (25):
> x86/cpufeatures: Add CET CPU feature flags for Control-flow
> Enforcement Technology (CET)
> x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states

How would people feel about taking the above two patches (02 and 03 in the
series) through the KVM tree to enable KVM virtualization of CET before the
kernel itself gains CET support? I.e. add the MSR and feature bits, along
with the XSAVES context switching. The feature definitons could use "" to
suppress displaying them in /proc/cpuinfo to avoid falsely advertising CET
to userspace.

AIUI, there are ABI issues that need to be sorted out, and that is likely
going to drag on for some time.

Is this a "hell no" sort of idea, or something that would be feasible if we
can show that there are no negative impacts to the kernel?

2020-07-23 16:42:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On 7/23/20 9:25 AM, Sean Christopherson wrote:
> How would people feel about taking the above two patches (02 and 03 in the
> series) through the KVM tree to enable KVM virtualization of CET before the
> kernel itself gains CET support? I.e. add the MSR and feature bits, along
> with the XSAVES context switching. The feature definitons could use "" to
> suppress displaying them in /proc/cpuinfo to avoid falsely advertising CET
> to userspace.
>
> AIUI, there are ABI issues that need to be sorted out, and that is likely
> going to drag on for some time.
>
> Is this a "hell no" sort of idea, or something that would be feasible if we
> can show that there are no negative impacts to the kernel?

Negative impacts like bloating every task->fpu with XSAVE state that
will never get used? ;)

I thought KVM had its own vcpu->arch.guest_fpu buffers which mirrored
the size and format of task->fpu. Can we have KVM support today without
task->fpu support? I see some XSS munging in the KVM code so I think
this might be *possible*, but I don't see all of the plumbing that would
make it actually work.

2020-07-23 16:57:34

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Thu, Jul 23, 2020 at 09:41:37AM -0700, Dave Hansen wrote:
> On 7/23/20 9:25 AM, Sean Christopherson wrote:
> > How would people feel about taking the above two patches (02 and 03 in the
> > series) through the KVM tree to enable KVM virtualization of CET before the
> > kernel itself gains CET support? I.e. add the MSR and feature bits, along
> > with the XSAVES context switching. The feature definitons could use "" to
> > suppress displaying them in /proc/cpuinfo to avoid falsely advertising CET
> > to userspace.
> >
> > AIUI, there are ABI issues that need to be sorted out, and that is likely
> > going to drag on for some time.
> >
> > Is this a "hell no" sort of idea, or something that would be feasible if we
> > can show that there are no negative impacts to the kernel?
>
> Negative impacts like bloating every task->fpu with XSAVE state that
> will never get used? ;)

Gah, should have qualified that with "meaningful or measurable negative
impacts". E.g. the extra 40 bytes for CET XSAVE state seems like it would
be acceptable overhead, but noticeably increasing the latency of XSAVES
and/or XRSTORS would not be acceptable.

> I thought KVM had its own vcpu->arch.guest_fpu buffers which mirrored
> the size and format of task->fpu. Can we have KVM support today without
> task->fpu support? I see some XSS munging in the KVM code so I think
> this might be *possible*, but I don't see all of the plumbing that would
> make it actually work.

It'd be possible, but long term I don't think it's a good idea for KVM to
diverge from the kernel's FPU support, i.e. fully converting KVM to it's own
implementation will likely lead to pain and maintenance problems. Without
fully converting KVM to a custom implementation, adding one off support for
CET would be a massive hack job.

2020-07-23 18:45:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On 7/23/20 9:56 AM, Sean Christopherson wrote:
> On Thu, Jul 23, 2020 at 09:41:37AM -0700, Dave Hansen wrote:
>> On 7/23/20 9:25 AM, Sean Christopherson wrote:
>>> How would people feel about taking the above two patches (02 and 03 in the
>>> series) through the KVM tree to enable KVM virtualization of CET before the
>>> kernel itself gains CET support? I.e. add the MSR and feature bits, along
>>> with the XSAVES context switching. The feature definitons could use "" to
>>> suppress displaying them in /proc/cpuinfo to avoid falsely advertising CET
>>> to userspace.
>>>
>>> AIUI, there are ABI issues that need to be sorted out, and that is likely
>>> going to drag on for some time.
>>>
>>> Is this a "hell no" sort of idea, or something that would be feasible if we
>>> can show that there are no negative impacts to the kernel?
>> Negative impacts like bloating every task->fpu with XSAVE state that
>> will never get used? ;)
> Gah, should have qualified that with "meaningful or measurable negative
> impacts". E.g. the extra 40 bytes for CET XSAVE state seems like it would
> be acceptable overhead, but noticeably increasing the latency of XSAVES
> and/or XRSTORS would not be acceptable.

It's 40 bytes, but it's 40 bytes of just pure, unadulterated waste. It
would have no *chance* of being used. It's also quite precisely
measurable on a given system:

cat /proc/slabinfo | grep task_struct | awk '{print $3 * 40}'

I don't expect it would do *much* to XSAVE/XRSTOR. There's probably an
extra conditional and jump in the ucode, but that's probably in the
noise. I assume that all the CET state has functioning init and
modified trackers and we don't do anything to spoil their state. It
would be good to check that in practice, though it probably isn't the
end of the world either way. We've had some bugs in the past where we
accidentally took things out of their init state.

It will make signal entry/return slower since we use a plain XSAVE
without the init optimization. But, that's just a single cacheline on
average and some 0's to write. Probably not noticeable, including the
40 bytes of extra userspace signal stack space.

I think that puts me in the "mildly annoyed" camp more than "hell no",
but "mildly annoyed" is pretty much my resting state, so it doesn't
really move the needle. :)

Why the urgency, though?

https://windows-internals.com/cet-on-windows/

?

2020-07-24 03:44:02

by Yu-cheng Yu

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Thu, 2020-07-23 at 11:41 -0700, Dave Hansen wrote:
> On 7/23/20 9:56 AM, Sean Christopherson wrote:
> > On Thu, Jul 23, 2020 at 09:41:37AM -0700, Dave Hansen wrote:
> > > On 7/23/20 9:25 AM, Sean Christopherson wrote:
> > > > How would people feel about taking the above two patches (02 and 03 in the
> > > > series) through the KVM tree to enable KVM virtualization of CET before the
> > > > kernel itself gains CET support? I.e. add the MSR and feature bits, along
> > > > with the XSAVES context switching. The feature definitons could use "" to
> > > > suppress displaying them in /proc/cpuinfo to avoid falsely advertising CET
> > > > to userspace.
> > > >
> > > > AIUI, there are ABI issues that need to be sorted out, and that is likely
> > > > going to drag on for some time.
> > > >
> > > > Is this a "hell no" sort of idea, or something that would be feasible if we
> > > > can show that there are no negative impacts to the kernel?
> > > Negative impacts like bloating every task->fpu with XSAVE state that
> > > will never get used? ;)
> > Gah, should have qualified that with "meaningful or measurable negative
> > impacts". E.g. the extra 40 bytes for CET XSAVE state seems like it would
> > be acceptable overhead, but noticeably increasing the latency of XSAVES
> > and/or XRSTORS would not be acceptable.
>
> It's 40 bytes, but it's 40 bytes of just pure, unadulterated waste. It
> would have no *chance* of being used. It's also quite precisely
> measurable on a given system:
>
> cat /proc/slabinfo | grep task_struct | awk '{print $3 * 40}'

If there is value in getting these two patches merged first, we can move
XFEATURE_MASK_CET_USER to XFEATURE_MASK_SUPERVISOR_UNSUPPORTED for now, until
CET is eventually merged. That way, there is no space wasted.

Yu-cheng


2020-07-24 04:52:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Thu, Jul 23, 2020 at 08:40:33PM -0700, Yu-cheng Yu wrote:
> On Thu, 2020-07-23 at 11:41 -0700, Dave Hansen wrote:
> > On 7/23/20 9:56 AM, Sean Christopherson wrote:
> > > On Thu, Jul 23, 2020 at 09:41:37AM -0700, Dave Hansen wrote:
> > > > On 7/23/20 9:25 AM, Sean Christopherson wrote:
> > > > > How would people feel about taking the above two patches (02 and 03 in the
> > > > > series) through the KVM tree to enable KVM virtualization of CET before the
> > > > > kernel itself gains CET support? I.e. add the MSR and feature bits, along
> > > > > with the XSAVES context switching. The feature definitons could use "" to
> > > > > suppress displaying them in /proc/cpuinfo to avoid falsely advertising CET
> > > > > to userspace.
> > > > >
> > > > > AIUI, there are ABI issues that need to be sorted out, and that is likely
> > > > > going to drag on for some time.
> > > > >
> > > > > Is this a "hell no" sort of idea, or something that would be feasible if we
> > > > > can show that there are no negative impacts to the kernel?
> > > > Negative impacts like bloating every task->fpu with XSAVE state that
> > > > will never get used? ;)
> > > Gah, should have qualified that with "meaningful or measurable negative
> > > impacts". E.g. the extra 40 bytes for CET XSAVE state seems like it would
> > > be acceptable overhead, but noticeably increasing the latency of XSAVES
> > > and/or XRSTORS would not be acceptable.
> >
> > It's 40 bytes, but it's 40 bytes of just pure, unadulterated waste. It
> > would have no *chance* of being used. It's also quite precisely
> > measurable on a given system:
> >
> > cat /proc/slabinfo | grep task_struct | awk '{print $3 * 40}'
>
> If there is value in getting these two patches merged first, we can move
> XFEATURE_MASK_CET_USER to XFEATURE_MASK_SUPERVISOR_UNSUPPORTED for now, until
> CET is eventually merged. That way, there is no space wasted.

Merging them as disabled wouldn't help, KVM needs the features enabled so
that context switching the guest state works as expected.

2020-07-24 05:01:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v10 00/26] Control-flow Enforcement: Shadow Stack

On Thu, Jul 23, 2020 at 11:41:55AM -0700, Dave Hansen wrote:
> On 7/23/20 9:56 AM, Sean Christopherson wrote:
> > On Thu, Jul 23, 2020 at 09:41:37AM -0700, Dave Hansen wrote:
> >> On 7/23/20 9:25 AM, Sean Christopherson wrote:
> >>> How would people feel about taking the above two patches (02 and 03 in the
> >>> series) through the KVM tree to enable KVM virtualization of CET before the
> >>> kernel itself gains CET support? I.e. add the MSR and feature bits, along
> >>> with the XSAVES context switching. The feature definitons could use "" to
> >>> suppress displaying them in /proc/cpuinfo to avoid falsely advertising CET
> >>> to userspace.
> >>>
> >>> AIUI, there are ABI issues that need to be sorted out, and that is likely
> >>> going to drag on for some time.
> >>>
> >>> Is this a "hell no" sort of idea, or something that would be feasible if we
> >>> can show that there are no negative impacts to the kernel?
> >> Negative impacts like bloating every task->fpu with XSAVE state that
> >> will never get used? ;)
> > Gah, should have qualified that with "meaningful or measurable negative
> > impacts". E.g. the extra 40 bytes for CET XSAVE state seems like it would
> > be acceptable overhead, but noticeably increasing the latency of XSAVES
> > and/or XRSTORS would not be acceptable.
>
> It's 40 bytes, but it's 40 bytes of just pure, unadulterated waste. It
> would have no *chance* of being used. It's also quite precisely

Well, technically the guest would be using that space :-).

> measurable on a given system:
>
> cat /proc/slabinfo | grep task_struct | awk '{print $3 * 40}'
>
> I don't expect it would do *much* to XSAVE/XRSTOR. There's probably an
> extra conditional and jump in the ucode, but that's probably in the
> noise. I assume that all the CET state has functioning init and
> modified trackers and we don't do anything to spoil their state. It
> would be good to check that in practice, though it probably isn't the
> end of the world either way. We've had some bugs in the past where we
> accidentally took things out of their init state.
>
> It will make signal entry/return slower since we use a plain XSAVE
> without the init optimization. But, that's just a single cacheline on
> average and some 0's to write. Probably not noticeable, including the
> 40 bytes of extra userspace signal stack space.
>
> I think that puts me in the "mildly annoyed" camp more than "hell no",
> but "mildly annoyed" is pretty much my resting state, so it doesn't
> really move the needle. :)
>
> Why the urgency, though?
>
> https://windows-internals.com/cet-on-windows/
>
> ?

No urgency, it'd simply be one less KVM feature for us to be carrying and
refreshing. And as a sort of general question, I was curious if folks
would be open to merging KVM support before kernel.