2022-11-04 22:51:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 00/37] Shadow stacks for userspace

Hi,

This series implements Shadow Stacks for userspace using x86's Control-flow
enforcement technology (CET). CET consists of two related security features:
Shadow Stacks and Indirect Branch Tracking. This series implements just the
Shadow Stack part of this feature, and just for userspace.

The main use case for shadow stack is providing protection against return
oriented programming attacks. It works by maintaining a secondary (shadow)
stack using a special memory type that has protections against modification.
When executing a CALL instruction, the processor pushes the return address to
both the normal stack and to the special permissioned shadow stack. Upon RET,
the processor pops the shadow stack copy and compares it to the normal stack
copy. For more details, see the coverletter from v1 [0].

Thanks to all the reviewers of v2 [1]. There was a lot of very helpful
feedback. For v3 there are a lot of small changes, but not really any big ones.
I think the only remaining unresolved big issue is what to do about the
existing binaries that will fail if glibc is updated to utilize the shadow
stack kernel support. I am honestly not sure what is the right way forward.
More discussion on this below.

Other notable changes were:
- Dropping sigaltshstk support. It sounded like this could be a future
enhancement.
- Remove AMD patch (Thanks for Tested-by from John Allen)
- Promote ptrace/criu patches from OPTIONAL. If we might not have a new shadow
stack bit on day 1, we should give ptrace users something to work with.
- Detangle arch_prctl() numbers from LAM

Smaller changes are in the patches after the break.

This is off of v6.1-rc3 and this cleanup series [2]. Find the full tree here
[3].


Existing package compatibility problems
=======================================

This feature has a history of compatibility issues with existing userspace due
to the userspace enabling landing upstream ahead of the kernel support. The
first major issue encountered was the classic bad kernel regression - that some
existing distros would fail to boot on CET enabled kernels. This was due to
upstream glibc targeting an old abandoned CET kernel interface. It was resolved
with v1 of this series by switching the kernel enabling interface so old glibc
can’t find it, and is no longer an issue.

However there are still some lesser compatibility issues that are worth
discussing, and possibly help avoid on the kernel's side. These are around apps
being marked as shadow stack compatible when they actually are not.

When a binary is compatible with shadow stack it is supposed to be marked with
a specific elf bit . The design of the shadow stack implementation is that
glibc will detect this bit, and call kernel APIs to enable shadow stack.

Upstream glibc does not yet know how to do this. So the kernel’s shadow stack
implementation, and any compatibility issues, will remain dormant until these
CET glibc changes make it there. But many application binaries with the bit
marked exist today, and critically, it was applied widely and automatically by
some popular distro builds without verification that the packages actually
support shadow stack. So when glibc is updated, shadow stack will suddenly turn
on very widely with some missing verification.

In an ideal world this would be ok, because glibc has resolved many of the
shadow stack violating conditions internally. So as long as apps stick to the
normal usage of the glibc implementations for doing exotic stack things, then
apps *should* just work. However, in the real world there are apps that don't
stick to this. Especially JITs can violate the shadow stack enforcement.

In internal testing we have found one popular package, node.js, crashes on
startup. It's unknown how many other apps would crash with more complicated
usage than a basic startup test. My assumption is that there are more that
would.

The other compatibility issue that comes from the widespread presence of this
elf bit, is ptrace using applications. Some like, CRIU, do unusual
shadow-stack-violating things to the seized process as part of basic operation.
So while the application may not have issues in itself, they run into trouble
working with shadow stack enabled tracees. Others, like GDB, would fail when
doing some limited specific things like the "call a function" operation. While
it’s not unusual to have a new feature break saving and restoring an individual
target app, it is a bit unusual to have a new feature break CRIU usage for most
apps on the system. The kernel changes required for a fixed CRIU and GDB are
included in this series, but the userspace fixes are not upstream.


Blocking CET for the existing binaries
======================================

So we are not talking about a traditional kernel regression, where a fresh
kernel update breaks userspace. Instead we are talking about a userspace
component choosing to break existing apps by using new kernel functionality.
I’m not sure if it is the kernel’s job to stop this or not. But the kernel
actually could. It could detect the shadow stack elf bit and then later return
failure for the APIs that enable shadow stack. This would result in these apps
simply running normally without shadow stack.

Florian Weimer points out that this is a bit nasty, and I have to agree. I
think the workaround for this belongs in glibc. The best thing would be to pick
a new elf bit and have glibc look for this new one instead when deciding to
call the kernel shadow stack APIs. Then the old binaries could continue to work
without shadow stack, and new, more highly tested ones could be marked with the
new bit. Since there would then be kernel support and also there is a lot of
supporting HW out there, any CET enabling issues would be caught much earlier
when starting over with a new bit. So you could have a normal slow rollout
instead of a big bang.

But it doesn’t seem like the glibc developers are interested in working on a
solution. So I included a patch to do the detection and disable on the kernel
side and marked it RFC.

Distro’s could easily remove any kernel side check, and they can also fix
userspace regressions with package updates, or even before they turn on shadow
stack for their kernels. So we are talking about protecting users who will want
to use bleeding edge kernels and glibcs they build themselves. Probably not the
biggest category of users, but a helpful one to upstream developers.

This RFC patch also includes the ability for the kernel to allow broken
binaries by Kconfig, gated by CONFIG_EXPERT. So with this kernel patch, a user
who wants to try out CET would end up reading the Kconfig option and
understanding the situation before encountering breakage. But having this
release valve also is less of a forcing function to drive creation of a new elf
bit. So if that never happens, allowing broken binaries (ones with the existing
elf bit) would probably eventually have to become the default.

Another option might be a sysctl knob to toggle allowing these binaries instead
of a Kconfig option.

[0] https://lore.kernel.org/lkml/[email protected]/
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://github.com/rpedgeco/linux/tree/user_shstk_v3


Kirill A. Shutemov (1):
x86: Introduce userspace API for CET enabling

Mike Rapoport (1):
x86/cet/shstk: Add ARCH_CET_UNLOCK

Rick Edgecombe (10):
x86/fpu: Add helper for modifying xstate
mm: Don't allow write GUPs to shadow stack memory
mm: Warn on shadow stack memory in wrong vma
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Support wrss for userspace
x86: Expose thread features in /proc/$PID/status
x86/cet/shstk: Wire in CET interface
selftests/x86: Add shadow stack test
x86/fpu: Add helper for initing features
fs/binfmt_elf: Block old shstk elf bit

Yu-cheng Yu (25):
Documentation/x86: Add CET description
x86/cet/shstk: Add Kconfig option for Shadow Stack
x86/cpufeatures: Add CPU feature flags for shadow stacks
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
x86/cet: Add user control-protection fault handler
x86/mm: Remove _PAGE_DIRTY from kernel RO pages
x86/mm: Move pmd_write(), pud_write() up in the file
x86/mm: Introduce _PAGE_COW
x86/mm: Update pte_modify for _PAGE_COW
x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
transition from _PAGE_DIRTY to _PAGE_COW
mm: Move VM_UFFD_MINOR_BIT from 37 to 38
mm: Introduce VM_SHADOW_STACK for shadow stack memory
x86/mm: Check Shadow Stack page fault errors
x86/mm: Update maybe_mkwrite() for shadow stack
mm: Fixup places that call pte_mkwrite() directly
mm: Add guard pages around a shadow stack.
mm/mmap: Add shadow stack pages to memory accounting
mm/mprotect: Exclude shadow stack from preserve_write
mm: Re-introduce vm_flags to do_mmap()
x86/shstk: Add user-mode shadow stack support
x86/shstk: Handle thread shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle signals for shadow stack
x86/cet: Add PTRACE interface for CET

Documentation/filesystems/proc.rst | 1 +
Documentation/x86/cet.rst | 151 +++++
Documentation/x86/index.rst | 1 +
arch/arm/kernel/signal.c | 2 +-
arch/arm64/include/asm/elf.h | 5 +
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/x86/Kconfig | 37 ++
arch/x86/Kconfig.assembler | 5 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/ia32/ia32_signal.c | 1 +
arch/x86/include/asm/cet.h | 42 ++
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 17 +-
arch/x86/include/asm/elf.h | 11 +
arch/x86/include/asm/fpu/api.h | 9 +
arch/x86/include/asm/fpu/regset.h | 7 +-
arch/x86/include/asm/fpu/sched.h | 3 +-
arch/x86/include/asm/fpu/types.h | 14 +-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/idtentry.h | 2 +-
arch/x86/include/asm/mmu_context.h | 2 +
arch/x86/include/asm/msr-index.h | 5 +
arch/x86/include/asm/msr.h | 11 +
arch/x86/include/asm/pgtable.h | 321 ++++++++--
arch/x86/include/asm/pgtable_types.h | 65 +-
arch/x86/include/asm/processor.h | 9 +
arch/x86/include/asm/special_insns.h | 13 +
arch/x86/include/asm/trap_pf.h | 2 +
arch/x86/include/uapi/asm/mman.h | 3 +
arch/x86/include/uapi/asm/prctl.h | 11 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/cpu/common.c | 35 +-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/cpu/proc.c | 23 +
arch/x86/kernel/fpu/core.c | 60 +-
arch/x86/kernel/fpu/regset.c | 90 +++
arch/x86/kernel/fpu/xstate.c | 148 +++--
arch/x86/kernel/fpu/xstate.h | 6 +
arch/x86/kernel/idt.c | 2 +-
arch/x86/kernel/process.c | 18 +-
arch/x86/kernel/process_64.c | 41 +-
arch/x86/kernel/ptrace.c | 20 +
arch/x86/kernel/shstk.c | 499 +++++++++++++++
arch/x86/kernel/signal.c | 7 +
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 107 +++-
arch/x86/mm/fault.c | 26 +
arch/x86/mm/mmap.c | 23 +
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 6 +
arch/x86/xen/enlighten_pv.c | 2 +-
arch/x86/xen/xen-asm.S | 2 +-
fs/aio.c | 2 +-
fs/binfmt_elf.c | 24 +-
fs/proc/array.c | 6 +
fs/proc/task_mmu.c | 3 +
include/linux/elf.h | 6 +
include/linux/mm.h | 37 +-
include/linux/pgtable.h | 35 ++
include/linux/proc_fs.h | 2 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/asm-generic/unistd.h | 2 +-
include/uapi/linux/elf.h | 16 +
ipc/shm.c | 2 +-
kernel/sys_ni.c | 1 +
mm/gup.c | 2 +-
mm/huge_memory.c | 19 +-
mm/memory.c | 7 +-
mm/migrate_device.c | 4 +-
mm/mmap.c | 19 +-
mm/mprotect.c | 7 +
mm/nommu.c | 4 +-
mm/userfaultfd.c | 10 +-
mm/util.c | 2 +-
tools/testing/selftests/x86/Makefile | 4 +-
.../testing/selftests/x86/test_shadow_stack.c | 574 ++++++++++++++++++
80 files changed, 2502 insertions(+), 179 deletions(-)
create mode 100644 Documentation/x86/cet.rst
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/kernel/shstk.c
create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

--
2.17.1



2022-11-04 22:52:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 14/37] mm: Introduce VM_SHADOW_STACK for shadow stack memory

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
two cases are handled differently for page faults. Introduce
VM_SHADOW_STACK to track shadow stack VMAs.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Drop arch specific change in arch_vma_name(). The memory can show as
anonymous (Kirill)
- Change CONFIG_ARCH_HAS_SHADOW_STACK to CONFIG_X86_USER_SHADOW_STACK
in show_smap_vma_flags() (Boris)

Documentation/filesystems/proc.rst | 1 +
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 8 ++++++++
3 files changed, 12 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 898c99eae8e4..05506dfa0480 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -560,6 +560,7 @@ encoded manner. The codes are the following:
mt arm64 MTE allocation tags are enabled
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
+ ss shadow stack page
== =======================================

Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a74cdcc9af0..7dee7afbb01b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -703,6 +703,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ [ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
};
size_t i;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5314ad0a342d..42c4e4bc972d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -314,11 +314,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -334,6 +336,12 @@ extern unsigned int kobjsize(const void *objp);
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+# define VM_SHADOW_STACK VM_HIGH_ARCH_5
+#else
+# define VM_SHADOW_STACK VM_NONE
+#endif
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
--
2.17.1


2022-11-04 22:52:51

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 03/37] x86/cpufeatures: Add CPU feature flags for shadow stacks

From: Yu-cheng Yu <[email protected]>

The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).

To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.

The shadow stack feature, enumerated by the CPUID bit described above,
encompasses both supervisor and userspace support for shadow stack. In
near future patches, only userspace shadow stack will be enabled. In
expectation of future supervisor shadow stack support, create a software
CPU capability to enumerate kernel utilization of userspace shadow stack
support. This will also allow for userspace shadow stack to be disabled,
while leaving the shadow stack hardware capability exposed in the cpuinfo
proc. This user shadow stack bit should depend on the HW "shstk"
capability and that logic will be implemented in future patches.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Add user specific shadow stack cpu cap (Andrew Cooper)
- Drop reviewed-bys from Boris and Kees due to the above change.

v2:
- Remove IBT reference in commit log (Kees)
- Describe xsaves dependency using text from (Dave)

v1:
- Remove IBT, can be added in a follow on IBT series.

Yu-cheng v25:
- Make X86_FEATURE_IBT depend on X86_FEATURE_SHSTK.


arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/include/asm/disabled-features.h | 9 ++++++++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index b71f4f2ecdd5..5626ecb8a080 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -304,6 +304,7 @@
#define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */
#define X86_FEATURE_USE_IBPB_FW (11*32+16) /* "" Use IBPB during runtime firmware calls */
#define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on VM exit when EIBRS is enabled */
+#define X86_FEATURE_USER_SHSTK (11*32+18) /* Shadow stack support for user mode applications */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
@@ -365,6 +366,7 @@
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 33d2cd04d254..30cd12905499 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -87,6 +87,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define DISABLE_USER_SHSTK 0
+#else
+#define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -101,7 +107,8 @@
#define DISABLED_MASK8 (DISABLE_TDX_GUEST)
#define DISABLED_MASK9 (DISABLE_SGX)
#define DISABLED_MASK10 0
-#define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET)
+#define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
+ DISABLE_USER_SHSTK)
#define DISABLED_MASK12 0
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index c881bcafba7d..bf1b55a1ba21 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -78,6 +78,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
{}
};

--
2.17.1


2022-11-04 22:53:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 05/37] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states

From: Yu-cheng Yu <[email protected]>

Shadow stack register state can be managed with XSAVE. The registers
can logically be separated into two groups:
* Registers controlling user-mode operation
* Registers controlling kernel-mode operation

The architecture has two new XSAVE state components: one for each group
of those groups of registers. This lets an OS manage them separately if
it chooses. Future patches for host userspace and KVM guests will only
utilize the user-mode registers, so only configure XSAVE to save
user-mode registers. This state will add 16 bytes to the xsave buffer
size.

Future patches will use the user-mode XSAVE area to save guest user-mode
CET state. However, VMCS includes new fields for guest CET supervisor
states. KVM can use these to save and restore guest supervisor state, so
host supervisor XSAVE support is not required.

Adding this exacerbates the already unwieldy if statement in
check_xstate_against_struct() that handles warning about un-implemented
xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when
it actually check's the xfeature. This ends up exceeding 80 chars, but was
better on balance than other options explored. Pass the bool as pointer to
make it clear that XCHECK_SZ() can change the variable.

While configuring user-mode XSAVE, clarify kernel-mode registers are not
managed by XSAVE by defining the xfeature in
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT.
This serves more of a documentation as code purpose, and functionally,
only enables a few safety checks.

Both XSAVE state components are supervisor states, even the state
controlling user-mode operation. This is a departure from earlier features
like protection keys where the PKRU state is a normal user
(non-supervisor) state. Having the user state be supervisor-managed
ensures there is no direct, unprivileged access to it, making it harder
for an attacker to subvert CET.

To facilitate this privileged access, define the two user-mode CET MSRs,
and the bits defined in those MSRs relevant to future shadow stack
enablement patches.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Add missing "is" in commit log (Boris)
- Change to case statement for struct size checking (Boris)
- Adjust commas on xfeature_names (Kees, Boris)

v2:
- Change name to XFEATURE_CET_KERNEL_UNUSED (peterz)

KVM refresh:
- Reword commit log using some verbiage posted by Dave Hansen
- Remove unlikely to be used supervisor cet xsave struct
- Clarify that supervisor cet state is not saved by xsave
- Remove unused supervisor MSRs

v1:
- Remove outdated reference to sigreturn checks on msr's.

arch/x86/include/asm/fpu/types.h | 14 ++++-
arch/x86/include/asm/fpu/xstate.h | 6 ++-
arch/x86/kernel/fpu/xstate.c | 90 +++++++++++++++----------------
3 files changed, 59 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb7cd1139d97..344baad02b97 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
XFEATURE_PKRU,
XFEATURE_PASID,
- XFEATURE_RSRVD_COMP_11,
- XFEATURE_RSRVD_COMP_12,
+ XFEATURE_CET_USER,
+ XFEATURE_CET_KERNEL_UNUSED,
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
@@ -138,6 +138,8 @@ enum xfeature {
#define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
+#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
@@ -252,6 +254,14 @@ struct pkru_state {
u32 pad;
} __packed;

+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+ u64 user_cet; /* user control-flow settings */
+ u64 user_ssp; /* user shadow stack pointer */
+};
+
/*
* State component 15: Architectural LBR configuration state.
* The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..d4427b88ee12 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -50,7 +50,8 @@
#define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA

/* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+ XFEATURE_MASK_CET_USER)

/*
* A supervisor state component may not always contain valuable information,
@@ -77,7 +78,8 @@
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+ XFEATURE_MASK_CET_KERNEL)

/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 59e543b95a3c..959d4dd64434 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -39,26 +39,26 @@
*/
static const char *xfeature_names[] =
{
- "x87 floating point registers" ,
- "SSE registers" ,
- "AVX registers" ,
- "MPX bounds registers" ,
- "MPX CSR" ,
- "AVX-512 opmask" ,
- "AVX-512 Hi256" ,
- "AVX-512 ZMM_Hi256" ,
- "Processor Trace (unused)" ,
+ "x87 floating point registers",
+ "SSE registers",
+ "AVX registers",
+ "MPX bounds registers",
+ "MPX CSR",
+ "AVX-512 opmask",
+ "AVX-512 Hi256",
+ "AVX-512 ZMM_Hi256",
+ "Processor Trace (unused)",
"Protection Keys User registers",
"PASID state",
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "AMX Tile config" ,
- "AMX Tile data" ,
- "unknown xstate feature" ,
+ "Control-flow User registers",
+ "Control-flow Kernel registers (unused)",
+ "unknown xstate feature",
+ "unknown xstate feature",
+ "unknown xstate feature",
+ "unknown xstate feature",
+ "AMX Tile config",
+ "AMX Tile data",
+ "unknown xstate feature",
};

static unsigned short xsave_cpuid_features[] __initdata = {
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_PKU,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -276,6 +277,7 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
+ print_xstate_feature(XFEATURE_MASK_CET_USER);
print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}
@@ -344,6 +346,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
XFEATURE_MASK_BNDREGS | \
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID | \
+ XFEATURE_MASK_CET_USER | \
XFEATURE_MASK_XTILE)

/*
@@ -446,14 +449,15 @@ static void __init __xstate_dump_leaves(void)
} \
} while (0)

-#define XCHECK_SZ(sz, nr, nr_macro, __struct) do { \
- if ((nr == nr_macro) && \
- WARN_ONCE(sz != sizeof(__struct), \
- "%s: struct is %zu bytes, cpu state %d bytes\n", \
- __stringify(nr_macro), sizeof(__struct), sz)) { \
+#define XCHECK_SZ(sz, nr, __struct) ({ \
+ if (WARN_ONCE(sz != sizeof(__struct), \
+ "[%s]: struct is %zu bytes, cpu state %d bytes\n", \
+ xfeature_names[nr], sizeof(__struct), sz)) { \
__xstate_dump_leaves(); \
} \
-} while (0)
+ true; \
+})
+

/**
* check_xtile_data_against_struct - Check tile data state size.
@@ -527,37 +531,29 @@ static bool __init check_xstate_against_struct(int nr)
* Ask the CPU for the size of the state.
*/
int sz = xfeature_size(nr);
+
/*
* Match each CPU state with the corresponding software
* structure.
*/
- XCHECK_SZ(sz, nr, XFEATURE_YMM, struct ymmh_struct);
- XCHECK_SZ(sz, nr, XFEATURE_BNDREGS, struct mpx_bndreg_state);
- XCHECK_SZ(sz, nr, XFEATURE_BNDCSR, struct mpx_bndcsr_state);
- XCHECK_SZ(sz, nr, XFEATURE_OPMASK, struct avx_512_opmask_state);
- XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
- XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state);
- XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
- XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
- XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
-
- /* The tile data size varies between implementations. */
- if (nr == XFEATURE_XTILE_DATA)
- check_xtile_data_against_struct(sz);
-
- /*
- * Make *SURE* to add any feature numbers in below if
- * there are "holes" in the xsave state component
- * numbers.
- */
- if ((nr < XFEATURE_YMM) ||
- (nr >= XFEATURE_MAX) ||
- (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
- ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
+ switch (nr) {
+ case XFEATURE_YMM: return XCHECK_SZ(sz, nr, struct ymmh_struct);
+ case XFEATURE_BNDREGS: return XCHECK_SZ(sz, nr, struct mpx_bndreg_state);
+ case XFEATURE_BNDCSR: return XCHECK_SZ(sz, nr, struct mpx_bndcsr_state);
+ case XFEATURE_OPMASK: return XCHECK_SZ(sz, nr, struct avx_512_opmask_state);
+ case XFEATURE_ZMM_Hi256: return XCHECK_SZ(sz, nr, struct avx_512_zmm_uppers_state);
+ case XFEATURE_Hi16_ZMM: return XCHECK_SZ(sz, nr, struct avx_512_hi16_state);
+ case XFEATURE_PKRU: return XCHECK_SZ(sz, nr, struct pkru_state);
+ case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
+ case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
+ case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
+ case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
+ default:
WARN_ONCE(1, "no structure for xstate: %d\n", nr);
XSTATE_WARN_ON(1);
return false;
}
+
return true;
}

--
2.17.1


2022-11-04 22:53:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 34/37] x86/fpu: Add helper for initing features

If an xfeature is saved in a buffer, the xfeature's bit will be set in
xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
is in it's init state. In this case the xfeature buffer address cannot
be retrieved with get_xsave_addr().

Future patches will need to handle the case of writing to an xfeature
that may not be saved. So provide helpers to init an xfeature in an
xsave buffer.

This could of course be done directly by reaching into the xsave buffer,
however this would not be robust against future changes to optimize the
xsave buffer by compacting it. In that case the xsave buffer would need
to be re-arranged as well. So the logic properly belongs encapsulated
in a helper where the logic can be unified.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v2:
- New patch

arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
arch/x86/kernel/fpu/xstate.h | 6 ++++
2 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 959d4dd64434..665737559a1f 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -934,6 +934,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
}

+static int xsave_buffer_access_checks(int xfeature_nr)
+{
+ /*
+ * Do we even *have* xsave state?
+ */
+ if (!boot_cpu_has(X86_FEATURE_XSAVE))
+ return 1;
+
+ /*
+ * We should not ever be requesting features that we
+ * have not enabled.
+ */
+ if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+ return 1;
+
+ return 0;
+}
+
/*
* Given the xsave area and a state inside, this function returns the
* address of the state.
@@ -954,17 +972,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
*/
void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
{
- /*
- * Do we even *have* xsave state?
- */
- if (!boot_cpu_has(X86_FEATURE_XSAVE))
- return NULL;
-
- /*
- * We should not ever be requesting features that we
- * have not enabled.
- */
- if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+ if (xsave_buffer_access_checks(xfeature_nr))
return NULL;

/*
@@ -984,6 +992,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
return __raw_xsave_addr(xsave, xfeature_nr);
}

+/*
+ * Given the xsave area and a state inside, this function
+ * initializes an xfeature in the buffer.
+ *
+ * get_xsave_addr() will return NULL if the feature bit is
+ * not present in the header. This function will make it so
+ * the xfeature buffer address is ready to be retrieved by
+ * get_xsave_addr().
+ *
+ * Inputs:
+ * xstate: the thread's storage area for all FPU data
+ * xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
+ * XFEATURE_SSE, etc...)
+ * Output:
+ * 1 if the feature cannot be inited, 0 on success
+ */
+int init_xfeature(struct xregs_state *xsave, int xfeature_nr)
+{
+ if (xsave_buffer_access_checks(xfeature_nr))
+ return 1;
+
+ /*
+ * Mark the feature inited.
+ */
+ xsave->header.xfeatures |= BIT_ULL(xfeature_nr);
+ return 0;
+}
+
#ifdef CONFIG_ARCH_HAS_PKEYS

/*
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 5ad47031383b..fb8aae678e9f 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -54,6 +54,12 @@ extern void fpu__init_cpu_xstate(void);
extern void fpu__init_system_xstate(unsigned int legacy_size);

extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+extern int init_xfeature(struct xregs_state *xsave, int xfeature_nr);
+
+static inline int xfeature_saved(struct xregs_state *xsave, int xfeature_nr)
+{
+ return xsave->header.xfeatures & BIT_ULL(xfeature_nr);
+}

static inline u64 xfeatures_mask_supervisor(void)
{
--
2.17.1


2022-11-04 22:53:30

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

From: Yu-cheng Yu <[email protected]>

Shadow stack's are normally written to via CALL/RET or specific CET
instuctions like RSTORSSP/SAVEPREVSSP. However during some Linux
operations the kernel will need to write to directly using the ring-0 only
WRUSS instruction.

A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack. This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.

Introduce token setup and verify routines. Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack.

In future patches that enable shadow stack to work with signals, the kernel
will need something to denote the point in the stack where sigreturn may be
called. This will prevent attackers calling sigreturn at arbitrary places
in the stack, in order to help prevent SROP attacks.

To do this, something that can only be written by the kernel needs to be
placed on the shadow stack. This can be accomplished by setting bit 63 in
the frame written to the shadow stack. Userspace return addresses can't
have this bit set as it is in the kernel range. It is also can't be a
valid restore token.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Drop shstk_check_rstor_token()
- Fail put_shstk_data() if bit 63 is set in the data (Kees)
- Add comment in create_rstor_token() (Kees)
- Pull in create_rstor_token() changes from future patch (Kees)

v2:
- Add data helpers for writing to shadow stack.

v1:
- Use xsave helpers.

Yu-cheng v30:
- Update commit log, remove description about signals.
- Update various comments.
- Remove variable 'ssp' init and adjust return value accordingly.
- Check get_user_shstk_addr() return value.
- Replace 'ia32' with 'proc32'.

arch/x86/include/asm/special_insns.h | 13 +++++
arch/x86/kernel/shstk.c | 73 ++++++++++++++++++++++++++++
2 files changed, 86 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 35f709f619fb..6d51a87aea7f 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
: [pax] "a" (p));
}

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+ asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+ _ASM_EXTABLE(1b, %l[fail])
+ :: [addr] "r" (addr), [val] "r" (val)
+ :: fail);
+ return 0;
+fail:
+ return -EFAULT;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
#define nop() asm volatile ("nop")

static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index a7a982924b9a..755b4af40413 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,8 @@
#include <asm/fpu/api.h>
#include <asm/prctl.h>

+#define SS_FRAME_SIZE 8
+
static bool features_enabled(unsigned long features)
{
return current->thread.features & features;
@@ -40,6 +42,35 @@ static void features_clr(unsigned long features)
current->thread.features &= ~features;
}

+/*
+ * Create a restore token on the shadow stack. A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+ unsigned long addr;
+
+ /* Token must be aligned */
+ if (!IS_ALIGNED(ssp, 8))
+ return -EINVAL;
+
+ addr = ssp - SS_FRAME_SIZE;
+
+ /*
+ * SSP is aligned, so reserved bits and mode bit are a zero, just mark
+ * the token 64-bit.
+ */
+ ssp |= BIT(0);
+
+ if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+ return -EFAULT;
+
+ if (token_addr)
+ *token_addr = addr;
+
+ return 0;
+}
+
static unsigned long alloc_shstk(unsigned long size)
{
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
@@ -160,6 +191,48 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
return 0;
}

+static unsigned long get_user_shstk_addr(void)
+{
+ unsigned long long ssp;
+
+ fpregs_lock_and_load();
+
+ rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+ fpregs_unlock();
+
+ return ssp;
+}
+
+static int put_shstk_data(u64 __user *addr, u64 data)
+{
+ if (WARN_ON_ONCE(data & BIT(63)))
+ return -EINVAL;
+
+ /*
+ * Mark the high bit so that the sigframe can't be processed as a
+ * return address.
+ */
+ if (write_user_shstk_64(addr, data | BIT(63)))
+ return -EFAULT;
+ return 0;
+}
+
+static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
+{
+ unsigned long ldata;
+
+ if (unlikely(get_user(ldata, addr)))
+ return -EFAULT;
+
+ if (!(ldata & BIT(63)))
+ return -EINVAL;
+
+ *data = ldata & ~BIT(63);
+
+ return 0;
+}
+
void shstk_free(struct task_struct *tsk)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
--
2.17.1


2022-11-04 22:53:31

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 28/37] x86/shstk: Handle signals for shadow stack

From: Yu-cheng Yu <[email protected]>

When a signal is handled normally the context is pushed to the stack
before handling it. For shadow stacks, since the shadow stack only track's
return addresses, there isn't any state that needs to be pushed. However,
there are still a few things that need to be done. These things are
userspace visible and which will be kernel ABI for shadow stacks.

One is to make sure the restorer address is written to shadow stack, since
the signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn. So add the restorer on the shadow stack
before handling the signal, so there is not a conflict when the signal
handler returns to the restorer.

The other thing to do is to place some type of checkable token on the
thread's shadow stack before handling the signal and check it during
sigreturn. This is an extra layer of protection to hamper attackers
calling sigreturn manually as in SROP-like attacks.

For this token we can use the shadow stack data format defined earlier.
Have the data pushed be the previous SSP. In the future the sigreturn
might want to return back to a different stack. Storing the SSP (instead
of a restore offset or something) allows for future functionality that
may want to restore to a different stack.

So, when handling a signal push
- the SSP pointing in the shadow stack data format
- the restorer address below the restore token.

In sigreturn, verify SSP is stored in the data format and pop the shadow
stack.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Drop shstk_setup_rstor_token() (Kees)
- Drop x32 signal support, since x32 support is dropped

v2:
- Switch to new shstk signal format

v1:
- Use xsave helpers.
- Expand commit log.

Yu-cheng v27:
- Eliminate saving shadow stack pointer to signal context.

Yu-cheng v25:
- Update commit log/comments for the sc_ext struct.
- Use restorer address already calculated.
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
- Eliminate writing to MSR_IA32_U_CET for shadow stack.
- Change wrmsrl() to wrmsrl_safe() and handle error.

arch/x86/ia32/ia32_signal.c | 1 +
arch/x86/include/asm/cet.h | 5 ++
arch/x86/kernel/shstk.c | 98 +++++++++++++++++++++++++++++++++++++
arch/x86/kernel/signal.c | 7 +++
4 files changed, 111 insertions(+)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index c9c3859322fa..88d71b9de616 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,6 +34,7 @@
#include <asm/sigframe.h>
#include <asm/sighandling.h>
#include <asm/smap.h>
+#include <asm/cet.h>

static inline void reload_segments(struct sigcontext_32 *sc)
{
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 1a97223e7d2f..098e4ecfdf9b 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -6,6 +6,7 @@
#include <linux/types.h>

struct task_struct;
+struct ksignal;

#ifdef CONFIG_X86_USER_SHADOW_STACK
struct thread_shstk {
@@ -19,6 +20,8 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
unsigned long stack_size,
unsigned long *shstk_addr);
void shstk_free(struct task_struct *p);
+int setup_signal_shadow_stack(struct ksignal *ksig);
+int restore_signal_shadow_stack(void);
#else
static inline long cet_prctl(struct task_struct *task, int option,
unsigned long features) { return -EINVAL; }
@@ -28,6 +31,8 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
unsigned long stack_size,
unsigned long *shstk_addr) { return 0; }
static inline void shstk_free(struct task_struct *p) {}
+static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
#endif /* CONFIG_X86_USER_SHADOW_STACK */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 755b4af40413..332b7c73a1af 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -233,6 +233,104 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
return 0;
}

+static int shstk_push_sigframe(unsigned long *ssp)
+{
+ unsigned long target_ssp = *ssp;
+
+ /* Token must be aligned */
+ if (!IS_ALIGNED(*ssp, 8))
+ return -EINVAL;
+
+ if (!IS_ALIGNED(target_ssp, 8))
+ return -EINVAL;
+
+ *ssp -= SS_FRAME_SIZE;
+ if (put_shstk_data((void *__user)*ssp, target_ssp))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int shstk_pop_sigframe(unsigned long *ssp)
+{
+ unsigned long token_addr;
+ int err;
+
+ err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
+ if (unlikely(err))
+ return err;
+
+ /* Restore SSP aligned? */
+ if (unlikely(!IS_ALIGNED(token_addr, 8)))
+ return -EINVAL;
+
+ /* SSP in userspace? */
+ if (unlikely(token_addr >= TASK_SIZE_MAX))
+ return -EINVAL;
+
+ *ssp = token_addr;
+
+ return 0;
+}
+
+int setup_signal_shadow_stack(struct ksignal *ksig)
+{
+ void __user *restorer = ksig->ka.sa.sa_restorer;
+ unsigned long ssp;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+ !features_enabled(CET_SHSTK))
+ return 0;
+
+ if (!restorer)
+ return -EINVAL;
+
+ ssp = get_user_shstk_addr();
+ if (unlikely(!ssp))
+ return -EINVAL;
+
+ err = shstk_push_sigframe(&ssp);
+ if (unlikely(err))
+ return err;
+
+ /* Push restorer address */
+ ssp -= SS_FRAME_SIZE;
+ err = write_user_shstk_64((u64 __user *)ssp, (u64)restorer);
+ if (unlikely(err))
+ return -EFAULT;
+
+ fpregs_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, ssp);
+ fpregs_unlock();
+
+ return 0;
+}
+
+int restore_signal_shadow_stack(void)
+{
+ unsigned long ssp;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+ !features_enabled(CET_SHSTK))
+ return 0;
+
+ ssp = get_user_shstk_addr();
+ if (unlikely(!ssp))
+ return -EINVAL;
+
+ err = shstk_pop_sigframe(&ssp);
+ if (unlikely(err))
+ return err;
+
+ fpregs_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, ssp);
+ fpregs_unlock();
+
+ return 0;
+}
+
void shstk_free(struct task_struct *tsk)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9c7265b524c7..be25f7dce2d5 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -47,6 +47,7 @@
#include <asm/syscall.h>
#include <asm/sigframe.h>
#include <asm/signal.h>
+#include <asm/cet.h>

#ifdef CONFIG_X86_64
/*
@@ -472,6 +473,9 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig,
frame = get_sigframe(&ksig->ka, regs, sizeof(struct rt_sigframe), &fp);
uc_flags = frame_uc_flags(regs);

+ if (setup_signal_shadow_stack(ksig))
+ return -EFAULT;
+
if (!user_access_begin(frame, sizeof(*frame)))
return -EFAULT;

@@ -675,6 +679,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
goto badframe;

+ if (restore_signal_shadow_stack())
+ goto badframe;
+
if (restore_altstack(&frame->uc.uc_stack))
goto badframe;

--
2.17.1


2022-11-04 22:53:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 02/37] x86/cet/shstk: Add Kconfig option for Shadow Stack

From: Yu-cheng Yu <[email protected]>

Shadow Stack provides protection for applications against function return
address corruption. It is active when the processor supports it, the
kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built
for the feature. This is only implemented for the 64-bit kernel. When it
is enabled, legacy non-Shadow Stack applications continue to work, but
without protection.

Since there is another feature that utilizes CET (Kernel IBT) that will
share implementation with Shadow Stacks, create CONFIG_CET to signify
that at least one CET feature is configured.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Add X86_CET (Kees)
- Add back WRUSS dependency (Kees)
- Fix verbiage (Dave)
- Change from promt to bool (Kirill)
- Add more to commit log

v2:
- Remove already wrong kernel size increase info (tlgx)
- Change prompt to remove "Intel" (tglx)
- Update line about what CPUs are supported (Dave)

Yu-cheng v25:
- Remove X86_CET and use X86_SHADOW_STACK directly.

Yu-cheng v24:
- Update for the splitting X86_CET to X86_SHADOW_STACK and X86_IBT.
arch/x86/Kconfig | 24 ++++++++++++++++++++++++
arch/x86/Kconfig.assembler | 5 +++++
2 files changed, 29 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..f3d14f5accce 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1852,6 +1852,11 @@ config CC_HAS_IBT
(CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
$(as-instr,endbr64)

+config X86_CET
+ def_bool n
+ help
+ CET features configured (Shadow Stack or IBT)
+
config X86_KERNEL_IBT
prompt "Indirect Branch Tracking"
bool
@@ -1859,6 +1864,7 @@ config X86_KERNEL_IBT
# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
depends on !LD_IS_LLD || LLD_VERSION >= 140000
select OBJTOOL
+ select X86_CET
help
Build the kernel with support for Indirect Branch Tracking, a
hardware support course-grain forward-edge Control Flow Integrity
@@ -1953,6 +1959,24 @@ config X86_SGX

If unsure, say N.

+config X86_USER_SHADOW_STACK
+ bool "X86 Userspace Shadow Stack"
+ depends on AS_WRUSS
+ depends on X86_64
+ select ARCH_USES_HIGH_VMA_FLAGS
+ select X86_CET
+ help
+ Shadow Stack protection is a hardware feature that detects function
+ return address corruption. This helps mitigate ROP attacks.
+ Applications must be enabled to use it, and old userspace does not
+ get protection "for free".
+
+ CPUs supporting shadow stacks were first released in 2020.
+
+ See Documentation/x86/cet.rst for more information.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 26b8c08e2fc4..00c79dd93651 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -19,3 +19,8 @@ config AS_TPAUSE
def_bool $(as-instr,tpause %ecx)
help
Supported by binutils >= 2.31.1 and LLVM integrated assembler >= V7
+
+config AS_WRUSS
+ def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+ help
+ Supported by binutils >= 2.31 and LLVM integrated assembler
--
2.17.1


2022-11-04 22:53:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 19/37] mm/mmap: Add shadow stack pages to memory accounting

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Account shadow stack pages to stack memory.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
(Kirill)

v2:
- Remove is_shadow_stack_mapping() and just change it to directly bitwise
and VM_SHADOW_STACK.

Yu-cheng v26:
- Remove redundant #ifdef CONFIG_MMU.

Yu-cheng v25:
- Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().

mm/mmap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index f67606fbc464..4dc157869b34 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3298,6 +3298,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
mm->exec_vm += npages;
else if (is_stack_mapping(flags))
mm->stack_vm += npages;
+ else if (flags & VM_SHADOW_STACK)
+ mm->stack_vm += npages;
else if (is_data_mapping(flags))
mm->data_vm += npages;
}
--
2.17.1


2022-11-04 22:53:41

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 33/37] selftests/x86: Add shadow stack test

Add a simple selftest for exercising some shadow stack behavior:
- map_shadow_stack syscall and pivot
- Faulting in shadow stack memory
- Handling shadow stack violations
- GUP of shadow stack memory
- mprotect() of shadow stack memory
- Userfaultfd on shadow stack memory

Since this test exercises a recently added syscall manually, it needs
to find the automatically created __NR_foo defines. Per the selftest
documentation, KHDR_INCLUDES can be used to help the selftest Makefile's
find the headers from the kernel source. This way the new selftest can
be built inside the kernel source tree without installing the headers
to the system. So also add KHDR_INCLUDES as described in the selftest
docs, to facilitate this.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Co-developed-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Change "+m" to "=m" in write_shstk() (Andrew Cooper)
- Fix userfaultfd test with transparent huge pages by doing a
MADV_DONTNEED, since the token write faults in the while stack with
huge pages.

v2:
- Change print statements to more align with other selftests
- Add more tests
- Add KHDR_INCLUDES to Makefile

v1:
- New patch.

tools/testing/selftests/x86/Makefile | 4 +-
.../testing/selftests/x86/test_shadow_stack.c | 574 ++++++++++++++++++
2 files changed, 576 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 0388c4d60af0..cfc8a26ad151 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- corrupt_xstate_header amx
+ corrupt_xstate_header amx test_shadow_stack
# Some selftests require 32bit support enabled also on 64bit systems
TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall

@@ -34,7 +34,7 @@ BINARIES_64 := $(TARGETS_C_64BIT_ALL:%=%_64)
BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))

-CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
+CFLAGS := -O2 -g -std=gnu99 -pthread -Wall $(KHDR_INCLUDES)

# call32_from_64 in thunks.S uses absolute addresses.
ifeq ($(CAN_BUILD_WITH_NOPIE),1)
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
new file mode 100644
index 000000000000..b347447da317
--- /dev/null
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -0,0 +1,574 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program test's basic kernel shadow stack support. It enables shadow
+ * stack manual via the arch_prctl(), instead of relying on glibc. It's
+ * Makefile doesn't compile with shadow stack support, so it doesn't rely on
+ * any particular glibc. As a result it can't do any operations that require
+ * special glibc shadow stack support (longjmp(), swapcontext(), etc). Just
+ * stick to the basics and hope the compiler doesn't do anything strange.
+ */
+
+#define _GNU_SOURCE
+
+#include <sys/syscall.h>
+#include <asm/mman.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <x86intrin.h>
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#include <stdint.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+
+#define SS_SIZE 0x200000
+
+#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
+int main(int argc, char *argv[])
+{
+ printf("[SKIP]\tCompiler does not support CET.\n");
+ return 0;
+}
+#else
+void write_shstk(unsigned long *addr, unsigned long val)
+{
+ asm volatile("wrssq %[val], (%[addr])\n"
+ : "=m" (addr)
+ : [addr] "r" (addr), [val] "r" (val));
+}
+
+static inline unsigned long __attribute__((always_inline)) get_ssp(void)
+{
+ unsigned long ret = 0;
+
+ asm volatile("xor %0, %0; rdsspq %0" : "=r" (ret));
+ return ret;
+}
+
+/*
+ * For use in inline enablement of shadow stack.
+ *
+ * The program can't return from the point where shadow stack get's enabled
+ * because there will be no address on the shadow stack. So it can't use
+ * syscall() for enablement, since it is a function.
+ *
+ * Based on code from nolibc.h. Keep a copy here because this can't pull in all
+ * of nolibc.h.
+ */
+#define ARCH_PRCTL(arg1, arg2) \
+({ \
+ long _ret; \
+ register long _num asm("eax") = __NR_arch_prctl; \
+ register long _arg1 asm("rdi") = (long)(arg1); \
+ register long _arg2 asm("rsi") = (long)(arg2); \
+ \
+ asm volatile ( \
+ "syscall\n" \
+ : "=a"(_ret) \
+ : "r"(_arg1), "r"(_arg2), \
+ "0"(_num) \
+ : "rcx", "r11", "memory", "cc" \
+ ); \
+ _ret; \
+})
+
+void *create_shstk(void *addr)
+{
+ return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
+}
+
+void *create_normal_mem(void *addr)
+{
+ return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+}
+
+void free_shstk(void *shstk)
+{
+ munmap(shstk, SS_SIZE);
+}
+
+int reset_shstk(void *shstk)
+{
+ return madvise(shstk, SS_SIZE, MADV_DONTNEED);
+}
+
+void try_shstk(unsigned long new_ssp)
+{
+ unsigned long ssp;
+
+ printf("[INFO]\tnew_ssp = %lx, *new_ssp = %lx\n",
+ new_ssp, *((unsigned long *)new_ssp));
+
+ ssp = get_ssp();
+ printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp);
+
+ asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
+ asm volatile("saveprevssp");
+ printf("[INFO]\tssp is now %lx\n", get_ssp());
+
+ /* Switch back to original shadow stack */
+ ssp -= 8;
+ asm volatile("rstorssp (%0)\n":: "r" (ssp));
+ asm volatile("saveprevssp");
+}
+
+int test_shstk_pivot(void)
+{
+ void *shstk = create_shstk(0);
+
+ if (shstk == MAP_FAILED) {
+ printf("[FAIL]\tError creating shadow stack: %d\n", errno);
+ return 1;
+ }
+ try_shstk((unsigned long)shstk + SS_SIZE - 8);
+ free_shstk(shstk);
+
+ printf("[OK]\tShadow stack pivot\n");
+ return 0;
+}
+
+int test_shstk_faults(void)
+{
+ unsigned long *shstk = create_shstk(0);
+
+ /* Read shadow stack, test if it's zero to not get read optimized out */
+ if (*shstk != 0)
+ goto err;
+
+ /* Wrss memory that was already read. */
+ write_shstk(shstk, 1);
+ if (*shstk != 1)
+ goto err;
+
+ /* Page out memory, so we can wrss it again. */
+ if (reset_shstk((void *)shstk))
+ goto err;
+
+ write_shstk(shstk, 1);
+ if (*shstk != 1)
+ goto err;
+
+ printf("[OK]\tShadow stack faults\n");
+ return 0;
+
+err:
+ return 1;
+}
+
+unsigned long saved_ssp;
+unsigned long saved_ssp_val;
+volatile bool segv_triggered;
+
+void __attribute__((noinline)) violate_ss(void)
+{
+ saved_ssp = get_ssp();
+ saved_ssp_val = *(unsigned long *)saved_ssp;
+
+ /* Corrupt shadow stack */
+ printf("[INFO]\tCorrupting shadow stack\n");
+ write_shstk((void *)saved_ssp, 0);
+}
+
+void segv_handler(int signum, siginfo_t *si, void *uc)
+{
+ printf("[INFO]\tGenerated shadow stack violation successfully\n");
+
+ segv_triggered = true;
+
+ /* Fix shadow stack */
+ write_shstk((void *)saved_ssp, saved_ssp_val);
+}
+
+int test_shstk_violation(void)
+{
+ struct sigaction sa;
+
+ sa.sa_sigaction = segv_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ segv_triggered = false;
+
+ /* Make sure segv_triggered is set before violate_ss() */
+ asm volatile("" : : : "memory");
+
+ violate_ss();
+
+ signal(SIGSEGV, SIG_DFL);
+
+ printf("[OK]\tShadow stack violation test\n");
+
+ return !segv_triggered;
+}
+
+/* Gup test state */
+#define MAGIC_VAL 0x12345678
+bool is_shstk_access;
+void *shstk_ptr;
+int fd;
+
+void reset_test_shstk(void *addr)
+{
+ if (shstk_ptr != NULL)
+ free_shstk(shstk_ptr);
+ shstk_ptr = create_shstk(addr);
+}
+
+void test_access_fix_handler(int signum, siginfo_t *si, void *uc)
+{
+ printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : "normal write");
+
+ segv_triggered = true;
+
+ /* Fix shadow stack */
+ if (is_shstk_access) {
+ reset_test_shstk(shstk_ptr);
+ return;
+ }
+
+ free_shstk(shstk_ptr);
+ create_normal_mem(shstk_ptr);
+}
+
+bool test_shstk_access(void *ptr)
+{
+ is_shstk_access = true;
+ segv_triggered = false;
+ write_shstk(ptr, MAGIC_VAL);
+
+ asm volatile("" : : : "memory");
+
+ return segv_triggered;
+}
+
+bool test_write_access(void *ptr)
+{
+ is_shstk_access = false;
+ segv_triggered = false;
+ *(unsigned long *)ptr = MAGIC_VAL;
+
+ asm volatile("" : : : "memory");
+
+ return segv_triggered;
+}
+
+bool gup_write(void *ptr)
+{
+ unsigned long val;
+
+ lseek(fd, (unsigned long)ptr, SEEK_SET);
+ if (write(fd, &val, sizeof(val)) < 0)
+ return 1;
+
+ return 0;
+}
+
+bool gup_read(void *ptr)
+{
+ unsigned long val;
+
+ lseek(fd, (unsigned long)ptr, SEEK_SET);
+ if (read(fd, &val, sizeof(val)) < 0)
+ return 1;
+
+ return 0;
+}
+
+int test_gup(void)
+{
+ struct sigaction sa;
+ int status;
+ pid_t pid;
+
+ sa.sa_sigaction = test_access_fix_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ segv_triggered = false;
+
+ fd = open("/proc/self/mem", O_RDWR);
+ if (fd == -1)
+ return 1;
+
+ reset_test_shstk(0);
+ if (gup_read(shstk_ptr))
+ return 1;
+ if (test_shstk_access(shstk_ptr))
+ return 1;
+ printf("[INFO]\tGup read -> shstk access success\n");
+
+ reset_test_shstk(0);
+ if (gup_write(shstk_ptr))
+ return 1;
+ if (test_shstk_access(shstk_ptr))
+ return 1;
+ printf("[INFO]\tGup write -> shstk access success\n");
+
+ reset_test_shstk(0);
+ if (gup_read(shstk_ptr))
+ return 1;
+ if (!test_write_access(shstk_ptr))
+ return 1;
+ printf("[INFO]\tGup read -> write access success\n");
+
+ reset_test_shstk(0);
+ if (gup_write(shstk_ptr))
+ return 1;
+ if (!test_write_access(shstk_ptr))
+ return 1;
+ printf("[INFO]\tGup write -> write access success\n");
+
+ close(fd);
+
+ /* COW/gup test */
+ reset_test_shstk(0);
+ pid = fork();
+ if (!pid) {
+ fd = open("/proc/self/mem", O_RDWR);
+ if (fd == -1)
+ exit(1);
+
+ if (gup_write(shstk_ptr)) {
+ close(fd);
+ exit(1);
+ }
+ close(fd);
+ exit(0);
+ }
+ waitpid(pid, &status, 0);
+ if (WEXITSTATUS(status)) {
+ printf("[FAIL]\tWrite in child failed\n");
+ return 1;
+ }
+ if (*(unsigned long *)shstk_ptr == MAGIC_VAL) {
+ printf("[FAIL]\tWrite in child wrote through to shared memory\n");
+ return 1;
+ }
+
+ printf("[INFO]\tCow gup write -> write access success\n");
+
+ free_shstk(shstk_ptr);
+
+ signal(SIGSEGV, SIG_DFL);
+
+ printf("[OK]\tShadow gup test\n");
+
+ return 0;
+}
+
+int test_mprotect(void)
+{
+ struct sigaction sa;
+
+ sa.sa_sigaction = test_access_fix_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ segv_triggered = false;
+
+ /* mprotect a shaodw stack as read only */
+ reset_test_shstk(0);
+ if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+ printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+ return 1;
+ }
+
+ /* try to wrss it and fail */
+ if (!test_shstk_access(shstk_ptr)) {
+ printf("[FAIL]\tShadow stack access to read-only memory succeeded\n");
+ return 1;
+ }
+
+ /* then back to writable */
+ if (mprotect(shstk_ptr, SS_SIZE, PROT_WRITE | PROT_READ) < 0) {
+ printf("[FAIL]\tmprotect(PROT_WRITE) failed\n");
+ return 1;
+ }
+
+ /* then pivot to it and succeed */
+ if (test_shstk_access(shstk_ptr)) {
+ printf("[FAIL]\tShadow stack access to mprotect() writable memory failed\n");
+ return 1;
+ }
+
+ free_shstk(shstk_ptr);
+
+ signal(SIGSEGV, SIG_DFL);
+
+ printf("[OK]\tmprotect() test\n");
+
+ return 0;
+}
+
+char zero[4096];
+
+static void *uffd_thread(void *arg)
+{
+ struct uffdio_copy req;
+ int uffd = *(int *)arg;
+ struct uffd_msg msg;
+
+ if (read(uffd, &msg, sizeof(msg)) <= 0)
+ return (void *)1;
+
+ req.dst = msg.arg.pagefault.address;
+ req.src = (__u64)zero;
+ req.len = 4096;
+ req.mode = 0;
+
+ if (ioctl(uffd, UFFDIO_COPY, &req))
+ return (void *)1;
+
+ return (void *)0;
+}
+
+int test_userfaultfd(void)
+{
+ struct uffdio_register uffdio_register;
+ struct uffdio_api uffdio_api;
+ struct sigaction sa;
+ pthread_t thread;
+ void *res;
+ int uffd;
+
+ sa.sa_sigaction = test_access_fix_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+ if (uffd < 0) {
+ printf("[SKIP]\tUserfaultfd unavailable.\n");
+ return 0;
+ }
+
+ reset_test_shstk(0);
+
+ uffdio_api.api = UFFD_API;
+ uffdio_api.features = 0;
+ if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+ goto err;
+
+ uffdio_register.range.start = (__u64)shstk_ptr;
+ uffdio_register.range.len = 4096;
+ uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+ if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+ goto err;
+
+ if (pthread_create(&thread, NULL, &uffd_thread, &uffd))
+ goto err;
+
+ reset_shstk(shstk_ptr);
+ test_shstk_access(shstk_ptr);
+
+ if (pthread_join(thread, &res))
+ goto err;
+
+ if (test_shstk_access(shstk_ptr))
+ goto err;
+
+ free_shstk(shstk_ptr);
+
+ signal(SIGSEGV, SIG_DFL);
+
+ if (!res)
+ printf("[OK]\tUserfaultfd test\n");
+ return !!res;
+err:
+ free_shstk(shstk_ptr);
+ close(uffd);
+ signal(SIGSEGV, SIG_DFL);
+ return 1;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = 0;
+
+ if (ARCH_PRCTL(ARCH_CET_ENABLE, CET_SHSTK)) {
+ printf("[SKIP]\tCould not enable Shadow stack\n");
+ return 1;
+ }
+
+ if (ARCH_PRCTL(ARCH_CET_DISABLE, CET_SHSTK)) {
+ ret = 1;
+ printf("[FAIL]\tDisabling shadow stack failed\n");
+ }
+
+ if (ARCH_PRCTL(ARCH_CET_ENABLE, CET_SHSTK)) {
+ printf("[SKIP]\tCould not re-enable Shadow stack\n");
+ return 1;
+ }
+
+ if (ARCH_PRCTL(ARCH_CET_ENABLE, CET_WRSS)) {
+ printf("[SKIP]\tCould not enable WRSS\n");
+ ret = 1;
+ goto out;
+ }
+
+ /* Should have succeeded if here, but this is a test, so double check. */
+ if (!get_ssp()) {
+ printf("[FAIL]\tShadow stack disabled\n");
+ return 1;
+ }
+
+ if (test_shstk_pivot()) {
+ ret = 1;
+ printf("[FAIL]\tShadow stack pivot\n");
+ goto out;
+ }
+
+ if (test_shstk_faults()) {
+ ret = 1;
+ printf("[FAIL]\tShadow stack fault test\n");
+ goto out;
+ }
+
+ if (test_shstk_violation()) {
+ ret = 1;
+ printf("[FAIL]\tShadow stack violation test\n");
+ goto out;
+ }
+
+ if (test_gup()) {
+ ret = 1;
+ printf("[FAIL]\tShadow shadow stack gup\n");
+ }
+
+ if (test_mprotect()) {
+ ret = 1;
+ printf("[FAIL]\tShadow shadow mprotect test\n");
+ }
+
+ if (test_userfaultfd()) {
+ ret = 1;
+ printf("[FAIL]\tUserfaultfd test\n");
+ }
+
+out:
+ /*
+ * Disable shadow stack before the function returns, or there will be a
+ * shadow stack violation.
+ */
+ if (ARCH_PRCTL(ARCH_CET_DISABLE, CET_SHSTK)) {
+ ret = 1;
+ printf("[FAIL]\tDisabling shadow stack failed\n");
+ }
+
+ return ret;
+}
+#endif
--
2.17.1


2022-11-04 22:53:50

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 20/37] mm/mprotect: Exclude shadow stack from preserve_write

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

In change_pte_range(), when a PTE is changed for prot_numa, _PAGE_RW is
preserved to avoid the additional write fault after the NUMA hinting fault.
However, pte_write() now includes both normal writable and shadow stack
(Write=0, Dirty=1) PTEs, but the latter does not have _PAGE_RW and has no
need to preserve it.

Exclude shadow stack from preserve_write test, and apply the same change to
change_huge_pmd().

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

Yu-cheng v25:
- Move is_shadow_stack_mapping() to a separate line.

Yu-cheng v24:
- Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().

mm/huge_memory.c | 7 +++++++
mm/mprotect.c | 7 +++++++
2 files changed, 14 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73b9b78f8cf4..7643a4db1b50 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1803,6 +1803,13 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return 0;

preserve_write = prot_numa && pmd_write(*pmd);
+
+ /*
+ * Preserve only normal writable huge PMD, but not shadow
+ * stack (RW=0, Dirty=1).
+ */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ preserve_write = false;
ret = 1;

#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 668bfaa6ed2a..ea82ce5f38fe 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -115,6 +115,13 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
pte_t ptent;
bool preserve_write = prot_numa && pte_write(oldpte);

+ /*
+ * Preserve only normal writable PTE, but not shadow
+ * stack (RW=0, Dirty=1).
+ */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ preserve_write = false;
+
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
--
2.17.1


2022-11-04 22:53:52

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 25/37] x86/shstk: Add user-mode shadow stack support

From: Yu-cheng Yu <[email protected]>

Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.

Do not support IA32 emulation or x32.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Use define for set_clr_bits_msrl() (Kees)
- Make some functions static (Kees)
- Change feature_foo() to features_foo() (Kees)
- Centralize shadow stack size rlimit checks (Kees)
- Disable x32 support

v2:
- Get rid of unnessary shstk->base checks
- Don't support IA32 emulation

v1:
- Switch to xsave helpers.
- Expand commit log.

Yu-cheng v30:
- Remove superfluous comments for struct thread_shstk.
- Replace 'populate' with 'unused'.

arch/x86/include/asm/cet.h | 7 ++
arch/x86/include/asm/msr.h | 11 +++
arch/x86/include/asm/processor.h | 3 +
arch/x86/include/uapi/asm/prctl.h | 3 +
arch/x86/kernel/shstk.c | 146 ++++++++++++++++++++++++++++++
5 files changed, 170 insertions(+)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index a2f3c6e06ef5..cade110b2ea8 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -8,12 +8,19 @@
struct task_struct;

#ifdef CONFIG_X86_USER_SHADOW_STACK
+struct thread_shstk {
+ u64 base;
+ u64 size;
+};
+
long cet_prctl(struct task_struct *task, int option, unsigned long features);
void reset_thread_features(void);
+void shstk_free(struct task_struct *p);
#else
static inline long cet_prctl(struct task_struct *task, int option,
unsigned long features) { return -EINVAL; }
static inline void reset_thread_features(void) {}
+static inline void shstk_free(struct task_struct *p) {}
#endif /* CONFIG_X86_USER_SHADOW_STACK */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..a4b86eb537d6 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
int msr_set_bit(u32 msr, u8 bit);
int msr_clear_bit(u32 msr, u8 bit);

+/* Helper that can never get accidentally un-inlined. */
+#define set_clr_bits_msrl(msr, set, clear) do { \
+ u64 __val, __new_val; \
+ \
+ rdmsrl(msr, __val); \
+ __new_val = (__val & ~(clear)) | (set); \
+ \
+ if (__new_val != __val) \
+ wrmsrl(msr, __new_val); \
+} while (0)
+
#ifdef CONFIG_SMP
int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index ca66d320a263..a6c414dfd10f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -27,6 +27,7 @@ struct vm86;
#include <asm/unwind_hints.h>
#include <asm/vmxfeatures.h>
#include <asm/vdso/processor.h>
+#include <asm/cet.h>

#include <linux/personality.h>
#include <linux/cache.h>
@@ -533,6 +534,8 @@ struct thread_struct {
#ifdef CONFIG_X86_USER_SHADOW_STACK
unsigned long features;
unsigned long features_locked;
+
+ struct thread_shstk shstk;
#endif

/* Floating point and extended processor state */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 2dae9997ee17..dad5288bf086 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,4 +26,7 @@
#define ARCH_CET_DISABLE 0x5002
#define ARCH_CET_LOCK 0x5003

+/* ARCH_CET_ features bits */
+#define CET_SHSTK (1ULL << 0)
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index ed6f25cc07c5..20da2008e021 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -8,14 +8,160 @@

#include <linux/sched.h>
#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/cet.h>
+#include <asm/special_insns.h>
+#include <asm/fpu/api.h>
#include <asm/prctl.h>

+static bool features_enabled(unsigned long features)
+{
+ return current->thread.features & features;
+}
+
+static void features_set(unsigned long features)
+{
+ current->thread.features |= features;
+}
+
+static void features_clr(unsigned long features)
+{
+ current->thread.features &= ~features;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+ int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+ struct mm_struct *mm = current->mm;
+ unsigned long addr, unused;
+
+ mmap_write_lock(mm);
+ addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+ VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+
+ mmap_write_unlock(mm);
+
+ return addr;
+}
+
+static unsigned long adjust_shstk_size(unsigned long size)
+{
+ if (size)
+ return PAGE_ALIGN(size);
+
+ return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+}
+
+static void unmap_shadow_stack(u64 base, u64 size)
+{
+ while (1) {
+ int r;
+
+ r = vm_munmap(base, size);
+
+ /*
+ * vm_munmap() returns -EINTR when mmap_lock is held by
+ * something else, and that lock should not be held for a
+ * long time. Retry it for the case.
+ */
+ if (r == -EINTR) {
+ cond_resched();
+ continue;
+ }
+
+ /*
+ * For all other types of vm_munmap() failure, either the
+ * system is out of memory or there is bug.
+ */
+ WARN_ON_ONCE(r);
+ break;
+ }
+}
+
+static int shstk_setup(void)
+{
+ struct thread_shstk *shstk = &current->thread.shstk;
+ unsigned long addr, size;
+
+ /* Already enabled */
+ if (features_enabled(CET_SHSTK))
+ return 0;
+
+ /* Also not supported for 32 bit and x32 */
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall())
+ return -EOPNOTSUPP;
+
+ size = adjust_shstk_size(0);
+ addr = alloc_shstk(size);
+ if (IS_ERR_VALUE(addr))
+ return PTR_ERR((void *)addr);
+
+ fpregs_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+ wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
+ fpregs_unlock();
+
+ shstk->base = addr;
+ shstk->size = size;
+ features_set(CET_SHSTK);
+
+ return 0;
+}
+
void reset_thread_features(void)
{
+ memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
current->thread.features = 0;
current->thread.features_locked = 0;
}

+void shstk_free(struct task_struct *tsk)
+{
+ struct thread_shstk *shstk = &tsk->thread.shstk;
+
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+ !features_enabled(CET_SHSTK))
+ return;
+
+ if (!tsk->mm)
+ return;
+
+ unmap_shadow_stack(shstk->base, shstk->size);
+}
+
+
+static int shstk_disable(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return -EOPNOTSUPP;
+
+ /* Already disabled? */
+ if (!features_enabled(CET_SHSTK))
+ return 0;
+
+ fpregs_lock_and_load();
+ /* Disable WRSS too when disabling shadow stack */
+ set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);
+ wrmsrl(MSR_IA32_PL3_SSP, 0);
+ fpregs_unlock();
+
+ shstk_free(current);
+ features_clr(CET_SHSTK);
+
+ return 0;
+}
+
long cet_prctl(struct task_struct *task, int option, unsigned long features)
{
if (option == ARCH_CET_LOCK) {
--
2.17.1


2022-11-04 22:53:53

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 13/37] mm: Move VM_UFFD_MINOR_BIT from 37 to 38

From: Yu-cheng Yu <[email protected]>

To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Peter Xu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Axel Rasmussen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Mike Kravetz <[email protected]>
---
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..5314ad0a342d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -365,7 +365,7 @@ extern unsigned int kobjsize(const void *objp);
#endif

#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT 37
+# define VM_UFFD_MINOR_BIT 38
# define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */
#else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
# define VM_UFFD_MINOR VM_NONE
--
2.17.1


2022-11-04 22:54:01

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 08/37] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

From: Yu-cheng Yu <[email protected]>

New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages.

In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
if HW dirty tracking is not needed, because if the Dirty bit is not already
set the CPU has to set Dirty=1 when it the memory gets written to. This
creates addiontal work for the CPU. So tradional wisdom was to simply set
the Dirty bit whenever you didn't care about it. However, it was never
really very helpful for read only kernel memory.

When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
any functional benefit either, so to reduce ambiguity between shadow stack
and regular Write=0 pages, removed Dirty=1 from any kernel Write=0 PTEs.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---

v3:
- Update commit log (Andrew Cooper, Peterz)

v2:
- Normalize PTE bit descriptions between patches

arch/x86/include/asm/pgtable_types.h | 6 +++---
arch/x86/mm/pat/set_memory.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index aa174fed3a71..ff82237e7b6b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -192,10 +192,10 @@ enum page_cache_mode {
#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
#define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0)
#define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC)
-#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G)
-#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0|___D| 0|___G)
+#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G)
+#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G)
#define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC)
-#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX|___D| 0|___G)
+#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX| 0| 0|___G)
#define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G)
#define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW| 0|___A| 0|___D|_PSE|___G)
#define __PAGE_KERNEL_WP (__PP|__RW| 0|___A|__NX|___D| 0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 2e5a045731de..af2267a9cdab 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2026,7 +2026,7 @@ int set_memory_nx(unsigned long addr, int numpages)

int set_memory_ro(unsigned long addr, int numpages)
{
- return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+ return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
}

int set_memory_rw(unsigned long addr, int numpages)
--
2.17.1


2022-11-04 22:54:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 30/37] x86/shstk: Support wrss for userspace

For the current shadow stack implementation, shadow stacks contents can't
easily be provisioned with arbitrary data. This property helps apps
protect themselves better, but also restricts any potential apps that may
want to do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, wrss, which
can be enabled to write directly to shadow stack permissioned memory from
userspace. Allow it to get enabled via the prctl interface.

Only enable the userspace wrss instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a #PF err code perspective.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Make wrss_control() static
- Fix verbiage in commit log (Kees)

v2:
- Add some commit log verbiage from (Dave Hansen)

v1:
- New patch.

arch/x86/include/uapi/asm/prctl.h | 1 +
arch/x86/kernel/shstk.c | 33 +++++++++++++++++++++++++++++--
2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index dad5288bf086..5f1d3181e4a1 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -28,5 +28,6 @@

/* ARCH_CET_ features bits */
#define CET_SHSTK (1ULL << 0)
+#define CET_WRSS (1ULL << 1)

#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 9a025eea520f..cbd0970b26d7 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -364,6 +364,35 @@ void shstk_free(struct task_struct *tsk)
unmap_shadow_stack(shstk->base, shstk->size);
}

+static int wrss_control(bool enable)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return -EOPNOTSUPP;
+
+ /*
+ * Only enable wrss if shadow stack is enabled. If shadow stack is not
+ * enabled, wrss will already be disabled, so don't bother clearing it
+ * when disabling.
+ */
+ if (!features_enabled(CET_SHSTK))
+ return -EPERM;
+
+ /* Already enabled/disabled? */
+ if (features_enabled(CET_WRSS) == enable)
+ return 0;
+
+ fpregs_lock_and_load();
+ if (enable) {
+ set_clr_bits_msrl(MSR_IA32_U_CET, CET_WRSS_EN, 0);
+ features_set(CET_WRSS);
+ } else {
+ set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_WRSS_EN);
+ features_clr(CET_WRSS);
+ }
+ fpregs_unlock();
+
+ return 0;
+}

static int shstk_disable(void)
{
@@ -376,12 +405,12 @@ static int shstk_disable(void)

fpregs_lock_and_load();
/* Disable WRSS too when disabling shadow stack */
- set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);
+ set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN | CET_WRSS_EN);
wrmsrl(MSR_IA32_PL3_SSP, 0);
fpregs_unlock();

shstk_free(current);
- features_clr(CET_SHSTK);
+ features_clr(CET_SHSTK | CET_WRSS);

return 0;
}
--
2.17.1


2022-11-04 22:54:27

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 31/37] x86: Expose thread features in /proc/$PID/status

Applications and loaders can have logic to decide whether to enable CET.
They usually don't report whether CET has been enabled or not, so there
is no way to verify whether an application actually is protected by CET
features.

Add two lines in /proc/$PID/status to report enabled and locked features.

Since, this involves referring to arch specific defines in asm/prctl.h,
implement an arch breakout to emit the feature lines.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
[Switched to CET, added to commit log]
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Move to /proc/pid/status (Kees)

v2:
- New patch

arch/x86/kernel/cpu/proc.c | 23 +++++++++++++++++++++++
fs/proc/array.c | 6 ++++++
include/linux/proc_fs.h | 2 ++
3 files changed, 31 insertions(+)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 099b6f0d96bd..105587d43500 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -4,6 +4,8 @@
#include <linux/string.h>
#include <linux/seq_file.h>
#include <linux/cpufreq.h>
+#include <asm/prctl.h>
+#include <linux/proc_fs.h>

#include "cpu.h"

@@ -175,3 +177,24 @@ const struct seq_operations cpuinfo_op = {
.stop = c_stop,
.show = show_cpuinfo,
};
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static void dump_x86_features(struct seq_file *m, unsigned long features)
+{
+ if (features & CET_SHSTK)
+ seq_puts(m, "shstk ");
+ if (features & CET_WRSS)
+ seq_puts(m, "wrss ");
+}
+
+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task)
+{
+ seq_puts(m, "x86_Thread_features:\t");
+ dump_x86_features(m, task->thread.features);
+ seq_putc(m, '\n');
+
+ seq_puts(m, "x86_Thread_features_locked:\t");
+ dump_x86_features(m, task->thread.features_locked);
+ seq_putc(m, '\n');
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49283b8103c7..7ac43ecda1c2 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -428,6 +428,11 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
}

+__weak void arch_proc_pid_thread_features(struct seq_file *m,
+ struct task_struct *task)
+{
+}
+
int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -451,6 +456,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+ arch_proc_pid_thread_features(m, task);
return 0;
}

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 81d6e4ec2294..5a8b21c0a587 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -158,6 +158,8 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task);
#endif /* CONFIG_PROC_PID_ARCH_STATUS */

+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task);
+
#else /* CONFIG_PROC_FS */

static inline void proc_root_init(void)
--
2.17.1


2022-11-04 22:54:36

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 18/37] mm: Add guard pages around a shadow stack.

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to prevent corrupting or
switching to other shadow stacks. The RSTORSSP can move the spp to
different shadow stacks, but it requires a specially placed token in order
to do this. However, the architecture does not prevent incrementing the
stack pointer to wander onto an adjacent shadow stack. To prevent this in
software, enforce guard pages at the beginning of shadow stack vmas, such
that there will always be a gap between adjacent shadow stacks.

Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can increment or decrement by CALL, RET and INCSSP. CALL and RET
can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.

The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:

addq $0x80, %rsp

However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:

READ_ONCE(ssp); // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8); // read+discard last popped stack element

The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.

This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stack's to grow, which is
unneeded and adds a strange difference to how most regular stacks work.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v2:
- Use __weak instead of #ifdef (Dave Hansen)
- Only have start gap on shadow stack (Andy Luto)
- Create stack_guard_start_gap() to not duplicate code
in an arch version of vm_start_gap() (Dave Hansen)
- Improve commit log partly with verbiage from (Dave Hansen)

Yu-cheng v25:
- Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.

Yu-cheng v24:
- Instead changing vm_*_gap(), create x86-specific versions.

arch/x86/mm/mmap.c | 23 +++++++++++++++++++++++
include/linux/mm.h | 11 ++++++-----
mm/mmap.c | 7 +++++++
3 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..66da1f3298b0 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -248,3 +248,26 @@ bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
return false;
return true;
}
+
+unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_GROWSDOWN)
+ return stack_guard_gap;
+
+ /*
+ * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).
+ * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
+ * (~1KB for INCSSPD) and touches the first and the last element
+ * in the range, which triggers a page fault if the range is not
+ * in a shadow stack. Because of this, creating 4-KB guard pages
+ * around a shadow stack prevents these instructions from going
+ * beyond.
+ *
+ * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
+ * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
+ */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return PAGE_SIZE;
+
+ return 0;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d9536fa860a..0a3f7e2b32df 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2832,15 +2832,16 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
return mtree_load(&mm->mm_mt, addr);
}

+unsigned long stack_guard_start_gap(struct vm_area_struct *vma);
+
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
{
+ unsigned long gap = stack_guard_start_gap(vma);
unsigned long vm_start = vma->vm_start;

- if (vma->vm_flags & VM_GROWSDOWN) {
- vm_start -= stack_guard_gap;
- if (vm_start > vma->vm_start)
- vm_start = 0;
- }
+ vm_start -= gap;
+ if (vm_start > vma->vm_start)
+ vm_start = 0;
return vm_start;
}

diff --git a/mm/mmap.c b/mm/mmap.c
index 2def55555e05..f67606fbc464 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -281,6 +281,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
return origbrk;
}

+unsigned long __weak stack_guard_start_gap(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_GROWSDOWN)
+ return stack_guard_gap;
+ return 0;
+}
+
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
extern void mt_validate(struct maple_tree *mt);
extern void mt_dump(const struct maple_tree *mt);
--
2.17.1


2022-11-04 22:54:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 04/37] x86/cpufeatures: Enable CET CR4 bit for shadow stack

From: Yu-cheng Yu <[email protected]>

Setting CR4.CET is a prerequisite for utilizing any CET features, most of
which also require setting MSRs.

Kernel IBT already enables the CET CR4 bit when it detects IBT HW support
and is configured with kernel IBT. However, future patches that enable
userspace shadow stack support will need the bit set as well. So change
the logic to enable it in either case.

Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see
userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Remove stay new line (Boris)
- Simplify commit log (Andrew Cooper)

v2:
- In the shadow stack case, go back to only setting CR4.CET if the
kernel is compiled with user shadow stack support.
- Clear MSR_IA32_U_CET as well. (PeterZ)

KVM refresh:
- Set CR4.CET if SHSTK or IBT are supported by HW, so that KVM can
support CET even if IBT is disabled.
- Drop no_user_shstk (Dave Hansen)
- Elaborate on what the CR4 bit does in the commit log
- Integrate with Kernel IBT logic

v1:
- Moved kernel-parameters.txt changes here from patch 1.

arch/x86/kernel/cpu/common.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e508f239098..0ba0a136adcb 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -596,28 +596,51 @@ __noendbr void ibt_restore(u64 save)

#endif

+#ifdef CONFIG_X86_CET
static __always_inline void setup_cet(struct cpuinfo_x86 *c)
{
- u64 msr = CET_ENDBR_EN;
+ bool kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);
+ bool user_shstk;
+ u64 msr = 0;

- if (!HAS_KERNEL_IBT ||
- !cpu_feature_enabled(X86_FEATURE_IBT))
+ /*
+ * Enable user shadow stack only if the Linux defined user shadow stack
+ * cap was not cleared by command line.
+ */
+ user_shstk = cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+ IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK) &&
+ !test_bit(X86_FEATURE_USER_SHSTK, (unsigned long *)cpu_caps_cleared);
+
+ if (!kernel_ibt && !user_shstk)
return;

+ if (user_shstk)
+ set_cpu_cap(c, X86_FEATURE_USER_SHSTK);
+
+ if (kernel_ibt)
+ msr = CET_ENDBR_EN;
+
wrmsrl(MSR_IA32_S_CET, msr);
cr4_set_bits(X86_CR4_CET);

- if (!ibt_selftest()) {
+ if (kernel_ibt && !ibt_selftest()) {
pr_err("IBT selftest: Failed!\n");
setup_clear_cpu_cap(X86_FEATURE_IBT);
return;
}
}
+#else /* CONFIG_X86_CET */
+static inline void setup_cet(struct cpuinfo_x86 *c) {}
+#endif

__noendbr void cet_disable(void)
{
- if (cpu_feature_enabled(X86_FEATURE_IBT))
- wrmsrl(MSR_IA32_S_CET, 0);
+ if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
+ cpu_feature_enabled(X86_FEATURE_SHSTK)))
+ return;
+
+ wrmsrl(MSR_IA32_S_CET, 0);
+ wrmsrl(MSR_IA32_U_CET, 0);
}

/*
--
2.17.1


2022-11-04 22:54:51

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 21/37] mm: Re-introduce vm_flags to do_mmap()

From: Yu-cheng Yu <[email protected]>

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap(). Thus, re-introduce vm_flags to do_mmap().

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Peter Collingbourne <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: [email protected]
---
fs/aio.c | 2 +-
include/linux/mm.h | 3 ++-
ipc/shm.c | 2 +-
mm/mmap.c | 10 +++++-----
mm/nommu.c | 4 ++--
mm/util.c | 2 +-
6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5b2ff20ad322..66119297125a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -554,7 +554,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)

ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
PROT_READ | PROT_WRITE,
- MAP_SHARED, 0, &unused, NULL);
+ MAP_SHARED, 0, 0, &unused, NULL);
mmap_write_unlock(mm);
if (IS_ERR((void *)ctx->mmap_base)) {
ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a3f7e2b32df..c9b387b905df 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2742,7 +2742,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
struct list_head *uf);
extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
- unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+ vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf);
extern int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
unsigned long start, size_t len, struct list_head *uf,
bool downgrade);
diff --git a/ipc/shm.c b/ipc/shm.c
index 7d86f058fb86..11e98de7e522 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1646,7 +1646,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
goto invalid;
}

- addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+ addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
*raddr = addr;
err = 0;
if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 4dc157869b34..47a8a7d9c560 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1246,11 +1246,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
*/
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
- unsigned long flags, unsigned long pgoff,
- unsigned long *populate, struct list_head *uf)
+ unsigned long flags, vm_flags_t vm_flags,
+ unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf)
{
struct mm_struct *mm = current->mm;
- vm_flags_t vm_flags;
int pkey = 0;

validate_mm(mm);
@@ -1311,7 +1311,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
- vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

if (flags & MAP_LOCKED)
@@ -2880,7 +2880,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,

file = get_file(vma->vm_file);
ret = do_mmap(vma->vm_file, start, size,
- prot, flags, pgoff, &populate, NULL);
+ prot, flags, 0, pgoff, &populate, NULL);
fput(file);
out:
mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 214c70e1d059..20ff1ec89091 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1042,6 +1042,7 @@ unsigned long do_mmap(struct file *file,
unsigned long len,
unsigned long prot,
unsigned long flags,
+ vm_flags_t vm_flags,
unsigned long pgoff,
unsigned long *populate,
struct list_head *uf)
@@ -1049,7 +1050,6 @@ unsigned long do_mmap(struct file *file,
struct vm_area_struct *vma;
struct vm_region *region;
struct rb_node *rb;
- vm_flags_t vm_flags;
unsigned long capabilities, result;
int ret;
MA_STATE(mas, &current->mm->mm_mt, 0, 0);
@@ -1069,7 +1069,7 @@ unsigned long do_mmap(struct file *file,

/* we've determined that we can make the mapping, now translate what we
* now know into VMA flags */
- vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+ vm_flags |= determine_vm_flags(file, prot, flags, capabilities);


/* we're going to need to record the mapping */
diff --git a/mm/util.c b/mm/util.c
index 12984e76767e..aefe4fae7ecf 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -517,7 +517,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
if (!ret) {
if (mmap_write_lock_killable(mm))
return -EINTR;
- ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+ ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
&uf);
mmap_write_unlock(mm);
userfaultfd_unmap_complete(mm, &uf);
--
2.17.1


2022-11-04 22:55:46

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 22/37] mm: Don't allow write GUPs to shadow stack memory

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Shadow stack memory is writable only in very specific, controlled ways.
However, since it is writable, the kernel treats it as such. As a result
there remain many ways for userspace to trigger the kernel to write to
shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
little less exposed, block writable GUPs for shadow stack VMAs.

Still allow FOLL_FORCE to write through shadow stack protections, as it
does for read-only protections.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Add comment in __pte_access_permitted() (Dave)
- Remove unneeded shadow stack specific check in
__pte_access_permitted() (Jann)

arch/x86/include/asm/pgtable.h | 5 +++++
mm/gup.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index d57dc1b2d3e8..0d18f3a4373d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1637,6 +1637,11 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
{
unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;

+ /*
+ * Write=0,Dirty=1 PTEs are shadow stack, which the kernel
+ * shouldn't generally allow access to, but since they
+ * are already Write=0, the below logic covers both cases.
+ */
if (write)
need_pte_bits |= _PAGE_RW;

diff --git a/mm/gup.c b/mm/gup.c
index fe195d47de74..411befd4d431 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1062,7 +1062,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
return -EFAULT;

if (write) {
- if (!(vm_flags & VM_WRITE)) {
+ if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
/*
--
2.17.1


2022-11-04 23:00:05

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 15/37] x86/mm: Check Shadow Stack page fault errors

From: Yu-cheng Yu <[email protected]>

The CPU performs "shadow stack accesses" when it expects to encounter
shadow stack mappings. These accesses can be implicit (via CALL/RET
instructions) or explicit (instructions like WRSS).

Shadow stacks accesses to shadow-stack mappings can see faults in normal,
valid operation just like regular accesses to regular mappings. Shadow
stacks need some of the same features like delayed allocation, swap and
copy-on-write. The kernel needs to use faults to implement those features.

The architecture has concepts of both shadow stack reads and shadow stack
writes. Any shadow stack access to non-shadow stack memory will generate
a fault with the shadow stack error code bit set.

This means that, unlike normal write protection, the fault handler needs
to create a type of memory that can be written to (with instructions that
generate shadow stack writes), even to fulfill a read access. So in the
case of COW memory, the COW needs to take place even with a shadow stack
read. Otherwise the page will be left (shadow stack) writable in
userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
for shadow stack accesses, even if the access was a shadow stack read.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping. Also, generate the errors for invalid shadow stack accesses.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Improve comment talking about using FAULT_FLAG_WRITE (Peterz)

v2:
- Update commit log with verbiage/feedback from Dave Hansen
- Clarify reasoning for FAULT_FLAG_WRITE for all shadow stack accesses
- Update comments with some verbiage from Dave Hansen

Yu-cheng v30:
- Update Subject line and add a verb

arch/x86/include/asm/trap_pf.h | 2 ++
arch/x86/mm/fault.c | 26 ++++++++++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..afa524325e55 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
+ * bit 6 == 1: shadow stack access fault
* bit 15 == 1: SGX MMU page-fault
*/
enum x86_pf_error_code {
@@ -20,6 +21,7 @@ enum x86_pf_error_code {
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
+ X86_PF_SHSTK = 1 << 6,
X86_PF_SGX = 1 << 15,
};

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..0af3d7f52c2e 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1138,8 +1138,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
(error_code & X86_PF_INSTR), foreign))
return 1;

+ /*
+ * Shadow stack accesses (PF_SHSTK=1) are only permitted to
+ * shadow stack VMAs. All other accesses result in an error.
+ */
+ if (error_code & X86_PF_SHSTK) {
+ if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
+ return 1;
+ if (unlikely(!(vma->vm_flags & VM_WRITE)))
+ return 1;
+ return 0;
+ }
+
if (error_code & X86_PF_WRITE) {
/* write, present and write, not present: */
+ if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
+ return 1;
if (unlikely(!(vma->vm_flags & VM_WRITE)))
return 1;
return 0;
@@ -1331,6 +1345,18 @@ void do_user_addr_fault(struct pt_regs *regs,

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

+ /*
+ * To service shadow stack read faults, unlike normal read faults, the
+ * fault handler needs to create a type of memory that will also be
+ * writable (with instructions that generate shadow stack writes).
+ * In the case of COW memory, the COW needs to take place even with
+ * a shadow stack read. Otherwise the shared page will be left (shadow
+ * stack) writable in userspace. So to trigger the appropriate behavior
+ * by setting FAULT_FLAG_WRITE for shadow stack accesses, even if the
+ * access was a shadow stack read.
+ */
+ if (error_code & X86_PF_SHSTK)
+ flags |= FAULT_FLAG_WRITE;
if (error_code & X86_PF_WRITE)
flags |= FAULT_FLAG_WRITE;
if (error_code & X86_PF_INSTR)
--
2.17.1


2022-11-04 23:00:08

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 29/37] x86/shstk: Introduce map_shadow_stack syscall

When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stack's is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other thread's could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
restore tokens being written into the middle of pre-used shadow stacks.
It is ideal to prevent restore tokens being added at arbitrary
locations, so the check was to make sure the shadow stack had never been
written to.
3. It stood out from the rest of the madvise flags, as more of direct
action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup it's own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(adrr, stack_size, SHADOW_STACK_SET_TOKEN);

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Change syscall common -> 64 (Kees)
- Use bit shift notation instead of 0x1 for uapi header (Kees)
- Call do_mmap() with MAP_FIXED_NOREPLACE (Kees)
- Block unsupported flags (Kees)
- Require size >= 8 to set token (Kees)

v2:
- Change syscall to take address like mmap() for CRIU's usage

v1:
- New patch (replaces PROT_SHADOW_STACK).

arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/uapi/asm/mman.h | 3 ++
arch/x86/kernel/shstk.c | 57 ++++++++++++++++++++++----
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 2 +-
kernel/sys_ni.c | 1 +
6 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..f65c671ce3b1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 64 map_shadow_stack sys_map_shadow_stack

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 775dbd3aff73..15c5a1c4fc29 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -12,6 +12,9 @@
((key) & 0x8 ? VM_PKEY_BIT3 : 0))
#endif

+/* Flags for map_shadow_stack(2) */
+#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */
+
#include <asm-generic/mman.h>

#endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 332b7c73a1af..9a025eea520f 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -17,6 +17,7 @@
#include <linux/compat.h>
#include <linux/sizes.h>
#include <linux/user.h>
+#include <linux/syscalls.h>
#include <asm/msr.h>
#include <asm/fpu/xstate.h>
#include <asm/fpu/types.h>
@@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
return 0;
}

-static unsigned long alloc_shstk(unsigned long size)
+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
+ unsigned long token_offset, bool set_res_tok)
{
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
struct mm_struct *mm = current->mm;
- unsigned long addr, unused;
+ unsigned long mapped_addr, unused;

- mmap_write_lock(mm);
- addr = do_mmap(NULL, 0, size, PROT_READ, flags,
- VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+ if (addr)
+ flags |= MAP_FIXED_NOREPLACE;

+ mmap_write_lock(mm);
+ mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+ VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
mmap_write_unlock(mm);

- return addr;
+ if (!set_res_tok || IS_ERR_VALUE(addr))
+ goto out;
+
+ if (create_rstor_token(mapped_addr + token_offset, NULL)) {
+ vm_munmap(mapped_addr, size);
+ return -EINVAL;
+ }
+
+out:
+ return mapped_addr;
}

static unsigned long adjust_shstk_size(unsigned long size)
@@ -134,7 +147,7 @@ static int shstk_setup(void)
return -EOPNOTSUPP;

size = adjust_shstk_size(0);
- addr = alloc_shstk(size);
+ addr = alloc_shstk(0, size, 0, false);
if (IS_ERR_VALUE(addr))
return PTR_ERR((void *)addr);

@@ -179,7 +192,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,


size = adjust_shstk_size(stack_size);
- addr = alloc_shstk(size);
+ addr = alloc_shstk(0, size, 0, false);
if (IS_ERR_VALUE(addr))
return PTR_ERR((void *)addr);

@@ -373,6 +386,34 @@ static int shstk_disable(void)
return 0;
}

+
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+ bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
+ unsigned long aligned_size;
+
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return -ENOSYS;
+
+ if (flags & ~SHADOW_STACK_SET_TOKEN)
+ return -EINVAL;
+
+ /* If there isn't space for a token */
+ if (set_tok && size < 8)
+ return -EINVAL;
+
+ /*
+ * An overflow would result in attempting to write the restore token
+ * to the wrong location. Not catastrophic, but just return the right
+ * error code and block it.
+ */
+ aligned_size = PAGE_ALIGN(size);
+ if (aligned_size < size)
+ return -EOVERFLOW;
+
+ return alloc_shstk(addr, aligned_size, size, set_tok);
+}
+
long cet_prctl(struct task_struct *task, int option, unsigned long features)
{
if (option == ARCH_CET_LOCK) {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..3ae05cbdea5b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
unsigned long home_node,
unsigned long flags);
+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..b12940ec5926 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)

#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452

/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..cb9aebd34646 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old);
COND_SYSCALL(modify_ldt);
COND_SYSCALL(vm86);
COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);

/* s390 */
COND_SYSCALL(s390_pci_mmio_read);
--
2.17.1


2022-11-04 23:00:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 32/37] x86/cet/shstk: Wire in CET interface

The kernel now has the main CET functionality to support applications.
Wire in the WRSS and shadow stack enable/disable functions into the
existing CET API skeleton.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v2:
- Split from other patches

arch/x86/kernel/shstk.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index cbd0970b26d7..71620b77a654 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -463,9 +463,17 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
return -EINVAL;

if (option == ARCH_CET_DISABLE) {
+ if (features & CET_WRSS)
+ return wrss_control(false);
+ if (features & CET_SHSTK)
+ return shstk_disable();
return -EINVAL;
}

/* Handle ARCH_CET_ENABLE */
+ if (features & CET_SHSTK)
+ return shstk_setup();
+ if (features & CET_WRSS)
+ return wrss_control(true);
return -EINVAL;
}
--
2.17.1


2022-11-04 23:00:48

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 01/37] Documentation/x86: Add CET description

From: Yu-cheng Yu <[email protected]>

Introduce a new document on Control-flow Enforcement Technology (CET).

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Clarify kernel IBT is supported by the kernel. (Kees, Andrew Cooper)
- Clarify which arch_prctl's can take multiple bits. (Kees)
- Describe ASLR characteristics of thread shadow stacks. (Kees)
- Add exec section. (Andrew Cooper)
- Fix some capitalization (Bagas Sanjaya)
- Update new location of enablement status proc.
- Add info about new user_shstk software capability.
- Add more info about what the kernel pushes to the shadow stack on
signal.

v2:
- Updated to new arch_prctl() API
- Add bit about new proc status

v1:
- Update and clarify the docs.
- Moved kernel parameters documentation to other patch.

Documentation/x86/cet.rst | 147 ++++++++++++++++++++++++++++++++++++
Documentation/x86/index.rst | 1 +
2 files changed, 148 insertions(+)
create mode 100644 Documentation/x86/cet.rst

diff --git a/Documentation/x86/cet.rst b/Documentation/x86/cet.rst
new file mode 100644
index 000000000000..b56811566531
--- /dev/null
+++ b/Documentation/x86/cet.rst
@@ -0,0 +1,147 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Control-flow Enforcement Technology (CET)
+=========================================
+
+Overview
+========
+
+Control-flow Enforcement Technology (CET) is term referring to several
+related x86 processor features that provides protection against control
+flow hijacking attacks. The HW feature itself can be set up to protect
+both applications and the kernel.
+
+CET introduces Shadow Stack and Indirect Branch Tracking (IBT). Shadow stack
+is a secondary stack allocated from memory and cannot be directly modified by
+applications. When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack. Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy. If the two differ, the processor raises a
+control-protection fault. IBT verifies indirect CALL/JMP targets are intended
+as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow
+Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace
+Shadow Stack and kernel IBT is supported in the kernel.
+
+The Kconfig option is X86_USER_SHADOW_STACK, and it can be disabled with
+the kernel parameter clearcpuid, like this: "clearcpuid=user_shstk".
+
+To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later
+are required.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET. "shstk" and "ibt" relate to the individual HW features. "user_shstk"
+relates to whether the userspace shadow stack specifically is supported.
+
+Application Enabling
+====================
+
+An application's CET capability is marked in its ELF note and can be verified
+from readelf/llvm-readelf output:
+
+ readelf -n <application> | grep -a SHSTK
+ properties: x86 feature: SHSTK
+
+The kernel does not process these applications markers directly. Applications
+or loaders must enable CET features using the interface described in section 4.
+Typically this would be done in dynamic loader or static runtime objects, as is
+the case in GLIBC.
+
+CET arch_prctl()'s
+==================
+
+Elf features should be enabled by the loader using the below arch_prctl's.
+
+arch_prctl(ARCH_CET_ENABLE, unsigned int feature)
+ Enable a single feature specified in 'feature'. Can only operate on
+ one feature at a time.
+
+arch_prctl(ARCH_CET_DISABLE, unsigned int feature)
+ Disable a single feature specified in 'feature'. Can only operate on
+ one feature at a time.
+
+arch_prctl(ARCH_CET_LOCK, unsigned int features)
+ Lock in features at their current enabled or disabled status. 'features'
+ is a mask of all features to lock. All bits set are processed, unset bits
+ are ignored. The mask is ORed with the existing value. So any feature bits
+ set here cannot be enabled or disabled afterwards.
+
+The return values are as following:
+ On success, return 0. On error, errno can be::
+
+ -EPERM if any of the passed feature are locked.
+ -EOPNOTSUPP if the feature is not supported by the hardware or
+ disabled by kernel parameter.
+ -EINVAL arguments (non existing feature, etc)
+
+Currently shadow stack and WRSS are supported via this interface. WRSS
+can only be enabled with shadow stack, and is automatically disabled
+if shadow stack is disabled.
+
+Proc status
+===========
+To check if an application is actually running with shadow stack, the
+user can read the /proc/$PID/status. It will report "wrss" or "shstk"
+depending on what is enabled. The lines look like this::
+
+ x86_Thread_features: shstk wrss
+ x86_Thread_features_locked: shstk wrss
+
+The implementation of the Shadow Stack
+======================================
+
+Shadow Stack size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+By default, the main program and its signal handlers use the same shadow
+stack. Because the shadow stack stores only return addresses, a large
+shadow stack covers the condition that both the program stack and the
+signal alternate stack run out.
+
+When a signal happens, the old pre-signal state is pushed on the stack. When
+shadow stack is enabled, the shadow stack specific state is pushed onto the
+shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
+in a special format with bit 63 set. On sigreturn this old SSP token is
+verified and restored by the kernel. The kernel will also push the normal
+restorer address to the shadow stack to help userspace avoid a shadow stack
+violation on the sigreturn path that goes through the restorer.
+
+So the shadow stack signal frame format is as follows::
+
+ |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format
+ (bit 63 set to 1)
+ | ...| - Other state may be added in the future
+
+
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread. New shadow stack's behave like mmap() with respect to
+ASLR behavior.
+
+Exec
+----
+
+On exec, shadow stack features are disabled by the kernel. At which point,
+userspace can choose to re-enable, or lock them.
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index c73d133fd37c..9ac03055c4b5 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -22,6 +22,7 @@ x86-specific Documentation
mtrr
pat
intel-hfi
+ cet
iommu
intel_txt
amd-memory-encryption
--
2.17.1


2022-11-04 23:01:13

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 24/37] x86: Introduce userspace API for CET enabling

From: "Kirill A. Shutemov" <[email protected]>

Add three new arch_prctl() handles:

- ARCH_CET_ENABLE/DISABLE enables or disables the specified
feature. Returns 0 on success or an error.

- ARCH_CET_LOCK prevents future disabling or enabling of the
specified feature. Returns 0 on success or an error

The features are handled per-thread and inherited over fork(2)/clone(2),
but reset on exec().

This is preparation patch. It does not implement any features.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
[tweaked with feedback from tglx]
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Move shstk.c Makefile changes earlier (Kees)
- Add #ifdef around features_locked and features (Kees)
- Encapsulate features reset earlier in reset_thread_features() so
features and features_locked are not referenced in code that would be
compiled !CONFIG_X86_USER_SHADOW_STACK. (Kees)
- Fix typo in commit log (Kees)
- Switch arch_prctl() numbers to avoid conflict with LAM

v2:
- Only allow one enable/disable per call (tglx)
- Return error code like a normal arch_prctl() (Alexander Potapenko)
- Make CET only (tglx)

arch/x86/include/asm/cet.h | 21 +++++++++++++++
arch/x86/include/asm/processor.h | 5 ++++
arch/x86/include/uapi/asm/prctl.h | 6 +++++
arch/x86/kernel/Makefile | 2 ++
arch/x86/kernel/process_64.c | 7 ++++-
arch/x86/kernel/shstk.c | 44 +++++++++++++++++++++++++++++++
6 files changed, 84 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/kernel/shstk.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
new file mode 100644
index 000000000000..a2f3c6e06ef5
--- /dev/null
+++ b/arch/x86/include/asm/cet.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CET_H
+#define _ASM_X86_CET_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+long cet_prctl(struct task_struct *task, int option, unsigned long features);
+void reset_thread_features(void);
+#else
+static inline long cet_prctl(struct task_struct *task, int option,
+ unsigned long features) { return -EINVAL; }
+static inline void reset_thread_features(void) {}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 67c9d73b31fa..ca66d320a263 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -530,6 +530,11 @@ struct thread_struct {
*/
u32 pkru;

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ unsigned long features;
+ unsigned long features_locked;
+#endif
+
/* Floating point and extended processor state */
struct fpu fpu;
/*
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..2dae9997ee17 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,10 @@
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003

+/* Don't use 0x3001-0x3004 because of old glibcs */
+
+#define ARCH_CET_ENABLE 0x5001
+#define ARCH_CET_DISABLE 0x5002
+#define ARCH_CET_LOCK 0x5003
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f901658d9f7c..fbb1cb34188d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -143,6 +143,8 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += sev.o

obj-$(CONFIG_CFI_CLANG) += cfi.o

+obj-$(CONFIG_X86_USER_SHADOW_STACK) += shstk.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6b3418bff326..17fec059317c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
load_gs_index(__USER_DS);
}

+ reset_thread_features();
+
loadsegment(fs, 0);
loadsegment(es, _ds);
loadsegment(ds, _ds);
@@ -830,7 +832,10 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
case ARCH_MAP_VDSO_64:
return prctl_map_vdso(&vdso_image_64, arg2);
#endif
-
+ case ARCH_CET_ENABLE:
+ case ARCH_CET_DISABLE:
+ case ARCH_CET_LOCK:
+ return cet_prctl(task, option, arg2);
default:
ret = -EINVAL;
break;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
new file mode 100644
index 000000000000..ed6f25cc07c5
--- /dev/null
+++ b/arch/x86/kernel/shstk.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * shstk.c - Intel shadow stack support
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Yu-cheng Yu <[email protected]>
+ */
+
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <asm/prctl.h>
+
+void reset_thread_features(void)
+{
+ current->thread.features = 0;
+ current->thread.features_locked = 0;
+}
+
+long cet_prctl(struct task_struct *task, int option, unsigned long features)
+{
+ if (option == ARCH_CET_LOCK) {
+ task->thread.features_locked |= features;
+ return 0;
+ }
+
+ /* Don't allow via ptrace */
+ if (task != current)
+ return -EINVAL;
+
+ /* Do not allow to change locked features */
+ if (features & task->thread.features_locked)
+ return -EPERM;
+
+ /* Only support enabling/disabling one feature at a time. */
+ if (hweight_long(features) > 1)
+ return -EINVAL;
+
+ if (option == ARCH_CET_DISABLE) {
+ return -EINVAL;
+ }
+
+ /* Handle ARCH_CET_ENABLE */
+ return -EINVAL;
+}
--
2.17.1


2022-11-04 23:01:47

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 17/37] mm: Fixup places that call pte_mkwrite() directly

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

With the introduction of shadow stack memory there are two ways a pte can
be writable: regular writable memory and shadow stack memory.

In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
where a PTE is made writable. However, there are places where pte_mkwrite()
is called directly and the logic should now also create a shadow stack PTE
in the case of a shadow stack VMA.

- do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
directly and call pte_mkwrite(). Teach it about pte_mkwrite_shstk()

- When userfaultfd is creating a PTE after userspace handles the fault
it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk()

To make the code cleaner, introduce is_shstk_write() which simplifies
checking for VM_WRITE | VM_SHADOW_STACK together.

In other cases where pte_mkwrite() is called directly, the VMA will not
be VM_SHADOW_STACK, and so shadow stack memory should not be created.
- In the case of pte_savedwrite(), shadow stack VMA's are excluded.
- In the case of the "dirty_accountable" optimization in mprotect(),
shadow stack VMA's won't be VM_SHARED, so it is not nessary.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v3:
- Restore do_anonymous_page() that accidetally moved commits (Kirill)
- Open code maybe_mkwrite() cases from v2, so the behavior doesn't change
to mark that non-writable PTEs dirty. (Nadav)

v2:
- Updated commit log with comment's from Dave Hansen
- Dave also suggested (I understood) to maybe tweak vm_get_page_prot()
to avoid having to call maybe_mkwrite(). After playing around with
this I opted to *not* do this. Shadow stack memory memory is
effectively writable, so having the default permissions be writable
ended up mapping the zero page as writable and other surprises. So
creating shadow stack memory needs to be done with manual logic
like pte_mkwrite().
- Drop change in change_pte_range() because it couldn't actually trigger
for shadow stack VMAs.
- Clarify reasoning for skipped cases of pte_mkwrite().

Yu-cheng v25:
- Apply same changes to do_huge_pmd_numa_page() as to do_numa_page().

arch/x86/include/asm/pgtable.h | 3 +++
arch/x86/mm/pgtable.c | 6 ++++++
include/linux/pgtable.h | 7 +++++++
mm/memory.c | 5 ++++-
mm/migrate_device.c | 4 +++-
mm/userfaultfd.c | 10 +++++++---
6 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index df67bcf9f69e..d57dc1b2d3e8 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -919,6 +919,9 @@ static inline pgd_t pti_set_user_pgtbl(pgd_t *pgdp, pgd_t pgd)
}
#endif /* CONFIG_PAGE_TABLE_ISOLATION */

+#define is_shstk_write is_shstk_write
+extern bool is_shstk_write(unsigned long vm_flags);
+
#endif /* __ASSEMBLY__ */


diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8525f2876fb4..f0e536bea3ca 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -876,3 +876,9 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)

#endif /* CONFIG_X86_64 */
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+bool is_shstk_write(unsigned long vm_flags)
+{
+ return (vm_flags & (VM_SHADOW_STACK | VM_WRITE)) ==
+ (VM_SHADOW_STACK | VM_WRITE);
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5ce6732a6b65..36926a207b6d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1567,6 +1567,13 @@ static inline bool arch_has_pfn_modify_check(void)
}
#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */

+#ifndef is_shstk_write
+static inline bool is_shstk_write(unsigned long vm_flags)
+{
+ return false;
+}
+#endif
+
/*
* Architecture PAGE_KERNEL_* fallbacks
*
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..b9bee283aad3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4128,7 +4128,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)

entry = mk_pte(page, vma->vm_page_prot);
entry = pte_sw_mkyoung(entry);
- if (vma->vm_flags & VM_WRITE)
+
+ if (is_shstk_write(vma->vm_flags))
+ entry = pte_mkwrite_shstk(pte_mkdirty(entry));
+ else if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6fa682eef7a0..4c21c600bf46 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -641,7 +641,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
goto abort;
}
entry = mk_pte(page, vma->vm_page_prot);
- if (vma->vm_flags & VM_WRITE)
+ if (is_shstk_write(vma->vm_flags))
+ entry = pte_mkwrite_shstk(pte_mkdirty(entry));
+ else if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
}

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3d0fef3980b3..503135b079b6 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
int ret;
pte_t _dst_pte, *dst_pte;
bool writable = dst_vma->vm_flags & VM_WRITE;
+ bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
bool vm_shared = dst_vma->vm_flags & VM_SHARED;
bool page_in_cache = page->mapping;
spinlock_t *ptl;
@@ -83,9 +84,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
writable = false;
}

- if (writable)
- _dst_pte = pte_mkwrite(_dst_pte);
- else
+ if (writable) {
+ if (shstk)
+ _dst_pte = pte_mkwrite_shstk(_dst_pte);
+ else
+ _dst_pte = pte_mkwrite(_dst_pte);
+ } else
/*
* We need this to make sure write bit removed; as mk_pte()
* could return a pte with write bit set.
--
2.17.1


2022-11-04 23:01:49

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 26/37] x86/shstk: Handle thread shadow stack

From: Yu-cheng Yu <[email protected]>

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-cet case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child from
returning from the vfork() calling function, using CET. Glibc does this
by adjusting the shadow stack pointer in the child, so that the child
receives a #CP if it tries to return from vfork() calling function.

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

During this operation, the shadow stack pointer of the new thread needs
to be updated to point to the newly allocated shadow stack. Since the
ability to do this is confined to the FPU subsystem, change
fpu_clone() to take the new shadow stack pointer, and update it
internally inside the FPU subsystem. This part was suggested by Thomas
Gleixner.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Fix update_fpu_shstk() stub (Mike Rapoport)
- Fix chunks around alloc_shstk() in wrong patch (Kees)
- Fix stack_size/flags swap (Kees)
- Use centalized stack size logic (Kees)

v2:
- Have fpu_clone() take new shadow stack pointer and update SSP in
xsave buffer for new task. (tglx)

v1:
- Expand commit log.
- Add more comments.
- Switch to xsave helpers.

Yu-cheng v30:
- Update comments about clone()/clone3(). (Borislav Petkov)

arch/x86/include/asm/cet.h | 7 +++++
arch/x86/include/asm/fpu/sched.h | 3 +-
arch/x86/include/asm/mmu_context.h | 2 ++
arch/x86/kernel/fpu/core.c | 41 +++++++++++++++++++++++++++-
arch/x86/kernel/process.c | 18 +++++++++++-
arch/x86/kernel/shstk.c | 44 ++++++++++++++++++++++++++++--
6 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index cade110b2ea8..1a97223e7d2f 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -15,11 +15,18 @@ struct thread_shstk {

long cet_prctl(struct task_struct *task, int option, unsigned long features);
void reset_thread_features(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+ unsigned long stack_size,
+ unsigned long *shstk_addr);
void shstk_free(struct task_struct *p);
#else
static inline long cet_prctl(struct task_struct *task, int option,
unsigned long features) { return -EINVAL; }
static inline void reset_thread_features(void) {}
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+ unsigned long clone_flags,
+ unsigned long stack_size,
+ unsigned long *shstk_addr) { return 0; }
static inline void shstk_free(struct task_struct *p) {}
#endif /* CONFIG_X86_USER_SHADOW_STACK */

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index b2486b2cbc6e..54c9c2fd1907 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -11,7 +11,8 @@

extern void save_fpregs_to_fpstate(struct fpu *fpu);
extern void fpu__drop(struct fpu *fpu);
-extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal);
+extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+ unsigned long shstk_addr);
extern void fpu_flush_thread(void);

/*
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index b8d40ddeab00..d29988cbdf20 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -146,6 +146,8 @@ do { \
#else
#define deactivate_mm(tsk, mm) \
do { \
+ if (!tsk->vfork_done) \
+ shstk_free(tsk); \
load_gs_index(0); \
loadsegment(fs, 0); \
} while (0)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8b3162badab7..3a2f37ac3005 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -555,8 +555,41 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
}
}

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
+{
+ struct cet_user_state *xstate;
+
+ /* If ssp update is not needed. */
+ if (!ssp)
+ return 0;
+
+ xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
+ XFEATURE_CET_USER);
+
+ /*
+ * If there is a non-zero ssp, then 'dst' must be configured with a shadow
+ * stack and the fpu state should be up to date since it was just copied
+ * from the parent in fpu_clone(). So there must be a valid non-init CET
+ * state location in the buffer.
+ */
+ if (WARN_ON_ONCE(!xstate))
+ return 1;
+
+ xstate->user_ssp = (u64)ssp;
+
+ return 0;
+}
+#else
+static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
+{
+ return 0;
+}
+#endif
+
/* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+ unsigned long ssp)
{
struct fpu *src_fpu = &current->thread.fpu;
struct fpu *dst_fpu = &dst->thread.fpu;
@@ -616,6 +649,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
if (use_xsave())
dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;

+ /*
+ * Update shadow stack pointer, in case it changed during clone.
+ */
+ if (update_fpu_shstk(dst, ssp))
+ return 1;
+
trace_x86_fpu_copy_src(src_fpu);
trace_x86_fpu_copy_dst(dst_fpu);

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c21b7347a26d..12c28161867c 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,6 +47,7 @@
#include <asm/frame.h>
#include <asm/unwind.h>
#include <asm/tdx.h>
+#include <asm/cet.h>

#include "process.h"

@@ -118,6 +119,7 @@ void exit_thread(struct task_struct *tsk)

free_vm86(t);

+ shstk_free(tsk);
fpu__drop(fpu);
}

@@ -139,6 +141,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
struct inactive_task_frame *frame;
struct fork_frame *fork_frame;
struct pt_regs *childregs;
+ unsigned long shstk_addr = 0;
int ret = 0;

childregs = task_pt_regs(p);
@@ -173,7 +176,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
frame->flags = X86_EFLAGS_FIXED;
#endif

- fpu_clone(p, clone_flags, args->fn);
+ /* Allocate a new shadow stack for pthread if needed */
+ ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
+ &shstk_addr);
+ if (ret)
+ return ret;
+
+ fpu_clone(p, clone_flags, args->fn, shstk_addr);

/* Kernel thread ? */
if (unlikely(p->flags & PF_KTHREAD)) {
@@ -219,6 +228,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
io_bitmap_share(p);

+ /*
+ * If copy_thread() if failing, don't leak the shadow stack possibly
+ * allocated in shstk_alloc_thread_stack() above.
+ */
+ if (ret)
+ shstk_free(p);
+
return ret;
}

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 20da2008e021..a7a982924b9a 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -47,7 +47,7 @@ static unsigned long alloc_shstk(unsigned long size)
unsigned long addr, unused;

mmap_write_lock(mm);
- addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+ addr = do_mmap(NULL, 0, size, PROT_READ, flags,
VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);

mmap_write_unlock(mm);
@@ -126,6 +126,40 @@ void reset_thread_features(void)
current->thread.features_locked = 0;
}

+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+ unsigned long stack_size, unsigned long *shstk_addr)
+{
+ struct thread_shstk *shstk = &tsk->thread.shstk;
+ unsigned long addr, size;
+
+ /*
+ * If shadow stack is not enabled on the new thread, skip any
+ * switch to a new shadow stack.
+ */
+ if (!features_enabled(CET_SHSTK))
+ return 0;
+
+ /*
+ * For CLONE_VM, except vfork, the child needs a separate shadow
+ * stack.
+ */
+ if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+ return 0;
+
+
+ size = adjust_shstk_size(stack_size);
+ addr = alloc_shstk(size);
+ if (IS_ERR_VALUE(addr))
+ return PTR_ERR((void *)addr);
+
+ shstk->base = addr;
+ shstk->size = size;
+
+ *shstk_addr = addr + size;
+
+ return 0;
+}
+
void shstk_free(struct task_struct *tsk)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -134,7 +168,13 @@ void shstk_free(struct task_struct *tsk)
!features_enabled(CET_SHSTK))
return;

- if (!tsk->mm)
+ /*
+ * When fork() with CLONE_VM fails, the child (tsk) already has a
+ * shadow stack allocated, and exit_thread() calls this function to
+ * free it. In this case the parent (current) and the child share
+ * the same mm struct.
+ */
+ if (!tsk->mm || tsk->mm != current->mm)
return;

unmap_shadow_stack(shstk->base, shstk->size);
--
2.17.1


2022-11-04 23:03:37

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 36/37] x86/cet/shstk: Add ARCH_CET_UNLOCK

From: Mike Rapoport <[email protected]>

Userspace loaders may lock features before a CRIU restore operation has
the chance to set them to whatever state is required by the process
being restored. Allow a way for CRIU to unlock features. Add it as an
arch_prctl() like the other CET operations, but restrict it being called
by the ptrace arch_pctl() interface.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
[Merged into recent API changes, added commit log and docs]
Signed-off-by: Rick Edgecombe <[email protected]>
---

v3:
- Depend on CONFIG_CHECKPOINT_RESTORE (Kees)

Documentation/x86/cet.rst | 4 ++++
arch/x86/include/uapi/asm/prctl.h | 1 +
arch/x86/kernel/process_64.c | 1 +
arch/x86/kernel/shstk.c | 9 +++++++--
4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/cet.rst b/Documentation/x86/cet.rst
index b56811566531..f69cafb1feff 100644
--- a/Documentation/x86/cet.rst
+++ b/Documentation/x86/cet.rst
@@ -66,6 +66,10 @@ arch_prctl(ARCH_CET_LOCK, unsigned int features)
are ignored. The mask is ORed with the existing value. So any feature bits
set here cannot be enabled or disabled afterwards.

+arch_prctl(ARCH_CET_UNLOCK, unsigned int features)
+ Unlock features. 'features' is a mask of all features to unlock. All
+ bits set are processed, unset bits are ignored.
+
The return values are as following:
On success, return 0. On error, errno can be::

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5f1d3181e4a1..0c37fd0ad8d9 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -25,6 +25,7 @@
#define ARCH_CET_ENABLE 0x5001
#define ARCH_CET_DISABLE 0x5002
#define ARCH_CET_LOCK 0x5003
+#define ARCH_CET_UNLOCK 0x5004

/* ARCH_CET_ features bits */
#define CET_SHSTK (1ULL << 0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 17fec059317c..03bc16c9cc19 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -835,6 +835,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
case ARCH_CET_ENABLE:
case ARCH_CET_DISABLE:
case ARCH_CET_LOCK:
+ case ARCH_CET_UNLOCK:
return cet_prctl(task, option, arg2);
default:
ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 71620b77a654..bed7032d35f2 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -450,9 +450,14 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
return 0;
}

- /* Don't allow via ptrace */
- if (task != current)
+ /* Only allow via ptrace */
+ if (task != current) {
+ if (option == ARCH_CET_UNLOCK && IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
+ task->thread.features_locked &= ~features;
+ return 0;
+ }
return -EINVAL;
+ }

/* Do not allow to change locked features */
if (features & task->thread.features_locked)
--
2.17.1


2022-11-04 23:11:00

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 12/37] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

From: Yu-cheng Yu <[email protected]>

When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
stack. Copy-on-write PTes then have Write=0,Cow=1.

When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a Write=0 PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE. However, this will not occur on processors supporting
Shadow Stack, and a TLB flush is not necessary.

The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
atomically, a transient shadow stack PTE can be created as a result.
Thus, prevent that with cmpxchg.

In the case of pmdp_set_wrprotect(), for nopmd configs the ->pmd operated
on does not exist and the logic would need to be different. Although the
extra functionality will normally be optimized out when user shadow
stacks are not configured, also exclude it in the preprocessor stage so
that it will still compile. User shadow stack is not supported there by
Linux anyway. Leave the cpu_feature_enabled() check so that the
functionality also disables based on runtime detection of the feature.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the cmpxchg solution.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- Remove unnecessary #ifdef (Dave Hansen)

v2:
- Compile out some code due to clang build error
- Clarify commit log (dhansen)
- Normalize PTE bit descriptions between patches (dhansen)
- Update comment with text from (dhansen)

Yu-cheng v30:
- Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).

arch/x86/include/asm/pgtable.h | 35 ++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 81f388a5a5ab..f252c42f3ca1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1289,6 +1289,21 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
+ /*
+ * Avoid accidentally creating shadow stack PTEs
+ * (Write=0,Dirty=1). Use cmpxchg() to prevent races with
+ * the hardware setting Dirty=1.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+ pte_t old_pte, new_pte;
+
+ old_pte = READ_ONCE(*ptep);
+ do {
+ new_pte = pte_wrprotect(old_pte);
+ } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+ return;
+ }
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
}

@@ -1341,6 +1356,26 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp)
{
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ /*
+ * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
+ * to _PAGE_COW (see comments at pmd_wrprotect()).
+ * When a thread reads a RW=1, Dirty=0 PMD and before changing it
+ * to RW=0, Dirty=0, another thread could have written to the page
+ * and the PMD is RW=1, Dirty=1 now.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+ pmd_t old_pmd, new_pmd;
+
+ old_pmd = READ_ONCE(*pmdp);
+ do {
+ new_pmd = pmd_wrprotect(old_pmd);
+ } while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+ return;
+ }
+#endif
+
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

--
2.17.1


2022-11-04 23:11:32

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

From: Yu-cheng Yu <[email protected]>

Some applications (like GDB and CRIU) would like to tweak CET state via
ptrace. This allows for existing functionality to continue to work for
seized CET applications. Provide an interface based on the xsave buffer
format of CET, but filter unneeded states to make the kernel’s job
easier.

There is already ptrace functionality for accessing xstate, but this
does not include supervisor xfeatures. So there is not a completely
clear place for where to put the CET state. Adding it to the user
xfeatures regset would complicate that code, as it currently shares
logic with signals which should not have supervisor features.

Don’t add a general supervisor xfeature regset like the user one,
because it is better to maintain flexibility for other supervisor
xfeatures to define their own interface. For example, an xfeature may
decide not to expose all of it’s state to userspace. A lot of enum
values remain to be used, so just put it in dedicated CET regset.

The only downside to not having a generic supervisor xfeature regset,
is that apps need to be enlightened of any new supervisor xfeature
exposed this way (i.e. they can’t try to have generic save/restore
logic). But maybe that is a good thing, because they have to think
through each new xfeature instead of encountering issues when new a new
supervisor xfeature was added.

By adding a CET regset, it also has the effect of including the CET state
in a core dump, which could be useful for debugging.

Inside the setter CET regset, filter out invalid state. Today this
includes states disallowed by the HW and states involving Indirect Branch
Tracking which the kernel does not currently support for usersapce.

So this leaves three pieces of data that can be set, shadow stack
enablement, WRSS enablement and the shadow stack pointer. It is worth
noting that this is separate than enabling shadow stack via the
arch_prctl()s. Enabling shadow stack involves more than just flipping the
bit. The kernel is made aware that it has to do extra things when cloning
or handling signals. That logic is triggered off of separate feature
enablement state kept in the task struct. So the flipping on HW shadow
stack enforcement without notifying the kernel to change its behavior
would severely limit what an application could do without crashing. Since
there is likely no use for this, only allow the CET registers to be set
if shadow stack is already enabled via the arch_prctl()s. This will let
apps like GDB toggle shadow stack enforcement for apps that already have
shadow stack enabled, and minimize scenarios the kernel has to worry
about.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>

---

v3:
- Drop dependence on thread.shstk.size, and use thread.features bits
- Drop 32 bit support

v2:
- Check alignment on ssp.
- Block IBT bits.
- Handle init states instead of returning error.
- Add verbose commit log justifying the design.

Yu-Cheng v12:
- Return -ENODEV when CET registers are in INIT state.
- Check reserved/non-support bits from user input.

arch/x86/include/asm/fpu/regset.h | 7 +--
arch/x86/include/asm/msr-index.h | 5 ++
arch/x86/kernel/fpu/regset.c | 90 +++++++++++++++++++++++++++++++
arch/x86/kernel/ptrace.c | 20 +++++++
include/uapi/linux/elf.h | 1 +
5 files changed, 120 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/regset.h
index 4f928d6a367b..8622184d87f5 100644
--- a/arch/x86/include/asm/fpu/regset.h
+++ b/arch/x86/include/asm/fpu/regset.h
@@ -7,11 +7,12 @@

#include <linux/regset.h>

-extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active;
+extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active,
+ cetregs_active;
extern user_regset_get2_fn fpregs_get, xfpregs_get, fpregs_soft_get,
- xstateregs_get;
+ xstateregs_get, cetregs_get;
extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
- xstateregs_set;
+ xstateregs_set, cetregs_set;

/*
* xstateregs_active == regset_fpregs_active. Please refer to the comment
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 10ac52705892..674c508798ee 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -437,6 +437,11 @@
#define CET_RESERVED (BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
#define CET_SUPPRESS BIT_ULL(10)
#define CET_WAIT_ENDBR BIT_ULL(11)
+#define CET_EG_LEG_BITMAP_BASE_MASK GENMASK_ULL(63, 13)
+
+#define CET_U_IBT_MASK (CET_ENDBR_EN | CET_LEG_IW_EN | CET_NO_TRACK_EN | \
+ CET_NO_TRACK_EN | CET_SUPPRESS_DISABLE | CET_SUPPRESS | \
+ CET_WAIT_ENDBR | CET_EG_LEG_BITMAP_BASE_MASK)

#define MSR_IA32_PL0_SSP 0x000006a4 /* ring-0 shadow stack pointer */
#define MSR_IA32_PL1_SSP 0x000006a5 /* ring-1 shadow stack pointer */
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 75ffaef8c299..21225b994b2d 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -8,6 +8,7 @@
#include <asm/fpu/api.h>
#include <asm/fpu/signal.h>
#include <asm/fpu/regset.h>
+#include <asm/prctl.h>

#include "context.h"
#include "internal.h"
@@ -174,6 +175,95 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
return ret;
}

+int cetregs_active(struct task_struct *target, const struct user_regset *regset)
+{
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ if (target->thread.features & CET_SHSTK)
+ return regset->n;
+#endif
+ return 0;
+}
+
+int cetregs_get(struct task_struct *target, const struct user_regset *regset,
+ struct membuf to)
+{
+ struct fpu *fpu = &target->thread.fpu;
+ struct cet_user_state *cetregs;
+
+ if (!boot_cpu_has(X86_FEATURE_USER_SHSTK))
+ return -ENODEV;
+
+ sync_fpstate(fpu);
+ cetregs = get_xsave_addr(&fpu->fpstate->regs.xsave, XFEATURE_CET_USER);
+ if (!cetregs) {
+ /*
+ * The registers are the in the init state. The init values for
+ * these regs are zero, so just zero the output buffer.
+ */
+ membuf_zero(&to, sizeof(struct cet_user_state));
+ return 0;
+ }
+
+ return membuf_write(&to, cetregs, sizeof(struct cet_user_state));
+}
+
+int cetregs_set(struct task_struct *target, const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ const void *kbuf, const void __user *ubuf)
+{
+ struct fpu *fpu = &target->thread.fpu;
+ struct xregs_state *xsave = &fpu->fpstate->regs.xsave;
+ struct cet_user_state *cetregs, tmp;
+ int r;
+
+ if (!boot_cpu_has(X86_FEATURE_USER_SHSTK) ||
+ !cetregs_active(target, regset))
+ return -ENODEV;
+
+ r = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &tmp, 0, -1);
+ if (r)
+ return r;
+
+ /*
+ * Some kernel instructions (IRET, etc) can cause exceptions in the case
+ * of disallowed CET register values. Just prevent invalid values.
+ */
+ if ((tmp.user_ssp >= TASK_SIZE_MAX) || !IS_ALIGNED(tmp.user_ssp, 8))
+ return -EINVAL;
+
+ /*
+ * Don't allow any IBT bits to be set because it is not supported by
+ * the kernel yet. Also don't allow reserved bits.
+ */
+ if ((tmp.user_cet & CET_RESERVED) || (tmp.user_cet & CET_U_IBT_MASK))
+ return -EINVAL;
+
+ fpu_force_restore(fpu);
+
+ /*
+ * Don't want to init the xfeature until the kernel will definetely
+ * overwrite it, otherwise if it inits and then fails out, it would
+ * end up initing it to random data.
+ */
+ if (!xfeature_saved(xsave, XFEATURE_CET_USER) &&
+ WARN_ON(init_xfeature(xsave, XFEATURE_CET_USER)))
+ return -ENODEV;
+
+ cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
+ if (WARN_ON(!cetregs)) {
+ /*
+ * This shouldn't ever be NULL because it was successfully
+ * inited above if needed. The only scenario would be if an
+ * xfeature was somehow saved in a buffer, but not enabled in
+ * xsave.
+ */
+ return -ENODEV;
+ }
+
+ memmove(cetregs, &tmp, sizeof(tmp));
+ return 0;
+}
+
#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION

/*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index eed8a65d335d..f9e6635b69ce 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -51,6 +51,7 @@ enum x86_regset_32 {
REGSET_XSTATE32,
REGSET_TLS32,
REGSET_IOPERM32,
+ REGSET_CET32,
};

enum x86_regset_64 {
@@ -58,6 +59,7 @@ enum x86_regset_64 {
REGSET_FP64,
REGSET_IOPERM64,
REGSET_XSTATE64,
+ REGSET_CET64,
};

#define REGSET_GENERAL \
@@ -1267,6 +1269,15 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
.active = ioperm_active,
.regset_get = ioperm_get
},
+ [REGSET_CET64] = {
+ .core_note_type = NT_X86_CET,
+ .n = sizeof(struct cet_user_state) / sizeof(u64),
+ .size = sizeof(u64),
+ .align = sizeof(u64),
+ .active = cetregs_active,
+ .regset_get = cetregs_get,
+ .set = cetregs_set
+ },
};

static const struct user_regset_view user_x86_64_view = {
@@ -1336,6 +1347,15 @@ static struct user_regset x86_32_regsets[] __ro_after_init = {
.active = ioperm_active,
.regset_get = ioperm_get
},
+ [REGSET_CET32] = {
+ .core_note_type = NT_X86_CET,
+ .n = sizeof(struct cet_user_state) / sizeof(u64),
+ .size = sizeof(u64),
+ .align = sizeof(u64),
+ .active = cetregs_active,
+ .regset_get = cetregs_get,
+ .set = cetregs_set
+ },
};

static const struct user_regset_view user_x86_32_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index c7b056af9ef0..11089731e2e9 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -406,6 +406,7 @@ typedef struct elf64_shdr {
#define NT_386_TLS 0x200 /* i386 TLS slots (struct user_desc) */
#define NT_386_IOPERM 0x201 /* x86 io permission bitmap (1=deny) */
#define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
+#define NT_X86_CET 0x203 /* x86 CET state */
#define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */
#define NT_S390_TIMER 0x301 /* s390 timer register */
#define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */
--
2.17.1


2022-11-04 23:12:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v3 23/37] mm: Warn on shadow stack memory in wrong vma

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
treated as shadow by the CPU, but this combination used to be created by
the kernel on x86. Previous patches have changed the kernel to now avoid
creating these PTEs unless they are for shadow stack memory. In case any
missed corners of the kernel are still creating PTEs like this for
non-shadow stack memory, and to catch any re-introductions of the logic,
warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
stack VMAs when they are being zapped. This won't catch transient cases
but should have decent coverage. It will be compiled out when shadow
stack is not configured.

In order to check if a pte is shadow stack in core mm code, add default
implmentations for pte_shstk() and pmd_shstk().

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---

v3:
- New patch

arch/x86/include/asm/pgtable.h | 2 ++
include/linux/pgtable.h | 14 ++++++++++++++
mm/huge_memory.c | 2 ++
mm/memory.c | 2 ++
4 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0d18f3a4373d..051c03e59468 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -129,6 +129,7 @@ static inline bool pte_dirty(pte_t pte)
return pte_flags(pte) & _PAGE_DIRTY_BITS;
}

+#define pte_shstk pte_shstk
static inline bool pte_shstk(pte_t pte)
{
if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
@@ -147,6 +148,7 @@ static inline bool pmd_dirty(pmd_t pmd)
return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
}

+#define pmd_shstk pmd_shstk
static inline bool pmd_shstk(pmd_t pmd)
{
if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 36926a207b6d..dd84af70b434 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -500,6 +500,20 @@ static inline pte_t pte_mkwrite_shstk(pte_t pte)
}
#endif

+#ifndef pte_shstk
+static inline bool pte_shstk(pte_t pte)
+{
+ return false;
+}
+#endif
+
+#ifndef pmd_shstk
+static inline bool pmd_shstk(pmd_t pte)
+{
+ return false;
+}
+#endif
+
#ifndef pte_clear_savedwrite
#define pte_clear_savedwrite pte_wrprotect
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7643a4db1b50..2540f0d4c8ff 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1656,6 +1656,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
*/
orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
tlb->fullmm);
+ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+ pmd_shstk(orig_pmd));
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
if (vma_is_special_huge(vma)) {
if (arch_needs_pgtable_deposit())
diff --git a/mm/memory.c b/mm/memory.c
index b9bee283aad3..4331f33a02d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1437,6 +1437,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
+ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+ pte_shstk(ptent));
tlb_remove_tlb_entry(tlb, pte, addr);
zap_install_uffd_wp_if_needed(vma, addr, pte, details,
ptent);
--
2.17.1


2022-11-04 23:23:28

by Edgecombe, Rick P

[permalink] [raw]
Subject: [RFC 37/37] fs/binfmt_elf: Block old shstk elf bit

The x86 Control-flow Enforcement Technology (CET) feature includes a new
feature called shadow stacks that provides security enforcement of
behavior that is rarely used by non-attackers.

There exists a lurking compatibility problem for userspace shadow stack.
Old binaries exist that are marked as supporting shadow stack in their
elf header, but actually will crash if shadow stack is enabled. This would
only happens if the loader chooses to call the kernel APIs that enable
shadow stack. However, glibc plans to update to do just this. At which
point the old apps will crash.

In a lot of ways this is userspace's business, however the kernel could
save the user from these crashes. It could do this by detecting the elf
bit and blocking the shadow stack APIs, so that loader (glibc) will fail
to enable shadow stack and the binary would then run without shadow stack.
So implement this logic in the elf processing that happens during exec.

This is a bit dirty, and implemented here just for discussion on whether
the kernel should actually do something like this.

The elf loading logic in the kernel has to do a little extra scanning
through the elf header in order to find the shadow stack bit.

Since some people may not mind if some apps crash, also create
a Kconfig X86_USER_SHADOW_STACK_ALLOW_BROKEN to allow the old binaries
to still have access to the shadow stack kernel APIs.

This is based on an earlier patch by Yu-cheng Yu that was looking at elf
bits on the interpreter instead of the execing binary.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/arm64/include/asm/elf.h | 5 +++++
arch/x86/Kconfig | 13 +++++++++++++
arch/x86/include/asm/cet.h | 2 ++
arch/x86/include/asm/elf.h | 11 +++++++++++
arch/x86/include/asm/processor.h | 1 +
arch/x86/kernel/process_64.c | 33 ++++++++++++++++++++++++++++++++
arch/x86/kernel/shstk.c | 15 +++++++++++++++
fs/binfmt_elf.c | 24 ++++++++++++++++++++++-
include/linux/elf.h | 6 ++++++
include/uapi/linux/elf.h | 15 +++++++++++++++
10 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 97932fbf973d..1aa76ed02dda 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -279,6 +279,11 @@ static inline int arch_parse_elf_property(u32 type, const void *data,
return 0;
}

+static inline int arch_process_elf_property(struct arch_elf_state *arch)
+{
+ return 0;
+}
+
static inline int arch_elf_pt_proc(void *ehdr, void *phdr,
struct file *f, bool is_interp,
struct arch_elf_state *state)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f3d14f5accce..da9e43aa91a3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -28,6 +28,7 @@ config X86_64
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_USE_CMPXCHG_LOCKREF
+ select ARCH_USE_GNU_PROPERTY
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE
@@ -60,6 +61,7 @@ config X86
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
select ARCH_32BIT_OFF_T if X86_32
+ select ARCH_BINFMT_ELF_STATE
select ARCH_CLOCKSOURCE_INIT
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRATION
@@ -1977,6 +1979,17 @@ config X86_USER_SHADOW_STACK

If unsure, say N.

+config X86_USER_SHADOW_STACK_ALLOW_BROKEN
+ bool "Allow enabling shadow stack for broken binaries"
+ depends on EXPERT
+ depends on X86_USER_SHADOW_STACK
+ help
+ There exist old binaries that are marked as compatible with shadow
+ stack, but actually aren't. The kernel blocks these binaries from
+ getting shadow stack enabled by default. But some working binaries
+ are also blocked. Select this option if you would like to allow these
+ binaries to run with shadow stack, and possibly crash.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 098e4ecfdf9b..7f0cabb3db21 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -22,6 +22,7 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
void shstk_free(struct task_struct *p);
int setup_signal_shadow_stack(struct ksignal *ksig);
int restore_signal_shadow_stack(void);
+void bad_cet_binary_disable(bool disable);
#else
static inline long cet_prctl(struct task_struct *task, int option,
unsigned long features) { return -EINVAL; }
@@ -33,6 +34,7 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
static inline void shstk_free(struct task_struct *p) {}
static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
static inline int restore_signal_shadow_stack(void) { return 0; }
+static inline void bad_cet_binary_disable(bool disable) {};
#endif /* CONFIG_X86_USER_SHADOW_STACK */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index cb0ff1055ab1..95ee133acffb 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -383,6 +383,17 @@ extern int compat_arch_setup_additional_pages(struct linux_binprm *bprm,

extern bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs);

+struct arch_elf_state {
+ unsigned int gnu_property;
+};
+
+#define INIT_ARCH_ELF_STATE { \
+ .gnu_property = 0, \
+}
+
+#define arch_elf_pt_proc(ehdr, phdr, elf, interp, state) (0)
+#define arch_check_elf(ehdr, interp, interp_ehdr, state) (0)
+
/* Do not change the values. See get_align_mask() */
enum align_flags {
ALIGN_VA_32 = BIT(0),
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a6c414dfd10f..4b333c801010 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -534,6 +534,7 @@ struct thread_struct {
#ifdef CONFIG_X86_USER_SHADOW_STACK
unsigned long features;
unsigned long features_locked;
+ bool bad_cet_binary_disable;

struct thread_shstk shstk;
#endif
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 03bc16c9cc19..461b8e9468df 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -867,3 +867,36 @@ unsigned long KSTK_ESP(struct task_struct *task)
{
return task_pt_regs(task)->sp;
}
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
+ bool compat, struct arch_elf_state *state)
+{
+ if (type != GNU_PROPERTY_X86_FEATURE_1_AND)
+ return 0;
+
+ if (datasz != sizeof(unsigned int))
+ return -ENOEXEC;
+
+ state->gnu_property = *(unsigned int *)data;
+ return 0;
+}
+
+int arch_process_elf_property(struct arch_elf_state *state)
+{
+ bad_cet_binary_disable(state->gnu_property & GNU_PROPERTY_X86_FEATURE_1_BAD);
+ return 0;
+}
+#else /* CONFIG_X86_USER_SHADOW_STACK */
+int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
+ bool compat, struct arch_elf_state *state)
+{
+ return 0;
+}
+
+int arch_process_elf_property(struct arch_elf_state *state)
+{
+ return 0;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index bed7032d35f2..cb105e69c840 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -445,6 +445,9 @@ SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsi

long cet_prctl(struct task_struct *task, int option, unsigned long features)
{
+ if (task->thread.bad_cet_binary_disable)
+ return -EINVAL;
+
if (option == ARCH_CET_LOCK) {
task->thread.features_locked |= features;
return 0;
@@ -482,3 +485,15 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
return wrss_control(true);
return -EINVAL;
}
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK_ALLOW_BROKEN
+void bad_cet_binary_disable(bool disable)
+{
+ current->thread.bad_cet_binary_disable = false;
+}
+#else /* CONFIG_X86_USER_SHADOW_STACK_ALLOW_BROKEN */
+void bad_cet_binary_disable(bool disable)
+{
+ current->thread.bad_cet_binary_disable = disable;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK_ALLOW_BROKEN */
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6a11025e5850..8b6ae5e423fb 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -764,6 +764,8 @@ static int parse_elf_property(const char *data, size_t *off, size_t datasz,
#define GNU_PROPERTY_TYPE_0_NAME "GNU"
#define NOTE_NAME_SZ (sizeof(GNU_PROPERTY_TYPE_0_NAME))

+
+
static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
struct arch_elf_state *arch)
{
@@ -821,6 +823,18 @@ static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
return ret == -ENOENT ? 0 : ret;
}

+static int check_elf_properties(struct file *f, const struct elf_phdr *phdr)
+{
+ struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
+ int retval;
+
+ retval = parse_elf_properties(f, phdr, &arch_state);
+ if (retval)
+ return retval;
+
+ return arch_process_elf_property(&arch_state);
+}
+
static int load_elf_binary(struct linux_binprm *bprm)
{
struct file *interpreter = NULL; /* to shut gcc up */
@@ -920,13 +934,21 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval < 0)
goto out_free_dentry;

- break;
+ /* Quit if already found PT_GNU_PROPERTY */
+ if (elf_property_phdata)
+ break;
+
+ continue;

out_free_interp:
kfree(elf_interpreter);
goto out_free_ph;
}

+ retval = check_elf_properties(bprm->file, elf_property_phdata);
+ if (retval)
+ return retval;
+
elf_ppnt = elf_phdata;
for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++)
switch (elf_ppnt->p_type) {
diff --git a/include/linux/elf.h b/include/linux/elf.h
index c9a46c4e183b..faf961b92a95 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -92,9 +92,15 @@ static inline int arch_parse_elf_property(u32 type, const void *data,
{
return 0;
}
+
+static inline int arch_process_elf_property(struct arch_elf_state *arch)
+{
+ return 0;
+}
#else
extern int arch_parse_elf_property(u32 type, const void *data, size_t datasz,
bool compat, struct arch_elf_state *arch);
+extern int arch_process_elf_property(struct arch_elf_state *arch);
#endif

#ifdef CONFIG_ARCH_HAVE_ELF_PROT
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 11089731e2e9..d9b58adce321 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -469,4 +469,19 @@ typedef struct elf64_note {
/* Bits for GNU_PROPERTY_AARCH64_FEATURE_1_BTI */
#define GNU_PROPERTY_AARCH64_FEATURE_1_BTI (1U << 0)

+/*
+ * See the x86 64 psABI at:
+ * https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI
+ * .note.gnu.property types for x86:
+ */
+/* 0xc0000000 and 0xc0000001 are reserved */
+#define GNU_PROPERTY_X86_FEATURE_1_AND 0xc0000002
+
+/* Bits for GNU_PROPERTY_X86_FEATURE_1_AND */
+#define GNU_PROPERTY_X86_FEATURE_1_IBT 0x00000001
+#define GNU_PROPERTY_X86_FEATURE_1_SHSTK 0x00000002
+
+#define GNU_PROPERTY_X86_FEATURE_1_BAD (GNU_PROPERTY_X86_FEATURE_1_IBT | \
+ GNU_PROPERTY_X86_FEATURE_1_SHSTK)
+
#endif /* _UAPI_LINUX_ELF_H */
--
2.17.1


2022-11-07 18:03:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] x86/cpufeatures: Add CPU feature flags for shadow stacks

On Fri, Nov 04, 2022 at 03:35:30PM -0700, Rick Edgecombe wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index b71f4f2ecdd5..5626ecb8a080 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -304,6 +304,7 @@
> #define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */
> #define X86_FEATURE_USE_IBPB_FW (11*32+16) /* "" Use IBPB during runtime firmware calls */
> #define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on VM exit when EIBRS is enabled */
> +#define X86_FEATURE_USER_SHSTK (11*32+18) /* Shadow stack support for user mode applications */

This clashes with the calldepth tracking in tip, see updated diff below.

> This is off of v6.1-rc3

In the future, please do all your x86 patches against latest tip/master
so that there's no issues like that.

Thx.

---
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 97669aaf1202..8d67249f5232 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -306,6 +306,7 @@
#define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on VM exit when EIBRS is enabled */
#define X86_FEATURE_SGX_EDECCSSA (11*32+18) /* "" SGX EDECCSSA user leaf function */
#define X86_FEATURE_CALL_DEPTH (11*32+19) /* "" Call depth tracking for RSB stuffing */
+#define X86_FEATURE_USER_SHSTK (11*32+20) /* Shadow stack support for user mode applications */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
@@ -367,6 +368,7 @@
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index bbb03b25263e..79fbb7799526 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -93,6 +93,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define DISABLE_USER_SHSTK 0
+#else
+#define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -108,7 +114,7 @@
#define DISABLED_MASK9 (DISABLE_SGX)
#define DISABLED_MASK10 0
#define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
- DISABLE_CALL_DEPTH_TRACKING)
+ DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
#define DISABLED_MASK12 0
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index d95221117129..c3e4e5246df9 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -79,6 +79,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
{}
};


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-07 18:41:37

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] x86/cpufeatures: Enable CET CR4 bit for shadow stack

On Mon, 2022-11-07 at 19:00 +0100, Borislav Petkov wrote:
> On Fri, Nov 04, 2022 at 03:35:31PM -0700, Rick Edgecombe wrote:
> > static __always_inline void setup_cet(struct cpuinfo_x86 *c)
> > {
> > - u64 msr = CET_ENDBR_EN;
> > + bool kernel_ibt = HAS_KERNEL_IBT &&
> > cpu_feature_enabled(X86_FEATURE_IBT);
> > + bool user_shstk;
> > + u64 msr = 0;
> >
> > - if (!HAS_KERNEL_IBT ||
> > - !cpu_feature_enabled(X86_FEATURE_IBT))
> > + /*
> > + * Enable user shadow stack only if the Linux defined user
> > shadow stack
> > + * cap was not cleared by command line.
> > + */
> > + user_shstk = cpu_feature_enabled(X86_FEATURE_SHSTK) &&
> > + IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK) &&
> > + !test_bit(X86_FEATURE_USER_SHSTK, (unsigned long
> > *)cpu_caps_cleared);
>
> Huh, why poke at cpu_caps_cleared?

It was to catch if the software user shadow stack feature gets disabled
at boot with the "clearcpuid" command. Is there a better way to do
this?

>
> Look below:
>
> > + if (!kernel_ibt && !user_shstk)
> > return;
> >
> > + if (user_shstk)
> > + set_cpu_cap(c, X86_FEATURE_USER_SHSTK);
> > +
> > + if (kernel_ibt)
> > + msr = CET_ENDBR_EN;
> > +
> > wrmsrl(MSR_IA32_S_CET, msr);
> > cr4_set_bits(X86_CR4_CET);
> >
> > - if (!ibt_selftest()) {
> > + if (kernel_ibt && !ibt_selftest()) {
> > pr_err("IBT selftest: Failed!\n");
> > setup_clear_cpu_cap(X86_FEATURE_IBT);
> > return;
> > }
> > }
> > +#else /* CONFIG_X86_CET */
> > +static inline void setup_cet(struct cpuinfo_x86 *c) {}
> > +#endif
> >
> > __noendbr void cet_disable(void)
> > {
> > - if (cpu_feature_enabled(X86_FEATURE_IBT))
> > - wrmsrl(MSR_IA32_S_CET, 0);
> > + if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
> > + cpu_feature_enabled(X86_FEATURE_SHSTK)))
> > + return;
> > +
> > + wrmsrl(MSR_IA32_S_CET, 0);
> > + wrmsrl(MSR_IA32_U_CET, 0);
>
> Here you need to do
>
> setup_clear_cpu_cap(X86_FEATURE_IBT);
> setup_clear_cpu_cap(X86_FEATURE_SHSTK);

This only gets called by kexec way after boot, as kexec is prepping to
transition to the new kernel. Do we want to be clearing feature bits at
that time?

>
> and then the cpu_feature_enabled() test above alone should suffice.
>
> But, before you do that, I'd like to ask you to update your patchset
> ontop of tip/master because the conflicts are getting non-trivial.
> This
> one doesn't even want to apply with a large fuzz:
>
> $ patch -p1 --dry-run -F20 -i /tmp/new
> checking file arch/x86/kernel/cpu/common.c
> Hunk #1 FAILED at 596.
> 1 out of 1 hunk FAILED
>
> Thx.

Sure, sorry about that. I'll target tip for the next version.

>

2022-11-07 18:42:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] x86/cpufeatures: Enable CET CR4 bit for shadow stack

On Fri, Nov 04, 2022 at 03:35:31PM -0700, Rick Edgecombe wrote:
> static __always_inline void setup_cet(struct cpuinfo_x86 *c)
> {
> - u64 msr = CET_ENDBR_EN;
> + bool kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);
> + bool user_shstk;
> + u64 msr = 0;
>
> - if (!HAS_KERNEL_IBT ||
> - !cpu_feature_enabled(X86_FEATURE_IBT))
> + /*
> + * Enable user shadow stack only if the Linux defined user shadow stack
> + * cap was not cleared by command line.
> + */
> + user_shstk = cpu_feature_enabled(X86_FEATURE_SHSTK) &&
> + IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK) &&
> + !test_bit(X86_FEATURE_USER_SHSTK, (unsigned long *)cpu_caps_cleared);

Huh, why poke at cpu_caps_cleared?

Look below:

> + if (!kernel_ibt && !user_shstk)
> return;
>
> + if (user_shstk)
> + set_cpu_cap(c, X86_FEATURE_USER_SHSTK);
> +
> + if (kernel_ibt)
> + msr = CET_ENDBR_EN;
> +
> wrmsrl(MSR_IA32_S_CET, msr);
> cr4_set_bits(X86_CR4_CET);
>
> - if (!ibt_selftest()) {
> + if (kernel_ibt && !ibt_selftest()) {
> pr_err("IBT selftest: Failed!\n");
> setup_clear_cpu_cap(X86_FEATURE_IBT);
> return;
> }
> }
> +#else /* CONFIG_X86_CET */
> +static inline void setup_cet(struct cpuinfo_x86 *c) {}
> +#endif
>
> __noendbr void cet_disable(void)
> {
> - if (cpu_feature_enabled(X86_FEATURE_IBT))
> - wrmsrl(MSR_IA32_S_CET, 0);
> + if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
> + cpu_feature_enabled(X86_FEATURE_SHSTK)))
> + return;
> +
> + wrmsrl(MSR_IA32_S_CET, 0);
> + wrmsrl(MSR_IA32_U_CET, 0);

Here you need to do

setup_clear_cpu_cap(X86_FEATURE_IBT);
setup_clear_cpu_cap(X86_FEATURE_SHSTK);

and then the cpu_feature_enabled() test above alone should suffice.

But, before you do that, I'd like to ask you to update your patchset
ontop of tip/master because the conflicts are getting non-trivial. This
one doesn't even want to apply with a large fuzz:

$ patch -p1 --dry-run -F20 -i /tmp/new
checking file arch/x86/kernel/cpu/common.c
Hunk #1 FAILED at 596.
1 out of 1 hunk FAILED

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-07 19:57:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] x86/cpufeatures: Enable CET CR4 bit for shadow stack

On Mon, Nov 07, 2022 at 06:19:48PM +0000, Edgecombe, Rick P wrote:
> It was to catch if the software user shadow stack feature gets disabled
> at boot with the "clearcpuid" command.

I don't understand. clearcpuid does setup_clear_cpu_cap() too. It would
eventually clear the bit in boot_cpu_data.x86_capability's AFAICT.

cpu_feature_enabled() looks at boot_cpu_data too.

So what's the problem?

Oh, and also, you've added that clearcpuid thing to the help docs.
Please remove it. clearcpuid= taints the kernel and we've left it in
because some of your colleagues really wanted it for testing or whatnot.
But it is crap and it was on its way out at some point so we better not
proliferate its use any more.

> Is there a better way to do this?

Yeah, cpu_feature_enabled() should be enough and if it isn't, then we
need to fix it to be.

Which reminds me, I'd need to take Maxim's patch too:

https://lore.kernel.org/r/[email protected]

as it is a simplification.

> > Here you need to do
> >
> > setup_clear_cpu_cap(X86_FEATURE_IBT);
> > setup_clear_cpu_cap(X86_FEATURE_SHSTK);
>
> This only gets called by kexec way after boot, as kexec is prepping to
> transition to the new kernel. Do we want to be clearing feature bits at
> that time?

Hmm, I was under the impression you'll have the usual chicken bit
"noshstk" which gets added with every big feature. So it'll call that
thing here.

> Sure, sorry about that. I'll target tip for the next version.

Thanks!

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-07 20:05:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] x86/cpufeatures: Enable CET CR4 bit for shadow stack

On Mon, Nov 07, 2022 at 07:19:21PM +0000, Edgecombe, Rick P wrote:
> There was at one point, but there was a suggestion to remove in favor
> of clearcpuid. I can add it back.

Sounds like I need to school that "suggestor" about clearcpuid. :)

For example, when doing this:

1625c833db93 ("x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid=")

I even managed to prevent the kernel from booting. So clearcpuid= is
definitely not for general consumption.

So as said, please remove it from your whole patchset's text.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-11-07 20:06:36

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] x86/cpufeatures: Enable CET CR4 bit for shadow stack

On Mon, 2022-11-07 at 20:30 +0100, Borislav Petkov wrote:
> On Mon, Nov 07, 2022 at 07:19:21PM +0000, Edgecombe, Rick P wrote:
> > There was at one point, but there was a suggestion to remove in
> > favor
> > of clearcpuid. I can add it back.
>
> Sounds like I need to school that "suggestor" about clearcpuid. :)
>
> For example, when doing this:
>
> 1625c833db93 ("x86/cpu: Allow feature bit names from /proc/cpuinfo
> in clearcpuid=")
>
> I even managed to prevent the kernel from booting. So clearcpuid= is
> definitely not for general consumption.
>
> So as said, please remove it from your whole patchset's text.

Will do, thanks.

2022-11-07 20:29:12

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] x86/cpufeatures: Enable CET CR4 bit for shadow stack

On Mon, 2022-11-07 at 19:37 +0100, Borislav Petkov wrote:
> On Mon, Nov 07, 2022 at 06:19:48PM +0000, Edgecombe, Rick P wrote:
> > It was to catch if the software user shadow stack feature gets
> > disabled
> > at boot with the "clearcpuid" command.
>
> I don't understand. clearcpuid does setup_clear_cpu_cap() too. It
> would
> eventually clear the bit in boot_cpu_data.x86_capability's AFAICT.
>
> cpu_feature_enabled() looks at boot_cpu_data too.
>
> So what's the problem?

You are right, there actually is no problem. I thought the
apply_forced_caps() was happening too late, but it is not. So this
check you highlighted would not be needed if we kept the clearcpuid
method.

Thanks.

>
> Oh, and also, you've added that clearcpuid thing to the help docs.
> Please remove it. clearcpuid= taints the kernel and we've left it in
> because some of your colleagues really wanted it for testing or
> whatnot.
> But it is crap and it was on its way out at some point so we better
> not
> proliferate its use any more.
>
> > Is there a better way to do this?
>
> Yeah, cpu_feature_enabled() should be enough and if it isn't, then we
> need to fix it to be.
>
> Which reminds me, I'd need to take Maxim's patch too:
>
> https://lore.kernel.org/r/[email protected]
>
> as it is a simplification.
>
> > > Here you need to do
> > >
> > > setup_clear_cpu_cap(X86_FEATURE_IBT);
> > > setup_clear_cpu_cap(X86_FEATURE_SHSTK);
> >
> > This only gets called by kexec way after boot, as kexec is prepping
> > to
> > transition to the new kernel. Do we want to be clearing feature
> > bits at
> > that time?
>
> Hmm, I was under the impression you'll have the usual chicken bit
> "noshstk" which gets added with every big feature. So it'll call that
> thing here.

There was at one point, but there was a suggestion to remove in favor
of clearcpuid. I can add it back.

2022-11-07 21:58:15

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC 37/37] fs/binfmt_elf: Block old shstk elf bit

On Mon, 2022-11-07 at 13:21 -0800, H.J. Lu wrote:
> > > Some applications and libraries are compiled with -fcf-
> > > protection,
> > > but
> > > they manipulate the stack in such a way that they aren't
> > > compatible
> > > with the shadow stack. However, if the build/test setup doesn't
> > > support
> > > shadow stack, it is impossible to validate.
> > >
> >
> > When we have everything in place, the problems would be much more
> > obvious when distros started turning it on. But we can't turn it on
> > as
>
> Not necessarily. The problem will show up only in a CET enabled
> environment since build/test setup may not be on a CET capable
> hardware.

Well, I'm not sure of the details of distro testing, but there are
plenty of TGL and later systems out there today. With kernel support,
I'm thinking these types of problems couldn't lurk for years like they
have.

>
> > planned without breaking things for existing binaries. We can have
> > both
> > by:
> > 1. Choosing a new bit, adding it to the tools, and never supporting
> > the
> > old bit in glibc.
> > 2. Providing the option to have the kernel block the old bit, so
> > upgraded users can decide what experience they would like. Then
> > distros
> > can find the problems and adjust their packages. I'm starting to
> > think
> > a default off sysctl toggle might be better than a Kconfig.
> > 3. Any other ideas?
>
> Don't enable CET in glibc until we can validate CET functionality.

Can you elaborate on what you mean by this? Not upstream glibc CET
support? Or have users not enable it? If the latter, how would they
know about all these problems.

And what is wrong with the cleanest option, number 1? The ABI document
can be updated.

2022-11-07 22:04:42

by H.J. Lu

[permalink] [raw]
Subject: Re: [RFC 37/37] fs/binfmt_elf: Block old shstk elf bit

On Mon, Nov 7, 2022 at 1:34 PM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Mon, 2022-11-07 at 13:21 -0800, H.J. Lu wrote:
> > > > Some applications and libraries are compiled with -fcf-
> > > > protection,
> > > > but
> > > > they manipulate the stack in such a way that they aren't
> > > > compatible
> > > > with the shadow stack. However, if the build/test setup doesn't
> > > > support
> > > > shadow stack, it is impossible to validate.
> > > >
> > >
> > > When we have everything in place, the problems would be much more
> > > obvious when distros started turning it on. But we can't turn it on
> > > as
> >
> > Not necessarily. The problem will show up only in a CET enabled
> > environment since build/test setup may not be on a CET capable
> > hardware.
>
> Well, I'm not sure of the details of distro testing, but there are
> plenty of TGL and later systems out there today. With kernel support,
> I'm thinking these types of problems couldn't lurk for years like they
> have.

If this is the case, we would have nothing to worry about since the CET
enabled applications won't pass validation if they aren't CET compatible.

> >
> > > planned without breaking things for existing binaries. We can have
> > > both
> > > by:
> > > 1. Choosing a new bit, adding it to the tools, and never supporting
> > > the
> > > old bit in glibc.
> > > 2. Providing the option to have the kernel block the old bit, so
> > > upgraded users can decide what experience they would like. Then
> > > distros
> > > can find the problems and adjust their packages. I'm starting to
> > > think
> > > a default off sysctl toggle might be better than a Kconfig.
> > > 3. Any other ideas?
> >
> > Don't enable CET in glibc until we can validate CET functionality.
>
> Can you elaborate on what you mean by this? Not upstream glibc CET
> support? Or have users not enable it? If the latter, how would they
> know about all these problems.

The current glibc doesn't support CET. To enable CET in an application,
one should validate it together with the CET enabled glibc under the CET
enabled kernel on a CET capable machine.

>
> And what is wrong with the cleanest option, number 1? The ABI document
> can be updated.

It doesn't help resolve any issues.

--
H.J.

2022-11-07 23:06:22

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [RFC 37/37] fs/binfmt_elf: Block old shstk elf bit

On Mon, 2022-11-07 at 13:47 -0800, H.J. Lu wrote:
> On Mon, Nov 7, 2022 at 1:34 PM Edgecombe, Rick P
> <[email protected]> wrote:
> >
> > On Mon, 2022-11-07 at 13:21 -0800, H.J. Lu wrote:
> > > > > Some applications and libraries are compiled with -fcf-
> > > > > protection,
> > > > > but
> > > > > they manipulate the stack in such a way that they aren't
> > > > > compatible
> > > > > with the shadow stack. However, if the build/test setup
> > > > > doesn't
> > > > > support
> > > > > shadow stack, it is impossible to validate.
> > > > >
> > > >
> > > > When we have everything in place, the problems would be much
> > > > more
> > > > obvious when distros started turning it on. But we can't turn
> > > > it on
> > > > as
> > >
> > > Not necessarily. The problem will show up only in a CET enabled
> > > environment since build/test setup may not be on a CET capable
> > > hardware.
> >
> > Well, I'm not sure of the details of distro testing, but there are
> > plenty of TGL and later systems out there today. With kernel
> > support,
> > I'm thinking these types of problems couldn't lurk for years like
> > they
> > have.
>
> If this is the case, we would have nothing to worry about since the
> CET
> enabled applications won't pass validation if they aren't CET
> compatible.

Hmm, I think you couldn't have already forgotten the problem binaries
are already shipped...

>
> > >
> > > > planned without breaking things for existing binaries. We can
> > > > have
> > > > both
> > > > by:
> > > > 1. Choosing a new bit, adding it to the tools, and never
> > > > supporting
> > > > the
> > > > old bit in glibc.
> > > > 2. Providing the option to have the kernel block the old bit,
> > > > so
> > > > upgraded users can decide what experience they would like. Then
> > > > distros
> > > > can find the problems and adjust their packages. I'm starting
> > > > to
> > > > think
> > > > a default off sysctl toggle might be better than a Kconfig.
> > > > 3. Any other ideas?
> > >
> > > Don't enable CET in glibc until we can validate CET
> > > functionality.
> >
> > Can you elaborate on what you mean by this? Not upstream glibc CET
> > support? Or have users not enable it? If the latter, how would they
> > know about all these problems.
>
> The current glibc doesn't support CET. To enable CET in an
> application,
> one should validate it together with the CET enabled glibc under the
> CET
> enabled kernel on a CET capable machine.

Agreed that this is how it should have gone.

>
> >
> > And what is wrong with the cleanest option, number 1? The ABI
> > document
> > can be updated.
>
> It doesn't help resolve any issues.

Please read the coverletter if you are unsure of what issues this is
trying to address. I should have put more in the commit log.



2022-11-08 00:23:39

by H.J. Lu

[permalink] [raw]
Subject: Re: [RFC 37/37] fs/binfmt_elf: Block old shstk elf bit

On Mon, Nov 7, 2022 at 2:47 PM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Mon, 2022-11-07 at 13:47 -0800, H.J. Lu wrote:
> > On Mon, Nov 7, 2022 at 1:34 PM Edgecombe, Rick P
> > <[email protected]> wrote:
> > >
> > > On Mon, 2022-11-07 at 13:21 -0800, H.J. Lu wrote:
> > > > > > Some applications and libraries are compiled with -fcf-
> > > > > > protection,
> > > > > > but
> > > > > > they manipulate the stack in such a way that they aren't
> > > > > > compatible
> > > > > > with the shadow stack. However, if the build/test setup
> > > > > > doesn't
> > > > > > support
> > > > > > shadow stack, it is impossible to validate.
> > > > > >
> > > > >
> > > > > When we have everything in place, the problems would be much
> > > > > more
> > > > > obvious when distros started turning it on. But we can't turn
> > > > > it on
> > > > > as
> > > >
> > > > Not necessarily. The problem will show up only in a CET enabled
> > > > environment since build/test setup may not be on a CET capable
> > > > hardware.
> > >
> > > Well, I'm not sure of the details of distro testing, but there are
> > > plenty of TGL and later systems out there today. With kernel
> > > support,
> > > I'm thinking these types of problems couldn't lurk for years like
> > > they
> > > have.
> >
> > If this is the case, we would have nothing to worry about since the
> > CET
> > enabled applications won't pass validation if they aren't CET
> > compatible.
>
> Hmm, I think you couldn't have already forgotten the problem binaries
> are already shipped...

It should be OK since glibc doesn't support CET.

> >
> > > >
> > > > > planned without breaking things for existing binaries. We can
> > > > > have
> > > > > both
> > > > > by:
> > > > > 1. Choosing a new bit, adding it to the tools, and never
> > > > > supporting
> > > > > the
> > > > > old bit in glibc.
> > > > > 2. Providing the option to have the kernel block the old bit,
> > > > > so
> > > > > upgraded users can decide what experience they would like. Then
> > > > > distros
> > > > > can find the problems and adjust their packages. I'm starting
> > > > > to
> > > > > think
> > > > > a default off sysctl toggle might be better than a Kconfig.
> > > > > 3. Any other ideas?
> > > >
> > > > Don't enable CET in glibc until we can validate CET
> > > > functionality.
> > >
> > > Can you elaborate on what you mean by this? Not upstream glibc CET
> > > support? Or have users not enable it? If the latter, how would they
> > > know about all these problems.
> >
> > The current glibc doesn't support CET. To enable CET in an
> > application,
> > one should validate it together with the CET enabled glibc under the
> > CET
> > enabled kernel on a CET capable machine.
>
> Agreed that this is how it should have gone.
>
> >
> > >
> > > And what is wrong with the cleanest option, number 1? The ABI
> > > document
> > > can be updated.
> >
> > It doesn't help resolve any issues.
>
> Please read the coverletter if you are unsure of what issues this is
> trying to address. I should have put more in the commit log.
>
>
>


--
H.J.

2022-11-15 11:33:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 13/37] mm: Move VM_UFFD_MINOR_BIT from 37 to 38

On Fri, Nov 04, 2022 at 03:35:40PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
> VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.

Why thought ?!? Changelog utterly fails to provide rationale.

2022-11-15 17:35:02

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 13/37] mm: Move VM_UFFD_MINOR_BIT from 37 to 38

On Tue, 2022-11-15 at 12:20 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:35:40PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
> > VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to
> > 38.
>
> Why thought ?!? Changelog utterly fails to provide rationale.

I can beef this up. It is just a cleanup type thing to make things less
scattered around.

Thanks.

2022-11-15 19:44:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 36/37] x86/cet/shstk: Add ARCH_CET_UNLOCK

On Fri, Nov 04, 2022 at 03:36:03PM -0700, Rick Edgecombe wrote:

> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 71620b77a654..bed7032d35f2 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -450,9 +450,14 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
> return 0;
> }
>
> - /* Don't allow via ptrace */
> - if (task != current)
> + /* Only allow via ptrace */

Both the old and new comment are equally useless for they tell us
nothing the code doesn't already.

Why isn't ptrace allowed to call these, and doesn't that rather leave
CRIU in a bind, it can unlock but not re-lock the features, leaving
restored processes more vulnerable than they were.

> + if (task != current) {
> + if (option == ARCH_CET_UNLOCK && IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {

Why make this conditional on CRIU at all?

> + task->thread.features_locked &= ~features;
> + return 0;
> + }
> return -EINVAL;
> + }
>
> /* Do not allow to change locked features */
> if (features & task->thread.features_locked)
> --
> 2.17.1
>

2022-11-15 19:47:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 15/37] x86/mm: Check Shadow Stack page fault errors

On Fri, Nov 04, 2022 at 03:35:42PM -0700, Rick Edgecombe wrote:
> @@ -1331,6 +1345,18 @@ void do_user_addr_fault(struct pt_regs *regs,
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>
> + /*
> + * To service shadow stack read faults, unlike normal read faults, the
> + * fault handler needs to create a type of memory that will also be
> + * writable (with instructions that generate shadow stack writes).
> + * In the case of COW memory, the COW needs to take place even with
> + * a shadow stack read. Otherwise the shared page will be left (shadow
> + * stack) writable in userspace. So to trigger the appropriate behavior
> + * by setting FAULT_FLAG_WRITE for shadow stack accesses, even if the
> + * access was a shadow stack read.
> + */

Clear as mud... So SS pages are 'Write=0,Dirty=1', which, per
construction, lack a RW bit. And these pages are writable (WRUSS).

pte_wrprotect() seems to do: _PAGE_DIRTY->_PAGE_COW (which is really
weird in this situation), resulting in: 'Write=0,Dirty=0,Cow=1'.

That's regular RO memory and won't raise read-faults.

But I'm thinking RET will trip #PF here when it tries to read the SS
because the SSP is not a proper shadow stack page?

And in that case you want to tickle pte_mkwrite() to undo the
pte_wrprotect() above?

So while the #PF is a 'read' fault due to RET not actually writing to
the shadow stack, you want to force a write fault so it will re-instate
the SS page.

Did I get that right?

> + if (error_code & X86_PF_SHSTK)
> + flags |= FAULT_FLAG_WRITE;
> if (error_code & X86_PF_WRITE)
> flags |= FAULT_FLAG_WRITE;
> if (error_code & X86_PF_INSTR)
> --
> 2.17.1
>

2022-11-15 20:16:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 25/37] x86/shstk: Add user-mode shadow stack support

On Fri, Nov 04, 2022 at 03:35:52PM -0700, Rick Edgecombe wrote:

> +static int shstk_setup(void)
> +{
> + struct thread_shstk *shstk = &current->thread.shstk;
> + unsigned long addr, size;
> +
> + /* Already enabled */
> + if (features_enabled(CET_SHSTK))
> + return 0;
> +
> + /* Also not supported for 32 bit and x32 */
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall())
> + return -EOPNOTSUPP;
> +
> + size = adjust_shstk_size(0);
> + addr = alloc_shstk(size);
> + if (IS_ERR_VALUE(addr))
> + return PTR_ERR((void *)addr);
> +
> + fpregs_lock_and_load();
> + wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> + wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);

This..

> + fpregs_unlock();
> +
> + shstk->base = addr;
> + shstk->size = size;
> + features_set(CET_SHSTK);
> +
> + return 0;
> +}

> +static int shstk_disable(void)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return -EOPNOTSUPP;
> +
> + /* Already disabled? */
> + if (!features_enabled(CET_SHSTK))
> + return 0;
> +
> + fpregs_lock_and_load();
> + /* Disable WRSS too when disabling shadow stack */
> + set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);

And this... aren't very consistent in approach. Given there is no U_IBT
yet, why complicate matters like this?

> + wrmsrl(MSR_IA32_PL3_SSP, 0);
> + fpregs_unlock();
> +
> + shstk_free(current);
> + features_clr(CET_SHSTK);
> +
> + return 0;
> +}

2022-11-15 20:22:25

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 15/37] x86/mm: Check Shadow Stack page fault errors

On Tue, 2022-11-15 at 12:47 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:35:42PM -0700, Rick Edgecombe wrote:
> > @@ -1331,6 +1345,18 @@ void do_user_addr_fault(struct pt_regs
> > *regs,
> >
> > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> >
> > + /*
> > + * To service shadow stack read faults, unlike normal read
> > faults, the
> > + * fault handler needs to create a type of memory that will
> > also be
> > + * writable (with instructions that generate shadow stack
> > writes).
> > + * In the case of COW memory, the COW needs to take place
> > even with
> > + * a shadow stack read. Otherwise the shared page will be
> > left (shadow
> > + * stack) writable in userspace. So to trigger the
> > appropriate behavior
> > + * by setting FAULT_FLAG_WRITE for shadow stack accesses,
> > even if the
> > + * access was a shadow stack read.
> > + */
>
> Clear as mud... So SS pages are 'Write=0,Dirty=1', which, per
> construction, lack a RW bit. And these pages are writable (WRUSS).
>
> pte_wrprotect() seems to do: _PAGE_DIRTY->_PAGE_COW (which is really
> weird in this situation), resulting in: 'Write=0,Dirty=0,Cow=1'.
>
> That's regular RO memory and won't raise read-faults.
>
> But I'm thinking RET will trip #PF here when it tries to read the SS
> because the SSP is not a proper shadow stack page?
>
> And in that case you want to tickle pte_mkwrite() to undo the
> pte_wrprotect() above?
>
> So while the #PF is a 'read' fault due to RET not actually writing to
> the shadow stack, you want to force a write fault so it will re-
> instate
> the SS page.
>
> Did I get that right?

That's right. I think the assumption that needs to be broken in the
readers head is that you can satisfy a read fault with read-only PTE.
This is kind of baked in all over the place with the zero-pfn, COW,
etc. Maybe I should try to start with that.

2022-11-15 20:22:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 18/37] mm: Add guard pages around a shadow stack.

On Fri, Nov 04, 2022 at 03:35:45PM -0700, Rick Edgecombe wrote:

> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index c90c20904a60..66da1f3298b0 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -248,3 +248,26 @@ bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
> return false;
> return true;
> }
> +
> +unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
> +{
> + if (vma->vm_flags & VM_GROWSDOWN)
> + return stack_guard_gap;
> +
> + /*
> + * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).

Can we perhaps write this like: INCSPP[QD] ? The () notation makes it
look like a function.

> + * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
> + * (~1KB for INCSSPD) and touches the first and the last element
> + * in the range, which triggers a page fault if the range is not
> + * in a shadow stack. Because of this, creating 4-KB guard pages
> + * around a shadow stack prevents these instructions from going
> + * beyond.
> + *
> + * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
> + * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
> + */
> + if (vma->vm_flags & VM_SHADOW_STACK)
> + return PAGE_SIZE;
> +
> + return 0;
> +}
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5d9536fa860a..0a3f7e2b32df 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2832,15 +2832,16 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
> return mtree_load(&mm->mm_mt, addr);
> }
>
> +unsigned long stack_guard_start_gap(struct vm_area_struct *vma);
> +
> static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
> {
> + unsigned long gap = stack_guard_start_gap(vma);
> unsigned long vm_start = vma->vm_start;
>
> - if (vma->vm_flags & VM_GROWSDOWN) {
> - vm_start -= stack_guard_gap;
> - if (vm_start > vma->vm_start)
> - vm_start = 0;
> - }
> + vm_start -= gap;
> + if (vm_start > vma->vm_start)
> + vm_start = 0;
> return vm_start;
> }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2def55555e05..f67606fbc464 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -281,6 +281,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
> return origbrk;
> }
>
> +unsigned long __weak stack_guard_start_gap(struct vm_area_struct *vma)
> +{
> + if (vma->vm_flags & VM_GROWSDOWN)
> + return stack_guard_gap;
> + return 0;
> +}

I'm thinking perhaps this wants to be an inline function?

2022-11-15 20:38:48

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 36/37] x86/cet/shstk: Add ARCH_CET_UNLOCK

On Tue, 2022-11-15 at 15:47 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:36:03PM -0700, Rick Edgecombe wrote:
>
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index 71620b77a654..bed7032d35f2 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -450,9 +450,14 @@ long cet_prctl(struct task_struct *task, int
> > option, unsigned long features)
> > return 0;
> > }
> >
> > - /* Don't allow via ptrace */
> > - if (task != current)
> > + /* Only allow via ptrace */
>
> Both the old and new comment are equally useless for they tell us
> nothing the code doesn't already.
>
> Why isn't ptrace allowed to call these, and doesn't that rather leave
> CRIU in a bind, it can unlock but not re-lock the features, leaving
> restored processes more vulnerable than they were.

As I understand it, CRIU does some poking at things via ptrace to setup
a "parasite" which is actual executable code injected in the process.
Then a lot of the restore code actually runs in the process getting
restored.

As for not allowing unlock to be used in the non-ptrace scenario it's
to keep attackers from unlocking the shadow stack disable API and
calling it to disable shadow stack.

As for not allowing the others via ptrace, the call is in the tracing
processes context, so they would operate on their own registers instead
of the tracees.

>
> > + if (task != current) {
> > + if (option == ARCH_CET_UNLOCK &&
> > IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
>
> Why make this conditional on CRIU at all?

Kees asked for it, I think he was worried about attackers using it to
unlock and disable shadow stack. So wanted to lock it down to the
maximum.

>
> > + task->thread.features_locked &= ~features;
> > + return 0;
> > + }
> > return -EINVAL;
> > + }
> >
> > /* Do not allow to change locked features */
> > if (features & task->thread.features_locked)
> > --
> > 2.17.1
> >

2022-11-15 20:39:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

On Fri, Nov 04, 2022 at 03:35:54PM -0700, Rick Edgecombe wrote:

> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> +{
> + asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> + _ASM_EXTABLE(1b, %l[fail])
> + :: [addr] "r" (addr), [val] "r" (val)
> + :: fail);
> + return 0;
> +fail:
> + return -EFAULT;
> +}
> +#endif /* CONFIG_X86_USER_SHADOW_STACK */

Why isn't this modelled after put_user() ?

Should you write a 64bit value even if the task receiving a signal is
32bit ?

2022-11-15 20:42:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 24/37] x86: Introduce userspace API for CET enabling

On Tue, Nov 15, 2022 at 01:26:23PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:35:51PM -0700, Rick Edgecombe wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Add three new arch_prctl() handles:
> >
> > - ARCH_CET_ENABLE/DISABLE enables or disables the specified
> > feature. Returns 0 on success or an error.
> >
> > - ARCH_CET_LOCK prevents future disabling or enabling of the
> > specified feature. Returns 0 on success or an error
> >
> > The features are handled per-thread and inherited over fork(2)/clone(2),
> > but reset on exec().
> >
> > This is preparation patch. It does not implement any features.
>
> Urgh... so much for sharing with other architectures I suppose :/
>
> The ARM64 BTI thing is very similar to IBT (except I think their
> approach to the legacy bitmap is much saner).
>
> Given that IBT isn't supported and needs the whole legacy bitmap mess,
> do we really want to call this CET ? Why not just make a Shadow Stack
> API and tackle IBT independently.

On that; ARM64 exposes PROT_BTI (to be used by mprotect()) and have an
ELF_ARM64_BTI note for the loader to bootstrap things.

We could co-opt that same interface and instead of flipping actual PTE
bits, have this thing manage the legacy bitmap -- basically have the
legacy bitmap function as an external PTE bit array (in inverse).

Basically, have every page mapped PROT_EXEC set the bit in the legacy
bitmap while every page mapped PROT_EXEC|PROT_BTI will have the legacy
bitmap bit to 0.

And as long as there is a single 0 in the bitmap, the feature is
enabled.

(obviously we can delay allocating the bitmap until the first PROT_EXEC
mapping that lacks PROT_BTI)



2022-11-15 20:43:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 24/37] x86: Introduce userspace API for CET enabling

On Fri, Nov 04, 2022 at 03:35:51PM -0700, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Add three new arch_prctl() handles:
>
> - ARCH_CET_ENABLE/DISABLE enables or disables the specified
> feature. Returns 0 on success or an error.
>
> - ARCH_CET_LOCK prevents future disabling or enabling of the
> specified feature. Returns 0 on success or an error
>
> The features are handled per-thread and inherited over fork(2)/clone(2),
> but reset on exec().
>
> This is preparation patch. It does not implement any features.

Urgh... so much for sharing with other architectures I suppose :/

The ARM64 BTI thing is very similar to IBT (except I think their
approach to the legacy bitmap is much saner).

Given that IBT isn't supported and needs the whole legacy bitmap mess,
do we really want to call this CET ? Why not just make a Shadow Stack
API and tackle IBT independently.

2022-11-15 20:43:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 24/37] x86: Introduce userspace API for CET enabling

On Fri, Nov 04, 2022 at 03:35:51PM -0700, Rick Edgecombe wrote:

> arch/x86/include/asm/cet.h | 21 +++++++++++++++
> arch/x86/kernel/shstk.c | 44 +++++++++++++++++++++++++++++++

You see what's going wrong there? Eradicate the CET...

2022-11-15 21:04:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 20/37] mm/mprotect: Exclude shadow stack from preserve_write

On Fri, Nov 04, 2022 at 03:35:47PM -0700, Rick Edgecombe wrote:
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 73b9b78f8cf4..7643a4db1b50 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1803,6 +1803,13 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> return 0;
>
> preserve_write = prot_numa && pmd_write(*pmd);
> +
> + /*
> + * Preserve only normal writable huge PMD, but not shadow
> + * stack (RW=0, Dirty=1).
> + */
> + if (vma->vm_flags & VM_SHADOW_STACK)
> + preserve_write = false;
> ret = 1;
>
> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 668bfaa6ed2a..ea82ce5f38fe 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -115,6 +115,13 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> pte_t ptent;
> bool preserve_write = prot_numa && pte_write(oldpte);
>
> + /*
> + * Preserve only normal writable PTE, but not shadow
> + * stack (RW=0, Dirty=1).
> + */
> + if (vma->vm_flags & VM_SHADOW_STACK)
> + preserve_write = false;
> +
> /*
> * Avoid trapping faults against the zero or KSM
> * pages. See similar comment in change_huge_pmd.

These comments lack a why component; someone is going to wonder wtf this
code is doing in the near future -- that someone might be you.

2022-11-15 21:17:18

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 24/37] x86: Introduce userspace API for CET enabling

On Tue, 2022-11-15 at 15:25 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:35:51PM -0700, Rick Edgecombe wrote:
>
> > arch/x86/include/asm/cet.h | 21 +++++++++++++++
> > arch/x86/kernel/shstk.c | 44
> > +++++++++++++++++++++++++++++++
>
> You see what's going wrong there? Eradicate the CET...

Yep, will do. Thanks.

2022-11-15 21:20:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

On Fri, Nov 04, 2022 at 03:36:02PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> Some applications (like GDB and CRIU) would like to tweak CET state via
> ptrace. This allows for existing functionality to continue to work for
> seized CET applications. Provide an interface based on the xsave buffer
> format of CET, but filter unneeded states to make the kernel’s job
> easier.
>
> There is already ptrace functionality for accessing xstate, but this
> does not include supervisor xfeatures. So there is not a completely
> clear place for where to put the CET state. Adding it to the user
> xfeatures regset would complicate that code, as it currently shares
> logic with signals which should not have supervisor features.
>
> Don’t add a general supervisor xfeature regset like the user one,
> because it is better to maintain flexibility for other supervisor
> xfeatures to define their own interface. For example, an xfeature may
> decide not to expose all of it’s state to userspace. A lot of enum
> values remain to be used, so just put it in dedicated CET regset.
>
> The only downside to not having a generic supervisor xfeature regset,
> is that apps need to be enlightened of any new supervisor xfeature
> exposed this way (i.e. they can’t try to have generic save/restore
> logic). But maybe that is a good thing, because they have to think
> through each new xfeature instead of encountering issues when new a new
> supervisor xfeature was added.

Per this argument this should not use the CET XSAVE format and CET name
at all, because that conflates the situation vs IBT. Enabling that might
not want to follow this precedent.

> By adding a CET regset, it also has the effect of including the CET state
> in a core dump, which could be useful for debugging.
>
> Inside the setter CET regset, filter out invalid state. Today this
> includes states disallowed by the HW and states involving Indirect Branch
> Tracking which the kernel does not currently support for usersapce.
>
> So this leaves three pieces of data that can be set, shadow stack
> enablement, WRSS enablement and the shadow stack pointer. It is worth
> noting that this is separate than enabling shadow stack via the
> arch_prctl()s.

Does this validate the SSP, when set, points to an actual valid SS page?

2022-11-15 21:20:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 18/37] mm: Add guard pages around a shadow stack.

On Tue, Nov 15, 2022 at 08:40:19PM +0000, Edgecombe, Rick P wrote:
> > > +unsigned long __weak stack_guard_start_gap(struct vm_area_struct
> > > *vma)
> > > +{
> > > + if (vma->vm_flags & VM_GROWSDOWN)
> > > + return stack_guard_gap;
> > > + return 0;
> > > +}
> >
> > I'm thinking perhaps this wants to be an inline function?
>
> I don't think it can work with weak then.

That was kinda the point, __weak sucks and this is very small in any
case.

2022-11-15 21:22:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

On Fri, Nov 04, 2022 at 03:35:54PM -0700, Rick Edgecombe wrote:
> +static unsigned long get_user_shstk_addr(void)
> +{
> + unsigned long long ssp;
> +
> + fpregs_lock_and_load();
> +
> + rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +
> + fpregs_unlock();
> +
> + return ssp;
> +}

This doesn't return the shstk addr, unlike what the name suggests,
right?

2022-11-15 21:22:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 36/37] x86/cet/shstk: Add ARCH_CET_UNLOCK

On Tue, Nov 15, 2022 at 08:01:12PM +0000, Edgecombe, Rick P wrote:
> > > + if (task != current) {
> > > + if (option == ARCH_CET_UNLOCK &&
> > > IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
> >
> > Why make this conditional on CRIU at all?
>
> Kees asked for it, I think he was worried about attackers using it to
> unlock and disable shadow stack. So wanted to lock it down to the
> maximum.

Well, distros will all have this stuff enabled no? So not much
protection in practise.

2022-11-15 21:23:15

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 20/37] mm/mprotect: Exclude shadow stack from preserve_write

On Tue, 2022-11-15 at 13:05 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:35:47PM -0700, Rick Edgecombe wrote:
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 73b9b78f8cf4..7643a4db1b50 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1803,6 +1803,13 @@ int change_huge_pmd(struct mmu_gather *tlb,
> > struct vm_area_struct *vma,
> > return 0;
> >
> > preserve_write = prot_numa && pmd_write(*pmd);
> > +
> > + /*
> > + * Preserve only normal writable huge PMD, but not shadow
> > + * stack (RW=0, Dirty=1).
> > + */
> > + if (vma->vm_flags & VM_SHADOW_STACK)
> > + preserve_write = false;
> > ret = 1;
> >
> > #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 668bfaa6ed2a..ea82ce5f38fe 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -115,6 +115,13 @@ static unsigned long change_pte_range(struct
> > mmu_gather *tlb,
> > pte_t ptent;
> > bool preserve_write = prot_numa &&
> > pte_write(oldpte);
> >
> > + /*
> > + * Preserve only normal writable PTE, but not
> > shadow
> > + * stack (RW=0, Dirty=1).
> > + */
> > + if (vma->vm_flags & VM_SHADOW_STACK)
> > + preserve_write = false;
> > +
> > /*
> > * Avoid trapping faults against the zero or
> > KSM
> > * pages. See similar comment in
> > change_huge_pmd.
>
> These comments lack a why component; someone is going to wonder wtf
> this
> code is doing in the near future -- that someone might be you.

Good point, I'll expand it.

2022-11-15 21:23:56

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 24/37] x86: Introduce userspace API for CET enabling

On Tue, 2022-11-15 at 14:03 +0100, Peter Zijlstra wrote:
> On Tue, Nov 15, 2022 at 01:26:23PM +0100, Peter Zijlstra wrote:
> > On Fri, Nov 04, 2022 at 03:35:51PM -0700, Rick Edgecombe wrote:
> > > From: "Kirill A. Shutemov" <[email protected]>
> > >
> > > Add three new arch_prctl() handles:
> > >
> > > - ARCH_CET_ENABLE/DISABLE enables or disables the specified
> > > feature. Returns 0 on success or an error.
> > >
> > > - ARCH_CET_LOCK prevents future disabling or enabling of the
> > > specified feature. Returns 0 on success or an error
> > >
> > > The features are handled per-thread and inherited over
> > > fork(2)/clone(2),
> > > but reset on exec().
> > >
> > > This is preparation patch. It does not implement any features.
> >
> > Urgh... so much for sharing with other architectures I suppose :/
> >
> > The ARM64 BTI thing is very similar to IBT (except I think their
> > approach to the legacy bitmap is much saner).
> >
> > Given that IBT isn't supported and needs the whole legacy bitmap
> > mess,
> > do we really want to call this CET ? Why not just make a Shadow
> > Stack
> > API and tackle IBT independently.
>
> On that; ARM64 exposes PROT_BTI (to be used by mprotect()) and have
> an
> ELF_ARM64_BTI note for the loader to bootstrap things.
>
> We could co-opt that same interface and instead of flipping actual
> PTE
> bits, have this thing manage the legacy bitmap -- basically have the
> legacy bitmap function as an external PTE bit array (in inverse).
>
> Basically, have every page mapped PROT_EXEC set the bit in the legacy
> bitmap while every page mapped PROT_EXEC|PROT_BTI will have the
> legacy
> bitmap bit to 0.
>
> And as long as there is a single 0 in the bitmap, the feature is
> enabled.
>
> (obviously we can delay allocating the bitmap until the first
> PROT_EXEC
> mapping that lacks PROT_BTI)

This is an interesting idea. I'll have to think a little more on it.

One non-impossible issue would be setting IBT in the MSR late. Each
thread would have to be interrupted and have it set, while no new
threads are created. Maybe this is easy and I just don't know how to do
it.

The other thing is there would be overhead compared to an IBT
implementation with a separate interface from BTI. Would have to look
at the tradeoffs.

2022-11-15 21:25:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 15/37] x86/mm: Check Shadow Stack page fault errors

On Tue, Nov 15, 2022 at 08:03:06PM +0000, Edgecombe, Rick P wrote:

> That's right. I think the assumption that needs to be broken in the
> readers head is that you can satisfy a read fault with read-only PTE.
> This is kind of baked in all over the place with the zero-pfn, COW,
> etc. Maybe I should try to start with that.

Maybe something like:

CoW -- pte_wrprotect() -- changes a SS page 'Write=0,Dirty=1' to
'Write=0,Dirty=0,CoW=1' which is a 'regular' RO page. A SS read from RET
will #PF because it expects a SS page. Make sure to break the CoW so it
can be restored to an SS page, as such force the write path and tickle
pte_mkwrite().

2022-11-15 21:26:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 36/37] x86/cet/shstk: Add ARCH_CET_UNLOCK

On 11/15/22 12:57, Peter Zijlstra wrote:
> On Tue, Nov 15, 2022 at 08:01:12PM +0000, Edgecombe, Rick P wrote:
>>>> + if (task != current) {
>>>> + if (option == ARCH_CET_UNLOCK &&
>>>> IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
>>> Why make this conditional on CRIU at all?
>> Kees asked for it, I think he was worried about attackers using it to
>> unlock and disable shadow stack. So wanted to lock it down to the
>> maximum.
> Well, distros will all have this stuff enabled no? So not much
> protection in practise.

Yeah, that's true for the distros.

But, I would imagine that our more paranoid friends like the ChromeOS
folks might appreciate this.

2022-11-15 21:28:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 36/37] x86/cet/shstk: Add ARCH_CET_UNLOCK

On Tue, Nov 15, 2022 at 01:00:40PM -0800, Dave Hansen wrote:
> On 11/15/22 12:57, Peter Zijlstra wrote:
> > On Tue, Nov 15, 2022 at 08:01:12PM +0000, Edgecombe, Rick P wrote:
> >>>> + if (task != current) {
> >>>> + if (option == ARCH_CET_UNLOCK &&
> >>>> IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
> >>> Why make this conditional on CRIU at all?
> >> Kees asked for it, I think he was worried about attackers using it to
> >> unlock and disable shadow stack. So wanted to lock it down to the
> >> maximum.
> > Well, distros will all have this stuff enabled no? So not much
> > protection in practise.
>
> Yeah, that's true for the distros.
>
> But, I would imagine that our more paranoid friends like the ChromeOS
> folks might appreciate this.

ptrace can modify text, I'm not sure what if anything we're protecting
against.

2022-11-15 21:43:22

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 18/37] mm: Add guard pages around a shadow stack.

On Tue, 2022-11-15 at 13:04 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:35:45PM -0700, Rick Edgecombe wrote:
>
> > diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> > index c90c20904a60..66da1f3298b0 100644
> > --- a/arch/x86/mm/mmap.c
> > +++ b/arch/x86/mm/mmap.c
> > @@ -248,3 +248,26 @@ bool pfn_modify_allowed(unsigned long pfn,
> > pgprot_t prot)
> > return false;
> > return true;
> > }
> > +
> > +unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
> > +{
> > + if (vma->vm_flags & VM_GROWSDOWN)
> > + return stack_guard_gap;
> > +
> > + /*
> > + * Shadow stack pointer is moved by CALL, RET, and
> > INCSSP(Q/D).
>
> Can we perhaps write this like: INCSPP[QD] ? The () notation makes it
> look like a function.

Sure.

>
> > + * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
> > + * (~1KB for INCSSPD) and touches the first and the last
> > element
> > + * in the range, which triggers a page fault if the range is
> > not
> > + * in a shadow stack. Because of this, creating 4-KB guard
> > pages
> > + * around a shadow stack prevents these instructions from
> > going
> > + * beyond.
> > + *
> > + * Creation of VM_SHADOW_STACK is tightly controlled, so a
> > vma
> > + * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
> > + */
> > + if (vma->vm_flags & VM_SHADOW_STACK)
> > + return PAGE_SIZE;
> > +
> > + return 0;
> > +}
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 5d9536fa860a..0a3f7e2b32df 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2832,15 +2832,16 @@ struct vm_area_struct *vma_lookup(struct
> > mm_struct *mm, unsigned long addr)
> > return mtree_load(&mm->mm_mt, addr);
> > }
> >
> > +unsigned long stack_guard_start_gap(struct vm_area_struct *vma);
> > +
> > static inline unsigned long vm_start_gap(struct vm_area_struct
> > *vma)
> > {
> > + unsigned long gap = stack_guard_start_gap(vma);
> > unsigned long vm_start = vma->vm_start;
> >
> > - if (vma->vm_flags & VM_GROWSDOWN) {
> > - vm_start -= stack_guard_gap;
> > - if (vm_start > vma->vm_start)
> > - vm_start = 0;
> > - }
> > + vm_start -= gap;
> > + if (vm_start > vma->vm_start)
> > + vm_start = 0;
> > return vm_start;
> > }
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2def55555e05..f67606fbc464 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -281,6 +281,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
> > return origbrk;
> > }
> >
> > +unsigned long __weak stack_guard_start_gap(struct vm_area_struct
> > *vma)
> > +{
> > + if (vma->vm_flags & VM_GROWSDOWN)
> > + return stack_guard_gap;
> > + return 0;
> > +}
>
> I'm thinking perhaps this wants to be an inline function?

I don't think it can work with weak then.

2022-11-15 22:13:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 25/37] x86/shstk: Add user-mode shadow stack support

On Tue, 2022-11-15 at 13:32 +0100, Peter Zijlstra wrote:
> > + struct thread_shstk *shstk = &current->thread.shstk;
> > + unsigned long addr, size;
> > +
> > + /* Already enabled */
> > + if (features_enabled(CET_SHSTK))
> > + return 0;
> > +
> > + /* Also not supported for 32 bit and x32 */
> > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
> > in_32bit_syscall())
> > + return -EOPNOTSUPP;
> > +
> > + size = adjust_shstk_size(0);
> > + addr = alloc_shstk(size);
> > + if (IS_ERR_VALUE(addr))
> > + return PTR_ERR((void *)addr);
> > +
> > + fpregs_lock_and_load();
> > + wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> > + wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
>
> This..
>
> > + fpregs_unlock();
> > +
> > + shstk->base = addr;
> > + shstk->size = size;
> > + features_set(CET_SHSTK);
> > +
> > + return 0;
> > +}
> > +static int shstk_disable(void)
> > +{
> > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > + return -EOPNOTSUPP;
> > +
> > + /* Already disabled? */
> > + if (!features_enabled(CET_SHSTK))
> > + return 0;
> > +
> > + fpregs_lock_and_load();
> > + /* Disable WRSS too when disabling shadow stack */

Oops, this comment is in wrong patch.

> > + set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);
>
> And this... aren't very consistent in approach. Given there is no
> U_IBT
> yet, why complicate matters like this?

Sure, I can change it.

2022-11-15 22:14:26

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

On Tue, 2022-11-15 at 15:22 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:35:54PM -0700, Rick Edgecombe wrote:
> > +static unsigned long get_user_shstk_addr(void)
> > +{
> > + unsigned long long ssp;
> > +
> > + fpregs_lock_and_load();
> > +
> > + rdmsrl(MSR_IA32_PL3_SSP, ssp);
> > +
> > + fpregs_unlock();
> > +
> > + return ssp;
> > +}
>
> This doesn't return the shstk addr, unlike what the name suggests,
> right?

That's a good point. get_user_ssp() would be a better name.

2022-11-15 22:39:01

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 18/37] mm: Add guard pages around a shadow stack.

On Tue, 2022-11-15 at 21:56 +0100, Peter Zijlstra wrote:
> On Tue, Nov 15, 2022 at 08:40:19PM +0000, Edgecombe, Rick P wrote:
> > > > +unsigned long __weak stack_guard_start_gap(struct
> > > > vm_area_struct
> > > > *vma)
> > > > +{
> > > > + if (vma->vm_flags & VM_GROWSDOWN)
> > > > + return stack_guard_gap;
> > > > + return 0;
> > > > +}
> > >
> > > I'm thinking perhaps this wants to be an inline function?
> >
> > I don't think it can work with weak then.
>
> That was kinda the point, __weak sucks and this is very small in any
> case.

__weak was suggested here:

https://lore.kernel.org/lkml/[email protected]/

Let me try to put in cross arch code again like the other suggestion
was. I can't remember the reason why I didn't do it.

2022-11-15 23:04:26

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

+ Christina

On Tue, 2022-11-15 at 15:43 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:36:02PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > Some applications (like GDB and CRIU) would like to tweak CET state
> > via
> > ptrace. This allows for existing functionality to continue to work
> > for
> > seized CET applications. Provide an interface based on the xsave
> > buffer
> > format of CET, but filter unneeded states to make the kernel’s job
> > easier.
> >
> > There is already ptrace functionality for accessing xstate, but
> > this
> > does not include supervisor xfeatures. So there is not a completely
> > clear place for where to put the CET state. Adding it to the user
> > xfeatures regset would complicate that code, as it currently shares
> > logic with signals which should not have supervisor features.
> >
> > Don’t add a general supervisor xfeature regset like the user one,
> > because it is better to maintain flexibility for other supervisor
> > xfeatures to define their own interface. For example, an xfeature
> > may
> > decide not to expose all of it’s state to userspace. A lot of enum
> > values remain to be used, so just put it in dedicated CET regset.
> >
> > The only downside to not having a generic supervisor xfeature
> > regset,
> > is that apps need to be enlightened of any new supervisor xfeature
> > exposed this way (i.e. they can’t try to have generic save/restore
> > logic). But maybe that is a good thing, because they have to think
> > through each new xfeature instead of encountering issues when new a
> > new
> > supervisor xfeature was added.
>
> Per this argument this should not use the CET XSAVE format and CET
> name
> at all, because that conflates the situation vs IBT. Enabling that
> might
> not want to follow this precedent.

Hmm, we definitely need to be able to set the SSP. Christina, does GDB
need anything else? I thought maybe toggling SHSTK_EN?

So it might end up looking pretty much the same, and it would just be
renamed and separated in concept.

>
> > By adding a CET regset, it also has the effect of including the CET
> > state
> > in a core dump, which could be useful for debugging.
> >
> > Inside the setter CET regset, filter out invalid state. Today this
> > includes states disallowed by the HW and states involving Indirect
> > Branch
> > Tracking which the kernel does not currently support for usersapce.
> >
> > So this leaves three pieces of data that can be set, shadow stack
> > enablement, WRSS enablement and the shadow stack pointer. It is
> > worth
> > noting that this is separate than enabling shadow stack via the
> > arch_prctl()s.
>
> Does this validate the SSP, when set, points to an actual valid SS
> page?

No, but that situation is already possible and has to be handled
anyway. Just unmap your shadow stack, and map whatever other type of
memory at the same address without doing a call or ret. Then you will
segfault at the first call or ret.

2022-11-15 23:38:21

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 15/37] x86/mm: Check Shadow Stack page fault errors

On Tue, 2022-11-15 at 22:07 +0100, Peter Zijlstra wrote:
> On Tue, Nov 15, 2022 at 08:03:06PM +0000, Edgecombe, Rick P wrote:
>
> > That's right. I think the assumption that needs to be broken in the
> > readers head is that you can satisfy a read fault with read-only
> > PTE.
> > This is kind of baked in all over the place with the zero-pfn, COW,
> > etc. Maybe I should try to start with that.
>
> Maybe something like:
>
> CoW -- pte_wrprotect() -- changes a SS page 'Write=0,Dirty=1' to
> 'Write=0,Dirty=0,CoW=1' which is a 'regular' RO page. A SS read from
> RET
> will #PF because it expects a SS page. Make sure to break the CoW so
> it
> can be restored to an SS page, as such force the write path and
> tickle
> pte_mkwrite().

Hmm, TBH I'm not sure it's more clear. I tried to take this and fill it
out more. Does it sound better?


When a page becomes COW it changes from a shadow stack permissioned
page (Write=0,Dirty=1) to (Write=0,Dirty=0,CoW=1), which is simply
read-only to the CPU. When shadow stack is enabled, a RET would
normally pop the shadow stack by reading it with a "shadow stack read"
access. However, in the COW case the shadow stack memory does not have
shadow stack permissions, it is read-only. So it will generate a fault.

For conventionally writable pages, a read can be serviced with a read
only PTE, and COW would not have to happen. But for shadow stack, there
isn't the concept of read-only shadow stack memory. If it is shadow
stack permissioned, it can be modified via CALL and RET instructions.
So COW needs to happen before any memory can be mapped with shadow
stack permissions.

Shadow stack accesses (read or write) need to be serviced with shadow
stack permissioned memory, so in the case of a shadow stack read
access, treat it as a WRITE fault so both COW will happen and the write
fault path will tickle maybe_mkwrite() and map the memory shadow stack.

2022-11-15 23:52:53

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

On Tue, 2022-11-15 at 15:18 +0100, Peter Zijlstra wrote:
> On Fri, Nov 04, 2022 at 03:35:54PM -0700, Rick Edgecombe wrote:
>
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> > +{
> > + asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > + _ASM_EXTABLE(1b, %l[fail])
> > + :: [addr] "r" (addr), [val] "r" (val)
> > + :: fail);
> > + return 0;
> > +fail:
> > + return -EFAULT;
> > +}
> > +#endif /* CONFIG_X86_USER_SHADOW_STACK */
>
> Why isn't this modelled after put_user() ?

You mean as far as supporting multiple sizes? It just isn't really
needed yet. We are only writing single frames. I suppose it might make
more sense with the alt shadow stack support, but that is dropped for
now.

The other difference here is that WRUSS is a weird instruction that is
treated as a user access even if it comes from the kernel mode. So it's
doesn't need to stac/clac.

>
> Should you write a 64bit value even if the task receiving a signal is
> 32bit ?

32 bit support was also dropped.

2022-11-16 10:59:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

On Tue, Nov 15, 2022 at 11:42:46PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2022-11-15 at 15:18 +0100, Peter Zijlstra wrote:
> > On Fri, Nov 04, 2022 at 03:35:54PM -0700, Rick Edgecombe wrote:
> >
> > > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > > +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> > > +{
> > > + asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > > + _ASM_EXTABLE(1b, %l[fail])
> > > + :: [addr] "r" (addr), [val] "r" (val)
> > > + :: fail);
> > > + return 0;
> > > +fail:
> > > + return -EFAULT;
> > > +}
> > > +#endif /* CONFIG_X86_USER_SHADOW_STACK */
> >
> > Why isn't this modelled after put_user() ?
>
> You mean as far as supporting multiple sizes? It just isn't really
> needed yet. We are only writing single frames. I suppose it might make
> more sense with the alt shadow stack support, but that is dropped for
> now.
>
> The other difference here is that WRUSS is a weird instruction that is
> treated as a user access even if it comes from the kernel mode. So it's
> doesn't need to stac/clac.
>
> >
> > Should you write a 64bit value even if the task receiving a signal is
> > 32bit ?
>
> 32 bit support was also dropped.

How? Task could start life as 64bit, frob LDT to set up 32bit code
segment and jump into it and start doing 32bit syscalls, then what?

AFAICT those 32bit syscalls will end up doing SA_IA32_ABI sigframes.

2022-11-16 11:02:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 15/37] x86/mm: Check Shadow Stack page fault errors

On Tue, Nov 15, 2022 at 11:13:34PM +0000, Edgecombe, Rick P wrote:

> When a page becomes COW it changes from a shadow stack permissioned
> page (Write=0,Dirty=1) to (Write=0,Dirty=0,CoW=1), which is simply
> read-only to the CPU. When shadow stack is enabled, a RET would
> normally pop the shadow stack by reading it with a "shadow stack read"
> access. However, in the COW case the shadow stack memory does not have
> shadow stack permissions, it is read-only. So it will generate a fault.
>
> For conventionally writable pages, a read can be serviced with a read
> only PTE, and COW would not have to happen. But for shadow stack, there
> isn't the concept of read-only shadow stack memory. If it is shadow
> stack permissioned, it can be modified via CALL and RET instructions.
> So COW needs to happen before any memory can be mapped with shadow
> stack permissions.
>
> Shadow stack accesses (read or write) need to be serviced with shadow
> stack permissioned memory, so in the case of a shadow stack read
> access, treat it as a WRITE fault so both COW will happen and the write
> fault path will tickle maybe_mkwrite() and map the memory shadow stack.

ACK.

2022-11-16 23:24:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

On Wed, 2022-11-16 at 11:18 +0100, Peter Zijlstra wrote:
> > >
> > > Should you write a 64bit value even if the task receiving a
> > > signal is
> > > 32bit ?
> >
> > 32 bit support was also dropped.
>
> How? Task could start life as 64bit, frob LDT to set up 32bit code
> segment and jump into it and start doing 32bit syscalls, then what?
>
> AFAICT those 32bit syscalls will end up doing SA_IA32_ABI sigframes.

Hmm, good point. This series used to support normal 32 bit apps via
ia32 emulation which would have handled this. But I removed it (blocked
in the enabling logic) because it didn't seem like it would get enough
use to justify the extra code. That doesn't block this scenario here
though.

Pardon the possibly naive question, but is this 32/64 bit mixing
something any normal, shstk-desiring, applications would actually do? O
r more that they could do?

Thanks,

Rick

2022-11-17 12:32:11

by Schimpe, Christina

[permalink] [raw]
Subject: RE: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

> + Christina
>
> On Tue, 2022-11-15 at 15:43 +0100, Peter Zijlstra wrote:
> > On Fri, Nov 04, 2022 at 03:36:02PM -0700, Rick Edgecombe wrote:
> > > From: Yu-cheng Yu <[email protected]>
> > >
> > > Some applications (like GDB and CRIU) would like to tweak CET state
> > > via ptrace. This allows for existing functionality to continue to
> > > work for seized CET applications. Provide an interface based on the
> > > xsave buffer format of CET, but filter unneeded states to make the
> > > kernel’s job easier.
> > >
> > > There is already ptrace functionality for accessing xstate, but this
> > > does not include supervisor xfeatures. So there is not a completely
> > > clear place for where to put the CET state. Adding it to the user
> > > xfeatures regset would complicate that code, as it currently shares
> > > logic with signals which should not have supervisor features.
> > >
> > > Don’t add a general supervisor xfeature regset like the user one,
> > > because it is better to maintain flexibility for other supervisor
> > > xfeatures to define their own interface. For example, an xfeature
> > > may decide not to expose all of it’s state to userspace. A lot of
> > > enum values remain to be used, so just put it in dedicated CET
> > > regset.
> > >
> > > The only downside to not having a generic supervisor xfeature
> > > regset, is that apps need to be enlightened of any new supervisor
> > > xfeature exposed this way (i.e. they can’t try to have generic
> > > save/restore logic). But maybe that is a good thing, because they
> > > have to think through each new xfeature instead of encountering
> > > issues when new a new supervisor xfeature was added.
> >
> > Per this argument this should not use the CET XSAVE format and CET
> > name at all, because that conflates the situation vs IBT. Enabling
> > that might not want to follow this precedent.
>
> Hmm, we definitely need to be able to set the SSP. Christina, does GDB need
> anything else? I thought maybe toggling SHSTK_EN?

In addition to the SSP, we want to write the CET state. For instance for inferior calls,
we want to reset the IBT bits.
However, we won't write states that are disallowed by HW.
Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, http://www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

2022-11-17 14:33:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

On Thu, Nov 17, 2022 at 12:25:16PM +0000, Schimpe, Christina wrote:
> > + Christina
> >
> > On Tue, 2022-11-15 at 15:43 +0100, Peter Zijlstra wrote:
> > > On Fri, Nov 04, 2022 at 03:36:02PM -0700, Rick Edgecombe wrote:
> > > > From: Yu-cheng Yu <[email protected]>
> > > >
> > > > Some applications (like GDB and CRIU) would like to tweak CET state
> > > > via ptrace. This allows for existing functionality to continue to
> > > > work for seized CET applications. Provide an interface based on the
> > > > xsave buffer format of CET, but filter unneeded states to make the
> > > > kernel’s job easier.
> > > >
> > > > There is already ptrace functionality for accessing xstate, but this
> > > > does not include supervisor xfeatures. So there is not a completely
> > > > clear place for where to put the CET state. Adding it to the user
> > > > xfeatures regset would complicate that code, as it currently shares
> > > > logic with signals which should not have supervisor features.
> > > >
> > > > Don’t add a general supervisor xfeature regset like the user one,
> > > > because it is better to maintain flexibility for other supervisor
> > > > xfeatures to define their own interface. For example, an xfeature
> > > > may decide not to expose all of it’s state to userspace. A lot of
> > > > enum values remain to be used, so just put it in dedicated CET
> > > > regset.
> > > >
> > > > The only downside to not having a generic supervisor xfeature
> > > > regset, is that apps need to be enlightened of any new supervisor
> > > > xfeature exposed this way (i.e. they can’t try to have generic
> > > > save/restore logic). But maybe that is a good thing, because they
> > > > have to think through each new xfeature instead of encountering
> > > > issues when new a new supervisor xfeature was added.
> > >
> > > Per this argument this should not use the CET XSAVE format and CET
> > > name at all, because that conflates the situation vs IBT. Enabling
> > > that might not want to follow this precedent.
> >
> > Hmm, we definitely need to be able to set the SSP. Christina, does GDB need
> > anything else? I thought maybe toggling SHSTK_EN?
>
> In addition to the SSP, we want to write the CET state. For instance for inferior calls,
> we want to reset the IBT bits.

This is about Shadow Stack -- IBT is a completely different feature and
not subject of this series.

Also, wth is an inferior call?

2022-11-17 15:04:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

On Wed, Nov 16, 2022 at 10:38:19PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2022-11-16 at 11:18 +0100, Peter Zijlstra wrote:
> > > >
> > > > Should you write a 64bit value even if the task receiving a
> > > > signal is
> > > > 32bit ?
> > >
> > > 32 bit support was also dropped.
> >
> > How? Task could start life as 64bit, frob LDT to set up 32bit code
> > segment and jump into it and start doing 32bit syscalls, then what?
> >
> > AFAICT those 32bit syscalls will end up doing SA_IA32_ABI sigframes.
>
> Hmm, good point. This series used to support normal 32 bit apps via
> ia32 emulation which would have handled this. But I removed it (blocked
> in the enabling logic) because it didn't seem like it would get enough
> use to justify the extra code. That doesn't block this scenario here
> though.
>
> Pardon the possibly naive question, but is this 32/64 bit mixing
> something any normal, shstk-desiring, applications would actually do? O
> r more that they could do?

It is not something common, but it is something that things like Wine
do IIRC, and it would be a real shame if Wine could not use shadow
stacks or something, right ;-)

But more to the point; since the kernel cannot forbit this scenario
(aside from taking away the LDT entirely) it is something that needs
handling.

2022-11-17 20:41:50

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

On Thu, 2022-11-17 at 12:25 +0000, Schimpe, Christina wrote:
> > Hmm, we definitely need to be able to set the SSP. Christina, does
> > GDB need
> > anything else? I thought maybe toggling SHSTK_EN?
>
> In addition to the SSP, we want to write the CET state. For instance
> for inferior calls,
> we want to reset the IBT bits.
> However, we won't write states that are disallowed by HW.

Sorry, I should have given more background. Peter is saying we should
split the ptrace interface so that shadow stack and IBT are separate.
They would also no longer necessarily mirror the CET_U MSR format.
Instead the kernel would expose a kernel specific format that has the
needed bits of shadow stack support. And a separate one later for IBT.

So the question is what does shadow stack need to support for ptrace
besides SSP? Is it only SSP? The other features are SHSTK_EN and
WRSS_EN. It might actually be nice to keep how these bits get flipped
more controlled (remove them from ptrace). It looks like CRIU didn't
need them.


2022-11-18 17:15:14

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 27/37] x86/shstk: Introduce routines modifying shstk

On Thu, 2022-11-17 at 15:17 +0100, Peter Zijlstra wrote:
> On Wed, Nov 16, 2022 at 10:38:19PM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2022-11-16 at 11:18 +0100, Peter Zijlstra wrote:
> > > > >
> > > > > Should you write a 64bit value even if the task receiving a
> > > > > signal is
> > > > > 32bit ?
> > > >
> > > > 32 bit support was also dropped.
> > >
> > > How? Task could start life as 64bit, frob LDT to set up 32bit
> > > code
> > > segment and jump into it and start doing 32bit syscalls, then
> > > what?
> > >
> > > AFAICT those 32bit syscalls will end up doing SA_IA32_ABI
> > > sigframes.
> >
> > Hmm, good point. This series used to support normal 32 bit apps via
> > ia32 emulation which would have handled this. But I removed it
> > (blocked
> > in the enabling logic) because it didn't seem like it would get
> > enough
> > use to justify the extra code. That doesn't block this scenario
> > here
> > though.
> >
> > Pardon the possibly naive question, but is this 32/64 bit mixing
> > something any normal, shstk-desiring, applications would actually
> > do? O
> > r more that they could do?
>
> It is not something common, but it is something that things like Wine
> do IIRC, and it would be a real shame if Wine could not use shadow
> stacks or something, right ;-)

Since windows has shadow stack support, I guess. But it looks like it
doesn't support shadow stacks on 32 bit either. So for the time being,
it seems Wine wouldn't use this either... I think...

>
> But more to the point; since the kernel cannot forbit this scenario
> (aside from taking away the LDT entirely) it is something that needs
> handling.

I'm having to go educate myself a bit on this kind of code mixing and
existing ABI expectations. It seems you could also just make 32 bit
syscalls from 64 bit code to trigger the same behavior.

On one hand if we think that no one will use this, it would be a shame
to have to maintain 32 bit shadow stack support. But on the other hand,
we have all these apps being automatically marked as supporting shadow
stack. If this was not the case, I would think just declaring this
unsupported would be the best.

For bringing back 32 bit support, the tricky part might be a 32 bit
implementation of the new shadow stack sigframe design that supports
alt shadow stacks. Setting the high bit to guarantee the frame will not
point to user space won't work for 32 bit. But if we are mostly worried
about making sure it is still functional we could maybe just have a
slightly less protective format for the shadow stack sigframe for 32
bit. It would not have the same SROP protections. Have to think if this
is a security hole for 64 bit though.

Anyway, I'm still digging on this one, and just wanted to let you know
where I was at.

2022-11-18 17:31:16

by Schimpe, Christina

[permalink] [raw]
Subject: RE: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

> On Thu, 2022-11-17 at 12:25 +0000, Schimpe, Christina wrote:
> > > Hmm, we definitely need to be able to set the SSP. Christina, does
> > > GDB need anything else? I thought maybe toggling SHSTK_EN?
> >
> > In addition to the SSP, we want to write the CET state. For instance
> > for inferior calls, we want to reset the IBT bits.
> > However, we won't write states that are disallowed by HW.
>
> Sorry, I should have given more background. Peter is saying we should split
> the ptrace interface so that shadow stack and IBT are separate.
> They would also no longer necessarily mirror the CET_U MSR format.
> Instead the kernel would expose a kernel specific format that has the
> needed bits of shadow stack support. And a separate one later for IBT.
>
> So the question is what does shadow stack need to support for ptrace
> besides SSP? Is it only SSP? The other features are SHSTK_EN and WRSS_EN.
> It might actually be nice to keep how these bits get flipped more controlled
> (remove them from ptrace). It looks like CRIU didn't need them.
>

GDB currently reads the CET_U and SSP register. However, we don’t necessarily have to read EB_LEG_BITMAP_BASE.
In addition to SSP, we want to write the bits for the IBT state machine (TRACKER and SUPPRESS).
However, besides that GDB does not have to write anything else.


Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, http://www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

2022-11-18 17:32:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

On Fri, 2022-11-18 at 16:21 +0000, Schimpe, Christina wrote:
> > On Thu, 2022-11-17 at 12:25 +0000, Schimpe, Christina wrote:
> > > > Hmm, we definitely need to be able to set the SSP. Christina,
> > > > does
> > > > GDB need anything else? I thought maybe toggling SHSTK_EN?
> > >
> > > In addition to the SSP, we want to write the CET state. For
> > > instance
> > > for inferior calls, we want to reset the IBT bits.
> > > However, we won't write states that are disallowed by HW.
> >
> > Sorry, I should have given more background. Peter is saying we
> > should split
> > the ptrace interface so that shadow stack and IBT are separate.
> > They would also no longer necessarily mirror the CET_U MSR format.
> > Instead the kernel would expose a kernel specific format that has
> > the
> > needed bits of shadow stack support. And a separate one later for
> > IBT.
> >
> > So the question is what does shadow stack need to support for
> > ptrace
> > besides SSP? Is it only SSP? The other features are SHSTK_EN and
> > WRSS_EN.
> > It might actually be nice to keep how these bits get flipped more
> > controlled
> > (remove them from ptrace). It looks like CRIU didn't need them.
> >
>
> GDB currently reads the CET_U and SSP register. However, we don’t
> necessarily have to read EB_LEG_BITMAP_BASE.
> In addition to SSP, we want to write the bits for the IBT state
> machine (TRACKER and SUPPRESS).
> However, besides that GDB does not have to write anything else.

Again, this is just about shadow stack. IBT will have a separate
interface. So based on these comments, I'll change the interface in
this patch to one for simply reading/writing SSP.


2022-11-18 18:04:32

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

On Thu, 2022-11-17 at 15:14 +0100, Peter Zijlstra wrote:
> Also, wth is an inferior call?

I think it's the GDB functionality to make a call in the process being
debugged. So if you want to call something without an endbranch I
guess.

2022-11-18 18:05:26

by Schimpe, Christina

[permalink] [raw]
Subject: RE: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

> On Thu, 2022-11-17 at 15:14 +0100, Peter Zijlstra wrote:
> > Also, wth is an inferior call?
>
> I think it's the GDB functionality to make a call in the process being debugged.
> So if you want to call something without an endbranch I guess.

For inferior calls, GDB builds a dummy-frame to evaluate an expression:
https://sourceware.org/gdb/onlinedocs/gdb/Calling.html

Best Regards,
Christina

Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, http://www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

2022-11-21 07:57:36

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

On Thu, Nov 17, 2022 at 07:57:59PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2022-11-17 at 12:25 +0000, Schimpe, Christina wrote:
> > > Hmm, we definitely need to be able to set the SSP. Christina, does
> > > GDB need
> > > anything else? I thought maybe toggling SHSTK_EN?
> >
> > In addition to the SSP, we want to write the CET state. For instance
> > for inferior calls,
> > we want to reset the IBT bits.
> > However, we won't write states that are disallowed by HW.
>
> Sorry, I should have given more background. Peter is saying we should
> split the ptrace interface so that shadow stack and IBT are separate.
> They would also no longer necessarily mirror the CET_U MSR format.
> Instead the kernel would expose a kernel specific format that has the
> needed bits of shadow stack support. And a separate one later for IBT.
>
> So the question is what does shadow stack need to support for ptrace
> besides SSP? Is it only SSP? The other features are SHSTK_EN and
> WRSS_EN. It might actually be nice to keep how these bits get flipped
> more controlled (remove them from ptrace). It looks like CRIU didn't
> need them.

CRIU reads CET_U with ptrace(PTRACE_GETREGSET, NT_X86_CET). It's done
before the injection of the parasite. The value of SHSTK_EN is used then to
detect if shadow stack is enabled and to setup victim's shadow stack for
sigreturn.

--
Sincerely yours,
Mike.

2022-11-21 16:19:07

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

On Mon, 2022-11-21 at 09:40 +0200, Mike Rapoport wrote:
> On Thu, Nov 17, 2022 at 07:57:59PM +0000, Edgecombe, Rick P wrote:
> > On Thu, 2022-11-17 at 12:25 +0000, Schimpe, Christina wrote:
> > > > Hmm, we definitely need to be able to set the SSP. Christina,
> > > > does
> > > > GDB need
> > > > anything else? I thought maybe toggling SHSTK_EN?
> > >
> > > In addition to the SSP, we want to write the CET state. For
> > > instance
> > > for inferior calls,
> > > we want to reset the IBT bits.
> > > However, we won't write states that are disallowed by HW.
> >
> > Sorry, I should have given more background. Peter is saying we
> > should
> > split the ptrace interface so that shadow stack and IBT are
> > separate.
> > They would also no longer necessarily mirror the CET_U MSR format.
> > Instead the kernel would expose a kernel specific format that has
> > the
> > needed bits of shadow stack support. And a separate one later for
> > IBT.
> >
> > So the question is what does shadow stack need to support for
> > ptrace
> > besides SSP? Is it only SSP? The other features are SHSTK_EN and
> > WRSS_EN. It might actually be nice to keep how these bits get
> > flipped
> > more controlled (remove them from ptrace). It looks like CRIU
> > didn't
> > need them.
>
>
> CRIU reads CET_U with ptrace(PTRACE_GETREGSET, NT_X86_CET). It's done
> before the injection of the parasite. The value of SHSTK_EN is used
> then to
> detect if shadow stack is enabled and to setup victim's shadow stack
> for
> sigreturn.

Hmm, can it read /proc/pid/status? It has some lines like this:
x86_Thread_features: shstk wrss
x86_Thread_features_locked: shstk wrss

2022-11-22 10:13:32

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v3 35/37] x86/cet: Add PTRACE interface for CET

On Mon, Nov 21, 2022 at 03:52:57PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2022-11-21 at 09:40 +0200, Mike Rapoport wrote:
> > On Thu, Nov 17, 2022 at 07:57:59PM +0000, Edgecombe, Rick P wrote:
> > > On Thu, 2022-11-17 at 12:25 +0000, Schimpe, Christina wrote:
> > > > > Hmm, we definitely need to be able to set the SSP. Christina,
> > > > > does
> > > > > GDB need
> > > > > anything else? I thought maybe toggling SHSTK_EN?
> > > >
> > > > In addition to the SSP, we want to write the CET state. For
> > > > instance
> > > > for inferior calls,
> > > > we want to reset the IBT bits.
> > > > However, we won't write states that are disallowed by HW.
> > >
> > > Sorry, I should have given more background. Peter is saying we
> > > should
> > > split the ptrace interface so that shadow stack and IBT are
> > > separate.
> > > They would also no longer necessarily mirror the CET_U MSR format.
> > > Instead the kernel would expose a kernel specific format that has
> > > the
> > > needed bits of shadow stack support. And a separate one later for
> > > IBT.
> > >
> > > So the question is what does shadow stack need to support for
> > > ptrace
> > > besides SSP? Is it only SSP? The other features are SHSTK_EN and
> > > WRSS_EN. It might actually be nice to keep how these bits get
> > > flipped
> > > more controlled (remove them from ptrace). It looks like CRIU
> > > didn't
> > > need them.
> >
> >
> > CRIU reads CET_U with ptrace(PTRACE_GETREGSET, NT_X86_CET). It's done
> > before the injection of the parasite. The value of SHSTK_EN is used
> > then to
> > detect if shadow stack is enabled and to setup victim's shadow stack
> > for
> > sigreturn.
>
> Hmm, can it read /proc/pid/status? It has some lines like this:
> x86_Thread_features: shstk wrss
> x86_Thread_features_locked: shstk wrss

It could, but that would be much more intrusive than GETREGSET because
currently /proc parsing and parasite injection don't really interact.
If anything, arch_prctl(ARCH_CET_GET) via ptrace would be much nicer than
/proc.

--
Sincerely yours,
Mike.