2023-02-18 21:16:30

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 00/41] Shadow stacks for userspace

Hi,

This series implements Shadow Stacks for userspace using x86's Control-flow
Enforcement Technology (CET). CET consists of two related security features:
shadow stacks and indirect branch tracking. This series implements just the
shadow stack part of this feature, and just for userspace.

The main use case for shadow stack is providing protection against return
oriented programming attacks. It works by maintaining a secondary (shadow)
stack using a special memory type that has protections against modification.
When executing a CALL instruction, the processor pushes the return address to
both the normal stack and to the special permission shadow stack. Upon RET,
the processor pops the shadow stack copy and compares it to the normal stack
copy. For more details, see the coverletter from v1 [0].

The main changes in this version are the MM suggestions by David Hildenbrand
to have pte_mkwrite() take a vma, and rename _PAGE_COW.
The former is split over three patches:
mm: Introduce pte_mkwrite_kernel()
s390/mm: Introduce pmd_mkwrite_kernel()
mm: Make pte_mkwrite() take a VMA

With these changes, and an adjustment to "mm: Warn on shadow stack memory in
wrong vma", references to "shstk" are now only in x86 arch code, hopefully
addressing Andrew Morton's concerns. There are still a couple VM_SHADOW_STACK
references, which seems to be in keeping with the treatment of other
VM_HIGH_ARCH flags. If other shadow stack implementations end up with identical
logic, it can easily be refactored at that point.

There was also some more feedback from Boris which was incorporated.

I left tested-by tags in place per discussion with testers. Testers, please
retest.

Previous version [1].

Thanks,
Rick


[0] https://lore.kernel.org/lkml/[email protected]/
[1] https://lore.kernel.org/lkml/[email protected]/


Kirill A. Shutemov (1):
x86: Introduce userspace API for shadow stack

Mike Rapoport (1):
x86/shstk: Add ARCH_SHSTK_UNLOCK

Rick Edgecombe (19):
x86/fpu: Add helper for modifying xstate
x86: Move control protection handler to separate file
mm: Introduce pte_mkwrite_kernel()
s390/mm: Introduce pmd_mkwrite_kernel()
mm: Make pte_mkwrite() take a VMA
x86/mm: Introduce _PAGE_SAVED_DIRTY
x86/mm: Start actually marking _PAGE_SAVED_DIRTY
x86/mm: Teach pte_mkwrite() about stack memory
mm: Don't allow write GUPs to shadow stack memory
x86/mm: Introduce MAP_ABOVE4G
mm: Warn on shadow stack memory in wrong vma
x86/mm: Warn if create Write=0,Dirty=1 with raw prot
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Support WRSS for userspace
x86: Expose thread features in /proc/$PID/status
x86/shstk: Wire in shadow stack interface
selftests/x86: Add shadow stack test
x86/fpu: Add helper for initing features
x86/shstk: Add ARCH_SHSTK_STATUS

Yu-cheng Yu (20):
Documentation/x86: Add CET shadow stack description
x86/shstk: Add Kconfig option for shadow stack
x86/cpufeatures: Add CPU feature flags for shadow stacks
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
x86/shstk: Add user control-protection fault handler
x86/mm: Remove _PAGE_DIRTY from kernel RO pages
x86/mm: Move pmd_write(), pud_write() up in the file
x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY
mm: Move VM_UFFD_MINOR_BIT from 37 to 38
mm: Introduce VM_SHADOW_STACK for shadow stack memory
x86/mm: Check shadow stack page fault errors
mm: Add guard pages around a shadow stack.
mm/mmap: Add shadow stack pages to memory accounting
mm: Re-introduce vm_flags to do_mmap()
x86/shstk: Add user-mode shadow stack support
x86/shstk: Handle thread shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle signals for shadow stack
x86: Add PTRACE interface for shadow stack

Documentation/filesystems/proc.rst | 1 +
Documentation/mm/arch_pgtable_helpers.rst | 9 +-
Documentation/x86/index.rst | 1 +
Documentation/x86/shstk.rst | 176 +++++
arch/alpha/include/asm/pgtable.h | 6 +-
arch/arc/include/asm/hugepage.h | 2 +-
arch/arc/include/asm/pgtable-bits-arcv2.h | 7 +-
arch/arm/include/asm/pgtable-3level.h | 7 +-
arch/arm/include/asm/pgtable.h | 2 +-
arch/arm/kernel/signal.c | 2 +-
arch/arm64/include/asm/pgtable.h | 9 +-
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/arm64/mm/trans_pgd.c | 4 +-
arch/csky/include/asm/pgtable.h | 2 +-
arch/hexagon/include/asm/pgtable.h | 2 +-
arch/ia64/include/asm/pgtable.h | 2 +-
arch/loongarch/include/asm/pgtable.h | 4 +-
arch/m68k/include/asm/mcf_pgtable.h | 2 +-
arch/m68k/include/asm/motorola_pgtable.h | 6 +-
arch/m68k/include/asm/sun3_pgtable.h | 6 +-
arch/microblaze/include/asm/pgtable.h | 2 +-
arch/mips/include/asm/pgtable.h | 6 +-
arch/nios2/include/asm/pgtable.h | 2 +-
arch/openrisc/include/asm/pgtable.h | 2 +-
arch/parisc/include/asm/pgtable.h | 6 +-
arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +-
arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +-
arch/powerpc/include/asm/nohash/32/pte-8xx.h | 2 +-
arch/powerpc/include/asm/nohash/64/pgtable.h | 2 +-
arch/riscv/include/asm/pgtable.h | 6 +-
arch/s390/include/asm/hugetlb.h | 4 +-
arch/s390/include/asm/pgtable.h | 14 +-
arch/s390/mm/pageattr.c | 4 +-
arch/sh/include/asm/pgtable_32.h | 10 +-
arch/sparc/include/asm/pgtable_32.h | 2 +-
arch/sparc/include/asm/pgtable_64.h | 6 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/um/include/asm/pgtable.h | 2 +-
arch/x86/Kconfig | 24 +
arch/x86/Kconfig.assembler | 5 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 16 +-
arch/x86/include/asm/fpu/api.h | 9 +
arch/x86/include/asm/fpu/regset.h | 7 +-
arch/x86/include/asm/fpu/sched.h | 3 +-
arch/x86/include/asm/fpu/types.h | 16 +-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/idtentry.h | 2 +-
arch/x86/include/asm/mmu_context.h | 2 +
arch/x86/include/asm/msr.h | 11 +
arch/x86/include/asm/pgtable.h | 322 ++++++++-
arch/x86/include/asm/pgtable_types.h | 71 +-
arch/x86/include/asm/processor.h | 8 +
arch/x86/include/asm/shstk.h | 40 ++
arch/x86/include/asm/special_insns.h | 13 +
arch/x86/include/asm/tlbflush.h | 3 +-
arch/x86/include/asm/trap_pf.h | 2 +
arch/x86/include/asm/traps.h | 12 +
arch/x86/include/uapi/asm/mman.h | 4 +
arch/x86/include/uapi/asm/prctl.h | 12 +
arch/x86/kernel/Makefile | 4 +
arch/x86/kernel/cet.c | 152 ++++
arch/x86/kernel/cpu/common.c | 35 +-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/cpu/proc.c | 23 +
arch/x86/kernel/fpu/core.c | 59 +-
arch/x86/kernel/fpu/regset.c | 86 +++
arch/x86/kernel/fpu/xstate.c | 148 ++--
arch/x86/kernel/fpu/xstate.h | 6 +
arch/x86/kernel/idt.c | 2 +-
arch/x86/kernel/process.c | 18 +-
arch/x86/kernel/process_64.c | 9 +-
arch/x86/kernel/ptrace.c | 12 +
arch/x86/kernel/shstk.c | 491 +++++++++++++
arch/x86/kernel/signal.c | 1 +
arch/x86/kernel/signal_32.c | 2 +-
arch/x86/kernel/signal_64.c | 8 +-
arch/x86/kernel/sys_x86_64.c | 6 +-
arch/x86/kernel/traps.c | 87 ---
arch/x86/mm/fault.c | 38 +
arch/x86/mm/pat/set_memory.c | 4 +-
arch/x86/mm/pgtable.c | 38 +
arch/x86/xen/enlighten_pv.c | 2 +-
arch/x86/xen/mmu_pv.c | 2 +-
arch/x86/xen/xen-asm.S | 2 +-
arch/xtensa/include/asm/pgtable.h | 2 +-
fs/aio.c | 2 +-
fs/proc/array.c | 6 +
fs/proc/task_mmu.c | 3 +
include/asm-generic/hugetlb.h | 4 +-
include/linux/mm.h | 46 +-
include/linux/mman.h | 4 +
include/linux/pgtable.h | 14 +
include/linux/proc_fs.h | 2 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/asm-generic/unistd.h | 2 +-
include/uapi/linux/elf.h | 2 +
ipc/shm.c | 2 +-
kernel/sys_ni.c | 1 +
mm/debug_vm_pgtable.c | 16 +-
mm/gup.c | 2 +-
mm/huge_memory.c | 7 +-
mm/hugetlb.c | 4 +-
mm/memory.c | 5 +-
mm/migrate_device.c | 2 +-
mm/mmap.c | 12 +-
mm/mprotect.c | 2 +-
mm/nommu.c | 4 +-
mm/userfaultfd.c | 2 +-
mm/util.c | 2 +-
tools/testing/selftests/x86/Makefile | 4 +-
.../testing/selftests/x86/test_shadow_stack.c | 676 ++++++++++++++++++
117 files changed, 2671 insertions(+), 324 deletions(-)
create mode 100644 Documentation/x86/shstk.rst
create mode 100644 arch/x86/include/asm/shstk.h
create mode 100644 arch/x86/kernel/cet.c
create mode 100644 arch/x86/kernel/shstk.c
create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

--
2.17.1



2023-02-18 21:16:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 01/41] Documentation/x86: Add CET shadow stack description

From: Yu-cheng Yu <[email protected]>

Introduce a new document on Control-flow Enforcement Technology (CET).

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v5:
- Literal format tweaks (Bagas Sanjaya)
- Update EOPNOTSUPP text due to unification after comment from (Kees)
- Update 32 bit signal support with new behavior
- Remove capitalization on shadow stack (Boris)
- Fix typo

v4:
- Drop clearcpuid piece (Boris)
- Add some info about 32 bit

v3:
- Clarify kernel IBT is supported by the kernel. (Kees, Andrew Cooper)
- Clarify which arch_prctl's can take multiple bits. (Kees)
- Describe ASLR characteristics of thread shadow stacks. (Kees)
- Add exec section. (Andrew Cooper)
- Fix some capitalization (Bagas Sanjaya)
- Update new location of enablement status proc.
- Add info about new user_shstk software capability.
- Add more info about what the kernel pushes to the shadow stack on
signal.

v2:
- Updated to new arch_prctl() API
- Add bit about new proc status
---
Documentation/x86/index.rst | 1 +
Documentation/x86/shstk.rst | 166 ++++++++++++++++++++++++++++++++++++
2 files changed, 167 insertions(+)
create mode 100644 Documentation/x86/shstk.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index c73d133fd37c..8ac64d7de4dc 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -22,6 +22,7 @@ x86-specific Documentation
mtrr
pat
intel-hfi
+ shstk
iommu
intel_txt
amd-memory-encryption
diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
new file mode 100644
index 000000000000..f2e6f323cf68
--- /dev/null
+++ b/Documentation/x86/shstk.rst
@@ -0,0 +1,166 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Control-flow Enforcement Technology (CET) Shadow Stack
+======================================================
+
+CET Background
+==============
+
+Control-flow Enforcement Technology (CET) is term referring to several
+related x86 processor features that provides protection against control
+flow hijacking attacks. The HW feature itself can be set up to protect
+both applications and the kernel.
+
+CET introduces shadow stack and indirect branch tracking (IBT). Shadow stack
+is a secondary stack allocated from memory and cannot be directly modified by
+applications. When executing a CALL instruction, the processor pushes the
+return address to both the normal stack and the shadow stack. Upon
+function return, the processor pops the shadow stack copy and compares it
+to the normal stack copy. If the two differ, the processor raises a
+control-protection fault. IBT verifies indirect CALL/JMP targets are intended
+as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow
+Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace
+shadow stack and kernel IBT are supported.
+
+Requirements to use Shadow Stack
+================================
+
+To use userspace shadow stack you need HW that supports it, a kernel
+configured with it and userspace libraries compiled with it.
+
+The kernel Kconfig option is X86_USER_SHADOW_STACK, and it can be disabled
+with the kernel parameter: nousershstk.
+
+To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later
+are required.
+
+At run time, /proc/cpuinfo shows CET features if the processor supports
+CET. "user_shstk" means that userspace shadow stack is supported on the current
+kernel and HW.
+
+Application Enabling
+====================
+
+An application's CET capability is marked in its ELF note and can be verified
+from readelf/llvm-readelf output::
+
+ readelf -n <application> | grep -a SHSTK
+ properties: x86 feature: SHSTK
+
+The kernel does not process these applications markers directly. Applications
+or loaders must enable CET features using the interface described in section 4.
+Typically this would be done in dynamic loader or static runtime objects, as is
+the case in GLIBC.
+
+Enabling arch_prctl()'s
+=======================
+
+Elf features should be enabled by the loader using the below arch_prctl's. They
+are only supported in 64 bit user applications.
+
+arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
+ Enable a single feature specified in 'feature'. Can only operate on
+ one feature at a time.
+
+arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
+ Disable a single feature specified in 'feature'. Can only operate on
+ one feature at a time.
+
+arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
+ Lock in features at their current enabled or disabled status. 'features'
+ is a mask of all features to lock. All bits set are processed, unset bits
+ are ignored. The mask is ORed with the existing value. So any feature bits
+ set here cannot be enabled or disabled afterwards.
+
+The return values are as follows. On success, return 0. On error, errno can
+be::
+
+ -EPERM if any of the passed feature are locked.
+ -ENOTSUPP if the feature is not supported by the hardware or
+ kernel.
+ -EINVAL arguments (non existing feature, etc)
+
+The feature's bits supported are::
+
+ ARCH_SHSTK_SHSTK - Shadow stack
+ ARCH_SHSTK_WRSS - WRSS
+
+Currently shadow stack and WRSS are supported via this interface. WRSS
+can only be enabled with shadow stack, and is automatically disabled
+if shadow stack is disabled.
+
+Proc Status
+===========
+To check if an application is actually running with shadow stack, the
+user can read the /proc/$PID/status. It will report "wrss" or "shstk"
+depending on what is enabled. The lines look like this::
+
+ x86_Thread_features: shstk wrss
+ x86_Thread_features_locked: shstk wrss
+
+Implementation of the Shadow Stack
+==================================
+
+Shadow Stack Size
+-----------------
+
+A task's shadow stack is allocated from memory to a fixed size of
+MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
+the maximum size of the normal stack, but capped to 4 GB. However,
+a compat-mode application's address space is smaller, each of its thread's
+shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
+
+Signal
+------
+
+By default, the main program and its signal handlers use the same shadow
+stack. Because the shadow stack stores only return addresses, a large
+shadow stack covers the condition that both the program stack and the
+signal alternate stack run out.
+
+When a signal happens, the old pre-signal state is pushed on the stack. When
+shadow stack is enabled, the shadow stack specific state is pushed onto the
+shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
+in a special format with bit 63 set. On sigreturn this old SSP token is
+verified and restored by the kernel. The kernel will also push the normal
+restorer address to the shadow stack to help userspace avoid a shadow stack
+violation on the sigreturn path that goes through the restorer.
+
+So the shadow stack signal frame format is as follows::
+
+ |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format
+ (bit 63 set to 1)
+ | ...| - Other state may be added in the future
+
+
+32 bit ABI signals are not supported in shadow stack processes. Linux prevents
+32 bit execution while shadow stack is enabled by the allocating shadow stack's
+outside of the 32 bit address space. When execution enters 32 bit mode, either
+via far call or returning to userspace, a #GP is generated by the hardware
+which, will be delivered to the process as a segfault. When transitioning to
+userspace the register's state will be as if the userspace ip being returned to
+caused the segfault.
+
+Fork
+----
+
+The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
+to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
+shadow access triggers a page fault with the shadow stack access bit set
+in the page fault error code.
+
+When a task forks a child, its shadow stack PTEs are copied and both the
+parent's and the child's shadow stack PTEs are cleared of the dirty bit.
+Upon the next shadow stack access, the resulting shadow stack page fault
+is handled by page copy/re-use.
+
+When a pthread child is created, the kernel allocates a new shadow stack
+for the new thread. New shadow stack's behave like mmap() with respect to
+ASLR behavior.
+
+Exec
+----
+
+On exec, shadow stack features are disabled by the kernel. At which point,
+userspace can choose to re-enable, or lock them.
--
2.17.1


2023-02-18 21:16:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 02/41] x86/shstk: Add Kconfig option for shadow stack

From: Yu-cheng Yu <[email protected]>

Shadow stack provides protection for applications against function return
address corruption. It is active when the processor supports it, the
kernel has CONFIG_X86_SHADOW_STACK enabled, and the application is built
for the feature. This is only implemented for the 64-bit kernel. When it
is enabled, legacy non-shadow stack applications continue to work, but
without protection.

Since there is another feature that utilizes CET (Kernel IBT) that will
share implementation with shadow stacks, create CONFIG_CET to signify
that at least one CET feature is configured.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v5:
- Remove capitalization of shadow stack (Boris)

v3:
- Add X86_CET (Kees)
- Add back WRUSS dependency (Kees)
- Fix verbiage (Dave)
- Change from promt to bool (Kirill)
- Add more to commit log

v2:
- Remove already wrong kernel size increase info (tlgx)
- Change prompt to remove "Intel" (tglx)
- Update line about what CPUs are supported (Dave)

Yu-cheng v25:
- Remove X86_CET and use X86_SHADOW_STACK directly.
---
arch/x86/Kconfig | 24 ++++++++++++++++++++++++
arch/x86/Kconfig.assembler | 5 +++++
2 files changed, 29 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..f03791b73f9f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1851,6 +1851,11 @@ config CC_HAS_IBT
(CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
$(as-instr,endbr64)

+config X86_CET
+ def_bool n
+ help
+ CET features configured (Shadow stack or IBT)
+
config X86_KERNEL_IBT
prompt "Indirect Branch Tracking"
def_bool y
@@ -1858,6 +1863,7 @@ config X86_KERNEL_IBT
# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
depends on !LD_IS_LLD || LLD_VERSION >= 140000
select OBJTOOL
+ select X86_CET
help
Build the kernel with support for Indirect Branch Tracking, a
hardware support course-grain forward-edge Control Flow Integrity
@@ -1952,6 +1958,24 @@ config X86_SGX

If unsure, say N.

+config X86_USER_SHADOW_STACK
+ bool "X86 userspace shadow stack"
+ depends on AS_WRUSS
+ depends on X86_64
+ select ARCH_USES_HIGH_VMA_FLAGS
+ select X86_CET
+ help
+ Shadow stack protection is a hardware feature that detects function
+ return address corruption. This helps mitigate ROP attacks.
+ Applications must be enabled to use it, and old userspace does not
+ get protection "for free".
+
+ CPUs supporting shadow stacks were first released in 2020.
+
+ See Documentation/x86/shstk.rst for more information.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 26b8c08e2fc4..00c79dd93651 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -19,3 +19,8 @@ config AS_TPAUSE
def_bool $(as-instr,tpause %ecx)
help
Supported by binutils >= 2.31.1 and LLVM integrated assembler >= V7
+
+config AS_WRUSS
+ def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+ help
+ Supported by binutils >= 2.31 and LLVM integrated assembler
--
2.17.1


2023-02-18 21:16:43

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 03/41] x86/cpufeatures: Add CPU feature flags for shadow stacks

From: Yu-cheng Yu <[email protected]>

The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).

To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.

The shadow stack feature, enumerated by the CPUID bit described above,
encompasses both supervisor and userspace support for shadow stack. In
near future patches, only userspace shadow stack will be enabled. In
expectation of future supervisor shadow stack support, create a software
CPU capability to enumerate kernel utilization of userspace shadow stack
support. This user shadow stack bit should depend on the HW "shstk"
capability and that logic will be implemented in future patches.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v5:
- Drop "shstk" from cpuinfo (Boris)
- Remove capitalization on shadow stack (Boris)

v3:
- Add user specific shadow stack cpu cap (Andrew Cooper)
- Drop reviewed-bys from Boris and Kees due to the above change.

v2:
- Remove IBT reference in commit log (Kees)
- Describe xsaves dependency using text from (Dave)

v1:
- Remove IBT, can be added in a follow on IBT series.
---
arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index fdb8e09234ba..af4178e0d76a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -309,6 +309,7 @@
#define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
#define X86_FEATURE_SMBA (11*32+21) /* "" Slow Memory Bandwidth Allocation */
#define X86_FEATURE_BMEC (11*32+22) /* "" Bandwidth Monitoring Event Configuration */
+#define X86_FEATURE_USER_SHSTK (11*32+23) /* Shadow stack support for user mode applications */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
@@ -375,6 +376,7 @@
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK (16*32+ 7) /* "" Shadow stack */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dfa4fb76f4b..505f78ddca82 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define DISABLE_USER_SHSTK 0
+#else
+#define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -114,7 +120,7 @@
#define DISABLED_MASK9 (DISABLE_SGX)
#define DISABLED_MASK10 0
#define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
- DISABLE_CALL_DEPTH_TRACKING)
+ DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
#define DISABLED_MASK12 0
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index f6748c8bd647..e462c1d3800a 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -81,6 +81,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
{}
};

--
2.17.1


2023-02-18 21:17:05

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 04/41] x86/cpufeatures: Enable CET CR4 bit for shadow stack

From: Yu-cheng Yu <[email protected]>

Setting CR4.CET is a prerequisite for utilizing any CET features, most of
which also require setting MSRs.

Kernel IBT already enables the CET CR4 bit when it detects IBT HW support
and is configured with kernel IBT. However, future patches that enable
userspace shadow stack support will need the bit set as well. So change
the logic to enable it in either case.

Clear MSR_IA32_U_CET in cet_disable() so that it can't live to see
userspace in a new kexec-ed kernel that has CR4.CET set from kernel IBT.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v5:
- Remove #ifdeffery (Boris)

v4:
- Add back dedicated command line disable: "nousershstk" (Boris)

v3:
- Remove stay new line (Boris)
- Simplify commit log (Andrew Cooper)

v2:
- In the shadow stack case, go back to only setting CR4.CET if the
kernel is compiled with user shadow stack support.
- Clear MSR_IA32_U_CET as well. (PeterZ)
---
arch/x86/kernel/cpu/common.c | 35 +++++++++++++++++++++++++++--------
1 file changed, 27 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 38646f1b5f14..30c524cd8cad 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -599,27 +599,43 @@ __noendbr void ibt_restore(u64 save)

static __always_inline void setup_cet(struct cpuinfo_x86 *c)
{
- u64 msr = CET_ENDBR_EN;
+ bool user_shstk, kernel_ibt;

- if (!HAS_KERNEL_IBT ||
- !cpu_feature_enabled(X86_FEATURE_IBT))
+ if (!IS_ENABLED(CONFIG_X86_CET))
return;

- wrmsrl(MSR_IA32_S_CET, msr);
+ kernel_ibt = HAS_KERNEL_IBT && cpu_feature_enabled(X86_FEATURE_IBT);
+ user_shstk = cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+ IS_ENABLED(CONFIG_X86_USER_SHADOW_STACK);
+
+ if (!kernel_ibt && !user_shstk)
+ return;
+
+ if (user_shstk)
+ set_cpu_cap(c, X86_FEATURE_USER_SHSTK);
+
+ if (kernel_ibt)
+ wrmsrl(MSR_IA32_S_CET, CET_ENDBR_EN);
+ else
+ wrmsrl(MSR_IA32_S_CET, 0);
+
cr4_set_bits(X86_CR4_CET);

- if (!ibt_selftest()) {
+ if (kernel_ibt && !ibt_selftest()) {
pr_err("IBT selftest: Failed!\n");
wrmsrl(MSR_IA32_S_CET, 0);
setup_clear_cpu_cap(X86_FEATURE_IBT);
- return;
}
}

__noendbr void cet_disable(void)
{
- if (cpu_feature_enabled(X86_FEATURE_IBT))
- wrmsrl(MSR_IA32_S_CET, 0);
+ if (!(cpu_feature_enabled(X86_FEATURE_IBT) ||
+ cpu_feature_enabled(X86_FEATURE_SHSTK)))
+ return;
+
+ wrmsrl(MSR_IA32_S_CET, 0);
+ wrmsrl(MSR_IA32_U_CET, 0);
}

/*
@@ -1476,6 +1492,9 @@ static void __init cpu_parse_early_param(void)
if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
setup_clear_cpu_cap(X86_FEATURE_XSAVES);

+ if (cmdline_find_option_bool(boot_command_line, "nousershstk"))
+ setup_clear_cpu_cap(X86_FEATURE_USER_SHSTK);
+
arglen = cmdline_find_option(boot_command_line, "clearcpuid", arg, sizeof(arg));
if (arglen <= 0)
return;
--
2.17.1


2023-02-18 21:17:18

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 05/41] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states

From: Yu-cheng Yu <[email protected]>

Shadow stack register state can be managed with XSAVE. The registers
can logically be separated into two groups:
* Registers controlling user-mode operation
* Registers controlling kernel-mode operation

The architecture has two new XSAVE state components: one for each group
of those groups of registers. This lets an OS manage them separately if
it chooses. Future patches for host userspace and KVM guests will only
utilize the user-mode registers, so only configure XSAVE to save
user-mode registers. This state will add 16 bytes to the xsave buffer
size.

Future patches will use the user-mode XSAVE area to save guest user-mode
CET state. However, VMCS includes new fields for guest CET supervisor
states. KVM can use these to save and restore guest supervisor state, so
host supervisor XSAVE support is not required.

Adding this exacerbates the already unwieldy if statement in
check_xstate_against_struct() that handles warning about un-implemented
xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when
it actually check's the xfeature. This ends up exceeding 80 chars, but was
better on balance than other options explored. Pass the bool as pointer to
make it clear that XCHECK_SZ() can change the variable.

While configuring user-mode XSAVE, clarify kernel-mode registers are not
managed by XSAVE by defining the xfeature in
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT.
This serves more of a documentation as code purpose, and functionally,
only enables a few safety checks.

Both XSAVE state components are supervisor states, even the state
controlling user-mode operation. This is a departure from earlier features
like protection keys where the PKRU state is a normal user
(non-supervisor) state. Having the user state be supervisor-managed
ensures there is no direct, unprivileged access to it, making it harder
for an attacker to subvert CET.

To facilitate this privileged access, define the two user-mode CET MSRs,
and the bits defined in those MSRs relevant to future shadow stack
enablement patches.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v5:
- Move comments from end of lines in cet_user_state struct (Boris)

v3:
- Add missing "is" in commit log (Boris)
- Change to case statement for struct size checking (Boris)
- Adjust commas on xfeature_names (Kees, Boris)

v2:
- Change name to XFEATURE_CET_KERNEL_UNUSED (peterz)

KVM refresh:
- Reword commit log using some verbiage posted by Dave Hansen
- Remove unlikely to be used supervisor cet xsave struct
- Clarify that supervisor cet state is not saved by xsave
- Remove unused supervisor MSRs
---
arch/x86/include/asm/fpu/types.h | 16 +++++-
arch/x86/include/asm/fpu/xstate.h | 6 ++-
arch/x86/kernel/fpu/xstate.c | 90 +++++++++++++++----------------
3 files changed, 61 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 7f6d858ff47a..eb810074f1e7 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
XFEATURE_PKRU,
XFEATURE_PASID,
- XFEATURE_RSRVD_COMP_11,
- XFEATURE_RSRVD_COMP_12,
+ XFEATURE_CET_USER,
+ XFEATURE_CET_KERNEL_UNUSED,
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
@@ -138,6 +138,8 @@ enum xfeature {
#define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
+#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
@@ -252,6 +254,16 @@ struct pkru_state {
u32 pad;
} __packed;

+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+ /* user control-flow settings */
+ u64 user_cet;
+ /* user shadow stack pointer */
+ u64 user_ssp;
+};
+
/*
* State component 15: Architectural LBR configuration state.
* The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..d4427b88ee12 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -50,7 +50,8 @@
#define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA

/* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+ XFEATURE_MASK_CET_USER)

/*
* A supervisor state component may not always contain valuable information,
@@ -77,7 +78,8 @@
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+ XFEATURE_MASK_CET_KERNEL)

/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 714166cc25f2..13a80521dd51 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -39,26 +39,26 @@
*/
static const char *xfeature_names[] =
{
- "x87 floating point registers" ,
- "SSE registers" ,
- "AVX registers" ,
- "MPX bounds registers" ,
- "MPX CSR" ,
- "AVX-512 opmask" ,
- "AVX-512 Hi256" ,
- "AVX-512 ZMM_Hi256" ,
- "Processor Trace (unused)" ,
+ "x87 floating point registers",
+ "SSE registers",
+ "AVX registers",
+ "MPX bounds registers",
+ "MPX CSR",
+ "AVX-512 opmask",
+ "AVX-512 Hi256",
+ "AVX-512 ZMM_Hi256",
+ "Processor Trace (unused)",
"Protection Keys User registers",
"PASID state",
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "unknown xstate feature" ,
- "AMX Tile config" ,
- "AMX Tile data" ,
- "unknown xstate feature" ,
+ "Control-flow User registers",
+ "Control-flow Kernel registers (unused)",
+ "unknown xstate feature",
+ "unknown xstate feature",
+ "unknown xstate feature",
+ "unknown xstate feature",
+ "AMX Tile config",
+ "AMX Tile data",
+ "unknown xstate feature",
};

static unsigned short xsave_cpuid_features[] __initdata = {
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_PKU,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -276,6 +277,7 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
+ print_xstate_feature(XFEATURE_MASK_CET_USER);
print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}
@@ -344,6 +346,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
XFEATURE_MASK_BNDREGS | \
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID | \
+ XFEATURE_MASK_CET_USER | \
XFEATURE_MASK_XTILE)

/*
@@ -446,14 +449,15 @@ static void __init __xstate_dump_leaves(void)
} \
} while (0)

-#define XCHECK_SZ(sz, nr, nr_macro, __struct) do { \
- if ((nr == nr_macro) && \
- WARN_ONCE(sz != sizeof(__struct), \
- "%s: struct is %zu bytes, cpu state %d bytes\n", \
- __stringify(nr_macro), sizeof(__struct), sz)) { \
+#define XCHECK_SZ(sz, nr, __struct) ({ \
+ if (WARN_ONCE(sz != sizeof(__struct), \
+ "[%s]: struct is %zu bytes, cpu state %d bytes\n", \
+ xfeature_names[nr], sizeof(__struct), sz)) { \
__xstate_dump_leaves(); \
} \
-} while (0)
+ true; \
+})
+

/**
* check_xtile_data_against_struct - Check tile data state size.
@@ -527,36 +531,28 @@ static bool __init check_xstate_against_struct(int nr)
* Ask the CPU for the size of the state.
*/
int sz = xfeature_size(nr);
+
/*
* Match each CPU state with the corresponding software
* structure.
*/
- XCHECK_SZ(sz, nr, XFEATURE_YMM, struct ymmh_struct);
- XCHECK_SZ(sz, nr, XFEATURE_BNDREGS, struct mpx_bndreg_state);
- XCHECK_SZ(sz, nr, XFEATURE_BNDCSR, struct mpx_bndcsr_state);
- XCHECK_SZ(sz, nr, XFEATURE_OPMASK, struct avx_512_opmask_state);
- XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
- XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state);
- XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
- XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
- XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
-
- /* The tile data size varies between implementations. */
- if (nr == XFEATURE_XTILE_DATA)
- check_xtile_data_against_struct(sz);
-
- /*
- * Make *SURE* to add any feature numbers in below if
- * there are "holes" in the xsave state component
- * numbers.
- */
- if ((nr < XFEATURE_YMM) ||
- (nr >= XFEATURE_MAX) ||
- (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
- ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
+ switch (nr) {
+ case XFEATURE_YMM: return XCHECK_SZ(sz, nr, struct ymmh_struct);
+ case XFEATURE_BNDREGS: return XCHECK_SZ(sz, nr, struct mpx_bndreg_state);
+ case XFEATURE_BNDCSR: return XCHECK_SZ(sz, nr, struct mpx_bndcsr_state);
+ case XFEATURE_OPMASK: return XCHECK_SZ(sz, nr, struct avx_512_opmask_state);
+ case XFEATURE_ZMM_Hi256: return XCHECK_SZ(sz, nr, struct avx_512_zmm_uppers_state);
+ case XFEATURE_Hi16_ZMM: return XCHECK_SZ(sz, nr, struct avx_512_hi16_state);
+ case XFEATURE_PKRU: return XCHECK_SZ(sz, nr, struct pkru_state);
+ case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
+ case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
+ case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
+ case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
+ default:
XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
return false;
}
+
return true;
}

--
2.17.1


2023-02-18 21:17:22

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 06/41] x86/fpu: Add helper for modifying xstate

Just like user xfeatures, supervisor xfeatures can be active in the
registers or present in the task FPU buffer. If the registers are
active, the registers can be modified directly. If the registers are
not active, the modification must be performed on the task FPU buffer.

When the state is not active, the kernel could perform modifications
directly to the buffer. But in order for it to do that, it needs
to know where in the buffer the specific state it wants to modify is
located. Doing this is not robust against optimizations that compact
the FPU buffer, as each access would require computing where in the
buffer it is.

The easiest way to modify supervisor xfeature data is to force restore
the registers and write directly to the MSRs. Often times this is just fine
anyway as the registers need to be restored before returning to userspace.
Do this for now, leaving buffer writing optimizations for the future.

Add a new function fpregs_lock_and_load() that can simultaneously call
fpregs_lock() and do this restore. Also perform some extra sanity
checks in this function since this will be used in non-fpu focused code.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Drop "but appear to work" (Boris)

v5:
- Fix spelling error (Boris)
- Don't export fpregs_lock_and_load() (Boris)

v3:
- Rename to fpregs_lock_and_load() to match the unlocking
fpregs_unlock(). (Kees)
- Elaborate in comment about helper. (Dave)

v2:
- Drop optimization of writing directly the buffer, and change API
accordingly.
- fpregs_lock_and_load() suggested by tglx
- Some commit log verbiage from dhansen
---
arch/x86/include/asm/fpu/api.h | 9 +++++++++
arch/x86/kernel/fpu/core.c | 18 ++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 503a577814b2..aadc6893dcaa 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -82,6 +82,15 @@ static inline void fpregs_unlock(void)
preempt_enable();
}

+/*
+ * FPU state gets lazily restored before returning to userspace. So when in the
+ * kernel, the valid FPU state may be kept in the buffer. This function will force
+ * restore all the fpu state to the registers early if needed, and lock them from
+ * being automatically saved/restored. Then FPU state can be modified safely in the
+ * registers, before unlocking with fpregs_unlock().
+ */
+void fpregs_lock_and_load(void);
+
#ifdef CONFIG_X86_DEBUG_FPU
extern void fpregs_assert_state_consistent(void);
#else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index caf33486dc5e..f851558b673f 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -753,6 +753,24 @@ void switch_fpu_return(void)
}
EXPORT_SYMBOL_GPL(switch_fpu_return);

+void fpregs_lock_and_load(void)
+{
+ /*
+ * fpregs_lock() only disables preemption (mostly). So modifying state
+ * in an interrupt could screw up some in progress fpregs operation.
+ * Warn about it.
+ */
+ WARN_ON_ONCE(!irq_fpu_usable());
+ WARN_ON_ONCE(current->flags & PF_KTHREAD);
+
+ fpregs_lock();
+
+ fpregs_assert_state_consistent();
+
+ if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ fpregs_restore_userregs();
+}
+
#ifdef CONFIG_X86_DEBUG_FPU
/*
* If current FPU state according to its tracking (loaded FPU context on this
--
2.17.1


2023-02-18 21:17:46

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 07/41] x86: Move control protection handler to separate file

Today the control protection handler is defined in traps.c and used only
for the kernel IBT feature. To reduce ifdeffery, move it to it's own file.
In future patches, functionality will be added to make this handler also
handle user shadow stack faults. So name the file cet.c.

No functional change.

Tested-by: Pengfei Xu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Split move to cet.c and shadow stack enhancements to fault handler to
separate files. (Kees)
---
arch/x86/kernel/Makefile | 2 ++
arch/x86/kernel/cet.c | 76 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps.c | 75 ---------------------------------------
3 files changed, 78 insertions(+), 75 deletions(-)
create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index dd61752f4c96..92446f1dedd7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -144,6 +144,8 @@ obj-$(CONFIG_CFI_CLANG) += cfi.o

obj-$(CONFIG_CALL_THUNKS) += callthunks.o

+obj-$(CONFIG_X86_CET) += cet.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..7ad22b705b64
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/ptrace.h>
+#include <asm/bugs.h>
+#include <asm/traps.h>
+
+static __ro_after_init bool ibt_fatal = true;
+
+extern void ibt_selftest_ip(void); /* code label defined in asm below */
+
+enum cp_error_code {
+ CP_EC = (1 << 15) - 1,
+
+ CP_RET = 1,
+ CP_IRET = 2,
+ CP_ENDBR = 3,
+ CP_RSTRORSSP = 4,
+ CP_SETSSBSY = 5,
+
+ CP_ENCL = 1 << 15,
+};
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
+ pr_err("Unexpected #CP\n");
+ BUG();
+ }
+
+ if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+ return;
+
+ if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
+ regs->ax = 0;
+ return;
+ }
+
+ pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
+ if (!ibt_fatal) {
+ printk(KERN_DEFAULT CUT_HERE);
+ __warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+ return;
+ }
+ BUG();
+}
+
+/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
+noinline bool ibt_selftest(void)
+{
+ unsigned long ret;
+
+ asm (" lea ibt_selftest_ip(%%rip), %%rax\n\t"
+ ANNOTATE_RETPOLINE_SAFE
+ " jmp *%%rax\n\t"
+ "ibt_selftest_ip:\n\t"
+ UNWIND_HINT_FUNC
+ ANNOTATE_NOENDBR
+ " nop\n\t"
+
+ : "=a" (ret) : : "memory");
+
+ return !ret;
+}
+
+static int __init ibt_setup(char *str)
+{
+ if (!strcmp(str, "off"))
+ setup_clear_cpu_cap(X86_FEATURE_IBT);
+
+ if (!strcmp(str, "warn"))
+ ibt_fatal = false;
+
+ return 1;
+}
+
+__setup("ibt=", ibt_setup);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..cc223e60aba2 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -213,81 +213,6 @@ DEFINE_IDTENTRY(exc_overflow)
do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
}

-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
-enum cp_error_code {
- CP_EC = (1 << 15) - 1,
-
- CP_RET = 1,
- CP_IRET = 2,
- CP_ENDBR = 3,
- CP_RSTRORSSP = 4,
- CP_SETSSBSY = 5,
-
- CP_ENCL = 1 << 15,
-};
-
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
-{
- if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
- pr_err("Unexpected #CP\n");
- BUG();
- }
-
- if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
- return;
-
- if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
- regs->ax = 0;
- return;
- }
-
- pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
- if (!ibt_fatal) {
- printk(KERN_DEFAULT CUT_HERE);
- __warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
- return;
- }
- BUG();
-}
-
-/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
-noinline bool ibt_selftest(void)
-{
- unsigned long ret;
-
- asm (" lea ibt_selftest_ip(%%rip), %%rax\n\t"
- ANNOTATE_RETPOLINE_SAFE
- " jmp *%%rax\n\t"
- "ibt_selftest_ip:\n\t"
- UNWIND_HINT_FUNC
- ANNOTATE_NOENDBR
- " nop\n\t"
-
- : "=a" (ret) : : "memory");
-
- return !ret;
-}
-
-static int __init ibt_setup(char *str)
-{
- if (!strcmp(str, "off"))
- setup_clear_cpu_cap(X86_FEATURE_IBT);
-
- if (!strcmp(str, "warn"))
- ibt_fatal = false;
-
- return 1;
-}
-
-__setup("ibt=", ibt_setup);
-
-#endif /* CONFIG_X86_KERNEL_IBT */
-
#ifdef CONFIG_X86_F00F_BUG
void handle_invalid_op(struct pt_regs *regs)
#else
--
2.17.1


2023-02-18 21:17:58

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 09/41] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

From: Yu-cheng Yu <[email protected]>

New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages.

In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
if HW dirty tracking is not needed, because if the Dirty bit is not already
set the CPU has to set Dirty=1 when the memory gets written to. This
creates additional work for the CPU. So traditional wisdom was to simply
set the Dirty bit whenever you didn't care about it. However, it was never
really very helpful for read-only kernel memory.

When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
any functional benefit either, so to reduce ambiguity between shadow stack
and regular Write=0 pages, remove Dirty=1 from any kernel Write=0 PTEs.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
v6:
- Also remove dirty from newly added set_memory_rox()

v5:
- Spelling and grammer in commit log (Boris)

v3:
- Update commit log (Andrew Cooper, Peterz)

v2:
- Normalize PTE bit descriptions between patches
---
arch/x86/include/asm/pgtable_types.h | 6 +++---
arch/x86/mm/pat/set_memory.c | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 447d4bee25c4..0646ad00178b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -192,10 +192,10 @@ enum page_cache_mode {
#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
#define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0)
#define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC)
-#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G)
-#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0|___D| 0|___G)
+#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G)
+#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G)
#define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC)
-#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX|___D| 0|___G)
+#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX| 0| 0|___G)
#define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G)
#define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW| 0|___A| 0|___D|_PSE|___G)
#define __PAGE_KERNEL_WP (__PP|__RW| 0|___A|__NX|___D| 0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 356758b7d4b4..1b5c0dc9f32b 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2073,12 +2073,12 @@ int set_memory_nx(unsigned long addr, int numpages)

int set_memory_ro(unsigned long addr, int numpages)
{
- return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+ return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
}

int set_memory_rox(unsigned long addr, int numpages)
{
- pgprot_t clr = __pgprot(_PAGE_RW);
+ pgprot_t clr = __pgprot(_PAGE_RW | _PAGE_DIRTY);

if (__supported_pte_mask & _PAGE_NX)
clr.pgprot |= _PAGE_NX;
--
2.17.1


2023-02-18 21:18:01

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 08/41] x86/shstk: Add user control-protection fault handler

From: Yu-cheng Yu <[email protected]>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT faults. Refactor this fault handler into separate user and kernel
handlers, like the page fault handler. Add a control-protection handler
for usermode. To avoid ifdeffery, put them both in a new file cet.c, which
is compiled in the case of either of the two CET features supported in the
kernel: kernel IBT or user mode shadow stack. Move some static inline
functions from traps.c into a header so they can be used in cet.c.

Opportunistically fix a comment in the kernel IBT part of the fault
handler that is on the end of the line instead of preceding it.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when the feature
is missing. This unifies the behavior with the new shadow stack code, and
also prevents the kernel from crashing under this situation which is
potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Kerrisk <[email protected]>

---
v6:
- Split into separate patches (Kees)
- Change to "x86/shstk" in commit log (Boris)

v5:
- Move to separate file to advoid ifdeffery (Boris)
- Improvements to commit log (Boris)
- Rename control_protection_err (Boris)
- Move comment from end of line in IBT fault handler (Boris)

v3:
- Shorten user/kernel #CP handler function names (peterz)
- Restore CP_ENDBR check to kernel handler (peterz)
- Utilize CONFIG_X86_CET (Kees)
- Unify "unexpected" warnings (Andrew Cooper)
- Use 2d array for error code chars (Andrew Cooper)
- Add comment about why to read SSP MSR before enabling interrupts

v2:
- Integrate with kernel IBT fault handler
- Update printed messages. (Dave)
- Remove array_index_nospec() usage. (Dave)
- Remove IBT messages. (Dave)
- Add enclave error code bit processing it case it can get triggered
somehow.
- Add extra "unknown" in control_protection_err.
---
arch/arm/kernel/signal.c | 2 +-
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 2 +-
arch/x86/include/asm/traps.h | 12 +++
arch/x86/kernel/cet.c | 94 +++++++++++++++++++++---
arch/x86/kernel/idt.c | 2 +-
arch/x86/kernel/signal_32.c | 2 +-
arch/x86/kernel/signal_64.c | 2 +-
arch/x86/kernel/traps.c | 12 ---
arch/x86/xen/enlighten_pv.c | 2 +-
arch/x86/xen/xen-asm.S | 2 +-
include/uapi/asm-generic/siginfo.h | 3 +-
16 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index be279fd48248..4bced22213d5 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1176,7 +1176,7 @@ void __init minsigstksz_setup(void)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 505f78ddca82..652e366b68a0 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
#define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31))
#endif

+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT 0
+#else
+#define DISABLE_IBT (1 << (X86_FEATURE_IBT & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -128,7 +134,7 @@
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
DISABLE_ENQCMD)
#define DISABLED_MASK17 0
-#define DISABLED_MASK18 0
+#define DISABLED_MASK18 (DISABLE_IBT)
#define DISABLED_MASK19 0
#define DISABLED_MASK20 0
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF, xenpv_exc_double_fault);
#endif

/* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
#endif

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..75e0dabf0c45 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,16 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
struct stack_info *info);
#endif

+static inline void cond_local_irq_enable(struct pt_regs *regs)
+{
+ if (regs->flags & X86_EFLAGS_IF)
+ local_irq_enable();
+}
+
+static inline void cond_local_irq_disable(struct pt_regs *regs)
+{
+ if (regs->flags & X86_EFLAGS_IF)
+ local_irq_disable();
+}
+
#endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 7ad22b705b64..33d7d119be26 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -4,10 +4,6 @@
#include <asm/bugs.h>
#include <asm/traps.h>

-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
enum cp_error_code {
CP_EC = (1 << 15) - 1,

@@ -20,15 +16,80 @@ enum cp_error_code {
CP_ENCL = 1 << 15,
};

-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+static const char cp_err[][10] = {
+ [0] = "unknown",
+ [1] = "near ret",
+ [2] = "far/iret",
+ [3] = "endbranch",
+ [4] = "rstorssp",
+ [5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+ unsigned int cpec = error_code & CP_EC;
+
+ if (cpec >= ARRAY_SIZE(cp_err))
+ cpec = 0;
+ return cp_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+ WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+ user_mode(regs) ? "user mode" : "kernel mode",
+ cp_err_string(error_code));
+}
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
{
- if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
- pr_err("Unexpected #CP\n");
- BUG();
+ struct task_struct *tsk;
+ unsigned long ssp;
+
+ /*
+ * An exception was just taken from userspace. Since interrupts are disabled
+ * here, no scheduling should have messed with the registers yet and they
+ * will be whatever is live in userspace. So read the SSP before enabling
+ * interrupts so locking the fpregs to do it later is not required.
+ */
+ rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+ cond_local_irq_enable(regs);
+
+ tsk = current;
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_CP;
+
+ /* Ratelimit to prevent log spamming. */
+ if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+ __ratelimit(&cpf_rate)) {
+ pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+ tsk->comm, task_pid_nr(tsk),
+ regs->ip, regs->sp, ssp, error_code,
+ cp_err_string(error_code),
+ error_code & CP_ENCL ? " in enclave" : "");
+ print_vma_addr(KERN_CONT " in ", regs->ip);
+ pr_cont("\n");
}

- if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+ force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+ cond_local_irq_disable(regs);
+}
+
+static __ro_after_init bool ibt_fatal = true;
+
+/* code label defined in asm below */
+extern void ibt_selftest_ip(void);
+
+static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ if ((error_code & CP_EC) != CP_ENDBR) {
+ do_unexpected_cp(regs, error_code);
return;
+ }

if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
regs->ax = 0;
@@ -74,3 +135,18 @@ static int __init ibt_setup(char *str)
}

__setup("ibt=", ibt_setup);
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+ if (user_mode(regs)) {
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ do_user_cp_fault(regs, error_code);
+ else
+ do_unexpected_cp(regs, error_code);
+ } else {
+ if (cpu_feature_enabled(X86_FEATURE_IBT))
+ do_kernel_cp_fault(regs, error_code);
+ else
+ do_unexpected_cp(regs, error_code);
+ }
+}
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
ISTG(X86_TRAP_MC, asm_exc_machine_check, IST_INDEX_MCE),
#endif

-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
INTG(X86_TRAP_CP, asm_exc_control_protection),
#endif

diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..c12624bc82a3 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -402,7 +402,7 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 13a1e6083837..0e808c72bf7e 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -403,7 +403,7 @@ void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cc223e60aba2..18fb9d620824 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -77,18 +77,6 @@

DECLARE_BITMAP(system_vectors, NR_VECTORS);

-static inline void cond_local_irq_enable(struct pt_regs *regs)
-{
- if (regs->flags & X86_EFLAGS_IF)
- local_irq_enable();
-}
-
-static inline void cond_local_irq_disable(struct pt_regs *regs)
-{
- if (regs->flags & X86_EFLAGS_IF)
- local_irq_disable();
-}
-
__always_inline int is_valid_bugaddr(unsigned long addr)
{
if (addr < TASK_SIZE_MAX)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index bb59cc6ddb2d..9c29cd5393cc 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -640,7 +640,7 @@ static struct trap_array_entry trap_array[] = {
TRAP_ENTRY(exc_coprocessor_error, false ),
TRAP_ENTRY(exc_alignment_check, false ),
TRAP_ENTRY(exc_simd_coprocessor_error, false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
TRAP_ENTRY(exc_control_protection, false ),
#endif
};
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 4a184f6e4e4d..7cdcb4ce6976 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
xen_pv_trap asm_exc_spurious_interrupt_bug
xen_pv_trap asm_exc_coprocessor_error
xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
xen_pv_trap asm_exc_control_protection
#endif
#ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
#define SEGV_ADIPERR 7 /* Precise MCD exception */
#define SEGV_MTEAERR 8 /* Asynchronous ARM MTE error */
#define SEGV_MTESERR 9 /* Synchronous ARM MTE exception */
-#define NSIGSEGV 9
+#define SEGV_CPERR 10 /* Control protection fault */
+#define NSIGSEGV 10

/*
* SIGBUS si_codes
--
2.17.1


2023-02-18 21:18:32

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 10/41] x86/mm: Move pmd_write(), pud_write() up in the file

From: Yu-cheng Yu <[email protected]>

To prepare the introduction of _PAGE_SAVED_DIRTY, move pmd_write() and
pud_write() up in the file, so that they can be used by other
helpers below. No functional changes.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0564edd24ffb..b39f16c0d507 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -160,6 +160,18 @@ static inline int pte_write(pte_t pte)
return pte_flags(pte) & _PAGE_RW;
}

+#define pmd_write pmd_write
+static inline int pmd_write(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define pud_write pud_write
+static inline int pud_write(pud_t pud)
+{
+ return pud_flags(pud) & _PAGE_RW;
+}
+
static inline int pte_huge(pte_t pte)
{
return pte_flags(pte) & _PAGE_PSE;
@@ -1120,12 +1132,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);


-#define pmd_write pmd_write
-static inline int pmd_write(pmd_t pmd)
-{
- return pmd_flags(pmd) & _PAGE_RW;
-}
-
#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp)
@@ -1155,12 +1161,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

-#define pud_write pud_write
-static inline int pud_write(pud_t pud)
-{
- return pud_flags(pud) & _PAGE_RW;
-}
-
#ifndef pmdp_establish
#define pmdp_establish pmdp_establish
static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
--
2.17.1


2023-02-18 21:18:52

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 11/41] mm: Introduce pte_mkwrite_kernel()

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One of these changes is to allow for pte_mkwrite() to create different
types of writable memory (the existing conventionally writable type and
also the new shadow stack type). Future patches will convert pte_mkwrite()
to take a VMA in order to facilitate this, however there are places in the
kernel where pte_mkwrite() is called outside of the context of a VMA.
These are for kernel memory. So create a new variant called
pte_mkwrite_kernel() and switch the kernel users over to it. Have
pte_mkwrite() and pte_mkwrite_kernel() be the same for now. Future patches
will introduce changes to make pte_mkwrite() take a VMA.

Only do this for architectures that need it because they call pte_mkwrite()
in arch code without an associated VMA. Since it will only currently be
used in arch code, so do not include it in arch_pgtable_helpers.rst.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Tested-by: Pengfei Xu <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
Hi Non-x86 Arch’s,

x86 has a feature that allows for the creation of a special type of
writable memory (shadow stack) that is only writable in limited specific
ways. Previously, changes were proposed to core MM code to teach it to
decide when to create normally writable memory or the special shadow stack
writable memory, but David Hildenbrand suggested[0] to change
pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
moved into x86 code.

Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
changes. So that is why you are seeing some patches out of a big x86
series pop up in your arch mailing list. There is no functional change.
After this refactor, the shadow stack series goes on to use the arch
helpers to push shadow stack memory details inside arch/x86.

Testing was just 0-day build testing.

Hopefully that is enough context. Thanks!

[0] https://lore.kernel.org/lkml/[email protected]/#t

v6:
- New patch
---
arch/arm64/include/asm/pgtable.h | 7 ++++++-
arch/arm64/mm/trans_pgd.c | 4 ++--
arch/s390/include/asm/pgtable.h | 7 ++++++-
arch/s390/mm/pageattr.c | 2 +-
arch/x86/include/asm/pgtable.h | 7 ++++++-
arch/x86/xen/mmu_pv.c | 2 +-
6 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 65e78999c75d..ed555f947697 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -180,13 +180,18 @@ static inline pmd_t set_pmd_bit(pmd_t pmd, pgprot_t prot)
return pmd;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_kernel(pte_t pte)
{
pte = set_pte_bit(pte, __pgprot(PTE_WRITE));
pte = clear_pte_bit(pte, __pgprot(PTE_RDONLY));
return pte;
}

+static inline pte_t pte_mkwrite(pte_t pte)
+{
+ return pte_mkwrite_kernel(pte);
+}
+
static inline pte_t pte_mkclean(pte_t pte)
{
pte = clear_pte_bit(pte, __pgprot(PTE_DIRTY));
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 4ea2eefbc053..5c07e68d80ea 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -40,7 +40,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
* read only (code, rodata). Clear the RDONLY bit from
* the temporary mappings we use during restore.
*/
- set_pte(dst_ptep, pte_mkwrite(pte));
+ set_pte(dst_ptep, pte_mkwrite_kernel(pte));
} else if (debug_pagealloc_enabled() && !pte_none(pte)) {
/*
* debug_pagealloc will removed the PTE_VALID bit if
@@ -53,7 +53,7 @@ static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
*/
BUG_ON(!pfn_valid(pte_pfn(pte)));

- set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
+ set_pte(dst_ptep, pte_mkpresent(pte_mkwrite_kernel(pte)));
}
}

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index b26cbf1c533c..29522418b5f4 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -991,7 +991,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
return set_pte_bit(pte, __pgprot(_PAGE_PROTECT));
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_kernel(pte_t pte)
{
pte = set_pte_bit(pte, __pgprot(_PAGE_WRITE));
if (pte_val(pte) & _PAGE_DIRTY)
@@ -999,6 +999,11 @@ static inline pte_t pte_mkwrite(pte_t pte)
return pte;
}

+static inline pte_t pte_mkwrite(pte_t pte)
+{
+ return pte_mkwrite_kernel(pte);
+}
+
static inline pte_t pte_mkclean(pte_t pte)
{
pte = clear_pte_bit(pte, __pgprot(_PAGE_DIRTY));
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 85195c18b2e8..4ee5fe5caa23 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -96,7 +96,7 @@ static int walk_pte_level(pmd_t *pmdp, unsigned long addr, unsigned long end,
if (flags & SET_MEMORY_RO)
new = pte_wrprotect(new);
else if (flags & SET_MEMORY_RW)
- new = pte_mkwrite(pte_mkdirty(new));
+ new = pte_mkwrite_kernel(pte_mkdirty(new));
if (flags & SET_MEMORY_NX)
new = set_pte_bit(new, __pgprot(_PAGE_NOEXEC));
else if (flags & SET_MEMORY_X)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b39f16c0d507..4f9fddcff2b9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -364,11 +364,16 @@ static inline pte_t pte_mkyoung(pte_t pte)
return pte_set_flags(pte, _PAGE_ACCESSED);
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite_kernel(pte_t pte)
{
return pte_set_flags(pte, _PAGE_RW);
}

+static inline pte_t pte_mkwrite(pte_t pte)
+{
+ return pte_mkwrite_kernel(pte);
+}
+
static inline pte_t pte_mkhuge(pte_t pte)
{
return pte_set_flags(pte, _PAGE_PSE);
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index ee29fb558f2e..a23f04243c19 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -150,7 +150,7 @@ void make_lowmem_page_readwrite(void *vaddr)
if (pte == NULL)
return; /* vaddr missing */

- ptev = pte_mkwrite(*pte);
+ ptev = pte_mkwrite_kernel(*pte);

if (HYPERVISOR_update_va_mapping(address, ptev, 0))
BUG();
--
2.17.1


2023-02-18 21:18:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 12/41] s390/mm: Introduce pmd_mkwrite_kernel()

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One of these changes is to allow for pmd_mkwrite() to create different
types of writable memory (the existing conventionally writable type and
also the new shadow stack type). Future patches will convert pmd_mkwrite()
to take a VMA in order to facilitate this, however there are places in the
kernel where pmd_mkwrite() is called outside of the context of a VMA.
These are for kernel memory. So create a new variant called
pmd_mkwrite_kernel() and switch the kernel users over to it. Have
pmd_mkwrite() and pmd_mkwrite_kernel() be the same for now. Future patches
will introduce changes to make pmd_mkwrite() take a VMA.

Only do this for architectures that need it because they call pmd_mkwrite()
in arch code without an associated VMA. Since it will only currently be
used in arch code, so do not include it in arch_pgtable_helpers.rst.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Tested-by: Pengfei Xu <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
Hi Non-x86 Arch’s,

x86 has a feature that allows for the creation of a special type of
writable memory (shadow stack) that is only writable in limited specific
ways. Previously, changes were proposed to core MM code to teach it to
decide when to create normally writable memory or the special shadow stack
writable memory, but David Hildenbrand suggested[0] to change
pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
moved into x86 code.

Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
changes. So that is why you are seeing some patches out of a big x86
series pop up in your arch mailing list. There is no functional change.
After this refactor, the shadow stack series goes on to use the arch
helpers to push shadow stack memory details inside arch/x86.

Testing was just 0-day build testing.

Hopefully that is enough context. Thanks!

[0] https://lore.kernel.org/lkml/[email protected]/#t

v6:
- New patch
---
arch/s390/include/asm/pgtable.h | 7 ++++++-
arch/s390/mm/pageattr.c | 2 +-
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 29522418b5f4..c48a447d1432 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1425,7 +1425,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
return set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_PROTECT));
}

-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite_kernel(pmd_t pmd)
{
pmd = set_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_WRITE));
if (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY)
@@ -1433,6 +1433,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
return pmd;
}

+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+ return pmd_mkwrite_kernel(pmd);
+}
+
static inline pmd_t pmd_mkclean(pmd_t pmd)
{
pmd = clear_pmd_bit(pmd, __pgprot(_SEGMENT_ENTRY_DIRTY));
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 4ee5fe5caa23..7b6967dfacd0 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -146,7 +146,7 @@ static void modify_pmd_page(pmd_t *pmdp, unsigned long addr,
if (flags & SET_MEMORY_RO)
new = pmd_wrprotect(new);
else if (flags & SET_MEMORY_RW)
- new = pmd_mkwrite(pmd_mkdirty(new));
+ new = pmd_mkwrite_kernel(pmd_mkdirty(new));
if (flags & SET_MEMORY_NX)
new = set_pmd_bit(new, __pgprot(_SEGMENT_ENTRY_NOEXEC));
else if (flags & SET_MEMORY_X)
--
2.17.1


2023-02-18 21:19:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One of these unusual properties is that shadow stack memory is writable,
but only in limited ways. These limits are applied via a specific PTE
bit combination. Nevertheless, the memory is writable, and core mm code
will need to apply the writable permissions in the typical paths that
call pte_mkwrite().

In addition to VM_WRITE, the shadow stack VMA's will have a flag denoting
that they are special shadow stack flavor of writable memory. So make
pte_mkwrite() take a VMA, so that the x86 implementation of it can know to
create regular writable memory or shadow stack memory.

Apply the same changes for pmd_mkwrite() and huge_pte_mkwrite().

No functional change.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Michal Simek <[email protected]>
Cc: Dinh Nguyen <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Tested-by: Pengfei Xu <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
Hi Non-x86 Arch’s,

x86 has a feature that allows for the creation of a special type of
writable memory (shadow stack) that is only writable in limited specific
ways. Previously, changes were proposed to core MM code to teach it to
decide when to create normally writable memory or the special shadow stack
writable memory, but David Hildenbrand suggested[0] to change
pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
moved into x86 code.

Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
changes. So that is why you are seeing some patches out of a big x86
series pop up in your arch mailing list. There is no functional change.
After this refactor, the shadow stack series goes on to use the arch
helpers to push shadow stack memory details inside arch/x86.

Testing was just 0-day build testing.

Hopefully that is enough context. Thanks!

[0] https://lore.kernel.org/lkml/[email protected]/#t

v6:
- New patch
---
Documentation/mm/arch_pgtable_helpers.rst | 9 ++++++---
arch/alpha/include/asm/pgtable.h | 6 +++++-
arch/arc/include/asm/hugepage.h | 2 +-
arch/arc/include/asm/pgtable-bits-arcv2.h | 7 ++++++-
arch/arm/include/asm/pgtable-3level.h | 7 ++++++-
arch/arm/include/asm/pgtable.h | 2 +-
arch/arm64/include/asm/pgtable.h | 4 ++--
arch/csky/include/asm/pgtable.h | 2 +-
arch/hexagon/include/asm/pgtable.h | 2 +-
arch/ia64/include/asm/pgtable.h | 2 +-
arch/loongarch/include/asm/pgtable.h | 4 ++--
arch/m68k/include/asm/mcf_pgtable.h | 2 +-
arch/m68k/include/asm/motorola_pgtable.h | 6 +++++-
arch/m68k/include/asm/sun3_pgtable.h | 6 +++++-
arch/microblaze/include/asm/pgtable.h | 2 +-
arch/mips/include/asm/pgtable.h | 6 +++---
arch/nios2/include/asm/pgtable.h | 2 +-
arch/openrisc/include/asm/pgtable.h | 2 +-
arch/parisc/include/asm/pgtable.h | 6 +++++-
arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++--
arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +-
arch/powerpc/include/asm/nohash/32/pte-8xx.h | 2 +-
arch/powerpc/include/asm/nohash/64/pgtable.h | 2 +-
arch/riscv/include/asm/pgtable.h | 6 +++---
arch/s390/include/asm/hugetlb.h | 4 ++--
arch/s390/include/asm/pgtable.h | 4 ++--
arch/sh/include/asm/pgtable_32.h | 10 ++++++++--
arch/sparc/include/asm/pgtable_32.h | 2 +-
arch/sparc/include/asm/pgtable_64.h | 6 +++---
arch/um/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/pgtable.h | 6 ++++--
arch/xtensa/include/asm/pgtable.h | 2 +-
include/asm-generic/hugetlb.h | 4 ++--
include/linux/mm.h | 2 +-
mm/debug_vm_pgtable.c | 16 ++++++++--------
mm/huge_memory.c | 6 +++---
mm/hugetlb.c | 4 ++--
mm/memory.c | 4 ++--
mm/migrate_device.c | 2 +-
mm/mprotect.c | 2 +-
mm/userfaultfd.c | 2 +-
42 files changed, 106 insertions(+), 69 deletions(-)

diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index fd2a19df884e..f119d16915d4 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -48,7 +48,8 @@ PTE Page Table Helpers
+---------------------------+--------------------------------------------------+
| pte_mkclean | Creates a clean PTE |
+---------------------------+--------------------------------------------------+
-| pte_mkwrite | Creates a writable PTE |
+| pte_mkwrite | Creates a writable PTE of the type specified by |
+| | the VMA. |
+---------------------------+--------------------------------------------------+
| pte_wrprotect | Creates a write protected PTE |
+---------------------------+--------------------------------------------------+
@@ -120,7 +121,8 @@ PMD Page Table Helpers
+---------------------------+--------------------------------------------------+
| pmd_mkclean | Creates a clean PMD |
+---------------------------+--------------------------------------------------+
-| pmd_mkwrite | Creates a writable PMD |
+| pmd_mkwrite | Creates a writable PMD of the type specified by |
+| | the VMA. |
+---------------------------+--------------------------------------------------+
| pmd_wrprotect | Creates a write protected PMD |
+---------------------------+--------------------------------------------------+
@@ -224,7 +226,8 @@ HugeTLB Page Table Helpers
+---------------------------+--------------------------------------------------+
| huge_pte_mkdirty | Creates a dirty HugeTLB |
+---------------------------+--------------------------------------------------+
-| huge_pte_mkwrite | Creates a writable HugeTLB |
+| huge_pte_mkwrite | Creates a writable HugeTLB of the type specified |
+| | by the VMA. |
+---------------------------+--------------------------------------------------+
| huge_pte_wrprotect | Creates a write protected HugeTLB |
+---------------------------+--------------------------------------------------+
diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index 9e45f6735d5d..39da4baa4f2d 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -253,9 +253,13 @@ extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;
extern inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_FOW; return pte; }
extern inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~(__DIRTY_BITS); return pte; }
extern inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~(__ACCESS_BITS); return pte; }
-extern inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) &= ~_PAGE_FOW; return pte; }
extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= __DIRTY_BITS; return pte; }
extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; }
+extern inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ pte_val(pte) &= ~_PAGE_FOW;
+ return pte;
+}

/*
* The smp_rmb() in the following functions are required to order the load of
diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index 5001b796fb8d..223a96967188 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -21,7 +21,7 @@ static inline pmd_t pte_pmd(pte_t pte)
}

#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite(pmd, vma) pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
diff --git a/arch/arc/include/asm/pgtable-bits-arcv2.h b/arch/arc/include/asm/pgtable-bits-arcv2.h
index 515e82db519f..cbbbcbd0a780 100644
--- a/arch/arc/include/asm/pgtable-bits-arcv2.h
+++ b/arch/arc/include/asm/pgtable-bits-arcv2.h
@@ -84,7 +84,6 @@

PTE_BIT_FUNC(mknotpresent, &= ~(_PAGE_PRESENT));
PTE_BIT_FUNC(wrprotect, &= ~(_PAGE_WRITE));
-PTE_BIT_FUNC(mkwrite, |= (_PAGE_WRITE));
PTE_BIT_FUNC(mkclean, &= ~(_PAGE_DIRTY));
PTE_BIT_FUNC(mkdirty, |= (_PAGE_DIRTY));
PTE_BIT_FUNC(mkold, &= ~(_PAGE_ACCESSED));
@@ -92,6 +91,12 @@ PTE_BIT_FUNC(mkyoung, |= (_PAGE_ACCESSED));
PTE_BIT_FUNC(mkspecial, |= (_PAGE_SPECIAL));
PTE_BIT_FUNC(mkhuge, |= (_PAGE_HW_SZ));

+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ pte_val(pte) |= (_PAGE_WRITE);
+ return pte;
+}
+
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index eabe72ff7381..c25024455445 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -199,11 +199,16 @@ static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }

PMD_BIT_FUNC(wrprotect, |= L_PMD_SECT_RDONLY);
PMD_BIT_FUNC(mkold, &= ~PMD_SECT_AF);
-PMD_BIT_FUNC(mkwrite, &= ~L_PMD_SECT_RDONLY);
PMD_BIT_FUNC(mkdirty, |= L_PMD_SECT_DIRTY);
PMD_BIT_FUNC(mkclean, &= ~L_PMD_SECT_DIRTY);
PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF);

+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+ pmd_val(pmd) |= L_PMD_SECT_RDONLY;
+ return pmd;
+}
+
#define pmd_mkhuge(pmd) (__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))

#define pmd_pfn(pmd) (((pmd_val(pmd) & PMD_MASK) & PHYS_MASK) >> PAGE_SHIFT)
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index f049072b2e85..dcefac6b1edc 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -227,7 +227,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
return set_pte_bit(pte, __pgprot(L_PTE_RDONLY));
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return clear_pte_bit(pte, __pgprot(L_PTE_RDONLY));
}
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ed555f947697..67748142eb8c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -187,7 +187,7 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return pte_mkwrite_kernel(pte);
}
@@ -489,7 +489,7 @@ static inline int pmd_trans_huge(pmd_t pmd)
#define pmd_cont(pmd) pte_cont(pmd_pte(pmd))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite(pmd, vma) pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))
#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index 77bc6caff2d2..559bed7c9cd0 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -176,7 +176,7 @@ static inline pte_t pte_mkold(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
pte_val(pte) |= _PAGE_WRITE;
if (pte_val(pte) & _PAGE_MODIFIED)
diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
index f7048c18b6f9..8a317f41e04f 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -297,7 +297,7 @@ static inline pte_t pte_wrprotect(pte_t pte)
}

/* pte_mkwrite - mark page as writable */
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
pte_val(pte) |= _PAGE_WRITE;
return pte;
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 01517a5e6778..70724658ddee 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -265,7 +265,7 @@ ia64_phys_addr_valid (unsigned long addr)
* access rights:
*/
#define pte_wrprotect(pte) (__pte(pte_val(pte) & ~_PAGE_AR_RW))
-#define pte_mkwrite(pte) (__pte(pte_val(pte) | _PAGE_AR_RW))
+#define pte_mkwrite(pte, vma) (__pte(pte_val(pte) | _PAGE_AR_RW))
#define pte_mkold(pte) (__pte(pte_val(pte) & ~_PAGE_A))
#define pte_mkyoung(pte) (__pte(pte_val(pte) | _PAGE_A))
#define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D))
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 7a34e900d8c1..8cc4292357aa 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -360,7 +360,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
pte_val(pte) |= _PAGE_WRITE;
if (pte_val(pte) & _PAGE_MODIFIED)
@@ -460,7 +460,7 @@ static inline int pmd_write(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_WRITE);
}

-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
pmd_val(pmd) |= _PAGE_WRITE;
if (pmd_val(pmd) & _PAGE_MODIFIED)
diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mcf_pgtable.h
index b619b22823f8..ea1531b60645 100644
--- a/arch/m68k/include/asm/mcf_pgtable.h
+++ b/arch/m68k/include/asm/mcf_pgtable.h
@@ -208,7 +208,7 @@ static inline pte_t pte_mkold(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
pte_val(pte) |= CF_PAGE_WRITABLE;
return pte;
diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/asm/motorola_pgtable.h
index 7ac3d64c6b33..58daa34902a0 100644
--- a/arch/m68k/include/asm/motorola_pgtable.h
+++ b/arch/m68k/include/asm/motorola_pgtable.h
@@ -152,7 +152,6 @@ static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;
static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) |= _PAGE_RONLY; return pte; }
static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~_PAGE_DIRTY; return pte; }
static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~_PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) &= ~_PAGE_RONLY; return pte; }
static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= _PAGE_DIRTY; return pte; }
static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mknocache(pte_t pte)
@@ -165,6 +164,11 @@ static inline pte_t pte_mkcache(pte_t pte)
pte_val(pte) = (pte_val(pte) & _CACHEMASK040) | m68k_supervisor_cachemode;
return pte;
}
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ pte_val(pte) &= ~_PAGE_RONLY;
+ return pte;
+}

#define swapper_pg_dir kernel_pg_dir
extern pgd_t kernel_pg_dir[128];
diff --git a/arch/m68k/include/asm/sun3_pgtable.h b/arch/m68k/include/asm/sun3_pgtable.h
index 90d57e537eb1..89ce30bff5a1 100644
--- a/arch/m68k/include/asm/sun3_pgtable.h
+++ b/arch/m68k/include/asm/sun3_pgtable.h
@@ -140,10 +140,14 @@ static inline int pte_young(pte_t pte) { return pte_val(pte) & SUN3_PAGE_ACCESS
static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_WRITEABLE; return pte; }
static inline pte_t pte_mkclean(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_MODIFIED; return pte; }
static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~SUN3_PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |= SUN3_PAGE_WRITEABLE; return pte; }
static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= SUN3_PAGE_MODIFIED; return pte; }
static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= SUN3_PAGE_ACCESSED; return pte; }
static inline pte_t pte_mknocache(pte_t pte) { pte_val(pte) |= SUN3_PAGE_NOCACHE; return pte; }
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ pte_val(pte) |= SUN3_PAGE_WRITEABLE;
+ return pte;
+}
// use this version when caches work...
//static inline pte_t pte_mkcache(pte_t pte) { pte_val(pte) &= SUN3_PAGE_NOCACHE; return pte; }
// until then, use:
diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h
index 42f5988e998b..7a696c2e326c 100644
--- a/arch/microblaze/include/asm/pgtable.h
+++ b/arch/microblaze/include/asm/pgtable.h
@@ -263,7 +263,7 @@ static inline pte_t pte_mkread(pte_t pte) \
{ pte_val(pte) |= _PAGE_USER; return pte; }
static inline pte_t pte_mkexec(pte_t pte) \
{ pte_val(pte) |= _PAGE_USER | _PAGE_EXEC; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte) \
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma) \
{ pte_val(pte) |= _PAGE_RW; return pte; }
static inline pte_t pte_mkdirty(pte_t pte) \
{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index a68c0b01d8cd..e92232739bb8 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -309,7 +309,7 @@ static inline pte_t pte_mkold(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
pte.pte_low |= _PAGE_WRITE;
if (pte.pte_low & _PAGE_MODIFIED) {
@@ -364,7 +364,7 @@ static inline pte_t pte_mkold(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
pte_val(pte) |= _PAGE_WRITE;
if (pte_val(pte) & _PAGE_MODIFIED)
@@ -591,7 +591,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
return pmd;
}

-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
pmd_val(pmd) |= _PAGE_WRITE;
if (pmd_val(pmd) & _PAGE_MODIFIED)
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index ab793bc517f5..20cbf857099a 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -129,7 +129,7 @@ static inline pte_t pte_mkold(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
pte_val(pte) |= _PAGE_WRITE;
return pte;
diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
index 6477c17b3062..7d1d0a3936eb 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -247,7 +247,7 @@ static inline pte_t pte_mkold(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
pte_val(pte) |= _PAGE_WRITE;
return pte;
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index ea357430aafe..9e4344a976a7 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -328,8 +328,12 @@ static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~_PAGE_ACCESSED; retu
static inline pte_t pte_wrprotect(pte_t pte) { pte_val(pte) &= ~_PAGE_WRITE; return pte; }
static inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= _PAGE_DIRTY; return pte; }
static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte) { pte_val(pte) |= _PAGE_WRITE; return pte; }
static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; return pte; }
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ pte_val(pte) |= _PAGE_WRITE;
+ return pte;
+}

/*
* Huge pte definitions.
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 75823f39e042..fc4b3a55482b 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -471,7 +471,7 @@ static inline pte_t pte_mkpte(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return __pte(pte_val(pte) | _PAGE_RW);
}
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index cb4c67bf45d7..4bfa37cbf5d0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -600,7 +600,7 @@ static inline pte_t pte_mkexec(pte_t pte)
return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC));
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
/*
* write implies read, hence set both
@@ -1072,7 +1072,7 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
-#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkwrite(pmd, vma) pte_pmd(pte_mkwrite(pmd_pte(pmd), (vma)))

#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
#define pmd_soft_dirty(pmd) pte_soft_dirty(pmd_pte(pmd))
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 70edad44dff6..20ce92108f6c 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -171,7 +171,7 @@ void unmap_kernel_page(unsigned long va);
do { pte_update(mm, addr, ptep, ~0, 0, 0); } while (0)

#ifndef pte_mkwrite
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return __pte(pte_val(pte) | _PAGE_RW);
}
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 1a89ebdc3acc..f32450eb270a 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -101,7 +101,7 @@ static inline int pte_write(pte_t pte)

#define pte_write pte_write

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return __pte(pte_val(pte) & ~_PAGE_RO);
}
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 879e9a6e5a87..526868817df0 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -85,7 +85,7 @@
#ifndef __ASSEMBLY__
/* pte_clear moved to later in this file */

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return __pte(pte_val(pte) | _PAGE_RW);
}
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 3e01f4f3ab08..7eb7dd558319 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -338,7 +338,7 @@ static inline pte_t pte_wrprotect(pte_t pte)

/* static inline pte_t pte_mkread(pte_t pte) */

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return __pte(pte_val(pte) | _PAGE_WRITE);
}
@@ -624,9 +624,9 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
return pte_pmd(pte_mkyoung(pmd_pte(pmd)));
}

-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
- return pte_pmd(pte_mkwrite(pmd_pte(pmd)));
+ return pte_pmd(pte_mkwrite(pmd_pte(pmd), vma));
}

static inline pmd_t pmd_wrprotect(pmd_t pmd)
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index ccdbccfde148..558f7eef9c4d 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -102,9 +102,9 @@ static inline int huge_pte_dirty(pte_t pte)
return pte_dirty(pte);
}

-static inline pte_t huge_pte_mkwrite(pte_t pte)
+static inline pte_t huge_pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
- return pte_mkwrite(pte);
+ return pte_mkwrite(pte, vma);
}

static inline pte_t huge_pte_mkdirty(pte_t pte)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index c48a447d1432..edf290429050 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -999,7 +999,7 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
return pte;
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return pte_mkwrite_kernel(pte);
}
@@ -1433,7 +1433,7 @@ static inline pmd_t pmd_mkwrite_kernel(pmd_t pmd)
return pmd;
}

-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
return pmd_mkwrite_kernel(pmd);
}
diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h
index d0240decacca..3ddf7e13c6aa 100644
--- a/arch/sh/include/asm/pgtable_32.h
+++ b/arch/sh/include/asm/pgtable_32.h
@@ -351,6 +351,12 @@ static inline void set_pte(pte_t *ptep, pte_t pte)

#define PTE_BIT_FUNC(h,fn,op) \
static inline pte_t pte_##fn(pte_t pte) { pte.pte_##h op; return pte; }
+#define PTE_BIT_FUNC_VMA(h,fn,op) \
+static inline pte_t pte_##fn(pte_t pte, struct vm_area_struct *vma) \
+{ \
+ pte.pte_##h op; \
+ return pte; \
+}

#ifdef CONFIG_X2TLB
/*
@@ -359,11 +365,11 @@ static inline pte_t pte_##fn(pte_t pte) { pte.pte_##h op; return pte; }
* kernel permissions), we attempt to couple them a bit more sanely here.
*/
PTE_BIT_FUNC(high, wrprotect, &= ~(_PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE));
-PTE_BIT_FUNC(high, mkwrite, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE);
+PTE_BIT_FUNC_VMA(high, mkwrite, |= _PAGE_EXT_USER_WRITE | _PAGE_EXT_KERN_WRITE);
PTE_BIT_FUNC(high, mkhuge, |= _PAGE_SZHUGE);
#else
PTE_BIT_FUNC(low, wrprotect, &= ~_PAGE_RW);
-PTE_BIT_FUNC(low, mkwrite, |= _PAGE_RW);
+PTE_BIT_FUNC_VMA(low, mkwrite, |= _PAGE_RW);
PTE_BIT_FUNC(low, mkhuge, |= _PAGE_SZHUGE);
#endif

diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index 5acc05b572e6..ba829881c904 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -241,7 +241,7 @@ static inline pte_t pte_mkold(pte_t pte)
return __pte(pte_val(pte) & ~SRMMU_REF);
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return __pte(pte_val(pte) | SRMMU_WRITE);
}
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 3bc9736bddb1..71df58562138 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -463,7 +463,7 @@ static inline pte_t pte_mkclean(pte_t pte)
return __pte(val);
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
unsigned long val = pte_val(pte), mask;

@@ -753,11 +753,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
return __pmd(pte_val(pte));
}

-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
pte_t pte = __pte(pmd_val(pmd));

- pte = pte_mkwrite(pte);
+ pte = pte_mkwrite(pte, vma);

return __pmd(pte_val(pte));
}
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index 4e3052f2671a..0f9554ab6b6f 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -204,7 +204,7 @@ static inline pte_t pte_mkyoung(pte_t pte)
return(pte);
}

-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
if (unlikely(pte_get_bits(pte, _PAGE_RW)))
return pte;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 4f9fddcff2b9..2b423d697490 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -369,7 +369,9 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)
return pte_set_flags(pte, _PAGE_RW);
}

-static inline pte_t pte_mkwrite(pte_t pte)
+struct vm_area_struct;
+
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
return pte_mkwrite_kernel(pte);
}
@@ -470,7 +472,7 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
return pmd_set_flags(pmd, _PAGE_ACCESSED);
}

-static inline pmd_t pmd_mkwrite(pmd_t pmd)
+static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
return pmd_set_flags(pmd, _PAGE_RW);
}
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
index 5b5484d707b2..13f05b0d38a3 100644
--- a/arch/xtensa/include/asm/pgtable.h
+++ b/arch/xtensa/include/asm/pgtable.h
@@ -258,7 +258,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
{ pte_val(pte) |= _PAGE_DIRTY; return pte; }
static inline pte_t pte_mkyoung(pte_t pte)
{ pte_val(pte) |= _PAGE_ACCESSED; return pte; }
-static inline pte_t pte_mkwrite(pte_t pte)
+static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{ pte_val(pte) |= _PAGE_WRITABLE; return pte; }

#define pgprot_noncached(prot) \
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index a57d667addd2..45e8c4a98cf4 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -20,9 +20,9 @@ static inline unsigned long huge_pte_dirty(pte_t pte)
return pte_dirty(pte);
}

-static inline pte_t huge_pte_mkwrite(pte_t pte)
+static inline pte_t huge_pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
- return pte_mkwrite(pte);
+ return pte_mkwrite(pte, vma);
}

static inline pte_t huge_pte_mkdirty(pte_t pte)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7afc86d50442..4650e2580d60 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1102,7 +1102,7 @@ void free_compound_page(struct page *page);
static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
if (likely(vma->vm_flags & VM_WRITE))
- pte = pte_mkwrite(pte);
+ pte = pte_mkwrite(pte, vma);
return pte;
}

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index c631ade3f1d2..ee7d6fc36128 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -107,10 +107,10 @@ static void __init pte_basic_tests(struct pgtable_debug_args *args, int idx)
WARN_ON(!pte_same(pte, pte));
WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte))));
WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte))));
- WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte))));
+ WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte), args->vma)));
WARN_ON(pte_young(pte_mkold(pte_mkyoung(pte))));
WARN_ON(pte_dirty(pte_mkclean(pte_mkdirty(pte))));
- WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte))));
+ WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte, args->vma))));
WARN_ON(pte_dirty(pte_wrprotect(pte_mkclean(pte))));
WARN_ON(!pte_dirty(pte_wrprotect(pte_mkdirty(pte))));
}
@@ -151,7 +151,7 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
pte = pte_mkclean(pte);
set_pte_at(args->mm, args->vaddr, args->ptep, pte);
flush_dcache_page(page);
- pte = pte_mkwrite(pte);
+ pte = pte_mkwrite(pte, args->vma);
pte = pte_mkdirty(pte);
ptep_set_access_flags(args->vma, args->vaddr, args->ptep, pte, 1);
pte = ptep_get(args->ptep);
@@ -197,10 +197,10 @@ static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx)
WARN_ON(!pmd_same(pmd, pmd));
WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd))));
WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd))));
- WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd))));
+ WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd), args->vma)));
WARN_ON(pmd_young(pmd_mkold(pmd_mkyoung(pmd))));
WARN_ON(pmd_dirty(pmd_mkclean(pmd_mkdirty(pmd))));
- WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd))));
+ WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd, args->vma))));
WARN_ON(pmd_dirty(pmd_wrprotect(pmd_mkclean(pmd))));
WARN_ON(!pmd_dirty(pmd_wrprotect(pmd_mkdirty(pmd))));
/*
@@ -251,7 +251,7 @@ static void __init pmd_advanced_tests(struct pgtable_debug_args *args)
pmd = pmd_mkclean(pmd);
set_pmd_at(args->mm, vaddr, args->pmdp, pmd);
flush_dcache_page(page);
- pmd = pmd_mkwrite(pmd);
+ pmd = pmd_mkwrite(pmd, args->vma);
pmd = pmd_mkdirty(pmd);
pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1);
pmd = READ_ONCE(*args->pmdp);
@@ -903,8 +903,8 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args)
pte = mk_huge_pte(page, args->page_prot);

WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte)));
- WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte))));
- WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte))));
+ WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte), args->vma)));
+ WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte, args->vma))));

#ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
pte = pfn_pte(args->fixed_pmd_pfn, args->page_prot);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abe6cfd92ffa..a216129e6a7c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -554,7 +554,7 @@ __setup("transparent_hugepage=", setup_transparent_hugepage);
pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
if (likely(vma->vm_flags & VM_WRITE))
- pmd = pmd_mkwrite(pmd);
+ pmd = pmd_mkwrite(pmd, vma);
return pmd;
}

@@ -1587,7 +1587,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
pmd = pmd_modify(oldpmd, vma->vm_page_prot);
pmd = pmd_mkyoung(pmd);
if (writable)
- pmd = pmd_mkwrite(pmd);
+ pmd = pmd_mkwrite(pmd, vma);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
spin_unlock(vmf->ptl);
@@ -1935,7 +1935,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
/* See change_pte_range(). */
if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
can_change_pmd_writable(vma, addr, entry))
- entry = pmd_mkwrite(entry);
+ entry = pmd_mkwrite(entry, vma);

ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bdbfeb6fb393..0f5cf9a3cdb7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4898,7 +4898,7 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,

if (writable) {
entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
- vma->vm_page_prot)));
+ vma->vm_page_prot)), vma);
} else {
entry = huge_pte_wrprotect(mk_huge_pte(page,
vma->vm_page_prot));
@@ -4914,7 +4914,7 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma,
{
pte_t entry;

- entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep)));
+ entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep)), vma);
if (huge_ptep_set_access_flags(vma, address, ptep, entry, 1))
update_mmu_cache(vma, address, ptep);
}
diff --git a/mm/memory.c b/mm/memory.c
index 3e836fecd035..6ad031d5cfb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4089,7 +4089,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
entry = mk_pte(page, vma->vm_page_prot);
entry = pte_sw_mkyoung(entry);
if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = pte_mkwrite(pte_mkdirty(entry), vma);

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
@@ -4777,7 +4777,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
pte = pte_modify(old_pte, vma->vm_page_prot);
pte = pte_mkyoung(pte);
if (writable)
- pte = pte_mkwrite(pte);
+ pte = pte_mkwrite(pte, vma);
ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 721b2365dbca..0d2c97f6d561 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -646,7 +646,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
}
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = pte_mkwrite(pte_mkdirty(entry), vma);
}

ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 61cf60015a8b..2a70d16ec556 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -200,7 +200,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
!pte_write(ptent) &&
can_change_pte_writable(vma, addr, ptent))
- ptent = pte_mkwrite(ptent);
+ ptent = pte_mkwrite(ptent, vma);

ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
if (pte_needs_flush(oldpte, ptent))
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0499907b6f1a..c3bcc6fcf27b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -85,7 +85,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
}

if (writable)
- _dst_pte = pte_mkwrite(_dst_pte);
+ _dst_pte = pte_mkwrite(_dst_pte, dst_vma);
else
/*
* We need this to make sure write bit removed; as mk_pte()
--
2.17.1


2023-02-18 21:19:30

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.

The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.

So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
in places where Linux normally creates Write=0,Dirty=1, it can use the
software-defined _PAGE_SAVED_DIRTY in place of the hardware _PAGE_DIRTY.
In other words, whenever Linux needs to create Write=0,Dirty=1, it instead
creates Write=0,SavedDirty=1 except for shadow stack, which is
Write=0,Dirty=1. Further differentiated by VMA flags, these PTE bit
combinations would be set as follows for various types of memory:

(Write=0,SavedDirty=1,Dirty=0):
- A modified, copy-on-write (COW) page. Previously when a typical
anonymous writable mapping was made COW via fork(), the kernel would
mark it Write=0,Dirty=1. Now it will instead use the SavedDirty bit.
This happens in copy_present_pte().
- A R/O page that has been COW'ed. The user page is in a R/O VMA,
and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
handler creates a copy of the page and sets the new copy's PTE as
Write=0 and SavedDirty=1.
- A shared shadow stack PTE. When a shadow stack page is being shared
among processes (this happens at fork()), its PTE is made Dirty=0, so
the next shadow stack access causes a fault, and the page is
duplicated and Dirty=1 is set again. This is the COW equivalent for
shadow stack pages, even though it's copy-on-access rather than
copy-on-write.

(Write=0,SavedDirty=0,Dirty=1):
- A shadow stack PTE.
- A Cow PTE created when a processor without shadow stack support set
Dirty=1.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_SAVED_DIRTY. No space is consumed in 32-bit
kernels because shadow stacks are not enabled there.

Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to start
creating _PAGE_SAVED_DIRTY PTEs will follow once other pieces are in place.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Co-developed-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
- Add _PAGE_SAVED_DIRTY to _PAGE_CHG_MASK

v5:
- Fix log, comments and whitespace (Boris)
- Remove capitalization on shadow stack (Boris)

v4:
- Teach pte_flags_need_flush() about _PAGE_COW bit
- Break apart patch for better bisectability

v3:
- Add comment around _PAGE_TABLE in response to comment
from (Andrew Cooper)
- Check for PSE in pmd_shstk (Andrew Cooper)
- Get to the point quicker in commit log (Andrew Cooper)
- Clarify and reorder commit log for why the PTE bit examples have
multiple entries. Apply same changes for comment. (peterz)
- Fix comment that implied dirty bit for COW was a specific x86 thing
(peterz)
- Fix swapping of Write/Dirty (PeterZ)
---
arch/x86/include/asm/pgtable.h | 79 ++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable_types.h | 65 ++++++++++++++++++++---
arch/x86/include/asm/tlbflush.h | 3 +-
3 files changed, 138 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2b423d697490..110e552eb602 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -301,6 +301,45 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
return native_make_pte(v & ~clear);
}

+/*
+ * COW and other write protection operations can result in Dirty=1,Write=0
+ * PTEs. But in the case of X86_FEATURE_USER_SHSTK, the software SavedDirty bit
+ * is used, since the Dirty=1,Write=0 will result in the memory being treated as
+ * shadow stack by the HW. So when creating dirty, write-protected memory, a
+ * software bit is used _PAGE_BIT_SAVED_DIRTY. The following functions
+ * pte_mksaveddirty() and pte_clear_saveddirty() take a conventional dirty,
+ * write-protected PTE (Write=0,Dirty=1) and transition it to the shadow stack
+ * compatible version. (Write=0,SavedDirty=1).
+ */
+static inline pte_t pte_mksaveddirty(pte_t pte)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pte;
+
+ pte = pte_clear_flags(pte, _PAGE_DIRTY);
+ return pte_set_flags(pte, _PAGE_SAVED_DIRTY);
+}
+
+static inline pte_t pte_clear_saveddirty(pte_t pte)
+{
+ /*
+ * _PAGE_SAVED_DIRTY is unnecessary on !X86_FEATURE_USER_SHSTK kernels,
+ * since the HW dirty bit can be used without creating shadow stack
+ * memory. See the _PAGE_SAVED_DIRTY definition for more details.
+ */
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pte;
+
+ /*
+ * PTE is getting copied-on-write, so it will be dirtied
+ * if writable, or made shadow stack if shadow stack and
+ * being copied on access. Set the dirty bit for both
+ * cases.
+ */
+ pte = pte_set_flags(pte, _PAGE_DIRTY);
+ return pte_clear_flags(pte, _PAGE_SAVED_DIRTY);
+}
+
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
@@ -420,6 +459,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
return native_make_pmd(v & ~clear);
}

+/* See comments above pte_mksaveddirty() */
+static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pmd;
+
+ pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+ return pmd_set_flags(pmd, _PAGE_SAVED_DIRTY);
+}
+
+/* See comments above pte_mksaveddirty() */
+static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pmd;
+
+ pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+ return pmd_clear_flags(pmd, _PAGE_SAVED_DIRTY);
+}
+
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pmd_uffd_wp(pmd_t pmd)
{
@@ -491,6 +550,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
return native_make_pud(v & ~clear);
}

+/* See comments above pte_mksaveddirty() */
+static inline pud_t pud_mksaveddirty(pud_t pud)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pud;
+
+ pud = pud_clear_flags(pud, _PAGE_DIRTY);
+ return pud_set_flags(pud, _PAGE_SAVED_DIRTY);
+}
+
+/* See comments above pte_mksaveddirty() */
+static inline pud_t pud_clear_saveddirty(pud_t pud)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pud;
+
+ pud = pud_set_flags(pud, _PAGE_DIRTY);
+ return pud_clear_flags(pud, _PAGE_SAVED_DIRTY);
+}
+
static inline pud_t pud_mkold(pud_t pud)
{
return pud_clear_flags(pud, _PAGE_ACCESSED);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0646ad00178b..3b420b6c0584 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@
#define _PAGE_BIT_SOFTW2 10 /* " */
#define _PAGE_BIT_SOFTW3 11 /* " */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
+#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
+#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
#define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
#define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
#define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4

+/*
+ * Indicates a Saved Dirty bit page.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_SAVED_DIRTY 0
+#endif
+
/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
@@ -117,6 +127,40 @@
#define _PAGE_SOFTW4 (_AT(pteval_t, 0))
#endif

+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_SAVED_DIRTY is a software-only bit used to separate copy-on-write
+ * PTEs from shadow stack PTEs:
+ *
+ * (Write=0,SavedDirty=1,Dirty=0):
+ * - A modified, copy-on-write (COW) page. Previously when a typical
+ * anonymous writable mapping was made COW via fork(), the kernel would
+ * mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
+ * happens in copy_present_pte().
+ * - A R/O page that has been COW'ed. The user page is in a R/O VMA,
+ * and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
+ * handler creates a copy of the page and sets the new copy's PTE as
+ * Write=0 and SavedDirty=1.
+ * - A shared shadow stack PTE. When a shadow stack page is being shared
+ * among processes (this happens at fork()), its PTE is made Dirty=0, so
+ * the next shadow stack access causes a fault, and the page is
+ * duplicated and Dirty=1 is set again. This is the COW equivalent for
+ * shadow stack pages, even though it's copy-on-access rather than
+ * copy-on-write.
+ *
+ * (Write=0,SavedDirty=0,Dirty=1):
+ * - A shadow stack PTE.
+ * - A Cow PTE created when a processor without shadow stack support set
+ * Dirty=1.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_SAVED_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_SAVED_DIRTY)
+#else
+#define _PAGE_SAVED_DIRTY (_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_SAVED_DIRTY)
+
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

/*
@@ -125,9 +169,9 @@
* instance, and is *not* included in this mask since
* pte_modify() does modify it.
*/
-#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
- _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
- _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \
+#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
+ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_BITS | \
+ _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \
_PAGE_UFFD_WP)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)

@@ -186,12 +230,17 @@ enum page_cache_mode {
#define PAGE_READONLY __pg(__PP| 0|_USR|___A|__NX| 0| 0| 0)
#define PAGE_READONLY_EXEC __pg(__PP| 0|_USR|___A| 0| 0| 0| 0)

-#define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G)
-#define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G)
-#define _KERNPG_TABLE_NOENC (__PP|__RW| 0|___A| 0|___D| 0| 0)
-#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
+/*
+ * Page tables needs to have Write=1 in order for any lower PTEs to be
+ * writable. This includes shadow stack memory (Write=0, Dirty=1)
+ */
#define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0)
#define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC)
+#define _KERNPG_TABLE_NOENC (__PP|__RW| 0|___A| 0|___D| 0| 0)
+#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
+
+#define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G)
+#define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G)
#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G)
#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G)
#define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..6c5ef14060a8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -273,7 +273,8 @@ static inline bool pte_flags_need_flush(unsigned long oldflags,
const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
_PAGE_ACCESSED;
const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
- _PAGE_SOFTW3 | _PAGE_SOFTW4;
+ _PAGE_SOFTW3 | _PAGE_SOFTW4 |
+ _PAGE_SAVED_DIRTY;
const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
_PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
_PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
--
2.17.1


2023-02-18 21:19:42

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 15/41] x86/mm: Update ptep/pmdp_set_wrprotect() for _PAGE_SAVED_DIRTY

From: Yu-cheng Yu <[email protected]>

When shadow stack is in use, Write=0,Dirty=1 PTE are preserved for
shadow stack. Copy-on-write PTEs then have Write=0,SavedDirty=1.

When a PTE goes from Write=1,Dirty=1 to Write=0,SavedDirty=1, it could
become a transient shadow stack PTE in two cases:

1. Some processors can start a write but end up seeing a Write=0 PTE by
the time they get to the Dirty bit, creating a transient shadow stack
PTE. However, this will not occur on processors supporting shadow
stack, and a TLB flush is not necessary.

2. When _PAGE_DIRTY is replaced with _PAGE_SAVED_DIRTY non-atomically, a
transient shadow stack PTE can be created as a result. Thus, prevent
that with cmpxchg.

In the case of pmdp_set_wrprotect(), for nopmd configs the ->pmd operated
on does not exist and the logic would need to be different. Although the
extra functionality will normally be optimized out when user shadow
stacks are not configured, also exclude it in the preprocessor stage so
that it will still compile. User shadow stack is not supported there by
Linux anyway. Leave the cpu_feature_enabled() check so that the
functionality also gets disabled based on runtime detection of the
feature.

Similarly, compile it out in ptep_set_wrprotect() due to a clang warning
on i386. Like above, the code path should get optimized out on i386
since shadow stack is not supported on 32 bit kernels, but this makes
the compiler happy.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the cmpxchg solution.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Fix comment and log to update for _PAGE_COW being replaced with
_PAGE_SAVED_DIRTY.

v5:
- Commit log verbiage and formatting (Boris)
- Remove capitalization on shadow stack (Boris)
- Fix i386 warning on recent clang

v3:
- Remove unnecessary #ifdef (Dave Hansen)

v2:
- Compile out some code due to clang build error
- Clarify commit log (dhansen)
- Normalize PTE bit descriptions between patches (dhansen)
- Update comment with text from (dhansen)
---
arch/x86/include/asm/pgtable.h | 35 ++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 110e552eb602..e5f00c077039 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1192,6 +1192,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ /*
+ * Avoid accidentally creating shadow stack PTEs
+ * (Write=0,Dirty=1). Use cmpxchg() to prevent races with
+ * the hardware setting Dirty=1.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+ pte_t old_pte, new_pte;
+
+ old_pte = READ_ONCE(*ptep);
+ do {
+ new_pte = pte_wrprotect(old_pte);
+ } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+ return;
+ }
+#endif
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
}

@@ -1244,6 +1261,24 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp)
{
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ /*
+ * Avoid accidentally creating shadow stack PTEs
+ * (Write=0,Dirty=1). Use cmpxchg() to prevent races with
+ * the hardware setting Dirty=1.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+ pmd_t old_pmd, new_pmd;
+
+ old_pmd = READ_ONCE(*pmdp);
+ do {
+ new_pmd = pmd_wrprotect(old_pmd);
+ } while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+ return;
+ }
+#endif
+
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

--
2.17.1


2023-02-18 21:19:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 16/41] x86/mm: Start actually marking _PAGE_SAVED_DIRTY

The recently introduced _PAGE_SAVED_DIRTY should be used instead of the
HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently
create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply
the same changes to pmd and pud.

For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which
was not necessarily created with any of the existing PTE bit helpers.
That means that it can return a pte_t with Write=0,Dirty=1, a shadow
stack PTE, when it did not intend to create one.

Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid
creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
1. Marking Write=0 PTEs Dirty=1
2. Marking Dirty=1 PTEs Write=0

The first case cannot happen as the existing behavior of pte_modify() is to
filter out any Dirty bit passed in newprot. Handle the second case by
shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write
protected by the pte_modify() call. Apply the same changes to
pmd_modify().

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Co-developed-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
- Open code _PAGE_SAVED_DIRTY part in pte_modify() (Boris)
- Change the logic so the open coded part is not too ugly
- Merge pte_modify() patch with this one because of the above

v4:
- Break part patch for better bisectability
---
arch/x86/include/asm/pgtable.h | 168 ++++++++++++++++++++++++++++-----
1 file changed, 145 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e5f00c077039..292a3b75d7fa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -124,9 +124,17 @@ extern pmdval_t early_pmd_flags;
* The following only work if pte_present() is true.
* Undefined behaviour if not..
*/
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
{
- return pte_flags(pte) & _PAGE_DIRTY;
+ return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return false;
+
+ return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
}

static inline int pte_young(pte_t pte)
@@ -134,9 +142,18 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED;
}

-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_DIRTY;
+ return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return false;
+
+ return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
+ (_PAGE_DIRTY | _PAGE_PSE);
}

#define pmd_young pmd_young
@@ -145,9 +162,9 @@ static inline int pmd_young(pmd_t pmd)
return pmd_flags(pmd) & _PAGE_ACCESSED;
}

-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
{
- return pud_flags(pud) & _PAGE_DIRTY;
+ return pud_flags(pud) & _PAGE_DIRTY_BITS;
}

static inline int pud_young(pud_t pud)
@@ -157,13 +174,21 @@ static inline int pud_young(pud_t pud)

static inline int pte_write(pte_t pte)
{
- return pte_flags(pte) & _PAGE_RW;
+ /*
+ * Shadow stack pages are logically writable, but do not have
+ * _PAGE_RW. Check for them separately from _PAGE_RW itself.
+ */
+ return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
}

#define pmd_write pmd_write
static inline int pmd_write(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_RW;
+ /*
+ * Shadow stack pages are logically writable, but do not have
+ * _PAGE_RW. Check for them separately from _PAGE_RW itself.
+ */
+ return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
}

#define pud_write pud_write
@@ -375,7 +400,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)

static inline pte_t pte_mkclean(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_DIRTY);
+ return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
}

static inline pte_t pte_mkold(pte_t pte)
@@ -385,7 +410,16 @@ static inline pte_t pte_mkold(pte_t pte)

static inline pte_t pte_wrprotect(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_RW);
+ pte = pte_clear_flags(pte, _PAGE_RW);
+
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (pte_dirty(pte))
+ pte = pte_mksaveddirty(pte);
+ return pte;
}

static inline pte_t pte_mkexec(pte_t pte)
@@ -395,7 +429,19 @@ static inline pte_t pte_mkexec(pte_t pte)

static inline pte_t pte_mkdirty(pte_t pte)
{
- return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ pteval_t dirty = _PAGE_DIRTY;
+
+ /* Avoid creating Dirty=1,Write=0 PTEs */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pte_write(pte))
+ dirty = _PAGE_SAVED_DIRTY;
+
+ return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+ /* pte_clear_saveddirty() also sets Dirty=1 */
+ return pte_clear_saveddirty(pte);
}

static inline pte_t pte_mkyoung(pte_t pte)
@@ -412,7 +458,12 @@ struct vm_area_struct;

static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
- return pte_mkwrite_kernel(pte);
+ pte = pte_mkwrite_kernel(pte);
+
+ if (pte_dirty(pte))
+ pte = pte_clear_saveddirty(pte);
+
+ return pte;
}

static inline pte_t pte_mkhuge(pte_t pte)
@@ -503,17 +554,36 @@ static inline pmd_t pmd_mkold(pmd_t pmd)

static inline pmd_t pmd_mkclean(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_DIRTY);
+ return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
}

static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_RW);
+ pmd = pmd_clear_flags(pmd, _PAGE_RW);
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (pmd_dirty(pmd))
+ pmd = pmd_mksaveddirty(pmd);
+ return pmd;
}

static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
- return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ pmdval_t dirty = _PAGE_DIRTY;
+
+ /* Avoid creating (HW)Dirty=1, Write=0 PMDs */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pmd_write(pmd))
+ dirty = _PAGE_SAVED_DIRTY;
+
+ return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+ return pmd_clear_saveddirty(pmd);
}

static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -533,7 +603,12 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)

static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
- return pmd_set_flags(pmd, _PAGE_RW);
+ pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+ if (pmd_dirty(pmd))
+ pmd = pmd_clear_saveddirty(pmd);
+
+ return pmd;
}

static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
@@ -577,17 +652,32 @@ static inline pud_t pud_mkold(pud_t pud)

static inline pud_t pud_mkclean(pud_t pud)
{
- return pud_clear_flags(pud, _PAGE_DIRTY);
+ return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
}

static inline pud_t pud_wrprotect(pud_t pud)
{
- return pud_clear_flags(pud, _PAGE_RW);
+ pud = pud_clear_flags(pud, _PAGE_RW);
+
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (pud_dirty(pud))
+ pud = pud_mksaveddirty(pud);
+ return pud;
}

static inline pud_t pud_mkdirty(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ pudval_t dirty = _PAGE_DIRTY;
+
+ /* Avoid creating (HW)Dirty=1, Write=0 PUDs */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pud_write(pud))
+ dirty = _PAGE_SAVED_DIRTY;
+
+ return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
}

static inline pud_t pud_mkdevmap(pud_t pud)
@@ -607,7 +697,11 @@ static inline pud_t pud_mkyoung(pud_t pud)

static inline pud_t pud_mkwrite(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_RW);
+ pud = pud_set_flags(pud, _PAGE_RW);
+
+ if (pud_dirty(pud))
+ pud = pud_clear_saveddirty(pud);
+ return pud;
}

#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
@@ -724,6 +818,8 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
pteval_t val = pte_val(pte), oldval = val;
+ bool wr_protected;
+ pte_t pte_result;

/*
* Chop off the NX bit (if present), and add the NX portion of
@@ -732,17 +828,43 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
val &= _PAGE_CHG_MASK;
val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
- return __pte(val);
+
+ pte_result = __pte(val);
+
+ /*
+ * Do the saveddirty fixup if the PTE was just write protected and
+ * it's dirty.
+ */
+ wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
+ (val & _PAGE_DIRTY))
+ pte_result = pte_mksaveddirty(pte_result);
+
+ return pte_result;
}

static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
{
pmdval_t val = pmd_val(pmd), oldval = val;
+ bool wr_protected;
+ pmd_t pmd_result;

- val &= _HPAGE_CHG_MASK;
+ val &= (_HPAGE_CHG_MASK & ~_PAGE_DIRTY);
val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
- return __pmd(val);
+
+ pmd_result = __pmd(val);
+
+ /*
+ * Do the saveddirty fixup if the PMD was just write protected and
+ * it's dirty.
+ */
+ wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
+ (val & _PAGE_DIRTY))
+ pmd_result = pmd_mksaveddirty(pmd_result);
+
+ return pmd_result;
}

/*
--
2.17.1


2023-02-18 21:20:06

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 17/41] mm: Move VM_UFFD_MINOR_BIT from 37 to 38

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Future patches will introduce a new VM flag VM_SHADOW_STACK that will be
VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are
bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake
of order, make all VM_HIGH_ARCH_BITs stay together by moving
VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be
introduced as 37.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Peter Xu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Axel Rasmussen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Mike Kravetz <[email protected]>
---
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4650e2580d60..e6f1789c8e69 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -366,7 +366,7 @@ extern unsigned int kobjsize(const void *objp);
#endif

#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT 37
+# define VM_UFFD_MINOR_BIT 38
# define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */
#else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
# define VM_UFFD_MINOR VM_NONE
--
2.17.1


2023-02-18 21:20:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
two cases are handled differently for page faults. Introduce
VM_SHADOW_STACK to track shadow stack VMAs.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v6:
- Add comment about VM_SHADOW_STACK not being allowed with VM_SHARED
(David Hildenbrand)

v3:
- Drop arch specific change in arch_vma_name(). The memory can show as
anonymous (Kirill)
- Change CONFIG_ARCH_HAS_SHADOW_STACK to CONFIG_X86_USER_SHADOW_STACK
in show_smap_vma_flags() (Boris)
---
Documentation/filesystems/proc.rst | 1 +
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 8 ++++++++
3 files changed, 12 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index e224b6d5b642..115843e8cce3 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -564,6 +564,7 @@ encoded manner. The codes are the following:
mt arm64 MTE allocation tags are enabled
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
+ ss shadow stack page
== =======================================

Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index af1c49ae11b1..9e2cefe47749 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ [ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
};
size_t i;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e6f1789c8e69..76e0a09aeffe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,11 +315,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -335,6 +337,12 @@ extern unsigned int kobjsize(const void *objp);
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+# define VM_SHADOW_STACK VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */
+#else
+# define VM_SHADOW_STACK VM_NONE
+#endif
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
--
2.17.1


2023-02-18 21:20:16

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 19/41] x86/mm: Check shadow stack page fault errors

From: Yu-cheng Yu <[email protected]>

The CPU performs "shadow stack accesses" when it expects to encounter
shadow stack mappings. These accesses can be implicit (via CALL/RET
instructions) or explicit (instructions like WRSS).

Shadow stack accesses to shadow-stack mappings can result in faults in
normal, valid operation just like regular accesses to regular mappings.
Shadow stacks need some of the same features like delayed allocation, swap
and copy-on-write. The kernel needs to use faults to implement those
features.

The architecture has concepts of both shadow stack reads and shadow stack
writes. Any shadow stack access to non-shadow stack memory will generate
a fault with the shadow stack error code bit set.

This means that, unlike normal write protection, the fault handler needs
to create a type of memory that can be written to (with instructions that
generate shadow stack writes), even to fulfill a read access. So in the
case of COW memory, the COW needs to take place even with a shadow stack
read. Otherwise the page will be left (shadow stack) writable in
userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
for shadow stack accesses, even if the access was a shadow stack read.

For the purpose of making this clearer, consider the following example.
If a process has a shadow stack, and forks, the shadow stack PTEs will
become read-only due to COW. If the CPU in one process performs a shadow
stack read access to the shadow stack, for example executing a RET and
causing the CPU to read the shadow stack copy of the return address, then
in order for the fault to be resolved the PTE will need to be set with
shadow stack permissions. But then the memory would be changeable from
userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
COW, otherwise the shared page would be changeable from both processes.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping. Also, generate the errors for invalid shadow stack accesses.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Update comment due to rename of Cow bit to SavedDirty

v5:
- Add description of COW example (Boris)
- Replace "permissioned" (Boris)
- Remove capitalization of shadow stack (Boris)

v4:
- Further improve comment talking about FAULT_FLAG_WRITE (Peterz)

v3:
- Improve comment talking about using FAULT_FLAG_WRITE (Peterz)
---
arch/x86/include/asm/trap_pf.h | 2 ++
arch/x86/mm/fault.c | 38 ++++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..afa524325e55 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
+ * bit 6 == 1: shadow stack access fault
* bit 15 == 1: SGX MMU page-fault
*/
enum x86_pf_error_code {
@@ -20,6 +21,7 @@ enum x86_pf_error_code {
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
+ X86_PF_SHSTK = 1 << 6,
X86_PF_SGX = 1 << 15,
};

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..42885d8e2036 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1138,8 +1138,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
(error_code & X86_PF_INSTR), foreign))
return 1;

+ /*
+ * Shadow stack accesses (PF_SHSTK=1) are only permitted to
+ * shadow stack VMAs. All other accesses result in an error.
+ */
+ if (error_code & X86_PF_SHSTK) {
+ if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
+ return 1;
+ if (unlikely(!(vma->vm_flags & VM_WRITE)))
+ return 1;
+ return 0;
+ }
+
if (error_code & X86_PF_WRITE) {
/* write, present and write, not present: */
+ if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
+ return 1;
if (unlikely(!(vma->vm_flags & VM_WRITE)))
return 1;
return 0;
@@ -1331,6 +1345,30 @@ void do_user_addr_fault(struct pt_regs *regs,

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

+ /*
+ * When a page becomes COW it changes from a shadow stack permission
+ * page (Write=0,Dirty=1) to (Write=0,Dirty=0,SavedDirty=1), which is simply
+ * read-only to the CPU. When shadow stack is enabled, a RET would
+ * normally pop the shadow stack by reading it with a "shadow stack
+ * read" access. However, in the COW case the shadow stack memory does
+ * not have shadow stack permissions, it is read-only. So it will
+ * generate a fault.
+ *
+ * For conventionally writable pages, a read can be serviced with a
+ * read only PTE, and COW would not have to happen. But for shadow
+ * stack, there isn't the concept of read-only shadow stack memory.
+ * If it is shadow stack permission, it can be modified via CALL and
+ * RET instructions. So COW needs to happen before any memory can be
+ * mapped with shadow stack permissions.
+ *
+ * Shadow stack accesses (read or write) need to be serviced with
+ * shadow stack permission memory, so in the case of a shadow stack
+ * read access, treat it as a WRITE fault so both COW will happen and
+ * the write fault path will tickle maybe_mkwrite() and map the memory
+ * shadow stack.
+ */
+ if (error_code & X86_PF_SHSTK)
+ flags |= FAULT_FLAG_WRITE;
if (error_code & X86_PF_WRITE)
flags |= FAULT_FLAG_WRITE;
if (error_code & X86_PF_INSTR)
--
2.17.1


2023-02-18 21:20:35

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 20/41] x86/mm: Teach pte_mkwrite() about stack memory

If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So
when it is made writable with pte_mkwrite(), it should create shadow
stack memory, not conventionally writable memory. Now that pte_mkwrite()
takes a VMA, and places where shadow stack memory might be created pass
one, pte_mkwrite() can know when it should do this.

So make pte_mkwrite() create shadow stack memory when the VMA has the
VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite().

This requires referencing VM_SHADOW_STACK in these functions, which are
currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is
located) can't be pulled in without causing problems for files that
reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where
they can safely reference VM_SHADOW_STACK.

Tested-by: Pengfei Xu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- New patch
---
arch/x86/include/asm/pgtable.h | 20 ++------------------
arch/x86/mm/pgtable.c | 26 ++++++++++++++++++++++++++
2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 292a3b75d7fa..6b7106457bfb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -456,15 +456,7 @@ static inline pte_t pte_mkwrite_kernel(pte_t pte)

struct vm_area_struct;

-static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
- pte = pte_mkwrite_kernel(pte);
-
- if (pte_dirty(pte))
- pte = pte_clear_saveddirty(pte);
-
- return pte;
-}
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma);

static inline pte_t pte_mkhuge(pte_t pte)
{
@@ -601,15 +593,7 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
return pmd_set_flags(pmd, _PAGE_ACCESSED);
}

-static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
-{
- pmd = pmd_set_flags(pmd, _PAGE_RW);
-
- if (pmd_dirty(pmd))
- pmd = pmd_clear_saveddirty(pmd);
-
- return pmd;
-}
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);

static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
{
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e4f499eb0f29..98856bcc8102 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -880,3 +880,29 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)

#endif /* CONFIG_X86_64 */
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return pte_mkwrite_shstk(pte);
+
+ pte = pte_mkwrite_kernel(pte);
+
+ if (pte_dirty(pte))
+ pte = pte_clear_saveddirty(pte);
+
+ return pte;
+}
+
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return pmd_mkwrite_shstk(pmd);
+
+ pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+ if (pmd_dirty(pmd))
+ pmd = pmd_clear_saveddirty(pmd);
+
+ return pmd;
+}
--
2.17.1


2023-02-18 21:20:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 21/41] mm: Add guard pages around a shadow stack.

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to prevent corrupting or
switching to other shadow stacks. The RSTORSSP can move the ssp to
different shadow stacks, but it requires a specially placed token in order
to do this. However, the architecture does not prevent incrementing the
stack pointer to wander onto an adjacent shadow stack. To prevent this in
software, enforce guard pages at the beginning of shadow stack vmas, such
that there will always be a gap between adjacent shadow stacks.

Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can increment or decrement by CALL, RET and INCSSP. CALL and RET
can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.

The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:

addq $0x80, %rsp

However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:

READ_ONCE(ssp); // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8); // read+discard last popped stack element

The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.

This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stack's to grow, which is
unneeded and adds a strange difference to how most regular stacks work.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v5:
- Fix typo in commit log

v4:
- Drop references to 32 bit instructions
- Switch to generic code to drop __weak (Peterz)

v2:
- Use __weak instead of #ifdef (Dave Hansen)
- Only have start gap on shadow stack (Andy Luto)
- Create stack_guard_start_gap() to not duplicate code
in an arch version of vm_start_gap() (Dave Hansen)
- Improve commit log partly with verbiage from (Dave Hansen)

Yu-cheng v25:
- Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.
---
include/linux/mm.h | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76e0a09aeffe..a41577c5bf3e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2980,15 +2980,36 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
return mtree_load(&mm->mm_mt, addr);
}

+static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_GROWSDOWN)
+ return stack_guard_gap;
+
+ /*
+ * Shadow stack pointer is moved by CALL, RET, and INCSSPQ.
+ * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
+ * and touches the first and the last element in the range, which
+ * triggers a page fault if the range is not in a shadow stack.
+ * Because of this, creating 4-KB guard pages around a shadow
+ * stack prevents these instructions from going beyond.
+ *
+ * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
+ * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
+ */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return PAGE_SIZE;
+
+ return 0;
+}
+
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
{
+ unsigned long gap = stack_guard_start_gap(vma);
unsigned long vm_start = vma->vm_start;

- if (vma->vm_flags & VM_GROWSDOWN) {
- vm_start -= stack_guard_gap;
- if (vm_start > vma->vm_start)
- vm_start = 0;
- }
+ vm_start -= gap;
+ if (vm_start > vma->vm_start)
+ vm_start = 0;
return vm_start;
}

--
2.17.1


2023-02-18 21:20:53

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 22/41] mm/mmap: Add shadow stack pages to memory accounting

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Account shadow stack pages to stack memory.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v3:
- Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
(Kirill)

v2:
- Remove is_shadow_stack_mapping() and just change it to directly bitwise
and VM_SHADOW_STACK.

Yu-cheng v26:
- Remove redundant #ifdef CONFIG_MMU.

Yu-cheng v25:
- Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().
---
mm/mmap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 425a9349e610..9f85596cce31 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3290,6 +3290,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
mm->exec_vm += npages;
else if (is_stack_mapping(flags))
mm->stack_vm += npages;
+ else if (flags & VM_SHADOW_STACK)
+ mm->stack_vm += npages;
else if (is_data_mapping(flags))
mm->data_vm += npages;
}
--
2.17.1


2023-02-18 21:20:56

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 23/41] mm: Re-introduce vm_flags to do_mmap()

From: Yu-cheng Yu <[email protected]>

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap(). Thus, re-introduce vm_flags to do_mmap().

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Peter Collingbourne <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: [email protected]
---
fs/aio.c | 2 +-
include/linux/mm.h | 3 ++-
ipc/shm.c | 2 +-
mm/mmap.c | 10 +++++-----
mm/nommu.c | 4 ++--
mm/util.c | 2 +-
6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 562916d85cba..279c75ec6a05 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -554,7 +554,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)

ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
PROT_READ | PROT_WRITE,
- MAP_SHARED, 0, &unused, NULL);
+ MAP_SHARED, 0, 0, &unused, NULL);
mmap_write_unlock(mm);
if (IS_ERR((void *)ctx->mmap_base)) {
ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a41577c5bf3e..6d6ffd7563bf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2890,7 +2890,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
struct list_head *uf);
extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
- unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+ vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf);
extern int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
unsigned long start, size_t len, struct list_head *uf,
bool downgrade);
diff --git a/ipc/shm.c b/ipc/shm.c
index bd2fcc4d454e..1c5476bfec8b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1662,7 +1662,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
goto invalid;
}

- addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+ addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
*raddr = addr;
err = 0;
if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 9f85596cce31..350bf156fcae 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1238,11 +1238,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
*/
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
- unsigned long flags, unsigned long pgoff,
- unsigned long *populate, struct list_head *uf)
+ unsigned long flags, vm_flags_t vm_flags,
+ unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf)
{
struct mm_struct *mm = current->mm;
- vm_flags_t vm_flags;
int pkey = 0;

validate_mm(mm);
@@ -1303,7 +1303,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
- vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

if (flags & MAP_LOCKED)
@@ -2877,7 +2877,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,

file = get_file(vma->vm_file);
ret = do_mmap(vma->vm_file, start, size,
- prot, flags, pgoff, &populate, NULL);
+ prot, flags, 0, pgoff, &populate, NULL);
fput(file);
out:
mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 5b83938ecb67..3642a3e01265 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1042,6 +1042,7 @@ unsigned long do_mmap(struct file *file,
unsigned long len,
unsigned long prot,
unsigned long flags,
+ vm_flags_t vm_flags,
unsigned long pgoff,
unsigned long *populate,
struct list_head *uf)
@@ -1049,7 +1050,6 @@ unsigned long do_mmap(struct file *file,
struct vm_area_struct *vma;
struct vm_region *region;
struct rb_node *rb;
- vm_flags_t vm_flags;
unsigned long capabilities, result;
int ret;
MA_STATE(mas, &current->mm->mm_mt, 0, 0);
@@ -1069,7 +1069,7 @@ unsigned long do_mmap(struct file *file,

/* we've determined that we can make the mapping, now translate what we
* now know into VMA flags */
- vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+ vm_flags |= determine_vm_flags(file, prot, flags, capabilities);


/* we're going to need to record the mapping */
diff --git a/mm/util.c b/mm/util.c
index b56c92fb910f..77867bf9959a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -517,7 +517,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
if (!ret) {
if (mmap_write_lock_killable(mm))
return -EINTR;
- ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+ ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
&uf);
mmap_write_unlock(mm);
userfaultfd_unmap_complete(mm, &uf);
--
2.17.1


2023-02-18 21:21:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 24/41] mm: Don't allow write GUPs to shadow stack memory

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Shadow stack memory is writable only in very specific, controlled ways.
However, since it is writable, the kernel treats it as such. As a result
there remain many ways for userspace to trigger the kernel to write to
shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
little less exposed, block writable GUPs for shadow stack VMAs.

Still allow FOLL_FORCE to write through shadow stack protections, as it
does for read-only protections.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v3:
- Add comment in __pte_access_permitted() (Dave)
- Remove unneeded shadow stack specific check in
__pte_access_permitted() (Jann)
---
arch/x86/include/asm/pgtable.h | 5 +++++
mm/gup.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6b7106457bfb..20d0df494269 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1641,6 +1641,11 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
{
unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;

+ /*
+ * Write=0,Dirty=1 PTEs are shadow stack, which the kernel
+ * shouldn't generally allow access to, but since they
+ * are already Write=0, the below logic covers both cases.
+ */
if (write)
need_pte_bits |= _PAGE_RW;

diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..bfd33d9edb89 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -982,7 +982,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
return -EFAULT;

if (write) {
- if (!(vm_flags & VM_WRITE)) {
+ if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
/* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
--
2.17.1


2023-02-18 21:21:24

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 25/41] x86/mm: Introduce MAP_ABOVE4G

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which require some core mm changes to function
properly.

One of the properties is that the shadow stack pointer (SSP), which is a
CPU register that points to the shadow stack like the stack pointer points
to the stack, can't be pointing outside of the 32 bit address space when
the CPU is executing in 32 bit mode. It is desirable to prevent executing
in 32 bit mode when shadow stack is enabled because the kernel can't easily
support 32 bit signals.

On x86 it is possible to transition to 32 bit mode without any special
interaction with the kernel, by doing a "far call" to a 32 bit segment.
So the shadow stack implementation can use this address space behavior
as a feature, by enforcing that shadow stack memory is always crated
outside of the 32 bit address space. This way userspace will trigger a
general protection fault which will in turn trigger a segfault if it
tries to transition to 32 bit mode with shadow stack enabled.

This provides a clean error generating border for the user if they try
attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
half working state for userspace to be surprised by.

So to allow future shadow stack enabling patches to map shadow stacks
out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior
is pretty much like MAP_32BIT, except that it has the opposite address
range. The are a few differences though.

If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
syscall.

Since the default search behavior is top down, the normal kaslr base can
be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add it's
own randomization in the bottom up case.

For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G
both are potentially valid, so both are used. In the bottomup search
path, the default behavior is already consistent with MAP_ABOVE4G since
mmap base should be above 4GB.

Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB.
So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode
with shadow stack enabled would usually segfault anyway. This is already
pretty decent guard rails. But the addition of MAP_ABOVE4G is some small
complexity spent to make it make it more complete.

Tested-by: Pengfei Xu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v5:
- New patch
---
arch/x86/include/uapi/asm/mman.h | 1 +
arch/x86/kernel/sys_x86_64.c | 6 +++++-
include/linux/mman.h | 4 ++++
3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 775dbd3aff73..5a0256e73f1e 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -3,6 +3,7 @@
#define _ASM_X86_MMAN_H

#define MAP_32BIT 0x40 /* only give out 32bit addresses */
+#define MAP_ABOVE4G 0x80 /* only map above 4GB */

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
#define arch_calc_vm_prot_bits(prot, key) ( \
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 8cc653ffdccd..06378b5682c1 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -193,7 +193,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,

info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
- info.low_limit = PAGE_SIZE;
+ if (!in_32bit_syscall() && (flags & MAP_ABOVE4G))
+ info.low_limit = 0x100000000;
+ else
+ info.low_limit = PAGE_SIZE;
+
info.high_limit = get_mmap_base(0);

/*
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 58b3abd457a3..32156daa985a 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -15,6 +15,9 @@
#ifndef MAP_32BIT
#define MAP_32BIT 0
#endif
+#ifndef MAP_ABOVE4G
+#define MAP_ABOVE4G 0
+#endif
#ifndef MAP_HUGE_2MB
#define MAP_HUGE_2MB 0
#endif
@@ -50,6 +53,7 @@
| MAP_STACK \
| MAP_HUGETLB \
| MAP_32BIT \
+ | MAP_ABOVE4G \
| MAP_HUGE_2MB \
| MAP_HUGE_1GB)

--
2.17.1


2023-02-18 21:21:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 26/41] mm: Warn on shadow stack memory in wrong vma

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
treated as shadow by the CPU, but this combination used to be created by
the kernel on x86. Previous patches have changed the kernel to now avoid
creating these PTEs unless they are for shadow stack memory. In case any
missed corners of the kernel are still creating PTEs like this for
non-shadow stack memory, and to catch any re-introductions of the logic,
warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
stack VMAs when they are being zapped. This won't catch transient cases
but should have decent coverage. It will be compiled out when shadow
stack is not configured.

In order to check if a pte is shadow stack in core mm code, add two arch
breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack
specific code to be kept in arch/x86.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Add arch breakout to remove shstk from core MM code.

v5:
- Fix typo in commit log

v3:
- New patch
---
arch/x86/include/asm/pgtable.h | 6 ++++++
arch/x86/mm/pgtable.c | 12 ++++++++++++
include/linux/pgtable.h | 14 ++++++++++++++
mm/huge_memory.c | 1 +
mm/memory.c | 1 +
5 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 20d0df494269..f3dc16fc4389 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1687,6 +1687,12 @@ static inline bool arch_has_hw_pte_young(void)
return true;
}

+#define arch_check_zapped_pte arch_check_zapped_pte
+void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
+
+#define arch_check_zapped_pmd arch_check_zapped_pmd
+void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd);
+
#ifdef CONFIG_XEN_PV
#define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young
static inline bool arch_has_hw_nonleaf_pmd_young(void)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 98856bcc8102..afab0bc7862b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -906,3 +906,15 @@ pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)

return pmd;
}
+
+void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte)
+{
+ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+ pte_shstk(pte));
+}
+
+void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd)
+{
+ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
+ pmd_shstk(pmd));
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1159b25b0542..22787c86c8f2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -291,6 +291,20 @@ static inline bool arch_has_hw_pte_young(void)
}
#endif

+#ifndef arch_check_zapped_pte
+static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
+ pte_t pte)
+{
+}
+#endif
+
+#ifndef arch_check_zapped_pmd
+static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
+ pmd_t pmd)
+{
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a216129e6a7c..842925f7fa9e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1696,6 +1696,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
*/
orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
tlb->fullmm);
+ arch_check_zapped_pmd(vma, orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
if (vma_is_special_huge(vma)) {
if (arch_needs_pgtable_deposit())
diff --git a/mm/memory.c b/mm/memory.c
index 6ad031d5cfb0..29e8f043b603 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1377,6 +1377,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
+ arch_check_zapped_pte(vma, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);
zap_install_uffd_wp_if_needed(vma, addr, pte, details,
ptent);
--
2.17.1


2023-02-18 21:21:40

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot

When user shadow stack is use, Write=0,Dirty=1 is treated by the CPU as
shadow stack memory. So for shadow stack memory this bit combination is
valid, but when Dirty=1,Write=1 (conventionally writable) memory is being
write protected, the kernel has been taught to transition the Dirty=1
bit to SavedDirty=1, to avoid inadvertently creating shadow stack
memory. It does this inside pte_wrprotect() because it knows the PTE is
not intended to be a writable shadow stack entry, it is supposed to be
write protected.

However, when a PTE is created by a raw prot using mk_pte(), mk_pte()
can't know whether to adjust Dirty=1 to SavedDirty=1. It can't
distinguish between the caller intending to create a shadow stack PTE or
needing the SavedDirty shift.

The kernel has been updated to not do this, and so Write=0,Dirty=1
memory should only be created by the pte_mkfoo() helpers. Add a warning
to make sure no new mk_pte() start doing this.

Tested-by: Pengfei Xu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- New patch (Note, this has already been a useful warning, it caught the
newly added set_memory_rox() doing this)
---
arch/x86/include/asm/pgtable.h | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f3dc16fc4389..db8fe5511c74 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1032,7 +1032,15 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
* (Currently stuck as a macro because of indirect forward reference
* to linux/mm.h:page_to_nid())
*/
-#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
+#define mk_pte(page, pgprot) \
+({ \
+ pgprot_t __pgprot = pgprot; \
+ \
+ WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && \
+ (pgprot_val(__pgprot) & (_PAGE_DIRTY | _PAGE_RW)) == \
+ _PAGE_DIRTY); \
+ pfn_pte(page_to_pfn(page), __pgprot); \
+})

static inline int pmd_bad(pmd_t pmd)
{
--
2.17.1


2023-02-18 21:21:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 28/41] x86: Introduce userspace API for shadow stack

From: "Kirill A. Shutemov" <[email protected]>

Add three new arch_prctl() handles:

- ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified
feature. Returns 0 on success or an error.

- ARCH_SHSTK_LOCK prevents future disabling or enabling of the
specified feature. Returns 0 on success or an error

The features are handled per-thread and inherited over fork(2)/clone(2),
but reset on exec().

This is preparation patch. It does not implement any features.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
[tweaked with feedback from tglx]
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v4:
- Remove references to CET and replace with shadow stack (Peterz)

v3:
- Move shstk.c Makefile changes earlier (Kees)
- Add #ifdef around features_locked and features (Kees)
- Encapsulate features reset earlier in reset_thread_features() so
features and features_locked are not referenced in code that would be
compiled !CONFIG_X86_USER_SHADOW_STACK. (Kees)
- Fix typo in commit log (Kees)
- Switch arch_prctl() numbers to avoid conflict with LAM

v2:
- Only allow one enable/disable per call (tglx)
- Return error code like a normal arch_prctl() (Alexander Potapenko)
- Make CET only (tglx)
---
arch/x86/include/asm/processor.h | 6 +++++
arch/x86/include/asm/shstk.h | 21 +++++++++++++++
arch/x86/include/uapi/asm/prctl.h | 6 +++++
arch/x86/kernel/Makefile | 2 ++
arch/x86/kernel/process_64.c | 7 ++++-
arch/x86/kernel/shstk.c | 44 +++++++++++++++++++++++++++++++
6 files changed, 85 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/shstk.h
create mode 100644 arch/x86/kernel/shstk.c

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 8d73004e4cac..bd16e012b3e9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -28,6 +28,7 @@ struct vm86;
#include <asm/unwind_hints.h>
#include <asm/vmxfeatures.h>
#include <asm/vdso/processor.h>
+#include <asm/shstk.h>

#include <linux/personality.h>
#include <linux/cache.h>
@@ -475,6 +476,11 @@ struct thread_struct {
*/
u32 pkru;

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ unsigned long features;
+ unsigned long features_locked;
+#endif
+
/* Floating point and extended processor state */
struct fpu fpu;
/*
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
new file mode 100644
index 000000000000..ec753809f074
--- /dev/null
+++ b/arch/x86/include/asm/shstk.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SHSTK_H
+#define _ASM_X86_SHSTK_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+long shstk_prctl(struct task_struct *task, int option, unsigned long features);
+void reset_thread_features(void);
+#else
+static inline long shstk_prctl(struct task_struct *task, int option,
+ unsigned long arg2) { return -EINVAL; }
+static inline void reset_thread_features(void) {}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_SHSTK_H */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..b2b3b7200b2d 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,10 @@
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003

+/* Don't use 0x3001-0x3004 because of old glibcs */
+
+#define ARCH_SHSTK_ENABLE 0x5001
+#define ARCH_SHSTK_DISABLE 0x5002
+#define ARCH_SHSTK_LOCK 0x5003
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 92446f1dedd7..b366641703e3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -146,6 +146,8 @@ obj-$(CONFIG_CALL_THUNKS) += callthunks.o

obj-$(CONFIG_X86_CET) += cet.o

+obj-$(CONFIG_X86_USER_SHADOW_STACK) += shstk.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4e34b3b68ebd..71094c8a305f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
load_gs_index(__USER_DS);
}

+ reset_thread_features();
+
loadsegment(fs, 0);
loadsegment(es, _ds);
loadsegment(ds, _ds);
@@ -830,7 +832,10 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
case ARCH_MAP_VDSO_64:
return prctl_map_vdso(&vdso_image_64, arg2);
#endif
-
+ case ARCH_SHSTK_ENABLE:
+ case ARCH_SHSTK_DISABLE:
+ case ARCH_SHSTK_LOCK:
+ return shstk_prctl(task, option, arg2);
default:
ret = -EINVAL;
break;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
new file mode 100644
index 000000000000..41ed6552e0a5
--- /dev/null
+++ b/arch/x86/kernel/shstk.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * shstk.c - Intel shadow stack support
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Yu-cheng Yu <[email protected]>
+ */
+
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <asm/prctl.h>
+
+void reset_thread_features(void)
+{
+ current->thread.features = 0;
+ current->thread.features_locked = 0;
+}
+
+long shstk_prctl(struct task_struct *task, int option, unsigned long features)
+{
+ if (option == ARCH_SHSTK_LOCK) {
+ task->thread.features_locked |= features;
+ return 0;
+ }
+
+ /* Don't allow via ptrace */
+ if (task != current)
+ return -EINVAL;
+
+ /* Do not allow to change locked features */
+ if (features & task->thread.features_locked)
+ return -EPERM;
+
+ /* Only support enabling/disabling one feature at a time. */
+ if (hweight_long(features) > 1)
+ return -EINVAL;
+
+ if (option == ARCH_SHSTK_DISABLE) {
+ return -EINVAL;
+ }
+
+ /* Handle ARCH_SHSTK_ENABLE */
+ return -EINVAL;
+}
--
2.17.1


2023-02-18 21:21:48

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 29/41] x86/shstk: Add user-mode shadow stack support

From: Yu-cheng Yu <[email protected]>

Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.

Do not support IA32 emulation or x32.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v5:
- Switch to EOPNOTSUPP
- Use MAP_ABOVE4G
- Move set_clr_bits_msrl() to patch where it is first used

v4:
- Just set MSR_IA32_U_CET when disabling shadow stack, since we don't
have IBT yet. (Peterz)

v3:
- Use define for set_clr_bits_msrl() (Kees)
- Make some functions static (Kees)
- Change feature_foo() to features_foo() (Kees)
- Centralize shadow stack size rlimit checks (Kees)
- Disable x32 support

v2:
- Get rid of unnecessary shstk->base checks
- Don't support IA32 emulation
---
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/shstk.h | 7 ++
arch/x86/include/uapi/asm/prctl.h | 3 +
arch/x86/kernel/shstk.c | 145 ++++++++++++++++++++++++++++++
4 files changed, 157 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index bd16e012b3e9..ff98cd6d5af2 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -479,6 +479,8 @@ struct thread_struct {
#ifdef CONFIG_X86_USER_SHADOW_STACK
unsigned long features;
unsigned long features_locked;
+
+ struct thread_shstk shstk;
#endif

/* Floating point and extended processor state */
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index ec753809f074..2b1f7c9b9995 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -8,12 +8,19 @@
struct task_struct;

#ifdef CONFIG_X86_USER_SHADOW_STACK
+struct thread_shstk {
+ u64 base;
+ u64 size;
+};
+
long shstk_prctl(struct task_struct *task, int option, unsigned long features);
void reset_thread_features(void);
+void shstk_free(struct task_struct *p);
#else
static inline long shstk_prctl(struct task_struct *task, int option,
unsigned long arg2) { return -EINVAL; }
static inline void reset_thread_features(void) {}
+static inline void shstk_free(struct task_struct *p) {}
#endif /* CONFIG_X86_USER_SHADOW_STACK */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index b2b3b7200b2d..7dfd9dc00509 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,4 +26,7 @@
#define ARCH_SHSTK_DISABLE 0x5002
#define ARCH_SHSTK_LOCK 0x5003

+/* ARCH_SHSTK_ features bits */
+#define ARCH_SHSTK_SHSTK (1ULL << 0)
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 41ed6552e0a5..3cb85224d856 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -8,14 +8,159 @@

#include <linux/sched.h>
#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/shstk.h>
+#include <asm/special_insns.h>
+#include <asm/fpu/api.h>
#include <asm/prctl.h>

+static bool features_enabled(unsigned long features)
+{
+ return current->thread.features & features;
+}
+
+static void features_set(unsigned long features)
+{
+ current->thread.features |= features;
+}
+
+static void features_clr(unsigned long features)
+{
+ current->thread.features &= ~features;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+ int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
+ struct mm_struct *mm = current->mm;
+ unsigned long addr, unused;
+
+ mmap_write_lock(mm);
+ addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+ VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+
+ mmap_write_unlock(mm);
+
+ return addr;
+}
+
+static unsigned long adjust_shstk_size(unsigned long size)
+{
+ if (size)
+ return PAGE_ALIGN(size);
+
+ return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+}
+
+static void unmap_shadow_stack(u64 base, u64 size)
+{
+ while (1) {
+ int r;
+
+ r = vm_munmap(base, size);
+
+ /*
+ * vm_munmap() returns -EINTR when mmap_lock is held by
+ * something else, and that lock should not be held for a
+ * long time. Retry it for the case.
+ */
+ if (r == -EINTR) {
+ cond_resched();
+ continue;
+ }
+
+ /*
+ * For all other types of vm_munmap() failure, either the
+ * system is out of memory or there is bug.
+ */
+ WARN_ON_ONCE(r);
+ break;
+ }
+}
+
+static int shstk_setup(void)
+{
+ struct thread_shstk *shstk = &current->thread.shstk;
+ unsigned long addr, size;
+
+ /* Already enabled */
+ if (features_enabled(ARCH_SHSTK_SHSTK))
+ return 0;
+
+ /* Also not supported for 32 bit and x32 */
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall())
+ return -EOPNOTSUPP;
+
+ size = adjust_shstk_size(0);
+ addr = alloc_shstk(size);
+ if (IS_ERR_VALUE(addr))
+ return PTR_ERR((void *)addr);
+
+ fpregs_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+ wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
+ fpregs_unlock();
+
+ shstk->base = addr;
+ shstk->size = size;
+ features_set(ARCH_SHSTK_SHSTK);
+
+ return 0;
+}
+
void reset_thread_features(void)
{
+ memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
current->thread.features = 0;
current->thread.features_locked = 0;
}

+void shstk_free(struct task_struct *tsk)
+{
+ struct thread_shstk *shstk = &tsk->thread.shstk;
+
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+ !features_enabled(ARCH_SHSTK_SHSTK))
+ return;
+
+ if (!tsk->mm)
+ return;
+
+ unmap_shadow_stack(shstk->base, shstk->size);
+}
+
+static int shstk_disable(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return -EOPNOTSUPP;
+
+ /* Already disabled? */
+ if (!features_enabled(ARCH_SHSTK_SHSTK))
+ return 0;
+
+ fpregs_lock_and_load();
+ /* Disable WRSS too when disabling shadow stack */
+ wrmsrl(MSR_IA32_U_CET, 0);
+ wrmsrl(MSR_IA32_PL3_SSP, 0);
+ fpregs_unlock();
+
+ shstk_free(current);
+ features_clr(ARCH_SHSTK_SHSTK);
+
+ return 0;
+}
+
long shstk_prctl(struct task_struct *task, int option, unsigned long features)
{
if (option == ARCH_SHSTK_LOCK) {
--
2.17.1


2023-02-18 21:22:04

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 30/41] x86/shstk: Handle thread shadow stack

From: Yu-cheng Yu <[email protected]>

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-cet case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child from
returning from the vfork() calling function, using CET. Glibc does this
by adjusting the shadow stack pointer in the child, so that the child
receives a #CP if it tries to return from vfork() calling function.

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

During this operation, the shadow stack pointer of the new thread needs
to be updated to point to the newly allocated shadow stack. Since the
ability to do this is confined to the FPU subsystem, change
fpu_clone() to take the new shadow stack pointer, and update it
internally inside the FPU subsystem. This part was suggested by Thomas
Gleixner.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v3:
- Fix update_fpu_shstk() stub (Mike Rapoport)
- Fix chunks around alloc_shstk() in wrong patch (Kees)
- Fix stack_size/flags swap (Kees)
- Use centralized stack size logic (Kees)

v2:
- Have fpu_clone() take new shadow stack pointer and update SSP in
xsave buffer for new task. (tglx)

v1:
- Expand commit log.
- Add more comments.
- Switch to xsave helpers.

Yu-cheng v30:
- Update comments about clone()/clone3(). (Borislav Petkov)
---
arch/x86/include/asm/fpu/sched.h | 3 ++-
arch/x86/include/asm/mmu_context.h | 2 ++
arch/x86/include/asm/shstk.h | 7 +++++
arch/x86/kernel/fpu/core.c | 41 +++++++++++++++++++++++++++-
arch/x86/kernel/process.c | 18 ++++++++++++-
arch/x86/kernel/shstk.c | 43 ++++++++++++++++++++++++++++--
6 files changed, 109 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index c2d6cd78ed0c..3c2903bbb456 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -11,7 +11,8 @@

extern void save_fpregs_to_fpstate(struct fpu *fpu);
extern void fpu__drop(struct fpu *fpu);
-extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal);
+extern int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+ unsigned long shstk_addr);
extern void fpu_flush_thread(void);

/*
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index e01aa74a6de7..9714f08d941b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -147,6 +147,8 @@ do { \
#else
#define deactivate_mm(tsk, mm) \
do { \
+ if (!tsk->vfork_done) \
+ shstk_free(tsk); \
load_gs_index(0); \
loadsegment(fs, 0); \
} while (0)
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 2b1f7c9b9995..1399f4df098b 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -15,11 +15,18 @@ struct thread_shstk {

long shstk_prctl(struct task_struct *task, int option, unsigned long features);
void reset_thread_features(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+ unsigned long stack_size,
+ unsigned long *shstk_addr);
void shstk_free(struct task_struct *p);
#else
static inline long shstk_prctl(struct task_struct *task, int option,
unsigned long arg2) { return -EINVAL; }
static inline void reset_thread_features(void) {}
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+ unsigned long clone_flags,
+ unsigned long stack_size,
+ unsigned long *shstk_addr) { return 0; }
static inline void shstk_free(struct task_struct *p) {}
#endif /* CONFIG_X86_USER_SHADOW_STACK */

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index f851558b673f..bc3de4aeb661 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -552,8 +552,41 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
}
}

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static int update_fpu_shstk(struct task_struct *dst, unsigned long ssp)
+{
+ struct cet_user_state *xstate;
+
+ /* If ssp update is not needed. */
+ if (!ssp)
+ return 0;
+
+ xstate = get_xsave_addr(&dst->thread.fpu.fpstate->regs.xsave,
+ XFEATURE_CET_USER);
+
+ /*
+ * If there is a non-zero ssp, then 'dst' must be configured with a shadow
+ * stack and the fpu state should be up to date since it was just copied
+ * from the parent in fpu_clone(). So there must be a valid non-init CET
+ * state location in the buffer.
+ */
+ if (WARN_ON_ONCE(!xstate))
+ return 1;
+
+ xstate->user_ssp = (u64)ssp;
+
+ return 0;
+}
+#else
+static int update_fpu_shstk(struct task_struct *dst, unsigned long shstk_addr)
+{
+ return 0;
+}
+#endif
+
/* Clone current's FPU state on fork */
-int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
+int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal,
+ unsigned long ssp)
{
struct fpu *src_fpu = &current->thread.fpu;
struct fpu *dst_fpu = &dst->thread.fpu;
@@ -613,6 +646,12 @@ int fpu_clone(struct task_struct *dst, unsigned long clone_flags, bool minimal)
if (use_xsave())
dst_fpu->fpstate->regs.xsave.header.xfeatures &= ~XFEATURE_MASK_PASID;

+ /*
+ * Update shadow stack pointer, in case it changed during clone.
+ */
+ if (update_fpu_shstk(dst, ssp))
+ return 1;
+
trace_x86_fpu_copy_src(src_fpu);
trace_x86_fpu_copy_dst(dst_fpu);

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e57cd31bfec4..13a0a81d70b9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -48,6 +48,7 @@
#include <asm/frame.h>
#include <asm/unwind.h>
#include <asm/tdx.h>
+#include <asm/shstk.h>

#include "process.h"

@@ -119,6 +120,7 @@ void exit_thread(struct task_struct *tsk)

free_vm86(t);

+ shstk_free(tsk);
fpu__drop(fpu);
}

@@ -140,6 +142,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
struct inactive_task_frame *frame;
struct fork_frame *fork_frame;
struct pt_regs *childregs;
+ unsigned long shstk_addr = 0;
int ret = 0;

childregs = task_pt_regs(p);
@@ -174,7 +177,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
frame->flags = X86_EFLAGS_FIXED;
#endif

- fpu_clone(p, clone_flags, args->fn);
+ /* Allocate a new shadow stack for pthread if needed */
+ ret = shstk_alloc_thread_stack(p, clone_flags, args->stack_size,
+ &shstk_addr);
+ if (ret)
+ return ret;
+
+ fpu_clone(p, clone_flags, args->fn, shstk_addr);

/* Kernel thread ? */
if (unlikely(p->flags & PF_KTHREAD)) {
@@ -220,6 +229,13 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
io_bitmap_share(p);

+ /*
+ * If copy_thread() if failing, don't leak the shadow stack possibly
+ * allocated in shstk_alloc_thread_stack() above.
+ */
+ if (ret)
+ shstk_free(p);
+
return ret;
}

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 3cb85224d856..1d30295e0066 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -47,7 +47,7 @@ static unsigned long alloc_shstk(unsigned long size)
unsigned long addr, unused;

mmap_write_lock(mm);
- addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+ addr = do_mmap(NULL, 0, size, PROT_READ, flags,
VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);

mmap_write_unlock(mm);
@@ -126,6 +126,39 @@ void reset_thread_features(void)
current->thread.features_locked = 0;
}

+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+ unsigned long stack_size, unsigned long *shstk_addr)
+{
+ struct thread_shstk *shstk = &tsk->thread.shstk;
+ unsigned long addr, size;
+
+ /*
+ * If shadow stack is not enabled on the new thread, skip any
+ * switch to a new shadow stack.
+ */
+ if (!features_enabled(ARCH_SHSTK_SHSTK))
+ return 0;
+
+ /*
+ * For CLONE_VM, except vfork, the child needs a separate shadow
+ * stack.
+ */
+ if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+ return 0;
+
+ size = adjust_shstk_size(stack_size);
+ addr = alloc_shstk(size);
+ if (IS_ERR_VALUE(addr))
+ return PTR_ERR((void *)addr);
+
+ shstk->base = addr;
+ shstk->size = size;
+
+ *shstk_addr = addr + size;
+
+ return 0;
+}
+
void shstk_free(struct task_struct *tsk)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -134,7 +167,13 @@ void shstk_free(struct task_struct *tsk)
!features_enabled(ARCH_SHSTK_SHSTK))
return;

- if (!tsk->mm)
+ /*
+ * When fork() with CLONE_VM fails, the child (tsk) already has a
+ * shadow stack allocated, and exit_thread() calls this function to
+ * free it. In this case the parent (current) and the child share
+ * the same mm struct.
+ */
+ if (!tsk->mm || tsk->mm != current->mm)
return;

unmap_shadow_stack(shstk->base, shstk->size);
--
2.17.1


2023-02-18 21:22:19

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 31/41] x86/shstk: Introduce routines modifying shstk

From: Yu-cheng Yu <[email protected]>

Shadow stacks are normally written to via CALL/RET or specific CET
instructions like RSTORSSP/SAVEPREVSSP. However during some Linux
operations the kernel will need to write to directly using the ring-0 only
WRUSS instruction.

A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack. This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.

Introduce token setup and verify routines. Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack.

In future patches that enable shadow stack to work with signals, the kernel
will need something to denote the point in the stack where sigreturn may be
called. This will prevent attackers calling sigreturn at arbitrary places
in the stack, in order to help prevent SROP attacks.

To do this, something that can only be written by the kernel needs to be
placed on the shadow stack. This can be accomplished by setting bit 63 in
the frame written to the shadow stack. Userspace return addresses can't
have this bit set as it is in the kernel range. It is also can't be a
valid restore token.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---
v5:
- Fix typo in commit log

v3:
- Drop shstk_check_rstor_token()
- Fail put_shstk_data() if bit 63 is set in the data (Kees)
- Add comment in create_rstor_token() (Kees)
- Pull in create_rstor_token() changes from future patch (Kees)

v2:
- Add data helpers for writing to shadow stack.

v1:
- Use xsave helpers.
---
arch/x86/include/asm/special_insns.h | 13 +++++
arch/x86/kernel/shstk.c | 73 ++++++++++++++++++++++++++++
2 files changed, 86 insertions(+)

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index de48d1389936..d6cd9344f6c7 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -202,6 +202,19 @@ static inline void clwb(volatile void *__p)
: [pax] "a" (p));
}

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+ asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+ _ASM_EXTABLE(1b, %l[fail])
+ :: [addr] "r" (addr), [val] "r" (val)
+ :: fail);
+ return 0;
+fail:
+ return -EFAULT;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
#define nop() asm volatile ("nop")

static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 1d30295e0066..13c02747386f 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,8 @@
#include <asm/fpu/api.h>
#include <asm/prctl.h>

+#define SS_FRAME_SIZE 8
+
static bool features_enabled(unsigned long features)
{
return current->thread.features & features;
@@ -40,6 +42,35 @@ static void features_clr(unsigned long features)
current->thread.features &= ~features;
}

+/*
+ * Create a restore token on the shadow stack. A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+ unsigned long addr;
+
+ /* Token must be aligned */
+ if (!IS_ALIGNED(ssp, 8))
+ return -EINVAL;
+
+ addr = ssp - SS_FRAME_SIZE;
+
+ /*
+ * SSP is aligned, so reserved bits and mode bit are a zero, just mark
+ * the token 64-bit.
+ */
+ ssp |= BIT(0);
+
+ if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+ return -EFAULT;
+
+ if (token_addr)
+ *token_addr = addr;
+
+ return 0;
+}
+
static unsigned long alloc_shstk(unsigned long size)
{
int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
@@ -159,6 +190,48 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
return 0;
}

+static unsigned long get_user_shstk_addr(void)
+{
+ unsigned long long ssp;
+
+ fpregs_lock_and_load();
+
+ rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+ fpregs_unlock();
+
+ return ssp;
+}
+
+static int put_shstk_data(u64 __user *addr, u64 data)
+{
+ if (WARN_ON_ONCE(data & BIT(63)))
+ return -EINVAL;
+
+ /*
+ * Mark the high bit so that the sigframe can't be processed as a
+ * return address.
+ */
+ if (write_user_shstk_64(addr, data | BIT(63)))
+ return -EFAULT;
+ return 0;
+}
+
+static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
+{
+ unsigned long ldata;
+
+ if (unlikely(get_user(ldata, addr)))
+ return -EFAULT;
+
+ if (!(ldata & BIT(63)))
+ return -EINVAL;
+
+ *data = ldata & ~BIT(63);
+
+ return 0;
+}
+
void shstk_free(struct task_struct *tsk)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
--
2.17.1


2023-02-18 21:22:21

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 32/41] x86/shstk: Handle signals for shadow stack

From: Yu-cheng Yu <[email protected]>

When a signal is handled normally the context is pushed to the stack
before handling it. For shadow stacks, since the shadow stack only track's
return addresses, there isn't any state that needs to be pushed. However,
there are still a few things that need to be done. These things are
userspace visible and which will be kernel ABI for shadow stacks.

One is to make sure the restorer address is written to shadow stack, since
the signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn. So add the restorer on the shadow stack
before handling the signal, so there is not a conflict when the signal
handler returns to the restorer.

The other thing to do is to place some type of checkable token on the
thread's shadow stack before handling the signal and check it during
sigreturn. This is an extra layer of protection to hamper attackers
calling sigreturn manually as in SROP-like attacks.

For this token we can use the shadow stack data format defined earlier.
Have the data pushed be the previous SSP. In the future the sigreturn
might want to return back to a different stack. Storing the SSP (instead
of a restore offset or something) allows for future functionality that
may want to restore to a different stack.

So, when handling a signal push
- the SSP pointing in the shadow stack data format
- the restorer address below the restore token.

In sigreturn, verify SSP is stored in the data format and pop the shadow
stack.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>

---
v3:
- Drop shstk_setup_rstor_token() (Kees)
- Drop x32 signal support, since x32 support is dropped

v2:
- Switch to new shstk signal format

v1:
- Use xsave helpers.
- Expand commit log.

Yu-cheng v27:
- Eliminate saving shadow stack pointer to signal context.
---
arch/x86/include/asm/shstk.h | 5 ++
arch/x86/kernel/shstk.c | 98 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/signal.c | 1 +
arch/x86/kernel/signal_64.c | 6 +++
4 files changed, 110 insertions(+)

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 1399f4df098b..acee68d30a07 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
#include <linux/types.h>

struct task_struct;
+struct ksignal;

#ifdef CONFIG_X86_USER_SHADOW_STACK
struct thread_shstk {
@@ -19,6 +20,8 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
unsigned long stack_size,
unsigned long *shstk_addr);
void shstk_free(struct task_struct *p);
+int setup_signal_shadow_stack(struct ksignal *ksig);
+int restore_signal_shadow_stack(void);
#else
static inline long shstk_prctl(struct task_struct *task, int option,
unsigned long arg2) { return -EINVAL; }
@@ -28,6 +31,8 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
unsigned long stack_size,
unsigned long *shstk_addr) { return 0; }
static inline void shstk_free(struct task_struct *p) {}
+static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
#endif /* CONFIG_X86_USER_SHADOW_STACK */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 13c02747386f..40f0a55762a9 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -232,6 +232,104 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
return 0;
}

+static int shstk_push_sigframe(unsigned long *ssp)
+{
+ unsigned long target_ssp = *ssp;
+
+ /* Token must be aligned */
+ if (!IS_ALIGNED(*ssp, 8))
+ return -EINVAL;
+
+ if (!IS_ALIGNED(target_ssp, 8))
+ return -EINVAL;
+
+ *ssp -= SS_FRAME_SIZE;
+ if (put_shstk_data((void *__user)*ssp, target_ssp))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int shstk_pop_sigframe(unsigned long *ssp)
+{
+ unsigned long token_addr;
+ int err;
+
+ err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
+ if (unlikely(err))
+ return err;
+
+ /* Restore SSP aligned? */
+ if (unlikely(!IS_ALIGNED(token_addr, 8)))
+ return -EINVAL;
+
+ /* SSP in userspace? */
+ if (unlikely(token_addr >= TASK_SIZE_MAX))
+ return -EINVAL;
+
+ *ssp = token_addr;
+
+ return 0;
+}
+
+int setup_signal_shadow_stack(struct ksignal *ksig)
+{
+ void __user *restorer = ksig->ka.sa.sa_restorer;
+ unsigned long ssp;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+ !features_enabled(ARCH_SHSTK_SHSTK))
+ return 0;
+
+ if (!restorer)
+ return -EINVAL;
+
+ ssp = get_user_shstk_addr();
+ if (unlikely(!ssp))
+ return -EINVAL;
+
+ err = shstk_push_sigframe(&ssp);
+ if (unlikely(err))
+ return err;
+
+ /* Push restorer address */
+ ssp -= SS_FRAME_SIZE;
+ err = write_user_shstk_64((u64 __user *)ssp, (u64)restorer);
+ if (unlikely(err))
+ return -EFAULT;
+
+ fpregs_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, ssp);
+ fpregs_unlock();
+
+ return 0;
+}
+
+int restore_signal_shadow_stack(void)
+{
+ unsigned long ssp;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) ||
+ !features_enabled(ARCH_SHSTK_SHSTK))
+ return 0;
+
+ ssp = get_user_shstk_addr();
+ if (unlikely(!ssp))
+ return -EINVAL;
+
+ err = shstk_pop_sigframe(&ssp);
+ if (unlikely(err))
+ return err;
+
+ fpregs_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, ssp);
+ fpregs_unlock();
+
+ return 0;
+}
+
void shstk_free(struct task_struct *tsk)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 004cb30b7419..356253e85ce9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -40,6 +40,7 @@
#include <asm/syscall.h>
#include <asm/sigframe.h>
#include <asm/signal.h>
+#include <asm/shstk.h>

static inline int is_ia32_compat_frame(struct ksignal *ksig)
{
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 0e808c72bf7e..cacf2ede6217 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -175,6 +175,9 @@ int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
frame = get_sigframe(ksig, regs, sizeof(struct rt_sigframe), &fp);
uc_flags = frame_uc_flags(regs);

+ if (setup_signal_shadow_stack(ksig))
+ return -EFAULT;
+
if (!user_access_begin(frame, sizeof(*frame)))
return -EFAULT;

@@ -260,6 +263,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
goto badframe;

+ if (restore_signal_shadow_stack())
+ goto badframe;
+
if (restore_altstack(&frame->uc.uc_stack))
goto badframe;

--
2.17.1


2023-02-18 21:22:24

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall

When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stack's is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other thread's could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
restore tokens being written into the middle of pre-used shadow stacks.
It is ideal to prevent restore tokens being added at arbitrary
locations, so the check was to make sure the shadow stack had never been
written to.
3. It stood out from the rest of the madvise flags, as more of direct
action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup it's own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v5:
- Fix addr/mapped_addr (Kees)
- Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch
suggests this)
- Return error for addresses below 4G

v3:
- Change syscall common -> 64 (Kees)
- Use bit shift notation instead of 0x1 for uapi header (Kees)
- Call do_mmap() with MAP_FIXED_NOREPLACE (Kees)
- Block unsupported flags (Kees)
- Require size >= 8 to set token (Kees)

v2:
- Change syscall to take address like mmap() for CRIU's usage

v1:
- New patch (replaces PROT_SHADOW_STACK).
---
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/uapi/asm/mman.h | 3 ++
arch/x86/kernel/shstk.c | 59 ++++++++++++++++++++++----
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 2 +-
kernel/sys_ni.c | 1 +
6 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..f65c671ce3b1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 64 map_shadow_stack sys_map_shadow_stack

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 5a0256e73f1e..8148bdddbd2c 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -13,6 +13,9 @@
((key) & 0x8 ? VM_PKEY_BIT3 : 0))
#endif

+/* Flags for map_shadow_stack(2) */
+#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */
+
#include <asm-generic/mman.h>

#endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 40f0a55762a9..0a3decab70ee 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -17,6 +17,7 @@
#include <linux/compat.h>
#include <linux/sizes.h>
#include <linux/user.h>
+#include <linux/syscalls.h>
#include <asm/msr.h>
#include <asm/fpu/xstate.h>
#include <asm/fpu/types.h>
@@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
return 0;
}

-static unsigned long alloc_shstk(unsigned long size)
+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
+ unsigned long token_offset, bool set_res_tok)
{
int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
struct mm_struct *mm = current->mm;
- unsigned long addr, unused;
+ unsigned long mapped_addr, unused;

- mmap_write_lock(mm);
- addr = do_mmap(NULL, 0, size, PROT_READ, flags,
- VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+ if (addr)
+ flags |= MAP_FIXED_NOREPLACE;

+ mmap_write_lock(mm);
+ mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+ VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
mmap_write_unlock(mm);

- return addr;
+ if (!set_res_tok || IS_ERR_VALUE(mapped_addr))
+ goto out;
+
+ if (create_rstor_token(mapped_addr + token_offset, NULL)) {
+ vm_munmap(mapped_addr, size);
+ return -EINVAL;
+ }
+
+out:
+ return mapped_addr;
}

static unsigned long adjust_shstk_size(unsigned long size)
@@ -134,7 +147,7 @@ static int shstk_setup(void)
return -EOPNOTSUPP;

size = adjust_shstk_size(0);
- addr = alloc_shstk(size);
+ addr = alloc_shstk(0, size, 0, false);
if (IS_ERR_VALUE(addr))
return PTR_ERR((void *)addr);

@@ -178,7 +191,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
return 0;

size = adjust_shstk_size(stack_size);
- addr = alloc_shstk(size);
+ addr = alloc_shstk(0, size, 0, false);
if (IS_ERR_VALUE(addr))
return PTR_ERR((void *)addr);

@@ -371,6 +384,36 @@ static int shstk_disable(void)
return 0;
}

+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+ bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
+ unsigned long aligned_size;
+
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return -EOPNOTSUPP;
+
+ if (flags & ~SHADOW_STACK_SET_TOKEN)
+ return -EINVAL;
+
+ /* If there isn't space for a token */
+ if (set_tok && size < 8)
+ return -EINVAL;
+
+ if (addr && addr <= 0xFFFFFFFF)
+ return -EINVAL;
+
+ /*
+ * An overflow would result in attempting to write the restore token
+ * to the wrong location. Not catastrophic, but just return the right
+ * error code and block it.
+ */
+ aligned_size = PAGE_ALIGN(size);
+ if (aligned_size < size)
+ return -EOVERFLOW;
+
+ return alloc_shstk(addr, aligned_size, size, set_tok);
+}
+
long shstk_prctl(struct task_struct *task, int option, unsigned long features)
{
if (option == ARCH_SHSTK_LOCK) {
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 33a0ee3bcb2e..392dc11e3556 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
unsigned long home_node,
unsigned long flags);
+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..b12940ec5926 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)

#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452

/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..cb9aebd34646 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old);
COND_SYSCALL(modify_ldt);
COND_SYSCALL(vm86);
COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);

/* s390 */
COND_SYSCALL(s390_pci_mmio_read);
--
2.17.1


2023-02-18 21:22:40

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 34/41] x86/shstk: Support WRSS for userspace

For the current shadow stack implementation, shadow stacks contents can't
easily be provisioned with arbitrary data. This property helps apps
protect themselves better, but also restricts any potential apps that may
want to do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, WRSS, which
can be enabled to write directly to shadow stack permissioned memory from
userspace. Allow it to get enabled via the prctl interface.

Only enable the userspace WRSS instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a #PF err code perspective.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Make set_clr_bits_msrl() avoid side affects in 'msr'

v5:
- Switch to EOPNOTSUPP
- Move set_clr_bits_msrl() to patch where it is first used
- Commit log formatting

v3:
- Make wrss_control() static
- Fix verbiage in commit log (Kees)

v2:
- Add some commit log verbiage from (Dave Hansen)

v1:
- New patch.
---
arch/x86/include/asm/msr.h | 11 +++++++++++
arch/x86/include/uapi/asm/prctl.h | 1 +
arch/x86/kernel/shstk.c | 32 ++++++++++++++++++++++++++++++-
3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..2d3b35c957ad 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
int msr_set_bit(u32 msr, u8 bit);
int msr_clear_bit(u32 msr, u8 bit);

+/* Helper that can never get accidentally un-inlined. */
+#define set_clr_bits_msrl(msr, set, clear) do { \
+ u64 __val, __new_val, __msr = msr; \
+ \
+ rdmsrl(__msr, __val); \
+ __new_val = (__val & ~(clear)) | (set); \
+ \
+ if (__new_val != __val) \
+ wrmsrl(__msr, __new_val); \
+} while (0)
+
#ifdef CONFIG_SMP
int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 7dfd9dc00509..e31495668056 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -28,5 +28,6 @@

/* ARCH_SHSTK_ features bits */
#define ARCH_SHSTK_SHSTK (1ULL << 0)
+#define ARCH_SHSTK_WRSS (1ULL << 1)

#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 0a3decab70ee..009cb3fa0ae5 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -363,6 +363,36 @@ void shstk_free(struct task_struct *tsk)
unmap_shadow_stack(shstk->base, shstk->size);
}

+static int wrss_control(bool enable)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return -EOPNOTSUPP;
+
+ /*
+ * Only enable wrss if shadow stack is enabled. If shadow stack is not
+ * enabled, wrss will already be disabled, so don't bother clearing it
+ * when disabling.
+ */
+ if (!features_enabled(ARCH_SHSTK_SHSTK))
+ return -EPERM;
+
+ /* Already enabled/disabled? */
+ if (features_enabled(ARCH_SHSTK_WRSS) == enable)
+ return 0;
+
+ fpregs_lock_and_load();
+ if (enable) {
+ set_clr_bits_msrl(MSR_IA32_U_CET, CET_WRSS_EN, 0);
+ features_set(ARCH_SHSTK_WRSS);
+ } else {
+ set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_WRSS_EN);
+ features_clr(ARCH_SHSTK_WRSS);
+ }
+ fpregs_unlock();
+
+ return 0;
+}
+
static int shstk_disable(void)
{
if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
@@ -379,7 +409,7 @@ static int shstk_disable(void)
fpregs_unlock();

shstk_free(current);
- features_clr(ARCH_SHSTK_SHSTK);
+ features_clr(ARCH_SHSTK_SHSTK | ARCH_SHSTK_WRSS);

return 0;
}
--
2.17.1


2023-02-18 21:22:43

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 35/41] x86: Expose thread features in /proc/$PID/status

Applications and loaders can have logic to decide whether to enable
shadow stack. They usually don't report whether shadow stack has been
enabled or not, so there is no way to verify whether an application
actually is protected by shadow stack.

Add two lines in /proc/$PID/status to report enabled and locked features.

Since, this involves referring to arch specific defines in asm/prctl.h,
implement an arch breakout to emit the feature lines.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
[Switched to CET, added to commit log]
Signed-off-by: Rick Edgecombe <[email protected]>

---
v4:
- Remove "CET" references

v3:
- Move to /proc/pid/status (Kees)

v2:
- New patch
---
arch/x86/kernel/cpu/proc.c | 23 +++++++++++++++++++++++
fs/proc/array.c | 6 ++++++
include/linux/proc_fs.h | 2 ++
3 files changed, 31 insertions(+)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 099b6f0d96bd..31c0e68f6227 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -4,6 +4,8 @@
#include <linux/string.h>
#include <linux/seq_file.h>
#include <linux/cpufreq.h>
+#include <asm/prctl.h>
+#include <linux/proc_fs.h>

#include "cpu.h"

@@ -175,3 +177,24 @@ const struct seq_operations cpuinfo_op = {
.stop = c_stop,
.show = show_cpuinfo,
};
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static void dump_x86_features(struct seq_file *m, unsigned long features)
+{
+ if (features & ARCH_SHSTK_SHSTK)
+ seq_puts(m, "shstk ");
+ if (features & ARCH_SHSTK_WRSS)
+ seq_puts(m, "wrss ");
+}
+
+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task)
+{
+ seq_puts(m, "x86_Thread_features:\t");
+ dump_x86_features(m, task->thread.features);
+ seq_putc(m, '\n');
+
+ seq_puts(m, "x86_Thread_features_locked:\t");
+ dump_x86_features(m, task->thread.features_locked);
+ seq_putc(m, '\n');
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49283b8103c7..7ac43ecda1c2 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -428,6 +428,11 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
}

+__weak void arch_proc_pid_thread_features(struct seq_file *m,
+ struct task_struct *task)
+{
+}
+
int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -451,6 +456,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+ arch_proc_pid_thread_features(m, task);
return 0;
}

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 0260f5ea98fe..80ff8e533cbd 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -158,6 +158,8 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task);
#endif /* CONFIG_PROC_PID_ARCH_STATUS */

+void arch_proc_pid_thread_features(struct seq_file *m, struct task_struct *task);
+
#else /* CONFIG_PROC_FS */

static inline void proc_root_init(void)
--
2.17.1


2023-02-18 21:22:49

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 36/41] x86/shstk: Wire in shadow stack interface

The kernel now has the main shadow stack functionality to support
applications. Wire in the WRSS and shadow stack enable/disable functions
into the existing shadow stack API skeleton.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v4:
- Remove "CET" references

v2:
- Split from other patches
---
arch/x86/kernel/shstk.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 009cb3fa0ae5..2faf9b45ac72 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -464,9 +464,17 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
return -EINVAL;

if (option == ARCH_SHSTK_DISABLE) {
+ if (features & ARCH_SHSTK_WRSS)
+ return wrss_control(false);
+ if (features & ARCH_SHSTK_SHSTK)
+ return shstk_disable();
return -EINVAL;
}

/* Handle ARCH_SHSTK_ENABLE */
+ if (features & ARCH_SHSTK_SHSTK)
+ return shstk_setup();
+ if (features & ARCH_SHSTK_WRSS)
+ return wrss_control(true);
return -EINVAL;
}
--
2.17.1


2023-02-18 21:23:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 38/41] x86/fpu: Add helper for initing features

If an xfeature is saved in a buffer, the xfeature's bit will be set in
xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
is in it's init state. In this case the xfeature buffer address cannot
be retrieved with get_xsave_addr().

Future patches will need to handle the case of writing to an xfeature
that may not be saved. So provide helpers to init an xfeature in an
xsave buffer.

This could of course be done directly by reaching into the xsave buffer,
however this would not be robust against future changes to optimize the
xsave buffer by compacting it. In that case the xsave buffer would need
to be re-arranged as well. So the logic properly belongs encapsulated
in a helper where the logic can be unified.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v2:
- New patch
---
arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
arch/x86/kernel/fpu/xstate.h | 6 ++++
2 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 13a80521dd51..3ff80be0a441 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -934,6 +934,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
}

+static int xsave_buffer_access_checks(int xfeature_nr)
+{
+ /*
+ * Do we even *have* xsave state?
+ */
+ if (!boot_cpu_has(X86_FEATURE_XSAVE))
+ return 1;
+
+ /*
+ * We should not ever be requesting features that we
+ * have not enabled.
+ */
+ if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+ return 1;
+
+ return 0;
+}
+
/*
* Given the xsave area and a state inside, this function returns the
* address of the state.
@@ -954,17 +972,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
*/
void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
{
- /*
- * Do we even *have* xsave state?
- */
- if (!boot_cpu_has(X86_FEATURE_XSAVE))
- return NULL;
-
- /*
- * We should not ever be requesting features that we
- * have not enabled.
- */
- if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+ if (xsave_buffer_access_checks(xfeature_nr))
return NULL;

/*
@@ -984,6 +992,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
return __raw_xsave_addr(xsave, xfeature_nr);
}

+/*
+ * Given the xsave area and a state inside, this function
+ * initializes an xfeature in the buffer.
+ *
+ * get_xsave_addr() will return NULL if the feature bit is
+ * not present in the header. This function will make it so
+ * the xfeature buffer address is ready to be retrieved by
+ * get_xsave_addr().
+ *
+ * Inputs:
+ * xstate: the thread's storage area for all FPU data
+ * xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
+ * XFEATURE_SSE, etc...)
+ * Output:
+ * 1 if the feature cannot be inited, 0 on success
+ */
+int init_xfeature(struct xregs_state *xsave, int xfeature_nr)
+{
+ if (xsave_buffer_access_checks(xfeature_nr))
+ return 1;
+
+ /*
+ * Mark the feature inited.
+ */
+ xsave->header.xfeatures |= BIT_ULL(xfeature_nr);
+ return 0;
+}
+
#ifdef CONFIG_ARCH_HAS_PKEYS

/*
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index a4ecb04d8d64..dc06f63063ee 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -54,6 +54,12 @@ extern void fpu__init_cpu_xstate(void);
extern void fpu__init_system_xstate(unsigned int legacy_size);

extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+extern int init_xfeature(struct xregs_state *xsave, int xfeature_nr);
+
+static inline int xfeature_saved(struct xregs_state *xsave, int xfeature_nr)
+{
+ return xsave->header.xfeatures & BIT_ULL(xfeature_nr);
+}

static inline u64 xfeatures_mask_supervisor(void)
{
--
2.17.1


2023-02-18 21:23:12

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 37/41] selftests/x86: Add shadow stack test

Add a simple selftest for exercising some shadow stack behavior:
- map_shadow_stack syscall and pivot
- Faulting in shadow stack memory
- Handling shadow stack violations
- GUP of shadow stack memory
- mprotect() of shadow stack memory
- Userfaultfd on shadow stack memory

Since this test exercises a recently added syscall manually, it needs
to find the automatically created __NR_foo defines. Per the selftest
documentation, KHDR_INCLUDES can be used to help the selftest Makefile's
find the headers from the kernel source. This way the new selftest can
be built inside the kernel source tree without installing the headers
to the system. So also add KHDR_INCLUDES as described in the selftest
docs, to facilitate this.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Co-developed-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v6:
- Tweak mprotect test
- Code style tweaks

v5:
- Update 32 bit signal test with new ABI and better asm

v4:
- Add test for 32 bit signal ABI blocking

v3:
- Change "+m" to "=m" in write_shstk() (Andrew Cooper)
- Fix userfaultfd test with transparent huge pages by doing a
MADV_DONTNEED, since the token write faults in the while stack with
huge pages.
---
tools/testing/selftests/x86/Makefile | 4 +-
.../testing/selftests/x86/test_shadow_stack.c | 676 ++++++++++++++++++
2 files changed, 678 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 0388c4d60af0..cfc8a26ad151 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- corrupt_xstate_header amx
+ corrupt_xstate_header amx test_shadow_stack
# Some selftests require 32bit support enabled also on 64bit systems
TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall

@@ -34,7 +34,7 @@ BINARIES_64 := $(TARGETS_C_64BIT_ALL:%=%_64)
BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))

-CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
+CFLAGS := -O2 -g -std=gnu99 -pthread -Wall $(KHDR_INCLUDES)

# call32_from_64 in thunks.S uses absolute addresses.
ifeq ($(CAN_BUILD_WITH_NOPIE),1)
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
new file mode 100644
index 000000000000..71de3527c67a
--- /dev/null
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -0,0 +1,676 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program test's basic kernel shadow stack support. It enables shadow
+ * stack manual via the arch_prctl(), instead of relying on glibc. It's
+ * Makefile doesn't compile with shadow stack support, so it doesn't rely on
+ * any particular glibc. As a result it can't do any operations that require
+ * special glibc shadow stack support (longjmp(), swapcontext(), etc). Just
+ * stick to the basics and hope the compiler doesn't do anything strange.
+ */
+
+#define _GNU_SOURCE
+
+#include <sys/syscall.h>
+#include <asm/mman.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <x86intrin.h>
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#include <stdint.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <setjmp.h>
+
+#define SS_SIZE 0x200000
+
+#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
+int main(int argc, char *argv[])
+{
+ printf("[SKIP]\tCompiler does not support CET.\n");
+ return 0;
+}
+#else
+void write_shstk(unsigned long *addr, unsigned long val)
+{
+ asm volatile("wrssq %[val], (%[addr])\n"
+ : "=m" (addr)
+ : [addr] "r" (addr), [val] "r" (val));
+}
+
+static inline unsigned long __attribute__((always_inline)) get_ssp(void)
+{
+ unsigned long ret = 0;
+
+ asm volatile("xor %0, %0; rdsspq %0" : "=r" (ret));
+ return ret;
+}
+
+/*
+ * For use in inline enablement of shadow stack.
+ *
+ * The program can't return from the point where shadow stack gets enabled
+ * because there will be no address on the shadow stack. So it can't use
+ * syscall() for enablement, since it is a function.
+ *
+ * Based on code from nolibc.h. Keep a copy here because this can't pull in all
+ * of nolibc.h.
+ */
+#define ARCH_PRCTL(arg1, arg2) \
+({ \
+ long _ret; \
+ register long _num asm("eax") = __NR_arch_prctl; \
+ register long _arg1 asm("rdi") = (long)(arg1); \
+ register long _arg2 asm("rsi") = (long)(arg2); \
+ \
+ asm volatile ( \
+ "syscall\n" \
+ : "=a"(_ret) \
+ : "r"(_arg1), "r"(_arg2), \
+ "0"(_num) \
+ : "rcx", "r11", "memory", "cc" \
+ ); \
+ _ret; \
+})
+
+void *create_shstk(void *addr)
+{
+ return (void *)syscall(__NR_map_shadow_stack, addr, SS_SIZE, SHADOW_STACK_SET_TOKEN);
+}
+
+void *create_normal_mem(void *addr)
+{
+ return mmap(addr, SS_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+}
+
+void free_shstk(void *shstk)
+{
+ munmap(shstk, SS_SIZE);
+}
+
+int reset_shstk(void *shstk)
+{
+ return madvise(shstk, SS_SIZE, MADV_DONTNEED);
+}
+
+void try_shstk(unsigned long new_ssp)
+{
+ unsigned long ssp;
+
+ printf("[INFO]\tnew_ssp = %lx, *new_ssp = %lx\n",
+ new_ssp, *((unsigned long *)new_ssp));
+
+ ssp = get_ssp();
+ printf("[INFO]\tchanging ssp from %lx to %lx\n", ssp, new_ssp);
+
+ asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
+ asm volatile("saveprevssp");
+ printf("[INFO]\tssp is now %lx\n", get_ssp());
+
+ /* Switch back to original shadow stack */
+ ssp -= 8;
+ asm volatile("rstorssp (%0)\n":: "r" (ssp));
+ asm volatile("saveprevssp");
+}
+
+int test_shstk_pivot(void)
+{
+ void *shstk = create_shstk(0);
+
+ if (shstk == MAP_FAILED) {
+ printf("[FAIL]\tError creating shadow stack: %d\n", errno);
+ return 1;
+ }
+ try_shstk((unsigned long)shstk + SS_SIZE - 8);
+ free_shstk(shstk);
+
+ printf("[OK]\tShadow stack pivot\n");
+ return 0;
+}
+
+int test_shstk_faults(void)
+{
+ unsigned long *shstk = create_shstk(0);
+
+ /* Read shadow stack, test if it's zero to not get read optimized out */
+ if (*shstk != 0)
+ goto err;
+
+ /* Wrss memory that was already read. */
+ write_shstk(shstk, 1);
+ if (*shstk != 1)
+ goto err;
+
+ /* Page out memory, so we can wrss it again. */
+ if (reset_shstk((void *)shstk))
+ goto err;
+
+ write_shstk(shstk, 1);
+ if (*shstk != 1)
+ goto err;
+
+ printf("[OK]\tShadow stack faults\n");
+ return 0;
+
+err:
+ return 1;
+}
+
+unsigned long saved_ssp;
+unsigned long saved_ssp_val;
+volatile bool segv_triggered;
+
+void __attribute__((noinline)) violate_ss(void)
+{
+ saved_ssp = get_ssp();
+ saved_ssp_val = *(unsigned long *)saved_ssp;
+
+ /* Corrupt shadow stack */
+ printf("[INFO]\tCorrupting shadow stack\n");
+ write_shstk((void *)saved_ssp, 0);
+}
+
+void segv_handler(int signum, siginfo_t *si, void *uc)
+{
+ printf("[INFO]\tGenerated shadow stack violation successfully\n");
+
+ segv_triggered = true;
+
+ /* Fix shadow stack */
+ write_shstk((void *)saved_ssp, saved_ssp_val);
+}
+
+int test_shstk_violation(void)
+{
+ struct sigaction sa;
+
+ sa.sa_sigaction = segv_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ segv_triggered = false;
+
+ /* Make sure segv_triggered is set before violate_ss() */
+ asm volatile("" : : : "memory");
+
+ violate_ss();
+
+ signal(SIGSEGV, SIG_DFL);
+
+ printf("[OK]\tShadow stack violation test\n");
+
+ return !segv_triggered;
+}
+
+/* Gup test state */
+#define MAGIC_VAL 0x12345678
+bool is_shstk_access;
+void *shstk_ptr;
+int fd;
+
+void reset_test_shstk(void *addr)
+{
+ if (shstk_ptr)
+ free_shstk(shstk_ptr);
+ shstk_ptr = create_shstk(addr);
+}
+
+void test_access_fix_handler(int signum, siginfo_t *si, void *uc)
+{
+ printf("[INFO]\tViolation from %s\n", is_shstk_access ? "shstk access" : "normal write");
+
+ segv_triggered = true;
+
+ /* Fix shadow stack */
+ if (is_shstk_access) {
+ reset_test_shstk(shstk_ptr);
+ return;
+ }
+
+ free_shstk(shstk_ptr);
+ create_normal_mem(shstk_ptr);
+}
+
+bool test_shstk_access(void *ptr)
+{
+ is_shstk_access = true;
+ segv_triggered = false;
+ write_shstk(ptr, MAGIC_VAL);
+
+ asm volatile("" : : : "memory");
+
+ return segv_triggered;
+}
+
+bool test_write_access(void *ptr)
+{
+ is_shstk_access = false;
+ segv_triggered = false;
+ *(unsigned long *)ptr = MAGIC_VAL;
+
+ asm volatile("" : : : "memory");
+
+ return segv_triggered;
+}
+
+bool gup_write(void *ptr)
+{
+ unsigned long val;
+
+ lseek(fd, (unsigned long)ptr, SEEK_SET);
+ if (write(fd, &val, sizeof(val)) < 0)
+ return 1;
+
+ return 0;
+}
+
+bool gup_read(void *ptr)
+{
+ unsigned long val;
+
+ lseek(fd, (unsigned long)ptr, SEEK_SET);
+ if (read(fd, &val, sizeof(val)) < 0)
+ return 1;
+
+ return 0;
+}
+
+int test_gup(void)
+{
+ struct sigaction sa;
+ int status;
+ pid_t pid;
+
+ sa.sa_sigaction = test_access_fix_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ segv_triggered = false;
+
+ fd = open("/proc/self/mem", O_RDWR);
+ if (fd == -1)
+ return 1;
+
+ reset_test_shstk(0);
+ if (gup_read(shstk_ptr))
+ return 1;
+ if (test_shstk_access(shstk_ptr))
+ return 1;
+ printf("[INFO]\tGup read -> shstk access success\n");
+
+ reset_test_shstk(0);
+ if (gup_write(shstk_ptr))
+ return 1;
+ if (test_shstk_access(shstk_ptr))
+ return 1;
+ printf("[INFO]\tGup write -> shstk access success\n");
+
+ reset_test_shstk(0);
+ if (gup_read(shstk_ptr))
+ return 1;
+ if (!test_write_access(shstk_ptr))
+ return 1;
+ printf("[INFO]\tGup read -> write access success\n");
+
+ reset_test_shstk(0);
+ if (gup_write(shstk_ptr))
+ return 1;
+ if (!test_write_access(shstk_ptr))
+ return 1;
+ printf("[INFO]\tGup write -> write access success\n");
+
+ close(fd);
+
+ /* COW/gup test */
+ reset_test_shstk(0);
+ pid = fork();
+ if (!pid) {
+ fd = open("/proc/self/mem", O_RDWR);
+ if (fd == -1)
+ exit(1);
+
+ if (gup_write(shstk_ptr)) {
+ close(fd);
+ exit(1);
+ }
+ close(fd);
+ exit(0);
+ }
+ waitpid(pid, &status, 0);
+ if (WEXITSTATUS(status)) {
+ printf("[FAIL]\tWrite in child failed\n");
+ return 1;
+ }
+ if (*(unsigned long *)shstk_ptr == MAGIC_VAL) {
+ printf("[FAIL]\tWrite in child wrote through to shared memory\n");
+ return 1;
+ }
+
+ printf("[INFO]\tCow gup write -> write access success\n");
+
+ free_shstk(shstk_ptr);
+
+ signal(SIGSEGV, SIG_DFL);
+
+ printf("[OK]\tShadow gup test\n");
+
+ return 0;
+}
+
+int test_mprotect(void)
+{
+ struct sigaction sa;
+
+ sa.sa_sigaction = test_access_fix_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ segv_triggered = false;
+
+ /* mprotect a shadow stack as read only */
+ reset_test_shstk(0);
+ if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+ printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+ return 1;
+ }
+
+ /* try to wrss it and fail */
+ if (!test_shstk_access(shstk_ptr)) {
+ printf("[FAIL]\tShadow stack access to read-only memory succeeded\n");
+ return 1;
+ }
+
+ /*
+ * The shadow stack was reset above to resolve the fault, make the new one
+ * read-only.
+ */
+ if (mprotect(shstk_ptr, SS_SIZE, PROT_READ) < 0) {
+ printf("[FAIL]\tmprotect(PROT_READ) failed\n");
+ return 1;
+ }
+
+ /* then back to writable */
+ if (mprotect(shstk_ptr, SS_SIZE, PROT_WRITE | PROT_READ) < 0) {
+ printf("[FAIL]\tmprotect(PROT_WRITE) failed\n");
+ return 1;
+ }
+
+ /* then wrss to it and succeed */
+ if (test_shstk_access(shstk_ptr)) {
+ printf("[FAIL]\tShadow stack access to mprotect() writable memory failed\n");
+ return 1;
+ }
+
+ free_shstk(shstk_ptr);
+
+ signal(SIGSEGV, SIG_DFL);
+
+ printf("[OK]\tmprotect() test\n");
+
+ return 0;
+}
+
+char zero[4096];
+
+static void *uffd_thread(void *arg)
+{
+ struct uffdio_copy req;
+ int uffd = *(int *)arg;
+ struct uffd_msg msg;
+
+ if (read(uffd, &msg, sizeof(msg)) <= 0)
+ return (void *)1;
+
+ req.dst = msg.arg.pagefault.address;
+ req.src = (__u64)zero;
+ req.len = 4096;
+ req.mode = 0;
+
+ if (ioctl(uffd, UFFDIO_COPY, &req))
+ return (void *)1;
+
+ return (void *)0;
+}
+
+int test_userfaultfd(void)
+{
+ struct uffdio_register uffdio_register;
+ struct uffdio_api uffdio_api;
+ struct sigaction sa;
+ pthread_t thread;
+ void *res;
+ int uffd;
+
+ sa.sa_sigaction = test_access_fix_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+ if (uffd < 0) {
+ printf("[SKIP]\tUserfaultfd unavailable.\n");
+ return 0;
+ }
+
+ reset_test_shstk(0);
+
+ uffdio_api.api = UFFD_API;
+ uffdio_api.features = 0;
+ if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+ goto err;
+
+ uffdio_register.range.start = (__u64)shstk_ptr;
+ uffdio_register.range.len = 4096;
+ uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+ if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+ goto err;
+
+ if (pthread_create(&thread, NULL, &uffd_thread, &uffd))
+ goto err;
+
+ reset_shstk(shstk_ptr);
+ test_shstk_access(shstk_ptr);
+
+ if (pthread_join(thread, &res))
+ goto err;
+
+ if (test_shstk_access(shstk_ptr))
+ goto err;
+
+ free_shstk(shstk_ptr);
+
+ signal(SIGSEGV, SIG_DFL);
+
+ if (!res)
+ printf("[OK]\tUserfaultfd test\n");
+ return !!res;
+err:
+ free_shstk(shstk_ptr);
+ close(uffd);
+ signal(SIGSEGV, SIG_DFL);
+ return 1;
+}
+
+/*
+ * Too complicated to pull it out of the 32 bit header, but also get the
+ * 64 bit one needed above. Just define a copy here.
+ */
+#define __NR_compat_sigaction 67
+
+/*
+ * Call 32 bit signal handler to get 32 bit signals ABI. Make sure
+ * to push the registers that will get clobbered.
+ */
+int sigaction32(int signum, const struct sigaction *restrict act,
+ struct sigaction *restrict oldact)
+{
+ register long syscall_reg asm("eax") = __NR_compat_sigaction;
+ register long signum_reg asm("ebx") = signum;
+ register long act_reg asm("ecx") = (long)act;
+ register long oldact_reg asm("edx") = (long)oldact;
+ int ret = 0;
+
+ asm volatile ("int $0x80;"
+ : "=a"(ret), "=m"(oldact)
+ : "r"(syscall_reg), "r"(signum_reg), "r"(act_reg),
+ "r"(oldact_reg)
+ : "r8", "r9", "r10", "r11"
+ );
+
+ return ret;
+}
+
+sigjmp_buf jmp_buffer;
+
+void segv_gp_handler(int signum, siginfo_t *si, void *uc)
+{
+ segv_triggered = true;
+
+ /*
+ * To work with old glibc, this can't rely on siglongjmp working with
+ * shadow stack enabled, so disable shadow stack before siglongjmp().
+ */
+ ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
+ siglongjmp(jmp_buffer, -1);
+}
+
+/*
+ * Transition to 32 bit mode and check that a #GP triggers a segfault.
+ */
+int test_32bit(void)
+{
+ struct sigaction sa;
+ struct sigaction *sa32;
+
+ /* Create sigaction in 32 bit address range */
+ sa32 = mmap(0, 4096, PROT_READ | PROT_WRITE,
+ MAP_32BIT | MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+ sa32->sa_flags = SA_SIGINFO;
+
+ sa.sa_sigaction = segv_gp_handler;
+ if (sigaction(SIGSEGV, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ segv_triggered = false;
+
+ /* Make sure segv_triggered is set before triggering the #GP */
+ asm volatile("" : : : "memory");
+
+ /*
+ * Set handler to somewhere in 32 bit address space
+ */
+ sa32->sa_handler = (void *)sa32;
+ if (sigaction32(SIGUSR1, sa32, NULL))
+ return 1;
+
+ if (!sigsetjmp(jmp_buffer, 1))
+ raise(SIGUSR1);
+
+ if (segv_triggered)
+ printf("[OK]\t32 bit test\n");
+
+ return !segv_triggered;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = 0;
+
+ if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+ printf("[SKIP]\tCould not enable Shadow stack\n");
+ return 1;
+ }
+
+ if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) {
+ ret = 1;
+ printf("[FAIL]\tDisabling shadow stack failed\n");
+ }
+
+ if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+ printf("[SKIP]\tCould not re-enable Shadow stack\n");
+ return 1;
+ }
+
+ if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_WRSS)) {
+ printf("[SKIP]\tCould not enable WRSS\n");
+ ret = 1;
+ goto out;
+ }
+
+ /* Should have succeeded if here, but this is a test, so double check. */
+ if (!get_ssp()) {
+ printf("[FAIL]\tShadow stack disabled\n");
+ return 1;
+ }
+
+ if (test_shstk_pivot()) {
+ ret = 1;
+ printf("[FAIL]\tShadow stack pivot\n");
+ goto out;
+ }
+
+ if (test_shstk_faults()) {
+ ret = 1;
+ printf("[FAIL]\tShadow stack fault test\n");
+ goto out;
+ }
+
+ if (test_shstk_violation()) {
+ ret = 1;
+ printf("[FAIL]\tShadow stack violation test\n");
+ goto out;
+ }
+
+ if (test_gup()) {
+ ret = 1;
+ printf("[FAIL]\tShadow shadow stack gup\n");
+ goto out;
+ }
+
+ if (test_mprotect()) {
+ ret = 1;
+ printf("[FAIL]\tShadow shadow mprotect test\n");
+ goto out;
+ }
+
+ if (test_userfaultfd()) {
+ ret = 1;
+ printf("[FAIL]\tUserfaultfd test\n");
+ goto out;
+ }
+
+ if (test_32bit()) {
+ ret = 1;
+ printf("[FAIL]\t32 bit test\n");
+ }
+
+ return ret;
+
+out:
+ /*
+ * Disable shadow stack before the function returns, or there will be a
+ * shadow stack violation.
+ */
+ if (ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK)) {
+ ret = 1;
+ printf("[FAIL]\tDisabling shadow stack failed\n");
+ }
+
+ return ret;
+}
+#endif
--
2.17.1


2023-02-18 21:23:17

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 39/41] x86: Add PTRACE interface for shadow stack

From: Yu-cheng Yu <[email protected]>

Some applications (like GDB) would like to tweak shadow stack state via
ptrace. This allows for existing functionality to continue to work for
seized shadow stack applications. Provide an regset interface for
manipulating the shadow stack pointer (SSP).

There is already ptrace functionality for accessing xstate, but this
does not include supervisor xfeatures. So there is not a completely
clear place for where to put the shadow stack state. Adding it to the
user xfeatures regset would complicate that code, as it currently shares
logic with signals which should not have supervisor features.

Don't add a general supervisor xfeature regset like the user one,
because it is better to maintain flexibility for other supervisor
xfeatures to define their own interface. For example, an xfeature may
decide not to expose all of it's state to userspace, as is actually the
case for shadow stack ptrace functionality. A lot of enum values remain
to be used, so just put it in dedicated shadow stack regset.

The only downside to not having a generic supervisor xfeature regset,
is that apps need to be enlightened of any new supervisor xfeature
exposed this way (i.e. they can't try to have generic save/restore
logic). But maybe that is a good thing, because they have to think
through each new xfeature instead of encountering issues when new a new
supervisor xfeature was added.

By adding a shadow stack regset, it also has the effect of including the
shadow stack state in a core dump, which could be useful for debugging.

The shadow stack specific xstate includes the SSP, and the shadow stack
and WRSS enablement status. Enabling shadow stack or wrss in the kernel
involves more than just flipping the bit. The kernel is made aware that
it has to do extra things when cloning or handling signals. That logic
is triggered off of separate feature enablement state kept in the task
struct. So the flipping on HW shadow stack enforcement without notifying
the kernel to change its behavior would severely limit what an application
could do without crashing, and the results would depend on kernel
internal implementation details. There is also no known use for controlling
this state via prtace today. So only expose the SSP, which is something
that userspace already has indirect control over.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>

---
v5:
- Check shadow stack enablement status for tracee (rppt)
- Fix typo in comment

v4:
- Make shadow stack only. Reduce to only supporting SSP register, and
remove CET references (peterz)
- Add comment to not use 0x203, because binutils already looks for it in
coredumps. (Christina Schimpe)

v3:
- Drop dependence on thread.shstk.size, and use thread.features bits
- Drop 32 bit support

v2:
- Check alignment on ssp.
- Block IBT bits.
- Handle init states instead of returning error.
- Add verbose commit log justifying the design.
---
arch/x86/include/asm/fpu/regset.h | 7 +--
arch/x86/kernel/fpu/regset.c | 86 +++++++++++++++++++++++++++++++
arch/x86/kernel/ptrace.c | 12 +++++
include/uapi/linux/elf.h | 2 +
4 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/regset.h b/arch/x86/include/asm/fpu/regset.h
index 4f928d6a367b..697b77e96025 100644
--- a/arch/x86/include/asm/fpu/regset.h
+++ b/arch/x86/include/asm/fpu/regset.h
@@ -7,11 +7,12 @@

#include <linux/regset.h>

-extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active;
+extern user_regset_active_fn regset_fpregs_active, regset_xregset_fpregs_active,
+ ssp_active;
extern user_regset_get2_fn fpregs_get, xfpregs_get, fpregs_soft_get,
- xstateregs_get;
+ xstateregs_get, ssp_get;
extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
- xstateregs_set;
+ xstateregs_set, ssp_set;

/*
* xstateregs_active == regset_fpregs_active. Please refer to the comment
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 6d056b68f4ed..c806952d9496 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -8,6 +8,7 @@
#include <asm/fpu/api.h>
#include <asm/fpu/signal.h>
#include <asm/fpu/regset.h>
+#include <asm/prctl.h>

#include "context.h"
#include "internal.h"
@@ -174,6 +175,91 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
return ret;
}

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+int ssp_active(struct task_struct *target, const struct user_regset *regset)
+{
+ if (target->thread.features & ARCH_SHSTK_SHSTK)
+ return regset->n;
+
+ return 0;
+}
+
+int ssp_get(struct task_struct *target, const struct user_regset *regset,
+ struct membuf to)
+{
+ struct fpu *fpu = &target->thread.fpu;
+ struct cet_user_state *cetregs;
+
+ if (!boot_cpu_has(X86_FEATURE_USER_SHSTK))
+ return -ENODEV;
+
+ sync_fpstate(fpu);
+ cetregs = get_xsave_addr(&fpu->fpstate->regs.xsave, XFEATURE_CET_USER);
+ if (!cetregs) {
+ /*
+ * The registers are the in the init state. The init values for
+ * these regs are zero, so just zero the output buffer.
+ */
+ membuf_zero(&to, sizeof(cetregs->user_ssp));
+ return 0;
+ }
+
+ return membuf_write(&to, (unsigned long *)&cetregs->user_ssp,
+ sizeof(cetregs->user_ssp));
+}
+
+int ssp_set(struct task_struct *target, const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ const void *kbuf, const void __user *ubuf)
+{
+ struct fpu *fpu = &target->thread.fpu;
+ struct xregs_state *xsave = &fpu->fpstate->regs.xsave;
+ struct cet_user_state *cetregs;
+ unsigned long user_ssp;
+ int r;
+
+ if (!boot_cpu_has(X86_FEATURE_USER_SHSTK) ||
+ !ssp_active(target, regset))
+ return -ENODEV;
+
+ r = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &user_ssp, 0, -1);
+ if (r)
+ return r;
+
+ /*
+ * Some kernel instructions (IRET, etc) can cause exceptions in the case
+ * of disallowed CET register values. Just prevent invalid values.
+ */
+ if ((user_ssp >= TASK_SIZE_MAX) || !IS_ALIGNED(user_ssp, 8))
+ return -EINVAL;
+
+ fpu_force_restore(fpu);
+
+ /*
+ * Don't want to init the xfeature until the kernel will definitely
+ * overwrite it, otherwise if it inits and then fails out, it would
+ * end up initing it to random data.
+ */
+ if (!xfeature_saved(xsave, XFEATURE_CET_USER) &&
+ WARN_ON(init_xfeature(xsave, XFEATURE_CET_USER)))
+ return -ENODEV;
+
+ cetregs = get_xsave_addr(xsave, XFEATURE_CET_USER);
+ if (WARN_ON(!cetregs)) {
+ /*
+ * This shouldn't ever be NULL because it was successfully
+ * inited above if needed. The only scenario would be if an
+ * xfeature was somehow saved in a buffer, but not enabled in
+ * xsave.
+ */
+ return -ENODEV;
+ }
+
+ cetregs->user_ssp = user_ssp;
+ return 0;
+}
+#endif /* CONFIG_X86_USER_SHADOW_STACK */
+
#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION

/*
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index dfaa270a7cc9..095f04bdabdc 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -58,6 +58,7 @@ enum x86_regset_64 {
REGSET64_FP,
REGSET64_IOPERM,
REGSET64_XSTATE,
+ REGSET64_SSP,
};

#define REGSET_GENERAL \
@@ -1267,6 +1268,17 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
.active = ioperm_active,
.regset_get = ioperm_get
},
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ [REGSET64_SSP] = {
+ .core_note_type = NT_X86_SHSTK,
+ .n = 1,
+ .size = sizeof(u64),
+ .align = sizeof(u64),
+ .active = ssp_active,
+ .regset_get = ssp_get,
+ .set = ssp_set
+ },
+#endif
};

static const struct user_regset_view user_x86_64_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 4c6a8fa5e7ed..413a15c07121 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -406,6 +406,8 @@ typedef struct elf64_shdr {
#define NT_386_TLS 0x200 /* i386 TLS slots (struct user_desc) */
#define NT_386_IOPERM 0x201 /* x86 io permission bitmap (1=deny) */
#define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
+/* Old binutils treats 0x203 as a CET state */
+#define NT_X86_SHSTK 0x204 /* x86 SHSTK state */
#define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */
#define NT_S390_TIMER 0x301 /* s390 timer register */
#define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */
--
2.17.1


2023-02-18 21:23:20

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 40/41] x86/shstk: Add ARCH_SHSTK_UNLOCK

From: Mike Rapoport <[email protected]>

Userspace loaders may lock features before a CRIU restore operation has
the chance to set them to whatever state is required by the process
being restored. Allow a way for CRIU to unlock features. Add it as an
arch_prctl() like the other shadow stack operations, but restrict it being
called by the ptrace arch_pctl() interface.

Reviewed-by: Kees Cook <[email protected]>
Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
[Merged into recent API changes, added commit log and docs]
Signed-off-by: Rick Edgecombe <[email protected]>

---
v4:
- Add to docs that it is ptrace only.
- Remove "CET" references

v3:
- Depend on CONFIG_CHECKPOINT_RESTORE (Kees)
---
Documentation/x86/shstk.rst | 4 ++++
arch/x86/include/uapi/asm/prctl.h | 1 +
arch/x86/kernel/process_64.c | 1 +
arch/x86/kernel/shstk.c | 9 +++++++--
4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
index f2e6f323cf68..e8ed5fc0f7ae 100644
--- a/Documentation/x86/shstk.rst
+++ b/Documentation/x86/shstk.rst
@@ -73,6 +73,10 @@ arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
are ignored. The mask is ORed with the existing value. So any feature bits
set here cannot be enabled or disabled afterwards.

+arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
+ Unlock features. 'features' is a mask of all features to unlock. All
+ bits set are processed, unset bits are ignored. Only works via ptrace.
+
The return values are as follows. On success, return 0. On error, errno can
be::

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index e31495668056..200efbbe5809 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -25,6 +25,7 @@
#define ARCH_SHSTK_ENABLE 0x5001
#define ARCH_SHSTK_DISABLE 0x5002
#define ARCH_SHSTK_LOCK 0x5003
+#define ARCH_SHSTK_UNLOCK 0x5004

/* ARCH_SHSTK_ features bits */
#define ARCH_SHSTK_SHSTK (1ULL << 0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 71094c8a305f..d368854fa9c4 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -835,6 +835,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
case ARCH_SHSTK_ENABLE:
case ARCH_SHSTK_DISABLE:
case ARCH_SHSTK_LOCK:
+ case ARCH_SHSTK_UNLOCK:
return shstk_prctl(task, option, arg2);
default:
ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 2faf9b45ac72..3197ff824809 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -451,9 +451,14 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long features)
return 0;
}

- /* Don't allow via ptrace */
- if (task != current)
+ /* Only allow via ptrace */
+ if (task != current) {
+ if (option == ARCH_SHSTK_UNLOCK && IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)) {
+ task->thread.features_locked &= ~features;
+ return 0;
+ }
return -EINVAL;
+ }

/* Do not allow to change locked features */
if (features & task->thread.features_locked)
--
2.17.1


2023-02-18 21:23:37

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v6 41/41] x86/shstk: Add ARCH_SHSTK_STATUS

CRIU and GDB need to get the current shadow stack and WRSS enablement
status. This information is already available via /proc/pid/status, but
this is inconvenient for CRIU because it involves parsing the text output
in an area of the code where this is difficult. Provide a status
arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a
userspace address, and make the new arch_prctl simply copy the features
out to userspace.

Tested-by: Pengfei Xu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Suggested-by: Mike Rapoport <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
v5:
- Fix typo in commit log

v4:
- New patch
---
Documentation/x86/shstk.rst | 6 ++++++
arch/x86/include/asm/shstk.h | 2 +-
arch/x86/include/uapi/asm/prctl.h | 1 +
arch/x86/kernel/process_64.c | 1 +
arch/x86/kernel/shstk.c | 8 +++++++-
5 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/shstk.rst b/Documentation/x86/shstk.rst
index e8ed5fc0f7ae..7f4af798794e 100644
--- a/Documentation/x86/shstk.rst
+++ b/Documentation/x86/shstk.rst
@@ -77,6 +77,11 @@ arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
Unlock features. 'features' is a mask of all features to unlock. All
bits set are processed, unset bits are ignored. Only works via ptrace.

+arch_prctl(ARCH_SHSTK_STATUS, unsigned long addr)
+ Copy the currently enabled features to the address passed in addr. The
+ features are described using the bits passed into the others in
+ 'features'.
+
The return values are as follows. On success, return 0. On error, errno can
be::

@@ -84,6 +89,7 @@ be::
-ENOTSUPP if the feature is not supported by the hardware or
kernel.
-EINVAL arguments (non existing feature, etc)
+ -EFAULT if could not copy information back to userspace

The feature's bits supported are::

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index acee68d30a07..be9267897211 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -14,7 +14,7 @@ struct thread_shstk {
u64 size;
};

-long shstk_prctl(struct task_struct *task, int option, unsigned long features);
+long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
void reset_thread_features(void);
int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
unsigned long stack_size,
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 200efbbe5809..1b85bc876c2d 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,6 +26,7 @@
#define ARCH_SHSTK_DISABLE 0x5002
#define ARCH_SHSTK_LOCK 0x5003
#define ARCH_SHSTK_UNLOCK 0x5004
+#define ARCH_SHSTK_STATUS 0x5005

/* ARCH_SHSTK_ features bits */
#define ARCH_SHSTK_SHSTK (1ULL << 0)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d368854fa9c4..dde43caf196e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -836,6 +836,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
case ARCH_SHSTK_DISABLE:
case ARCH_SHSTK_LOCK:
case ARCH_SHSTK_UNLOCK:
+ case ARCH_SHSTK_STATUS:
return shstk_prctl(task, option, arg2);
default:
ret = -EINVAL;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 3197ff824809..4069d5bbbe8c 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -444,8 +444,14 @@ SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsi
return alloc_shstk(addr, aligned_size, size, set_tok);
}

-long shstk_prctl(struct task_struct *task, int option, unsigned long features)
+long shstk_prctl(struct task_struct *task, int option, unsigned long arg2)
{
+ unsigned long features = arg2;
+
+ if (option == ARCH_SHSTK_STATUS) {
+ return put_user(task->thread.features, (unsigned long __user *)arg2);
+ }
+
if (option == ARCH_SHSTK_LOCK) {
task->thread.features_locked |= features;
return 0;
--
2.17.1


2023-02-19 20:38:41

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 11/41] mm: Introduce pte_mkwrite_kernel()

On Sat, Feb 18, 2023 at 01:14:03PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> One of these changes is to allow for pte_mkwrite() to create different
> types of writable memory (the existing conventionally writable type and
> also the new shadow stack type). Future patches will convert pte_mkwrite()
> to take a VMA in order to facilitate this, however there are places in the
> kernel where pte_mkwrite() is called outside of the context of a VMA.
> These are for kernel memory. So create a new variant called
> pte_mkwrite_kernel() and switch the kernel users over to it. Have
> pte_mkwrite() and pte_mkwrite_kernel() be the same for now. Future patches
> will introduce changes to make pte_mkwrite() take a VMA.
>
> Only do this for architectures that need it because they call pte_mkwrite()
> in arch code without an associated VMA. Since it will only currently be
> used in arch code, so do not include it in arch_pgtable_helpers.rst.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: Pengfei Xu <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

I think it's a little weird that it's the only PTE helper taking a vma,
but it does seem like the right approach.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-02-19 20:39:37

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 12/41] s390/mm: Introduce pmd_mkwrite_kernel()

On Sat, Feb 18, 2023 at 01:14:04PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> One of these changes is to allow for pmd_mkwrite() to create different
> types of writable memory (the existing conventionally writable type and
> also the new shadow stack type). Future patches will convert pmd_mkwrite()
> to take a VMA in order to facilitate this, however there are places in the
> kernel where pmd_mkwrite() is called outside of the context of a VMA.
> These are for kernel memory. So create a new variant called
> pmd_mkwrite_kernel() and switch the kernel users over to it. Have
> pmd_mkwrite() and pmd_mkwrite_kernel() be the same for now. Future patches
> will introduce changes to make pmd_mkwrite() take a VMA.
>
> Only do this for architectures that need it because they call pmd_mkwrite()
> in arch code without an associated VMA. Since it will only currently be
> used in arch code, so do not include it in arch_pgtable_helpers.rst.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: Pengfei Xu <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

Yup, 1:1 refactor.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-02-19 20:40:56

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

On Sat, Feb 18, 2023 at 01:14:05PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite().
>
> In addition to VM_WRITE, the shadow stack VMA's will have a flag denoting
> that they are special shadow stack flavor of writable memory. So make
> pte_mkwrite() take a VMA, so that the x86 implementation of it can know to
> create regular writable memory or shadow stack memory.
>
> Apply the same changes for pmd_mkwrite() and huge_pte_mkwrite().
>
> No functional change.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Michal Simek <[email protected]>
> Cc: Dinh Nguyen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: Pengfei Xu <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

I'm not an arch maintainer, but it looks like a correct tree-wide
refactor.

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-02-19 20:42:04

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 20/41] x86/mm: Teach pte_mkwrite() about stack memory

On Sat, Feb 18, 2023 at 01:14:12PM -0800, Rick Edgecombe wrote:
> If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So
> when it is made writable with pte_mkwrite(), it should create shadow
> stack memory, not conventionally writable memory. Now that pte_mkwrite()
> takes a VMA, and places where shadow stack memory might be created pass
> one, pte_mkwrite() can know when it should do this.
>
> So make pte_mkwrite() create shadow stack memory when the VMA has the
> VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite().
>
> This requires referencing VM_SHADOW_STACK in these functions, which are
> currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is
> located) can't be pulled in without causing problems for files that
> reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where
> they can safely reference VM_SHADOW_STACK.
>
> Tested-by: Pengfei Xu <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

Is there any realistic performance impact from making these not inline
now?

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-02-19 20:44:03

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 25/41] x86/mm: Introduce MAP_ABOVE4G

On Sat, Feb 18, 2023 at 01:14:17PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which require some core mm changes to function
> properly.
>
> One of the properties is that the shadow stack pointer (SSP), which is a
> CPU register that points to the shadow stack like the stack pointer points
> to the stack, can't be pointing outside of the 32 bit address space when
> the CPU is executing in 32 bit mode. It is desirable to prevent executing
> in 32 bit mode when shadow stack is enabled because the kernel can't easily
> support 32 bit signals.
>
> On x86 it is possible to transition to 32 bit mode without any special
> interaction with the kernel, by doing a "far call" to a 32 bit segment.
> So the shadow stack implementation can use this address space behavior
> as a feature, by enforcing that shadow stack memory is always crated
> outside of the 32 bit address space. This way userspace will trigger a
> general protection fault which will in turn trigger a segfault if it
> tries to transition to 32 bit mode with shadow stack enabled.
>
> This provides a clean error generating border for the user if they try
> attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
> half working state for userspace to be surprised by.
>
> So to allow future shadow stack enabling patches to map shadow stacks
> out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior
> is pretty much like MAP_32BIT, except that it has the opposite address
> range. The are a few differences though.
>
> If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
> MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
> syscall.

Should the interface refuse to accept both set instead?

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-02-19 20:45:06

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot

On Sat, Feb 18, 2023 at 01:14:19PM -0800, Rick Edgecombe wrote:
> When user shadow stack is use, Write=0,Dirty=1 is treated by the CPU as
> shadow stack memory. So for shadow stack memory this bit combination is
> valid, but when Dirty=1,Write=1 (conventionally writable) memory is being
> write protected, the kernel has been taught to transition the Dirty=1
> bit to SavedDirty=1, to avoid inadvertently creating shadow stack
> memory. It does this inside pte_wrprotect() because it knows the PTE is
> not intended to be a writable shadow stack entry, it is supposed to be
> write protected.
>
> However, when a PTE is created by a raw prot using mk_pte(), mk_pte()
> can't know whether to adjust Dirty=1 to SavedDirty=1. It can't
> distinguish between the caller intending to create a shadow stack PTE or
> needing the SavedDirty shift.
>
> The kernel has been updated to not do this, and so Write=0,Dirty=1
> memory should only be created by the pte_mkfoo() helpers. Add a warning
> to make sure no new mk_pte() start doing this.
>
> Tested-by: Pengfei Xu <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---
> v6:
> - New patch (Note, this has already been a useful warning, it caught the
> newly added set_memory_rox() doing this)
> ---
> arch/x86/include/asm/pgtable.h | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index f3dc16fc4389..db8fe5511c74 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1032,7 +1032,15 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
> * (Currently stuck as a macro because of indirect forward reference
> * to linux/mm.h:page_to_nid())
> */
> -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
> +#define mk_pte(page, pgprot) \
> +({ \
> + pgprot_t __pgprot = pgprot; \
> + \
> + WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && \
> + (pgprot_val(__pgprot) & (_PAGE_DIRTY | _PAGE_RW)) == \
> + _PAGE_DIRTY); \
> + pfn_pte(page_to_pfn(page), __pgprot); \
> +})

This only warns? Should it also enforce the state?

--
Kees Cook

2023-02-19 20:47:25

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 37/41] selftests/x86: Add shadow stack test

On Sat, Feb 18, 2023 at 01:14:29PM -0800, Rick Edgecombe wrote:
> Add a simple selftest for exercising some shadow stack behavior:
> - map_shadow_stack syscall and pivot
> - Faulting in shadow stack memory
> - Handling shadow stack violations
> - GUP of shadow stack memory
> - mprotect() of shadow stack memory
> - Userfaultfd on shadow stack memory
>
> Since this test exercises a recently added syscall manually, it needs
> to find the automatically created __NR_foo defines. Per the selftest
> documentation, KHDR_INCLUDES can be used to help the selftest Makefile's
> find the headers from the kernel source. This way the new selftest can
> be built inside the kernel source tree without installing the headers
> to the system. So also add KHDR_INCLUDES as described in the selftest
> docs, to facilitate this.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Co-developed-by: Yu-cheng Yu <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

I'll get some test hardware and run this myself too, but overall,
ignoring the lack of kselftest_harness.h, it looks good:

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-02-19 20:48:27

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 38/41] x86/fpu: Add helper for initing features

On Sat, Feb 18, 2023 at 01:14:30PM -0800, Rick Edgecombe wrote:
> If an xfeature is saved in a buffer, the xfeature's bit will be set in
> xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
> is in it's init state. In this case the xfeature buffer address cannot
> be retrieved with get_xsave_addr().
>
> Future patches will need to handle the case of writing to an xfeature
> that may not be saved. So provide helpers to init an xfeature in an
> xsave buffer.
>
> This could of course be done directly by reaching into the xsave buffer,
> however this would not be robust against future changes to optimize the
> xsave buffer by compacting it. In that case the xsave buffer would need
> to be re-arranged as well. So the logic properly belongs encapsulated
> in a helper where the logic can be unified.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2023-02-20 01:01:06

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

Rick Edgecombe <[email protected]> writes:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
...
> ---
> Hi Non-x86 Arch’s,
>
> x86 has a feature that allows for the creation of a special type of
> writable memory (shadow stack) that is only writable in limited specific
> ways. Previously, changes were proposed to core MM code to teach it to
> decide when to create normally writable memory or the special shadow stack
> writable memory, but David Hildenbrand suggested[0] to change
> pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
> moved into x86 code.
>
> Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
> changes. So that is why you are seeing some patches out of a big x86
> series pop up in your arch mailing list. There is no functional change.
> After this refactor, the shadow stack series goes on to use the arch
> helpers to push shadow stack memory details inside arch/x86.
...
> ---
> Documentation/mm/arch_pgtable_helpers.rst | 9 ++++++---
> arch/alpha/include/asm/pgtable.h | 6 +++++-
> arch/arc/include/asm/hugepage.h | 2 +-
> arch/arc/include/asm/pgtable-bits-arcv2.h | 7 ++++++-
> arch/arm/include/asm/pgtable-3level.h | 7 ++++++-
> arch/arm/include/asm/pgtable.h | 2 +-
> arch/arm64/include/asm/pgtable.h | 4 ++--
> arch/csky/include/asm/pgtable.h | 2 +-
> arch/hexagon/include/asm/pgtable.h | 2 +-
> arch/ia64/include/asm/pgtable.h | 2 +-
> arch/loongarch/include/asm/pgtable.h | 4 ++--
> arch/m68k/include/asm/mcf_pgtable.h | 2 +-
> arch/m68k/include/asm/motorola_pgtable.h | 6 +++++-
> arch/m68k/include/asm/sun3_pgtable.h | 6 +++++-
> arch/microblaze/include/asm/pgtable.h | 2 +-
> arch/mips/include/asm/pgtable.h | 6 +++---
> arch/nios2/include/asm/pgtable.h | 2 +-
> arch/openrisc/include/asm/pgtable.h | 2 +-
> arch/parisc/include/asm/pgtable.h | 6 +++++-
> arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +-
> arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++--
> arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +-
> arch/powerpc/include/asm/nohash/32/pte-8xx.h | 2 +-
> arch/powerpc/include/asm/nohash/64/pgtable.h | 2 +-

Looks like you discovered the joys of ppc's at-least 5 different MMU
implementations, sorry :)

Acked-by: Michael Ellerman <[email protected]> (powerpc)

cheers

2023-02-20 03:44:21

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 00/41] Shadow stacks for userspace

On Sat, Feb 18, 2023 at 01:13:52PM -0800, Rick Edgecombe wrote:
> This series implements Shadow Stacks for userspace using x86's Control-flow
> Enforcement Technology (CET). CET consists of two related security features:
> shadow stacks and indirect branch tracking. This series implements just the
> shadow stack part of this feature, and just for userspace.

Okay, I've done some bare metal testing, and it all looks happy. The
selftest passes, and I can can see the stack address mismatch get
detected if I explicitly rewrite the saved function pointer on the stack:

[INFO] Want normal flow
[INFO] Found 0x401890 @ 0x7fff47cf2ef8
[INFO] Normal execution flow
[INFO] Want to redirect
[INFO] Found 0x401890 @ 0x7fff47cf2ef8
[INFO] Hijacked execution flow
[INFO] Enabling shadow stack
[INFO] Want to redirect
[INFO] Found 0x401890 @ 0x7fff47cf2ef8
Segmentation fault (core dumped)

Tested-by: Kees Cook <[email protected]>

--
Kees Cook

2023-02-20 06:50:58

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v6 00/41] Shadow stacks for userspace

On Sat, Feb 18, 2023 at 01:13:52PM -0800, Rick Edgecombe wrote:
> Hi,
>
> This series implements Shadow Stacks for userspace using x86's Control-flow
> Enforcement Technology (CET). CET consists of two related security features:
> shadow stacks and indirect branch tracking. This series implements just the
> shadow stack part of this feature, and just for userspace.

For the series

Acked-by: Mike Rapoport (IBM) <[email protected]>

--
Sincerely yours,
Mike.

2023-02-20 11:19:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 11/41] mm: Introduce pte_mkwrite_kernel()

On 19.02.23 21:38, Kees Cook wrote:
> On Sat, Feb 18, 2023 at 01:14:03PM -0800, Rick Edgecombe wrote:
>> The x86 Control-flow Enforcement Technology (CET) feature includes a new
>> type of memory called shadow stack. This shadow stack memory has some
>> unusual properties, which requires some core mm changes to function
>> properly.
>>
>> One of these changes is to allow for pte_mkwrite() to create different
>> types of writable memory (the existing conventionally writable type and
>> also the new shadow stack type). Future patches will convert pte_mkwrite()
>> to take a VMA in order to facilitate this, however there are places in the
>> kernel where pte_mkwrite() is called outside of the context of a VMA.
>> These are for kernel memory. So create a new variant called
>> pte_mkwrite_kernel() and switch the kernel users over to it. Have
>> pte_mkwrite() and pte_mkwrite_kernel() be the same for now. Future patches
>> will introduce changes to make pte_mkwrite() take a VMA.
>>
>> Only do this for architectures that need it because they call pte_mkwrite()
>> in arch code without an associated VMA. Since it will only currently be
>> used in arch code, so do not include it in arch_pgtable_helpers.rst.
>>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Tested-by: Pengfei Xu <[email protected]>
>> Suggested-by: David Hildenbrand <[email protected]>
>> Signed-off-by: Rick Edgecombe <[email protected]>
>
> I think it's a little weird that it's the only PTE helper taking a vma,
> but it does seem like the right approach.

Right. We could pass the vm flags instead, but not sure if that really
improves the situation. So unless someone has a better idea, this LGTM.

--
Thanks,

David / dhildenb


2023-02-20 11:20:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 11/41] mm: Introduce pte_mkwrite_kernel()

On 18.02.23 22:14, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> One of these changes is to allow for pte_mkwrite() to create different
> types of writable memory (the existing conventionally writable type and
> also the new shadow stack type). Future patches will convert pte_mkwrite()
> to take a VMA in order to facilitate this, however there are places in the
> kernel where pte_mkwrite() is called outside of the context of a VMA.
> These are for kernel memory. So create a new variant called
> pte_mkwrite_kernel() and switch the kernel users over to it. Have
> pte_mkwrite() and pte_mkwrite_kernel() be the same for now. Future patches
> will introduce changes to make pte_mkwrite() take a VMA.
>
> Only do this for architectures that need it because they call pte_mkwrite()
> in arch code without an associated VMA. Since it will only currently be
> used in arch code, so do not include it in arch_pgtable_helpers.rst.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: Pengfei Xu <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>

Acked-by: David Hildenbrand <[email protected]>

Do we also have to care about pmd_mkwrite() ?

--
Thanks,

David / dhildenb


2023-02-20 11:22:06

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 12/41] s390/mm: Introduce pmd_mkwrite_kernel()

On 18.02.23 22:14, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> One of these changes is to allow for pmd_mkwrite() to create different
> types of writable memory (the existing conventionally writable type and
> also the new shadow stack type). Future patches will convert pmd_mkwrite()
> to take a VMA in order to facilitate this, however there are places in the
> kernel where pmd_mkwrite() is called outside of the context of a VMA.
> These are for kernel memory. So create a new variant called
> pmd_mkwrite_kernel() and switch the kernel users over to it. Have
> pmd_mkwrite() and pmd_mkwrite_kernel() be the same for now. Future patches
> will introduce changes to make pmd_mkwrite() take a VMA.
>
> Only do this for architectures that need it because they call pmd_mkwrite()
> in arch code without an associated VMA. Since it will only currently be
> used in arch code, so do not include it in arch_pgtable_helpers.rst.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: Pengfei Xu <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>

Heh, that answers my question to patch #11

Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb


2023-02-20 11:24:33

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

On 18.02.23 22:14, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> One of these unusual properties is that shadow stack memory is writable,
> but only in limited ways. These limits are applied via a specific PTE
> bit combination. Nevertheless, the memory is writable, and core mm code
> will need to apply the writable permissions in the typical paths that
> call pte_mkwrite().
>
> In addition to VM_WRITE, the shadow stack VMA's will have a flag denoting
> that they are special shadow stack flavor of writable memory. So make
> pte_mkwrite() take a VMA, so that the x86 implementation of it can know to
> create regular writable memory or shadow stack memory.
>
> Apply the same changes for pmd_mkwrite() and huge_pte_mkwrite().
>
> No functional change.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Michal Simek <[email protected]>
> Cc: Dinh Nguyen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: Pengfei Xu <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---
> Hi Non-x86 Arch’s,
>
> x86 has a feature that allows for the creation of a special type of
> writable memory (shadow stack) that is only writable in limited specific
> ways. Previously, changes were proposed to core MM code to teach it to
> decide when to create normally writable memory or the special shadow stack
> writable memory, but David Hildenbrand suggested[0] to change
> pXX_mkwrite() to take a VMA, so awareness of shadow stack memory can be
> moved into x86 code.
>
> Since pXX_mkwrite() is defined in every arch, it requires some tree-wide
> changes. So that is why you are seeing some patches out of a big x86
> series pop up in your arch mailing list. There is no functional change.
> After this refactor, the shadow stack series goes on to use the arch
> helpers to push shadow stack memory details inside arch/x86.
>
> Testing was just 0-day build testing.
>
> Hopefully that is enough context. Thanks!
>
> [0] https://lore.kernel.org/lkml/[email protected]/#t
>
> v6:
> - New patch
> ---
> Documentation/mm/arch_pgtable_helpers.rst | 9 ++++++---
> arch/alpha/include/asm/pgtable.h | 6 +++++-
> arch/arc/include/asm/hugepage.h | 2 +-
> arch/arc/include/asm/pgtable-bits-arcv2.h | 7 ++++++-
> arch/arm/include/asm/pgtable-3level.h | 7 ++++++-
> arch/arm/include/asm/pgtable.h | 2 +-
> arch/arm64/include/asm/pgtable.h | 4 ++--
> arch/csky/include/asm/pgtable.h | 2 +-
> arch/hexagon/include/asm/pgtable.h | 2 +-
> arch/ia64/include/asm/pgtable.h | 2 +-
> arch/loongarch/include/asm/pgtable.h | 4 ++--
> arch/m68k/include/asm/mcf_pgtable.h | 2 +-
> arch/m68k/include/asm/motorola_pgtable.h | 6 +++++-
> arch/m68k/include/asm/sun3_pgtable.h | 6 +++++-
> arch/microblaze/include/asm/pgtable.h | 2 +-
> arch/mips/include/asm/pgtable.h | 6 +++---
> arch/nios2/include/asm/pgtable.h | 2 +-
> arch/openrisc/include/asm/pgtable.h | 2 +-
> arch/parisc/include/asm/pgtable.h | 6 +++++-
> arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +-
> arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++--
> arch/powerpc/include/asm/nohash/32/pgtable.h | 2 +-
> arch/powerpc/include/asm/nohash/32/pte-8xx.h | 2 +-
> arch/powerpc/include/asm/nohash/64/pgtable.h | 2 +-
> arch/riscv/include/asm/pgtable.h | 6 +++---
> arch/s390/include/asm/hugetlb.h | 4 ++--
> arch/s390/include/asm/pgtable.h | 4 ++--
> arch/sh/include/asm/pgtable_32.h | 10 ++++++++--
> arch/sparc/include/asm/pgtable_32.h | 2 +-
> arch/sparc/include/asm/pgtable_64.h | 6 +++---
> arch/um/include/asm/pgtable.h | 2 +-
> arch/x86/include/asm/pgtable.h | 6 ++++--
> arch/xtensa/include/asm/pgtable.h | 2 +-
> include/asm-generic/hugetlb.h | 4 ++--
> include/linux/mm.h | 2 +-
> mm/debug_vm_pgtable.c | 16 ++++++++--------
> mm/huge_memory.c | 6 +++---
> mm/hugetlb.c | 4 ++--
> mm/memory.c | 4 ++--
> mm/migrate_device.c | 2 +-
> mm/mprotect.c | 2 +-
> mm/userfaultfd.c | 2 +-
> 42 files changed, 106 insertions(+), 69 deletions(-)

That looks painful but IMHO worth it :)

Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb


2023-02-20 11:32:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On 18.02.23 22:14, Rick Edgecombe wrote:
> Some OSes have a greater dependence on software available bits in PTEs than
> Linux. That left the hardware architects looking for a way to represent a
> new memory type (shadow stack) within the existing bits. They chose to
> repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
> shadow stack memory, Linux should avoid creating memory with this PTE bit
> combination unless it intends for it to be shadow stack.
>
> The reason it's lightly used is that Dirty=1 is normally set by HW
> _before_ a write. A write with a Write=0 PTE would typically only generate
> a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
> generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
> supports shadow stacks will no longer exhibit this oddity.
>
> So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
> in places where Linux normally creates Write=0,Dirty=1, it can use the
> software-defined _PAGE_SAVED_DIRTY in place of the hardware _PAGE_DIRTY.
> In other words, whenever Linux needs to create Write=0,Dirty=1, it instead
> creates Write=0,SavedDirty=1 except for shadow stack, which is
> Write=0,Dirty=1. Further differentiated by VMA flags, these PTE bit
> combinations would be set as follows for various types of memory:
I would simplify (see below) and not repeat what the patch contains as
comments already that detailed.

>
> (Write=0,SavedDirty=1,Dirty=0):
> - A modified, copy-on-write (COW) page. Previously when a typical
> anonymous writable mapping was made COW via fork(), the kernel would
> mark it Write=0,Dirty=1. Now it will instead use the SavedDirty bit.
> This happens in copy_present_pte().
> - A R/O page that has been COW'ed. The user page is in a R/O VMA,
> and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
> handler creates a copy of the page and sets the new copy's PTE as
> Write=0 and SavedDirty=1.
> - A shared shadow stack PTE. When a shadow stack page is being shared
> among processes (this happens at fork()), its PTE is made Dirty=0, so
> the next shadow stack access causes a fault, and the page is
> duplicated and Dirty=1 is set again. This is the COW equivalent for
> shadow stack pages, even though it's copy-on-access rather than
> copy-on-write.
>
> (Write=0,SavedDirty=0,Dirty=1):
> - A shadow stack PTE.
> - A Cow PTE created when a processor without shadow stack support set
> Dirty=1.
>
> There are six bits left available to software in the 64-bit PTE after
> consuming a bit for _PAGE_SAVED_DIRTY. No space is consumed in 32-bit
> kernels because shadow stacks are not enabled there.
>
> Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to start
> creating _PAGE_SAVED_DIRTY PTEs will follow once other pieces are in place.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Reviewed-by: Kees Cook <[email protected]>
> Co-developed-by: Yu-cheng Yu <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---
> v6:
> - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
> - Add _PAGE_SAVED_DIRTY to _PAGE_CHG_MASK
>
> v5:
> - Fix log, comments and whitespace (Boris)
> - Remove capitalization on shadow stack (Boris)
>
> v4:
> - Teach pte_flags_need_flush() about _PAGE_COW bit
> - Break apart patch for better bisectability
>
> v3:
> - Add comment around _PAGE_TABLE in response to comment
> from (Andrew Cooper)
> - Check for PSE in pmd_shstk (Andrew Cooper)
> - Get to the point quicker in commit log (Andrew Cooper)
> - Clarify and reorder commit log for why the PTE bit examples have
> multiple entries. Apply same changes for comment. (peterz)
> - Fix comment that implied dirty bit for COW was a specific x86 thing
> (peterz)
> - Fix swapping of Write/Dirty (PeterZ)
> ---
> arch/x86/include/asm/pgtable.h | 79 ++++++++++++++++++++++++++++
> arch/x86/include/asm/pgtable_types.h | 65 ++++++++++++++++++++---
> arch/x86/include/asm/tlbflush.h | 3 +-
> 3 files changed, 138 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2b423d697490..110e552eb602 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -301,6 +301,45 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
> return native_make_pte(v & ~clear);
> }
>
> +/*
> + * COW and other write protection operations can result in Dirty=1,Write=0
> + * PTEs. But in the case of X86_FEATURE_USER_SHSTK, the software SavedDirty bit
> + * is used, since the Dirty=1,Write=0 will result in the memory being treated as
> + * shadow stack by the HW. So when creating dirty, write-protected memory, a
> + * software bit is used _PAGE_BIT_SAVED_DIRTY. The following functions
> + * pte_mksaveddirty() and pte_clear_saveddirty() take a conventional dirty,
> + * write-protected PTE (Write=0,Dirty=1) and transition it to the shadow stack
> + * compatible version. (Write=0,SavedDirty=1).
> + */
> +static inline pte_t pte_mksaveddirty(pte_t pte)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return pte;
> +
> + pte = pte_clear_flags(pte, _PAGE_DIRTY);
> + return pte_set_flags(pte, _PAGE_SAVED_DIRTY);
> +}
> +
> +static inline pte_t pte_clear_saveddirty(pte_t pte)
> +{
> + /*
> + * _PAGE_SAVED_DIRTY is unnecessary on !X86_FEATURE_USER_SHSTK kernels,
> + * since the HW dirty bit can be used without creating shadow stack
> + * memory. See the _PAGE_SAVED_DIRTY definition for more details.
> + */
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return pte;
> +
> + /*
> + * PTE is getting copied-on-write, so it will be dirtied
> + * if writable, or made shadow stack if shadow stack and
> + * being copied on access. Set the dirty bit for both
> + * cases.
> + */
> + pte = pte_set_flags(pte, _PAGE_DIRTY);
> + return pte_clear_flags(pte, _PAGE_SAVED_DIRTY);
> +}
> +
> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> static inline int pte_uffd_wp(pte_t pte)
> {
> @@ -420,6 +459,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
> return native_make_pmd(v & ~clear);
> }
>
> +/* See comments above pte_mksaveddirty() */
> +static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return pmd;
> +
> + pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
> + return pmd_set_flags(pmd, _PAGE_SAVED_DIRTY);
> +}
> +
> +/* See comments above pte_mksaveddirty() */
> +static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return pmd;
> +
> + pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
> + return pmd_clear_flags(pmd, _PAGE_SAVED_DIRTY);
> +}
> +
> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> static inline int pmd_uffd_wp(pmd_t pmd)
> {
> @@ -491,6 +550,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
> return native_make_pud(v & ~clear);
> }
>
> +/* See comments above pte_mksaveddirty() */
> +static inline pud_t pud_mksaveddirty(pud_t pud)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return pud;
> +
> + pud = pud_clear_flags(pud, _PAGE_DIRTY);
> + return pud_set_flags(pud, _PAGE_SAVED_DIRTY);
> +}
> +
> +/* See comments above pte_mksaveddirty() */
> +static inline pud_t pud_clear_saveddirty(pud_t pud)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return pud;
> +
> + pud = pud_set_flags(pud, _PAGE_DIRTY);
> + return pud_clear_flags(pud, _PAGE_SAVED_DIRTY);
> +}
> +
> static inline pud_t pud_mkold(pud_t pud)
> {
> return pud_clear_flags(pud, _PAGE_ACCESSED);
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 0646ad00178b..3b420b6c0584 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -21,7 +21,8 @@
> #define _PAGE_BIT_SOFTW2 10 /* " */
> #define _PAGE_BIT_SOFTW3 11 /* " */
> #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
> -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
> +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
> +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
> #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
> #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
> #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
> @@ -34,6 +35,15 @@
> #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
> #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
>
> +/*
> + * Indicates a Saved Dirty bit page.
> + */
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* copy-on-write */

Nope, not "copy-on-write" :) It's more like "dirty bit when the hw-dirty
bit cannot be used". Maybe simply drop the comment.

> +#else
> +#define _PAGE_BIT_SAVED_DIRTY 0
> +#endif
> +
> /* If _PAGE_BIT_PRESENT is clear, we use these: */
> /* - if the user mapped it with PROT_NONE; pte_present gives true */
> #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
> @@ -117,6 +127,40 @@
> #define _PAGE_SOFTW4 (_AT(pteval_t, 0))
> #endif
>
> +/*
> + * The hardware requires shadow stack to be read-only and Dirty.
> + * _PAGE_SAVED_DIRTY is a software-only bit used to separate copy-on-write
> + * PTEs from shadow stack PTEs:

I'd suggest phrasing this differently. COW is just one scenario where
this can happen. Also, I don't think that the description of
"separation" is correct.

Something like the following maybe?

"
However, there are valid cases where the kernel might create read-only
PTEs that are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty
tracking). In this case, the _PAGE_SAVED_DIRTY bit is used instead of
the HW-dirty bit, to avoid creating a wrong "shadow stack" PTEs. Such
PTEs have (Write=0,SavedDirty=1,Dirty=0) set.

Note that on processors without shadow stack support, the
_PAGE_SAVED_DIRTY remains unused.
"

The I would simply drop below (which is also too COW-specific I think).

> + *
> + * (Write=0,SavedDirty=1,Dirty=0):
> + * - A modified, copy-on-write (COW) page. Previously when a typical
> + * anonymous writable mapping was made COW via fork(), the kernel would
> + * mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
> + * happens in copy_present_pte().
> + * - A R/O page that has been COW'ed. The user page is in a R/O VMA,
> + * and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
> + * handler creates a copy of the page and sets the new copy's PTE as
> + * Write=0 and SavedDirty=1.
> + * - A shared shadow stack PTE. When a shadow stack page is being shared
> + * among processes (this happens at fork()), its PTE is made Dirty=0, so
> + * the next shadow stack access causes a fault, and the page is
> + * duplicated and Dirty=1 is set again. This is the COW equivalent for
> + * shadow stack pages, even though it's copy-on-access rather than
> + * copy-on-write.
> + *
> + * (Write=0,SavedDirty=0,Dirty=1):
> + * - A shadow stack PTE.
> + * - A Cow PTE created when a processor without shadow stack support set
> + * Dirty=1.
> + */


--
Thanks,

David / dhildenb


2023-02-20 12:57:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory

On 18.02.23 22:14, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
> read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
> two cases are handled differently for page faults. Introduce
> VM_SHADOW_STACK to track shadow stack VMAs.

I suggest simplifying and abstracting that description.

"New hardware extensions implement support for shadow stack memory, such
as x86 Control-flow Enforcement Technology (CET). Let's add a new VM
flag to identify these areas, for example, to be used to properly
indicate shadow stack PTEs to the hardware."

>
> Reviewed-by: Kees Cook <[email protected]>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Kees Cook <[email protected]>
>
> ---
> v6:
> - Add comment about VM_SHADOW_STACK not being allowed with VM_SHARED
> (David Hildenbrand)

Might want to add some more meat to the patch description why that is
the case.

>
> v3:
> - Drop arch specific change in arch_vma_name(). The memory can show as
> anonymous (Kirill)
> - Change CONFIG_ARCH_HAS_SHADOW_STACK to CONFIG_X86_USER_SHADOW_STACK
> in show_smap_vma_flags() (Boris)
> ---
> Documentation/filesystems/proc.rst | 1 +
> fs/proc/task_mmu.c | 3 +++
> include/linux/mm.h | 8 ++++++++
> 3 files changed, 12 insertions(+)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index e224b6d5b642..115843e8cce3 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -564,6 +564,7 @@ encoded manner. The codes are the following:
> mt arm64 MTE allocation tags are enabled
> um userfaultfd missing tracking
> uw userfaultfd wr-protect tracking
> + ss shadow stack page
> == =======================================
>
> Note that there is no guarantee that every flag and associated mnemonic will
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index af1c49ae11b1..9e2cefe47749 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> [ilog2(VM_UFFD_MINOR)] = "ui",
> #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> + [ilog2(VM_SHADOW_STACK)] = "ss",
> +#endif
> };
> size_t i;
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e6f1789c8e69..76e0a09aeffe 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -315,11 +315,13 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
> #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
> #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
> #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
> #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
> #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
> #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
> #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
> #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>
> #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -335,6 +337,12 @@ extern unsigned int kobjsize(const void *objp);
> #endif
> #endif /* CONFIG_ARCH_HAS_PKEYS */
>
> +#ifdef CONFIG_X86_USER_SHADOW_STACK


Should we abstract this to CONFIG_ARCH_USER_SHADOW_STACK, seeing that
other architectures might similarly need it?

--
Thanks,

David / dhildenb


2023-02-20 12:58:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 19/41] x86/mm: Check shadow stack page fault errors

On 18.02.23 22:14, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> The CPU performs "shadow stack accesses" when it expects to encounter
> shadow stack mappings. These accesses can be implicit (via CALL/RET
> instructions) or explicit (instructions like WRSS).
>
> Shadow stack accesses to shadow-stack mappings can result in faults in
> normal, valid operation just like regular accesses to regular mappings.
> Shadow stacks need some of the same features like delayed allocation, swap
> and copy-on-write. The kernel needs to use faults to implement those
> features.
>
> The architecture has concepts of both shadow stack reads and shadow stack
> writes. Any shadow stack access to non-shadow stack memory will generate
> a fault with the shadow stack error code bit set.
>
> This means that, unlike normal write protection, the fault handler needs
> to create a type of memory that can be written to (with instructions that
> generate shadow stack writes), even to fulfill a read access. So in the
> case of COW memory, the COW needs to take place even with a shadow stack
> read. Otherwise the page will be left (shadow stack) writable in
> userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
> for shadow stack accesses, even if the access was a shadow stack read.
>
> For the purpose of making this clearer, consider the following example.
> If a process has a shadow stack, and forks, the shadow stack PTEs will
> become read-only due to COW. If the CPU in one process performs a shadow
> stack read access to the shadow stack, for example executing a RET and
> causing the CPU to read the shadow stack copy of the return address, then
> in order for the fault to be resolved the PTE will need to be set with
> shadow stack permissions. But then the memory would be changeable from
> userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
> COW, otherwise the shared page would be changeable from both processes.
>
> Shadow stack accesses can also result in errors, such as when a shadow
> stack overflows, or if a shadow stack access occurs to a non-shadow-stack
> mapping. Also, generate the errors for invalid shadow stack accesses.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Reviewed-by: Kees Cook <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---
> v6:
> - Update comment due to rename of Cow bit to SavedDirty
>
> v5:
> - Add description of COW example (Boris)
> - Replace "permissioned" (Boris)
> - Remove capitalization of shadow stack (Boris)
>
> v4:
> - Further improve comment talking about FAULT_FLAG_WRITE (Peterz)
>
> v3:
> - Improve comment talking about using FAULT_FLAG_WRITE (Peterz)
> ---
> arch/x86/include/asm/trap_pf.h | 2 ++
> arch/x86/mm/fault.c | 38 ++++++++++++++++++++++++++++++++++
> 2 files changed, 40 insertions(+)
>
> diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
> index 10b1de500ab1..afa524325e55 100644
> --- a/arch/x86/include/asm/trap_pf.h
> +++ b/arch/x86/include/asm/trap_pf.h
> @@ -11,6 +11,7 @@
> * bit 3 == 1: use of reserved bit detected
> * bit 4 == 1: fault was an instruction fetch
> * bit 5 == 1: protection keys block access
> + * bit 6 == 1: shadow stack access fault
> * bit 15 == 1: SGX MMU page-fault
> */
> enum x86_pf_error_code {
> @@ -20,6 +21,7 @@ enum x86_pf_error_code {
> X86_PF_RSVD = 1 << 3,
> X86_PF_INSTR = 1 << 4,
> X86_PF_PK = 1 << 5,
> + X86_PF_SHSTK = 1 << 6,
> X86_PF_SGX = 1 << 15,
> };
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 7b0d4ab894c8..42885d8e2036 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1138,8 +1138,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
> (error_code & X86_PF_INSTR), foreign))
> return 1;
>
> + /*
> + * Shadow stack accesses (PF_SHSTK=1) are only permitted to
> + * shadow stack VMAs. All other accesses result in an error.
> + */
> + if (error_code & X86_PF_SHSTK) {
> + if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
> + return 1;
> + if (unlikely(!(vma->vm_flags & VM_WRITE)))
> + return 1;
> + return 0;
> + }
> +
> if (error_code & X86_PF_WRITE) {
> /* write, present and write, not present: */
> + if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
> + return 1;
> if (unlikely(!(vma->vm_flags & VM_WRITE)))
> return 1;
> return 0;
> @@ -1331,6 +1345,30 @@ void do_user_addr_fault(struct pt_regs *regs,
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>
> + /*
> + * When a page becomes COW it changes from a shadow stack permission
> + * page (Write=0,Dirty=1) to (Write=0,Dirty=0,SavedDirty=1), which is simply
> + * read-only to the CPU. When shadow stack is enabled, a RET would
> + * normally pop the shadow stack by reading it with a "shadow stack
> + * read" access. However, in the COW case the shadow stack memory does
> + * not have shadow stack permissions, it is read-only. So it will
> + * generate a fault.
> + *
> + * For conventionally writable pages, a read can be serviced with a
> + * read only PTE, and COW would not have to happen. But for shadow
> + * stack, there isn't the concept of read-only shadow stack memory.
> + * If it is shadow stack permission, it can be modified via CALL and
> + * RET instructions. So COW needs to happen before any memory can be
> + * mapped with shadow stack permissions.
> + *
> + * Shadow stack accesses (read or write) need to be serviced with
> + * shadow stack permission memory, so in the case of a shadow stack
> + * read access, treat it as a WRITE fault so both COW will happen and
> + * the write fault path will tickle maybe_mkwrite() and map the memory
> + * shadow stack.
> + */

Again, I suggest dropping all details about COW from this comment and
from the patch description. It's just one such case that can happen.


--
Thanks,

David / dhildenb


2023-02-20 12:59:38

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 22/41] mm/mmap: Add shadow stack pages to memory accounting

On 18.02.23 22:14, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> Account shadow stack pages to stack memory.
>
> Reviewed-by: Kees Cook <[email protected]>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Kees Cook <[email protected]>
>
> ---
> v3:
> - Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
> (Kirill)
>
> v2:
> - Remove is_shadow_stack_mapping() and just change it to directly bitwise
> and VM_SHADOW_STACK.
>
> Yu-cheng v26:
> - Remove redundant #ifdef CONFIG_MMU.
>
> Yu-cheng v25:
> - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().
> ---
> mm/mmap.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 425a9349e610..9f85596cce31 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3290,6 +3290,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
> mm->exec_vm += npages;
> else if (is_stack_mapping(flags))
> mm->stack_vm += npages;
> + else if (flags & VM_SHADOW_STACK)
> + mm->stack_vm += npages;

Why not modify is_stack_mapping() ?

--
Thanks,

David / dhildenb


2023-02-20 20:24:38

by John Allen

[permalink] [raw]
Subject: Re: [PATCH v6 00/41] Shadow stacks for userspace

On Sat, Feb 18, 2023 at 01:13:52PM -0800, Rick Edgecombe wrote:
> I left tested-by tags in place per discussion with testers. Testers, please
> retest.

v6 is still working well on my AMD system (Dell PowerEdge
R6515 w/ EPYC 7713).

The selftests run cleanly:

[INFO] new_ssp = 7f53069ffff8, *new_ssp = 7f5306a00001
[INFO] changing ssp from 7f53071ffff0 to 7f53069ffff8
[INFO] ssp is now 7f5306a00000
[OK] Shadow stack pivot
[OK] Shadow stack faults
[INFO] Corrupting shadow stack
[INFO] Generated shadow stack violation successfully
[OK] Shadow stack violation test
[INFO] Gup read -> shstk access success
[INFO] Gup write -> shstk access success
[INFO] Violation from normal write
[INFO] Gup read -> write access success
[INFO] Violation from normal write
[INFO] Gup write -> write access success
[INFO] Cow gup write -> write access success
[OK] Shadow gup test
[INFO] Violation from shstk access
[OK] mprotect() test
[OK] Userfaultfd test
[OK] 32 bit test

And I can see the control protection messages in dmesg when
running the shstk violation test from here:
https://gitlab.com/cet-software/cet-smoke-test

ld-linux-x86-64[51598] control protection ip:401139 sp:7ffd68b1b7c8 ssp:7fb433578fd8 error:1(near ret) in shstk1[401000+1000]

Tested-by: John Allen <[email protected]>

2023-02-20 21:24:02

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 00/41] Shadow stacks for userspace

On Mon, 2023-02-20 at 08:50 +0200, Mike Rapoport wrote:
> On Sat, Feb 18, 2023 at 01:13:52PM -0800, Rick Edgecombe wrote:
> > Hi,
> >
> > This series implements Shadow Stacks for userspace using x86's
> > Control-flow
> > Enforcement Technology (CET). CET consists of two related security
> > features:
> > shadow stacks and indirect branch tracking. This series implements
> > just the
> > shadow stack part of this feature, and just for userspace.
>
> For the series
>
> Acked-by: Mike Rapoport (IBM) <[email protected]>

Thanks Mike! Sorry forgot to add it since last time.

2023-02-20 21:24:36

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

On Mon, 2023-02-20 at 12:00 +1100, Michael Ellerman wrote:
> Acked-by: Michael Ellerman <[email protected]> (powerpc)

Thanks!

2023-02-20 21:38:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On Mon, 2023-02-20 at 12:32 +0100, David Hildenbrand wrote:
> On 18.02.23 22:14, Rick Edgecombe wrote:
> > Some OSes have a greater dependence on software available bits in
> > PTEs than
> > Linux. That left the hardware architects looking for a way to
> > represent a
> > new memory type (shadow stack) within the existing bits. They chose
> > to
> > repurpose a lightly-used state: Write=0,Dirty=1. So in order to
> > support
> > shadow stack memory, Linux should avoid creating memory with this
> > PTE bit
> > combination unless it intends for it to be shadow stack.
> >
> > The reason it's lightly used is that Dirty=1 is normally set by HW
> > _before_ a write. A write with a Write=0 PTE would typically only
> > generate
> > a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1
> > *and*
> > generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware
> > which
> > supports shadow stacks will no longer exhibit this oddity.
> >
> > So that leaves Write=0,Dirty=1 PTEs created in software. To achieve
> > this,
> > in places where Linux normally creates Write=0,Dirty=1, it can use
> > the
> > software-defined _PAGE_SAVED_DIRTY in place of the hardware
> > _PAGE_DIRTY.
> > In other words, whenever Linux needs to create Write=0,Dirty=1, it
> > instead
> > creates Write=0,SavedDirty=1 except for shadow stack, which is
> > Write=0,Dirty=1. Further differentiated by VMA flags, these PTE bit
> > combinations would be set as follows for various types of memory:
>
> I would simplify (see below) and not repeat what the patch contains
> as
> comments already that detailed.

This verbiage has had quite a bit of x86 maintainer attention already.
I hear what you are saying, but I'm a bit hesitant to take style
suggestions at this point for fear of the situation where people ask
for changes back and forth across different versions. Unless any x86
maintainers want to chime in again? More responses below.

>
> >
> > (Write=0,SavedDirty=1,Dirty=0):
> > - A modified, copy-on-write (COW) page. Previously when a typical
> > anonymous writable mapping was made COW via fork(), the kernel
> > would
> > mark it Write=0,Dirty=1. Now it will instead use the SavedDirty
> > bit.
> > This happens in copy_present_pte().
> > - A R/O page that has been COW'ed. The user page is in a R/O VMA,
> > and get_user_pages(FOLL_FORCE) needs a writable copy. The page
> > fault
> > handler creates a copy of the page and sets the new copy's PTE
> > as
> > Write=0 and SavedDirty=1.
> > - A shared shadow stack PTE. When a shadow stack page is being
> > shared
> > among processes (this happens at fork()), its PTE is made
> > Dirty=0, so
> > the next shadow stack access causes a fault, and the page is
> > duplicated and Dirty=1 is set again. This is the COW equivalent
> > for
> > shadow stack pages, even though it's copy-on-access rather than
> > copy-on-write.
> >
> > (Write=0,SavedDirty=0,Dirty=1):
> > - A shadow stack PTE.
> > - A Cow PTE created when a processor without shadow stack support
> > set
> > Dirty=1.
> >
> > There are six bits left available to software in the 64-bit PTE
> > after
> > consuming a bit for _PAGE_SAVED_DIRTY. No space is consumed in 32-
> > bit
> > kernels because shadow stacks are not enabled there.
> >
> > Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to
> > start
> > creating _PAGE_SAVED_DIRTY PTEs will follow once other pieces are
> > in place.
> >
> > Tested-by: Pengfei Xu <[email protected]>
> > Tested-by: John Allen <[email protected]>
> > Reviewed-by: Kees Cook <[email protected]>
> > Co-developed-by: Yu-cheng Yu <[email protected]>
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> >
> > ---
> > v6:
> > - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
> > - Add _PAGE_SAVED_DIRTY to _PAGE_CHG_MASK
> >
> > v5:
> > - Fix log, comments and whitespace (Boris)
> > - Remove capitalization on shadow stack (Boris)
> >
> > v4:
> > - Teach pte_flags_need_flush() about _PAGE_COW bit
> > - Break apart patch for better bisectability
> >
> > v3:
> > - Add comment around _PAGE_TABLE in response to comment
> > from (Andrew Cooper)
> > - Check for PSE in pmd_shstk (Andrew Cooper)
> > - Get to the point quicker in commit log (Andrew Cooper)
> > - Clarify and reorder commit log for why the PTE bit examples
> > have
> > multiple entries. Apply same changes for comment. (peterz)
> > - Fix comment that implied dirty bit for COW was a specific x86
> > thing
> > (peterz)
> > - Fix swapping of Write/Dirty (PeterZ)
> > ---
> > arch/x86/include/asm/pgtable.h | 79
> > ++++++++++++++++++++++++++++
> > arch/x86/include/asm/pgtable_types.h | 65 ++++++++++++++++++++---
> > arch/x86/include/asm/tlbflush.h | 3 +-
> > 3 files changed, 138 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable.h
> > b/arch/x86/include/asm/pgtable.h
> > index 2b423d697490..110e552eb602 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -301,6 +301,45 @@ static inline pte_t pte_clear_flags(pte_t pte,
> > pteval_t clear)
> > return native_make_pte(v & ~clear);
> > }
> >
> > +/*
> > + * COW and other write protection operations can result in
> > Dirty=1,Write=0
> > + * PTEs. But in the case of X86_FEATURE_USER_SHSTK, the software
> > SavedDirty bit
> > + * is used, since the Dirty=1,Write=0 will result in the memory
> > being treated as
> > + * shadow stack by the HW. So when creating dirty, write-protected
> > memory, a
> > + * software bit is used _PAGE_BIT_SAVED_DIRTY. The following
> > functions
> > + * pte_mksaveddirty() and pte_clear_saveddirty() take a
> > conventional dirty,
> > + * write-protected PTE (Write=0,Dirty=1) and transition it to the
> > shadow stack
> > + * compatible version. (Write=0,SavedDirty=1).
> > + */
> > +static inline pte_t pte_mksaveddirty(pte_t pte)
> > +{
> > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > + return pte;
> > +
> > + pte = pte_clear_flags(pte, _PAGE_DIRTY);
> > + return pte_set_flags(pte, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > +static inline pte_t pte_clear_saveddirty(pte_t pte)
> > +{
> > + /*
> > + * _PAGE_SAVED_DIRTY is unnecessary on !X86_FEATURE_USER_SHSTK
> > kernels,
> > + * since the HW dirty bit can be used without creating shadow
> > stack
> > + * memory. See the _PAGE_SAVED_DIRTY definition for more
> > details.
> > + */
> > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > + return pte;
> > +
> > + /*
> > + * PTE is getting copied-on-write, so it will be dirtied
> > + * if writable, or made shadow stack if shadow stack and
> > + * being copied on access. Set the dirty bit for both
> > + * cases.
> > + */
> > + pte = pte_set_flags(pte, _PAGE_DIRTY);
> > + return pte_clear_flags(pte, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> > static inline int pte_uffd_wp(pte_t pte)
> > {
> > @@ -420,6 +459,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd,
> > pmdval_t clear)
> > return native_make_pmd(v & ~clear);
> > }
> >
> > +/* See comments above pte_mksaveddirty() */
> > +static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
> > +{
> > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > + return pmd;
> > +
> > + pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
> > + return pmd_set_flags(pmd, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > +/* See comments above pte_mksaveddirty() */
> > +static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
> > +{
> > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > + return pmd;
> > +
> > + pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
> > + return pmd_clear_flags(pmd, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> > static inline int pmd_uffd_wp(pmd_t pmd)
> > {
> > @@ -491,6 +550,26 @@ static inline pud_t pud_clear_flags(pud_t pud,
> > pudval_t clear)
> > return native_make_pud(v & ~clear);
> > }
> >
> > +/* See comments above pte_mksaveddirty() */
> > +static inline pud_t pud_mksaveddirty(pud_t pud)
> > +{
> > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > + return pud;
> > +
> > + pud = pud_clear_flags(pud, _PAGE_DIRTY);
> > + return pud_set_flags(pud, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > +/* See comments above pte_mksaveddirty() */
> > +static inline pud_t pud_clear_saveddirty(pud_t pud)
> > +{
> > + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > + return pud;
> > +
> > + pud = pud_set_flags(pud, _PAGE_DIRTY);
> > + return pud_clear_flags(pud, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > static inline pud_t pud_mkold(pud_t pud)
> > {
> > return pud_clear_flags(pud, _PAGE_ACCESSED);
> > diff --git a/arch/x86/include/asm/pgtable_types.h
> > b/arch/x86/include/asm/pgtable_types.h
> > index 0646ad00178b..3b420b6c0584 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -21,7 +21,8 @@
> > #define _PAGE_BIT_SOFTW2 10 /* " */
> > #define _PAGE_BIT_SOFTW3 11 /* " */
> > #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
> > -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
> > +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
> > +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
> > #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4
> > */
> > #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4
> > */
> > #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4
> > */
> > @@ -34,6 +35,15 @@
> > #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software
> > dirty tracking */
> > #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
> >
> > +/*
> > + * Indicates a Saved Dirty bit page.
> > + */
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /*
> > copy-on-write */
>
> Nope, not "copy-on-write" :) It's more like "dirty bit when the hw-
> dirty
> bit cannot be used". Maybe simply drop the comment.

Oops, I missed this when I scrubbed _PAGE_COW. Thanks. Will fix.

>
> > +#else
> > +#define _PAGE_BIT_SAVED_DIRTY 0
> > +#endif
> > +
> > /* If _PAGE_BIT_PRESENT is clear, we use these: */
> > /* - if the user mapped it with PROT_NONE; pte_present gives true
> > */
> > #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
> > @@ -117,6 +127,40 @@
> > #define _PAGE_SOFTW4 (_AT(pteval_t, 0))
> > #endif
> >
> > +/*
> > + * The hardware requires shadow stack to be read-only and Dirty.
> > + * _PAGE_SAVED_DIRTY is a software-only bit used to separate copy-
> > on-write
> > + * PTEs from shadow stack PTEs:
>
> I'd suggest phrasing this differently. COW is just one scenario
> where
> this can happen. Also, I don't think that the description of
> "separation" is correct.
>
> Something like the following maybe?
>
> "
> However, there are valid cases where the kernel might create read-
> only
> PTEs that are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty
> tracking). In this case, the _PAGE_SAVED_DIRTY bit is used instead
> of
> the HW-dirty bit, to avoid creating a wrong "shadow stack" PTEs.
> Such
> PTEs have (Write=0,SavedDirty=1,Dirty=0) set.
>
> Note that on processors without shadow stack support, the
> _PAGE_SAVED_DIRTY remains unused.
> "
>
> The I would simply drop below (which is also too COW-specific I
> think).

COW is the main situation where shadow stacks become read-only. So, as
an example it is nice in that COW covers all the scenarios discussed.
Again, do any x86 maintainers want to weigh in here?

>
> > + *
> > + * (Write=0,SavedDirty=1,Dirty=0):
> > + * - A modified, copy-on-write (COW) page. Previously when a
> > typical
> > + * anonymous writable mapping was made COW via fork(), the
> > kernel would
> > + * mark it Write=0,Dirty=1. Now it will instead use the Cow
> > bit. This
> > + * happens in copy_present_pte().
> > + * - A R/O page that has been COW'ed. The user page is in a R/O
> > VMA,
> > + * and get_user_pages(FOLL_FORCE) needs a writable copy. The
> > page fault
> > + * handler creates a copy of the page and sets the new copy's
> > PTE as
> > + * Write=0 and SavedDirty=1.
> > + * - A shared shadow stack PTE. When a shadow stack page is being
> > shared
> > + * among processes (this happens at fork()), its PTE is made
> > Dirty=0, so
> > + * the next shadow stack access causes a fault, and the page is
> > + * duplicated and Dirty=1 is set again. This is the COW
> > equivalent for
> > + * shadow stack pages, even though it's copy-on-access rather
> > than
> > + * copy-on-write.
> > + *
> > + * (Write=0,SavedDirty=0,Dirty=1):
> > + * - A shadow stack PTE.
> > + * - A Cow PTE created when a processor without shadow stack
> > support set
> > + * Dirty=1.
> > + */
>
>

2023-02-20 22:08:19

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory

On Mon, 2023-02-20 at 13:56 +0100, David Hildenbrand wrote:
> On 18.02.23 22:14, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > The x86 Control-flow Enforcement Technology (CET) feature includes
> > a new
> > type of memory called shadow stack. This shadow stack memory has
> > some
> > unusual properties, which requires some core mm changes to function
> > properly.
> >
> > A shadow stack PTE must be read-only and have _PAGE_DIRTY set.
> > However,
> > read-only and Dirty PTEs also exist for copy-on-write (COW) pages.
> > These
> > two cases are handled differently for page faults. Introduce
> > VM_SHADOW_STACK to track shadow stack VMAs.
>
> I suggest simplifying and abstracting that description.
>
> "New hardware extensions implement support for shadow stack memory,
> such
> as x86 Control-flow Enforcement Technology (CET). Let's add a new VM
> flag to identify these areas, for example, to be used to properly
> indicate shadow stack PTEs to the hardware."

Ah yea, that top blurb was added to all the non-x86 arch patches after
some feedback from Andrew Morton. He had said basically (in some more
colorful language) that the changelogs (at the time) were written
assuming the reader knows what a shadow stack is.

So it might be worth keeping a little more info in the log?

>
> >
> > Reviewed-by: Kees Cook <[email protected]>
> > Tested-by: Pengfei Xu <[email protected]>
> > Tested-by: John Allen <[email protected]>
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> > Reviewed-by: Kirill A. Shutemov <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > Cc: Kees Cook <[email protected]>
> >
> > ---
> > v6:
> > - Add comment about VM_SHADOW_STACK not being allowed with
> > VM_SHARED
> > (David Hildenbrand)
>
> Might want to add some more meat to the patch description why that
> is
> the case.

Sure.

>
> >
> > v3:
> > - Drop arch specific change in arch_vma_name(). The memory can
> > show as
> > anonymous (Kirill)
> > - Change CONFIG_ARCH_HAS_SHADOW_STACK to
> > CONFIG_X86_USER_SHADOW_STACK
> > in show_smap_vma_flags() (Boris)
> > ---
> > Documentation/filesystems/proc.rst | 1 +
> > fs/proc/task_mmu.c | 3 +++
> > include/linux/mm.h | 8 ++++++++
> > 3 files changed, 12 insertions(+)
> >
> > diff --git a/Documentation/filesystems/proc.rst
> > b/Documentation/filesystems/proc.rst
> > index e224b6d5b642..115843e8cce3 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -564,6 +564,7 @@ encoded manner. The codes are the following:
> > mt arm64 MTE allocation tags are enabled
> > um userfaultfd missing tracking
> > uw userfaultfd wr-protect tracking
> > + ss shadow stack page
> > == =======================================
> >
> > Note that there is no guarantee that every flag and associated
> > mnemonic will
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index af1c49ae11b1..9e2cefe47749 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file
> > *m, struct vm_area_struct *vma)
> > #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> > [ilog2(VM_UFFD_MINOR)] = "ui",
> > #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > + [ilog2(VM_SHADOW_STACK)] = "ss",
> > +#endif
> > };
> > size_t i;
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e6f1789c8e69..76e0a09aeffe 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -315,11 +315,13 @@ extern unsigned int kobjsize(const void
> > *objp);
> > #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-
> > bit architectures */
> > #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-
> > bit architectures */
> > #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-
> > bit architectures */
> > +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit
> > architectures */
> > #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
> > #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
> > #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
> > #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
> > #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
> > +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
> > #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
> >
> > #ifdef CONFIG_ARCH_HAS_PKEYS
> > @@ -335,6 +337,12 @@ extern unsigned int kobjsize(const void
> > *objp);
> > #endif
> > #endif /* CONFIG_ARCH_HAS_PKEYS */
> >
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
>
>
> Should we abstract this to CONFIG_ARCH_USER_SHADOW_STACK, seeing
> that
> other architectures might similarly need it?

There was an ARCH_HAS_SHADOW_STACK but it got removed following this
discussion:

https://lore.kernel.org/lkml/[email protected]/

Now we have this new RFC for riscv as potentially a second
implementation. But it is still very early, and I'm not sure anyone
knows exactly what the similarities will be in a mature version. So I
think it would be better to refactor in an ARCH_HAS_SHADOW_STACK later
(and similar abstractions) once that series is more mature and we have
an idea of what pieces will be shared. I don't have a problem in
principle with an ARCH config, just don't think we should do it yet.

>

2023-02-20 22:32:58

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 27/41] x86/mm: Warn if create Write=0,Dirty=1 with raw prot

On Sun, 2023-02-19 at 12:45 -0800, Kees Cook wrote:
> > diff --git a/arch/x86/include/asm/pgtable.h
> > b/arch/x86/include/asm/pgtable.h
> > index f3dc16fc4389..db8fe5511c74 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -1032,7 +1032,15 @@ static inline unsigned long
> > pmd_page_vaddr(pmd_t pmd)
> > * (Currently stuck as a macro because of indirect forward
> > reference
> > * to linux/mm.h:page_to_nid())
> > */
> > -#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page),
> > (pgprot))
> > +#define mk_pte(page,
> > pgprot) \
> > +({
> > \
> > + pgprot_t __pgprot =
> > pgprot; \
> > +
> > \
> > + WARN_ON_ONCE(cpu_feature_enabled(X86_FEATURE_USER_SHSTK)
> > && \
> > + (pgprot_val(__pgprot) & (_PAGE_DIRTY | _PAGE_RW))
> > == \
> > +
> > _PAGE_DIRTY); \
> > + pfn_pte(page_to_pfn(page),
> > __pgprot); \
> > +})
>
> This only warns? Should it also enforce the state?

Hmm, you mean something like forcing Dirty=0 if Write=0?

The thing we are worried about here is some new x86 code that creates
Write=0,Dirty=1 PTEs directly because the developer is unaware or
forgot about shadow stack. The issue the warning actually caught was
kernel memory being marked Write=0,Dirty=1, which today is more about
consistency than any functional issue. But if some future hypothetical
code was creating a userspace PTE like this, and depending on the
memory being read-only, then the enforcement would be useful and
potentially save the day.

The downside is that it adds tricky logic into a low level helper that
shouldn't be required unless strange and wrong new code is added in the
future. And then it is still only useful if the warning doesn't catch
the issue in testing. And then there would be some slight risk that the
Dirty bit was expected to be there in some PTE without shadow stack
exposure, and a functional bug would be introduced.

I'm waffling here. I could be convinced either way. Hopefully that
helps characterize the dilemma at least.

2023-02-20 22:38:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 25/41] x86/mm: Introduce MAP_ABOVE4G

On Sun, 2023-02-19 at 12:43 -0800, Kees Cook wrote:
> On Sat, Feb 18, 2023 at 01:14:17PM -0800, Rick Edgecombe wrote:
> > The x86 Control-flow Enforcement Technology (CET) feature includes
> > a new
> > type of memory called shadow stack. This shadow stack memory has
> > some
> > unusual properties, which require some core mm changes to function
> > properly.
> >
> > One of the properties is that the shadow stack pointer (SSP), which
> > is a
> > CPU register that points to the shadow stack like the stack pointer
> > points
> > to the stack, can't be pointing outside of the 32 bit address space
> > when
> > the CPU is executing in 32 bit mode. It is desirable to prevent
> > executing
> > in 32 bit mode when shadow stack is enabled because the kernel
> > can't easily
> > support 32 bit signals.
> >
> > On x86 it is possible to transition to 32 bit mode without any
> > special
> > interaction with the kernel, by doing a "far call" to a 32 bit
> > segment.
> > So the shadow stack implementation can use this address space
> > behavior
> > as a feature, by enforcing that shadow stack memory is always
> > crated
> > outside of the 32 bit address space. This way userspace will
> > trigger a
> > general protection fault which will in turn trigger a segfault if
> > it
> > tries to transition to 32 bit mode with shadow stack enabled.
> >
> > This provides a clean error generating border for the user if they
> > try
> > attempt to do 32 bit mode shadow stack, rather than leave the
> > kernel in a
> > half working state for userspace to be surprised by.
> >
> > So to allow future shadow stack enabling patches to map shadow
> > stacks
> > out of the 32 bit address space, introduce MAP_ABOVE4G. The
> > behavior
> > is pretty much like MAP_32BIT, except that it has the opposite
> > address
> > range. The are a few differences though.
> >
> > If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use
> > the
> > MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a
> > 32 bit
> > syscall.
>
> Should the interface refuse to accept both set instead?

I guess that might be less surprising. But I think to do this would
either require adding logic to core mm or a new arch breakout. I
actually kind of wish there was an easy way to keep this flag from
being used from userspace and just be a kernel only thing. It is only
used internally in this series and there isn't any know use for
userspace.

>
> Reviewed-by: Kees Cook <[email protected]>

2023-02-20 22:44:25

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 22/41] mm/mmap: Add shadow stack pages to memory accounting

On Mon, 2023-02-20 at 13:58 +0100, David Hildenbrand wrote:
> On 18.02.23 22:14, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > The x86 Control-flow Enforcement Technology (CET) feature includes
> > a new
> > type of memory called shadow stack. This shadow stack memory has
> > some
> > unusual properties, which requires some core mm changes to function
> > properly.
> >
> > Account shadow stack pages to stack memory.
> >
> > Reviewed-by: Kees Cook <[email protected]>
> > Tested-by: Pengfei Xu <[email protected]>
> > Tested-by: John Allen <[email protected]>
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> > Co-developed-by: Rick Edgecombe <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > Cc: Kees Cook <[email protected]>
> >
> > ---
> > v3:
> > - Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
> > (Kirill)
> >
> > v2:
> > - Remove is_shadow_stack_mapping() and just change it to
> > directly bitwise
> > and VM_SHADOW_STACK.
> >
> > Yu-cheng v26:
> > - Remove redundant #ifdef CONFIG_MMU.
> >
> > Yu-cheng v25:
> > - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for
> > is_shadow_stack_mapping().
> > ---
> > mm/mmap.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 425a9349e610..9f85596cce31 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3290,6 +3290,8 @@ void vm_stat_account(struct mm_struct *mm,
> > vm_flags_t flags, long npages)
> > mm->exec_vm += npages;
> > else if (is_stack_mapping(flags))
> > mm->stack_vm += npages;
> > + else if (flags & VM_SHADOW_STACK)
> > + mm->stack_vm += npages;
>
> Why not modify is_stack_mapping() ?

It kind of sticks out a little in this conditional, but
is_stack_mapping() has this comment:
/*
* Stack area - automatically grows in one direction
*
* VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
* do_mmap() forbids all other combinations.
*/

Shadow stack don't grow, so it doesn't quite fit. There used to be an
is_shadow_stack_mapping(), but it was removed because all that was
needed (for the time being) was the simple bitwise AND:

https://lore.kernel.org/lkml/[email protected]/

2023-02-20 22:52:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 20/41] x86/mm: Teach pte_mkwrite() about stack memory

On Sun, 2023-02-19 at 12:41 -0800, Kees Cook wrote:
> On Sat, Feb 18, 2023 at 01:14:12PM -0800, Rick Edgecombe wrote:
> > If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory.
> > So
> > when it is made writable with pte_mkwrite(), it should create
> > shadow
> > stack memory, not conventionally writable memory. Now that
> > pte_mkwrite()
> > takes a VMA, and places where shadow stack memory might be created
> > pass
> > one, pte_mkwrite() can know when it should do this.
> >
> > So make pte_mkwrite() create shadow stack memory when the VMA has
> > the
> > VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite().
> >
> > This requires referencing VM_SHADOW_STACK in these functions, which
> > are
> > currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK
> > is
> > located) can't be pulled in without causing problems for files that
> > reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c,
> > where
> > they can safely reference VM_SHADOW_STACK.
> >
> > Tested-by: Pengfei Xu <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
>
> Is there any realistic performance impact from making these not
> inline
> now?

Hmm, I can't say definitively. I would think in write protecting
operations, the big cost would not be the PTE setters. For mapping
things read-only from the beginning (user text, etc), I'm not sure. I
guess it gives the compiler less flexibility, but also gives it the
option to have one copy and so less text size overall for the kernel.

Are there any specific microbenchmarks we could run?

2023-02-20 22:54:15

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 00/41] Shadow stacks for userspace

On Sun, 2023-02-19 at 19:42 -0800, Kees Cook wrote:
> On Sat, Feb 18, 2023 at 01:13:52PM -0800, Rick Edgecombe wrote:
> > This series implements Shadow Stacks for userspace using x86's
> > Control-flow
> > Enforcement Technology (CET). CET consists of two related security
> > features:
> > shadow stacks and indirect branch tracking. This series implements
> > just the
> > shadow stack part of this feature, and just for userspace.
>
> Okay, I've done some bare metal testing, and it all looks happy. The
> selftest passes, and I can can see the stack address mismatch get
> detected if I explicitly rewrite the saved function pointer on the
> stack:
>
> [INFO] Want normal flow
> [INFO] Found 0x401890 @ 0x7fff47cf2ef8
> [INFO] Normal execution flow
> [INFO] Want to redirect
> [INFO] Found 0x401890 @ 0x7fff47cf2ef8
> [INFO] Hijacked execution flow
> [INFO] Enabling shadow stack
> [INFO] Want to redirect
> [INFO] Found 0x401890 @ 0x7fff47cf2ef8
> Segmentation fault (core dumped)
>
> Tested-by: Kees Cook <[email protected]>

Thanks and for the other tags!

2023-02-20 22:57:01

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

On Mon, 2023-02-20 at 12:23 +0100, David Hildenbrand wrote:
> That looks painful but IMHO worth it :)
>
> Acked-by: David Hildenbrand <[email protected]>

Thanks. Yes it was not the most fun, but I agree - worth it.

2023-02-21 02:38:11

by Pengfei Xu

[permalink] [raw]
Subject: Re: [PATCH v6 00/41] Shadow stacks for userspace

Hi Rick,

On 2023-02-18 at 13:13:52 -0800, Rick Edgecombe wrote:
> Hi,
>
...
>
> I left tested-by tags in place per discussion with testers. Testers, please
> retest.
>

1. Tested kself-test from user space shstk on ADL-S, TGL-U without Glibc shstk
support in CentOS 8 stream OS:

// From the test_shadow_stack code in this patch series:
# ./test_shadow_stack
[INFO] new_ssp = 7f014ac2dff8, *new_ssp = 7f014ac2e001
[INFO] changing ssp from 7f014a1ffff0 to 7f014ac2dff8
[INFO] ssp is now 7f014ac2e000
[OK] Shadow stack pivot
[OK] Shadow stack faults
[INFO] Corrupting shadow stack
[INFO] Generated shadow stack violation successfully
[OK] Shadow stack violation test
[INFO] Gup read -> shstk access success
[INFO] Gup write -> shstk access success
[INFO] Violation from normal write
[INFO] Gup read -> write access success
[INFO] Violation from normal write
[INFO] Gup write -> write access success
[INFO] Cow gup write -> write access success
[OK] Shadow gup test
[INFO] Violation from shstk access
[OK] mprotect() test
[OK] Userfaultfd test
[OK] 32 bit test

// shstk violation without SHSTK glibc support
// Code link: https://github.com/intel/lkvs/blob/main/cet/shstk_cp.c
# ./shstk_cp
[PASS] Enable SHSTK successfully
[PASS] Disabling shadow stack successfully
[PASS] Re-enable shadow stack successfully
[PASS] SHSTK enabled, ssp:7fa3bfe00000
[INFO] do_hack() change address for return:
[INFO] Before,ssp:7fa3bfdffff8,*ssp:40133f,rbp:0x7ffc23b5b440,*rbp:7ffc23b5b480,*(rbp+1):40133f
[INFO] After, ssp:7fa3bfdffff8,*ssp:40133f,rbp:0x7ffc23b5b440,*rbp:7ffc23b5b480,*(rbp+1):401146
Segmentation fault (core dumped)

Dmesg:
[1117184.518588] shstk_cp[1523882] control protection ip:40122c sp:7ffc23b5b448 ssp:7fa3bfdffff8 error:1(near ret) in shstk_cp[401000+1000]

// shstk ARCH_SHSTK_STATUS read/set test without SHSTK Glibc support
// Code link: https://github.com/intel/lkvs/blob/main/cet/shstk_unlock_test.c
# ./shstk_unlock_test
[PASS] Parent process enable SHSTK.
[PASS] Parent pid:1522040, ssp:0x7f57fc400000
[INFO] pid:1522040, ssp:0x7f57fc3ffff8, *ssp:401799
[PASS] Unlock CET successfully for pid:1522041
[PASS] GET CET REG ret:0, err:0, ssp:7f57fc3ffff8
[PASS] SET CET REG ret:0, err:0, ssp:7f57fc3ffff8
[PASS] SET ssp -1 failed(expected) ret:-1, errno:22
[PASS] GET xstate successfully ret:0
[PASS] SHSTK is enabled in child process
[INFO] Child:1522041 origin ssp:0x7f57fc400000
[INFO] Child:1522041, ssp:0x7f57fc400000, bp,0x7ffcf32ba0f0, *bp:401dc0, *(bp+1):7f57fc43ad85
[PASS] Disabling shadow stack succesfully
[PASS] SHSTK_STATUS ok, feature:0 is 0, ret:0
[PASS] Child process re-enable ssp
[PASS] SHSTK_STATUS ok, feature:1 1st bit is 1, ret:0
[PASS] Child process enabled wrss
[PASS] SHSTK_STATUS ok, feature:3 2nd bit is 1, ret:0
[INFO] Child:1522041, ssp:0x7f57fc400000, bp,0x7ffcf32ba0f0, *bp:401dc0, *(bp+1):7f57fc43ad85
[INFO] ssp addr:0x7f57fc400000 is same as ssp_verify:0x7f57fc400000
[PASS] Child process disable shstk successfully.
[PASS] Parent process disable shadow stack successfully.


2. Tested fedora37 OS + Hongjiu provided user space SHSTK support Glibc:
// shstk with Glibc support:
// Related Glibc support for Fedora37: http://gnu-4.sc.intel.com/git/?p=hjl/misc.git;a=tree;f=setup/fedora/37;h=63af84a8f28f3d0802f09266e47fb94eb5cdff26;hb=HEAD
# readelf -n shadow_test_fork | head
readelf: Warning: Gap in build notes detected from 0x4011d7 to 0x4011e4

Displaying notes found in: .note.gnu.property
Owner Data size Description
GNU 0x00000040 NT_GNU_PROPERTY_TYPE_0
Properties: x86 feature: IBT, SHSTK
...
// shadow_test_fork code is in attached
// gcc -fcf-protection=full -mshstk -O0 -fno-stack-check -fno-stack-protector shadow_test_fork.c -o shadow_test_fork
# ./shadow_test_fork s2
[INFO] s2: stack rbp + 1
[INFO] do_hack() change address for return:
[INFO] After change, rbp+1 to hacked:0x401296
Segmentation fault (core dumped)

Dmesg:
[418653.591014] shadow_test_for[16529] control protection ip:401367 sp:7fff6ed0a728 ssp:7f661265bfe0 error:1(near ret) in shadow_test_fork[401000+1000]

All above user space SHSTK tests are passed.

Many thanks Rick and all!

Thanks!
BR.
Pengfei

> --
> 2.17.1
>


Attachments:
(No filename) (4.18 kB)
shadow_test_fork.c (9.67 kB)
Download all attachments

2023-02-21 08:32:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 22/41] mm/mmap: Add shadow stack pages to memory accounting

On 20.02.23 23:44, Edgecombe, Rick P wrote:
> On Mon, 2023-02-20 at 13:58 +0100, David Hildenbrand wrote:
>> On 18.02.23 22:14, Rick Edgecombe wrote:
>>> From: Yu-cheng Yu <[email protected]>
>>>
>>> The x86 Control-flow Enforcement Technology (CET) feature includes
>>> a new
>>> type of memory called shadow stack. This shadow stack memory has
>>> some
>>> unusual properties, which requires some core mm changes to function
>>> properly.
>>>
>>> Account shadow stack pages to stack memory.
>>>
>>> Reviewed-by: Kees Cook <[email protected]>
>>> Tested-by: Pengfei Xu <[email protected]>
>>> Tested-by: John Allen <[email protected]>
>>> Signed-off-by: Yu-cheng Yu <[email protected]>
>>> Co-developed-by: Rick Edgecombe <[email protected]>
>>> Signed-off-by: Rick Edgecombe <[email protected]>
>>> Cc: Kees Cook <[email protected]>
>>>
>>> ---
>>> v3:
>>> - Remove unneeded VM_SHADOW_STACK check in accountable_mapping()
>>> (Kirill)
>>>
>>> v2:
>>> - Remove is_shadow_stack_mapping() and just change it to
>>> directly bitwise
>>> and VM_SHADOW_STACK.
>>>
>>> Yu-cheng v26:
>>> - Remove redundant #ifdef CONFIG_MMU.
>>>
>>> Yu-cheng v25:
>>> - Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for
>>> is_shadow_stack_mapping().
>>> ---
>>> mm/mmap.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 425a9349e610..9f85596cce31 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -3290,6 +3290,8 @@ void vm_stat_account(struct mm_struct *mm,
>>> vm_flags_t flags, long npages)
>>> mm->exec_vm += npages;
>>> else if (is_stack_mapping(flags))
>>> mm->stack_vm += npages;
>>> + else if (flags & VM_SHADOW_STACK)
>>> + mm->stack_vm += npages;
>>
>> Why not modify is_stack_mapping() ?
>
> It kind of sticks out a little in this conditional, but
> is_stack_mapping() has this comment:
> /*
> * Stack area - automatically grows in one direction
> *
> * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
> * do_mmap() forbids all other combinations.
> */
>
> Shadow stack don't grow, so it doesn't quite fit. There used to be an
> is_shadow_stack_mapping(), but it was removed because all that was
> needed (for the time being) was the simple bitwise AND:
>
> https://lore.kernel.org/lkml/[email protected]/

As there is only a single user of is_stack_mapping(), I'd simply have
adjusted the doc of is_stack_mapping() to include shadow stacks.

--
Thanks,

David / dhildenb


2023-02-21 08:37:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory

On 20.02.23 23:08, Edgecombe, Rick P wrote:
> On Mon, 2023-02-20 at 13:56 +0100, David Hildenbrand wrote:
>> On 18.02.23 22:14, Rick Edgecombe wrote:
>>> From: Yu-cheng Yu <[email protected]>
>>>
>>> The x86 Control-flow Enforcement Technology (CET) feature includes
>>> a new
>>> type of memory called shadow stack. This shadow stack memory has
>>> some
>>> unusual properties, which requires some core mm changes to function
>>> properly.
>>>
>>> A shadow stack PTE must be read-only and have _PAGE_DIRTY set.
>>> However,
>>> read-only and Dirty PTEs also exist for copy-on-write (COW) pages.
>>> These
>>> two cases are handled differently for page faults. Introduce
>>> VM_SHADOW_STACK to track shadow stack VMAs.
>>
>> I suggest simplifying and abstracting that description.
>>
>> "New hardware extensions implement support for shadow stack memory,
>> such
>> as x86 Control-flow Enforcement Technology (CET). Let's add a new VM
>> flag to identify these areas, for example, to be used to properly
>> indicate shadow stack PTEs to the hardware."
>
> Ah yea, that top blurb was added to all the non-x86 arch patches after
> some feedback from Andrew Morton. He had said basically (in some more
> colorful language) that the changelogs (at the time) were written
> assuming the reader knows what a shadow stack is.

Okay. It's a bit repetitive, though.

Ideally, we'd just explain it in the cover letter in detail and
Andrews's script would include the cover letter in the first commit.
IIRC, that's what usually happens.

>
> So it might be worth keeping a little more info in the log?

Copying the same paragraph into each commit is IMHO a bit repetitive.
But these are just my 2 cents.

[...]

>> Should we abstract this to CONFIG_ARCH_USER_SHADOW_STACK, seeing
>> that
>> other architectures might similarly need it?
>
> There was an ARCH_HAS_SHADOW_STACK but it got removed following this
> discussion:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Now we have this new RFC for riscv as potentially a second
> implementation. But it is still very early, and I'm not sure anyone
> knows exactly what the similarities will be in a mature version. So I
> think it would be better to refactor in an ARCH_HAS_SHADOW_STACK later
> (and similar abstractions) once that series is more mature and we have
> an idea of what pieces will be shared. I don't have a problem in
> principle with an ARCH config, just don't think we should do it yet.

Okay, easy to factor out later.

Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb


2023-02-21 08:40:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On 20.02.23 22:38, Edgecombe, Rick P wrote:
> On Mon, 2023-02-20 at 12:32 +0100, David Hildenbrand wrote:
>> On 18.02.23 22:14, Rick Edgecombe wrote:
>>> Some OSes have a greater dependence on software available bits in
>>> PTEs than
>>> Linux. That left the hardware architects looking for a way to
>>> represent a
>>> new memory type (shadow stack) within the existing bits. They chose
>>> to
>>> repurpose a lightly-used state: Write=0,Dirty=1. So in order to
>>> support
>>> shadow stack memory, Linux should avoid creating memory with this
>>> PTE bit
>>> combination unless it intends for it to be shadow stack.
>>>
>>> The reason it's lightly used is that Dirty=1 is normally set by HW
>>> _before_ a write. A write with a Write=0 PTE would typically only
>>> generate
>>> a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1
>>> *and*
>>> generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware
>>> which
>>> supports shadow stacks will no longer exhibit this oddity.
>>>
>>> So that leaves Write=0,Dirty=1 PTEs created in software. To achieve
>>> this,
>>> in places where Linux normally creates Write=0,Dirty=1, it can use
>>> the
>>> software-defined _PAGE_SAVED_DIRTY in place of the hardware
>>> _PAGE_DIRTY.
>>> In other words, whenever Linux needs to create Write=0,Dirty=1, it
>>> instead
>>> creates Write=0,SavedDirty=1 except for shadow stack, which is
>>> Write=0,Dirty=1. Further differentiated by VMA flags, these PTE bit
>>> combinations would be set as follows for various types of memory:
>>
>> I would simplify (see below) and not repeat what the patch contains
>> as
>> comments already that detailed.
>
> This verbiage has had quite a bit of x86 maintainer attention already.
> I hear what you are saying, but I'm a bit hesitant to take style
> suggestions at this point for fear of the situation where people ask
> for changes back and forth across different versions. Unless any x86
> maintainers want to chime in again? More responses below.

Sure, for my taste this is (1) too repetitive (2) too verbose (3) to
specialized. But whatever x86 maintainers prefer.

[...]

>> "
>> However, there are valid cases where the kernel might create read-
>> only
>> PTEs that are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty
>> tracking). In this case, the _PAGE_SAVED_DIRTY bit is used instead
>> of
>> the HW-dirty bit, to avoid creating a wrong "shadow stack" PTEs.
>> Such
>> PTEs have (Write=0,SavedDirty=1,Dirty=0) set.
>>
>> Note that on processors without shadow stack support, the
>> _PAGE_SAVED_DIRTY remains unused.
>> "
>>
>> The I would simply drop below (which is also too COW-specific I
>> think).
>
> COW is the main situation where shadow stacks become read-only. So, as
> an example it is nice in that COW covers all the scenarios discussed.
> Again, do any x86 maintainers want to weigh in here?

Again, I'd not specialize on COW in all patches to much (IMHO, it
creates more confusion than it actually helps for understanding what's
happening) and just call it a read-only PTE that is dirty. Simple as
that. And it's easy to see why that's problematic, because read-only
PTEs that are dirty would be identified as shadow stack PTEs, which we
want to work around.

Again, just my 2 cents. I'm not an x86 maintainer ;)

--
Thanks,

David / dhildenb


2023-02-21 08:43:51

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 24/41] mm: Don't allow write GUPs to shadow stack memory

On 18.02.23 22:14, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> Shadow stack memory is writable only in very specific, controlled ways.
> However, since it is writable, the kernel treats it as such. As a result
> there remain many ways for userspace to trigger the kernel to write to
> shadow stack's via get_user_pages(, FOLL_WRITE) operations. To make this a
> little less exposed, block writable GUPs for shadow stack VMAs.
>
> Still allow FOLL_FORCE to write through shadow stack protections, as it
> does for read-only protections.
>
> Reviewed-by: Kees Cook <[email protected]>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---
> v3:
> - Add comment in __pte_access_permitted() (Dave)
> - Remove unneeded shadow stack specific check in
> __pte_access_permitted() (Jann)
> ---
> arch/x86/include/asm/pgtable.h | 5 +++++
> mm/gup.c | 2 +-
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 6b7106457bfb..20d0df494269 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1641,6 +1641,11 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
> {
> unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
>
> + /*
> + * Write=0,Dirty=1 PTEs are shadow stack, which the kernel
> + * shouldn't generally allow access to, but since they
> + * are already Write=0, the below logic covers both cases.
> + */
> if (write)
> need_pte_bits |= _PAGE_RW;

So, GUP fast will always fail when writing ...

>
> diff --git a/mm/gup.c b/mm/gup.c
> index f45a3a5be53a..bfd33d9edb89 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -982,7 +982,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> return -EFAULT;
>
> if (write) {
> - if (!(vm_flags & VM_WRITE)) {
> + if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
> if (!(gup_flags & FOLL_FORCE))
> return -EFAULT;
> /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */

and ordinary GUP without FOLL_FORCE.

Acked-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb


2023-02-21 08:49:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 37/41] selftests/x86: Add shadow stack test

On 18.02.23 22:14, Rick Edgecombe wrote:
> Add a simple selftest for exercising some shadow stack behavior:
> - map_shadow_stack syscall and pivot
> - Faulting in shadow stack memory
> - Handling shadow stack violations
> - GUP of shadow stack memory
> - mprotect() of shadow stack memory
> - Userfaultfd on shadow stack memory
>
> Since this test exercises a recently added syscall manually, it needs
> to find the automatically created __NR_foo defines. Per the selftest
> documentation, KHDR_INCLUDES can be used to help the selftest Makefile's
> find the headers from the kernel source. This way the new selftest can
> be built inside the kernel source tree without installing the headers
> to the system. So also add KHDR_INCLUDES as described in the selftest
> docs, to facilitate this.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Co-developed-by: Yu-cheng Yu <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---


[...]

> +bool gup_write(void *ptr)
> +{
> + unsigned long val;
> +
> + lseek(fd, (unsigned long)ptr, SEEK_SET);
> + if (write(fd, &val, sizeof(val)) < 0)
> + return 1;

/proc/self/mem is for debug/ptrace access (FOLL_FORCE). I think you
might also want to add tests for ordinary GUP, checking that we fail to
obtain a write pin -- and call these tests "gup_ptrace_read" /
"gup_ptrace_write"

An simple approach would be to trigger a read()/write() on a file opened
via O_DIRECT, using the shadow stack as buffer. While the write()
[reading from the page] is expected to work, a read() [writing to the
page] has to fail.


--
Thanks,

David / dhildenb


2023-02-21 20:02:21

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 37/41] selftests/x86: Add shadow stack test

On Tue, 2023-02-21 at 09:48 +0100, David Hildenbrand wrote:
> On 18.02.23 22:14, Rick Edgecombe wrote:
> > Add a simple selftest for exercising some shadow stack behavior:
> > - map_shadow_stack syscall and pivot
> > - Faulting in shadow stack memory
> > - Handling shadow stack violations
> > - GUP of shadow stack memory
> > - mprotect() of shadow stack memory
> > - Userfaultfd on shadow stack memory
> >
> > Since this test exercises a recently added syscall manually, it
> > needs
> > to find the automatically created __NR_foo defines. Per the
> > selftest
> > documentation, KHDR_INCLUDES can be used to help the selftest
> > Makefile's
> > find the headers from the kernel source. This way the new selftest
> > can
> > be built inside the kernel source tree without installing the
> > headers
> > to the system. So also add KHDR_INCLUDES as described in the
> > selftest
> > docs, to facilitate this.
> >
> > Tested-by: Pengfei Xu <[email protected]>
> > Tested-by: John Allen <[email protected]>
> > Co-developed-by: Yu-cheng Yu <[email protected]>
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> >
> > ---
>
>
> [...]
>
> > +bool gup_write(void *ptr)
> > +{
> > + unsigned long val;
> > +
> > + lseek(fd, (unsigned long)ptr, SEEK_SET);
> > + if (write(fd, &val, sizeof(val)) < 0)
> > + return 1;
>
> /proc/self/mem is for debug/ptrace access (FOLL_FORCE). I think you
> might also want to add tests for ordinary GUP, checking that we fail
> to
> obtain a write pin -- and call these tests "gup_ptrace_read" /
> "gup_ptrace_write"

Yes, this only tests the FOLL_FORCE case, but it does exercise GUP.

>
> An simple approach would be to trigger a read()/write() on a file
> opened
> via O_DIRECT, using the shadow stack as buffer. While the write()
> [reading from the page] is expected to work, a read() [writing to
> the
> page] has to fail.

Hmm, good idea. This would be nice to add.

2023-02-21 20:03:55

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 24/41] mm: Don't allow write GUPs to shadow stack memory

On Tue, 2023-02-21 at 09:42 +0100, David Hildenbrand wrote:
> and ordinary GUP without FOLL_FORCE.
>
> Acked-by: David Hildenbrand <[email protected]>

Thanks! And for the other acks.

2023-02-21 20:08:49

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On Tue, 2023-02-21 at 09:38 +0100, David Hildenbrand wrote:
> Again, I'd not specialize on COW in all patches to much (IMHO, it
> creates more confusion than it actually helps for understanding
> what's
> happening) and just call it a read-only PTE that is dirty. Simple as
> that. And it's easy to see why that's problematic, because read-only
> PTEs that are dirty would be identified as shadow stack PTEs, which
> we
> want to work around.
>
> Again, just my 2 cents. I'm not an x86 maintainer ;)

Right, I see the point. Let's see if they have any opinion. There is a
bit of a historical reason for the focus on COW. As you well know the
dirty bit used to be important for that case. But I think it's still
not a terrible example. It covers some typical cases, but yes we don't
want to mislead the reader that it is a Cow only scenario.

2023-02-21 20:13:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On 2/21/23 00:38, David Hildenbrand wrote:> Sure, for my taste this is
(1) too repetitive (2) too verbose (3) to
> specialized. But whatever x86 maintainers prefer.

At this point, I'm not going to be too nitpicky. I personally think we
need to get _something_ merged. We can then nitpick it to death once
its in the tree.

So I prefer whatever will move the set along. ;)

2023-02-22 00:10:35

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 22/41] mm/mmap: Add shadow stack pages to memory accounting

On Tue, 2023-02-21 at 09:31 +0100, David Hildenbrand wrote:
> > > Why not modify is_stack_mapping() ?
> >
> > It kind of sticks out a little in this conditional, but
> > is_stack_mapping() has this comment:
> > /*
> > * Stack area - automatically grows in one direction
> > *
> > * VM_GROWSUP / VM_GROWSDOWN VMAs are always private anonymous:
> > * do_mmap() forbids all other combinations.
> > */
> >
> > Shadow stack don't grow, so it doesn't quite fit. There used to be
> > an
> > is_shadow_stack_mapping(), but it was removed because all that was
> > needed (for the time being) was the simple bitwise AND:
> >
> >
https://lore.kernel.org/lkml/[email protected]/
>
> As there is only a single user of is_stack_mapping(), I'd simply
> have
> adjusted the doc of is_stack_mapping() to include shadow stacks.

Ok, I'll update the comment and add it there.

2023-02-22 01:03:32

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On Tue, 2023-02-21 at 12:13 -0800, Dave Hansen wrote:
> On 2/21/23 00:38, David Hildenbrand wrote:> Sure, for my taste this
> is
> (1) too repetitive (2) too verbose (3) to
> > specialized. But whatever x86 maintainers prefer.
>
> At this point, I'm not going to be too nitpicky. I personally think
> we
> need to get _something_ merged. We can then nitpick it to death once
> its in the tree.
>
> So I prefer whatever will move the set along. ;)

Ok, David's general suggestion across these x86/mm patches is to make
things less COW specific. Sounds like you don't have a problem with
that. I'll just do that and hope I don't stir up any additional
concerns. Thanks all.

2023-02-22 09:06:17

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On 21.02.23 21:13, Dave Hansen wrote:
> On 2/21/23 00:38, David Hildenbrand wrote:> Sure, for my taste this is
> (1) too repetitive (2) too verbose (3) to
>> specialized. But whatever x86 maintainers prefer.
>
> At this point, I'm not going to be too nitpicky. I personally think we
> need to get _something_ merged. We can then nitpick it to death once
> its in the tree.

Yes, but ... do we have to rush right now?

This series wasn't in -next and we're in the merge window. Is the plan
to still include it into this merge window?

Also, I think concise patch descriptions and comments are not
necessarily nitpicking like "please rename that variable".

>
> So I prefer whatever will move the set along. ;)

If the plan is to merge it in the next merge window (which I suspect,
but I might be wrong), I suggest including it in -next fairly soonish,
and in the meantime, polish the remaining bits.

Knowing the plan would be good ;)

--
Thanks,

David / dhildenb


2023-02-22 17:24:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On 2/22/23 01:05, David Hildenbrand wrote:
> This series wasn't in -next and we're in the merge window. Is the plan
> to still include it into this merge window?

No way. It's 6.4 material at the earliest.

I'm just saying to Rick not to worry _too_ much about earlier feedback
from me if folks have more recent review feedback.

2023-02-22 17:28:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On 22.02.23 18:23, Dave Hansen wrote:
> On 2/22/23 01:05, David Hildenbrand wrote:
>> This series wasn't in -next and we're in the merge window. Is the plan
>> to still include it into this merge window?
>
> No way. It's 6.4 material at the earliest.
>
> I'm just saying to Rick not to worry _too_ much about earlier feedback
> from me if folks have more recent review feedback.

Great. So I hope we can get this into -next soon and that we'll only get
non-earth-shattering feedback so this can land in 6.4.

--
Thanks,

David / dhildenb


2023-02-22 17:42:18

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On February 22, 2023 9:27:35 AM PST, David Hildenbrand <[email protected]> wrote:
>On 22.02.23 18:23, Dave Hansen wrote:
>> On 2/22/23 01:05, David Hildenbrand wrote:
>>> This series wasn't in -next and we're in the merge window. Is the plan
>>> to still include it into this merge window?
>>
>> No way. It's 6.4 material at the earliest.
>>
>> I'm just saying to Rick not to worry _too_ much about earlier feedback
>> from me if folks have more recent review feedback.
>
>Great. So I hope we can get this into -next soon and that we'll only get non-earth-shattering feedback so this can land in 6.4.

Yes please. Who's going to take it? :)


--
Kees Cook

2023-02-22 17:54:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On 2/22/23 09:42, Kees Cook wrote:
> On February 22, 2023 9:27:35 AM PST, David Hildenbrand <[email protected]> wrote:
>> On 22.02.23 18:23, Dave Hansen wrote:
>>> On 2/22/23 01:05, David Hildenbrand wrote:
>>>> This series wasn't in -next and we're in the merge window. Is the plan
>>>> to still include it into this merge window?
>>> No way. It's 6.4 material at the earliest.
>>>
>>> I'm just saying to Rick not to worry _too_ much about earlier feedback
>>> from me if folks have more recent review feedback.
>> Great. So I hope we can get this into -next soon and that we'll only get non-earth-shattering feedback so this can land in 6.4.
> Yes please. Who's going to take it? ????

I'm more than happy to queue it in x86/mm. I'll plan to queue Rick's
next version that he posts and then we can do any fixes or minor changes
on top of that.

2023-02-22 19:29:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 00/41] Shadow stacks for userspace

On Sat, Feb 18, 2023 at 01:13:52PM -0800, Rick Edgecombe wrote:
> Hi,
>
> This series implements Shadow Stacks for userspace using x86's Control-flow

What is the base commit this applies on?

It ain't v6.2...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-02-22 19:32:31

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 00/41] Shadow stacks for userspace

On Wed, 2023-02-22 at 20:28 +0100, Borislav Petkov wrote:
> On Sat, Feb 18, 2023 at 01:13:52PM -0800, Rick Edgecombe wrote:
> > Hi,
> >
> > This series implements Shadow Stacks for userspace using x86's
> > Control-flow
>
> What is the base commit this applies on?
>
> It ain't v6.2...

It was tip/master the week I sent it out:
0a5e985fb1c8

2023-02-22 19:39:42

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v6 14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

On Wed, Feb 22, 2023 at 09:54:36AM -0800, Dave Hansen wrote:
> On 2/22/23 09:42, Kees Cook wrote:
> > On February 22, 2023 9:27:35 AM PST, David Hildenbrand <[email protected]> wrote:
> >> On 22.02.23 18:23, Dave Hansen wrote:
> >>> On 2/22/23 01:05, David Hildenbrand wrote:
> >>>> This series wasn't in -next and we're in the merge window. Is the plan
> >>>> to still include it into this merge window?
> >>> No way. It's 6.4 material at the earliest.
> >>>
> >>> I'm just saying to Rick not to worry _too_ much about earlier feedback
> >>> from me if folks have more recent review feedback.
> >> Great. So I hope we can get this into -next soon and that we'll only get non-earth-shattering feedback so this can land in 6.4.
> > Yes please. Who's going to take it? ????
>
> I'm more than happy to queue it in x86/mm. I'll plan to queue Rick's
> next version that he posts and then we can do any fixes or minor changes
> on top of that.

That sounds lovely; thank you! :)

--
Kees Cook

2023-02-22 22:14:08

by Deepak Gupta

[permalink] [raw]
Subject: Re: [PATCH v6 18/41] mm: Introduce VM_SHADOW_STACK for shadow stack memory

On Tue, Feb 21, 2023 at 09:34:35AM +0100, David Hildenbrand wrote:
>On 20.02.23 23:08, Edgecombe, Rick P wrote:
>>On Mon, 2023-02-20 at 13:56 +0100, David Hildenbrand wrote:
>>>On 18.02.23 22:14, Rick Edgecombe wrote:
>>>>From: Yu-cheng Yu <[email protected]>
>>>>
>>>>The x86 Control-flow Enforcement Technology (CET) feature includes
>>>>a new
>>>>type of memory called shadow stack. This shadow stack memory has
>>>>some
>>>>unusual properties, which requires some core mm changes to function
>>>>properly.
>>>>
>>>>A shadow stack PTE must be read-only and have _PAGE_DIRTY set.
>>>>However,
>>>>read-only and Dirty PTEs also exist for copy-on-write (COW) pages.
>>>>These
>>>>two cases are handled differently for page faults. Introduce
>>>>VM_SHADOW_STACK to track shadow stack VMAs.
>>>
>>>I suggest simplifying and abstracting that description.
>>>
>>>"New hardware extensions implement support for shadow stack memory,
>>>such
>>>as x86 Control-flow Enforcement Technology (CET). Let's add a new VM
>>>flag to identify these areas, for example, to be used to properly
>>>indicate shadow stack PTEs to the hardware."
>>
>>Ah yea, that top blurb was added to all the non-x86 arch patches after
>>some feedback from Andrew Morton. He had said basically (in some more
>>colorful language) that the changelogs (at the time) were written
>>assuming the reader knows what a shadow stack is.
>
>Okay. It's a bit repetitive, though.
>
>Ideally, we'd just explain it in the cover letter in detail and
>Andrews's script would include the cover letter in the first commit.
>IIRC, that's what usually happens.
>
>>
>>So it might be worth keeping a little more info in the log?
>
>Copying the same paragraph into each commit is IMHO a bit repetitive.
>But these are just my 2 cents.
>
>[...]
>
>>>Should we abstract this to CONFIG_ARCH_USER_SHADOW_STACK, seeing
>>>that
>>>other architectures might similarly need it?
>>
>>There was an ARCH_HAS_SHADOW_STACK but it got removed following this
>>discussion:
>>
>>https://lore.kernel.org/lkml/[email protected]/
>>
>>Now we have this new RFC for riscv as potentially a second
>>implementation. But it is still very early, and I'm not sure anyone
>>knows exactly what the similarities will be in a mature version. So I
>>think it would be better to refactor in an ARCH_HAS_SHADOW_STACK later
>>(and similar abstractions) once that series is more mature and we have
>>an idea of what pieces will be shared. I don't have a problem in
>>principle with an ARCH config, just don't think we should do it yet.
>
>Okay, easy to factor out later.

I would be more than happy if this config name would've been abstracted out and arches can
choose to implement. It's a bit sad that it was generic earlier and was later changed due
to lack of support from other architectures. Now there are three architectures who either
already support shadow stack (x86), announced the support (aarch64) or are planning to
support (riscv).

However given patch reduction I will get due to `pte_mkwrite` refactor, I am in favor of
future refactor for config.
>
>Acked-by: David Hildenbrand <[email protected]>
>
>--
>Thanks,
>
>David / dhildenb
>

2023-02-22 23:07:55

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 19/41] x86/mm: Check shadow stack page fault errors

On Mon, 2023-02-20 at 13:57 +0100, David Hildenbrand wrote:
> >
> > + /*
> > + * When a page becomes COW it changes from a shadow stack
> > permission
> > + * page (Write=0,Dirty=1) to (Write=0,Dirty=0,SavedDirty=1),
> > which is simply
> > + * read-only to the CPU. When shadow stack is enabled, a RET
> > would
> > + * normally pop the shadow stack by reading it with a "shadow
> > stack
> > + * read" access. However, in the COW case the shadow stack
> > memory does
> > + * not have shadow stack permissions, it is read-only. So it
> > will
> > + * generate a fault.
> > + *
> > + * For conventionally writable pages, a read can be serviced
> > with a
> > + * read only PTE, and COW would not have to happen. But for
> > shadow
> > + * stack, there isn't the concept of read-only shadow stack
> > memory.
> > + * If it is shadow stack permission, it can be modified via
> > CALL and
> > + * RET instructions. So COW needs to happen before any memory
> > can be
> > + * mapped with shadow stack permissions.
> > + *
> > + * Shadow stack accesses (read or write) need to be serviced
> > with
> > + * shadow stack permission memory, so in the case of a shadow
> > stack
> > + * read access, treat it as a WRITE fault so both COW will
> > happen and
> > + * the write fault path will tickle maybe_mkwrite() and map
> > the memory
> > + * shadow stack.
> > + */
>
> Again, I suggest dropping all details about COW from this comment
> and
> from the patch description. It's just one such case that can happen.

Hi David,

I was just trying to edit this one to drop COW details, but I think in
this case, one of the major reasons for the code *is* actually COW. We
are not working around the whole inadvertent shadow stack memory piece
here, but something else: Making sure shadow stack memory is faulted in
and doing COW if required to make this possible. I came up with this,
does it seem better?


/*
* For conventionally writable pages, a read can be serviced with a
*
read only PTE. But for shadow stack, there isn't a concept of
* read-
only shadow stack memory. If it a PTE has the shadow stack
*
permission, it can be modified via CALL and RET instructions. So
* core
MM needs to fault in a writable PTE and do things it already
* does for
write faults.
*
* Shadow stack accesses (read or write) need to be
serviced with
* shadow stack permission memory, so in the case of a
shadow stack
* read access, treat it as a WRITE fault so both any
required COW will
* happen and the write fault path will tickle
maybe_mkwrite() and map
* the memory shadow stack.
*/



Thanks,
Rick

2023-02-23 00:03:53

by Deepak Gupta

[permalink] [raw]
Subject: Re: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall

On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
>When operating with shadow stacks enabled, the kernel will automatically
>allocate shadow stacks for new threads, however in some cases userspace
>will need additional shadow stacks. The main example of this is the
>ucontext family of functions, which require userspace allocating and
>pivoting to userspace managed stacks.
>
>Unlike most other user memory permissions, shadow stacks need to be
>provisioned with special data in order to be useful. They need to be setup
>with a restore token so that userspace can pivot to them via the RSTORSSP
>instruction. But, the security design of shadow stack's is that they
>should not be written to except in limited circumstances. This presents a
>problem for userspace, as to how userspace can provision this special
>data, without allowing for the shadow stack to be generally writable.
>
>Previously, a new PROT_SHADOW_STACK was attempted, which could be
>mprotect()ed from RW permissions after the data was provisioned. This was
>found to not be secure enough, as other thread's could write to the
>shadow stack during the writable window.
>
>The kernel can use a special instruction, WRUSS, to write directly to
>userspace shadow stacks. So the solution can be that memory can be mapped
>as shadow stack permissions from the beginning (never generally writable
>in userspace), and the kernel itself can write the restore token.
>
>First, a new madvise() flag was explored, which could operate on the
>PROT_SHADOW_STACK memory. This had a couple downsides:
>1. Extra checks were needed in mprotect() to prevent writable memory from
> ever becoming PROT_SHADOW_STACK.
>2. Extra checks/vma state were needed in the new madvise() to prevent
> restore tokens being written into the middle of pre-used shadow stacks.
> It is ideal to prevent restore tokens being added at arbitrary
> locations, so the check was to make sure the shadow stack had never been
> written to.
>3. It stood out from the rest of the madvise flags, as more of direct
> action than a hint at future desired behavior.
>
>So rather than repurpose two existing syscalls (mmap, madvise) that don't
>quite fit, just implement a new map_shadow_stack syscall to allow
>userspace to map and setup new shadow stacks in one step. While ucontext
>is the primary motivator, userspace may have other unforeseen reasons to
>setup it's own shadow stacks using the WRSS instruction. Towards this
>provide a flag so that stacks can be optionally setup securely for the
>common case of ucontext without enabling WRSS. Or potentially have the
>kernel set up the shadow stack in some new way.

Was following ever attempted?

void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
- limit PROT_SHADOWSTACK protection flag to only mmap (and thus mprotect can't
convert memory from shadow stack to non-shadow stack type or vice versa)
- limit PROT_SHADOWSTACK protection flag to anonymous memory only.
- top level mmap handler to put a token at the base using WRUSS if prot == PROT_SHADOWSTACK

You essentially would get shadow stack manufacturing with existing (single) syscall.
Acting a bit selfish here, this allows other architectures as well to re-use this and
do their own implementation of mapping and placing the token at the base.

>
>The following example demonstrates how to create a new shadow stack with
>map_shadow_stack:
>void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);
>
>Tested-by: Pengfei Xu <[email protected]>
>Tested-by: John Allen <[email protected]>
>Reviewed-by: Kees Cook <[email protected]>
>Signed-off-by: Rick Edgecombe <[email protected]>
>
>---
>v5:
> - Fix addr/mapped_addr (Kees)
> - Switch to EOPNOTSUPP (Kees suggested ENOTSUPP, but checkpatch
> suggests this)
> - Return error for addresses below 4G
>
>v3:
> - Change syscall common -> 64 (Kees)
> - Use bit shift notation instead of 0x1 for uapi header (Kees)
> - Call do_mmap() with MAP_FIXED_NOREPLACE (Kees)
> - Block unsupported flags (Kees)
> - Require size >= 8 to set token (Kees)
>
>v2:
> - Change syscall to take address like mmap() for CRIU's usage
>
>v1:
> - New patch (replaces PROT_SHADOW_STACK).
>---
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/x86/include/uapi/asm/mman.h | 3 ++
> arch/x86/kernel/shstk.c | 59 ++++++++++++++++++++++----
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 2 +-
> kernel/sys_ni.c | 1 +
> 6 files changed, 58 insertions(+), 9 deletions(-)
>
>diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
>index c84d12608cd2..f65c671ce3b1 100644
>--- a/arch/x86/entry/syscalls/syscall_64.tbl
>+++ b/arch/x86/entry/syscalls/syscall_64.tbl
>@@ -372,6 +372,7 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
>+451 64 map_shadow_stack sys_map_shadow_stack
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
>diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
>index 5a0256e73f1e..8148bdddbd2c 100644
>--- a/arch/x86/include/uapi/asm/mman.h
>+++ b/arch/x86/include/uapi/asm/mman.h
>@@ -13,6 +13,9 @@
> ((key) & 0x8 ? VM_PKEY_BIT3 : 0))
> #endif
>
>+/* Flags for map_shadow_stack(2) */
>+#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */
>+
> #include <asm-generic/mman.h>
>
> #endif /* _ASM_X86_MMAN_H */
>diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
>index 40f0a55762a9..0a3decab70ee 100644
>--- a/arch/x86/kernel/shstk.c
>+++ b/arch/x86/kernel/shstk.c
>@@ -17,6 +17,7 @@
> #include <linux/compat.h>
> #include <linux/sizes.h>
> #include <linux/user.h>
>+#include <linux/syscalls.h>
> #include <asm/msr.h>
> #include <asm/fpu/xstate.h>
> #include <asm/fpu/types.h>
>@@ -71,19 +72,31 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
> return 0;
> }
>
>-static unsigned long alloc_shstk(unsigned long size)
>+static unsigned long alloc_shstk(unsigned long addr, unsigned long size,
>+ unsigned long token_offset, bool set_res_tok)
> {
> int flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G;
> struct mm_struct *mm = current->mm;
>- unsigned long addr, unused;
>+ unsigned long mapped_addr, unused;
>
>- mmap_write_lock(mm);
>- addr = do_mmap(NULL, 0, size, PROT_READ, flags,
>- VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
>+ if (addr)
>+ flags |= MAP_FIXED_NOREPLACE;
>
>+ mmap_write_lock(mm);
>+ mapped_addr = do_mmap(NULL, addr, size, PROT_READ, flags,
>+ VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
> mmap_write_unlock(mm);
>
>- return addr;
>+ if (!set_res_tok || IS_ERR_VALUE(mapped_addr))
>+ goto out;
>+
>+ if (create_rstor_token(mapped_addr + token_offset, NULL)) {
>+ vm_munmap(mapped_addr, size);
>+ return -EINVAL;
>+ }
>+
>+out:
>+ return mapped_addr;
> }
>
> static unsigned long adjust_shstk_size(unsigned long size)
>@@ -134,7 +147,7 @@ static int shstk_setup(void)
> return -EOPNOTSUPP;
>
> size = adjust_shstk_size(0);
>- addr = alloc_shstk(size);
>+ addr = alloc_shstk(0, size, 0, false);
> if (IS_ERR_VALUE(addr))
> return PTR_ERR((void *)addr);
>
>@@ -178,7 +191,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> return 0;
>
> size = adjust_shstk_size(stack_size);
>- addr = alloc_shstk(size);
>+ addr = alloc_shstk(0, size, 0, false);
> if (IS_ERR_VALUE(addr))
> return PTR_ERR((void *)addr);
>
>@@ -371,6 +384,36 @@ static int shstk_disable(void)
> return 0;
> }
>
>+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
>+{
>+ bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
>+ unsigned long aligned_size;
>+
>+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
>+ return -EOPNOTSUPP;
>+
>+ if (flags & ~SHADOW_STACK_SET_TOKEN)
>+ return -EINVAL;
>+
>+ /* If there isn't space for a token */
>+ if (set_tok && size < 8)
>+ return -EINVAL;
>+
>+ if (addr && addr <= 0xFFFFFFFF)
>+ return -EINVAL;
>+
>+ /*
>+ * An overflow would result in attempting to write the restore token
>+ * to the wrong location. Not catastrophic, but just return the right
>+ * error code and block it.
>+ */
>+ aligned_size = PAGE_ALIGN(size);
>+ if (aligned_size < size)
>+ return -EOVERFLOW;
>+
>+ return alloc_shstk(addr, aligned_size, size, set_tok);
>+}
>+
> long shstk_prctl(struct task_struct *task, int option, unsigned long features)
> {
> if (option == ARCH_SHSTK_LOCK) {
>diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>index 33a0ee3bcb2e..392dc11e3556 100644
>--- a/include/linux/syscalls.h
>+++ b/include/linux/syscalls.h
>@@ -1058,6 +1058,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> unsigned long home_node,
> unsigned long flags);
>+asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
>
> /*
> * Architecture-specific system calls
>diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
>index 45fa180cc56a..b12940ec5926 100644
>--- a/include/uapi/asm-generic/unistd.h
>+++ b/include/uapi/asm-generic/unistd.h
>@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> #undef __NR_syscalls
>-#define __NR_syscalls 451
>+#define __NR_syscalls 452
>
> /*
> * 32 bit systems traditionally used different
>diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>index 860b2dcf3ac4..cb9aebd34646 100644
>--- a/kernel/sys_ni.c
>+++ b/kernel/sys_ni.c
>@@ -381,6 +381,7 @@ COND_SYSCALL(vm86old);
> COND_SYSCALL(modify_ldt);
> COND_SYSCALL(vm86);
> COND_SYSCALL(kexec_file_load);
>+COND_SYSCALL(map_shadow_stack);
>
> /* s390 */
> COND_SYSCALL(s390_pci_mmio_read);
>--
>2.17.1
>

2023-02-23 01:12:00

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall

On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote:
> On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
> > When operating with shadow stacks enabled, the kernel will
> > automatically
> > allocate shadow stacks for new threads, however in some cases
> > userspace
> > will need additional shadow stacks. The main example of this is the
> > ucontext family of functions, which require userspace allocating
> > and
> > pivoting to userspace managed stacks.
> >
> > Unlike most other user memory permissions, shadow stacks need to be
> > provisioned with special data in order to be useful. They need to
> > be setup
> > with a restore token so that userspace can pivot to them via the
> > RSTORSSP
> > instruction. But, the security design of shadow stack's is that
> > they
> > should not be written to except in limited circumstances. This
> > presents a
> > problem for userspace, as to how userspace can provision this
> > special
> > data, without allowing for the shadow stack to be generally
> > writable.
> >
> > Previously, a new PROT_SHADOW_STACK was attempted, which could be
> > mprotect()ed from RW permissions after the data was provisioned.
> > This was
> > found to not be secure enough, as other thread's could write to the
> > shadow stack during the writable window.
> >
> > The kernel can use a special instruction, WRUSS, to write directly
> > to
> > userspace shadow stacks. So the solution can be that memory can be
> > mapped
> > as shadow stack permissions from the beginning (never generally
> > writable
> > in userspace), and the kernel itself can write the restore token.
> >
> > First, a new madvise() flag was explored, which could operate on
> > the
> > PROT_SHADOW_STACK memory. This had a couple downsides:
> > 1. Extra checks were needed in mprotect() to prevent writable
> > memory from
> > ever becoming PROT_SHADOW_STACK.
> > 2. Extra checks/vma state were needed in the new madvise() to
> > prevent
> > restore tokens being written into the middle of pre-used shadow
> > stacks.
> > It is ideal to prevent restore tokens being added at arbitrary
> > locations, so the check was to make sure the shadow stack had
> > never been
> > written to.
> > 3. It stood out from the rest of the madvise flags, as more of
> > direct
> > action than a hint at future desired behavior.
> >
> > So rather than repurpose two existing syscalls (mmap, madvise) that
> > don't
> > quite fit, just implement a new map_shadow_stack syscall to allow
> > userspace to map and setup new shadow stacks in one step. While
> > ucontext
> > is the primary motivator, userspace may have other unforeseen
> > reasons to
> > setup it's own shadow stacks using the WRSS instruction. Towards
> > this
> > provide a flag so that stacks can be optionally setup securely for
> > the
> > common case of ucontext without enabling WRSS. Or potentially have
> > the
> > kernel set up the shadow stack in some new way.
>
> Was following ever attempted?
>
> void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
> - limit PROT_SHADOWSTACK protection flag to only mmap (and thus
> mprotect can't
> convert memory from shadow stack to non-shadow stack type or vice
> versa)
> - limit PROT_SHADOWSTACK protection flag to anonymous memory only.
> - top level mmap handler to put a token at the base using WRUSS if
> prot == PROT_SHADOWSTACK
>
> You essentially would get shadow stack manufacturing with existing
> (single) syscall.
> Acting a bit selfish here, this allows other architectures as well to
> re-use this and
> do their own implementation of mapping and placing the token at the
> base.

Yes, I looked at it. You end up with a pile of checks and hooks added
to mmap() and various other places as you outline. We also now have the
MAP_ABOVE4G limitation for x86 shadow stack that would need checking
for too. It's not exactly a clean fit. Then, callers would have to pass
special x86 flags in anyway.

It doesn't seem like the complexity of the checks is worth saving the
tiny syscall. Is there some reason why riscv can't use the same syscall
stub? It doesn't need to live forever in x86 code. Not sure what the
savings are for riscv of the mmap+checks approach are either...

I did wonder if there could be some sort of more general syscall for
mapping and provisioning special security-type memory. But we probably
need a few more non-shadow stack examples to get an idea of what that
would look like.

2023-02-23 12:15:24

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH v6 12/41] s390/mm: Introduce pmd_mkwrite_kernel()

On Sat, Feb 18, 2023 at 01:14:04PM -0800, Rick Edgecombe wrote:
> The x86 Control-flow Enforcement Technology (CET) feature includes a new
> type of memory called shadow stack. This shadow stack memory has some
> unusual properties, which requires some core mm changes to function
> properly.
>
> One of these changes is to allow for pmd_mkwrite() to create different
> types of writable memory (the existing conventionally writable type and
> also the new shadow stack type). Future patches will convert pmd_mkwrite()
> to take a VMA in order to facilitate this, however there are places in the
> kernel where pmd_mkwrite() is called outside of the context of a VMA.
> These are for kernel memory. So create a new variant called
> pmd_mkwrite_kernel() and switch the kernel users over to it. Have
> pmd_mkwrite() and pmd_mkwrite_kernel() be the same for now. Future patches
> will introduce changes to make pmd_mkwrite() take a VMA.
>
> Only do this for architectures that need it because they call pmd_mkwrite()
> in arch code without an associated VMA. Since it will only currently be
> used in arch code, so do not include it in arch_pgtable_helpers.rst.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Tested-by: Pengfei Xu <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
...
> ---
> arch/s390/include/asm/pgtable.h | 7 ++++++-
> arch/s390/mm/pageattr.c | 2 +-
> 2 files changed, 7 insertions(+), 2 deletions(-)

Acked-by: Heiko Carstens <[email protected]>

2023-02-23 12:56:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v6 19/41] x86/mm: Check shadow stack page fault errors

On 23.02.23 00:07, Edgecombe, Rick P wrote:
> On Mon, 2023-02-20 at 13:57 +0100, David Hildenbrand wrote:
>>>
>>> + /*
>>> + * When a page becomes COW it changes from a shadow stack
>>> permission
>>> + * page (Write=0,Dirty=1) to (Write=0,Dirty=0,SavedDirty=1),
>>> which is simply
>>> + * read-only to the CPU. When shadow stack is enabled, a RET
>>> would
>>> + * normally pop the shadow stack by reading it with a "shadow
>>> stack
>>> + * read" access. However, in the COW case the shadow stack
>>> memory does
>>> + * not have shadow stack permissions, it is read-only. So it
>>> will
>>> + * generate a fault.
>>> + *
>>> + * For conventionally writable pages, a read can be serviced
>>> with a
>>> + * read only PTE, and COW would not have to happen. But for
>>> shadow
>>> + * stack, there isn't the concept of read-only shadow stack
>>> memory.
>>> + * If it is shadow stack permission, it can be modified via
>>> CALL and
>>> + * RET instructions. So COW needs to happen before any memory
>>> can be
>>> + * mapped with shadow stack permissions.
>>> + *
>>> + * Shadow stack accesses (read or write) need to be serviced
>>> with
>>> + * shadow stack permission memory, so in the case of a shadow
>>> stack
>>> + * read access, treat it as a WRITE fault so both COW will
>>> happen and
>>> + * the write fault path will tickle maybe_mkwrite() and map
>>> the memory
>>> + * shadow stack.
>>> + */
>>
>> Again, I suggest dropping all details about COW from this comment
>> and
>> from the patch description. It's just one such case that can happen.
>
> Hi David,

Hi Rick,

>
> I was just trying to edit this one to drop COW details, but I think in
> this case, one of the major reasons for the code *is* actually COW. We
> are not working around the whole inadvertent shadow stack memory piece
> here, but something else: Making sure shadow stack memory is faulted in
> and doing COW if required to make this possible. I came up with this,
> does it seem better?

Regarding the fault handling I completely agree. We have to treat a read
like a write event. And as read-only shadow stack PTEs don't exist, we
have to tell the MM to create a writable one for us.

>
>
> /*
> * For conventionally writable pages, a read can be serviced with a
> *
> read only PTE. But for shadow stack, there isn't a concept of
> * read-
> only shadow stack memory. If it a PTE has the shadow stack
> *
> permission, it can be modified via CALL and RET instructions. So
> * core
> MM needs to fault in a writable PTE and do things it already
> * does for
> write faults.
> *
> * Shadow stack accesses (read or write) need to be
> serviced with
> * shadow stack permission memory, so in the case of a
> shadow stack
> * read access, treat it as a WRITE fault so both any
> required COW will
> * happen and the write fault path will tickle
> maybe_mkwrite() and map
> * the memory shadow stack.
> */

That sounds good! I'd rewrite the last part slightly.

"
Shadow stack accesses (read or write) need to be serviced with
shadow stack permission memory, which always include write permissions.
So in the case of a shadow stack read access, treat it as a WRITE fault.
This will make sure that MM will prepare everything (e.g., break COW)
such that maybe_mkwrite() can create a proper shadow stack PTE.
"

--
Thanks,

David / dhildenb


2023-02-23 13:48:31

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 37/41] selftests/x86: Add shadow stack test

On Sat, Feb 18, 2023 at 01:14:29PM -0800, Rick Edgecombe wrote:
> Since this test exercises a recently added syscall manually, it needs
> to find the automatically created __NR_foo defines. Per the selftest
> documentation, KHDR_INCLUDES can be used to help the selftest Makefile's

Well, why don't you make it easier for the user of this to not have to
jump through hoops to get the test built?

IOW, something like the below ontop.

It works if I do

$ make -j<num> test_shadow_stack_64

It would only need to be fixed to work when you do

$ make -j<num>

without arguments as then make does a parallel build.

I guess something like

ifneq ($(filter test_shadow_stack_64, $(MAKECMDGOALS)),)
.NOTPARALLEL:
endif

needs to happen but I'm not sure...

Thx.

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index a5c5ee73052a..9287dc7c0263 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -14,6 +14,10 @@ top_srcdir ?= ../../..
abs_srctree := $(shell cd $(top_srcdir) && pwd)
KHDR_INCLUDES := -isystem ${abs_srctree}/usr/include

+test_shadow_stack_64: test_shadow_stack.c helpers.h
+ cd $(top_srcdir) && $(MAKE) headers
+ $(CC) -m64 -o $@ $(CFLAGS) $(EXTRA_CFLAGS) $(KHDR_INCLUDES) $^ -lrt -ldl
+
TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
check_initial_reg_state sigreturn iopl ioperm \
test_vsyscall mov_ss_trap \
@@ -22,7 +26,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- corrupt_xstate_header amx test_shadow_stack
+ corrupt_xstate_header amx
# Some selftests require 32bit support enabled also on 64bit systems
TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall

@@ -38,7 +42,7 @@ BINARIES_64 := $(TARGETS_C_64BIT_ALL:%=%_64)
BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))

-CFLAGS := -O2 -g -std=gnu99 -pthread -Wall $(KHDR_INCLUDES)
+CFLAGS := -O2 -g -std=gnu99 -pthread -Wall

# call32_from_64 in thunks.S uses absolute addresses.
ifeq ($(CAN_BUILD_WITH_NOPIE),1)

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-02-23 17:55:12

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 37/41] selftests/x86: Add shadow stack test

On Thu, 2023-02-23 at 14:47 +0100, Borislav Petkov wrote:
> On Sat, Feb 18, 2023 at 01:14:29PM -0800, Rick Edgecombe wrote:
> > Since this test exercises a recently added syscall manually, it
> > needs
> > to find the automatically created __NR_foo defines. Per the
> > selftest
> > documentation, KHDR_INCLUDES can be used to help the selftest
> > Makefile's
>
> Well, why don't you make it easier for the user of this to not have
> to
> jump through hoops to get the test built?
>
> IOW, something like the below ontop.
>
> It works if I do
>
> $ make -j<num> test_shadow_stack_64
>
> It would only need to be fixed to work when you do
>
> $ make -j<num>
>
> without arguments as then make does a parallel build.
>
> I guess something like
>
> ifneq ($(filter test_shadow_stack_64, $(MAKECMDGOALS)),)
> .NOTPARALLEL:
> endif
>
> needs to happen but I'm not sure...

Ah, I see. I had built the kernel with CONFIG_HEADERS_INSTALL and so
this was already done for me.

The proposed Makefile solution seems a bit unusual. What about this
less complicated solution to just make this case work?
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c
b/tools/testing/selftests/x86/test_shadow_stack.c
index 71de3527c67a..02fe1b135ba8 100644
--- a/tools/testing/selftests/x86/test_shadow_stack.c
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -34,6 +34,24 @@

#define SS_SIZE 0x200000

+/*
+ * Define the ABI defines if needed, so people can run the tests
+ * without building the headers.
+ */
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack 451
+#define SHADOW_STACK_SET_TOKEN (1ULL << 0)
+
+#define ARCH_SHSTK_ENABLE 0x5001
+#define ARCH_SHSTK_DISABLE 0x5002
+#define ARCH_SHSTK_LOCK 0x5003
+#define ARCH_SHSTK_UNLOCK 0x5004
+#define ARCH_SHSTK_STATUS 0x5005
+
+#define ARCH_SHSTK_SHSTK (1ULL << 0)
+#define ARCH_SHSTK_WRSS (1ULL << 1)
+#endif
+
#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
int main(int argc, char *argv[])
{


I was a bit surprised this was even as tricky as the KHDR_INCLUDES
part. The other tests seem to be fine only because they use old header
definitions that have been there forever. So it seems like the proper
way to build the selftests should involve building the headers. So
alternatively, why not just always encourage building the headers
before running the selftests by warning if
${abs_srctree}/usr/include/linux is not found?

Do either of those seem any better?

2023-02-23 17:59:54

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 12/41] s390/mm: Introduce pmd_mkwrite_kernel()

On Thu, 2023-02-23 at 13:14 +0100, Heiko Carstens wrote:
> Acked-by: Heiko Carstens <[email protected]>

Thanks!

2023-02-23 21:20:34

by Deepak Gupta

[permalink] [raw]
Subject: Re: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall

On Wed, Feb 22, 2023 at 5:11 PM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote:
> > On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
> > > When operating with shadow stacks enabled, the kernel will
> > > automatically
> > > allocate shadow stacks for new threads, however in some cases
> > > userspace
> > > will need additional shadow stacks. The main example of this is the
> > > ucontext family of functions, which require userspace allocating
> > > and
> > > pivoting to userspace managed stacks.
> > >
> > > Unlike most other user memory permissions, shadow stacks need to be
> > > provisioned with special data in order to be useful. They need to
> > > be setup
> > > with a restore token so that userspace can pivot to them via the
> > > RSTORSSP
> > > instruction. But, the security design of shadow stack's is that
> > > they
> > > should not be written to except in limited circumstances. This
> > > presents a
> > > problem for userspace, as to how userspace can provision this
> > > special
> > > data, without allowing for the shadow stack to be generally
> > > writable.
> > >
> > > Previously, a new PROT_SHADOW_STACK was attempted, which could be
> > > mprotect()ed from RW permissions after the data was provisioned.
> > > This was
> > > found to not be secure enough, as other thread's could write to the
> > > shadow stack during the writable window.
> > >
> > > The kernel can use a special instruction, WRUSS, to write directly
> > > to
> > > userspace shadow stacks. So the solution can be that memory can be
> > > mapped
> > > as shadow stack permissions from the beginning (never generally
> > > writable
> > > in userspace), and the kernel itself can write the restore token.
> > >
> > > First, a new madvise() flag was explored, which could operate on
> > > the
> > > PROT_SHADOW_STACK memory. This had a couple downsides:
> > > 1. Extra checks were needed in mprotect() to prevent writable
> > > memory from
> > > ever becoming PROT_SHADOW_STACK.
> > > 2. Extra checks/vma state were needed in the new madvise() to
> > > prevent
> > > restore tokens being written into the middle of pre-used shadow
> > > stacks.
> > > It is ideal to prevent restore tokens being added at arbitrary
> > > locations, so the check was to make sure the shadow stack had
> > > never been
> > > written to.
> > > 3. It stood out from the rest of the madvise flags, as more of
> > > direct
> > > action than a hint at future desired behavior.
> > >
> > > So rather than repurpose two existing syscalls (mmap, madvise) that
> > > don't
> > > quite fit, just implement a new map_shadow_stack syscall to allow
> > > userspace to map and setup new shadow stacks in one step. While
> > > ucontext
> > > is the primary motivator, userspace may have other unforeseen
> > > reasons to
> > > setup it's own shadow stacks using the WRSS instruction. Towards
> > > this
> > > provide a flag so that stacks can be optionally setup securely for
> > > the
> > > common case of ucontext without enabling WRSS. Or potentially have
> > > the
> > > kernel set up the shadow stack in some new way.
> >
> > Was following ever attempted?
> >
> > void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
> > - limit PROT_SHADOWSTACK protection flag to only mmap (and thus
> > mprotect can't
> > convert memory from shadow stack to non-shadow stack type or vice
> > versa)
> > - limit PROT_SHADOWSTACK protection flag to anonymous memory only.
> > - top level mmap handler to put a token at the base using WRUSS if
> > prot == PROT_SHADOWSTACK
> >
> > You essentially would get shadow stack manufacturing with existing
> > (single) syscall.
> > Acting a bit selfish here, this allows other architectures as well to
> > re-use this and
> > do their own implementation of mapping and placing the token at the
> > base.
>
> Yes, I looked at it. You end up with a pile of checks and hooks added
> to mmap() and various other places as you outline. We also now have the
> MAP_ABOVE4G limitation for x86 shadow stack that would need checking
> for too. It's not exactly a clean fit. Then, callers would have to pass
> special x86 flags in anyway.

riscv has mechanisms using which a 32bit app can run on 64bit kernel.
So technically if there are 32bit and 64bit code in address space,
MAP_ABOVE4G could be useful.
Although I am not sure (or aware of) if there are such requirement
from app/developers yet (to guarantee address mapping above 4G)

But I see this as orthogonal to memory protection flags.

>
> It doesn't seem like the complexity of the checks is worth saving the
> tiny syscall. Is there some reason why riscv can't use the same syscall
> stub? It doesn't need to live forever in x86 code. Not sure what the
> savings are for riscv of the mmap+checks approach are either...

I don't see a lot of extra complexity here.
If `mprotect` and friends don't know about `PROT_SHADOWSTACK`, they'll
just fail by default (which is desired)

It's only `mmap` that needs to be enlightened. And it can just pass
`VMA_SHADOW_STACK` to `do_mmap` if input is `PROT_SHADOWSTACK`.

Adding a syscall just for mapping shadow stack is weird when it can be
solved with existing system calls.
As you say in your response below, it would be good to have such a
syscall which serve larger purposes (e.g. provisioning special
security-type memory)

arm64's memory tagging is one such example. Not exactly security-type
memory (but eventual application is security for this feature) .
It adds extra meaning to virtual addresses (i.e. an address has tags).
arm64 went about using a protection flag `PROT_MTE` instead of a
special system call.

Being said that since this patch has gone through multiple revisions
and I am new to the party. If others dont have issues on this special
system call,
I think it's fine then. In case of riscv I can choose to use this
mechanism or go via arm's route to define PROT_SHADOWSTACK which is
arch specific.

>
> I did wonder if there could be some sort of more general syscall for
> mapping and provisioning special security-type memory. But we probably
> need a few more non-shadow stack examples to get an idea of what that
> would look like.

As I mentioned memory tagging and thus PROT_MTE is already such a use
case which uses `mmap/mprotect` protection flags to designate special
meaning to a virtual address.

2023-02-23 23:44:39

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall

On Thu, 2023-02-23 at 13:20 -0800, Deepak Gupta wrote:
> > It doesn't seem like the complexity of the checks is worth saving
> > the
> > tiny syscall. Is there some reason why riscv can't use the same
> > syscall
> > stub? It doesn't need to live forever in x86 code. Not sure what
> > the
> > savings are for riscv of the mmap+checks approach are either...
>
> I don't see a lot of extra complexity here.
> If `mprotect` and friends don't know about `PROT_SHADOWSTACK`,
> they'll
> just fail by default (which is desired)

To me, the options look like: cram it into an existing syscall or
create one that does exactly what is needed.

To replace the new syscall with mmap(), with the existing MM
implementation, I think you would need to add to mmap:
1. Limit PROT_SHADOW_STACK to anonymous, private memory.
2. Limit PROT_SHADOW_STACK to MAP_ABOVE4G (or create a MAP_SHADOW_STACK
that does both). MAP_ABOVE4G protects from using shadow stack in 32 bit
mode, after some ABI issues were raised. So it is supposed to be
enforced.
3. Add additional logic for MAP_ABOVE4G to work with MAP_FIXED as is
required by CRIU.
4. Add another MAP_ flag to specify whether to write the token (AFAIK
the first flag that would do something like that. So then likely have
debates about whether it fits into the flags). Sort out the behavior of
trying to write the token to a non-PROT_SHADOW_STACK mmap call.
5. Add arch breakout for mmap to call into the token writing.

I think you are right that no mprotect() changes would be needed with
the current shadow stack MM implementation (it wasn't the case for the
original PROT_SHADOW_STACK). But I'm still not sure if it is simpler
then the syscall.

I do wonder a little bit about trying to remove some of these
limitations of the existing shadow stack MM implementation, which could
make an mmap based interface a little better fit. Like if someday
shadow stack memory added support for all the options that mmap
supports. But I'm not sure if that would just result in more complexity
in other places (MM, signals) that would barely get used. Like, I'm not
sure how shadow stack permissioned mmap()ed files should work. You
would have to require writable files to map them as shadow stack, but
then that is not as locked down as we have today with the anonymous
memory. And a "shadow stack" file permission would seem a bit
overboard.

Then probably some more dirty bit issues with mmaped files. I'm not
fully sure. That one was definitely an issue when PROT_SHADOW_STACK was
dropped, but David Hildenbrand has now removed at least some of the
issues it hit.

So the optimum solution might depend on if we add more shadow stack MM
support later. But it is always possible to add mmap support later too.

>
> It's only `mmap` that needs to be enlightened. And it can just pass
> `VMA_SHADOW_STACK` to `do_mmap` if input is `PROT_SHADOWSTACK`.
>
> Adding a syscall just for mapping shadow stack is weird when it can
> be
> solved with existing system calls.
> As you say in your response below, it would be good to have such a
> syscall which serve larger purposes (e.g. provisioning special
> security-type memory)
>
> arm64's memory tagging is one such example. Not exactly security-type
> memory (but eventual application is security for this feature) .
> It adds extra meaning to virtual addresses (i.e. an address has
> tags).
> arm64 went about using a protection flag `PROT_MTE` instead of a
> special system call.

It looks like that memory can be written with a special instruction?
And so it doesn't need to be provisioned by the kernel, as is the case
here.

>
> Being said that since this patch has gone through multiple revisions
> and I am new to the party. If others dont have issues on this special
> system call,
> I think it's fine then. In case of riscv I can choose to use this
> mechanism or go via arm's route to define PROT_SHADOWSTACK which is
> arch specific.

Ok, sounds good. The advice I got from maintainers after the many
attempts to cram it into the existing interfaces was: don't be afraid
to add a syscall. And sure enough, when MAP_ABOVE4G came along it
continued to make things simpler.

2023-02-24 11:46:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 37/41] selftests/x86: Add shadow stack test

On Thu, Feb 23, 2023 at 05:54:55PM +0000, Edgecombe, Rick P wrote:
> The proposed Makefile solution seems a bit unusual. What about this
> less complicated solution to just make this case work?

I like simple. :)

> So alternatively, why not just always encourage building the headers
> before running the selftests by warning if
> ${abs_srctree}/usr/include/linux is not found?

s/encourage/automate/

Imagine this situation: maintainer says: "please run the selftests".
User says, "uh oh, it fails building, I need to figure that out first."

So we should not have users have to figure out stuff if we can code it
to happen automatically for the default case.

If they want something special, then they can do all the figuring out
they want. :-)

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-02-24 12:20:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 28/41] x86: Introduce userspace API for shadow stack

On Sat, Feb 18, 2023 at 01:14:20PM -0800, Rick Edgecombe wrote:
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 500b96e71f18..b2b3b7200b2d 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -20,4 +20,10 @@
> #define ARCH_MAP_VDSO_32 0x2002
> #define ARCH_MAP_VDSO_64 0x2003
>
> +/* Don't use 0x3001-0x3004 because of old glibcs */

So where is this all new interface to userspace programs documented? Do
we have an agreement with all the involved parties that this is how
we're going to support shadow stacks and that this is what userspace
should do?

I'd like to avoid one more fiasco with glibc etc here...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-02-24 12:22:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 29/41] x86/shstk: Add user-mode shadow stack support

On Sat, Feb 18, 2023 at 01:14:21PM -0800, Rick Edgecombe wrote:
> Do not support IA32 emulation or x32.

Because? Simplicity?

No one cares about 32-bit?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-02-24 18:26:17

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 29/41] x86/shstk: Add user-mode shadow stack support

On Fri, 2023-02-24 at 13:22 +0100, Borislav Petkov wrote:
> On Sat, Feb 18, 2023 at 01:14:21PM -0800, Rick Edgecombe wrote:
> > Do not support IA32 emulation or x32.
>
> Because? Simplicity?
>
> No one cares about 32-bit?

Yea a little of both. Originally shadow stack 32 bit emulation was
supported for 64 bit kernels, but then Andy Lutomirski asked for the
signal ABI to flexible enough to support alt shadow stacks in the
future. This involved stuffing things on the shadow stack that didn't
work well for 32 bit. Resolving this wasn't exhaustively explored, but
there weren't any obvious things that jumped out.

Then there was the question of how much 32 bit CET apps running on 64
bit kernels would get used. Since shadow stack needs a re-compile this
would only be for newly build 32 bit binaries, not old legacy binaries
that seems to be the thing often using legacy emulation. So it was kind
of not expected to be used much or at all, so any kind of complications
tipped the scales toward dropping it. PeterZ brought up WINE running 32
bit Windows apps, but apparently Windows doesn't support 32 bit CET
either. And then there is that we can always add it later if a big use
shows up.

I'll add a little more info in the commit log about this.

2023-02-24 18:34:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 29/41] x86/shstk: Add user-mode shadow stack support

On Fri, Feb 24, 2023 at 06:25:57PM +0000, Edgecombe, Rick P wrote:
> I'll add a little more info in the commit log about this.

Thanks.

And I can't do anything but agree with the train of thought you guys
are having - 32-bit is a dead horse and we should stop trying to revive
it.

:-)

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-02-24 18:38:28

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 28/41] x86: Introduce userspace API for shadow stack

On Fri, 2023-02-24 at 13:20 +0100, Borislav Petkov wrote:
> On Sat, Feb 18, 2023 at 01:14:20PM -0800, Rick Edgecombe wrote:
> > diff --git a/arch/x86/include/uapi/asm/prctl.h
> > b/arch/x86/include/uapi/asm/prctl.h
> > index 500b96e71f18..b2b3b7200b2d 100644
> > --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -20,4 +20,10 @@
> > #define ARCH_MAP_VDSO_32 0x2002
> > #define ARCH_MAP_VDSO_64 0x2003
> >
> > +/* Don't use 0x3001-0x3004 because of old glibcs */
>
> So where is this all new interface to userspace programs documented?

In the first patch:

https://lore.kernel.org/lkml/[email protected]/

Then some more documentation is added about ARCH_SHSTK_UNLOCK and
ARCH_SHSTK_STATUS (which are for CRIU) in those patches.

> Do
> we have an agreement with all the involved parties that this is how
> we're going to support shadow stacks and that this is what userspace
> should do?
>
> I'd like to avoid one more fiasco with glibc etc here...

There are glibc patches prepared by HJ to use the new interface and
it's my understanding that he has discussed the changes with the other
glibc folks. Those glibc patches are used for testing these kernel
patches, but will not get upstream until the kernel patches to avoid
repeating the past problems. So I think it's as prepared as it can be.

One future thing that might come up... Glibc has this mode called
"permissive mode". When glibc is configured this way (compile time or
env var), it is supposed to disable shadow stack when dlopen()ing any
DSO that doesn't have the shadow stack elf header bit. The problem is,
it doesn't really work as intended. It only turns off shadow stack for
the calling thread, leaving the other threads free to call into the DSO
with shadow stack enabled. It's not clear yet if they are going to be
able to fix it in userspace. So this prompted some thinking of if there
could be an additional kernel mode to help glibc in this scenario. But
for non-permissive mode, glibc is queued up to use the interface as is.

2023-02-24 18:40:19

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 37/41] selftests/x86: Add shadow stack test

On Fri, 2023-02-24 at 12:45 +0100, Borislav Petkov wrote:
> On Thu, Feb 23, 2023 at 05:54:55PM +0000, Edgecombe, Rick P wrote:
> > The proposed Makefile solution seems a bit unusual. What about this
> > less complicated solution to just make this case work?
>
> I like simple. :)

Done!

>
> > So alternatively, why not just always encourage building the
> > headers
> > before running the selftests by warning if
> > ${abs_srctree}/usr/include/linux is not found?
>
> s/encourage/automate/
>
> Imagine this situation: maintainer says: "please run the selftests".
> User says, "uh oh, it fails building, I need to figure that out
> first."
>
> So we should not have users have to figure out stuff if we can code
> it
> to happen automatically for the default case.
>
> If they want something special, then they can do all the figuring out
> they want. :-)

I guess you also could run into the problem of using old headers. Like
say you checkout kernel A, build headers, then check out kernel B and
build selftests. You get surprised by using the old headers from A.

Today the selftests don't seem to depend on having a .config or
anything. But autobuilding does, so its kind of change to how tied in
the selftests are. But maybe it's how it should be for all of them.
Hmm.

2023-02-28 10:58:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v6 28/41] x86: Introduce userspace API for shadow stack

On Fri, Feb 24, 2023 at 06:37:57PM +0000, Edgecombe, Rick P wrote:
> In the first patch:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Then some more documentation is added about ARCH_SHSTK_UNLOCK and
> ARCH_SHSTK_STATUS (which are for CRIU) in those patches.

Right, I was thinking more about ARCH_PRCTL(2), the man page.

But you can send that to the manpages folks later. I.e., it should be
nearly impossible to be missed. :)

> There are glibc patches prepared by HJ to use the new interface and
> it's my understanding that he has discussed the changes with the other
> glibc folks. Those glibc patches are used for testing these kernel
> patches, but will not get upstream until the kernel patches to avoid
> repeating the past problems. So I think it's as prepared as it can be.

Good.

> One future thing that might come up... Glibc has this mode called
> "permissive mode". When glibc is configured this way (compile time or
> env var), it is supposed to disable shadow stack when dlopen()ing any
> DSO that doesn't have the shadow stack elf header bit.

Maybe I don't understand all the possible use cases but if I were
interested in using shadow stack, then I'd enable it for all objects.
And if I want permissive, I'd disable it for all. A mixed thing sounds
like a mixed can of worms waiting to be opened.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-02-28 22:36:18

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 28/41] x86: Introduce userspace API for shadow stack

On Tue, 2023-02-28 at 11:58 +0100, Borislav Petkov wrote:
> On Fri, Feb 24, 2023 at 06:37:57PM +0000, Edgecombe, Rick P wrote:
> > In the first patch:
> >
> >
https://lore.kernel.org/lkml/[email protected]/
> >
> > Then some more documentation is added about ARCH_SHSTK_UNLOCK and
> > ARCH_SHSTK_STATUS (which are for CRIU) in those patches.
>
> Right, I was thinking more about ARCH_PRCTL(2), the man page.
>
> But you can send that to the manpages folks later. I.e., it should be
> nearly impossible to be missed. :)

Sure I can add something. Somewhere I have a man page for
map_shadow_stack written up as well.

>
> > There are glibc patches prepared by HJ to use the new interface and
> > it's my understanding that he has discussed the changes with the
> > other
> > glibc folks. Those glibc patches are used for testing these kernel
> > patches, but will not get upstream until the kernel patches to
> > avoid
> > repeating the past problems. So I think it's as prepared as it can
> > be.
>
> Good.
>
> > One future thing that might come up... Glibc has this mode called
> > "permissive mode". When glibc is configured this way (compile time
> > or
> > env var), it is supposed to disable shadow stack when dlopen()ing
> > any
> > DSO that doesn't have the shadow stack elf header bit.
>
> Maybe I don't understand all the possible use cases but if I were
> interested in using shadow stack, then I'd enable it for all objects.

Enabling for all objects is the ideal, but in practice distros don't
have that.

> And if I want permissive, I'd disable it for all. A mixed thing
> sounds
> like a mixed can of worms waiting to be opened.

It is definitely a can of worms.


2023-03-01 15:39:21

by Deepak Gupta

[permalink] [raw]
Subject: Re: [PATCH v6 11/41] mm: Introduce pte_mkwrite_kernel()

On Sat, Feb 18, 2023 at 01:14:03PM -0800, Rick Edgecombe wrote:
>The x86 Control-flow Enforcement Technology (CET) feature includes a new
>type of memory called shadow stack. This shadow stack memory has some
>unusual properties, which requires some core mm changes to function
>properly.
>
>One of these changes is to allow for pte_mkwrite() to create different
>types of writable memory (the existing conventionally writable type and
>also the new shadow stack type). Future patches will convert pte_mkwrite()
>to take a VMA in order to facilitate this, however there are places in the
>kernel where pte_mkwrite() is called outside of the context of a VMA.
>These are for kernel memory. So create a new variant called
>pte_mkwrite_kernel() and switch the kernel users over to it. Have
>pte_mkwrite() and pte_mkwrite_kernel() be the same for now. Future patches
>will introduce changes to make pte_mkwrite() take a VMA.
>
>Only do this for architectures that need it because they call pte_mkwrite()
>in arch code without an associated VMA. Since it will only currently be
>used in arch code, so do not include it in arch_pgtable_helpers.rst.
>
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Tested-by: Pengfei Xu <[email protected]>
>Suggested-by: David Hildenbrand <[email protected]>
>Signed-off-by: Rick Edgecombe <[email protected]>
>

Acked-by: Deepak Gupta <[email protected]>

2023-03-01 15:41:44

by Deepak Gupta

[permalink] [raw]
Subject: Re: [PATCH v6 13/41] mm: Make pte_mkwrite() take a VMA

On Sat, Feb 18, 2023 at 01:14:05PM -0800, Rick Edgecombe wrote:
>The x86 Control-flow Enforcement Technology (CET) feature includes a new
>type of memory called shadow stack. This shadow stack memory has some
>unusual properties, which requires some core mm changes to function
>properly.
>
>One of these unusual properties is that shadow stack memory is writable,
>but only in limited ways. These limits are applied via a specific PTE
>bit combination. Nevertheless, the memory is writable, and core mm code
>will need to apply the writable permissions in the typical paths that
>call pte_mkwrite().
>
>In addition to VM_WRITE, the shadow stack VMA's will have a flag denoting
>that they are special shadow stack flavor of writable memory. So make
>pte_mkwrite() take a VMA, so that the x86 implementation of it can know to
>create regular writable memory or shadow stack memory.
>
>Apply the same changes for pmd_mkwrite() and huge_pte_mkwrite().
>
>No functional change.
>
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: Michal Simek <[email protected]>
>Cc: Dinh Nguyen <[email protected]>
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Cc: [email protected]
>Tested-by: Pengfei Xu <[email protected]>
>Suggested-by: David Hildenbrand <[email protected]>
>Signed-off-by: Rick Edgecombe <[email protected]>
>

Acked-by: Deepak Gupta <[email protected]>

2023-03-01 15:43:25

by Deepak Gupta

[permalink] [raw]
Subject: Re: [PATCH v6 20/41] x86/mm: Teach pte_mkwrite() about stack memory

On Sat, Feb 18, 2023 at 01:14:12PM -0800, Rick Edgecombe wrote:
>If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So
>when it is made writable with pte_mkwrite(), it should create shadow
>stack memory, not conventionally writable memory. Now that pte_mkwrite()
>takes a VMA, and places where shadow stack memory might be created pass
>one, pte_mkwrite() can know when it should do this.
>
>So make pte_mkwrite() create shadow stack memory when the VMA has the
>VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite().
>
>This requires referencing VM_SHADOW_STACK in these functions, which are
>currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is
>located) can't be pulled in without causing problems for files that
>reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where
>they can safely reference VM_SHADOW_STACK.
>
>Tested-by: Pengfei Xu <[email protected]>
>Signed-off-by: Rick Edgecombe <[email protected]>
>

Acked-by: Deepak Gupta <[email protected]>