2022-02-01 15:07:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 00/35] Shadow stacks for userspace

Hi,

This is a slight reboot of the userspace CET series. I will be taking over the
series from Yu-cheng. Per some internal recommendations, I’ve reset the version
number and am calling it a new series. Hopefully, it doesn’t cause confusion.

The new plan is to upstream only userspace Shadow Stack support at this point.
IBT can follow later, but for now I’ll focus solely on the most in-demand and
widely available (with the feature on AMD CPUs now) part of CET.

I thought as part of this reset, it might be useful to more fully write-up the
design and summarize the history of the previous CET series. So this slightly
long cover letter does that. The "Updates" section has the changes, if anyone
doesn't want the history.


Why is Shadow Stack Wanted
==========================
The main use case for userspace shadow stack is providing protection against
return oriented programming attacks. Fedora and Ubuntu already have many/most
packages enabled for shadow stack. The main missing piece is Linux kernel
support and there seems to be a high amount of interest in the ecosystem for
getting this feature supported. Besides security, Google has also done some
work on using shadow stack to improve performance and reliability of tracing.


Userspace Shadow Stack Implementation
=====================================
Shadow stack works by maintaining a secondary (shadow) stack that cannot be
directly modified by applications. When executing a CALL instruction, the
processor pushes the return address to both the normal stack and to the special
permissioned shadow stack. Upon ret, the processor pops the shadow stack copy
and compares it to the normal stack copy. If the two differ, the processor
raises a control protection fault. This implementation supports shadow stack on
64 bit kernels only, with support for 32 bit only via IA32 emulation.

Shadow Stack Memory
-------------------
The majority of this series deals with changes for handling the special
shadow stack memory permissions. This memory is specified by the
Dirty+RO PTE bits. A tricky aspect of this is that this combination was
previously used to specify COW memory. So Linux needs to handle COW
differently when shadow stack is in use. The solution is to use a
software PTE bit to denote COW memory, and take care to clear the dirty
bit when setting the memory RO.

Setup and Upkeep of HW Registers
--------------------------------
Using userspace CET requires a CR4 bit set, and also the manipulation
of two xsave managed MSRs. The kernel needs to modify these registers
during various operations like clone and signal handling. These
operations may happen when the registers are restored to the CPU, or
saved in an xsave buffer. Since the recent AMX triggered FPU overhaul
removed direct access to the xsave buffer, this series adds an
interface to operate on the supervisor xstate.

New ABIs
--------
This series introduces some new ABIs. The primary one is the shadow
stack itself. Since it is readable and the shadow stack pointer is
exposed to user space, applications can easily read and process the
shadow stack. And in fact the tracing usages plan to do exactly that.

Most of the shadow stack contents are written by HW, but some of the
entries are added by the kernel. The main place for this is signals. As
part of handling the signal the kernel does some manual adjustment of
the shadow stack that userspace depends on.

In addition to the contents of the shadow stack there is also user
visible behavior around when new shadow stacks are created and set in
the shadow stack pointer (SSP) register. This is relatively
straightforward – shadow stacks are created when new stacks are created
(thread creation, fork, etc). It is more or less what is required to
keep apps working.

For situations when userspace creates a new stack (i.e. makecontext(),
fibers, etc), a new syscall is provided for creating shadow stack
memory. To make the shadow stack usable, it needs to have a restore
token written to the protected memory. So the syscall provides a way to
specificity this should be done by the kernel.

When a shadow stack violation happens (when the return address of stack
not matching return address in shadow stack), a segfault is generated
with a new si_code specific to CET violations.

Lastly, a new arch_prctl interface is created for controlling the
enablement of CET-like features. It is intended to also be used for
LAM. It operates on the feature status per-thread, so for process wide
enabling it is intended to be used early in things like dynamic
linker/loaders. However, it can be used later for per-thread enablement
of features like WRSS.

WRSS
----
WRSS is an instruction that can write to shadow stacks. The HW provides
a way to enable this instruction for userspace use. Since shadow
stack’s are created initially protected, enabling WRSS allows any apps
that want to do unusual things with their stacks to have a way to
weaken protection and make things more flexible. A new feature bit is
defined to control enabling/disabling of WRSS.


History
=======
The branding “CET” really consists of two features: “Shadow Stack” and
“Indirect Branch Tracking”. They both restrict previously allowed, but rarely
valid behaviors and require userspace to change to avoid these behaviors before
enabling the protection. These raw HW features need to be assembled into a
software solution across userspace and kernel in order to add security value.
The kernel part of this solution has evolved iteratively starting with a lengthy
RFC period.

Until now, the enabling effort was trying to support both Shadow Stack and IBT.
This history will focus on a few areas of the shadow stack development history
that I thought stood out.

Signals
-------
Originally signals placed the location of the shadow stack restore
token inside the saved state on the stack. This was problematic from a
past ABI promises perspective. So the restore location was instead just
assumed from the shadow stack pointer. This works because in normal
allowed cases of calling sigreturn, the shadow stack pointer should be
right at the restore token at that time. There is no alternate shadow
stack support. If an alt shadow stack is added later we would need to
find a place to store the regular shadow stack token location. Options
could be to push something on the alt shadow stack, or to keep
something on the kernel side. So the current design keeps things simple
while slightly kicking the can down the road if alt shadow stacks
become a thing later. Siglongjmp is handled in glibc, using the incssp
instruction to unwind the shadow stack over the token.

Shadow Stack Allocation
-----------------------
makecontext() implementations need a way to create new shadow stacks
with restore token’s such that they can be pivoted to from userspace.
The first interface to do this was an arch_prctl(). It created a shadow
stack with a restore token pre-setup, since the kernel has an
instruction that can write to user shadow stacks. However, this
interface was abandoned for being strange.

The next version created PROT_SHADOW_STACK. This interface had two
problems. One, it left no options but for userspace to create writable
memory, write a restore token, then mproctect() it PROT_SHADOW_STACK.
The writable window left the shadow stack exposed, weakening the
security. Second, it caused problems with the guard pages. Since the
memory was initially created writable it did not have a guard page, but
then was mprotected later to a type of memory that should have one.
This resulted in missing guard pages and confused rb_subtree_gap’s.

This version introduces a new syscall that behaves similarly to the
initial arch_prctl() interface in that it has the kernel write the
restore token.

Enabling Interface
------------------
For the entire history of the original CET series, the design was to
enable shadow stack automatically if the feature bit was detected in
the elf header. Then it was userspace’s responsibility to turn it off
via an arch_prctl() if it was not desired, and this was handled by the
glibc dynamic loader. Glibc’s standard behavior (when CET if configured
is to leave shadow stack enabled if the executable and all linked
libraries are marked with shadow stacks.

Many distros (Fedora and others) have binaries already marked with
shadow stack, waiting for kernel support. Unfortunately their glibc
binaries expect the original arch_prctl() interface for allocating
shadow stacks, as those changes were pushed ahead of kernel support.
The net result of it all is, when updating to a kernel with shadow
stack these binaries would suddenly get shadow stack enabled and expect
the arch_prctl() interface to be there. And so calls to makecontext()
will fail, resulting in visible breakages. This series deals with this
problem as described below in "Updates".


Updates
=======
These updates were mostly driven by public comments, but a lot of the design
elements are new. I would like some extra scrutiny on the updates.

New syscall for Shadow Stack Allocation
---------------------------------------
A new syscall is added for allocating shadow stacks to replace
PROT_SHADOW_STACK. Several options were considered, as described in the
“x86/cet/shstk: Introduce map_shadow_stack syscall”.

Xsave Managed Supervisor State Modifications
--------------------------------------------
The shadow stack feature requires the kernel to modify xsaves managed
state. On one of the last versions of Yu-cheng’s series Boris had
commented on the pattern it was using to do this not necessarily being
ideal. The pattern was to force a restore to the registers and always
do the modification there. Then Thomas did an overhaul of the fpu code,
part of which consisted of making raw access to the xsave buffer
private to the fpu code. So this series tries to expose access again,
and in a way that addresses Boris’ comments.

The method is to provide functions like wmsrl/rdmsrl, but that can
direct the operation to the correct location (registers or buffer),
while giving the proper notice to the fpu subsystem so things don’t get
clobbered or corrupted.

In the past a solution like this was discussed as part of the PASID
series, and Thomas was not in favor. In CET’s case there is a more
logic around the CET MSR’s than in PASID's, and wrapping this logic
minimizes near identical open coded logic needed to do this more
efficiently. In addition it resolves the above described problem of
having no access to the xsave buffer. So it is being put forward here
under the supposition that CET’s usage may lead to a different
conclusion, not to try to ignore past direction.

The user interrupt series has similar needs as CET, and will also use
this internal interface if it’s found acceptable.

Support for WRSS
----------------
Andy Lutomirski had asked if we change the shadow stack allocation API
such that userspace cannot create arbitrary shadow stacks, then we look
at exposing an interface to enable the WRSS instruction for userspace.
This way app’s that want to do unexpected things with shadow stacks
would still have the option to create shadow stacks with arbitrary
data.

Switch Enabling Interface
-------------------------
As described above there is a problem with userspace binaries waiting
to break as soon as the kernel supports CET. This needs to be prevented
by changing the interface such that the old binaries will not enable
shadow stack AND behave as if shadow stack is not enabled. They should
run normally without shadow stack protection. Creating a new feature
(SHSTK2) for shadow stack was explored. SHSTK would never be supported
by the kernel, and all the userspace build tools would be updated to
target SHSTK2 instead of SHSTK. So old SHSTK binaries would be cleanly
disabled.

But there are existing downsides to automatic elf header processing
based enabling. The elf header feature spec is not defined by the
kernel and there are proposals to expand it to describe additional
logic. A simpler interface where the kernel is simply told what to
enable, and leaves all the decision making to userspace, is more
flexible for userspace and simpler for the kernel. There also already
needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and
likely LAM will use it too), so it avoids there being two ways to turn
on these types of features. The only tricky part for shadow stack, is
that it has to be enabled very early. Wherever the shadow stack is
enabled, the app cannot return from that point, otherwise there will be
a shadow stack violation. It turns out glibc can enable shadow stack
this early, so it works nicely. So not automatically enabling any
features in the elf header will cleanly disable all old binaries, which
expect the kernel to enable CET features automatically. Then after the
kernel changes are upstream, glibc can be updated to use the new
interface. This is the solution implemented in this series.

Expand Commit Logs
------------------
As part of spinning up on this series, I found some of the commit logs
did not describe the changes in enough detail for me understand their
purpose. I tried to expand the logs and comments, where I had to go
digging. Hopefully it’s useful.

Limit to only Intel Processors
------------------------------
Shadow stack is supported on some AMD processors, but this revision
(with expanded HW usage and xsaves changes) has only has been tested on
Intel ones. So this series has a patch to limit shadow stack support to
Intel processors. Ideally the patch would not even make it to mainline,
and should be dropped as soon as this testing is done. It's included
just in case.


Future Work
===========
Even though this is now exclusively a shadow stack series, there is still some
remaining shadow stack work to be done.

Ptrace
------
Early in the series, there was a patch to allow IA32_U_CET and
IA32_PL3_SSP to be set. This patch was dropped and planned as a follow
up to basic support, and it remains the plan. It will be needed for
in-progress gdb support.

CRIU Support
------------
In the past there was some speculation on the mailing list about
whether CRIU would need to be taught about CET. It turns out, it does.
The first issue hit is that CRIU calls sigreturn directly from its
“parasite code” that it injects into the dumper process. This violates
this shadow stack implementation’s protection that intends to prevent
attackers from doing this.

With so many packages already enabled with shadow stack, there is
probably desire to make it work seamlessly. But in the meantime if
distros want to support shadow stack and CRIU, users could manually
disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for
a process they will wants to dump. It’s not ideal.

I’d like to hear what people think about having shadow stack in the
kernel without this resolved. Nothing would change for any users until
they enable shadow stack in the kernel and update to a glibc configured
with CET. Should CRIU userspace be solved before kernel support?

Selftests
---------
There are some CET selftests being worked on and they are not included
here.

Thanks,

Rick

Rick Edgecombe (7):
x86/mm: Prevent VM_WRITE shadow stacks
x86/fpu: Add helpers for modifying supervisor xstate
x86/fpu: Add unsafe xsave buffer helpers
x86/cet/shstk: Introduce map_shadow_stack syscall
selftests/x86: Add map_shadow_stack syscall test
x86/cet/shstk: Support wrss for userspace
x86/cpufeatures: Limit shadow stack to Intel CPUs

Yu-cheng Yu (28):
Documentation/x86: Add CET description
x86/cet/shstk: Add Kconfig option for Shadow Stack
x86/cpufeatures: Add CET CPU feature flags for Control-flow
Enforcement Technology (CET)
x86/cpufeatures: Introduce CPU setup and option parsing for CET
x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
x86/cet: Add control-protection fault handler
x86/mm: Remove _PAGE_DIRTY from kernel RO pages
x86/mm: Move pmd_write(), pud_write() up in the file
x86/mm: Introduce _PAGE_COW
drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
x86/mm: Update pte_modify for _PAGE_COW
x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
transition from _PAGE_DIRTY to _PAGE_COW
mm: Move VM_UFFD_MINOR_BIT from 37 to 38
mm: Introduce VM_SHADOW_STACK for shadow stack memory
x86/mm: Check Shadow Stack page fault errors
x86/mm: Update maybe_mkwrite() for shadow stack
mm: Fixup places that call pte_mkwrite() directly
mm: Add guard pages around a shadow stack.
mm/mmap: Add shadow stack pages to memory accounting
mm: Update can_follow_write_pte() for shadow stack
mm/mprotect: Exclude shadow stack from preserve_write
mm: Re-introduce vm_flags to do_mmap()
x86/cet/shstk: Add user-mode shadow stack support
x86/process: Change copy_thread() argument 'arg' to 'stack_size'
x86/cet/shstk: Handle thread shadow stack
x86/cet/shstk: Introduce shadow stack token setup/verify routines
x86/cet/shstk: Handle signals for shadow stack
x86/cet/shstk: Add arch_prctl elf feature functions

.../admin-guide/kernel-parameters.txt | 4 +
Documentation/filesystems/proc.rst | 1 +
Documentation/x86/cet.rst | 145 ++++++
Documentation/x86/index.rst | 1 +
arch/arm/kernel/signal.c | 2 +-
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/x86/Kconfig | 22 +
arch/x86/Kconfig.assembler | 5 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/ia32/ia32_signal.c | 25 +-
arch/x86/include/asm/cet.h | 54 +++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/fpu/api.h | 8 +
arch/x86/include/asm/fpu/types.h | 23 +-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/mman.h | 24 +
arch/x86/include/asm/mmu_context.h | 2 +
arch/x86/include/asm/msr-index.h | 20 +
arch/x86/include/asm/page_types.h | 7 +
arch/x86/include/asm/pgtable.h | 302 ++++++++++--
arch/x86/include/asm/pgtable_types.h | 48 +-
arch/x86/include/asm/processor.h | 6 +
arch/x86/include/asm/special_insns.h | 30 ++
arch/x86/include/asm/trap_pf.h | 2 +
arch/x86/include/uapi/asm/mman.h | 8 +-
arch/x86/include/uapi/asm/prctl.h | 10 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/cpu/common.c | 20 +
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/elf_feature_prctl.c | 72 +++
arch/x86/kernel/fpu/xstate.c | 167 ++++++-
arch/x86/kernel/idt.c | 4 +
arch/x86/kernel/process.c | 17 +-
arch/x86/kernel/process_64.c | 2 +
arch/x86/kernel/shstk.c | 446 ++++++++++++++++++
arch/x86/kernel/signal.c | 13 +
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 62 +++
arch/x86/mm/fault.c | 19 +
arch/x86/mm/mmap.c | 48 ++
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 25 +
drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
fs/aio.c | 2 +-
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 19 +-
include/linux/pgtable.h | 8 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/asm-generic/unistd.h | 2 +-
ipc/shm.c | 2 +-
kernel/sys_ni.c | 1 +
mm/gup.c | 16 +-
mm/huge_memory.c | 27 +-
mm/memory.c | 5 +-
mm/migrate.c | 3 +-
mm/mmap.c | 15 +-
mm/mprotect.c | 9 +-
mm/nommu.c | 4 +-
mm/util.c | 2 +-
tools/testing/selftests/x86/Makefile | 9 +-
.../selftests/x86/test_map_shadow_stack.c | 75 +++
69 files changed, 1797 insertions(+), 92 deletions(-)
create mode 100644 Documentation/x86/cet.rst
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/include/asm/mman.h
create mode 100644 arch/x86/kernel/elf_feature_prctl.c
create mode 100644 arch/x86/kernel/shstk.c
create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c


base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
--
2.17.1


2022-02-01 15:07:50

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 02/35] x86/cet/shstk: Add Kconfig option for Shadow Stack

From: Yu-cheng Yu <[email protected]>

Shadow Stack provides protection against function return address
corruption. It is active when the processor supports it, the kernel has
CONFIG_X86_SHADOW_STACK enabled, and the application is built for the
feature. This is only implemented for the 64-bit kernel. When it is
enabled, legacy non-Shadow Stack applications continue to work, but without
protection.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

Yu-cheng v25:
- Remove X86_CET and use X86_SHADOW_STACK directly.

Yu-cheng v24:
- Update for the splitting X86_CET to X86_SHADOW_STACK and X86_IBT.

arch/x86/Kconfig | 22 ++++++++++++++++++++++
arch/x86/Kconfig.assembler | 5 +++++
2 files changed, 27 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ebe8fc76949a..b9efa0fd906d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,6 +26,7 @@ config X86_64
depends on 64BIT
# Options that are inherently 64-bit kernel only:
select ARCH_HAS_GIGANTIC_PAGE
+ select ARCH_HAS_SHADOW_STACK
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
@@ -1940,6 +1941,27 @@ config X86_SGX

If unsure, say N.

+config ARCH_HAS_SHADOW_STACK
+ def_bool n
+
+config X86_SHADOW_STACK
+ prompt "Intel Shadow Stack"
+ def_bool n
+ depends on AS_WRUSS
+ depends on ARCH_HAS_SHADOW_STACK
+ select ARCH_USES_HIGH_VMA_FLAGS
+ help
+ Shadow Stack protection is a hardware feature that detects function
+ return address corruption. This helps mitigate ROP attacks.
+ Applications must be enabled to use it, and old userspace does not
+ get protection "for free".
+ Support for this feature is present on Tiger Lake family of
+ processors released in 2020 or later. Enabling this feature
+ increases kernel text size by 3.7 KB.
+ See Documentation/x86/intel_cet.rst for more information.
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 26b8c08e2fc4..00c79dd93651 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -19,3 +19,8 @@ config AS_TPAUSE
def_bool $(as-instr,tpause %ecx)
help
Supported by binutils >= 2.31.1 and LLVM integrated assembler >= V7
+
+config AS_WRUSS
+ def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
+ help
+ Supported by binutils >= 2.31 and LLVM integrated assembler
--
2.17.1

2022-02-01 15:07:54

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 03/35] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)

From: Yu-cheng Yu <[email protected]>

Add CPU feature flags for Control-flow Enforcement Technology (CET).

CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect Branch Tracking

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

v1:
- Remove IBT, can be added in a follow on IBT series.

Yu-cheng v25:
- Make X86_FEATURE_IBT depend on X86_FEATURE_SHSTK.

Yu-cheng v24:
- Update for splitting CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK and
CONFIG_X86_IBT.
- Move DISABLE_IBT definition to the IBT series.

arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6db4e2932b3d..c3eb94b13fef 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -355,6 +355,7 @@
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..b7728f7afb2b 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -65,6 +65,12 @@
# define DISABLE_SGX (1 << (X86_FEATURE_SGX & 31))
#endif

+#ifdef CONFIG_X86_SHADOW_STACK
+#define DISABLE_SHSTK 0
+#else
+#define DISABLE_SHSTK (1 << (X86_FEATURE_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -85,7 +91,7 @@
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
- DISABLE_ENQCMD)
+ DISABLE_ENQCMD|DISABLE_SHSTK)
#define DISABLED_MASK17 0
#define DISABLED_MASK18 0
#define DISABLED_MASK19 0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index c881bcafba7d..bf1b55a1ba21 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -78,6 +78,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
{}
};

--
2.17.1

2022-02-01 15:07:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 04/35] x86/cpufeatures: Introduce CPU setup and option parsing for CET

From: Yu-cheng Yu <[email protected]>

Introduce CPU setup and boot option parsing for CET features.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>

---

v1:
- Moved kernel-parameters.txt changes here from patch 1.

Yu-cheng v25:
- Remove software-defined X86_FEATURE_CET.

Yu-cheng v24:
- Update #ifdef placement to reflect Kconfig changes of splitting shadow stack
and ibt.

Documentation/admin-guide/kernel-parameters.txt | 4 ++++
arch/x86/include/uapi/asm/processor-flags.h | 2 ++
arch/x86/kernel/cpu/common.c | 12 ++++++++++++
3 files changed, 18 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f5a27f067db9..6c5456c56dbf 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3389,6 +3389,10 @@
noexec=on: enable non-executable mappings (default)
noexec=off: disable non-executable mappings

+ no_user_shstk [X86-64] Disable Shadow Stack for user-mode
+ applications. Disabling shadow stack also disables
+ IBT.
+
nosmap [X86,PPC]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..a8df907e8017 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
#define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT)
#define X86_CR4_PKE_BIT 22 /* enable Protection Keys support */
#define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_CET_BIT 23 /* enable Control-flow Enforcement */
+#define X86_CR4_CET _BITUL(X86_CR4_CET_BIT)

/*
* x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7b8382c11788..9ee339f5b8ca 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -515,6 +515,14 @@ static __init int setup_disable_pku(char *arg)
__setup("nopku", setup_disable_pku);
#endif /* CONFIG_X86_64 */

+static __always_inline void setup_cet(struct cpuinfo_x86 *c)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return;
+
+ cr4_set_bits(X86_CR4_CET);
+}
+
/*
* Some CPU features depend on higher CPUID levels, which may not always
* be available due to CPUID level capping or broken virtualization
@@ -1261,6 +1269,9 @@ static void __init cpu_parse_early_param(void)
if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
setup_clear_cpu_cap(X86_FEATURE_XSAVES);

+ if (cmdline_find_option_bool(boot_command_line, "no_user_shstk"))
+ setup_clear_cpu_cap(X86_FEATURE_SHSTK);
+
arglen = cmdline_find_option(boot_command_line, "clearcpuid", arg, sizeof(arg));
if (arglen <= 0)
return;
@@ -1632,6 +1643,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)

x86_init_rdrand(c);
setup_pku(c);
+ setup_cet(c);

/*
* Clear/Set all flags overridden by options, need do it
--
2.17.1

2022-02-01 15:08:04

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 05/35] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states

From: Yu-cheng Yu <[email protected]>

Control-flow Enforcement Technology (CET) introduces these MSRs:

MSR_IA32_U_CET (user-mode CET settings),
MSR_IA32_PL3_SSP (user-mode shadow stack pointer),

MSR_IA32_PL0_SSP (kernel-mode shadow stack pointer),
MSR_IA32_PL1_SSP (Privilege Level 1 shadow stack pointer),
MSR_IA32_PL2_SSP (Privilege Level 2 shadow stack pointer),
MSR_IA32_S_CET (kernel-mode CET settings),
MSR_IA32_INT_SSP_TAB (exception shadow stack table).

The two user-mode MSRs belong to XFEATURE_CET_USER. The first three of
kernel-mode MSRs belong to XFEATURE_CET_KERNEL. Both XSAVES states are
supervisor states. This means that there is no direct, unprivileged access
to these states, making it harder for an attacker to subvert CET.

For future ptrace() support, shadow stack address and MSR reserved bits are
checked before written to the supervisor states.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

v1:
- Remove outdated reference to sigreturn checks on msr's.

Yu-cheng v29:
- Move CET MSR definition up in msr-index.h.

Yu-cheng v28:
- Add XFEATURE_MASK_CET_USER to XFEATURES_INIT_FPSTATE_HANDLED.

Yu-cheng v25:
- Update xsave_cpuid_features[]. Now CET XSAVES features depend on
X86_FEATURE_SHSTK (vs. the software-defined X86_FEATURE_CET).

arch/x86/include/asm/fpu/types.h | 23 +++++++++++++++++++++--
arch/x86/include/asm/fpu/xstate.h | 6 ++++--
arch/x86/include/asm/msr-index.h | 20 ++++++++++++++++++++
arch/x86/kernel/fpu/xstate.c | 13 ++++++++++++-
4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb7cd1139d97..e2b21197661c 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -115,8 +115,8 @@ enum xfeature {
XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
XFEATURE_PKRU,
XFEATURE_PASID,
- XFEATURE_RSRVD_COMP_11,
- XFEATURE_RSRVD_COMP_12,
+ XFEATURE_CET_USER,
+ XFEATURE_CET_KERNEL,
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
@@ -138,6 +138,8 @@ enum xfeature {
#define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
+#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
@@ -252,6 +254,23 @@ struct pkru_state {
u32 pad;
} __packed;

+/*
+ * State component 11 is Control-flow Enforcement user states
+ */
+struct cet_user_state {
+ u64 user_cet; /* user control-flow settings */
+ u64 user_ssp; /* user shadow stack pointer */
+};
+
+/*
+ * State component 12 is Control-flow Enforcement kernel states
+ */
+struct cet_kernel_state {
+ u64 kernel_ssp; /* kernel shadow stack */
+ u64 pl1_ssp; /* privilege level 1 shadow stack */
+ u64 pl2_ssp; /* privilege level 2 shadow stack */
+};
+
/*
* State component 15: Architectural LBR configuration state.
* The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..d4427b88ee12 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -50,7 +50,8 @@
#define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA

/* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+ XFEATURE_MASK_CET_USER)

/*
* A supervisor state component may not always contain valuable information,
@@ -77,7 +78,8 @@
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
+ XFEATURE_MASK_CET_KERNEL)

/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3faf0f97edb1..0ee77ce4c753 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -362,6 +362,26 @@


#define MSR_CORE_PERF_LIMIT_REASONS 0x00000690
+
+/* Control-flow Enforcement Technology MSRs */
+#define MSR_IA32_U_CET 0x000006a0 /* user mode cet setting */
+#define MSR_IA32_S_CET 0x000006a2 /* kernel mode cet setting */
+#define CET_SHSTK_EN BIT_ULL(0)
+#define CET_WRSS_EN BIT_ULL(1)
+#define CET_ENDBR_EN BIT_ULL(2)
+#define CET_LEG_IW_EN BIT_ULL(3)
+#define CET_NO_TRACK_EN BIT_ULL(4)
+#define CET_SUPPRESS_DISABLE BIT_ULL(5)
+#define CET_RESERVED (BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
+#define CET_SUPPRESS BIT_ULL(10)
+#define CET_WAIT_ENDBR BIT_ULL(11)
+
+#define MSR_IA32_PL0_SSP 0x000006a4 /* kernel shadow stack pointer */
+#define MSR_IA32_PL1_SSP 0x000006a5 /* ring-1 shadow stack pointer */
+#define MSR_IA32_PL2_SSP 0x000006a6 /* ring-2 shadow stack pointer */
+#define MSR_IA32_PL3_SSP 0x000006a7 /* user shadow stack pointer */
+#define MSR_IA32_INT_SSP_TAB 0x000006a8 /* exception shadow stack table */
+
#define MSR_GFX_PERF_LIMIT_REASONS 0x000006B0
#define MSR_RING_PERF_LIMIT_REASONS 0x000006B1

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 02b3ddaf4f75..44397202762b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -50,6 +50,8 @@ static const char *xfeature_names[] =
"Processor Trace (unused)" ,
"Protection Keys User registers",
"PASID state",
+ "Control-flow User registers" ,
+ "Control-flow Kernel registers" ,
"unknown xstate feature" ,
"unknown xstate feature" ,
"unknown xstate feature" ,
@@ -73,6 +75,8 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
[XFEATURE_PKRU] = X86_FEATURE_PKU,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
+ [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
+ [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
};
@@ -250,6 +254,8 @@ static void __init print_xstate_features(void)
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
+ print_xstate_feature(XFEATURE_MASK_CET_USER);
+ print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
}
@@ -405,6 +411,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
XFEATURE_MASK_BNDREGS | \
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID | \
+ XFEATURE_MASK_CET_USER | \
XFEATURE_MASK_XTILE)

/*
@@ -621,6 +628,8 @@ static bool __init check_xstate_against_struct(int nr)
XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
+ XCHECK_SZ(sz, nr, XFEATURE_CET_USER, struct cet_user_state);
+ XCHECK_SZ(sz, nr, XFEATURE_CET_KERNEL, struct cet_kernel_state);

/* The tile data size varies between implementations. */
if (nr == XFEATURE_XTILE_DATA)
@@ -634,7 +643,9 @@ static bool __init check_xstate_against_struct(int nr)
if ((nr < XFEATURE_YMM) ||
(nr >= XFEATURE_MAX) ||
(nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
- ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
+ (nr == XFEATURE_RSRVD_COMP_13) ||
+ (nr == XFEATURE_RSRVD_COMP_14) ||
+ (nr == XFEATURE_RSRVD_COMP_16)) {
WARN_ONCE(1, "no structure for xstate: %d\n", nr);
XSTATE_WARN_ON(1);
return false;
--
2.17.1

2022-02-01 15:08:06

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 07/35] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

From: Yu-cheng Yu <[email protected]>

The x86 family of processors do not directly create read-only and Dirty
PTEs. These PTEs are created by software. One such case is that kernel
read-only pages are historically setup as Dirty.

New processors that support Shadow Stack regard read-only and Dirty PTEs as
shadow stack pages. This results in ambiguity between shadow stack and
kernel read-only pages. To resolve this, removed Dirty from kernel read-
only pages.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 6 +++---
arch/x86/mm/pat/set_memory.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..3781a79b6388 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -190,10 +190,10 @@ enum page_cache_mode {
#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
#define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0)
#define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC)
-#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G)
-#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0|___D| 0|___G)
+#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G)
+#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G)
#define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC)
-#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX|___D| 0|___G)
+#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX| 0| 0|___G)
#define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G)
#define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW| 0|___A| 0|___D|_PSE|___G)
#define __PAGE_KERNEL_WP (__PP|__RW| 0|___A|__NX|___D| 0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index b4072115c8ef..844bb30280b7 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1943,7 +1943,7 @@ int set_memory_nx(unsigned long addr, int numpages)

int set_memory_ro(unsigned long addr, int numpages)
{
- return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+ return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
}

int set_memory_rw(unsigned long addr, int numpages)
--
2.17.1

2022-02-01 15:08:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 11/35] x86/mm: Update pte_modify for _PAGE_COW

From: Yu-cheng Yu <[email protected]>

The read-only and Dirty PTE has been used to indicate copy-on-write pages.
However, newer x86 processors also regard a read-only and Dirty PTE as a
shadow stack page. In order to separate the two, the software-defined
_PAGE_COW is created to replace _PAGE_DIRTY for the copy-on-write case, and
pte_*() are updated.

Pte_modify() changes a PTE to 'newprot', but it doesn't use the pte_*().
Introduce fixup_dirty_pte(), which sets a dirty PTE, based on _PAGE_RW,
to either _PAGE_DIRTY or _PAGE_COW.

Apply the same changes to pmd_modify().

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/include/asm/pgtable.h | 37 ++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a4a75e78a934..5c3886f6ccda 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -773,6 +773,23 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)

static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);

+static inline pteval_t fixup_dirty_pte(pteval_t pteval)
+{
+ pte_t pte = __pte(pteval);
+
+ /*
+ * Fix up potential shadow stack page flags because the RO, Dirty
+ * PTE is special.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (pte_dirty(pte)) {
+ pte = pte_mkclean(pte);
+ pte = pte_mkdirty(pte);
+ }
+ }
+ return pte_val(pte);
+}
+
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
pteval_t val = pte_val(pte), oldval = val;
@@ -783,16 +800,36 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
*/
val &= _PAGE_CHG_MASK;
val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+ val = fixup_dirty_pte(val);
val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
return __pte(val);
}

+static inline int pmd_write(pmd_t pmd);
+static inline pmdval_t fixup_dirty_pmd(pmdval_t pmdval)
+{
+ pmd_t pmd = __pmd(pmdval);
+
+ /*
+ * Fix up potential shadow stack page flags because the RO, Dirty
+ * PMD is special.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (pmd_dirty(pmd)) {
+ pmd = pmd_mkclean(pmd);
+ pmd = pmd_mkdirty(pmd);
+ }
+ }
+ return pmd_val(pmd);
+}
+
static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
{
pmdval_t val = pmd_val(pmd), oldval = val;

val &= _HPAGE_CHG_MASK;
val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+ val = fixup_dirty_pmd(val);
val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
return __pmd(val);
}
--
2.17.1

2022-02-01 15:08:19

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 06/35] x86/cet: Add control-protection fault handler

From: Yu-cheng Yu <[email protected]>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Kerrisk <[email protected]>
---

v1:
- Update static asserts for NSIGSEGV

Yu-cheng v29:
- Remove pr_emerg() since it is followed by die().
- Change boot_cpu_has() to cpu_feature_enabled().

Yu-cheng v25:
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.

arch/arm/kernel/signal.c | 2 +-
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/kernel/idt.c | 4 ++
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 62 ++++++++++++++++++++++++++++++
include/uapi/asm-generic/siginfo.h | 3 +-
10 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index c532a6041066..59aaadce9d52 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index d8aaf4b6f432..d2da57c415b8 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -983,7 +983,7 @@ void __init minsigstksz_setup(void)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index d984282b979f..8776a34c6444 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index 6cc124a3bb98..dc50b2a78692 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -752,7 +752,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 2a78d2af1265..7fe2bd37bd1a 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..a90791433152 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -562,6 +562,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS, exc_stack_segment);
DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP, exc_general_protection);
DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC, exc_alignment_check);

+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
/* Raw exception entries which need extra work */
DECLARE_IDTENTRY_RAW(X86_TRAP_UD, exc_invalid_op);
DECLARE_IDTENTRY_RAW(X86_TRAP_BP, exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..9f1bdaabc246 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -113,6 +113,10 @@ static const __initconst struct idt_data def_idts[] = {
#elif defined(CONFIG_X86_32)
SYSG(IA32_SYSCALL_VECTOR, entry_INT80_32),
#endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+ INTG(X86_TRAP_CP, asm_exc_control_protection),
+#endif
};

/*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index b52407c56000..ff50cd978ea5 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
*/
BUILD_BUG_ON(NSIGILL != 11);
BUILD_BUG_ON(NSIGFPE != 15);
- BUILD_BUG_ON(NSIGSEGV != 9);
+ BUILD_BUG_ON(NSIGSEGV != 10);
BUILD_BUG_ON(NSIGBUS != 5);
BUILD_BUG_ON(NSIGTRAP != 6);
BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c9d566dcf89a..54b7a146fd5e 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
#include <linux/io.h>
#include <linux/hardirq.h>
#include <linux/atomic.h>
+#include <linux/nospec.h>

#include <asm/stacktrace.h>
#include <asm/processor.h>
@@ -641,6 +642,67 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
cond_local_irq_disable(regs);
}

+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+ "unknown",
+ "near-ret",
+ "far-ret/iret",
+ "endbranch",
+ "rstorssp",
+ "setssbsy",
+ "unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application. Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+ struct task_struct *tsk;
+
+ if (!user_mode(regs)) {
+ die("kernel control protection fault", regs, error_code);
+ panic("Unexpected kernel control protection fault. Machine halted.");
+ }
+
+ cond_local_irq_enable(regs);
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+ tsk = current;
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_CP;
+
+ /*
+ * Ratelimit to prevent log spamming.
+ */
+ if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+ __ratelimit(&cpf_rate)) {
+ unsigned long ssp;
+ int cpf_type;
+
+ cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+ rdmsrl(MSR_IA32_PL3_SSP, ssp);
+ pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+ tsk->comm, task_pid_nr(tsk),
+ regs->ip, regs->sp, ssp, error_code,
+ control_protection_err[cpf_type]);
+ print_vma_addr(KERN_CONT " in ", regs->ip);
+ pr_cont("\n");
+ }
+
+ force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+ cond_local_irq_disable(regs);
+}
+#endif
+
static bool do_int3(struct pt_regs *regs)
{
int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 3ba180f550d7..081f4b37d22c 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -240,7 +240,8 @@ typedef struct siginfo {
#define SEGV_ADIPERR 7 /* Precise MCD exception */
#define SEGV_MTEAERR 8 /* Asynchronous ARM MTE error */
#define SEGV_MTESERR 9 /* Synchronous ARM MTE exception */
-#define NSIGSEGV 9
+#define SEGV_CPERR 10 /* Control protection fault */
+#define NSIGSEGV 10

/*
* SIGBUS si_codes
--
2.17.1

2022-02-01 15:08:19

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 13/35] mm: Move VM_UFFD_MINOR_BIT from 37 to 38

From: Yu-cheng Yu <[email protected]>

To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Axel Rasmussen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Mike Kravetz <[email protected]>
---
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..2e74c0ab6d25 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -359,7 +359,7 @@ extern unsigned int kobjsize(const void *objp);
#endif

#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT 37
+# define VM_UFFD_MINOR_BIT 38
# define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */
#else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
# define VM_UFFD_MINOR VM_NONE
--
2.17.1

2022-02-01 15:08:19

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 15/35] x86/mm: Check Shadow Stack page fault errors

From: Yu-cheng Yu <[email protected]>

Shadow stack accesses are those that are performed by the CPU where it
expects to encounter a shadow stack mapping. These accesses are performed
implicitly by CALL/RET at the site of the shadow stack pointer. These
accesses are made explicitly by shadow stack management instructions like
WRUSSQ.

Shadow stacks accesses to shadow-stack mapping can see faults in normal,
valid operation just like regular accesses to regular mappings. Shadow
stacks need some of the same features like delayed allocation, swap and
copy-on-write.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping.

In handling a shadow stack page fault, verify it occurs within a shadow
stack mapping. It is always an error otherwise. For valid shadow stack
accesses, set FAULT_FLAG_WRITE to effect copy-on-write. Because clearing
_PAGE_DIRTY (vs. _PAGE_RW) is used to trigger the fault, shadow stack read
fault and shadow stack write fault are not differentiated and both are
handled as a write access.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

Yu-cheng v30:
- Update Subject line and add a verb.

arch/x86/include/asm/trap_pf.h | 2 ++
arch/x86/mm/fault.c | 19 +++++++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 10b1de500ab1..afa524325e55 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
+ * bit 6 == 1: shadow stack access fault
* bit 15 == 1: SGX MMU page-fault
*/
enum x86_pf_error_code {
@@ -20,6 +21,7 @@ enum x86_pf_error_code {
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
+ X86_PF_SHSTK = 1 << 6,
X86_PF_SGX = 1 << 15,
};

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d0074c6ed31a..6769134986ec 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1107,6 +1107,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
(error_code & X86_PF_INSTR), foreign))
return 1;

+ /*
+ * Verify a shadow stack access is within a shadow stack VMA.
+ * It is always an error otherwise. Normal data access to a
+ * shadow stack area is checked in the case followed.
+ */
+ if (error_code & X86_PF_SHSTK) {
+ if (!(vma->vm_flags & VM_SHADOW_STACK))
+ return 1;
+ return 0;
+ }
+
if (error_code & X86_PF_WRITE) {
/* write, present and write, not present: */
if (unlikely(!(vma->vm_flags & VM_WRITE)))
@@ -1300,6 +1311,14 @@ void do_user_addr_fault(struct pt_regs *regs,

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

+ /*
+ * Clearing _PAGE_DIRTY is used to detect shadow stack access.
+ * This method cannot distinguish shadow stack read vs. write.
+ * For valid shadow stack accesses, set FAULT_FLAG_WRITE to effect
+ * copy-on-write.
+ */
+ if (error_code & X86_PF_SHSTK)
+ flags |= FAULT_FLAG_WRITE;
if (error_code & X86_PF_WRITE)
flags |= FAULT_FLAG_WRITE;
if (error_code & X86_PF_INSTR)
--
2.17.1

2022-02-01 15:08:19

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 14/35] mm: Introduce VM_SHADOW_STACK for shadow stack memory

From: Yu-cheng Yu <[email protected]>

A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
two cases are handled differently for page faults. Introduce
VM_SHADOW_STACK to track shadow stack VMAs.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---
Documentation/filesystems/proc.rst | 1 +
arch/x86/mm/mmap.c | 2 ++
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 8 ++++++++
4 files changed, 14 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 061744c436d9..3f8c0fbb9cb3 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -555,6 +555,7 @@ encoded manner. The codes are the following:
mt arm64 MTE allocation tags are enabled
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
+ ss shadow stack page
== =======================================

Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..f3f52c5e2fd6 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)

const char *arch_vma_name(struct vm_area_struct *vma)
{
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return "[shadow stack]";
return NULL;
}

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 18f8c3acbb85..78d9b0fd2aee 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -679,6 +679,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
+ [ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
};
size_t i;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e74c0ab6d25..311c6018d503 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -308,11 +308,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -328,6 +330,12 @@ extern unsigned int kobjsize(const void *objp);
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */

+#ifdef CONFIG_X86_SHADOW_STACK
+# define VM_SHADOW_STACK VM_HIGH_ARCH_5
+#else
+# define VM_SHADOW_STACK VM_NONE
+#endif
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
--
2.17.1

2022-02-01 15:08:19

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 09/35] x86/mm: Introduce _PAGE_COW

From: Yu-cheng Yu <[email protected]>

There is essentially no room left in the x86 hardware PTEs on some OSes
(not Linux). That left the hardware architects looking for a way to
represent a new memory type (shadow stack) within the existing bits.
They chose to repurpose a lightly-used state: Write=0, Dirty=1.

The reason it's lightly used is that Dirty=1 is normally set by hardware
and cannot normally be set by hardware on a Write=0 PTE. Software must
normally be involved to create one of these PTEs, so software can simply
opt to not create them.

In places where Linux normally creates Write=0, Dirty=1, it can use the
software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
words, whenever Linux needs to create Write=0, Dirty=1, it instead creates
Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1. This
clearly separates shadow stack from other data, and results in the
following:

(a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)
(b) A R/O page that has been COW'ed: (Write=0, Cow=1)
The user page is in a R/O VMA, and get_user_pages() needs a writable
copy. The page fault handler creates a copy of the page and sets
the new copy's PTE as Write=0 and Cow=1.
(c) A shadow stack PTE: (Write=0, Dirty=1)
(d) A shared shadow stack PTE: (Write=0, Cow=1)
When a shadow stack page is being shared among processes (this happens
at fork()), its PTE is made Dirty=0, so the next shadow stack access
causes a fault, and the page is duplicated and Dirty=1 is set again.
This is the COW equivalent for shadow stack pages, even though it's
copy-on-access rather than copy-on-write.
(e) A page where the processor observed a Write=1 PTE, started a write, set
Dirty=1, but then observed a Write=0 PTE. That's possible today, but
will not happen on processors that support shadow stack.

Define _PAGE_COW and update pte_*() helpers and apply the same changes to
pmd and pud.

After this, there are six free bits left in the 64-bit PTE, and no more
free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not
implemented for the 32-bit kernel.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/include/asm/pgtable.h | 196 ++++++++++++++++++++++++---
arch/x86/include/asm/pgtable_types.h | 42 +++++-
2 files changed, 217 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index aff5e489ff17..a4a75e78a934 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -123,9 +123,20 @@ extern pmdval_t early_pmd_flags;
* The following only work if pte_present() is true.
* Undefined behaviour if not..
*/
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
{
- return pte_flags(pte) & _PAGE_DIRTY;
+ /*
+ * A dirty PTE has Dirty=1 or Cow=1.
+ */
+ return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return false;
+
+ return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
}

static inline int pte_young(pte_t pte)
@@ -133,9 +144,20 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED;
}

-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_DIRTY;
+ /*
+ * A dirty PMD has Dirty=1 or Cow=1.
+ */
+ return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return false;
+
+ return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
}

static inline int pmd_young(pmd_t pmd)
@@ -143,9 +165,12 @@ static inline int pmd_young(pmd_t pmd)
return pmd_flags(pmd) & _PAGE_ACCESSED;
}

-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
{
- return pud_flags(pud) & _PAGE_DIRTY;
+ /*
+ * A dirty PUD has Dirty=1 or Cow=1.
+ */
+ return pud_flags(pud) & _PAGE_DIRTY_BITS;
}

static inline int pud_young(pud_t pud)
@@ -155,13 +180,23 @@ static inline int pud_young(pud_t pud)

static inline int pte_write(pte_t pte)
{
- return pte_flags(pte) & _PAGE_RW;
+ /*
+ * Shadow stack pages are always writable - but not by normal
+ * instructions, and only by shadow stack operations. Therefore,
+ * the W=0,D=1 test with pte_shstk().
+ */
+ return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
}

#define pmd_write pmd_write
static inline int pmd_write(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_RW;
+ /*
+ * Shadow stack pages are always writable - but not by normal
+ * instructions, and only by shadow stack operations. Therefore,
+ * the W=0,D=1 test with pmd_shstk().
+ */
+ return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
}

#define pud_write pud_write
@@ -299,6 +334,24 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
return native_make_pte(v & ~clear);
}

+static inline pte_t pte_mkcow(pte_t pte)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pte;
+
+ pte = pte_clear_flags(pte, _PAGE_DIRTY);
+ return pte_set_flags(pte, _PAGE_COW);
+}
+
+static inline pte_t pte_clear_cow(pte_t pte)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pte;
+
+ pte = pte_set_flags(pte, _PAGE_DIRTY);
+ return pte_clear_flags(pte, _PAGE_COW);
+}
+
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
@@ -318,7 +371,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)

static inline pte_t pte_mkclean(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_DIRTY);
+ return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
}

static inline pte_t pte_mkold(pte_t pte)
@@ -328,7 +381,16 @@ static inline pte_t pte_mkold(pte_t pte)

static inline pte_t pte_wrprotect(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_RW);
+ pte = pte_clear_flags(pte, _PAGE_RW);
+
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PTE (RW=0, Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (pte_dirty(pte))
+ pte = pte_mkcow(pte);
+ return pte;
}

static inline pte_t pte_mkexec(pte_t pte)
@@ -338,7 +400,18 @@ static inline pte_t pte_mkexec(pte_t pte)

static inline pte_t pte_mkdirty(pte_t pte)
{
- return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ pteval_t dirty = _PAGE_DIRTY;
+
+ /* Avoid creating (HW)Dirty=1, Write=0 PTEs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
+ dirty = _PAGE_COW;
+
+ return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+ return pte_clear_cow(pte);
}

static inline pte_t pte_mkyoung(pte_t pte)
@@ -348,7 +421,12 @@ static inline pte_t pte_mkyoung(pte_t pte)

static inline pte_t pte_mkwrite(pte_t pte)
{
- return pte_set_flags(pte, _PAGE_RW);
+ pte = pte_set_flags(pte, _PAGE_RW);
+
+ if (pte_dirty(pte))
+ pte = pte_clear_cow(pte);
+
+ return pte;
}

static inline pte_t pte_mkhuge(pte_t pte)
@@ -395,6 +473,24 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
return native_make_pmd(v & ~clear);
}

+static inline pmd_t pmd_mkcow(pmd_t pmd)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pmd;
+
+ pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+ return pmd_set_flags(pmd, _PAGE_COW);
+}
+
+static inline pmd_t pmd_clear_cow(pmd_t pmd)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pmd;
+
+ pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+ return pmd_clear_flags(pmd, _PAGE_COW);
+}
+
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pmd_uffd_wp(pmd_t pmd)
{
@@ -419,17 +515,36 @@ static inline pmd_t pmd_mkold(pmd_t pmd)

static inline pmd_t pmd_mkclean(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_DIRTY);
+ return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
}

static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_RW);
+ pmd = pmd_clear_flags(pmd, _PAGE_RW);
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (pmd_dirty(pmd))
+ pmd = pmd_mkcow(pmd);
+ return pmd;
}

static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
- return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ pmdval_t dirty = _PAGE_DIRTY;
+
+ /* Avoid creating (HW)Dirty=1, Write=0 PMDs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pmd_write(pmd))
+ dirty = _PAGE_COW;
+
+ return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+ return pmd_clear_cow(pmd);
}

static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -449,7 +564,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)

static inline pmd_t pmd_mkwrite(pmd_t pmd)
{
- return pmd_set_flags(pmd, _PAGE_RW);
+ pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+ if (pmd_dirty(pmd))
+ pmd = pmd_clear_cow(pmd);
+ return pmd;
}

static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
@@ -466,6 +585,24 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
return native_make_pud(v & ~clear);
}

+static inline pud_t pud_mkcow(pud_t pud)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pud;
+
+ pud = pud_clear_flags(pud, _PAGE_DIRTY);
+ return pud_set_flags(pud, _PAGE_COW);
+}
+
+static inline pud_t pud_clear_cow(pud_t pud)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pud;
+
+ pud = pud_set_flags(pud, _PAGE_DIRTY);
+ return pud_clear_flags(pud, _PAGE_COW);
+}
+
static inline pud_t pud_mkold(pud_t pud)
{
return pud_clear_flags(pud, _PAGE_ACCESSED);
@@ -473,17 +610,32 @@ static inline pud_t pud_mkold(pud_t pud)

static inline pud_t pud_mkclean(pud_t pud)
{
- return pud_clear_flags(pud, _PAGE_DIRTY);
+ return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
}

static inline pud_t pud_wrprotect(pud_t pud)
{
- return pud_clear_flags(pud, _PAGE_RW);
+ pud = pud_clear_flags(pud, _PAGE_RW);
+
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (pud_dirty(pud))
+ pud = pud_mkcow(pud);
+ return pud;
}

static inline pud_t pud_mkdirty(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+ pudval_t dirty = _PAGE_DIRTY;
+
+ /* Avoid creating (HW)Dirty=1, Write=0 PUDs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pud_write(pud))
+ dirty = _PAGE_COW;
+
+ return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
}

static inline pud_t pud_mkdevmap(pud_t pud)
@@ -503,7 +655,11 @@ static inline pud_t pud_mkyoung(pud_t pud)

static inline pud_t pud_mkwrite(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_RW);
+ pud = pud_set_flags(pud, _PAGE_RW);
+
+ if (pud_dirty(pud))
+ pud = pud_clear_cow(pud);
+ return pud;
}

#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 3781a79b6388..1bfab70ff9ac 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@
#define _PAGE_BIT_SOFTW2 10 /* " */
#define _PAGE_BIT_SOFTW3 11 /* " */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
+#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
+#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
#define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
#define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
#define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4

+/*
+ * Indicates a copy-on-write page.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW 0
+#endif
+
/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
@@ -115,6 +125,36 @@
#define _PAGE_DEVMAP (_AT(pteval_t, 0))
#endif

+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
+ * from shadow stack PTEs:
+ * (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)
+ * (b) A R/O page that has been COW'ed: (Write=0, Cow=1)
+ * The user page is in a R/O VMA, and get_user_pages() needs a
+ * writable copy. The page fault handler creates a copy of the page
+ * and sets the new copy's PTE as Write=0, Cow=1.
+ * (c) A shadow stack PTE: (Write=0, Dirty=1)
+ * (d) A shared (copy-on-access) shadow stack PTE: (Write=0, Cow=1)
+ * When a shadow stack page is being shared among processes (this
+ * happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the next
+ * shadow stack access causes a fault, and the page is duplicated and
+ * _PAGE_DIRTY is set again. This is the COW equivalent for shadow
+ * stack pages, even though it's copy-on-access rather than
+ * copy-on-write.
+ * (e) A page where the processor observed a Write=1 PTE, started a write,
+ * set Dirty=1, but then observed a Write=0 PTE (changed by another
+ * thread). That's possible today, but will not happen on processors
+ * that support shadow stack.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW (_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW)
+
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

/*
--
2.17.1

2022-02-01 15:08:20

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 17/35] mm: Fixup places that call pte_mkwrite() directly

From: Yu-cheng Yu <[email protected]>

When serving a page fault, maybe_mkwrite() makes a PTE writable if it is in
a writable vma. A shadow stack vma is writable, but its PTEs need
_PAGE_DIRTY to be set to become writable. For this reason, maybe_mkwrite()
has been updated.

There are a few places that call pte_mkwrite() directly, but have the
same result as from maybe_mkwrite(). These sites need to be updated for
shadow stack as well. Thus, change them to maybe_mkwrite():

- do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE directly
and call pte_mkwrite(), which is the same as maybe_mkwrite(). Change
them to maybe_mkwrite().

- In do_numa_page(), if the numa entry was writable, then pte_mkwrite()
is called directly. Fix it by doing maybe_mkwrite(). Make the same
changes to do_huge_pmd_numa_page().

- In change_pte_range(), pte_mkwrite() is called directly. Replace it with
maybe_mkwrite().

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

Yu-cheng v25:
- Apply same changes to do_huge_pmd_numa_page() as to do_numa_page().

mm/huge_memory.c | 2 +-
mm/memory.c | 5 ++---
mm/migrate.c | 3 +--
mm/mprotect.c | 2 +-
4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2adedcfca00b..3588e9fefbe0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1489,7 +1489,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
pmd = pmd_modify(oldpmd, vma->vm_page_prot);
pmd = pmd_mkyoung(pmd);
if (was_writable)
- pmd = pmd_mkwrite(pmd);
+ pmd = maybe_pmd_mkwrite(pmd, vma);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
spin_unlock(vmf->ptl);
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..c79444603d5d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3793,8 +3793,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)

entry = mk_pte(page, vma->vm_page_prot);
entry = pte_sw_mkyoung(entry);
- if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
@@ -4428,7 +4427,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
pte = pte_modify(old_pte, vma->vm_page_prot);
pte = pte_mkyoung(pte);
if (was_writable)
- pte = pte_mkwrite(pte);
+ pte = maybe_mkwrite(pte, vma);
ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..438f1e21b9c7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2697,8 +2697,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
}
} else {
entry = mk_pte(page, vma->vm_page_prot);
- if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}

ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0138dfcdb1d8..b0012c13a00e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -135,7 +135,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (dirty_accountable && pte_dirty(ptent) &&
(pte_soft_dirty(ptent) ||
!(vma->vm_flags & VM_SOFTDIRTY))) {
- ptent = pte_mkwrite(ptent);
+ ptent = maybe_mkwrite(ptent, vma);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
pages++;
--
2.17.1

2022-02-01 15:08:22

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 18/35] mm: Add guard pages around a shadow stack.

From: Yu-cheng Yu <[email protected]>

INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
first and the last elements in the range, effectively touches those memory
areas.

The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
255 * 4 = 1020 bytes by INCSSPD. Both ranges are far from PAGE_SIZE.
Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
CALL, and RET from going beyond.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

Yu-cheng v25:
- Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.

Yu-cheng v24:
- Instead changing vm_*_gap(), create x86-specific versions.

arch/x86/include/asm/page_types.h | 7 +++++
arch/x86/mm/mmap.c | 46 +++++++++++++++++++++++++++++++
include/linux/mm.h | 4 +++
3 files changed, 57 insertions(+)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index a506a411474d..e1533fdc08b4 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -73,6 +73,13 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);

extern void initmem_init(void);

+#define vm_start_gap vm_start_gap
+struct vm_area_struct;
+extern unsigned long vm_start_gap(struct vm_area_struct *vma);
+
+#define vm_end_gap vm_end_gap
+extern unsigned long vm_end_gap(struct vm_area_struct *vma);
+
#endif /* !__ASSEMBLY__ */

#endif /* _ASM_X86_PAGE_DEFS_H */
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index f3f52c5e2fd6..81f9325084d3 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -250,3 +250,49 @@ bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
return false;
return true;
}
+
+/*
+ * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D). INCSSPQ
+ * moves shadow stack pointer up to 255 * 8 = ~2 KB (~1KB for INCSSPD) and
+ * touches the first and the last element in the range, which triggers a
+ * page fault if the range is not in a shadow stack. Because of this,
+ * creating 4-KB guard pages around a shadow stack prevents these
+ * instructions from going beyond.
+ */
+#define SHADOW_STACK_GUARD_GAP PAGE_SIZE
+
+unsigned long vm_start_gap(struct vm_area_struct *vma)
+{
+ unsigned long vm_start = vma->vm_start;
+ unsigned long gap = 0;
+
+ if (vma->vm_flags & VM_GROWSDOWN)
+ gap = stack_guard_gap;
+ else if (vma->vm_flags & VM_SHADOW_STACK)
+ gap = SHADOW_STACK_GUARD_GAP;
+
+ if (gap != 0) {
+ vm_start -= gap;
+ if (vm_start > vma->vm_start)
+ vm_start = 0;
+ }
+ return vm_start;
+}
+
+unsigned long vm_end_gap(struct vm_area_struct *vma)
+{
+ unsigned long vm_end = vma->vm_end;
+ unsigned long gap = 0;
+
+ if (vma->vm_flags & VM_GROWSUP)
+ gap = stack_guard_gap;
+ else if (vma->vm_flags & VM_SHADOW_STACK)
+ gap = SHADOW_STACK_GUARD_GAP;
+
+ if (gap != 0) {
+ vm_end += gap;
+ if (vm_end < vma->vm_end)
+ vm_end = -PAGE_SIZE;
+ }
+ return vm_end;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b3cb3a17037b..e125358d7f75 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2797,6 +2797,7 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
return vma;
}

+#ifndef vm_start_gap
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
{
unsigned long vm_start = vma->vm_start;
@@ -2808,7 +2809,9 @@ static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
}
return vm_start;
}
+#endif

+#ifndef vm_end_gap
static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
{
unsigned long vm_end = vma->vm_end;
@@ -2820,6 +2823,7 @@ static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
}
return vm_end;
}
+#endif

static inline unsigned long vma_pages(struct vm_area_struct *vma)
{
--
2.17.1

2022-02-01 15:08:25

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 22/35] x86/mm: Prevent VM_WRITE shadow stacks

Shadow stack accesses are writes from handle_mm_fault() perspective. So to
generate the correct PTE, maybe_mkwrite() will rely on the presence of
VM_SHADOW_STACK or VM_WRITE in the vma.

In future patches, when VM_SHADOW_STACK is actually creatable by
userspace, a problem could happen if a user calls
mprotect( , , PROT_WRITE) on VM_SHADOW_STACK shadow stack memory. The code
would then be confused in the event of shadow stack accesses, and create a
writable PTE for a shadow stack access. Then the process would fault in a
loop.

Prevent this from happening by blocking this kind of memory (VM_WRITE and
VM_SHADOW_STACK) from being created, instead of complicating the fault
handler logic to handle it.

Add an x86 arch_validate_flags() implementation to handle the check.
Rename the uapi/asm/mman.h header guard to be able to use it for
arch/x86/include/asm/mman.h where the arch_validate_flags() will be.

Signed-off-by: Rick Edgecombe <[email protected]>
---

v1:
- New patch.

arch/x86/include/asm/mman.h | 21 +++++++++++++++++++++
arch/x86/include/uapi/asm/mman.h | 6 +++---
2 files changed, 24 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/include/asm/mman.h

diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
new file mode 100644
index 000000000000..b44fe31deb3a
--- /dev/null
+++ b/arch/x86/include/asm/mman.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_MMAN_H
+#define _ASM_X86_MMAN_H
+
+#include <linux/mm.h>
+#include <uapi/asm/mman.h>
+
+#ifdef CONFIG_X86_SHADOW_STACK
+static inline bool arch_validate_flags(unsigned long vm_flags)
+{
+ if ((vm_flags & VM_SHADOW_STACK) && (vm_flags & VM_WRITE))
+ return false;
+
+ return true;
+}
+
+#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
+
+#endif /* CONFIG_X86_SHADOW_STACK */
+
+#endif /* _ASM_X86_MMAN_H */
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index d4a8d0424bfb..9704e27c4d24 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-#ifndef _ASM_X86_MMAN_H
-#define _ASM_X86_MMAN_H
+#ifndef _UAPI_ASM_X86_MMAN_H
+#define _UAPI_ASM_X86_MMAN_H

#define MAP_32BIT 0x40 /* only give out 32bit addresses */

@@ -28,4 +28,4 @@

#include <asm-generic/mman.h>

-#endif /* _ASM_X86_MMAN_H */
+#endif /* _UAPI_ASM_X86_MMAN_H */
--
2.17.1

2022-02-01 15:08:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 20/35] mm: Update can_follow_write_pte() for shadow stack

From: Yu-cheng Yu <[email protected]>

Can_follow_write_pte() ensures a read-only page is COWed by checking the
FOLL_COW flag, and uses pte_dirty() to validate the flag is still valid.

Like a writable data page, a shadow stack page is writable, and becomes
read-only during copy-on-write, but it is always dirty. Thus, in the
can_follow_write_pte() check, it belongs to the writable page case and
should be excluded from the read-only page pte_dirty() check. Apply
the same changes to can_follow_write_pmd().

While at it, also split the long line into smaller ones.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

Yu-cheng v26:
- Instead of passing vm_flags, pass down vma pointer to can_follow_write_*().

Yu-cheng v25:
- Split long line into smaller ones.

Yu-cheng v24:
- Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().

mm/gup.c | 16 ++++++++++++----
mm/huge_memory.c | 16 ++++++++++++----
2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..95b7d1084c44 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -464,10 +464,18 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
* FOLL_FORCE can write to even unwritable pte's, but only
* after we've gone through a COW cycle and they are dirty.
*/
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
+ struct vm_area_struct *vma)
{
- return pte_write(pte) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+ if (pte_write(pte))
+ return true;
+ if ((flags & (FOLL_FORCE | FOLL_COW)) != (FOLL_FORCE | FOLL_COW))
+ return false;
+ if (!pte_dirty(pte))
+ return false;
+ if (is_shadow_stack_mapping(vma->vm_flags))
+ return false;
+ return true;
}

static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -510,7 +518,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
- if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+ if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags, vma)) {
pte_unmap_unlock(ptep, ptl);
return NULL;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3588e9fefbe0..1c7167e6f223 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1346,10 +1346,18 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
* FOLL_FORCE can write to even unwritable pmd's, but only
* after we've gone through a COW cycle and they are dirty.
*/
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags,
+ struct vm_area_struct *vma)
{
- return pmd_write(pmd) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+ if (pmd_write(pmd))
+ return true;
+ if ((flags & (FOLL_FORCE | FOLL_COW)) != (FOLL_FORCE | FOLL_COW))
+ return false;
+ if (!pmd_dirty(pmd))
+ return false;
+ if (is_shadow_stack_mapping(vma->vm_flags))
+ return false;
+ return true;
}

struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1362,7 +1370,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,

assert_spin_locked(pmd_lockptr(mm, pmd));

- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+ if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags, vma))
goto out;

/* Avoid dumping huge zero page */
--
2.17.1

2022-02-01 15:08:30

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 21/35] mm/mprotect: Exclude shadow stack from preserve_write

From: Yu-cheng Yu <[email protected]>

In change_pte_range(), when a PTE is changed for prot_numa, _PAGE_RW is
preserved to avoid the additional write fault after the NUMA hinting fault.
However, pte_write() now includes both normal writable and shadow stack
(RW=0, Dirty=1) PTEs, but the latter does not have _PAGE_RW and has no need
to preserve it.

Exclude shadow stack from preserve_write test, and apply the same change to
change_huge_pmd().

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

v25:
- Move is_shadow_stack_mapping() to a separate line.

v24:
- Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().

mm/huge_memory.c | 7 +++++++
mm/mprotect.c | 7 +++++++
2 files changed, 14 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1c7167e6f223..01375e39b52b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1750,6 +1750,13 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
return 0;

preserve_write = prot_numa && pmd_write(*pmd);
+
+ /*
+ * Preserve only normal writable huge PMD, but not shadow
+ * stack (RW=0, Dirty=1).
+ */
+ if (is_shadow_stack_mapping(vma->vm_flags))
+ preserve_write = false;
ret = 1;

#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b0012c13a00e..faac710f0891 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,6 +77,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
pte_t ptent;
bool preserve_write = prot_numa && pte_write(oldpte);

+ /*
+ * Preserve only normal writable PTE, but not shadow
+ * stack (RW=0, Dirty=1).
+ */
+ if (is_shadow_stack_mapping(vma->vm_flags))
+ preserve_write = false;
+
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
--
2.17.1

2022-02-01 15:08:35

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 23/35] x86/fpu: Add helpers for modifying supervisor xstate

Add helpers that can be used to modify supervisor xstate safely for the
current task.

State for supervisors xstate based features can be live and
accesses via MSR's, or saved in memory in an xsave buffer. When the
kernel needs to modify this state it needs to be sure to operate on it
in the right place, so the modifications don't get clobbered.

In the past supervisor xstate features have used get_xsave_addr()
directly, and performed open coded logic handle operating on the saved
state correctly. This has posed two problems:
1. It has logic that has been gotten wrong more than once.
2. To reduce code, less common path's are not optimized. Determination
of which path's are less common is based on assumptions about far away
code that could change.

In addition, now that get_xsave_addr() is not available outside of the
core fpu code, there isn't even a way for these supervisor features to
modify the in memory state.

To resolve these problems, add some helpers that encapsulate the correct
logic to operate on the correct copy of the state. Map the MSR's to the
struct field location in a case statements in __get_xsave_member().

Use the helpers like this, to write to either the MSR or saved state:
void *xstate;

xstate = start_update_xsave_msrs(XFEATURE_FOO);
r = xsave_rdmsrl(state, MSR_IA32_FOO_1, &val)
if (r)
xsave_wrmsrl(state, MSR_IA32_FOO_2, FOO_ENABLE);
end_update_xsave_msrs();

Signed-off-by: Rick Edgecombe <[email protected]>
---

v1:
- New patch.

arch/x86/include/asm/fpu/api.h | 5 ++
arch/x86/kernel/fpu/xstate.c | 134 +++++++++++++++++++++++++++++++++
2 files changed, 139 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index c83b3020350a..6aec27984b62 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -165,4 +165,9 @@ static inline bool fpstate_is_confidential(struct fpu_guest *gfpu)
struct task_struct;
extern long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2);

+void *start_update_xsave_msrs(int xfeature_nr);
+void end_update_xsave_msrs(void);
+int xsave_rdmsrl(void *state, unsigned int msr, unsigned long long *p);
+int xsave_wrmsrl(void *state, u32 msr, u64 val);
+int xsave_set_clear_bits_msrl(void *state, u32 msr, u64 set, u64 clear);
#endif /* _ASM_X86_FPU_API_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 44397202762b..c5e20e0d0725 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1867,3 +1867,137 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
return 0;
}
#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
+static u64 *__get_xsave_member(void *xstate, u32 msr)
+{
+ switch (msr) {
+ /* Currently there are no MSR's supported */
+ default:
+ WARN_ONCE(1, "x86/fpu: unsupported xstate msr (%u)\n", msr);
+ return NULL;
+ }
+}
+
+/*
+ * Return a pointer to the xstate for the feature if it should be used, or NULL
+ * if the MSRs should be written to directly. To do this safely, using the
+ * associated read/write helpers is required.
+ */
+void *start_update_xsave_msrs(int xfeature_nr)
+{
+ void *xstate;
+
+ /*
+ * fpregs_lock() only disables preemption (mostly). So modifing state
+ * in an interrupt could screw up some in progress fpregs operation,
+ * but appear to work. Warn about it.
+ */
+ WARN_ON_ONCE(!in_task());
+ WARN_ON_ONCE(current->flags & PF_KTHREAD);
+
+ fpregs_lock();
+
+ fpregs_assert_state_consistent();
+
+ /*
+ * If the registers don't need to be reloaded. Go ahead and operate on the
+ * registers.
+ */
+ if (!test_thread_flag(TIF_NEED_FPU_LOAD))
+ return NULL;
+
+ xstate = get_xsave_addr(&current->thread.fpu.fpstate->regs.xsave, xfeature_nr);
+
+ /*
+ * If regs are in the init state, they can't be retrieved from
+ * init_fpstate due to the init optimization, but are not nessarily
+ * zero. The only option is to restore to make everything live and
+ * operate on registers. This will clear TIF_NEED_FPU_LOAD.
+ *
+ * Otherwise, if not in the init state but TIF_NEED_FPU_LOAD is set,
+ * operate on the buffer. The registers will be restored before going
+ * to userspace in any case, but the task might get preempted before
+ * then, so this possibly saves an xsave.
+ */
+ if (!xstate)
+ fpregs_restore_userregs();
+ return xstate;
+}
+
+void end_update_xsave_msrs(void)
+{
+ fpregs_unlock();
+}
+
+/*
+ * When TIF_NEED_FPU_LOAD is set and fpregs_state_valid() is true, the saved
+ * state and fp state match. In this case, the kernel has some good options -
+ * it can skip the restore before returning to userspace or it could skip
+ * an xsave if preempted before then.
+ *
+ * But if this correspondence is broken by either a write to the in-memory
+ * buffer or the registers, the kernel needs to be notified so it doesn't miss
+ * an xsave or restore. __xsave_msrl_prepare_write() peforms this check and
+ * notifies the kernel if needed. Use before writes only, to not take away
+ * the kernel's options when not required.
+ *
+ * If TIF_NEED_FPU_LOAD is set, then the logic in start_update_xsave_msrs()
+ * must have resulted in targeting the in-memory state, so invaliding the
+ * registers is the right thing to do.
+ */
+static void __xsave_msrl_prepare_write(void)
+{
+ if (test_thread_flag(TIF_NEED_FPU_LOAD) &&
+ fpregs_state_valid(&current->thread.fpu, smp_processor_id()))
+ __fpu_invalidate_fpregs_state(&current->thread.fpu);
+}
+
+int xsave_rdmsrl(void *xstate, unsigned int msr, unsigned long long *p)
+{
+ u64 *member_ptr;
+
+ if (!xstate)
+ return rdmsrl_safe(msr, p);
+
+ member_ptr = __get_xsave_member(xstate, msr);
+ if (!member_ptr)
+ return 1;
+
+ *p = *member_ptr;
+
+ return 0;
+}
+
+int xsave_wrmsrl(void *xstate, u32 msr, u64 val)
+{
+ u64 *member_ptr;
+
+ __xsave_msrl_prepare_write();
+ if (!xstate)
+ return wrmsrl_safe(msr, val);
+
+ member_ptr = __get_xsave_member(xstate, msr);
+ if (!member_ptr)
+ return 1;
+
+ *member_ptr = val;
+
+ return 0;
+}
+
+int xsave_set_clear_bits_msrl(void *xstate, u32 msr, u64 set, u64 clear)
+{
+ u64 val, new_val;
+ int ret;
+
+ ret = xsave_rdmsrl(xstate, msr, &val);
+ if (ret)
+ return ret;
+
+ new_val = (val & ~clear) | set;
+
+ if (new_val != val)
+ return xsave_wrmsrl(xstate, msr, new_val);
+
+ return 0;
+}
--
2.17.1

2022-02-01 15:08:35

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 29/35] x86/cet/shstk: Introduce shadow stack token setup/verify routines

From: Yu-cheng Yu <[email protected]>

A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack. This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.

Introduce token setup and verify routines. Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack. It is
used to construct user signal stack as described above.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

v1:
- Use xsave helpers.

Yu-cheng v30:
- Update commit log, remove description about signals.
- Update various comments.
- Remove variable 'ssp' init and adjust return value accordingly.
- Check get_user_shstk_addr() return value.
- Replace 'ia32' with 'proc32'.

Yu-cheng v29:
- Update comments for the use of get_xsave_addr().

Yu-cheng v28:
- Add comments for get_xsave_addr().

Yu-cheng v27:
- For shstk_check_rstor_token(), instead of an input param, use current
shadow stack pointer.
- In response to comments, fix/simplify a few syntax/format issues.

arch/x86/include/asm/cet.h | 7 ++
arch/x86/include/asm/special_insns.h | 30 +++++++
arch/x86/kernel/shstk.c | 122 +++++++++++++++++++++++++++
3 files changed, 159 insertions(+)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 63ee8b45080d..6e8a7a807dcc 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -19,6 +19,9 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
void shstk_free(struct task_struct *p);
int shstk_disable(void);
void reset_thread_shstk(void);
+int shstk_setup_rstor_token(bool proc32, unsigned long restorer,
+ unsigned long *new_ssp);
+int shstk_check_rstor_token(bool proc32, unsigned long *new_ssp);
#else
static inline void shstk_setup(void) {}
static inline int shstk_alloc_thread_stack(struct task_struct *p,
@@ -27,6 +30,10 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
static inline void shstk_free(struct task_struct *p) {}
static inline void shstk_disable(void) {}
static inline void reset_thread_shstk(void) {}
+static inline int shstk_setup_rstor_token(bool proc32, unsigned long restorer,
+ unsigned long *new_ssp) { return 0; }
+static inline int shstk_check_rstor_token(bool proc32,
+ unsigned long *new_ssp) { return 0; }
#endif /* CONFIG_X86_SHADOW_STACK */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 68c257a3de0d..f45f378ca1fc 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -222,6 +222,36 @@ static inline void clwb(volatile void *__p)
: [pax] "a" (p));
}

+#ifdef CONFIG_X86_SHADOW_STACK
+static inline int write_user_shstk_32(u32 __user *addr, u32 val)
+{
+ if (WARN_ONCE(!IS_ENABLED(CONFIG_IA32_EMULATION) &&
+ !IS_ENABLED(CONFIG_X86_X32),
+ "%s used but not supported.\n", __func__)) {
+ return -EFAULT;
+ }
+
+ asm_volatile_goto("1: wrussd %[val], (%[addr])\n"
+ _ASM_EXTABLE(1b, %l[fail])
+ :: [addr] "r" (addr), [val] "r" (val)
+ :: fail);
+ return 0;
+fail:
+ return -EFAULT;
+}
+
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+ asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+ _ASM_EXTABLE(1b, %l[fail])
+ :: [addr] "r" (addr), [val] "r" (val)
+ :: fail);
+ return 0;
+fail:
+ return -EFAULT;
+}
+#endif /* CONFIG_X86_SHADOW_STACK */
+
#define nop() asm volatile ("nop")

static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 358f24e806cc..e0caab50ca77 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -23,6 +23,33 @@
#include <asm/special_insns.h>
#include <asm/fpu/api.h>

+/*
+ * Create a restore token on the shadow stack. A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(bool proc32, unsigned long ssp,
+ unsigned long *token_addr)
+{
+ unsigned long addr;
+
+ /* Aligned to 8 is aligned to 4, so test 8 first */
+ if ((!proc32 && !IS_ALIGNED(ssp, 8)) || !IS_ALIGNED(ssp, 4))
+ return -EINVAL;
+
+ addr = ALIGN_DOWN(ssp, 8) - 8;
+
+ /* Is the token for 64-bit? */
+ if (!proc32)
+ ssp |= BIT(0);
+
+ if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+ return -EFAULT;
+
+ *token_addr = addr;
+
+ return 0;
+}
+
static unsigned long alloc_shstk(unsigned long size)
{
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
@@ -213,3 +240,98 @@ int shstk_disable(void)
shstk_free(current);
return 0;
}
+
+static unsigned long get_user_shstk_addr(void)
+{
+ void *xstate;
+ unsigned long long ssp;
+
+ xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
+
+ xsave_rdmsrl(xstate, MSR_IA32_PL3_SSP, &ssp);
+
+ end_update_xsave_msrs();
+
+ return ssp;
+}
+
+/*
+ * Create a restore token on shadow stack, and then push the user-mode
+ * function return address.
+ */
+int shstk_setup_rstor_token(bool proc32, unsigned long ret_addr,
+ unsigned long *new_ssp)
+{
+ struct thread_shstk *shstk = &current->thread.shstk;
+ unsigned long ssp, token_addr;
+ int err;
+
+ if (!shstk->size)
+ return 0;
+
+ if (!ret_addr)
+ return -EINVAL;
+
+ ssp = get_user_shstk_addr();
+ if (!ssp)
+ return -EINVAL;
+
+ err = create_rstor_token(proc32, ssp, &token_addr);
+ if (err)
+ return err;
+
+ if (proc32) {
+ ssp = token_addr - sizeof(u32);
+ err = write_user_shstk_32((u32 __user *)ssp, (u32)ret_addr);
+ } else {
+ ssp = token_addr - sizeof(u64);
+ err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
+ }
+
+ if (!err)
+ *new_ssp = ssp;
+
+ return err;
+}
+
+/*
+ * Verify the user shadow stack has a valid token on it, and then set
+ * *new_ssp according to the token.
+ */
+int shstk_check_rstor_token(bool proc32, unsigned long *new_ssp)
+{
+ unsigned long token_addr;
+ unsigned long token;
+ bool shstk32;
+
+ token_addr = get_user_shstk_addr();
+ if (!token_addr)
+ return -EINVAL;
+
+ if (get_user(token, (unsigned long __user *)token_addr))
+ return -EFAULT;
+
+ /* Is mode flag correct? */
+ shstk32 = !(token & BIT(0));
+ if (proc32 ^ shstk32)
+ return -EINVAL;
+
+ /* Is busy flag set? */
+ if (token & BIT(1))
+ return -EINVAL;
+
+ /* Mask out flags */
+ token &= ~3UL;
+
+ /* Restore address aligned? */
+ if ((!proc32 && !IS_ALIGNED(token, 8)) || !IS_ALIGNED(token, 4))
+ return -EINVAL;
+
+ /* Token placed properly? */
+ if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
+ return -EINVAL;
+
+ *new_ssp = token;
+
+ return 0;
+}
--
2.17.1

2022-02-01 15:08:35

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 31/35] x86/cet/shstk: Add arch_prctl elf feature functions

From: Yu-cheng Yu <[email protected]>

Some CPU features that adjust the behavior of existing instructions
should be enabled only if the application supports these modifications.

Provide a per-thread arch_prctl interface for modifying, checking and
locking the enablement status of features like these. This interface
operates on the per-thread state which is copied for new threads. It is
intended to be mostly used early in an application (i.e. a dynamic loader)
such that the behavior will be inherited for new threads created by the
application.

Today the only user is shadow stack, but keep the names generic because
other features like LAM can use it as well.

The interface is as below:
arch_prctl(ARCH_X86_FEATURE_STATUS, u64 *args)
Get feature status.

The parameter 'args' is a pointer to a user buffer. The kernel returns
the following information:

*args = shadow stack/IBT status
*(args + 1) = shadow stack base address
*(args + 2) = shadow stack size

32-bit binaries use the same interface, but only lower 32-bits of each
item.

arch_prctl(ARCH_X86_FEATURE_DISABLE, unsigned int features)
Disable features specified in 'features'. Return -EPERM if any of the
passed feature are locked. Return -ECANCELED if any of the features
failed to disable. In this case call ARCH_X86_FEATURE_STATUS to find
out which features are still enabled.

arch_prctl(ARCH_X86_FEATURE_ENABLE, unsigned int features)
Enable feature specified in 'features'. Return -EPERM if any of the
passed feature are locked. Return -ECANCELED if any of the features
failed to enable. In this case call ARCH_X86_FEATURE_STATUS to find
out which features were enabled.

arch_prctl(ARCH_X86_FEATURE_LOCK, unsigned int features)
Lock in all features at their current enabled or disabled status.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

v1:
- Changed from ENOSYS and ENOTSUPP error codes per checkpatch.
- Changed interface/filename to be more generic so it can be shared with
LAM.
- Add lock mask, such that some features can be locked, while leaving others
to be enabled later.
- Add ARCH_X86_FEATURE_ENABLE to use instead of parsing the elf header
- Change ARCH_X86_FEATURE_DISABLE to actually return an error on
failure.

arch/x86/include/asm/cet.h | 6 +++
arch/x86/include/asm/processor.h | 1 +
arch/x86/include/uapi/asm/prctl.h | 10 +++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/elf_feature_prctl.c | 66 +++++++++++++++++++++++++++++
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 1 +
7 files changed, 86 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/elf_feature_prctl.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index faff8dc86159..cbc7cfcba5dc 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -40,6 +40,12 @@ static inline int setup_signal_shadow_stack(int proc32, void __user *restorer) {
static inline int restore_signal_shadow_stack(void) { return 0; }
#endif /* CONFIG_X86_SHADOW_STACK */

+#ifdef CONFIG_X86_SHADOW_STACK
+int prctl_elf_feature(int option, u64 arg2);
+#else
+static inline int prctl_elf_feature(int option, u64 arg2) { return -EINVAL; }
+#endif
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a9f4e9c4ca81..100af0f570c9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -531,6 +531,7 @@ struct thread_struct {

#ifdef CONFIG_X86_SHADOW_STACK
struct thread_shstk shstk;
+ u64 feat_prctl_locked;
#endif

/* Floating point and extended processor state */
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..aa294c7bcf41 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,14 @@
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003

+#define ARCH_X86_FEATURE_STATUS 0x3001
+#define ARCH_X86_FEATURE_DISABLE 0x3002
+#define ARCH_X86_FEATURE_LOCK 0x3003
+#define ARCH_X86_FEATURE_ENABLE 0x3004
+
+/* x86 feature bits to be used with ARCH_X86_FEATURE arch_prctl()s */
+#define LINUX_X86_FEATURE_IBT 0x00000001
+#define LINUX_X86_FEATURE_SHSTK 0x00000002
+
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index d60ae6c365c7..531dba96d4dc 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -153,7 +153,7 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += sev.o

obj-$(CONFIG_ARCH_HAS_CC_PLATFORM) += cc_platform.o

-obj-$(CONFIG_X86_SHADOW_STACK) += shstk.o
+obj-$(CONFIG_X86_SHADOW_STACK) += shstk.o elf_feature_prctl.o
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/elf_feature_prctl.c b/arch/x86/kernel/elf_feature_prctl.c
new file mode 100644
index 000000000000..47de201db3f7
--- /dev/null
+++ b/arch/x86/kernel/elf_feature_prctl.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/errno.h>
+#include <linux/uaccess.h>
+#include <linux/prctl.h>
+#include <linux/compat.h>
+#include <linux/mman.h>
+#include <linux/elfcore.h>
+#include <linux/processor.h>
+#include <asm/prctl.h>
+#include <asm/cet.h>
+
+/* See Documentation/x86/intel_cet.rst. */
+
+static int elf_feat_copy_status_to_user(struct thread_shstk *shstk, u64 __user *ubuf)
+{
+ u64 buf[3] = {};
+
+ if (shstk->size) {
+ buf[0] = LINUX_X86_FEATURE_SHSTK;
+ buf[1] = shstk->base;
+ buf[2] = shstk->size;
+ }
+
+ return copy_to_user(ubuf, buf, sizeof(buf));
+}
+
+int prctl_elf_feature(int option, u64 arg2)
+{
+ struct thread_struct *thread = &current->thread;
+ u64 feat_succ = 0;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return -EOPNOTSUPP;
+
+ switch (option) {
+ case ARCH_X86_FEATURE_STATUS:
+ return elf_feat_copy_status_to_user(&thread->shstk, (u64 __user *)arg2);
+ case ARCH_X86_FEATURE_DISABLE:
+ if (arg2 & thread->feat_prctl_locked)
+ return -EPERM;
+
+ if (arg2 & LINUX_X86_FEATURE_SHSTK && !shstk_disable())
+ feat_succ |= LINUX_X86_FEATURE_SHSTK;
+
+ if (feat_succ != arg2)
+ return -ECANCELED;
+ return 0;
+ case ARCH_X86_FEATURE_ENABLE:
+ if (arg2 & thread->feat_prctl_locked)
+ return -EPERM;
+
+ if (arg2 & LINUX_X86_FEATURE_SHSTK && !shstk_setup())
+ feat_succ |= LINUX_X86_FEATURE_SHSTK;
+
+ if (feat_succ != arg2)
+ return -ECANCELED;
+ return 0;
+ case ARCH_X86_FEATURE_LOCK:
+ thread->feat_prctl_locked |= arg2;
+ return 0;
+
+ default:
+ return -EINVAL;
+ }
+}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0fbcf33255fa..11bf09b60f9d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -1005,5 +1005,5 @@ long do_arch_prctl_common(struct task_struct *task, int option,
return fpu_xstate_prctl(task, option, arg2);
}

- return -EINVAL;
+ return prctl_elf_feature(option, arg2);
}
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 682d85a63a1d..f330be17e2d1 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -130,6 +130,7 @@ int shstk_setup(void)

void reset_thread_shstk(void)
{
+ current->thread.feat_prctl_locked = 0;
memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
}

--
2.17.1

2022-02-01 15:08:35

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 25/35] x86/cet/shstk: Add user-mode shadow stack support

From: Yu-cheng Yu <[email protected]>

Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Add the user shadow stack MSRs to the xsave helpers, so they can be used
to implement the functionality.

Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

v1:
- Switch to xsave helpers.
- Expand commit log.

Yu-cheng v30:
- Remove superfluous comments for struct thread_shstk.
- Replace 'populate' with 'unused'.

Yu-cheng v28:
- Update shstk_setup() with wrmsrl_safe(), returns success when shadow
stack feature is not present (since this is a setup function).

Yu-cheng v27:
- Change 'struct cet_status' to 'struct thread_shstk', and change member
types from unsigned long to u64.
- Re-order local variables in reverse order of length.
- WARN_ON_ONCE() when vm_munmap() fails.

arch/x86/include/asm/cet.h | 29 ++++++
arch/x86/include/asm/processor.h | 5 ++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/fpu/xstate.c | 5 +-
arch/x86/kernel/process_64.c | 2 +
arch/x86/kernel/shstk.c | 149 +++++++++++++++++++++++++++++++
6 files changed, 190 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/kernel/shstk.c

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
new file mode 100644
index 000000000000..de90e4ae083a
--- /dev/null
+++ b/arch/x86/include/asm/cet.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CET_H
+#define _ASM_X86_CET_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+struct task_struct;
+
+struct thread_shstk {
+ u64 base;
+ u64 size;
+};
+
+#ifdef CONFIG_X86_SHADOW_STACK
+int shstk_setup(void);
+void shstk_free(struct task_struct *p);
+int shstk_disable(void);
+void reset_thread_shstk(void);
+#else
+static inline void shstk_setup(void) {}
+static inline void shstk_free(struct task_struct *p) {}
+static inline void shstk_disable(void) {}
+static inline void reset_thread_shstk(void) {}
+#endif /* CONFIG_X86_SHADOW_STACK */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_CET_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2c5f12ae7d04..a9f4e9c4ca81 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -27,6 +27,7 @@ struct vm86;
#include <asm/unwind_hints.h>
#include <asm/vmxfeatures.h>
#include <asm/vdso/processor.h>
+#include <asm/cet.h>

#include <linux/personality.h>
#include <linux/cache.h>
@@ -528,6 +529,10 @@ struct thread_struct {
*/
u32 pkru;

+#ifdef CONFIG_X86_SHADOW_STACK
+ struct thread_shstk shstk;
+#endif
+
/* Floating point and extended processor state */
struct fpu fpu;
/*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 6aef9ee28a39..d60ae6c365c7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -153,6 +153,7 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += sev.o

obj-$(CONFIG_ARCH_HAS_CC_PLATFORM) += cc_platform.o

+obj-$(CONFIG_X86_SHADOW_STACK) += shstk.o
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c5e20e0d0725..25b1b0c417fd 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1871,7 +1871,10 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
static u64 *__get_xsave_member(void *xstate, u32 msr)
{
switch (msr) {
- /* Currently there are no MSR's supported */
+ case MSR_IA32_PL3_SSP:
+ return &((struct cet_user_state *)xstate)->user_ssp;
+ case MSR_IA32_U_CET:
+ return &((struct cet_user_state *)xstate)->user_cet;
default:
WARN_ONCE(1, "x86/fpu: unsupported xstate msr (%u)\n", msr);
return NULL;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3402edec236c..f05fe27d4967 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
load_gs_index(__USER_DS);
}

+ reset_thread_shstk();
+
loadsegment(fs, 0);
loadsegment(es, _ds);
loadsegment(ds, _ds);
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
new file mode 100644
index 000000000000..4e8686ed885f
--- /dev/null
+++ b/arch/x86/kernel/shstk.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * shstk.c - Intel shadow stack support
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Yu-cheng Yu <[email protected]>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/internal.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/cet.h>
+#include <asm/special_insns.h>
+#include <asm/fpu/api.h>
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+ int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+ struct mm_struct *mm = current->mm;
+ unsigned long addr, unused;
+
+ mmap_write_lock(mm);
+ addr = do_mmap(NULL, 0, size, PROT_READ, flags, VM_SHADOW_STACK, 0,
+ &unused, NULL);
+ mmap_write_unlock(mm);
+
+ return addr;
+}
+
+static void unmap_shadow_stack(u64 base, u64 size)
+{
+ while (1) {
+ int r;
+
+ r = vm_munmap(base, size);
+
+ /*
+ * vm_munmap() returns -EINTR when mmap_lock is held by
+ * something else, and that lock should not be held for a
+ * long time. Retry it for the case.
+ */
+ if (r == -EINTR) {
+ cond_resched();
+ continue;
+ }
+
+ /*
+ * For all other types of vm_munmap() failure, either the
+ * system is out of memory or there is bug.
+ */
+ WARN_ON_ONCE(r);
+ break;
+ }
+}
+
+int shstk_setup(void)
+{
+ struct thread_shstk *shstk = &current->thread.shstk;
+ unsigned long addr, size;
+ void *xstate;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+ shstk->size ||
+ shstk->base)
+ return 1;
+
+ size = PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+ addr = alloc_shstk(size);
+ if (IS_ERR_VALUE(addr))
+ return 1;
+
+ xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
+ err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, addr + size);
+ if (!err)
+ err = xsave_wrmsrl(xstate, MSR_IA32_U_CET, CET_SHSTK_EN);
+ end_update_xsave_msrs();
+
+ if (err) {
+ /*
+ * Don't leak shadow stack if something went wrong with writing the
+ * msrs. Warn about it because things may be in a weird state.
+ */
+ WARN_ON_ONCE(1);
+ unmap_shadow_stack(addr, size);
+ return 1;
+ }
+
+ shstk->base = addr;
+ shstk->size = size;
+ return 0;
+}
+
+void reset_thread_shstk(void)
+{
+ memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
+}
+
+void shstk_free(struct task_struct *tsk)
+{
+ struct thread_shstk *shstk = &tsk->thread.shstk;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+ !shstk->size ||
+ !shstk->base)
+ return;
+
+ if (!tsk->mm)
+ return;
+
+ unmap_shadow_stack(shstk->base, shstk->size);
+
+ shstk->base = 0;
+ shstk->size = 0;
+}
+
+int shstk_disable(void)
+{
+ struct thread_shstk *shstk = &current->thread.shstk;
+ void *xstate;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+ !shstk->size ||
+ !shstk->base)
+ return 1;
+
+ xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
+ err = xsave_set_clear_bits_msrl(xstate, MSR_IA32_U_CET, 0, CET_SHSTK_EN);
+ if (!err)
+ err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, 0);
+ end_update_xsave_msrs();
+
+ if (err)
+ return 1;
+
+ shstk_free(current);
+ return 0;
+}
--
2.17.1

2022-02-01 15:08:40

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 26/35] x86/process: Change copy_thread() argument 'arg' to 'stack_size'

From: Yu-cheng Yu <[email protected]>

The single call site of copy_thread() passes stack size in 'arg'. To make
this clear and in preparation of using this argument for shadow stack
allocation, change 'arg' to 'stack_size'. No functional changes.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/kernel/process.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 81d8ef036637..82a816178e7f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -130,8 +130,9 @@ static int set_new_tls(struct task_struct *p, unsigned long tls)
return do_set_thread_area_64(p, ARCH_SET_FS, tls);
}

-int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
- struct task_struct *p, unsigned long tls)
+int copy_thread(unsigned long clone_flags, unsigned long sp,
+ unsigned long stack_size, struct task_struct *p,
+ unsigned long tls)
{
struct inactive_task_frame *frame;
struct fork_frame *fork_frame;
@@ -175,7 +176,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
if (unlikely(p->flags & PF_KTHREAD)) {
p->thread.pkru = pkru_get_init_value();
memset(childregs, 0, sizeof(struct pt_regs));
- kthread_frame_init(frame, sp, arg);
+ kthread_frame_init(frame, sp, stack_size);
return 0;
}

@@ -208,7 +209,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
*/
childregs->sp = 0;
childregs->ip = 0;
- kthread_frame_init(frame, sp, arg);
+ kthread_frame_init(frame, sp, stack_size);
return 0;
}

--
2.17.1

2022-02-01 15:08:43

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 34/35] x86/cet/shstk: Support wrss for userspace

For the current shadow stack implementation, shadow stacks contents cannot
be arbitrarily provisioned with data. This property helps apps protect
themselves better, but also restricts any potential apps that may want to
do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, wrss, which
can be enabled to write directly to shadow stack permissioned memory from
userspace. Allow it to get enabled via the prctl interface.

Only enable the userspace wrss instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

Prevent shadow stack's from becoming executable to assist apps who want
W^X enforced. Add an arch_validate_flags() implementation to handle the
check. Rename the uapi/asm/mman.h header guard to be able to use it for
arch/x86/include/asm/mman.h where the arch_validate_flags() will be.

From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a PF err code perspective.

Signed-off-by: Rick Edgecombe <[email protected]>
---

v1:
- New patch.

arch/x86/include/asm/cet.h | 3 +++
arch/x86/include/asm/mman.h | 5 ++++-
arch/x86/include/uapi/asm/prctl.h | 2 +-
arch/x86/kernel/elf_feature_prctl.c | 6 +++++
arch/x86/kernel/shstk.c | 35 ++++++++++++++++++++++++++++-
5 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index cbc7cfcba5dc..c8ff0bd5f5bc 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -10,6 +10,7 @@ struct task_struct;
struct thread_shstk {
u64 base;
u64 size;
+ bool wrss;
};

#ifdef CONFIG_X86_SHADOW_STACK
@@ -19,6 +20,7 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
void shstk_free(struct task_struct *p);
int shstk_disable(void);
void reset_thread_shstk(void);
+int wrss_control(bool enable);
int shstk_setup_rstor_token(bool proc32, unsigned long restorer,
unsigned long *new_ssp);
int shstk_check_rstor_token(bool proc32, unsigned long *new_ssp);
@@ -32,6 +34,7 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
static inline void shstk_free(struct task_struct *p) {}
static inline void shstk_disable(void) {}
static inline void reset_thread_shstk(void) {}
+static inline void wrss_control(bool enable) {}
static inline int shstk_setup_rstor_token(bool proc32, unsigned long restorer,
unsigned long *new_ssp) { return 0; }
static inline int shstk_check_rstor_token(bool proc32,
diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
index b44fe31deb3a..c05951a36d93 100644
--- a/arch/x86/include/asm/mman.h
+++ b/arch/x86/include/asm/mman.h
@@ -8,7 +8,10 @@
#ifdef CONFIG_X86_SHADOW_STACK
static inline bool arch_validate_flags(unsigned long vm_flags)
{
- if ((vm_flags & VM_SHADOW_STACK) && (vm_flags & VM_WRITE))
+ /*
+ * Shadow stack must not be executable, to help with W^X due to wrss.
+ */
+ if ((vm_flags & VM_SHADOW_STACK) && (vm_flags & (VM_WRITE | VM_EXEC)))
return false;

return true;
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index aa294c7bcf41..210976925325 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -28,6 +28,6 @@
/* x86 feature bits to be used with ARCH_X86_FEATURE arch_prctl()s */
#define LINUX_X86_FEATURE_IBT 0x00000001
#define LINUX_X86_FEATURE_SHSTK 0x00000002
-
+#define LINUX_X86_FEATURE_WRSS 0x00000010

#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/elf_feature_prctl.c b/arch/x86/kernel/elf_feature_prctl.c
index 47de201db3f7..ecad6ebeb4dd 100644
--- a/arch/x86/kernel/elf_feature_prctl.c
+++ b/arch/x86/kernel/elf_feature_prctl.c
@@ -21,6 +21,8 @@ static int elf_feat_copy_status_to_user(struct thread_shstk *shstk, u64 __user *
buf[1] = shstk->base;
buf[2] = shstk->size;
}
+ if (shstk->wrss)
+ buf[0] |= LINUX_X86_FEATURE_WRSS;

return copy_to_user(ubuf, buf, sizeof(buf));
}
@@ -40,6 +42,8 @@ int prctl_elf_feature(int option, u64 arg2)
if (arg2 & thread->feat_prctl_locked)
return -EPERM;

+ if (arg2 & LINUX_X86_FEATURE_WRSS && !wrss_control(false))
+ feat_succ |= LINUX_X86_FEATURE_WRSS;
if (arg2 & LINUX_X86_FEATURE_SHSTK && !shstk_disable())
feat_succ |= LINUX_X86_FEATURE_SHSTK;

@@ -52,6 +56,8 @@ int prctl_elf_feature(int option, u64 arg2)

if (arg2 & LINUX_X86_FEATURE_SHSTK && !shstk_setup())
feat_succ |= LINUX_X86_FEATURE_SHSTK;
+ if (arg2 & LINUX_X86_FEATURE_WRSS && !wrss_control(true))
+ feat_succ |= LINUX_X86_FEATURE_WRSS;

if (feat_succ != arg2)
return -ECANCELED;
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 53be5d5539d4..92612236b4ef 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -230,6 +230,36 @@ void shstk_free(struct task_struct *tsk)
shstk->size = 0;
}

+int wrss_control(bool enable)
+{
+ struct thread_shstk *shstk = &current->thread.shstk;
+ void *xstate;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return 1;
+ /*
+ * Only enable wrss if shadow stack is enabled. If shadow stack is not
+ * enabled, wrss will already be disabled, so don't bother clearing it
+ * when disabling.
+ */
+ if (!shstk->size || shstk->wrss == enable)
+ return 1;
+
+ xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
+ if (enable)
+ err = xsave_set_clear_bits_msrl(xstate, MSR_IA32_U_CET, CET_WRSS_EN, 0);
+ else
+ err = xsave_set_clear_bits_msrl(xstate, MSR_IA32_U_CET, 0, CET_WRSS_EN);
+ end_update_xsave_msrs();
+
+ if (err)
+ return 1;
+
+ shstk->wrss = enable;
+ return 0;
+}
+
int shstk_disable(void)
{
struct thread_shstk *shstk = &current->thread.shstk;
@@ -242,7 +272,9 @@ int shstk_disable(void)
return 1;

xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
- err = xsave_set_clear_bits_msrl(xstate, MSR_IA32_U_CET, 0, CET_SHSTK_EN);
+ /* Disable WRSS too when disabling shadow stack */
+ err = xsave_set_clear_bits_msrl(xstate, MSR_IA32_U_CET, 0,
+ CET_SHSTK_EN | CET_WRSS_EN);
if (!err)
err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, 0);
end_update_xsave_msrs();
@@ -251,6 +283,7 @@ int shstk_disable(void)
return 1;

shstk_free(current);
+ shstk->wrss = 0;
return 0;
}

--
2.17.1

2022-02-01 15:08:47

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 27/35] x86/fpu: Add unsafe xsave buffer helpers

CET will need to modify the xsave buffer of a new FPU that was just
created in the process of copying a thread. In this case the normal
helpers will not work, because they operate on the current thread's FPU.

So add unsafe helpers to allow for this kind of modification. Make the
unsafe helpers operate on the MSR like the safe helpers for symmetry and
to avoid exposing the underling xsave structures. Don't add a read
helper because it is not needed at this time.

Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/include/asm/fpu/api.h | 9 ++++++---
arch/x86/kernel/fpu/xstate.c | 27 ++++++++++++++++++++++-----
2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 6aec27984b62..5cb557b9d118 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -167,7 +167,10 @@ extern long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long

void *start_update_xsave_msrs(int xfeature_nr);
void end_update_xsave_msrs(void);
-int xsave_rdmsrl(void *state, unsigned int msr, unsigned long long *p);
-int xsave_wrmsrl(void *state, u32 msr, u64 val);
-int xsave_set_clear_bits_msrl(void *state, u32 msr, u64 set, u64 clear);
+int xsave_rdmsrl(void *xstate, unsigned int msr, unsigned long long *p);
+int xsave_wrmsrl(void *xstate, u32 msr, u64 val);
+int xsave_set_clear_bits_msrl(void *xstate, u32 msr, u64 set, u64 clear);
+
+void *get_xsave_buffer_unsafe(struct fpu *fpu, int xfeature_nr);
+int xsave_wrmsrl_unsafe(void *xstate, u32 msr, u64 val);
#endif /* _ASM_X86_FPU_API_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 25b1b0c417fd..71b08026474c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1881,6 +1881,17 @@ static u64 *__get_xsave_member(void *xstate, u32 msr)
}
}

+/*
+ * Operate on the xsave buffer directly. It makes no gaurantees that the
+ * buffer will stay valid now or in the futre. This function is pretty
+ * much only useful when the caller knows the fpu's thread can't be
+ * scheduled or otherwise operated on concurrently.
+ */
+void *get_xsave_buffer_unsafe(struct fpu *fpu, int xfeature_nr)
+{
+ return get_xsave_addr(&fpu->fpstate->regs.xsave, xfeature_nr);
+}
+
/*
* Return a pointer to the xstate for the feature if it should be used, or NULL
* if the MSRs should be written to directly. To do this safely, using the
@@ -1971,14 +1982,11 @@ int xsave_rdmsrl(void *xstate, unsigned int msr, unsigned long long *p)
return 0;
}

-int xsave_wrmsrl(void *xstate, u32 msr, u64 val)
+
+int xsave_wrmsrl_unsafe(void *xstate, u32 msr, u64 val)
{
u64 *member_ptr;

- __xsave_msrl_prepare_write();
- if (!xstate)
- return wrmsrl_safe(msr, val);
-
member_ptr = __get_xsave_member(xstate, msr);
if (!member_ptr)
return 1;
@@ -1988,6 +1996,15 @@ int xsave_wrmsrl(void *xstate, u32 msr, u64 val)
return 0;
}

+int xsave_wrmsrl(void *xstate, u32 msr, u64 val)
+{
+ __xsave_msrl_prepare_write();
+ if (!xstate)
+ return wrmsrl_safe(msr, val);
+
+ return xsave_wrmsrl_unsafe(xstate, msr, val);
+}
+
int xsave_set_clear_bits_msrl(void *xstate, u32 msr, u64 set, u64 clear)
{
u64 val, new_val;
--
2.17.1

2022-02-01 15:08:50

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 35/35] x86/cpufeatures: Limit shadow stack to Intel CPUs

Shadow stack is supported on newer AMD processors, but the kernel
implementation has not been tested on them. Prevent basic issues from
showing up for normal users by disabling shadow stack on all CPUs except
Intel until it has been tested. At which point the limitation should be
removed.

Signed-off-by: Rick Edgecombe <[email protected]>
---

v1:
- New patch.

arch/x86/kernel/cpu/common.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9ee339f5b8ca..7fbfe707a1db 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -517,6 +517,14 @@ __setup("nopku", setup_disable_pku);

static __always_inline void setup_cet(struct cpuinfo_x86 *c)
{
+ /*
+ * Shadow stack is supported on AMD processors, but has not been
+ * tested. Only support it on Intel processors until this is done.
+ * At which point, this vendor check should be removed.
+ */
+ if (c->x86_vendor != X86_VENDOR_INTEL)
+ setup_clear_cpu_cap(X86_FEATURE_SHSTK);
+
if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
return;

--
2.17.1

2022-02-01 15:08:53

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 24/35] mm: Re-introduce vm_flags to do_mmap()

From: Yu-cheng Yu <[email protected]>

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap(). Thus, re-introduce vm_flags to do_mmap().

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Peter Collingbourne <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: [email protected]
---
fs/aio.c | 2 +-
include/linux/mm.h | 3 ++-
ipc/shm.c | 2 +-
mm/mmap.c | 10 +++++-----
mm/nommu.c | 4 ++--
mm/util.c | 2 +-
6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 4ceba13a7db0..a24618e0e3fc 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -554,7 +554,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)

ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
PROT_READ | PROT_WRITE,
- MAP_SHARED, 0, &unused, NULL);
+ MAP_SHARED, 0, 0, &unused, NULL);
mmap_write_unlock(mm);
if (IS_ERR((void *)ctx->mmap_base)) {
ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e125358d7f75..481e1271409f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2689,7 +2689,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
struct list_head *uf);
extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
- unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+ vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf);
extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
struct list_head *uf, bool downgrade);
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
diff --git a/ipc/shm.c b/ipc/shm.c
index b3048ebd5c31..f236b3e14ec4 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1646,7 +1646,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
goto invalid;
}

- addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+ addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
*raddr = addr;
err = 0;
if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 9bab326332af..9c82a1b02cfc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1410,11 +1410,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
*/
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
- unsigned long flags, unsigned long pgoff,
- unsigned long *populate, struct list_head *uf)
+ unsigned long flags, vm_flags_t vm_flags,
+ unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf)
{
struct mm_struct *mm = current->mm;
- vm_flags_t vm_flags;
int pkey = 0;

*populate = 0;
@@ -1474,7 +1474,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
- vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

if (flags & MAP_LOCKED)
@@ -3011,7 +3011,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,

file = get_file(vma->vm_file);
ret = do_mmap(vma->vm_file, start, size,
- prot, flags, pgoff, &populate, NULL);
+ prot, flags, 0, pgoff, &populate, NULL);
fput(file);
out:
mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 55a9e48a7a02..a6e0243cd69b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1057,6 +1057,7 @@ unsigned long do_mmap(struct file *file,
unsigned long len,
unsigned long prot,
unsigned long flags,
+ vm_flags_t vm_flags,
unsigned long pgoff,
unsigned long *populate,
struct list_head *uf)
@@ -1064,7 +1065,6 @@ unsigned long do_mmap(struct file *file,
struct vm_area_struct *vma;
struct vm_region *region;
struct rb_node *rb;
- vm_flags_t vm_flags;
unsigned long capabilities, result;
int ret;

@@ -1083,7 +1083,7 @@ unsigned long do_mmap(struct file *file,

/* we've determined that we can make the mapping, now translate what we
* now know into VMA flags */
- vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+ vm_flags |= determine_vm_flags(file, prot, flags, capabilities);

/* we're going to need to record the mapping */
region = kmem_cache_zalloc(vm_region_jar, GFP_KERNEL);
diff --git a/mm/util.c b/mm/util.c
index 7e43369064c8..d419821364cc 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -516,7 +516,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
if (!ret) {
if (mmap_write_lock_killable(mm))
return -EINTR;
- ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+ ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
&uf);
mmap_write_unlock(mm);
userfaultfd_unmap_complete(mm, &uf);
--
2.17.1

2022-02-01 15:09:15

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 32/35] x86/cet/shstk: Introduce map_shadow_stack syscall

When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stack's is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other thread's could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
restore tokens being written into the middle of pre-used shadow stacks.
It is ideal to prevent restore tokens being added at arbitrary
locations, so the check was to make sure the shadow stack had never been
written to.
3. It stood out from the rest of the madvise flags, as more of direct
action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup it's own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shadow_stack = map_shadow_stack(stack_size, SHADOW_STACK_SET_TOKEN);

Signed-off-by: Rick Edgecombe <[email protected]>
---

v1:
- New patch (replaces PROT_SHADOW_STACK).

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/uapi/asm/mman.h | 2 ++
arch/x86/kernel/shstk.c | 39 +++++++++++++++++++++++---
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 2 +-
kernel/sys_ni.c | 1 +
7 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..68106c12937f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
448 i386 process_mrelease sys_process_mrelease
449 i386 futex_waitv sys_futex_waitv
450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
+451 i386 map_shadow_stack sys_map_shadow_stack
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..d9639e3e0a33 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common map_shadow_stack sys_map_shadow_stack

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
index 9704e27c4d24..dd4e8405e189 100644
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -26,6 +26,8 @@
((key) & 0x8 ? VM_PKEY_BIT3 : 0))
#endif

+#define SHADOW_STACK_SET_TOKEN 0x1 /* Set up a restore token in the shadow stack */
+
#include <asm-generic/mman.h>

#endif /* _UAPI_ASM_X86_MMAN_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index f330be17e2d1..53be5d5539d4 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -15,6 +15,7 @@
#include <linux/compat.h>
#include <linux/sizes.h>
#include <linux/user.h>
+#include <linux/syscalls.h>
#include <asm/msr.h>
#include <asm/fpu/internal.h>
#include <asm/fpu/xstate.h>
@@ -45,12 +46,14 @@ static int create_rstor_token(bool proc32, unsigned long ssp,
if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
return -EFAULT;

- *token_addr = addr;
+ if (token_addr)
+ *token_addr = addr;

return 0;
}

-static unsigned long alloc_shstk(unsigned long size)
+static unsigned long alloc_shstk(unsigned long size, unsigned long token_offset,
+ bool set_res_tok)
{
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
struct mm_struct *mm = current->mm;
@@ -61,6 +64,15 @@ static unsigned long alloc_shstk(unsigned long size)
&unused, NULL);
mmap_write_unlock(mm);

+ if (!set_res_tok || IS_ERR_VALUE(addr))
+ goto out;
+
+ if (create_rstor_token(in_ia32_syscall(), addr + token_offset, NULL)) {
+ vm_munmap(addr, size);
+ return -EINVAL;
+ }
+
+out:
return addr;
}

@@ -103,7 +115,7 @@ int shstk_setup(void)
return 1;

size = PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
- addr = alloc_shstk(size);
+ addr = alloc_shstk(size, size, false);
if (IS_ERR_VALUE(addr))
return 1;

@@ -181,7 +193,7 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
return -EINVAL;

stack_size = PAGE_ALIGN(stack_size);
- addr = alloc_shstk(stack_size);
+ addr = alloc_shstk(stack_size, stack_size, false);
if (IS_ERR_VALUE(addr)) {
shstk->base = 0;
shstk->size = 0;
@@ -380,3 +392,22 @@ int restore_signal_shadow_stack(void)

return err;
}
+
+SYSCALL_DEFINE2(map_shadow_stack, unsigned long, size, unsigned int, flags)
+{
+ unsigned long aligned_size;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return -ENOSYS;
+
+ /*
+ * An overflow would result in attempting to write the restore token
+ * to the wrong location. Not catastrophic, but just return the right
+ * error code and block it.
+ */
+ aligned_size = PAGE_ALIGN(size);
+ if (aligned_size < size)
+ return -EOVERFLOW;
+
+ return alloc_shstk(aligned_size, size, flags & SHADOW_STACK_SET_TOKEN);
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 819c0cb00b6d..11220c40b26a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1060,6 +1060,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
unsigned long home_node,
unsigned long flags);
+asmlinkage long sys_map_shadow_stack(unsigned long size, unsigned int flags);

/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1c48b0ae3ba3..41112fdd3b66 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -887,7 +887,7 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)

#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452

/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a492f159624f..16a6e1a57c2b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -380,6 +380,7 @@ COND_SYSCALL(vm86old);
COND_SYSCALL(modify_ldt);
COND_SYSCALL(vm86);
COND_SYSCALL(kexec_file_load);
+COND_SYSCALL(map_shadow_stack);

/* s390 */
COND_SYSCALL(s390_pci_mmio_read);
--
2.17.1

2022-02-01 15:10:43

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 16/35] x86/mm: Update maybe_mkwrite() for shadow stack

From: Yu-cheng Yu <[email protected]>

When serving a page fault, maybe_mkwrite() makes a PTE writable if its vma
has VM_WRITE.

A shadow stack vma has VM_SHADOW_STACK. Its PTEs have _PAGE_DIRTY, but not
_PAGE_WRITE. In fork(), _PAGE_DIRTY is cleared to cause copy-on-write,
and in the page fault handler, _PAGE_DIRTY is restored and the shadow stack
page is writable again.

Introduce an x86 version of maybe_mkwrite(), which sets proper PTE bits
according to VM flags.

Apply the same changes to maybe_pmd_mkwrite().

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

Yu-cheng v29:
- Remove likely()'s.

arch/x86/include/asm/pgtable.h | 6 ++++++
arch/x86/mm/pgtable.c | 20 ++++++++++++++++++++
include/linux/mm.h | 2 ++
mm/huge_memory.c | 2 ++
4 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e1061b9cba6a..36166bdd0b98 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -282,6 +282,9 @@ static inline int pmd_trans_huge(pmd_t pmd)
return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
}

+#define maybe_pmd_mkwrite maybe_pmd_mkwrite
+extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
static inline int pud_trans_huge(pud_t pud)
{
@@ -1660,6 +1663,9 @@ static inline bool arch_faults_on_old_pte(void)
return false;
}

+#define maybe_mkwrite maybe_mkwrite
+extern pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma);
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_PGTABLE_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..c22c8e9c37e8 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -610,6 +610,26 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
}
#endif

+pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_WRITE)
+ pte = pte_mkwrite(pte);
+ else if (vma->vm_flags & VM_SHADOW_STACK)
+ pte = pte_mkwrite_shstk(pte);
+ return pte;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_WRITE)
+ pmd = pmd_mkwrite(pmd);
+ else if (vma->vm_flags & VM_SHADOW_STACK)
+ pmd = pmd_mkwrite_shstk(pmd);
+ return pmd;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/**
* reserve_top_address - reserves a hole in the top of kernel address space
* @reserve - size of hole to reserve
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311c6018d503..b3cb3a17037b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -955,12 +955,14 @@ void free_compound_page(struct page *page);
* pte_mkwrite. But get_user_pages can cause write faults for mappings
* that do not have writing enabled, when used by access_process_vm.
*/
+#ifndef maybe_mkwrite
static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
if (likely(vma->vm_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
}
+#endif

vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..2adedcfca00b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -491,12 +491,14 @@ static int __init setup_transparent_hugepage(char *str)
}
__setup("transparent_hugepage=", setup_transparent_hugepage);

+#ifndef maybe_pmd_mkwrite
pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
if (likely(vma->vm_flags & VM_WRITE))
pmd = pmd_mkwrite(pmd);
return pmd;
}
+#endif

#ifdef CONFIG_MEMCG
static inline struct deferred_split *get_deferred_split_queue(struct page *page)
--
2.17.1

2022-02-01 15:10:43

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 12/35] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

From: Yu-cheng Yu <[email protected]>

When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
shadow stack. Copy-on-write PTEs have [R/O + _PAGE_COW].

When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a read-only PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE. However, this will not occur on processors supporting
Shadow Stack, and a TLB flush is not necessary.

The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
atomically, a transient shadow stack PTE can be created as a result.
Thus, prevent that with cmpxchg.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the cmpxchg solution.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

Yu-cheng v30:
- Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).

arch/x86/include/asm/pgtable.h | 38 ++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5c3886f6ccda..e1061b9cba6a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1295,6 +1295,24 @@ static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
+ /*
+ * If Shadow Stack is enabled, pte_wrprotect() moves _PAGE_DIRTY
+ * to _PAGE_COW (see comments at pte_wrprotect()).
+ * When a thread reads a RW=1, Dirty=0 PTE and before changing it
+ * to RW=0, Dirty=0, another thread could have written to the page
+ * and the PTE is RW=1, Dirty=1 now. Use try_cmpxchg() to detect
+ * PTE changes and update old_pte, then try again.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pte_t old_pte, new_pte;
+
+ old_pte = READ_ONCE(*ptep);
+ do {
+ new_pte = pte_wrprotect(old_pte);
+ } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+ return;
+ }
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
}

@@ -1347,6 +1365,26 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp)
{
+#if CONFIG_PGTABLE_LEVELS > 2
+ /*
+ * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
+ * to _PAGE_COW (see comments at pmd_wrprotect()).
+ * When a thread reads a RW=1, Dirty=0 PMD and before changing it
+ * to RW=0, Dirty=0, another thread could have written to the page
+ * and the PMD is RW=1, Dirty=1 now. Use try_cmpxchg() to detect
+ * PMD changes and update old_pmd, then try again.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pmd_t old_pmd, new_pmd;
+
+ old_pmd = READ_ONCE(*pmdp);
+ do {
+ new_pmd = pmd_wrprotect(old_pmd);
+ } while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+ return;
+ }
+#endif
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

--
2.17.1

2022-02-01 15:10:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 19/35] mm/mmap: Add shadow stack pages to memory accounting

From: Yu-cheng Yu <[email protected]>

Account shadow stack pages to stack memory.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

Yu-cheng v26:
- Remove redundant #ifdef CONFIG_MMU.

Yu-cheng v25:
- Remove #ifdef CONFIG_ARCH_HAS_SHADOW_STACK for is_shadow_stack_mapping().

Yu-cheng v24:
- Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().
- Change VM_SHSTK to VM_SHADOW_STACK.

arch/x86/include/asm/pgtable.h | 3 +++
arch/x86/mm/pgtable.c | 5 +++++
include/linux/pgtable.h | 8 ++++++++
mm/mmap.c | 5 +++++
4 files changed, 21 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 36166bdd0b98..55641498485c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1666,6 +1666,9 @@ static inline bool arch_faults_on_old_pte(void)
#define maybe_mkwrite maybe_mkwrite
extern pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma);

+#define is_shadow_stack_mapping is_shadow_stack_mapping
+extern bool is_shadow_stack_mapping(vm_flags_t vm_flags);
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_PGTABLE_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index c22c8e9c37e8..61a364b9ae0a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -884,3 +884,8 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)

#endif /* CONFIG_X86_64 */
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
+bool is_shadow_stack_mapping(vm_flags_t vm_flags)
+{
+ return vm_flags & VM_SHADOW_STACK;
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index bc8713a76e03..21fdb1273571 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -911,6 +911,14 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
__ptep_modify_prot_commit(vma, addr, ptep, pte);
}
#endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
+
+#ifndef is_shadow_stack_mapping
+static inline bool is_shadow_stack_mapping(vm_flags_t vm_flags)
+{
+ return false;
+}
+#endif
+
#endif /* CONFIG_MMU */

/*
diff --git a/mm/mmap.c b/mm/mmap.c
index 1e8fdb0b51ed..9bab326332af 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1716,6 +1716,9 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
if (file && is_file_hugepages(file))
return 0;

+ if (is_shadow_stack_mapping(vm_flags))
+ return 1;
+
return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
}

@@ -3345,6 +3348,8 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
mm->stack_vm += npages;
else if (is_data_mapping(flags))
mm->data_vm += npages;
+ else if (is_shadow_stack_mapping(flags))
+ mm->stack_vm += npages;
}

static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
--
2.17.1

2022-02-01 15:10:46

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 08/35] x86/mm: Move pmd_write(), pud_write() up in the file

From: Yu-cheng Yu <[email protected]>

To prepare the introduction of _PAGE_COW, move pmd_write() and
pud_write() up in the file, so that they can be used by other
helpers below. No functional changes.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 8a9432fb3802..aff5e489ff17 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,6 +158,18 @@ static inline int pte_write(pte_t pte)
return pte_flags(pte) & _PAGE_RW;
}

+#define pmd_write pmd_write
+static inline int pmd_write(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define pud_write pud_write
+static inline int pud_write(pud_t pud)
+{
+ return pud_flags(pud) & _PAGE_RW;
+}
+
static inline int pte_huge(pte_t pte)
{
return pte_flags(pte) & _PAGE_PSE;
@@ -1116,12 +1128,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);


-#define pmd_write pmd_write
-static inline int pmd_write(pmd_t pmd)
-{
- return pmd_flags(pmd) & _PAGE_RW;
-}
-
#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp)
@@ -1151,12 +1157,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

-#define pud_write pud_write
-static inline int pud_write(pud_t pud)
-{
- return pud_flags(pud) & _PAGE_RW;
-}
-
#ifndef pmdp_establish
#define pmdp_establish pmdp_establish
static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
--
2.17.1

2022-02-01 15:10:49

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 10/35] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS

From: Yu-cheng Yu <[email protected]>

After the introduction of _PAGE_COW, a modified page's PTE can have either
_PAGE_DIRTY or _PAGE_COW. Change _PAGE_DIRTY to _PAGE_DIRTY_BITS.

Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Zhenyu Wang <[email protected]>
Cc: Zhi Wang <[email protected]>
---
drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
index 99d1781fa5f0..75ce4e823902 100644
--- a/drivers/gpu/drm/i915/gvt/gtt.c
+++ b/drivers/gpu/drm/i915/gvt/gtt.c
@@ -1210,7 +1210,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
}

/* Clear dirty field. */
- se->val64 &= ~_PAGE_DIRTY;
+ se->val64 &= ~_PAGE_DIRTY_BITS;

ops->clear_pse(se);
ops->clear_ips(se);
--
2.17.1

2022-02-01 15:10:51

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 30/35] x86/cet/shstk: Handle signals for shadow stack

From: Yu-cheng Yu <[email protected]>

When a signal is handled normally the context is pushed to the stack
before handling it. For shadow stacks, since the shadow stack only track's
return addresses, there isn't any state that needs to be pushed. However,
there are still a few things that need to be done. These things are
userspace visible and which will be kernel ABI for shadow stacks.

One is to make sure the restorer address is written to shadow stack, since
the signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn. So add the restorer on the shadow stack
before handling the signal, so there is not a conflict when the signal
handler returns to the restorer.

The other thing to do is to place a restore token on the thread's shadow
stack before handling the signal and check it during sigreturn. This
is an extra layer of protection to hamper attackers calling sigreturn
manually as in SROP-like attacks.

So, when handling a signal push
- a shadow stack restore token pointing to the current shadow stack
address
- the restorer address below the restore token.

In sigreturn, verify the restore token and pop the shadow stack.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
---

v1:
- Use xsave helpers.
- Expand commit log.

Yu-cheng v27:
- Eliminate saving shadow stack pointer to signal context.

Yu-cheng v25:
- Update commit log/comments for the sc_ext struct.
- Use restorer address already calculated.
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
- Eliminate writing to MSR_IA32_U_CET for shadow stack.
- Change wrmsrl() to wrmsrl_safe() and handle error.

arch/x86/ia32/ia32_signal.c | 25 ++++++++++++++++-----
arch/x86/include/asm/cet.h | 4 ++++
arch/x86/kernel/shstk.c | 44 +++++++++++++++++++++++++++++++++++++
arch/x86/kernel/signal.c | 13 +++++++++++
4 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index c9c3859322fa..a8d038409d60 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,6 +34,7 @@
#include <asm/sigframe.h>
#include <asm/sighandling.h>
#include <asm/smap.h>
+#include <asm/cet.h>

static inline void reload_segments(struct sigcontext_32 *sc)
{
@@ -112,6 +113,10 @@ COMPAT_SYSCALL_DEFINE0(sigreturn)

if (!ia32_restore_sigcontext(regs, &frame->sc))
goto badframe;
+
+ if (restore_signal_shadow_stack())
+ goto badframe;
+
return regs->ax;

badframe:
@@ -137,6 +142,9 @@ COMPAT_SYSCALL_DEFINE0(rt_sigreturn)
if (!ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext))
goto badframe;

+ if (restore_signal_shadow_stack())
+ goto badframe;
+
if (compat_restore_altstack(&frame->uc.uc_stack))
goto badframe;

@@ -261,6 +269,9 @@ int ia32_setup_frame(int sig, struct ksignal *ksig,
restorer = &frame->retcode;
}

+ if (setup_signal_shadow_stack(1, restorer))
+ return -EFAULT;
+
if (!user_access_begin(frame, sizeof(*frame)))
return -EFAULT;

@@ -318,6 +329,15 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig,

frame = get_sigframe(ksig, regs, sizeof(*frame), &fp);

+ if (ksig->ka.sa.sa_flags & SA_RESTORER)
+ restorer = ksig->ka.sa.sa_restorer;
+ else
+ restorer = current->mm->context.vdso +
+ vdso_image_32.sym___kernel_rt_sigreturn;
+
+ if (setup_signal_shadow_stack(1, restorer))
+ return -EFAULT;
+
if (!user_access_begin(frame, sizeof(*frame)))
return -EFAULT;

@@ -333,11 +353,6 @@ int ia32_setup_rt_frame(int sig, struct ksignal *ksig,
unsafe_put_user(0, &frame->uc.uc_link, Efault);
unsafe_compat_save_altstack(&frame->uc.uc_stack, regs->sp, Efault);

- if (ksig->ka.sa.sa_flags & SA_RESTORER)
- restorer = ksig->ka.sa.sa_restorer;
- else
- restorer = current->mm->context.vdso +
- vdso_image_32.sym___kernel_rt_sigreturn;
unsafe_put_user(ptr_to_compat(restorer), &frame->pretcode, Efault);

/*
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 6e8a7a807dcc..faff8dc86159 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -22,6 +22,8 @@ void reset_thread_shstk(void);
int shstk_setup_rstor_token(bool proc32, unsigned long restorer,
unsigned long *new_ssp);
int shstk_check_rstor_token(bool proc32, unsigned long *new_ssp);
+int setup_signal_shadow_stack(int proc32, void __user *restorer);
+int restore_signal_shadow_stack(void);
#else
static inline void shstk_setup(void) {}
static inline int shstk_alloc_thread_stack(struct task_struct *p,
@@ -34,6 +36,8 @@ static inline int shstk_setup_rstor_token(bool proc32, unsigned long restorer,
unsigned long *new_ssp) { return 0; }
static inline int shstk_check_rstor_token(bool proc32,
unsigned long *new_ssp) { return 0; }
+static inline int setup_signal_shadow_stack(int proc32, void __user *restorer) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
#endif /* CONFIG_X86_SHADOW_STACK */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index e0caab50ca77..682d85a63a1d 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -335,3 +335,47 @@ int shstk_check_rstor_token(bool proc32, unsigned long *new_ssp)

return 0;
}
+
+int setup_signal_shadow_stack(int proc32, void __user *restorer)
+{
+ struct thread_shstk *shstk = &current->thread.shstk;
+ unsigned long new_ssp;
+ void *xstate;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || !shstk->size)
+ return 0;
+
+ err = shstk_setup_rstor_token(proc32, (unsigned long)restorer,
+ &new_ssp);
+ if (err)
+ return err;
+
+ xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
+ err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, new_ssp);
+ end_update_xsave_msrs();
+
+ return err;
+}
+
+int restore_signal_shadow_stack(void)
+{
+ struct thread_shstk *shstk = &current->thread.shstk;
+ void *xstate;
+ int proc32 = in_ia32_syscall();
+ unsigned long new_ssp;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || !shstk->size)
+ return 0;
+
+ err = shstk_check_rstor_token(proc32, &new_ssp);
+ if (err)
+ return err;
+
+ xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
+ err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, new_ssp);
+ end_update_xsave_msrs();
+
+ return err;
+}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ec71e06ae364..e6202fc2a56c 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -48,6 +48,7 @@
#include <asm/syscall.h>
#include <asm/sigframe.h>
#include <asm/signal.h>
+#include <asm/cet.h>

#ifdef CONFIG_X86_64
/*
@@ -471,6 +472,9 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig,
frame = get_sigframe(&ksig->ka, regs, sizeof(struct rt_sigframe), &fp);
uc_flags = frame_uc_flags(regs);

+ if (setup_signal_shadow_stack(0, ksig->ka.sa.sa_restorer))
+ return -EFAULT;
+
if (!user_access_begin(frame, sizeof(*frame)))
return -EFAULT;

@@ -576,6 +580,9 @@ static int x32_setup_rt_frame(struct ksignal *ksig,

uc_flags = frame_uc_flags(regs);

+ if (setup_signal_shadow_stack(0, ksig->ka.sa.sa_restorer))
+ return -EFAULT;
+
if (!user_access_begin(frame, sizeof(*frame)))
return -EFAULT;

@@ -674,6 +681,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
goto badframe;

+ if (restore_signal_shadow_stack())
+ goto badframe;
+
if (restore_altstack(&frame->uc.uc_stack))
goto badframe;

@@ -991,6 +1001,9 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn)
if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
goto badframe;

+ if (restore_signal_shadow_stack())
+ goto badframe;
+
if (compat_restore_altstack(&frame->uc.uc_stack))
goto badframe;

--
2.17.1

2022-02-01 15:11:06

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 28/35] x86/cet/shstk: Handle thread shadow stack

From: Yu-cheng Yu <[email protected]>

When a process is duplicated, but the child shares the address space with
the parent, there is potential for the threads sharing a single stack to
cause conflicts for each other. In the normal non-cet case this is handled
in two ways.

With regular CLONE_VM a new stack is provided by userspace such that the
parent and child have different stacks.

For vfork, the parent is suspended until the child exits. So as long as
the child doesn't return from the vfork()/CLONE_VFORK calling function and
sticks to a limited set of operations, the parent and child can share the
same stack.

For shadow stack, these scenarios present similar sharing problems. For the
CLONE_VM case, the child and the parent must have separate shadow stacks.
Instead of changing clone to take a shadow stack, have the kernel just
allocate one and switch to it.

Use stack_size passed from clone3() syscall for thread shadow stack size. A
compat-mode thread shadow stack size is further reduced to 1/4. This
allows more threads to run in a 32-bit address space. The clone() does not
pass stack_size, which was added to clone3(). In that case, use
RLIMIT_STACK size and cap to 4 GB.

For shadow stack enabled vfork(), the parent and child can share the same
shadow stack, like they can share a normal stack. Since the parent is
suspended until the child terminates, the child will not interfere with
the parent while executing as long as it doesn't return from the vfork()
and overwrite up the shadow stack. The child can safely overwrite down
the shadow stack, as the parent can just overwrite this later. So CET does
not add any additional limitations for vfork().

Userspace implementing posix vfork() can actually prevent the child from
returning from the vfork() calling function, using CET. Glibc does this
by adjusting the shadow stack pointer in the child, so that the child
receives a #CP if it tries to return from vfork() calling function.

Free the shadow stack on thread exit by doing it in mm_release(). Skip
this when exiting a vfork() child since the stack is shared in the
parent.

Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

v1:
- Expand commit log.
- Add more comments.
- Switch to xsave helpers.

Yu-cheng v30:
- Update comments about clone()/clone3(). (Borislav Petkov)

Yu-cheng v29:
- WARN_ON_ONCE() when get_xsave_addr() returns NULL, and update comments.
(Dave Hansen)

Yu-cheng v28:
- Split out copy_thread() argument name changes to a new patch.
- Add compatibility for earlier clone(), which does not pass stack_size.
- Add comment for get_xsave_addr(), explain the handling of null return
value.

arch/x86/include/asm/cet.h | 5 +++
arch/x86/include/asm/mmu_context.h | 2 +
arch/x86/kernel/process.c | 6 +++
arch/x86/kernel/shstk.c | 68 +++++++++++++++++++++++++++++-
4 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index de90e4ae083a..63ee8b45080d 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -14,11 +14,16 @@ struct thread_shstk {

#ifdef CONFIG_X86_SHADOW_STACK
int shstk_setup(void);
+int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
+ unsigned long stack_size);
void shstk_free(struct task_struct *p);
int shstk_disable(void);
void reset_thread_shstk(void);
#else
static inline void shstk_setup(void) {}
+static inline int shstk_alloc_thread_stack(struct task_struct *p,
+ unsigned long clone_flags,
+ unsigned long stack_size) { return 0; }
static inline void shstk_free(struct task_struct *p) {}
static inline void shstk_disable(void) {}
static inline void reset_thread_shstk(void) {}
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 27516046117a..8e721d2c45d5 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -146,6 +146,8 @@ do { \
#else
#define deactivate_mm(tsk, mm) \
do { \
+ if (!tsk->vfork_done) \
+ shstk_free(tsk); \
load_gs_index(0); \
loadsegment(fs, 0); \
} while (0)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 82a816178e7f..0fbcf33255fa 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -46,6 +46,7 @@
#include <asm/proto.h>
#include <asm/frame.h>
#include <asm/unwind.h>
+#include <asm/cet.h>

#include "process.h"

@@ -117,6 +118,7 @@ void exit_thread(struct task_struct *tsk)

free_vm86(t);

+ shstk_free(tsk);
fpu__drop(fpu);
}

@@ -217,6 +219,10 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
if (clone_flags & CLONE_SETTLS)
ret = set_new_tls(p, tls);

+ /* Allocate a new shadow stack for pthread */
+ if (!ret)
+ ret = shstk_alloc_thread_stack(p, clone_flags, stack_size);
+
if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
io_bitmap_share(p);

diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 4e8686ed885f..358f24e806cc 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -106,6 +106,66 @@ void reset_thread_shstk(void)
memset(&current->thread.shstk, 0, sizeof(struct thread_shstk));
}

+int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
+ unsigned long stack_size)
+{
+ struct thread_shstk *shstk = &tsk->thread.shstk;
+ unsigned long addr;
+ void *xstate;
+
+ /*
+ * If shadow stack is not enabled on the new thread, skip any
+ * switch to a new shadow stack.
+ */
+ if (!shstk->size)
+ return 0;
+
+ /*
+ * clone() does not pass stack_size, which was added to clone3().
+ * Use RLIMIT_STACK and cap to 4 GB.
+ */
+ if (!stack_size)
+ stack_size = min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G);
+
+ /*
+ * For CLONE_VM, except vfork, the child needs a separate shadow
+ * stack.
+ */
+ if ((clone_flags & (CLONE_VFORK | CLONE_VM)) != CLONE_VM)
+ return 0;
+
+
+ /*
+ * Compat-mode pthreads share a limited address space.
+ * If each function call takes an average of four slots
+ * stack space, allocate 1/4 of stack size for shadow stack.
+ */
+ if (in_compat_syscall())
+ stack_size /= 4;
+
+ /*
+ * 'tsk' is configured with a shadow stack and the fpu.state is
+ * up to date since it was just copied from the parent. There
+ * must be a valid non-init CET state location in the buffer.
+ */
+ xstate = get_xsave_buffer_unsafe(&tsk->thread.fpu, XFEATURE_CET_USER);
+ if (WARN_ON_ONCE(!xstate))
+ return -EINVAL;
+
+ stack_size = PAGE_ALIGN(stack_size);
+ addr = alloc_shstk(stack_size);
+ if (IS_ERR_VALUE(addr)) {
+ shstk->base = 0;
+ shstk->size = 0;
+ return PTR_ERR((void *)addr);
+ }
+
+ xsave_wrmsrl_unsafe(xstate, MSR_IA32_PL3_SSP, (u64)(addr + stack_size));
+ shstk->base = addr;
+ shstk->size = stack_size;
+ return 0;
+}
+
void shstk_free(struct task_struct *tsk)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
@@ -115,7 +175,13 @@ void shstk_free(struct task_struct *tsk)
!shstk->base)
return;

- if (!tsk->mm)
+ /*
+ * When fork() with CLONE_VM fails, the child (tsk) already has a
+ * shadow stack allocated, and exit_thread() calls this function to
+ * free it. In this case the parent (current) and the child share
+ * the same mm struct.
+ */
+ if (!tsk->mm || tsk->mm != current->mm)
return;

unmap_shadow_stack(shstk->base, shstk->size);
--
2.17.1

2022-02-01 15:11:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH 33/35] selftests/x86: Add map_shadow_stack syscall test

Add a simple selftest for exercising the new map_shadow_stack syscall.

Co-developed-by: Yu, Yu-cheng <[email protected]>
Signed-off-by: Yu, Yu-cheng <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

v1:
- New patch.

tools/testing/selftests/x86/Makefile | 9 ++-
.../selftests/x86/test_map_shadow_stack.c | 75 +++++++++++++++++++
2 files changed, 83 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 8a1f62ab3c8e..9114943336f9 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -9,11 +9,13 @@ UNAME_M := $(shell uname -m)
CAN_BUILD_I386 := $(shell ./check_cc.sh $(CC) trivial_32bit_program.c -m32)
CAN_BUILD_X86_64 := $(shell ./check_cc.sh $(CC) trivial_64bit_program.c)
CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
+CAN_BUILD_WITH_SHSTK := $(shell ./check_cc.sh $(CC) trivial_program.c -mshstk -fcf-protection)

TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
check_initial_reg_state sigreturn iopl ioperm \
test_vsyscall mov_ss_trap \
- syscall_arg_fault fsgsbase_restore sigaltstack
+ syscall_arg_fault fsgsbase_restore sigaltstack \
+ test_map_shadow_stack
TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
@@ -105,3 +107,8 @@ $(OUTPUT)/test_syscall_vdso_32: thunks_32.S
# state.
$(OUTPUT)/check_initial_reg_state_32: CFLAGS += -Wl,-ereal_start -static
$(OUTPUT)/check_initial_reg_state_64: CFLAGS += -Wl,-ereal_start -static
+
+ifeq ($(CAN_BUILD_WITH_SHSTK),1)
+$(OUTPUT)/test_map_shadow_stack_64: CFLAGS += -mshstk -fcf-protection
+$(OUTPUT)/test_map_shadow_stack_32: CFLAGS += -mshstk -fcf-protection
+endif
\ No newline at end of file
diff --git a/tools/testing/selftests/x86/test_map_shadow_stack.c b/tools/testing/selftests/x86/test_map_shadow_stack.c
new file mode 100644
index 000000000000..dfd94ef0176d
--- /dev/null
+++ b/tools/testing/selftests/x86/test_map_shadow_stack.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <sys/syscall.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <x86intrin.h>
+
+#define SS_SIZE 0x200000
+
+void *create_shstk(void)
+{
+ return (void *)syscall(__NR_map_shadow_stack, SS_SIZE, SHADOW_STACK_SET_TOKEN);
+}
+
+#if (__GNUC__ < 8) || (__GNUC__ == 8 && __GNUC_MINOR__ < 5)
+int main(int argc, char *argv[])
+{
+ printf("SKIP: compiler does not support CET.");
+ return 0;
+}
+#else
+void try_shstk(unsigned long new_ssp)
+{
+ unsigned long ssp0, ssp1;
+
+ printf("pid=%d\n", getpid());
+ printf("new_ssp = %lx, *new_ssp = %lx\n",
+ new_ssp, *((unsigned long *)new_ssp));
+
+ ssp0 = _get_ssp();
+ printf("changing ssp from %lx to %lx\n", ssp0, new_ssp);
+
+ /* Make sure is aligned to 8 bytes */
+ if ((ssp0 & 0xf) != 0)
+ ssp0 &= -8;
+
+ asm volatile("rstorssp (%0)\n":: "r" (new_ssp));
+ asm volatile("saveprevssp");
+ ssp1 = _get_ssp();
+ printf("ssp is now %lx\n", ssp1);
+
+ ssp0 -= 8;
+ asm volatile("rstorssp (%0)\n":: "r" (ssp0));
+ asm volatile("saveprevssp");
+}
+
+int main(int argc, char *argv[])
+{
+ void *shstk;
+
+ if (!_get_ssp()) {
+ printf("SKIP: shadow stack disabled.");
+ return 0;
+ }
+
+ shstk = create_shstk();
+ if (shstk == MAP_FAILED) {
+ printf("FAIL: Error creating shadow stack: %d\n", errno);
+ return 1;
+ }
+ try_shstk((unsigned long)shstk + SS_SIZE - 8);
+
+ printf("PASS.\n");
+ return 0;
+}
+#endif
--
2.17.1

2022-02-01 15:40:38

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 34/35] x86/cet/shstk: Support wrss for userspace

* Rick Edgecombe:

> For the current shadow stack implementation, shadow stacks contents cannot
> be arbitrarily provisioned with data. This property helps apps protect
> themselves better, but also restricts any potential apps that may want to
> do exotic things at the expense of a little security.
>
> The x86 shadow stack feature introduces a new instruction, wrss, which
> can be enabled to write directly to shadow stack permissioned memory from
> userspace. Allow it to get enabled via the prctl interface.

Why can't this be turned on unconditionally?

Thanks,
Florian

2022-02-01 20:47:29

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 34/35] x86/cet/shstk: Support wrss for userspace

* H. J. Lu:

> On Sun, Jan 30, 2022 at 11:57 PM Florian Weimer <[email protected]> wrote:
>>
>> * Rick Edgecombe:
>>
>> > For the current shadow stack implementation, shadow stacks contents cannot
>> > be arbitrarily provisioned with data. This property helps apps protect
>> > themselves better, but also restricts any potential apps that may want to
>> > do exotic things at the expense of a little security.
>> >
>> > The x86 shadow stack feature introduces a new instruction, wrss, which
>> > can be enabled to write directly to shadow stack permissioned memory from
>> > userspace. Allow it to get enabled via the prctl interface.
>>
>> Why can't this be turned on unconditionally?
>
> WRSS can be a security risk since it defeats the whole purpose of
> Shadow Stack. If an application needs to write to shadow stack,
> it can make a syscall to enable it. After the CET patches are checked
> in Linux kernel, I will make a proposal to allow applications or shared
> libraries to opt-in WRSS through a linker option, a compiler option or
> a function attribute.

Ahh, that makes sense. I assumed that without WRSS, the default was to
allow plain writes. 8-)

Thanks,
Florian

2022-02-01 20:48:26

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 34/35] x86/cet/shstk: Support wrss for userspace

On Sun, Jan 30, 2022 at 11:57 PM Florian Weimer <[email protected]> wrote:
>
> * Rick Edgecombe:
>
> > For the current shadow stack implementation, shadow stacks contents cannot
> > be arbitrarily provisioned with data. This property helps apps protect
> > themselves better, but also restricts any potential apps that may want to
> > do exotic things at the expense of a little security.
> >
> > The x86 shadow stack feature introduces a new instruction, wrss, which
> > can be enabled to write directly to shadow stack permissioned memory from
> > userspace. Allow it to get enabled via the prctl interface.
>
> Why can't this be turned on unconditionally?

WRSS can be a security risk since it defeats the whole purpose of
Shadow Stack. If an application needs to write to shadow stack,
it can make a syscall to enable it. After the CET patches are checked
in Linux kernel, I will make a proposal to allow applications or shared
libraries to opt-in WRSS through a linker option, a compiler option or
a function attribute.

--
H.J.

2022-02-04 05:46:40

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 35/35] x86/cpufeatures: Limit shadow stack to Intel CPUs

On Thu, Feb 3, 2022 at 1:58 PM John Allen <[email protected]> wrote:
>
> On Sun, Jan 30, 2022 at 01:18:38PM -0800, Rick Edgecombe wrote:
> > Shadow stack is supported on newer AMD processors, but the kernel
> > implementation has not been tested on them. Prevent basic issues from
> > showing up for normal users by disabling shadow stack on all CPUs except
> > Intel until it has been tested. At which point the limitation should be
> > removed.
>
> Hi Rick,
>
> I have been testing Yu-Cheng's patchsets on AMD hardware and I am
> working on testing this version now. How are you testing this new
> series? I can partially test by calling the prctl enable for shadow
> stack directly from a program, but I'm not sure how useful that's going
> to be without the glibc support. Do you have a public repo with the
> necessary glibc changes to enable shadow stack early?
>

The glibc CET branch is at

https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/cet/master

--
H.J.

2022-02-04 15:52:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 33/35] selftests/x86: Add map_shadow_stack syscall test

On 1/30/22 13:18, Rick Edgecombe wrote:
> Add a simple selftest for exercising the new map_shadow_stack syscall.

This is a good start for the selftest. But, it would be really nice to
see a few additional smoke tests in here that are independent of the
library support.

For instance, it would be nice to have tests that:

1. Write to the shadow stack with normal instructions (and recover from
the inevitable SEGV). Make sure the siginfo looks like we expect.
2. Corrupt the regular stack, or maybe just use a retpoline
do induce a shadow stack exception. Ditto on checking the siginfo
3. Do enough CALLs that will likely trigger a fault and an on-demand
shadow stack page allocation.

That will test the *basics* and should be pretty simple to write.

2022-02-04 16:41:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

Rick,

On Sun, Jan 30 2022 at 13:18, Rick Edgecombe wrote:
> This is a slight reboot of the userspace CET series. I will be taking over the
> series from Yu-cheng. Per some internal recommendations, I’ve reset the version
> number and am calling it a new series. Hopefully, it doesn’t cause
> confusion.

That's fine as it seems to be a major change in course, so a reset to V1
is justified. Don't worry about confusion, we can easily confuse ourself
with minor things than that version reset :)

> The new plan is to upstream only userspace Shadow Stack support at this point.
> IBT can follow later, but for now I’ll focus solely on the most in-demand and
> widely available (with the feature on AMD CPUs now) part of CET.

We just have to keep in IBT mind so that we don't add roadblocks which
we regret some time later.

> I thought as part of this reset, it might be useful to more fully write-up the
> design and summarize the history of the previous CET series. So this slightly
> long cover letter does that. The "Updates" section has the changes, if anyone
> doesn't want the history.

Thanks for that lengthy writeup. It's appreciated. There is too much
confusion already so a coherent summary is helpful.

> Why is Shadow Stack Wanted
> ==========================
> The main use case for userspace shadow stack is providing protection against
> return oriented programming attacks. Fedora and Ubuntu already have many/most
> packages enabled for shadow stack.

Which is unfortunately part of the overall problem ...

> History
> =======
> The branding “CET” really consists of two features: “Shadow Stack” and
> “Indirect Branch Tracking”. They both restrict previously allowed, but rarely
> valid behaviors and require userspace to change to avoid these behaviors before
> enabling the protection. These raw HW features need to be assembled into a
> software solution across userspace and kernel in order to add security value.
> The kernel part of this solution has evolved iteratively starting with a lengthy
> RFC period.
>
> Until now, the enabling effort was trying to support both Shadow Stack and IBT.
> This history will focus on a few areas of the shadow stack development history
> that I thought stood out.
>
> Signals
> -------
> Originally signals placed the location of the shadow stack restore
> token inside the saved state on the stack. This was problematic from a
> past ABI promises perspective. So the restore location was instead just
> assumed from the shadow stack pointer. This works because in normal
> allowed cases of calling sigreturn, the shadow stack pointer should be
> right at the restore token at that time. There is no alternate shadow
> stack support. If an alt shadow stack is added later we would
> need to

So how is that going to work? altstack is not an esoteric corner case.

> Enabling Interface
> ------------------
> For the entire history of the original CET series, the design was to
> enable shadow stack automatically if the feature bit was detected in
> the elf header. Then it was userspace’s responsibility to turn it off
> via an arch_prctl() if it was not desired, and this was handled by the
> glibc dynamic loader. Glibc’s standard behavior (when CET if configured
> is to leave shadow stack enabled if the executable and all linked
> libraries are marked with shadow stacks.
>
> Many distros (Fedora and others) have binaries already marked with
> shadow stack, waiting for kernel support. Unfortunately their glibc
> binaries expect the original arch_prctl() interface for allocating
> shadow stacks, as those changes were pushed ahead of kernel support.
> The net result of it all is, when updating to a kernel with shadow
> stack these binaries would suddenly get shadow stack enabled and expect
> the arch_prctl() interface to be there. And so calls to makecontext()
> will fail, resulting in visible breakages. This series deals with this
> problem as described below in "Updates".

I'm really impressed by the well thought out coordination on the glibc and
distro side. Designed by committee never worked ...

> Updates
> =======
> These updates were mostly driven by public comments, but a lot of the design
> elements are new. I would like some extra scrutiny on the updates.
>
> New syscall for Shadow Stack Allocation
> ---------------------------------------
> A new syscall is added for allocating shadow stacks to replace
> PROT_SHADOW_STACK. Several options were considered, as described in the
> “x86/cet/shstk: Introduce map_shadow_stack syscall”.
>
> Xsave Managed Supervisor State Modifications
> --------------------------------------------
> The shadow stack feature requires the kernel to modify xsaves managed
> state. On one of the last versions of Yu-cheng’s series Boris had
> commented on the pattern it was using to do this not necessarily being
> ideal. The pattern was to force a restore to the registers and always
> do the modification there. Then Thomas did an overhaul of the fpu code,
> part of which consisted of making raw access to the xsave buffer
> private to the fpu code. So this series tries to expose access again,
> and in a way that addresses Boris’ comments.
>
> The method is to provide functions like wmsrl/rdmsrl, but that can
> direct the operation to the correct location (registers or buffer),
> while giving the proper notice to the fpu subsystem so things don’t get
> clobbered or corrupted.
>
> In the past a solution like this was discussed as part of the PASID
> series, and Thomas was not in favor. In CET’s case there is a more
> logic around the CET MSR’s than in PASID's, and wrapping this logic
> minimizes near identical open coded logic needed to do this more
> efficiently. In addition it resolves the above described problem of
> having no access to the xsave buffer. So it is being put forward here
> under the supposition that CET’s usage may lead to a different
> conclusion, not to try to ignore past direction.
>
> The user interrupt series has similar needs as CET, and will also use
> this internal interface if it’s found acceptable.

I'll have a look.

> Switch Enabling Interface
> -------------------------
> But there are existing downsides to automatic elf header processing
> based enabling. The elf header feature spec is not defined by the
> kernel and there are proposals to expand it to describe additional
> logic. A simpler interface where the kernel is simply told what to
> enable, and leaves all the decision making to userspace, is more
> flexible for userspace and simpler for the kernel. There also already
> needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and
> likely LAM will use it too), so it avoids there being two ways to turn
> on these types of features. The only tricky part for shadow stack, is
> that it has to be enabled very early. Wherever the shadow stack is
> enabled, the app cannot return from that point, otherwise there will be
> a shadow stack violation. It turns out glibc can enable shadow stack
> this early, so it works nicely. So not automatically enabling any
> features in the elf header will cleanly disable all old binaries, which
> expect the kernel to enable CET features automatically. Then after the
> kernel changes are upstream, glibc can be updated to use the new
> interface. This is the solution implemented in this series.

Makes sense.

> Expand Commit Logs
> ------------------
> As part of spinning up on this series, I found some of the commit logs
> did not describe the changes in enough detail for me understand their
> purpose. I tried to expand the logs and comments, where I had to go
> digging. Hopefully it’s useful.

Proper changelogs are always appreciated.

> Limit to only Intel Processors
> ------------------------------
> Shadow stack is supported on some AMD processors, but this revision
> (with expanded HW usage and xsaves changes) has only has been tested on
> Intel ones. So this series has a patch to limit shadow stack support to
> Intel processors. Ideally the patch would not even make it to mainline,
> and should be dropped as soon as this testing is done. It's included
> just in case.

Ha. I can give you access to an AMD machine with CET SS supported :)

> Future Work
> ===========
> Even though this is now exclusively a shadow stack series, there is still some
> remaining shadow stack work to be done.
>
> Ptrace
> ------
> Early in the series, there was a patch to allow IA32_U_CET and
> IA32_PL3_SSP to be set. This patch was dropped and planned as a follow
> up to basic support, and it remains the plan. It will be needed for
> in-progress gdb support.

It's pretty much a prerequisite for enabling it, right?

> CRIU Support
> ------------
> In the past there was some speculation on the mailing list about
> whether CRIU would need to be taught about CET. It turns out, it does.
> The first issue hit is that CRIU calls sigreturn directly from its
> “parasite code” that it injects into the dumper process. This violates
> this shadow stack implementation’s protection that intends to prevent
> attackers from doing this.
>
> With so many packages already enabled with shadow stack, there is
> probably desire to make it work seamlessly. But in the meantime if
> distros want to support shadow stack and CRIU, users could manually
> disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for
> a process they will wants to dump. It’s not ideal.
>
> I’d like to hear what people think about having shadow stack in the
> kernel without this resolved. Nothing would change for any users until
> they enable shadow stack in the kernel and update to a glibc configured
> with CET. Should CRIU userspace be solved before kernel support?

Definitely yes. Making CRIU users add a glibc tunable is not really an
option. We can't break CRIU systems with a kernel upgrade.

Thanks,

tglx


2022-02-04 17:34:22

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

Hi Thomas,

Thanks for feedback on the plan.

On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > Until now, the enabling effort was trying to support both Shadow
> > Stack and IBT.
> > This history will focus on a few areas of the shadow stack
> > development history
> > that I thought stood out.
> >
> > Signals
> > -------
> > Originally signals placed the location of the shadow stack
> > restore
> > token inside the saved state on the stack. This was
> > problematic from a
> > past ABI promises perspective. So the restore location was
> > instead just
> > assumed from the shadow stack pointer. This works because in
> > normal
> > allowed cases of calling sigreturn, the shadow stack pointer
> > should be
> > right at the restore token at that time. There is no
> > alternate shadow
> > stack support. If an alt shadow stack is added later we
> > would
> > need to
>
> So how is that going to work? altstack is not an esoteric corner
> case.

My understanding is that the main usages for the signal stack were
handling stack overflows and corruption. Since the shadow stack only
contains return addresses rather than large stack allocations, and is
not generally writable or pivotable, I thought there was a good
possibility an alt shadow stack would not end up being especially
useful. Does it seem like reasonable guesswork?

If it does seems likely, I'll give it some more thought than that hand
wavy plan.

>
> > Limit to only Intel Processors
> > ------------------------------
> > Shadow stack is supported on some AMD processors, but this
> > revision
> > (with expanded HW usage and xsaves changes) has only has
> > been tested on
> > Intel ones. So this series has a patch to limit shadow stack
> > support to
> > Intel processors. Ideally the patch would not even make it
> > to mainline,
> > and should be dropped as soon as this testing is done. It's
> > included
> > just in case.
>
> Ha. I can give you access to an AMD machine with CET SS supported :)

Thanks for the offer. It sounds like John Allen can do this testing.

>
> > Future Work
> > ===========
> > Even though this is now exclusively a shadow stack series, there is
> > still some
> > remaining shadow stack work to be done.
> >
> > Ptrace
> > ------
> > Early in the series, there was a patch to allow IA32_U_CET
> > and
> > IA32_PL3_SSP to be set. This patch was dropped and planned
> > as a follow
> > up to basic support, and it remains the plan. It will be
> > needed for
> > in-progress gdb support.
>
> It's pretty much a prerequisite for enabling it, right?

Yes.

>
> > CRIU Support
> > ------------
> > In the past there was some speculation on the mailing list
> > about
> > whether CRIU would need to be taught about CET. It turns
> > out, it does.
> > The first issue hit is that CRIU calls sigreturn directly
> > from its
> > “parasite code” that it injects into the dumper process.
> > This violates
> > this shadow stack implementation’s protection that intends
> > to prevent
> > attackers from doing this.
> >
> > With so many packages already enabled with shadow stack,
> > there is
> > probably desire to make it work seamlessly. But in the
> > meantime if
> > distros want to support shadow stack and CRIU, users could
> > manually
> > disabled shadow stack via
> > “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for
> > a process they will wants to dump. It’s not ideal.
> >
> > I’d like to hear what people think about having shadow stack
> > in the
> > kernel without this resolved. Nothing would change for any
> > users until
> > they enable shadow stack in the kernel and update to a glibc
> > configured
> > with CET. Should CRIU userspace be solved before kernel
> > support?
>
> Definitely yes. Making CRIU users add a glibc tunable is not really
> an
> option. We can't break CRIU systems with a kernel upgrade.

Ok got it, thanks. Just to be clear though, existing distros/binaries
out there will not have shadow stack enabled with just an updated
kernel (due to the enabling changes). So the CRIU tools would only
break after future glibc binaries enable CET, which users/distros would
have to do specifically (glibc doesn't even enable cet by default).

Since the purpose of this feature is to restrict previously allowed
behaviors, and it’s apparently getting enabled by default in some
distro's packages, I guess there is a decent chance that once a system
is updated with a future glibc some app somewhere will break. I was
under the impression that as long as there were no breakages under a
current set of binaries (including glibc), this was not considered a
kernel regression. Please correct me if this is wrong. I think there
are other options if we want to make this softer.

Of course none of that prevents known breakages from being fixed for
normal reasons and I’ll look into that for CRIU.

2022-02-07 07:02:45

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Sun, Feb 6, 2022 at 5:42 AM David Laight <[email protected]> wrote:
>
> From: Edgecombe, Rick P
> > Sent: 05 February 2022 20:15
> >
> > On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
> > > On Sat, Feb 5, 2022 at 5:27 AM David Laight <[email protected]>
> > > wrote:
> > > >
> > > > From: Edgecombe, Rick P
> > > > > Sent: 04 February 2022 01:08
> > > > > Hi Thomas,
> > > > >
> > > > > Thanks for feedback on the plan.
> > > > >
> > > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > > > > Until now, the enabling effort was trying to support both
> > > > > > > Shadow
> > > > > > > Stack and IBT.
> > > > > > > This history will focus on a few areas of the shadow stack
> > > > > > > development history
> > > > > > > that I thought stood out.
> > > > > > >
> > > > > > > Signals
> > > > > > > -------
> > > > > > > Originally signals placed the location of the shadow
> > > > > > > stack
> > > > > > > restore
> > > > > > > token inside the saved state on the stack. This was
> > > > > > > problematic from a
> > > > > > > past ABI promises perspective. So the restore location
> > > > > > > was
> > > > > > > instead just
> > > > > > > assumed from the shadow stack pointer. This works
> > > > > > > because in
> > > > > > > normal
> > > > > > > allowed cases of calling sigreturn, the shadow stack
> > > > > > > pointer
> > > > > > > should be
> > > > > > > right at the restore token at that time. There is no
> > > > > > > alternate shadow
> > > > > > > stack support. If an alt shadow stack is added later
> > > > > > > we
> > > > > > > would
> > > > > > > need to
> > > > > >
> > > > > > So how is that going to work? altstack is not an esoteric
> > > > > > corner
> > > > > > case.
> > > > >
> > > > > My understanding is that the main usages for the signal stack
> > > > > were
> > > > > handling stack overflows and corruption. Since the shadow stack
> > > > > only
> > > > > contains return addresses rather than large stack allocations,
> > > > > and is
> > > > > not generally writable or pivotable, I thought there was a good
> > > > > possibility an alt shadow stack would not end up being especially
> > > > > useful. Does it seem like reasonable guesswork?
> > > >
> > > > The other 'problem' is that it is valid to longjump out of a signal
> > > > handler.
> > > > These days you have to use siglongjmp() not longjmp() but it is
> > > > still used.
> > > >
> > > > It is probably also valid to use siglongjmp() to jump from a nested
> > > > signal handler into the outer handler.
> > > > Given both signal handlers can have their own stack, there can be
> > > > three
> > > > stacks involved.
> >
> > So the scenario is?
> >
> > 1. Handle signal 1
> > 2. sigsetjmp()
> > 3. signalstack()
> > 4. Handle signal 2 on alt stack
> > 5. siglongjmp()
> >
> > I'll check that it is covered by the tests, but I think it should work
> > in this series that has no alt shadow stack. I have only done a high
> > level overview of how the shadow stack stuff, that doesn't involve the
> > kernel, works in glibc. Sounds like I'll need to do a deeper dive.
>
> The posix/xopen definition for setjmp/longjmp doesn't require such
> longjmp requests to work.
>
> Although they still have to do something that doesn't break badly.
> Aborting the process is probably fine!
>
> > > > I think the shadow stack pointer has to be in ucontext - which also
> > > > means the application can change it before returning from a signal.
> >
> > Yes we might need to change it to support alt shadow stacks. Can you
> > elaborate why you think it has to be in ucontext? I was thinking of
> > looking at three options for storing the ssp:
> > - Stored in the shadow stack like a token using WRUSS from the kernel.
> > - Stored on the kernel side using a hashmap that maps ucontext or
> > sigframe userspace address to ssp (this is of course similar to
> > storing in ucontext, except that the user can’t change the ssp).
> > - Stored writable in userspace in ucontext.
> >
> > But in this version, without alt shadow stacks, the shadow stack
> > pointer is not stored in ucontext. This causes the limitation that
> > userspace can only call sigreturn when it has returned back to a point
> > where there is a restore token on the shadow stack (which was placed
> > there by the kernel). This doesn’t mean it can’t switch to a different
> > shadow stack or handle a nested signal, but it limits the possibility
> > for calling sigreturn with a totally different sigframe (like CRIU and
> > SROP attacks do). It should hopefully be a helpful, protective
> > limitation for most apps and I'm hoping CRIU can be fixed without
> > removing it.
> >
> > I am not aware of other limitations to signals (besides normal shadow
> > stack enforcement), but I could be missing it. And people's skepticism
> > is making me want to go back over it with more scrutiny.
> >
> > > > In much the same way as all the segment registers can be changed
> > > > leading to all the nasty bugs when the final 'return to user' code
> > > > traps in kernel when loading invalid segment registers or executing
> > > > iret.
> >
> > I don't think this is as difficult to avoid because userspace ssp has
> > its own register that should not be accessed at that point, but I have
> > not given this aspect enough analysis. Thanks for bringing it up.
>
> So the user ssp isn't saved (or restored) by the trap entry/exit.
> So it needs to be saved by the context switch code?
> Much like the user segment registers?
> So you are likely to get the same problems if restoring it can fault
> in kernel (eg for a non-canonical address).
>
> > > > Hmmm... do shadow stacks mean that longjmp() has to be a system
> > > > call?
> > >
> > > No. setjmp/longjmp save and restore shadow stack pointer.
>
> Ok, I was thinking that direct access to the user ssp would be
> a privileged operation.

User space can only pop shadow stack. longjmp does

#ifdef SHADOW_STACK_POINTER_OFFSET
# if IS_IN (libc) && defined SHARED && defined FEATURE_1_OFFSET
/* Check if Shadow Stack is enabled. */
testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET
jz L(skip_ssp)
# else
xorl %eax, %eax
# endif
/* Check and adjust the Shadow-Stack-Pointer. */
/* Get the current ssp. */
rdsspq %rax
/* And compare it with the saved ssp value. */
subq SHADOW_STACK_POINTER_OFFSET(%rdi), %rax
je L(skip_ssp)
/* Count the number of frames to adjust and adjust it
with incssp instruction. The instruction can adjust
the ssp by [0..255] value only thus use a loop if
the number of frames is bigger than 255. */
negq %rax
shrq $3, %rax
/* NB: We saved Shadow-Stack-Pointer of setjmp. Since we are
restoring Shadow-Stack-Pointer of setjmp's caller, we
need to unwind shadow stack by one more frame. */
addq $1, %rax

movl $255, %ebx
L(loop):
cmpq %rbx, %rax
cmovb %rax, %rbx
incsspq %rbx
subq %rbx, %rax
ja L(loop)

L(skip_ssp):
#endif

> If it can be written you don't really have to worry about what code
> is trying to do - it can actually do what it likes.
> It just catches unintentional operations (like buffer overflows).
>
> Was there any 'spare' space in struct jmpbuf ?

By pure luck, we have ONE spare space in sigjmp_buf.

> Otherwise you can only enable shadow stacks if everything has been
> recompiled - including any shared libraries the might be dlopen()ed.
> (or does the compiler invent an alloca() call somehow for a
> size that comes back from glibc?)
>
> I've never really considered how setjmp/longjmp handle callee saved
> register variables (apart from it being hard).
> The original pdp11 implementation probably only needed to save r6 and r7.
>
> What does happen to all the 'extended state' that XSAVE handles?
> IIRC all the AVX registers are caller saved (so should probably
> be zerod), but some of the SSE ones are callee saved, and one or
> two of the fpu flags are sticky and annoying enough the save/restore
> at the best of times.
>
> > It sounds like it would help to write up in a lot more detail exactly
> > how all the signal and specialer stack manipulation scenarios work in
> > glibc.
>
> Some cross references might have made people notice that the ucontext
> extensions for AVX512 (if not earlier ones) broke the minimal/default
> signal stack size.
>
> David
>

--
H.J.

2022-02-07 07:04:00

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 00/35] Shadow stacks for userspace

From: Edgecombe, Rick P
> Sent: 04 February 2022 01:08
> Hi Thomas,
>
> Thanks for feedback on the plan.
>
> On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > Until now, the enabling effort was trying to support both Shadow
> > > Stack and IBT.
> > > This history will focus on a few areas of the shadow stack
> > > development history
> > > that I thought stood out.
> > >
> > > Signals
> > > -------
> > > Originally signals placed the location of the shadow stack
> > > restore
> > > token inside the saved state on the stack. This was
> > > problematic from a
> > > past ABI promises perspective. So the restore location was
> > > instead just
> > > assumed from the shadow stack pointer. This works because in
> > > normal
> > > allowed cases of calling sigreturn, the shadow stack pointer
> > > should be
> > > right at the restore token at that time. There is no
> > > alternate shadow
> > > stack support. If an alt shadow stack is added later we
> > > would
> > > need to
> >
> > So how is that going to work? altstack is not an esoteric corner
> > case.
>
> My understanding is that the main usages for the signal stack were
> handling stack overflows and corruption. Since the shadow stack only
> contains return addresses rather than large stack allocations, and is
> not generally writable or pivotable, I thought there was a good
> possibility an alt shadow stack would not end up being especially
> useful. Does it seem like reasonable guesswork?

The other 'problem' is that it is valid to longjump out of a signal handler.
These days you have to use siglongjmp() not longjmp() but it is still used.

It is probably also valid to use siglongjmp() to jump from a nested
signal handler into the outer handler.
Given both signal handlers can have their own stack, there can be three
stacks involved.

I think the shadow stack pointer has to be in ucontext - which also
means the application can change it before returning from a signal.
In much the same way as all the segment registers can be changed
leading to all the nasty bugs when the final 'return to user' code
traps in kernel when loading invalid segment registers or executing iret.

Hmmm... do shadow stacks mean that longjmp() has to be a system call?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-07 08:08:00

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Thu, 2022-02-03 at 21:20 -0800, Andy Lutomirski wrote:
> On Thu, Feb 3, 2022, at 5:08 PM, Edgecombe, Rick P wrote:
> > Hi Thomas,
> > > > Signals
> > > > -------
> > > > Originally signals placed the location of the shadow
> > > > stack
> > > > restore
> > > > token inside the saved state on the stack. This was
> > > > problematic from a
> > > > past ABI promises perspective.
>
> What was the actual problem?

The meat of the discussion that I saw was this thread:

https://lore.kernel.org/lkml/CALCETrVTeYfzO-XWh+VwTuKCyPyp-oOMGH=QR_msG9tPQ4xPmA@mail.gmail.com/

The problem was that the saved shadow stack pointer was placed in a
location unexpected by userspace (at the end of the floating point
state), which apparently can be relocated by userspace that would not
be aware of the shadow stack state at the end. I think an earlier
version was not extendable either.

It does not seem to be a fully dead end, but I have to admit I didn’t
fully internalize the limits imposed by the userspace expectations of
the sigframe because the current solution seemed good. I’ll have to dig
into it a little more because alt stack, CRIU and these expectations
all seem intertwined.

Here is a question for you guys though – is a general ucontext
extension solution a nice-to-have on its own?

I was thinking that it would be better to keep CET stuff out of the
sigframe related structures if it can be avoided. One reason is that it
keeps security related register values out of writable locations. I’m
not thinking of any specific problem, but just a general idea of not
opening that stuff up if it’s not needed. An example is in the IBT
series, the wait-for-endbranch state was saved in a ucontext flag. Some
usages may want to keep it from being cleared in a signal and the
endbranch check skipped. So for shadow stack, just the general notion
that this is not ideal state to open up.

On where to keep the wait-for-endbranch state, PeterZ had suggested the
possibility of having a per-mm hashmap of (userspace stack addresses)->
(kernel side saved state), limited to some sensible limit of items.
This could be extendable to other state besides CET stuff. I was
thinking to look in that direction if it’s needed for the alt shadow
stack.

But, is it swimming against the place where saved state is "supposed"
to be? There is some optimum of compatibility (more apps able to opt-
in) and security. I guess it's probably not good to have the kernel
bend over backwards trying to get both.

> > > > So the restore location was
> > > > instead just
> > > > assumed from the shadow stack pointer. This works
> > > > because in
> > > > normal
> > > > allowed cases of calling sigreturn, the shadow stack
> > > > pointer
> > > > should be
> > > > right at the restore token at that time. There is no
> > > > alternate shadow
> > > > stack support. If an alt shadow stack is added later we
> > > > would
> > > > need to
> > >
> > > So how is that going to work? altstack is not an esoteric corner
> > > case.
> >
> > My understanding is that the main usages for the signal stack were
> > handling stack overflows and corruption. Since the shadow stack
> > only
> > contains return addresses rather than large stack allocations, and
> > is
> > not generally writable or pivotable, I thought there was a good
> > possibility an alt shadow stack would not end up being especially
> > useful. Does it seem like reasonable guesswork?
>
> It's also used for things like DOSEMU that execute in a weird context
> and then trap back out to the outer program using a signal handler
> and an altstack. Also, imagine someone writing a SIGSEGV handler
> specifically intended to handle shadow stack overflow.

Interesting, thanks. I had been thinking that an alt shadow stack would
require a new interface that would mostly just sit dormant taking up
space. But probably an (obvious) better way would be to just have the
signalstack() syscall automatically create a new shadow stack, like the
rest of the series does automatically for new threads. I’ll think I’ll
see how that would look.

>
> The shadow stack can be pivoted using RSTORSSP.

Yes, I just meant that the ability to pivot or modify is restricted (in
RSTORSSP's case by restore token checks) and so with less ability to
interact with it, it could be less likely for there to be corruptions.
This if of course just speculation.

2022-02-07 11:30:31

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Sat, Feb 5, 2022 at 12:15 PM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
> > On Sat, Feb 5, 2022 at 5:27 AM David Laight <[email protected]>
> > wrote:
> > >
> > > From: Edgecombe, Rick P
> > > > Sent: 04 February 2022 01:08
> > > > Hi Thomas,
> > > >
> > > > Thanks for feedback on the plan.
> > > >
> > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > > > Until now, the enabling effort was trying to support both
> > > > > > Shadow
> > > > > > Stack and IBT.
> > > > > > This history will focus on a few areas of the shadow stack
> > > > > > development history
> > > > > > that I thought stood out.
> > > > > >
> > > > > > Signals
> > > > > > -------
> > > > > > Originally signals placed the location of the shadow
> > > > > > stack
> > > > > > restore
> > > > > > token inside the saved state on the stack. This was
> > > > > > problematic from a
> > > > > > past ABI promises perspective. So the restore location
> > > > > > was
> > > > > > instead just
> > > > > > assumed from the shadow stack pointer. This works
> > > > > > because in
> > > > > > normal
> > > > > > allowed cases of calling sigreturn, the shadow stack
> > > > > > pointer
> > > > > > should be
> > > > > > right at the restore token at that time. There is no
> > > > > > alternate shadow
> > > > > > stack support. If an alt shadow stack is added later
> > > > > > we
> > > > > > would
> > > > > > need to
> > > > >
> > > > > So how is that going to work? altstack is not an esoteric
> > > > > corner
> > > > > case.
> > > >
> > > > My understanding is that the main usages for the signal stack
> > > > were
> > > > handling stack overflows and corruption. Since the shadow stack
> > > > only
> > > > contains return addresses rather than large stack allocations,
> > > > and is
> > > > not generally writable or pivotable, I thought there was a good
> > > > possibility an alt shadow stack would not end up being especially
> > > > useful. Does it seem like reasonable guesswork?
> > >
> > > The other 'problem' is that it is valid to longjump out of a signal
> > > handler.
> > > These days you have to use siglongjmp() not longjmp() but it is
> > > still used.
> > >
> > > It is probably also valid to use siglongjmp() to jump from a nested
> > > signal handler into the outer handler.
> > > Given both signal handlers can have their own stack, there can be
> > > three
> > > stacks involved.
>
> So the scenario is?
>
> 1. Handle signal 1
> 2. sigsetjmp()
> 3. signalstack()
> 4. Handle signal 2 on alt stack
> 5. siglongjmp()
>
> I'll check that it is covered by the tests, but I think it should work
> in this series that has no alt shadow stack. I have only done a high
> level overview of how the shadow stack stuff, that doesn't involve the
> kernel, works in glibc. Sounds like I'll need to do a deeper dive.
>
> > >
> > > I think the shadow stack pointer has to be in ucontext - which also
> > > means the application can change it before returning from a signal.
>
> Yes we might need to change it to support alt shadow stacks. Can you
> elaborate why you think it has to be in ucontext? I was thinking of
> looking at three options for storing the ssp:
> - Stored in the shadow stack like a token using WRUSS from the kernel.
> - Stored on the kernel side using a hashmap that maps ucontext or
> sigframe userspace address to ssp (this is of course similar to
> storing in ucontext, except that the user can’t change the ssp).
> - Stored writable in userspace in ucontext.
>
> But in this version, without alt shadow stacks, the shadow stack
> pointer is not stored in ucontext. This causes the limitation that
> userspace can only call sigreturn when it has returned back to a point
> where there is a restore token on the shadow stack (which was placed
> there by the kernel). This doesn’t mean it can’t switch to a different
> shadow stack or handle a nested signal, but it limits the possibility
> for calling sigreturn with a totally different sigframe (like CRIU and
> SROP attacks do). It should hopefully be a helpful, protective
> limitation for most apps and I'm hoping CRIU can be fixed without
> removing it.
>
> I am not aware of other limitations to signals (besides normal shadow
> stack enforcement), but I could be missing it. And people's skepticism
> is making me want to go back over it with more scrutiny.
>
> > > In much the same way as all the segment registers can be changed
> > > leading to all the nasty bugs when the final 'return to user' code
> > > traps in kernel when loading invalid segment registers or executing
> > > iret.
>
> I don't think this is as difficult to avoid because userspace ssp has
> its own register that should not be accessed at that point, but I have
> not given this aspect enough analysis. Thanks for bringing it up.
>
> > >
> > > Hmmm... do shadow stacks mean that longjmp() has to be a system
> > > call?
> >
> > No. setjmp/longjmp save and restore shadow stack pointer.
> >
>
> It sounds like it would help to write up in a lot more detail exactly
> how all the signal and specialer stack manipulation scenarios work in
> glibc.
>

setjmp/longjmp work on the same sigjmp_buf. Shadow stack pointer
is saved and restored, just like any other callee-saved registers.


--
H.J.

2022-02-07 11:47:07

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Sat, Feb 5, 2022 at 5:27 AM David Laight <[email protected]> wrote:
>
> From: Edgecombe, Rick P
> > Sent: 04 February 2022 01:08
> > Hi Thomas,
> >
> > Thanks for feedback on the plan.
> >
> > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > Until now, the enabling effort was trying to support both Shadow
> > > > Stack and IBT.
> > > > This history will focus on a few areas of the shadow stack
> > > > development history
> > > > that I thought stood out.
> > > >
> > > > Signals
> > > > -------
> > > > Originally signals placed the location of the shadow stack
> > > > restore
> > > > token inside the saved state on the stack. This was
> > > > problematic from a
> > > > past ABI promises perspective. So the restore location was
> > > > instead just
> > > > assumed from the shadow stack pointer. This works because in
> > > > normal
> > > > allowed cases of calling sigreturn, the shadow stack pointer
> > > > should be
> > > > right at the restore token at that time. There is no
> > > > alternate shadow
> > > > stack support. If an alt shadow stack is added later we
> > > > would
> > > > need to
> > >
> > > So how is that going to work? altstack is not an esoteric corner
> > > case.
> >
> > My understanding is that the main usages for the signal stack were
> > handling stack overflows and corruption. Since the shadow stack only
> > contains return addresses rather than large stack allocations, and is
> > not generally writable or pivotable, I thought there was a good
> > possibility an alt shadow stack would not end up being especially
> > useful. Does it seem like reasonable guesswork?
>
> The other 'problem' is that it is valid to longjump out of a signal handler.
> These days you have to use siglongjmp() not longjmp() but it is still used.
>
> It is probably also valid to use siglongjmp() to jump from a nested
> signal handler into the outer handler.
> Given both signal handlers can have their own stack, there can be three
> stacks involved.
>
> I think the shadow stack pointer has to be in ucontext - which also
> means the application can change it before returning from a signal.
> In much the same way as all the segment registers can be changed
> leading to all the nasty bugs when the final 'return to user' code
> traps in kernel when loading invalid segment registers or executing iret.
>
> Hmmm... do shadow stacks mean that longjmp() has to be a system call?

No. setjmp/longjmp save and restore shadow stack pointer.

--
H.J.

2022-02-07 11:59:44

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

(added more CRIU people)

On Sun, Jan 30, 2022 at 01:18:03PM -0800, Rick Edgecombe wrote:
> Hi,
>
> This is a slight reboot of the userspace CET series. I will be taking over the
> series from Yu-cheng. Per some internal recommendations, I’ve reset the version
> number and am calling it a new series. Hopefully, it doesn’t cause confusion.
>
> The new plan is to upstream only userspace Shadow Stack support at this point.
> IBT can follow later, but for now I’ll focus solely on the most in-demand and
> widely available (with the feature on AMD CPUs now) part of CET.
>
> I thought as part of this reset, it might be useful to more fully write-up the
> design and summarize the history of the previous CET series. So this slightly
> long cover letter does that. The "Updates" section has the changes, if anyone
> doesn't want the history.
>
>
> Why is Shadow Stack Wanted
> ==========================
> The main use case for userspace shadow stack is providing protection against
> return oriented programming attacks. Fedora and Ubuntu already have many/most
> packages enabled for shadow stack. The main missing piece is Linux kernel
> support and there seems to be a high amount of interest in the ecosystem for
> getting this feature supported. Besides security, Google has also done some
> work on using shadow stack to improve performance and reliability of tracing.
>
>
> Userspace Shadow Stack Implementation
> =====================================
> Shadow stack works by maintaining a secondary (shadow) stack that cannot be
> directly modified by applications. When executing a CALL instruction, the
> processor pushes the return address to both the normal stack and to the special
> permissioned shadow stack. Upon ret, the processor pops the shadow stack copy
> and compares it to the normal stack copy. If the two differ, the processor
> raises a control protection fault. This implementation supports shadow stack on
> 64 bit kernels only, with support for 32 bit only via IA32 emulation.
>
> Shadow Stack Memory
> -------------------
> The majority of this series deals with changes for handling the special
> shadow stack memory permissions. This memory is specified by the
> Dirty+RO PTE bits. A tricky aspect of this is that this combination was
> previously used to specify COW memory. So Linux needs to handle COW
> differently when shadow stack is in use. The solution is to use a
> software PTE bit to denote COW memory, and take care to clear the dirty
> bit when setting the memory RO.
>
> Setup and Upkeep of HW Registers
> --------------------------------
> Using userspace CET requires a CR4 bit set, and also the manipulation
> of two xsave managed MSRs. The kernel needs to modify these registers
> during various operations like clone and signal handling. These
> operations may happen when the registers are restored to the CPU, or
> saved in an xsave buffer. Since the recent AMX triggered FPU overhaul
> removed direct access to the xsave buffer, this series adds an
> interface to operate on the supervisor xstate.
>
> New ABIs
> --------
> This series introduces some new ABIs. The primary one is the shadow
> stack itself. Since it is readable and the shadow stack pointer is
> exposed to user space, applications can easily read and process the
> shadow stack. And in fact the tracing usages plan to do exactly that.
>
> Most of the shadow stack contents are written by HW, but some of the
> entries are added by the kernel. The main place for this is signals. As
> part of handling the signal the kernel does some manual adjustment of
> the shadow stack that userspace depends on.
>
> In addition to the contents of the shadow stack there is also user
> visible behavior around when new shadow stacks are created and set in
> the shadow stack pointer (SSP) register. This is relatively
> straightforward – shadow stacks are created when new stacks are created
> (thread creation, fork, etc). It is more or less what is required to
> keep apps working.
>
> For situations when userspace creates a new stack (i.e. makecontext(),
> fibers, etc), a new syscall is provided for creating shadow stack
> memory. To make the shadow stack usable, it needs to have a restore
> token written to the protected memory. So the syscall provides a way to
> specificity this should be done by the kernel.
>
> When a shadow stack violation happens (when the return address of stack
> not matching return address in shadow stack), a segfault is generated
> with a new si_code specific to CET violations.
>
> Lastly, a new arch_prctl interface is created for controlling the
> enablement of CET-like features. It is intended to also be used for
> LAM. It operates on the feature status per-thread, so for process wide
> enabling it is intended to be used early in things like dynamic
> linker/loaders. However, it can be used later for per-thread enablement
> of features like WRSS.
>
> WRSS
> ----
> WRSS is an instruction that can write to shadow stacks. The HW provides
> a way to enable this instruction for userspace use. Since shadow
> stack’s are created initially protected, enabling WRSS allows any apps
> that want to do unusual things with their stacks to have a way to
> weaken protection and make things more flexible. A new feature bit is
> defined to control enabling/disabling of WRSS.
>
>
> History
> =======
> The branding “CET” really consists of two features: “Shadow Stack” and
> “Indirect Branch Tracking”. They both restrict previously allowed, but rarely
> valid behaviors and require userspace to change to avoid these behaviors before
> enabling the protection. These raw HW features need to be assembled into a
> software solution across userspace and kernel in order to add security value.
> The kernel part of this solution has evolved iteratively starting with a lengthy
> RFC period.
>
> Until now, the enabling effort was trying to support both Shadow Stack and IBT.
> This history will focus on a few areas of the shadow stack development history
> that I thought stood out.
>
> Signals
> -------
> Originally signals placed the location of the shadow stack restore
> token inside the saved state on the stack. This was problematic from a
> past ABI promises perspective. So the restore location was instead just
> assumed from the shadow stack pointer. This works because in normal
> allowed cases of calling sigreturn, the shadow stack pointer should be
> right at the restore token at that time. There is no alternate shadow
> stack support. If an alt shadow stack is added later we would need to
> find a place to store the regular shadow stack token location. Options
> could be to push something on the alt shadow stack, or to keep
> something on the kernel side. So the current design keeps things simple
> while slightly kicking the can down the road if alt shadow stacks
> become a thing later. Siglongjmp is handled in glibc, using the incssp
> instruction to unwind the shadow stack over the token.
>
> Shadow Stack Allocation
> -----------------------
> makecontext() implementations need a way to create new shadow stacks
> with restore token’s such that they can be pivoted to from userspace.
> The first interface to do this was an arch_prctl(). It created a shadow
> stack with a restore token pre-setup, since the kernel has an
> instruction that can write to user shadow stacks. However, this
> interface was abandoned for being strange.
>
> The next version created PROT_SHADOW_STACK. This interface had two
> problems. One, it left no options but for userspace to create writable
> memory, write a restore token, then mproctect() it PROT_SHADOW_STACK.
> The writable window left the shadow stack exposed, weakening the
> security. Second, it caused problems with the guard pages. Since the
> memory was initially created writable it did not have a guard page, but
> then was mprotected later to a type of memory that should have one.
> This resulted in missing guard pages and confused rb_subtree_gap’s.
>
> This version introduces a new syscall that behaves similarly to the
> initial arch_prctl() interface in that it has the kernel write the
> restore token.
>
> Enabling Interface
> ------------------
> For the entire history of the original CET series, the design was to
> enable shadow stack automatically if the feature bit was detected in
> the elf header. Then it was userspace’s responsibility to turn it off
> via an arch_prctl() if it was not desired, and this was handled by the
> glibc dynamic loader. Glibc’s standard behavior (when CET if configured
> is to leave shadow stack enabled if the executable and all linked
> libraries are marked with shadow stacks.
>
> Many distros (Fedora and others) have binaries already marked with
> shadow stack, waiting for kernel support. Unfortunately their glibc
> binaries expect the original arch_prctl() interface for allocating
> shadow stacks, as those changes were pushed ahead of kernel support.
> The net result of it all is, when updating to a kernel with shadow
> stack these binaries would suddenly get shadow stack enabled and expect
> the arch_prctl() interface to be there. And so calls to makecontext()
> will fail, resulting in visible breakages. This series deals with this
> problem as described below in "Updates".
>
>
> Updates
> =======
> These updates were mostly driven by public comments, but a lot of the design
> elements are new. I would like some extra scrutiny on the updates.
>
> New syscall for Shadow Stack Allocation
> ---------------------------------------
> A new syscall is added for allocating shadow stacks to replace
> PROT_SHADOW_STACK. Several options were considered, as described in the
> “x86/cet/shstk: Introduce map_shadow_stack syscall”.
>
> Xsave Managed Supervisor State Modifications
> --------------------------------------------
> The shadow stack feature requires the kernel to modify xsaves managed
> state. On one of the last versions of Yu-cheng’s series Boris had
> commented on the pattern it was using to do this not necessarily being
> ideal. The pattern was to force a restore to the registers and always
> do the modification there. Then Thomas did an overhaul of the fpu code,
> part of which consisted of making raw access to the xsave buffer
> private to the fpu code. So this series tries to expose access again,
> and in a way that addresses Boris’ comments.
>
> The method is to provide functions like wmsrl/rdmsrl, but that can
> direct the operation to the correct location (registers or buffer),
> while giving the proper notice to the fpu subsystem so things don’t get
> clobbered or corrupted.
>
> In the past a solution like this was discussed as part of the PASID
> series, and Thomas was not in favor. In CET’s case there is a more
> logic around the CET MSR’s than in PASID's, and wrapping this logic
> minimizes near identical open coded logic needed to do this more
> efficiently. In addition it resolves the above described problem of
> having no access to the xsave buffer. So it is being put forward here
> under the supposition that CET’s usage may lead to a different
> conclusion, not to try to ignore past direction.
>
> The user interrupt series has similar needs as CET, and will also use
> this internal interface if it’s found acceptable.
>
> Support for WRSS
> ----------------
> Andy Lutomirski had asked if we change the shadow stack allocation API
> such that userspace cannot create arbitrary shadow stacks, then we look
> at exposing an interface to enable the WRSS instruction for userspace.
> This way app’s that want to do unexpected things with shadow stacks
> would still have the option to create shadow stacks with arbitrary
> data.
>
> Switch Enabling Interface
> -------------------------
> As described above there is a problem with userspace binaries waiting
> to break as soon as the kernel supports CET. This needs to be prevented
> by changing the interface such that the old binaries will not enable
> shadow stack AND behave as if shadow stack is not enabled. They should
> run normally without shadow stack protection. Creating a new feature
> (SHSTK2) for shadow stack was explored. SHSTK would never be supported
> by the kernel, and all the userspace build tools would be updated to
> target SHSTK2 instead of SHSTK. So old SHSTK binaries would be cleanly
> disabled.
>
> But there are existing downsides to automatic elf header processing
> based enabling. The elf header feature spec is not defined by the
> kernel and there are proposals to expand it to describe additional
> logic. A simpler interface where the kernel is simply told what to
> enable, and leaves all the decision making to userspace, is more
> flexible for userspace and simpler for the kernel. There also already
> needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and
> likely LAM will use it too), so it avoids there being two ways to turn
> on these types of features. The only tricky part for shadow stack, is
> that it has to be enabled very early. Wherever the shadow stack is
> enabled, the app cannot return from that point, otherwise there will be
> a shadow stack violation. It turns out glibc can enable shadow stack
> this early, so it works nicely. So not automatically enabling any
> features in the elf header will cleanly disable all old binaries, which
> expect the kernel to enable CET features automatically. Then after the
> kernel changes are upstream, glibc can be updated to use the new
> interface. This is the solution implemented in this series.
>
> Expand Commit Logs
> ------------------
> As part of spinning up on this series, I found some of the commit logs
> did not describe the changes in enough detail for me understand their
> purpose. I tried to expand the logs and comments, where I had to go
> digging. Hopefully it’s useful.
>
> Limit to only Intel Processors
> ------------------------------
> Shadow stack is supported on some AMD processors, but this revision
> (with expanded HW usage and xsaves changes) has only has been tested on
> Intel ones. So this series has a patch to limit shadow stack support to
> Intel processors. Ideally the patch would not even make it to mainline,
> and should be dropped as soon as this testing is done. It's included
> just in case.
>
>
> Future Work
> ===========
> Even though this is now exclusively a shadow stack series, there is still some
> remaining shadow stack work to be done.
>
> Ptrace
> ------
> Early in the series, there was a patch to allow IA32_U_CET and
> IA32_PL3_SSP to be set. This patch was dropped and planned as a follow
> up to basic support, and it remains the plan. It will be needed for
> in-progress gdb support.
>
> CRIU Support
> ------------
> In the past there was some speculation on the mailing list about
> whether CRIU would need to be taught about CET. It turns out, it does.
> The first issue hit is that CRIU calls sigreturn directly from its
> “parasite code” that it injects into the dumper process. This violates
> this shadow stack implementation’s protection that intends to prevent
> attackers from doing this.
>
> With so many packages already enabled with shadow stack, there is
> probably desire to make it work seamlessly. But in the meantime if
> distros want to support shadow stack and CRIU, users could manually
> disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for
> a process they will wants to dump. It’s not ideal.
>
> I’d like to hear what people think about having shadow stack in the
> kernel without this resolved. Nothing would change for any users until
> they enable shadow stack in the kernel and update to a glibc configured
> with CET. Should CRIU userspace be solved before kernel support?
>
> Selftests
> ---------
> There are some CET selftests being worked on and they are not included
> here.
>
> Thanks,
>
> Rick
>
> Rick Edgecombe (7):
> x86/mm: Prevent VM_WRITE shadow stacks
> x86/fpu: Add helpers for modifying supervisor xstate
> x86/fpu: Add unsafe xsave buffer helpers
> x86/cet/shstk: Introduce map_shadow_stack syscall
> selftests/x86: Add map_shadow_stack syscall test
> x86/cet/shstk: Support wrss for userspace
> x86/cpufeatures: Limit shadow stack to Intel CPUs
>
> Yu-cheng Yu (28):
> Documentation/x86: Add CET description
> x86/cet/shstk: Add Kconfig option for Shadow Stack
> x86/cpufeatures: Add CET CPU feature flags for Control-flow
> Enforcement Technology (CET)
> x86/cpufeatures: Introduce CPU setup and option parsing for CET
> x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
> x86/cet: Add control-protection fault handler
> x86/mm: Remove _PAGE_DIRTY from kernel RO pages
> x86/mm: Move pmd_write(), pud_write() up in the file
> x86/mm: Introduce _PAGE_COW
> drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
> x86/mm: Update pte_modify for _PAGE_COW
> x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
> transition from _PAGE_DIRTY to _PAGE_COW
> mm: Move VM_UFFD_MINOR_BIT from 37 to 38
> mm: Introduce VM_SHADOW_STACK for shadow stack memory
> x86/mm: Check Shadow Stack page fault errors
> x86/mm: Update maybe_mkwrite() for shadow stack
> mm: Fixup places that call pte_mkwrite() directly
> mm: Add guard pages around a shadow stack.
> mm/mmap: Add shadow stack pages to memory accounting
> mm: Update can_follow_write_pte() for shadow stack
> mm/mprotect: Exclude shadow stack from preserve_write
> mm: Re-introduce vm_flags to do_mmap()
> x86/cet/shstk: Add user-mode shadow stack support
> x86/process: Change copy_thread() argument 'arg' to 'stack_size'
> x86/cet/shstk: Handle thread shadow stack
> x86/cet/shstk: Introduce shadow stack token setup/verify routines
> x86/cet/shstk: Handle signals for shadow stack
> x86/cet/shstk: Add arch_prctl elf feature functions
>
> .../admin-guide/kernel-parameters.txt | 4 +
> Documentation/filesystems/proc.rst | 1 +
> Documentation/x86/cet.rst | 145 ++++++
> Documentation/x86/index.rst | 1 +
> arch/arm/kernel/signal.c | 2 +-
> arch/arm64/kernel/signal.c | 2 +-
> arch/arm64/kernel/signal32.c | 2 +-
> arch/sparc/kernel/signal32.c | 2 +-
> arch/sparc/kernel/signal_64.c | 2 +-
> arch/x86/Kconfig | 22 +
> arch/x86/Kconfig.assembler | 5 +
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/x86/ia32/ia32_signal.c | 25 +-
> arch/x86/include/asm/cet.h | 54 +++
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/disabled-features.h | 8 +-
> arch/x86/include/asm/fpu/api.h | 8 +
> arch/x86/include/asm/fpu/types.h | 23 +-
> arch/x86/include/asm/fpu/xstate.h | 6 +-
> arch/x86/include/asm/idtentry.h | 4 +
> arch/x86/include/asm/mman.h | 24 +
> arch/x86/include/asm/mmu_context.h | 2 +
> arch/x86/include/asm/msr-index.h | 20 +
> arch/x86/include/asm/page_types.h | 7 +
> arch/x86/include/asm/pgtable.h | 302 ++++++++++--
> arch/x86/include/asm/pgtable_types.h | 48 +-
> arch/x86/include/asm/processor.h | 6 +
> arch/x86/include/asm/special_insns.h | 30 ++
> arch/x86/include/asm/trap_pf.h | 2 +
> arch/x86/include/uapi/asm/mman.h | 8 +-
> arch/x86/include/uapi/asm/prctl.h | 10 +
> arch/x86/include/uapi/asm/processor-flags.h | 2 +
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/cpu/common.c | 20 +
> arch/x86/kernel/cpu/cpuid-deps.c | 1 +
> arch/x86/kernel/elf_feature_prctl.c | 72 +++
> arch/x86/kernel/fpu/xstate.c | 167 ++++++-
> arch/x86/kernel/idt.c | 4 +
> arch/x86/kernel/process.c | 17 +-
> arch/x86/kernel/process_64.c | 2 +
> arch/x86/kernel/shstk.c | 446 ++++++++++++++++++
> arch/x86/kernel/signal.c | 13 +
> arch/x86/kernel/signal_compat.c | 2 +-
> arch/x86/kernel/traps.c | 62 +++
> arch/x86/mm/fault.c | 19 +
> arch/x86/mm/mmap.c | 48 ++
> arch/x86/mm/pat/set_memory.c | 2 +-
> arch/x86/mm/pgtable.c | 25 +
> drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
> fs/aio.c | 2 +-
> fs/proc/task_mmu.c | 3 +
> include/linux/mm.h | 19 +-
> include/linux/pgtable.h | 8 +
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/siginfo.h | 3 +-
> include/uapi/asm-generic/unistd.h | 2 +-
> ipc/shm.c | 2 +-
> kernel/sys_ni.c | 1 +
> mm/gup.c | 16 +-
> mm/huge_memory.c | 27 +-
> mm/memory.c | 5 +-
> mm/migrate.c | 3 +-
> mm/mmap.c | 15 +-
> mm/mprotect.c | 9 +-
> mm/nommu.c | 4 +-
> mm/util.c | 2 +-
> tools/testing/selftests/x86/Makefile | 9 +-
> .../selftests/x86/test_map_shadow_stack.c | 75 +++
> 69 files changed, 1797 insertions(+), 92 deletions(-)
> create mode 100644 Documentation/x86/cet.rst
> create mode 100644 arch/x86/include/asm/cet.h
> create mode 100644 arch/x86/include/asm/mman.h
> create mode 100644 arch/x86/kernel/elf_feature_prctl.c
> create mode 100644 arch/x86/kernel/shstk.c
> create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c
>
>
> base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
> --
> 2.17.1

--
Sincerely yours,
Mike.

2022-02-07 12:10:23

by John Allen

[permalink] [raw]
Subject: Re: [PATCH 35/35] x86/cpufeatures: Limit shadow stack to Intel CPUs

On Sun, Jan 30, 2022 at 01:18:38PM -0800, Rick Edgecombe wrote:
> Shadow stack is supported on newer AMD processors, but the kernel
> implementation has not been tested on them. Prevent basic issues from
> showing up for normal users by disabling shadow stack on all CPUs except
> Intel until it has been tested. At which point the limitation should be
> removed.

Hi Rick,

I have been testing Yu-Cheng's patchsets on AMD hardware and I am
working on testing this version now. How are you testing this new
series? I can partially test by calling the prctl enable for shadow
stack directly from a program, but I'm not sure how useful that's going
to be without the glibc support. Do you have a public repo with the
necessary glibc changes to enable shadow stack early?

Thanks,
John

2022-02-07 13:20:00

by John Allen

[permalink] [raw]
Subject: Re: [PATCH 35/35] x86/cpufeatures: Limit shadow stack to Intel CPUs

On Thu, Feb 03, 2022 at 02:23:43PM -0800, H.J. Lu wrote:
> On Thu, Feb 3, 2022 at 1:58 PM John Allen <[email protected]> wrote:
> >
> > On Sun, Jan 30, 2022 at 01:18:38PM -0800, Rick Edgecombe wrote:
> > > Shadow stack is supported on newer AMD processors, but the kernel
> > > implementation has not been tested on them. Prevent basic issues from
> > > showing up for normal users by disabling shadow stack on all CPUs except
> > > Intel until it has been tested. At which point the limitation should be
> > > removed.
> >
> > Hi Rick,
> >
> > I have been testing Yu-Cheng's patchsets on AMD hardware and I am
> > working on testing this version now. How are you testing this new
> > series? I can partially test by calling the prctl enable for shadow
> > stack directly from a program, but I'm not sure how useful that's going
> > to be without the glibc support. Do you have a public repo with the
> > necessary glibc changes to enable shadow stack early?
> >
>
> The glibc CET branch is at
>
> https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/cet/master

Thanks, I ran some smoke tests with the updated glibc and it's looking
good so far. Additionally, I ran the new kselftest and it passed.

Thanks,
John

>
> --
> H.J.

2022-02-07 15:12:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Sat, Feb 05, 2022 at 12:21:12PM -0800, H.J. Lu wrote:

> setjmp/longjmp work on the same sigjmp_buf. Shadow stack pointer
> is saved and restored, just like any other callee-saved registers.

How is having that shadow stack pointer in user-writable memory not a
problem? That seems like a prime target to subvert the whole shadow
stack machinery.

2022-02-07 15:27:31

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 33/35] selftests/x86: Add map_shadow_stack syscall test

On Thu, 2022-02-03 at 14:42 -0800, Dave Hansen wrote:
> This is a good start for the selftest. But, it would be really nice
> to
> see a few additional smoke tests in here that are independent of the
> library support.

Sure. I had actually included this just because the "adding a syscall"
docs said to make sure to include a test for the syscall. There are
some other tests that were being planned as a follow up.

>
> For instance, it would be nice to have tests that:
>
> 1. Write to the shadow stack with normal instructions (and recover
> from
> the inevitable SEGV). Make sure the siginfo looks like we expect.
> 2. Corrupt the regular stack, or maybe just use a retpoline
> do induce a shadow stack exception. Ditto on checking the siginfo
> 3. Do enough CALLs that will likely trigger a fault and an on-demand
> shadow stack page allocation.
>
> That will test the *basics* and should be pretty simple to write.

Most of this already exists in the private tests. I'll combine it into
a single selftest. Having wrss now nicely made it a bit easier because
those writes are treated as shadow stack accesses, so we can do these
operations directly without too much calling acrobatics.

Thanks,

Rick

2022-02-07 17:34:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace



On Thu, Feb 3, 2022, at 5:08 PM, Edgecombe, Rick P wrote:
> Hi Thomas,

>> > Signals
>> > -------
>> > Originally signals placed the location of the shadow stack
>> > restore
>> > token inside the saved state on the stack. This was
>> > problematic from a
>> > past ABI promises perspective.

What was the actual problem?

>> > So the restore location was
>> > instead just
>> > assumed from the shadow stack pointer. This works because in
>> > normal
>> > allowed cases of calling sigreturn, the shadow stack pointer
>> > should be
>> > right at the restore token at that time. There is no
>> > alternate shadow
>> > stack support. If an alt shadow stack is added later we
>> > would
>> > need to
>>
>> So how is that going to work? altstack is not an esoteric corner
>> case.
>
> My understanding is that the main usages for the signal stack were
> handling stack overflows and corruption. Since the shadow stack only
> contains return addresses rather than large stack allocations, and is
> not generally writable or pivotable, I thought there was a good
> possibility an alt shadow stack would not end up being especially
> useful. Does it seem like reasonable guesswork?

It's also used for things like DOSEMU that execute in a weird context and then trap back out to the outer program using a signal handler and an altstack. Also, imagine someone writing a SIGSEGV handler specifically intended to handle shadow stack overflow.

The shadow stack can be pivoted using RSTORSSP.

2022-02-08 01:08:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/35] x86/cet/shstk: Add Kconfig option for Shadow Stack

On 1/30/22 13:18, Rick Edgecombe wrote:
> +config X86_SHADOW_STACK
> + prompt "Intel Shadow Stack"
> + def_bool n
> + depends on AS_WRUSS
> + depends on ARCH_HAS_SHADOW_STACK
> + select ARCH_USES_HIGH_VMA_FLAGS
> + help
> + Shadow Stack protection is a hardware feature that detects function
> + return address corruption. This helps mitigate ROP attacks.
> + Applications must be enabled to use it, and old userspace does not
> + get protection "for free".
> + Support for this feature is present on Tiger Lake family of
> + processors released in 2020 or later. Enabling this feature
> + increases kernel text size by 3.7 KB.

I guess the "2020" comment is still OK. But, given that it's on AMD and
a could of other Intel models, maybe we should just leave this at:

CPUs supporting shadow stacks were first released in 2020.

If we say anything. We mostly want folks to just go read the
documentation if they needs more details.

2022-02-08 05:34:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On 2/6/22 23:20, Adrian Reber wrote:
>>> CRIU Support
>>> ------------
>>> In the past there was some speculation on the mailing list about
>>> whether CRIU would need to be taught about CET. It turns out, it does.
>>> The first issue hit is that CRIU calls sigreturn directly from its
>>> “parasite code” that it injects into the dumper process. This violates
>>> this shadow stack implementation’s protection that intends to prevent
>>> attackers from doing this.
...
>>From the CRIU side I can say that I would definitely like to see this
> resolved. CRIU just went through a similar exercise with rseq() being
> enabled in glibc and CI broke all around for us and other projects
> relying on CRIU. Although rseq() was around for a long time we were not
> aware of it but luckily 5.13 introduced a way to handle it for CRIU with
> ptrace. An environment variable existed but did not really help when
> CRIU is called somewhere in the middle of the container software stack.
>
>>From my point of view a solution not involving an environment variable
> would definitely be preferred.

Have there been things like this for CRIU in the past? Something where
CRIU needs control but that's also security-sensitive?

Any thoughts on how you would _like_ to see this resolved?

2022-02-08 09:34:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 02/35] x86/cet/shstk: Add Kconfig option for Shadow Stack

On Mon, Feb 07 2022 at 14:39, Dave Hansen wrote:

> On 1/30/22 13:18, Rick Edgecombe wrote:
>> +config X86_SHADOW_STACK
>> + prompt "Intel Shadow Stack"
>> + def_bool n
>> + depends on AS_WRUSS
>> + depends on ARCH_HAS_SHADOW_STACK
>> + select ARCH_USES_HIGH_VMA_FLAGS
>> + help
>> + Shadow Stack protection is a hardware feature that detects function
>> + return address corruption. This helps mitigate ROP attacks.
>> + Applications must be enabled to use it, and old userspace does not
>> + get protection "for free".
>> + Support for this feature is present on Tiger Lake family of
>> + processors released in 2020 or later. Enabling this feature
>> + increases kernel text size by 3.7 KB.
>
> I guess the "2020" comment is still OK. But, given that it's on AMD and
> a could of other Intel models, maybe we should just leave this at:
>
> CPUs supporting shadow stacks were first released in 2020.

Yes.

> If we say anything. We mostly want folks to just go read the
> documentation if they needs more details.

Also the kernel text size increase blurb is pretty useless as that's a
number which is wrong from day one.

Thanks,

tglx

2022-02-08 11:28:53

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Sun, 2022-02-06 at 13:42 +0000, David Laight wrote:
> > I don't think this is as difficult to avoid because userspace ssp
> > has
> > its own register that should not be accessed at that point, but I
> > have
> > not given this aspect enough analysis. Thanks for bringing it up.
>
> So the user ssp isn't saved (or restored) by the trap entry/exit.
> So it needs to be saved by the context switch code?
> Much like the user segment registers?
> So you are likely to get the same problems if restoring it can fault
> in kernel (eg for a non-canonical address).

PL3_SSP is lazily saved and restored by the FPU supervisor xsave code,
which has its buffer in kernel memory. For the most part it is
userspace instructions that use this register and they can only modify
it in limited ways.

It does look like IRET can cause a #CP if the PL3 SSP is not aligned,
but only after RIP and CPL are set back to userspace. I'm not confident
enough interpreting the specs to assert the specific behavior and will
follow up internally to clarify.

2022-02-08 13:15:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 02/35] x86/cet/shstk: Add Kconfig option for Shadow Stack

On Sun, Jan 30 2022 at 13:18, Rick Edgecombe wrote:
> +config ARCH_HAS_SHADOW_STACK
> + def_bool n
> +
> +config X86_SHADOW_STACK
> + prompt "Intel Shadow Stack"

It's also available on AMD, right?

Thanks,

tglx

2022-02-08 15:39:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Mon, Feb 07 2022 at 17:31, Andy Lutomirski wrote:
> So this leaves altshadowstack. If we want to allow userspace to handle
> a shstk overflow, I think we need altshadowstack. And I can easily
> imagine signal handling in a coroutine or user-threading evironment (Go?
> UMCG or whatever it's called?) wanting this. As noted, this obnoxious
> Andy person didn't like putting any shstk-related extensions in the FPU
> state.
>
> For better or for worse, altshadowstack is (I think) fundamentally a new
> API. No amount of ucontext magic is going to materialize an entire
> shadow stack out of nowhere when someone calls sigaltstack(). So the
> questions are: should we support altshadowstack from day one and, if so,
> what should it look like?

I think we should support them from day one.

> So I don't have a complete or even almost complete design in mind, but I
> think we do need to make a conscious decision either to design this
> right or to skip it for v1.

Skipping it might create a fundamental design fail situation as it might
require changes to the shadow stack signal handling in general which
becomes a nightmare once a non-altstack API is exposed.

> As for CRIU, I don't think anyone really expects a new kernel, running
> new userspace that takes advantage of features in the new kernel, to
> work with old CRIU.

Yes, CRIU needs updates, but what ensures that CRIU managed user space
does not use SHSTK if CRIU is not updated yet?

> Upgrading to a SHSTK kernel should still allow using CRIU with
> non-SHSTK userspace, but I don't see how it's possible for CRIU to
> handle SHSTK without updates. We should certainly do our best to make
> CRIU's life easy, though.

Handling CRIU with SHSTK enabled has to be part of the overall design
otherwise we'll either end up with horrible hacks or with a requirement
to change the V1 UAPI....

Thanks,

tglx

2022-02-08 16:01:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 07/35] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

On 1/30/22 13:18, Rick Edgecombe wrote:
> The x86 family of processors do not directly create read-only and Dirty
> PTEs. These PTEs are created by software.

That's not strictly correct.

There's nothing in the architecture today to prevent the CPU from
creating Write=0,Dirty=1 PTEs. In fact, some CPUs do this in weird
situations. It wouldn't be wrong to say:

Processors sometimes directly create read-only and Dirty PTEs.

which is the opposite of what is written above. This is why the CET
spec has the blurb about shadow-stack-supporting CPUs promise not to do
this any more.

> One such case is that kernel
> read-only pages are historically setup as Dirty.

^ set up

> New processors that support Shadow Stack regard read-only and Dirty PTEs as
> shadow stack pages.

This also isn't *quite* correct. It's not just having a new processor,
it includes enabling shadow stacks.

> This results in ambiguity between shadow stack and kernel read-only
> pages. To resolve this, removed Dirty from kernel read- only pages.
One thing that's not clear from the spec: does this cause an *actual*
problem? For instance, does setting:

IA32_U_CET.SH_STK_EN=1
but
IA32_S_CET.SH_STK_EN=0

means that shadow stacks are enforced in user *MODE* or on
user-paging-permission (U=0) PTEs?

I think it's modes, but it would be nice to be clear. *BUT*, if this is
accurate, doesn't it also mean that this patch is not strictly necessary?

Don't get me wrong, the patch is probably still a good idea, but let's
make sure we get the exact reasoning clear.

2022-02-08 16:42:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 03/35] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)

On 1/30/22 13:18, Rick Edgecombe wrote:
> --- a/arch/x86/kernel/cpu/cpuid-deps.c
> +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> @@ -78,6 +78,7 @@ static const struct cpuid_dep cpuid_deps[] = {
> { X86_FEATURE_XFD, X86_FEATURE_XSAVES },
> { X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
> { X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
> + { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
> {}
> };

Please add a chunk to the changelog that explains the dependency. This
would suffice:

To protect shadow stack state from malicious modification, the
registers are only accessible in supervisor mode. This
implementation context-switches the registers with XSAVES. Make
X86_FEATURE_SHSTK depend on XSAVES.

The XSAVES dependency is touched on in the documentation, but it's a bit
buried in there.

2022-02-08 17:15:39

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, Feb 08, 2022 at 11:16:51AM +0200, Mike Rapoport wrote:
>
> > Any thoughts on how you would _like_ to see this resolved?
>
> Ideally, CRIU will need a knob that will tell the kernel/CET machinery
> where the next RET will jump, along the lines of
> restore_signal_shadow_stack() AFAIU.
>
> But such a knob will immediately reduce the security value of the entire
> thing, and I don't have good ideas how to deal with it :(

Probably a kind of latch in the task_struct which would trigger off once
returt to a different address happened, thus we would be able to jump inside
paratite code. Of course such trigger should be available under proper
capability only.

2022-02-08 23:55:22

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 03/35] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)

On Mon, 2022-02-07 at 14:45 -0800, Dave Hansen wrote:
> Please add a chunk to the changelog that explains the dependency.
> This
> would suffice:
>
> To protect shadow stack state from malicious modification,
> the
> registers are only accessible in supervisor mode. This
> implementation context-switches the registers with XSAVES.
> Make
> X86_FEATURE_SHSTK depend on XSAVES.

Thanks. Yea, I don't think that part of the design is really elaborated
on anywhere. It can be some foreshadowing for the signal stuff later
too.

2022-02-09 00:41:24

by Adrian Reber

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Sun, Feb 06, 2022 at 08:42:03PM +0200, Mike Rapoport wrote:
> (added more CRIU people)

Thanks, Mike.

> On Sun, Jan 30, 2022 at 01:18:03PM -0800, Rick Edgecombe wrote:
> > This is a slight reboot of the userspace CET series. I will be taking over the
> > series from Yu-cheng. Per some internal recommendations, I’ve reset the version
> > number and am calling it a new series. Hopefully, it doesn’t cause confusion.
> >
> > The new plan is to upstream only userspace Shadow Stack support at this point.
> > IBT can follow later, but for now I’ll focus solely on the most in-demand and
> > widely available (with the feature on AMD CPUs now) part of CET.
> >
> > I thought as part of this reset, it might be useful to more fully write-up the
> > design and summarize the history of the previous CET series. So this slightly
> > long cover letter does that. The "Updates" section has the changes, if anyone
> > doesn't want the history.

[...]

> > CRIU Support
> > ------------
> > In the past there was some speculation on the mailing list about
> > whether CRIU would need to be taught about CET. It turns out, it does.
> > The first issue hit is that CRIU calls sigreturn directly from its
> > “parasite code” that it injects into the dumper process. This violates
> > this shadow stack implementation’s protection that intends to prevent
> > attackers from doing this.
> >
> > With so many packages already enabled with shadow stack, there is
> > probably desire to make it work seamlessly. But in the meantime if
> > distros want to support shadow stack and CRIU, users could manually
> > disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for
> > a process they will wants to dump. It’s not ideal.
> >
> > I’d like to hear what people think about having shadow stack in the
> > kernel without this resolved. Nothing would change for any users until
> > they enable shadow stack in the kernel and update to a glibc configured
> > with CET. Should CRIU userspace be solved before kernel support?

From the CRIU side I can say that I would definitely like to see this
resolved. CRIU just went through a similar exercise with rseq() being
enabled in glibc and CI broke all around for us and other projects
relying on CRIU. Although rseq() was around for a long time we were not
aware of it but luckily 5.13 introduced a way to handle it for CRIU with
ptrace. An environment variable existed but did not really help when
CRIU is called somewhere in the middle of the container software stack.

From my point of view a solution not involving an environment variable
would definitely be preferred.

Adrian

2022-02-09 03:14:09

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 02/35] x86/cet/shstk: Add Kconfig option for Shadow Stack

On Tue, 2022-02-08 at 09:41 +0100, Thomas Gleixner wrote:
> On Mon, Feb 07 2022 at 14:39, Dave Hansen wrote:
>
> > On 1/30/22 13:18, Rick Edgecombe wrote:
> > > +config X86_SHADOW_STACK
> > > + prompt "Intel Shadow Stack"
> > > + def_bool n
> > > + depends on AS_WRUSS
> > > + depends on ARCH_HAS_SHADOW_STACK
> > > + select ARCH_USES_HIGH_VMA_FLAGS
> > > + help
> > > + Shadow Stack protection is a hardware feature that detects
> > > function
> > > + return address corruption. This helps mitigate ROP
> > > attacks.
> > > + Applications must be enabled to use it, and old userspace
> > > does not
> > > + get protection "for free".
> > > + Support for this feature is present on Tiger Lake family
> > > of
> > > + processors released in 2020 or later. Enabling this
> > > feature
> > > + increases kernel text size by 3.7 KB.
> >
> > I guess the "2020" comment is still OK. But, given that it's on
> > AMD and
> > a could of other Intel models, maybe we should just leave this at:
> >
> > CPUs supporting shadow stacks were first released in 2020.
>
> Yes.
>
> > If we say anything. We mostly want folks to just go read the
> > documentation if they needs more details.
>
> Also the kernel text size increase blurb is pretty useless as that's
> a
> number which is wrong from day one.

Makes sense. Thanks.

2022-02-09 06:10:57

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 04/35] x86/cpufeatures: Introduce CPU setup and option parsing for CET

On Mon, 2022-02-07 at 14:49 -0800, Dave Hansen wrote:
> Given this:
>
>
> https://lore.kernel.org/all/[email protected]/
>
> I'd probably yank the command-line option out of this series, or
> stick
> it in a separate patch that you tack on to the end.

Makes sense. I'll change the docs to point out exactly how to use this
new parameter for shadow stack. It could come in handy if some
important service miss marks itself as shadow stack capable and
complicates boot.

Thanks.

2022-02-09 06:30:49

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> > > > But such a knob will immediately reduce the security value of
> > > > the entire
> > > > thing, and I don't have good ideas how to deal with it :(
> > >
> > > Probably a kind of latch in the task_struct which would trigger
> > > off once
> > > returt to a different address happened, thus we would be able to
> > > jump inside
> > > paratite code. Of course such trigger should be available under
> > > proper
> > > capability only.
> >
> > I'm not fully in touch with how parasite, etc works. Are we
> > talking about save or restore?
>
> We use parasite code in question during checkpoint phase as far as I
> remember.
> push addr/lret trick is used to run "injected" code (code injection
> itself is
> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> into this code
> for years already, do we still need to support compat mode at all?
>
> > If it's restore, what exactly does CRIU need to do? Is it just
> > that CRIU needs to return
> > out from its resume code into the to-be-resumed program without
> > tripping CET? Would it
> > be acceptable for CRIU to require that at least one shstk slot be
> > free at save time?
> > Or do we need a mechanism to atomically switch to a completely full
> > shadow stack at resume?
> >
> > Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> > that is intended for
> > use for altshadowstack could safely verify a token on the
> > altshdowstack, possibly
> > compare to something in ucontext (or not -- this isn't clearly
> > necessary) and switch
> > back to the previous stack. CRIU could use that too. Obviously
> > CRIU will need a way
> > to populate the relevant stacks, but WRUSS can be used for that,
> > and I think this
> > is a fundamental requirement for CRIU -- CRIU restore absolutely
> > needs a way to write
> > the saved shadow stack data into the shadow stack.

Still wrapping my head around the CRIU save and restore steps, but
another general approach might be to give ptrace the ability to
temporarily pause/resume/set CET enablement and SSP for a stopped
thread. Then injected code doesn't need to jump through any hoops or
possibly run into road blocks. I'm not sure how much this opens things
up if the thread has to be stopped...

Cyrill, could it fit into the CRIU pause and resume flow? What action
causes the final resuming of execution of the restored process for
checkpointing and for restore? Wondering if we could somehow make CET
re-enable exactly then.

And I guess this also needs a way to create shadow stack allocations at
a specific address to match where they were in the dumped process. That
is missing in this series.


> >
> > So I think the only special capability that CRIU really needs is
> > WRUSS, and
> > we need to wire that up anyway.
>
> Thanks for these notes, Andy! I can't provide any sane answer here
> since didn't
> read tech spec for this feature yet :-)




2022-02-09 06:38:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 07/35] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

On Mon, 2022-02-07 at 16:13 -0800, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > The x86 family of processors do not directly create read-only and
> > Dirty
> > PTEs. These PTEs are created by software.
>
> That's not strictly correct.
>
> There's nothing in the architecture today to prevent the CPU from
> creating Write=0,Dirty=1 PTEs. In fact, some CPUs do this in weird
> situations. It wouldn't be wrong to say:
>
> Processors sometimes directly create read-only and Dirty PTEs.
>
> which is the opposite of what is written above. This is why the CET
> spec has the blurb about shadow-stack-supporting CPUs promise not to
> do
> this any more.

Yea, it's wrong. The whole point of the new assurance is that it could
before as you say.

>
> > One such case is that kernel
> > read-only pages are historically setup as Dirty.
>
> ^ set up
>
> > New processors that support Shadow Stack regard read-only and Dirty
> > PTEs as
> > shadow stack pages.
>
> This also isn't *quite* correct. It's not just having a new
> processor,
> it includes enabling shadow stacks.

Right.

>
> > This results in ambiguity between shadow stack and kernel read-only
> > pages. To resolve this, removed Dirty from kernel read- only
> > pages.
>
> One thing that's not clear from the spec: does this cause an *actual*
> problem? For instance, does setting:
>
> IA32_U_CET.SH_STK_EN=1
> but
> IA32_S_CET.SH_STK_EN=0
>
> means that shadow stacks are enforced in user *MODE* or on
> user-paging-permission (U=0) PTEs?
>
> I think it's modes, but it would be nice to be clear. *BUT*, if this
> is
> accurate, doesn't it also mean that this patch is not strictly
> necessary?
>
> Don't get me wrong, the patch is probably still a good idea, but
> let's
> make sure we get the exact reasoning clear.

Yea, I think this is just a tying up loose ends thing. It is not
functionally needed until there would be shadow stack mode for kernel.
I'll update the patch to make this clear.

2022-02-09 07:12:05

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 00/35] Shadow stacks for userspace

From: Edgecombe, Rick P
> Sent: 05 February 2022 20:15
>
> On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
> > On Sat, Feb 5, 2022 at 5:27 AM David Laight <[email protected]>
> > wrote:
> > >
> > > From: Edgecombe, Rick P
> > > > Sent: 04 February 2022 01:08
> > > > Hi Thomas,
> > > >
> > > > Thanks for feedback on the plan.
> > > >
> > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > > > Until now, the enabling effort was trying to support both
> > > > > > Shadow
> > > > > > Stack and IBT.
> > > > > > This history will focus on a few areas of the shadow stack
> > > > > > development history
> > > > > > that I thought stood out.
> > > > > >
> > > > > > Signals
> > > > > > -------
> > > > > > Originally signals placed the location of the shadow
> > > > > > stack
> > > > > > restore
> > > > > > token inside the saved state on the stack. This was
> > > > > > problematic from a
> > > > > > past ABI promises perspective. So the restore location
> > > > > > was
> > > > > > instead just
> > > > > > assumed from the shadow stack pointer. This works
> > > > > > because in
> > > > > > normal
> > > > > > allowed cases of calling sigreturn, the shadow stack
> > > > > > pointer
> > > > > > should be
> > > > > > right at the restore token at that time. There is no
> > > > > > alternate shadow
> > > > > > stack support. If an alt shadow stack is added later
> > > > > > we
> > > > > > would
> > > > > > need to
> > > > >
> > > > > So how is that going to work? altstack is not an esoteric
> > > > > corner
> > > > > case.
> > > >
> > > > My understanding is that the main usages for the signal stack
> > > > were
> > > > handling stack overflows and corruption. Since the shadow stack
> > > > only
> > > > contains return addresses rather than large stack allocations,
> > > > and is
> > > > not generally writable or pivotable, I thought there was a good
> > > > possibility an alt shadow stack would not end up being especially
> > > > useful. Does it seem like reasonable guesswork?
> > >
> > > The other 'problem' is that it is valid to longjump out of a signal
> > > handler.
> > > These days you have to use siglongjmp() not longjmp() but it is
> > > still used.
> > >
> > > It is probably also valid to use siglongjmp() to jump from a nested
> > > signal handler into the outer handler.
> > > Given both signal handlers can have their own stack, there can be
> > > three
> > > stacks involved.
>
> So the scenario is?
>
> 1. Handle signal 1
> 2. sigsetjmp()
> 3. signalstack()
> 4. Handle signal 2 on alt stack
> 5. siglongjmp()
>
> I'll check that it is covered by the tests, but I think it should work
> in this series that has no alt shadow stack. I have only done a high
> level overview of how the shadow stack stuff, that doesn't involve the
> kernel, works in glibc. Sounds like I'll need to do a deeper dive.

The posix/xopen definition for setjmp/longjmp doesn't require such
longjmp requests to work.

Although they still have to do something that doesn't break badly.
Aborting the process is probably fine!

> > > I think the shadow stack pointer has to be in ucontext - which also
> > > means the application can change it before returning from a signal.
>
> Yes we might need to change it to support alt shadow stacks. Can you
> elaborate why you think it has to be in ucontext? I was thinking of
> looking at three options for storing the ssp:
> - Stored in the shadow stack like a token using WRUSS from the kernel.
> - Stored on the kernel side using a hashmap that maps ucontext or
> sigframe userspace address to ssp (this is of course similar to
> storing in ucontext, except that the user can’t change the ssp).
> - Stored writable in userspace in ucontext.
>
> But in this version, without alt shadow stacks, the shadow stack
> pointer is not stored in ucontext. This causes the limitation that
> userspace can only call sigreturn when it has returned back to a point
> where there is a restore token on the shadow stack (which was placed
> there by the kernel). This doesn’t mean it can’t switch to a different
> shadow stack or handle a nested signal, but it limits the possibility
> for calling sigreturn with a totally different sigframe (like CRIU and
> SROP attacks do). It should hopefully be a helpful, protective
> limitation for most apps and I'm hoping CRIU can be fixed without
> removing it.
>
> I am not aware of other limitations to signals (besides normal shadow
> stack enforcement), but I could be missing it. And people's skepticism
> is making me want to go back over it with more scrutiny.
>
> > > In much the same way as all the segment registers can be changed
> > > leading to all the nasty bugs when the final 'return to user' code
> > > traps in kernel when loading invalid segment registers or executing
> > > iret.
>
> I don't think this is as difficult to avoid because userspace ssp has
> its own register that should not be accessed at that point, but I have
> not given this aspect enough analysis. Thanks for bringing it up.

So the user ssp isn't saved (or restored) by the trap entry/exit.
So it needs to be saved by the context switch code?
Much like the user segment registers?
So you are likely to get the same problems if restoring it can fault
in kernel (eg for a non-canonical address).

> > > Hmmm... do shadow stacks mean that longjmp() has to be a system
> > > call?
> >
> > No. setjmp/longjmp save and restore shadow stack pointer.

Ok, I was thinking that direct access to the user ssp would be
a privileged operation.
If it can be written you don't really have to worry about what code
is trying to do - it can actually do what it likes.
It just catches unintentional operations (like buffer overflows).

Was there any 'spare' space in struct jmpbuf ?
Otherwise you can only enable shadow stacks if everything has been
recompiled - including any shared libraries the might be dlopen()ed.
(or does the compiler invent an alloca() call somehow for a
size that comes back from glibc?)

I've never really considered how setjmp/longjmp handle callee saved
register variables (apart from it being hard).
The original pdp11 implementation probably only needed to save r6 and r7.

What does happen to all the 'extended state' that XSAVE handles?
IIRC all the AVX registers are caller saved (so should probably
be zerod), but some of the SSE ones are callee saved, and one or
two of the fpu flags are sticky and annoying enough the save/restore
at the best of times.

> It sounds like it would help to write up in a lot more detail exactly
> how all the signal and specialer stack manipulation scenarios work in
> glibc.

Some cross references might have made people notice that the ucontext
extensions for AVX512 (if not earlier ones) broke the minimal/default
signal stack size.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-09 07:19:55

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Mon, Feb 07, 2022 at 08:30:50AM -0800, Dave Hansen wrote:
> On 2/6/22 23:20, Adrian Reber wrote:
> >>> CRIU Support
> >>> ------------
> >>> In the past there was some speculation on the mailing list about
> >>> whether CRIU would need to be taught about CET. It turns out, it does.
> >>> The first issue hit is that CRIU calls sigreturn directly from its
> >>> “parasite code” that it injects into the dumper process. This violates
> >>> this shadow stack implementation’s protection that intends to prevent
> >>> attackers from doing this.
> ...
> >>From the CRIU side I can say that I would definitely like to see this
> > resolved. CRIU just went through a similar exercise with rseq() being
> > enabled in glibc and CI broke all around for us and other projects
> > relying on CRIU. Although rseq() was around for a long time we were not
> > aware of it but luckily 5.13 introduced a way to handle it for CRIU with
> > ptrace. An environment variable existed but did not really help when
> > CRIU is called somewhere in the middle of the container software stack.
> >
> >>From my point of view a solution not involving an environment variable
> > would definitely be preferred.
>
> Have there been things like this for CRIU in the past? Something where
> CRIU needs control but that's also security-sensitive?

Generally CRIU requires (almost) root privileges to work, but I don't think
it handles something as security sensitive and restrictive as shadow stacks.

> Any thoughts on how you would _like_ to see this resolved?

Ideally, CRIU will need a knob that will tell the kernel/CET machinery
where the next RET will jump, along the lines of
restore_signal_shadow_stack() AFAIU.

But such a knob will immediately reduce the security value of the entire
thing, and I don't have good ideas how to deal with it :(

--
Sincerely yours,
Mike.

2022-02-09 07:28:26

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> >> But such a knob will immediately reduce the security value of the entire
> >> thing, and I don't have good ideas how to deal with it :(
> >
> > Probably a kind of latch in the task_struct which would trigger off once
> > returt to a different address happened, thus we would be able to jump inside
> > paratite code. Of course such trigger should be available under proper
> > capability only.
>
> I'm not fully in touch with how parasite, etc works. Are we talking about save or restore?

We use parasite code in question during checkpoint phase as far as I remember.
push addr/lret trick is used to run "injected" code (code injection itself is
done via ptrace) in compat mode at least. Dima, Andrei, I didn't look into this code
for years already, do we still need to support compat mode at all?

> If it's restore, what exactly does CRIU need to do? Is it just that CRIU needs to return
> out from its resume code into the to-be-resumed program without tripping CET? Would it
> be acceptable for CRIU to require that at least one shstk slot be free at save time?
> Or do we need a mechanism to atomically switch to a completely full shadow stack at resume?
>
> Off the top of my head, a sigreturn (or sigreturn-like mechanism) that is intended for
> use for altshadowstack could safely verify a token on the altshdowstack, possibly
> compare to something in ucontext (or not -- this isn't clearly necessary) and switch
> back to the previous stack. CRIU could use that too. Obviously CRIU will need a way
> to populate the relevant stacks, but WRUSS can be used for that, and I think this
> is a fundamental requirement for CRIU -- CRIU restore absolutely needs a way to write
> the saved shadow stack data into the shadow stack.
>
> So I think the only special capability that CRIU really needs is WRUSS, and
> we need to wire that up anyway.

Thanks for these notes, Andy! I can't provide any sane answer here since didn't
read tech spec for this feature yet :-)

2022-02-09 07:57:09

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

* David Laight:

> Was there any 'spare' space in struct jmpbuf ?

jmp_buf in glibc looks like this:

(gdb) ptype/o jmp_buf
type = struct __jmp_buf_tag {
/* 0 | 64 */ __jmp_buf __jmpbuf;
/* 64 | 4 */ int __mask_was_saved;
/* XXX 4-byte hole */
/* 72 | 128 */ __sigset_t __saved_mask;

/* total size (bytes): 200 */
} [1]
(gdb) ptype/o __jmp_buf
type = long [8]

The glibc ABI reserves space for 1024 signals, something that Linux is
never going to implement. We can use that space to store a few extra
registers in __save_mask. There is a complication because the
pthread_cancel unwinding allocates only space for the __jmpbuf member.
Fortunately, we do not need to unwind the shadow stack for thread
cancellation, so we don't need that extra space in that case.

Thanks,
Florian


2022-02-09 08:09:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On 2/5/22 12:15, Edgecombe, Rick P wrote:
> On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
>> On Sat, Feb 5, 2022 at 5:27 AM David Laight <[email protected]>
>> wrote:
>>>
>>> From: Edgecombe, Rick P
>>>> Sent: 04 February 2022 01:08
>>>> Hi Thomas,
>>>>
>>>> Thanks for feedback on the plan.
>>>>
>>>> On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
>>>>>> Until now, the enabling effort was trying to support both
>>>>>> Shadow
>>>>>> Stack and IBT.
>>>>>> This history will focus on a few areas of the shadow stack
>>>>>> development history
>>>>>> that I thought stood out.
>>>>>>
>>>>>> Signals
>>>>>> -------
>>>>>> Originally signals placed the location of the shadow
>>>>>> stack
>>>>>> restore
>>>>>> token inside the saved state on the stack. This was
>>>>>> problematic from a
>>>>>> past ABI promises perspective. So the restore location
>>>>>> was
>>>>>> instead just
>>>>>> assumed from the shadow stack pointer. This works
>>>>>> because in
>>>>>> normal
>>>>>> allowed cases of calling sigreturn, the shadow stack
>>>>>> pointer
>>>>>> should be
>>>>>> right at the restore token at that time. There is no
>>>>>> alternate shadow
>>>>>> stack support. If an alt shadow stack is added later
>>>>>> we
>>>>>> would
>>>>>> need to
>>>>>
>>>>> So how is that going to work? altstack is not an esoteric
>>>>> corner
>>>>> case.
>>>>
>>>> My understanding is that the main usages for the signal stack
>>>> were
>>>> handling stack overflows and corruption. Since the shadow stack
>>>> only
>>>> contains return addresses rather than large stack allocations,
>>>> and is
>>>> not generally writable or pivotable, I thought there was a good
>>>> possibility an alt shadow stack would not end up being especially
>>>> useful. Does it seem like reasonable guesswork?
>>>
>>> The other 'problem' is that it is valid to longjump out of a signal
>>> handler.
>>> These days you have to use siglongjmp() not longjmp() but it is
>>> still used.
>>>
>>> It is probably also valid to use siglongjmp() to jump from a nested
>>> signal handler into the outer handler.
>>> Given both signal handlers can have their own stack, there can be
>>> three
>>> stacks involved.
>
> So the scenario is?
>
> 1. Handle signal 1
> 2. sigsetjmp()
> 3. signalstack()
> 4. Handle signal 2 on alt stack
> 5. siglongjmp()
>
> I'll check that it is covered by the tests, but I think it should work
> in this series that has no alt shadow stack. I have only done a high
> level overview of how the shadow stack stuff, that doesn't involve the
> kernel, works in glibc. Sounds like I'll need to do a deeper dive.
>
>>>
>>> I think the shadow stack pointer has to be in ucontext - which also
>>> means the application can change it before returning from a signal.
>
> Yes we might need to change it to support alt shadow stacks. Can you
> elaborate why you think it has to be in ucontext? I was thinking of
> looking at three options for storing the ssp:
> - Stored in the shadow stack like a token using WRUSS from the kernel.
> - Stored on the kernel side using a hashmap that maps ucontext or
> sigframe userspace address to ssp (this is of course similar to
> storing in ucontext, except that the user can’t change the ssp).
> - Stored writable in userspace in ucontext.
>
> But in this version, without alt shadow stacks, the shadow stack
> pointer is not stored in ucontext. This causes the limitation that
> userspace can only call sigreturn when it has returned back to a point
> where there is a restore token on the shadow stack (which was placed
> there by the kernel).



I'll reply here and maybe cover multiple things.


User code already needs to rewind the regular stack to call sigreturn --
sigreturn find the signal frame based on ESP/RSP. So if you call it
from the wrong place, you go boom. I think that the Linux SHSTK ABI
should have the property that no amount of tampering with just the
ucontext and associated structures can cause sigreturn to redirect to
the wrong IP -- there should be something on the shadow stack that also
gets verified in sigreturn. IIRC the series does this, but it's been a
while. The post-sigreturn SSP should be entirely implied by
pre-sigreturn SSP (or perhaps something on the shadow stack), so, in the
absence of an altshadowstack feature, no ucontext changes should be needed.

We can also return from a signal or from more than one signal at once,
as above, using siglongjmp. It seems like this should Just Work (tm),
at least in the absence of altshadowstack.

So this leaves altshadowstack. If we want to allow userspace to handle
a shstk overflow, I think we need altshadowstack. And I can easily
imagine signal handling in a coroutine or user-threading evironment (Go?
UMCG or whatever it's called?) wanting this. As noted, this obnoxious
Andy person didn't like putting any shstk-related extensions in the FPU
state.

For better or for worse, altshadowstack is (I think) fundamentally a new
API. No amount of ucontext magic is going to materialize an entire
shadow stack out of nowhere when someone calls sigaltstack(). So the
questions are: should we support altshadowstack from day one and, if so,
what should it look like?

If we want to be clever, we could attempt to make altstadowstack
compatible with RSTORSSP. Signal delivery pushes a restore token to the
old stack (hah! what if the old stack is full?) and pushes the RSTORSSP
busy magic to the new stack, and sigreturn inverts it. Code that wants
to return without sigreturn does it manually with RSTORSSP. (Assuming
that I've understood the arcane RSTORSSP sequence right. Intel wins
major points for documentation quality here.) Or we could invent our
own scheme. In either case, I don't immediately see any reason that the
ucontext needs to contain a shadow stack pointer.

There's a delightful wart to consider, though. siglongjmp, at least as
currently envisioned, can't return off an altshadowstack: the whole
point of the INCSSP distance restrictions to to avoid incrementing right
off the top of the current stack, but siglongjmp off an altshadowstack
fundamentally switches stacks. So either siglongjmp off an
altshadowstack needs to be illegal or it needs to work differently. (By
incssp-ing to the top of the altshadowstack, then switching, then
incssp-ing some more? How does it even find the top of the current
altshadowstack?) And the plot thickens if one tries to siglongjmp off
two nested altshadowstack-using signals in a single call. Fortunately,
since altshadowstack is a new API, it's not entirely crazy to have
different rules.

So I don't have a complete or even almost complete design in mind, but I
think we do need to make a conscious decision either to design this
right or to skip it for v1.

As for CRIU, I don't think anyone really expects a new kernel, running
new userspace that takes advantage of features in the new kernel, to
work with old CRIU. Upgrading to a SHSTK kernel should still allow
using CRIU with non-SHSTK userspace, but I don't see how it's possible
for CRIU to handle SHSTK without updates. We should certainly do our
best to make CRIU's life easy, though.

This doesn’t mean it can’t switch to a different
> shadow stack or handle a nested signal, but it limits the possibility
> for calling sigreturn with a totally different sigframe (like CRIU and
> SROP attacks do). It should hopefully be a helpful, protective
> limitation for most apps and I'm hoping CRIU can be fixed without
> removing it.
>
> I am not aware of other limitations to signals (besides normal shadow
> stack enforcement), but I could be missing it. And people's skepticism
> is making me want to go back over it with more scrutiny.
>
>>> In much the same way as all the segment registers can be changed
>>> leading to all the nasty bugs when the final 'return to user' code
>>> traps in kernel when loading invalid segment registers or executing
>>> iret.
>
> I don't think this is as difficult to avoid because userspace ssp has
> its own register that should not be accessed at that point, but I have
> not given this aspect enough analysis. Thanks for bringing it up.
>
>>>
>>> Hmmm... do shadow stacks mean that longjmp() has to be a system
>>> call?
>>
>> No. setjmp/longjmp save and restore shadow stack pointer.
>>
>
> It sounds like it would help to write up in a lot more detail exactly
> how all the signal and specialer stack manipulation scenarios work in
> glibc.
>


2022-02-09 08:15:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 04/35] x86/cpufeatures: Introduce CPU setup and option parsing for CET

> * Some CPU features depend on higher CPUID levels, which may not always
> * be available due to CPUID level capping or broken virtualization
> @@ -1261,6 +1269,9 @@ static void __init cpu_parse_early_param(void)
> if (cmdline_find_option_bool(boot_command_line, "noxsaves"))
> setup_clear_cpu_cap(X86_FEATURE_XSAVES);
>
> + if (cmdline_find_option_bool(boot_command_line, "no_user_shstk"))
> + setup_clear_cpu_cap(X86_FEATURE_SHSTK);

Given this:

https://lore.kernel.org/all/[email protected]/

I'd probably yank the command-line option out of this series, or stick
it in a separate patch that you tack on to the end.

2022-02-09 08:52:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 05/35] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states

On 1/30/22 13:18, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> Control-flow Enforcement Technology (CET) introduces these MSRs:
>
> MSR_IA32_U_CET (user-mode CET settings),
> MSR_IA32_PL3_SSP (user-mode shadow stack pointer),
>
> MSR_IA32_PL0_SSP (kernel-mode shadow stack pointer),
> MSR_IA32_PL1_SSP (Privilege Level 1 shadow stack pointer),
> MSR_IA32_PL2_SSP (Privilege Level 2 shadow stack pointer),
> MSR_IA32_S_CET (kernel-mode CET settings),
> MSR_IA32_INT_SSP_TAB (exception shadow stack table).

To be honest, I'm not sure this is very valuable. It's *VERY* close to
the exact information in the structure definitions. It's also not
obviously related to XSAVE. It's more of the "what" this patch does
than the "why". Good changelogs talk about "why".

> The two user-mode MSRs belong to XFEATURE_CET_USER. The first three of
> kernel-mode MSRs belong to XFEATURE_CET_KERNEL. Both XSAVES states are
> supervisor states. This means that there is no direct, unprivileged access
> to these states, making it harder for an attacker to subvert CET.

Forgive me while I go into changelog lecture mode for a moment.

I was constantly looking up at the list of MSRs and trying to reconcile
them with this paragraph. Imagine if you had started out this changelog
by saying:

Shadow stack register state can be managed with XSAVE. The
registers can logically be separated into two groups:

* Registers controlling user-mode operation
* Registers controlling kernel-mode operation

The architecture has two new XSAVE state components: one for
each group of registers. This _lets_ an OS manage them
separately if it chooses. Linux chooses to ... <explain the
design choice here, or why we don't care yet>.

Both XSAVE state components are supervisor states, even the
state controlling user-mode operation. This is a departure from
earlier features like protection keys where the PKRU state is
a normal user (non-supervisor) state. Having the user state be
supervisor-managed ensures there is no direct, unprivileged
access to it, making it harder for an attacker to subvert CET.

Also, IBT gunk is in here too, right? Let's at least *mention* that in
the changelog.

...
> /* All supervisor states including supported and unsupported states. */
> #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 3faf0f97edb1..0ee77ce4c753 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -362,6 +362,26 @@
>
>
> #define MSR_CORE_PERF_LIMIT_REASONS 0x00000690
> +
> +/* Control-flow Enforcement Technology MSRs */
> +#define MSR_IA32_U_CET 0x000006a0 /* user mode cet setting */
> +#define MSR_IA32_S_CET 0x000006a2 /* kernel mode cet setting */
> +#define CET_SHSTK_EN BIT_ULL(0)
> +#define CET_WRSS_EN BIT_ULL(1)
> +#define CET_ENDBR_EN BIT_ULL(2)
> +#define CET_LEG_IW_EN BIT_ULL(3)
> +#define CET_NO_TRACK_EN BIT_ULL(4)
> +#define CET_SUPPRESS_DISABLE BIT_ULL(5)
> +#define CET_RESERVED (BIT_ULL(6) | BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))

Would GENMASK_ULL() look any nicer here? I guess it's pretty clear
as-is that bits 6->9 are reserved.

> +#define CET_SUPPRESS BIT_ULL(10)
> +#define CET_WAIT_ENDBR BIT_ULL(11)

Are those bit fields common for both registers? It might be worth a
comment to mention that.

> +#define MSR_IA32_PL0_SSP 0x000006a4 /* kernel shadow stack pointer */
> +#define MSR_IA32_PL1_SSP 0x000006a5 /* ring-1 shadow stack pointer */
> +#define MSR_IA32_PL2_SSP 0x000006a6 /* ring-2 shadow stack pointer */

Are PL1/2 ever used in this implementation? If not, let's axe these
definitions.

> +#define MSR_IA32_PL3_SSP 0x000006a7 /* user shadow stack pointer */
> +#define MSR_IA32_INT_SSP_TAB 0x000006a8 /* exception shadow stack table */
> +
> #define MSR_GFX_PERF_LIMIT_REASONS 0x000006B0
> #define MSR_RING_PERF_LIMIT_REASONS 0x000006B1
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 02b3ddaf4f75..44397202762b 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -50,6 +50,8 @@ static const char *xfeature_names[] =
> "Processor Trace (unused)" ,
> "Protection Keys User registers",
> "PASID state",
> + "Control-flow User registers" ,
> + "Control-flow Kernel registers" ,
> "unknown xstate feature" ,
> "unknown xstate feature" ,
> "unknown xstate feature" ,
> @@ -73,6 +75,8 @@ static unsigned short xsave_cpuid_features[] __initdata = {
> [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
> [XFEATURE_PKRU] = X86_FEATURE_PKU,
> [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
> + [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
> + [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
> [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
> [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
> };
> @@ -250,6 +254,8 @@ static void __init print_xstate_features(void)
> print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
> print_xstate_feature(XFEATURE_MASK_PKRU);
> print_xstate_feature(XFEATURE_MASK_PASID);
> + print_xstate_feature(XFEATURE_MASK_CET_USER);
> + print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
> print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
> print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
> }
> @@ -405,6 +411,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
> XFEATURE_MASK_BNDREGS | \
> XFEATURE_MASK_BNDCSR | \
> XFEATURE_MASK_PASID | \
> + XFEATURE_MASK_CET_USER | \
> XFEATURE_MASK_XTILE)
>
> /*
> @@ -621,6 +628,8 @@ static bool __init check_xstate_against_struct(int nr)
> XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
> XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
> XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
> + XCHECK_SZ(sz, nr, XFEATURE_CET_USER, struct cet_user_state);
> + XCHECK_SZ(sz, nr, XFEATURE_CET_KERNEL, struct cet_kernel_state);
>
> /* The tile data size varies between implementations. */
> if (nr == XFEATURE_XTILE_DATA)
> @@ -634,7 +643,9 @@ static bool __init check_xstate_against_struct(int nr)
> if ((nr < XFEATURE_YMM) ||
> (nr >= XFEATURE_MAX) ||
> (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
> - ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_RSRVD_COMP_16))) {
> + (nr == XFEATURE_RSRVD_COMP_13) ||
> + (nr == XFEATURE_RSRVD_COMP_14) ||
> + (nr == XFEATURE_RSRVD_COMP_16)) {
> WARN_ONCE(1, "no structure for xstate: %d\n", nr);
> XSTATE_WARN_ON(1);
> return false;

That if() is getting unweildy. While I generally despise macros
implicitly modifying variables, this might be worth it. We could have a
local function variable:

bool feature_checked = false;

and then muck with it in the macro:

#define XCHECK_SZ(sz, nr, nr_macro, __struct) do {
if (nr == nr_macro)) {
feature_checked = true;
if (WARN_ONCE(sz != sizeof(__struct), ... ) {
__xstate_dump_leaves();
}
}
} while (0)

Then the if() just makes sure the feature was checked instead of
checking for reserved features explicitly. We could also do:

bool c = false;

...

c |= XCHECK_SZ(sz, nr, XFEATURE_YMM, struct ymmh_struct);
c |= XCHECK_SZ(sz, nr, XFEATURE_BNDREGS, struct ...
c |= XCHECK_SZ(sz, nr, XFEATURE_BNDCSR, struct ...
...

but that starts to run into 80 columns. Those are both nice because
they mean you don't have to maintain a list of reserved features in the
code. Another option would be to define a:

bool xfeature_is_reserved(int nr)
{
switch (nr) {
case XFEATURE_RSRVD_COMP_13:
...

so the if() looks nicer and won't grow; the function will grow instead.

Either way, I think this needs some refactoring.

2022-02-09 09:01:20

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 03/35] x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET)

On Sun, Jan 30, 2022 at 01:18:06PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> Add CPU feature flags for Control-flow Enforcement Technology (CET).
>
> CPUID.(EAX=7,ECX=0):ECX[bit 7] Shadow stack
> CPUID.(EAX=7,ECX=0):EDX[bit 20] Indirect Branch Tracking

It looks like this only adds the SHSTK bit, maybe drop mention of IBT
here.

I wonder if we could land this (and the IBT part) without waiting for
everything else in the respective series?

-Kees

>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
>
> v1:
> - Remove IBT, can be added in a follow on IBT series.
>
> Yu-cheng v25:
> - Make X86_FEATURE_IBT depend on X86_FEATURE_SHSTK.
>
> Yu-cheng v24:
> - Update for splitting CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK and
> CONFIG_X86_IBT.
> - Move DISABLE_IBT definition to the IBT series.
>
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/disabled-features.h | 8 +++++++-
> arch/x86/kernel/cpu/cpuid-deps.c | 1 +
> 3 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 6db4e2932b3d..c3eb94b13fef 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -355,6 +355,7 @@
> #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
> #define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
> #define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
> +#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
> #define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
> #define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
> #define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
> diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
> index 8f28fafa98b3..b7728f7afb2b 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -65,6 +65,12 @@
> # define DISABLE_SGX (1 << (X86_FEATURE_SGX & 31))
> #endif
>
> +#ifdef CONFIG_X86_SHADOW_STACK
> +#define DISABLE_SHSTK 0
> +#else
> +#define DISABLE_SHSTK (1 << (X86_FEATURE_SHSTK & 31))
> +#endif
> +
> /*
> * Make sure to add features to the correct mask
> */
> @@ -85,7 +91,7 @@
> #define DISABLED_MASK14 0
> #define DISABLED_MASK15 0
> #define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
> - DISABLE_ENQCMD)
> + DISABLE_ENQCMD|DISABLE_SHSTK)
> #define DISABLED_MASK17 0
> #define DISABLED_MASK18 0
> #define DISABLED_MASK19 0
> diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
> index c881bcafba7d..bf1b55a1ba21 100644
> --- a/arch/x86/kernel/cpu/cpuid-deps.c
> +++ b/arch/x86/kernel/cpu/cpuid-deps.c
> @@ -78,6 +78,7 @@ static const struct cpuid_dep cpuid_deps[] = {
> { X86_FEATURE_XFD, X86_FEATURE_XSAVES },
> { X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
> { X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
> + { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
> {}
> };
>
> --
> 2.17.1
>

--
Kees Cook

2022-02-09 09:09:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Fri, Feb 04, 2022 at 01:08:25AM +0000, Edgecombe, Rick P wrote:

> > So how is that going to work? altstack is not an esoteric corner
> > case.
>
> My understanding is that the main usages for the signal stack were
> handling stack overflows and corruption. Since the shadow stack only
> contains return addresses rather than large stack allocations, and is
> not generally writable or pivotable, I thought there was a good
> possibility an alt shadow stack would not end up being especially
> useful. Does it seem like reasonable guesswork?

altstacks are also used in userspace threading implemenations.

2022-02-09 09:11:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 06/35] x86/cet: Add control-protection fault handler

On 1/30/22 13:18, Rick Edgecombe wrote:
> A control-protection fault is triggered when a control-flow transfer
> attempt violates Shadow Stack or Indirect Branch Tracking constraints.
> For example, the return address for a RET instruction differs from the copy
> on the shadow stack; or an indirect JMP instruction, without the NOTRACK
> prefix, arrives at a non-ENDBR opcode.
>
> The control-protection fault handler works in a similar way as the general
> protection fault handler. It provides the si_code SEGV_CPERR to the signal
> handler.

It's not a big deal, but we should probably just remove IBT from the
changelogs for now.

> arch/arm/kernel/signal.c | 2 +-
> arch/arm64/kernel/signal.c | 2 +-
> arch/arm64/kernel/signal32.c | 2 +-
> arch/sparc/kernel/signal32.c | 2 +-
> arch/sparc/kernel/signal_64.c | 2 +-
> arch/x86/include/asm/idtentry.h | 4 ++
> arch/x86/kernel/idt.c | 4 ++
> arch/x86/kernel/signal_compat.c | 2 +-
> arch/x86/kernel/traps.c | 62 ++++++++++++++++++++++++++++++
> include/uapi/asm-generic/siginfo.h | 3 +-
> 10 files changed, 78 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
> index c532a6041066..59aaadce9d52 100644
> --- a/arch/arm/kernel/signal.c
> +++ b/arch/arm/kernel/signal.c
> @@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
> */
> static_assert(NSIGILL == 11);
> static_assert(NSIGFPE == 15);
> -static_assert(NSIGSEGV == 9);
> +static_assert(NSIGSEGV == 10);
> static_assert(NSIGBUS == 5);
> static_assert(NSIGTRAP == 6);
> static_assert(NSIGCHLD == 6);
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index d8aaf4b6f432..d2da57c415b8 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -983,7 +983,7 @@ void __init minsigstksz_setup(void)
> */
> static_assert(NSIGILL == 11);
> static_assert(NSIGFPE == 15);
> -static_assert(NSIGSEGV == 9);
> +static_assert(NSIGSEGV == 10);
> static_assert(NSIGBUS == 5);
> static_assert(NSIGTRAP == 6);
> static_assert(NSIGCHLD == 6);
> diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
> index d984282b979f..8776a34c6444 100644
> --- a/arch/arm64/kernel/signal32.c
> +++ b/arch/arm64/kernel/signal32.c
> @@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
> */
> static_assert(NSIGILL == 11);
> static_assert(NSIGFPE == 15);
> -static_assert(NSIGSEGV == 9);
> +static_assert(NSIGSEGV == 10);
> static_assert(NSIGBUS == 5);
> static_assert(NSIGTRAP == 6);
> static_assert(NSIGCHLD == 6);
> diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
> index 6cc124a3bb98..dc50b2a78692 100644
> --- a/arch/sparc/kernel/signal32.c
> +++ b/arch/sparc/kernel/signal32.c
> @@ -752,7 +752,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
> */
> static_assert(NSIGILL == 11);
> static_assert(NSIGFPE == 15);
> -static_assert(NSIGSEGV == 9);
> +static_assert(NSIGSEGV == 10);
> static_assert(NSIGBUS == 5);
> static_assert(NSIGTRAP == 6);
> static_assert(NSIGCHLD == 6);
> diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
> index 2a78d2af1265..7fe2bd37bd1a 100644
> --- a/arch/sparc/kernel/signal_64.c
> +++ b/arch/sparc/kernel/signal_64.c
> @@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
> */
> static_assert(NSIGILL == 11);
> static_assert(NSIGFPE == 15);
> -static_assert(NSIGSEGV == 9);
> +static_assert(NSIGSEGV == 10);
> static_assert(NSIGBUS == 5);
> static_assert(NSIGTRAP == 6);
> static_assert(NSIGCHLD == 6);
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 1345088e9902..a90791433152 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -562,6 +562,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS, exc_stack_segment);
> DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP, exc_general_protection);
> DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC, exc_alignment_check);
>
> +#ifdef CONFIG_X86_SHADOW_STACK
> +DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
> +#endif
> +
> /* Raw exception entries which need extra work */
> DECLARE_IDTENTRY_RAW(X86_TRAP_UD, exc_invalid_op);
> DECLARE_IDTENTRY_RAW(X86_TRAP_BP, exc_int3);
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index df0fa695bb09..9f1bdaabc246 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -113,6 +113,10 @@ static const __initconst struct idt_data def_idts[] = {
> #elif defined(CONFIG_X86_32)
> SYSG(IA32_SYSCALL_VECTOR, entry_INT80_32),
> #endif
> +
> +#ifdef CONFIG_X86_SHADOW_STACK
> + INTG(X86_TRAP_CP, asm_exc_control_protection),
> +#endif
> };
>
> /*
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index b52407c56000..ff50cd978ea5 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
> */
> BUILD_BUG_ON(NSIGILL != 11);
> BUILD_BUG_ON(NSIGFPE != 15);
> - BUILD_BUG_ON(NSIGSEGV != 9);
> + BUILD_BUG_ON(NSIGSEGV != 10);
> BUILD_BUG_ON(NSIGBUS != 5);
> BUILD_BUG_ON(NSIGTRAP != 6);
> BUILD_BUG_ON(NSIGCHLD != 6);
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index c9d566dcf89a..54b7a146fd5e 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -39,6 +39,7 @@
> #include <linux/io.h>
> #include <linux/hardirq.h>
> #include <linux/atomic.h>
> +#include <linux/nospec.h>
>
> #include <asm/stacktrace.h>
> #include <asm/processor.h>
> @@ -641,6 +642,67 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> cond_local_irq_disable(regs);
> }
>
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static const char * const control_protection_err[] = {
> + "unknown",
> + "near-ret",
> + "far-ret/iret",
> + "endbranch",
> + "rstorssp",
> + "setssbsy",
> + "unknown",
> +};
> +
> +static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
> +
> +/*
> + * When a control protection exception occurs, send a signal to the responsible
> + * application. Currently, control protection is only enabled for user mode.
> + * This exception should not come from kernel mode.
> + */

Please move that last sentence to the code which enforces that expectation.

> +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +{
> + struct task_struct *tsk;
> +
> + if (!user_mode(regs)) {
> + die("kernel control protection fault", regs, error_code);
> + panic("Unexpected kernel control protection fault. Machine halted.");
> + }

s/ Machine halted.//

I think they'll get the point when they see "kernel panic".

> +
> + cond_local_irq_enable(regs);
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> + WARN_ONCE(1, "Control protection fault with CET support disabled\n");
> +
> + tsk = current;
> + tsk->thread.error_code = error_code;
> + tsk->thread.trap_nr = X86_TRAP_CP;
> +
> + /*
> + * Ratelimit to prevent log spamming.
> + */
> + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> + __ratelimit(&cpf_rate)) {
> + unsigned long ssp;
> + int cpf_type;
> +
> + cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));

Isn't 'error_code' generated by the hardware? Is this defending against
userspace which can somehow get trigger this with an arbitrary 'error_code'?

I'm also not sure I like using array_index_nospec() as the *only* bounds
checking on the array. Is that the way folks are using it these days?
Even the comment above it has a pattern like this:

> * if (index < size) {
> * index = array_index_nospec(index, size);
> * val = array[index];
> * }


> + rdmsrl(MSR_IA32_PL3_SSP, ssp);
> + pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
> + tsk->comm, task_pid_nr(tsk),
> + regs->ip, regs->sp, ssp, error_code,
> + control_protection_err[cpf_type]);
> + print_vma_addr(KERN_CONT " in ", regs->ip);
> + pr_cont("\n");
> + }
> +
> + force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> + cond_local_irq_disable(regs);
> +}
> +#endif
> +
> static bool do_int3(struct pt_regs *regs)
> {
> int res;
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index 3ba180f550d7..081f4b37d22c 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -240,7 +240,8 @@ typedef struct siginfo {
> #define SEGV_ADIPERR 7 /* Precise MCD exception */
> #define SEGV_MTEAERR 8 /* Asynchronous ARM MTE error */
> #define SEGV_MTESERR 9 /* Synchronous ARM MTE exception */
> -#define NSIGSEGV 9
> +#define SEGV_CPERR 10 /* Control protection fault */
> +#define NSIGSEGV 10
>
> /*
> * SIGBUS si_codes


2022-02-09 09:53:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace



On Tue, Feb 8, 2022, at 1:29 AM, Cyrill Gorcunov wrote:
> On Tue, Feb 08, 2022 at 11:16:51AM +0200, Mike Rapoport wrote:
>>
>> > Any thoughts on how you would _like_ to see this resolved?
>>
>> Ideally, CRIU will need a knob that will tell the kernel/CET machinery
>> where the next RET will jump, along the lines of
>> restore_signal_shadow_stack() AFAIU.
>>
>> But such a knob will immediately reduce the security value of the entire
>> thing, and I don't have good ideas how to deal with it :(
>
> Probably a kind of latch in the task_struct which would trigger off once
> returt to a different address happened, thus we would be able to jump inside
> paratite code. Of course such trigger should be available under proper
> capability only.

I'm not fully in touch with how parasite, etc works. Are we talking about save or restore? If it's restore, what exactly does CRIU need to do? Is it just that CRIU needs to return out from its resume code into the to-be-resumed program without tripping CET? Would it be acceptable for CRIU to require that at least one shstk slot be free at save time? Or do we need a mechanism to atomically switch to a completely full shadow stack at resume?

Off the top of my head, a sigreturn (or sigreturn-like mechanism) that is intended for use for altshadowstack could safely verify a token on the altshdowstack, possibly compare to something in ucontext (or not -- this isn't clearly necessary) and switch back to the previous stack. CRIU could use that too. Obviously CRIU will need a way to populate the relevant stacks, but WRUSS can be used for that, and I think this is a fundamental requirement for CRIU -- CRIU restore absolutely needs a way to write the saved shadow stack data into the shadow stack.

So I think the only special capability that CRIU really needs is WRUSS, and we need to wire that up anyway.

2022-02-09 09:55:42

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 06/35] x86/cet: Add control-protection fault handler

On Mon, 2022-02-07 at 15:56 -0800, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > A control-protection fault is triggered when a control-flow
> > transfer
> > attempt violates Shadow Stack or Indirect Branch Tracking
> > constraints.
> > For example, the return address for a RET instruction differs from
> > the copy
> > on the shadow stack; or an indirect JMP instruction, without the
> > NOTRACK
> > prefix, arrives at a non-ENDBR opcode.
> >
> > The control-protection fault handler works in a similar way as the
> > general
> > protection fault handler. It provides the si_code SEGV_CPERR to
> > the signal
> > handler.
>
> It's not a big deal, but we should probably just remove IBT from the
> changelogs for now.

Makes sense. I should have scrubbed these better for IBT.

>
> > arch/arm/kernel/signal.c | 2 +-
> > arch/arm64/kernel/signal.c | 2 +-
> > arch/arm64/kernel/signal32.c | 2 +-
> > arch/sparc/kernel/signal32.c | 2 +-
> > arch/sparc/kernel/signal_64.c | 2 +-
> > arch/x86/include/asm/idtentry.h | 4 ++
> > arch/x86/kernel/idt.c | 4 ++
> > arch/x86/kernel/signal_compat.c | 2 +-
> > arch/x86/kernel/traps.c | 62
> > ++++++++++++++++++++++++++++++
> > include/uapi/asm-generic/siginfo.h | 3 +-
> > 10 files changed, 78 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
> > index c532a6041066..59aaadce9d52 100644
> > --- a/arch/arm/kernel/signal.c
> > +++ b/arch/arm/kernel/signal.c
> > @@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs
> > *regs)
> > */
> > static_assert(NSIGILL == 11);
> > static_assert(NSIGFPE == 15);
> > -static_assert(NSIGSEGV == 9);
> > +static_assert(NSIGSEGV == 10);
> > static_assert(NSIGBUS == 5);
> > static_assert(NSIGTRAP == 6);
> > static_assert(NSIGCHLD == 6);
> > diff --git a/arch/arm64/kernel/signal.c
> > b/arch/arm64/kernel/signal.c
> > index d8aaf4b6f432..d2da57c415b8 100644
> > --- a/arch/arm64/kernel/signal.c
> > +++ b/arch/arm64/kernel/signal.c
> > @@ -983,7 +983,7 @@ void __init minsigstksz_setup(void)
> > */
> > static_assert(NSIGILL == 11);
> > static_assert(NSIGFPE == 15);
> > -static_assert(NSIGSEGV == 9);
> > +static_assert(NSIGSEGV == 10);
> > static_assert(NSIGBUS == 5);
> > static_assert(NSIGTRAP == 6);
> > static_assert(NSIGCHLD == 6);
> > diff --git a/arch/arm64/kernel/signal32.c
> > b/arch/arm64/kernel/signal32.c
> > index d984282b979f..8776a34c6444 100644
> > --- a/arch/arm64/kernel/signal32.c
> > +++ b/arch/arm64/kernel/signal32.c
> > @@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct
> > pt_regs *regs)
> > */
> > static_assert(NSIGILL == 11);
> > static_assert(NSIGFPE == 15);
> > -static_assert(NSIGSEGV == 9);
> > +static_assert(NSIGSEGV == 10);
> > static_assert(NSIGBUS == 5);
> > static_assert(NSIGTRAP == 6);
> > static_assert(NSIGCHLD == 6);
> > diff --git a/arch/sparc/kernel/signal32.c
> > b/arch/sparc/kernel/signal32.c
> > index 6cc124a3bb98..dc50b2a78692 100644
> > --- a/arch/sparc/kernel/signal32.c
> > +++ b/arch/sparc/kernel/signal32.c
> > @@ -752,7 +752,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr,
> > u32 u_ossptr, unsigned long sp)
> > */
> > static_assert(NSIGILL == 11);
> > static_assert(NSIGFPE == 15);
> > -static_assert(NSIGSEGV == 9);
> > +static_assert(NSIGSEGV == 10);
> > static_assert(NSIGBUS == 5);
> > static_assert(NSIGTRAP == 6);
> > static_assert(NSIGCHLD == 6);
> > diff --git a/arch/sparc/kernel/signal_64.c
> > b/arch/sparc/kernel/signal_64.c
> > index 2a78d2af1265..7fe2bd37bd1a 100644
> > --- a/arch/sparc/kernel/signal_64.c
> > +++ b/arch/sparc/kernel/signal_64.c
> > @@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs,
> > unsigned long orig_i0, unsigned long
> > */
> > static_assert(NSIGILL == 11);
> > static_assert(NSIGFPE == 15);
> > -static_assert(NSIGSEGV == 9);
> > +static_assert(NSIGSEGV == 10);
> > static_assert(NSIGBUS == 5);
> > static_assert(NSIGTRAP == 6);
> > static_assert(NSIGCHLD == 6);
> > diff --git a/arch/x86/include/asm/idtentry.h
> > b/arch/x86/include/asm/idtentry.h
> > index 1345088e9902..a90791433152 100644
> > --- a/arch/x86/include/asm/idtentry.h
> > +++ b/arch/x86/include/asm/idtentry.h
> > @@ -562,6 +562,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,
> > exc_stack_segment);
> > DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP, exc_general_protection)
> > ;
> > DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC, exc_alignment_check);
> >
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
> > +#endif
> > +
> > /* Raw exception entries which need extra work */
> > DECLARE_IDTENTRY_RAW(X86_TRAP_UD, exc_invalid_op);
> > DECLARE_IDTENTRY_RAW(X86_TRAP_BP, exc_int3);
> > diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> > index df0fa695bb09..9f1bdaabc246 100644
> > --- a/arch/x86/kernel/idt.c
> > +++ b/arch/x86/kernel/idt.c
> > @@ -113,6 +113,10 @@ static const __initconst struct idt_data
> > def_idts[] = {
> > #elif defined(CONFIG_X86_32)
> > SYSG(IA32_SYSCALL_VECTOR, entry_INT80_32),
> > #endif
> > +
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > + INTG(X86_TRAP_CP, asm_exc_control_protection),
> > +#endif
> > };
> >
> > /*
> > diff --git a/arch/x86/kernel/signal_compat.c
> > b/arch/x86/kernel/signal_compat.c
> > index b52407c56000..ff50cd978ea5 100644
> > --- a/arch/x86/kernel/signal_compat.c
> > +++ b/arch/x86/kernel/signal_compat.c
> > @@ -27,7 +27,7 @@ static inline void
> > signal_compat_build_tests(void)
> > */
> > BUILD_BUG_ON(NSIGILL != 11);
> > BUILD_BUG_ON(NSIGFPE != 15);
> > - BUILD_BUG_ON(NSIGSEGV != 9);
> > + BUILD_BUG_ON(NSIGSEGV != 10);
> > BUILD_BUG_ON(NSIGBUS != 5);
> > BUILD_BUG_ON(NSIGTRAP != 6);
> > BUILD_BUG_ON(NSIGCHLD != 6);
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index c9d566dcf89a..54b7a146fd5e 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -39,6 +39,7 @@
> > #include <linux/io.h>
> > #include <linux/hardirq.h>
> > #include <linux/atomic.h>
> > +#include <linux/nospec.h>
> >
> > #include <asm/stacktrace.h>
> > #include <asm/processor.h>
> > @@ -641,6 +642,67 @@
> > DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> > cond_local_irq_disable(regs);
> > }
> >
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static const char * const control_protection_err[] = {
> > + "unknown",
> > + "near-ret",
> > + "far-ret/iret",
> > + "endbranch",
> > + "rstorssp",
> > + "setssbsy",
> > + "unknown",
> > +};
> > +
> > +static DEFINE_RATELIMIT_STATE(cpf_rate,
> > DEFAULT_RATELIMIT_INTERVAL,
> > + DEFAULT_RATELIMIT_BURST);
> > +
> > +/*
> > + * When a control protection exception occurs, send a signal to
> > the responsible
> > + * application. Currently, control protection is only enabled for
> > user mode.
> > + * This exception should not come from kernel mode.
> > + */
>
> Please move that last sentence to the code which enforces that
> expectation.

Ok.

>
> > +DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> > +{
> > + struct task_struct *tsk;
> > +
> > + if (!user_mode(regs)) {
> > + die("kernel control protection fault", regs,
> > error_code);
> > + panic("Unexpected kernel control protection
> > fault. Machine halted.");
> > + }
>
> s/ Machine halted.//
>
> I think they'll get the point when they see "kernel panic".

Ok.

>
> > +
> > + cond_local_irq_enable(regs);
> > +
> > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > + WARN_ONCE(1, "Control protection fault with CET support
> > disabled\n");
> > +
> > + tsk = current;
> > + tsk->thread.error_code = error_code;
> > + tsk->thread.trap_nr = X86_TRAP_CP;
> > +
> > + /*
> > + * Ratelimit to prevent log spamming.
> > + */
> > + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> > + __ratelimit(&cpf_rate)) {
> > + unsigned long ssp;
> > + int cpf_type;
> > +
> > + cpf_type = array_index_nospec(error_code,
> > ARRAY_SIZE(control_protection_err));
>
> Isn't 'error_code' generated by the hardware? Is this defending
> against
> userspace which can somehow get trigger this with an arbitrary
> 'error_code'?
>
> I'm also not sure I like using array_index_nospec() as the *only*
> bounds
> checking on the array. Is that the way folks are using it these
> days?

Yea, I was wondering about that too. It looks like it came from this
comment:
https://lore.kernel.org/lkml/202102041201.C2B93F8D8A@keescook/
...which didn't raise any speculation concerns. What it does do though,
is massage the index to 0 if it is out of bounds, leading to the
"unknown" message being selected for out of bounds error codes. If
that's the purpose though, I'm not sure why "unknown" is also the last
element of the array though.

I think maybe this will not typically be a fast path, and so
conditional logic would be easier to read and better on balance.

I'm now realizing that this is missing the "ENCL" error code bit (15)
which denotes #CP during enclave execution.

> Even the comment above it has a pattern like this:
>
> > * if (index < size) {
> > * index = array_index_nospec(index, size);
> > * val = array[index];
> > * }
>
>
> > + rdmsrl(MSR_IA32_PL3_SSP, ssp);
> > + pr_emerg("%s[%d] control protection ip:%lx sp:%lx
> > ssp:%lx error:%lx(%s)",
> > + tsk->comm, task_pid_nr(tsk),
> > + regs->ip, regs->sp, ssp, error_code,
> > + control_protection_err[cpf_type]);
> > + print_vma_addr(KERN_CONT " in ", regs->ip);
> > + pr_cont("\n");
> > + }
> > +
> > + force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> > + cond_local_irq_disable(regs);
> > +}
> > +#endif
> > +
> > static bool do_int3(struct pt_regs *regs)
> > {
> > int res;
> > diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-
> > generic/siginfo.h
> > index 3ba180f550d7..081f4b37d22c 100644
> > --- a/include/uapi/asm-generic/siginfo.h
> > +++ b/include/uapi/asm-generic/siginfo.h
> > @@ -240,7 +240,8 @@ typedef struct siginfo {
> > #define SEGV_ADIPERR 7 /* Precise MCD exception */
> > #define SEGV_MTEAERR 8 /* Asynchronous ARM MTE error */
> > #define SEGV_MTESERR 9 /* Synchronous ARM MTE exception */
> > -#define NSIGSEGV 9
> > +#define SEGV_CPERR 10 /* Control protection fault */
> > +#define NSIGSEGV 10
> >
> > /*
> > * SIGBUS si_codes
>
>

2022-02-09 09:58:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 26/35] x86/process: Change copy_thread() argument 'arg' to 'stack_size'

On Sun, Jan 30 2022 at 13:18, Rick Edgecombe wrote:
> -int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
> - struct task_struct *p, unsigned long tls)
> +int copy_thread(unsigned long clone_flags, unsigned long sp,
> + unsigned long stack_size, struct task_struct *p,
> + unsigned long tls)
> {
> struct inactive_task_frame *frame;
> struct fork_frame *fork_frame;
> @@ -175,7 +176,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
> if (unlikely(p->flags & PF_KTHREAD)) {
> p->thread.pkru = pkru_get_init_value();
> memset(childregs, 0, sizeof(struct pt_regs));
> - kthread_frame_init(frame, sp, arg);
> + kthread_frame_init(frame, sp, stack_size);
> return 0;
> }
>
> @@ -208,7 +209,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
> */
> childregs->sp = 0;
> childregs->ip = 0;
> - kthread_frame_init(frame, sp, arg);
> + kthread_frame_init(frame, sp, stack_size);
> return 0;
> }

Can you please change the prototypes too for completeness sake?

Thanks,

tglx

2022-02-09 10:42:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 09/35] x86/mm: Introduce _PAGE_COW

On 1/30/22 13:18, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> There is essentially no room left in the x86 hardware PTEs on some OSes
> (not Linux). That left the hardware architects looking for a way to
> represent a new memory type (shadow stack) within the existing bits.
> They chose to repurpose a lightly-used state: Write=0, Dirty=1.
>
> The reason it's lightly used is that Dirty=1 is normally set by hardware
> and cannot normally be set by hardware on a Write=0 PTE. Software must
> normally be involved to create one of these PTEs, so software can simply
> opt to not create them.

This is kinda skipping over something important:

The reason it's lightly used is that Dirty=1 is normally set
_before_ a write. A write with a Write=0 PTE would typically
only generate a fault, not set Dirty=1. Hardware can (rarely)
both set Write=1 *and* generate the fault, resulting in a
Dirty=0,Write=1 PTE. Hardware which supports shadow stacks
will no longer exhibit this oddity.

> In places where Linux normally creates Write=0, Dirty=1, it can use the
> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
> words, whenever Linux needs to create Write=0, Dirty=1, it instead creates
> Write=0, Cow=1, except for shadow stack, which is Write=0, Dirty=1. This
> clearly separates shadow stack from other data, and results in the
> following:

Following _what_... What are these? I think they're PTE states. Best
to say that.

> (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)

Could you give an example of this? Would this be a typical anonymous
page which was Write=1,Dirty=1, then historically made Write=0,Dirty=1
at fork()?

> (b) A R/O page that has been COW'ed: (Write=0, Cow=1)
> The user page is in a R/O VMA, and get_user_pages() needs a writable
> copy. The page fault handler creates a copy of the page and sets
> the new copy's PTE as Write=0 and Cow=1.
> (c) A shadow stack PTE: (Write=0, Dirty=1)
> (d) A shared shadow stack PTE: (Write=0, Cow=1)
> When a shadow stack page is being shared among processes (this happens
> at fork()), its PTE is made Dirty=0, so the next shadow stack access
> causes a fault, and the page is duplicated and Dirty=1 is set again.
> This is the COW equivalent for shadow stack pages, even though it's
> copy-on-access rather than copy-on-write.

Just like code, it's also nice to format these in a way which allows
them to be visually compared, trivially. So, let's expand all the bits
and vertically align everything. To break this down a bit, we have two
old states:

[a] (Write=0, Dirty=0, Cow=1)
[b] (Write=0, Dirty=0, Cow=1)

And two new ones:

[c] (Write=0, Dirty=1, Cow=0)
[d] (Write=0, Dirty=0, Cow=1)

That makes me wonder what the difference is between [a] and [b] and why
they are separate. Is their handling different? How are those two
states differentiated?

> (e) A page where the processor observed a Write=1 PTE, started a write, set
> Dirty=1, but then observed a Write=0 PTE. That's possible today, but
> will not happen on processors that support shadow stack.

This left me wondering how you are going to detangle the mess where PTEs
look like shadow-stack PTEs on non-shadow-stack hardware. Could you
cover that here?

You can shorten that above bullet to this to help make the space:

(e) (Write=0, Dirty=1, Cow=0) PTE created when a processor
without shadow stack support set Dirty=1.


> Define _PAGE_COW and update pte_*() helpers and apply the same changes to
> pmd and pud.
>
> After this, there are six free bits left in the 64-bit PTE, and no more
> free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not
> implemented for the 32-bit kernel.

Just say:

There are six bits left available to software in the 64-bit PTE
after consuming a bit for _PAGE_COW. No space is consumed in
32-bit kernels because shadow stacks are not enabled there.

There's no need to rub it in that 32-bit is out of space.

> -static inline int pte_dirty(pte_t pte)
> +static inline bool pte_dirty(pte_t pte)
> {
> - return pte_flags(pte) & _PAGE_DIRTY;
> + /*
> + * A dirty PTE has Dirty=1 or Cow=1.
> + */

I don't really like that comment because "Cow" isn't anywhere to be found.

> + return pte_flags(pte) & _PAGE_DIRTY_BITS;
> +}
> +
> +static inline bool pte_shstk(pte_t pte)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> + return false;
> +
> + return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
> }
>
> static inline int pte_young(pte_t pte)
> @@ -133,9 +144,20 @@ static inline int pte_young(pte_t pte)
> return pte_flags(pte) & _PAGE_ACCESSED;
> }
>
> -static inline int pmd_dirty(pmd_t pmd)
> +static inline bool pmd_dirty(pmd_t pmd)
> {
> - return pmd_flags(pmd) & _PAGE_DIRTY;
> + /*
> + * A dirty PMD has Dirty=1 or Cow=1.
> + */
> + return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
> +}
> +
> +static inline bool pmd_shstk(pmd_t pmd)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> + return false;
> +
> + return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
> }
>
> static inline int pmd_young(pmd_t pmd)
> @@ -143,9 +165,12 @@ static inline int pmd_young(pmd_t pmd)
> return pmd_flags(pmd) & _PAGE_ACCESSED;
> }
>
> -static inline int pud_dirty(pud_t pud)
> +static inline bool pud_dirty(pud_t pud)
> {
> - return pud_flags(pud) & _PAGE_DIRTY;
> + /*
> + * A dirty PUD has Dirty=1 or Cow=1.
> + */
> + return pud_flags(pud) & _PAGE_DIRTY_BITS;
> }
>
> static inline int pud_young(pud_t pud)
> @@ -155,13 +180,23 @@ static inline int pud_young(pud_t pud)
>
> static inline int pte_write(pte_t pte)
> {
> - return pte_flags(pte) & _PAGE_RW;
> + /*
> + * Shadow stack pages are always writable - but not by normal
> + * instructions, and only by shadow stack operations. Therefore,
> + * the W=0,D=1 test with pte_shstk().
> + */

I think that comment is off a bit. It's not really connected to the
code. We don't, for instance need to know what the bit combination is
inside pte_shstk(). Further, it's a bit mean to talk about "W" in the
comment and _PAGE_RW in the code. How about:

/*
* Shadow stack pages are logically writable, but do not have
* _PAGE_RW. Check for them separately from _PAGE_RW itself.
*/

> + return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
> }
>
> #define pmd_write pmd_write
> static inline int pmd_write(pmd_t pmd)
> {
> - return pmd_flags(pmd) & _PAGE_RW;
> + /*
> + * Shadow stack pages are always writable - but not by normal
> + * instructions, and only by shadow stack operations. Therefore,
> + * the W=0,D=1 test with pmd_shstk().
> + */
> + return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
> }

Ditto on the comment. Please copy the pte_write() one here too.

>
> #define pud_write pud_write
> @@ -299,6 +334,24 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
> return native_make_pte(v & ~clear);
> }
>
> +static inline pte_t pte_mkcow(pte_t pte)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> + return pte;
> +
> + pte = pte_clear_flags(pte, _PAGE_DIRTY);
> + return pte_set_flags(pte, _PAGE_COW);
> +}
> +
> +static inline pte_t pte_clear_cow(pte_t pte)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> + return pte;
> +
> + pte = pte_set_flags(pte, _PAGE_DIRTY);
> + return pte_clear_flags(pte, _PAGE_COW);
> +}

I think we need to say *SOMETHING* about the X86_FEATURE_SHSTK and
_PAGE_COW connection here. Otherwise they look like two random features
that are interacting in an unknown way.

Maybe even something this simple:

/*
* _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
* See the _PAGE_COW definition for more details.
*/

Also, the manipulation of _PAGE_DIRTY is not clear here. It's obvious
why we have to:

pte_clear_flags(pte, _PAGE_COW);

in a function called pte_clear_cow() but, again, how does _PAGE_DIRTY fit?

> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> static inline int pte_uffd_wp(pte_t pte)
> {
> @@ -318,7 +371,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
>
> static inline pte_t pte_mkclean(pte_t pte)
> {
> - return pte_clear_flags(pte, _PAGE_DIRTY);
> + return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
> }
>
> static inline pte_t pte_mkold(pte_t pte)
> @@ -328,7 +381,16 @@ static inline pte_t pte_mkold(pte_t pte)
>
> static inline pte_t pte_wrprotect(pte_t pte)
> {
> - return pte_clear_flags(pte, _PAGE_RW);
> + pte = pte_clear_flags(pte, _PAGE_RW);
> +
> + /*
> + * Blindly clearing _PAGE_RW might accidentally create
> + * a shadow stack PTE (RW=0, Dirty=1). Move the hardware

Could you grep this series and try to be consistent about the formatting
here? (Not that I've been perfect in this regard either). I think we
have at least:

Write=X,Dirty=Y
W=X,D=Y
RW=X,Dirty=Y

> + * dirty value to the software bit.
> + */
> + if (pte_dirty(pte))
> + pte = pte_mkcow(pte);
> + return pte;
> }

One of my logical checks for this is "does it all go away when this is
compiled out". Because of this:

+static inline pte_t pte_mkcow(pte_t pte)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pte;
...

the answer is yes! So, this looks good to me. Just thought I'd share a
bit of my thought process.

> static inline pte_t pte_mkexec(pte_t pte)
> @@ -338,7 +400,18 @@ static inline pte_t pte_mkexec(pte_t pte)
>
> static inline pte_t pte_mkdirty(pte_t pte)
> {
> - return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> + pteval_t dirty = _PAGE_DIRTY;
> +
> + /* Avoid creating (HW)Dirty=1, Write=0 PTEs */

The "(HW)" thing doesn't make a lot of sense any longer. I think we had
a set of HWDirty and SWDirty bits, but SWDirty ended up being morphed
over to _PAGE_COW.

> + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
> + dirty = _PAGE_COW;
> +
> + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
> +}
> +
> +static inline pte_t pte_mkwrite_shstk(pte_t pte)
> +{
> + return pte_clear_cow(pte);
> }

This one is a bit of black magic. This is taking a PTE from
(presumably) states [c]->[d] from earlier in the changelog.

Write=0,Dirty=0,Cow=1
to
Write=0,Dirty=1,Cow=0

It's hard to wrap my head around how clearing a software bit (from the
naming) will make this PTE writable.

There's either something wrong with the naming, or something wrong with
my mental model of what "COW clearing" is.

> static inline pte_t pte_mkyoung(pte_t pte)
> @@ -348,7 +421,12 @@ static inline pte_t pte_mkyoung(pte_t pte)
>
> static inline pte_t pte_mkwrite(pte_t pte)
> {
> - return pte_set_flags(pte, _PAGE_RW);
> + pte = pte_set_flags(pte, _PAGE_RW);
> +
> + if (pte_dirty(pte))
> + pte = pte_clear_cow(pte);
> +
> + return pte;
> }

Along the same lines as the last few comments, this leaves me wondering
why a pte_dirty() can't also be a "COW PTE".

... <snipping the pmd/pud copies> ...
> #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 3781a79b6388..1bfab70ff9ac 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -21,7 +21,8 @@
> #define _PAGE_BIT_SOFTW2 10 /* " */
> #define _PAGE_BIT_SOFTW3 11 /* " */
> #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
> -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
> +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
> +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
> #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
> #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
> #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
> @@ -34,6 +35,15 @@
> #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
> #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
>
> +/*
> + * Indicates a copy-on-write page.
> + */
> +#ifdef CONFIG_X86_SHADOW_STACK
> +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */
> +#else
> +#define _PAGE_BIT_COW 0
> +#endif
> +
> /* If _PAGE_BIT_PRESENT is clear, we use these: */
> /* - if the user mapped it with PROT_NONE; pte_present gives true */
> #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
> @@ -115,6 +125,36 @@
> #define _PAGE_DEVMAP (_AT(pteval_t, 0))
> #endif
>
> +/*
> + * The hardware requires shadow stack to be read-only and Dirty.
> + * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
> + * from shadow stack PTEs:
> + * (a) A modified, copy-on-write (COW) page: (Write=0, Cow=1)
> + * (b) A R/O page that has been COW'ed: (Write=0, Cow=1)
> + * The user page is in a R/O VMA, and get_user_pages() needs a
> + * writable copy. The page fault handler creates a copy of the page
> + * and sets the new copy's PTE as Write=0, Cow=1.
> + * (c) A shadow stack PTE: (Write=0, Dirty=1)
> + * (d) A shared (copy-on-access) shadow stack PTE: (Write=0, Cow=1)
> + * When a shadow stack page is being shared among processes (this
> + * happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the next
> + * shadow stack access causes a fault, and the page is duplicated and
> + * _PAGE_DIRTY is set again. This is the COW equivalent for shadow
> + * stack pages, even though it's copy-on-access rather than
> + * copy-on-write.
> + * (e) A page where the processor observed a Write=1 PTE, started a write,
> + * set Dirty=1, but then observed a Write=0 PTE (changed by another
> + * thread). That's possible today, but will not happen on processors
> + * that support shadow stack.

This info, again, is great. Let's keep it, but please do reformat it
like the changelog version to make the bit states easier to grok.

2022-02-09 11:08:34

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 05/35] x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states

On Mon, 2022-02-07 at 15:28 -0800, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > Control-flow Enforcement Technology (CET) introduces these MSRs:
> >
> > MSR_IA32_U_CET (user-mode CET settings),
> > MSR_IA32_PL3_SSP (user-mode shadow stack pointer),
> >
> > MSR_IA32_PL0_SSP (kernel-mode shadow stack pointer),
> > MSR_IA32_PL1_SSP (Privilege Level 1 shadow stack pointer),
> > MSR_IA32_PL2_SSP (Privilege Level 2 shadow stack pointer),
> > MSR_IA32_S_CET (kernel-mode CET settings),
> > MSR_IA32_INT_SSP_TAB (exception shadow stack table).
>
> To be honest, I'm not sure this is very valuable. It's *VERY* close
> to
> the exact information in the structure definitions. It's also not
> obviously related to XSAVE. It's more of the "what" this patch does
> than the "why". Good changelogs talk about "why".

Ok I'll look at re-wording this.

>
> > The two user-mode MSRs belong to XFEATURE_CET_USER. The first
> > three of
> > kernel-mode MSRs belong to XFEATURE_CET_KERNEL. Both XSAVES states
> > are
> > supervisor states. This means that there is no direct,
> > unprivileged access
> > to these states, making it harder for an attacker to subvert CET.

Oh, well I guess this *is* mentioned elsewhere, than in patch 3.

>
> Forgive me while I go into changelog lecture mode for a moment.
>
> I was constantly looking up at the list of MSRs and trying to
> reconcile
> them with this paragraph. Imagine if you had started out this
> changelog
> by saying:
>
> Shadow stack register state can be managed with XSAVE. The
> registers can logically be separated into two groups:
>
> * Registers controlling user-mode operation
> * Registers controlling kernel-mode operation
>
> The architecture has two new XSAVE state components: one for
> each group of registers. This _lets_ an OS manage them
> separately if it chooses. Linux chooses to ... <explain the
> design choice here, or why we don't care yet>.
>
> Both XSAVE state components are supervisor states, even the
> state controlling user-mode operation. This is a departure
> from
> earlier features like protection keys where the PKRU state is
> a normal user (non-supervisor) state. Having the user state be
>
> supervisor-managed ensures there is no direct, unprivileged
> access to it, making it harder for an attacker to subvert CET.
>
> Also, IBT gunk is in here too, right? Let's at least *mention* that
> in
> the changelog.

We can remove the IBT stuff if its better. I always appreciate finding
the unused features in headers when hacking around. But it all adds to
build time slightly I guess.

>
> ...
> > /* All supervisor states including supported and unsupported
> > states. */
> > #define XFEATURE_MASK_SUPERVISOR_ALL
> > (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
> > diff --git a/arch/x86/include/asm/msr-index.h
> > b/arch/x86/include/asm/msr-index.h
> > index 3faf0f97edb1..0ee77ce4c753 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -362,6 +362,26 @@
> >
> >
> > #define MSR_CORE_PERF_LIMIT_REASONS 0x00000690
> > +
> > +/* Control-flow Enforcement Technology MSRs */
> > +#define MSR_IA32_U_CET 0x000006a0 /* user mode
> > cet setting */
> > +#define MSR_IA32_S_CET 0x000006a2 /* kernel
> > mode cet setting */
> > +#define CET_SHSTK_EN BIT_ULL(0)
> > +#define CET_WRSS_EN BIT_ULL(1)
> > +#define CET_ENDBR_EN BIT_ULL(2)
> > +#define CET_LEG_IW_EN BIT_ULL(3)
> > +#define CET_NO_TRACK_EN BIT_ULL(4)
> > +#define CET_SUPPRESS_DISABLE BIT_ULL(5)
> > +#define CET_RESERVED (BIT_ULL(6) |
> > BIT_ULL(7) | BIT_ULL(8) | BIT_ULL(9))
>
> Would GENMASK_ULL() look any nicer here? I guess it's pretty clear
> as-is that bits 6->9 are reserved.

Hmm, visually I think it might be easier to catch that you need to
remove a reserved bit if it is being added after becoming unreserved
some day.

>
> > +#define CET_SUPPRESS BIT_ULL(10)
> > +#define CET_WAIT_ENDBR BIT_ULL(11)
>
> Are those bit fields common for both registers? It might be worth a
> comment to mention that.

Yes, I'll mention that.

>
> > +#define MSR_IA32_PL0_SSP 0x000006a4 /* kernel shadow
> > stack pointer */
> > +#define MSR_IA32_PL1_SSP 0x000006a5 /* ring-1 shadow
> > stack pointer */
> > +#define MSR_IA32_PL2_SSP 0x000006a6 /* ring-2 shadow
> > stack pointer */
>
> Are PL1/2 ever used in this implementation? If not, let's axe these
> definitions.

They are not used. Ok.

>
> > +#define MSR_IA32_PL3_SSP 0x000006a7 /* user shadow stack
> > pointer */
> > +#define MSR_IA32_INT_SSP_TAB 0x000006a8 /* exception
> > shadow stack table */
> > +
> > #define MSR_GFX_PERF_LIMIT_REASONS 0x000006B0
> > #define MSR_RING_PERF_LIMIT_REASONS 0x000006B1
> >
> > diff --git a/arch/x86/kernel/fpu/xstate.c
> > b/arch/x86/kernel/fpu/xstate.c
> > index 02b3ddaf4f75..44397202762b 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -50,6 +50,8 @@ static const char *xfeature_names[] =
> > "Processor Trace (unused)" ,
> > "Protection Keys User registers",
> > "PASID state",
> > + "Control-flow User registers" ,
> > + "Control-flow Kernel registers" ,
> > "unknown xstate feature" ,
> > "unknown xstate feature" ,
> > "unknown xstate feature" ,
> > @@ -73,6 +75,8 @@ static unsigned short xsave_cpuid_features[]
> > __initdata = {
> > [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT,
> > [XFEATURE_PKRU] = X86_FEATURE_PKU,
> > [XFEATURE_PASID] = X86_FEATURE_ENQCMD,
> > + [XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
> > + [XFEATURE_CET_KERNEL] =
> > X86_FEATURE_SHSTK,
> > [XFEATURE_XTILE_CFG] =
> > X86_FEATURE_AMX_TILE,
> > [XFEATURE_XTILE_DATA] =
> > X86_FEATURE_AMX_TILE,
> > };
> > @@ -250,6 +254,8 @@ static void __init print_xstate_features(void)
> > print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
> > print_xstate_feature(XFEATURE_MASK_PKRU);
> > print_xstate_feature(XFEATURE_MASK_PASID);
> > + print_xstate_feature(XFEATURE_MASK_CET_USER);
> > + print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
> > print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
> > print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
> > }
> > @@ -405,6 +411,7 @@ static __init void os_xrstor_booting(struct
> > xregs_state *xstate)
> > XFEATURE_MASK_BNDREGS | \
> > XFEATURE_MASK_BNDCSR | \
> > XFEATURE_MASK_PASID | \
> > + XFEATURE_MASK_CET_USER | \
> > XFEATURE_MASK_XTILE)
> >
> > /*
> > @@ -621,6 +628,8 @@ static bool __init
> > check_xstate_against_struct(int nr)
> > XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
> > XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
> > XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
> > + XCHECK_SZ(sz, nr, XFEATURE_CET_USER, struct cet_user_state);
> > + XCHECK_SZ(sz, nr, XFEATURE_CET_KERNEL, struct
> > cet_kernel_state);
> >
> > /* The tile data size varies between implementations. */
> > if (nr == XFEATURE_XTILE_DATA)
> > @@ -634,7 +643,9 @@ static bool __init
> > check_xstate_against_struct(int nr)
> > if ((nr < XFEATURE_YMM) ||
> > (nr >= XFEATURE_MAX) ||
> > (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
> > - ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <=
> > XFEATURE_RSRVD_COMP_16))) {
> > + (nr == XFEATURE_RSRVD_COMP_13) ||
> > + (nr == XFEATURE_RSRVD_COMP_14) ||
> > + (nr == XFEATURE_RSRVD_COMP_16)) {
> > WARN_ONCE(1, "no structure for xstate: %d\n", nr);
> > XSTATE_WARN_ON(1);
> > return false;
>
> That if() is getting unweildy. While I generally despise macros
> implicitly modifying variables, this might be worth it. We could
> have a
> local function variable:
>
> bool feature_checked = false;
>
> and then muck with it in the macro:
>
> #define XCHECK_SZ(sz, nr, nr_macro, __struct) do {
> if (nr == nr_macro)) {
> feature_checked = true;
> if (WARN_ONCE(sz != sizeof(__struct), ... ) {
> __xstate_dump_leaves();
> }
> }
> } while (0)
>
> Then the if() just makes sure the feature was checked instead of
> checking for reserved features explicitly. We could also do:
>
> bool c = false;
>
> ...
>
> c |= XCHECK_SZ(sz, nr, XFEATURE_YMM, struct
> ymmh_struct);
> c |= XCHECK_SZ(sz, nr, XFEATURE_BNDREGS, struct ...
> c |= XCHECK_SZ(sz, nr, XFEATURE_BNDCSR, struct ...
> ...
>
> but that starts to run into 80 columns. Those are both nice because
> they mean you don't have to maintain a list of reserved features in
> the
> code. Another option would be to define a:
>
> bool xfeature_is_reserved(int nr)
> {
> switch (nr) {
> case XFEATURE_RSRVD_COMP_13:
> ...
>
> so the if() looks nicer and won't grow; the function will grow
> instead.
>
> Either way, I think this needs some refactoring.

Yes, this makes sense. I'll play around with it.

2022-02-09 11:54:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 23/35] x86/fpu: Add helpers for modifying supervisor xstate

On Sun, Jan 30 2022 at 13:18, Rick Edgecombe wrote:
> In addition, now that get_xsave_addr() is not available outside of the
> core fpu code, there isn't even a way for these supervisor features to
> modify the in memory state.
>
> To resolve these problems, add some helpers that encapsulate the correct
> logic to operate on the correct copy of the state. Map the MSR's to the
> struct field location in a case statements in __get_xsave_member().

I like the approach in principle, but you still expose the xstate
internals via the void pointer. It's just a question of time that this
is type casted and abused in interesting ways.

Something like the below untested (on top of the whole series) preserves
the encapsulation and reduces the code at the call sites.

Thanks,

tglx
---
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -165,12 +165,7 @@ static inline bool fpstate_is_confidenti
struct task_struct;
extern long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2);

-void *start_update_xsave_msrs(int xfeature_nr);
-void end_update_xsave_msrs(void);
-int xsave_rdmsrl(void *xstate, unsigned int msr, unsigned long long *p);
-int xsave_wrmsrl(void *xstate, u32 msr, u64 val);
-int xsave_set_clear_bits_msrl(void *xstate, u32 msr, u64 set, u64 clear);
-
-void *get_xsave_buffer_unsafe(struct fpu *fpu, int xfeature_nr);
-int xsave_wrmsrl_unsafe(void *xstate, u32 msr, u64 val);
+int xsave_rdmsrs(int xfeature_nr, struct xstate_msr *xmsr, int num_msrs);
+int xsave_wrmsrs(int xfeature_nr, struct xstate_msr *xmsr, int num_msrs);
+int xsave_wrmsrs_on_task(struct task_struct *tsk, int xfeature_nr, struct xstate_msr *xmsr, int num_msrs);
#endif /* _ASM_X86_FPU_API_H */
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -601,4 +601,12 @@ struct fpu_state_config {
/* FPU state configuration information */
extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;

+struct xstate_msr {
+ unsigned int msr;
+ unsigned int bitop;
+ u64 val;
+ u64 set;
+ u64 clear;
+};
+
#endif /* _ASM_X86_FPU_H */
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1868,7 +1868,7 @@ int proc_pid_arch_status(struct seq_file
}
#endif /* CONFIG_PROC_PID_ARCH_STATUS */

-static u64 *__get_xsave_member(void *xstate, u32 msr)
+static u64 *xstate_get_member(void *xstate, u32 msr)
{
switch (msr) {
case MSR_IA32_PL3_SSP:
@@ -1882,22 +1882,11 @@ static u64 *__get_xsave_member(void *xst
}

/*
- * Operate on the xsave buffer directly. It makes no gaurantees that the
- * buffer will stay valid now or in the futre. This function is pretty
- * much only useful when the caller knows the fpu's thread can't be
- * scheduled or otherwise operated on concurrently.
- */
-void *get_xsave_buffer_unsafe(struct fpu *fpu, int xfeature_nr)
-{
- return get_xsave_addr(&fpu->fpstate->regs.xsave, xfeature_nr);
-}
-
-/*
* Return a pointer to the xstate for the feature if it should be used, or NULL
* if the MSRs should be written to directly. To do this safely, using the
* associated read/write helpers is required.
*/
-void *start_update_xsave_msrs(int xfeature_nr)
+static void *xsave_msrs_op_start(int xfeature_nr)
{
void *xstate;

@@ -1938,7 +1927,7 @@ void *start_update_xsave_msrs(int xfeatu
return xstate;
}

-void end_update_xsave_msrs(void)
+static void xsave_msrs_op_end(void)
{
fpregs_unlock();
}
@@ -1951,7 +1940,7 @@ void end_update_xsave_msrs(void)
*
* But if this correspondence is broken by either a write to the in-memory
* buffer or the registers, the kernel needs to be notified so it doesn't miss
- * an xsave or restore. __xsave_msrl_prepare_write() peforms this check and
+ * an xsave or restore. xsave_msrs_prepare_write() performs this check and
* notifies the kernel if needed. Use before writes only, to not take away
* the kernel's options when not required.
*
@@ -1959,65 +1948,107 @@ void end_update_xsave_msrs(void)
* must have resulted in targeting the in-memory state, so invaliding the
* registers is the right thing to do.
*/
-static void __xsave_msrl_prepare_write(void)
+static void xsave_msrs_prepare_write(void)
{
if (test_thread_flag(TIF_NEED_FPU_LOAD) &&
fpregs_state_valid(&current->thread.fpu, smp_processor_id()))
__fpu_invalidate_fpregs_state(&current->thread.fpu);
}

-int xsave_rdmsrl(void *xstate, unsigned int msr, unsigned long long *p)
+static int read_xstate_or_msr(struct xstate_msr *xmsr, void *xstate)
{
u64 *member_ptr;

if (!xstate)
- return rdmsrl_safe(msr, p);
+ return rdmsrl_safe(xmsr->msr, &xmsr->val);

- member_ptr = __get_xsave_member(xstate, msr);
+ member_ptr = xstate_get_member(xstate, xmsr->msr);
if (!member_ptr)
return 1;

- *p = *member_ptr;
-
+ xmsr->val = *member_ptr;
return 0;
}

+int xsave_rdmsrs(int xfeature_nr, struct xstate_msr *xmsr, int num_msrs)
+{
+ void *xstate = xsave_msrs_op_start(xfeature_nr);
+ int i, ret;
+
+ for (i = 0, ret = 0; !ret && i < num_msrs; i++, xmsr++)
+ ret = read_xstate_or_msr(xmsr, xstate);
+
+ xsave_msrs_op_end();
+ return ret;
+}

-int xsave_wrmsrl_unsafe(void *xstate, u32 msr, u64 val)
+static int write_xstate(struct xstate_msr *xmsr, void *xstate)
{
- u64 *member_ptr;
+ u64 *member_ptr = xstate_get_member(xstate, xmsr->msr);

- member_ptr = __get_xsave_member(xstate, msr);
if (!member_ptr)
return 1;

- *member_ptr = val;
-
+ *member_ptr = xmsr->val;
return 0;
}

-int xsave_wrmsrl(void *xstate, u32 msr, u64 val)
+static int write_xstate_or_msr(struct xstate_msr *xmsr, void *xstate)
{
- __xsave_msrl_prepare_write();
if (!xstate)
- return wrmsrl_safe(msr, val);
-
- return xsave_wrmsrl_unsafe(xstate, msr, val);
+ return wrmsrl_safe(xmsr->msr, xmsr->val);
+ return write_xstate(xmsr, xstate);
}

-int xsave_set_clear_bits_msrl(void *xstate, u32 msr, u64 set, u64 clear)
+static int mod_xstate_or_msr_bits(struct xstate_msr *xmsr, void *xstate)
{
- u64 val, new_val;
+ u64 val;
int ret;

- ret = xsave_rdmsrl(xstate, msr, &val);
+ ret = read_xstate_or_msr(xmsr, xstate);
if (ret)
return ret;

- new_val = (val & ~clear) | set;
+ val = xmsr->val;
+ xmsr->val = (val & ~xmsr->clear) | xmsr->set;

- if (new_val != val)
- return xsave_wrmsrl(xstate, msr, new_val);
+ if (val != xmsr->val)
+ return write_xstate_or_msr(xmsr, xstate);

return 0;
}
+
+static int __xsave_wrmsrs(void *xstate, struct xstate_msr *xmsr, int num_msrs)
+{
+ int i, ret;
+
+ for (i = 0, ret = 0; !ret && i < num_msrs; i++, xmsr++) {
+ if (!xmsr->bitop)
+ ret = write_xstate_or_msr(xmsr, xstate);
+ else
+ ret = mod_xstate_or_msr_bits(xmsr, xstate);
+ }
+
+ return ret;
+}
+
+int xsave_wrmsrs(int xfeature_nr, struct xstate_msr *xmsr, int num_msrs)
+{
+ void *xstate = xsave_msrs_op_start(xfeature_nr);
+ int ret;
+
+ xsave_msrs_prepare_write();
+ ret = __xsave_wrmsrs(xstate, xmsr, num_msrs);
+ xsave_msrs_op_end();
+ return ret;
+}
+
+int xsave_wrmsrs_on_task(struct task_struct *tsk, int xfeature_nr, struct xstate_msr *xmsr,
+ int num_msrs)
+{
+ void *xstate = get_xsave_addr(&tsk->thread.fpu.fpstate->regs.xsave, xfeature_nr);
+
+ if (WARN_ON_ONCE(!xstate))
+ return -EINVAL;
+ return __xsave_wrmsrs(xstate, xmsr, num_msrs);
+}
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -106,8 +106,7 @@ int shstk_setup(void)
{
struct thread_shstk *shstk = &current->thread.shstk;
unsigned long addr, size;
- void *xstate;
- int err;
+ struct xstate_msr xmsr[2];

if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
shstk->size ||
@@ -119,13 +118,10 @@ int shstk_setup(void)
if (IS_ERR_VALUE(addr))
return 1;

- xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
- err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, addr + size);
- if (!err)
- err = xsave_wrmsrl(xstate, MSR_IA32_U_CET, CET_SHSTK_EN);
- end_update_xsave_msrs();
+ xmsr[0] = (struct xstate_msr) { .msr = MSR_IA32_PL3_SSP, .val = addr + size };
+ xmsr[0] = (struct xstate_msr) { .msr = MSR_IA32_U_CET, .val = CET_SHSTK_EN };

- if (err) {
+ if (xsave_wrmsrs(XFEATURE_CET_USER, xmsr, ARRAY_SIZE(xmsr))) {
/*
* Don't leak shadow stack if something went wrong with writing the
* msrs. Warn about it because things may be in a weird state.
@@ -150,8 +146,8 @@ int shstk_alloc_thread_stack(struct task
unsigned long stack_size)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
+ struct xstate_msr xmsr[1];
unsigned long addr;
- void *xstate;

/*
* If shadow stack is not enabled on the new thread, skip any
@@ -183,15 +179,6 @@ int shstk_alloc_thread_stack(struct task
if (in_compat_syscall())
stack_size /= 4;

- /*
- * 'tsk' is configured with a shadow stack and the fpu.state is
- * up to date since it was just copied from the parent. There
- * must be a valid non-init CET state location in the buffer.
- */
- xstate = get_xsave_buffer_unsafe(&tsk->thread.fpu, XFEATURE_CET_USER);
- if (WARN_ON_ONCE(!xstate))
- return -EINVAL;
-
stack_size = PAGE_ALIGN(stack_size);
addr = alloc_shstk(stack_size, stack_size, false);
if (IS_ERR_VALUE(addr)) {
@@ -200,7 +187,11 @@ int shstk_alloc_thread_stack(struct task
return PTR_ERR((void *)addr);
}

- xsave_wrmsrl_unsafe(xstate, MSR_IA32_PL3_SSP, (u64)(addr + stack_size));
+ xmsr[0] = (struct xstate_msr) { .msr = MSR_IA32_PL3_SSP, .val = addr + stack_size };
+ if (xsave_wrmsrs_on_task(tsk, XFEATURE_CET_USER, xmsr, ARRAY_SIZE(xmsr))) {
+ unmap_shadow_stack(addr, stack_size);
+ return 1;
+ }
shstk->base = addr;
shstk->size = stack_size;
return 0;
@@ -232,8 +223,8 @@ void shstk_free(struct task_struct *tsk)

int wrss_control(bool enable)
{
+ struct xstate_msr xmsr[1] = {[0] = { .msr = MSR_IA32_U_CET, .bitop = 1,}, };
struct thread_shstk *shstk = &current->thread.shstk;
- void *xstate;
int err;

if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
@@ -246,13 +237,11 @@ int wrss_control(bool enable)
if (!shstk->size || shstk->wrss == enable)
return 1;

- xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
if (enable)
- err = xsave_set_clear_bits_msrl(xstate, MSR_IA32_U_CET, CET_WRSS_EN, 0);
+ xmsr[0].set = CET_WRSS_EN;
else
- err = xsave_set_clear_bits_msrl(xstate, MSR_IA32_U_CET, 0, CET_WRSS_EN);
- end_update_xsave_msrs();
-
+ xmsr[0].clear = CET_WRSS_EN;
+ err = xsave_wrmsrs(XFEATURE_CET_USER, xmsr, ARRAY_SIZE(xmsr));
if (err)
return 1;

@@ -263,7 +252,7 @@ int wrss_control(bool enable)
int shstk_disable(void)
{
struct thread_shstk *shstk = &current->thread.shstk;
- void *xstate;
+ struct xstate_msr xmsr[2];
int err;

if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
@@ -271,14 +260,11 @@ int shstk_disable(void)
!shstk->base)
return 1;

- xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
- /* Disable WRSS too when disabling shadow stack */
- err = xsave_set_clear_bits_msrl(xstate, MSR_IA32_U_CET, 0,
- CET_SHSTK_EN | CET_WRSS_EN);
- if (!err)
- err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, 0);
- end_update_xsave_msrs();
+ xmsr[0] = (struct xstate_msr) { .msr = MSR_IA32_U_CET, .bitop = 1,
+ .set = 0, .clear = CET_SHSTK_EN | CET_WRSS_EN };
+ xmsr[1] = (struct xstate_msr) { .msr = MSR_IA32_PL3_SSP, .val = 0 };

+ err = xsave_wrmsrs(XFEATURE_CET_USER, xmsr, ARRAY_SIZE(xmsr));
if (err)
return 1;

@@ -289,16 +275,10 @@ int shstk_disable(void)

static unsigned long get_user_shstk_addr(void)
{
- void *xstate;
- unsigned long long ssp;
-
- xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
-
- xsave_rdmsrl(xstate, MSR_IA32_PL3_SSP, &ssp);
-
- end_update_xsave_msrs();
+ struct xstate_msr xmsr[1] = { [0] = {.msr = MSR_IA32_PL3_SSP, }, };

- return ssp;
+ xsave_rdmsrs(XFEATURE_CET_USER, xmsr, ARRAY_SIZE(xmsr));
+ return xmsr[0].val;
}

/*
@@ -385,8 +365,8 @@ int shstk_check_rstor_token(bool proc32,
int setup_signal_shadow_stack(int proc32, void __user *restorer)
{
struct thread_shstk *shstk = &current->thread.shstk;
+ struct xstate_msr xmsr[1];
unsigned long new_ssp;
- void *xstate;
int err;

if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || !shstk->size)
@@ -397,18 +377,15 @@ int setup_signal_shadow_stack(int proc32
if (err)
return err;

- xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
- err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, new_ssp);
- end_update_xsave_msrs();
-
- return err;
+ xmsr[0] = (struct xstate_msr) { .msr = MSR_IA32_PL3_SSP, .val = new_ssp };
+ return xsave_wrmsrs(XFEATURE_CET_USER, xmsr, ARRAY_SIZE(xmsr));
}

int restore_signal_shadow_stack(void)
{
struct thread_shstk *shstk = &current->thread.shstk;
- void *xstate;
int proc32 = in_ia32_syscall();
+ struct xstate_msr xmsr[1];
unsigned long new_ssp;
int err;

@@ -419,11 +396,8 @@ int restore_signal_shadow_stack(void)
if (err)
return err;

- xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
- err = xsave_wrmsrl(xstate, MSR_IA32_PL3_SSP, new_ssp);
- end_update_xsave_msrs();
-
- return err;
+ xmsr[0] = (struct xstate_msr) { .msr = MSR_IA32_PL3_SSP, .val = new_ssp };
+ return xsave_wrmsrs(XFEATURE_CET_USER, xmsr, ARRAY_SIZE(xmsr));
}

SYSCALL_DEFINE2(map_shadow_stack, unsigned long, size, unsigned int, flags)

2022-02-09 12:08:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, Feb 8, 2022, at 1:31 AM, Thomas Gleixner wrote:
> On Mon, Feb 07 2022 at 17:31, Andy Lutomirski wrote:
>> So this leaves altshadowstack. If we want to allow userspace to handle
>> a shstk overflow, I think we need altshadowstack. And I can easily
>> imagine signal handling in a coroutine or user-threading evironment (Go?
>> UMCG or whatever it's called?) wanting this. As noted, this obnoxious
>> Andy person didn't like putting any shstk-related extensions in the FPU
>> state.
>>
>> For better or for worse, altshadowstack is (I think) fundamentally a new
>> API. No amount of ucontext magic is going to materialize an entire
>> shadow stack out of nowhere when someone calls sigaltstack(). So the
>> questions are: should we support altshadowstack from day one and, if so,
>> what should it look like?
>
> I think we should support them from day one.
>
>> So I don't have a complete or even almost complete design in mind, but I
>> think we do need to make a conscious decision either to design this
>> right or to skip it for v1.
>
> Skipping it might create a fundamental design fail situation as it might
> require changes to the shadow stack signal handling in general which
> becomes a nightmare once a non-altstack API is exposed.

It would also expose a range of kernels in which shstk is on but programs that want altshadowstack don't have it. That would be annoying.

>
>> As for CRIU, I don't think anyone really expects a new kernel, running
>> new userspace that takes advantage of features in the new kernel, to
>> work with old CRIU.
>
> Yes, CRIU needs updates, but what ensures that CRIU managed user space
> does not use SHSTK if CRIU is not updated yet?

In some sense this is like any other feature. If a program uses timerfd but CRIU doesn't support timerfd, then it won't work. SHSTK is a bit unique because it's likely that all programs on a system will start using it all at once.

2022-02-09 12:23:55

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

Hi Rick,

On Wed, Feb 09, 2022 at 02:18:42AM +0000, Edgecombe, Rick P wrote:
> On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> > > > > But such a knob will immediately reduce the security value of
> > > > > the entire
> > > > > thing, and I don't have good ideas how to deal with it :(
> > > >
> > > > Probably a kind of latch in the task_struct which would trigger
> > > > off once
> > > > returt to a different address happened, thus we would be able to
> > > > jump inside
> > > > paratite code. Of course such trigger should be available under
> > > > proper
> > > > capability only.
> > >
> > > I'm not fully in touch with how parasite, etc works. Are we
> > > talking about save or restore?
> >
> > We use parasite code in question during checkpoint phase as far as I
> > remember.
> > push addr/lret trick is used to run "injected" code (code injection
> > itself is
> > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> > into this code
> > for years already, do we still need to support compat mode at all?
> >
> > > If it's restore, what exactly does CRIU need to do? Is it just
> > > that CRIU needs to return
> > > out from its resume code into the to-be-resumed program without
> > > tripping CET? Would it
> > > be acceptable for CRIU to require that at least one shstk slot be
> > > free at save time?
> > > Or do we need a mechanism to atomically switch to a completely full
> > > shadow stack at resume?
> > >
> > > Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> > > that is intended for
> > > use for altshadowstack could safely verify a token on the
> > > altshdowstack, possibly
> > > compare to something in ucontext (or not -- this isn't clearly
> > > necessary) and switch
> > > back to the previous stack. CRIU could use that too. Obviously
> > > CRIU will need a way
> > > to populate the relevant stacks, but WRUSS can be used for that,
> > > and I think this
> > > is a fundamental requirement for CRIU -- CRIU restore absolutely
> > > needs a way to write
> > > the saved shadow stack data into the shadow stack.
>
> Still wrapping my head around the CRIU save and restore steps, but
> another general approach might be to give ptrace the ability to
> temporarily pause/resume/set CET enablement and SSP for a stopped
> thread. Then injected code doesn't need to jump through any hoops or
> possibly run into road blocks. I'm not sure how much this opens things
> up if the thread has to be stopped...

IIRC, criu dump does something like this:
* Stop the process being dumped (victim) with ptrace
* Inject parasite code and data into the victim, again with ptrace.
Among other things the parasite data contains a sigreturn frame with
saved victim state.
* Resume the victim process, which will run parasite code now.
* When parasite finishes it uses that frame to sigreturn to normal victim
execution

So, my feeling is that for dump side WRUSS should be enough.

> Cyrill, could it fit into the CRIU pause and resume flow? What action
> causes the final resuming of execution of the restored process for
> checkpointing and for restore? Wondering if we could somehow make CET
> re-enable exactly then.
>
> And I guess this also needs a way to create shadow stack allocations at
> a specific address to match where they were in the dumped process. That
> is missing in this series.

Yes, criu restore will need to recreate shadow stack mappings. Currently,
we recreate the restored process (target) address space based on
/proc/pid/maps and /proc/pid/smaps. CRIU preserves the virtual addresses
and VMA flags. The relevant steps of restore process can be summarised as:
* Clone() the target process tree
* Recreate VMAs with the needed size and flags, but not necessarily at the
correct place yet
* Partially populate memory data from the saved images
* Move VMAs to their exact addresses
* Complete restoring the data
* Create a frame for sigreturn and jump to the target.

Here, the stack used after sigreturn contains the data that was captured
during dump and it entirely different from what shadow stack will contain.

There are several points when the target threads are stopped, so
pausing/resuming CET may help.

> > > So I think the only special capability that CRIU really needs is
> > > WRUSS, and
> > > we need to wire that up anyway.
> >
> > Thanks for these notes, Andy! I can't provide any sane answer here
> > since didn't
> > read tech spec for this feature yet :-)
>
>
>
>

--
Sincerely yours,
Mike.

2022-02-09 12:24:33

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

[un-Cc'ed a lot of people, as the question is highly off-topic, so I
don't feel like the answer is of big interest to them, keeping x86
maintainer in]

On 2/8/22 17:02, Cyrill Gorcunov wrote:
> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
>>>> But such a knob will immediately reduce the security value of the entire
>>>> thing, and I don't have good ideas how to deal with it :(
>>>
>>> Probably a kind of latch in the task_struct which would trigger off once
>>> returt to a different address happened, thus we would be able to jump inside
>>> paratite code. Of course such trigger should be available under proper
>>> capability only.
>>
>> I'm not fully in touch with how parasite, etc works. Are we talking about save or restore?
>
> We use parasite code in question during checkpoint phase as far as I remember.
> push addr/lret trick is used to run "injected" code (code injection itself is
> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look into this code
> for years already, do we still need to support compat mode at all?

Cyrill, I haven't been working on/with Virtuozzo people last 5 years, so
I don't know. As you're more connected to Vz, your question seems to
imply that ia32 C/R is no longer needed by Vz customers. If it's not
needed anymore - I'm all for stopping testing of it in CRIU.

The only thing I ask before you go and remove that is to ping the person
who paid some substantial amount of money on bugsbounty to get ia32
support in CRIU. Albeit, in the end I didn't get a cent out of it (VZ
managers insisted on receiving all of the money), I still feel
responsible to that person as the amount he paid was the biggest bounty
at that moment and I was the person, who presented him ia32 C/R as
working and being tested.
If you need his contacts - ping me, I'll search and find it.

Other than that - if no one needs ia32 C/R, let's do go ahead and drop
testing (and maybe some complicated code) of it.

Thanks,
Dmitry

2022-02-09 12:42:58

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Wed, Feb 09, 2022 at 02:18:42AM +0000, Edgecombe, Rick P wrote:
...
>
> Still wrapping my head around the CRIU save and restore steps, but
> another general approach might be to give ptrace the ability to
> temporarily pause/resume/set CET enablement and SSP for a stopped
> thread. Then injected code doesn't need to jump through any hoops or
> possibly run into road blocks. I'm not sure how much this opens things
> up if the thread has to be stopped...
>
> Cyrill, could it fit into the CRIU pause and resume flow? What action
> causes the final resuming of execution of the restored process for
> checkpointing and for restore? Wondering if we could somehow make CET
> re-enable exactly then.
>
> And I guess this also needs a way to create shadow stack allocations at
> a specific address to match where they were in the dumped process. That
> is missing in this series.

Thanks Rick! This sounds like an option. I need a couple of days to refresh
my memory about criu internals. Let me CC a few current active criu developers
(CC list is already big enough though:), maybe this will speedup the procedure.

2022-02-09 12:44:57

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, Feb 08, 2022 at 09:54:14PM +0000, Dmitry Safonov wrote:
> [un-Cc'ed a lot of people, as the question is highly off-topic, so I
> don't feel like the answer is of big interest to them, keeping x86
> maintainer in]
>
> On 2/8/22 17:02, Cyrill Gorcunov wrote:
> >>> Probably a kind of latch in the task_struct which would trigger off once
> >>> returt to a different address happened, thus we would be able to jump inside
> >>> paratite code. Of course such trigger should be available under proper
> >>> capability only.
> >>
> >> I'm not fully in touch with how parasite, etc works. Are we talking about save or restore?
> >
> > We use parasite code in question during checkpoint phase as far as I remember.
> > push addr/lret trick is used to run "injected" code (code injection itself is
> > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look into this code
> > for years already, do we still need to support compat mode at all?
>
> Cyrill, I haven't been working on/with Virtuozzo people last 5 years, so
> I don't know. As you're more connected to Vz, your question seems to
> imply that ia32 C/R is no longer needed by Vz customers. If it's not
> needed anymore - I'm all for stopping testing of it in CRIU.

Nope. I didn't see any sign that Vz is intended to drop ia32 suport. But
Vz's criu instance is following vanilla's one, that is why I asked you
and Andrew about ia32 support. This ia32 code snippet with stack
manipulation simply popped out in my mind immediately when Andy asked
how we deal with stack.

Also we adjust stack in restorer code but I need some time to recall
all thses details since as I said I didn't work with criu code for
years already.

Cyrill

2022-02-09 18:40:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 10/35] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS

On 1/30/22 13:18, Rick Edgecombe wrote:
>
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 99d1781fa5f0..75ce4e823902 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1210,7 +1210,7 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
> }
>
> /* Clear dirty field. */
> - se->val64 &= ~_PAGE_DIRTY;
> + se->val64 &= ~_PAGE_DIRTY_BITS;
>
> ops->clear_pse(se);
> ops->clear_ips(se);

Are these x86 CPU page table values? I see ->val64 being used like this:

e->val64 &= ~GEN8_PAGE_PRESENT;
and
se.val64 |= GEN8_PAGE_PRESENT | GEN8_PAGE_RW;

where we also have:

#define GEN8_PAGE_PRESENT BIT_ULL(0)
#define GEN8_PAGE_RW BIT_ULL(1)

Which tells me that these are probably *close* to the CPU's page tables.
But, I honestly don't know which format they are. I don't know if
_PAGE_COW is still a software bit in that format or not.

Either way, I don't think we should be messing with i915 device page tables.

Or, are these somehow magically shared with the CPU in some way I don't
know about?

[ If these are device-only page tables, it would probably be nice to
stop using _PAGE_FOO for them. It would avoid confusion like this. ]

2022-02-09 19:19:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 11/35] x86/mm: Update pte_modify for _PAGE_COW

On 1/30/22 13:18, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> The read-only and Dirty PTE has been used to indicate copy-on-write pages.

Nit: This is another opportunity to use consistent terminology
for these Write=0,Dirty=1 PTEs.

> However, newer x86 processors also regard a read-only and Dirty PTE as a
> shadow stack page. In order to separate the two, the software-defined
> _PAGE_COW is created to replace _PAGE_DIRTY for the copy-on-write case, and
> pte_*() are updated.

The tense here is weird. "_PAGE_COW is created" is present tense, but
it refers to something that happened earlier in the series.

> Pte_modify() changes a PTE to 'newprot', but it doesn't use the pte_*().

I'm not seeing a clear problem statement in there. It looks something
like this to me:

pte_modify() takes a "raw" pgprot_t which was not necessarily
created with any of the existing PTE bit helpers. That means
that it can return a pte_t with Write=0,Dirty=1: a shadow stack
PTE when it did not intend to create one.

But, this kinda looks like a hack to me.

It all boils down to _PAGE_CHG_MASK. If pte_modify() can change the
bit's value, it is not included in _PAGE_CHG_MASK. But, pte_modify()
*CAN* change the _PAGE_DIRTY value now.

Another way of saying it is that _PAGE_DIRTY is now a permission bit
(part-time, at least).


> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index a4a75e78a934..5c3886f6ccda 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -773,6 +773,23 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>
> static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
>
> +static inline pteval_t fixup_dirty_pte(pteval_t pteval)
> +{
> + pte_t pte = __pte(pteval);
> +
> + /*
> + * Fix up potential shadow stack page flags because the RO, Dirty
> + * PTE is special.
> + */
> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> + if (pte_dirty(pte)) {
> + pte = pte_mkclean(pte);
> + pte = pte_mkdirty(pte);
> + }
> + }
> + return pte_val(pte);
> +}
> +
> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> {
> pteval_t val = pte_val(pte), oldval = val;
> @@ -783,16 +800,36 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> */
> val &= _PAGE_CHG_MASK;
> val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
> + val = fixup_dirty_pte(val);
> val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
> return __pte(val);
> }

Maybe something like this? We can take _PAGE_DIRTY out of
_PAGE_CHG_MASK, then the p*_modify() functions look like this:

static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
pteval_t val = pte_val(pte), oldval = val;
+ pte_t pte_result;

/* Chop off any bits that might change with 'newprot': */
val &= _PAGE_CHG_MASK;
val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);

+ pte_result = __pte(val);
+
+ if (pte_dirty(oldval))
+ pte_result = pte_mkdirty(pte_result));
+
+ return pte_result;
}


This:

1. Makes logical sense: the dirty bit *IS* special in that it has to be
logically preserved across permission changes.
2. Would work with or without shadow stacks. That exact code would even
work on a non-shadow-stack kernel
3. Doesn't introduce *any* new shadow-stack conditional code; the one
already hidden in pte_mkdirty() is sufficient.
4. Avoids silly things like setting a bit and then immediately clearing
it in a "fixup".
5. Removes the opaque "fixup" abstraction function.

That's way better if I do say so myself.

2022-02-09 19:55:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 15/35] x86/mm: Check Shadow Stack page fault errors

On 1/30/22 13:18, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> Shadow stack accesses are those that are performed by the CPU where it
> expects to encounter a shadow stack mapping. These accesses are performed
> implicitly by CALL/RET at the site of the shadow stack pointer. These
> accesses are made explicitly by shadow stack management instructions like
> WRUSSQ.

The passive voice is killing me. Here's a rewrite:

The CPU performs "shadow stack accesses" when it expects to
encounter shadow stack mappings. These accesses can be
implicit (via CALL/RET instructions) or explicit (instructions
like WRUSSQ).

Since we defined what a shadow stack access *is*, shouldn't we also
connect it to X86_PF_SHSTK?

> Shadow stacks accesses to shadow-stack mapping can see faults in normal,

^ mappings

> valid operation just like regular accesses to regular mappings. Shadow
> stacks need some of the same features like delayed allocation, swap and
> copy-on-write.

... and use faults to implement those features.

> Shadow stack accesses can also result in errors, such as when a shadow
> stack overflows, or if a shadow stack access occurs to a non-shadow-stack
> mapping.

Those two paragraphs tell a pretty good story. Nice.

> In handling a shadow stack page fault, verify it occurs within a shadow
> stack mapping. It is always an error otherwise. For valid shadow stack
> accesses, set FAULT_FLAG_WRITE to effect copy-on-write. Because clearing
> _PAGE_DIRTY (vs. _PAGE_RW) is used to trigger the fault, shadow stack read
> fault and shadow stack write fault are not differentiated and both are
> handled as a write access.

This paragraph is a rehash of what the code does. It can go.

*But*, with or without this paragraph, the reader is left with all
background and no discussion of why this patch exists.

Even just this would be fine:

Handle valid and invalid shadow-stack accesses in the page fault
handler.


> diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
> index 10b1de500ab1..afa524325e55 100644
> --- a/arch/x86/include/asm/trap_pf.h
> +++ b/arch/x86/include/asm/trap_pf.h
> @@ -11,6 +11,7 @@
> * bit 3 == 1: use of reserved bit detected
> * bit 4 == 1: fault was an instruction fetch
> * bit 5 == 1: protection keys block access
> + * bit 6 == 1: shadow stack access fault
> * bit 15 == 1: SGX MMU page-fault
> */
> enum x86_pf_error_code {
> @@ -20,6 +21,7 @@ enum x86_pf_error_code {
> X86_PF_RSVD = 1 << 3,
> X86_PF_INSTR = 1 << 4,
> X86_PF_PK = 1 << 5,
> + X86_PF_SHSTK = 1 << 6,
> X86_PF_SGX = 1 << 15,
> };
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index d0074c6ed31a..6769134986ec 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1107,6 +1107,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
> (error_code & X86_PF_INSTR), foreign))
> return 1;
>
> + /*
> + * Verify a shadow stack access is within a shadow stack VMA.
> + * It is always an error otherwise. Normal data access to a
> + * shadow stack area is checked in the case followed.
> + */

That comment needs some help. Maybe:

Shadow stack accesses (PF_SHSTK=1) are only permitted to
shadow stack VMAs. All other accesses result in an error.

I don't think we need to talk about the other cases being handled below.

> + if (error_code & X86_PF_SHSTK) {
> + if (!(vma->vm_flags & VM_SHADOW_STACK))
> + return 1;
> + return 0;
> + }
> +
> if (error_code & X86_PF_WRITE) {
> /* write, present and write, not present: */
> if (unlikely(!(vma->vm_flags & VM_WRITE)))
> @@ -1300,6 +1311,14 @@ void do_user_addr_fault(struct pt_regs *regs,
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>
> + /*
> + * Clearing _PAGE_DIRTY is used to detect shadow stack access.
> + * This method cannot distinguish shadow stack read vs. write.
> + * For valid shadow stack accesses, set FAULT_FLAG_WRITE to effect
> + * copy-on-write.
> + */

Too much detail. This is also rather unconnected to the code I can see:

> + if (error_code & X86_PF_SHSTK)
> + flags |= FAULT_FLAG_WRITE;

Also, the use of "effect" here is arguably wrong. It's odd at best.
I'd use some alternative wording.

Let's stick to the facts:
1. Shadow stack pages architecturally can't be read-only
2. Don't bother with read faults, consider everything a write

BTW, what happens if we don't do this? What breaks?

2022-02-09 20:32:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 12/35] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

On 1/30/22 13:18, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> When Shadow Stack is introduced, [R/O + _PAGE_DIRTY] PTE is reserved for
> shadow stack. Copy-on-write PTEs have [R/O + _PAGE_COW].

<sigh> Another way to refer to these PTEs. In the last patch, it was:

"read-only and Dirty PTE"
and now:
"[R/O + _PAGE_DIRTY]"

> When a PTE goes from [R/W + _PAGE_DIRTY] to [R/O + _PAGE_COW], it could
> become a transient shadow stack PTE in two cases:
>
> The first case is that some processors can start a write but end up seeing
> a read-only PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE. However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
>
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.

== Background ==

Shadow stack PTEs always have [Write=0,Dirty=1].

As currently implemented, ptep_set_wrprotect() simply clears _PAGE_RW:
(Write=1 -> Write=0).

== Problem ==

This could cause a problem if ptep_set_wrprotect() caused a PTE to
transition from:

[Write=1,Dirty=1]
to
[Write=0,Dirty=1]

Which would inadvertently create a shadow stack PTE instead of
write-protecting it. ptep_set_wrprotect() can not simply check for the
Dirty=1 bit because the hardware can set it at any time.

== Solution ==

Perform a compare-and-exchange operation on the PTE to avoid racing with
the hardware. The cmpxchg is expected to be more expensive than the
existing clear_bit(). Continue using the cheaper clear_bit() on when
shadow stacks are not in play.

> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 5c3886f6ccda..e1061b9cba6a 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1295,6 +1295,24 @@ static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
> static inline void ptep_set_wrprotect(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> + /*
> + * If Shadow Stack is enabled, pte_wrprotect() moves _PAGE_DIRTY
> + * to _PAGE_COW (see comments at pte_wrprotect()).
> + * When a thread reads a RW=1, Dirty=0 PTE and before changing it
> + * to RW=0, Dirty=0, another thread could have written to the page
> + * and the PTE is RW=1, Dirty=1 now. Use try_cmpxchg() to detect
> + * PTE changes and update old_pte, then try again.
> + */

I think we can trim that down. We don't need to explain what cmpxchg
does or why it loops. That's way too much detail that we don't need.
Maybe:

/*
* Avoid accidentally creating shadow stack PTEs
* (Write=0,Dirty=1). Use cmpxchg() to prevent races with
* the hardware setting Dirty=1.
*/

BTW, is it *really* a problem with other threads setting Dirty=1? This
is happening under the page table lock on this side at least.

> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> + pte_t old_pte, new_pte;
> +
> + old_pte = READ_ONCE(*ptep);
> + do {
> + new_pte = pte_wrprotect(old_pte);
> + } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> + return;
> + }
> clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> }




2022-02-09 20:47:54

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 23/35] x86/fpu: Add helpers for modifying supervisor xstate

On Tue, 2022-02-08 at 09:51 +0100, Thomas Gleixner wrote:
> I like the approach in principle, but you still expose the xstate
> internals via the void pointer. It's just a question of time that
> this
> is type casted and abused in interesting ways.

Thanks for taking a look. I have to say though, these changes are
making me scratch my head a bit. Should we really design around callers
digging into mysterious pointers with magic offsets instead of using
easy helpers full of warnings about pitfalls? It should look odd in a
code review too I would think.

>
> Something like the below untested (on top of the whole series)
> preserves
> the encapsulation and reduces the code at the call sites.
>
>
It loses the ability to know which MSR write actually failed. It also
loses the ability to perform read/write logic under a single
transaction. The latter is not needed for this series, but this snippet
from the IBT series does it:

int ibt_get_clear_wait_endbr(void)
{
void *xstate;
u64 msr_val = 0;

if (!current->thread.shstk.ibt)
return 0;

xstate = start_update_xsave_msrs(XFEATURE_CET_USER);
if (!xsave_rdmsrl(xstate, MSR_IA32_U_CET, &msr_val))
xsave_wrmsrl(xstate, MSR_IA32_U_CET, msr_val &
~CET_WAIT_ENDBR);
end_update_xsave_msrs();

return msr_val & CET_WAIT_ENDBR;
}

I suppose we could just add a new function to do that logic in a single
transaction when the time comes. But inventing data structures to
describe work to be passed off to some executor always seems to break
at the first new requirement. What I usually wanted was a programming
language, and I already had it.

Not to bikeshed though, it will still get the job done.

2022-02-09 23:18:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 16/35] x86/mm: Update maybe_mkwrite() for shadow stack

First of all, that changelog doesn't really explain the problem. It's
all background and no "why".

*Why* does maybe_mkwrite() take a VMA? What's the point?


> #endif /* _ASM_X86_PGTABLE_H */
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 3481b35cb4ec..c22c8e9c37e8 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -610,6 +610,26 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
> }
> #endif
>
> +pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> + if (vma->vm_flags & VM_WRITE)
> + pte = pte_mkwrite(pte);
> + else if (vma->vm_flags & VM_SHADOW_STACK)
> + pte = pte_mkwrite_shstk(pte);
> + return pte;
> +}

First, this makes me wonder why we need pte_mkwrite() *AND*
pte_mkwrite_shstk(). Is there a difference in their behavior that matters?

Second, I don't like the copy-and-paste to make an arch-specific "hook"
for a function. This is a very good way to ensure that arch code and
generic code fork and accumulate separate bugs.

I'd much rather have this do (in generic code):

pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
if (vma->vm_flags & VM_WRITE)
pte = pte_mkwrite(pte);

pte = arch_maybe_mkwrite(pte, vma);

return pte;
+}

Actually, is there a reason the generic code could not even just add:

if (vma->vm_flags & VM_ARCH_MAYBE_MKWRITE_MASK)
pte = arch_maybe_mkwrite(pte, vma);

or heck even just the x86-specific code itself:

if (vma->vm_flags & VM_SHADOW_STACK)
pte = pte_mkwrite_shstk(pte);

with a stub defined for pte_mkwrite_shstk()?

In the end, it's just a question of whether the generic code wants
something to say "arch" or "shstk". But, I don't think we need a forked
x86 copy of these functions.

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 311c6018d503..b3cb3a17037b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -955,12 +955,14 @@ void free_compound_page(struct page *page);
> * pte_mkwrite. But get_user_pages can cause write faults for mappings
> * that do not have writing enabled, when used by access_process_vm.
> */
> +#ifndef maybe_mkwrite

maybe_mkwrite is defined in asm/pgtable.h. Where is the #include?

> static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> {
> if (likely(vma->vm_flags & VM_WRITE))
> pte = pte_mkwrite(pte);
> return pte;
> }
> +#endif
>
> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
> void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 406a3c28c026..2adedcfca00b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -491,12 +491,14 @@ static int __init setup_transparent_hugepage(char *str)
> }
> __setup("transparent_hugepage=", setup_transparent_hugepage);
>
> +#ifndef maybe_pmd_mkwrite
> pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> {
> if (likely(vma->vm_flags & VM_WRITE))
> pmd = pmd_mkwrite(pmd);
> return pmd;
> }
> +#endif
>
> #ifdef CONFIG_MEMCG
> static inline struct deferred_split *get_deferred_split_queue(struct page *page)


2022-02-09 23:20:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 19/35] mm/mmap: Add shadow stack pages to memory accounting

On 1/30/22 13:18, Rick Edgecombe wrote:
> +bool is_shadow_stack_mapping(vm_flags_t vm_flags)
> +{
> + return vm_flags & VM_SHADOW_STACK;
> +}
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index bc8713a76e03..21fdb1273571 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -911,6 +911,14 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
> __ptep_modify_prot_commit(vma, addr, ptep, pte);
> }
> #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +#ifndef is_shadow_stack_mapping
> +static inline bool is_shadow_stack_mapping(vm_flags_t vm_flags)
> +{
> + return false;
> +}
> +#endif

Hold your horses there. Remember:

+#ifdef CONFIG_X86_SHADOW_STACK
+# define VM_SHADOW_STACK VM_HIGH_ARCH_5
+#else
+# define VM_SHADOW_STACK VM_NONE
+#endif

Plus:

#define VM_NONE 0x00000000

That means the arch-generic version, when CONFIG_X86_SHADOW_STACK is off
compiles down to:

bool is_shadow_stack_mapping(vm_flags_t vm_flags)
{
return vm_flags & 0x00000000;
}

I _suspect_ the compiler *might* compile that down to the same thing as:

return false;

So, why not just have one version, no additional #ifdefs, and be done
with it? Heck, why have the helper in the first place? Just check
VM_SHADOW_STACK directly.

2022-02-09 23:22:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 20/35] mm: Update can_follow_write_pte() for shadow stack

On 1/30/22 13:18, Rick Edgecombe wrote:
> Like a writable data page, a shadow stack page is writable, and becomes
> read-only during copy-on-write, but it is always dirty.

One other thing...

The language in these changelogs is a bit sloppy. For instance, what
does "always dirty" mean here? pte_dirty()? Or strictly _PAGE_DIRTY?

In other words, logically dirty, or literally "has *the* dirty bit set"?


2022-02-09 23:22:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 18/35] mm: Add guard pages around a shadow stack.

On 1/30/22 13:18, Rick Edgecombe wrote:
> INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
> first and the last elements in the range, effectively touches those memory
> areas.

This is a pretty close copy of the instruction reference text for
INCSSP. I'm feeling rather dense today, but that's just not making any
sense.

The pseudocode is more sensible in the SDM. I think this needs a better
explanation:

The INCSSP instruction increments the shadow stack pointer. It
is the shadow stack analog of an instruction like:

addq $0x80, %rsp

However, there is one important difference between an ADD on
%rsp and INCSSP. In addition to modifying SSP, INCSSP also
reads from the memory of the first and last elements that were
"popped". You can think of it as acting like this:

READ_ONCE(ssp); // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8); // read+discard last popped stack element


> The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
> 255 * 4 = 1020 bytes by INCSSPD. Both ranges are far from PAGE_SIZE.

... That maximum distance, combined with an a guard pages at the end of
a shadow stack ensures that INCSSP will fault before it is able to move
across an entire guard page.

> Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
> CALL, and RET from going beyond.

>
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index a506a411474d..e1533fdc08b4 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -73,6 +73,13 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
>
> extern void initmem_init(void);
>
> +#define vm_start_gap vm_start_gap
> +struct vm_area_struct;
> +extern unsigned long vm_start_gap(struct vm_area_struct *vma);
> +
> +#define vm_end_gap vm_end_gap
> +extern unsigned long vm_end_gap(struct vm_area_struct *vma);
> +
> #endif /* !__ASSEMBLY__ */
>
> #endif /* _ASM_X86_PAGE_DEFS_H */
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index f3f52c5e2fd6..81f9325084d3 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -250,3 +250,49 @@ bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
> return false;
> return true;
> }
> +
> +/*
> + * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D). INCSSPQ
> + * moves shadow stack pointer up to 255 * 8 = ~2 KB (~1KB for INCSSPD) and
> + * touches the first and the last element in the range, which triggers a
> + * page fault if the range is not in a shadow stack. Because of this,
> + * creating 4-KB guard pages around a shadow stack prevents these
> + * instructions from going beyond.
> + */
> +#define SHADOW_STACK_GUARD_GAP PAGE_SIZE
> +
> +unsigned long vm_start_gap(struct vm_area_struct *vma)
> +{
> + unsigned long vm_start = vma->vm_start;
> + unsigned long gap = 0;
> +
> + if (vma->vm_flags & VM_GROWSDOWN)
> + gap = stack_guard_gap;
> + else if (vma->vm_flags & VM_SHADOW_STACK)
> + gap = SHADOW_STACK_GUARD_GAP;
> +
> + if (gap != 0) {
> + vm_start -= gap;
> + if (vm_start > vma->vm_start)
> + vm_start = 0;
> + }
> + return vm_start;
> +}
> +
> +unsigned long vm_end_gap(struct vm_area_struct *vma)
> +{
> + unsigned long vm_end = vma->vm_end;
> + unsigned long gap = 0;
> +
> + if (vma->vm_flags & VM_GROWSUP)
> + gap = stack_guard_gap;
> + else if (vma->vm_flags & VM_SHADOW_STACK)
> + gap = SHADOW_STACK_GUARD_GAP;
> +
> + if (gap != 0) {
> + vm_end += gap;
> + if (vm_end < vma->vm_end)
> + vm_end = -PAGE_SIZE;
> + }
> + return vm_end;
> +}

First of all, __weak would be a lot better than these #ifdefs.

Second, I have the same basic objection to this as the maybe_mkwrite()
mess. This is a forked copy of the code. Instead of refactoring, it's
just copied-pasted-and-#ifdef'd. Not so nice.

Isn't this just a matter of overriding 'stack_guard_gap' for
VM_SHADOW_STACK? Why don't we just do this:

unsigned long stack_guard_gap(struct vm_area_struct *vma)
{
if (vma->vm_flags & VM_SHADOW_STACK)
return SHADOW_STACK_GUARD_GAP;

return __stack_guard_gap;
}

Or, worst-case if people don't want 2 easily compiled-out lines added to
generic code, define:

unsigned long __weak stack_guard_gap(struct vm_area_struct *vma)
{
return __stack_guard_gap;
}

in generic code, and put the top definition in arch/x86.

2022-02-09 23:24:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 17/35] mm: Fixup places that call pte_mkwrite() directly

On 1/30/22 13:18, Rick Edgecombe wrote:
> - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE directly
> and call pte_mkwrite(), which is the same as maybe_mkwrite(). Change
> them to maybe_mkwrite().

Those look OK.

> - In do_numa_page(), if the numa entry was writable, then pte_mkwrite()
> is called directly. Fix it by doing maybe_mkwrite(). Make the same
> changes to do_huge_pmd_numa_page().

This is another "what", not "why" changelog. This change puzzles me.

*Why* is this needed? It sounds like pte_mkwrite() doesn't work for
shadow stack PTEs. Let's say that explicitly.

I also this this is ab/misuse of maybe_mkwrite().

The shadow stack VMA *REQUIRES* PTEs with Dirty=1. There's no *maybe*
about it. The rest of this is essentially a hack to get
VM_SHADOW_STACK-required bits into the PTE. We have a place where we
store those VMA-required bits: vma->vm_page_prot. Look at how we store
the pkey bits in there for instance.

Let's say we set _PAGE_DIRTY in vma->vm_page_prot. We'd come into
do_anonymous_page() for instance and do this:

> entry = mk_pte(page, vma->vm_page_prot); <--- PTE is Write=0,Dirty=1 Yay!
> entry = pte_sw_mkyoung(entry);
> if (vma->vm_flags & VM_WRITE) <--- False, skip the pte_mkwrite()
> entry = pte_mkwrite(pte_mkdirty(entry));

In other words, it "just works" because shadow stack VMAs don't have
VM_WRITE set.

I think the other VM_WRITE checks would be fine too, although I'm unsure
about the change_page_attr() one.

2022-02-09 23:29:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 20/35] mm: Update can_follow_write_pte() for shadow stack

On 1/30/22 13:18, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> Can_follow_write_pte() ensures a read-only page is COWed by checking the
> FOLL_COW flag, and uses pte_dirty() to validate the flag is still valid.
>
> Like a writable data page, a shadow stack page is writable, and becomes
> read-only during copy-on-write,

I thought we could not have read-only shadow stack pages. What does a
read-only shadow stack PTE look like? ;)

> but it is always dirty. Thus, in the
> can_follow_write_pte() check, it belongs to the writable page case and
> should be excluded from the read-only page pte_dirty() check. Apply
> the same changes to can_follow_write_pmd().
>
> While at it, also split the long line into smaller ones.

FWIW, I probably would have had a preparatory patch for this part. The
advantage is that if you break existing code, it's a lot easier to
figure it out if you have a separate refactoring patch. Also, for a
patch like this, the refactoring might result in the same exact binary.
It's a pretty good sign that your patch won't cause regressions if it
results in the same binary.

> diff --git a/mm/gup.c b/mm/gup.c
> index f0af462ac1e2..95b7d1084c44 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -464,10 +464,18 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> * FOLL_FORCE can write to even unwritable pte's, but only
> * after we've gone through a COW cycle and they are dirty.
> */
> -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> +static inline bool can_follow_write_pte(pte_t pte, unsigned int flags,
> + struct vm_area_struct *vma)
> {
> - return pte_write(pte) ||
> - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> + if (pte_write(pte))
> + return true;
> + if ((flags & (FOLL_FORCE | FOLL_COW)) != (FOLL_FORCE | FOLL_COW))
> + return false;
> + if (!pte_dirty(pte))
> + return false;
> + if (is_shadow_stack_mapping(vma->vm_flags))
> + return false;

You had me up until this is_shadow_stack_mapping(). It wasn't mentioned
at all in the changelog. Logically, I think it's trying to say that a
shadow stack VMA never allows a FOLL_FORCE override.

That makes some sense, but it's a pretty big point not to mention in the
changelog.

> + return true;
> }
>
> static struct page *follow_page_pte(struct vm_area_struct *vma,
> @@ -510,7 +518,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> }
> if ((flags & FOLL_NUMA) && pte_protnone(pte))
> goto no_page;
> - if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
> + if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags, vma)) {
> pte_unmap_unlock(ptep, ptl);
> return NULL;
> }



2022-02-09 23:36:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 14/35] mm: Introduce VM_SHADOW_STACK for shadow stack memory

On 1/30/22 13:18, Rick Edgecombe wrote:
> A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
> read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
> two cases are handled differently for page faults. Introduce
> VM_SHADOW_STACK to track shadow stack VMAs.

This is also a very appropriate place to remind folks that VM_WRITE is
mutually exclusive with this flag. That's pretty important.

2022-02-10 03:09:23

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Wed, Feb 9, 2022 at 6:37 PM Andy Lutomirski <[email protected]> wrote:
>
> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> >> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> >>>>> But such a knob will immediately reduce the security value of
> >>>>> the entire
> >>>>> thing, and I don't have good ideas how to deal with it :(
> >>>>
> >>>> Probably a kind of latch in the task_struct which would trigger
> >>>> off once
> >>>> returt to a different address happened, thus we would be able to
> >>>> jump inside
> >>>> paratite code. Of course such trigger should be available under
> >>>> proper
> >>>> capability only.
> >>>
> >>> I'm not fully in touch with how parasite, etc works. Are we
> >>> talking about save or restore?
> >>
> >> We use parasite code in question during checkpoint phase as far as I
> >> remember.
> >> push addr/lret trick is used to run "injected" code (code injection
> >> itself is
> >> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> >> into this code
> >> for years already, do we still need to support compat mode at all?
> >>
> >>> If it's restore, what exactly does CRIU need to do? Is it just
> >>> that CRIU needs to return
> >>> out from its resume code into the to-be-resumed program without
> >>> tripping CET? Would it
> >>> be acceptable for CRIU to require that at least one shstk slot be
> >>> free at save time?
> >>> Or do we need a mechanism to atomically switch to a completely full
> >>> shadow stack at resume?
> >>>
> >>> Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> >>> that is intended for
> >>> use for altshadowstack could safely verify a token on the
> >>> altshdowstack, possibly
> >>> compare to something in ucontext (or not -- this isn't clearly
> >>> necessary) and switch
> >>> back to the previous stack. CRIU could use that too. Obviously
> >>> CRIU will need a way
> >>> to populate the relevant stacks, but WRUSS can be used for that,
> >>> and I think this
> >>> is a fundamental requirement for CRIU -- CRIU restore absolutely
> >>> needs a way to write
> >>> the saved shadow stack data into the shadow stack.
> >
> > Still wrapping my head around the CRIU save and restore steps, but
> > another general approach might be to give ptrace the ability to
> > temporarily pause/resume/set CET enablement and SSP for a stopped
> > thread. Then injected code doesn't need to jump through any hoops or
> > possibly run into road blocks. I'm not sure how much this opens things
> > up if the thread has to be stopped...
>
> Hmm, that's maybe not insane.
>
> An alternative would be to add a bona fide ptrace call-a-function
> mechanism. I can think of two potentially usable variants:
>
> 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> shadow stack push and all.
>
> 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> frame just like a real signal is being delivered with the specified
> handler. There could be a variant to opt-in to also using a specified
> altstack and altshadowstack.
>
> 2 would be more expensive but would avoid the need for much in the way
> of asm magic. The injected code could be plain C (or Rust or Zig or
> whatever).
>
> All of this only really handles save, not restore. I don't understand
> restore enough to fully understand the issue.

FWIW, CET enabled GDB can call a function in a CET enabled process.
Adding Felix who may know more about it.


--
H.J.

2022-02-10 04:16:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On 2/8/22 18:18, Edgecombe, Rick P wrote:
> On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
>> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
>>>>> But such a knob will immediately reduce the security value of
>>>>> the entire
>>>>> thing, and I don't have good ideas how to deal with it :(
>>>>
>>>> Probably a kind of latch in the task_struct which would trigger
>>>> off once
>>>> returt to a different address happened, thus we would be able to
>>>> jump inside
>>>> paratite code. Of course such trigger should be available under
>>>> proper
>>>> capability only.
>>>
>>> I'm not fully in touch with how parasite, etc works. Are we
>>> talking about save or restore?
>>
>> We use parasite code in question during checkpoint phase as far as I
>> remember.
>> push addr/lret trick is used to run "injected" code (code injection
>> itself is
>> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
>> into this code
>> for years already, do we still need to support compat mode at all?
>>
>>> If it's restore, what exactly does CRIU need to do? Is it just
>>> that CRIU needs to return
>>> out from its resume code into the to-be-resumed program without
>>> tripping CET? Would it
>>> be acceptable for CRIU to require that at least one shstk slot be
>>> free at save time?
>>> Or do we need a mechanism to atomically switch to a completely full
>>> shadow stack at resume?
>>>
>>> Off the top of my head, a sigreturn (or sigreturn-like mechanism)
>>> that is intended for
>>> use for altshadowstack could safely verify a token on the
>>> altshdowstack, possibly
>>> compare to something in ucontext (or not -- this isn't clearly
>>> necessary) and switch
>>> back to the previous stack. CRIU could use that too. Obviously
>>> CRIU will need a way
>>> to populate the relevant stacks, but WRUSS can be used for that,
>>> and I think this
>>> is a fundamental requirement for CRIU -- CRIU restore absolutely
>>> needs a way to write
>>> the saved shadow stack data into the shadow stack.
>
> Still wrapping my head around the CRIU save and restore steps, but
> another general approach might be to give ptrace the ability to
> temporarily pause/resume/set CET enablement and SSP for a stopped
> thread. Then injected code doesn't need to jump through any hoops or
> possibly run into road blocks. I'm not sure how much this opens things
> up if the thread has to be stopped...

Hmm, that's maybe not insane.

An alternative would be to add a bona fide ptrace call-a-function
mechanism. I can think of two potentially usable variants:

1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
shadow stack push and all.

2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
frame just like a real signal is being delivered with the specified
handler. There could be a variant to opt-in to also using a specified
altstack and altshadowstack.

2 would be more expensive but would avoid the need for much in the way
of asm magic. The injected code could be plain C (or Rust or Zig or
whatever).

All of this only really handles save, not restore. I don't understand
restore enough to fully understand the issue.

--Andy

2022-02-10 16:37:47

by Willgerodt, Felix

[permalink] [raw]
Subject: RE: [PATCH 00/35] Shadow stacks for userspace

> -----Original Message-----
> From: H.J. Lu <[email protected]>
> Sent: Donnerstag, 10. Februar 2022 03:54
> To: Lutomirski, Andy <[email protected]>; Willgerodt, Felix
> <[email protected]>
> Cc: Edgecombe, Rick P <[email protected]>; [email protected];
> [email protected]; [email protected]; Syromiatnikov, Eugene
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Eranian,
> Stephane <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; Moreira, Joao <[email protected]>; [email protected];
> [email protected]; [email protected]; Yang, Weijiang
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; Hansen, Dave
> <[email protected]>; [email protected]; [email protected];
> [email protected]; Shankar, Ravi V <[email protected]>
> Subject: Re: [PATCH 00/35] Shadow stacks for userspace
>
> On Wed, Feb 9, 2022 at 6:37 PM Andy Lutomirski <[email protected]> wrote:
> >
> > On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > >> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> > >>>>> But such a knob will immediately reduce the security value of
> > >>>>> the entire
> > >>>>> thing, and I don't have good ideas how to deal with it :(
> > >>>>
> > >>>> Probably a kind of latch in the task_struct which would trigger
> > >>>> off once
> > >>>> returt to a different address happened, thus we would be able to
> > >>>> jump inside
> > >>>> paratite code. Of course such trigger should be available under
> > >>>> proper
> > >>>> capability only.
> > >>>
> > >>> I'm not fully in touch with how parasite, etc works. Are we
> > >>> talking about save or restore?
> > >>
> > >> We use parasite code in question during checkpoint phase as far as I
> > >> remember.
> > >> push addr/lret trick is used to run "injected" code (code injection
> > >> itself is
> > >> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> > >> into this code
> > >> for years already, do we still need to support compat mode at all?
> > >>
> > >>> If it's restore, what exactly does CRIU need to do? Is it just
> > >>> that CRIU needs to return
> > >>> out from its resume code into the to-be-resumed program without
> > >>> tripping CET? Would it
> > >>> be acceptable for CRIU to require that at least one shstk slot be
> > >>> free at save time?
> > >>> Or do we need a mechanism to atomically switch to a completely full
> > >>> shadow stack at resume?
> > >>>
> > >>> Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> > >>> that is intended for
> > >>> use for altshadowstack could safely verify a token on the
> > >>> altshdowstack, possibly
> > >>> compare to something in ucontext (or not -- this isn't clearly
> > >>> necessary) and switch
> > >>> back to the previous stack. CRIU could use that too. Obviously
> > >>> CRIU will need a way
> > >>> to populate the relevant stacks, but WRUSS can be used for that,
> > >>> and I think this
> > >>> is a fundamental requirement for CRIU -- CRIU restore absolutely
> > >>> needs a way to write
> > >>> the saved shadow stack data into the shadow stack.
> > >
> > > Still wrapping my head around the CRIU save and restore steps, but
> > > another general approach might be to give ptrace the ability to
> > > temporarily pause/resume/set CET enablement and SSP for a stopped
> > > thread. Then injected code doesn't need to jump through any hoops or
> > > possibly run into road blocks. I'm not sure how much this opens things
> > > up if the thread has to be stopped...
> >
> > Hmm, that's maybe not insane.
> >
> > An alternative would be to add a bona fide ptrace call-a-function
> > mechanism. I can think of two potentially usable variants:
> >
> > 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> > shadow stack push and all.
> >
> > 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> > frame just like a real signal is being delivered with the specified
> > handler. There could be a variant to opt-in to also using a specified
> > altstack and altshadowstack.
> >
> > 2 would be more expensive but would avoid the need for much in the way
> > of asm magic. The injected code could be plain C (or Rust or Zig or
> > whatever).
> >
> > All of this only really handles save, not restore. I don't understand
> > restore enough to fully understand the issue.
>
> FWIW, CET enabled GDB can call a function in a CET enabled process.
> Adding Felix who may know more about it.
>
>
> --
> H.J.

I don't know much about CRIU or kernel code, so I will stick to explaining
what our GDB patches for CET (not upstream yet) currently do.

GDB does inferior calls by setting the PC to the function it wants to call
and by manipulating the return address. It basically creates a dummy
frame and runs that on top of where it currently is.

To enable this for CET, our GDB CET patches push onto the shstk of the
inferior by writing to the inferiors memory, if it isn't out of range,
and by incrementing the SSP (using NT_X86_CET), both via ptrace.

(GDB also has a command called 'return', which basically returns early from
a function. Here GDB just decrements the SSP via ptrace.)

This was based on earlier versions of the kernel patches.
If the interface needs to change or if new interfaces would be available to
do this better, that is fine from our pov. We didn't upstream this yet.

Also, if you have any concerns with this approach please let me know.

Regards,
Felix

Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, http://www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

2022-02-10 22:11:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 21/35] mm/mprotect: Exclude shadow stack from preserve_write

On 1/30/22 13:18, Rick Edgecombe wrote:
> In change_pte_range(), when a PTE is changed for prot_numa, _PAGE_RW is
> preserved to avoid the additional write fault after the NUMA hinting fault.
> However, pte_write() now includes both normal writable and shadow stack
> (RW=0, Dirty=1) PTEs, but the latter does not have _PAGE_RW and has no need
> to preserve it.

This series creates an interesting situation: it causes a logical
disconnection between things that were tightly coupled before. For
instance, before this series, _PAGE_RW=1 and "writable" really were
synonyms. They meant the same thing.

One of the complexities in this series is differentiating the two. For
instance, a shadow stack page can be written to, even though it has
_PAGE_RW=0.

This particular patch seems to be hacking around the problem that a
p*_mkwrite() doesn't work on shadow stack PTE/PMDs. First, that makes
me wonder what *actually* happens if we do a plain pte_mkwrite() on a
shadow stack PTE. I *think* it will take the [Write=0,Dirty=1] PTE and

pte = pte_set_flags(pte, _PAGE_RW);

so we'll end up with [Write=1,Dirty=1], which is bad.

Let's say pte_mkwrite() can't be fixed. We should probably make it
VM_BUG_ON() if it's ever asked to muck with a shadow stack PTE.

It's also weird because we have this pte_write()==1 PTE in a !VM_WRITE
VMA. Then, we're trying to pte_mkwrite() under this !VM_WRITE VMA.

pte_write() <-- returns true for on shadow stack PTE!
pte_mkwrite() <-- illegal on shadow stack PTE

I need to think about this a little more. I don't have a solution.
But, as-is, it seems untenable. The rules are just too counter
intuitive to live.

2022-02-10 23:01:58

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 20/35] mm: Update can_follow_write_pte() for shadow stack

From: Dave Hansen
> Sent: 09 February 2022 22:52
>
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > Like a writable data page, a shadow stack page is writable, and becomes
> > read-only during copy-on-write, but it is always dirty.
>
> One other thing...
>
> The language in these changelogs is a bit sloppy. For instance, what
> does "always dirty" mean here? pte_dirty()? Or strictly _PAGE_DIRTY?
>
> In other words, logically dirty, or literally "has *the* dirty bit set"?

Doesn't COW have to set it readonly - so that the access faults.
And then set the fault code set it readonly+dirty (without write)
to allow the shadow stack accesses to not-fault.

Or am I mis-guessing what the docs actually say?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-11 00:21:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 18/35] mm: Add guard pages around a shadow stack.

On Thu, Feb 10, 2022 at 2:44 PM Dave Hansen <[email protected]> wrote:
>
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
> > first and the last elements in the range, effectively touches those memory
> > areas.
> >
> > The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
> > 255 * 4 = 1020 bytes by INCSSPD. Both ranges are far from PAGE_SIZE.
> > Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
> > CALL, and RET from going beyond.
>
> What is the downside of not applying this patch? The shadow stack gap
> is 1MB instead of 4k?
>
> That, frankly, doesn't seem too bad. How badly do we *need* this patch?

1MB of oer-thread guard address space in a 32-bit program may be a
show stopper. Do we intend to support any of this for 32-bit?

--Andy

2022-02-11 05:19:01

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 18/35] mm: Add guard pages around a shadow stack.

From: Dave Hansen
> Sent: 09 February 2022 22:24
>
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
> > first and the last elements in the range, effectively touches those memory
> > areas.
>
> This is a pretty close copy of the instruction reference text for
> INCSSP. I'm feeling rather dense today, but that's just not making any
> sense.
>
> The pseudocode is more sensible in the SDM. I think this needs a better
> explanation:
>
> The INCSSP instruction increments the shadow stack pointer. It
> is the shadow stack analog of an instruction like:
>
> addq $0x80, %rsp
>
> However, there is one important difference between an ADD on
> %rsp and INCSSP. In addition to modifying SSP, INCSSP also
> reads from the memory of the first and last elements that were
> "popped". You can think of it as acting like this:
>
> READ_ONCE(ssp); // read+discard top element on stack
> ssp += nr_to_pop * 8; // move the shadow stack
> READ_ONCE(ssp-8); // read+discard last popped stack element
>
>
> > The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
> > 255 * 4 = 1020 bytes by INCSSPD. Both ranges are far from PAGE_SIZE.
>
> ... That maximum distance, combined with an a guard pages at the end of
> a shadow stack ensures that INCSSP will fault before it is able to move
> across an entire guard page.
>
> > Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
> > CALL, and RET from going beyond.

Do you need a real guard page?
Or is it just enough to ensure that the adjacent page isn't another
shadow stack page?

Any other page will cause a fault because the PTE isn't readonly+dirty.

I'm not sure how common single page allocates are in Linux.
But adjacent shadow stacks may be rare anyway.
So a check against both adjacent PTE entries would suffice.
Or maybe always allocate an even (or odd) numbered page.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-11 07:43:55

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 10/35] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS

CC [email protected]

Thread:
https://lore.kernel.org/lkml/[email protected]/

On Wed, 2022-02-09 at 08:58 -0800, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
> >
> > diff --git a/drivers/gpu/drm/i915/gvt/gtt.c
> > b/drivers/gpu/drm/i915/gvt/gtt.c
> > index 99d1781fa5f0..75ce4e823902 100644
> > --- a/drivers/gpu/drm/i915/gvt/gtt.c
> > +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> > @@ -1210,7 +1210,7 @@ static int split_2MB_gtt_entry(struct
> > intel_vgpu *vgpu,
> > }
> >
> > /* Clear dirty field. */
> > - se->val64 &= ~_PAGE_DIRTY;
> > + se->val64 &= ~_PAGE_DIRTY_BITS;
> >
> > ops->clear_pse(se);
> > ops->clear_ips(se);
>
> Are these x86 CPU page table values? I see ->val64 being used like
> this:
>
> e->val64 &= ~GEN8_PAGE_PRESENT;
> and
> se.val64 |= GEN8_PAGE_PRESENT | GEN8_PAGE_RW;
>
> where we also have:
>
> #define GEN8_PAGE_PRESENT BIT_ULL(0)
> #define GEN8_PAGE_RW BIT_ULL(1)
>
> Which tells me that these are probably *close* to the CPU's page
> tables.
> But, I honestly don't know which format they are. I don't know if
> _PAGE_COW is still a software bit in that format or not.
>
> Either way, I don't think we should be messing with i915 device page
> tables.
>
> Or, are these somehow magically shared with the CPU in some way I
> don't
> know about?
>
> [ If these are device-only page tables, it would probably be nice to
> stop using _PAGE_FOO for them. It would avoid confusion like this.
> ]

The two Reviewed-by tags are giving me pause, but as far as I can tell
this should not be setting _PAGE_DIRTY_BITS. This code seems to be
shadowing guest page tables, and the change would clear the COW
software bit in the guest page tables. So, yes, I think this should be
dropped.

2022-02-11 08:02:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 18/35] mm: Add guard pages around a shadow stack.

On 1/30/22 13:18, Rick Edgecombe wrote:
> INCSSP(Q/D) increments shadow stack pointer and 'pops and discards' the
> first and the last elements in the range, effectively touches those memory
> areas.
>
> The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes and
> 255 * 4 = 1020 bytes by INCSSPD. Both ranges are far from PAGE_SIZE.
> Thus, putting a gap page on both ends of a shadow stack prevents INCSSP,
> CALL, and RET from going beyond.

What is the downside of not applying this patch? The shadow stack gap
is 1MB instead of 4k?

That, frankly, doesn't seem too bad. How badly do we *need* this patch?

2022-02-11 09:06:37

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 18/35] mm: Add guard pages around a shadow stack.

On Thu, 2022-02-10 at 22:38 +0000, David Laight wrote:
> Do you need a real guard page?
> Or is it just enough to ensure that the adjacent page isn't another
> shadow stack page?
>
> Any other page will cause a fault because the PTE isn't
> readonly+dirty.
>
> I'm not sure how common single page allocates are in Linux.

I think it came from this discussion:

https://lore.kernel.org/lkml/CAG48ez1ytOfQyNZMNPFp7XqKcpd7_aRai9G5s7rx0V=8ZG+r2A@mail.gmail.com/#t

> But adjacent shadow stacks may be rare anyway.
> So a check against both adjacent PTE entries would suffice.
> Or maybe always allocate an even (or odd) numbered page.

It just needs to not be adjacent to shadow stack memory to do the job.
Would that be small to implement? It might be a tradeoff of code
complexity.

2022-02-11 09:30:35

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Thu, Feb 10, 2022 at 11:41:16PM -0800, [email protected] wrote:
> On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> >
> > An alternative would be to add a bona fide ptrace call-a-function mechanism.
> > I can think of two potentially usable variants:
> >
> > 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> > shadow stack push and all.
> >
> > 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> > frame just like a real signal is being delivered with the specified handler.
> > There could be a variant to opt-in to also using a specified altstack and
> > altshadowstack.
>
> I think this would be ideal. In CRIU, the parasite code is executed in
> the "daemon" mode and returns back via sigreturn. Right now, CRIU needs
> to generate a signal frame. If I understand your idea right, the signal
> frame will be generated by the kernel.
>
> >
> > 2 would be more expensive but would avoid the need for much in the way of
> > asm magic. The injected code could be plain C (or Rust or Zig or whatever).
> >
> > All of this only really handles save, not restore. I don't understand
> > restore enough to fully understand the issue.
>
> In a few words, it works like this: CRIU restores all required resources
> and prepares a signal frame with a target process state, then it
> switches to a small PIE blob, where it restores vma-s and calls
> rt_sigreturn.

I think it's also important to note that the stack is restored as a part of
the process memory, i.e. its contents is read from the images.

> >
> > --Andy

--
Sincerely yours,
Mike.

2022-02-11 09:55:48

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > > On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
> > > > > > But such a knob will immediately reduce the security value of
> > > > > > the entire
> > > > > > thing, and I don't have good ideas how to deal with it :(
> > > > >
> > > > > Probably a kind of latch in the task_struct which would trigger
> > > > > off once
> > > > > returt to a different address happened, thus we would be able to
> > > > > jump inside
> > > > > paratite code. Of course such trigger should be available under
> > > > > proper
> > > > > capability only.
> > > >
> > > > I'm not fully in touch with how parasite, etc works. Are we
> > > > talking about save or restore?
> > >
> > > We use parasite code in question during checkpoint phase as far as I
> > > remember.
> > > push addr/lret trick is used to run "injected" code (code injection
> > > itself is
> > > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look
> > > into this code
> > > for years already, do we still need to support compat mode at all?
> > >
> > > > If it's restore, what exactly does CRIU need to do? Is it just
> > > > that CRIU needs to return
> > > > out from its resume code into the to-be-resumed program without
> > > > tripping CET? Would it
> > > > be acceptable for CRIU to require that at least one shstk slot be
> > > > free at save time?
> > > > Or do we need a mechanism to atomically switch to a completely full
> > > > shadow stack at resume?
> > > >
> > > > Off the top of my head, a sigreturn (or sigreturn-like mechanism)
> > > > that is intended for
> > > > use for altshadowstack could safely verify a token on the
> > > > altshdowstack, possibly
> > > > compare to something in ucontext (or not -- this isn't clearly
> > > > necessary) and switch
> > > > back to the previous stack. CRIU could use that too. Obviously
> > > > CRIU will need a way
> > > > to populate the relevant stacks, but WRUSS can be used for that,
> > > > and I think this
> > > > is a fundamental requirement for CRIU -- CRIU restore absolutely
> > > > needs a way to write
> > > > the saved shadow stack data into the shadow stack.
> >
> > Still wrapping my head around the CRIU save and restore steps, but
> > another general approach might be to give ptrace the ability to
> > temporarily pause/resume/set CET enablement and SSP for a stopped
> > thread. Then injected code doesn't need to jump through any hoops or
> > possibly run into road blocks. I'm not sure how much this opens things
> > up if the thread has to be stopped...
>
> Hmm, that's maybe not insane.
>
> An alternative would be to add a bona fide ptrace call-a-function mechanism.
> I can think of two potentially usable variants:
>
> 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> shadow stack push and all.
>
> 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> frame just like a real signal is being delivered with the specified handler.
> There could be a variant to opt-in to also using a specified altstack and
> altshadowstack.

I think this would be ideal. In CRIU, the parasite code is executed in
the "daemon" mode and returns back via sigreturn. Right now, CRIU needs
to generate a signal frame. If I understand your idea right, the signal
frame will be generated by the kernel.

>
> 2 would be more expensive but would avoid the need for much in the way of
> asm magic. The injected code could be plain C (or Rust or Zig or whatever).
>
> All of this only really handles save, not restore. I don't understand
> restore enough to fully understand the issue.

In a few words, it works like this: CRIU restores all required resources
and prepares a signal frame with a target process state, then it
switches to a small PIE blob, where it restores vma-s and calls
rt_sigreturn.

>
> --Andy

2022-02-11 11:02:09

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 18/35] mm: Add guard pages around a shadow stack.

From: Edgecombe, Rick P
> Sent: 10 February 2022 23:43
>
> On Thu, 2022-02-10 at 22:38 +0000, David Laight wrote:
> > Do you need a real guard page?
> > Or is it just enough to ensure that the adjacent page isn't another
> > shadow stack page?
> >
> > Any other page will cause a fault because the PTE isn't
> > readonly+dirty.
> >
> > I'm not sure how common single page allocates are in Linux.
>
> I think it came from this discussion:
>
> https://lore.kernel.org/lkml/CAG48ez1ytOfQyNZMNPFp7XqKcpd7_aRai9G5s7rx0V=8ZG+r2A@mail.gmail.com/#t
>
> > But adjacent shadow stacks may be rare anyway.
> > So a check against both adjacent PTE entries would suffice.
> > Or maybe always allocate an even (or odd) numbered page.
>
> It just needs to not be adjacent to shadow stack memory to do the job.
> Would that be small to implement? It might be a tradeoff of code
> complexity.

That's what I thought.
Although the VA use for guard pages might be a problem in itself.

I'm not sure why I thought shadow stacks would be a single page.
For user space that 'only' allows 512 calls.
For kernel it is a massive waste of memory.
It is probably worth putting multiple kernel shadow stacks into the same page.
(Code that can overrun can do other stuff more easily.)

The hardware engineers failed to think about the implementation (again).
The shadow stack should (probably) run in the opposite direction to
the normal stack.
Then the shadow stack can be placed at the other end of the VA allocated
to a user space stack.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-11 12:27:17

by Wang, Zhi A

[permalink] [raw]
Subject: Re: [PATCH 10/35] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS

On 2/11/22 1:39 AM, Edgecombe, Rick P wrote:
> CC [email protected]
>
> Thread:
> https://lore.kernel.org/lkml/[email protected]/
>
> On Wed, 2022-02-09 at 08:58 -0800, Dave Hansen wrote:
>> On 1/30/22 13:18, Rick Edgecombe wrote:
>>>
>>> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c
>>> b/drivers/gpu/drm/i915/gvt/gtt.c
>>> index 99d1781fa5f0..75ce4e823902 100644
>>> --- a/drivers/gpu/drm/i915/gvt/gtt.c
>>> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
>>> @@ -1210,7 +1210,7 @@ static int split_2MB_gtt_entry(struct
>>> intel_vgpu *vgpu,
>>> }
>>>
>>> /* Clear dirty field. */
>>> - se->val64 &= ~_PAGE_DIRTY;
>>> + se->val64 &= ~_PAGE_DIRTY_BITS;
>>>
>>> ops->clear_pse(se);
>>> ops->clear_ips(se);
>>
>> Are these x86 CPU page table values? I see ->val64 being used like
>> this:
>>
>> e->val64 &= ~GEN8_PAGE_PRESENT;
>> and
>> se.val64 |= GEN8_PAGE_PRESENT | GEN8_PAGE_RW;
>>
>> where we also have:
>>
>> #define GEN8_PAGE_PRESENT BIT_ULL(0)
>> #define GEN8_PAGE_RW BIT_ULL(1)
>>
>> Which tells me that these are probably *close* to the CPU's page
>> tables.
>> But, I honestly don't know which format they are. I don't know if
>> _PAGE_COW is still a software bit in that format or not.
>>
>> Either way, I don't think we should be messing with i915 device page
>> tables.
>>
>> Or, are these somehow magically shared with the CPU in some way I
>> don't
>> know about?
>>
>> [ If these are device-only page tables, it would probably be nice to
>> stop using _PAGE_FOO for them. It would avoid confusion like this.
>> ]
>
> The two Reviewed-by tags are giving me pause, but as far as I can tell
> this should not be setting _PAGE_DIRTY_BITS. This code seems to be
> shadowing guest page tables, and the change would clear the COW
> software bit in the guest page tables. So, yes, I think this should be
> dropped.
>

Hi:

According to the PRM https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-lkf-vol06-memory_views.pdf p.28,
the GPU page table is IA-like and there will be scenarios when IA and
gpu sharing a page table. That's why they are sharing part of the
definitions. But the dirty bits will be ignored in the HW which GVT-g
supports. The code should copy the bits from the guest PDPE 2M entry
and then cleans some unused bits. So the _PAGE_DIRTY_ is misused here.

I would suggest you can remove that line in your patch and I will clean
this function after your patches got merged.

Thanks,
Zhi.

2022-02-11 15:16:07

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 26/35] x86/process: Change copy_thread() argument 'arg' to 'stack_size'

On Tue, 2022-02-08 at 09:38 +0100, Thomas Gleixner wrote:
> On Sun, Jan 30 2022 at 13:18, Rick Edgecombe wrote:
> > -int copy_thread(unsigned long clone_flags, unsigned long sp,
> > unsigned long arg,
> > - struct task_struct *p, unsigned long tls)
> > +int copy_thread(unsigned long clone_flags, unsigned long sp,
> > + unsigned long stack_size, struct task_struct *p,
> > + unsigned long tls)
> > {
> > struct inactive_task_frame *frame;
> > struct fork_frame *fork_frame;
> > @@ -175,7 +176,7 @@ int copy_thread(unsigned long clone_flags,
> > unsigned long sp, unsigned long arg,
> > if (unlikely(p->flags & PF_KTHREAD)) {
> > p->thread.pkru = pkru_get_init_value();
> > memset(childregs, 0, sizeof(struct pt_regs));
> > - kthread_frame_init(frame, sp, arg);
> > + kthread_frame_init(frame, sp, stack_size);
> > return 0;
> > }
> >
> > @@ -208,7 +209,7 @@ int copy_thread(unsigned long clone_flags,
> > unsigned long sp, unsigned long arg,
> > */
> > childregs->sp = 0;
> > childregs->ip = 0;
> > - kthread_frame_init(frame, sp, arg);
> > + kthread_frame_init(frame, sp, stack_size);
> > return 0;
> > }
>
> Can you please change the prototypes too for completeness sake?

In the header it's:
extern int copy_thread(unsigned long, unsigned long, unsigned long,
struct task_struct *, unsigned long);

And the various arch implementations call the stack size: arg,
kthread_arg, stk_sz, etc.

Adding names to the prototype would conflict with the some arch's names
unless they were all unified. Is it a worthwhile refactor?


2022-02-11 16:30:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 18/35] mm: Add guard pages around a shadow stack.

On Thu, 2022-02-10 at 15:07 -0800, Andy Lutomirski wrote:
> On Thu, Feb 10, 2022 at 2:44 PM Dave Hansen <[email protected]>
> wrote:
> >
> > On 1/30/22 13:18, Rick Edgecombe wrote:
> > > INCSSP(Q/D) increments shadow stack pointer and 'pops and
> > > discards' the
> > > first and the last elements in the range, effectively touches
> > > those memory
> > > areas.
> > >
> > > The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes
> > > and
> > > 255 * 4 = 1020 bytes by INCSSPD. Both ranges are far from
> > > PAGE_SIZE.
> > > Thus, putting a gap page on both ends of a shadow stack prevents
> > > INCSSP,
> > > CALL, and RET from going beyond.
> >
> > What is the downside of not applying this patch? The shadow stack
> > gap
> > is 1MB instead of 4k?
> >
> > That, frankly, doesn't seem too bad. How badly do we *need* this
> > patch?

Like just using VM_SHADOW_STACK | VM_GROWSDOWN to get a regular stack
sized gap? I think it could work. It also simplifies the mm->stack_vm
accounting.

It would no longer get a gap at the end though. I don't think it's
needed.

>
> 1MB of oer-thread guard address space in a 32-bit program may be a
> show stopper. Do we intend to support any of this for 32-bit?

It is supported in the 32 bit compatibility mode, although IBT had
dropped it. I guess this was probably the reason.

2022-02-12 07:19:47

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 18/35] mm: Add guard pages around a shadow stack.

On Fri, 2022-02-11 at 09:54 -0800, Andy Lutomirski wrote:
> > Like just using VM_SHADOW_STACK | VM_GROWSDOWN to get a regular
> > stack
> > sized gap? I think it could work. It also simplifies the mm-
> > >stack_vm
> > accounting.
>
> Seems not crazy. Do we want automatically growing shadow stacks? I
> don't really like the historical unix behavior where the main thread
> has
> a sort-of-infinite stack and every other thread has a fixed stack.

Ah! I was scratching my head why glibc did that - it's historical. Yea,
probably it's not needed and adding strange behavior.

>
> >
> > It would no longer get a gap at the end though. I don't think it's
> > needed.
> >
>
> I may have missed something about the oddball way the mm code works,
> but
> it seems if you have a gap at one end of every shadow stack, you
> automatically have a gap at the other end.

Right, we only need one, and this patch added a gap on both ends.

Per Andy's comment about the "oddball way" the mm code does the gaps -
The previous version of this (PROT_SHADOW_STACK) had an issue where if
you started with writable memory, then mprotect() with
PROT_SHADOW_STACK, the internal rb tree would get confused over the
sudden appearance of a gap. This new version follows closer to how
MAP_STACK avoids the problem I saw. But the way these guard gaps work
seem to barely avoid problems when you do things like split the vma by
mprotect()ing the middle of one. I wasn't sure if it's worth a
refactor. I guess the solution is quite old and there hasn't been
problems. I'm not even sure what the change would be, but it does feel
like adding to something fragile. Maybe shadow stack should just place
a guard page manually and not add any special shadow stack logic to the
mm code...

Other than that I'm inclined to remove the end gap and justify this
patch better in the commit log. Something like this (borrowing some
from Dave's comments):

The architecture of shadow stack constrains the ability of
userspace to move the shadow stack pointer (ssp) in order to
prevent corrupting or switching to other shadow stacks. The
RSTORSSP can move the spp to different shadow stacks, but it
requires a specially placed token in order to switch shadow
stacks. However, the architecture does not prevent
incrementing or decrementing shadow stack pointer to wander
onto an adjacent shadow stack. To prevent this in software,
enforce guard pages at the beginning of shadow stack vmas, such
that t
here will always be a gap between adjacent shadow stacks.

Make the gap big enough so that no userspace ssp changing
operations (besides RSTORSSP), can move the ssp from one stack
to the next. The ssp can increment or decrement by CALL, RET
and INCSSP. CALL and RET can move the ssp by a maximum of 8
bytes, at which point the shadow stack would be accessed.

The INCSSP instruction can also increment the shadow stack
pointer. It is the shadow stack analog of an instruction like:

addq $0x80, %rsp

However, there is one important difference between an ADD on
%rsp and INCSSP. In addition to modifying SSP, INCSSP also
reads from the memory of the first and last elements that were
"popped". It can be thought of as acting like this:

READ_ONCE(ssp); // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8); // read+discard last popped stack element

The maximum distance INCSSP can move the ssp is 2040 bytes,
before it would read the memory. There for a single page gap
will be enough to prevent any operation from shifting the ssp
to an adjacent gap, before the shadow stack would be read and
cause a fault.

This could be accomplished by using VM_GROWSDOWN, but this has
two downsides.
1. VM_GROWSDOWN will have a 1MB gap which is on the large
side for 32 bit address spaces
2. The behavior would allow shadow stack's to grow, which
is unneeded and adds a strange difference to how most
regular stacks work.

2022-02-12 16:30:47

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 22/35] x86/mm: Prevent VM_WRITE shadow stacks

On Fri, 2022-02-11 at 14:19 -0800, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > Shadow stack accesses are writes from handle_mm_fault()
> > perspective. So to
> > generate the correct PTE, maybe_mkwrite() will rely on the presence
> > of
> > VM_SHADOW_STACK or VM_WRITE in the vma.
> >
> > In future patches, when VM_SHADOW_STACK is actually creatable by
> > userspace, a problem could happen if a user calls
> > mprotect( , , PROT_WRITE) on VM_SHADOW_STACK shadow stack memory.
> > The code
> > would then be confused in the event of shadow stack accesses, and
> > create a
> > writable PTE for a shadow stack access. Then the process would
> > fault in a
> > loop.
> >
> > Prevent this from happening by blocking this kind of memory
> > (VM_WRITE and
> > VM_SHADOW_STACK) from being created, instead of complicating the
> > fault
> > handler logic to handle it.
> >
> > Add an x86 arch_validate_flags() implementation to handle the
> > check.
> > Rename the uapi/asm/mman.h header guard to be able to use it for
> > arch/x86/include/asm/mman.h where the arch_validate_flags() will
> > be.
>
> It would be great if this also said:
>
> There is an existing arch_validate_flags() hook for mmap() and
> mprotect() which allows architectures to reject unwanted
> ->vm_flags combinations. Add an implementation for x86.
>
> That's somewhat implied from what is there already, but making it
> more
> clear would be nice. There's a much higher bar to add a new arch
> hook
> than to just implement an existing one.

Ok, makes sense.

>
>
> > diff --git a/arch/x86/include/asm/mman.h
> > b/arch/x86/include/asm/mman.h
> > new file mode 100644
> > index 000000000000..b44fe31deb3a
> > --- /dev/null
> > +++ b/arch/x86/include/asm/mman.h
> > @@ -0,0 +1,21 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_MMAN_H
> > +#define _ASM_X86_MMAN_H
> > +
> > +#include <linux/mm.h>
> > +#include <uapi/asm/mman.h>
> > +
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static inline bool arch_validate_flags(unsigned long vm_flags)
> > +{
> > + if ((vm_flags & VM_SHADOW_STACK) && (vm_flags & VM_WRITE))
> > + return false;
> > +
> > + return true;
> > +}
>
> The design decision here seems to be that VM_SHADOW_STACK is itself a
> pseudo-VM_WRITE flag. Like you said: "Shadow stack accesses are
> writes
> from handle_mm_fault()".
>
> Very early on, this series seems to have made the decision that
> shadow
> stacks are writable and need lots of write handling behavior, *BUT*
> shouldn't have VM_WRITE set. As a whole, that seems odd.
>
> The alternative would be *requiring* VM_WRITE and VM_SHADOW_STACK be
> set
> together. I guess the downside is that pte_mkwrite() would need to
> be
> made to work on shadow stack PTEs.
>
> That particular design decision was never discussed. I think it has
> a
> really big impact on the rest of the series. What do you think? Was
> it
> a good idea? Or would the alternative be more complicated than what
> you
> have now?

First of all, thanks again for the deep review of the MM piece. I'm
still pondering the overall problem, which is why I haven't responded
to those yet.

I had originally thought that the MM changes were a bit hard to follow.
I was also somewhat amazed at how naturally normal COW worked. I was
wondering where the big COW stuff would be happening. In the way that
COW was sort of tucked away, overloading writability seemed sort of
aligned. But the names are very confusing, and this patch probably
should have been a hint that there are problems design wise.

For writability, especially with WRSS, I do think it's a bit unnatural
to think of shadow stack memory as anything but writable. Especially
when it comes to COW. But shadow stack accesses are not always writes,
incssp for example. The code will create shadow stack memory for shadow
stack access loads, which of course isn't writing anything, but is
required to make the instruction work. So it calls mkwrite(), which is
weird. But... it does need to leave it in a state that is kind of
writable, so makes a little sense I guess.

I was wondering if maybe the mm code can't be fully sensible for shadow
stacks without creating maybe_mkshstk() and adding it everywhere in a
whole new fault path. Then you have reads, writes and shadow stack
accesses that each have their own logic. It might require so many
additions that better names and comments are preferable. I don't know
though, still trying to come up with a good opinion.


2022-02-12 18:23:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 22/35] x86/mm: Prevent VM_WRITE shadow stacks

On 1/30/22 13:18, Rick Edgecombe wrote:
> Shadow stack accesses are writes from handle_mm_fault() perspective. So to
> generate the correct PTE, maybe_mkwrite() will rely on the presence of
> VM_SHADOW_STACK or VM_WRITE in the vma.
>
> In future patches, when VM_SHADOW_STACK is actually creatable by
> userspace, a problem could happen if a user calls
> mprotect( , , PROT_WRITE) on VM_SHADOW_STACK shadow stack memory. The code
> would then be confused in the event of shadow stack accesses, and create a
> writable PTE for a shadow stack access. Then the process would fault in a
> loop.
>
> Prevent this from happening by blocking this kind of memory (VM_WRITE and
> VM_SHADOW_STACK) from being created, instead of complicating the fault
> handler logic to handle it.
>
> Add an x86 arch_validate_flags() implementation to handle the check.
> Rename the uapi/asm/mman.h header guard to be able to use it for
> arch/x86/include/asm/mman.h where the arch_validate_flags() will be.

It would be great if this also said:

There is an existing arch_validate_flags() hook for mmap() and
mprotect() which allows architectures to reject unwanted
->vm_flags combinations. Add an implementation for x86.

That's somewhat implied from what is there already, but making it more
clear would be nice. There's a much higher bar to add a new arch hook
than to just implement an existing one.


> diff --git a/arch/x86/include/asm/mman.h b/arch/x86/include/asm/mman.h
> new file mode 100644
> index 000000000000..b44fe31deb3a
> --- /dev/null
> +++ b/arch/x86/include/asm/mman.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_MMAN_H
> +#define _ASM_X86_MMAN_H
> +
> +#include <linux/mm.h>
> +#include <uapi/asm/mman.h>
> +
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static inline bool arch_validate_flags(unsigned long vm_flags)
> +{
> + if ((vm_flags & VM_SHADOW_STACK) && (vm_flags & VM_WRITE))
> + return false;
> +
> + return true;
> +}

The design decision here seems to be that VM_SHADOW_STACK is itself a
pseudo-VM_WRITE flag. Like you said: "Shadow stack accesses are writes
from handle_mm_fault()".

Very early on, this series seems to have made the decision that shadow
stacks are writable and need lots of write handling behavior, *BUT*
shouldn't have VM_WRITE set. As a whole, that seems odd.

The alternative would be *requiring* VM_WRITE and VM_SHADOW_STACK be set
together. I guess the downside is that pte_mkwrite() would need to be
made to work on shadow stack PTEs.

That particular design decision was never discussed. I think it has a
really big impact on the rest of the series. What do you think? Was it
a good idea? Or would the alternative be more complicated than what you
have now?

2022-02-12 23:22:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 23/35] x86/fpu: Add helpers for modifying supervisor xstate

On Fri, 2022-02-11 at 16:27 -0800, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > Add helpers that can be used to modify supervisor xstate safely for
> > the
> > current task.
>
> This should be at the end of the changelog.

Hmm, ok.

>
> > State for supervisors xstate based features can be live and
> > accesses via MSR's, or saved in memory in an xsave buffer. When the
> > kernel needs to modify this state it needs to be sure to operate on
> > it
> > in the right place, so the modifications don't get clobbered.
>
> We tend to call these "supervisor xfeatures". The "state is in the
> registers" we call "active". Maybe:
>
> Just like user xfeatures, supervisor xfeatures can be either
> active in the registers or inactive and present in the task FPU
> buffer. If the registers are active, the registers can be
> modified directly. If the registers are not active, the
> modification must be performed on the task FPU buffer.

Ok, thanks.

>
>
> > In the past supervisor xstate features have used get_xsave_addr()
> > directly, and performed open coded logic handle operating on the
> > saved
> > state correctly. This has posed two problems:
> > 1. It has logic that has been gotten wrong more than once.
> > 2. To reduce code, less common path's are not optimized.
> > Determination
>
> "paths" ^
>

Arg, thanks.

>
> > xstate = start_update_xsave_msrs(XFEATURE_FOO);
> > r = xsave_rdmsrl(state, MSR_IA32_FOO_1, &val)
> > if (r)
> > xsave_wrmsrl(state, MSR_IA32_FOO_2, FOO_ENABLE);
> > end_update_xsave_msrs();
>
> This looks OK. I'm not thrilled about it. The
> start_update_xsave_msrs() can probably drop the "_msrs". Maybe:
>
> start_xfeature_update(...);

Hmm, this whole thing pretends to be updating MSRs, which is often not
true. Maybe the xsave_rdmsrl/xsave_wrmsrl should be renamed too.
xsave_readl()/xsave_writel() or something.

>
> Also, if you have to do the address lookup in xsave_rdmsrl() anyway,
> I
> wonder if the 'xstate' should just be a full fledged 'struct
> xregs_state'.
>
> The other option would be to make a little on-stack structure like:
>
> struct xsave_update {
> int feature;
> struct xregs_state *xregs;
> };
>
> Then you do:
>
> struct xsave_update xsu;
> ...
> start_update_xsave_msrs(&xsu, XFEATURE_FOO);
>
> and then pass it along to each of the other operations:
>
> r = xsave_rdmsrl(xsu, MSR_IA32_FOO_1, &val)
>
> It's slightly less likely to get type confused as a 'void *';

The 'void *' is actually a pointer to the specific xfeature in the
buffer. So the read/writes don't have to re-compute the offset every
time. It's not too much work though. I'm really surprised by the desire
to obfuscate the pointer, but I guess if we really want to, I'd rather
do that and keep the regular read/write operations.

If we don't care about the extra lookups this can totally drop the
caller side state. The feature nr can be looked up from the MSR along
with the struct offset. Then it doesn't expose the pointer to the
buffer, since it's all recomputed every operation.

So like:
start_xfeature_update();
r = xsave_readl(MSR_IA32_FOO_1, &val)
if (r)
xsave_writel(MSR_IA32_FOO_2, FOO_ENABLE);
end_xfeature_update();

The WARNs then happen in the read/writes. An early iteration looked
like that. I liked this version with caller side state, but thought it
might be worth revisiting if there really is a strong desire to hide
the pointer.

>
> > +static u64 *__get_xsave_member(void *xstate, u32 msr)
> > +{
> > + switch (msr) {
> > + /* Currently there are no MSR's supported */
> > + default:
> > + WARN_ONCE(1, "x86/fpu: unsupported xstate msr (%u)\n",
> > msr);
> > + return NULL;
> > + }
> > +}
>
> Just to get an idea what this is doing, it's OK to include the shadow
> stack MSRs in here.

Ok.

>
> Are you sure this should return a u64*? We have lots of <=64-bit
> XSAVE
> fields.

I thought it should only be used with 64 bit msrs. Maybe it needs a
better name?

>
> > +/*
> > + * Return a pointer to the xstate for the feature if it should be
> > used, or NULL
> > + * if the MSRs should be written to directly. To do this safely,
> > using the
> > + * associated read/write helpers is required.
> > + */
> > +void *start_update_xsave_msrs(int xfeature_nr)
> > +{
> > + void *xstate;
> > +
> > + /*
> > + * fpregs_lock() only disables preemption (mostly). So modifing
> > state
>
> modifying ^
>
> > + * in an interrupt could screw up some in progress fpregs
> > operation,
>
> ^ in-progress

I swear I ran checkpatch...

>
> > + * but appear to work. Warn about it.
> > + */
> > + WARN_ON_ONCE(!in_task());
> > + WARN_ON_ONCE(current->flags & PF_KTHREAD);
>
> This might also be a good spot to check that xfeature_nr is in
> fpstate.xfeatures.

Hmm, good idea.

>
> > + fpregs_lock();
> > +
> > + fpregs_assert_state_consistent();
> > +
> > + /*
> > + * If the registers don't need to be reloaded. Go ahead and
> > operate on the
> > + * registers.
> > + */
> > + if (!test_thread_flag(TIF_NEED_FPU_LOAD))
> > + return NULL;
> > +
> > + xstate = get_xsave_addr(&current->thread.fpu.fpstate-
> > >regs.xsave, xfeature_nr);
> > +
> > + /*
> > + * If regs are in the init state, they can't be retrieved from
> > + * init_fpstate due to the init optimization, but are not
> > nessarily
>
> necessarily ^

Oof, thanks.

>
> Spell checker time. ":set spell" in vim works for me nicely.
>
> > + * zero. The only option is to restore to make everything live
> > and
> > + * operate on registers. This will clear TIF_NEED_FPU_LOAD.
> > + *
> > + * Otherwise, if not in the init state but TIF_NEED_FPU_LOAD is
> > set,
> > + * operate on the buffer. The registers will be restored before
> > going
> > + * to userspace in any case, but the task might get preempted
> > before
> > + * then, so this possibly saves an xsave.
> > + */
> > + if (!xstate)
> > + fpregs_restore_userregs();
>
> Won't fpregs_restore_userregs() end up setting TIF_NEED_FPU_LOAD=0?
> Isn't that a case where a "return NULL" is needed?

This is for the case when the feature is in the init state. For CET's
case this could just zero the buffer and return the pointer to it, but
for other features the init state wasn't always zero. So this just
makes all the features "active" and TIF_NEED_FPU_LOAD is cleared. It
then returns NULL and the read/writes go to the MSRs. It still looks
correct to me, am I missing something?

>
> In any case, this makes me think this code should start out stupid
> and
> slow. Keep the API as-is, but make the first patch unconditionally
> do
> the WRMSR. Leave the "fast" buffer modifications for a follow-on
> patch.

Ok. Should I drop the optimized versions from the series or just split
them out? The optimizations were trying to address Boris' comments:
https://lore.kernel.org/lkml/[email protected]/

2022-02-13 10:36:16

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 10/35] drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS

On Fri, 2022-02-11 at 07:13 +0000, Wang, Zhi A wrote:
> I would suggest you can remove that line in your patch and I will
> clean
> this function after your patches got merged.

Thanks!

2022-02-14 03:27:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 25/35] x86/cet/shstk: Add user-mode shadow stack support

On 1/30/22 13:18, Rick Edgecombe wrote:
> Add the user shadow stack MSRs to the xsave helpers, so they can be used
> to implement the functionality.

Do these MSRs ever affect kernel-mode operation?

If so, we might need to switch them more aggressively at context-switch
time like PKRU.

If not, they can continue to be context-switched with the PASID state
which does not affect kernel-mode operation.

Either way, it would be nice to have some changelog material to that effect.

2022-02-14 07:16:34

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 25/35] x86/cet/shstk: Add user-mode shadow stack support

On Fri, 2022-02-11 at 15:37 -0800, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > Add the user shadow stack MSRs to the xsave helpers, so they can be
> > used
> > to implement the functionality.
>
> Do these MSRs ever affect kernel-mode operation?
>
> If so, we might need to switch them more aggressively at context-
> switch
> time like PKRU.
>
> If not, they can continue to be context-switched with the PASID state
> which does not affect kernel-mode operation.
>
> Either way, it would be nice to have some changelog material to that
> effect.

The only special shadow stack thing the kernel does is WRUSS, which per
the SDM only needs the CR4 bit set to work (unlike WRSS). So I think
the lazy restore is ok.

2022-02-14 08:42:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 25/35] x86/cet/shstk: Add user-mode shadow stack support



On Fri, Feb 11, 2022, at 3:37 PM, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
>> Add the user shadow stack MSRs to the xsave helpers, so they can be used
>> to implement the functionality.
>
> Do these MSRs ever affect kernel-mode operation?
>
> If so, we might need to switch them more aggressively at context-switch
> time like PKRU.
>
> If not, they can continue to be context-switched with the PASID state
> which does not affect kernel-mode operation.

PASID? PASID is all kinds of weird. I assume you mean switching it with all the normal state.

>
> Either way, it would be nice to have some changelog material to that effect.

2022-02-14 09:15:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 23/35] x86/fpu: Add helpers for modifying supervisor xstate

On 1/30/22 13:18, Rick Edgecombe wrote:
> Add helpers that can be used to modify supervisor xstate safely for the
> current task.

This should be at the end of the changelog.

> State for supervisors xstate based features can be live and
> accesses via MSR's, or saved in memory in an xsave buffer. When the
> kernel needs to modify this state it needs to be sure to operate on it
> in the right place, so the modifications don't get clobbered.

We tend to call these "supervisor xfeatures". The "state is in the
registers" we call "active". Maybe:

Just like user xfeatures, supervisor xfeatures can be either
active in the registers or inactive and present in the task FPU
buffer. If the registers are active, the registers can be
modified directly. If the registers are not active, the
modification must be performed on the task FPU buffer.


> In the past supervisor xstate features have used get_xsave_addr()
> directly, and performed open coded logic handle operating on the saved
> state correctly. This has posed two problems:
> 1. It has logic that has been gotten wrong more than once.
> 2. To reduce code, less common path's are not optimized. Determination

"paths" ^


> xstate = start_update_xsave_msrs(XFEATURE_FOO);
> r = xsave_rdmsrl(state, MSR_IA32_FOO_1, &val)
> if (r)
> xsave_wrmsrl(state, MSR_IA32_FOO_2, FOO_ENABLE);
> end_update_xsave_msrs();

This looks OK. I'm not thrilled about it. The
start_update_xsave_msrs() can probably drop the "_msrs". Maybe:

start_xfeature_update(...);

Also, if you have to do the address lookup in xsave_rdmsrl() anyway, I
wonder if the 'xstate' should just be a full fledged 'struct xregs_state'.

The other option would be to make a little on-stack structure like:

struct xsave_update {
int feature;
struct xregs_state *xregs;
};

Then you do:

struct xsave_update xsu;
...
start_update_xsave_msrs(&xsu, XFEATURE_FOO);

and then pass it along to each of the other operations:

r = xsave_rdmsrl(xsu, MSR_IA32_FOO_1, &val)

It's slightly less likely to get type confused as a 'void *';

> +static u64 *__get_xsave_member(void *xstate, u32 msr)
> +{
> + switch (msr) {
> + /* Currently there are no MSR's supported */
> + default:
> + WARN_ONCE(1, "x86/fpu: unsupported xstate msr (%u)\n", msr);
> + return NULL;
> + }
> +}

Just to get an idea what this is doing, it's OK to include the shadow
stack MSRs in here.

Are you sure this should return a u64*? We have lots of <=64-bit XSAVE
fields.

> +/*
> + * Return a pointer to the xstate for the feature if it should be used, or NULL
> + * if the MSRs should be written to directly. To do this safely, using the
> + * associated read/write helpers is required.
> + */
> +void *start_update_xsave_msrs(int xfeature_nr)
> +{
> + void *xstate;
> +
> + /*
> + * fpregs_lock() only disables preemption (mostly). So modifing state

modifying ^

> + * in an interrupt could screw up some in progress fpregs operation,

^ in-progress

> + * but appear to work. Warn about it.
> + */
> + WARN_ON_ONCE(!in_task());
> + WARN_ON_ONCE(current->flags & PF_KTHREAD);

This might also be a good spot to check that xfeature_nr is in
fpstate.xfeatures.

> + fpregs_lock();
> +
> + fpregs_assert_state_consistent();
> +
> + /*
> + * If the registers don't need to be reloaded. Go ahead and operate on the
> + * registers.
> + */
> + if (!test_thread_flag(TIF_NEED_FPU_LOAD))
> + return NULL;
> +
> + xstate = get_xsave_addr(&current->thread.fpu.fpstate->regs.xsave, xfeature_nr);
> +
> + /*
> + * If regs are in the init state, they can't be retrieved from
> + * init_fpstate due to the init optimization, but are not nessarily

necessarily ^

Spell checker time. ":set spell" in vim works for me nicely.

> + * zero. The only option is to restore to make everything live and
> + * operate on registers. This will clear TIF_NEED_FPU_LOAD.
> + *
> + * Otherwise, if not in the init state but TIF_NEED_FPU_LOAD is set,
> + * operate on the buffer. The registers will be restored before going
> + * to userspace in any case, but the task might get preempted before
> + * then, so this possibly saves an xsave.
> + */
> + if (!xstate)
> + fpregs_restore_userregs();

Won't fpregs_restore_userregs() end up setting TIF_NEED_FPU_LOAD=0?
Isn't that a case where a "return NULL" is needed?

In any case, this makes me think this code should start out stupid and
slow. Keep the API as-is, but make the first patch unconditionally do
the WRMSR. Leave the "fast" buffer modifications for a follow-on patch.

2022-02-14 09:32:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 18/35] mm: Add guard pages around a shadow stack.

On 2/10/22 15:40, Edgecombe, Rick P wrote:
> On Thu, 2022-02-10 at 15:07 -0800, Andy Lutomirski wrote:
>> On Thu, Feb 10, 2022 at 2:44 PM Dave Hansen <[email protected]>
>> wrote:
>>>
>>> On 1/30/22 13:18, Rick Edgecombe wrote:
>>>> INCSSP(Q/D) increments shadow stack pointer and 'pops and
>>>> discards' the
>>>> first and the last elements in the range, effectively touches
>>>> those memory
>>>> areas.
>>>>
>>>> The maximum moving distance by INCSSPQ is 255 * 8 = 2040 bytes
>>>> and
>>>> 255 * 4 = 1020 bytes by INCSSPD. Both ranges are far from
>>>> PAGE_SIZE.
>>>> Thus, putting a gap page on both ends of a shadow stack prevents
>>>> INCSSP,
>>>> CALL, and RET from going beyond.
>>>
>>> What is the downside of not applying this patch? The shadow stack
>>> gap
>>> is 1MB instead of 4k?
>>>
>>> That, frankly, doesn't seem too bad. How badly do we *need* this
>>> patch?
>
> Like just using VM_SHADOW_STACK | VM_GROWSDOWN to get a regular stack
> sized gap? I think it could work. It also simplifies the mm->stack_vm
> accounting.

Seems not crazy. Do we want automatically growing shadow stacks? I
don't really like the historical unix behavior where the main thread has
a sort-of-infinite stack and every other thread has a fixed stack.

>
> It would no longer get a gap at the end though. I don't think it's
> needed.
>

I may have missed something about the oddball way the mm code works, but
it seems if you have a gap at one end of every shadow stack, you
automatically have a gap at the other end.


2022-02-14 18:45:35

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 26/35] x86/process: Change copy_thread() argument 'arg' to 'stack_size'

On Sun, Jan 30, 2022 at 10:22 PM Rick Edgecombe
<[email protected]> wrote:
>
> From: Yu-cheng Yu <[email protected]>
>
> The single call site of copy_thread() passes stack size in 'arg'. To make
> this clear and in preparation of using this argument for shadow stack
> allocation, change 'arg' to 'stack_size'. No functional changes.

Actually that name is misleading - the single caller copy_process() indeed does:

retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);

but the member "stack_size" of "struct kernel_clone_args" can actually
also be a pointer argument given to a kthread, see create_io_thread()
and kernel_thread():

pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
struct kernel_clone_args args = {
.flags = ((lower_32_bits(flags) | CLONE_VM |
CLONE_UNTRACED) & ~CSIGNAL),
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.stack = (unsigned long)fn,
.stack_size = (unsigned long)arg,
};

return kernel_clone(&args);
}

And then in copy_thread(), we have:

kthread_frame_init(frame, sp, arg)


So I'm not sure whether this name change really makes sense, or
whether it just adds to the confusion.

2022-02-14 21:21:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 25/35] x86/cet/shstk: Add user-mode shadow stack support

On 2/11/22 16:07, Andy Lutomirski wrote:
> On Fri, Feb 11, 2022, at 3:37 PM, Dave Hansen wrote:
>> On 1/30/22 13:18, Rick Edgecombe wrote:
>>> Add the user shadow stack MSRs to the xsave helpers, so they can be used
>>> to implement the functionality.
>> Do these MSRs ever affect kernel-mode operation?
>>
>> If so, we might need to switch them more aggressively at context-switch
>> time like PKRU.
>>
>> If not, they can continue to be context-switched with the PASID state
>> which does not affect kernel-mode operation.
> PASID? PASID is all kinds of weird. I assume you mean switching it
> with all the normal state.

I was grouping PASID along with the CET MSRs because they're the only
supervisor state. But, yeah, it's all XRSTOR'd at the same spot right
now, user or kernel.

2022-02-15 02:21:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 26/35] x86/process: Change copy_thread() argument 'arg' to 'stack_size'

On Mon, 2022-02-14 at 13:33 +0100, Jann Horn wrote:
> On Sun, Jan 30, 2022 at 10:22 PM Rick Edgecombe
> <[email protected]> wrote:
> >
> > From: Yu-cheng Yu <[email protected]>
> >
> > The single call site of copy_thread() passes stack size in
> > 'arg'. To make
> > this clear and in preparation of using this argument for shadow
> > stack
> > allocation, change 'arg' to 'stack_size'. No functional changes.
>
> Actually that name is misleading - the single caller copy_process()
> indeed does:
>
> retval = copy_thread(clone_flags, args->stack, args->stack_size, p,
> args->tls);
>
> but the member "stack_size" of "struct kernel_clone_args" can
> actually
> also be a pointer argument given to a kthread, see create_io_thread()
> and kernel_thread():
>
> pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long
> flags)
> {
> struct kernel_clone_args args = {
> .flags = ((lower_32_bits(flags) | CLONE_VM |
> CLONE_UNTRACED) & ~CSIGNAL),
> .exit_signal = (lower_32_bits(flags) & CSIGNAL),
> .stack = (unsigned long)fn,
> .stack_size = (unsigned long)arg,
> };
>
> return kernel_clone(&args);
> }
>
> And then in copy_thread(), we have:
>
> kthread_frame_init(frame, sp, arg)
>
>
> So I'm not sure whether this name change really makes sense, or
> whether it just adds to the confusion.

Thanks Jann. Yea I guess this makes it worse.

Reading a bit of the history, it seems there used to be unwieldy
argument lists which were replaced by a big "struct kernel_clone_args"
to be passed around. The re-use of the stack_size argument is from
before there was the struct. And then the struct just inherited the
confusion when it was introduced.

So if a separate *data member was added to kernel_clone_args for
kernel_thread() and create_io_thread() to use, they wouldn't need to
re-use stack_size. And copy_thread() could just take a struct
kernel_clone_args pointer. It might make it more clear.

But copy_thread() has a ton of arch specific implementations. So they
would all need to be updated to do this.

Just playing around with it, I did this which only makes the change on
x86. Duplicating it for 21 copy_thread()s seems perfectly doable, but
I'm not sure how much people care vs renaming arg to
stacksize_or_data...

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 11bf09b60f9d..cfbba5b14609 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -132,9 +132,8 @@ static int set_new_tls(struct task_struct *p,
unsigned long tls)
return do_set_thread_area_64(p, ARCH_SET_FS, tls);
}

-int copy_thread(unsigned long clone_flags, unsigned long sp,
- unsigned long stack_size, struct task_struct *p,
- unsigned long tls)
+int copy_thread(unsigned long clone_flags, struct kernel_clone_args
*args,
+ struct task_struct *p)
{
struct inactive_task_frame *frame;
struct fork_frame *fork_frame;
@@ -178,7 +177,7 @@ int copy_thread(unsigned long clone_flags, unsigned
long sp,
if (unlikely(p->flags & PF_KTHREAD)) {
p->thread.pkru = pkru_get_init_value();
memset(childregs, 0, sizeof(struct pt_regs));
- kthread_frame_init(frame, sp, stack_size);
+ kthread_frame_init(frame, args->stack, (unsigned
long)args->data);
return 0;
}

@@ -191,8 +190,8 @@ int copy_thread(unsigned long clone_flags, unsigned
long sp,
frame->bx = 0;
*childregs = *current_pt_regs();
childregs->ax = 0;
- if (sp)
- childregs->sp = sp;
+ if (args->stack)
+ childregs->sp = args->stack;

#ifdef CONFIG_X86_32
task_user_gs(p) = get_user_gs(current_pt_regs());
@@ -211,17 +210,17 @@ int copy_thread(unsigned long clone_flags,
unsigned long sp,
*/
childregs->sp = 0;
childregs->ip = 0;
- kthread_frame_init(frame, sp, stack_size);
+ kthread_frame_init(frame, args->stack, (unsigned
long)args->data);
return 0;
}

/* Set a new TLS for the child thread? */
if (clone_flags & CLONE_SETTLS)
- ret = set_new_tls(p, tls);
+ ret = set_new_tls(p, args->tls);

/* Allocate a new shadow stack for pthread */
if (!ret)
- ret = shstk_alloc_thread_stack(p, clone_flags,
stack_size);
+ ret = shstk_alloc_thread_stack(p, clone_flags, args-
>stack_size);

if (!ret && unlikely(test_tsk_thread_flag(current,
TIF_IO_BITMAP)))
io_bitmap_share(p);
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index b9198a1b3a84..f138b23aee50 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -34,6 +34,7 @@ struct kernel_clone_args {
int io_thread;
struct cgroup *cgrp;
struct css_set *cset;
+ void *data;
};

/*
@@ -67,8 +68,8 @@ extern void fork_init(void);

extern void release_task(struct task_struct * p);

-extern int copy_thread(unsigned long, unsigned long, unsigned long,
- struct task_struct *, unsigned long);
+extern int copy_thread(unsigned long clone_flags, struct
kernel_clone_args *args,
+ struct task_struct *p);

extern void flush_thread(void);

@@ -85,10 +86,10 @@ extern void exit_files(struct task_struct *);
extern void exit_itimers(struct signal_struct *);

extern pid_t kernel_clone(struct kernel_clone_args *kargs);
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int
node);
+struct task_struct *create_io_thread(int (*fn)(void *), void *data,
int node);
struct task_struct *fork_idle(int);
struct mm_struct *copy_init_mm(void);
-extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long
flags);
+extern pid_t kernel_thread(int (*fn)(void *), void *data, unsigned
long flags);
extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
int kernel_wait(pid_t pid, int *stat);

diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..8af202e5651e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2170,7 +2170,7 @@ static __latent_entropy struct task_struct
*copy_process(
retval = copy_io(clone_flags, p);
if (retval)
goto bad_fork_cleanup_namespaces;
- retval = copy_thread(clone_flags, args->stack, args-
>stack_size, p, args->tls);
+ retval = copy_thread(clone_flags, args, p);
if (retval)
goto bad_fork_cleanup_io;

@@ -2487,7 +2487,7 @@ struct mm_struct *copy_init_mm(void)
* The returned task is inactive, and the caller must fire it up
through
* wake_up_new_task(p). All signals are blocked in the created task.
*/
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int
node)
+struct task_struct *create_io_thread(int (*fn)(void *), void *data,
int node)
{
unsigned long flags =
CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
CLONE_IO;
@@ -2496,7 +2496,7 @@ struct task_struct *create_io_thread(int
(*fn)(void *), void *arg, int node)
CLONE_UNTRACED) & ~CSIGNAL),
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.stack = (unsigned long)fn,
- .stack_size = (unsigned long)arg,
+ .data = data,
.io_thread = 1,
};

@@ -2594,14 +2594,14 @@ pid_t kernel_clone(struct kernel_clone_args
*args)
/*
* Create a kernel thread.
*/
-pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
+pid_t kernel_thread(int (*fn)(void *), void *data, unsigned long
flags)
{
struct kernel_clone_args args = {
.flags = ((lower_32_bits(flags) | CLONE_VM |
CLONE_UNTRACED) & ~CSIGNAL),
.exit_signal = (lower_32_bits(flags) & CSIGNAL),
.stack = (unsigned long)fn,
- .stack_size = (unsigned long)arg,
+ .data = data,
};

return kernel_clone(&args);


2022-02-15 08:54:59

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 26/35] x86/process: Change copy_thread() argument 'arg' to 'stack_size'

On Tue, Feb 15, 2022 at 01:22:04AM +0000, Edgecombe, Rick P wrote:
> On Mon, 2022-02-14 at 13:33 +0100, Jann Horn wrote:
> > On Sun, Jan 30, 2022 at 10:22 PM Rick Edgecombe
> > <[email protected]> wrote:
> > >
> > > From: Yu-cheng Yu <[email protected]>
> > >
> > > The single call site of copy_thread() passes stack size in
> > > 'arg'. To make
> > > this clear and in preparation of using this argument for shadow
> > > stack
> > > allocation, change 'arg' to 'stack_size'. No functional changes.
> >
> > Actually that name is misleading - the single caller copy_process()
> > indeed does:
> >
> > retval = copy_thread(clone_flags, args->stack, args->stack_size, p,
> > args->tls);
> >
> > but the member "stack_size" of "struct kernel_clone_args" can
> > actually
> > also be a pointer argument given to a kthread, see create_io_thread()
> > and kernel_thread():
> >
> > pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long
> > flags)
> > {
> > struct kernel_clone_args args = {
> > .flags = ((lower_32_bits(flags) | CLONE_VM |
> > CLONE_UNTRACED) & ~CSIGNAL),
> > .exit_signal = (lower_32_bits(flags) & CSIGNAL),
> > .stack = (unsigned long)fn,
> > .stack_size = (unsigned long)arg,
> > };
> >
> > return kernel_clone(&args);
> > }
> >
> > And then in copy_thread(), we have:
> >
> > kthread_frame_init(frame, sp, arg)
> >
> >
> > So I'm not sure whether this name change really makes sense, or
> > whether it just adds to the confusion.
>
> Thanks Jann. Yea I guess this makes it worse.
>
> Reading a bit of the history, it seems there used to be unwieldy
> argument lists which were replaced by a big "struct kernel_clone_args"
> to be passed around. The re-use of the stack_size argument is from
> before there was the struct. And then the struct just inherited the
> confusion when it was introduced.
>
> So if a separate *data member was added to kernel_clone_args for
> kernel_thread() and create_io_thread() to use, they wouldn't need to
> re-use stack_size. And copy_thread() could just take a struct
> kernel_clone_args pointer. It might make it more clear.

I'm honestly not sure it makes things that much better but I don't feel
strongly about it either.

>
> But copy_thread() has a ton of arch specific implementations. So they
> would all need to be updated to do this.

When struct kernel_clone_args was introduced I also removed the
copy_thread_tls() and copy_thread() split. So now we're only left with
copy_thread(). That already allowed us to get rid of some arch-specific
code. I didn't go further in trying to unify more arch-specific code. It
might be worth it but I didn't come to a clear conclusion.

>
> Just playing around with it, I did this which only makes the change on
> x86. Duplicating it for 21 copy_thread()s seems perfectly doable, but
> I'm not sure how much people care vs renaming arg to
> stacksize_or_data...


>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 11bf09b60f9d..cfbba5b14609 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -132,9 +132,8 @@ static int set_new_tls(struct task_struct *p,
> unsigned long tls)
> return do_set_thread_area_64(p, ARCH_SET_FS, tls);
> }
>
> -int copy_thread(unsigned long clone_flags, unsigned long sp,
> - unsigned long stack_size, struct task_struct *p,
> - unsigned long tls)
> +int copy_thread(unsigned long clone_flags, struct kernel_clone_args
> *args,
> + struct task_struct *p)
> {
> struct inactive_task_frame *frame;
> struct fork_frame *fork_frame;
> @@ -178,7 +177,7 @@ int copy_thread(unsigned long clone_flags, unsigned
> long sp,
> if (unlikely(p->flags & PF_KTHREAD)) {
> p->thread.pkru = pkru_get_init_value();
> memset(childregs, 0, sizeof(struct pt_regs));
> - kthread_frame_init(frame, sp, stack_size);
> + kthread_frame_init(frame, args->stack, (unsigned
> long)args->data);
> return 0;
> }
>
> @@ -191,8 +190,8 @@ int copy_thread(unsigned long clone_flags, unsigned
> long sp,
> frame->bx = 0;
> *childregs = *current_pt_regs();
> childregs->ax = 0;
> - if (sp)
> - childregs->sp = sp;
> + if (args->stack)
> + childregs->sp = args->stack;
>
> #ifdef CONFIG_X86_32
> task_user_gs(p) = get_user_gs(current_pt_regs());
> @@ -211,17 +210,17 @@ int copy_thread(unsigned long clone_flags,
> unsigned long sp,
> */
> childregs->sp = 0;
> childregs->ip = 0;
> - kthread_frame_init(frame, sp, stack_size);
> + kthread_frame_init(frame, args->stack, (unsigned
> long)args->data);
> return 0;
> }
>
> /* Set a new TLS for the child thread? */
> if (clone_flags & CLONE_SETTLS)
> - ret = set_new_tls(p, tls);
> + ret = set_new_tls(p, args->tls);
>
> /* Allocate a new shadow stack for pthread */
> if (!ret)
> - ret = shstk_alloc_thread_stack(p, clone_flags,
> stack_size);
> + ret = shstk_alloc_thread_stack(p, clone_flags, args-
> >stack_size);
>
> if (!ret && unlikely(test_tsk_thread_flag(current,
> TIF_IO_BITMAP)))
> io_bitmap_share(p);
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index b9198a1b3a84..f138b23aee50 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -34,6 +34,7 @@ struct kernel_clone_args {
> int io_thread;
> struct cgroup *cgrp;
> struct css_set *cset;
> + void *data;
> };
>
> /*
> @@ -67,8 +68,8 @@ extern void fork_init(void);
>
> extern void release_task(struct task_struct * p);
>
> -extern int copy_thread(unsigned long, unsigned long, unsigned long,
> - struct task_struct *, unsigned long);
> +extern int copy_thread(unsigned long clone_flags, struct
> kernel_clone_args *args,
> + struct task_struct *p);
>
> extern void flush_thread(void);
>
> @@ -85,10 +86,10 @@ extern void exit_files(struct task_struct *);
> extern void exit_itimers(struct signal_struct *);
>
> extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> -struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int
> node);
> +struct task_struct *create_io_thread(int (*fn)(void *), void *data,
> int node);
> struct task_struct *fork_idle(int);
> struct mm_struct *copy_init_mm(void);
> -extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long
> flags);
> +extern pid_t kernel_thread(int (*fn)(void *), void *data, unsigned
> long flags);
> extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);
> int kernel_wait(pid_t pid, int *stat);
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d75a528f7b21..8af202e5651e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2170,7 +2170,7 @@ static __latent_entropy struct task_struct
> *copy_process(
> retval = copy_io(clone_flags, p);
> if (retval)
> goto bad_fork_cleanup_namespaces;
> - retval = copy_thread(clone_flags, args->stack, args-
> >stack_size, p, args->tls);
> + retval = copy_thread(clone_flags, args, p);
> if (retval)
> goto bad_fork_cleanup_io;
>
> @@ -2487,7 +2487,7 @@ struct mm_struct *copy_init_mm(void)
> * The returned task is inactive, and the caller must fire it up
> through
> * wake_up_new_task(p). All signals are blocked in the created task.
> */
> -struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int
> node)
> +struct task_struct *create_io_thread(int (*fn)(void *), void *data,
> int node)
> {
> unsigned long flags =
> CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
> CLONE_IO;
> @@ -2496,7 +2496,7 @@ struct task_struct *create_io_thread(int
> (*fn)(void *), void *arg, int node)
> CLONE_UNTRACED) & ~CSIGNAL),
> .exit_signal = (lower_32_bits(flags) & CSIGNAL),
> .stack = (unsigned long)fn,
> - .stack_size = (unsigned long)arg,
> + .data = data,
> .io_thread = 1,
> };
>
> @@ -2594,14 +2594,14 @@ pid_t kernel_clone(struct kernel_clone_args
> *args)
> /*
> * Create a kernel thread.
> */
> -pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
> +pid_t kernel_thread(int (*fn)(void *), void *data, unsigned long
> flags)
> {
> struct kernel_clone_args args = {
> .flags = ((lower_32_bits(flags) | CLONE_VM |
> CLONE_UNTRACED) & ~CSIGNAL),
> .exit_signal = (lower_32_bits(flags) & CSIGNAL),
> .stack = (unsigned long)fn,
> - .stack_size = (unsigned long)arg,
> + .data = data,
> };
>
> return kernel_clone(&args);
>
>

2022-02-28 20:30:51

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> >
> > Still wrapping my head around the CRIU save and restore steps, but
> > another general approach might be to give ptrace the ability to
> > temporarily pause/resume/set CET enablement and SSP for a stopped
> > thread. Then injected code doesn't need to jump through any hoops or
> > possibly run into road blocks. I'm not sure how much this opens things
> > up if the thread has to be stopped...
>
> Hmm, that's maybe not insane.
>
> An alternative would be to add a bona fide ptrace call-a-function mechanism.
> I can think of two potentially usable variants:
>
> 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
> shadow stack push and all.
>
> 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
> frame just like a real signal is being delivered with the specified handler.
> There could be a variant to opt-in to also using a specified altstack and
> altshadowstack.

Using ptrace() will not solve CRIU's issue with sigreturn because sigreturn
is called from the victim context rather than from the criu process that
controls the dump and uses ptrace().

Even with the current shadow stack interface Rick proposed, CRIU can restore
the victim using ptrace without any additional knobs, but we loose an
important ability to "self-cure" the victim from the parasite in case
anything goes wrong with criu control process.

Moreover, the issue with backward compatibility is not with ptrace but with
sigreturn and it seems that criu is not its only user.

So I think we need a way to allow direct calls to sigreturn that will
bypass check and restore of the shadow stack.

I only know that there are sigreturn users except criu that show up in
Debian codesearch, and I don't know how do they use it, but for full
backward compatibility we'd need to have no-CET sigreturn as default and
add a new, say UC_CHECK_SHSTK flag to rt_sigframe->uc.uc_flags or even a
new syscall for libc signal handling.

> 2 would be more expensive but would avoid the need for much in the way of
> asm magic. The injected code could be plain C (or Rust or Zig or whatever).
>
> All of this only really handles save, not restore. I don't understand
> restore enough to fully understand the issue.

Restore is more complex, will get to it later.

> --Andy

--
Sincerely yours,
Mike.

2022-02-28 20:32:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace



On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
> On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
>> On 2/8/22 18:18, Edgecombe, Rick P wrote:
>> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
>> >
>> > Still wrapping my head around the CRIU save and restore steps, but
>> > another general approach might be to give ptrace the ability to
>> > temporarily pause/resume/set CET enablement and SSP for a stopped
>> > thread. Then injected code doesn't need to jump through any hoops or
>> > possibly run into road blocks. I'm not sure how much this opens things
>> > up if the thread has to be stopped...
>>
>> Hmm, that's maybe not insane.
>>
>> An alternative would be to add a bona fide ptrace call-a-function mechanism.
>> I can think of two potentially usable variants:
>>
>> 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr,
>> shadow stack push and all.
>>
>> 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal
>> frame just like a real signal is being delivered with the specified handler.
>> There could be a variant to opt-in to also using a specified altstack and
>> altshadowstack.
>
> Using ptrace() will not solve CRIU's issue with sigreturn because sigreturn
> is called from the victim context rather than from the criu process that
> controls the dump and uses ptrace().

I'm not sure I follow.

>
> Even with the current shadow stack interface Rick proposed, CRIU can restore
> the victim using ptrace without any additional knobs, but we loose an
> important ability to "self-cure" the victim from the parasite in case
> anything goes wrong with criu control process.
>
> Moreover, the issue with backward compatibility is not with ptrace but with
> sigreturn and it seems that criu is not its only user.

So we need an ability for a tracer to cause the tracee to call a function and to return successfully. Apparently a gdb branch can already do this with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the trick. I don't see why we need a sigretur-but-dont-verify -- we just need this mechanism to create a frame such that sigreturn actually works.

--Andy

2022-02-28 21:40:04

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote:
>
>
> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> >> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> >> >
> >
> > Even with the current shadow stack interface Rick proposed, CRIU can restore
> > the victim using ptrace without any additional knobs, but we loose an
> > important ability to "self-cure" the victim from the parasite in case
> > anything goes wrong with criu control process.
> >
> > Moreover, the issue with backward compatibility is not with ptrace but with
> > sigreturn and it seems that criu is not its only user.
>
> So we need an ability for a tracer to cause the tracee to call a function
> and to return successfully. Apparently a gdb branch can already do this
> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the
> trick. I don't see why we need a sigretur-but-dont-verify -- we just
> need this mechanism to create a frame such that sigreturn actually works.

If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame
into the tracee and makes the tracee call sigreturn.
I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or
PTRACE_SYSCALL.

In such case this defeats the purpose of sigreturn in CRIU because it is
called asynchronously by the tracee when the tracer is about to detach or
even already detached.

For synchronous use-case PTRACE_SETREGSET will be enough, the rest of the
sigframe can be restored by other means.

And with 'criu restore' there may be even no tracer by the time sigreturn
is called.

> --Andy

--
Sincerely yours,
Mike.

2022-02-28 23:27:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace



On Mon, Feb 28, 2022, at 1:30 PM, Mike Rapoport wrote:
> On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote:
>>
>>
>> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
>> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
>> >> On 2/8/22 18:18, Edgecombe, Rick P wrote:
>> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
>> >> >
>> >
>> > Even with the current shadow stack interface Rick proposed, CRIU can restore
>> > the victim using ptrace without any additional knobs, but we loose an
>> > important ability to "self-cure" the victim from the parasite in case
>> > anything goes wrong with criu control process.
>> >
>> > Moreover, the issue with backward compatibility is not with ptrace but with
>> > sigreturn and it seems that criu is not its only user.
>>
>> So we need an ability for a tracer to cause the tracee to call a function
>> and to return successfully. Apparently a gdb branch can already do this
>> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the
>> trick. I don't see why we need a sigretur-but-dont-verify -- we just
>> need this mechanism to create a frame such that sigreturn actually works.
>
> If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame
> into the tracee and makes the tracee call sigreturn.
> I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or
> PTRACE_SYSCALL.
>
> In such case this defeats the purpose of sigreturn in CRIU because it is
> called asynchronously by the tracee when the tracer is about to detach or
> even already detached.

The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal frame onto the stack and call a function. That function should then be able to call sigreturn just like any normal signal handler. There should be no requirement that the tracer still be attached when this happens, although the code calling sigreturn still needs to be mapped.

(Specifically, on modern arches, the user runtime is expected to provide a "restorer" that calls sigreturn. A hypotheticall PTRACE_CALL_FUNCTION_SIGFRAME would either need to call sigreturn directly or provide a restorer.)

2022-03-04 02:48:09

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Thu, 2022-03-03 at 15:00 -0800, Andy Lutomirski wrote:
> > > The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal
> > > frame onto
> > > the stack and call a function. That function should then be able
> > > to call
> > > sigreturn just like any normal signal handler.
> >
> > Ok, let me reiterate.
> >
> > We have a seized and stopped tracee, use
> > PTRACE_CALL_FUNCTION_SIGFRAME
> > to push a signal frame onto the tracee's stack so that sigreturn
> > could use
> > that frame, then set the tracee %rip to the function we'd like to
> > call and
> > then we PTRACE_CONT the tracee. Tracee continues to execute the
> > parasite
> > code that calls sigreturn to clean up and restore the tracee
> > process.
> >
> > PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the
> > shadow
> > stack, just like setup_rt_frame() does, so that sys_rt_sigreturn()
> > won't
> > bail out at restore_signal_shadow_stack().
>
> That is the intent.
>
> >
> > The only thing that CRIU actually needs is to push a restore token
> > to the
> > shadow stack, so for us a ptrace call that does that would be
> > ideal.
> >
>
> That seems fine too. The main benefit of the SIGFRAME approach is
> that, AIUI, CRIU eventually constructs a signal frame anyway, and
> getting one ready-made seems plausibly helpful. But if it's not
> actually that useful, then there's no need to do it.

I guess pushing a token to the shadow stack could be done like GDB does
calls, with just the basic CET ptrace support. So do we even need a
specific push token operation?

I suppose if CRIU already used some kernel encapsulation of a seized
call/return operation it would have been easier to make CRIU work with
the introduction of CET. But the design of CRIU seems to be to have the
kernel expose just enough and then tie it all together in userspace.

Andy, did you have any other usages for PTRACE_CALL_FUNCTION in mind? I
couldn't find any other CRIU-like users of sigreturn in the debian
source search (but didn't read all 819 pages that come up with
"sigreturn"). It seemed to be mostly seccomp sandbox references.

2022-03-04 05:10:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Thu, Mar 3, 2022 at 11:43 AM Mike Rapoport <[email protected]> wrote:
>
> On Mon, Feb 28, 2022 at 02:55:30PM -0800, Andy Lutomirski wrote:
> >
> >
> > On Mon, Feb 28, 2022, at 1:30 PM, Mike Rapoport wrote:
> > > On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote:
> > >>
> > >>
> > >> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
> > >> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> > >> >> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> > >> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> > >> >> >
> > >> >
> > >> > Even with the current shadow stack interface Rick proposed, CRIU can restore
> > >> > the victim using ptrace without any additional knobs, but we loose an
> > >> > important ability to "self-cure" the victim from the parasite in case
> > >> > anything goes wrong with criu control process.
> > >> >
> > >> > Moreover, the issue with backward compatibility is not with ptrace but with
> > >> > sigreturn and it seems that criu is not its only user.
> > >>
> > >> So we need an ability for a tracer to cause the tracee to call a function
> > >> and to return successfully. Apparently a gdb branch can already do this
> > >> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the
> > >> trick. I don't see why we need a sigretur-but-dont-verify -- we just
> > >> need this mechanism to create a frame such that sigreturn actually works.
> > >
> > > If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame
> > > into the tracee and makes the tracee call sigreturn.
> > > I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or
> > > PTRACE_SYSCALL.
> > >
> > > In such case this defeats the purpose of sigreturn in CRIU because it is
> > > called asynchronously by the tracee when the tracer is about to detach or
> > > even already detached.
> >
> > The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal frame onto
> > the stack and call a function. That function should then be able to call
> > sigreturn just like any normal signal handler.
>
> Ok, let me reiterate.
>
> We have a seized and stopped tracee, use PTRACE_CALL_FUNCTION_SIGFRAME
> to push a signal frame onto the tracee's stack so that sigreturn could use
> that frame, then set the tracee %rip to the function we'd like to call and
> then we PTRACE_CONT the tracee. Tracee continues to execute the parasite
> code that calls sigreturn to clean up and restore the tracee process.
>
> PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the shadow
> stack, just like setup_rt_frame() does, so that sys_rt_sigreturn() won't
> bail out at restore_signal_shadow_stack().

That is the intent.

>
> The only thing that CRIU actually needs is to push a restore token to the
> shadow stack, so for us a ptrace call that does that would be ideal.
>

That seems fine too. The main benefit of the SIGFRAME approach is
that, AIUI, CRIU eventually constructs a signal frame anyway, and
getting one ready-made seems plausibly helpful. But if it's not
actually that useful, then there's no need to do it.

2022-03-04 07:09:01

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Mon, Feb 28, 2022 at 02:55:30PM -0800, Andy Lutomirski wrote:
>
>
> On Mon, Feb 28, 2022, at 1:30 PM, Mike Rapoport wrote:
> > On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote:
> >>
> >>
> >> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote:
> >> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote:
> >> >> On 2/8/22 18:18, Edgecombe, Rick P wrote:
> >> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote:
> >> >> >
> >> >
> >> > Even with the current shadow stack interface Rick proposed, CRIU can restore
> >> > the victim using ptrace without any additional knobs, but we loose an
> >> > important ability to "self-cure" the victim from the parasite in case
> >> > anything goes wrong with criu control process.
> >> >
> >> > Moreover, the issue with backward compatibility is not with ptrace but with
> >> > sigreturn and it seems that criu is not its only user.
> >>
> >> So we need an ability for a tracer to cause the tracee to call a function
> >> and to return successfully. Apparently a gdb branch can already do this
> >> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the
> >> trick. I don't see why we need a sigretur-but-dont-verify -- we just
> >> need this mechanism to create a frame such that sigreturn actually works.
> >
> > If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame
> > into the tracee and makes the tracee call sigreturn.
> > I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or
> > PTRACE_SYSCALL.
> >
> > In such case this defeats the purpose of sigreturn in CRIU because it is
> > called asynchronously by the tracee when the tracer is about to detach or
> > even already detached.
>
> The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal frame onto
> the stack and call a function. That function should then be able to call
> sigreturn just like any normal signal handler.

Ok, let me reiterate.

We have a seized and stopped tracee, use PTRACE_CALL_FUNCTION_SIGFRAME
to push a signal frame onto the tracee's stack so that sigreturn could use
that frame, then set the tracee %rip to the function we'd like to call and
then we PTRACE_CONT the tracee. Tracee continues to execute the parasite
code that calls sigreturn to clean up and restore the tracee process.

PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the shadow
stack, just like setup_rt_frame() does, so that sys_rt_sigreturn() won't
bail out at restore_signal_shadow_stack().

The only thing that CRIU actually needs is to push a restore token to the
shadow stack, so for us a ptrace call that does that would be ideal.

--
Sincerely yours,
Mike.

2022-03-04 20:56:04

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On 3/3/22 17:30, Edgecombe, Rick P wrote:
> On Thu, 2022-03-03 at 15:00 -0800, Andy Lutomirski wrote:
>>>> The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal
>>>> frame onto
>>>> the stack and call a function. That function should then be able
>>>> to call
>>>> sigreturn just like any normal signal handler.
>>>
>>> Ok, let me reiterate.
>>>
>>> We have a seized and stopped tracee, use
>>> PTRACE_CALL_FUNCTION_SIGFRAME
>>> to push a signal frame onto the tracee's stack so that sigreturn
>>> could use
>>> that frame, then set the tracee %rip to the function we'd like to
>>> call and
>>> then we PTRACE_CONT the tracee. Tracee continues to execute the
>>> parasite
>>> code that calls sigreturn to clean up and restore the tracee
>>> process.
>>>
>>> PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the
>>> shadow
>>> stack, just like setup_rt_frame() does, so that sys_rt_sigreturn()
>>> won't
>>> bail out at restore_signal_shadow_stack().
>>
>> That is the intent.
>>
>>>
>>> The only thing that CRIU actually needs is to push a restore token
>>> to the
>>> shadow stack, so for us a ptrace call that does that would be
>>> ideal.
>>>
>>
>> That seems fine too. The main benefit of the SIGFRAME approach is
>> that, AIUI, CRIU eventually constructs a signal frame anyway, and
>> getting one ready-made seems plausibly helpful. But if it's not
>> actually that useful, then there's no need to do it.
>
> I guess pushing a token to the shadow stack could be done like GDB does
> calls, with just the basic CET ptrace support. So do we even need a
> specific push token operation?
>
> I suppose if CRIU already used some kernel encapsulation of a seized
> call/return operation it would have been easier to make CRIU work with
> the introduction of CET. But the design of CRIU seems to be to have the
> kernel expose just enough and then tie it all together in userspace.
>
> Andy, did you have any other usages for PTRACE_CALL_FUNCTION in mind? I
> couldn't find any other CRIU-like users of sigreturn in the debian
> source search (but didn't read all 819 pages that come up with
> "sigreturn"). It seemed to be mostly seccomp sandbox references.

I don't see a benefit compelling enough to justify the added complexity,
given that existing mechanisms can do it.

The sigframe thing, OTOH, seems genuinely useful if CRIU would actually
use it to save the full register state. Generating a signal frame from
scratch is a pain. That being said, if CRIU isn't excited, then don't
bother.

2022-03-07 21:43:18

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Mon, Mar 7, 2022 at 10:57 AM Mike Rapoport <[email protected]> wrote:
>
> On Fri, Mar 04, 2022 at 11:13:19AM -0800, Andy Lutomirski wrote:
> > On 3/3/22 17:30, Edgecombe, Rick P wrote:
> > > On Thu, 2022-03-03 at 15:00 -0800, Andy Lutomirski wrote:
> > > > > > The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal
> > > > > > frame onto
> > > > > > the stack and call a function. That function should then be able
> > > > > > to call
> > > > > > sigreturn just like any normal signal handler.
> > > > >
> > > > > Ok, let me reiterate.
> > > > >
> > > > > We have a seized and stopped tracee, use
> > > > > PTRACE_CALL_FUNCTION_SIGFRAME
> > > > > to push a signal frame onto the tracee's stack so that sigreturn
> > > > > could use
> > > > > that frame, then set the tracee %rip to the function we'd like to
> > > > > call and
> > > > > then we PTRACE_CONT the tracee. Tracee continues to execute the
> > > > > parasite
> > > > > code that calls sigreturn to clean up and restore the tracee
> > > > > process.
> > > > >
> > > > > PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the
> > > > > shadow
> > > > > stack, just like setup_rt_frame() does, so that sys_rt_sigreturn()
> > > > > won't
> > > > > bail out at restore_signal_shadow_stack().
> > > >
> > > > That is the intent.
> > > >
> > > > >
> > > > > The only thing that CRIU actually needs is to push a restore token
> > > > > to the
> > > > > shadow stack, so for us a ptrace call that does that would be
> > > > > ideal.
> > > > >
> > > >
> > > > That seems fine too. The main benefit of the SIGFRAME approach is
> > > > that, AIUI, CRIU eventually constructs a signal frame anyway, and
> > > > getting one ready-made seems plausibly helpful. But if it's not
> > > > actually that useful, then there's no need to do it.
> > >
> > > I guess pushing a token to the shadow stack could be done like GDB does
> > > calls, with just the basic CET ptrace support. So do we even need a
> > > specific push token operation?
>
> I've tried to follow gdb CET push implementation, but got lost.
> What is "basic CET ptrace support"? I don't see any ptrace changes in this
> series.

Here is the CET ptrace patch on CET 5.16 kernel branch:

https://github.com/hjl-tools/linux/commit/3a43ec29ddac56f87807161b5aeafa80f632363d

> > > I suppose if CRIU already used some kernel encapsulation of a seized
> > > call/return operation it would have been easier to make CRIU work with
> > > the introduction of CET. But the design of CRIU seems to be to have the
> > > kernel expose just enough and then tie it all together in userspace.
> > >
> > > Andy, did you have any other usages for PTRACE_CALL_FUNCTION in mind? I
> > > couldn't find any other CRIU-like users of sigreturn in the debian
> > > source search (but didn't read all 819 pages that come up with
> > > "sigreturn"). It seemed to be mostly seccomp sandbox references.
> >
> > I don't see a benefit compelling enough to justify the added complexity,
> > given that existing mechanisms can do it.
> >
> > The sigframe thing, OTOH, seems genuinely useful if CRIU would actually use
> > it to save the full register state. Generating a signal frame from scratch
> > is a pain. That being said, if CRIU isn't excited, then don't bother.
>
> CRIU is excited :)
>
> I just was looking for the minimal possible interface that will allow us to
> call sigreturn. Rick is right and CRIU does try to expose as little as
> possible and handle the pain in the userspace.
>
> The SIGFRAME approach is indeed very helpful, especially if we can make it
> work on other architectures eventually.
>
> --
> Sincerely yours,
> Mike.



--
H.J.

2022-03-07 23:32:15

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Fri, Mar 04, 2022 at 11:13:19AM -0800, Andy Lutomirski wrote:
> On 3/3/22 17:30, Edgecombe, Rick P wrote:
> > On Thu, 2022-03-03 at 15:00 -0800, Andy Lutomirski wrote:
> > > > > The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal
> > > > > frame onto
> > > > > the stack and call a function. That function should then be able
> > > > > to call
> > > > > sigreturn just like any normal signal handler.
> > > >
> > > > Ok, let me reiterate.
> > > >
> > > > We have a seized and stopped tracee, use
> > > > PTRACE_CALL_FUNCTION_SIGFRAME
> > > > to push a signal frame onto the tracee's stack so that sigreturn
> > > > could use
> > > > that frame, then set the tracee %rip to the function we'd like to
> > > > call and
> > > > then we PTRACE_CONT the tracee. Tracee continues to execute the
> > > > parasite
> > > > code that calls sigreturn to clean up and restore the tracee
> > > > process.
> > > >
> > > > PTRACE_CALL_FUNCTION_SIGFRAME also pushes a restore token to the
> > > > shadow
> > > > stack, just like setup_rt_frame() does, so that sys_rt_sigreturn()
> > > > won't
> > > > bail out at restore_signal_shadow_stack().
> > >
> > > That is the intent.
> > >
> > > >
> > > > The only thing that CRIU actually needs is to push a restore token
> > > > to the
> > > > shadow stack, so for us a ptrace call that does that would be
> > > > ideal.
> > > >
> > >
> > > That seems fine too. The main benefit of the SIGFRAME approach is
> > > that, AIUI, CRIU eventually constructs a signal frame anyway, and
> > > getting one ready-made seems plausibly helpful. But if it's not
> > > actually that useful, then there's no need to do it.
> >
> > I guess pushing a token to the shadow stack could be done like GDB does
> > calls, with just the basic CET ptrace support. So do we even need a
> > specific push token operation?

I've tried to follow gdb CET push implementation, but got lost.
What is "basic CET ptrace support"? I don't see any ptrace changes in this
series.

> > I suppose if CRIU already used some kernel encapsulation of a seized
> > call/return operation it would have been easier to make CRIU work with
> > the introduction of CET. But the design of CRIU seems to be to have the
> > kernel expose just enough and then tie it all together in userspace.
> >
> > Andy, did you have any other usages for PTRACE_CALL_FUNCTION in mind? I
> > couldn't find any other CRIU-like users of sigreturn in the debian
> > source search (but didn't read all 819 pages that come up with
> > "sigreturn"). It seemed to be mostly seccomp sandbox references.
>
> I don't see a benefit compelling enough to justify the added complexity,
> given that existing mechanisms can do it.
>
> The sigframe thing, OTOH, seems genuinely useful if CRIU would actually use
> it to save the full register state. Generating a signal frame from scratch
> is a pain. That being said, if CRIU isn't excited, then don't bother.

CRIU is excited :)

I just was looking for the minimal possible interface that will allow us to
call sigreturn. Rick is right and CRIU does try to expose as little as
possible and handle the pain in the userspace.

The SIGFRAME approach is indeed very helpful, especially if we can make it
work on other architectures eventually.

--
Sincerely yours,
Mike.

2022-03-08 06:17:12

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 00/35] Shadow stacks for userspace

From: Mike Rapoport
> Sent: 07 March 2022 18:57
...
> > The sigframe thing, OTOH, seems genuinely useful if CRIU would actually use
> > it to save the full register state. Generating a signal frame from scratch
> > is a pain. That being said, if CRIU isn't excited, then don't bother.
>
> CRIU is excited :)
>
> I just was looking for the minimal possible interface that will allow us to
> call sigreturn. Rick is right and CRIU does try to expose as little as
> possible and handle the pain in the userspace.
>
> The SIGFRAME approach is indeed very helpful, especially if we can make it
> work on other architectures eventually.

I thought the full sigframe layout depends very much on what the kernel
decides it needs to save?
Some parts are exposed to the signal handler, but there are large
blocks of data that XSAVE (etc) save that have to be put onto the
signal stack.
Is it even vaguely feasible to replicate what a specific kernel
generates on specific hardware in a userspace library?
The size of this data is getting bigger and bigger - causing
issues with the SIGALTSTACK (and even thread stack) minimum sizes.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-06-01 18:43:52

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Wed, 2022-06-01 at 11:06 +0300, Mike Rapoport wrote:
> > Yea, having something working is really great. My only hesitancy is
> > that, per a discussion on the LAM patchset, we are going to make
> > this
> > enabling API CET only (same semantics for though). I suppose the
> > locking API arch_prctl() could still be support other arch
> > features,
> > but it might be a second CET only regset. It's not the end of the
> > world.
>
> The support for CET in criu is anyway experimental for now, if the
> kernel
> API will be slightly different in the end, we'll update criu.
> The important things are the ability to control tracee shadow stack
> from ptrace, the ability to map the shadow stack at fixed address and
> the
> ability to control the features at least from ptrace.
> As long as we have APIs that provide those, it should be Ok.
>
> > I guess the other consideration is tieing CRIU to glibc
> > peculiarities.
> > Like even if we fix glibc, then CRIU may not work with some other
> > libc
> > or app that force disables for some weird reason. Is it supposed to
> > be
> > libc-agnostic?
>
> Actually using the ptrace to control the CET features does not tie
> criu to
> glibc. The current proposal for the arch_prctl() allows libc to lock
> CET
> features and having a ptrace call to control the lock makes criu
> agnostic
> to libc behaviour.

From staring at the glibc code, I'm suspicious something was weird with
your test setup, as I don't think it should be locking. But I guess to
be completely proper you would need to save and restore the lock state
anyway. So, ok yea, on balance probably better to have an extra
interface.

Should we make it a GET/SET interface?

2022-06-01 18:57:44

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, 2022-05-31 at 19:36 +0300, Mike Rapoport wrote:
> > WRSS is a feature where you would usually want to lock it as
> > disabled,
> > but WRSS cannot be enabled if shadow stack is not enabled. Locking
> > shadow stack and WRSS off together doesn't have any security
> > benefits
> > in theory. so I'm thinking glibc doesn't need to do this. The
> > kernel
> > could even refuse to lock WRSS without shadow stack being enabled.
> > Could we avoid the extra ptrace functionality then?
>
> What I see for is that a program can support shadow stack, glibc
> enables
> shadow stack, does not enable WRSS and than calls
>
> arch_prctl(ARCH_X86_FEATURE_LOCK,
> LINUX_X86_FEATURE_SHSTK | LINUX_X86_FEATURE_WRSS);

I see the logic is glibc will lock SHSTK|IBT if either is enabled in
the elf header. I guess that is why I didn't see the locking happening
for me, because my manual enablement test doesn't have either set in
the header.

It can't see where that glibc knows about WRSS though...

The glibc logic seems wrong to me also, because shadow stack or IBT
could be force-disabled via glibc tunables. I don't see why the elf
header bit should exclusively control the feature locking. Or why both
should be locked if only one is in the header.

>
> so that WRSS cannot be re-enabled.
>
> For the programs that do not support shadow stack, both SHSTK and
> WRSS are
> disabled, but still there is the same call to
> arch_prctl(ARCH_X86_FEATURE_LOCK, ...) and then neither shadow stack
> nor
> WRSS can be enabled.
>
> My original plan was to run CRIU with no shadow stack, enable shadow
> stack
> and WRSS in the restored tasks using arch_prct() and after the shadow
> stack
> contents is restored disable WRSS.
>
> Obviously, this didn't work with glibc I have :)

Were you disabling shadow stack via glibc tunnable? Or was the elf
header marked for IBT? If it was a plain old binary, the code looks to
me like it should not lock any features.

>
> On the bright side, having a ptrace call to unlock shadow stack and
> wrss
> allows running CRIU itself with shadow stack.
>

Yea, having something working is really great. My only hesitancy is
that, per a discussion on the LAM patchset, we are going to make this
enabling API CET only (same semantics for though). I suppose the
locking API arch_prctl() could still be support other arch features,
but it might be a second CET only regset. It's not the end of the
world.

I guess the other consideration is tieing CRIU to glibc peculiarities.
Like even if we fix glibc, then CRIU may not work with some other libc
or app that force disables for some weird reason. Is it supposed to be
libc-agnostic?

2022-06-01 19:14:26

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

Mike,

Thanks for doing this. Glad to hear this is solvable with the current
paradigm.

On Tue, 2022-05-31 at 14:59 +0300, Mike Rapoport wrote:
> * add ability to unlock shadow stack features using ptrace. This is
> required because the current glibc (or at least in the version I used
> for
> tests) locks shadow stack state when it loads a program. This locking
> means
> that a process will either have shadow stack disabled without an
> ability to
> enable it or it will have shadow stack enabled with WRSS disabled and
> again, there is no way to re-enable WRSS. With that, ptrace looked
> like the
> most sensible interface to interfere with the shadow stack locking.

So whatever glibc you have lock's features even if it doesn't enable
shadow stack? Hmm, I've not encountered this. Which glibc is it?

WRSS is a feature where you would usually want to lock it as disabled,
but WRSS cannot be enabled if shadow stack is not enabled. Locking
shadow stack and WRSS off together doesn't have any security benefits
in theory. so I'm thinking glibc doesn't need to do this. The kernel
could even refuse to lock WRSS without shadow stack being enabled.
Could we avoid the extra ptrace functionality then?

Rick

2022-06-01 20:08:09

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, May 31, 2022 at 04:25:13PM +0000, Edgecombe, Rick P wrote:
> Mike,
>
> Thanks for doing this. Glad to hear this is solvable with the current
> paradigm.
>
> On Tue, 2022-05-31 at 14:59 +0300, Mike Rapoport wrote:
> > * add ability to unlock shadow stack features using ptrace. This is
> > required because the current glibc (or at least in the version I used
> > for
> > tests) locks shadow stack state when it loads a program. This locking
> > means
> > that a process will either have shadow stack disabled without an
> > ability to
> > enable it or it will have shadow stack enabled with WRSS disabled and
> > again, there is no way to re-enable WRSS. With that, ptrace looked
> > like the
> > most sensible interface to interfere with the shadow stack locking.
>
> So whatever glibc you have lock's features even if it doesn't enable
> shadow stack? Hmm, I've not encountered this. Which glibc is it?

I use glibc from here:
https://gitlab.com/x86-glibc/glibc/, commit b6f9a22a00c1f8ae8c0991886f0a714f2f5da002

AFAIU, it's H.J cet work.


> WRSS is a feature where you would usually want to lock it as disabled,
> but WRSS cannot be enabled if shadow stack is not enabled. Locking
> shadow stack and WRSS off together doesn't have any security benefits
> in theory. so I'm thinking glibc doesn't need to do this. The kernel
> could even refuse to lock WRSS without shadow stack being enabled.
> Could we avoid the extra ptrace functionality then?

What I see for is that a program can support shadow stack, glibc enables
shadow stack, does not enable WRSS and than calls

arch_prctl(ARCH_X86_FEATURE_LOCK,
LINUX_X86_FEATURE_SHSTK | LINUX_X86_FEATURE_WRSS);

so that WRSS cannot be re-enabled.

For the programs that do not support shadow stack, both SHSTK and WRSS are
disabled, but still there is the same call to
arch_prctl(ARCH_X86_FEATURE_LOCK, ...) and then neither shadow stack nor
WRSS can be enabled.

My original plan was to run CRIU with no shadow stack, enable shadow stack
and WRSS in the restored tasks using arch_prct() and after the shadow stack
contents is restored disable WRSS.

Obviously, this didn't work with glibc I have :)

On the bright side, having a ptrace call to unlock shadow stack and wrss
allows running CRIU itself with shadow stack.

> Rick

--
Sincerely yours,
Mike.

2022-06-01 20:11:06

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

Hi all,

On Mon, Mar 07, 2022 at 11:07:01AM -0800, H.J. Lu wrote:
> On Mon, Mar 7, 2022 at 10:57 AM Mike Rapoport <[email protected]> wrote:
> >
> > On Fri, Mar 04, 2022 at 11:13:19AM -0800, Andy Lutomirski wrote:
> > > On 3/3/22 17:30, Edgecombe, Rick P wrote:
>
> Here is the CET ptrace patch on CET 5.16 kernel branch:
>
> https://github.com/hjl-tools/linux/commit/3a43ec29ddac56f87807161b5aeafa80f632363d

It took me a while, but at last I have a version of CRIU that knows how to
handle shadow stack. For the shadow stack manipulation during dump and for
the creation of the sigframe for sigreturn I used the CET ptrace patch for
5.16 (thanks H.J).

For the restore I had to add two modifications to the kernel APIs on top of
this version of the shadow stack series:

* add address parameter to map_shadow_stack() so that it'll call mmap()
with MAP_FIXED if the address is requested. This is required to restore the
shadow stack at the same address as it was at dump time.

* add ability to unlock shadow stack features using ptrace. This is
required because the current glibc (or at least in the version I used for
tests) locks shadow stack state when it loads a program. This locking means
that a process will either have shadow stack disabled without an ability to
enable it or it will have shadow stack enabled with WRSS disabled and
again, there is no way to re-enable WRSS. With that, ptrace looked like the
most sensible interface to interfere with the shadow stack locking.

I've pushed the kernel modifications here:

https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=cet/kvm

and CRIU modifications here:

https://github.com/rppt/criu/tree/cet/v0.1

--
Sincerely yours,
Mike.

2022-06-01 20:34:41

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, 2022-05-31 at 11:00 -0700, H.J. Lu wrote:
> > The glibc logic seems wrong to me also, because shadow stack or IBT
> > could be force-disabled via glibc tunables. I don't see why the elf
> > header bit should exclusively control the feature locking. Or why
> > both
> > should be locked if only one is in the header.
>
> glibc locks SHSTK and IBT only if they are enabled at run-time.

It's not what I saw in the code. Somehow Mike saw something different
as well.

> It doesn't
> enable/disable/lock WRSS at the moment. If WRSS can be enabled
> via arch_prctl at any time, we can't lock it. If WRSS should be
> locked early,
> how should it be enabled in application? Also can WRSS be enabled
> from a dlopened object?

I think in the past we discussed having another elf header bit that
behaved differently (OR vs AND).

2022-06-01 21:06:51

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, May 31, 2022 at 05:34:50PM +0000, Edgecombe, Rick P wrote:
> On Tue, 2022-05-31 at 19:36 +0300, Mike Rapoport wrote:
> > > WRSS is a feature where you would usually want to lock it as
> > > disabled,
> > > but WRSS cannot be enabled if shadow stack is not enabled. Locking
> > > shadow stack and WRSS off together doesn't have any security
> > > benefits
> > > in theory. so I'm thinking glibc doesn't need to do this. The
> > > kernel
> > > could even refuse to lock WRSS without shadow stack being enabled.
> > > Could we avoid the extra ptrace functionality then?
> >
> > What I see for is that a program can support shadow stack, glibc
> > enables
> > shadow stack, does not enable WRSS and than calls
> >
> > arch_prctl(ARCH_X86_FEATURE_LOCK,
> > LINUX_X86_FEATURE_SHSTK | LINUX_X86_FEATURE_WRSS);
>
> I see the logic is glibc will lock SHSTK|IBT if either is enabled in
> the elf header. I guess that is why I didn't see the locking happening
> for me, because my manual enablement test doesn't have either set in
> the header.

The locking was quite a surprise for me when I moved from standalone test
to a system with CET-enabled glibc :)

> It can't see where that glibc knows about WRSS though...

Right, it was my mistake, as H.J. said glibc locks SHSTK and IBT.

> The glibc logic seems wrong to me also, because shadow stack or IBT
> could be force-disabled via glibc tunables. I don't see why the elf
> header bit should exclusively control the feature locking. Or why both
> should be locked if only one is in the header.
>
> >
> > so that WRSS cannot be re-enabled.
> >
> > For the programs that do not support shadow stack, both SHSTK and
> > WRSS are
> > disabled, but still there is the same call to
> > arch_prctl(ARCH_X86_FEATURE_LOCK, ...) and then neither shadow stack
> > nor
> > WRSS can be enabled.
> >
> > My original plan was to run CRIU with no shadow stack, enable shadow
> > stack
> > and WRSS in the restored tasks using arch_prct() and after the shadow
> > stack
> > contents is restored disable WRSS.
> >
> > Obviously, this didn't work with glibc I have :)
>
> Were you disabling shadow stack via glibc tunnable? Or was the elf
> header marked for IBT? If it was a plain old binary, the code looks to
> me like it should not lock any features.

I built criu as a plain old binary, there were no SHSTK or IBT markers. And
I've seen that there was a call to arch_prctl that locked the features as
disabled.

> > On the bright side, having a ptrace call to unlock shadow stack and
> > wrss
> > allows running CRIU itself with shadow stack.
>
> Yea, having something working is really great. My only hesitancy is
> that, per a discussion on the LAM patchset, we are going to make this
> enabling API CET only (same semantics for though). I suppose the
> locking API arch_prctl() could still be support other arch features,
> but it might be a second CET only regset. It's not the end of the
> world.

The support for CET in criu is anyway experimental for now, if the kernel
API will be slightly different in the end, we'll update criu.
The important things are the ability to control tracee shadow stack
from ptrace, the ability to map the shadow stack at fixed address and the
ability to control the features at least from ptrace.
As long as we have APIs that provide those, it should be Ok.

> I guess the other consideration is tieing CRIU to glibc peculiarities.
> Like even if we fix glibc, then CRIU may not work with some other libc
> or app that force disables for some weird reason. Is it supposed to be
> libc-agnostic?

Actually using the ptrace to control the CET features does not tie criu to
glibc. The current proposal for the arch_prctl() allows libc to lock CET
features and having a ptrace call to control the lock makes criu agnostic
to libc behaviour.

--
Sincerely yours,
Mike.

2022-06-01 21:28:54

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Tue, May 31, 2022 at 10:34 AM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Tue, 2022-05-31 at 19:36 +0300, Mike Rapoport wrote:
> > > WRSS is a feature where you would usually want to lock it as
> > > disabled,
> > > but WRSS cannot be enabled if shadow stack is not enabled. Locking
> > > shadow stack and WRSS off together doesn't have any security
> > > benefits
> > > in theory. so I'm thinking glibc doesn't need to do this. The
> > > kernel
> > > could even refuse to lock WRSS without shadow stack being enabled.
> > > Could we avoid the extra ptrace functionality then?
> >
> > What I see for is that a program can support shadow stack, glibc
> > enables
> > shadow stack, does not enable WRSS and than calls
> >
> > arch_prctl(ARCH_X86_FEATURE_LOCK,
> > LINUX_X86_FEATURE_SHSTK | LINUX_X86_FEATURE_WRSS);
>
> I see the logic is glibc will lock SHSTK|IBT if either is enabled in
> the elf header. I guess that is why I didn't see the locking happening
> for me, because my manual enablement test doesn't have either set in
> the header.
>
> It can't see where that glibc knows about WRSS though...
>
> The glibc logic seems wrong to me also, because shadow stack or IBT
> could be force-disabled via glibc tunables. I don't see why the elf
> header bit should exclusively control the feature locking. Or why both
> should be locked if only one is in the header.

glibc locks SHSTK and IBT only if they are enabled at run-time. It doesn't
enable/disable/lock WRSS at the moment. If WRSS can be enabled
via arch_prctl at any time, we can't lock it. If WRSS should be locked early,
how should it be enabled in application? Also can WRSS be enabled
from a dlopened object?

> >
> > so that WRSS cannot be re-enabled.
> >
> > For the programs that do not support shadow stack, both SHSTK and
> > WRSS are
> > disabled, but still there is the same call to
> > arch_prctl(ARCH_X86_FEATURE_LOCK, ...) and then neither shadow stack
> > nor
> > WRSS can be enabled.
> >
> > My original plan was to run CRIU with no shadow stack, enable shadow
> > stack
> > and WRSS in the restored tasks using arch_prct() and after the shadow
> > stack
> > contents is restored disable WRSS.
> >
> > Obviously, this didn't work with glibc I have :)
>
> Were you disabling shadow stack via glibc tunnable? Or was the elf
> header marked for IBT? If it was a plain old binary, the code looks to
> me like it should not lock any features.
>
> >
> > On the bright side, having a ptrace call to unlock shadow stack and
> > wrss
> > allows running CRIU itself with shadow stack.
> >
>
> Yea, having something working is really great. My only hesitancy is
> that, per a discussion on the LAM patchset, we are going to make this
> enabling API CET only (same semantics for though). I suppose the
> locking API arch_prctl() could still be support other arch features,
> but it might be a second CET only regset. It's not the end of the
> world.
>
> I guess the other consideration is tieing CRIU to glibc peculiarities.
> Like even if we fix glibc, then CRIU may not work with some other libc
> or app that force disables for some weird reason. Is it supposed to be
> libc-agnostic?
>


--
H.J.

2022-06-01 21:30:59

by H.J. Lu

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Wed, Jun 1, 2022 at 10:27 AM Edgecombe, Rick P
<[email protected]> wrote:
>
> On Tue, 2022-05-31 at 11:00 -0700, H.J. Lu wrote:
> > > The glibc logic seems wrong to me also, because shadow stack or IBT
> > > could be force-disabled via glibc tunables. I don't see why the elf
> > > header bit should exclusively control the feature locking. Or why
> > > both
> > > should be locked if only one is in the header.
> >
> > glibc locks SHSTK and IBT only if they are enabled at run-time.
>
> It's not what I saw in the code. Somehow Mike saw something different
> as well.

The current glibc cet branch:

https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/cet/master

locks only available CET features. Since only SHSTK is available, I
saw

arch_prctl(0x3003 /* ARCH_??? */, 0x2) = 0

CET features are always enabled early in ld.so to allow function
calls in the CET enabled ld.so. ld.so always locks CET features
even if they are disabled when the program or a dependency
library isn't CET enabled.

> > It doesn't
> > enable/disable/lock WRSS at the moment. If WRSS can be enabled
> > via arch_prctl at any time, we can't lock it. If WRSS should be
> > locked early,
> > how should it be enabled in application? Also can WRSS be enabled
> > from a dlopened object?
>
> I think in the past we discussed having another elf header bit that
> behaved differently (OR vs AND).

We should have a complete list of use cases and design a way to
support them.

--
H.J.

2022-06-09 19:23:01

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/35] Shadow stacks for userspace

On Wed, Jun 01, 2022 at 05:24:26PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2022-06-01 at 11:06 +0300, Mike Rapoport wrote:
> > > Yea, having something working is really great. My only hesitancy is
> > > that, per a discussion on the LAM patchset, we are going to make
> > > this
> > > enabling API CET only (same semantics for though). I suppose the
> > > locking API arch_prctl() could still be support other arch
> > > features,
> > > but it might be a second CET only regset. It's not the end of the
> > > world.
> >
> > The support for CET in criu is anyway experimental for now, if the
> > kernel
> > API will be slightly different in the end, we'll update criu.
> > The important things are the ability to control tracee shadow stack
> > from ptrace, the ability to map the shadow stack at fixed address and
> > the
> > ability to control the features at least from ptrace.
> > As long as we have APIs that provide those, it should be Ok.
> >
> > > I guess the other consideration is tieing CRIU to glibc
> > > peculiarities.
> > > Like even if we fix glibc, then CRIU may not work with some other
> > > libc
> > > or app that force disables for some weird reason. Is it supposed to
> > > be
> > > libc-agnostic?
> >
> > Actually using the ptrace to control the CET features does not tie
> > criu to
> > glibc. The current proposal for the arch_prctl() allows libc to lock
> > CET
> > features and having a ptrace call to control the lock makes criu
> > agnostic
> > to libc behaviour.
>
> From staring at the glibc code, I'm suspicious something was weird with
> your test setup, as I don't think it should be locking. But I guess to
> be completely proper you would need to save and restore the lock state
> anyway. So, ok yea, on balance probably better to have an extra
> interface.
>
> Should we make it a GET/SET interface?

Yes, I think so.

--
Sincerely yours,
Mike.