Hi,
This is an overdue followup to the “Shadow stacks for userspace” CET series.
Thanks for all the comments on the first version [0]. They drove a decent
amount of changes for v2. Since it has been awhile, I’ll try to summarize the
areas that got major changes since last time. Smaller changes are listed in
each patch.
The coverletter is organized into the following sections:
1. Shadow Stack Memory Solution
2. FPU API
3. Alt Shadow Stacks
4. Compatibility of Existing Binaries/Enabling Interface
5. CRIU Support
6. Bigger Selftest
Last time, two bigger pieces of new functionality were requested (Alt shadow
stack and CRIU support). Alt shadow stack support was requested, not because
there was an immediate need, but more because of the risk of signal ABI
decisions made now, creating implementation problems if alt shadow stacks were
added later.
A POC for alt shadow stacks may be enough to gauge this risk. CRIU support may
also not be something critical for day one, if glibc disables all existing
binaries as described in section 4. So I marked the patches at the end that
support those two things as RFC/OPTIONAL. The earlier patches will support a
smaller, basic initial implementation. So I’m wondering if we could consider
just enabling the basics upstream first, assuming the RFC pieces here look
passable.
1. Shadow Stack Memory Solution
===============================
Dave had a lot of questions and feedback about how shadow stack memory is
handled, including why shadow stack VMAs were not VM_WRITE. These questions
prompted a revisit of the design, and in the end shadow stack’s were switched
to be VM_WRITE. I’ve tried to summarize how shadow stack memory is supposed to
work, with some examples of how MM features interact with shadow stack memory.
Shadow Stack Memory Summary
---------------------------
Integrating shadow stack memory into the kernel has two main challenges. One,
Write=0,Dirty=1 PTEs are already created by the kernel, and now they can’t be
or they will inadvertently create shadow stack memory.
And, two, shadow stack memory fits strangely into the existing concepts of
Copy-On-Write and “writable” memory. It is *sort of* writable, in that it can
be changed by userspace, but sort of not in that it has Write=0 and can’t be
written by normal mov-type accesses. So we still have the “writable” memory we
always had, but now we also have another type of memory that is changeable from
userspace. Another weird aspect is that memory has to be shadow stack, in order
to serve a “shadow stack read”, so a shadow stack read also needs to cause
something like a Copy-On-Write, as the result will be changeable from
userspace.
Dealing with the new meaning of Dirty and Write bits
----------------------------------------------------
The first issue is solved with creating PAGE_COW using a software PTE bit. This
is hidden inside the pgtable.h helpers, such that it *mostly* (more on this
later) happens without changing core code. Basically in pte_wrprotect() will
clear Dirty and set Cow=1, if the pte was dirty. In pte_mkdirty(), it set’s COW
if the PTE was Write=0. Then pte_dirty() returns true for Dirty=1 or Cow=1.
Since this requires a little extra work, this behavior is compiled out when
shadow stack support is not enabled for the kernel.
Dealing with a new type of writable memory
------------------------------------------
The other side of the problem - dealing with the concept-splitting new type of
userspace changeable memory - leaves a bit more loose ends. Probably the most
important thing is that we don’t want the kernel thinking that shadow stack
memory is protected from changes from userspace. But we also don’t want the
kernel to treat it like normal writable memory in some ways either, for example
to get confused and inadvertently make it writable in the normal (PTE Write=1)
sense.
The solution here is to treat shadow stack memory as a special class of
writable memory by updating places where memory is made writable to be aware of
it, and treat all shadow stack accesses as if they are writes.
Shadow stack accesses are always treated as write faults because even shadow
stack reads need to be made (shadow stack) writable in order to service them.
Logic creating PTE’s then decides whether to create shadow stack or normal
writable memory by the VMA type. Most of this is encapsulated in
maybe_mkwrite() but some differentiation needs to be open coded where
pte_mkwrite() is called directly.
Shadow stack VMA’s are a special type of writable and so they are created as
VM_WRITE | VM_SHADOW_STACK. The benefit of making them also VM_WRITE is that
there is some existing logic around using VM_WRITE to make decisions in the
kernel that apply to shadow stack memory as well.
- Scheduling code decides whether to migrate a VMA depending on whether
it’s VM_WRITE. The same reasoning should apply for shadow stack
memory.
- While there is no current interface for mmap()ing files as shadow
stack, various drivers enforce non-writable mappings by checking
!VM_WRITE and clearing VM_MAYWRITE. Because there is no longer a way
to mmap() something arbitrarily as shadow stack, this can’t be hit.
But this un-hittable wrong logic makes the design confusing and
brittle.
The downside of having shadow stack memory have VM_WRITE is that any logic that
assumes VM_WRITE means normally writable, for example open coded like:
if (flags & VM_WRITE)
pte_mkwrite()
...will no longer be correct. It will need to be changed to have additional
logic that knows about shadow stack. It turns out there are not too many of
these cases and so this series just adds the logic.
This solution for this second issue also tweaks the behavior of pte_write() and
pte_dirty(). pte_write() check’s whether a pte is writable or not, previously
this was only the case when Write=1, but now pte_write() also returns true for
shadow stack memory.
There are some additional areas that are probably worth commenting on:
COW
---
When a shadow stack page is shared as part of COW, it becomes read-only,
just like normally writable memory would be. As part of the Dirty bit
solution described above, pte_wrprotect() will move Dirty=1 to COW=1.
This will leave the PTE in a read-only state automatically. Then when
it takes a shadow stack access, it will perform COW, copying the page
and making it writable. Logic added as part of the shadow stack memory
solution will detect that the VMA is shadow stack and make the PTE a
shadow stack PTE.
mprotect()/VM_WRITE
-------------------
Shadow stack memory doesn’t have a PROT flag. It is created either
internally in the kernel or via a special syscall. When it is created
this way, the VMA gets VM_WRITE|VM_SHADOW_STACK. However, some
functionality of the kernel will remove VM_WRITE, for example
mprotect(). When this happens the memory is expected to be read only. So
without any intervention, there may be a VMA that is VM_SHADOW_STACK and
not VM_WRITE. We could try to prevent this from happening, (for example
block mprotect() from operating on shadow stack memory), however some
things like userfaulfd call mprotect internally and depend on it to
work.
So mprotect()ing shadow stack memory can make it read-only (non-shadow
stack). It can then become shadow stack again by mprotect()ing it with
PROT_WRITE. It always keeps the VM_SHADOW_STACK, so that it can never
become normally writable memory.
GUP
---
Shadow stack memory is generally treated as writable by the kernel, but
it behaves differently then other writable memory with respect to GUP.
FOLL_WRITE will not GUP shadow stack memory unless FOLL_FORCE is also
set. Shadow stack memory is writable from the perspective of being
changeable by userspace, but it is also protected memory from
userspace’s perspective. So preventing it from being writable via
FOLL_WRITE help’s make it harder for userspace to arbitrarily write to
it. However, like read-only memory, FOLL_FORCE can still write through
it. This means shadow stacks can be written to via things like
“/proc/self/mem”. Apps that want extra security will have to prevent
access to kernel features that can write with FOLL_FORCE.
2. FPU API
==========
The last version of this had an interface for modifying the FPU state in either
the buffer or the registers to try to minimize saves and restores. Shortly
after that, Thomas experimented with a different fpu optimization that was
incompatible with how the interface kept state in the caller. So it doesn't
seem like a robust interface and for this version the optimization piece of the
API is dropped in this series, and the force restore technique is used again.
3. Alt Shadow Stacks
====================
Andy Lutomirski asked about alt shadow stack support. The following describes
the design of shadow stack support for signals and alt shadow stacks.
Signal handling and shadow stacks
---------------------------------
Signals push information about the execution context to the stack that will
handle the signal. The data pushed is use to restore registers and other state
after the signal. In the case of handling the signal on a normal stack, the
stack just needs to be unwound over the stack frame, but in the case of alt
stacks, the saved stack pointer is important for the sigreturn to find it’s way
back to the thread stack. With shadow stack there is a new type of stack
pointer, the shadow stack pointer (SSP), that needs to be restored. Just like
the regular stack pointer, it needs to be saved somewhere in order to implement
shadow alt stacks. Beyond supporting basic functionality, it would be nice if
shadow stack’s could make sigreturn oriented programming (SROP) attacks harder.
Alt stacks
----------
The automatically-created thread shadow stacks are sized such that shadow stack
overflows should not normally be expected. However, especially since userspace
can create and pivot to arbitrarily sized shadow stacks and we now optionally
have WRSS, overflows are not impossible. To cover the case of shadow stack
overflow, user’s may want to handle a signal on an alternate shadow stack.
Normal signal alt stacks had problems with using swapcontext() in the signal
handler. Apps couldn’t do it safely, because a subsequent signal would
overwrite the previous signal’s stack. The kernel would see the current stack
pointer was not on the shadow stack (since it swapcontext()ed off of it), so
would restart the signal from the end of the alt stack, clobbering the previous
signal. The solution was to create a new flag that would change the signal
behavior to disable alt stack switching while on the alt stack. Then new
signals would be pushed onto the alt stack. On sigreturn, when the sigframe for
the first signal that switched to the alt stack is encountered, the alt signal
stack would be re-enabled. Then subsequent signals would start at the end of
the alt stack again.
For regular alt stacks, this swapcontext() capable behavior is enabled by
having the kernel clear its copy of the alt signal stack address and length
after this data is saved to the sigframe. So when the first sigframe on the alt
stack is sigreturn-ed, the alt stack is automatically restored.
In order to support swapcontext() on alt shadow stacks, we can have something
similar where we push the SSP, alt shadow stack base and length to some kind of
shadow stack sigframe. This leaves the question of where to push this data.
SROP
----
Similar to normal returns, sigreturn’s can be security sensitive. One exploit
technique (SROP) is to call sigreturn directly with the stack pointer at a
forged sigframe. So this involves being somewhere else on the stack, than a
real kernel placed sigframe. These attacks can be made harder by placing
something on the protected shadow stack to signify that a specific location on
the shadow stack corresponds to where sigreturn is supposed to be called. The
kernel can check for this token during sigreturn, and then sigreturn can’t be
called at arbitrary places on the stack.
Shadow stack signal format
--------------------------
So to handle alt shadow stacks we need to push some data onto a stack. To
prevent SROP we need to push something to the shadow stack that the kernel can
know it must have placed there itself. To support both we can push a special
shadow stack sigframe to the shadow stack that contains the necessary alt stack
restore data, in a format that couldn't possibly occur naturally. To be extra
careful, this data should be written such that it can't be used as a regular
shadow stack return address or a shadow stack tokens. To make sure it can’t be
used, data is pushed with the high bit (bit 63) set. This bit is a linear
address bit in both the token format and a normal return address, so it should
not conflict with anything. It puts any return address in the kernel half of
the address space, so would never be created naturally by a userspace program.
It will not be a valid restore token either, as the kernel address will never
be pointing to the previous frame in the shadow stack.
When a signal hits, the format pushed to the stack that is handling the signal
is four 8 byte values (since we are 64 bit only):
|1...old SSP|1...alt stack size|1...alt stack base|0|
The zero (without high bit set) at the end is pushed to act as a guard frame.
An attacker cannot restore from a point where the frame processed would span
two shadow stack sigframes because the kernel would detect the missing high
bit.
setjmp()/longjmp()
------------------
In past designs for userspace shadow stacks, shadow alt stacks were not
supported. Since there was only one shadow stack, longjmp() could jump out of a
signal by using incssp to unwind the SSP to the place where the setjmp() was
called. In order to support longjmp() off of an alt shadow stack, a restore
token could be pushed to the original stack before switching to the alt stack.
Userspace could search the alt stack for the alt stack sigframe to find the
restore token, then restore back to it and continue unwinding. However, the
main point of alt shadow stacks is to handle shadow stack overflows. So
requiring there be space to push a token would prevent the feature from being
used for it’s main purpose. So in this design nothing is pushed to the old
stack.
Since shadow alt stacks are a new feature, longjmp()ing from an alt shadow stack
will simply not be supported. If a libc want’s to support this it will need to
enable WRSS and write it’s own restore token. This could likely even let it
jump straight back to the setjmp() point and skip the whole incssp piece. It
could even work for longjmp() after a swapcontext(). So this kernel design
makes longjmp() support a security/compatibility tradeoff that the kernel is
not entirely in charge of making.
sigaltshstk() syscall
---------------------
The sigaltstack() syscall works pretty well and is familiar interface, so
sigaltshstk() is just a copy. It uses the same stack_t struct for transferring
the shadow stack point, size and flags. For the flags however, it will not
honor the meaning of the existing flags. Future flags may not have sensible
meanings for shadow stack, so sigaltshstk() will start from scratch for flag
meanings. As long as we are making new flag meanings, we can make SS_AUTODISARM
the default behavior for sigaltshstk(), and not require a flag. Today the only
flag supported is SS_DISABLE, and a !SS_AUTODISARM mode is not supported.
sigaltshstk() is separate from sigaltstack(). You can have one without the
other, neither or both together. Because the shadow stack specific state is
pushed to the shadow stack, the two features don’t need to know about each
other.
Preventing use as an arbitrary “set SSP”
----------------------------------------
So now when a signal hits it will jump to the location specified in
sigaltshstk(). Currently (without WRSS), userspace doesn’t have the ability to
arbitrarily set the SSP. But telling the kernel to set the SSP to an arbitrary
point on signal is kind of like that. So there would be a weakening of the
shadow stack protections unless additional checks are made. With the
SS_AUTODISARM-style behavior, the SSP will only jump to the shadow stack if the
SSP is not already on the shadow stack, otherwise it will just push the SSP. So
we really only need to worry about the transition to the start of the alt
shadow stack. So the kernel checks for a token whenever transitioning to the
alt stack from a place other than the alt stack. This token can be placed when
doing the allocation using the existing map_shadow_stack syscall.
RFC
---
Lastly, Andy Lutomirski raised the issue of alt shadow stacks (I think) out of
concern that we might settle on an ABI that wouldn’t support them if there was
later demand. The ABI of the sigreturn token was actually changed to support alt
shadow stacks here. So if this whole series feels like a lot of code, I wanted
to toss out the option of settling on how we could do alt shadow stacks
someday, but then leave the implementation until later.
4. Compatibility of Existing Binaries/Enabling Interface
========================================================
The last version of this dealt with the problem of old glib’s breaking against
future upstream shadow stack enabled kernels. Unfortunately, more userspace
issues have been found. In anticipation of kernel support, some distro’s have
been apparently force compiling applications with shadow stack support. Of
course compiling with shadow stack really mostly means marking the elf header
bit as “this binary supports shadow stack”. And having this bit doesn’t
necessarily mean that the binary actually supports shadow stack. In the case of
JITing or other custom stack switching programs, it often doesn’t. I have come
across at least one popular distro package that completely fails to even start
up, so there are likely more issues hidden in less common code paths. None of
these apps will break until glibc is updated to use the new kernel API for
enabling shadow stack. They will simply not run with shadow stack.
Waiting until glibc updates to break packages might not technically be a kernel
regression, but it’s not good either. With the current kernel API, the decision
of which binaries to enable shadow stack is left to userspace. So to prevent
breakages my plan is to engage the glibc community to detect and not enable CET
for these old binaries as part of the upstream of glibc CET support that will
work with the new kernel interface. Then only enable CET on future more
carefully compiled binaries. This will also lessen the impact of old CRIU’s
(pre-Mike’s changes) failing to save shadow stack enabled programs, as most
existing binaries wouldn't all turn on with CET at once.
5. CRIU Support
===============
Big thanks to Mike Rapoport for a POC [1] that fixes CRIU to work with
processes that enable shadow stacks. The general design is to allow CET
features to be unlocked via ptrace only, then WRSS can be used to manipulate
the shadow stack to allow CRIU’s sigreturn-oriented operation to continue to
work. He needed a few tweaks to the kernel in order for CRIU to do this,
including the general CET ptrace support that was missing in recent postings of
CET. So this is added back in, as well as his new UNLOCK ptrace-only
arch_prctl(). With the new plan of not trying to enable shadow stack for most
apps all at once, I wonder if this functionality might also be a good candidate
for a fast follow up. Note, this CRIU POC will need to be updated to target the
final signal shadow stack format.
6. Bigger Selftest
==================
A new selftest that exercises the shadow stack kernel features without any
special glibc requirements. It manually enables shadow stack with the
arch_prctl() and exercises shadow stack arch_prctl(), shadow stack MM,
userfaultfd, signal, and the 2 new syscalls.
[0] https://lore.kernel.org/lkml/[email protected]/
[1] https://lore.kernel.org/lkml/[email protected]/
Kirill A. Shutemov (2):
x86: Introduce userspace API for CET enabling
x86: Expose thread features status in /proc/$PID/arch_status
Mike Rapoport (1):
x86/cet/shstk: Add ARCH_CET_UNLOCK
Rick Edgecombe (11):
x86/fpu: Add helper for modifying xstate
mm: Don't allow write GUPs to shadow stack memory
x86/cet/shstk: Introduce map_shadow_stack syscall
x86/cet/shstk: Support wrss for userspace
x86/cet/shstk: Wire in CET interface
selftests/x86: Add shadow stack test
x86/cpufeatures: Limit shadow stack to Intel CPUs
x86: Separate out x86_regset for 32 and 64 bit
x86: Improve formatting of user_regset arrays
x86/fpu: Add helper for initing features
x86: Add alt shadow stack support
Yu-cheng Yu (25):
Documentation/x86: Add CET description
x86/cet/shstk: Add Kconfig option for Shadow Stack
x86/cpufeatures: Add CPU feature flags for shadow stacks
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
x86/cet: Add user control-protection fault handler
x86/mm: Remove _PAGE_DIRTY from kernel RO pages
x86/mm: Move pmd_write(), pud_write() up in the file
x86/mm: Introduce _PAGE_COW
x86/mm: Update pte_modify for _PAGE_COW
x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
transition from _PAGE_DIRTY to _PAGE_COW
mm: Move VM_UFFD_MINOR_BIT from 37 to 38
mm: Introduce VM_SHADOW_STACK for shadow stack memory
x86/mm: Check Shadow Stack page fault errors
x86/mm: Update maybe_mkwrite() for shadow stack
mm: Fixup places that call pte_mkwrite() directly
mm: Add guard pages around a shadow stack.
mm/mmap: Add shadow stack pages to memory accounting
mm/mprotect: Exclude shadow stack from preserve_write
mm: Re-introduce vm_flags to do_mmap()
x86/cet/shstk: Add user-mode shadow stack support
x86/cet/shstk: Handle thread shadow stack
x86/cet/shstk: Introduce routines modifying shstk
x86/cet/shstk: Handle signals for shadow stack
x86/cet: Add PTRACE interface for CET
Documentation/filesystems/proc.rst | 1 +
Documentation/x86/cet.rst | 143 ++++
Documentation/x86/index.rst | 1 +
arch/arm/kernel/signal.c | 2 +-
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/x86/Kconfig | 18 +
arch/x86/Kconfig.assembler | 5 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/x86/ia32/ia32_signal.c | 1 +
arch/x86/include/asm/cet.h | 49 ++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/fpu/api.h | 6 +
arch/x86/include/asm/fpu/regset.h | 7 +-
arch/x86/include/asm/fpu/sched.h | 3 +-
arch/x86/include/asm/fpu/types.h | 14 +-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/idtentry.h | 2 +-
arch/x86/include/asm/mmu_context.h | 2 +
arch/x86/include/asm/msr-index.h | 5 +
arch/x86/include/asm/msr.h | 11 +
arch/x86/include/asm/pgtable.h | 314 ++++++++-
arch/x86/include/asm/pgtable_types.h | 48 +-
arch/x86/include/asm/processor.h | 11 +
arch/x86/include/asm/special_insns.h | 13 +
arch/x86/include/asm/trap_pf.h | 2 +
arch/x86/include/uapi/asm/mman.h | 2 +
arch/x86/include/uapi/asm/prctl.h | 10 +
arch/x86/kernel/Makefile | 4 +
arch/x86/kernel/cpu/common.c | 30 +-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/fpu/core.c | 59 +-
arch/x86/kernel/fpu/regset.c | 95 +++
arch/x86/kernel/fpu/xstate.c | 198 +++---
arch/x86/kernel/fpu/xstate.h | 6 +
arch/x86/kernel/idt.c | 2 +-
arch/x86/kernel/proc.c | 63 ++
arch/x86/kernel/process.c | 24 +-
arch/x86/kernel/process_64.c | 8 +-
arch/x86/kernel/ptrace.c | 188 +++--
arch/x86/kernel/shstk.c | 628 +++++++++++++++++
arch/x86/kernel/signal.c | 10 +
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 98 ++-
arch/x86/mm/fault.c | 21 +
arch/x86/mm/mmap.c | 25 +
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/xen/enlighten_pv.c | 2 +-
arch/x86/xen/xen-asm.S | 2 +-
fs/aio.c | 2 +-
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 38 +-
include/linux/pgtable.h | 14 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/asm-generic/unistd.h | 2 +-
include/uapi/linux/elf.h | 1 +
ipc/shm.c | 2 +-
kernel/sys_ni.c | 2 +
mm/gup.c | 2 +-
mm/huge_memory.c | 16 +-
mm/memory.c | 3 +-
mm/migrate_device.c | 3 +-
mm/mmap.c | 22 +-
mm/mprotect.c | 7 +
mm/nommu.c | 4 +-
mm/userfaultfd.c | 10 +-
mm/util.c | 2 +-
tools/testing/selftests/x86/Makefile | 4 +-
.../testing/selftests/x86/test_shadow_stack.c | 646 ++++++++++++++++++
73 files changed, 2670 insertions(+), 281 deletions(-)
create mode 100644 Documentation/x86/cet.rst
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/kernel/proc.c
create mode 100644 arch/x86/kernel/shstk.c
create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c
base-commit: f76349cf41451c5c42a99f18a9163377e4b364ff
--
2.17.1
The kernel now has the main CET functionality to support applications.
Wire in the WRSS and shadow stack enable/disable functions into the
existing CET API skeleton.
Signed-off-by: Rick Edgecombe <[email protected]>
---
v2:
- Split from other patches
arch/x86/kernel/shstk.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index fc64a04366aa..0efec02dbe6b 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -477,9 +477,17 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
return -EINVAL;
if (option == ARCH_CET_DISABLE) {
+ if (features & CET_WRSS)
+ return wrss_control(false);
+ if (features & CET_SHSTK)
+ return shstk_disable();
return -EINVAL;
}
/* Handle ARCH_CET_ENABLE */
+ if (features & CET_SHSTK)
+ return shstk_setup();
+ if (features & CET_WRSS)
+ return wrss_control(true);
return -EINVAL;
}
--
2.17.1
From: Yu-cheng Yu <[email protected]>
To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Axel Rasmussen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Mike Kravetz <[email protected]>
---
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..be80fc827212 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -365,7 +365,7 @@ extern unsigned int kobjsize(const void *objp);
#endif
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT 37
+# define VM_UFFD_MINOR_BIT 38
# define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */
#else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
# define VM_UFFD_MINOR VM_NONE
--
2.17.1
From: Yu-cheng Yu <[email protected]>
Shadow stack's are normally written to via CALL/RET or specific CET
instuctions like RSTORSSP/SAVEPREVSSP. However during some Linux
operations the kernel will need to write to directly using the ring-0 only
WRUSS instruction.
A shadow stack restore token marks a restore point of the shadow stack, and
the address in a token must point directly above the token, which is within
the same shadow stack. This is distinctively different from other pointers
on the shadow stack, since those pointers point to executable code area.
Introduce token setup and verify routines. Also introduce WRUSS, which is
a kernel-mode instruction but writes directly to user shadow stack.
In future patches that enable shadow stack to work with signals, the kernel
will need something to denote the point in the stack where sigreturn may be
called. This will prevent attackers calling sigreturn at arbitrary places
in the stack, in order to help prevent SROP attacks.
To do this, something that can only be written by the kernel needs to be
placed on the shadow stack. This can be accomplished by setting bit 63 in
the frame written to the shadow stack. Userspace return addresses can't
have this bit set as it is in the kernel range. It is also can't be a
valid restore token.
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---
v2:
- Add data helpers for writing to shadow stack.
v1:
- Use xsave helpers.
Yu-cheng v30:
- Update commit log, remove description about signals.
- Update various comments.
- Remove variable 'ssp' init and adjust return value accordingly.
- Check get_user_shstk_addr() return value.
- Replace 'ia32' with 'proc32'.
Yu-cheng v29:
- Update comments for the use of get_xsave_addr().
arch/x86/include/asm/special_insns.h | 13 ++++
arch/x86/kernel/shstk.c | 108 +++++++++++++++++++++++++++
2 files changed, 121 insertions(+)
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 35f709f619fb..f096f52bd059 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
: [pax] "a" (p));
}
+#ifdef CONFIG_X86_SHADOW_STACK
+static inline int write_user_shstk_64(u64 __user *addr, u64 val)
+{
+ asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
+ _ASM_EXTABLE(1b, %l[fail])
+ :: [addr] "r" (addr), [val] "r" (val)
+ :: fail);
+ return 0;
+fail:
+ return -EFAULT;
+}
+#endif /* CONFIG_X86_SHADOW_STACK */
+
#define nop() asm volatile ("nop")
static inline void serialize(void)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index db4e53f9fdaf..8904aef487bf 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,8 @@
#include <asm/fpu/api.h>
#include <asm/prctl.h>
+#define SS_FRAME_SIZE 8
+
static bool feature_enabled(unsigned long features)
{
return current->thread.features & features;
@@ -40,6 +42,31 @@ static void feature_clr(unsigned long features)
current->thread.features &= ~features;
}
+/*
+ * Create a restore token on the shadow stack. A token is always 8-byte
+ * and aligned to 8.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+ unsigned long addr;
+
+ /* Token must be aligned */
+ if (!IS_ALIGNED(ssp, 8))
+ return -EINVAL;
+
+ addr = ssp - SS_FRAME_SIZE;
+
+ /* Mark the token 64-bit */
+ ssp |= BIT(0);
+
+ if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
+ return -EFAULT;
+
+ *token_addr = addr;
+
+ return 0;
+}
+
static unsigned long alloc_shstk(unsigned long size)
{
int flags = MAP_ANONYMOUS | MAP_PRIVATE;
@@ -158,6 +185,87 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
return 0;
}
+static unsigned long get_user_shstk_addr(void)
+{
+ unsigned long long ssp;
+
+ fpu_lock_and_load();
+
+ rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+ fpregs_unlock();
+
+ return ssp;
+}
+
+static int put_shstk_data(u64 __user *addr, u64 data)
+{
+ WARN_ON(data & BIT(63));
+
+ /*
+ * Mark the high bit so that the sigframe can't be processed as a
+ * return address.
+ */
+ if (write_user_shstk_64(addr, data | BIT(63)))
+ return -EFAULT;
+ return 0;
+}
+
+static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
+{
+ unsigned long ldata;
+
+ if (unlikely(get_user(ldata, addr)))
+ return -EFAULT;
+
+ if (!(ldata & BIT(63)))
+ return -EINVAL;
+
+ *data = ldata & ~BIT(63);
+
+ return 0;
+}
+
+/*
+ * Verify the user shadow stack has a valid token on it, and then set
+ * *new_ssp according to the token.
+ */
+static int shstk_check_rstor_token(unsigned long *new_ssp)
+{
+ unsigned long token_addr;
+ unsigned long token;
+
+ token_addr = get_user_shstk_addr();
+ if (!token_addr)
+ return -EINVAL;
+
+ if (get_user(token, (unsigned long __user *)token_addr))
+ return -EFAULT;
+
+ /* Is mode flag correct? */
+ if (!(token & BIT(0)))
+ return -EINVAL;
+
+ /* Is busy flag set? */
+ if (token & BIT(1))
+ return -EINVAL;
+
+ /* Mask out flags */
+ token &= ~3UL;
+
+ /* Restore address aligned? */
+ if (!IS_ALIGNED(token, 8))
+ return -EINVAL;
+
+ /* Token placed properly? */
+ if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
+ return -EINVAL;
+
+ *new_ssp = token;
+
+ return 0;
+}
+
void shstk_free(struct task_struct *tsk)
{
struct thread_shstk *shstk = &tsk->thread.shstk;
--
2.17.1
From: Yu-cheng Yu <[email protected]>
The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---
v2:
- Remove IBT reference in commit log (Kees)
- Describe xsaves dependency using text from (Dave)
v1:
- Remove IBT, can be added in a follow on IBT series.
Yu-cheng v25:
- Make X86_FEATURE_IBT depend on X86_FEATURE_SHSTK.
Yu-cheng v24:
- Update for splitting CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK and
CONFIG_X86_IBT.
- Move DISABLE_IBT definition to the IBT series.
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
3 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ef4775c6db01..d0b49da95c70 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -365,6 +365,7 @@
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 33d2cd04d254..00fe41eee92d 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -87,6 +87,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif
+#ifdef CONFIG_X86_SHADOW_STACK
+#define DISABLE_SHSTK 0
+#else
+#define DISABLE_SHSTK (1 << (X86_FEATURE_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -107,7 +113,7 @@
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
- DISABLE_ENQCMD)
+ DISABLE_ENQCMD|DISABLE_SHSTK)
#define DISABLED_MASK17 0
#define DISABLED_MASK18 0
#define DISABLED_MASK19 0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index c881bcafba7d..bf1b55a1ba21 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -78,6 +78,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
{}
};
--
2.17.1
From: Yu-cheng Yu <[email protected]>
The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to prevent corrupting or
switching to other shadow stacks. The RSTORSSP can move the spp to
different shadow stacks, but it requires a specially placed token in order
to do this. However, the architecture does not prevent incrementing the
stack pointer to wander onto an adjacent shadow stack. To prevent this in
software, enforce guard pages at the beginning of shadow stack vmas, such
that there will always be a gap between adjacent shadow stacks.
Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can increment or decrement by CALL, RET and INCSSP. CALL and RET
can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.
The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:
addq $0x80, %rsp
However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:
READ_ONCE(ssp); // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8); // read+discard last popped stack element
The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.
This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stack's to grow, which is
unneeded and adds a strange difference to how most regular stacks work.
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---
v2:
- Use __weak instead of #ifdef (Dave Hansen)
- Only have start gap on shadow stack (Andy Luto)
- Create stack_guard_start_gap() to not duplicate code
in an arch version of vm_start_gap() (Dave Hansen)
- Improve commit log partly with verbiage from (Dave Hansen)
Yu-cheng v25:
- Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.
Yu-cheng v24:
- Instead changing vm_*_gap(), create x86-specific versions.
arch/x86/mm/mmap.c | 23 +++++++++++++++++++++++
include/linux/mm.h | 11 ++++++-----
mm/mmap.c | 7 +++++++
3 files changed, 36 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index f3f52c5e2fd6..b0427bd2da30 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -250,3 +250,26 @@ bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
return false;
return true;
}
+
+unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_GROWSDOWN)
+ return stack_guard_gap;
+
+ /*
+ * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).
+ * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
+ * (~1KB for INCSSPD) and touches the first and the last element
+ * in the range, which triggers a page fault if the range is not
+ * in a shadow stack. Because of this, creating 4-KB guard pages
+ * around a shadow stack prevents these instructions from going
+ * beyond.
+ *
+ * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
+ * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
+ */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return PAGE_SIZE;
+
+ return 0;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fef14ab3abcb..09458e77bf52 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2775,15 +2775,16 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
return vma;
}
+unsigned long stack_guard_start_gap(struct vm_area_struct *vma);
+
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
{
+ unsigned long gap = stack_guard_start_gap(vma);
unsigned long vm_start = vma->vm_start;
- if (vma->vm_flags & VM_GROWSDOWN) {
- vm_start -= stack_guard_gap;
- if (vm_start > vma->vm_start)
- vm_start = 0;
- }
+ vm_start -= gap;
+ if (vm_start > vma->vm_start)
+ vm_start = 0;
return vm_start;
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d780f415be3..f0d2e9143bd0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -247,6 +247,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
return origbrk;
}
+unsigned long __weak stack_guard_start_gap(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_GROWSDOWN)
+ return stack_guard_gap;
+ return 0;
+}
+
static inline unsigned long vma_compute_gap(struct vm_area_struct *vma)
{
unsigned long gap, prev_end;
--
2.17.1
From: "Kirill A. Shutemov" <[email protected]>
Applications and loaders can have logic to decide whether to enable CET.
They usually don't report whether CET has been enabled or not, so there
is no way to verify whether an application actually is protected by CET
features.
Add two lines in /proc/$PID/arch_status to report enabled and locked
features.
Signed-off-by: Kirill A. Shutemov <[email protected]>
[Switched to CET, added to commit log]
Signed-off-by: Rick Edgecombe <[email protected]>
---
v2:
- New patch
arch/x86/kernel/Makefile | 2 ++
arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
arch/x86/kernel/proc.c | 63 ++++++++++++++++++++++++++++++++++++
3 files changed, 65 insertions(+), 47 deletions(-)
create mode 100644 arch/x86/kernel/proc.c
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8950d1f71226..b87b8a0a3749 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -141,6 +141,8 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += sev.o
obj-$(CONFIG_X86_SHADOW_STACK) += shstk.o
+obj-$(CONFIG_PROC_FS) += proc.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5e6a4867fd05..9258fc1169cc 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -10,8 +10,6 @@
#include <linux/mman.h>
#include <linux/nospec.h>
#include <linux/pkeys.h>
-#include <linux/seq_file.h>
-#include <linux/proc_fs.h>
#include <linux/vmalloc.h>
#include <asm/fpu/api.h>
@@ -1746,48 +1744,3 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
return -EINVAL;
}
}
-
-#ifdef CONFIG_PROC_PID_ARCH_STATUS
-/*
- * Report the amount of time elapsed in millisecond since last AVX512
- * use in the task.
- */
-static void avx512_status(struct seq_file *m, struct task_struct *task)
-{
- unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
- long delta;
-
- if (!timestamp) {
- /*
- * Report -1 if no AVX512 usage
- */
- delta = -1;
- } else {
- delta = (long)(jiffies - timestamp);
- /*
- * Cap to LONG_MAX if time difference > LONG_MAX
- */
- if (delta < 0)
- delta = LONG_MAX;
- delta = jiffies_to_msecs(delta);
- }
-
- seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
- seq_putc(m, '\n');
-}
-
-/*
- * Report architecture specific information
- */
-int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
- struct pid *pid, struct task_struct *task)
-{
- /*
- * Report AVX512 state if the processor and build option supported.
- */
- if (cpu_feature_enabled(X86_FEATURE_AVX512F))
- avx512_status(m, task);
-
- return 0;
-}
-#endif /* CONFIG_PROC_PID_ARCH_STATUS */
diff --git a/arch/x86/kernel/proc.c b/arch/x86/kernel/proc.c
new file mode 100644
index 000000000000..fa9cbe13c298
--- /dev/null
+++ b/arch/x86/kernel/proc.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
+#include <uapi/asm/prctl.h>
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+ unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+ long delta;
+
+ if (!timestamp) {
+ /*
+ * Report -1 if no AVX512 usage
+ */
+ delta = -1;
+ } else {
+ delta = (long)(jiffies - timestamp);
+ /*
+ * Cap to LONG_MAX if time difference > LONG_MAX
+ */
+ if (delta < 0)
+ delta = LONG_MAX;
+ delta = jiffies_to_msecs(delta);
+ }
+
+ seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+ seq_putc(m, '\n');
+}
+
+static void dump_features(struct seq_file *m, unsigned long features)
+{
+ if (features & CET_SHSTK)
+ seq_puts(m, "shstk ");
+ if (features & CET_WRSS)
+ seq_puts(m, "wrss ");
+}
+
+/*
+ * Report architecture specific information
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ /*
+ * Report AVX512 state if the processor and build option supported.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+ avx512_status(m, task);
+
+ seq_puts(m, "Thread_features:\t");
+ dump_features(m, task->thread.features);
+ seq_putc(m, '\n');
+
+ seq_puts(m, "Thread_features_locked:\t");
+ dump_features(m, task->thread.features_locked);
+ seq_putc(m, '\n');
+
+ return 0;
+}
--
2.17.1
From: Yu-cheng Yu <[email protected]>
When a signal is handled normally the context is pushed to the stack
before handling it. For shadow stacks, since the shadow stack only track's
return addresses, there isn't any state that needs to be pushed. However,
there are still a few things that need to be done. These things are
userspace visible and which will be kernel ABI for shadow stacks.
One is to make sure the restorer address is written to shadow stack, since
the signal handler (if not changing ucontext) returns to the restorer, and
the restorer calls sigreturn. So add the restorer on the shadow stack
before handling the signal, so there is not a conflict when the signal
handler returns to the restorer.
The other thing to do is to place some type of checkable token on the
thread's shadow stack before handling the signal and check it during
sigreturn. This is an extra layer of protection to hamper attackers
calling sigreturn manually as in SROP-like attacks.
For this token we can use the shadow stack data format defined earlier.
Have the data pushed be the previous SSP. In the future the sigreturn
might want to return back to a different stack. Storing the SSP (instead
of a restore offset or something) allows for future functionality that
may want to restore to a different stack.
So, when handling a signal push
- the SSP pointing in the shadow stack data format
- the restorer address below the restore token.
In sigreturn, verify SSP is stored in the data format and pop the shadow
stack.
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kees Cook <[email protected]>
---
v2:
- Switch to new shstk signal format
v1:
- Use xsave helpers.
- Expand commit log.
Yu-cheng v27:
- Eliminate saving shadow stack pointer to signal context.
Yu-cheng v25:
- Update commit log/comments for the sc_ext struct.
- Use restorer address already calculated.
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
- Eliminate writing to MSR_IA32_U_CET for shadow stack.
- Change wrmsrl() to wrmsrl_safe() and handle error.
arch/x86/ia32/ia32_signal.c | 1 +
arch/x86/include/asm/cet.h | 5 ++
arch/x86/kernel/shstk.c | 126 ++++++++++++++++++++++++++++++------
arch/x86/kernel/signal.c | 10 +++
4 files changed, 123 insertions(+), 19 deletions(-)
diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index c9c3859322fa..88d71b9de616 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,6 +34,7 @@
#include <asm/sigframe.h>
#include <asm/sighandling.h>
#include <asm/smap.h>
+#include <asm/cet.h>
static inline void reload_segments(struct sigcontext_32 *sc)
{
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 924de99e0c61..8c6fab9f402a 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -6,6 +6,7 @@
#include <linux/types.h>
struct task_struct;
+struct ksignal;
struct thread_shstk {
u64 base;
@@ -22,6 +23,8 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
void shstk_free(struct task_struct *p);
int shstk_disable(void);
void reset_thread_shstk(void);
+int setup_signal_shadow_stack(struct ksignal *ksig);
+int restore_signal_shadow_stack(void);
#else
static inline long cet_prctl(struct task_struct *task, int option,
unsigned long features) { return -EINVAL; }
@@ -33,6 +36,8 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
static inline void shstk_free(struct task_struct *p) {}
static inline int shstk_disable(void) { return -EOPNOTSUPP; }
static inline void reset_thread_shstk(void) {}
+static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
+static inline int restore_signal_shadow_stack(void) { return 0; }
#endif /* CONFIG_X86_SHADOW_STACK */
#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 8904aef487bf..04442134aadd 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -227,41 +227,129 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
}
/*
- * Verify the user shadow stack has a valid token on it, and then set
- * *new_ssp according to the token.
+ * Create a restore token on shadow stack, and then push the user-mode
+ * function return address.
*/
-static int shstk_check_rstor_token(unsigned long *new_ssp)
+static int shstk_setup_rstor_token(unsigned long ret_addr, unsigned long *new_ssp)
{
- unsigned long token_addr;
- unsigned long token;
+ unsigned long ssp, token_addr;
+ int err;
+
+ if (!ret_addr)
+ return -EINVAL;
+
+ ssp = get_user_shstk_addr();
+ if (!ssp)
+ return -EINVAL;
+
+ err = create_rstor_token(ssp, &token_addr);
+ if (err)
+ return err;
+
+ ssp = token_addr - sizeof(u64);
+ err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
+
+ if (!err)
+ *new_ssp = ssp;
+
+ return err;
+}
+
+static int shstk_push_sigframe(unsigned long *ssp)
+{
+ unsigned long target_ssp = *ssp;
+
+ /* Token must be aligned */
+ if (!IS_ALIGNED(*ssp, 8))
+ return -EINVAL;
- token_addr = get_user_shstk_addr();
- if (!token_addr)
+ if (!IS_ALIGNED(target_ssp, 8))
return -EINVAL;
- if (get_user(token, (unsigned long __user *)token_addr))
+ *ssp -= SS_FRAME_SIZE;
+ if (put_shstk_data((void *__user)*ssp, target_ssp))
return -EFAULT;
- /* Is mode flag correct? */
- if (!(token & BIT(0)))
+ return 0;
+}
+
+
+static int shstk_pop_sigframe(unsigned long *ssp)
+{
+ unsigned long token_addr;
+ int err;
+
+ err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
+ if (unlikely(err))
+ return err;
+
+ /* Restore SSP aligned? */
+ if (unlikely(!IS_ALIGNED(token_addr, 8)))
return -EINVAL;
- /* Is busy flag set? */
- if (token & BIT(1))
+ /* SSP in userspace? */
+ if (unlikely(token_addr >= TASK_SIZE_MAX))
return -EINVAL;
- /* Mask out flags */
- token &= ~3UL;
+ *ssp = token_addr;
+
+ return 0;
+}
+
+int setup_signal_shadow_stack(struct ksignal *ksig)
+{
+ void __user *restorer = ksig->ka.sa.sa_restorer;
+ unsigned long ssp;
+ int err;
- /* Restore address aligned? */
- if (!IS_ALIGNED(token, 8))
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+ !feature_enabled(CET_SHSTK))
+ return 0;
+
+ if (!restorer)
return -EINVAL;
- /* Token placed properly? */
- if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
+ ssp = get_user_shstk_addr();
+ if (unlikely(!ssp))
+ return -EINVAL;
+
+ err = shstk_push_sigframe(&ssp);
+ if (unlikely(err))
+ return err;
+
+ /* Push restorer address */
+ ssp -= SS_FRAME_SIZE;
+ err = write_user_shstk_64((u64 __user *)ssp, (u64)restorer);
+ if (unlikely(err))
+ return -EFAULT;
+
+ fpu_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, ssp);
+ fpregs_unlock();
+
+ return 0;
+}
+
+int restore_signal_shadow_stack(void)
+{
+ unsigned long ssp;
+ int err;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+ !feature_enabled(CET_SHSTK))
+ return 0;
+
+ ssp = get_user_shstk_addr();
+ if (unlikely(!ssp))
return -EINVAL;
- *new_ssp = token;
+ err = shstk_pop_sigframe(&ssp);
+ if (unlikely(err))
+ return err;
+
+ fpu_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, ssp);
+ fpregs_unlock();
return 0;
}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9c7265b524c7..d2081305f698 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -47,6 +47,7 @@
#include <asm/syscall.h>
#include <asm/sigframe.h>
#include <asm/signal.h>
+#include <asm/cet.h>
#ifdef CONFIG_X86_64
/*
@@ -472,6 +473,9 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig,
frame = get_sigframe(&ksig->ka, regs, sizeof(struct rt_sigframe), &fp);
uc_flags = frame_uc_flags(regs);
+ if (setup_signal_shadow_stack(ksig))
+ return -EFAULT;
+
if (!user_access_begin(frame, sizeof(*frame)))
return -EFAULT;
@@ -675,6 +679,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
goto badframe;
+ if (restore_signal_shadow_stack())
+ goto badframe;
+
if (restore_altstack(&frame->uc.uc_stack))
goto badframe;
@@ -992,6 +999,9 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn)
if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags))
goto badframe;
+ if (restore_signal_shadow_stack())
+ goto badframe;
+
if (compat_restore_altstack(&frame->uc.uc_stack))
goto badframe;
--
2.17.1
To handle stack overflows, applications can register a separate signal alt
stack to use for the stack to handle signals. To handle shadow stack
overflows the kernel can similarly provide the ability to have an alt
shadow stack.
Signals push information about the execution context to the stack that
will handle the signal. The data pushed is use to restore registers
and other state after the signal. In the case of handling the signal on
a normal stack, the stack just needs to be unwound over the stack frame,
but in the case of alt stacks, the saved stack pointer is important for
the sigreturn to find it’s way back to the thread. With shadow stack
there is a new type of stack pointer, the shadow stack pointer (SSP), that
needs to be restored. Just like the regular stack pointer, it needs to be
saved somewhere in order to implement shadow alt stacks. This is already
done as part of the token placed to prevent SROP attacks, so on sigreturn
from an alt shadow stack, the kernel can easily know which SSP to restore.
But to enable SS_AUTODISARM like functionality, the kernel also needs to
push the shadow alt stack and size somewhere, like happens in regular
alt stacks. So push this data using the same format. In the end the
shadow stack sigframe looks like this:
|1...old SSP|1...alt stack size|1...alt stack base| 0|
In the future, any other data could come between the alt stack base and
the guard zero. The guard zero is to prevent tricking the kernel into
processing half of one frame and half of the adjacent frame.
In past designs for userspace shadow stacks, shadow alt stacks were not
supported. Since there was only one shadow stack, longjmp() could jump out
of a signal by using incssp to unwind the SSP to the place where the
setjmp() was called. Since alt shadow stacks are a new thing, simply don't
support longjmp()ing from an alt shadow stacks.
Introduce a new syscall "sigaltshstk" that behaves similarly to
sigaltstack. Have it take new and old stack_t's to specify the base and
length of the alt shadow stack. Don't have it adopt the same flag
semantics though, because not all alt stack flags will necessarily apply
to alt shadow stacks. As long as the syscall is getting new flag meanings
make SS_AUTODISARM the default behavior for sigaltshstk(), and not require
a flag. Today the only flag supported is SS_DISABLE, and a !SS_AUTODISARM
mode is not supported.
So when a signal hits it will jump to the location specified in
sigaltshstk(). Currently (without WRSS), userspace doesn’t have the
ability to arbitrarily set the SSP. But telling the kernel to set the
SSP to an arbitrary point on signal is kind of like that. So there would
be a weakening of the shadow stack protections unless additional checks
are made. With the SS_AUTODISARM-style behavior, the SSP will only jump to
the shadow stack if the SSP is not already on the shadow stack, otherwise
it will just push the SSP. So have the kernel checks for a token
whenever transitioning to the alt stack from a place other than the alt
stack. This token can be written by the kernel during shadow stack
allocation, using the map_shadow_stack syscall.
Signed-off-by: Rick Edgecombe <[email protected]>
---
v2:
- New patch
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/cet.h | 2 +
arch/x86/include/asm/processor.h | 3 +
arch/x86/kernel/process.c | 3 +
arch/x86/kernel/shstk.c | 178 +++++++++++++++---
include/linux/syscalls.h | 1 +
kernel/sys_ni.c | 1 +
.../testing/selftests/x86/test_shadow_stack.c | 75 ++++++++
8 files changed, 240 insertions(+), 24 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d9639e3e0a33..a2dd5d56caa4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -373,6 +373,7 @@
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
451 common map_shadow_stack sys_map_shadow_stack
+452 common sigaltshstk sys_sigaltshstk
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index edf681d4843a..52119b913ed6 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -26,6 +26,7 @@ void reset_thread_shstk(void);
int setup_signal_shadow_stack(struct ksignal *ksig);
int restore_signal_shadow_stack(void);
int wrss_control(bool enable);
+void reset_alt_shstk(void);
#else
static inline long cet_prctl(struct task_struct *task, int option,
unsigned long features) { return -EINVAL; }
@@ -40,6 +41,7 @@ static inline void reset_thread_shstk(void) {}
static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
static inline int restore_signal_shadow_stack(void) { return 0; }
static inline int wrss_control(bool enable) { return -EOPNOTSUPP; }
+static inline void reset_alt_shstk(void) {}
#endif /* CONFIG_X86_SHADOW_STACK */
#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3a0c9d9d4d1d..b9fb966edec7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -536,6 +536,9 @@ struct thread_struct {
#ifdef CONFIG_X86_SHADOW_STACK
struct thread_shstk shstk;
+ unsigned long sas_shstk_sp;
+ size_t sas_shstk_size;
+ unsigned int sas_shstk_flags;
#endif
/* Floating point and extended processor state */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 5e63d190becd..b71eb2d6a20f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -176,6 +176,9 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
frame->flags = X86_EFLAGS_FIXED;
#endif
+ if ((clone_flags & (CLONE_VM|CLONE_VFORK)) == CLONE_VM)
+ reset_alt_shstk();
+
/* Allocate a new shadow stack for pthread if needed */
ret = shstk_alloc_thread_stack(p, clone_flags, args->flags, &shstk_addr);
if (ret)
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index af1255164f0c..05ee3793b60f 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -25,6 +25,7 @@
#include <asm/special_insns.h>
#include <asm/fpu/api.h>
#include <asm/prctl.h>
+#include <asm/signal.h>
#define SS_FRAME_SIZE 8
@@ -149,11 +150,18 @@ int shstk_setup(void)
return 0;
}
+void reset_alt_shstk(void)
+{
+ current->thread.sas_shstk_sp = 0;
+ current->thread.sas_shstk_size = 0;
+}
+
void reset_thread_shstk(void)
{
memset(¤t->thread.shstk, 0, sizeof(struct thread_shstk));
current->thread.features = 0;
current->thread.features_locked = 0;
+ reset_alt_shstk();
}
int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
@@ -238,39 +246,67 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
return 0;
}
+static bool on_alt_shstk(unsigned long ssp)
+{
+ unsigned long alt_ss_start = current->thread.sas_shstk_sp;
+ unsigned long alt_ss_end = alt_ss_start + current->thread.sas_shstk_size;
+
+ return ssp >= alt_ss_start && ssp < alt_ss_end;
+}
+
+static bool alt_shstk_active(void)
+{
+ return current->thread.sas_shstk_sp;
+}
+
+static bool alt_shstk_valid(unsigned long ssp, size_t size)
+{
+ if (ssp && (size < PAGE_SIZE || size >= TASK_SIZE_MAX))
+ return -EINVAL;
+
+ if (ssp >= TASK_SIZE_MAX)
+ return -EINVAL;
+
+ return 0;
+}
+
/*
- * Create a restore token on shadow stack, and then push the user-mode
- * function return address.
+ * Verify the user shadow stack has a valid token on it, and then set
+ * *new_ssp according to the token.
*/
-static int shstk_setup_rstor_token(unsigned long ret_addr, unsigned long *new_ssp)
+static int shstk_check_rstor_token(unsigned long token_addr, unsigned long *new_ssp)
{
- unsigned long ssp, token_addr;
- int err;
+ unsigned long token;
- if (!ret_addr)
+ if (get_user(token, (unsigned long __user *)token_addr))
+ return -EFAULT;
+
+ /* Is mode flag correct? */
+ if (!(token & BIT(0)))
return -EINVAL;
- ssp = get_user_shstk_addr();
- if (!ssp)
+ /* Is busy flag set? */
+ if (token & BIT(1))
return -EINVAL;
- err = create_rstor_token(ssp, &token_addr);
- if (err)
- return err;
+ /* Mask out flags */
+ token &= ~3UL;
+
+ /* Restore address aligned? */
+ if (!IS_ALIGNED(token, 8))
+ return -EINVAL;
- ssp = token_addr - sizeof(u64);
- err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
+ /* Token placed properly? */
+ if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
+ return -EINVAL;
- if (!err)
- *new_ssp = ssp;
+ *new_ssp = token;
- return err;
+ return 0;
}
-static int shstk_push_sigframe(unsigned long *ssp)
+static int shstk_push_sigframe(unsigned long *ssp, unsigned long target_ssp)
{
- unsigned long target_ssp = *ssp;
-
/* Token must be aligned */
if (!IS_ALIGNED(*ssp, 8))
return -EINVAL;
@@ -278,17 +314,32 @@ static int shstk_push_sigframe(unsigned long *ssp)
if (!IS_ALIGNED(target_ssp, 8))
return -EINVAL;
+ *ssp -= SS_FRAME_SIZE;
+ if (write_user_shstk_64((u64 __user *)*ssp, 0))
+ return -EFAULT;
+
+ *ssp -= SS_FRAME_SIZE;
+ if (put_shstk_data((u64 __user *)*ssp, current->thread.sas_shstk_sp))
+ return -EFAULT;
+
+ *ssp -= SS_FRAME_SIZE;
+ if (put_shstk_data((u64 __user *)*ssp, current->thread.sas_shstk_size))
+ return -EFAULT;
+
*ssp -= SS_FRAME_SIZE;
if (put_shstk_data((void *__user)*ssp, target_ssp))
return -EFAULT;
+ current->thread.sas_shstk_sp = 0;
+ current->thread.sas_shstk_size = 0;
+
return 0;
}
static int shstk_pop_sigframe(unsigned long *ssp)
{
- unsigned long token_addr;
+ unsigned long token_addr, shstk_sp, shstk_size;
int err;
err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
@@ -303,7 +354,38 @@ static int shstk_pop_sigframe(unsigned long *ssp)
if (unlikely(token_addr >= TASK_SIZE_MAX))
return -EINVAL;
+ *ssp += SS_FRAME_SIZE;
+ err = get_shstk_data(&shstk_size, (void __user *)*ssp);
+ if (unlikely(err))
+ return err;
+
+ *ssp += SS_FRAME_SIZE;
+ err = get_shstk_data(&shstk_sp, (void __user *)*ssp);
+ if (unlikely(err))
+ return err;
+
+ if (unlikely(alt_shstk_valid((unsigned long)shstk_sp, shstk_size)))
+ return -EINVAL;
+
*ssp = token_addr;
+ current->thread.sas_shstk_sp = shstk_sp;
+ current->thread.sas_shstk_size = shstk_size;
+
+ return 0;
+}
+
+static unsigned long get_sig_start_ssp(unsigned long orig_ssp, unsigned long *ssp)
+{
+ unsigned long sp_end = (current->thread.sas_shstk_sp +
+ current->thread.sas_shstk_size) - SS_FRAME_SIZE;
+
+ if (!alt_shstk_active() || on_alt_shstk(*ssp)) {
+ *ssp = orig_ssp;
+ return 0;
+ }
+
+ if (shstk_check_rstor_token(sp_end, ssp))
+ return -EINVAL;
return 0;
}
@@ -311,7 +393,7 @@ static int shstk_pop_sigframe(unsigned long *ssp)
int setup_signal_shadow_stack(struct ksignal *ksig)
{
void __user *restorer = ksig->ka.sa.sa_restorer;
- unsigned long ssp;
+ unsigned long ssp, orig_ssp;
int err;
if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
@@ -321,11 +403,15 @@ int setup_signal_shadow_stack(struct ksignal *ksig)
if (!restorer)
return -EINVAL;
- ssp = get_user_shstk_addr();
- if (unlikely(!ssp))
+ orig_ssp = get_user_shstk_addr();
+ if (unlikely(!orig_ssp))
return -EINVAL;
- err = shstk_push_sigframe(&ssp);
+ err = get_sig_start_ssp(orig_ssp, &ssp);
+ if (unlikely(err))
+ return err;
+
+ err = shstk_push_sigframe(&ssp, orig_ssp);
if (unlikely(err))
return err;
@@ -496,3 +582,47 @@ long cet_prctl(struct task_struct *task, int option, unsigned long features)
return wrss_control(true);
return -EINVAL;
}
+
+SYSCALL_DEFINE2(sigaltshstk, const stack_t __user *, uss, stack_t __user *, uoss)
+{
+ unsigned long ssp;
+ stack_t new, old;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return -ENOSYS;
+
+ ssp = get_user_shstk_addr();
+
+ if (unlikely(!ssp || on_alt_shstk(ssp)))
+ return -EPERM;
+
+ if (uss) {
+ if (unlikely(copy_from_user(&new, uss, sizeof(stack_t))))
+ return -EFAULT;
+
+ if (unlikely(alt_shstk_valid((unsigned long)new.ss_sp,
+ new.ss_size)))
+ return -EINVAL;
+
+ if (new.ss_flags & SS_DISABLE) {
+ current->thread.sas_shstk_sp = 0;
+ current->thread.sas_shstk_size = 0;
+ return 0;
+ }
+
+ current->thread.sas_shstk_sp = (unsigned long) new.ss_sp;
+ current->thread.sas_shstk_size = new.ss_size;
+ /* No saved flags for now */
+ }
+
+ if (!uoss)
+ return 0;
+
+ memset(&old, 0, sizeof(stack_t));
+ old.ss_sp = (void __user *)current->thread.sas_shstk_sp;
+ old.ss_size = current->thread.sas_shstk_size;
+ if (copy_to_user(uoss, &old, sizeof(stack_t)))
+ return -EFAULT;
+
+ return 0;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3ae05cbdea5b..7b7e7bb992c2 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1057,6 +1057,7 @@ asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long l
unsigned long home_node,
unsigned long flags);
asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
+asmlinkage long sys_sigaltshstk(const struct sigaltstack *uss, struct sigaltstack *uoss);
/*
* Architecture-specific system calls
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index cb9aebd34646..3a5f8b76e7a4 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -382,6 +382,7 @@ COND_SYSCALL(modify_ldt);
COND_SYSCALL(vm86);
COND_SYSCALL(kexec_file_load);
COND_SYSCALL(map_shadow_stack);
+COND_SYSCALL(sigaltshstk);
/* s390 */
COND_SYSCALL(s390_pci_mmio_read);
diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c
index 249397736d0d..22b856de5cdd 100644
--- a/tools/testing/selftests/x86/test_shadow_stack.c
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -492,6 +492,76 @@ int test_userfaultfd(void)
return 1;
}
+volatile bool segv_pass;
+
+long sigaltshstk(stack_t *uss, stack_t *ouss)
+{
+ return syscall(__NR_sigaltshstk, uss, ouss);
+}
+
+void segv_alt_handler(int signum, siginfo_t *si, void *uc)
+{
+ unsigned long min = (unsigned long)shstk_ptr;
+ unsigned long max = (unsigned long)shstk_ptr + SS_SIZE;
+ unsigned long ssp = get_ssp();
+ stack_t alt_shstk_stackt;
+
+ if (sigaltshstk(NULL, &alt_shstk_stackt))
+ goto fail;
+
+ if (alt_shstk_stackt.ss_sp || alt_shstk_stackt.ss_size)
+ goto fail;
+
+ if (ssp < min || ssp > max - 8)
+ goto fail;
+
+ segv_pass = true;
+ return;
+fail:
+ segv_pass = false;
+}
+
+int test_shstk_alt_stack(void)
+{
+ stack_t alt_shstk_stackt;
+ struct sigaction sa;
+ int ret = 1;
+
+ sa.sa_sigaction = segv_alt_handler;
+ if (sigaction(SIGUSR1, &sa, NULL))
+ return 1;
+ sa.sa_flags = SA_SIGINFO;
+
+ shstk_ptr = create_shstk(0);
+ if (shstk_ptr == MAP_FAILED)
+ goto err_sig;
+
+ alt_shstk_stackt.ss_sp = shstk_ptr;
+ alt_shstk_stackt.ss_size = SS_SIZE;
+ if (sigaltshstk(&alt_shstk_stackt, NULL) == -1)
+ goto err_shstk;
+
+ segv_pass = false;
+
+ /* Make sure segv_was_on_alt is set before signal */
+ asm volatile("" : : : "memory");
+
+ raise(SIGUSR1);
+
+ if (segv_pass) {
+ printf("[OK]\tAlt shadow stack test.\n");
+ ret = 0;
+ }
+
+err_shstk:
+ alt_shstk_stackt.ss_flags = SS_DISABLE;
+ sigaltshstk(&alt_shstk_stackt, NULL);
+ free_shstk(shstk_ptr);
+err_sig:
+ signal(SIGUSR1, SIG_DFL);
+ return ret;
+}
+
int main(int argc, char *argv[])
{
int ret = 0;
@@ -556,6 +626,11 @@ int main(int argc, char *argv[])
printf("[FAIL]\tUserfaultfd test\n");
}
+ if (test_shstk_alt_stack()) {
+ ret = 1;
+ printf("[FAIL]\tAlt shadow stack test\n");
+ }
+
out:
/*
* Disable shadow stack before the function returns, or there will be a
--
2.17.1
Back in 2018, Ingo Molnar suggested[0] to improve the formatting of the
struct user_regset arrays. They have multiple member initializations per
line and some lines exceed 100 chars. Reformat them like he suggested.
[0] https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Rick Edgecombe <[email protected]>
---
v2:
- New patch
arch/x86/kernel/ptrace.c | 107 ++++++++++++++++++++++++---------------
1 file changed, 65 insertions(+), 42 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 1a4df5fbc5e9..eed8a65d335d 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1235,28 +1235,37 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
static struct user_regset x86_64_regsets[] __ro_after_init = {
[REGSET_GENERAL64] = {
- .core_note_type = NT_PRSTATUS,
- .n = sizeof(struct user_regs_struct) / sizeof(long),
- .size = sizeof(long), .align = sizeof(long),
- .regset_get = genregs_get, .set = genregs_set
+ .core_note_type = NT_PRSTATUS,
+ .n = sizeof(struct user_regs_struct) / sizeof(long),
+ .size = sizeof(long),
+ .align = sizeof(long),
+ .regset_get = genregs_get,
+ .set = genregs_set
},
[REGSET_FP64] = {
- .core_note_type = NT_PRFPREG,
- .n = sizeof(struct fxregs_state) / sizeof(long),
- .size = sizeof(long), .align = sizeof(long),
- .active = regset_xregset_fpregs_active, .regset_get = xfpregs_get, .set = xfpregs_set
+ .core_note_type = NT_PRFPREG,
+ .n = sizeof(struct fxregs_state) / sizeof(long),
+ .size = sizeof(long),
+ .align = sizeof(long),
+ .active = regset_xregset_fpregs_active,
+ .regset_get = xfpregs_get,
+ .set = xfpregs_set
},
[REGSET_XSTATE64] = {
- .core_note_type = NT_X86_XSTATE,
- .size = sizeof(u64), .align = sizeof(u64),
- .active = xstateregs_active, .regset_get = xstateregs_get,
- .set = xstateregs_set
+ .core_note_type = NT_X86_XSTATE,
+ .size = sizeof(u64),
+ .align = sizeof(u64),
+ .active = xstateregs_active,
+ .regset_get = xstateregs_get,
+ .set = xstateregs_set
},
[REGSET_IOPERM64] = {
- .core_note_type = NT_386_IOPERM,
- .n = IO_BITMAP_LONGS,
- .size = sizeof(long), .align = sizeof(long),
- .active = ioperm_active, .regset_get = ioperm_get
+ .core_note_type = NT_386_IOPERM,
+ .n = IO_BITMAP_LONGS,
+ .size = sizeof(long),
+ .align = sizeof(long),
+ .active = ioperm_active,
+ .regset_get = ioperm_get
},
};
@@ -1276,42 +1285,56 @@ static const struct user_regset_view user_x86_64_view = {
#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
static struct user_regset x86_32_regsets[] __ro_after_init = {
[REGSET_GENERAL32] = {
- .core_note_type = NT_PRSTATUS,
- .n = sizeof(struct user_regs_struct32) / sizeof(u32),
- .size = sizeof(u32), .align = sizeof(u32),
- .regset_get = genregs32_get, .set = genregs32_set
+ .core_note_type = NT_PRSTATUS,
+ .n = sizeof(struct user_regs_struct32) / sizeof(u32),
+ .size = sizeof(u32),
+ .align = sizeof(u32),
+ .regset_get = genregs32_get,
+ .set = genregs32_set
},
[REGSET_FP32] = {
- .core_note_type = NT_PRFPREG,
- .n = sizeof(struct user_i387_ia32_struct) / sizeof(u32),
- .size = sizeof(u32), .align = sizeof(u32),
- .active = regset_fpregs_active, .regset_get = fpregs_get, .set = fpregs_set
+ .core_note_type = NT_PRFPREG,
+ .n = sizeof(struct user_i387_ia32_struct) / sizeof(u32),
+ .size = sizeof(u32),
+ .align = sizeof(u32),
+ .active = regset_fpregs_active,
+ .regset_get = fpregs_get,
+ .set = fpregs_set
},
[REGSET_XFP32] = {
- .core_note_type = NT_PRXFPREG,
- .n = sizeof(struct fxregs_state) / sizeof(u32),
- .size = sizeof(u32), .align = sizeof(u32),
- .active = regset_xregset_fpregs_active, .regset_get = xfpregs_get, .set = xfpregs_set
+ .core_note_type = NT_PRXFPREG,
+ .n = sizeof(struct fxregs_state) / sizeof(u32),
+ .size = sizeof(u32),
+ .align = sizeof(u32),
+ .active = regset_xregset_fpregs_active,
+ .regset_get = xfpregs_get,
+ .set = xfpregs_set
},
[REGSET_XSTATE32] = {
- .core_note_type = NT_X86_XSTATE,
- .size = sizeof(u64), .align = sizeof(u64),
- .active = xstateregs_active, .regset_get = xstateregs_get,
- .set = xstateregs_set
+ .core_note_type = NT_X86_XSTATE,
+ .size = sizeof(u64),
+ .align = sizeof(u64),
+ .active = xstateregs_active,
+ .regset_get = xstateregs_get,
+ .set = xstateregs_set
},
[REGSET_TLS32] = {
- .core_note_type = NT_386_TLS,
- .n = GDT_ENTRY_TLS_ENTRIES, .bias = GDT_ENTRY_TLS_MIN,
- .size = sizeof(struct user_desc),
- .align = sizeof(struct user_desc),
- .active = regset_tls_active,
- .regset_get = regset_tls_get, .set = regset_tls_set
+ .core_note_type = NT_386_TLS,
+ .n = GDT_ENTRY_TLS_ENTRIES,
+ .bias = GDT_ENTRY_TLS_MIN,
+ .size = sizeof(struct user_desc),
+ .align = sizeof(struct user_desc),
+ .active = regset_tls_active,
+ .regset_get = regset_tls_get,
+ .set = regset_tls_set
},
[REGSET_IOPERM32] = {
- .core_note_type = NT_386_IOPERM,
- .n = IO_BITMAP_BYTES / sizeof(u32),
- .size = sizeof(u32), .align = sizeof(u32),
- .active = ioperm_active, .regset_get = ioperm_get
+ .core_note_type = NT_386_IOPERM,
+ .n = IO_BITMAP_BYTES / sizeof(u32),
+ .size = sizeof(u32),
+ .align = sizeof(u32),
+ .active = ioperm_active,
+ .regset_get = ioperm_get
},
};
--
2.17.1
From: Yu-cheng Yu <[email protected]>
In change_pte_range(), when a PTE is changed for prot_numa, _PAGE_RW is
preserved to avoid the additional write fault after the NUMA hinting fault.
However, pte_write() now includes both normal writable and shadow stack
(Write=0, Dirty=1) PTEs, but the latter does not have _PAGE_RW and has no
need to preserve it.
Exclude shadow stack from preserve_write test, and apply the same change to
change_huge_pmd().
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
Yu-cheng v25:
- Move is_shadow_stack_mapping() to a separate line.
Yu-cheng v24:
- Change arch_shadow_stack_mapping() to is_shadow_stack_mapping().
mm/huge_memory.c | 7 +++++++
mm/mprotect.c | 7 +++++++
2 files changed, 14 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 11fc69eb4717..492c4f190f55 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1800,6 +1800,13 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return 0;
preserve_write = prot_numa && pmd_write(*pmd);
+
+ /*
+ * Preserve only normal writable huge PMD, but not shadow
+ * stack (RW=0, Dirty=1).
+ */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ preserve_write = false;
ret = 1;
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
diff --git a/mm/mprotect.c b/mm/mprotect.c
index bc6bddd156ca..983206529dce 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -114,6 +114,13 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
pte_t ptent;
bool preserve_write = prot_numa && pte_write(oldpte);
+ /*
+ * Preserve only normal writable PTE, but not shadow
+ * stack (RW=0, Dirty=1).
+ */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ preserve_write = false;
+
/*
* Avoid trapping faults against the zero or KSM
* pages. See similar comment in change_huge_pmd.
--
2.17.1
If an xfeature is saved in a buffer, the xfeature's bit will be set in
xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
is in it's init state. In this case the xfeature buffer address cannot
be retrieved with get_xsave_addr().
Future patches will need to handle the case of writing to an xfeature
that may not be saved. So provide helpers to init an xfeature in an
xsave buffer.
This could of course be done directly by reaching into the xsave buffer,
however this would not be robust against future changes to optimize the
xsave buffer by compacting it. In that case the xsave buffer would need
to be re-arranged as well. So the logic properly belongs encapsulated
in a helper where the logic can be unified.
Signed-off-by: Rick Edgecombe <[email protected]>
---
v2:
- New patch
arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
arch/x86/kernel/fpu/xstate.h | 6 ++++
2 files changed, 53 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9258fc1169cc..82cee1f2f0c8 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -942,6 +942,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
}
+static int xsave_buffer_access_checks(int xfeature_nr)
+{
+ /*
+ * Do we even *have* xsave state?
+ */
+ if (!boot_cpu_has(X86_FEATURE_XSAVE))
+ return 1;
+
+ /*
+ * We should not ever be requesting features that we
+ * have not enabled.
+ */
+ if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+ return 1;
+
+ return 0;
+}
+
/*
* Given the xsave area and a state inside, this function returns the
* address of the state.
@@ -962,17 +980,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
*/
void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
{
- /*
- * Do we even *have* xsave state?
- */
- if (!boot_cpu_has(X86_FEATURE_XSAVE))
- return NULL;
-
- /*
- * We should not ever be requesting features that we
- * have not enabled.
- */
- if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
+ if (xsave_buffer_access_checks(xfeature_nr))
return NULL;
/*
@@ -992,6 +1000,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
return __raw_xsave_addr(xsave, xfeature_nr);
}
+/*
+ * Given the xsave area and a state inside, this function
+ * initializes an xfeature in the buffer.
+ *
+ * get_xsave_addr() will return NULL if the feature bit is
+ * not present in the header. This function will make it so
+ * the xfeature buffer address is ready to be retrieved by
+ * get_xsave_addr().
+ *
+ * Inputs:
+ * xstate: the thread's storage area for all FPU data
+ * xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
+ * XFEATURE_SSE, etc...)
+ * Output:
+ * 1 if the feature cannot be inited, 0 on success
+ */
+int init_xfeature(struct xregs_state *xsave, int xfeature_nr)
+{
+ if (xsave_buffer_access_checks(xfeature_nr))
+ return 1;
+
+ /*
+ * Mark the feature inited.
+ */
+ xsave->header.xfeatures |= BIT_ULL(xfeature_nr);
+ return 0;
+}
+
#ifdef CONFIG_ARCH_HAS_PKEYS
/*
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 5ad47031383b..fb8aae678e9f 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -54,6 +54,12 @@ extern void fpu__init_cpu_xstate(void);
extern void fpu__init_system_xstate(unsigned int legacy_size);
extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+extern int init_xfeature(struct xregs_state *xsave, int xfeature_nr);
+
+static inline int xfeature_saved(struct xregs_state *xsave, int xfeature_nr)
+{
+ return xsave->header.xfeatures & BIT_ULL(xfeature_nr);
+}
static inline u64 xfeatures_mask_supervisor(void)
{
--
2.17.1
From: Yu-cheng Yu <[email protected]>
A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
two cases are handled differently for page faults. Introduce
VM_SHADOW_STACK to track shadow stack VMAs.
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---
Documentation/filesystems/proc.rst | 1 +
arch/x86/mm/mmap.c | 2 ++
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 8 ++++++++
4 files changed, 14 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index e7aafc82be99..d54ff397947a 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -560,6 +560,7 @@ encoded manner. The codes are the following:
mt arm64 MTE allocation tags are enabled
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
+ ss shadow stack page
== =======================================
Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..f3f52c5e2fd6 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
const char *arch_vma_name(struct vm_area_struct *vma)
{
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return "[shadow stack]";
return NULL;
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4e0023643f8b..a20899392c8d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -700,6 +700,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_ARCH_HAS_SHADOW_STACK
+ [ilog2(VM_SHADOW_STACK)] = "ss",
+#endif
};
size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index be80fc827212..8cd413c5a329 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -314,11 +314,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -334,6 +336,12 @@ extern unsigned int kobjsize(const void *objp);
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_X86_SHADOW_STACK
+# define VM_SHADOW_STACK VM_HIGH_ARCH_5
+#else
+# define VM_SHADOW_STACK VM_NONE
+#endif
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
--
2.17.1
In fill_thread_core_info() the ptrace accessible registers are collected
for a core file to be written out as notes. The note array is allocated
from a size calculated by iterating the user regset view, and counting the
regsets that have a non-zero core_note_type. However, this only allows for
there to be non-zero core_note_type at the end of the regset view. If
there are any in the middle, fill_thread_core_info() will overflow the
note allocation, as it iterates over the size of the view and the
allocation would be smaller than that.
To apparently avoid this problem, x86_32_regsets and x86_64_regsets need
to be constructed in a special way. They both draw their indices from a
shared enum x86_regset, but 32 bit and 64 bit don't all support the same
regsets and can be compiled in at the same time in the case of
IA32_EMULATION. So this enum has to be laid out in a special way such that
there are no gaps for both x86_32_regsets and x86_64_regsets. This
involves ordering them just right by creating aliases for enum’s that
are only in one view or the other, or creating multiple versions like
REGSET_IOPERM32/REGSET_IOPERM64.
So the collection of the registers tries to minimize the size of the
allocation, but it doesn’t quite work. Then the x86 ptrace side works
around it by constructing the enum just right to avoid a problem. In the
end there is no functional problem, but it is somewhat strange and
fragile.
It could also be improved like this [1], by better utilizing the smaller
array, but this still wastes space in the regset array’s if they are not
carefully crafted to avoid gaps. Instead, just fully separate out the
enums and give them separate 32 and 64 enum names. Add some bitsize-free
defines for REGSET_GENERAL and REGSET_FP since they are the only two
referred to in bitsize generic code.
This should have no functional change and is only changing how constants
are generated and referred to.
[1] https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Rick Edgecombe <[email protected]>
---
v2:
- New patch
arch/x86/kernel/ptrace.c | 61 ++++++++++++++++++++++++++--------------
1 file changed, 40 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 37c12fb92906..1a4df5fbc5e9 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -44,16 +44,35 @@
#include "tls.h"
-enum x86_regset {
- REGSET_GENERAL,
- REGSET_FP,
- REGSET_XFP,
- REGSET_IOPERM64 = REGSET_XFP,
- REGSET_XSTATE,
- REGSET_TLS,
+enum x86_regset_32 {
+ REGSET_GENERAL32,
+ REGSET_FP32,
+ REGSET_XFP32,
+ REGSET_XSTATE32,
+ REGSET_TLS32,
REGSET_IOPERM32,
};
+enum x86_regset_64 {
+ REGSET_GENERAL64,
+ REGSET_FP64,
+ REGSET_IOPERM64,
+ REGSET_XSTATE64,
+};
+
+#define REGSET_GENERAL \
+({ \
+ BUILD_BUG_ON((int)REGSET_GENERAL32 != (int)REGSET_GENERAL64); \
+ REGSET_GENERAL32; \
+})
+
+#define REGSET_FP \
+({ \
+ BUILD_BUG_ON((int)REGSET_FP32 != (int)REGSET_FP64); \
+ REGSET_FP32; \
+})
+
+
struct pt_regs_offset {
const char *name;
int offset;
@@ -788,13 +807,13 @@ long arch_ptrace(struct task_struct *child, long request,
#ifdef CONFIG_X86_32
case PTRACE_GETFPXREGS: /* Get the child extended FPU state. */
return copy_regset_to_user(child, &user_x86_32_view,
- REGSET_XFP,
+ REGSET_XFP32,
0, sizeof(struct user_fxsr_struct),
datap) ? -EIO : 0;
case PTRACE_SETFPXREGS: /* Set the child extended FPU state. */
return copy_regset_from_user(child, &user_x86_32_view,
- REGSET_XFP,
+ REGSET_XFP32,
0, sizeof(struct user_fxsr_struct),
datap) ? -EIO : 0;
#endif
@@ -1086,13 +1105,13 @@ static long ia32_arch_ptrace(struct task_struct *child, compat_long_t request,
case PTRACE_GETFPXREGS: /* Get the child extended FPU state. */
return copy_regset_to_user(child, &user_x86_32_view,
- REGSET_XFP, 0,
+ REGSET_XFP32, 0,
sizeof(struct user32_fxsr_struct),
datap);
case PTRACE_SETFPXREGS: /* Set the child extended FPU state. */
return copy_regset_from_user(child, &user_x86_32_view,
- REGSET_XFP, 0,
+ REGSET_XFP32, 0,
sizeof(struct user32_fxsr_struct),
datap);
@@ -1215,19 +1234,19 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
#ifdef CONFIG_X86_64
static struct user_regset x86_64_regsets[] __ro_after_init = {
- [REGSET_GENERAL] = {
+ [REGSET_GENERAL64] = {
.core_note_type = NT_PRSTATUS,
.n = sizeof(struct user_regs_struct) / sizeof(long),
.size = sizeof(long), .align = sizeof(long),
.regset_get = genregs_get, .set = genregs_set
},
- [REGSET_FP] = {
+ [REGSET_FP64] = {
.core_note_type = NT_PRFPREG,
.n = sizeof(struct fxregs_state) / sizeof(long),
.size = sizeof(long), .align = sizeof(long),
.active = regset_xregset_fpregs_active, .regset_get = xfpregs_get, .set = xfpregs_set
},
- [REGSET_XSTATE] = {
+ [REGSET_XSTATE64] = {
.core_note_type = NT_X86_XSTATE,
.size = sizeof(u64), .align = sizeof(u64),
.active = xstateregs_active, .regset_get = xstateregs_get,
@@ -1256,31 +1275,31 @@ static const struct user_regset_view user_x86_64_view = {
#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
static struct user_regset x86_32_regsets[] __ro_after_init = {
- [REGSET_GENERAL] = {
+ [REGSET_GENERAL32] = {
.core_note_type = NT_PRSTATUS,
.n = sizeof(struct user_regs_struct32) / sizeof(u32),
.size = sizeof(u32), .align = sizeof(u32),
.regset_get = genregs32_get, .set = genregs32_set
},
- [REGSET_FP] = {
+ [REGSET_FP32] = {
.core_note_type = NT_PRFPREG,
.n = sizeof(struct user_i387_ia32_struct) / sizeof(u32),
.size = sizeof(u32), .align = sizeof(u32),
.active = regset_fpregs_active, .regset_get = fpregs_get, .set = fpregs_set
},
- [REGSET_XFP] = {
+ [REGSET_XFP32] = {
.core_note_type = NT_PRXFPREG,
.n = sizeof(struct fxregs_state) / sizeof(u32),
.size = sizeof(u32), .align = sizeof(u32),
.active = regset_xregset_fpregs_active, .regset_get = xfpregs_get, .set = xfpregs_set
},
- [REGSET_XSTATE] = {
+ [REGSET_XSTATE32] = {
.core_note_type = NT_X86_XSTATE,
.size = sizeof(u64), .align = sizeof(u64),
.active = xstateregs_active, .regset_get = xstateregs_get,
.set = xstateregs_set
},
- [REGSET_TLS] = {
+ [REGSET_TLS32] = {
.core_note_type = NT_386_TLS,
.n = GDT_ENTRY_TLS_ENTRIES, .bias = GDT_ENTRY_TLS_MIN,
.size = sizeof(struct user_desc),
@@ -1311,10 +1330,10 @@ u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
void __init update_regset_xstate_info(unsigned int size, u64 xstate_mask)
{
#ifdef CONFIG_X86_64
- x86_64_regsets[REGSET_XSTATE].n = size / sizeof(u64);
+ x86_64_regsets[REGSET_XSTATE64].n = size / sizeof(u64);
#endif
#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
- x86_32_regsets[REGSET_XSTATE].n = size / sizeof(u64);
+ x86_32_regsets[REGSET_XSTATE32].n = size / sizeof(u64);
#endif
xstate_fx_sw_bytes[USER_XSTATE_XCR0_WORD] = xstate_mask;
}
--
2.17.1
From: Yu-cheng Yu <[email protected]>
When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
stack. Copy-on-write PTes then have Write=0,Cow=1.
When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
become a transient shadow stack PTE in two cases:
The first case is that some processors can start a write but end up seeing
a Write=0 PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE. However, this will not occur on processors supporting
Shadow Stack, and a TLB flush is not necessary.
The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
atomically, a transient shadow stack PTE can be created as a result.
Thus, prevent that with cmpxchg.
Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the cmpxchg solution.
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---
v2:
- Compile out some code due to clang build error
- Clarify commit log (dhansen)
- Normalize PTE bit descriptions between patches (dhansen)
- Update comment with text from (dhansen)
Yu-cheng v30:
- Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).
arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2f2963429f48..58c7bf9d7392 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1287,6 +1287,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
+#ifdef CONFIG_X86_SHADOW_STACK
+ /*
+ * Avoid accidentally creating shadow stack PTEs
+ * (Write=0,Dirty=1). Use cmpxchg() to prevent races with
+ * the hardware setting Dirty=1.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pte_t old_pte, new_pte;
+
+ old_pte = READ_ONCE(*ptep);
+ do {
+ new_pte = pte_wrprotect(old_pte);
+ } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+ return;
+ }
+#endif
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
}
@@ -1339,6 +1356,25 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp)
{
+#ifdef CONFIG_X86_SHADOW_STACK
+ /*
+ * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
+ * to _PAGE_COW (see comments at pmd_wrprotect()).
+ * When a thread reads a RW=1, Dirty=0 PMD and before changing it
+ * to RW=0, Dirty=0, another thread could have written to the page
+ * and the PMD is RW=1, Dirty=1 now.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pmd_t old_pmd, new_pmd;
+
+ old_pmd = READ_ONCE(*pmdp);
+ do {
+ new_pmd = pmd_wrprotect(old_pmd);
+ } while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+ return;
+ }
+#endif
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}
--
2.17.1
From: Yu-cheng Yu <[email protected]>
With the introduction of shadow stack memory there are two ways a pte can
be writable: regular writable memory and shadow stack memory.
In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
where a PTE is made writable. However, there are places where pte_mkwrite()
is called directly and the logic should now also create a shadow stack PTE
in the case of a shadow stack VMA.
- do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
directly and call pte_mkwrite(), which is the same as maybe_mkwrite()
in logic and intention. Just change them to maybe_mkwrite().
- When userfaultfd is creating a PTE after userspace handles the fault
it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk()
In other cases where pte_mkwrite() is called directly, the VMA will not
be VM_SHADOW_STACK, and so shadow stack memory should not be created.
- In the case of pte_savedwrite(), shadow stack VMA's are excluded.
- In the case of the "dirty_accountable" optimization in mprotect(),
shadow stack VMA's won't be VM_SHARED, so it is not nessary.
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---
v2:
- Updated commit log with comment's from Dave Hansen
- Dave also suggested (I understood) to maybe tweak vm_get_page_prot()
to avoid having to call maybe_mkwrite(). After playing around with
this I opted to *not* do this. Shadow stack memory memory is
effectively writable, so having the default permissions be writable
ended up mapping the zero page as writable and other surprises. So
creating shadow stack memory needs to be done with manual logic
like pte_mkwrite().
- Drop change in change_pte_range() because it couldn't actually trigger
for shadow stack VMAs.
- Clarify reasoning for skipped cases of pte_mkwrite().
Yu-cheng v25:
- Apply same changes to do_huge_pmd_numa_page() as to do_numa_page().
mm/migrate_device.c | 3 +--
mm/userfaultfd.c | 10 +++++++---
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d65476..eba3164736b3 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -606,8 +606,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
goto abort;
}
entry = mk_pte(page, vma->vm_page_prot);
- if (vma->vm_flags & VM_WRITE)
- entry = pte_mkwrite(pte_mkdirty(entry));
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7327b2573f7c..b49372c7de41 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
int ret;
pte_t _dst_pte, *dst_pte;
bool writable = dst_vma->vm_flags & VM_WRITE;
+ bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
bool vm_shared = dst_vma->vm_flags & VM_SHARED;
bool page_in_cache = page->mapping;
spinlock_t *ptl;
@@ -83,9 +84,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
writable = false;
}
- if (writable)
- _dst_pte = pte_mkwrite(_dst_pte);
- else
+ if (writable) {
+ if (shstk)
+ _dst_pte = pte_mkwrite_shstk(_dst_pte);
+ else
+ _dst_pte = pte_mkwrite(_dst_pte);
+ } else
/*
* We need this to make sure write bit removed; as mk_pte()
* could return a pte with write bit set.
--
2.17.1
From: Yu-cheng Yu <[email protected]>
Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).
Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.
Do not support IA32 emulation.
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---
v2:
- Get rid of unnessary shstk->base checks
- Don't support IA32 emulation
v1:
- Switch to xsave helpers.
- Expand commit log.
Yu-cheng v30:
- Remove superfluous comments for struct thread_shstk.
- Replace 'populate' with 'unused'.
Yu-cheng v28:
- Update shstk_setup() with wrmsrl_safe(), returns success when shadow
stack feature is not present (since this is a setup function).
arch/x86/include/asm/cet.h | 13 +++
arch/x86/include/asm/msr.h | 11 +++
arch/x86/include/asm/processor.h | 5 ++
arch/x86/include/uapi/asm/prctl.h | 2 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/process_64.c | 2 +
arch/x86/kernel/shstk.c | 143 ++++++++++++++++++++++++++++++
7 files changed, 178 insertions(+)
diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
index 0fa4dbc98c49..a4a1f4c0089b 100644
--- a/arch/x86/include/asm/cet.h
+++ b/arch/x86/include/asm/cet.h
@@ -7,12 +7,25 @@
struct task_struct;
+struct thread_shstk {
+ u64 base;
+ u64 size;
+};
+
#ifdef CONFIG_X86_SHADOW_STACK
long cet_prctl(struct task_struct *task, int option,
unsigned long features);
+int shstk_setup(void);
+void shstk_free(struct task_struct *p);
+int shstk_disable(void);
+void reset_thread_shstk(void);
#else
static inline long cet_prctl(struct task_struct *task, int option,
unsigned long features) { return -EINVAL; }
+static inline int shstk_setup(void) { return -EOPNOTSUPP; }
+static inline void shstk_free(struct task_struct *p) {}
+static inline int shstk_disable(void) { return -EOPNOTSUPP; }
+static inline void reset_thread_shstk(void) {}
#endif /* CONFIG_X86_SHADOW_STACK */
#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 65ec1965cd28..a9cb4c434e60 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
int msr_set_bit(u32 msr, u8 bit);
int msr_clear_bit(u32 msr, u8 bit);
+static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
+{
+ u64 val, new_val;
+
+ rdmsrl(msr, val);
+ new_val = (val & ~clear) | set;
+
+ if (new_val != val)
+ wrmsrl(msr, new_val);
+}
+
#ifdef CONFIG_SMP
int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a92bf76edafe..3a0c9d9d4d1d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -27,6 +27,7 @@ struct vm86;
#include <asm/unwind_hints.h>
#include <asm/vmxfeatures.h>
#include <asm/vdso/processor.h>
+#include <asm/cet.h>
#include <linux/personality.h>
#include <linux/cache.h>
@@ -533,6 +534,10 @@ struct thread_struct {
unsigned long features;
unsigned long features_locked;
+#ifdef CONFIG_X86_SHADOW_STACK
+ struct thread_shstk shstk;
+#endif
+
/* Floating point and extended processor state */
struct fpu fpu;
/*
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 028158e35269..41af3a8c4fa4 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -26,4 +26,6 @@
#define ARCH_CET_DISABLE 0x4002
#define ARCH_CET_LOCK 0x4003
+#define CET_SHSTK 0x1
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a20a5ebfacd7..8950d1f71226 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -139,6 +139,8 @@ obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += sev.o
+obj-$(CONFIG_X86_SHADOW_STACK) += shstk.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 8fa2c2b7de65..be544b4b4c8b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
load_gs_index(__USER_DS);
}
+ reset_thread_shstk();
+
loadsegment(fs, 0);
loadsegment(es, _ds);
loadsegment(ds, _ds);
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index e3276ac9e9b9..a0b8d4adb2bf 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -8,8 +8,151 @@
#include <linux/sched.h>
#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/sched/signal.h>
+#include <linux/compat.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <asm/msr.h>
+#include <asm/fpu/xstate.h>
+#include <asm/fpu/types.h>
+#include <asm/cet.h>
+#include <asm/special_insns.h>
+#include <asm/fpu/api.h>
#include <asm/prctl.h>
+static bool feature_enabled(unsigned long features)
+{
+ return current->thread.features & features;
+}
+
+static void feature_set(unsigned long features)
+{
+ current->thread.features |= features;
+}
+
+static void feature_clr(unsigned long features)
+{
+ current->thread.features &= ~features;
+}
+
+static unsigned long alloc_shstk(unsigned long size)
+{
+ int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+ struct mm_struct *mm = current->mm;
+ unsigned long addr, unused;
+
+ mmap_write_lock(mm);
+ addr = do_mmap(NULL, addr, size, PROT_READ, flags,
+ VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
+
+ mmap_write_unlock(mm);
+
+ return addr;
+}
+
+static void unmap_shadow_stack(u64 base, u64 size)
+{
+ while (1) {
+ int r;
+
+ r = vm_munmap(base, size);
+
+ /*
+ * vm_munmap() returns -EINTR when mmap_lock is held by
+ * something else, and that lock should not be held for a
+ * long time. Retry it for the case.
+ */
+ if (r == -EINTR) {
+ cond_resched();
+ continue;
+ }
+
+ /*
+ * For all other types of vm_munmap() failure, either the
+ * system is out of memory or there is bug.
+ */
+ WARN_ON_ONCE(r);
+ break;
+ }
+}
+
+int shstk_setup(void)
+{
+ struct thread_shstk *shstk = ¤t->thread.shstk;
+ unsigned long addr, size;
+
+ /* Already enabled */
+ if (feature_enabled(CET_SHSTK))
+ return 0;
+
+ /* Also not supported for 32 bit */
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || in_ia32_syscall())
+ return -EOPNOTSUPP;
+
+ size = PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+ addr = alloc_shstk(size);
+ if (IS_ERR_VALUE(addr))
+ return PTR_ERR((void *)addr);
+
+ fpu_lock_and_load();
+ wrmsrl(MSR_IA32_PL3_SSP, addr + size);
+ wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
+ fpregs_unlock();
+
+ shstk->base = addr;
+ shstk->size = size;
+ feature_set(CET_SHSTK);
+
+ return 0;
+}
+
+void reset_thread_shstk(void)
+{
+ memset(¤t->thread.shstk, 0, sizeof(struct thread_shstk));
+ current->thread.features = 0;
+ current->thread.features_locked = 0;
+}
+
+void shstk_free(struct task_struct *tsk)
+{
+ struct thread_shstk *shstk = &tsk->thread.shstk;
+
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
+ !feature_enabled(CET_SHSTK))
+ return;
+
+ if (!tsk->mm)
+ return;
+
+ unmap_shadow_stack(shstk->base, shstk->size);
+}
+
+int shstk_disable(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return -EOPNOTSUPP;
+
+ /* Already disabled? */
+ if (!feature_enabled(CET_SHSTK))
+ return 0;
+
+ fpu_lock_and_load();
+ /* Disable WRSS too when disabling shadow stack */
+ set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);
+ wrmsrl(MSR_IA32_PL3_SSP, 0);
+ fpregs_unlock();
+
+ shstk_free(current);
+ feature_clr(CET_SHSTK);
+
+ return 0;
+}
+
long cet_prctl(struct task_struct *task, int option, unsigned long features)
{
if (option == ARCH_CET_LOCK) {
--
2.17.1
On Thu, Sep 29, 2022 at 03:28:57PM -0700, Rick Edgecombe wrote:
> This is an overdue followup to the “Shadow stacks for userspace” CET series.
> Thanks for all the comments on the first version [0]. They drove a decent
> amount of changes for v2. Since it has been awhile, I’ll try to summarize the
> areas that got major changes since last time. Smaller changes are listed in
> each patch.
Thanks for the write-up!
> [...]
> GUP
> ---
> Shadow stack memory is generally treated as writable by the kernel, but
> it behaves differently then other writable memory with respect to GUP.
> FOLL_WRITE will not GUP shadow stack memory unless FOLL_FORCE is also
> set. Shadow stack memory is writable from the perspective of being
> changeable by userspace, but it is also protected memory from
> userspace’s perspective. So preventing it from being writable via
> FOLL_WRITE help’s make it harder for userspace to arbitrarily write to
> it. However, like read-only memory, FOLL_FORCE can still write through
> it. This means shadow stacks can be written to via things like
> “/proc/self/mem”. Apps that want extra security will have to prevent
> access to kernel features that can write with FOLL_FORCE.
This seems like a problem to me -- the point of SS is that there cannot be
a way to write to them without specific instruction sequences. The fact
that /proc/self/mem bypasses memory protections was an old design mistake
that keeps leading to surprising behaviors. It would be much nicer to
draw the line somewhere and just say that FOLL_FORCE doesn't work on
VM_SHADOW_STACK. Why must FOLL_FORCE be allowed to write to SS?
> [...]
> Shadow stack signal format
> --------------------------
> So to handle alt shadow stacks we need to push some data onto a stack. To
> prevent SROP we need to push something to the shadow stack that the kernel can
> [...]
> shadow stack return address or a shadow stack tokens. To make sure it can’t be
> used, data is pushed with the high bit (bit 63) set. This bit is a linear
> address bit in both the token format and a normal return address, so it should
> not conflict with anything. It puts any return address in the kernel half of
> the address space, so would never be created naturally by a userspace program.
> It will not be a valid restore token either, as the kernel address will never
> be pointing to the previous frame in the shadow stack.
>
> When a signal hits, the format pushed to the stack that is handling the signal
> is four 8 byte values (since we are 64 bit only):
> |1...old SSP|1...alt stack size|1...alt stack base|0|
Do these end up being non-canonical addresses? (To avoid confusion with
"real" kernel addresses?)
-Kees
--
Kees Cook
On Mon, Oct 3, 2022 at 7:04 PM Kees Cook <[email protected]> wrote:
> On Thu, Sep 29, 2022 at 03:28:57PM -0700, Rick Edgecombe wrote:
> > This is an overdue followup to the “Shadow stacks for userspace” CET series.
> > Thanks for all the comments on the first version [0]. They drove a decent
> > amount of changes for v2. Since it has been awhile, I’ll try to summarize the
> > areas that got major changes since last time. Smaller changes are listed in
> > each patch.
>
> Thanks for the write-up!
>
> > [...]
> > GUP
> > ---
> > Shadow stack memory is generally treated as writable by the kernel, but
> > it behaves differently then other writable memory with respect to GUP.
> > FOLL_WRITE will not GUP shadow stack memory unless FOLL_FORCE is also
> > set. Shadow stack memory is writable from the perspective of being
> > changeable by userspace, but it is also protected memory from
> > userspace’s perspective. So preventing it from being writable via
> > FOLL_WRITE help’s make it harder for userspace to arbitrarily write to
> > it. However, like read-only memory, FOLL_FORCE can still write through
> > it. This means shadow stacks can be written to via things like
> > “/proc/self/mem”. Apps that want extra security will have to prevent
> > access to kernel features that can write with FOLL_FORCE.
>
> This seems like a problem to me -- the point of SS is that there cannot be
> a way to write to them without specific instruction sequences. The fact
> that /proc/self/mem bypasses memory protections was an old design mistake
> that keeps leading to surprising behaviors. It would be much nicer to
> draw the line somewhere and just say that FOLL_FORCE doesn't work on
> VM_SHADOW_STACK. Why must FOLL_FORCE be allowed to write to SS?
But once you have FOLL_FORCE, you can also just write over stuff like
executable code instead of writing over the stack. I don't think
allowing FOLL_FORCE writes over shadow stacks from /proc/$pid/mem is
making things worse in any way, and it's probably helpful for stuff
like debuggers.
If you don't want /proc/$pid/mem to be able to do stuff like that,
then IMO the way to go is to change when /proc/$pid/mem uses
FOLL_FORCE, or to limit overall write access to /proc/$pid/mem.
On Thu, Sep 29, 2022 at 03:29:00PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> The Control-Flow Enforcement Technology contains two related features,
> one of which is Shadow Stacks. Future patches will utilize this feature
> for shadow stack support in KVM, so add a CPU feature flags for Shadow
> Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
>
> To protect shadow stack state from malicious modification, the registers
> are only accessible in supervisor mode. This implementation
> context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
> on XSAVES.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Thu, Sep 29, 2022 at 03:29:09PM -0700, Rick Edgecombe wrote:
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2f2963429f48..58c7bf9d7392 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1287,6 +1287,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> static inline void ptep_set_wrprotect(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> +#ifdef CONFIG_X86_SHADOW_STACK
> + /*
> + * Avoid accidentally creating shadow stack PTEs
> + * (Write=0,Dirty=1). Use cmpxchg() to prevent races with
> + * the hardware setting Dirty=1.
> + */
> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> + pte_t old_pte, new_pte;
> +
> + old_pte = READ_ONCE(*ptep);
> + do {
> + new_pte = pte_wrprotect(old_pte);
> + } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> + return;
> + }
> +#endif
> clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
> }
Okay, this addresses my previous question. The need in cmpxchg is
unfortunate, but well.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Sep 29, 2022, at 3:29 PM, Rick Edgecombe <[email protected]> wrote:
> From: Yu-cheng Yu <[email protected]>
>
> When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
> stack. Copy-on-write PTes then have Write=0,Cow=1.
>
> When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
> become a transient shadow stack PTE in two cases:
>
> The first case is that some processors can start a write but end up seeing
> a Write=0 PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE. However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
>
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.
>
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue. Jann Horn provided the cmpxchg solution.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---
>
> v2:
> - Compile out some code due to clang build error
> - Clarify commit log (dhansen)
> - Normalize PTE bit descriptions between patches (dhansen)
> - Update comment with text from (dhansen)
>
> Yu-cheng v30:
> - Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).
>
> arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2f2963429f48..58c7bf9d7392 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1287,6 +1287,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> static inline void ptep_set_wrprotect(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> +#ifdef CONFIG_X86_SHADOW_STACK
> + /*
> + * Avoid accidentally creating shadow stack PTEs
> + * (Write=0,Dirty=1). Use cmpxchg() to prevent races with
> + * the hardware setting Dirty=1.
> + */
> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> + pte_t old_pte, new_pte;
> +
> + old_pte = READ_ONCE(*ptep);
> + do {
> + new_pte = pte_wrprotect(old_pte);
> + } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> + return;
> + }
> +#endif
There is no way of using IS_ENABLED() here instead of these ifdefs?
Did you have a look at ptep_set_access_flags() and friends and checked they
do not need to be changed too? Perhaps you should at least add some
assertion just to ensure nothing breaks.
On Thu, Sep 29, 2022 at 03:29:11PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
> read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
> two cases are handled differently for page faults. Introduce
> VM_SHADOW_STACK to track shadow stack VMAs.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
> Documentation/filesystems/proc.rst | 1 +
> arch/x86/mm/mmap.c | 2 ++
> fs/proc/task_mmu.c | 3 +++
> include/linux/mm.h | 8 ++++++++
> 4 files changed, 14 insertions(+)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index e7aafc82be99..d54ff397947a 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -560,6 +560,7 @@ encoded manner. The codes are the following:
> mt arm64 MTE allocation tags are enabled
> um userfaultfd missing tracking
> uw userfaultfd wr-protect tracking
> + ss shadow stack page
> == =======================================
>
> Note that there is no guarantee that every flag and associated mnemonic will
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index c90c20904a60..f3f52c5e2fd6 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
>
> const char *arch_vma_name(struct vm_area_struct *vma)
> {
> + if (vma->vm_flags & VM_SHADOW_STACK)
> + return "[shadow stack]";
> return NULL;
> }
>
But why here?
CONFIG_ARCH_HAS_SHADOW_STACK implies that there will be more than one arch
that supports shadow stack. The name has to come from generic code too, no?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Sep 29, 2022 at 03:29:10PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
> VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Reviewed-by: Axel Rasmussen <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Peter Xu <[email protected]>
> Cc: Mike Kravetz <[email protected]>
Acked-by: Peter Xu <[email protected]>
--
Peter Xu
On Mon, 2022-10-03 at 10:04 -0700, Kees Cook wrote:
> > Shadow stack signal format
> > --------------------------
> > So to handle alt shadow stacks we need to push some data onto a
> > stack. To
> > prevent SROP we need to push something to the shadow stack that the
> > kernel can
> > [...]
> > shadow stack return address or a shadow stack tokens. To make sure
> > it can’t be
> > used, data is pushed with the high bit (bit 63) set. This bit is a
> > linear
> > address bit in both the token format and a normal return address,
> > so it should
> > not conflict with anything. It puts any return address in the
> > kernel half of
> > the address space, so would never be created naturally by a
> > userspace program.
> > It will not be a valid restore token either, as the kernel address
> > will never
> > be pointing to the previous frame in the shadow stack.
> >
> > When a signal hits, the format pushed to the stack that is handling
> > the signal
> > is four 8 byte values (since we are 64 bit only):
> > > 1...old SSP|1...alt stack size|1...alt stack base|0|
>
> Do these end up being non-canonical addresses? (To avoid confusion
> with
> "real" kernel addresses?)
Usually, but not necessarily with LAM. LAM cannot mask bit 63 though.
So hypothetically they could become "real" kernel addresses some day.
To keep them in the user half but still make sure they are not usable,
you would either have to encode the bits over a lot of entries which
would use extra space, or shrink the available address space, which
could cause compatibility problems.
Do you think it's an issue?
On Thu, Sep 29, 2022 at 03:29:11PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> A shadow stack PTE must be read-only and have _PAGE_DIRTY set. However,
> read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
> two cases are handled differently for page faults. Introduce
> VM_SHADOW_STACK to track shadow stack VMAs.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Reviewed-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
> Documentation/filesystems/proc.rst | 1 +
> arch/x86/mm/mmap.c | 2 ++
> fs/proc/task_mmu.c | 3 +++
> include/linux/mm.h | 8 ++++++++
> 4 files changed, 14 insertions(+)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index e7aafc82be99..d54ff397947a 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -560,6 +560,7 @@ encoded manner. The codes are the following:
> mt arm64 MTE allocation tags are enabled
> um userfaultfd missing tracking
> uw userfaultfd wr-protect tracking
> + ss shadow stack page
> == =======================================
>
> Note that there is no guarantee that every flag and associated mnemonic will
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index c90c20904a60..f3f52c5e2fd6 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
>
> const char *arch_vma_name(struct vm_area_struct *vma)
> {
> + if (vma->vm_flags & VM_SHADOW_STACK)
> + return "[shadow stack]";
> return NULL;
> }
I agree with Kirill: this should be in the arch-agnostic code.
--
Kees Cook
On 10/3/22 11:11, Nadav Amit wrote:
>> +#ifdef CONFIG_X86_SHADOW_STACK
>> + /*
>> + * Avoid accidentally creating shadow stack PTEs
>> + * (Write=0,Dirty=1). Use cmpxchg() to prevent races with
>> + * the hardware setting Dirty=1.
>> + */
>> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
>> + pte_t old_pte, new_pte;
>> +
>> + old_pte = READ_ONCE(*ptep);
>> + do {
>> + new_pte = pte_wrprotect(old_pte);
>> + } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
>> +
>> + return;
>> + }
>> +#endif
> There is no way of using IS_ENABLED() here instead of these ifdefs?
Actually, both the existing #ifdef and an IS_ENABLED() check would be
is superfluous as-is.
Adding X86_FEATURE_SHSTK disabled-features.h gives cpu_feature_enabled()
compile-time optimizations for free. No need for *any* additional
CONFIG_* checks.
The only issue would be if the #ifdef'd code won't even compile with
X86_FEATURE_SHSTK disabled.
On Thu, Sep 29, 2022 at 03:29:15PM -0700, Rick Edgecombe wrote:
> [...]
> +unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
> +{
> + if (vma->vm_flags & VM_GROWSDOWN)
> + return stack_guard_gap;
> +
> + /*
> + * Shadow stack pointer is moved by CALL, RET, and INCSSP(Q/D).
> + * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
> + * (~1KB for INCSSPD) and touches the first and the last element
> + * in the range, which triggers a page fault if the range is not
> + * in a shadow stack. Because of this, creating 4-KB guard pages
> + * around a shadow stack prevents these instructions from going
> + * beyond.
> + *
> + * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
> + * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
> + */
Thank you for the details on how the size choice is made here! :)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fef14ab3abcb..09458e77bf52 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2775,15 +2775,16 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
> return vma;
> }
>
> +unsigned long stack_guard_start_gap(struct vm_area_struct *vma);
> +
> static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
> {
> + unsigned long gap = stack_guard_start_gap(vma);
> unsigned long vm_start = vma->vm_start;
>
> - if (vma->vm_flags & VM_GROWSDOWN) {
> - vm_start -= stack_guard_gap;
> - if (vm_start > vma->vm_start)
> - vm_start = 0;
> - }
> + vm_start -= gap;
> + if (vm_start > vma->vm_start)
> + vm_start = 0;
> return vm_start;
> }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9d780f415be3..f0d2e9143bd0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -247,6 +247,13 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
> return origbrk;
> }
>
I feel like something could be done with this definitions to make them
inline, instead of __weak:
#ifndef stack_guard_start_gap
> +unsigned long __weak stack_guard_start_gap(struct vm_area_struct *vma)
> +{
> + if (vma->vm_flags & VM_GROWSDOWN)
> + return stack_guard_gap;
> + return 0;
> +}
#endif
And then move the x86 stack_guard_start_gap to a header?
It's not exactly fast-path, but it feels a little weird. Regardlesss:
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> With the introduction of shadow stack memory there are two ways a pte can
> be writable: regular writable memory and shadow stack memory.
>
> In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
> or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
> where a PTE is made writable. However, there are places where pte_mkwrite()
> is called directly and the logic should now also create a shadow stack PTE
> in the case of a shadow stack VMA.
>
> - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
> directly and call pte_mkwrite(), which is the same as maybe_mkwrite()
> in logic and intention. Just change them to maybe_mkwrite().
>
> - When userfaultfd is creating a PTE after userspace handles the fault
> it calls pte_mkwrite() directly. Teach it about pte_mkwrite_shstk()
>
> In other cases where pte_mkwrite() is called directly, the VMA will not
> be VM_SHADOW_STACK, and so shadow stack memory should not be created.
> - In the case of pte_savedwrite(), shadow stack VMA's are excluded.
> - In the case of the "dirty_accountable" optimization in mprotect(),
> shadow stack VMA's won't be VM_SHARED, so it is not nessary.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On 9/29/2022 3:29 PM, Rick Edgecombe wrote:
> If an xfeature is saved in a buffer, the xfeature's bit will be set in
> xsave->header.xfeatures. The CPU may opt to not save the xfeature if it
> is in it's init state. In this case the xfeature buffer address cannot
> be retrieved with get_xsave_addr().
>
> Future patches will need to handle the case of writing to an xfeature
> that may not be saved. So provide helpers to init an xfeature in an
> xsave buffer.
>
> This could of course be done directly by reaching into the xsave buffer,
> however this would not be robust against future changes to optimize the
> xsave buffer by compacting it. In that case the xsave buffer would need
> to be re-arranged as well. So the logic properly belongs encapsulated
> in a helper where the logic can be unified.
>
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---
>
> v2:
> - New patch
>
> arch/x86/kernel/fpu/xstate.c | 58 +++++++++++++++++++++++++++++-------
> arch/x86/kernel/fpu/xstate.h | 6 ++++
> 2 files changed, 53 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 9258fc1169cc..82cee1f2f0c8 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -942,6 +942,24 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
> return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr);
> }
>
> +static int xsave_buffer_access_checks(int xfeature_nr)
> +{
> + /*
> + * Do we even *have* xsave state?
> + */
> + if (!boot_cpu_has(X86_FEATURE_XSAVE))
> + return 1;
> +
> + /*
> + * We should not ever be requesting features that we
> + * have not enabled.
> + */
> + if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
> + return 1;
> +
> + return 0;
> +}
> +
> /*
> * Given the xsave area and a state inside, this function returns the
> * address of the state.
> @@ -962,17 +980,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
> */
> void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
> {
> - /*
> - * Do we even *have* xsave state?
> - */
> - if (!boot_cpu_has(X86_FEATURE_XSAVE))
> - return NULL;
> -
> - /*
> - * We should not ever be requesting features that we
> - * have not enabled.
> - */
> - if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
> + if (xsave_buffer_access_checks(xfeature_nr))
> return NULL;
>
> /*
> @@ -992,6 +1000,34 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
> return __raw_xsave_addr(xsave, xfeature_nr);
> }
>
> +/*
> + * Given the xsave area and a state inside, this function
> + * initializes an xfeature in the buffer.
But, this function sets XSTATE_BV bits in the buffer. That does not
*initialize* the state, right?
> + *
> + * get_xsave_addr() will return NULL if the feature bit is
> + * not present in the header. This function will make it so
> + * the xfeature buffer address is ready to be retrieved by
> + * get_xsave_addr().
Looks like this is used in the next patch to help ptracer().
We have the state copy function -- copy_uabi_to_xstate() that retrieves
the address using __raw_xsave_addr() instead of get_xsave_addr(), copies
the state, and then updates XSTATE_BV.
__raw_xsave_addr() also ensures whether the state is in the compacted
format or not. I think you can use it.
Also, I'm curious about the reason why you want to update XSTATE_BV
first with this new helper.
Overall, I'm not sure these new helpers are necessary.
Thanks,
Chang
On Thu, Sep 29, 2022 at 03:29:10PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> To introduce VM_SHADOW_STACK as VM_HIGH_ARCH_BIT (37), and make all
> VM_HIGH_ARCH_BITs stay together, move VM_UFFD_MINOR_BIT from 37 to 38.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On 10/3/22 12:43, Kees Cook wrote:
>> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
>> +{
>> + u64 val, new_val;
>> +
>> + rdmsrl(msr, val);
>> + new_val = (val & ~clear) | set;
>> +
>> + if (new_val != val)
>> + wrmsrl(msr, new_val);
>> +}
> I always get uncomfortable when I see these kinds of generalized helper
> functions for touching cpu bits, etc. It just begs for future attacker
> abuse to muck with arbitrary bits -- even marked inline there is a risk
> the compiler will ignore that in some circumstances (not as currently
> used in the code, but I'm imagining future changes leading to such a
> condition). Will you humor me and change this to a macro instead? That'll
> force it always inline (even __always_inline isn't always inline):
Oh, are you thinking that this is dangerous because it's so surgical and
non-intrusive? It's even more powerful to an attacker than, say
wrmsrl(), because there they actually have to know what the existing
value is to update it. With this helper, it's quite easy to flip an
individual bit without disturbing the neighboring bits.
Is that it?
I don't _like_ the #defines, but doing one here doesn't seem too onerous
considering how critical MSRs are.
On Thu, Sep 29, 2022 at 03:29:21PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> Introduce basic shadow stack enabling/disabling/allocation routines.
> A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
> and has a fixed size of min(RLIMIT_STACK, 4GB).
>
> Keep the task's shadow stack address and size in thread_struct. This will
> be copied when cloning new threads, but needs to be cleared during exec,
> so add a function to do this.
>
> Do not support IA32 emulation.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Kees Cook <[email protected]>
>
> ---
>
> v2:
> - Get rid of unnessary shstk->base checks
> - Don't support IA32 emulation
>
> v1:
> - Switch to xsave helpers.
> - Expand commit log.
>
> Yu-cheng v30:
> - Remove superfluous comments for struct thread_shstk.
> - Replace 'populate' with 'unused'.
>
> Yu-cheng v28:
> - Update shstk_setup() with wrmsrl_safe(), returns success when shadow
> stack feature is not present (since this is a setup function).
>
> arch/x86/include/asm/cet.h | 13 +++
> arch/x86/include/asm/msr.h | 11 +++
> arch/x86/include/asm/processor.h | 5 ++
> arch/x86/include/uapi/asm/prctl.h | 2 +
> arch/x86/kernel/Makefile | 2 +
> arch/x86/kernel/process_64.c | 2 +
> arch/x86/kernel/shstk.c | 143 ++++++++++++++++++++++++++++++
> 7 files changed, 178 insertions(+)
>
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index 0fa4dbc98c49..a4a1f4c0089b 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -7,12 +7,25 @@
>
> struct task_struct;
>
> +struct thread_shstk {
> + u64 base;
> + u64 size;
> +};
> +
> #ifdef CONFIG_X86_SHADOW_STACK
> long cet_prctl(struct task_struct *task, int option,
> unsigned long features);
> +int shstk_setup(void);
> +void shstk_free(struct task_struct *p);
> +int shstk_disable(void);
> +void reset_thread_shstk(void);
> #else
> static inline long cet_prctl(struct task_struct *task, int option,
> unsigned long features) { return -EINVAL; }
> +static inline int shstk_setup(void) { return -EOPNOTSUPP; }
> +static inline void shstk_free(struct task_struct *p) {}
> +static inline int shstk_disable(void) { return -EOPNOTSUPP; }
> +static inline void reset_thread_shstk(void) {}
> #endif /* CONFIG_X86_SHADOW_STACK */
shstk_setup() and shstk_disable() are not called outside of shstk.c, so
they can be removed from this header entirely.
>
> #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 65ec1965cd28..a9cb4c434e60 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -310,6 +310,17 @@ void msrs_free(struct msr *msrs);
> int msr_set_bit(u32 msr, u8 bit);
> int msr_clear_bit(u32 msr, u8 bit);
>
> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> +{
> + u64 val, new_val;
> +
> + rdmsrl(msr, val);
> + new_val = (val & ~clear) | set;
> +
> + if (new_val != val)
> + wrmsrl(msr, new_val);
> +}
I always get uncomfortable when I see these kinds of generalized helper
functions for touching cpu bits, etc. It just begs for future attacker
abuse to muck with arbitrary bits -- even marked inline there is a risk
the compiler will ignore that in some circumstances (not as currently
used in the code, but I'm imagining future changes leading to such a
condition). Will you humor me and change this to a macro instead? That'll
force it always inline (even __always_inline isn't always inline):
/* Helper that can never get accidentally un-inlined. */
#define set_clr_bits_msrl(msr, set, clear) do { \
u64 __val, __new_val; \
\
rdmsrl(msr, __val); \
__new_val = (__val & ~(clear)) | (set); \
\
if (__new_val != __val) \
wrmsrl(msr, __new_val); \
} while (0)
> +
> #ifdef CONFIG_SMP
> int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
> int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index a92bf76edafe..3a0c9d9d4d1d 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -27,6 +27,7 @@ struct vm86;
> #include <asm/unwind_hints.h>
> #include <asm/vmxfeatures.h>
> #include <asm/vdso/processor.h>
> +#include <asm/cet.h>
>
> #include <linux/personality.h>
> #include <linux/cache.h>
> @@ -533,6 +534,10 @@ struct thread_struct {
> unsigned long features;
> unsigned long features_locked;
>
> +#ifdef CONFIG_X86_SHADOW_STACK
> + struct thread_shstk shstk;
> +#endif
> +
> /* Floating point and extended processor state */
> struct fpu fpu;
> /*
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index 028158e35269..41af3a8c4fa4 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -26,4 +26,6 @@
> #define ARCH_CET_DISABLE 0x4002
> #define ARCH_CET_LOCK 0x4003
>
For readability, maybe add: /* ARCH_CET_* "features" bits */
> +#define CET_SHSTK 0x1
This is UAPI, so the BIT() macro isn't available, but since this is
unsigned long, please use the form: (1ULL << 0) etc...
> +
> #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index a20a5ebfacd7..8950d1f71226 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -139,6 +139,8 @@ obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o
>
> obj-$(CONFIG_AMD_MEM_ENCRYPT) += sev.o
>
> +obj-$(CONFIG_X86_SHADOW_STACK) += shstk.o
> +
> ###
> # 64 bit specific files
> ifeq ($(CONFIG_X86_64),y)
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 8fa2c2b7de65..be544b4b4c8b 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -514,6 +514,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
> load_gs_index(__USER_DS);
> }
>
> + reset_thread_shstk();
> +
> loadsegment(fs, 0);
> loadsegment(es, _ds);
> loadsegment(ds, _ds);
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index e3276ac9e9b9..a0b8d4adb2bf 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -8,8 +8,151 @@
>
> #include <linux/sched.h>
> #include <linux/bitops.h>
> +#include <linux/types.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/signal.h>
> +#include <linux/compat.h>
> +#include <linux/sizes.h>
> +#include <linux/user.h>
> +#include <asm/msr.h>
> +#include <asm/fpu/xstate.h>
> +#include <asm/fpu/types.h>
> +#include <asm/cet.h>
> +#include <asm/special_insns.h>
> +#include <asm/fpu/api.h>
> #include <asm/prctl.h>
>
> +static bool feature_enabled(unsigned long features)
> +{
> + return current->thread.features & features;
> +}
> +
> +static void feature_set(unsigned long features)
> +{
> + current->thread.features |= features;
> +}
> +
> +static void feature_clr(unsigned long features)
> +{
> + current->thread.features &= ~features;
> +}
"feature" vs "features" here is confusing. Should these helpers enforce
the single-bit-set requirements? If so, please switch to a bit number
instead of a mask. If not, please rename these to
"features_{enabled,set,clr}", and fix "features_enabled" to check them
all:
return (current->thread.features & features) == features;
> +static unsigned long alloc_shstk(unsigned long size)
> +{
> + int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> + struct mm_struct *mm = current->mm;
> + unsigned long addr, unused;
WARN_ON + clamp on "size" here, or perhaps move the bounds check from
shstk_setup() into here?
> +
> + mmap_write_lock(mm);
> + addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> + VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
This will use the mmap base address offset randomization, I guess?
> +
> + mmap_write_unlock(mm);
> +
> + return addr;
> +}
> +
> +static void unmap_shadow_stack(u64 base, u64 size)
> +{
> + while (1) {
> + int r;
> +
> + r = vm_munmap(base, size);
> +
> + /*
> + * vm_munmap() returns -EINTR when mmap_lock is held by
> + * something else, and that lock should not be held for a
> + * long time. Retry it for the case.
> + */
> + if (r == -EINTR) {
> + cond_resched();
> + continue;
> + }
> +
> + /*
> + * For all other types of vm_munmap() failure, either the
> + * system is out of memory or there is bug.
> + */
> + WARN_ON_ONCE(r);
> + break;
> + }
> +}
> +
> +int shstk_setup(void)
Only called local. Make static?
> +{
> + struct thread_shstk *shstk = ¤t->thread.shstk;
> + unsigned long addr, size;
> +
> + /* Already enabled */
> + if (feature_enabled(CET_SHSTK))
> + return 0;
> +
> + /* Also not supported for 32 bit */
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) || in_ia32_syscall())
> + return -EOPNOTSUPP;
> +
> + size = PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
> + addr = alloc_shstk(size);
> + if (IS_ERR_VALUE(addr))
> + return PTR_ERR((void *)addr);
> +
> + fpu_lock_and_load();
> + wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> + wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
> + fpregs_unlock();
> +
> + shstk->base = addr;
> + shstk->size = size;
> + feature_set(CET_SHSTK);
> +
> + return 0;
> +}
> +
> +void reset_thread_shstk(void)
> +{
> + memset(¤t->thread.shstk, 0, sizeof(struct thread_shstk));
> + current->thread.features = 0;
> + current->thread.features_locked = 0;
> +}
If features is always going to be tied to shstk, why not put them in the
shstk struct?
Also, shouldn't this also be called from arch_setup_new_exec() instead
of the open-coded wipe of features there?
> +
> +void shstk_free(struct task_struct *tsk)
> +{
> + struct thread_shstk *shstk = &tsk->thread.shstk;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> + !feature_enabled(CET_SHSTK))
> + return;
> +
> + if (!tsk->mm)
> + return;
> +
> + unmap_shadow_stack(shstk->base, shstk->size);
I feel like base and size should be zeroed here?
> +}
> +
> +int shstk_disable(void)
This is only called locally. static?
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> + return -EOPNOTSUPP;
> +
> + /* Already disabled? */
> + if (!feature_enabled(CET_SHSTK))
> + return 0;
> +
> + fpu_lock_and_load();
> + /* Disable WRSS too when disabling shadow stack */
> + set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_SHSTK_EN);
> + wrmsrl(MSR_IA32_PL3_SSP, 0);
> + fpregs_unlock();
> +
> + shstk_free(current);
> + feature_clr(CET_SHSTK);
> +
> + return 0;
> +}
> +
> long cet_prctl(struct task_struct *task, int option, unsigned long features)
> {
> if (option == ARCH_CET_LOCK) {
> --
> 2.17.1
>
--
Kees Cook
On Thu, Sep 29, 2022 at 03:29:23PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> Shadow stack's are normally written to via CALL/RET or specific CET
> instuctions like RSTORSSP/SAVEPREVSSP. However during some Linux
> operations the kernel will need to write to directly using the ring-0 only
> WRUSS instruction.
>
> A shadow stack restore token marks a restore point of the shadow stack, and
> the address in a token must point directly above the token, which is within
> the same shadow stack. This is distinctively different from other pointers
> on the shadow stack, since those pointers point to executable code area.
>
> Introduce token setup and verify routines. Also introduce WRUSS, which is
> a kernel-mode instruction but writes directly to user shadow stack.
>
> In future patches that enable shadow stack to work with signals, the kernel
> will need something to denote the point in the stack where sigreturn may be
> called. This will prevent attackers calling sigreturn at arbitrary places
> in the stack, in order to help prevent SROP attacks.
>
> To do this, something that can only be written by the kernel needs to be
> placed on the shadow stack. This can be accomplished by setting bit 63 in
> the frame written to the shadow stack. Userspace return addresses can't
> have this bit set as it is in the kernel range. It is also can't be a
> valid restore token.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Kees Cook <[email protected]>
>
> ---
>
> v2:
> - Add data helpers for writing to shadow stack.
>
> v1:
> - Use xsave helpers.
>
> Yu-cheng v30:
> - Update commit log, remove description about signals.
> - Update various comments.
> - Remove variable 'ssp' init and adjust return value accordingly.
> - Check get_user_shstk_addr() return value.
> - Replace 'ia32' with 'proc32'.
>
> Yu-cheng v29:
> - Update comments for the use of get_xsave_addr().
>
> arch/x86/include/asm/special_insns.h | 13 ++++
> arch/x86/kernel/shstk.c | 108 +++++++++++++++++++++++++++
> 2 files changed, 121 insertions(+)
>
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 35f709f619fb..f096f52bd059 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
> : [pax] "a" (p));
> }
>
> +#ifdef CONFIG_X86_SHADOW_STACK
> +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> +{
> + asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> + _ASM_EXTABLE(1b, %l[fail])
> + :: [addr] "r" (addr), [val] "r" (val)
> + :: fail);
> + return 0;
> +fail:
> + return -EFAULT;
> +}
> +#endif /* CONFIG_X86_SHADOW_STACK */
> +
> #define nop() asm volatile ("nop")
>
> static inline void serialize(void)
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index db4e53f9fdaf..8904aef487bf 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -25,6 +25,8 @@
> #include <asm/fpu/api.h>
> #include <asm/prctl.h>
>
> +#define SS_FRAME_SIZE 8
> +
> static bool feature_enabled(unsigned long features)
> {
> return current->thread.features & features;
> @@ -40,6 +42,31 @@ static void feature_clr(unsigned long features)
> current->thread.features &= ~features;
> }
>
> +/*
> + * Create a restore token on the shadow stack. A token is always 8-byte
> + * and aligned to 8.
> + */
> +static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
> +{
> + unsigned long addr;
> +
> + /* Token must be aligned */
> + if (!IS_ALIGNED(ssp, 8))
> + return -EINVAL;
> +
> + addr = ssp - SS_FRAME_SIZE;
> +
> + /* Mark the token 64-bit */
> + ssp |= BIT(0);
Wow, that confused me for a moment. :) SDE says:
- Bit 63:2 – Value of shadow stack pointer when this restore point was created.
- Bit 1 – Reserved. Must be zero.
- Bit 0 – Mode bit. If 0, the token is a compatibility/legacy mode
“shadow stack restore” token. If 1, then this shadow stack restore
token can be used with a RSTORSSP instruction in 64-bit mode.
So shouldn't this actually be:
ssp &= ~BIT(1); /* Reserved */
ssp |= BIT(0); /* RSTORSSP instruction in 64-bit mode */
> +
> + if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
> + return -EFAULT;
> +
> + *token_addr = addr;
> +
> + return 0;
> +}
> +
> static unsigned long alloc_shstk(unsigned long size)
> {
> int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> @@ -158,6 +185,87 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> return 0;
> }
>
> +static unsigned long get_user_shstk_addr(void)
> +{
> + unsigned long long ssp;
> +
> + fpu_lock_and_load();
> +
> + rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +
> + fpregs_unlock();
> +
> + return ssp;
> +}
> +
> +static int put_shstk_data(u64 __user *addr, u64 data)
> +{
> + WARN_ON(data & BIT(63));
Let's make this a bit more defensive:
if (WARN_ON_ONCE(data & BIT(63)))
return -EFAULT;
> +
> + /*
> + * Mark the high bit so that the sigframe can't be processed as a
> + * return address.
> + */
> + if (write_user_shstk_64(addr, data | BIT(63)))
> + return -EFAULT;
> + return 0;
> +}
> +
> +static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
> +{
> + unsigned long ldata;
> +
> + if (unlikely(get_user(ldata, addr)))
> + return -EFAULT;
> +
> + if (!(ldata & BIT(63)))
> + return -EINVAL;
> +
> + *data = ldata & ~BIT(63);
> +
> + return 0;
> +}
> +
> +/*
> + * Verify the user shadow stack has a valid token on it, and then set
> + * *new_ssp according to the token.
> + */
> +static int shstk_check_rstor_token(unsigned long *new_ssp)
> +{
> + unsigned long token_addr;
> + unsigned long token;
> +
> + token_addr = get_user_shstk_addr();
> + if (!token_addr)
> + return -EINVAL;
> +
> + if (get_user(token, (unsigned long __user *)token_addr))
> + return -EFAULT;
> +
> + /* Is mode flag correct? */
> + if (!(token & BIT(0)))
> + return -EINVAL;
> +
> + /* Is busy flag set? */
"Busy"? Not "Reserved"?
> + if (token & BIT(1))
> + return -EINVAL;
> +
> + /* Mask out flags */
> + token &= ~3UL;
> +
> + /* Restore address aligned? */
> + if (!IS_ALIGNED(token, 8))
> + return -EINVAL;
> +
> + /* Token placed properly? */
> + if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >= TASK_SIZE_MAX)
> + return -EINVAL;
> +
> + *new_ssp = token;
> +
> + return 0;
> +}
> +
> void shstk_free(struct task_struct *tsk)
> {
> struct thread_shstk *shstk = &tsk->thread.shstk;
> --
> 2.17.1
>
--
Kees Cook
On Thu, Sep 29, 2022 at 03:29:24PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> When a signal is handled normally the context is pushed to the stack
> before handling it. For shadow stacks, since the shadow stack only track's
> return addresses, there isn't any state that needs to be pushed. However,
> there are still a few things that need to be done. These things are
> userspace visible and which will be kernel ABI for shadow stacks.
>
> One is to make sure the restorer address is written to shadow stack, since
> the signal handler (if not changing ucontext) returns to the restorer, and
> the restorer calls sigreturn. So add the restorer on the shadow stack
> before handling the signal, so there is not a conflict when the signal
> handler returns to the restorer.
>
> The other thing to do is to place some type of checkable token on the
> thread's shadow stack before handling the signal and check it during
> sigreturn. This is an extra layer of protection to hamper attackers
> calling sigreturn manually as in SROP-like attacks.
>
> For this token we can use the shadow stack data format defined earlier.
> Have the data pushed be the previous SSP. In the future the sigreturn
> might want to return back to a different stack. Storing the SSP (instead
> of a restore offset or something) allows for future functionality that
> may want to restore to a different stack.
>
> So, when handling a signal push
> - the SSP pointing in the shadow stack data format
> - the restorer address below the restore token.
>
> In sigreturn, verify SSP is stored in the data format and pop the shadow
> stack.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Cyrill Gorcunov <[email protected]>
> Cc: Florian Weimer <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Kees Cook <[email protected]>
>
> ---
>
> v2:
> - Switch to new shstk signal format
>
> v1:
> - Use xsave helpers.
> - Expand commit log.
>
> Yu-cheng v27:
> - Eliminate saving shadow stack pointer to signal context.
>
> Yu-cheng v25:
> - Update commit log/comments for the sc_ext struct.
> - Use restorer address already calculated.
> - Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
> - Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
> - Eliminate writing to MSR_IA32_U_CET for shadow stack.
> - Change wrmsrl() to wrmsrl_safe() and handle error.
>
> arch/x86/ia32/ia32_signal.c | 1 +
> arch/x86/include/asm/cet.h | 5 ++
> arch/x86/kernel/shstk.c | 126 ++++++++++++++++++++++++++++++------
> arch/x86/kernel/signal.c | 10 +++
> 4 files changed, 123 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
> index c9c3859322fa..88d71b9de616 100644
> --- a/arch/x86/ia32/ia32_signal.c
> +++ b/arch/x86/ia32/ia32_signal.c
> @@ -34,6 +34,7 @@
> #include <asm/sigframe.h>
> #include <asm/sighandling.h>
> #include <asm/smap.h>
> +#include <asm/cet.h>
>
> static inline void reload_segments(struct sigcontext_32 *sc)
> {
> diff --git a/arch/x86/include/asm/cet.h b/arch/x86/include/asm/cet.h
> index 924de99e0c61..8c6fab9f402a 100644
> --- a/arch/x86/include/asm/cet.h
> +++ b/arch/x86/include/asm/cet.h
> @@ -6,6 +6,7 @@
> #include <linux/types.h>
>
> struct task_struct;
> +struct ksignal;
>
> struct thread_shstk {
> u64 base;
> @@ -22,6 +23,8 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
> void shstk_free(struct task_struct *p);
> int shstk_disable(void);
> void reset_thread_shstk(void);
> +int setup_signal_shadow_stack(struct ksignal *ksig);
> +int restore_signal_shadow_stack(void);
> #else
> static inline long cet_prctl(struct task_struct *task, int option,
> unsigned long features) { return -EINVAL; }
> @@ -33,6 +36,8 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
> static inline void shstk_free(struct task_struct *p) {}
> static inline int shstk_disable(void) { return -EOPNOTSUPP; }
> static inline void reset_thread_shstk(void) {}
> +static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
> +static inline int restore_signal_shadow_stack(void) { return 0; }
> #endif /* CONFIG_X86_SHADOW_STACK */
>
> #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 8904aef487bf..04442134aadd 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -227,41 +227,129 @@ static int get_shstk_data(unsigned long *data, unsigned long __user *addr)
> }
>
> /*
> - * Verify the user shadow stack has a valid token on it, and then set
> - * *new_ssp according to the token.
> + * Create a restore token on shadow stack, and then push the user-mode
> + * function return address.
> */
> -static int shstk_check_rstor_token(unsigned long *new_ssp)
> +static int shstk_setup_rstor_token(unsigned long ret_addr, unsigned long *new_ssp)
Oh, hrm. Prior patch defines shstk_check_rstor_token() and
doesn't call it. This patch removes it. :P Can you please remove
shstk_check_rstor_token() from the prior patch?
> {
> - unsigned long token_addr;
> - unsigned long token;
> + unsigned long ssp, token_addr;
> + int err;
> +
> + if (!ret_addr)
> + return -EINVAL;
> +
> + ssp = get_user_shstk_addr();
> + if (!ssp)
> + return -EINVAL;
> +
> + err = create_rstor_token(ssp, &token_addr);
> + if (err)
> + return err;
> +
> + ssp = token_addr - sizeof(u64);
> + err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
> +
> + if (!err)
> + *new_ssp = ssp;
> +
> + return err;
> +}
> +
> +static int shstk_push_sigframe(unsigned long *ssp)
> +{
> + unsigned long target_ssp = *ssp;
> +
> + /* Token must be aligned */
> + if (!IS_ALIGNED(*ssp, 8))
> + return -EINVAL;
>
> - token_addr = get_user_shstk_addr();
> - if (!token_addr)
> + if (!IS_ALIGNED(target_ssp, 8))
> return -EINVAL;
>
> - if (get_user(token, (unsigned long __user *)token_addr))
> + *ssp -= SS_FRAME_SIZE;
> + if (put_shstk_data((void *__user)*ssp, target_ssp))
> return -EFAULT;
>
> - /* Is mode flag correct? */
> - if (!(token & BIT(0)))
> + return 0;
> +}
> +
> +
> +static int shstk_pop_sigframe(unsigned long *ssp)
> +{
> + unsigned long token_addr;
> + int err;
> +
> + err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
> + if (unlikely(err))
> + return err;
> +
> + /* Restore SSP aligned? */
> + if (unlikely(!IS_ALIGNED(token_addr, 8)))
> return -EINVAL;
Why doesn't this always fail, given BIT(0) being set? I don't see it
getting cleared until the end of this function.
>
> - /* Is busy flag set? */
> - if (token & BIT(1))
> + /* SSP in userspace? */
> + if (unlikely(token_addr >= TASK_SIZE_MAX))
> return -EINVAL;
BIT(63) already got cleared by here (in get_shstk_data(), but yes,
this is still a reasonable check.
>
> - /* Mask out flags */
> - token &= ~3UL;
> + *ssp = token_addr;
> +
> + return 0;
> +}
--
Kees Cook
On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
> Did you have a look at ptep_set_access_flags() and friends and
> checked they
> do not need to be changed too?
ptep_set_access_flags() doesn't actually set any additional dirty bits
on x86, so I think it's ok.
> Perhaps you should at least add some
> assertion just to ensure nothing breaks.
You mean in ptep_set_access_flags()? I think some assertions would be
really great, I'm just not sure where.
On Thu, Sep 29, 2022 at 03:29:27PM -0700, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Applications and loaders can have logic to decide whether to enable CET.
> They usually don't report whether CET has been enabled or not, so there
> is no way to verify whether an application actually is protected by CET
> features.
>
> Add two lines in /proc/$PID/arch_status to report enabled and locked
> features.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> [Switched to CET, added to commit log]
> Signed-off-by: Rick Edgecombe <[email protected]>
>
> ---
>
> v2:
> - New patch
>
> arch/x86/kernel/Makefile | 2 ++
> arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
> arch/x86/kernel/proc.c | 63 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 65 insertions(+), 47 deletions(-)
> create mode 100644 arch/x86/kernel/proc.c
This is two patches: one to create proc.c, the other to add CET support.
I found where the "arch_status" conversation was:
https://lore.kernel.org/all/CALCETrUjF9PBmkzH1J86vw4ZW785DP7FtcT+gcSrx29=BUnjoQ@mail.gmail.com/
Andy, what did you mean "make sure that everything in it is namespaced"?
Everything already has a field name. And arch_status doesn't exactly
solve having compat fields -- it still needs to be handled manually?
Anyway... we have arch_status, so I guess it's fine.
> [...]
> +int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
> + struct pid *pid, struct task_struct *task)
> +{
> + /*
> + * Report AVX512 state if the processor and build option supported.
> + */
> + if (cpu_feature_enabled(X86_FEATURE_AVX512F))
> + avx512_status(m, task);
> +
> + seq_puts(m, "Thread_features:\t");
> + dump_features(m, task->thread.features);
> + seq_putc(m, '\n');
> +
> + seq_puts(m, "Thread_features_locked:\t");
> + dump_features(m, task->thread.features_locked);
> + seq_putc(m, '\n');
Why are these always present instead of ifdefed?
-Kees
--
Kees Cook
On Thu, Sep 29, 2022 at 03:29:28PM -0700, Rick Edgecombe wrote:
> The kernel now has the main CET functionality to support applications.
> Wire in the WRSS and shadow stack enable/disable functions into the
> existing CET API skeleton.
>
> Signed-off-by: Rick Edgecombe <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Mon, Oct 3, 2022, at 3:37 PM, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:27PM -0700, Rick Edgecombe wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> Applications and loaders can have logic to decide whether to enable CET.
>> They usually don't report whether CET has been enabled or not, so there
>> is no way to verify whether an application actually is protected by CET
>> features.
>>
>> Add two lines in /proc/$PID/arch_status to report enabled and locked
>> features.
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> [Switched to CET, added to commit log]
>> Signed-off-by: Rick Edgecombe <[email protected]>
>>
>> ---
>>
>> v2:
>> - New patch
>>
>> arch/x86/kernel/Makefile | 2 ++
>> arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
>> arch/x86/kernel/proc.c | 63 ++++++++++++++++++++++++++++++++++++
>> 3 files changed, 65 insertions(+), 47 deletions(-)
>> create mode 100644 arch/x86/kernel/proc.c
>
> This is two patches: one to create proc.c, the other to add CET support.
>
> I found where the "arch_status" conversation was:
> https://lore.kernel.org/all/CALCETrUjF9PBmkzH1J86vw4ZW785DP7FtcT+gcSrx29=BUnjoQ@mail.gmail.com/
>
> Andy, what did you mean "make sure that everything in it is namespaced"?
> Everything already has a field name. And arch_status doesn't exactly
> solve having compat fields -- it still needs to be handled manually?
> Anyway... we have arch_status, so I guess it's fine.
I think I meant that, since it's "arch_status" not "x86_status", the fields should have names like "x86.Thread_features". Otherwise if another architecture adds a Thread_features field, then anything running under something like qemu userspace emulation could be confused.
Assuming that's what I meant, I think my comment still stands :)
On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <[email protected]> wrote:
> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>> Did you have a look at ptep_set_access_flags() and friends and
>> checked they
>> do not need to be changed too?
>
> ptep_set_access_flags() doesn't actually set any additional dirty bits
> on x86, so I think it's ok.
Are you sure about that? (lost my confidence today so I am hesitant).
Looking on insert_pfn(), I see:
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...
This appears to set the dirty bit while potentially leaving the write-bit
clear. This is the scenario you want to avoid, no?
>> Perhaps you should at least add some
>> assertion just to ensure nothing breaks.
>
> You mean in ptep_set_access_flags()? I think some assertions would be
> really great, I'm just not sure where.
Yes, on x86’s version of the function.
On Oct 3, 2022, at 4:17 PM, Nadav Amit <[email protected]> wrote:
> On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <[email protected]> wrote:
>
>> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>>> Did you have a look at ptep_set_access_flags() and friends and
>>> checked they
>>> do not need to be changed too?
>>
>> ptep_set_access_flags() doesn't actually set any additional dirty bits
>> on x86, so I think it's ok.
>
> Are you sure about that? (lost my confidence today so I am hesitant).
>
> Looking on insert_pfn(), I see:
>
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...
>
> This appears to set the dirty bit while potentially leaving the write-bit
> clear. This is the scenario you want to avoid, no?
No. I am not paying attention. Ignore.
On 9/29/22 15:29, Rick Edgecombe wrote:
> To handle stack overflows, applications can register a separate signal alt
> stack to use for the stack to handle signals. To handle shadow stack
> overflows the kernel can similarly provide the ability to have an alt
> shadow stack.
The overall SHSTK mechanism has a concept of a shadow stack that is
valid and not in use and a shadow stack that is in use. This is used,
for example, by RSTORSSP. I would like to imagine that this serves a
real purpose (presumably preventing two different threads from using the
same shadow stack and thus corrupting each others' state).
So maybe altshstk should use exactly the same mechanism. Either signal
delivery should do the atomic very-and-mark-busy routine or registering
the stack as an altstack should do it.
I think your patch has this maybe 1/3 implemented, but I don't see any
atomics, and you seem to have removed (?) the code that actually
modifies the token on the stack.
>
> +static bool on_alt_shstk(unsigned long ssp)
> +{
> + unsigned long alt_ss_start = current->thread.sas_shstk_sp;
> + unsigned long alt_ss_end = alt_ss_start + current->thread.sas_shstk_size;
> +
> + return ssp >= alt_ss_start && ssp < alt_ss_end;
> +}
We're forcing AUTODISARM behavior (right?), so I don't think this is
needed at all. User code is never "on the alt stack". It's either "on
the alt stack but the alt stack is disarmed, so it's not on the alt
stack" or it's just straight up not on the alt stack.
On Oct 3, 2022, at 4:20 PM, Nadav Amit <[email protected]> wrote:
> On Oct 3, 2022, at 4:17 PM, Nadav Amit <[email protected]> wrote:
>
>> On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <[email protected]> wrote:
>>
>>> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>>>> Did you have a look at ptep_set_access_flags() and friends and
>>>> checked they
>>>> do not need to be changed too?
>>>
>>> ptep_set_access_flags() doesn't actually set any additional dirty bits
>>> on x86, so I think it's ok.
>>
>> Are you sure about that? (lost my confidence today so I am hesitant).
>>
>> Looking on insert_pfn(), I see:
>>
>> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>> if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...
>>
>> This appears to set the dirty bit while potentially leaving the write-bit
>> clear. This is the scenario you want to avoid, no?
>
> No. I am not paying attention. Ignore.
Sorry for the spam. Just this “dirty” argument is confusing. This indeed
seems like a flow that can set the dirty bit. I think.
On Mon, 2022-10-03 at 16:25 -0700, Nadav Amit wrote:
> On Oct 3, 2022, at 4:20 PM, Nadav Amit <[email protected]> wrote:
>
> > On Oct 3, 2022, at 4:17 PM, Nadav Amit <[email protected]>
> > wrote:
> >
> > > On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <
> > > [email protected]> wrote:
> > >
> > > > On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
> > > > > Did you have a look at ptep_set_access_flags() and friends
> > > > > and
> > > > > checked they
> > > > > do not need to be changed too?
> > > >
> > > > ptep_set_access_flags() doesn't actually set any additional
> > > > dirty bits
> > > > on x86, so I think it's ok.
> > >
> > > Are you sure about that? (lost my confidence today so I am
> > > hesitant).
> > >
> > > Looking on insert_pfn(), I see:
> > >
> > > entry = maybe_mkwrite(pte_mkdirty(entry),
> > > vma);
> > > if (ptep_set_access_flags(vma, addr, pte,
> > > entry, 1)) ...
> > >
> > > This appears to set the dirty bit while potentially leaving the
> > > write-bit
> > > clear. This is the scenario you want to avoid, no?
> >
> > No. I am not paying attention. Ignore.
>
> Sorry for the spam. Just this “dirty” argument is confusing. This
> indeed
> seems like a flow that can set the dirty bit. I think.
I think the HW dirty bit will not be set here. How it works is,
pte_mkdirty() will not actually set the HW dirty bit, but instead the
software COW bit. Here is the relevant snippet:
static inline pte_t pte_mkdirty(pte_t pte)
{
pteval_t dirty = _PAGE_DIRTY;
/* Avoid creating Dirty=1,Write=0 PTEs */
if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
dirty = _PAGE_COW;
return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
}
So for a !VM_WRITE vma, you end up with Write=0,Cow=1 PTE passed
into ptep_set_access_flags(). Does it make sense?
On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> With the introduction of shadow stack memory there are two ways a pte can
> be writable: regular writable memory and shadow stack memory.
>
> In past patches, maybe_mkwrite() has been updated to apply pte_mkwrite()
> or pte_mkwrite_shstk() depending on the VMA flag. This covers most cases
> where a PTE is made writable. However, there are places where pte_mkwrite()
> is called directly and the logic should now also create a shadow stack PTE
> in the case of a shadow stack VMA.
>
> - do_anonymous_page() and migrate_vma_insert_page() check VM_WRITE
> directly and call pte_mkwrite(), which is the same as maybe_mkwrite()
> in logic and intention. Just change them to maybe_mkwrite().
Looks like you folded change for do_anonymous_page() into the wrong patch.
I see the relevant change in the previous patch.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Oct 3, 2022, at 4:38 PM, Edgecombe, Rick P <[email protected]> wrote:
> I think the HW dirty bit will not be set here. How it works is,
> pte_mkdirty() will not actually set the HW dirty bit, but instead the
> software COW bit. Here is the relevant snippet:
>
> static inline pte_t pte_mkdirty(pte_t pte)
> {
> pteval_t dirty = _PAGE_DIRTY;
>
> /* Avoid creating Dirty=1,Write=0 PTEs */
> if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
> dirty = _PAGE_COW;
>
> return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
> }
>
> So for a !VM_WRITE vma, you end up with Write=0,Cow=1 PTE passed
> into ptep_set_access_flags(). Does it make sense?
Thanks for your patience with me. I should have read the series in order.
On Mon, 2022-10-03 at 20:47 +0300, Kirill A . Shutemov wrote:
> > @@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
> >
> > const char *arch_vma_name(struct vm_area_struct *vma)
> > {
> > + if (vma->vm_flags & VM_SHADOW_STACK)
> > + return "[shadow stack]";
> > return NULL;
> > }
> >
>
> But why here?
>
> CONFIG_ARCH_HAS_SHADOW_STACK implies that there will be more than one
> arch
> that supports shadow stack. The name has to come from generic code
> too, no?
I'm not aware of any other arch that will, so I wonder if I should just
remove ARCH_HAS_SHADOW_STACK actually.
Hopefully I will not waste your time again… If it has been discussed in the
last 26 iterations, just tell me and ignore.
On Sep 29, 2022, at 3:29 PM, Rick Edgecombe <[email protected]> wrote:
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -606,8 +606,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> goto abort;
> }
> entry = mk_pte(page, vma->vm_page_prot);
> - if (vma->vm_flags & VM_WRITE)
> - entry = pte_mkwrite(pte_mkdirty(entry));
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> }
This is not exactly the same logic. You might dirty read-only pages since
you call pte_mkdirty() unconditionally. It has been known not to be very
robust (e.g., dirty-COW and friends). Perhaps it is not dangerous following
some recent enhancements, but why do you want to take the risk?
Instead, although it might seem redundant, the compiler will hopefully would
make it efficient:
if (vma->vm_flags & VM_WRITE) {
entry = pte_mkdirty(entry);
entry = maybe_mkwrite(entry, vma);
}
On Mon, Oct 03, 2022 at 01:04:37PM -0700, Dave Hansen wrote:
> On 10/3/22 12:43, Kees Cook wrote:
> >> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> >> +{
> >> + u64 val, new_val;
> >> +
> >> + rdmsrl(msr, val);
> >> + new_val = (val & ~clear) | set;
> >> +
> >> + if (new_val != val)
> >> + wrmsrl(msr, new_val);
> >> +}
> > I always get uncomfortable when I see these kinds of generalized helper
> > functions for touching cpu bits, etc. It just begs for future attacker
> > abuse to muck with arbitrary bits -- even marked inline there is a risk
> > the compiler will ignore that in some circumstances (not as currently
> > used in the code, but I'm imagining future changes leading to such a
> > condition). Will you humor me and change this to a macro instead? That'll
> > force it always inline (even __always_inline isn't always inline):
>
> Oh, are you thinking that this is dangerous because it's so surgical and
> non-intrusive? It's even more powerful to an attacker than, say
> wrmsrl(), because there they actually have to know what the existing
> value is to update it. With this helper, it's quite easy to flip an
> individual bit without disturbing the neighboring bits.
>
> Is that it?
Yeah, it was kind of the combo: both a potential entry point to wrmsrl
for arbitrary values, but also one where all the work is done to mask
stuff out.
> I don't _like_ the #defines, but doing one here doesn't seem too onerous
> considering how critical MSRs are.
I bet there are others, but this just weirded me out. I'll live with a
macro, especially since the chance of it mutating in a non-inline is
very small, but I figured I'd mention the idea.
--
Kees Cook
On Mon, Oct 03, 2022 at 03:45:50PM -0700, Andy Lutomirski wrote:
>
>
> On Mon, Oct 3, 2022, at 3:37 PM, Kees Cook wrote:
> > On Thu, Sep 29, 2022 at 03:29:27PM -0700, Rick Edgecombe wrote:
> >> From: "Kirill A. Shutemov" <[email protected]>
> >>
> >> Applications and loaders can have logic to decide whether to enable CET.
> >> They usually don't report whether CET has been enabled or not, so there
> >> is no way to verify whether an application actually is protected by CET
> >> features.
> >>
> >> Add two lines in /proc/$PID/arch_status to report enabled and locked
> >> features.
> >>
> >> Signed-off-by: Kirill A. Shutemov <[email protected]>
> >> [Switched to CET, added to commit log]
> >> Signed-off-by: Rick Edgecombe <[email protected]>
> >>
> >> ---
> >>
> >> v2:
> >> - New patch
> >>
> >> arch/x86/kernel/Makefile | 2 ++
> >> arch/x86/kernel/fpu/xstate.c | 47 ---------------------------
> >> arch/x86/kernel/proc.c | 63 ++++++++++++++++++++++++++++++++++++
> >> 3 files changed, 65 insertions(+), 47 deletions(-)
> >> create mode 100644 arch/x86/kernel/proc.c
> >
> > This is two patches: one to create proc.c, the other to add CET support.
> >
> > I found where the "arch_status" conversation was:
> > https://lore.kernel.org/all/CALCETrUjF9PBmkzH1J86vw4ZW785DP7FtcT+gcSrx29=BUnjoQ@mail.gmail.com/
> >
> > Andy, what did you mean "make sure that everything in it is namespaced"?
> > Everything already has a field name. And arch_status doesn't exactly
> > solve having compat fields -- it still needs to be handled manually?
> > Anyway... we have arch_status, so I guess it's fine.
>
> I think I meant that, since it's "arch_status" not "x86_status", the fields should have names like "x86.Thread_features". Otherwise if another architecture adds a Thread_features field, then anything running under something like qemu userspace emulation could be confused.
>
> Assuming that's what I meant, I think my comment still stands :)
Ah, but that would be needed for compat things too in "arch_status", and
could just as well live in "status".
How about moving both of these into "status", with appropriate names?
x86_64.Thread_features: ...
i386.LDT_or_something: ...
?
Does anything consume arch_status yet? Looks like probably not:
https://codesearch.debian.net/search?q=%5Cbarch_status%5Cb&literal=0&perpkg=1
--
Kees Cook
On Mon, Oct 03, 2022 at 06:33:52PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2022-10-03 at 10:04 -0700, Kees Cook wrote:
> > > Shadow stack signal format
> > > --------------------------
> > > So to handle alt shadow stacks we need to push some data onto a
> > > stack. To
> > > prevent SROP we need to push something to the shadow stack that the
> > > kernel can
> > > [...]
> > > shadow stack return address or a shadow stack tokens. To make sure
> > > it can’t be
> > > used, data is pushed with the high bit (bit 63) set. This bit is a
> > > linear
> > > address bit in both the token format and a normal return address,
> > > so it should
> > > not conflict with anything. It puts any return address in the
> > > kernel half of
> > > the address space, so would never be created naturally by a
> > > userspace program.
> > > It will not be a valid restore token either, as the kernel address
> > > will never
> > > be pointing to the previous frame in the shadow stack.
> > >
> > > When a signal hits, the format pushed to the stack that is handling
> > > the signal
> > > is four 8 byte values (since we are 64 bit only):
> > > > 1...old SSP|1...alt stack size|1...alt stack base|0|
> >
> > Do these end up being non-canonical addresses? (To avoid confusion
> > with
> > "real" kernel addresses?)
>
> Usually, but not necessarily with LAM. LAM cannot mask bit 63 though.
> So hypothetically they could become "real" kernel addresses some day.
> To keep them in the user half but still make sure they are not usable,
> you would either have to encode the bits over a lot of entries which
> would use extra space, or shrink the available address space, which
> could cause compatibility problems.
>
> Do you think it's an issue?
Nah; I think it's a good solution. I was just trying to make sure I
understood it correctly. Thanks!
--
Kees Cook
On Mon, Oct 03, 2022 at 07:25:03PM +0200, Jann Horn wrote:
> On Mon, Oct 3, 2022 at 7:04 PM Kees Cook <[email protected]> wrote:
> > On Thu, Sep 29, 2022 at 03:28:57PM -0700, Rick Edgecombe wrote:
> > > This is an overdue followup to the “Shadow stacks for userspace” CET series.
> > > Thanks for all the comments on the first version [0]. They drove a decent
> > > amount of changes for v2. Since it has been awhile, I’ll try to summarize the
> > > areas that got major changes since last time. Smaller changes are listed in
> > > each patch.
> >
> > Thanks for the write-up!
> >
> > > [...]
> > > GUP
> > > ---
> > > Shadow stack memory is generally treated as writable by the kernel, but
> > > it behaves differently then other writable memory with respect to GUP.
> > > FOLL_WRITE will not GUP shadow stack memory unless FOLL_FORCE is also
> > > set. Shadow stack memory is writable from the perspective of being
> > > changeable by userspace, but it is also protected memory from
> > > userspace’s perspective. So preventing it from being writable via
> > > FOLL_WRITE help’s make it harder for userspace to arbitrarily write to
> > > it. However, like read-only memory, FOLL_FORCE can still write through
> > > it. This means shadow stacks can be written to via things like
> > > “/proc/self/mem”. Apps that want extra security will have to prevent
> > > access to kernel features that can write with FOLL_FORCE.
> >
> > This seems like a problem to me -- the point of SS is that there cannot be
> > a way to write to them without specific instruction sequences. The fact
> > that /proc/self/mem bypasses memory protections was an old design mistake
> > that keeps leading to surprising behaviors. It would be much nicer to
> > draw the line somewhere and just say that FOLL_FORCE doesn't work on
> > VM_SHADOW_STACK. Why must FOLL_FORCE be allowed to write to SS?
>
> But once you have FOLL_FORCE, you can also just write over stuff like
> executable code instead of writing over the stack. I don't think
> allowing FOLL_FORCE writes over shadow stacks from /proc/$pid/mem is
> making things worse in any way, and it's probably helpful for stuff
> like debuggers.
>
> If you don't want /proc/$pid/mem to be able to do stuff like that,
> then IMO the way to go is to change when /proc/$pid/mem uses
> FOLL_FORCE, or to limit overall write access to /proc/$pid/mem.
Yeah, all reasonable. I just wish we could ditch FOLL_FORCE; it continues
to weird me out how powerful that fd's side-effects are.
--
Kees Cook
From: Kees Cook <[email protected]>
...
> >
> > If you don't want /proc/$pid/mem to be able to do stuff like that,
> > then IMO the way to go is to change when /proc/$pid/mem uses
> > FOLL_FORCE, or to limit overall write access to /proc/$pid/mem.
>
> Yeah, all reasonable. I just wish we could ditch FOLL_FORCE; it continues
> to weird me out how powerful that fd's side-effects are.
Could you remove FOLL_FORCE from /proc/$pid/mem and add a
/proc/$pid/mem_force that enable FOLL_FORCE but requires root
(or similar) access.
Although I suspect gdb may like to have write access to
code?
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
From: Dave Hansen
> Sent: 03 October 2022 21:05
>
> On 10/3/22 12:43, Kees Cook wrote:
> >> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> >> +{
> >> + u64 val, new_val;
> >> +
> >> + rdmsrl(msr, val);
> >> + new_val = (val & ~clear) | set;
> >> +
> >> + if (new_val != val)
> >> + wrmsrl(msr, new_val);
> >> +}
> > I always get uncomfortable when I see these kinds of generalized helper
> > functions for touching cpu bits, etc. It just begs for future attacker
> > abuse to muck with arbitrary bits -- even marked inline there is a risk
> > the compiler will ignore that in some circumstances (not as currently
> > used in the code, but I'm imagining future changes leading to such a
> > condition). Will you humor me and change this to a macro instead? That'll
> > force it always inline (even __always_inline isn't always inline):
>
> Oh, are you thinking that this is dangerous because it's so surgical and
> non-intrusive? It's even more powerful to an attacker than, say
> wrmsrl(), because there they actually have to know what the existing
> value is to update it. With this helper, it's quite easy to flip an
> individual bit without disturbing the neighboring bits.
>
> Is that it?
>
> I don't _like_ the #defines, but doing one here doesn't seem too onerous
> considering how critical MSRs are.
How often is the 'msr' number not a compile-time constant?
Adding rd/wrmsr variants that verify this would reduce the
attack surface as well.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, 2022-10-03 at 21:04 -0700, Kees Cook wrote:
> > I don't _like_ the #defines, but doing one here doesn't seem too
> > onerous
> > considering how critical MSRs are.
>
> I bet there are others, but this just weirded me out. I'll live with
> a
> macro, especially since the chance of it mutating in a non-inline is
> very small, but I figured I'd mention the idea.
Makes sense. I'll change it to a define.
On Mon, 2022-10-03 at 18:56 -0700, Nadav Amit wrote:
> Hopefully I will not waste your time again… If it has been discussed
> in the
> last 26 iterations, just tell me and ignore.
>
> On Sep 29, 2022, at 3:29 PM, Rick Edgecombe <
> [email protected]> wrote:
>
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -606,8 +606,7 @@ static void migrate_vma_insert_page(struct
> > migrate_vma *migrate,
> > goto abort;
> > }
> > entry = mk_pte(page, vma->vm_page_prot);
> > - if (vma->vm_flags & VM_WRITE)
> > - entry = pte_mkwrite(pte_mkdirty(entry));
> > + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > }
>
> This is not exactly the same logic. You might dirty read-only pages
> since
> you call pte_mkdirty() unconditionally. It has been known not to be
> very
> robust (e.g., dirty-COW and friends). Perhaps it is not dangerous
> following
> some recent enhancements, but why do you want to take the risk?
Yea those changes let me drop a patch. But, it's a good point.
>
> Instead, although it might seem redundant, the compiler will
> hopefully would
> make it efficient:
>
> if (vma->vm_flags & VM_WRITE) {
> entry = pte_mkdirty(entry);
> entry = maybe_mkwrite(entry, vma);
> }
>
Thanks Nadav. I think you're right, it should have the open coded logic
here and in the do_anonymous_page() chunk that got moved to the
previous patch on accident.
On Mon, 2022-10-03 at 16:21 -0700, Andy Lutomirski wrote:
> On 9/29/22 15:29, Rick Edgecombe wrote:
> > To handle stack overflows, applications can register a separate
> > signal alt
> > stack to use for the stack to handle signals. To handle shadow
> > stack
> > overflows the kernel can similarly provide the ability to have an
> > alt
> > shadow stack.
>
>
> The overall SHSTK mechanism has a concept of a shadow stack that is
> valid and not in use and a shadow stack that is in use. This is
> used,
> for example, by RSTORSSP. I would like to imagine that this serves
> a
> real purpose (presumably preventing two different threads from using
> the
> same shadow stack and thus corrupting each others' state).
>
> So maybe altshstk should use exactly the same mechanism. Either
> signal
> delivery should do the atomic very-and-mark-busy routine or
> registering
> the stack as an altstack should do it.
>
> I think your patch has this maybe 1/3 implemented
I'm not following how it breaks down into 3 parts, so hopefully I'm not
missing something. We could do a software busy bit for the token at the
end of alt shstk though. It seems like a good idea.
The busy-like bit in the RSTORSSP-type token is not called out as a
busy bit, but instead defined as reserved (must be 0) in some states.
(Note, it is different than the supervisor shadow stack format). Yea,
we could just probably use it like RSTORSSP does for this operation.
Or just invent another new token format and stay away from bits marked
reserved. Then it wouldn't have to be atomic either, since userspace
couldn't use it.
> , but I don't see any
> atomics, and you seem to have removed (?) the code that actually
> modifies the token on the stack.
The past series didn't do any busy bit like operation. The token just
marked where the sigreturn should be called. There was actually a
similar problem to what you described above, in that the token marking
the sigreturn point could have been usable by RSTORSSP from another
thread. In this version (even back in the non-RFC patches) using a made
up token format that RSTORSSP knows nothing about, avoids this a
different way than a busy bit. Two threads couldn't use a shstk
sigframe at the same time unless they somehow were already using the
same shadow stack.
>
> >
> > +static bool on_alt_shstk(unsigned long ssp)
> > +{
> > + unsigned long alt_ss_start = current->thread.sas_shstk_sp;
> > + unsigned long alt_ss_end = alt_ss_start + current-
> > >thread.sas_shstk_size;
> > +
> > + return ssp >= alt_ss_start && ssp < alt_ss_end;
> > +}
>
> We're forcing AUTODISARM behavior (right?), so I don't think this is
> needed at all. User code is never "on the alt stack". It's either
> "on
> the alt stack but the alt stack is disarmed, so it's not on the alt
> stack" or it's just straight up not on the alt stack.
Err, right. This can be dropped. Thanks.
On Tue, 2022-10-04 at 02:56 +0300, Kirill A . Shutemov wrote:
> On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > With the introduction of shadow stack memory there are two ways a
> > pte can
> > be writable: regular writable memory and shadow stack memory.
> >
> > In past patches, maybe_mkwrite() has been updated to apply
> > pte_mkwrite()
> > or pte_mkwrite_shstk() depending on the VMA flag. This covers most
> > cases
> > where a PTE is made writable. However, there are places where
> > pte_mkwrite()
> > is called directly and the logic should now also create a shadow
> > stack PTE
> > in the case of a shadow stack VMA.
> >
> > - do_anonymous_page() and migrate_vma_insert_page() check
> > VM_WRITE
> > directly and call pte_mkwrite(), which is the same as
> > maybe_mkwrite()
> > in logic and intention. Just change them to maybe_mkwrite().
>
> Looks like you folded change for do_anonymous_page() into the wrong
> patch.
> I see the relevant change in the previous patch.
Arg, yep thanks. It got moved accidentally.
On Tue, Oct 4, 2022, at 9:12 AM, Edgecombe, Rick P wrote:
> On Mon, 2022-10-03 at 16:21 -0700, Andy Lutomirski wrote:
>> On 9/29/22 15:29, Rick Edgecombe wrote:
>> > To handle stack overflows, applications can register a separate
>> > signal alt
>> > stack to use for the stack to handle signals. To handle shadow
>> > stack
>> > overflows the kernel can similarly provide the ability to have an
>> > alt
>> > shadow stack.
>>
>>
>> The overall SHSTK mechanism has a concept of a shadow stack that is
>> valid and not in use and a shadow stack that is in use. This is
>> used,
>> for example, by RSTORSSP. I would like to imagine that this serves
>> a
>> real purpose (presumably preventing two different threads from using
>> the
>> same shadow stack and thus corrupting each others' state).
>>
>> So maybe altshstk should use exactly the same mechanism. Either
>> signal
>> delivery should do the atomic very-and-mark-busy routine or
>> registering
>> the stack as an altstack should do it.
>>
>> I think your patch has this maybe 1/3 implemented
>
> I'm not following how it breaks down into 3 parts, so hopefully I'm not
> missing something. We could do a software busy bit for the token at the
> end of alt shstk though. It seems like a good idea.
I didn't mean there were three parts. I just wild @&! guessed the amount of code written versus needed :)
>
> The busy-like bit in the RSTORSSP-type token is not called out as a
> busy bit, but instead defined as reserved (must be 0) in some states.
> (Note, it is different than the supervisor shadow stack format). Yea,
> we could just probably use it like RSTORSSP does for this operation.
>
> Or just invent another new token format and stay away from bits marked
> reserved. Then it wouldn't have to be atomic either, since userspace
> couldn't use it.
But userspace *can* use it by delivering a signal. I consider the scenario where two user threads set up the same altshstk and take signals concurrently to be about as dangerous and about as likely (under accidental or malicious conditions) as two user threads doing RSTORSSP at the same time. Someone at Intel thought the latter was a big deal, so maybe we should match its behavior.
On Tue, 2022-10-04 at 10:46 -0700, Andy Lutomirski wrote:
> > The busy-like bit in the RSTORSSP-type token is not called out as a
> > busy bit, but instead defined as reserved (must be 0) in some
> > states.
> > (Note, it is different than the supervisor shadow stack format).
> > Yea,
> > we could just probably use it like RSTORSSP does for this
> > operation.
> >
> > Or just invent another new token format and stay away from bits
> > marked
> > reserved. Then it wouldn't have to be atomic either, since
> > userspace
> > couldn't use it.
>
> But userspace *can* use it by delivering a signal. I consider the
> scenario where two user threads set up the same altshstk and take
> signals concurrently to be about as dangerous and about as likely
> (under accidental or malicious conditions) as two user threads doing
> RSTORSSP at the same time. Someone at Intel thought the latter was a
> big deal, so maybe we should match its behavior.
Right, for alt shadow stack there should be some busy like checking or
that could happen. For regular on-thread stack signals (earlier in this
series) we don't need a busy bit.
On Tue, Oct 04, 2022 at 10:17:57AM +0000, David Laight wrote:
> From: Dave Hansen
> > Sent: 03 October 2022 21:05
> >
> > On 10/3/22 12:43, Kees Cook wrote:
> > >> +static inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> > >> +{
> > >> + u64 val, new_val;
> > >> +
> > >> + rdmsrl(msr, val);
> > >> + new_val = (val & ~clear) | set;
> > >> +
> > >> + if (new_val != val)
> > >> + wrmsrl(msr, new_val);
> > >> +}
> > > I always get uncomfortable when I see these kinds of generalized helper
> > > functions for touching cpu bits, etc. It just begs for future attacker
> > > abuse to muck with arbitrary bits -- even marked inline there is a risk
> > > the compiler will ignore that in some circumstances (not as currently
> > > used in the code, but I'm imagining future changes leading to such a
> > > condition). Will you humor me and change this to a macro instead? That'll
> > > force it always inline (even __always_inline isn't always inline):
> >
> > Oh, are you thinking that this is dangerous because it's so surgical and
> > non-intrusive? It's even more powerful to an attacker than, say
> > wrmsrl(), because there they actually have to know what the existing
> > value is to update it. With this helper, it's quite easy to flip an
> > individual bit without disturbing the neighboring bits.
> >
> > Is that it?
> >
> > I don't _like_ the #defines, but doing one here doesn't seem too onerous
> > considering how critical MSRs are.
>
> How often is the 'msr' number not a compile-time constant?
> Adding rd/wrmsr variants that verify this would reduce the
> attack surface as well.
Oh, yes! I do this all the time with FORTIFY shenanigans. Right, so,
instead of a macro, the "cannot be un-inlined" could be enforced with
this (untested):
static __always_inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
{
u64 val, new_val;
BUILD_BUG_ON(!__builtin_constant_p(msr) ||
!__builtin_constant_p(set) ||
!__builtin_constant_p(clear));
rdmsrl(msr, val);
new_val = (val & ~clear) | set;
if (new_val != val)
wrmsrl(msr, new_val);
}
--
Kees Cook
On Tue, Oct 04, 2022 at 09:57:48AM +0000, David Laight wrote:
> From: Kees Cook <[email protected]>
> ...
> > >
> > > If you don't want /proc/$pid/mem to be able to do stuff like that,
> > > then IMO the way to go is to change when /proc/$pid/mem uses
> > > FOLL_FORCE, or to limit overall write access to /proc/$pid/mem.
> >
> > Yeah, all reasonable. I just wish we could ditch FOLL_FORCE; it continues
> > to weird me out how powerful that fd's side-effects are.
>
> Could you remove FOLL_FORCE from /proc/$pid/mem and add a
> /proc/$pid/mem_force that enable FOLL_FORCE but requires root
> (or similar) access.
>
> Although I suspect gdb may like to have write access to
> code?
As Jann has reminded me out of band, while FOLL_FORCE is still worrisome,
it's really /proc/$pid/mem access at all without an active ptrace
attachment (and to self).
Here's my totally untested idea to require access to /proc/$pid/mem
having an established ptrace relationship:
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index c952c5ba8fab..0393741eeabb 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -64,6 +64,7 @@ extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead);
#define PTRACE_MODE_NOAUDIT 0x04
#define PTRACE_MODE_FSCREDS 0x08
#define PTRACE_MODE_REALCREDS 0x10
+#define PTRACE_MODE_ATTACHED 0x20
/* shorthands for READ/ATTACH and FSCREDS/REALCREDS combinations */
#define PTRACE_MODE_READ_FSCREDS (PTRACE_MODE_READ | PTRACE_MODE_FSCREDS)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 93f7e3d971e4..fadec587d133 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -826,7 +826,7 @@ static int __mem_open(struct inode *inode, struct file *file, unsigned int mode)
static int mem_open(struct inode *inode, struct file *file)
{
- int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
+ int ret = __mem_open(inode, file, PTRACE_MODE_ATTACHED);
/* OK to pass negative loff_t, we can catch out-of-range */
file->f_mode |= FMODE_UNSIGNED_OFFSET;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1893d909e45c..c97e6d734ae5 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -304,6 +304,12 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
* or halting the specified task is impossible.
*/
+ /*
+ * If an existing ptrace relationship is required, not even
+ * introspection is allowed.
+ */
+ if ((mode & PTRACE_MODE_ATTACHED) && ptrace_parent(task) != current)
+ return -EPERM;
/* Don't let security modules deny introspection */
if (same_thread_group(task, current))
return 0;
--
Kees Cook
On Mon, 2022-10-03 at 13:44 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:23PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > Shadow stack's are normally written to via CALL/RET or specific CET
> > instuctions like RSTORSSP/SAVEPREVSSP. However during some Linux
> > operations the kernel will need to write to directly using the
> > ring-0 only
> > WRUSS instruction.
> >
> > A shadow stack restore token marks a restore point of the shadow
> > stack, and
> > the address in a token must point directly above the token, which
> > is within
> > the same shadow stack. This is distinctively different from other
> > pointers
> > on the shadow stack, since those pointers point to executable code
> > area.
> >
> > Introduce token setup and verify routines. Also introduce WRUSS,
> > which is
> > a kernel-mode instruction but writes directly to user shadow stack.
> >
> > In future patches that enable shadow stack to work with signals,
> > the kernel
> > will need something to denote the point in the stack where
> > sigreturn may be
> > called. This will prevent attackers calling sigreturn at arbitrary
> > places
> > in the stack, in order to help prevent SROP attacks.
> >
> > To do this, something that can only be written by the kernel needs
> > to be
> > placed on the shadow stack. This can be accomplished by setting bit
> > 63 in
> > the frame written to the shadow stack. Userspace return addresses
> > can't
> > have this bit set as it is in the kernel range. It is also can't be
> > a
> > valid restore token.
> >
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> > Co-developed-by: Rick Edgecombe <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > Cc: Kees Cook <[email protected]>
> >
> > ---
> >
> > v2:
> > - Add data helpers for writing to shadow stack.
> >
> > v1:
> > - Use xsave helpers.
> >
> > Yu-cheng v30:
> > - Update commit log, remove description about signals.
> > - Update various comments.
> > - Remove variable 'ssp' init and adjust return value accordingly.
> > - Check get_user_shstk_addr() return value.
> > - Replace 'ia32' with 'proc32'.
> >
> > Yu-cheng v29:
> > - Update comments for the use of get_xsave_addr().
> >
> > arch/x86/include/asm/special_insns.h | 13 ++++
> > arch/x86/kernel/shstk.c | 108
> > +++++++++++++++++++++++++++
> > 2 files changed, 121 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/special_insns.h
> > b/arch/x86/include/asm/special_insns.h
> > index 35f709f619fb..f096f52bd059 100644
> > --- a/arch/x86/include/asm/special_insns.h
> > +++ b/arch/x86/include/asm/special_insns.h
> > @@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
> > : [pax] "a" (p));
> > }
> >
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> > +{
> > + asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > + _ASM_EXTABLE(1b, %l[fail])
> > + :: [addr] "r" (addr), [val] "r" (val)
> > + :: fail);
> > + return 0;
> > +fail:
> > + return -EFAULT;
> > +}
> > +#endif /* CONFIG_X86_SHADOW_STACK */
> > +
> > #define nop() asm volatile ("nop")
> >
> > static inline void serialize(void)
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index db4e53f9fdaf..8904aef487bf 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -25,6 +25,8 @@
> > #include <asm/fpu/api.h>
> > #include <asm/prctl.h>
> >
> > +#define SS_FRAME_SIZE 8
> > +
> > static bool feature_enabled(unsigned long features)
> > {
> > return current->thread.features & features;
> > @@ -40,6 +42,31 @@ static void feature_clr(unsigned long features)
> > current->thread.features &= ~features;
> > }
> >
> > +/*
> > + * Create a restore token on the shadow stack. A token is always
> > 8-byte
> > + * and aligned to 8.
> > + */
> > +static int create_rstor_token(unsigned long ssp, unsigned long
> > *token_addr)
> > +{
> > + unsigned long addr;
> > +
> > + /* Token must be aligned */
> > + if (!IS_ALIGNED(ssp, 8))
> > + return -EINVAL;
> > +
> > + addr = ssp - SS_FRAME_SIZE;
> > +
> > + /* Mark the token 64-bit */
> > + ssp |= BIT(0);
>
> Wow, that confused me for a moment. :) SDE says:
>
> - Bit 63:2 – Value of shadow stack pointer when this restore point
> was created.
> - Bit 1 – Reserved. Must be zero.
> - Bit 0 – Mode bit. If 0, the token is a compatibility/legacy mode
> “shadow stack restore” token. If 1, then this shadow stack
> restore
> token can be used with a RSTORSSP instruction in 64-bit
> mode.
>
> So shouldn't this actually be:
>
> ssp &= ~BIT(1); /* Reserved */
> ssp |= BIT(0); /* RSTORSSP instruction in 64-bit mode */
Since SSP is aligned, it should not have bits 0 to 2 set. I'll add a
comment.
>
> > +
> > + if (write_user_shstk_64((u64 __user *)addr, (u64)ssp))
> > + return -EFAULT;
> > +
> > + *token_addr = addr;
> > +
> > + return 0;
> > +}
> > +
> > static unsigned long alloc_shstk(unsigned long size)
> > {
> > int flags = MAP_ANONYMOUS | MAP_PRIVATE;
> > @@ -158,6 +185,87 @@ int shstk_alloc_thread_stack(struct
> > task_struct *tsk, unsigned long clone_flags,
> > return 0;
> > }
> >
> > +static unsigned long get_user_shstk_addr(void)
> > +{
> > + unsigned long long ssp;
> > +
> > + fpu_lock_and_load();
> > +
> > + rdmsrl(MSR_IA32_PL3_SSP, ssp);
> > +
> > + fpregs_unlock();
> > +
> > + return ssp;
> > +}
> > +
> > +static int put_shstk_data(u64 __user *addr, u64 data)
> > +{
> > + WARN_ON(data & BIT(63));
>
> Let's make this a bit more defensive:
>
> if (WARN_ON_ONCE(data & BIT(63)))
> return -EFAULT;
Hmm, sure. I'm thinking EINVAL since the failure has nothing to do with
faulting.
>
> > +
> > + /*
> > + * Mark the high bit so that the sigframe can't be processed as
> > a
> > + * return address.
> > + */
> > + if (write_user_shstk_64(addr, data | BIT(63)))
> > + return -EFAULT;
> > + return 0;
> > +}
> > +
> > +static int get_shstk_data(unsigned long *data, unsigned long
> > __user *addr)
> > +{
> > + unsigned long ldata;
> > +
> > + if (unlikely(get_user(ldata, addr)))
> > + return -EFAULT;
> > +
> > + if (!(ldata & BIT(63)))
> > + return -EINVAL;
> > +
> > + *data = ldata & ~BIT(63);
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * Verify the user shadow stack has a valid token on it, and then
> > set
> > + * *new_ssp according to the token.
> > + */
> > +static int shstk_check_rstor_token(unsigned long *new_ssp)
> > +{
> > + unsigned long token_addr;
> > + unsigned long token;
> > +
> > + token_addr = get_user_shstk_addr();
> > + if (!token_addr)
> > + return -EINVAL;
> > +
> > + if (get_user(token, (unsigned long __user *)token_addr))
> > + return -EFAULT;
> > +
> > + /* Is mode flag correct? */
> > + if (!(token & BIT(0)))
> > + return -EINVAL;
> > +
> > + /* Is busy flag set? */
>
> "Busy"? Not "Reserved"?
Yes reserved is more accurate, I'll change it. In a previous-ssp token
it is 1, so kind of busy-like. That is probably where the comment came
from.
>
> > + if (token & BIT(1))
> > + return -EINVAL;
> > +
> > + /* Mask out flags */
> > + token &= ~3UL;
> > +
> > + /* Restore address aligned? */
> > + if (!IS_ALIGNED(token, 8))
> > + return -EINVAL;
> > +
> > + /* Token placed properly? */
> > + if (((ALIGN_DOWN(token, 8) - 8) != token_addr) || token >=
> > TASK_SIZE_MAX)
> > + return -EINVAL;
> > +
> > + *new_ssp = token;
> > +
> > + return 0;
> > +}
> > +
> > void shstk_free(struct task_struct *tsk)
> > {
> > struct thread_shstk *shstk = &tsk->thread.shstk;
> > --
> > 2.17.1
> >
>
>
On Mon, 2022-10-03 at 12:07 -0700, Chang S. Bae wrote:
> > +/*
> > + * Given the xsave area and a state inside, this function
> > + * initializes an xfeature in the buffer.
>
> But, this function sets XSTATE_BV bits in the buffer. That does not
> *initialize* the state, right?
No, it doesn't actually write out the init state to the buffer.
>
> > + *
> > + * get_xsave_addr() will return NULL if the feature bit is
> > + * not present in the header. This function will make it so
> > + * the xfeature buffer address is ready to be retrieved by
> > + * get_xsave_addr().
>
> Looks like this is used in the next patch to help ptracer().
>
> We have the state copy function -- copy_uabi_to_xstate() that
> retrieves
> the address using __raw_xsave_addr() instead of get_xsave_addr(),
> copies
> the state, and then updates XSTATE_BV.
>
> __raw_xsave_addr() also ensures whether the state is in the
> compacted
> format or not. I think you can use it.
>
> Also, I'm curious about the reason why you want to update XSTATE_BV
> first with this new helper.
>
> Overall, I'm not sure these new helpers are necessary.
Thomas had experimented with this optimization where init state
features weren't saved:
https://lore.kernel.org/lkml/[email protected]/
It made me think non-fpu code should not assume things about the state
of the buffer, as FPU code might have to move things when initing them.
So the operation is worth centralizing in a helper. I think you are
right, today it is not doing much and could be open coded. I guess the
question is, should it be open coded or centralized? I'm fine either
way.
From: Kees Cook
> Sent: 04 October 2022 20:32
...
> Oh, yes! I do this all the time with FORTIFY shenanigans. Right, so,
> instead of a macro, the "cannot be un-inlined" could be enforced with
> this (untested):
>
> static __always_inline void set_clr_bits_msrl(u32 msr, u64 set, u64 clear)
> {
> u64 val, new_val;
>
> BUILD_BUG_ON(!__builtin_constant_p(msr) ||
> !__builtin_constant_p(set) ||
> !__builtin_constant_p(clear));
You can reduce the amount of text the brain has to parse
by using:
BUILD_BUG_ON(!__builtin_constant_p(msr + set + clear));
Just requires the brain to do a quick 'oh yes'...
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Wed, 2022-10-05 at 02:43 +0000, Andrew Cooper wrote:
> On 29/09/2022 23:29, Rick Edgecombe wrote:
> > diff --git a/arch/x86/include/asm/special_insns.h
> > b/arch/x86/include/asm/special_insns.h
> > index 35f709f619fb..f096f52bd059 100644
> > --- a/arch/x86/include/asm/special_insns.h
> > +++ b/arch/x86/include/asm/special_insns.h
> > @@ -223,6 +223,19 @@ static inline void clwb(volatile void *__p)
> > : [pax] "a" (p));
> > }
> >
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static inline int write_user_shstk_64(u64 __user *addr, u64 val)
> > +{
> > + asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > + _ASM_EXTABLE(1b, %l[fail])
> > + :: [addr] "r" (addr), [val] "r" (val)
> > + :: fail);
>
> "1: wrssq %[val], %[addr]\n"
> _ASM_EXTABLE(1b, %l[fail])
> : [addr] "+m" (*addr)
> : [val] "r" (val)
> :: fail
>
> Otherwise you've failed to tell the compiler that you wrote to *addr.
>
> With that fixed, it's not volatile because there are no unexpressed
> side
> effects.
Ok, thanks!
* Andrew Cooper:
> You don't actually need a hole to create a guard. Any mapping of type
> != shstk will do.
>
> If you've got a load of threads, you can tightly pack stack / shstk /
> stack / shstk with no holes, and they each act as each other guard pages.
Can userspace read the shadow stack directly? Writing is obviously
blocked, but reading?
GCC's stack-clash probing uses OR instructions, so it would be fine with
a readable mapping. POSIX does not appear to require PROT_NONE mappings
for the stack guard region, either. However, the
pthread_attr_setguardsize manual page pretty clearly says that it's got
to be unreadable and unwriteable. Hence my question.
Thanks,
Florian
* Andrew Cooper:
> On 10/10/2022 13:33, Florian Weimer wrote:
>> * Andrew Cooper:
>>
>>> You don't actually need a hole to create a guard. Any mapping of type
>>> != shstk will do.
>>>
>>> If you've got a load of threads, you can tightly pack stack / shstk /
>>> stack / shstk with no holes, and they each act as each other guard pages.
>> Can userspace read the shadow stack directly? Writing is obviously
>> blocked, but reading?
>
> Yes - regular reads are permitted to shstk memory.
>
> It's actually a great way to get backtraces with no extra metadata
> needed.
Indeed, I hope shadow stacks can be used to put the discussion around
frame pointers to a rest, at least when it comes to profiling. 8-)
>> POSIX does not appear to require PROT_NONE mappings
>> for the stack guard region, either. However, the
>> pthread_attr_setguardsize manual page pretty clearly says that it's got
>> to be unreadable and unwriteable. Hence my question.
>
> Hmm. If that's what the manuals say, then fine.
>
> But honestly, you don't get very far at all without faulting on a
> read-only stack.
I guess we can update the manual page proactively. It does look like a
tempting optimization.
Thanks,
Florian
On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 7327b2573f7c..b49372c7de41 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> int ret;
> pte_t _dst_pte, *dst_pte;
> bool writable = dst_vma->vm_flags & VM_WRITE;
> + bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
> bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> bool page_in_cache = page->mapping;
> spinlock_t *ptl;
> @@ -83,9 +84,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> writable = false;
> }
>
> - if (writable)
> - _dst_pte = pte_mkwrite(_dst_pte);
> - else
> + if (writable) {
> + if (shstk)
> + _dst_pte = pte_mkwrite_shstk(_dst_pte);
> + else
> + _dst_pte = pte_mkwrite(_dst_pte);
> + } else
> /*
> * We need this to make sure write bit removed; as mk_pte()
> * could return a pte with write bit set.
Urgh.. that's unfortunate. But yeah, I don't see a way to make that
pretty either.
On Thu, Sep 29, 2022 at 03:29:00PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> The Control-Flow Enforcement Technology contains two related features,
> one of which is Shadow Stacks. Future patches will utilize this feature
> for shadow stack support in KVM, so add a CPU feature flags for Shadow
> Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
>
> To protect shadow stack state from malicious modification, the registers
> are only accessible in supervisor mode. This implementation
> context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
> on XSAVES.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>
> Cc: Kees Cook <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, 2022-10-14 at 17:52 +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:29:14PM -0700, Rick Edgecombe wrote:
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 7327b2573f7c..b49372c7de41 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -63,6 +63,7 @@ int mfill_atomic_install_pte(struct mm_struct
> > *dst_mm, pmd_t *dst_pmd,
> > int ret;
> > pte_t _dst_pte, *dst_pte;
> > bool writable = dst_vma->vm_flags & VM_WRITE;
> > + bool shstk = dst_vma->vm_flags & VM_SHADOW_STACK;
> > bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> > bool page_in_cache = page->mapping;
> > spinlock_t *ptl;
> > @@ -83,9 +84,12 @@ int mfill_atomic_install_pte(struct mm_struct
> > *dst_mm, pmd_t *dst_pmd,
> > writable = false;
> > }
> >
> > - if (writable)
> > - _dst_pte = pte_mkwrite(_dst_pte);
> > - else
> > + if (writable) {
> > + if (shstk)
> > + _dst_pte = pte_mkwrite_shstk(_dst_pte);
> > + else
> > + _dst_pte = pte_mkwrite(_dst_pte);
> > + } else
> > /*
> > * We need this to make sure write bit removed; as
> > mk_pte()
> > * could return a pte with write bit set.
>
> Urgh.. that's unfortunate. But yeah, I don't see a way to make that
> pretty either.
Nadav pointed out that:
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
and:
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
Are not actually the same, because in the former the non-writable PTE
gets marked dirty. So I was actually going to add two more cases like
the ugly case.
On Fri, 2022-10-14 at 18:20 +0200, Borislav Petkov wrote:
> On Thu, Sep 29, 2022 at 03:29:00PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > The Control-Flow Enforcement Technology contains two related
> > features,
> > one of which is Shadow Stacks. Future patches will utilize this
> > feature
> > for shadow stack support in KVM, so add a CPU feature flags for
> > Shadow
> > Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
> >
> > To protect shadow stack state from malicious modification, the
> > registers
> > are only accessible in supervisor mode. This implementation
> > context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK
> > depend
> > on XSAVES.
> >
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> > Co-developed-by: Rick Edgecombe <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > Cc: Kees Cook <[email protected]>
>
> Reviewed-by: Borislav Petkov <[email protected]>
Thanks!
Just now realizing, I never responded to most of this feedback as the
later conversation focused in on one area. All seems good (thanks!),
except not sure about the below:
On Mon, 2022-10-03 at 12:43 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:21PM -0700, Rick Edgecombe wrote:
> > +
> > + mmap_write_lock(mm);
> > + addr = do_mmap(NULL, addr, size, PROT_READ, flags,
> > + VM_SHADOW_STACK | VM_WRITE, 0, &unused, NULL);
>
> This will use the mmap base address offset randomization, I guess?
Yes.
>
> > +
> > + mmap_write_unlock(mm);
> > +
> > + return addr;
> > +}
> > +
> > +static void unmap_shadow_stack(u64 base, u64 size)
> > +{
> > + while (1) {
> > + int r;
> > +
> > + r = vm_munmap(base, size);
> > +
> > + /*
> > + * vm_munmap() returns -EINTR when mmap_lock is held by
> > + * something else, and that lock should not be held for
> > a
> > + * long time. Retry it for the case.
> > + */
> > + if (r == -EINTR) {
> > + cond_resched();
> > + continue;
> > + }
> > +
> > + /*
> > + * For all other types of vm_munmap() failure, either
> > the
> > + * system is out of memory or there is bug.
> > + */
> > + WARN_ON_ONCE(r);
> > + break;
> > + }
> > +}
> > +
> > +int shstk_setup(void)
>
> Only called local. Make static?
>
> > +{
> > + struct thread_shstk *shstk = ¤t->thread.shstk;
> > + unsigned long addr, size;
> > +
> > + /* Already enabled */
> > + if (feature_enabled(CET_SHSTK))
> > + return 0;
> > +
> > + /* Also not supported for 32 bit */
> > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> > in_ia32_syscall())
> > + return -EOPNOTSUPP;
> > +
> > + size = PAGE_ALIGN(min_t(unsigned long long,
> > rlimit(RLIMIT_STACK), SZ_4G));
> > + addr = alloc_shstk(size);
> > + if (IS_ERR_VALUE(addr))
> > + return PTR_ERR((void *)addr);
> > +
> > + fpu_lock_and_load();
> > + wrmsrl(MSR_IA32_PL3_SSP, addr + size);
> > + wrmsrl(MSR_IA32_U_CET, CET_SHSTK_EN);
> > + fpregs_unlock();
> > +
> > + shstk->base = addr;
> > + shstk->size = size;
> > + feature_set(CET_SHSTK);
> > +
> > + return 0;
> > +}
> > +
> > +void reset_thread_shstk(void)
> > +{
> > + memset(¤t->thread.shstk, 0, sizeof(struct thread_shstk));
> > + current->thread.features = 0;
> > + current->thread.features_locked = 0;
> > +}
>
> If features is always going to be tied to shstk, why not put them in
> the
> shstk struct?
CET and LAM used to share an enabling interface and also kernel side
enablement state tracking. But in the end LAM got its own enabling
interface. Thomas had suggested that they could share a state field on
the kernel side. But then LAM already had enough state tracking for
it's needs.
Shadow stack used to track enabling with the fields in the shstk struct
that keep track of the threads shadow stack. But then we added WRSS
which needs another field to keep track of the status. So I thought to
leave the 'features' field and use it for all the CET features
tracking. I left it outside of the shstk struct so it looks usable for
any other features that might be looking for an status bit. I can
definitely compile it out when there is no user shadow stack.
snip
> > +
> > +void shstk_free(struct task_struct *tsk)
> > +{
> > + struct thread_shstk *shstk = &tsk->thread.shstk;
> > +
> > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK) ||
> > + !feature_enabled(CET_SHSTK))
> > + return;
> > +
> > + if (!tsk->mm)
> > + return;
> > +
> > + unmap_shadow_stack(shstk->base, shstk->size);
>
> I feel like base and size should be zeroed here?
>
The code used to use shstk->base and shstk->size to keep track of if
shadow stack was enabled. I'm not sure why to zero it now. Just
defensively or did you see a concrete issue?
On Wed, 2022-10-05 at 22:58 +0000, Andrew Cooper wrote:
> On 05/10/2022 23:47, Edgecombe, Rick P wrote:
> > On Wed, 2022-10-05 at 02:43 +0000, Andrew Cooper wrote:
> > > On 29/09/2022 23:29, Rick Edgecombe wrote:
> > > > diff --git a/arch/x86/include/asm/special_insns.h
> > > > b/arch/x86/include/asm/special_insns.h
> > > > index 35f709f619fb..f096f52bd059 100644
> > > > --- a/arch/x86/include/asm/special_insns.h
> > > > +++ b/arch/x86/include/asm/special_insns.h
> > > > @@ -223,6 +223,19 @@ static inline void clwb(volatile void
> > > > *__p)
> > > > : [pax] "a" (p));
> > > > }
> > > >
> > > > +#ifdef CONFIG_X86_SHADOW_STACK
> > > > +static inline int write_user_shstk_64(u64 __user *addr, u64
> > > > val)
> > > > +{
> > > > + asm_volatile_goto("1: wrussq %[val], (%[addr])\n"
> > > > + _ASM_EXTABLE(1b, %l[fail])
> > > > + :: [addr] "r" (addr), [val] "r" (val)
> > > > + :: fail);
> > >
> > > "1: wrssq %[val], %[addr]\n"
> > > _ASM_EXTABLE(1b, %l[fail])
> > > : [addr] "+m" (*addr)
> > > : [val] "r" (val)
> > > :: fail
> > >
> > > Otherwise you've failed to tell the compiler that you wrote to
> > > *addr.
> > >
> > > With that fixed, it's not volatile because there are no
> > > unexpressed
> > > side
> > > effects.
> >
> > Ok, thanks!
>
> On further consideration, it should be "=m" not "+m", which is even
> less
> constrained, and even easier for an enterprising optimiser to try and
> do
> something useful with.
AFAICT this won't work on all gccs. Asm goto's used to not support
having outputs. They are also implicitly volatile anyway. So I think
I'll have to leave it. But I can change the wrss one in the selftest to
"=m".
Kees, sorry for the delayed response. There was so much feedback, I
missed responding to some.
On Mon, 2022-10-03 at 13:52 -0700, Kees Cook wrote:
> On Thu, Sep 29, 2022 at 03:29:24PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > When a signal is handled normally the context is pushed to the
> > stack
> > before handling it. For shadow stacks, since the shadow stack only
> > track's
> > return addresses, there isn't any state that needs to be pushed.
> > However,
> > there are still a few things that need to be done. These things are
> > userspace visible and which will be kernel ABI for shadow stacks.
> >
> > One is to make sure the restorer address is written to shadow
> > stack, since
> > the signal handler (if not changing ucontext) returns to the
> > restorer, and
> > the restorer calls sigreturn. So add the restorer on the shadow
> > stack
> > before handling the signal, so there is not a conflict when the
> > signal
> > handler returns to the restorer.
> >
> > The other thing to do is to place some type of checkable token on
> > the
> > thread's shadow stack before handling the signal and check it
> > during
> > sigreturn. This is an extra layer of protection to hamper attackers
> > calling sigreturn manually as in SROP-like attacks.
> >
> > For this token we can use the shadow stack data format defined
> > earlier.
> > Have the data pushed be the previous SSP. In the future the
> > sigreturn
> > might want to return back to a different stack. Storing the SSP
> > (instead
> > of a restore offset or something) allows for future functionality
> > that
> > may want to restore to a different stack.
> >
> > So, when handling a signal push
> > - the SSP pointing in the shadow stack data format
> > - the restorer address below the restore token.
> >
> > In sigreturn, verify SSP is stored in the data format and pop the
> > shadow
> > stack.
> >
> > Signed-off-by: Yu-cheng Yu <[email protected]>
> > Co-developed-by: Rick Edgecombe <[email protected]>
> > Signed-off-by: Rick Edgecombe <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Cyrill Gorcunov <[email protected]>
> > Cc: Florian Weimer <[email protected]>
> > Cc: H. Peter Anvin <[email protected]>
> > Cc: Kees Cook <[email protected]>
> >
> > ---
> >
> > v2:
> > - Switch to new shstk signal format
> >
> > v1:
> > - Use xsave helpers.
> > - Expand commit log.
> >
> > Yu-cheng v27:
> > - Eliminate saving shadow stack pointer to signal context.
> >
> > Yu-cheng v25:
> > - Update commit log/comments for the sc_ext struct.
> > - Use restorer address already calculated.
> > - Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
> > - Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
> > - Eliminate writing to MSR_IA32_U_CET for shadow stack.
> > - Change wrmsrl() to wrmsrl_safe() and handle error.
> >
> > arch/x86/ia32/ia32_signal.c | 1 +
> > arch/x86/include/asm/cet.h | 5 ++
> > arch/x86/kernel/shstk.c | 126 ++++++++++++++++++++++++++++++
> > ------
> > arch/x86/kernel/signal.c | 10 +++
> > 4 files changed, 123 insertions(+), 19 deletions(-)
> >
> > diff --git a/arch/x86/ia32/ia32_signal.c
> > b/arch/x86/ia32/ia32_signal.c
> > index c9c3859322fa..88d71b9de616 100644
> > --- a/arch/x86/ia32/ia32_signal.c
> > +++ b/arch/x86/ia32/ia32_signal.c
> > @@ -34,6 +34,7 @@
> > #include <asm/sigframe.h>
> > #include <asm/sighandling.h>
> > #include <asm/smap.h>
> > +#include <asm/cet.h>
> >
> > static inline void reload_segments(struct sigcontext_32 *sc)
> > {
> > diff --git a/arch/x86/include/asm/cet.h
> > b/arch/x86/include/asm/cet.h
> > index 924de99e0c61..8c6fab9f402a 100644
> > --- a/arch/x86/include/asm/cet.h
> > +++ b/arch/x86/include/asm/cet.h
> > @@ -6,6 +6,7 @@
> > #include <linux/types.h>
> >
> > struct task_struct;
> > +struct ksignal;
> >
> > struct thread_shstk {
> > u64 base;
> > @@ -22,6 +23,8 @@ int shstk_alloc_thread_stack(struct task_struct
> > *p, unsigned long clone_flags,
> > void shstk_free(struct task_struct *p);
> > int shstk_disable(void);
> > void reset_thread_shstk(void);
> > +int setup_signal_shadow_stack(struct ksignal *ksig);
> > +int restore_signal_shadow_stack(void);
> > #else
> > static inline long cet_prctl(struct task_struct *task, int option,
> > unsigned long features) { return -EINVAL; }
> > @@ -33,6 +36,8 @@ static inline int shstk_alloc_thread_stack(struct
> > task_struct *p,
> > static inline void shstk_free(struct task_struct *p) {}
> > static inline int shstk_disable(void) { return -EOPNOTSUPP; }
> > static inline void reset_thread_shstk(void) {}
> > +static inline int setup_signal_shadow_stack(struct ksignal *ksig)
> > { return 0; }
> > +static inline int restore_signal_shadow_stack(void) { return 0; }
> > #endif /* CONFIG_X86_SHADOW_STACK */
> >
> > #endif /* __ASSEMBLY__ */
> > diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> > index 8904aef487bf..04442134aadd 100644
> > --- a/arch/x86/kernel/shstk.c
> > +++ b/arch/x86/kernel/shstk.c
> > @@ -227,41 +227,129 @@ static int get_shstk_data(unsigned long
> > *data, unsigned long __user *addr)
> > }
> >
> > /*
> > - * Verify the user shadow stack has a valid token on it, and then
> > set
> > - * *new_ssp according to the token.
> > + * Create a restore token on shadow stack, and then push the user-
> > mode
> > + * function return address.
> > */
> > -static int shstk_check_rstor_token(unsigned long *new_ssp)
> > +static int shstk_setup_rstor_token(unsigned long ret_addr,
> > unsigned long *new_ssp)
>
> Oh, hrm. Prior patch defines shstk_check_rstor_token() and
> doesn't call it. This patch removes it. :P Can you please remove
> shstk_check_rstor_token() from the prior patch?
Yes, this function is not needed until the alt shadow stack stuff. It
got all mangled across earlier patches. I removed it all together now.
Thanks.
>
> > {
> > - unsigned long token_addr;
> > - unsigned long token;
> > + unsigned long ssp, token_addr;
> > + int err;
> > +
> > + if (!ret_addr)
> > + return -EINVAL;
> > +
> > + ssp = get_user_shstk_addr();
> > + if (!ssp)
> > + return -EINVAL;
> > +
> > + err = create_rstor_token(ssp, &token_addr);
> > + if (err)
> > + return err;
> > +
> > + ssp = token_addr - sizeof(u64);
> > + err = write_user_shstk_64((u64 __user *)ssp, (u64)ret_addr);
> > +
> > + if (!err)
> > + *new_ssp = ssp;
> > +
> > + return err;
> > +}
> > +
> > +static int shstk_push_sigframe(unsigned long *ssp)
> > +{
> > + unsigned long target_ssp = *ssp;
> > +
> > + /* Token must be aligned */
> > + if (!IS_ALIGNED(*ssp, 8))
> > + return -EINVAL;
> >
> > - token_addr = get_user_shstk_addr();
> > - if (!token_addr)
> > + if (!IS_ALIGNED(target_ssp, 8))
> > return -EINVAL;
> >
> > - if (get_user(token, (unsigned long __user *)token_addr))
> > + *ssp -= SS_FRAME_SIZE;
> > + if (put_shstk_data((void *__user)*ssp, target_ssp))
> > return -EFAULT;
> >
> > - /* Is mode flag correct? */
> > - if (!(token & BIT(0)))
> > + return 0;
> > +}
> > +
> > +
> > +static int shstk_pop_sigframe(unsigned long *ssp)
> > +{
> > + unsigned long token_addr;
> > + int err;
> > +
> > + err = get_shstk_data(&token_addr, (unsigned long __user
> > *)*ssp);
> > + if (unlikely(err))
> > + return err;
> > +
> > + /* Restore SSP aligned? */
> > + if (unlikely(!IS_ALIGNED(token_addr, 8)))
> > return -EINVAL;
>
> Why doesn't this always fail, given BIT(0) being set? I don't see it
> getting cleared until the end of this function.
Because it isn't a normal token, it was an address in the "data format"
that had bit 63 set. Then bit 63 was cleared, making it a normal
address.
>
> >
> > - /* Is busy flag set? */
> > - if (token & BIT(1))
> > + /* SSP in userspace? */
> > + if (unlikely(token_addr >= TASK_SIZE_MAX))
> > return -EINVAL;
>
> BIT(63) already got cleared by here (in get_shstk_data(), but yes,
> this is still a reasonable check.
Good point. I guess I can leave it. Thanks.
>
> >
> > - /* Mask out flags */
> > - token &= ~3UL;
> > + *ssp = token_addr;
> > +
> > + return 0;
> > +}
>
>
On Thu, Oct 20, 2022 at 09:29:38PM +0000, Edgecombe, Rick P wrote:
> The code used to use shstk->base and shstk->size to keep track of if
> shadow stack was enabled. I'm not sure why to zero it now. Just
> defensively or did you see a concrete issue?
Just to be defensive. It's not fast path by any means, to better to
have values that make a bit of sense there. *shrug* It just stood out
as feeling "missing" while I was reading the code.
--
Kees Cook
On Thu, Oct 20, 2022 at 10:08:17PM +0000, Edgecombe, Rick P wrote:
> Kees, sorry for the delayed response. There was so much feedback, I
> missed responding to some.
No worries! Most of my feedback was just to get help filling gaps in my
understanding. :) I appreciate the replies -- I'm looking forward to v3!
--
Kees Cook