2022-12-03 02:15:59

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 00/39] Shadow stacks for userspace

Hi,

This series implements Shadow Stacks for userspace using x86's Control-flow
Enforcement Technology (CET). CET consists of two related security features:
Shadow Stacks and Indirect Branch Tracking. This series implements just the
Shadow Stack part of this feature, and just for userspace.

The main use case for shadow stack is providing protection against return
oriented programming attacks. It works by maintaining a secondary (shadow)
stack using a special memory type that has protections against modification.
When executing a CALL instruction, the processor pushes the return address to
both the normal stack and to the special permissioned shadow stack. Upon RET,
the processor pops the shadow stack copy and compares it to the normal stack
copy. For more details, see the coverletter from v1 [0].


I humbly think this is looking decent at this point, please consider applying.

Again there were some smaller changes (logged in the patches). The more
noteworthy changes were:
- Separate Shadow stack from IBT in kernel APIs. This involves renaming the
arch_prctl()s. And changing the ptrace cet regset interface to only expose
the SSP, and not any IBT bits.
- Handle 32 bit case more completely. (see commit log of “x86: Prevent 32 bit
operations for 64 bit shstk tasks” for the details)
- Drop elf header bit filtering compatibility patch for now, per Linus. [1]
- Break apart _PAGE_COW patch for bisectability reasons.

I left the Tested-by tags that were already in place, testers please re-test.

Previous version [2].

Thanks,

Rick

[0] https://lore.kernel.org/lkml/[email protected]/
[1] https://lore.kernel.org/lkml/CAHk-=wgP5mk3poVeejw16Asbid0ghDt4okHnWaWKLBkRhQntRA@mail.gmail.com/
[2] https://lore.kernel.org/lkml/[email protected]/

Kirill A. Shutemov (1):
x86: Introduce userspace API for shadow stack

Mike Rapoport (1):
x86/shstk: Add ARCH_SHSTK_UNLOCK

Rick Edgecombe (13):
x86/fpu: Add helper for modifying xstate
x86/mm: Introduce _PAGE_COW
x86/mm: Start actually marking _PAGE_COW
mm: Don't allow write GUPs to shadow stack memory
mm: Warn on shadow stack memory in wrong vma
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Support wrss for userspace
x86: Expose thread features in /proc/$PID/status
x86: Prevent 32 bit operations for 64 bit shstk tasks
x86/shstk: Wire in shadow stack interface
selftests/x86: Add shadow stack test
x86/fpu: Add helper for initing features
x86/shstk: Add ARCH_SHSTK_STATUS

Yu-cheng Yu (24):
Documentation/x86: Add CET shadow stack description
x86/shstk: Add Kconfig option for Shadow Stack
x86/cpufeatures: Add CPU feature flags for shadow stacks
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
x86: Add user control-protection fault handler
x86/mm: Remove _PAGE_DIRTY from kernel RO pages
x86/mm: Move pmd_write(), pud_write() up in the file
x86/mm: Update pte_modify for _PAGE_COW
x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
transition from _PAGE_DIRTY to _PAGE_COW
mm: Move VM_UFFD_MINOR_BIT from 37 to 38
mm: Introduce VM_SHADOW_STACK for shadow stack memory
x86/mm: Check Shadow Stack page fault errors
x86/mm: Update maybe_mkwrite() for shadow stack
mm: Fixup places that call pte_mkwrite() directly
mm: Add guard pages around a shadow stack.
mm/mmap: Add shadow stack pages to memory accounting
mm/mprotect: Exclude shadow stack from preserve_write
mm: Re-introduce vm_flags to do_mmap()
x86/shstk: Add user-mode shadow stack support
x86/shstk: Handle thread shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle signals for shadow stack
x86: Add PTRACE interface for shadow stack

Documentation/filesystems/proc.rst | 1 +
Documentation/x86/index.rst | 1 +
Documentation/x86/shstk.rst | 172 +++++
arch/arm/kernel/signal.c | 2 +-
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/x86/Kconfig | 24 +
arch/x86/Kconfig.assembler | 5 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 16 +-
arch/x86/include/asm/fpu/api.h | 9 +
arch/x86/include/asm/fpu/regset.h | 7 +-
arch/x86/include/asm/fpu/sched.h | 3 +-
arch/x86/include/asm/fpu/types.h | 14 +-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/idtentry.h | 2 +-
arch/x86/include/asm/mmu_context.h | 2 +
arch/x86/include/asm/msr.h | 11 +
arch/x86/include/asm/pgtable.h | 320 +++++++-
arch/x86/include/asm/pgtable_types.h | 65 +-
arch/x86/include/asm/processor.h | 8 +
arch/x86/include/asm/shstk.h | 52 ++
arch/x86/include/asm/sighandling.h | 1 +
arch/x86/include/asm/special_insns.h | 13 +
arch/x86/include/asm/tlbflush.h | 3 +-
arch/x86/include/asm/trap_pf.h | 2 +
arch/x86/include/uapi/asm/mman.h | 3 +
arch/x86/include/uapi/asm/prctl.h | 12 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/cpu/common.c | 37 +-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/cpu/proc.c | 23 +
arch/x86/kernel/fpu/core.c | 60 +-
arch/x86/kernel/fpu/regset.c | 87 +++
arch/x86/kernel/fpu/xstate.c | 148 ++--
arch/x86/kernel/fpu/xstate.h | 6 +
arch/x86/kernel/idt.c | 2 +-
arch/x86/kernel/process.c | 18 +-
arch/x86/kernel/process_64.c | 8 +
arch/x86/kernel/ptrace.c | 12 +
arch/x86/kernel/shstk.c | 495 +++++++++++++
arch/x86/kernel/signal.c | 21 +
arch/x86/kernel/signal_64.c | 6 +
arch/x86/kernel/signal_compat.c | 7 +-
arch/x86/kernel/traps.c | 107 ++-
arch/x86/mm/fault.c | 38 +
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 6 +
arch/x86/xen/enlighten_pv.c | 2 +-
arch/x86/xen/xen-asm.S | 2 +-
fs/aio.c | 2 +-
fs/proc/array.c | 6 +
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 57 +-
include/linux/pgtable.h | 35 +
include/linux/proc_fs.h | 2 +
include/linux/ptrace.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/asm-generic/unistd.h | 2 +-
include/uapi/linux/elf.h | 2 +
ipc/shm.c | 2 +-
kernel/signal.c | 8 +
kernel/sys_ni.c | 1 +
mm/gup.c | 2 +-
mm/huge_memory.c | 20 +-
mm/memory.c | 7 +-
mm/migrate_device.c | 4 +-
mm/mmap.c | 12 +-
mm/mprotect.c | 8 +
mm/nommu.c | 4 +-
mm/userfaultfd.c | 10 +-
mm/util.c | 2 +-
tools/testing/selftests/x86/Makefile | 4 +-
.../testing/selftests/x86/test_shadow_stack.c | 685 ++++++++++++++++++
78 files changed, 2560 insertions(+), 178 deletions(-)
create mode 100644 Documentation/x86/shstk.rst
create mode 100644 arch/x86/include/asm/shstk.h
create mode 100644 arch/x86/kernel/shstk.c
create mode 100644 tools/testing/selftests/x86/test_shadow_stack.c

--
2.17.1


2022-12-03 02:16:18

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 33/39] x86: Prevent 32 bit operations for 64 bit shstk tasks

Shadow stack is only supported in 64 bit kernels. Since 64 bit kernels can
run 32 bit applications using 32 bit emulation it would still be possible
to support shadow stack for 32 bit applications. The challenge for this
would be the shadow stack sigframe format uses noncannonical bits in the
shadow stack frame, for which there are none in 32 bit. This means the 32
bit shadow stack sigframe would need either an alternate unknown solution
or to not support any features that expand the shadow stack sigframe (alt
shadow stack). It would add complexity to separate out these future
features between the ABIs.

But shadow stack support for 32 bit would not likely be very useful. 32
bit emulation is mostly to support old apps, which should not have enabled
shadow stack. So this series disables support for 32 bit apps by blocking
the arch_prctl()s when made via a 32 bit syscall.

But PeterZ points out (see link) that some applications will utilize a 32
bit segment to run 32 bit code inside a 64 bit process. WINE does this to
run 32 bit Windows apps. However, doing this with shadow stack is not
likely to happen in practice since Windows does not support shadow stack
for 32 bit applications either.

Any preexisting applications that are marked with shadow stack are unlikely
to make it to the point where they can exercise any 32 bit signal
interactions anyway, because the HW requires that the shadow stack pointer
register needs to be in the 32 bit range in this case, but shadow stack
would have been allocated in the 64 bit address space. The shadow stack
enabled app would #GP when doing the long jump into the 32 bit segment.

So since 32 bit is not easy to support, and there are likely not many
users. More cleanly don't support 32 bit signals in a 64 bit address
space by not allowing 32 bit ABI signal handlers when shadow stack is
enabled. Do this by clearing any 32 bit ABI signal handlers when shadow
stack is enabled, and disallow any further 32 bit ABI signal handlers.
Also, return an error code for the clone operations when in a 32 bit
syscall.

Link: https://lore.kernel.org/lkml/[email protected]/#t
Signed-off-by: Rick Edgecombe <[email protected]>
---

v4:
- New patch

arch/x86/include/asm/shstk.h | 12 ++++++++++++
arch/x86/include/asm/sighandling.h | 1 +
arch/x86/kernel/shstk.c | 10 ++++++++--
arch/x86/kernel/signal.c | 20 ++++++++++++++++++++
arch/x86/kernel/signal_compat.c | 5 +++++
include/linux/ptrace.h | 1 +
kernel/signal.c | 8 ++++++++
7 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 746c040f7cb6..c82f22fd5e6d 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -4,6 +4,7 @@

#ifndef __ASSEMBLY__
#include <linux/types.h>
+#include <asm/prctl.h>

struct task_struct;
struct ksignal;
@@ -22,6 +23,12 @@ int shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
void shstk_free(struct task_struct *p);
int setup_signal_shadow_stack(struct ksignal *ksig);
int restore_signal_shadow_stack(void);
+bool features_enabled(unsigned long features);
+
+static inline bool shstk_enabled(void)
+{
+ return features_enabled(ARCH_SHSTK_SHSTK);
+}
#else
static inline long shstk_prctl(struct task_struct *task, int option,
unsigned long features) { return -EINVAL; }
@@ -33,6 +40,11 @@ static inline int shstk_alloc_thread_stack(struct task_struct *p,
static inline void shstk_free(struct task_struct *p) {}
static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
static inline int restore_signal_shadow_stack(void) { return 0; }
+
+static inline bool shstk_enabled(void)
+{
+ return false;
+}
#endif /* CONFIG_X86_USER_SHADOW_STACK */

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/include/asm/sighandling.h b/arch/x86/include/asm/sighandling.h
index e770c4fc47f4..eba88c7a6446 100644
--- a/arch/x86/include/asm/sighandling.h
+++ b/arch/x86/include/asm/sighandling.h
@@ -24,4 +24,5 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);
int x64_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);
int x32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs);

+void flush_32bit_signals(struct task_struct *t);
#endif /* _ASM_X86_SIGHANDLING_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index e59544fec96d..3a7bcc01d985 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -24,11 +24,11 @@
#include <asm/shstk.h>
#include <asm/special_insns.h>
#include <asm/fpu/api.h>
-#include <asm/prctl.h>
+#include <asm/sighandling.h>

#define SS_FRAME_SIZE 8

-static bool features_enabled(unsigned long features)
+bool features_enabled(unsigned long features)
{
return current->thread.features & features;
}
@@ -146,6 +146,8 @@ static int shstk_setup(void)
if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK) || in_32bit_syscall())
return -EOPNOTSUPP;

+ flush_32bit_signals(current);
+
size = adjust_shstk_size(0);
addr = alloc_shstk(0, size, 0, false);
if (IS_ERR_VALUE(addr))
@@ -183,6 +185,10 @@ int shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
if (!features_enabled(ARCH_SHSTK_SHSTK))
return 0;

+ /* If shadow stack is enabled, 32 bit syscalls are not supported */
+ if (in_32bit_syscall())
+ return 1;
+
/*
* For CLONE_VM, except vfork, the child needs a separate shadow
* stack.
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index b2c9853ce1c5..721b326d61ec 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -352,6 +352,26 @@ void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
force_sig(SIGSEGV);
}

+void flush_32bit_signals(struct task_struct *t)
+{
+ const unsigned long flags_32 = SA_IA32_ABI | SA_X32_ABI;
+ struct k_sigaction *ka;
+ int i;
+
+ spin_lock_irq(&t->sighand->siglock);
+ ka = &t->sighand->action[0];
+ for (i = 0; i < _NSIG; i++) {
+ if (ka->sa.sa_flags & flags_32) {
+ ka->sa.sa_handler = SIG_DFL;
+ ka->sa.sa_flags = 0;
+ ka->sa.sa_restorer = NULL;
+ sigemptyset(&ka->sa.sa_mask);
+ }
+ ka++;
+ }
+ spin_unlock_irq(&t->sighand->siglock);
+}
+
#ifdef CONFIG_DYNAMIC_SIGFRAME
#ifdef CONFIG_STRICT_SIGALTSTACK_SIZE
static bool strict_sigaltstack_size __ro_after_init = true;
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index d441804443d5..9c73435bc393 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -177,6 +177,11 @@ static inline void signal_compat_build_tests(void)
/* any new si_fields should be added here */
}

+bool sigaction_compat_invalid(void)
+{
+ return in_32bit_syscall() && shstk_enabled();
+}
+
void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact)
{
signal_compat_build_tests();
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index c952c5ba8fab..30ec68f56caf 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -405,6 +405,7 @@ static inline void user_single_step_report(struct pt_regs *regs)
extern int task_current_syscall(struct task_struct *target, struct syscall_info *info);

extern void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact);
+bool sigaction_compat_invalid(void);

/*
* ptrace report for syscall entry and exit looks identical.
diff --git a/kernel/signal.c b/kernel/signal.c
index d140672185a4..a75351c8fc0e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4079,6 +4079,11 @@ void kernel_sigaction(int sig, __sighandler_t action)
}
EXPORT_SYMBOL(kernel_sigaction);

+bool __weak sigaction_compat_invalid(void)
+{
+ return false;
+}
+
void __weak sigaction_compat_abi(struct k_sigaction *act,
struct k_sigaction *oact)
{
@@ -4093,6 +4098,9 @@ int do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact)
if (!valid_signal(sig) || sig < 1 || (act && sig_kernel_only(sig)))
return -EINVAL;

+ if (sigaction_compat_invalid())
+ return -EINVAL;
+
k = &p->sighand->action[sig-1];

spin_lock_irq(&p->sighand->siglock);
--
2.17.1

2022-12-03 02:16:21

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

From: Yu-cheng Yu <[email protected]>

New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages.

In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
if HW dirty tracking is not needed, because if the Dirty bit is not already
set the CPU has to set Dirty=1 when it the memory gets written to. This
creates addiontal work for the CPU. So tradional wisdom was to simply set
the Dirty bit whenever you didn't care about it. However, it was never
really very helpful for read only kernel memory.

When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
any functional benefit either, so to reduce ambiguity between shadow stack
and regular Write=0 pages, removed Dirty=1 from any kernel Write=0 PTEs.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---

v3:
- Update commit log (Andrew Cooper, Peterz)

v2:
- Normalize PTE bit descriptions between patches

arch/x86/include/asm/pgtable_types.h | 6 +++---
arch/x86/mm/pat/set_memory.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 447d4bee25c4..0646ad00178b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -192,10 +192,10 @@ enum page_cache_mode {
#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
#define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0)
#define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC)
-#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G)
-#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0|___D| 0|___G)
+#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G)
+#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G)
#define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC)
-#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX|___D| 0|___G)
+#define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX| 0| 0|___G)
#define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G)
#define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW| 0|___A| 0|___D|_PSE|___G)
#define __PAGE_KERNEL_WP (__PP|__RW| 0|___A|__NX|___D| 0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 0db69514fe29..50e07e8493e0 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2055,7 +2055,7 @@ int set_memory_nx(unsigned long addr, int numpages)

int set_memory_ro(unsigned long addr, int numpages)
{
- return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+ return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
}

int set_memory_rox(unsigned long addr, int numpages)
--
2.17.1

2022-12-03 02:16:56

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 22/39] mm: Re-introduce vm_flags to do_mmap()

From: Yu-cheng Yu <[email protected]>

There was no more caller passing vm_flags to do_mmap(), and vm_flags was
removed from the function's input by:

commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").

There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to
do_mmap(). Thus, re-introduce vm_flags to do_mmap().

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Peter Collingbourne <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: [email protected]
---
fs/aio.c | 2 +-
include/linux/mm.h | 3 ++-
ipc/shm.c | 2 +-
mm/mmap.c | 10 +++++-----
mm/nommu.c | 4 ++--
mm/util.c | 2 +-
6 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 5b2ff20ad322..66119297125a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -554,7 +554,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)

ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
PROT_READ | PROT_WRITE,
- MAP_SHARED, 0, &unused, NULL);
+ MAP_SHARED, 0, 0, &unused, NULL);
mmap_write_unlock(mm);
if (IS_ERR((void *)ctx->mmap_base)) {
ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0991d2fc5a8..7c10d2a7bbfd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2731,7 +2731,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
struct list_head *uf);
extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
- unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+ vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf);
extern int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
unsigned long start, size_t len, struct list_head *uf,
bool downgrade);
diff --git a/ipc/shm.c b/ipc/shm.c
index bd2fcc4d454e..1c5476bfec8b 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1662,7 +1662,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
goto invalid;
}

- addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+ addr = do_mmap(file, addr, size, prot, flags, 0, 0, &populate, NULL);
*raddr = addr;
err = 0;
if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 7ce04d2d5a10..a3ce13be0767 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1239,11 +1239,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
*/
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
- unsigned long flags, unsigned long pgoff,
- unsigned long *populate, struct list_head *uf)
+ unsigned long flags, vm_flags_t vm_flags,
+ unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf)
{
struct mm_struct *mm = current->mm;
- vm_flags_t vm_flags;
int pkey = 0;

validate_mm(mm);
@@ -1304,7 +1304,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
- vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

if (flags & MAP_LOCKED)
@@ -2877,7 +2877,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,

file = get_file(vma->vm_file);
ret = do_mmap(vma->vm_file, start, size,
- prot, flags, pgoff, &populate, NULL);
+ prot, flags, 0, pgoff, &populate, NULL);
fput(file);
out:
mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 214c70e1d059..20ff1ec89091 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1042,6 +1042,7 @@ unsigned long do_mmap(struct file *file,
unsigned long len,
unsigned long prot,
unsigned long flags,
+ vm_flags_t vm_flags,
unsigned long pgoff,
unsigned long *populate,
struct list_head *uf)
@@ -1049,7 +1050,6 @@ unsigned long do_mmap(struct file *file,
struct vm_area_struct *vma;
struct vm_region *region;
struct rb_node *rb;
- vm_flags_t vm_flags;
unsigned long capabilities, result;
int ret;
MA_STATE(mas, &current->mm->mm_mt, 0, 0);
@@ -1069,7 +1069,7 @@ unsigned long do_mmap(struct file *file,

/* we've determined that we can make the mapping, now translate what we
* now know into VMA flags */
- vm_flags = determine_vm_flags(file, prot, flags, capabilities);
+ vm_flags |= determine_vm_flags(file, prot, flags, capabilities);


/* we're going to need to record the mapping */
diff --git a/mm/util.c b/mm/util.c
index 12984e76767e..aefe4fae7ecf 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -517,7 +517,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
if (!ret) {
if (mmap_write_lock_killable(mm))
return -EINTR;
- ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
+ ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
&uf);
mmap_write_unlock(mm);
userfaultfd_unmap_complete(mm, &uf);
--
2.17.1

2022-12-03 02:17:45

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 19/39] mm: Add guard pages around a shadow stack.

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

The architecture of shadow stack constrains the ability of userspace to
move the shadow stack pointer (SSP) in order to prevent corrupting or
switching to other shadow stacks. The RSTORSSP can move the spp to
different shadow stacks, but it requires a specially placed token in order
to do this. However, the architecture does not prevent incrementing the
stack pointer to wander onto an adjacent shadow stack. To prevent this in
software, enforce guard pages at the beginning of shadow stack vmas, such
that there will always be a gap between adjacent shadow stacks.

Make the gap big enough so that no userspace SSP changing operations
(besides RSTORSSP), can move the SSP from one stack to the next. The
SSP can increment or decrement by CALL, RET and INCSSP. CALL and RET
can move the SSP by a maximum of 8 bytes, at which point the shadow
stack would be accessed.

The INCSSP instruction can also increment the shadow stack pointer. It
is the shadow stack analog of an instruction like:

addq $0x80, %rsp

However, there is one important difference between an ADD on %rsp and
INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
of the first and last elements that were "popped". It can be thought of
as acting like this:

READ_ONCE(ssp); // read+discard top element on stack
ssp += nr_to_pop * 8; // move the shadow stack
READ_ONCE(ssp-8); // read+discard last popped stack element

The maximum distance INCSSP can move the SSP is 2040 bytes, before it
would read the memory. Therefore a single page gap will be enough to
prevent any operation from shifting the SSP to an adjacent stack, since
it would have to land in the gap at least once, causing a fault.

This could be accomplished by using VM_GROWSDOWN, but this has a
downside. The behavior would allow shadow stack's to grow, which is
unneeded and adds a strange difference to how most regular stacks work.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

v4:
- Drop references to 32 bit instructions
- Switch to generic code to drop __weak (Peterz)

v2:
- Use __weak instead of #ifdef (Dave Hansen)
- Only have start gap on shadow stack (Andy Luto)
- Create stack_guard_start_gap() to not duplicate code
in an arch version of vm_start_gap() (Dave Hansen)
- Improve commit log partly with verbiage from (Dave Hansen)

Yu-cheng v25:
- Move SHADOW_STACK_GUARD_GAP to arch/x86/mm/mmap.c.

Yu-cheng v24:
- Instead changing vm_*_gap(), create x86-specific versions.

include/linux/mm.h | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f10797a1b236..e0991d2fc5a8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2821,15 +2821,36 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
return mtree_load(&mm->mm_mt, addr);
}

+static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_GROWSDOWN)
+ return stack_guard_gap;
+
+ /*
+ * Shadow stack pointer is moved by CALL, RET, and INCSSPQ.
+ * INCSSPQ moves shadow stack pointer up to 255 * 8 = ~2 KB
+ * and touches the first and the last element in the range, which
+ * triggers a page fault if the range is not in a shadow stack.
+ * Because of this, creating 4-KB guard pages around a shadow
+ * stack prevents these instructions from going beyond.
+ *
+ * Creation of VM_SHADOW_STACK is tightly controlled, so a vma
+ * can't be both VM_GROWSDOWN and VM_SHADOW_STACK
+ */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return PAGE_SIZE;
+
+ return 0;
+}
+
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
{
+ unsigned long gap = stack_guard_start_gap(vma);
unsigned long vm_start = vma->vm_start;

- if (vma->vm_flags & VM_GROWSDOWN) {
- vm_start -= stack_guard_gap;
- if (vm_start > vma->vm_start)
- vm_start = 0;
- }
+ vm_start -= gap;
+ if (vm_start > vma->vm_start)
+ vm_start = 0;
return vm_start;
}

--
2.17.1

2022-12-03 02:17:46

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

From: Yu-cheng Yu <[email protected]>

When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
stack. Copy-on-write PTes then have Write=0,Cow=1.

When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a Write=0 PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE. However, this will not occur on processors supporting
Shadow Stack, and a TLB flush is not necessary.

The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
atomically, a transient shadow stack PTE can be created as a result.
Thus, prevent that with cmpxchg.

In the case of pmdp_set_wrprotect(), for nopmd configs the ->pmd operated
on does not exist and the logic would need to be different. Although the
extra functionality will normally be optimized out when user shadow
stacks are not configured, also exclude it in the preprocessor stage so
that it will still compile. User shadow stack is not supported there by
Linux anyway. Leave the cpu_feature_enabled() check so that the
functionality also disables based on runtime detection of the feature.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue. Jann Horn provided the cmpxchg solution.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

v3:
- Remove unnecessary #ifdef (Dave Hansen)

v2:
- Compile out some code due to clang build error
- Clarify commit log (dhansen)
- Normalize PTE bit descriptions between patches (dhansen)
- Update comment with text from (dhansen)

Yu-cheng v30:
- Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).

arch/x86/include/asm/pgtable.h | 35 ++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 67bd2627c293..b68428099932 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1195,6 +1195,21 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
+ /*
+ * Avoid accidentally creating shadow stack PTEs
+ * (Write=0,Dirty=1). Use cmpxchg() to prevent races with
+ * the hardware setting Dirty=1.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+ pte_t old_pte, new_pte;
+
+ old_pte = READ_ONCE(*ptep);
+ do {
+ new_pte = pte_wrprotect(old_pte);
+ } while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+ return;
+ }
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
}

@@ -1247,6 +1262,26 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp)
{
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+ /*
+ * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
+ * to _PAGE_COW (see comments at pmd_wrprotect()).
+ * When a thread reads a RW=1, Dirty=0 PMD and before changing it
+ * to RW=0, Dirty=0, another thread could have written to the page
+ * and the PMD is RW=1, Dirty=1 now.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK)) {
+ pmd_t old_pmd, new_pmd;
+
+ old_pmd = READ_ONCE(*pmdp);
+ do {
+ new_pmd = pmd_wrprotect(old_pmd);
+ } while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+ return;
+ }
+#endif
+
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

--
2.17.1

2022-12-03 02:19:07

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 10/39] x86/mm: Introduce _PAGE_COW

Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.

The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.

So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
in places where Linux normally creates Write=0,Dirty=1, it can use the
software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
Further differentiated by VMA flags, these PTE bit combinations would be
set as follows for various types of memory:

(Write=0,Cow=1,Dirty=0):
- A modified, copy-on-write (COW) page. Previously when a typical
anonymous writable mapping was made COW via fork(), the kernel would
mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
happens in copy_present_pte().
- A R/O page that has been COW'ed. The user page is in a R/O VMA,
and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
handler creates a copy of the page and sets the new copy's PTE as
Write=0 and Cow=1.
- A shared shadow stack PTE. When a shadow stack page is being shared
among processes (this happens at fork()), its PTE is made Dirty=0, so
the next shadow stack access causes a fault, and the page is
duplicated and Dirty=1 is set again. This is the COW equivalent for
shadow stack pages, even though it's copy-on-access rather than
copy-on-write.

(Write=0,Cow=0,Dirty=1):
- A shadow stack PTE.
- A Cow PTE created when a processor without shadow stack support set
Dirty=1.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
because shadow stacks are not enabled there.

This is a prepratory patch. Changes to actually start marking _PAGE_COW
will follow once other pieces are in place.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Co-developed-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

v4:
- Teach pte_flags_need_flush() about _PAGE_COW bit
- Break apart patch for better bisectability

v3:
- Add comment around _PAGE_TABLE in response to comment
from (Andrew Cooper)
- Check for PSE in pmd_shstk (Andrew Cooper)
- Get to the point quicker in commit log (Andrew Cooper)
- Clarify and reorder commit log for why the PTE bit examples have
multiple entries. Apply same changes for comment. (peterz)
- Fix comment that implied dirty bit for COW was a specific x86 thing
(peterz)
- Fix swapping of Write/Dirty (PeterZ)

v2:
- Update commit log with comments (Dave Hansen)
- Add comments in code to explain pte modification code better (Dave)
- Clarify info on the meaning of various Write,Cow,Dirty combinations

arch/x86/include/asm/pgtable.h | 78 ++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable_types.h | 59 +++++++++++++++++++--
arch/x86/include/asm/tlbflush.h | 3 +-
3 files changed, 134 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1d6f121ee35..ee5fbdc2615f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -300,6 +300,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
return native_make_pte(v & ~clear);
}

+/*
+ * Normally COW memory can result in Dirty=1,Write=0 PTs. But in the case
+ * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since the
+ * Dirty=1,Write=0 will result in the memory being treated as shaodw stack
+ * by the HW. So when creating COW memory, a software bit is used
+ * _PAGE_BIT_COW. The following functions pte_mkcow() and pte_clear_cow()
+ * take a PTE marked conventially COW (Dirty=1) and transition it to the
+ * shadow stack compatible version of COW (Cow=1).
+ */
+
+static inline pte_t pte_mkcow(pte_t pte)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pte;
+
+ pte = pte_clear_flags(pte, _PAGE_DIRTY);
+ return pte_set_flags(pte, _PAGE_COW);
+}
+
+static inline pte_t pte_clear_cow(pte_t pte)
+{
+ /*
+ * _PAGE_COW is unnecessary on !X86_FEATURE_USER_SHSTK kernels.
+ * See the _PAGE_COW definition for more details.
+ */
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pte;
+
+ /*
+ * PTE is getting copied-on-write, so it will be dirtied
+ * if writable, or made shadow stack if shadow stack and
+ * being copied on access. Set they dirty bit for both
+ * cases.
+ */
+ pte = pte_set_flags(pte, _PAGE_DIRTY);
+ return pte_clear_flags(pte, _PAGE_COW);
+}
+
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
@@ -396,6 +434,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
return native_make_pmd(v & ~clear);
}

+/* See comments above pte_mkcow() */
+static inline pmd_t pmd_mkcow(pmd_t pmd)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pmd;
+
+ pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+ return pmd_set_flags(pmd, _PAGE_COW);
+}
+
+/* See comments above pte_mkcow() */
+static inline pmd_t pmd_clear_cow(pmd_t pmd)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pmd;
+
+ pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+ return pmd_clear_flags(pmd, _PAGE_COW);
+}
+
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pmd_uffd_wp(pmd_t pmd)
{
@@ -467,6 +525,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
return native_make_pud(v & ~clear);
}

+/* See comments above pte_mkcow() */
+static inline pud_t pud_mkcow(pud_t pud)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pud;
+
+ pud = pud_clear_flags(pud, _PAGE_DIRTY);
+ return pud_set_flags(pud, _PAGE_COW);
+}
+
+/* See comments above pte_mkcow() */
+static inline pud_t pud_clear_cow(pud_t pud)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return pud;
+
+ pud = pud_set_flags(pud, _PAGE_DIRTY);
+ return pud_clear_flags(pud, _PAGE_COW);
+}
+
static inline pud_t pud_mkold(pud_t pud)
{
return pud_clear_flags(pud, _PAGE_ACCESSED);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0646ad00178b..5c3f942865d9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@
#define _PAGE_BIT_SOFTW2 10 /* " */
#define _PAGE_BIT_SOFTW3 11 /* " */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
+#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
+#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
#define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
#define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
#define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4

+/*
+ * Indicates a copy-on-write page.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW 0
+#endif
+
/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
@@ -117,6 +127,40 @@
#define _PAGE_SOFTW4 (_AT(pteval_t, 0))
#endif

+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
+ * from shadow stack PTEs:
+ *
+ * (Write=0,Cow=1,Dirty=0):
+ * - A modified, copy-on-write (COW) page. Previously when a typical
+ * anonymous writable mapping was made COW via fork(), the kernel would
+ * mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
+ * happens in copy_present_pte().
+ * - A R/O page that has been COW'ed. The user page is in a R/O VMA,
+ * and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
+ * handler creates a copy of the page and sets the new copy's PTE as
+ * Write=0 and Cow=1.
+ * - A shared shadow stack PTE. When a shadow stack page is being shared
+ * among processes (this happens at fork()), its PTE is made Dirty=0, so
+ * the next shadow stack access causes a fault, and the page is
+ * duplicated and Dirty=1 is set again. This is the COW equivalent for
+ * shadow stack pages, even though it's copy-on-access rather than
+ * copy-on-write.
+ *
+ * (Write=0,Cow=0,Dirty=1):
+ * - A shadow stack PTE.
+ * - A Cow PTE created when a processor without shadow stack support set
+ * Dirty=1.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW (_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW)
+
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

/*
@@ -186,12 +230,17 @@ enum page_cache_mode {
#define PAGE_READONLY __pg(__PP| 0|_USR|___A|__NX| 0| 0| 0)
#define PAGE_READONLY_EXEC __pg(__PP| 0|_USR|___A| 0| 0| 0| 0)

-#define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G)
-#define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G)
-#define _KERNPG_TABLE_NOENC (__PP|__RW| 0|___A| 0|___D| 0| 0)
-#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
+/*
+ * Page tables needs to have Write=1 in order for any lower PTEs to be
+ * writable. This includes shadow stack memory (Write=0, Dirty=1)
+ */
#define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0)
#define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC)
+#define _KERNPG_TABLE_NOENC (__PP|__RW| 0|___A| 0|___D| 0| 0)
+#define _KERNPG_TABLE (__PP|__RW| 0|___A| 0|___D| 0| 0| _ENC)
+
+#define __PAGE_KERNEL (__PP|__RW| 0|___A|__NX|___D| 0|___G)
+#define __PAGE_KERNEL_EXEC (__PP|__RW| 0|___A| 0|___D| 0|___G)
#define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX| 0| 0|___G)
#define __PAGE_KERNEL_ROX (__PP| 0| 0|___A| 0| 0| 0|___G)
#define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 662598dea937..08d9b5fce4fd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -283,7 +283,8 @@ static inline bool pte_flags_need_flush(unsigned long oldflags,
const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
_PAGE_ACCESSED;
const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
- _PAGE_SOFTW3 | _PAGE_SOFTW4;
+ _PAGE_SOFTW3 | _PAGE_SOFTW4 |
+ _PAGE_COW;
const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
_PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
_PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
--
2.17.1

2022-12-03 02:19:10

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 07/39] x86: Add user control-protection fault handler

From: Yu-cheng Yu <[email protected]>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT. Refactor this fault handler into sparate user and kernel handlers,
like the page fault handler. Add a control-protection handler for usermode.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when
!cpu_feature_enabled(). This unifies the behavior with the new shadow stack
code, and also prevents the kernel from crashing under this situation which
is potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Kerrisk <[email protected]>
---

v3:
- Shorten user/kernel #CP handler function names (peterz)
- Restore CP_ENDBR check to kernel handler (peterz)
- Utilize CONFIG_X86_CET (Kees)
- Unify "unexpected" warnings (Andrew Cooper)
- Use 2d array for error code chars (Andrew Cooper)
- Add comment about why to read SSP MSR before enabling interrupts

v2:
- Integrate with kernel IBT fault handler
- Update printed messages. (Dave)
- Remove array_index_nospec() usage. (Dave)
- Remove IBT messages. (Dave)
- Add enclave error code bit processing it case it can get triggered
somehow.
- Add extra "unknown" in control_protection_err.

v1:
- Update static asserts for NSIGSEGV

Yu-cheng v29:
- Remove pr_emerg() since it is followed by die().
- Change boot_cpu_has() to cpu_feature_enabled().

arch/arm/kernel/signal.c | 2 +-
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 2 +-
arch/x86/kernel/idt.c | 2 +-
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 107 ++++++++++++++++++++---
arch/x86/xen/enlighten_pv.c | 2 +-
arch/x86/xen/xen-asm.S | 2 +-
include/uapi/asm-generic/siginfo.h | 3 +-
13 files changed, 114 insertions(+), 24 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 9ad911f1647c..81b13a21046e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1166,7 +1166,7 @@ void __init minsigstksz_setup(void)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
*/
static_assert(NSIGILL == 11);
static_assert(NSIGFPE == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
static_assert(NSIGBUS == 5);
static_assert(NSIGTRAP == 6);
static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 7a2954a16cb7..b7646b471537 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
#define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31))
#endif

+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT 0
+#else
+#define DISABLE_IBT (1 << (X86_FEATURE_IBT & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -128,7 +134,7 @@
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
DISABLE_ENQCMD)
#define DISABLED_MASK17 0
-#define DISABLED_MASK18 0
+#define DISABLED_MASK18 (DISABLE_IBT)
#define DISABLED_MASK19 0
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF, xenpv_exc_double_fault);
#endif

/* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
#endif

diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
ISTG(X86_TRAP_MC, asm_exc_machine_check, IST_INDEX_MCE),
#endif

-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
INTG(X86_TRAP_CP, asm_exc_control_protection),
#endif

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 879ef8c72f5c..d441804443d5 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
*/
BUILD_BUG_ON(NSIGILL != 11);
BUILD_BUG_ON(NSIGFPE != 15);
- BUILD_BUG_ON(NSIGSEGV != 9);
+ BUILD_BUG_ON(NSIGSEGV != 10);
BUILD_BUG_ON(NSIGBUS != 5);
BUILD_BUG_ON(NSIGTRAP != 6);
BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8b83d8fbce71..e35c70dc1afb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -213,12 +213,7 @@ DEFINE_IDTENTRY(exc_overflow)
do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
}

-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
+#ifdef CONFIG_X86_CET
enum cp_error_code {
CP_EC = (1 << 15) - 1,

@@ -231,15 +226,87 @@ enum cp_error_code {
CP_ENCL = 1 << 15,
};

-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+static const char control_protection_err[][10] = {
+ [0] = "unknown",
+ [1] = "near ret",
+ [2] = "far/iret",
+ [3] = "endbranch",
+ [4] = "rstorssp",
+ [5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+ unsigned int cpec = error_code & CP_EC;
+
+ if (cpec >= ARRAY_SIZE(control_protection_err))
+ cpec = 0;
+ return control_protection_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+ WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+ user_mode(regs) ? "user mode" : "kernel mode",
+ cp_err_string(error_code));
+}
+#endif /* CONFIG_X86_CET */
+
+void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code);
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
{
- if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
- pr_err("Unexpected #CP\n");
- BUG();
+ struct task_struct *tsk;
+ unsigned long ssp;
+
+ /*
+ * An exception was just taken from userspace. Since interrupts are disabled
+ * here, no scheduling should have messed with the registers yet and they
+ * will be whatever is live in userspace. So read the SSP before enabling
+ * interrupts so locking the fpregs to do it later is not required.
+ */
+ rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+ cond_local_irq_enable(regs);
+
+ tsk = current;
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_nr = X86_TRAP_CP;
+
+ /* Ratelimit to prevent log spamming. */
+ if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+ __ratelimit(&cpf_rate)) {
+ pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+ tsk->comm, task_pid_nr(tsk),
+ regs->ip, regs->sp, ssp, error_code,
+ cp_err_string(error_code),
+ error_code & CP_ENCL ? " in enclave" : "");
+ print_vma_addr(KERN_CONT " in ", regs->ip);
+ pr_cont("\n");
}

- if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+ force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+ cond_local_irq_disable(regs);
+}
+#endif
+
+void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code);
+
+#ifdef CONFIG_X86_KERNEL_IBT
+static __ro_after_init bool ibt_fatal = true;
+
+extern void ibt_selftest_ip(void); /* code label defined in asm below */
+
+void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ if ((error_code & CP_EC) != CP_ENDBR) {
+ do_unexpected_cp(regs, error_code);
return;
+ }

if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
regs->ax = 0;
@@ -285,9 +352,25 @@ static int __init ibt_setup(char *str)
}

__setup("ibt=", ibt_setup);
-
#endif /* CONFIG_X86_KERNEL_IBT */

+#ifdef CONFIG_X86_CET
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+ if (user_mode(regs)) {
+ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ do_user_cp_fault(regs, error_code);
+ else
+ do_unexpected_cp(regs, error_code);
+ } else {
+ if (cpu_feature_enabled(X86_FEATURE_IBT))
+ do_kernel_cp_fault(regs, error_code);
+ else
+ do_unexpected_cp(regs, error_code);
+ }
+}
+#endif /* CONFIG_X86_CET */
+
#ifdef CONFIG_X86_F00F_BUG
void handle_invalid_op(struct pt_regs *regs)
#else
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index a7d83c7800e4..e58d6cd30853 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -639,7 +639,7 @@ static struct trap_array_entry trap_array[] = {
TRAP_ENTRY(exc_coprocessor_error, false ),
TRAP_ENTRY(exc_alignment_check, false ),
TRAP_ENTRY(exc_simd_coprocessor_error, false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
TRAP_ENTRY(exc_control_protection, false ),
#endif
};
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 4a184f6e4e4d..7cdcb4ce6976 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
xen_pv_trap asm_exc_spurious_interrupt_bug
xen_pv_trap asm_exc_coprocessor_error
xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
xen_pv_trap asm_exc_control_protection
#endif
#ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
#define SEGV_ADIPERR 7 /* Precise MCD exception */
#define SEGV_MTEAERR 8 /* Asynchronous ARM MTE error */
#define SEGV_MTESERR 9 /* Synchronous ARM MTE exception */
-#define NSIGSEGV 9
+#define SEGV_CPERR 10 /* Control protection fault */
+#define NSIGSEGV 10

/*
* SIGBUS si_codes
--
2.17.1

2022-12-03 02:19:17

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 14/39] mm: Move VM_UFFD_MINOR_BIT from 37 to 38

From: Yu-cheng Yu <[email protected]>

The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

Future patches will introduce a new VM flag VM_SHADOW_STACK that will be
VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are
bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake
of order, make all VM_HIGH_ARCH_BITs stay together by moving
VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be
introduced as 37.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Peter Xu <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Axel Rasmussen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Mike Kravetz <[email protected]>
---
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bfac5a166cb8..9ae6bc38d0b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -354,7 +354,7 @@ extern unsigned int kobjsize(const void *objp);
#endif

#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT 37
+# define VM_UFFD_MINOR_BIT 38
# define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */
#else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
# define VM_UFFD_MINOR VM_NONE
--
2.17.1

2022-12-03 02:20:14

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 06/39] x86/fpu: Add helper for modifying xstate

Just like user xfeatures, supervisor xfeatures can be active in the
registers or present in the task FPU buffer. If the registers are
active, the registers can be modified directly. If the registers are
not active, the modification must be performed on the task FPU buffer.

When the state is not active, the kernel could perform modifications
directly to the buffer. But in order for it to do that, it needs
to know where in the buffer the specific state it wants to modify is
located. Doing this is not robust against optimizations that compact
the FPU buffer, as each access would require computing where in the
buffer it is.

The easiest way to modify supervisor xfeature data is to force restore
the registers and write directly to the MSRs. Often times this is just fine
anyway as the registers need to be restored before returning to userspace.
Do this for now, leaving buffer writing optimizations for the future.

Add a new function fpregs_lock_and_load() that can simultaneously call
fpregs_lock() and do this restore. Also perform some extra sanity
checks in this function since this will be used in non-fpu focused code.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

v3:
- Rename to fpregs_lock_and_load() to match the unlocking
fpregs_unlock(). (Kees)
- Elaborate in comment about helper. (Dave)

v2:
- Drop optimization of writing directly the buffer, and change API
accordingly.
- fpregs_lock_and_load() suggested by tglx
- Some commit log verbiage from dhansen

v1:
- New patch.

arch/x86/include/asm/fpu/api.h | 9 +++++++++
arch/x86/kernel/fpu/core.c | 19 +++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 503a577814b2..aadc6893dcaa 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -82,6 +82,15 @@ static inline void fpregs_unlock(void)
preempt_enable();
}

+/*
+ * FPU state gets lazily restored before returning to userspace. So when in the
+ * kernel, the valid FPU state may be kept in the buffer. This function will force
+ * restore all the fpu state to the registers early if needed, and lock them from
+ * being automatically saved/restored. Then FPU state can be modified safely in the
+ * registers, before unlocking with fpregs_unlock().
+ */
+void fpregs_lock_and_load(void);
+
#ifdef CONFIG_X86_DEBUG_FPU
extern void fpregs_assert_state_consistent(void);
#else
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 9baa89a8877d..9af78e9d92a0 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -753,6 +753,25 @@ void switch_fpu_return(void)
}
EXPORT_SYMBOL_GPL(switch_fpu_return);

+void fpregs_lock_and_load(void)
+{
+ /*
+ * fpregs_lock() only disables preemption (mostly). So modifing state
+ * in an interrupt could screw up some in progress fpregs operation,
+ * but appear to work. Warn about it.
+ */
+ WARN_ON_ONCE(!irq_fpu_usable());
+ WARN_ON_ONCE(current->flags & PF_KTHREAD);
+
+ fpregs_lock();
+
+ fpregs_assert_state_consistent();
+
+ if (test_thread_flag(TIF_NEED_FPU_LOAD))
+ fpregs_restore_userregs();
+}
+EXPORT_SYMBOL_GPL(fpregs_lock_and_load);
+
#ifdef CONFIG_X86_DEBUG_FPU
/*
* If current FPU state according to its tracking (loaded FPU context on this
--
2.17.1

2022-12-03 02:21:34

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 17/39] x86/mm: Update maybe_mkwrite() for shadow stack

From: Yu-cheng Yu <[email protected]>

When serving a page fault, maybe_mkwrite() makes a PTE writable if there is
a write access to it, and its vma has VM_WRITE. Shadow stack accesses to
shadow stack vma's are also treated as write accesses by the fault handler.
This is because setting shadow stack memory makes it writable via some
instructions, so COW has to happen even for shadow stack reads.

So maybe_mkwrite() should continue to set VM_WRITE vma's as normally
writable, but also set VM_WRITE|VM_SHADOW_STACK vma's as shadow stack.

Do this by adding a pte_mkwrite_shstk() and a cross-arch stub. Check for
VM_SHADOW_STACK in maybe_mkwrite() and call pte_mkwrite_shstk()
accordingly.

Apply the same changes to maybe_pmd_mkwrite().

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

v3:
- Remove unneeded define for maybe_mkwrite (Peterz)
- Switch to cleaner version of maybe_mkwrite() (Peterz)

v2:
- Change to handle shadow stacks that are VM_WRITE|VM_SHADOW_STACK
- Ditch arch specific maybe_mkwrite(), and make the code generic
- Move do_anonymous_page() to next patch (Kirill)

Yu-cheng v29:
- Remove likely()'s.

arch/x86/include/asm/pgtable.h | 2 ++
include/linux/mm.h | 13 ++++++++++---
include/linux/pgtable.h | 14 ++++++++++++++
mm/huge_memory.c | 10 +++++++---
4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 4a149cec0c07..e4530b39f378 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -420,6 +420,7 @@ static inline pte_t pte_mkdirty(pte_t pte)
return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
}

+#define pte_mkwrite_shstk pte_mkwrite_shstk
static inline pte_t pte_mkwrite_shstk(pte_t pte)
{
/* pte_clear_cow() also sets Dirty=1 */
@@ -556,6 +557,7 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
}

+#define pmd_mkwrite_shstk pmd_mkwrite_shstk
static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
{
return pmd_clear_cow(pmd);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b982c2749e7b..f10797a1b236 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1004,12 +1004,19 @@ void free_compound_page(struct page *page);
* servicing faults for write access. In the normal case, do always want
* pte_mkwrite. But get_user_pages can cause write faults for mappings
* that do not have writing enabled, when used by access_process_vm.
+ *
+ * If a vma is shadow stack (a type of writable memory), mark the pte shadow
+ * stack.
*/
static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
- if (likely(vma->vm_flags & VM_WRITE))
- pte = pte_mkwrite(pte);
- return pte;
+ if (!(vma->vm_flags & VM_WRITE))
+ return pte;
+
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return pte_mkwrite_shstk(pte);
+
+ return pte_mkwrite(pte);
}

vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 70e2a7e06a76..d8096578610a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -524,6 +524,13 @@ static inline pte_t pte_sw_mkyoung(pte_t pte)
#define pte_mk_savedwrite pte_mkwrite
#endif

+#ifndef pte_mkwrite_shstk
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+ return pte;
+}
+#endif
+
#ifndef pte_clear_savedwrite
#define pte_clear_savedwrite pte_wrprotect
#endif
@@ -532,6 +539,13 @@ static inline pte_t pte_sw_mkyoung(pte_t pte)
#define pmd_savedwrite pmd_write
#endif

+#ifndef pmd_mkwrite_shstk
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+ return pmd;
+}
+#endif
+
#ifndef pmd_mk_savedwrite
#define pmd_mk_savedwrite pmd_mkwrite
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 811d19b5c4f6..60451e588955 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -553,9 +553,13 @@ __setup("transparent_hugepage=", setup_transparent_hugepage);

pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
- if (likely(vma->vm_flags & VM_WRITE))
- pmd = pmd_mkwrite(pmd);
- return pmd;
+ if (!(vma->vm_flags & VM_WRITE))
+ return pmd;
+
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return pmd_mkwrite_shstk(pmd);
+
+ return pmd_mkwrite(pmd);
}

#ifdef CONFIG_MEMCG
--
2.17.1

2022-12-03 02:30:42

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 31/39] x86/shstk: Support wrss for userspace

For the current shadow stack implementation, shadow stacks contents can't
easily be provisioned with arbitrary data. This property helps apps
protect themselves better, but also restricts any potential apps that may
want to do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, wrss, which
can be enabled to write directly to shadow stack permissioned memory from
userspace. Allow it to get enabled via the prctl interface.

Only enable the userspace wrss instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a #PF err code perspective.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
---

v3:
- Make wrss_control() static
- Fix verbiage in commit log (Kees)

v2:
- Add some commit log verbiage from (Dave Hansen)

v1:
- New patch.

arch/x86/include/uapi/asm/prctl.h | 1 +
arch/x86/kernel/shstk.c | 31 ++++++++++++++++++++++++++++++-
2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index fc97ca7c4884..f13751c6bae4 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -33,5 +33,6 @@

/* ARCH_SHSTK_ features bits */
#define ARCH_SHSTK_SHSTK (1ULL << 0)
+#define ARCH_SHSTK_WRSS (1ULL << 1)

#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 8f329c22728a..e59544fec96d 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -364,6 +364,35 @@ void shstk_free(struct task_struct *tsk)
unmap_shadow_stack(shstk->base, shstk->size);
}

+static int wrss_control(bool enable)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+ return -EOPNOTSUPP;
+
+ /*
+ * Only enable wrss if shadow stack is enabled. If shadow stack is not
+ * enabled, wrss will already be disabled, so don't bother clearing it
+ * when disabling.
+ */
+ if (!features_enabled(ARCH_SHSTK_SHSTK))
+ return -EPERM;
+
+ /* Already enabled/disabled? */
+ if (features_enabled(ARCH_SHSTK_WRSS) == enable)
+ return 0;
+
+ fpregs_lock_and_load();
+ if (enable) {
+ set_clr_bits_msrl(MSR_IA32_U_CET, CET_WRSS_EN, 0);
+ features_set(ARCH_SHSTK_WRSS);
+ } else {
+ set_clr_bits_msrl(MSR_IA32_U_CET, 0, CET_WRSS_EN);
+ features_clr(ARCH_SHSTK_WRSS);
+ }
+ fpregs_unlock();
+
+ return 0;
+}

static int shstk_disable(void)
{
@@ -381,7 +410,7 @@ static int shstk_disable(void)
fpregs_unlock();

shstk_free(current);
- features_clr(ARCH_SHSTK_SHSTK);
+ features_clr(ARCH_SHSTK_SHSTK | ARCH_SHSTK_WRSS);

return 0;
}
--
2.17.1

2022-12-03 02:31:28

by Edgecombe, Rick P

[permalink] [raw]
Subject: [PATCH v4 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks

From: Yu-cheng Yu <[email protected]>

The Control-Flow Enforcement Technology contains two related features,
one of which is Shadow Stacks. Future patches will utilize this feature
for shadow stack support in KVM, so add a CPU feature flags for Shadow
Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).

To protect shadow stack state from malicious modification, the registers
are only accessible in supervisor mode. This implementation
context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
on XSAVES.

The shadow stack feature, enumerated by the CPUID bit described above,
encompasses both supervisor and userspace support for shadow stack. In
near future patches, only userspace shadow stack will be enabled. In
expectation of future supervisor shadow stack support, create a software
CPU capability to enumerate kernel utilization of userspace shadow stack
support. This will also allow for userspace shadow stack to be disabled,
while leaving the shadow stack hardware capability exposed in the cpuinfo
proc. This user shadow stack bit should depend on the HW "shstk"
capability and that logic will be implemented in future patches.

Tested-by: Pengfei Xu <[email protected]>
Tested-by: John Allen <[email protected]>
Signed-off-by: Yu-cheng Yu <[email protected]>
Co-developed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Cc: Kees Cook <[email protected]>
---

v3:
- Add user specific shadow stack cpu cap (Andrew Cooper)
- Drop reviewed-bys from Boris and Kees due to the above change.

v2:
- Remove IBT reference in commit log (Kees)
- Describe xsaves dependency using text from (Dave)

v1:
- Remove IBT, can be added in a follow on IBT series.

Yu-cheng v25:
- Make X86_FEATURE_IBT depend on X86_FEATURE_SHSTK.

arch/x86/include/asm/cpufeatures.h | 2 ++
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 11a0e06362e4..aab7fa4104d7 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -307,6 +307,7 @@
#define X86_FEATURE_SGX_EDECCSSA (11*32+18) /* "" SGX EDECCSSA user leaf function */
#define X86_FEATURE_CALL_DEPTH (11*32+19) /* "" Call depth tracking for RSB stuffing */
#define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
+#define X86_FEATURE_USER_SHSTK (11*32+21) /* Shadow stack support for user mode applications */

/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
@@ -369,6 +370,7 @@
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
+#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
#define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index c44b56f7ffba..7a2954a16cb7 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif

+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define DISABLE_USER_SHSTK 0
+#else
+#define DISABLE_USER_SHSTK (1 << (X86_FEATURE_USER_SHSTK & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -114,7 +120,7 @@
#define DISABLED_MASK9 (DISABLE_SGX)
#define DISABLED_MASK10 0
#define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
- DISABLE_CALL_DEPTH_TRACKING)
+ DISABLE_CALL_DEPTH_TRACKING|DISABLE_USER_SHSTK)
#define DISABLED_MASK12 0
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index d95221117129..c3e4e5246df9 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -79,6 +79,7 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_XFD, X86_FEATURE_XSAVES },
{ X86_FEATURE_XFD, X86_FEATURE_XGETBV1 },
{ X86_FEATURE_AMX_TILE, X86_FEATURE_XFD },
+ { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES },
{}
};

--
2.17.1

2022-12-03 02:39:06

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

On Fri, Dec 02, 2022 at 04:35:35PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
> shadow stack pages.
>
> In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
> if HW dirty tracking is not needed, because if the Dirty bit is not already
> set the CPU has to set Dirty=1 when it the memory gets written to. This
> creates addiontal work for the CPU. So tradional wisdom was to simply set
> the Dirty bit whenever you didn't care about it. However, it was never
> really very helpful for read only kernel memory.
>
> When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
> such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
> avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
> functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
> any functional benefit either, so to reduce ambiguity between shadow stack
> and regular Write=0 pages, removed Dirty=1 from any kernel Write=0 PTEs.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-12-03 02:39:55

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 07/39] x86: Add user control-protection fault handler

On Fri, Dec 02, 2022 at 04:35:34PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> A control-protection fault is triggered when a control-flow transfer
> attempt violates Shadow Stack or Indirect Branch Tracking constraints.
> For example, the return address for a RET instruction differs from the copy
> on the shadow stack.
>
> There already exists a control-protection fault handler for handling kernel
> IBT. Refactor this fault handler into sparate user and kernel handlers,
> like the page fault handler. Add a control-protection handler for usermode.
>
> Keep the same behavior for the kernel side of the fault handler, except for
> converting a BUG to a WARN in the case of a #CP happening when
> !cpu_feature_enabled(). This unifies the behavior with the new shadow stack
> code, and also prevents the kernel from crashing under this situation which
> is potentially recoverable.
>
> The control-protection fault handler works in a similar way as the general
> protection fault handler. It provides the si_code SEGV_CPERR to the signal
> handler.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> Co-developed-by: Rick Edgecombe <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

This looks nice cleaner to me. Thanks!

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-12-03 02:40:22

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 06/39] x86/fpu: Add helper for modifying xstate

On Fri, Dec 02, 2022 at 04:35:33PM -0800, Rick Edgecombe wrote:
> Just like user xfeatures, supervisor xfeatures can be active in the
> registers or present in the task FPU buffer. If the registers are
> active, the registers can be modified directly. If the registers are
> not active, the modification must be performed on the task FPU buffer.
>
> When the state is not active, the kernel could perform modifications
> directly to the buffer. But in order for it to do that, it needs
> to know where in the buffer the specific state it wants to modify is
> located. Doing this is not robust against optimizations that compact
> the FPU buffer, as each access would require computing where in the
> buffer it is.
>
> The easiest way to modify supervisor xfeature data is to force restore
> the registers and write directly to the MSRs. Often times this is just fine
> anyway as the registers need to be restored before returning to userspace.
> Do this for now, leaving buffer writing optimizations for the future.
>
> Add a new function fpregs_lock_and_load() that can simultaneously call
> fpregs_lock() and do this restore. Also perform some extra sanity
> checks in this function since this will be used in non-fpu focused code.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Suggested-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-12-03 02:41:52

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 17/39] x86/mm: Update maybe_mkwrite() for shadow stack

On Fri, Dec 02, 2022 at 04:35:44PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> When serving a page fault, maybe_mkwrite() makes a PTE writable if there is
> a write access to it, and its vma has VM_WRITE. Shadow stack accesses to
> shadow stack vma's are also treated as write accesses by the fault handler.
> This is because setting shadow stack memory makes it writable via some
> instructions, so COW has to happen even for shadow stack reads.
>
> So maybe_mkwrite() should continue to set VM_WRITE vma's as normally
> writable, but also set VM_WRITE|VM_SHADOW_STACK vma's as shadow stack.
>
> Do this by adding a pte_mkwrite_shstk() and a cross-arch stub. Check for
> VM_SHADOW_STACK in maybe_mkwrite() and call pte_mkwrite_shstk()
> accordingly.
>
> Apply the same changes to maybe_pmd_mkwrite().
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-12-03 02:46:47

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

On Fri, Dec 02, 2022 at 04:35:39PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
> stack. Copy-on-write PTes then have Write=0,Cow=1.
>
> When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
> become a transient shadow stack PTE in two cases:
>
> The first case is that some processors can start a write but end up seeing
> a Write=0 PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE. However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
>
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.
>
> In the case of pmdp_set_wrprotect(), for nopmd configs the ->pmd operated
> on does not exist and the logic would need to be different. Although the
> extra functionality will normally be optimized out when user shadow
> stacks are not configured, also exclude it in the preprocessor stage so
> that it will still compile. User shadow stack is not supported there by
> Linux anyway. Leave the cpu_feature_enabled() check so that the
> functionality also disables based on runtime detection of the feature.
>
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue. Jann Horn provided the cmpxchg solution.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-12-03 02:56:03

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks

On Fri, Dec 02, 2022 at 04:35:30PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> The Control-Flow Enforcement Technology contains two related features,
> one of which is Shadow Stacks. Future patches will utilize this feature
> for shadow stack support in KVM, so add a CPU feature flags for Shadow
> Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]).
>
> To protect shadow stack state from malicious modification, the registers
> are only accessible in supervisor mode. This implementation
> context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend
> on XSAVES.
>
> The shadow stack feature, enumerated by the CPUID bit described above,
> encompasses both supervisor and userspace support for shadow stack. In
> near future patches, only userspace shadow stack will be enabled. In
> expectation of future supervisor shadow stack support, create a software
> CPU capability to enumerate kernel utilization of userspace shadow stack
> support. This will also allow for userspace shadow stack to be disabled,
> while leaving the shadow stack hardware capability exposed in the cpuinfo
> proc. This user shadow stack bit should depend on the HW "shstk"
> capability and that logic will be implemented in future patches.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-12-03 02:59:32

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 31/39] x86/shstk: Support wrss for userspace

On Fri, Dec 02, 2022 at 04:35:58PM -0800, Rick Edgecombe wrote:
> For the current shadow stack implementation, shadow stacks contents can't
> easily be provisioned with arbitrary data. This property helps apps
> protect themselves better, but also restricts any potential apps that may
> want to do exotic things at the expense of a little security.
>
> The x86 shadow stack feature introduces a new instruction, wrss, which
> can be enabled to write directly to shadow stack permissioned memory from
> userspace. Allow it to get enabled via the prctl interface.
>
> Only enable the userspace wrss instruction, which allows writes to
> userspace shadow stacks from userspace. Do not allow it to be enabled
> independently of shadow stack, as HW does not support using WRSS when
> shadow stack is disabled.
>
> From a fault handler perspective, WRSS will behave very similar to WRUSS,
> which is treated like a user access from a #PF err code perspective.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Signed-off-by: Rick Edgecombe <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-12-03 03:07:14

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 10/39] x86/mm: Introduce _PAGE_COW

On Fri, Dec 02, 2022 at 04:35:37PM -0800, Rick Edgecombe wrote:
> Some OSes have a greater dependence on software available bits in PTEs than
> Linux. That left the hardware architects looking for a way to represent a
> new memory type (shadow stack) within the existing bits. They chose to
> repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
> shadow stack memory, Linux should avoid creating memory with this PTE bit
> combination unless it intends for it to be shadow stack.
>
> The reason it's lightly used is that Dirty=1 is normally set by HW
> _before_ a write. A write with a Write=0 PTE would typically only generate
> a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
> generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
> supports shadow stacks will no longer exhibit this oddity.
>
> So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
> in places where Linux normally creates Write=0,Dirty=1, it can use the
> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
> words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
> Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
> Further differentiated by VMA flags, these PTE bit combinations would be
> set as follows for various types of memory:
>
> (Write=0,Cow=1,Dirty=0):
> - A modified, copy-on-write (COW) page. Previously when a typical
> anonymous writable mapping was made COW via fork(), the kernel would
> mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
> happens in copy_present_pte().
> - A R/O page that has been COW'ed. The user page is in a R/O VMA,
> and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
> handler creates a copy of the page and sets the new copy's PTE as
> Write=0 and Cow=1.
> - A shared shadow stack PTE. When a shadow stack page is being shared
> among processes (this happens at fork()), its PTE is made Dirty=0, so
> the next shadow stack access causes a fault, and the page is
> duplicated and Dirty=1 is set again. This is the COW equivalent for
> shadow stack pages, even though it's copy-on-access rather than
> copy-on-write.
>
> (Write=0,Cow=0,Dirty=1):
> - A shadow stack PTE.
> - A Cow PTE created when a processor without shadow stack support set
> Dirty=1.
>
> There are six bits left available to software in the 64-bit PTE after
> consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
> because shadow stacks are not enabled there.
>
> This is a prepratory patch. Changes to actually start marking _PAGE_COW
> will follow once other pieces are in place.
>
> Tested-by: Pengfei Xu <[email protected]>
> Tested-by: John Allen <[email protected]>
> Co-developed-by: Yu-cheng Yu <[email protected]>
> Signed-off-by: Yu-cheng Yu <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

--
Kees Cook

2022-12-03 23:09:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 33/39] x86: Prevent 32 bit operations for 64 bit shstk tasks

On Fri, Dec 2, 2022 at 4:44 PM Rick Edgecombe
<[email protected]> wrote:
>

> So since 32 bit is not easy to support, and there are likely not many
> users. More cleanly don't support 32 bit signals in a 64 bit address
> space by not allowing 32 bit ABI signal handlers when shadow stack is
> enabled. Do this by clearing any 32 bit ABI signal handlers when shadow
> stack is enabled, and disallow any further 32 bit ABI signal handlers.
> Also, return an error code for the clone operations when in a 32 bit
> syscall.
>

This seems unfortunate. The result will be a highly incomprehensible
crash. Maybe instead deny enabling shadow stack in the first place?
Or at least pr_warn_once if anything gets flushed.

2022-12-04 21:12:58

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 33/39] x86: Prevent 32 bit operations for 64 bit shstk tasks

On Sat, 2022-12-03 at 14:49 -0800, Andy Lutomirski wrote:
> On Fri, Dec 2, 2022 at 4:44 PM Rick Edgecombe
> <[email protected]> wrote:
> >
> > So since 32 bit is not easy to support, and there are likely not
> > many
> > users. More cleanly don't support 32 bit signals in a 64 bit
> > address
> > space by not allowing 32 bit ABI signal handlers when shadow stack
> > is
> > enabled. Do this by clearing any 32 bit ABI signal handlers when
> > shadow
> > stack is enabled, and disallow any further 32 bit ABI signal
> > handlers.
> > Also, return an error code for the clone operations when in a 32
> > bit
> > syscall.
> >
>
> This seems unfortunate. The result will be a highly incomprehensible
> crash. Maybe instead deny enabling shadow stack in the first place?
> Or at least pr_warn_once if anything gets flushed.

Thanks for the suggestion! Denying seems much better, I'll change it.

2022-12-07 11:14:51

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks

On Fri, Dec 02, 2022 at 04:35:30PM -0800, Rick Edgecombe wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 11a0e06362e4..aab7fa4104d7 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -307,6 +307,7 @@
> #define X86_FEATURE_SGX_EDECCSSA (11*32+18) /* "" SGX EDECCSSA user leaf function */
> #define X86_FEATURE_CALL_DEPTH (11*32+19) /* "" Call depth tracking for RSB stuffing */
> #define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
> +#define X86_FEATURE_USER_SHSTK (11*32+21) /* Shadow stack support for user mode applications */
>
> /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
> #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
> @@ -369,6 +370,7 @@
> #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
> #define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
> #define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
> +#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
> #define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
> #define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
> #define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-Less Multiplication Double Quadword */

What is the end goal here?

To have X86_FEATURE_KERNEL_SHSTK or so someday too?

If so, then the hardware bit X86_FEATURE_SHSTK should be hidden in
/proc/cpuinfo, i.e.,

#define X86_FEATURE_SHSTK (16*32+ 7) /* "" Shadow Stack hardware support */

note the "", otherwise you'll have people go:

"hey, I have "shstk" in /proc/cpuinfo on my machine. Why doesn't it do
anything?"

Or am I misreading where this is headed?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-07 22:54:11

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks

On Wed, 2022-12-07 at 12:00 +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 04:35:30PM -0800, Rick Edgecombe wrote:
> > diff --git a/arch/x86/include/asm/cpufeatures.h
> > b/arch/x86/include/asm/cpufeatures.h
> > index 11a0e06362e4..aab7fa4104d7 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -307,6 +307,7 @@
> > #define X86_FEATURE_SGX_EDECCSSA (11*32+18) /* "" SGX
> > EDECCSSA user leaf function */
> > #define X86_FEATURE_CALL_DEPTH (11*32+19) /* ""
> > Call depth tracking for RSB stuffing */
> > #define X86_FEATURE_MSR_TSX_CTRL (11*32+20) /* "" MSR
> > IA32_TSX_CTRL (Intel) implemented */
> > +#define X86_FEATURE_USER_SHSTK (11*32+21) /* Shadow
> > stack support for user mode applications */
> >
> > /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX),
> > word 12 */
> > #define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI
> > instructions */
> > @@ -369,6 +370,7 @@
> > #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection
> > Keys Enable */
> > #define X86_FEATURE_WAITPKG (16*32+ 5) /*
> > UMONITOR/UMWAIT/TPAUSE Instructions */
> > #define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional
> > AVX512 Vector Bit Manipulation Instructions */
> > +#define X86_FEATURE_SHSTK (16*32+ 7) /* Shadow Stack */
> > #define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field
> > New Instructions */
> > #define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
> > #define X86_FEATURE_VPCLMULQDQ (16*32+10) /* Carry-
> > Less Multiplication Double Quadword */
>
> What is the end goal here?
>
> To have X86_FEATURE_KERNEL_SHSTK or so someday too?
>
> If so, then the hardware bit X86_FEATURE_SHSTK should be hidden in
> /proc/cpuinfo, i.e.,
>
> #define X86_FEATURE_SHSTK (16*32+ 7) /* "" Shadow Stack
> hardware support */
>
> note the "", otherwise you'll have people go:
>
> "hey, I have "shstk" in /proc/cpuinfo on my machine. Why doesn't it
> do
> anything?"
>
> Or am I misreading where this is headed?

Yes, the suggestion was to have one for kernel and one for user. But I
was also thinking about how KVM could hypothetically support shadow
stack in guests in the non !CONFIG_X86_USER_SHADOW_STACK case (it only
needs CET_U xsave support). So that configuration wouldn't expose
user_shstk and since KVM's guest feature support is retrieved
programmatically, it could be nice to have some hint for KVM users that
they could try. Maybe it's simpler to just tie KVM and host support
together though. I'll remove "shstk".

2022-12-08 11:56:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 03/39] x86/cpufeatures: Add CPU feature flags for shadow stacks

On Wed, Dec 07, 2022 at 10:35:59PM +0000, Edgecombe, Rick P wrote:
> Yes, the suggestion was to have one for kernel and one for user. But I
> was also thinking about how KVM could hypothetically support shadow
> stack in guests in the non !CONFIG_X86_USER_SHADOW_STACK case (it only
> needs CET_U xsave support). So that configuration wouldn't expose
> user_shstk and since KVM's guest feature support is retrieved
> programmatically, it could be nice to have some hint for KVM users that
> they could try. Maybe it's simpler to just tie KVM and host support
> together though. I'll remove "shstk".

Hmm, I don't have a clear idea how guest shstk support should do so
maybe this is all way off but yeah, if the host supports CET - the
*hardware* feature - then you can use the same logic to support that in
a VM.

I.e., if the guest sees CET - i.e., HV has advertized it - then guest
kernel behaves exactly the same as on the host.

But it is likely I'm missing something more involved...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-15 01:06:05

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 33/39] x86: Prevent 32 bit operations for 64 bit shstk tasks

On Sun, 2022-12-04 at 12:51 -0800, Rick Edgecombe wrote:
> On Sat, 2022-12-03 at 14:49 -0800, Andy Lutomirski wrote:
> > On Fri, Dec 2, 2022 at 4:44 PM Rick Edgecombe
> > <[email protected]> wrote:
> > >
> > > So since 32 bit is not easy to support, and there are likely not
> > > many
> > > users. More cleanly don't support 32 bit signals in a 64 bit
> > > address
> > > space by not allowing 32 bit ABI signal handlers when shadow
> > > stack
> > > is
> > > enabled. Do this by clearing any 32 bit ABI signal handlers when
> > > shadow
> > > stack is enabled, and disallow any further 32 bit ABI signal
> > > handlers.
> > > Also, return an error code for the clone operations when in a 32
> > > bit
> > > syscall.
> > >
> >
> > This seems unfortunate. The result will be a highly
> > incomprehensible
> > crash. Maybe instead deny enabling shadow stack in the first
> > place?
> > Or at least pr_warn_once if anything gets flushed.
>
> Thanks for the suggestion! Denying seems much better, I'll change it.

Argh, the solution only work in the normal case where the first task
enables shadow stack. Otherwise the process could:
1. Have two threads without shadow stack
2. Enable shadow stack in thread 1
3. Register 32 bit handler from thread 2
4. Handle 32 bit signal in thread 1

For this amount of special case ugliness it should fix the whole
problem I think.

Trying to fix it up by adding 32 bit signal blocking state into struct
sighand_struct, so it would actually be per-process, spills this into
core code. I think it might not be the best solution. I'm not sure what
is yet.

2022-12-20 12:36:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 06/39] x86/fpu: Add helper for modifying xstate

On Fri, Dec 02, 2022 at 04:35:33PM -0800, Rick Edgecombe wrote:
> +void fpregs_lock_and_load(void)

Fun naming :)

> +{
> + /*
> + * fpregs_lock() only disables preemption (mostly). So modifing state

Unknown word [modifing] in comment.
Suggestions: ['modifying',...

> + * in an interrupt could screw up some in progress fpregs operation,
> + * but appear to work. Warn about it.
> + */
> + WARN_ON_ONCE(!irq_fpu_usable());
> + WARN_ON_ONCE(current->flags & PF_KTHREAD);
> +
> + fpregs_lock();

So it locks them here...

/me goes further into the patchset

aha, and the counterpart of this function is fpregs_unlock() so
everything gets sandwitched between the two.

Ok, I guess.

> +EXPORT_SYMBOL_GPL(fpregs_lock_and_load);

Exported for KVM?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-20 16:40:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 07/39] x86: Add user control-protection fault handler

On Fri, Dec 02, 2022 at 04:35:34PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> A control-protection fault is triggered when a control-flow transfer
> attempt violates Shadow Stack or Indirect Branch Tracking constraints.
> For example, the return address for a RET instruction differs from the copy
> on the shadow stack.
>
> There already exists a control-protection fault handler for handling kernel
> IBT. Refactor this fault handler into sparate user and kernel handlers,

Unknown word [sparate] in commit message.
Suggestions: ['separate',

You could use a spellchecker with your commit messages so that it
catches all those typos.

...

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 8b83d8fbce71..e35c70dc1afb 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -213,12 +213,7 @@ DEFINE_IDTENTRY(exc_overflow)
> do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
> }
>
> -#ifdef CONFIG_X86_KERNEL_IBT
> -
> -static __ro_after_init bool ibt_fatal = true;
> -
> -extern void ibt_selftest_ip(void); /* code label defined in asm below */
> -
> +#ifdef CONFIG_X86_CET
> enum cp_error_code {
> CP_EC = (1 << 15) - 1,
>
> @@ -231,15 +226,87 @@ enum cp_error_code {
> CP_ENCL = 1 << 15,
> };
>
> -DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> +static const char control_protection_err[][10] = {

You already use the "cp_" prefix for the other things, might as well use
it here too.

> + [0] = "unknown",
> + [1] = "near ret",
> + [2] = "far/iret",
> + [3] = "endbranch",
> + [4] = "rstorssp",
> + [5] = "setssbsy",
> +};
> +
> +static const char *cp_err_string(unsigned long error_code)
> +{
> + unsigned int cpec = error_code & CP_EC;
> +
> + if (cpec >= ARRAY_SIZE(control_protection_err))
> + cpec = 0;
> + return control_protection_err[cpec];
> +}
> +
> +static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
> +{
> + WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
> + user_mode(regs) ? "user mode" : "kernel mode",
> + cp_err_string(error_code));
> +}
> +#endif /* CONFIG_X86_CET */
> +
> +void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code);

What's that forward declaration for?

In any case, push it into traps.h pls.

I gotta say, I'm not a big fan of that ifdeffery here. Do we really
really need it?

> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
> +
> +void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
> {
> - if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
> - pr_err("Unexpected #CP\n");
> - BUG();
> + struct task_struct *tsk;
> + unsigned long ssp;
> +
> + /*
> + * An exception was just taken from userspace. Since interrupts are disabled
> + * here, no scheduling should have messed with the registers yet and they
> + * will be whatever is live in userspace. So read the SSP before enabling
> + * interrupts so locking the fpregs to do it later is not required.
> + */
> + rdmsrl(MSR_IA32_PL3_SSP, ssp);
> +
> + cond_local_irq_enable(regs);
> +
> + tsk = current;
> + tsk->thread.error_code = error_code;
> + tsk->thread.trap_nr = X86_TRAP_CP;
> +
> + /* Ratelimit to prevent log spamming. */
> + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
> + __ratelimit(&cpf_rate)) {
> + pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
> + tsk->comm, task_pid_nr(tsk),
> + regs->ip, regs->sp, ssp, error_code,
> + cp_err_string(error_code),
> + error_code & CP_ENCL ? " in enclave" : "");
> + print_vma_addr(KERN_CONT " in ", regs->ip);
> + pr_cont("\n");
> }
>
> - if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
> + force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> + cond_local_irq_disable(regs);
> +}
> +#endif
> +
> +void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code);
> +
> +#ifdef CONFIG_X86_KERNEL_IBT
> +static __ro_after_init bool ibt_fatal = true;
> +
> +extern void ibt_selftest_ip(void); /* code label defined in asm below */

Yeah, pls put that comment above the function. Side comments are nasty.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-20 19:19:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

Just typos and spelling fixes:

On Fri, Dec 02, 2022 at 04:35:35PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
> shadow stack pages.
>
> In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
> if HW dirty tracking is not needed, because if the Dirty bit is not already
> set the CPU has to set Dirty=1 when it the memory gets written to. This

s/it //

> creates addiontal work for the CPU.

"additional"

> So tradional wisdom was to simply set

Unknown word [tradional] in commit message.
Suggestions: ['traditional', ...

> the Dirty bit whenever you didn't care about it. However, it was never
> really very helpful for read only kernel memory.

read-only

> When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
> such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
> avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
> functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
> any functional benefit either, so to reduce ambiguity between shadow stack
> and regular Write=0 pages, removed Dirty=1 from any kernel Write=0 PTEs.

s/removed/remove/

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-20 22:03:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 10/39] x86/mm: Introduce _PAGE_COW

On Fri, Dec 02, 2022 at 04:35:37PM -0800, Rick Edgecombe wrote:
> There are six bits left available to software in the 64-bit PTE after
> consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
> because shadow stacks are not enabled there.
>
> This is a prepratory patch. Changes to actually start marking _PAGE_COW

Unknown word [prepratory] in commit message.
Suggestions: ['preparatory',

> will follow once other pieces are in place.

And regardless, you don't really need this sentence at all, AFAICT.

...

> +/*
> + * Normally COW memory can result in Dirty=1,Write=0 PTs. But in the case
^^^

PTEs.

> + * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since the
> + * Dirty=1,Write=0 will result in the memory being treated as shaodw stack
> + * by the HW. So when creating COW memory, a software bit is used
> + * _PAGE_BIT_COW. The following functions pte_mkcow() and pte_clear_cow()
> + * take a PTE marked conventially COW (Dirty=1) and transition it to the

Unknown word [conventially] in comment.
Suggestions: ['conventionally', ...

> + * shadow stack compatible version of COW (Cow=1).
> + */
> +

^ Superfluous newline.

> +static inline pte_t pte_mkcow(pte_t pte)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return pte;
> +
> + pte = pte_clear_flags(pte, _PAGE_DIRTY);
> + return pte_set_flags(pte, _PAGE_COW);
> +}
> +
> +static inline pte_t pte_clear_cow(pte_t pte)
> +{
> + /*
> + * _PAGE_COW is unnecessary on !X86_FEATURE_USER_SHSTK kernels.

I'm guessing this "unnecessary" is supposed to mean that on kernels not
supporting shadow stack, a COW page uses the old bit flags?

I.e., Dirty=1,Write=0?

Might as well write it this way to be perfectly clear.

> + * See the _PAGE_COW definition for more details.
> + */
> + if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> + return pte;
> +
> + /*
> + * PTE is getting copied-on-write, so it will be dirtied
> + * if writable, or made shadow stack if shadow stack and
> + * being copied on access. Set they dirty bit for both

"Set the dirty bit.."

> + * cases.
> + */
> + pte = pte_set_flags(pte, _PAGE_DIRTY);
> + return pte_clear_flags(pte, _PAGE_COW);
> +}

Rest looks ok.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-20 22:06:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 07/39] x86: Add user control-protection fault handler

On Fri, Dec 02, 2022 at 04:35:34PM -0800, Rick Edgecombe wrote:
> Keep the same behavior for the kernel side of the fault handler, except for
> converting a BUG to a WARN in the case of a #CP happening when
> !cpu_feature_enabled().
^^^^^^^^^^^^^^^^^^^^^^^^^

Yeah, let's stick to plain english in the commit message pls, instead of
pseudo code.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-21 00:20:34

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 06/39] x86/fpu: Add helper for modifying xstate

On Tue, 2022-12-20 at 13:04 +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 04:35:33PM -0800, Rick Edgecombe wrote:
> > +void fpregs_lock_and_load(void)
>
> Fun naming :)

Yea. Suggested by Thomas.

>
> > +{
> > +       /*
> > +        * fpregs_lock() only disables preemption (mostly). So
> > modifing state
>
> Unknown word [modifing] in comment.
> Suggestions: ['modifying',...

Sorry about this and the others. I get spelling errors in checkpatch,
so I must have dictionary issues or something.

>
> > +        * in an interrupt could screw up some in progress fpregs
> > operation,
> > +        * but appear to work. Warn about it.
> > +        */
> > +       WARN_ON_ONCE(!irq_fpu_usable());
> > +       WARN_ON_ONCE(current->flags & PF_KTHREAD);
> > +
> > +       fpregs_lock();
>
> So it locks them here...
>
> /me goes further into the patchset
> aha, and the counterpart of this function is fpregs_unlock() so
> everything gets sandwitched between the two.
>
> Ok, I guess.
>
> > +EXPORT_SYMBOL_GPL(fpregs_lock_and_load);
>
> Exported for KVM?
>
Yes, the KVM series needed it. Part of the reasoning here was to
provide some helpers to avoid mistakes in modifying xstate, so the
general idea was that it should be available. I suppose that series
could add the export though?

2022-12-21 01:09:33

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 07/39] x86: Add user control-protection fault handler

On Tue, 2022-12-20 at 17:19 +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 04:35:34PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > A control-protection fault is triggered when a control-flow
> > transfer
> > attempt violates Shadow Stack or Indirect Branch Tracking
> > constraints.
> > For example, the return address for a RET instruction differs from
> > the copy
> > on the shadow stack.
> >
> > There already exists a control-protection fault handler for
> > handling kernel
> > IBT. Refactor this fault handler into separate user and kernel
> > handlers,
>
> Unknown word [separate] in commit message.
> Suggestions: ['separate',
>
> You could use a spellchecker with your commit messages so that it
> catches all those typos.

Argh, sorry. I'll upgrade my tools.

>
> ...
>
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index 8b83d8fbce71..e35c70dc1afb 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -213,12 +213,7 @@ DEFINE_IDTENTRY(exc_overflow)
> >         do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0,
> > NULL);
> >  }
> >  
> > -#ifdef CONFIG_X86_KERNEL_IBT
> > -
> > -static __ro_after_init bool ibt_fatal = true;
> > -
> > -extern void ibt_selftest_ip(void); /* code label defined in asm
> > below */
> > -
> > +#ifdef CONFIG_X86_CET
> >  enum cp_error_code {
> >         CP_EC        = (1 << 15) - 1,
> >  
> > @@ -231,15 +226,87 @@ enum cp_error_code {
> >         CP_ENCL      = 1 << 15,
> >  };
> >  
> > -DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
> > +static const char control_protection_err[][10] = {
>
> You already use the "cp_" prefix for the other things, might as well
> use
> it here too.

Sure.

>
> > +       [0] = "unknown",
> > +       [1] = "near ret",
> > +       [2] = "far/iret",
> > +       [3] = "endbranch",
> > +       [4] = "rstorssp",
> > +       [5] = "setssbsy",
> > +};
> > +
> > +static const char *cp_err_string(unsigned long error_code)
> > +{
> > +       unsigned int cpec = error_code & CP_EC;
> > +
> > +       if (cpec >= ARRAY_SIZE(control_protection_err))
> > +               cpec = 0;
> > +       return control_protection_err[cpec];
> > +}
> > +
> > +static void do_unexpected_cp(struct pt_regs *regs, unsigned long
> > error_code)
> > +{
> > +       WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
> > +                    user_mode(regs) ? "user mode" : "kernel mode",
> > +                    cp_err_string(error_code));
> > +}
> > +#endif /* CONFIG_X86_CET */
> > +
> > +void do_user_cp_fault(struct pt_regs *regs, unsigned long
> > error_code);
>
> What's that forward declaration for?

The reason is cpu_feature_enabled(X86_FEATURE_USER_SHSTK) will compile-
time evaluate to false when user shadow stack is not configured, so a
do_user_cp_fault() implementation is not needed in that case. The
reference in exec_control_protection will get optimized out, but it
still needs to be defined.

Otherwise it could have a stub implementation in an #else block, but
this seemed cleaner.

>
> In any case, push it into traps.h pls.

It's not really supposed to be called from outside traps.c. The only
reason it is not static it because it doesn't work with these
IS_ENABLED()-style solutions for compiling out code. Now I'm wondering
if these are not preferred...

>
> I gotta say, I'm not a big fan of that ifdeffery here. Do we really
> really need it?

You mean having separate paths for kernel IBT and user shadow stack
that compile out? I guess it could just all be in place if
CONFIG_X86_CET is in place.

I don't know, I thought it was relatively clean, but I can remove it.

>
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +static DEFINE_RATELIMIT_STATE(cpf_rate,
> > DEFAULT_RATELIMIT_INTERVAL,
> > +                             DEFAULT_RATELIMIT_BURST);
> > +
> > +void do_user_cp_fault(struct pt_regs *regs, unsigned long
> > error_code)
> >  {
> > -       if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
> > -               pr_err("Unexpected #CP\n");
> > -               BUG();
> > +       struct task_struct *tsk;
> > +       unsigned long ssp;
> > +
> > +       /*
> > +        * An exception was just taken from userspace. Since
> > interrupts are disabled
> > +        * here, no scheduling should have messed with the
> > registers yet and they
> > +        * will be whatever is live in userspace. So read the SSP
> > before enabling
> > +        * interrupts so locking the fpregs to do it later is not
> > required.
> > +        */
> > +       rdmsrl(MSR_IA32_PL3_SSP, ssp);
> > +
> > +       cond_local_irq_enable(regs);
> > +
> > +       tsk = current;
> > +       tsk->thread.error_code = error_code;
> > +       tsk->thread.trap_nr = X86_TRAP_CP;
> > +
> > +       /* Ratelimit to prevent log spamming. */
> > +       if (show_unhandled_signals && unhandled_signal(tsk,
> > SIGSEGV) &&
> > +           __ratelimit(&cpf_rate)) {
> > +               pr_emerg("%s[%d] control protection ip:%lx sp:%lx
> > ssp:%lx error:%lx(%s)%s",
> > +                        tsk->comm, task_pid_nr(tsk),
> > +                        regs->ip, regs->sp, ssp, error_code,
> > +                        cp_err_string(error_code),
> > +                        error_code & CP_ENCL ? " in enclave" :
> > "");
> > +               print_vma_addr(KERN_CONT " in ", regs->ip);
> > +               pr_cont("\n");
> >         }
> >  
> > -       if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) !=
> > CP_ENDBR))
> > +       force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
> > +       cond_local_irq_disable(regs);
> > +}
> > +#endif
> > +
> > +void do_kernel_cp_fault(struct pt_regs *regs, unsigned long
> > error_code);
> > +
> > +#ifdef CONFIG_X86_KERNEL_IBT
> > +static __ro_after_init bool ibt_fatal = true;
> > +
> > +extern void ibt_selftest_ip(void); /* code label defined in asm
> > below */
>
> Yeah, pls put that comment above the function. Side comments are
> nasty.
>
> Thx.
>
Sure. Thanks.

2022-12-21 01:15:18

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 07/39] x86: Add user control-protection fault handler

On Tue, 2022-12-20 at 22:21 +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 04:35:34PM -0800, Rick Edgecombe wrote:
> > Keep the same behavior for the kernel side of the fault handler,
> > except for
> > converting a BUG to a WARN in the case of a #CP happening when
> > !cpu_feature_enabled().
> ^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Yeah, let's stick to plain english in the commit message pls, instead
> of
> pseudo code.

Sure, I'll change it. Thanks.

2022-12-21 01:43:26

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 10/39] x86/mm: Introduce _PAGE_COW

On Tue, 2022-12-20 at 22:29 +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 04:35:37PM -0800, Rick Edgecombe wrote:
> > There are six bits left available to software in the 64-bit PTE
> > after
> > consuming a bit for _PAGE_COW. No space is consumed in 32-bit
> > kernels
> > because shadow stacks are not enabled there.
> >
> > This is a prepratory patch. Changes to actually start marking
> > _PAGE_COW
>
> Unknown word [prepratory] in commit message.
> Suggestions: ['preparatory',

Sorry about all these spelling errors.

>
> > will follow once other pieces are in place.
>
> And regardless, you don't really need this sentence at all, AFAICT.
>
> ...

Ok.

>
> > +/*
> > + * Normally COW memory can result in Dirty=1,Write=0 PTs. But in
> > the case
>                                                         ^^^
>
> PTEs.

Thanks.

>
> > + * of X86_FEATURE_USER_SHSTK, the software COW bit is used, since
> > the
> > + * Dirty=1,Write=0 will result in the memory being treated as
> > shaodw stack
> > + * by the HW. So when creating COW memory, a software bit is used
> > + * _PAGE_BIT_COW. The following functions pte_mkcow() and
> > pte_clear_cow()
> > + * take a PTE marked conventially COW (Dirty=1) and transition it
> > to the
>
> Unknown word [conventially] in comment.
> Suggestions: ['conventionally', ...
>
> > + * shadow stack compatible version of COW (Cow=1).
> > + */
> > +
>
> ^ Superfluous newline.

Thanks.

>
> > +static inline pte_t pte_mkcow(pte_t pte)
> > +{
> > +       if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > +               return pte;
> > +
> > +       pte = pte_clear_flags(pte, _PAGE_DIRTY);
> > +       return pte_set_flags(pte, _PAGE_COW);
> > +}
> > +
> > +static inline pte_t pte_clear_cow(pte_t pte)
> > +{
> > +       /*
> > +        * _PAGE_COW is unnecessary on !X86_FEATURE_USER_SHSTK
> > kernels.
>
> I'm guessing this "unnecessary" is supposed to mean that on kernels
> not
> supporting shadow stack, a COW page uses the old bit flags?
>
> I.e., Dirty=1,Write=0?
>
> Might as well write it this way to be perfectly clear.

Right, I can clarify.

>
> > +        * See the _PAGE_COW definition for more details.
> > +        */
> > +       if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > +               return pte;
> > +
> > +       /*
> > +        * PTE is getting copied-on-write, so it will be dirtied
> > +        * if writable, or made shadow stack if shadow stack and
> > +        * being copied on access. Set they dirty bit for both
>
> "Set the dirty bit.."

Oof.

>
> > +        * cases.
> > +        */
> > +       pte = pte_set_flags(pte, _PAGE_DIRTY);
> > +       return pte_clear_flags(pte, _PAGE_COW);
> > +}
>
> Rest looks ok.

Thanks.

>
> Thx.
>

2022-12-21 10:55:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 06/39] x86/fpu: Add helper for modifying xstate

On Wed, Dec 21, 2022 at 12:03:39AM +0000, Edgecombe, Rick P wrote:
> Sorry about this and the others. I get spelling errors in checkpatch,
> so I must have dictionary issues or something.

No worries, happens to everyone.

> Yes, the KVM series needed it. Part of the reasoning here was to
> provide some helpers to avoid mistakes in modifying xstate, so the
> general idea was that it should be available. I suppose that series
> could add the export though?

Yeah, you do have a point. Preemptive exposure of functionality is not
really needed, so yeah, let's add it when actually really needed.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-21 11:16:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 07/39] x86: Add user control-protection fault handler

On Wed, Dec 21, 2022 at 12:37:51AM +0000, Edgecombe, Rick P wrote:
> You mean having separate paths for kernel IBT and user shadow stack
> that compile out? I guess it could just all be in place if
> CONFIG_X86_CET is in place.
>
> I don't know, I thought it was relatively clean, but I can remove it.

Yeah, I'm wondering if we really need the ifdeffery. I always question
ifdeffery because it is a) ugly, b) a mess to deal with and having it is
not really worth it. Yeah, we save a couple of KBs, big deal.

What would practically happen is, shadow stack will be default-enabled
on the majority of kernels out there - distro ones - so it will be
enabled practically everywhere.

And it'll be off only in some self-built kernels which are the very
small minority.

And how much are the space savings with the whole set applied, with and
without the Kconfig item enabled? Probably only a couple of KBs.

And if so, I'm thinking we could at least make the traps.c stuff
unconditional - it'll be there but won't run. Unless we get some weird
#CP but it'll be caught by do_unexpected_cp().

And you have feature tests everywhere so it's not like it'll get
"misused".

And when you do that, you'll have everything a lot simpler, a lot less
Kconfig items to build-test and all good.

Right?

Or am I completely way off into the weeds here and am missing an
important aspect...?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-21 22:45:25

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 07/39] x86: Add user control-protection fault handler

On Wed, 2022-12-21 at 11:41 +0100, Borislav Petkov wrote:
> On Wed, Dec 21, 2022 at 12:37:51AM +0000, Edgecombe, Rick P wrote:
> > You mean having separate paths for kernel IBT and user shadow stack
> > that compile out? I guess it could just all be in place if
> > CONFIG_X86_CET is in place.
> >
> > I don't know, I thought it was relatively clean, but I can remove
> > it.
>
> Yeah, I'm wondering if we really need the ifdeffery. I always
> question
> ifdeffery because it is a) ugly, b) a mess to deal with and having it
> is
> not really worth it. Yeah, we save a couple of KBs, big deal.
>
> What would practically happen is, shadow stack will be default-
> enabled
> on the majority of kernels out there - distro ones - so it will be
> enabled practically everywhere.
>
> And it'll be off only in some self-built kernels which are the very
> small minority.
>
> And how much are the space savings with the whole set applied, with
> and
> without the Kconfig item enabled? Probably only a couple of KBs.

Oh, you mean the whole Kconfig thing. Yea, I mean I see the point about
typical configs. But at least CONFIG_X86_CET seems consistent with
CONFIG_INTEL_TDX_GUEST, CONFIG_IOMMU_SVA, etc.

What about moving it out of traps.c to a cet.c, like
exc_vmm_communication for CONFIG_AMD_MEM_ENCRT? Then the inclusion
logic lives in the build files, instead of an ifdef.

>
> And if so, I'm thinking we could at least make the traps.c stuff
> unconditional - it'll be there but won't run. Unless we get some
> weird
> #CP but it'll be caught by do_unexpected_cp().
>
> And you have feature tests everywhere so it's not like it'll get
> "misused".
>
> And when you do that, you'll have everything a lot simpler, a lot
> less
> Kconfig items to build-test and all good.
>
> Right?
>
> Or am I completely way off into the weeds here and am missing an
> important aspect...?
>

One aspect that has come up a couple of times, is how closely related
all these CET features are (or aren't). Shadow stack and IBT are mostly
separate, but do share an xfeature and an exception type. Similarly for
supervisor and user mode support for either of the CET features. So
maybe that is what is unusual here. There are some aspects that make
them look like separate features, which leads people to think they
should be separate in the code. But actually separating them leads to
excess ifdefery.

To me, putting the whole handler in even if there are no CET features
seems like too much. Leaving it in unconditionally is apparently also
inconsistent with some of the previous traps.c decisions. So I'd leave
CONFIG_X86_CET at least and it can help anyone building super stripped
down kernels. But I'm ok with whatever people think.

2022-12-27 13:47:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

Just textual improvements:

On Fri, Dec 02, 2022 at 04:35:39PM -0800, Rick Edgecombe wrote:
> From: Yu-cheng Yu <[email protected]>
>
> When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow

Pls, no caps.

> stack. Copy-on-write PTes then have Write=0,Cow=1.

"... are preserved for shadow stack pages."

>
> When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
> become a transient shadow stack PTE in two cases:

1. Some processors ...

2. When _PAGE_DIRTY ...

> The first case is that some processors can start a write but end up seeing
> a Write=0 PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE. However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
>
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.
>
> In the case of pmdp_set_wrprotect(), for nopmd configs the ->pmd operated
> on does not exist and the logic would need to be different. Although the
> extra functionality will normally be optimized out when user shadow
> stacks are not configured, also exclude it in the preprocessor stage so
> that it will still compile. User shadow stack is not supported there by
> Linux anyway. Leave the cpu_feature_enabled() check so that the
> functionality also disables based on runtime detection of the feature.

"... also gets disabled ..."

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-27 23:01:29

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v4 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

On Tue, 2022-12-27 at 14:26 +0100, Borislav Petkov wrote:
> Just textual improvements:
>
> On Fri, Dec 02, 2022 at 04:35:39PM -0800, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <[email protected]>
> >
> > When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for
> > shadow
>
> Pls, no caps.

Sure on "Shadow Stack". For Write=0,Dirty=1 there was a previous
suggestion to standardize on how these bits are referred to across the
series in both the comments and commit logs. I think the capitalization
helps differentiate between the concepts of write and dirty and the
actual PTE bits with those names. Especially since shadow stack muddies
the concepts of writable and dirty memory, I thought it was a helpful
distinction. Is it ok?

The other suggestions seem good.

Thanks,

Rick

2023-01-04 12:59:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 07/39] x86: Add user control-protection fault handler

On Wed, Dec 21, 2022 at 09:42:50PM +0000, Edgecombe, Rick P wrote:
> Oh, you mean the whole Kconfig thing. Yea, I mean I see the point about
> typical configs. But at least CONFIG_X86_CET seems consistent with
> CONFIG_INTEL_TDX_GUEST, CONFIG_IOMMU_SVA, etc.
>
> What about moving it out of traps.c to a cet.c, like
> exc_vmm_communication for CONFIG_AMD_MEM_ENCRT? Then the inclusion
> logic lives in the build files, instead of an ifdef.

Yeah, that definitely sounds cleaner. Another example would be the #MC handler
being in mce code and not in traps.c.

So yeah, the reason why I'm even mentioning this is that I get an allergic
reaction when I see unwieldy ifdeffery in one screen worth of code. But this is
just me. :)

> One aspect that has come up a couple of times, is how closely related
> all these CET features are (or aren't). Shadow stack and IBT are mostly
> separate, but do share an xfeature and an exception type. Similarly for
> supervisor and user mode support for either of the CET features. So
> maybe that is what is unusual here. There are some aspects that make
> them look like separate features, which leads people to think they
> should be separate in the code. But actually separating them leads to
> excess ifdefery.

Yeah, I think you solved that correctly by having the common X86_CET symbol
selected by the two.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-04 13:41:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

On Tue, Dec 27, 2022 at 10:26:33PM +0000, Edgecombe, Rick P wrote:
> Sure on "Shadow Stack". For Write=0,Dirty=1 there was a previous
> suggestion to standardize on how these bits are referred to across the
> series in both the comments and commit logs. I think the capitalization
> helps differentiate between the concepts of write and dirty and the
> actual PTE bits with those names. Especially since shadow stack muddies
> the concepts of writable and dirty memory, I thought it was a helpful
> distinction. Is it ok?

Oh sorry, I meant only

s/Shadow Stack/shadow stack/

The page flags are fine.

Bottom line is: hw documents love to capitalize features and concepts and that's
just unnecessary marketing bla.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette