Control-flow Enforcement (CET) is a new Intel processor feature that blocks
return/jump-oriented programming attacks. Details are in "Intel 64 and
IA-32 Architectures Software Developer's Manual" [1].
CET can protect applications and the kernel. This series enables only
application-level protection, and has three parts:
- Shadow stack [2],
- Indirect branch tracking [3], and
- Selftests [4].
I have run tests on these patches for quite some time, and they have been
very stable. Linux distributions with CET are available now, and Intel
processors with CET are becoming available. It would be nice if CET
support can be accepted into the kernel. I will be working to address any
issues should they come up.
Changes in v15:
- Rebase to v5.10-rc3.
- Small changes to the documentation to make meanings clear.
- Remove changes to tools/arch/x86/include/ files.
- Remove Reviewed-by tags from patches that have been revised too many
times.
[1] Intel 64 and IA-32 Architectures Software Developer's Manual:
https://software.intel.com/en-us/download/intel-64-and-ia-32-
architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4
[2] CET Shadow Stack patches v14:
https://lkml.kernel.org/r/[email protected]/
[3] Indirect Branch Tracking patches v14.
https://lkml.kernel.org/r/[email protected]/
[4] I am holding off the selftests changes and working to get Acked-by's.
The earlier version of the selftests patches:
https://lkml.kernel.org/r/[email protected]/
[5] The kernel ptrace patch is tested with an Intel-internal updated GDB.
I am holding off the kernel ptrace patch to re-test it with my earlier
patch for fixing regset holes.
Yu-cheng Yu (26):
Documentation/x86: Add CET description
x86/cpufeatures: Add CET CPU feature flags for Control-flow
Enforcement Technology (CET)
x86/fpu/xstate: Introduce CET MSR XSAVES supervisor states
x86/cet: Add control-protection fault handler
x86/cet/shstk: Add Kconfig option for user-mode Shadow Stack
x86/mm: Change _PAGE_DIRTY to _PAGE_DIRTY_HW
x86/mm: Remove _PAGE_DIRTY_HW from kernel RO pages
x86/mm: Introduce _PAGE_COW
drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
x86/mm: Update pte_modify for _PAGE_COW
x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
transition from _PAGE_DIRTY_HW to _PAGE_COW
mm: Introduce VM_SHSTK for shadow stack memory
x86/mm: Shadow Stack page fault error checking
x86/mm: Update maybe_mkwrite() for shadow stack
mm: Fixup places that call pte_mkwrite() directly
mm: Add guard pages around a shadow stack.
mm/mmap: Add shadow stack pages to memory accounting
mm: Update can_follow_write_pte() for shadow stack
mm: Re-introduce vm_flags to do_mmap()
x86/cet/shstk: User-mode shadow stack support
x86/cet/shstk: Handle signals for shadow stack
binfmt_elf: Define GNU_PROPERTY_X86_FEATURE_1_AND properties
ELF: Introduce arch_setup_elf_property()
x86/cet/shstk: Handle thread shadow stack
x86/cet/shstk: Add arch_prctl functions for shadow stack
mm: Introduce PROT_SHSTK for shadow stack
.../admin-guide/kernel-parameters.txt | 6 +
Documentation/x86/index.rst | 1 +
Documentation/x86/intel_cet.rst | 138 +++++++
arch/arm64/include/asm/elf.h | 5 +
arch/x86/Kconfig | 39 ++
arch/x86/ia32/ia32_signal.c | 17 +
arch/x86/include/asm/cet.h | 42 +++
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/elf.h | 13 +
arch/x86/include/asm/fpu/internal.h | 10 +
arch/x86/include/asm/fpu/types.h | 23 +-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/mman.h | 83 +++++
arch/x86/include/asm/mmu_context.h | 3 +
arch/x86/include/asm/msr-index.h | 20 +
arch/x86/include/asm/page_64_types.h | 10 +
arch/x86/include/asm/pgtable.h | 209 ++++++++++-
arch/x86/include/asm/pgtable_types.h | 57 ++-
arch/x86/include/asm/processor.h | 5 +
arch/x86/include/asm/special_insns.h | 32 ++
arch/x86/include/asm/trap_pf.h | 2 +
arch/x86/include/uapi/asm/mman.h | 28 +-
arch/x86/include/uapi/asm/prctl.h | 4 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/include/uapi/asm/sigcontext.h | 9 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/cet.c | 343 ++++++++++++++++++
arch/x86/kernel/cet_prctl.c | 68 ++++
arch/x86/kernel/cpu/common.c | 28 ++
arch/x86/kernel/cpu/cpuid-deps.c | 2 +
arch/x86/kernel/fpu/signal.c | 100 +++++
arch/x86/kernel/fpu/xstate.c | 25 +-
arch/x86/kernel/idt.c | 4 +
arch/x86/kernel/process.c | 14 +-
arch/x86/kernel/process_64.c | 32 ++
arch/x86/kernel/relocate_kernel_64.S | 2 +-
arch/x86/kernel/signal.c | 10 +
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 59 +++
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/mm/fault.c | 19 +
arch/x86/mm/mmap.c | 2 +
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 25 ++
drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
fs/aio.c | 2 +-
fs/binfmt_elf.c | 4 +
fs/proc/task_mmu.c | 3 +
include/linux/elf.h | 6 +
include/linux/mm.h | 38 +-
include/linux/pgtable.h | 35 ++
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/elf.h | 9 +
ipc/shm.c | 2 +-
mm/gup.c | 8 +-
mm/huge_memory.c | 10 +-
mm/memory.c | 5 +-
mm/migrate.c | 3 +-
mm/mmap.c | 23 +-
mm/mprotect.c | 2 +-
mm/nommu.c | 4 +-
mm/util.c | 2 +-
scripts/as-x86_64-has-shadow-stack.sh | 4 +
65 files changed, 1594 insertions(+), 90 deletions(-)
create mode 100644 Documentation/x86/intel_cet.rst
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/include/asm/mman.h
create mode 100644 arch/x86/kernel/cet.c
create mode 100644 arch/x86/kernel/cet_prctl.c
create mode 100755 scripts/as-x86_64-has-shadow-stack.sh
--
2.21.0
Shadow stack accesses are those that are performed by the CPU where it
expects to encounter a shadow stack mapping. These accesses are performed
implicitly by CALL/RET at the site of the shadow stack pointer. These
accesses are made explicitly by shadow stack management instructions like
WRUSSQ.
Shadow stacks accesses to shadow-stack mapping can see faults in normal,
valid operation just like regular accesses to regular mappings. Shadow
stacks need some of the same features like delayed allocation, swap and
copy-on-write.
Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping.
In handling a shadow stack page fault, verify it occurs within a shadow
stack mapping. It is always an error otherwise. For valid shadow stack
accesses, set FAULT_FLAG_WRITE to effect copy-on-write. Because clearing
_PAGE_DIRTY_HW (vs. _PAGE_RW) is used to trigger the fault, shadow stack
read fault and shadow stack write fault are not differentiated and both are
handled as a write access.
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
arch/x86/include/asm/trap_pf.h | 2 ++
arch/x86/mm/fault.c | 19 +++++++++++++++++++
2 files changed, 21 insertions(+)
diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
index 305bc1214aef..205766c438b3 100644
--- a/arch/x86/include/asm/trap_pf.h
+++ b/arch/x86/include/asm/trap_pf.h
@@ -11,6 +11,7 @@
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
+ * bit 6 == 1: shadow stack access fault
*/
enum x86_pf_error_code {
X86_PF_PROT = 1 << 0,
@@ -19,6 +20,7 @@ enum x86_pf_error_code {
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
+ X86_PF_SHSTK = 1 << 6,
};
#endif /* _ASM_X86_TRAP_PF_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..941f55ee7c75 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1110,6 +1110,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
(error_code & X86_PF_INSTR), foreign))
return 1;
+ /*
+ * Verify a shadow stack access is within a shadow stack VMA.
+ * It is always an error otherwise. Normal data access to a
+ * shadow stack area is checked in the case followed.
+ */
+ if (error_code & X86_PF_SHSTK) {
+ if (!(vma->vm_flags & VM_SHSTK))
+ return 1;
+ return 0;
+ }
+
if (error_code & X86_PF_WRITE) {
/* write, present and write, not present: */
if (unlikely(!(vma->vm_flags & VM_WRITE)))
@@ -1275,6 +1286,14 @@ void do_user_addr_fault(struct pt_regs *regs,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+ /*
+ * Clearing _PAGE_DIRTY_HW is used to detect shadow stack access.
+ * This method cannot distinguish shadow stack read vs. write.
+ * For valid shadow stack accesses, set FAULT_FLAG_WRITE to effect
+ * copy-on-write.
+ */
+ if (hw_error_code & X86_PF_SHSTK)
+ flags |= FAULT_FLAG_WRITE;
if (hw_error_code & X86_PF_WRITE)
flags |= FAULT_FLAG_WRITE;
if (hw_error_code & X86_PF_INSTR)
--
2.21.0
There is essentially no room left in the x86 hardware PTEs on some OSes
(not Linux). That left the hardware architects looking for a way to
represent a new memory type (shadow stack) within the existing bits.
They chose to repurpose a lightly-used state: Write=0,Dirty=1.
The reason it's lightly used is that Dirty=1 is normally set by hardware
and cannot normally be set by hardware on a Write=0 PTE. Software must
normally be involved to create one of these PTEs, so software can simply
opt to not create them.
But that leaves us with a Linux problem: we need to ensure we never create
Write=0,Dirty=1 PTEs. In places where we do create them, we need to find
an alternative way to represent them _without_ using the same hardware bit
combination. Thus, enter _PAGE_COW. This results in the following:
(a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW)
(b) A R/O page that has been COW'ed: (R/O + _PAGE_COW)
The user page is in a R/O VMA, and get_user_pages() needs a writable
copy. The page fault handler creates a copy of the page and sets
the new copy's PTE as R/O and _PAGE_COW.
(c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW)
(d) A shared shadow stack PTE: (R/O + _PAGE_COW)
When a shadow stack page is being shared among processes (this happens
at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the next shadow
stack access causes a fault, and the page is duplicated and
_PAGE_DIRTY_HW is set again. This is the COW equivalent for shadow
stack pages, even though it's copy-on-access rather than copy-on-write.
(e) A page where the processor observed a Write=1 PTE, started a write, set
Dirty=1, but then observed a Write=0 PTE. That's possible today, but
will not happen on processors that support shadow stack.
Use _PAGE_COW in pte_wrprotect() and _PAGE_DIRTY_HW in pte_mkwrite().
Apply the same changes to pmd and pud.
When this patch is applied, there are six free bits left in the 64-bit PTE.
There are no more free bits in the 32-bit PTE (except for PAE) and shadow
stack is not implemented for the 32-bit kernel.
Signed-off-by: Yu-cheng Yu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 120 ++++++++++++++++++++++++---
arch/x86/include/asm/pgtable_types.h | 41 ++++++++-
2 files changed, 150 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b23697658b28..c88c7ccf0318 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -121,9 +121,9 @@ extern pmdval_t early_pmd_flags;
* The following only work if pte_present() is true.
* Undefined behaviour if not..
*/
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
{
- return pte_flags(pte) & _PAGE_DIRTY_HW;
+ return pte_flags(pte) & _PAGE_DIRTY_BITS;
}
@@ -160,9 +160,9 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED;
}
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_DIRTY_HW;
+ return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
}
static inline int pmd_young(pmd_t pmd)
@@ -170,9 +170,9 @@ static inline int pmd_young(pmd_t pmd)
return pmd_flags(pmd) & _PAGE_ACCESSED;
}
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
{
- return pud_flags(pud) & _PAGE_DIRTY_HW;
+ return pud_flags(pud) & _PAGE_DIRTY_BITS;
}
static inline int pud_young(pud_t pud)
@@ -182,6 +182,12 @@ static inline int pud_young(pud_t pud)
static inline int pte_write(pte_t pte)
{
+ /*
+ * If _PAGE_DIRTY_HW is set, the PTE must either have
+ * _PAGE_RW or be a shadow stack PTE, which is logically writable.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY_HW);
return pte_flags(pte) & _PAGE_RW;
}
@@ -333,7 +339,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
static inline pte_t pte_mkclean(pte_t pte)
{
- return pte_clear_flags(pte, _PAGE_DIRTY_HW);
+ return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
}
static inline pte_t pte_mkold(pte_t pte)
@@ -343,6 +349,17 @@ static inline pte_t pte_mkold(pte_t pte)
static inline pte_t pte_wrprotect(pte_t pte)
{
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PTE (RW=0,Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >>
+ _PAGE_BIT_DIRTY_HW << _PAGE_BIT_COW;
+ pte = pte_clear_flags(pte, _PAGE_DIRTY_HW);
+ }
+
return pte_clear_flags(pte, _PAGE_RW);
}
@@ -353,6 +370,18 @@ static inline pte_t pte_mkexec(pte_t pte)
static inline pte_t pte_mkdirty(pte_t pte)
{
+ pteval_t dirty = _PAGE_DIRTY_HW;
+
+ /* Avoid creating (HW)Dirty=1,Write=0 PTEs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
+ dirty = _PAGE_COW;
+
+ return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+ pte = pte_clear_flags(pte, _PAGE_COW);
return pte_set_flags(pte, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}
@@ -363,6 +392,13 @@ static inline pte_t pte_mkyoung(pte_t pte)
static inline pte_t pte_mkwrite(pte_t pte)
{
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (pte_flags(pte) & _PAGE_COW) {
+ pte = pte_clear_flags(pte, _PAGE_COW);
+ pte = pte_set_flags(pte, _PAGE_DIRTY_HW);
+ }
+ }
+
return pte_set_flags(pte, _PAGE_RW);
}
@@ -434,16 +470,41 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
static inline pmd_t pmd_mkclean(pmd_t pmd)
{
- return pmd_clear_flags(pmd, _PAGE_DIRTY_HW);
+ return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
}
static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PMD (RW=0,Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pmdval_t v = native_pmd_val(pmd);
+
+ v |= (v & _PAGE_DIRTY_HW) >> _PAGE_BIT_DIRTY_HW <<
+ _PAGE_BIT_COW;
+ pmd = pmd_clear_flags(__pmd(v), _PAGE_DIRTY_HW);
+ }
+
return pmd_clear_flags(pmd, _PAGE_RW);
}
static inline pmd_t pmd_mkdirty(pmd_t pmd)
{
+ pmdval_t dirty = _PAGE_DIRTY_HW;
+
+ /* Avoid creating (HW)Dirty=1,Write=0 PMDs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_RW))
+ dirty = _PAGE_COW;
+
+ return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+ pmd = pmd_clear_flags(pmd, _PAGE_COW);
return pmd_set_flags(pmd, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
}
@@ -464,6 +525,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
static inline pmd_t pmd_mkwrite(pmd_t pmd)
{
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (pmd_flags(pmd) & _PAGE_COW) {
+ pmd = pmd_clear_flags(pmd, _PAGE_COW);
+ pmd = pmd_set_flags(pmd, _PAGE_DIRTY_HW);
+ }
+ }
+
return pmd_set_flags(pmd, _PAGE_RW);
}
@@ -488,17 +556,36 @@ static inline pud_t pud_mkold(pud_t pud)
static inline pud_t pud_mkclean(pud_t pud)
{
- return pud_clear_flags(pud, _PAGE_DIRTY_HW);
+ return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
}
static inline pud_t pud_wrprotect(pud_t pud)
{
+ /*
+ * Blindly clearing _PAGE_RW might accidentally create
+ * a shadow stack PUD (RW=0,Dirty=1). Move the hardware
+ * dirty value to the software bit.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ pudval_t v = native_pud_val(pud);
+
+ v |= (v & _PAGE_DIRTY_HW) >> _PAGE_BIT_DIRTY_HW <<
+ _PAGE_BIT_COW;
+ pud = pud_clear_flags(__pud(v), _PAGE_DIRTY_HW);
+ }
+
return pud_clear_flags(pud, _PAGE_RW);
}
static inline pud_t pud_mkdirty(pud_t pud)
{
- return pud_set_flags(pud, _PAGE_DIRTY_HW | _PAGE_SOFT_DIRTY);
+ pudval_t dirty = _PAGE_DIRTY_HW;
+
+ /* Avoid creating (HW)Dirty=1,Write=0 PUDs */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_RW))
+ dirty = _PAGE_COW;
+
+ return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
}
static inline pud_t pud_mkdevmap(pud_t pud)
@@ -518,6 +605,13 @@ static inline pud_t pud_mkyoung(pud_t pud)
static inline pud_t pud_mkwrite(pud_t pud)
{
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+ if (pud_flags(pud) & _PAGE_COW) {
+ pud = pud_clear_flags(pud, _PAGE_COW);
+ pud = pud_set_flags(pud, _PAGE_DIRTY_HW);
+ }
+ }
+
return pud_set_flags(pud, _PAGE_RW);
}
@@ -1131,6 +1225,12 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
#define pmd_write pmd_write
static inline int pmd_write(pmd_t pmd)
{
+ /*
+ * If _PAGE_DIRTY_HW is set, then the PMD must either have
+ * _PAGE_RW or be a shadow stack PMD, which is logically writable.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_SHSTK))
+ return pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY_HW);
return pmd_flags(pmd) & _PAGE_RW;
}
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 7462a574fc93..5f764d8d9bae 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,7 +23,8 @@
#define _PAGE_BIT_SOFTW2 10 /* " */
#define _PAGE_BIT_SOFTW3 11 /* " */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
+#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
+#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
#define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
#define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
#define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
@@ -36,6 +37,16 @@
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
+/*
+ * This bit indicates a copy-on-write page, and is different from
+ * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
+ */
+#ifdef CONFIG_X86_64
+#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW 0
+#endif
+
/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
#define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
@@ -117,6 +128,34 @@
#define _PAGE_DEVMAP (_AT(pteval_t, 0))
#endif
+/*
+ * _PAGE_COW is used to separate R/O and copy-on-write PTEs created by
+ * software from the shadow stack PTE setting required by the hardware:
+ * (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW)
+ * (b) A R/O page that has been COW'ed: (R/O +_PAGE_COW)
+ * The user page is in a R/O VMA, and get_user_pages() needs a
+ * writable copy. The page fault handler creates a copy of the page
+ * and sets the new copy's PTE as R/O and _PAGE_COW.
+ * (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW)
+ * (d) A shared (copy-on-access) shadow stack PTE: (R/O + _PAGE_COW)
+ * When a shadow stack page is being shared among processes (this
+ * happens at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the
+ * next shadow stack access causes a fault, and the page is duplicated
+ * and _PAGE_DIRTY_HW is set again. This is the COW equivalent for
+ * shadow stack pages, even though it's copy-on-access rather than
+ * copy-on-write.
+ * (e) A page where the processor observed a Write=1 PTE, started a write,
+ * set Dirty=1, but then observed a Write=0 PTE. That's possible
+ * today, but will not happen on processors that support shadow stack.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK_USER
+#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW (_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY_HW | _PAGE_COW)
+
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
/*
--
2.21.0
A Shadow Stack PTE must be read-only and have _PAGE_DIRTY set. However,
read-only and Dirty PTEs also exist for copy-on-write (COW) pages. These
two cases are handled differently for page faults. Introduce VM_SHSTK to
track shadow stack VMAs.
Signed-off-by: Yu-cheng Yu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
arch/x86/mm/mmap.c | 2 ++
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 8 ++++++++
3 files changed, 13 insertions(+)
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index c90c20904a60..a22c6b6fc607 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -165,6 +165,8 @@ unsigned long get_mmap_base(int is_legacy)
const char *arch_vma_name(struct vm_area_struct *vma)
{
+ if (vma->vm_flags & VM_SHSTK)
+ return "[shadow stack]";
return NULL;
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 217aa2705d5d..c72143cdbb5d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -661,6 +661,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_PKEY_BIT4)] = "",
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_X86_SHADOW_STACK_USER
+ [ilog2(VM_SHSTK)] = "ss",
+#endif
};
size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..c7f527bd21fb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -304,11 +304,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -324,6 +326,12 @@ extern unsigned int kobjsize(const void *objp);
#endif
#endif /* CONFIG_ARCH_HAS_PKEYS */
+#ifdef CONFIG_X86_SHADOW_STACK_USER
+# define VM_SHSTK VM_HIGH_ARCH_5
+#else
+# define VM_SHSTK VM_NONE
+#endif
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
--
2.21.0
On Tue, Nov 10, 2020 at 08:21:45AM -0800, Yu-cheng Yu wrote:
> Control-flow Enforcement (CET) is a new Intel processor feature that blocks
> return/jump-oriented programming attacks. Details are in "Intel 64 and
> IA-32 Architectures Software Developer's Manual" [1].
>
> CET can protect applications and the kernel. This series enables only
> application-level protection, and has three parts:
>
> - Shadow stack [2],
> - Indirect branch tracking [3], and
> - Selftests [4].
>
> I have run tests on these patches for quite some time, and they have been
> very stable. Linux distributions with CET are available now, and Intel
> processors with CET are becoming available. It would be nice if CET
> support can be accepted into the kernel. I will be working to address any
> issues should they come up.
>
Is there a way to run these patches for testing? Bochs emulation or anything
else? I presume you've been testing against violations of CET in user space?
Can you share your testing?
Balbir Singh.
On 11/27/2020 1:29 AM, Balbir Singh wrote:
> On Tue, Nov 10, 2020 at 08:21:45AM -0800, Yu-cheng Yu wrote:
>> Control-flow Enforcement (CET) is a new Intel processor feature that blocks
>> return/jump-oriented programming attacks. Details are in "Intel 64 and
>> IA-32 Architectures Software Developer's Manual" [1].
>>
>> CET can protect applications and the kernel. This series enables only
>> application-level protection, and has three parts:
>>
>> - Shadow stack [2],
>> - Indirect branch tracking [3], and
>> - Selftests [4].
>>
>> I have run tests on these patches for quite some time, and they have been
>> very stable. Linux distributions with CET are available now, and Intel
>> processors with CET are becoming available. It would be nice if CET
>> support can be accepted into the kernel. I will be working to address any
>> issues should they come up.
>>
>
> Is there a way to run these patches for testing? Bochs emulation or anything
> else? I presume you've been testing against violations of CET in user space?
> Can you share your testing?
>
> Balbir Singh.
>
Machines with CET are already available on the market. I tested these
on real machines with Fedora. There is a quick test in my earlier
selftest patches:
https://lore.kernel.org/linux-api/[email protected]/
Thanks,
Yu-cheng
On 12/8/2020 9:50 AM, Borislav Petkov wrote:
> On Tue, Nov 10, 2020 at 08:21:53AM -0800, Yu-cheng Yu wrote:
>> There is essentially no room left in the x86 hardware PTEs on some OSes
>> (not Linux). That left the hardware architects looking for a way to
>> represent a new memory type (shadow stack) within the existing bits.
>> They chose to repurpose a lightly-used state: Write=0,Dirty=1.
>
> It is not clear to me what the definition and semantics of that bit is.
>
> +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */
>
> Is it set by hw or by sw and hw uses it to know it is a shadow stack
> page, and so on.
>
> I think you should lead with its definition.
Ok.
...
>> Write=0,Dirty=1 PTEs. In places where we do create them, we need to find
>> an alternative way to represent them _without_ using the same hardware bit
>> combination. Thus, enter _PAGE_COW. This results in the following:
>>
>> (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW)
>> (b) A R/O page that has been COW'ed: (R/O + _PAGE_COW)
>
> Both are "R/O + _PAGE_COW". Where's the difference? The dirty bit?
The PTEs are the same for both (a) and (b), but come from different routes.
>> The user page is in a R/O VMA, and get_user_pages() needs a writable
>> copy. The page fault handler creates a copy of the page and sets
>> the new copy's PTE as R/O and _PAGE_COW.
>> (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW)
>
> So W=0, D=1 ?
Yes.
>> (d) A shared shadow stack PTE: (R/O + _PAGE_COW)
>> When a shadow stack page is being shared among processes (this happens
>> at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the next shadow
>> stack access causes a fault, and the page is duplicated and
>> _PAGE_DIRTY_HW is set again. This is the COW equivalent for shadow
>> stack pages, even though it's copy-on-access rather than copy-on-write.
>> (e) A page where the processor observed a Write=1 PTE, started a write, set
>> Dirty=1, but then observed a Write=0 PTE.
>
> How does that happen? Something changed the PTE's W bit to 0 in-between?
Yes.
...
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index b23697658b28..c88c7ccf0318 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -121,9 +121,9 @@ extern pmdval_t early_pmd_flags;
>> * The following only work if pte_present() is true.
>> * Undefined behaviour if not..
>> */
>> -static inline int pte_dirty(pte_t pte)
>> +static inline bool pte_dirty(pte_t pte)
>> {
>> - return pte_flags(pte) & _PAGE_DIRTY_HW;
>> + return pte_flags(pte) & _PAGE_DIRTY_BITS;
>
> Why?
>
> Does _PAGE_COW mean dirty too?
Yes. Basically [read-only & dirty] is created by software. Now the
software uses a different bit.
>> @@ -343,6 +349,17 @@ static inline pte_t pte_mkold(pte_t pte)
>>
>> static inline pte_t pte_wrprotect(pte_t pte)
>> {
>> + /*
>> + * Blindly clearing _PAGE_RW might accidentally create
>> + * a shadow stack PTE (RW=0,Dirty=1). Move the hardware
>> + * dirty value to the software bit.
>> + */
>> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
>> + pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >>
>> + _PAGE_BIT_DIRTY_HW << _PAGE_BIT_COW;
>
> Let that line stick out. And that shifting is not grokkable at a quick
> glance, at least not to me. Simplify?
Ok.
>> static inline pmd_t pmd_wrprotect(pmd_t pmd)
>> {
>> + /*
>> + * Blindly clearing _PAGE_RW might accidentally create
>> + * a shadow stack PMD (RW=0,Dirty=1). Move the hardware
>> + * dirty value to the software bit.
>
> This whole carefully sidestepping the possiblity of creating a shadow
> stack pXd is kinda sucky...
>
>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
>> index 7462a574fc93..5f764d8d9bae 100644
>> --- a/arch/x86/include/asm/pgtable_types.h
>> +++ b/arch/x86/include/asm/pgtable_types.h
>> @@ -23,7 +23,8 @@
>> #define _PAGE_BIT_SOFTW2 10 /* " */
>> #define _PAGE_BIT_SOFTW3 11 /* " */
>> #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
>> -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
>> +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
>> +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
>> #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
>> #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
>> #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
>> @@ -36,6 +37,16 @@
>> #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
>> #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
>>
>> +/*
>> + * This bit indicates a copy-on-write page, and is different from
>> + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
>> + */
>> +#ifdef CONFIG_X86_64
>
> CONFIG_X86_64 ? Do all x86 machines out there support CET?
>
> If anything, CONFIG_X86_CET...
Ok.
--
Yu-cheng
On Tue, Nov 10, 2020 at 08:21:53AM -0800, Yu-cheng Yu wrote:
> There is essentially no room left in the x86 hardware PTEs on some OSes
> (not Linux). That left the hardware architects looking for a way to
> represent a new memory type (shadow stack) within the existing bits.
> They chose to repurpose a lightly-used state: Write=0,Dirty=1.
It is not clear to me what the definition and semantics of that bit is.
+#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */
Is it set by hw or by sw and hw uses it to know it is a shadow stack
page, and so on.
I think you should lead with its definition.
> The reason it's lightly used is that Dirty=1 is normally set by hardware
> and cannot normally be set by hardware on a Write=0 PTE. Software must
> normally be involved to create one of these PTEs, so software can simply
> opt to not create them.
>
> But that leaves us with a Linux problem: we need to ensure we never create
Please use passive voice in your commit message: no "we" or "I", etc.
> Write=0,Dirty=1 PTEs. In places where we do create them, we need to find
> an alternative way to represent them _without_ using the same hardware bit
> combination. Thus, enter _PAGE_COW. This results in the following:
>
> (a) A modified, copy-on-write (COW) page: (R/O + _PAGE_COW)
> (b) A R/O page that has been COW'ed: (R/O + _PAGE_COW)
Both are "R/O + _PAGE_COW". Where's the difference? The dirty bit?
> The user page is in a R/O VMA, and get_user_pages() needs a writable
> copy. The page fault handler creates a copy of the page and sets
> the new copy's PTE as R/O and _PAGE_COW.
> (c) A shadow stack PTE: (R/O + _PAGE_DIRTY_HW)
So W=0, D=1 ?
> (d) A shared shadow stack PTE: (R/O + _PAGE_COW)
> When a shadow stack page is being shared among processes (this happens
> at fork()), its PTE is cleared of _PAGE_DIRTY_HW, so the next shadow
> stack access causes a fault, and the page is duplicated and
> _PAGE_DIRTY_HW is set again. This is the COW equivalent for shadow
> stack pages, even though it's copy-on-access rather than copy-on-write.
> (e) A page where the processor observed a Write=1 PTE, started a write, set
> Dirty=1, but then observed a Write=0 PTE.
How does that happen? Something changed the PTE's W bit to 0 in-between?
> That's possible today, but
> will not happen on processors that support shadow stack.
>
> Use _PAGE_COW in pte_wrprotect() and _PAGE_DIRTY_HW in pte_mkwrite().
> Apply the same changes to pmd and pud.
>
> When this patch is applied, there are six free bits left in the 64-bit PTE.
s/When this patch is applied/After this/
Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.
Also, do
$ git grep 'This patch' Documentation/process
for more details.
> There are no more free bits in the 32-bit PTE (except for PAE) and shadow
> stack is not implemented for the 32-bit kernel.
>
> Signed-off-by: Yu-cheng Yu <[email protected]>
> ---
> arch/x86/include/asm/pgtable.h | 120 ++++++++++++++++++++++++---
> arch/x86/include/asm/pgtable_types.h | 41 ++++++++-
> 2 files changed, 150 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index b23697658b28..c88c7ccf0318 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -121,9 +121,9 @@ extern pmdval_t early_pmd_flags;
> * The following only work if pte_present() is true.
> * Undefined behaviour if not..
> */
> -static inline int pte_dirty(pte_t pte)
> +static inline bool pte_dirty(pte_t pte)
> {
> - return pte_flags(pte) & _PAGE_DIRTY_HW;
> + return pte_flags(pte) & _PAGE_DIRTY_BITS;
Why?
Does _PAGE_COW mean dirty too?
> @@ -343,6 +349,17 @@ static inline pte_t pte_mkold(pte_t pte)
>
> static inline pte_t pte_wrprotect(pte_t pte)
> {
> + /*
> + * Blindly clearing _PAGE_RW might accidentally create
> + * a shadow stack PTE (RW=0,Dirty=1). Move the hardware
> + * dirty value to the software bit.
> + */
> + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> + pte.pte |= (pte.pte & _PAGE_DIRTY_HW) >>
> + _PAGE_BIT_DIRTY_HW << _PAGE_BIT_COW;
Let that line stick out. And that shifting is not grokkable at a quick
glance, at least not to me. Simplify?
> static inline pmd_t pmd_wrprotect(pmd_t pmd)
> {
> + /*
> + * Blindly clearing _PAGE_RW might accidentally create
> + * a shadow stack PMD (RW=0,Dirty=1). Move the hardware
> + * dirty value to the software bit.
This whole carefully sidestepping the possiblity of creating a shadow
stack pXd is kinda sucky...
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 7462a574fc93..5f764d8d9bae 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -23,7 +23,8 @@
> #define _PAGE_BIT_SOFTW2 10 /* " */
> #define _PAGE_BIT_SOFTW3 11 /* " */
> #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
> -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */
> +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */
> +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */
> #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
> #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
> #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
> @@ -36,6 +37,16 @@
> #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
> #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
>
> +/*
> + * This bit indicates a copy-on-write page, and is different from
> + * _PAGE_BIT_SOFT_DIRTY, which tracks which pages a task writes to.
> + */
> +#ifdef CONFIG_X86_64
CONFIG_X86_64 ? Do all x86 machines out there support CET?
If anything, CONFIG_X86_CET...
> +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */
> +#else
> +#define _PAGE_BIT_COW 0
> +#endif
> +
> /* If _PAGE_BIT_PRESENT is clear, we use these: */
> /* - if the user mapped it with PROT_NONE; pte_present gives true */
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Dec 08, 2020 at 10:25:15AM -0800, Yu, Yu-cheng wrote:
> > Both are "R/O + _PAGE_COW". Where's the difference? The dirty bit?
>
> The PTEs are the same for both (a) and (b), but come from different routes.
Do not be afraid to go into detail and explain to me what those routes
are please.
> > > (e) A page where the processor observed a Write=1 PTE, started a write, set
> > > Dirty=1, but then observed a Write=0 PTE.
> >
> > How does that happen? Something changed the PTE's W bit to 0 in-between?
>
> Yes.
Also do not scare from going into detail and explaining what you mean
here. Example?
> > Does _PAGE_COW mean dirty too?
>
> Yes. Basically [read-only & dirty] is created by software. Now the
> software uses a different bit.
That convention:
"[read-only & dirty] is created by software."
needs some prominent writeup somewhere explaining what it is.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 12/8/2020 10:47 AM, Borislav Petkov wrote:
> On Tue, Dec 08, 2020 at 10:25:15AM -0800, Yu, Yu-cheng wrote:
>>> Both are "R/O + _PAGE_COW". Where's the difference? The dirty bit?
>>
>> The PTEs are the same for both (a) and (b), but come from different routes.
>
> Do not be afraid to go into detail and explain to me what those routes
> are please.
Case (a) is a normal writable data page that has gone through fork().
So it has W=0, D=1. But here, the software chooses not to use the D
bit, and instead, W=0, COW=1.
Case (b) is a normal read-only data page. Since it is read-only, fork()
won't affect it. In __get_user_pages(), a copy of the read-only page is
needed, and the page is duplicated. The software sets COW=1 for the new
copy.
>>>> (e) A page where the processor observed a Write=1 PTE, started a write, set
>>>> Dirty=1, but then observed a Write=0 PTE.
>>>
>>> How does that happen? Something changed the PTE's W bit to 0 in-between?
>>
>> Yes.
>
> Also do not scare from going into detail and explaining what you mean
> here. Example?
Thread-A is writing to a writable page, and the page's PTE is becoming
W=1, D=1. In the middle of it, Thread-B is changing the PTE to W=0.
>>> Does _PAGE_COW mean dirty too?
>>
>> Yes. Basically [read-only & dirty] is created by software. Now the
>> software uses a different bit.
>
> That convention:
>
> "[read-only & dirty] is created by software."
>
> needs some prominent writeup somewhere explaining what it is.
>
> Thx.
>
I will put these into the comments.
--
Yu-cheng
On 12/10/2020 9:41 AM, Borislav Petkov wrote:
> On Tue, Dec 08, 2020 at 11:24:16AM -0800, Yu, Yu-cheng wrote:
>> Case (a) is a normal writable data page that has gone through fork(). So it
>
> Writable >
>> has W=0, D=1. But here, the software chooses not to use the D bit, and
>
> But it has W=0. So not writable?
Maybe I will change to: A page in a writable vma, has been modified and
gone through fork().
>> instead, W=0, COW=1.
>
> So the "new" way of denoting that the page is modified is COW=1
> *when* on CET hw. The D=1 bit is still used on the rest thus the two
> _PAGE_DIRTY_BITS.
>
> Am I close?
COW=1 is only used in copy-on-write situation (when CET is enabled). If
W=1, D bit is used.
>> Case (b) is a normal read-only data page. Since it is read-only, fork()
>> won't affect it. In __get_user_pages(), a copy of the read-only page is
>> needed, and the page is duplicated. The software sets COW=1 for the new
>> copy.
>
> That makes more sense.
>
>> Thread-A is writing to a writable page, and the page's PTE is becoming W=1,
>> D=1. In the middle of it, Thread-B is changing the PTE to W=0.
>
> Yah, add that to the explanation pls.
>
Sure.
On Tue, Dec 08, 2020 at 11:24:16AM -0800, Yu, Yu-cheng wrote:
> Case (a) is a normal writable data page that has gone through fork(). So it
Writable?
> has W=0, D=1. But here, the software chooses not to use the D bit, and
But it has W=0. So not writable?
> instead, W=0, COW=1.
So the "new" way of denoting that the page is modified is COW=1
*when* on CET hw. The D=1 bit is still used on the rest thus the two
_PAGE_DIRTY_BITS.
Am I close?
> Case (b) is a normal read-only data page. Since it is read-only, fork()
> won't affect it. In __get_user_pages(), a copy of the read-only page is
> needed, and the page is duplicated. The software sets COW=1 for the new
> copy.
That makes more sense.
> Thread-A is writing to a writable page, and the page's PTE is becoming W=1,
> D=1. In the middle of it, Thread-B is changing the PTE to W=0.
Yah, add that to the explanation pls.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette