2015-05-07 17:41:33

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 00/12] [RFC] x86: Memory Protection Keys

This is a big, fat RFC. This code is going to be unrunable to
anyone outside of Intel. But, this patch set has user interface
implications because we need to pass the protection key in to
the kernel somehow.

At this point, I would especially appreciate feedback on how
we should do that. I've taken the most expedient approach for
this first attempt, especially since we piggyback on existing
syscalls here.

There is a lot of work left to do here. Mainly, we need to
ensure that when we are walking the page tables in software
that we obey protection keys when at all possible. This is
going to mean a lot of audits of the page table walking code,
although some of it like access_process_vm() we can probably
safely ignore.

This set is also available here:

git://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-pkeys.git pkeys-v001

== FEATURE OVERVIEW ==

Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU
feature which will be found in future Intel CPUs. The work here
was done with the aid of simulators.

Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
page tables when an application changes protection domains. It
works by dedicating 4 previously ignored bits in each page table
entry to a "protection key", giving 16 possible keys.

There is also a new user-accessible register (PKRU) with two
separate bits (Access Disable and Write Disable) for each key.
Being a CPU register, PKRU is inherently thread-local,
potentially giving each thread a different set of protections
from every other thread.

There are two new instructions (RDPKRU/WRPKRU) for reading and
writing to the new register. The feature is only available in
64-bit mode, even though there is theoretically space in the PAE
PTEs. These permissions are enforced on data access only and
have no effect on instruction fetches.


2015-05-07 17:43:59

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 02/12] x86, pku: define new CR4 bit


There is a new bit in CR4 for enabling protection keys.

---

b/arch/x86/include/uapi/asm/processor-flags.h | 2 ++
1 file changed, 2 insertions(+)

diff -puN arch/x86/include/uapi/asm/processor-flags.h~pkeys-1-cr4, arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~pkeys-1-cr4, 2015-05-07 10:31:41.384187278 -0700
+++ b/arch/x86/include/uapi/asm/processor-flags.h 2015-05-07 10:31:41.387187413 -0700
@@ -120,6 +120,8 @@
#define X86_CR4_SMEP _BITUL(X86_CR4_SMEP_BIT)
#define X86_CR4_SMAP_BIT 21 /* enable SMAP support */
#define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT)
+#define X86_CR4_PKE_BIT 22 /* enable Protection Keys support */
+#define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT)

/*
* x86-64 Task Priority Register, CR8
_

2015-05-07 17:41:21

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 01/12] x86, pkeys: cpuid bit definition


There are two CPUID bits for protection keys. One is for whether
the CPU contains the feature, and the other will appear set once
the OS enables protection keys. Specifically:

Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable
Protection keys (and the RDPKRU/WRPKRU instructions)

This is because userspace can not see CR4 contents, but it can
see CPUID contents.

X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:

CPUID.(EAX=07H,ECX=0H):ECX.PKU [bit 3]

X86_FEATURE_OSPKE is "OSPKU":

CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]

These are the first CPU features which need to look at the
ECX word in CPUID leaf 0x7, so this patch also includes
fetching that word in to the cpuinfo->x86_capability[] array.

---

b/arch/x86/include/asm/cpufeature.h | 6 +++++-
b/arch/x86/kernel/cpu/common.c | 1 +
2 files changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/cpufeature.h~pkeys-0-cpuid arch/x86/include/asm/cpufeature.h
--- a/arch/x86/include/asm/cpufeature.h~pkeys-0-cpuid 2015-05-07 10:31:40.985169281 -0700
+++ b/arch/x86/include/asm/cpufeature.h 2015-05-07 10:31:40.991169552 -0700
@@ -12,7 +12,7 @@
#include <asm/disabled-features.h>
#endif

-#define NCAPINTS 13 /* N 32-bit words worth of info */
+#define NCAPINTS 14 /* N 32-bit words worth of info */
#define NBUGINTS 1 /* N 32-bit bug flags */

/*
@@ -252,6 +252,10 @@
/* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
#define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */

+/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 13 */
+#define X86_FEATURE_PKU (13*32+ 3) /* Protection Keys for Userspace */
+#define X86_FEATURE_OSPKE (13*32+ 4) /* OS Protection Keys Enable */
+
/*
* BUG word(s)
*/
diff -puN arch/x86/kernel/cpu/common.c~pkeys-0-cpuid arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-0-cpuid 2015-05-07 10:31:40.987169371 -0700
+++ b/arch/x86/kernel/cpu/common.c 2015-05-07 10:31:40.991169552 -0700
@@ -635,6 +635,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
cpuid_count(0x00000007, 0, &eax, &ebx, &ecx, &edx);

c->x86_capability[9] = ebx;
+ c->x86_capability[13] = ecx;
}

/* Extended state features: level 0x0000000d */
_

2015-05-07 17:41:43

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 05/12] x86, pkeys: new page fault error code bit: PF_PK


Note: "PK" is how the Intel SDM refers to this bit, so we also
use that nomenclature.

This only defines the bit, it does not plumb it anywhere to be
handled.

---

b/arch/x86/mm/fault.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN arch/x86/mm/fault.c~pkeys-4-pfec arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-4-pfec 2015-05-07 10:31:42.568240681 -0700
+++ b/arch/x86/mm/fault.c 2015-05-07 10:31:42.571240816 -0700
@@ -31,6 +31,7 @@
* bit 2 == 0: kernel-mode access 1: user-mode access
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
+ * bit 5 == 1: protection keys block access
*/
enum x86_pf_error_code {

@@ -39,6 +40,7 @@ enum x86_pf_error_code {
PF_USER = 1 << 2,
PF_RSVD = 1 << 3,
PF_INSTR = 1 << 4,
+ PF_PK = 1 << 5,
};

/*
@@ -912,7 +914,10 @@ static int spurious_fault_check(unsigned

if ((error_code & PF_INSTR) && !pte_exec(*pte))
return 0;
-
+ /*
+ * Note: We do not do lazy flushing on protection key
+ * changes, so no spurious fault will ever set PF_PK.
+ */
return 1;
}

_

2015-05-07 17:42:44

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 06/12] x86, pkeys: store protection in high VMA flags


vma->vm_flags is an 'unsigned long', so has space for 32 flags
on 32-bit architectures. The high 32 bits are unused on 64-bit
platforms. We've steered away from using the unused high VMA
bits for things because we would have difficulty supporting it
on 32-bit.

Protection Keys are not available in 32-bit mode, so there is
no concern about supporting this feature in 32-bit mode or on
32-bit CPUs.

This patch carves out 4 bits from the high half of
vma->vm_flags and allows architectures to set config option
to make them available.

---

b/arch/x86/Kconfig | 1 +
b/include/linux/mm.h | 7 +++++++
b/mm/Kconfig | 3 +++
3 files changed, 11 insertions(+)

diff -puN arch/x86/Kconfig~pkeys-7-eat-high-vma-flags arch/x86/Kconfig
--- a/arch/x86/Kconfig~pkeys-7-eat-high-vma-flags 2015-05-07 10:31:42.943257595 -0700
+++ b/arch/x86/Kconfig 2015-05-07 10:31:42.951257956 -0700
@@ -142,6 +142,7 @@ config X86
select ACPI_LEGACY_TABLES_LOOKUP if ACPI
select X86_FEATURE_NAMES if PROC_FS
select SRCU
+ select ARCH_USES_HIGH_VMA_FLAGS if X86_64

config INSTRUCTION_DECODER
def_bool y
diff -puN include/linux/mm.h~pkeys-7-eat-high-vma-flags include/linux/mm.h
--- a/include/linux/mm.h~pkeys-7-eat-high-vma-flags 2015-05-07 10:31:42.945257685 -0700
+++ b/include/linux/mm.h 2015-05-07 10:31:42.951257956 -0700
@@ -153,6 +153,13 @@ extern unsigned int kobjsize(const void
#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */

+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_HIGH_ARCH_1 0x100000000 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_2 0x200000000 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_3 0x400000000 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_4 0x800000000 /* bit only usable on 64-bit architectures */
+#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+
#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
diff -puN mm/Kconfig~pkeys-7-eat-high-vma-flags mm/Kconfig
--- a/mm/Kconfig~pkeys-7-eat-high-vma-flags 2015-05-07 10:31:42.947257775 -0700
+++ b/mm/Kconfig 2015-05-07 10:31:42.952258001 -0700
@@ -635,3 +635,6 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+config ARCH_USES_HIGH_VMA_FLAGS
+ bool
_

2015-05-07 17:41:23

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 03/12] x86, pkey: pkru xsave fields and data structure


The protection keys register (PKRU) is saved and restored using
xsave. Define the data structure that we will use to access it
inside the xsave buffer, and also double-check that the new
structure matches the size that comes out of the CPU.

---

b/arch/x86/include/asm/processor.h | 9 +++++++++
b/arch/x86/include/asm/xsave.h | 3 ++-
b/arch/x86/kernel/xsave.c | 7 +++++++
3 files changed, 18 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/processor.h~pkeys-2-xsave arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~pkeys-2-xsave 2015-05-07 10:31:41.756204056 -0700
+++ b/arch/x86/include/asm/processor.h 2015-05-07 10:31:41.763204372 -0700
@@ -406,6 +406,15 @@ struct bndcsr {
u64 bndstatus;
} __packed;

+/*
+ * "The size of XSAVE state component for PKRU is 8 bytes,
+ * of which only the first four bytes are used...".
+ */
+struct pkru {
+ u32 pkru;
+ u32 pkru_unused;
+} __packed;
+
struct xsave_hdr_struct {
u64 xstate_bv;
u64 xcomp_bv;
diff -puN arch/x86/include/asm/xsave.h~pkeys-2-xsave arch/x86/include/asm/xsave.h
--- a/arch/x86/include/asm/xsave.h~pkeys-2-xsave 2015-05-07 10:31:41.758204147 -0700
+++ b/arch/x86/include/asm/xsave.h 2015-05-07 10:31:41.764204417 -0700
@@ -14,6 +14,7 @@
#define XSTATE_OPMASK 0x20
#define XSTATE_ZMM_Hi256 0x40
#define XSTATE_Hi16_ZMM 0x80
+#define XSTATE_PKRU 0x200

#define XSTATE_FPSSE (XSTATE_FP | XSTATE_SSE)
#define XSTATE_AVX512 (XSTATE_OPMASK | XSTATE_ZMM_Hi256 | XSTATE_Hi16_ZMM)
@@ -33,7 +34,7 @@
| XSTATE_OPMASK | XSTATE_ZMM_Hi256 | XSTATE_Hi16_ZMM)

/* Supported features which require eager state saving */
-#define XSTATE_EAGER (XSTATE_BNDREGS | XSTATE_BNDCSR)
+#define XSTATE_EAGER (XSTATE_BNDREGS | XSTATE_BNDCSR | XSTATE_PKRU)

/* All currently supported features */
#define XCNTXT_MASK (XSTATE_LAZY | XSTATE_EAGER)
diff -puN arch/x86/kernel/xsave.c~pkeys-2-xsave arch/x86/kernel/xsave.c
--- a/arch/x86/kernel/xsave.c~pkeys-2-xsave 2015-05-07 10:31:41.760204237 -0700
+++ b/arch/x86/kernel/xsave.c 2015-05-07 10:31:41.764204417 -0700
@@ -528,6 +528,13 @@ void setup_xstate_comp(void)
+ xstate_comp_sizes[i-1];

}
+ /*
+ * Check that the size of the "PKRU" xsave area
+ * which the CPU knows about matches the kernel
+ * data structure that we have defined.
+ */
+ if ((xstate_features >= XSTATE_PKRU) && xstate_comp_sizes[XSTATE_PKRU])
+ WARN_ON(xstate_comp_sizes[XSTATE_PKRU] != sizeof(struct pkru));
}

/*
_

2015-05-07 17:46:01

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 04/12] x86, pkeys: PTE bits


Previous documentation has referred to these 4 bits as "ignored".
That means that software could have made use of them. But, as
far as I know, the kernel never used them.

They are still ignored when protection keys is not enabled, so
they could theoretically still get used for software purposes.

---

b/arch/x86/include/asm/pgtable_types.h | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-3-ptebits arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-3-ptebits 2015-05-07 10:31:42.194223812 -0700
+++ b/arch/x86/include/asm/pgtable_types.h 2015-05-07 10:31:42.198223992 -0700
@@ -25,7 +25,11 @@
#define _PAGE_BIT_SPLITTING _PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
#define _PAGE_BIT_HIDDEN _PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */
+#define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */
+#define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */
+#define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */
+#define _PAGE_BIT_PKEY_BIT3 62 /* Protection Keys, bit 4/4 */
+#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */

/* If _PAGE_BIT_PRESENT is clear, we use these: */
/* - if the user mapped it with PROT_NONE; pte_present gives true */
@@ -47,6 +51,10 @@
#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
#define _PAGE_CPA_TEST (_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
#define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
+#define _PAGE_PKEY_BIT0 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT0)
+#define _PAGE_PKEY_BIT1 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT1)
+#define _PAGE_PKEY_BIT2 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT2)
+#define _PAGE_PKEY_BIT3 (_AT(pteval_t, 1) << _PAGE_BIT_PKEY_BIT3)
#define __HAVE_ARCH_PTE_SPECIAL

#ifdef CONFIG_KMEMCHECK
_

2015-05-07 17:42:42

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 08/12] x86, pkeys: arch-specific protection bits


Lots of things seem to do:

vma->vm_page_prot = vm_get_page_prot(flags);

and the ptes get created right from things we pull out
of ->vm_page_prot. So it is very convenient if we can
store the protection key in flags and vm_page_prot, just
like the existing permission bits (_PAGE_RW/PRESENT). It
greatly reduces the amount of plumbing and arch-specific
hacking we have to do in generic code.

This also takes the new PROT_PKEY{0,1,2,3} flags and
turns *those* in to VM_ flags for vma->vm_flags.

The protection key values are stored in 4 places:
1. "prot" argument to system calls
2. vma->vm_flags, filled from the mmap "prot"
3. vma->vm_page prot, filled from vma->vm_flags
4. the PTE itself.

The pseudocode for these for steps are as follows:

mmap(PROT_PKEY*)
vma->vm_flags = ... | arch_calc_vm_prot_bits(mmap_prot);
vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
pte = pfn | vma->vm_page_prot

Note that these are new definitions for x86:

arch_vm_get_page_prot()
arch_calc_vm_prot_bits()

---

b/arch/x86/include/asm/pgtable_types.h | 12 ++++++++++--
b/arch/x86/include/uapi/asm/mman.h | 17 +++++++++++++++++
b/include/linux/mm.h | 4 ++++
3 files changed, 31 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable_types.h~pkeys-7-store-pkey-in-vma arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~pkeys-7-store-pkey-in-vma 2015-05-07 10:31:43.740293543 -0700
+++ b/arch/x86/include/asm/pgtable_types.h 2015-05-07 10:31:43.747293858 -0700
@@ -104,7 +104,12 @@
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
_PAGE_DIRTY)

-/* Set of bits not changed in pte_modify */
+/*
+ * Set of bits not changed in pte_modify. The pte's
+ * protection key is treated like _PAGE_RW, for
+ * instance, and is *not* included in this mask since
+ * pte_modify() does modify it.
+ */
#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
_PAGE_SOFT_DIRTY)
@@ -220,7 +225,10 @@ enum page_cache_mode {
/* PTE_PFN_MASK extracts the PFN from a (pte|pmd|pud|pgd)val_t */
#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK)

-/* PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t */
+/*
+ * PTE_FLAGS_MASK extracts the flags from a (pte|pmd|pud|pgd)val_t
+ * This includes the protection key value.
+ */
#define PTE_FLAGS_MASK (~PTE_PFN_MASK)

typedef struct pgprot { pgprotval_t pgprot; } pgprot_t;
diff -puN arch/x86/include/uapi/asm/mman.h~pkeys-7-store-pkey-in-vma arch/x86/include/uapi/asm/mman.h
--- a/arch/x86/include/uapi/asm/mman.h~pkeys-7-store-pkey-in-vma 2015-05-07 10:31:43.742293633 -0700
+++ b/arch/x86/include/uapi/asm/mman.h 2015-05-07 10:31:43.747293858 -0700
@@ -6,6 +6,23 @@
#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT)
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)

+/*
+ * Take the 4 protection key bits out of the vma->vm_flags
+ * value and turn them in to the bits that we can put in
+ * to a pte.
+ */
+#define arch_vm_get_page_prot(vm_flags) __pgprot( \
+ ((vm_flags) & VM_PKEY_BIT0 ? _PAGE_PKEY_BIT0 : 0) | \
+ ((vm_flags) & VM_PKEY_BIT1 ? _PAGE_PKEY_BIT1 : 0) | \
+ ((vm_flags) & VM_PKEY_BIT2 ? _PAGE_PKEY_BIT2 : 0) | \
+ ((vm_flags) & VM_PKEY_BIT3 ? _PAGE_PKEY_BIT3 : 0))
+
+#define arch_calc_vm_prot_bits(prot) ( \
+ ((prot) & PROT_PKEY0 ? VM_PKEY_BIT0 : 0) | \
+ ((prot) & PROT_PKEY1 ? VM_PKEY_BIT1 : 0) | \
+ ((prot) & PROT_PKEY2 ? VM_PKEY_BIT2 : 0) | \
+ ((prot) & PROT_PKEY3 ? VM_PKEY_BIT3 : 0))
+
#include <asm-generic/mman.h>

#endif /* _ASM_X86_MMAN_H */
diff -puN include/linux/mm.h~pkeys-7-store-pkey-in-vma include/linux/mm.h
--- a/include/linux/mm.h~pkeys-7-store-pkey-in-vma 2015-05-07 10:31:43.744293723 -0700
+++ b/include/linux/mm.h 2015-05-07 10:31:43.748293904 -0700
@@ -162,6 +162,10 @@ extern unsigned int kobjsize(const void

#if defined(CONFIG_X86)
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
+# define VM_PKEY_BIT0 VM_HIGH_ARCH_1 /* A protection key is a 4-bit value */
+# define VM_PKEY_BIT1 VM_HIGH_ARCH_2
+# define VM_PKEY_BIT2 VM_HIGH_ARCH_3
+# define VM_PKEY_BIT3 VM_HIGH_ARCH_4
#elif defined(CONFIG_PPC)
# define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
#elif defined(CONFIG_PARISC)
_

2015-05-07 17:44:53

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls


If a system call takes a PROT_{NONE,EXEC,WRITE,...} argument,
this adds support to it to take a protection key.

mmap()
mrprotect()
drivers/char/agp/frontend.c's ioctl(AGPIOC_RESERVE)

This does not include direct support for shmat() since it uses
a diffferent set of permission bits. You can use mprotect()
after the attach to assign an attched SHM segment a protection
key.

---

b/include/uapi/asm-generic/mman-common.h | 4 ++++
1 file changed, 4 insertions(+)

diff -puN include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits include/uapi/asm-generic/mman-common.h
--- a/include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits 2015-05-07 10:31:43.367276719 -0700
+++ b/include/uapi/asm-generic/mman-common.h 2015-05-07 10:31:43.370276855 -0700
@@ -10,6 +10,10 @@
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
#define PROT_SEM 0x8 /* page may be used for atomic ops */
+#define PROT_PKEY0 0x10 /* protection key value (bit 0) */
+#define PROT_PKEY1 0x20 /* protection key value (bit 1) */
+#define PROT_PKEY2 0x40 /* protection key value (bit 2) */
+#define PROT_PKEY3 0x80 /* protection key value (bit 3) */
#define PROT_NONE 0x0 /* page can not be accessed */
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
_

2015-05-07 17:41:27

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 10/12] x86, pkeys: differentiate Protection Key faults from normal



---

b/arch/x86/mm/fault.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff -puN arch/x86/mm/fault.c~pkeys-12-fault-differentiation arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-12-fault-differentiation 2015-05-07 10:31:44.570330979 -0700
+++ b/arch/x86/mm/fault.c 2015-05-07 10:31:44.573331114 -0700
@@ -1009,6 +1009,15 @@ int show_unhandled_signals = 1;
static inline int
access_error(unsigned long error_code, struct vm_area_struct *vma)
{
+ /*
+ * Access or read was blocked by protection keys. We do
+ * this check before any others because we do not want
+ * to, for instance, confuse a protection-key-denied
+ * write with one for which we should do a COW.
+ */
+ if (error_code & PF_PK)
+ return 1;
+
if (error_code & PF_WRITE) {
/* write, present and write, not present: */
if (unlikely(!(vma->vm_flags & VM_WRITE)))
_

2015-05-07 17:41:40

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 11/12] x86, pkeys: actually enable Memory Protection Keys in CPU


This sets the bit in 'cr4' to actually enable the protection
keys feature. We also include a boot-time disable for the
feature "nopku".

Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE cpuid
bit to appear set. At this point in boot, identify_cpu()
has already run the actual CPUID instructions and populated
the "cpu features" structures. We need to go back and
re-run identify_cpu() to make sure it gets updated values.

We *could* simply re-populate the 11th word of the cpuid
data, but this is probably quick enough.


---

b/Documentation/kernel-parameters.txt | 3 +++
b/arch/x86/kernel/cpu/common.c | 27 +++++++++++++++++++++++++++
2 files changed, 30 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~pkeys-5-should-be-last-patch arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~pkeys-5-should-be-last-patch 2015-05-07 10:31:44.946347938 -0700
+++ b/arch/x86/kernel/cpu/common.c 2015-05-07 10:31:44.952348209 -0700
@@ -306,6 +306,32 @@ static __always_inline void setup_smap(s
}
}

+#ifdef CONFIG_X86_64
+/*
+ * Protection Keys are not available in 32-bit mode.
+ */
+static __always_inline void setup_pku(struct cpuinfo_x86 *c)
+{
+ if (!cpu_has(c, X86_FEATURE_PKU))
+ return;
+
+ cr4_set_bits(X86_CR4_PKE);
+ /*
+ * Seting X86_CR4_PKE will cause the X86_FEATURE_OSPKE
+ * cpuid bit to be set. We need to ensure that we
+ * update that bit in this CPU's "cpu_info".
+ */
+ get_cpu_cap(&boot_cpu_data);
+}
+
+static __init int setup_disable_pku(char *arg)
+{
+ setup_clear_cpu_cap(X86_FEATURE_PKU);
+ return 1;
+}
+__setup("nopku", setup_disable_pku);
+#endif /* CONFIG_X86_64 */
+
/*
* Some CPU features depend on higher CPUID levels, which may not always
* be available due to CPUID level capping or broken virtualization
@@ -957,6 +983,7 @@ static void identify_cpu(struct cpuinfo_
}

#ifdef CONFIG_X86_64
+ setup_pku(c);
detect_ht(c);
#endif

diff -puN Documentation/kernel-parameters.txt~pkeys-5-should-be-last-patch Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~pkeys-5-should-be-last-patch 2015-05-07 10:31:44.948348028 -0700
+++ b/Documentation/kernel-parameters.txt 2015-05-07 10:31:44.953348254 -0700
@@ -936,6 +936,9 @@ bytes respectively. Such letter suffixes
Enable debug messages at boot time. See
Documentation/dynamic-debug-howto.txt for details.

+ nopku [X86] Disable Memory Protection Keys CPU feature found
+ in some Intel CPUs.
+
eagerfpu= [X86]
on enable eager fpu restore
off disable eager fpu restore
_

2015-05-07 17:41:38

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 12/12] x86, pkeys: Documentation



---

b/Documentation/x86/protection-keys.txt | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff -puN /dev/null Documentation/x86/protection-keys.txt
--- /dev/null 2015-05-06 22:34:35.845652580 -0700
+++ b/Documentation/x86/protection-keys.txt 2015-05-07 10:31:45.360366611 -0700
@@ -0,0 +1,22 @@
+Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU
+feature which will be found in future Intel CPUs. The work here
+was done with the aid of simulators.
+
+Memory Protection Keys provides a mechanism for enforcing
+page-based protections, but without requiring modification of the
+page tables when an application changes protection domains. It
+works by dedicating 4 previously ignored bits in each page table
+entry to a “protection key”, giving 16 possible keys.
+
+There is also a new user-accessible register (PKRU) with two
+separate bits (Access Disable and Write Disable) for each key.
+Being a CPU register, PKRU is inherently thread-local,
+potentially giving each thread a different set of protections
+from every other thread.
+
+There are two new instructions (RDPKRU/WRPKRU) for reading and
+writing to the new register. The feature is only available in
+64-bit mode, even though there is theoretically space in the PAE
+PTEs. These permissions are enforced on data access only and
+have no effect on instruction fetches.
+
_

2015-05-07 17:44:48

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 09/12] x86, pkeys: notify userspace about protection key faults


A protection key fault is very similar to any other access
error. There must be a VMA, etc... We even want to take
the same action (SIGSEGV) that we do with a normal access
fault.

However, we do need to let userspace know that something
is different. We do this the same way what we did with
SEGV_BNDERR with Memory Protection eXtensions (MPX):
define a new SEGV code: SEGV_PKUERR.

We will, at some point need to allow userspace a way to
figure out which protection key coveres the address that
we faulted on. We can either do that with a separate
interface, or we could pass it up in the siginfo like
MPX did.

Suggestions welcome. :)

---

b/arch/x86/mm/fault.c | 5 ++++-
b/include/uapi/asm-generic/siginfo.h | 10 +++++++++-
2 files changed, 13 insertions(+), 2 deletions(-)

diff -puN arch/x86/mm/fault.c~pkeys-13-siginfo arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c~pkeys-13-siginfo 2015-05-07 10:31:44.169312893 -0700
+++ b/arch/x86/mm/fault.c 2015-05-07 10:31:44.174313118 -0700
@@ -838,7 +838,10 @@ static noinline void
bad_area_access_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address)
{
- __bad_area(regs, error_code, address, SEGV_ACCERR);
+ if (error_code & PF_PK)
+ __bad_area(regs, error_code, address, SEGV_PKUERR);
+ else
+ __bad_area(regs, error_code, address, SEGV_ACCERR);
}

static void
diff -puN include/uapi/asm-generic/siginfo.h~pkeys-13-siginfo include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~pkeys-13-siginfo 2015-05-07 10:31:44.170312938 -0700
+++ b/include/uapi/asm-generic/siginfo.h 2015-05-07 10:31:44.174313118 -0700
@@ -95,6 +95,13 @@ typedef struct siginfo {
void __user *_lower;
void __user *_upper;
} _addr_bnd;
+ int protection_key; /* FIXME: protection key value??
+ * Do we really need this in here?
+ * userspace can get the PKRU value in
+ * the signal handler, but they do not
+ * easily have access to the PKEY value
+ * from the PTE.
+ */
} _sigfault;

/* SIGPOLL */
@@ -206,7 +213,8 @@ typedef struct siginfo {
#define SEGV_MAPERR (__SI_FAULT|1) /* address not mapped to object */
#define SEGV_ACCERR (__SI_FAULT|2) /* invalid permissions for mapped object */
#define SEGV_BNDERR (__SI_FAULT|3) /* failed address bound checks */
-#define NSIGSEGV 3
+#define SEGV_PKUERR (__SI_FAULT|4) /* failed address bound checks */
+#define NSIGSEGV 4

/*
* SIGBUS si_codes
_

2015-05-07 17:57:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys


* Dave Hansen <[email protected]> wrote:

> == FEATURE OVERVIEW ==
>
> Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU
> feature which will be found in future Intel CPUs. The work here was
> done with the aid of simulators.
>
> Memory Protection Keys provides a mechanism for enforcing page-based
> protections, but without requiring modification of the page tables
> when an application changes protection domains. It works by
> dedicating 4 previously ignored bits in each page table entry to a
> "protection key", giving 16 possible keys.
>
> There is also a new user-accessible register (PKRU) with two
> separate bits (Access Disable and Write Disable) for each key. Being
> a CPU register, PKRU is inherently thread-local, potentially giving
> each thread a different set of protections from every other thread.
>
> There are two new instructions (RDPKRU/WRPKRU) for reading and
> writing to the new register. The feature is only available in
> 64-bit mode, even though there is theoretically space in the PAE
> PTEs. These permissions are enforced on data access only and have
> no effect on instruction fetches.

So I'm wondering what the primary usecases are for this feature?
Could you outline applications/workloads/libraries that would
benefit from this?

Thanks,

Ingo

2015-05-07 18:09:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>> > There are two new instructions (RDPKRU/WRPKRU) for reading and
>> > writing to the new register. The feature is only available in
>> > 64-bit mode, even though there is theoretically space in the PAE
>> > PTEs. These permissions are enforced on data access only and have
>> > no effect on instruction fetches.
> So I'm wondering what the primary usecases are for this feature?
> Could you outline applications/workloads/libraries that would
> benefit from this?

There are lots of things that folks would _like_ to mprotect(), but end
up not being feasible because of the overhead of going and mucking with
thousands of PTEs and shooting down remote TLBs every time you want to
change protections.

Data structures like logs or journals that are only written to in very
limited code paths, but that you want to protect from "stray" writes.

Maybe even a database where a query operation will never need to write
to memory, but an insert would. You could keep the data R/O during the
entire operation except when an insert is actually in progress. It
narrows the window where data might be corrupted. This becomes even
more valuable if a stray write to memory is guaranteed to hit storage...
like with persistent memory.

Someone mentioned to me that valgrind does lots of mprotect()s and might
benefit from this.

We could keep heap metadata as R/O and only make it R/W inside of
malloc() itself to catch corruption more quickly.

More crazy ideas welcome. :)

2015-05-07 18:48:47

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

On 05/07/2015 08:09 PM, Dave Hansen wrote:
> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and
>>>> writing to the new register. The feature is only available in
>>>> 64-bit mode, even though there is theoretically space in the PAE
>>>> PTEs. These permissions are enforced on data access only and have
>>>> no effect on instruction fetches.
>> So I'm wondering what the primary usecases are for this feature?
>> Could you outline applications/workloads/libraries that would
>> benefit from this?
>
> There are lots of things that folks would _like_ to mprotect(), but end
> up not being feasible because of the overhead of going and mucking with
> thousands of PTEs and shooting down remote TLBs every time you want to
> change protections.
>
> Data structures like logs or journals that are only written to in very
> limited code paths, but that you want to protect from "stray" writes.
>
> Maybe even a database where a query operation will never need to write
> to memory, but an insert would. You could keep the data R/O during the
> entire operation except when an insert is actually in progress. It
> narrows the window where data might be corrupted. This becomes even
> more valuable if a stray write to memory is guaranteed to hit storage...
> like with persistent memory.
>
> Someone mentioned to me that valgrind does lots of mprotect()s and might
> benefit from this.
>
> We could keep heap metadata as R/O and only make it R/W inside of
> malloc() itself to catch corruption more quickly.

But that metadata is typically within the same page as the data itself
(for small objects at least), no?

> More crazy ideas welcome. :)

Since you asked :) I wonder if the usefulness could be extended by
making it possible for a thread to revoke its access to WRPKRU (it's not
privileged, right?). Then I could imagine some extra security for
sandbox/bytecode/JIT code so it doesn't interfere with the runtime. But
since it doesn't block instruction fetches, then maybe it wouldn't make
much difference...

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-05-07 19:12:10

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls

> diff -puN include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits include/uapi/asm-generic/mman-common.h
> --- a/include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits 2015-05-07 10:31:43.367276719 -0700
> +++ b/include/uapi/asm-generic/mman-common.h 2015-05-07 10:31:43.370276855 -0700
> @@ -10,6 +10,10 @@
> #define PROT_WRITE 0x2 /* page can be written */
> #define PROT_EXEC 0x4 /* page can be executed */
> #define PROT_SEM 0x8 /* page may be used for atomic ops */
> +#define PROT_PKEY0 0x10 /* protection key value (bit 0) */
> +#define PROT_PKEY1 0x20 /* protection key value (bit 1) */
> +#define PROT_PKEY2 0x40 /* protection key value (bit 2) */
> +#define PROT_PKEY3 0x80 /* protection key value (bit 3) */

Thats leaking deep Intelisms into asm-generic which makes me very
uncomfortable. Whether we need to reserve some bits for "arch specific"
is one question, what we do with them ought not to be leaking out.

To start with trying to port code people will want to do

#define PROT_PKEY0 0
#define PROT_PKEY1 0
..

etc

2015-05-07 19:18:59

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

> Data structures like logs or journals that are only written to in very
> limited code paths, but that you want to protect from "stray" writes.

Anything with lots of data where you want to minimise the risk of stray
accesses even if just as a debug aid (consider things like memcached).
>
> Maybe even a database where a query operation will never need to write
> to memory, but an insert would. You could keep the data R/O during the
> entire operation except when an insert is actually in progress. It
> narrows the window where data might be corrupted. This becomes even
> more valuable if a stray write to memory is guaranteed to hit storage...
> like with persistent memory.
>
> Someone mentioned to me that valgrind does lots of mprotect()s and might
> benefit from this.

You can also use it for certain types of emulator trickery, and I suspect
even for things like interpreters and controlling access to "tainted"
values.

Other obvious uses are making it a shade harder for SSL or ssh type
errors to leak things like key data by reducing the damage done by out of
bound accesses.

> We could keep heap metadata as R/O and only make it R/W inside of
> malloc() itself to catch corruption more quickly.

If you implement multiple malloc pools you can chop up lots of stuff.

In library land it isn't just stuff like malloc, you can use it as
a debug weapon to protect library private data from naughty application
code.

There are some other debug uses when catching faults - fast ways to do
range access breakpoints for example.

Alan

2015-05-07 19:19:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 07/12] mm: Pass the 4-bit protection key in via PROT_ bits to syscalls

On 05/07/2015 12:11 PM, One Thousand Gnomes wrote:
>> diff -puN include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits include/uapi/asm-generic/mman-common.h
>> --- a/include/uapi/asm-generic/mman-common.h~pkeys-11-user-abi-bits 2015-05-07 10:31:43.367276719 -0700
>> +++ b/include/uapi/asm-generic/mman-common.h 2015-05-07 10:31:43.370276855 -0700
>> @@ -10,6 +10,10 @@
>> #define PROT_WRITE 0x2 /* page can be written */
>> #define PROT_EXEC 0x4 /* page can be executed */
>> #define PROT_SEM 0x8 /* page may be used for atomic ops */
>> +#define PROT_PKEY0 0x10 /* protection key value (bit 0) */
>> +#define PROT_PKEY1 0x20 /* protection key value (bit 1) */
>> +#define PROT_PKEY2 0x40 /* protection key value (bit 2) */
>> +#define PROT_PKEY3 0x80 /* protection key value (bit 3) */
>
> Thats leaking deep Intelisms into asm-generic which makes me very
> uncomfortable. Whether we need to reserve some bits for "arch specific"
> is one question, what we do with them ought not to be leaking out.
>
> To start with trying to port code people will want to do
>
> #define PROT_PKEY0 0
> #define PROT_PKEY1 0

Yeah, I feel pretty uncomfortable with it as well. I really don't
expect these to live like this in asm-generic when I submit this.

Powerpc and ia64 have _something_ resembling protection keys, so the
concept isn't entirely x86 or Intel-specific. My hope would be that we
do this in a way that other architectures can use.

2015-05-07 19:22:37

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

Am 07.05.2015 um 20:09 schrieb Dave Hansen:
> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and
>>>> writing to the new register. The feature is only available in
>>>> 64-bit mode, even though there is theoretically space in the PAE
>>>> PTEs. These permissions are enforced on data access only and have
>>>> no effect on instruction fetches.
>> So I'm wondering what the primary usecases are for this feature?
>> Could you outline applications/workloads/libraries that would
>> benefit from this?
>
> There are lots of things that folks would _like_ to mprotect(), but end
> up not being feasible because of the overhead of going and mucking with
> thousands of PTEs and shooting down remote TLBs every time you want to
> change protections.

These protection bits would need to be cached in TLBs as well, no?
So the saving would come by switching the PKRU instead of the page bits.

This all looks like s390 storage keys (with the key in pagetables instead
of a dedicated place). There we also have 16 values for the key and 4 bits
in the PSW that describe the thread local key both are matched.
There is an additional field F (fetch protection) that decides, if the
key value is used for stores or for stores+fetches.

2015-05-07 19:26:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys


* One Thousand Gnomes <[email protected]> wrote:

> > We could keep heap metadata as R/O and only make it R/W inside of
> > malloc() itself to catch corruption more quickly.
>
> If you implement multiple malloc pools you can chop up lots of
> stuff.

I'd say that a 64-bit address space is large enough to hide buffers in
from accidental corruption, without any runtime page protection
flipping overhead?

> In library land it isn't just stuff like malloc, you can use it as a
> debug weapon to protect library private data from naughty
> application code.
>
> There are some other debug uses when catching faults - fast ways to
> do range access breakpoints for example.

I think libraries are happy enough to work without bugs - apps digging
around in library data are in a "you keep all the broken pieces"
situation, why would a library want to slow down every good citizen
down with extra protection flipping/unflipping accesses?

The Valgrind usecase looks somewhat legit, albeit not necessarily for
multithreaded apps: there you generally really want protection changes
to be globally visible, such as publishing the effects of free() or
malloc().

Also, will apps/libraries bother if it's not a standard API and if it
only runs on very fresh CPUs?

Thanks,

Ingo

2015-05-07 19:29:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

On 05/07/2015 12:22 PM, Christian Borntraeger wrote:
> Am 07.05.2015 um 20:09 schrieb Dave Hansen:
>> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and
>>>>> writing to the new register. The feature is only available in
>>>>> 64-bit mode, even though there is theoretically space in the PAE
>>>>> PTEs. These permissions are enforced on data access only and have
>>>>> no effect on instruction fetches.
>>> So I'm wondering what the primary usecases are for this feature?
>>> Could you outline applications/workloads/libraries that would
>>> benefit from this?
>>
>> There are lots of things that folks would _like_ to mprotect(), but end
>> up not being feasible because of the overhead of going and mucking with
>> thousands of PTEs and shooting down remote TLBs every time you want to
>> change protections.
>
> These protection bits would need to be cached in TLBs as well, no?

Yes, they are cached in the TLBs. It's actually explicitly called out
in the documentation.

> So the saving would come by switching the PKRU instead of the page bits.

Right.

> This all looks like s390 storage keys (with the key in pagetables instead
> of a dedicated place). There we also have 16 values for the key and 4 bits
> in the PSW that describe the thread local key both are matched.
> There is an additional field F (fetch protection) that decides, if the
> key value is used for stores or for stores+fetches.

OK, so a thread can only be in one domain at a time?

That's a bit different than x86 where each page can be in one protection
domain, but each CPU thread can independently enable/disable access to
each of the 16 protection domains.

2015-05-07 19:40:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

On 05/07/2015 12:26 PM, Ingo Molnar wrote:
> The Valgrind usecase looks somewhat legit, albeit not necessarily for
> multithreaded apps: there you generally really want protection changes
> to be globally visible, such as publishing the effects of free() or
> malloc().

I guess we could theoretically have an IPC of some kind that voluntarily
broadcasts changes so that we can be guaranteed that other threads see it.

> Also, will apps/libraries bother if it's not a standard API and if it
> only runs on very fresh CPUs?

It's always a problem with new CPU features.

I've thought a bit about trying to "emulate" the feature on older CPUs
using good ol' mprotect() so that we could have an API that folks can
use _today_, but that would get magically fast on future CPUs. But, the
problem with that is the thread-local aspect.

mprotect() is fundamentally process-wide and protection keys right are
fundamentally thread-local. Those things are going to be hard to
reconcile unless we do something slightly extreme like having per-thread
page tables.

2015-05-07 19:45:29

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

Am 07.05.2015 um 21:29 schrieb Dave Hansen:
> On 05/07/2015 12:22 PM, Christian Borntraeger wrote:
>> Am 07.05.2015 um 20:09 schrieb Dave Hansen:
>>> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and
>>>>>> writing to the new register. The feature is only available in
>>>>>> 64-bit mode, even though there is theoretically space in the PAE
>>>>>> PTEs. These permissions are enforced on data access only and have
>>>>>> no effect on instruction fetches.
>>>> So I'm wondering what the primary usecases are for this feature?
>>>> Could you outline applications/workloads/libraries that would
>>>> benefit from this?
>>>
>>> There are lots of things that folks would _like_ to mprotect(), but end
>>> up not being feasible because of the overhead of going and mucking with
>>> thousands of PTEs and shooting down remote TLBs every time you want to
>>> change protections.
>>
>> These protection bits would need to be cached in TLBs as well, no?
>
> Yes, they are cached in the TLBs. It's actually explicitly called out
> in the documentation.
>
>> So the saving would come by switching the PKRU instead of the page bits.
>
> Right.
>
>> This all looks like s390 storage keys (with the key in pagetables instead
>> of a dedicated place). There we also have 16 values for the key and 4 bits
>> in the PSW that describe the thread local key both are matched.
>> There is an additional field F (fetch protection) that decides, if the
>> key value is used for stores or for stores+fetches.
>
> OK, so a thread can only be in one domain at a time?

Via the PSW yes.
Actually the docs talk about access key, which is usually the PSW. There are
some instructions like MOVE WITH KEY that allow to specify the key for this
specific instruction. For compiled code these insructions are not used in
Linux and I can not really see a way to implement that properly. Furthermore
enabling these key ops has other implications which are unwanted.


> That's a bit different than x86 where each page can be in one protection
> domain, but each CPU thread can independently enable/disable access to
> each of the 16 protection domains.
>

2015-05-07 19:49:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

On 05/07/2015 12:45 PM, Christian Borntraeger wrote:
>>> >> This all looks like s390 storage keys (with the key in pagetables instead
>>> >> of a dedicated place). There we also have 16 values for the key and 4 bits
>>> >> in the PSW that describe the thread local key both are matched.
>>> >> There is an additional field F (fetch protection) that decides, if the
>>> >> key value is used for stores or for stores+fetches.
>> >
>> > OK, so a thread can only be in one domain at a time?
> Via the PSW yes.
> Actually the docs talk about access key, which is usually the PSW. There are
> some instructions like MOVE WITH KEY that allow to specify the key for this
> specific instruction. For compiled code these insructions are not used in
> Linux and I can not really see a way to implement that properly. Furthermore
> enabling these key ops has other implications which are unwanted.

OK, so we have to basic operations that need to be done for
protection/storage/$FOO keys:

1. Assign a key (or set of keys) to a memory area
2. Have a thread request the access (read and/or write) to a set of
areas be acquired or revoked.

For (2) on x86, we basically allow any combination of keys and r/w
permissions. On s390, we would need to ensure that acces to only one
key was allowed at a time.

BTW, do the s390 keys affect instructions and data, or data only?

The x86 ones affect data only.

2015-05-07 19:57:45

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

Am 07.05.2015 um 21:49 schrieb Dave Hansen:
> On 05/07/2015 12:45 PM, Christian Borntraeger wrote:
>>>>>> This all looks like s390 storage keys (with the key in pagetables instead
>>>>>> of a dedicated place). There we also have 16 values for the key and 4 bits
>>>>>> in the PSW that describe the thread local key both are matched.
>>>>>> There is an additional field F (fetch protection) that decides, if the
>>>>>> key value is used for stores or for stores+fetches.
>>>>
>>>> OK, so a thread can only be in one domain at a time?
>> Via the PSW yes.
>> Actually the docs talk about access key, which is usually the PSW. There are
>> some instructions like MOVE WITH KEY that allow to specify the key for this
>> specific instruction. For compiled code these insructions are not used in
>> Linux and I can not really see a way to implement that properly. Furthermore
>> enabling these key ops has other implications which are unwanted.
>
> OK, so we have to basic operations that need to be done for
> protection/storage/$FOO keys:
>
> 1. Assign a key (or set of keys) to a memory area
> 2. Have a thread request the access (read and/or write) to a set of
> areas be acquired or revoked.
>
> For (2) on x86, we basically allow any combination of keys and r/w
> permissions. On s390, we would need to ensure that acces to only one
> key was allowed at a time.
>
> BTW, do the s390 keys affect instructions and data, or data only?

Both. In fact its also used for I/O. Maybe that also points out the
biggest difference. the storage key is a property of the physical page
frame (and not of the virtual page defined by the page tables).
So we cannot really use that for shared memory and then set different
protection keys in different mappings.


> The x86 ones affect data only.
>

2015-05-07 20:11:34

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

On Thu, 7 May 2015 21:26:20 +0200
Ingo Molnar <[email protected]> wrote:

>
> * One Thousand Gnomes <[email protected]> wrote:
>
> > > We could keep heap metadata as R/O and only make it R/W inside of
> > > malloc() itself to catch corruption more quickly.
> >
> > If you implement multiple malloc pools you can chop up lots of
> > stuff.
>
> I'd say that a 64-bit address space is large enough to hide buffers in
> from accidental corruption, without any runtime page protection
> flipping overhead?

I'd say no. And from actual real world demand for PK the answer is also
no. It's already a problem with very large data sets. Worse still in many
cases its a problem that nobody is actually measuring or doing much about
(because mprotect on many gigabytes of data is expensive).

> > In library land it isn't just stuff like malloc, you can use it as a
> > debug weapon to protect library private data from naughty
> > application code.
> >
> > There are some other debug uses when catching faults - fast ways to
> > do range access breakpoints for example.
>
> I think libraries are happy enough to work without bugs - apps digging
> around in library data are in a "you keep all the broken pieces"
> situation, why would a library want to slow down every good citizen
> down with extra protection flipping/unflipping accesses?

For debugging, when the library maintained data is sensitive or
something you don't want corupted, or because the user puts security
first. Protection keys are an awful lot faster than mprotect. You've got
no synchronization and shootdowns to do just a CPU register to load to
indicate which mask of keys you are happy with. That really changes what
it is useful for, because it's cheap. It means you can happily do stuff
like

while(data_blocks) {
allow_key_and_source_access();
do_crypto_func();
revoke_key_and_source_access();
do_network_io(); /* Can't accidentally leak keys or
input */
}


> Also, will apps/libraries bother if it's not a standard API and if it
> only runs on very fresh CPUs?

In time I think yes.

Alan

2015-05-07 21:45:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

On 05/07/2015 11:48 AM, Vlastimil Babka wrote:
> On 05/07/2015 08:09 PM, Dave Hansen wrote:
>> On 05/07/2015 10:57 AM, Ingo Molnar wrote:
>>>>> There are two new instructions (RDPKRU/WRPKRU) for reading and
>>>>> writing to the new register. The feature is only available in
>>>>> 64-bit mode, even though there is theoretically space in the PAE
>>>>> PTEs. These permissions are enforced on data access only and have
>>>>> no effect on instruction fetches.
>>> So I'm wondering what the primary usecases are for this feature?
>>> Could you outline applications/workloads/libraries that would
>>> benefit from this?
>>
>> There are lots of things that folks would _like_ to mprotect(), but end
>> up not being feasible because of the overhead of going and mucking with
>> thousands of PTEs and shooting down remote TLBs every time you want to
>> change protections.
>>
>> Data structures like logs or journals that are only written to in very
>> limited code paths, but that you want to protect from "stray" writes.
>>
>> Maybe even a database where a query operation will never need to write
>> to memory, but an insert would. You could keep the data R/O during the
>> entire operation except when an insert is actually in progress. It
>> narrows the window where data might be corrupted. This becomes even
>> more valuable if a stray write to memory is guaranteed to hit storage...
>> like with persistent memory.
>>
>> Someone mentioned to me that valgrind does lots of mprotect()s and might
>> benefit from this.
>>
>> We could keep heap metadata as R/O and only make it R/W inside of
>> malloc() itself to catch corruption more quickly.
>
> But that metadata is typically within the same page as the data itself
> (for small objects at least), no?

I guess it depends on the implementation. I honestly don't know what
glibc's malloc does specifically.

>> More crazy ideas welcome. :)
>
> Since you asked :) I wonder if the usefulness could be extended by
> making it possible for a thread to revoke its access to WRPKRU (it's not
> privileged, right?). Then I could imagine some extra security for
> sandbox/bytecode/JIT code so it doesn't interfere with the runtime. But
> since it doesn't block instruction fetches, then maybe it wouldn't make
> much difference...

Correct, is is not privileged. The only way to "revoke" access would be
to disable the feature in CR4, in which case the keys wouldn't be
enforced either.

PKRU it saved/restored using xsave*/xrstor*, which require having the
FPU enabled. But, you can still *use* them even if the FPU is not in
play. So we can't use the FPU en/disable to help us, either. :(

2015-05-08 04:51:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys


* One Thousand Gnomes <[email protected]> wrote:

> On Thu, 7 May 2015 21:26:20 +0200
> Ingo Molnar <[email protected]> wrote:
>
> >
> > * One Thousand Gnomes <[email protected]> wrote:
> >
> > > > We could keep heap metadata as R/O and only make it R/W inside of
> > > > malloc() itself to catch corruption more quickly.
> > >
> > > If you implement multiple malloc pools you can chop up lots of
> > > stuff.
> >
> > I'd say that a 64-bit address space is large enough to hide
> > buffers in from accidental corruption, without any runtime page
> > protection flipping overhead?
>
> I'd say no. [...]

So if putting your buffers anywhere in a byte range of
18446744073709551616 bytes large (well, 281474976710656 bytes with
current CPUs) isn't enough to protect from stray writes? Could you
outline the situations where that isn't enough?

> [...] And from actual real world demand for PK the answer is also
> no. It's already a problem with very large data sets. [...]

So that's why I asked: what real world demand is there? Is it
described/documented/reported anywhere public?

> [...] Worse still in many cases its a problem that nobody is
> actually measuring or doing much about (because mprotect on many
> gigabytes of data is expensive).

It's not necessarily expensive if the remote TLB shootdown guarantee
is weakened (i.e. we could have an mprotect() flag that says "I don't
need remote TLB shootdowns") - and nobody has asked for that yet
AFAICS.

With 2MB or 1GB pages it would be even cheaper.

Also, the way databases usually protect themselves is by making a
robust central engine and communicating with (complex) DB users via
memory sharing and IPC.

> > I think libraries are happy enough to work without bugs - apps
> > digging around in library data are in a "you keep all the broken
> > pieces" situation, why would a library want to slow down every
> > good citizen down with extra protection flipping/unflipping
> > accesses?
>
> For debugging, when the library maintained data is sensitive or
> something you don't want corupted, or because the user puts security
> first. Protection keys are an awful lot faster than mprotect.

There's no flushing of TLBs involved even locally, a PK 'flip' is just
a handful of cycles no matter whether protections are narrowed or
broadened, right?

> [...] You've got no synchronization and shootdowns to do just a CPU
> register to load to indicate which mask of keys you are happy with.
> That really changes what it is useful for, because it's cheap. It
> means you can happily do stuff like
>
> while(data_blocks) {
> allow_key_and_source_access();
> do_crypto_func();
> revoke_key_and_source_access();
> do_network_io(); /* Can't accidentally leak keys or
> input */
> }

That looks useful if it's fast enough. I suspect a similar benefit
could be gained if we allowed individually randomized anonymous
mmap()s: the key wouldn't just be part of the heap, but isolated and
randomized somewhere in a 64-bit (48-bit) address space.

Thanks,

Ingo

2015-05-08 05:19:05

by Kevin Easton

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

On Thu, May 07, 2015 at 08:18:43PM +0100, One Thousand Gnomes wrote:
> > We could keep heap metadata as R/O and only make it R/W inside of
> > malloc() itself to catch corruption more quickly.
>
> If you implement multiple malloc pools you can chop up lots of stuff.
>
> In library land it isn't just stuff like malloc, you can use it as
> a debug weapon to protect library private data from naughty application
> code.

How could a library (or debugger, for that matter) arbitrate ownership
of the protection domains with the application?

One interesting use for it might be to be to provide an interface to
allocate memory and associate it with a lock that's supposed to be held
while accessing that memory. The allocation function hashes the lock
address down to one of the 15 non-zero protection domains and applies
that key to the memory, the lock function then adds RW access to the
appropriate protection domain and the unlock function removes it.

- Kevin

2015-05-09 19:25:22

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [PATCH 00/12] [RFC] x86: Memory Protection Keys

* Vlastimil Babka ([email protected]) wrote:
> On 05/07/2015 08:09 PM, Dave Hansen wrote:
> >On 05/07/2015 10:57 AM, Ingo Molnar wrote:
> >>>>There are two new instructions (RDPKRU/WRPKRU) for reading and
> >>>>writing to the new register. The feature is only available in
> >>>>64-bit mode, even though there is theoretically space in the PAE
> >>>>PTEs. These permissions are enforced on data access only and have
> >>>>no effect on instruction fetches.
> >>So I'm wondering what the primary usecases are for this feature?
> >>Could you outline applications/workloads/libraries that would
> >>benefit from this?
> >
> >There are lots of things that folks would _like_ to mprotect(), but end
> >up not being feasible because of the overhead of going and mucking with
> >thousands of PTEs and shooting down remote TLBs every time you want to
> >change protections.
> >
> >Data structures like logs or journals that are only written to in very
> >limited code paths, but that you want to protect from "stray" writes.
> >
> >Maybe even a database where a query operation will never need to write
> >to memory, but an insert would. You could keep the data R/O during the
> >entire operation except when an insert is actually in progress. It
> >narrows the window where data might be corrupted. This becomes even
> >more valuable if a stray write to memory is guaranteed to hit storage...
> >like with persistent memory.
> >
> >Someone mentioned to me that valgrind does lots of mprotect()s and might
> >benefit from this.
> >
> >We could keep heap metadata as R/O and only make it R/W inside of
> >malloc() itself to catch corruption more quickly.
>
> But that metadata is typically within the same page as the data
> itself (for small objects at least), no?
>
> >More crazy ideas welcome. :)
>
> Since you asked :) I wonder if the usefulness could be extended by
> making it possible for a thread to revoke its access to WRPKRU (it's
> not privileged, right?). Then I could imagine some extra security
> for sandbox/bytecode/JIT code so it doesn't interfere with the
> runtime. But since it doesn't block instruction fetches, then maybe
> it wouldn't make much difference...

Even without revoking a threads ability to change it, it would still
be useful just to restrict what data your JITd code can get to; if a JIT
generated the code it would know it's not generating any code that change
the keys and so as long as it bounds the code that's accessible, it could
use this to stop the generated code getting to JIT data structures.
I can see it also being useful for things like NaCl that supposedly bound
what the code can contain.

Dave

> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ gro.gilbert @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/

2015-05-15 21:10:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86, pkeys: store protection in high VMA flags

On Thu, 7 May 2015, Dave Hansen wrote:
> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> +#define VM_HIGH_ARCH_1 0x100000000 /* bit only usable on 64-bit architectures */

Nit. Shouldn't this start with VM_HIGH_ARCH_0 ?

Thanks,

tglx

2015-05-15 21:13:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 06/12] x86, pkeys: store protection in high VMA flags

On 05/15/2015 02:10 PM, Thomas Gleixner wrote:
> On Thu, 7 May 2015, Dave Hansen wrote:
>> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
>> +#define VM_HIGH_ARCH_1 0x100000000 /* bit only usable on 64-bit architectures */
>
> Nit. Shouldn't this start with VM_HIGH_ARCH_0 ?

Yeah, it does make the later #defines look a bit funny. I modeled it
after the "low" VM_ARCH_ flags which start at 1:

#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
#define VM_ARCH_2 0x02000000

I can change it to be 0 based though.